当代生成式 AI 和大语言模型(LLM)服务给 Kubernetes 带来了独特的流量路由挑衅。与典型的短时、无状态 Web 哀求不同,LLM 推剖析话通常是长时运行、资源密集且部分有状态的。比方,一个基于 GPU 的模型服务器大概同时维护多个活跃的推剖析话,并维护内存中的 token 缓存。
传统的负载均衡器多基于 HTTP 路径或轮询调理,缺乏处置处罚此类工作负载所需的专业能力。这些方案无法辨认模型标识或哀求的重要性(比方交互式对话哀求与批处置处罚作业之间的区别)。企业通常接纳临时拼集的方式应对,但缺乏同一标准的办理方案。
为了办理这一题目,Gateway API Inference Extension 在现有 Gateway API 的底子上,添加了针对推理任务的专属路由能力,同时保存了 Gateways 和 HTTPRoutes 等人们熟悉的模型。通过为现有网关添加这一扩展,可以将其转变为“推理网关”(Inference Gateway),帮助用户以“模型即服务”的方式自托管生成式 AI 模型或 LLM。
Gateway API Inference Extension 可将支持 ext-proc 的代理或网关(如 Envoy Gateway、kGateway 或 GKE Gateway)升级为推理网关,支持推理平台团队在 Kubernetes 上自建大语言模型服务。
主要特性
Gateway API Inference Extension 提供了以下关键特性:
- 模型感知路由:与传统仅基于哀求路径进行路由的方式不同,Gateway API Inference Extension 支持根据模型名称进行路由。这一能力得益于网关实现(如 Envoy Proxy)对生成式 AI 推理 API 规范(如 OpenAI API)的支持。该模型感知路由能力同样实用于基于 LoRA(Low-Rank Adaptation)微调的模型。
- 服务优先级:Gateway API Inference Extension 支持为模型指定服务优先级。比方,可为在线对话类任务(对延迟较为敏感)的模型设定更高的 criticality,而对延迟容忍度更高的任务(如摘要生成)的模型则设定较低的优先级。
- 模型版本发布:Gateway API Inference Extension 支持基于模型名称进行流量拆分,从而实现模型版本的渐进式发布与灰度上线。
- 推理服务的可扩展性:Gateway API Inference Extension 界说了一种可扩展模式,允许用户根据自身需求扩展推理服务,实现定制化的路由能力,以应对默认方案无法满足的场景。
- 面向推理的可定制负载均衡:Gateway API Inference Extension 提供了一种专为推理优化的可定制负载均衡与哀求路由模式,并在实现中提供了基于模型服务器及时指标的模型端点选择(model endpoint picking)机制。该机制可替代传统负载均衡方式,被称为“模型服务器感知”的智能负载均衡。实践表明,它能够有效降低推理延迟并提拔集群中 GPU 的使用率。
核心 CRD
Gateway API Inference Extension 界说了两个核心 CRD:InferencePool 和 InferenceModel。
InferencePool
InferencePool 表现一组专注于 AI 推理的 Pod,同时界说了用于路由到这些 Pod 的扩展配置。在 Gateway API 的资源模型中,InferencePool 被视为一种 “Backend” 资源。实际上,它可以用来替代传统的 Kubernetes Service,作为下游服务的目的。
固然 InferencePool 在某些方面与 Service 类似(比如选择 Pod 并指定端口),但它提供了专门面向推理场景的增强能力。InferencePool 通过 extensionRef 字段指向一个 EndPoint Picker 来管理推理感知的端点选择,从而根据及时指标(比方哀求队列深度和 GPU 内存可用性)做出智能路由决策。
- apiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferencePool
- metadata:
- labels:
- name: vllm-llama3-8b-instruct
- spec:
- targetPortNumber: 8000
- selector: # 选择运行 LLM 服务的 Pod
- app: vllm-llama3-8b-instruct
- extensionRef: # 指向 EndPoint Picker
- name: vllm-llama3-8b-instruct-epp
复制代码 InferenceModel
InferenceModel 表现某个推理模型或适配器及其相干配置。该资源用于界说模型的重要性品级(criticality),从而支持基于优先级的哀求调理。
此外,InferenceModel 还支持将用户哀求中的“模型名称”平滑地映射到一个或多个后端实际模型名称,便于进行版本管理、灰度发布或适配不同模型格式。多个 InferenceModel 可以关联到同一个 InferencePool 上,从而构建出一个机动且可扩展的模型路由体系。
- apiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferenceModel
- metadata:
- name: food-review
- spec:
- modelName: food-review # 用户请求中的模型名称
- criticality: Standard # 模型重要性等级
- poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
- name: vllm-llama3-8b-instruct
- targetModels: # 指定后端实际模型名称
- - name: food-review-1
- weight: 100
复制代码 相干组件
EndPoint Picker (EPP)
EndPoint Picker (EPP) 是专为 AI 推理场景计划的智能流量调理组件。EPP 实现了 Envoy 的 ext-proc,在 Envoy 转发流量之前,它会先通过 gRPC 哀求 EPP,EPP 会指示 Envoy 将哀求路由到哪个具体的 Pod。
EPP 主要实现以下核心功能:
1. 端点选择
EPP 的首要职责是从 InferencePool 中挑选一个合适的 Pod 作为哀求的目的:
- 每个 EPP 实例只服务一个 InferencePool(每个 InferencePool 需摆设一个 EPP 实例)。
- 它根据 InferencePool 的 Selector 字段,从标记为“停当”的 Pod 中选择目的。
- 哀求中的 ModelName 必须匹配绑定该 InferencePool 的某个 InferenceModel。
- 若找不到匹配的 ModelName,则返回错误响应给网关代理(如 Envoy)。
2. 流量拆分与模型名重写
EPP 支持灰度发布和模型版本管理:
- 通过 InferenceModel 中的配置,实现不同模型版本(如 LoRA 适配器)的流量按比例分配。
- 将用户哀求中的模型名(ModelName)重写为真实后端模型的名称,支持机动映射。
3. 可观测性
EPP 还负责生成与推理流量相干的监控指标:
- 提供基于 InferenceModel 的统计数据。
- 这些指标可以细化到实际使用的后端模型,方便监控和调优。
Dynamic LORA Adapter Sidecar
Dynamic LORA Adapter Sidecar 是一个基于 sidecar 的工具,用于将新的 LoRA 适配器摆设到一组正在运行的 vLLM 模型服务器。用户将 sidecar 与 vLLM 服务器一起摆设,并通过 ConfigMap 指定盼望配置的 LoRA 适配器。sidecar 监视 ConfigMap,并向 vLLM 容器发送加载或卸载哀求,以实验用户的配置意图。
这里顺便再表明一下什么是 LoRA:
LoRA(Low-Rank Adaptation,低秩自顺应)适配器是一种高效微调大模型的技能,它通过在预训练模型的特定层旁添加小型可训练的低秩矩阵,仅更新少量参数即可适配新任务,显著降低计算和存储成本。其核心作用包罗:动态加载不同任务适配器实现多任务切换,以及保持原模型权重不变的同时提拔微调效率,实用于个性化模型定制和资源受限场景。
哀求流程
为了分析这一切是怎样结合在一起的,让我们通过一个哀求示例来分析:
- 客户端向推理网关发送哀求。
- 哀求到达推理网关后,网关根据 HTTPRoute 的匹配规则将哀求路由到相应的 InferencePool。对于 Envoy 支持的 L7 路由器(如 kgateway 或 Istio),通常会通过路由计谋和负载均衡来选择哀求的目的端点。
- 但是对于 InferencePool,哀求会首先发送到一个专门的扩展组件——EndPoint Picker (EPP)。EPP 会根据来自 LLM 本身的指标来选择合适的后端 LLM 端点。
- 当哀求到达 EPP 后,EPP 会从哀求正文中提取 modelName。辨认出 modelName 后,EPP 会与可用的 InferenceModel 对象的 modelName 字段进行对比,确定相应的后端模型或 LoRA 适配器。比方,若 EPP 检测到的模型名称为 food-review,它会找到匹配的 InferenceModel,并将哀求路由到得当的端点,如 food-review-1 或 food-review-2。
- EPP 界说了一系列过滤器,用于选择特定模型对应的端点。它会定期查询各个 LLM Pod 的相干指标(如队列长度、KV 缓存使用环境等),从而选择最适合处置处罚当前哀求的 Pod。
- EPP 选定最佳 Pod 后,会将其返回给推理网关。
- 推理网关将哀求路由到 EPP 选择的 Pod。
实验
环境准备
准备一个 GPU Kubernetes 集群,可以参考我之前写的这篇文章快速搭建:一键摆设 GPU Kind 集群,体验 vLLM 极速推理。本实验运行的模型是 meta-llama/Llama-3.1-8B-Instruct,对 GPU 的性能有肯定要求,我是用 A100 的 GPU 进行实验的。
本实验使用的资源文件可以在 Github 上找到:https://github.com/cr7258/hands-on-lab/tree/main/gateway/gateway-api-inference-extension/get-started
创建 Hugging Face Token
需要先在 Hugging Face 创建一个 Token,并且申请 meta-llama/Llama-3.1-8B 模型的使用许可,注意填写信息的时间国家不要选中国,否则会被秒拒。
然后创建一个 Secret,将 Token 存储在其中。
- kubectl create secret generic hf-token --from-literal=token="<your-huggingface-token>"
复制代码 摆设 vLLM
通过 vLLM 摆设推理服务,默认配置为一个副本。如果 GPU 资源富足,可以增长副本数目。同时,配置了 lora-adapter-syncer 作为 sidecar 容器,根据 Configmap 中的配置动态管理 LoRA 适配器的加载与卸载。
等待 vLLM 容器启动成功,如果一切正常可以看到如下日志:
- kubectl logs vllm-llama3-8b-instruct-545c578498-47wt6 -f
- Defaulted container "vllm" out of: vllm, lora-adapter-syncer (init)
- INFO 04-05 05:51:39 [__init__.py:239] Automatically detected platform cuda.
- WARNING 04-05 05:51:44 [api_server.py:759] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
- INFO 04-05 05:51:44 [api_server.py:981] vLLM API server version 0.8.2
- INFO 04-05 05:51:44 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-8B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=2, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
- INFO 04-05 05:51:51 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
- WARNING 04-05 05:51:51 [arg_utils.py:1859] Detected VLLM_USE_V1=1 with LORA. Usage should be considered experimental. Please report any issues on Github.
- INFO 04-05 05:51:51 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
- WARNING 04-05 05:51:51 [config.py:2381] LoRA with chunked prefill is still experimental and may be unstable.
- INFO 04-05 05:51:53 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
- WARNING 04-05 05:51:54 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73d836b269c0>
- INFO 04-05 05:51:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
- INFO 04-05 05:51:55 [cuda.py:220] Using Flash Attention backend on V1 engine.
- INFO 04-05 05:51:55 [gpu_model_runner.py:1174] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
- INFO 04-05 05:51:55 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
- INFO 04-05 05:51:55 [weight_utils.py:265] Using model weights format ['*.safetensors']
- INFO 04-05 05:52:51 [weight_utils.py:281] Time spent downloading weights for meta-llama/Llama-3.1-8B-Instruct: 55.301468 seconds
- Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
- Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.25it/s]
- Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.72it/s]
- Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.50it/s]
- Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.25it/s]
- Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.33it/s]
- INFO 04-05 05:52:54 [loader.py:447] Loading weights took 3.27 seconds
- INFO 04-05 05:52:54 [punica_selector.py:18] Using PunicaWrapperGPU.
- INFO 04-05 05:52:54 [gpu_model_runner.py:1186] Model loading took 15.1749 GB and 59.268527 seconds
- INFO 04-05 05:53:07 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/253772ede5/rank_0_0 for vLLM's torch.compile
- INFO 04-05 05:53:07 [backends.py:425] Dynamo bytecode transform time: 12.60 s
- INFO 04-05 05:53:13 [backends.py:132] Cache the graph of shape None for later use
- INFO 04-05 05:53:53 [backends.py:144] Compiling a graph for general shape takes 44.37 s
- INFO 04-05 05:54:19 [monitor.py:33] torch.compile takes 56.97 s in total
- INFO 04-05 05:54:20 [kv_cache_utils.py:566] GPU KV cache size: 148,096 tokens
- INFO 04-05 05:54:20 [kv_cache_utils.py:569] Maximum concurrency for 131,072 tokens per request: 1.13x
- INFO 04-05 05:55:41 [gpu_model_runner.py:1534] Graph capturing finished in 81 secs, took 0.74 GiB
- INFO 04-05 05:55:42 [core.py:151] init engine (profile, create kv cache, warmup model) took 167.44 seconds
- WARNING 04-05 05:55:42 [config.py:1028] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
- INFO 04-05 05:55:42 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
- INFO 04-05 05:55:42 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
- INFO 04-05 05:55:42 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
- INFO 04-05 05:55:42 [launcher.py:26] Available routes are:
- INFO 04-05 05:55:42 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
- INFO 04-05 05:55:42 [launcher.py:34] Route: /docs, Methods: GET, HEAD
- INFO 04-05 05:55:42 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
- INFO 04-05 05:55:42 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
- INFO 04-05 05:55:42 [launcher.py:34] Route: /health, Methods: GET
- INFO 04-05 05:55:42 [launcher.py:34] Route: /load, Methods: GET
- INFO 04-05 05:55:42 [launcher.py:34] Route: /ping, Methods: GET, POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /tokenize, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /detokenize, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/models, Methods: GET
- INFO 04-05 05:55:42 [launcher.py:34] Route: /version, Methods: GET
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/completions, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/embeddings, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /pooling, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /score, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/score, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /rerank, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/rerank, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v2/rerank, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /invocations, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/load_lora_adapter, Methods: POST
- INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/unload_lora_adapter, Methods: POST
- INFO: Started server process [1]
- INFO: Waiting for application startup.
- INFO: Application startup complete.
- INFO: 10.244.1.1:33920 - "GET /health HTTP/1.1" 200 OK
- INFO: 10.244.1.1:33922 - "GET /health HTTP/1.1" 200 OK
复制代码 检察 lora-adapter-syncer sidecar 容器的日志,可以看到加载了 food-review-1 适配器。
- 2025-04-05 12:55:56 - WARNING - sidecar.py:266 - skipped adapters found in both `ensureExist` and `ensureNotExist`
- 2025-04-05 12:55:56 - INFO - sidecar.py:271 - adapter to load food-review-1
- 2025-04-05 12:55:56 - INFO - sidecar.py:218 - food-review-1 already present on model server localhost:8000
- 2025-04-05 12:55:57 - INFO - sidecar.py:276 - adapters to unload
- 2025-04-05 12:55:57 - INFO - sidecar.py:310 - Waiting 5s before next reconciliation...
- 2025-04-05 12:56:02 - INFO - sidecar.py:314 - Periodic reconciliation triggered
- 2025-04-05 12:56:02 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码 安装 Inference Extension CRD
安装 InferencePool 和 InferenceModel 这两个 CRD。
- VERSION=v0.2.0
- kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml
复制代码 摆设 InferenceModel
摆设 InferenceModel,将用户哀求 food-review 模型的流量转发到示例模型服务器的 food-review-1 LoRA 适配器。InferenceModel 通过 poolRef 关联 InferencePool(将在下一小节创建)。
- # 02-inferencemodel.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferenceModel
- metadata:
- name: food-review
- spec:
- modelName: food-review # 用户请求中的模型名称
- criticality: Standard # 模型重要性等级
- poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
- name: vllm-llama3-8b-instruct
- targetModels: # 指定后端实际模型名称
- - name: food-review-1
- weight: 100
复制代码 摆设 InferencePool 和 EPP
摆设 InferencePool,通过 selector 选择运行 LLM 服务的 Pod,并通过 extensionRef 关联 EPP。EPP 会基于及时指标(如哀求队列深度和 GPU 可用内存)做出智能路由决策。
- # 03-inferencepool-resources.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferencePool
- metadata:
- labels:
- name: vllm-llama3-8b-instruct
- spec:
- targetPortNumber: 8000
- selector: # 选择运行 LLM 服务的 Pod
- app: vllm-llama3-8b-instruct
- extensionRef: # 指向 EndPoint Picker
- name: vllm-llama3-8b-instruct-epp
- ---apiVersion: v1kind: Servicemetadata: name: vllm-llama3-8b-instruct-epp namespace: defaultspec: selector: app: vllm-llama3-8b-instruct-epp ports: - protocol: TCP port: 9002 targetPort: 9002 appProtocol: http2 type: ClusterIP---apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-llama3-8b-instruct-epp namespace: default labels: app: vllm-llama3-8b-instruct-eppspec: replicas: 1 selector: matchLabels: app: vllm-llama3-8b-instruct-epp template: metadata: labels: app: vllm-llama3-8b-instruct-epp spec: # Conservatively, this timeout should mirror the longest grace period of the pods within the pool terminationGracePeriodSeconds: 130 containers: - name: epp image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main imagePullPolicy: Always args: - -poolName - "vllm-llama3-8b-instruct" - -v - "4" - --zap-encoder - "json" - -grpcPort - "9002" - -grpcHealthPort - "9003" env: - name: USE_STREAMING value: "true" ports: - containerPort: 9002 - containerPort: 9003 - name: metrics containerPort: 9090 livenessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 readinessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10---kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata: name: pod-readrules:- apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferencemodels"] verbs: ["get", "watch", "list"]- apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list"]- apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferencepools"] verbs: ["get", "watch", "list"]- apiGroups: ["discovery.k8s.io"] resources: ["endpointslices"] verbs: ["get", "watch", "list"]- apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create- apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create--- kind: ClusterRoleBindingapiVersion: rbac.authorization.k8s.io/v1metadata: name: pod-read-bindingsubjects:- kind: ServiceAccount name: default namespace: defaultroleRef: kind: ClusterRole name: pod-read
复制代码 摆设推理网关
当前支持 Gateway API Inference Extension 的网关有 Kgateway,Envoy AI Gateway 等,完备的列表可以参考 Implementations。
本文将会以 Kgateway 为例进行演示。
首先安装 Kgateway 相干的 CRD。
- KGTW_VERSION=v2.0.0
- helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
复制代码 然后安装 Kgateway,设置 inferenceExtension.enabled=true 参数启用推理扩展。
- helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true
复制代码 接着创建 Gateway,gatewayClassName 关联到 Kgateway。
- # 04-gateway.yaml
- apiVersion: gateway.networking.k8s.io/v1
- kind: Gateway
- metadata:
- name: inference-gateway
- spec:
- gatewayClassName: kgateway
- listeners:
- - name: http
- port: 80
- protocol: HTTP
复制代码 确认网关已分配 IP 地址并报告 Programmed=True 状态。
- kubectl get gateway inference-gateway
- NAME CLASS ADDRESS PROGRAMMED AGE
- inference-gateway kgateway 172.18.0.4 True 16s
复制代码 摆设 HTTPRoute,将流量路由到 InferencePool。
- # 05-httproute.yaml
- apiVersion: gateway.networking.k8s.io/v1
- kind: HTTPRoute
- metadata:
- name: llm-route
- spec:
- parentRefs:
- - group: gateway.networking.k8s.io
- kind: Gateway
- name: inference-gateway
- rules:
- - backendRefs:
- - group: inference.networking.x-k8s.io
- kind: InferencePool
- name: vllm-llama3-8b-instruct
- port: 8000 # Remove when https://github.com/kgateway-dev/kgateway/issues/10987 is fixed.
- matches:
- - path:
- type: PathPrefix
- value: /
- timeouts:
- request: 300s
复制代码 确认 HTTPRoute 状态条件包含 Accepted=True 和 ResolvedRefs=True:
- kubectl get httproute llm-route -o yaml
- apiVersion: gateway.networking.k8s.io/v1
- kind: HTTPRoute
- ......
- status:
- parents:
- - conditions:
- - lastTransitionTime: "2025-04-05T13:04:35Z"
- message: ""
- observedGeneration: 2
- reason: Accepted
- status: "True"
- type: Accepted
- - lastTransitionTime: "2025-04-05T13:06:14Z"
- message: ""
- observedGeneration: 2
- reason: ResolvedRefs
- status: "True"
- type: ResolvedRefs
- controllerName: kgateway.dev/kgateway
- parentRef:
- group: gateway.networking.k8s.io
- kind: Gateway
- name: inference-gateway
复制代码 哀求验证
至此,我们已完成全部配置工作,接下来可通过 curl 下令向推理网关发送哀求进行测试。
- IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
- PORT=80
- curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
- "model": "food-review",
- "prompt": "Write as if you were a critic: San Francisco",
- "max_tokens": 100,
- "temperature": 0
- }'
复制代码 响应效果如下,可以看到模型成功处置处罚了哀求。
- HTTP/1.1 200 OK
- date: Sat, 05 Apr 2025 13:18:22 GMT
- server: envoy
- content-type: application/json
- x-envoy-upstream-service-time: 1785
- x-went-into-resp-headers: true
- transfer-encoding: chunked
- {
- "choices": [
- {
- "finish_reason": "length",
- "index": 0,
- "logprobs": null,
- "prompt_logprobs": null,
- "stop_reason": null,
- "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
- }
- ],
- "created": 1743859102,
- "id": "cmpl-0046459d-d94f-43b5-b8f4-0898d8e2d50b",
- "model": "food-review-1",
- "object": "text_completion",
- "usage": {
- "completion_tokens": 100,
- "prompt_tokens": 11,
- "prompt_tokens_details": null,
- "total_tokens": 111
- }
- }
复制代码 发布新的适配器版本
接下来,将演示怎样发布新的适配器版本。通过修改 vllm-llama3-8b-instruct-adapters ConfigMap,让 lora-adapter-syncer sidecar 容器加载新的适配器到 vLLM 容器。
- kubectl edit configmap vllm-llama3-8b-instruct-adapters
复制代码 更改 Configmap 的配置如下,增长 food-review-2 适配器。
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: vllm-llama3-8b-instruct-adapters
- data:
- configmap.yaml: |
- vLLMLoRAConfig:
- name: vllm-llama3-8b-instruct-adapters
- port: 8000
- defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
- ensureExist:
- models:
- - id: food-review-1
- source: Kawon/llama3.1-food-finetune_v14_r8
- # 增加新的适配器
- - id: food-review-2
- source: Kawon/llama3.1-food-finetune_v14_r8
复制代码 新的适配器版本将及时应用于模型服务器,无需重新启动。检察 lora-adapter-syncer sidecar 容器日志,可以看到加载了 food-review-2 适配器。
- 2025-04-05 13:15:21 - INFO - sidecar.py:271 - adapter to load food-review-2, food-review-1
- 2025-04-05 13:15:21 - INFO - sidecar.py:231 - loaded model food-review-2
- 2025-04-05 13:15:21 - INFO - sidecar.py:218 - food-review-1 already present on model server localhost:8000
- 2025-04-05 13:15:21 - INFO - sidecar.py:276 - adapters to unload
- 2025-04-05 13:15:21 - INFO - sidecar.py:62 - model server reconcile to Config '/config/configmap.yaml' !
- 2025-04-05 13:15:22 - INFO - sidecar.py:314 - Periodic reconciliation triggered
- 2025-04-05 13:15:22 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码 修改 InferenceModel 的配置以 Canary 的方式发布新的适配器版本。
- kubectl edit inferencemodel food-review
复制代码 将 10% 的流量路由到新的 food-review-2 适配器,90% 的流量路由到 food-review-1。
- apiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferenceModel
- metadata:
- name: food-review
- spec:
- modelName: food-review
- criticality: Standard
- poolRef:
- name: vllm-llama3-8b-instruct
- targetModels:
- - name: food-review-1
- weight: 90
- - name: food-review-2
- weight: 10
复制代码 使用雷同的 curl 下令哀求多次进行测试,可以观察到 90% 的哀求被路由到 food-review-1,10% 的哀求被路由到 food-review-2。
- curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
- "model": "food-review",
- "prompt": "Write as if you were a critic: San Francisco",
- "max_tokens": 100,
- "temperature": 0
- }'
- # 发送 food-review-1 的请求,可以通过响应的 model 字段辨认
- HTTP/1.1 200 OK
- date: Sat, 05 Apr 2025 13:18:34 GMT
- server: envoy
- content-type: application/json
- x-envoy-upstream-service-time: 1780
- x-went-into-resp-headers: true
- transfer-encoding: chunked
- {
- "choices": [
- {
- "finish_reason": "length",
- "index": 0,
- "logprobs": null,
- "prompt_logprobs": null,
- "stop_reason": null,
- "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
- }
- ],
- "created": 1743859115,
- "id": "cmpl-99203056-cb12-4c8e-bae9-23c28c07cdd7",
- "model": "food-review-1",
- "object": "text_completion",
- "usage": {
- "completion_tokens": 100,
- "prompt_tokens": 11,
- "prompt_tokens_details": null,
- "total_tokens": 111
- }
- }
- curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
- "model": "food-review",
- "prompt": "Write as if you were a critic: San Francisco",
- "max_tokens": 100,
- "temperature": 0
- }'
- HTTP/1.1 200 OK
- date: Sat, 05 Apr 2025 13:18:38 GMT
- server: envoy
- content-type: application/json
- x-envoy-upstream-service-time: 2531
- x-went-into-resp-headers: true
- transfer-encoding: chunked
- # 发送到 food-review-2 的请求
- {
- "choices": [
- {
- "finish_reason": "length",
- "index": 0,
- "logprobs": null,
- "prompt_logprobs": null,
- "stop_reason": null,
- "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
- }
- ],
- "created": 1743859119,
- "id": "cmpl-6f2e2e5f-a0e7-4ee0-bd54-5b1a2ef23399",
- "model": "food-review-2",
- "object": "text_completion",
- "usage": {
- "completion_tokens": 100,
- "prompt_tokens": 11,
- "prompt_tokens_details": null,
- "total_tokens": 111
- }
- }
复制代码 确认新版本的适配器工作正常后,可以修改 InferenceModel 的配置,将 100% 的流量路由到 food-review-2。
- apiVersion: inference.networking.x-k8s.io/v1alpha2
- kind: InferenceModel
- metadata:
- name: food-review
- spec:
- modelName: food-review
- criticality: Standard
- poolRef:
- name: vllm-llama3-8b-instruct
- targetModels:
- - name: food-review-2
- weight: 100
复制代码 同时修改 vllm-llama3-8b-instruct-adapters ConfigMap,将旧版本 food-review-1 移动到 ensureNotExist 列表中,从服务器卸载旧版本。
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: vllm-llama3-8b-instruct-adapters
- data:
- configmap.yaml: |
- vLLMLoRAConfig:
- name: vllm-llama3-8b-instruct-adapters
- port: 8000
- defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
- ensureExist:
- models:
- - id: food-review-2
- source: Kawon/llama3.1-food-finetune_v14_r8
- ensureNotExist:
- models:
- - id: food-review-1
- source: Kawon/llama3.1-food-finetune_v14_r8
复制代码 观察 lora-adapter-syncer sidecar 容器日志,可以看到卸载了 food-review-1 适配器。
- 2025-04-05 13:27:53 - INFO - sidecar.py:271 - adapter to load food-review-2
- 2025-04-05 13:27:53 - INFO - sidecar.py:218 - food-review-2 already present on model server localhost:8000
- 2025-04-05 13:27:53 - INFO - sidecar.py:276 - adapters to unload food-review-1
- 2025-04-05 13:27:53 - INFO - sidecar.py:247 - unloaded model food-review-1
- 2025-04-05 13:27:53 - INFO - sidecar.py:62 - model server reconcile to Config '/config/configmap.yaml' !
- 2025-04-05 13:27:56 - INFO - sidecar.py:314 - Periodic reconciliation triggered
- 2025-04-05 13:27:56 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码 此时,所有哀求都应该由新的适配器版本提供服务。
总结
Gateway API Inference Extension 为 Kubernetes 上的 LLM 推理服务提供了专业化的流量路由办理方案。通过模型感知路由、服务优先级和智能负载均衡等特性,它有效提高了 GPU 资源使用率,降低了推理延迟。该扩展通过 InferencePool 和 InferenceModel 两个核心 CRD,结合 EndPoint Picker 和 Dynamic LORA Adapter Sidecar 组件,实现了模型版本的灰度发布与动态 LoRA 适配器管理,为 Kubernetes 上自托管的大语言模型提供了标准化且机动的办理方案。
参考资料
- Gateway API Inference Extension: https://gateway-api-inference-extension.sigs.k8s.io/
- Gateway API: https://gateway-api.sigs.k8s.io/
- ext-proc: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter
- InferencePool: https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/
- InferenceModel: https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencemodel/
- EndPoint Picker (EPP): https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp
- Dynamic LORA Adapter Sidecar: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dynamic-lora-sidecar
- Implementations: https://gateway-api-inference-extension.sigs.k8s.io/implementations
- Deep Dive into the Gateway API Inference Extension: https://kgateway.dev/blog/deep-dive-inference-extensions/
- Smarter AI Inference Routing on Kubernetes with Gateway API Inference Extension: https://kgateway.dev/blog/smarter-ai-reference-kubernetes-gateway-api/
接待关注
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |