IT评测·应用市场-qidao123.com技术社区

标题: 为 Kubernetes 提供智能的 LLM 推理路由:Gateway API Inference Extension 深度解析 [打印本页]

作者: 科技颠覆者    时间: 2025-4-14 00:21
标题: 为 Kubernetes 提供智能的 LLM 推理路由:Gateway API Inference Extension 深度解析
当代生成式 AI 和大语言模型(LLM)服务给 Kubernetes 带来了独特的流量路由挑衅。与典型的短时、无状态 Web 哀求不同,LLM 推剖析话通常是长时运行、资源密集且部分有状态的。比方,一个基于 GPU 的模型服务器大概同时维护多个活跃的推剖析话,并维护内存中的 token 缓存。
传统的负载均衡器多基于 HTTP 路径或轮询调理,缺乏处置处罚此类工作负载所需的专业能力。这些方案无法辨认模型标识或哀求的重要性(比方交互式对话哀求与批处置处罚作业之间的区别)。企业通常接纳临时拼集的方式应对,但缺乏同一标准的办理方案。
为了办理这一题目,Gateway API Inference Extension 在现有 Gateway API 的底子上,添加了针对推理任务的专属路由能力,同时保存了 Gateways 和 HTTPRoutes 等人们熟悉的模型。通过为现有网关添加这一扩展,可以将其转变为“推理网关”(Inference Gateway),帮助用户以“模型即服务”的方式自托管生成式 AI 模型或 LLM。
Gateway API Inference Extension 可将支持 ext-proc 的代理或网关(如 Envoy Gateway、kGateway 或 GKE Gateway)升级为推理网关,支持推理平台团队在 Kubernetes 上自建大语言模型服务。

主要特性

Gateway API Inference Extension 提供了以下关键特性:

核心 CRD

Gateway API Inference Extension 界说了两个核心 CRD:InferencePool 和 InferenceModel。
InferencePool

InferencePool 表现一组专注于 AI 推理的 Pod,同时界说了用于路由到这些 Pod 的扩展配置。在 Gateway API 的资源模型中,InferencePool 被视为一种 “Backend” 资源。实际上,它可以用来替代传统的 Kubernetes Service,作为下游服务的目的。
固然 InferencePool 在某些方面与 Service 类似(比如选择 Pod 并指定端口),但它提供了专门面向推理场景的增强能力。InferencePool 通过 extensionRef 字段指向一个 EndPoint Picker 来管理推理感知的端点选择,从而根据及时指标(比方哀求队列深度和 GPU 内存可用性)做出智能路由决策。
  1. apiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferencePool
  3. metadata:
  4.   labels:
  5.   name: vllm-llama3-8b-instruct
  6. spec:
  7.   targetPortNumber: 8000
  8.   selector: # 选择运行 LLM 服务的 Pod
  9.     app: vllm-llama3-8b-instruct
  10.   extensionRef: # 指向 EndPoint Picker
  11.     name: vllm-llama3-8b-instruct-epp
复制代码
InferenceModel

InferenceModel 表现某个推理模型或适配器及其相干配置。该资源用于界说模型的重要性品级(criticality),从而支持基于优先级的哀求调理。
此外,InferenceModel 还支持将用户哀求中的“模型名称”平滑地映射到一个或多个后端实际模型名称,便于进行版本管理、灰度发布或适配不同模型格式。多个 InferenceModel 可以关联到同一个 InferencePool 上,从而构建出一个机动且可扩展的模型路由体系。
  1. apiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferenceModel
  3. metadata:
  4.   name: food-review
  5. spec:
  6.   modelName: food-review # 用户请求中的模型名称
  7.   criticality: Standard # 模型重要性等级
  8.   poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
  9.     name: vllm-llama3-8b-instruct
  10.   targetModels: # 指定后端实际模型名称
  11.   - name: food-review-1
  12.     weight: 100
复制代码
相干组件

EndPoint Picker (EPP)

EndPoint Picker (EPP) 是专为 AI 推理场景计划的智能流量调理组件。EPP 实现了 Envoy 的 ext-proc,在 Envoy 转发流量之前,它会先通过 gRPC 哀求 EPP,EPP 会指示 Envoy 将哀求路由到哪个具体的 Pod。
EPP 主要实现以下核心功能:
1. 端点选择

EPP 的首要职责是从 InferencePool 中挑选一个合适的 Pod 作为哀求的目的:

2. 流量拆分与模型名重写

EPP 支持灰度发布和模型版本管理:

3. 可观测性

EPP 还负责生成与推理流量相干的监控指标:

Dynamic LORA Adapter Sidecar

Dynamic LORA Adapter Sidecar 是一个基于 sidecar 的工具,用于将新的 LoRA 适配器摆设到一组正在运行的 vLLM 模型服务器。用户将 sidecar 与 vLLM 服务器一起摆设,并通过 ConfigMap 指定盼望配置的 LoRA 适配器。sidecar 监视 ConfigMap,并向 vLLM 容器发送加载或卸载哀求,以实验用户的配置意图。
这里顺便再表明一下什么是 LoRA:
LoRA(Low-Rank Adaptation,低秩自顺应)适配器是一种高效微调大模型的技能,它通过在预训练模型的特定层旁添加小型可训练的低秩矩阵,仅更新少量参数即可适配新任务,显著降低计算和存储成本。其核心作用包罗:动态加载不同任务适配器实现多任务切换,以及保持原模型权重不变的同时提拔微调效率,实用于个性化模型定制和资源受限场景。
哀求流程

为了分析这一切是怎样结合在一起的,让我们通过一个哀求示例来分析:

实验

环境准备

准备一个 GPU Kubernetes 集群,可以参考我之前写的这篇文章快速搭建:一键摆设 GPU Kind 集群,体验 vLLM 极速推理。本实验运行的模型是 meta-llama/Llama-3.1-8B-Instruct,对 GPU 的性能有肯定要求,我是用 A100 的 GPU 进行实验的。
本实验使用的资源文件可以在 Github 上找到:https://github.com/cr7258/hands-on-lab/tree/main/gateway/gateway-api-inference-extension/get-started
创建 Hugging Face Token

需要先在 Hugging Face 创建一个 Token,并且申请 meta-llama/Llama-3.1-8B 模型的使用许可,注意填写信息的时间国家不要选中国,否则会被秒拒。
然后创建一个 Secret,将 Token 存储在其中。
  1. kubectl create secret generic hf-token --from-literal=token="<your-huggingface-token>"
复制代码
摆设 vLLM

通过 vLLM 摆设推理服务,默认配置为一个副本。如果 GPU 资源富足,可以增长副本数目。同时,配置了 lora-adapter-syncer 作为 sidecar 容器,根据 Configmap 中的配置动态管理 LoRA 适配器的加载与卸载。
  1. # 01-gpu-deployment.yaml
  2. apiVersion: apps/v1
  3. kind: Deployment
  4. metadata:
  5.   name: vllm-llama3-8b-instruct
  6. spec:
  7.   replicas: 1
  8.   selector:
  9.     matchLabels:
  10.       app: vllm-llama3-8b-instruct
  11.   template:
  12.     metadata:
  13.       labels:
  14.         app: vllm-llama3-8b-instruct
  15.     spec:
  16.       containers:
  17.         - name: vllm
  18.           image: "vllm/vllm-openai:latest"
  19.           imagePullPolicy: Always
  20.           command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
  21.           args:
  22.           - "--model"
  23.           - "meta-llama/Llama-3.1-8B-Instruct"
  24.           - "--tensor-parallel-size"
  25.           - "1"
  26.           - "--port"
  27.           - "8000"
  28.           - "--max-num-seq"
  29.           - "1024"
  30.           - "--compilation-config"
  31.           - "3"
  32.           - "--enable-lora"
  33.           - "--max-loras"
  34.           - "2"
  35.           - "--max-lora-rank"
  36.           - "8"
  37.           - "--max-cpu-loras"
  38.           - "12"
  39.           env:
  40.             # Enabling LoRA support temporarily disables automatic v1, we want to force it on
  41.             # until 0.8.3 vLLM is released.
  42.             - name: VLLM_USE_V1
  43.               value: "1"
  44.             - name: PORT
  45.               value: "8000"
  46.             - name: HUGGING_FACE_HUB_TOKEN
  47.               valueFrom:
  48.                 secretKeyRef:
  49.                   name: hf-token
  50.                   key: token
  51.             - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
  52.               value: "true"
  53.           ports:
  54.             - containerPort: 8000
  55.               name: http
  56.               protocol: TCP
  57.           lifecycle:
  58.             preStop:
  59.               exec:
  60.                command:
  61.                - /usr/bin/sleep
  62.                - "30"
  63.           livenessProbe:
  64.             httpGet:
  65.               path: /health
  66.               port: http
  67.               scheme: HTTP
  68.             periodSeconds: 1
  69.             successThreshold: 1
  70.             failureThreshold: 5
  71.             timeoutSeconds: 1
  72.           readinessProbe:
  73.             httpGet:
  74.               path: /health
  75.               port: http
  76.               scheme: HTTP
  77.             periodSeconds: 1
  78.             successThreshold: 1
  79.             failureThreshold: 1
  80.             timeoutSeconds: 1
  81.           startupProbe:
  82.             failureThreshold: 600
  83.             initialDelaySeconds: 2
  84.             periodSeconds: 1
  85.             httpGet:
  86.               path: /health
  87.               port: http
  88.               scheme: HTTP
  89.           resources:
  90.             limits:
  91.               nvidia.com/gpu: 1
  92.             requests:
  93.               nvidia.com/gpu: 1
  94.           volumeMounts:
  95.             - mountPath: /data
  96.               name: data
  97.             - mountPath: /dev/shm
  98.               name: shm
  99.             - name: adapters
  100.               mountPath: "/adapters"
  101.       initContainers:
  102.         - name: lora-adapter-syncer
  103.           tty: true
  104.           stdin: true
  105.           image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
  106.           restartPolicy: Always
  107.           imagePullPolicy: Always
  108.           env:
  109.             - name: DYNAMIC_LORA_ROLLOUT_CONFIG
  110.               value: "/config/configmap.yaml"
  111.           volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
  112.           - name: config-volume
  113.             mountPath:  /config
  114.       restartPolicy: Always
  115.       enableServiceLinks: false
  116.       terminationGracePeriodSeconds: 130
  117.       volumes:
  118.         - name: data
  119.           emptyDir: {}
  120.         - name: shm
  121.           emptyDir:
  122.             medium: Memory
  123.         - name: adapters
  124.           emptyDir: {}
  125.         - name: config-volume
  126.           configMap:
  127.             name: vllm-llama3-8b-instruct-adapters
  128. ---
  129. apiVersion: v1
  130. kind: ConfigMap
  131. metadata:
  132.   name: vllm-llama3-8b-instruct-adapters
  133. data:
  134.   configmap.yaml: |
  135.     vLLMLoRAConfig:
  136.       name: vllm-llama3-8b-instruct-adapters
  137.       port: 8000
  138.       defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
  139.       ensureExist:
  140.         models:
  141.           - id: food-review-1
  142.             source: Kawon/llama3.1-food-finetune_v14_r8
复制代码
等待 vLLM 容器启动成功,如果一切正常可以看到如下日志:
  1. kubectl logs vllm-llama3-8b-instruct-545c578498-47wt6 -f
  2. Defaulted container "vllm" out of: vllm, lora-adapter-syncer (init)
  3. INFO 04-05 05:51:39 [__init__.py:239] Automatically detected platform cuda.
  4. WARNING 04-05 05:51:44 [api_server.py:759] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
  5. INFO 04-05 05:51:44 [api_server.py:981] vLLM API server version 0.8.2
  6. INFO 04-05 05:51:44 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-8B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=2, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
  7. INFO 04-05 05:51:51 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
  8. WARNING 04-05 05:51:51 [arg_utils.py:1859] Detected VLLM_USE_V1=1 with LORA. Usage should be considered experimental. Please report any issues on Github.
  9. INFO 04-05 05:51:51 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
  10. WARNING 04-05 05:51:51 [config.py:2381] LoRA with chunked prefill is still experimental and may be unstable.
  11. INFO 04-05 05:51:53 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
  12. WARNING 04-05 05:51:54 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73d836b269c0>
  13. INFO 04-05 05:51:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
  14. INFO 04-05 05:51:55 [cuda.py:220] Using Flash Attention backend on V1 engine.
  15. INFO 04-05 05:51:55 [gpu_model_runner.py:1174] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
  16. INFO 04-05 05:51:55 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
  17. INFO 04-05 05:51:55 [weight_utils.py:265] Using model weights format ['*.safetensors']
  18. INFO 04-05 05:52:51 [weight_utils.py:281] Time spent downloading weights for meta-llama/Llama-3.1-8B-Instruct: 55.301468 seconds
  19. Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
  20. Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.25it/s]
  21. Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.72it/s]
  22. Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.50it/s]
  23. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.25it/s]
  24. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.33it/s]
  25. INFO 04-05 05:52:54 [loader.py:447] Loading weights took 3.27 seconds
  26. INFO 04-05 05:52:54 [punica_selector.py:18] Using PunicaWrapperGPU.
  27. INFO 04-05 05:52:54 [gpu_model_runner.py:1186] Model loading took 15.1749 GB and 59.268527 seconds
  28. INFO 04-05 05:53:07 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/253772ede5/rank_0_0 for vLLM's torch.compile
  29. INFO 04-05 05:53:07 [backends.py:425] Dynamo bytecode transform time: 12.60 s
  30. INFO 04-05 05:53:13 [backends.py:132] Cache the graph of shape None for later use
  31. INFO 04-05 05:53:53 [backends.py:144] Compiling a graph for general shape takes 44.37 s
  32. INFO 04-05 05:54:19 [monitor.py:33] torch.compile takes 56.97 s in total
  33. INFO 04-05 05:54:20 [kv_cache_utils.py:566] GPU KV cache size: 148,096 tokens
  34. INFO 04-05 05:54:20 [kv_cache_utils.py:569] Maximum concurrency for 131,072 tokens per request: 1.13x
  35. INFO 04-05 05:55:41 [gpu_model_runner.py:1534] Graph capturing finished in 81 secs, took 0.74 GiB
  36. INFO 04-05 05:55:42 [core.py:151] init engine (profile, create kv cache, warmup model) took 167.44 seconds
  37. WARNING 04-05 05:55:42 [config.py:1028] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
  38. INFO 04-05 05:55:42 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
  39. INFO 04-05 05:55:42 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
  40. INFO 04-05 05:55:42 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
  41. INFO 04-05 05:55:42 [launcher.py:26] Available routes are:
  42. INFO 04-05 05:55:42 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
  43. INFO 04-05 05:55:42 [launcher.py:34] Route: /docs, Methods: GET, HEAD
  44. INFO 04-05 05:55:42 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
  45. INFO 04-05 05:55:42 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
  46. INFO 04-05 05:55:42 [launcher.py:34] Route: /health, Methods: GET
  47. INFO 04-05 05:55:42 [launcher.py:34] Route: /load, Methods: GET
  48. INFO 04-05 05:55:42 [launcher.py:34] Route: /ping, Methods: GET, POST
  49. INFO 04-05 05:55:42 [launcher.py:34] Route: /tokenize, Methods: POST
  50. INFO 04-05 05:55:42 [launcher.py:34] Route: /detokenize, Methods: POST
  51. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/models, Methods: GET
  52. INFO 04-05 05:55:42 [launcher.py:34] Route: /version, Methods: GET
  53. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
  54. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/completions, Methods: POST
  55. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/embeddings, Methods: POST
  56. INFO 04-05 05:55:42 [launcher.py:34] Route: /pooling, Methods: POST
  57. INFO 04-05 05:55:42 [launcher.py:34] Route: /score, Methods: POST
  58. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/score, Methods: POST
  59. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
  60. INFO 04-05 05:55:42 [launcher.py:34] Route: /rerank, Methods: POST
  61. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/rerank, Methods: POST
  62. INFO 04-05 05:55:42 [launcher.py:34] Route: /v2/rerank, Methods: POST
  63. INFO 04-05 05:55:42 [launcher.py:34] Route: /invocations, Methods: POST
  64. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/load_lora_adapter, Methods: POST
  65. INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/unload_lora_adapter, Methods: POST
  66. INFO:     Started server process [1]
  67. INFO:     Waiting for application startup.
  68. INFO:     Application startup complete.
  69. INFO:     10.244.1.1:33920 - "GET /health HTTP/1.1" 200 OK
  70. INFO:     10.244.1.1:33922 - "GET /health HTTP/1.1" 200 OK
复制代码
检察 lora-adapter-syncer sidecar 容器的日志,可以看到加载了 food-review-1 适配器。
  1. 2025-04-05 12:55:56 - WARNING - sidecar.py:266 -  skipped adapters found in both `ensureExist` and `ensureNotExist`
  2. 2025-04-05 12:55:56 - INFO - sidecar.py:271 -  adapter to load food-review-1
  3. 2025-04-05 12:55:56 - INFO - sidecar.py:218 -  food-review-1 already present on model server localhost:8000
  4. 2025-04-05 12:55:57 - INFO - sidecar.py:276 -  adapters to unload
  5. 2025-04-05 12:55:57 - INFO - sidecar.py:310 -  Waiting 5s before next reconciliation...
  6. 2025-04-05 12:56:02 - INFO - sidecar.py:314 -  Periodic reconciliation triggered
  7. 2025-04-05 12:56:02 - INFO - sidecar.py:255 -  reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码
安装 Inference Extension CRD

安装 InferencePool 和 InferenceModel 这两个 CRD。
  1. VERSION=v0.2.0
  2. kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml
复制代码
摆设 InferenceModel

摆设 InferenceModel,将用户哀求 food-review 模型的流量转发到示例模型服务器的 food-review-1 LoRA 适配器。InferenceModel 通过 poolRef 关联 InferencePool(将在下一小节创建)。
  1. # 02-inferencemodel.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferenceModel
  3. metadata:
  4.   name: food-review
  5. spec:
  6.   modelName: food-review # 用户请求中的模型名称
  7.   criticality: Standard # 模型重要性等级
  8.   poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
  9.     name: vllm-llama3-8b-instruct
  10.   targetModels: # 指定后端实际模型名称
  11.   - name: food-review-1
  12.     weight: 100
复制代码
摆设 InferencePool 和 EPP

摆设 InferencePool,通过 selector 选择运行 LLM 服务的 Pod,并通过 extensionRef 关联 EPP。EPP 会基于及时指标(如哀求队列深度和 GPU 可用内存)做出智能路由决策。
  1. # 03-inferencepool-resources.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferencePool
  3. metadata:
  4.   labels:
  5.   name: vllm-llama3-8b-instruct
  6. spec:
  7.   targetPortNumber: 8000
  8.   selector: # 选择运行 LLM 服务的 Pod
  9.     app: vllm-llama3-8b-instruct
  10.   extensionRef: # 指向 EndPoint Picker
  11.     name: vllm-llama3-8b-instruct-epp
  12. ---apiVersion: v1kind: Servicemetadata:  name: vllm-llama3-8b-instruct-epp  namespace: defaultspec:  selector:    app: vllm-llama3-8b-instruct-epp  ports:    - protocol: TCP      port: 9002      targetPort: 9002      appProtocol: http2  type: ClusterIP---apiVersion: apps/v1kind: Deploymentmetadata:  name: vllm-llama3-8b-instruct-epp  namespace: default  labels:    app: vllm-llama3-8b-instruct-eppspec:  replicas: 1  selector:    matchLabels:      app: vllm-llama3-8b-instruct-epp  template:    metadata:      labels:        app: vllm-llama3-8b-instruct-epp    spec:      # Conservatively, this timeout should mirror the longest grace period of the pods within the pool      terminationGracePeriodSeconds: 130      containers:      - name: epp        image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main        imagePullPolicy: Always        args:        - -poolName        - "vllm-llama3-8b-instruct"        - -v        - "4"        - --zap-encoder        - "json"        - -grpcPort        - "9002"        - -grpcHealthPort        - "9003"        env:        - name: USE_STREAMING          value: "true"        ports:        - containerPort: 9002        - containerPort: 9003        - name: metrics          containerPort: 9090        livenessProbe:          grpc:            port: 9003            service: inference-extension          initialDelaySeconds: 5          periodSeconds: 10        readinessProbe:          grpc:            port: 9003            service: inference-extension          initialDelaySeconds: 5          periodSeconds: 10---kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata:  name: pod-readrules:- apiGroups: ["inference.networking.x-k8s.io"]  resources: ["inferencemodels"]  verbs: ["get", "watch", "list"]- apiGroups: [""]  resources: ["pods"]  verbs: ["get", "watch", "list"]- apiGroups: ["inference.networking.x-k8s.io"]  resources: ["inferencepools"]  verbs: ["get", "watch", "list"]- apiGroups: ["discovery.k8s.io"]  resources: ["endpointslices"]  verbs: ["get", "watch", "list"]- apiGroups:  - authentication.k8s.io  resources:  - tokenreviews  verbs:  - create- apiGroups:  - authorization.k8s.io  resources:  - subjectaccessreviews  verbs:  - create--- kind: ClusterRoleBindingapiVersion: rbac.authorization.k8s.io/v1metadata:  name: pod-read-bindingsubjects:- kind: ServiceAccount  name: default  namespace: defaultroleRef:  kind: ClusterRole  name: pod-read
复制代码
摆设推理网关

当前支持 Gateway API Inference Extension 的网关有 Kgateway,Envoy AI Gateway 等,完备的列表可以参考 Implementations。
本文将会以 Kgateway 为例进行演示。
首先安装 Kgateway 相干的 CRD。
  1. KGTW_VERSION=v2.0.0
  2. helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
复制代码
然后安装 Kgateway,设置 inferenceExtension.enabled=true 参数启用推理扩展。
  1. helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true
复制代码
接着创建 Gateway,gatewayClassName 关联到 Kgateway。
  1. # 04-gateway.yaml
  2. apiVersion: gateway.networking.k8s.io/v1
  3. kind: Gateway
  4. metadata:
  5.   name: inference-gateway
  6. spec:
  7.   gatewayClassName: kgateway
  8.   listeners:
  9.   - name: http
  10.     port: 80
  11.     protocol: HTTP
复制代码
确认网关已分配 IP 地址并报告 Programmed=True 状态。
  1. kubectl get gateway inference-gateway
  2. NAME                CLASS      ADDRESS      PROGRAMMED   AGE
  3. inference-gateway   kgateway   172.18.0.4   True         16s
复制代码
摆设 HTTPRoute,将流量路由到 InferencePool。
  1. # 05-httproute.yaml
  2. apiVersion: gateway.networking.k8s.io/v1
  3. kind: HTTPRoute
  4. metadata:
  5.   name: llm-route
  6. spec:
  7.   parentRefs:
  8.   - group: gateway.networking.k8s.io
  9.     kind: Gateway
  10.     name: inference-gateway
  11.   rules:
  12.   - backendRefs:
  13.     - group: inference.networking.x-k8s.io
  14.       kind: InferencePool
  15.       name: vllm-llama3-8b-instruct
  16.       port: 8000 # Remove when https://github.com/kgateway-dev/kgateway/issues/10987 is fixed.
  17.     matches:
  18.     - path:
  19.         type: PathPrefix
  20.         value: /
  21.     timeouts:
  22.       request: 300s
复制代码
确认 HTTPRoute 状态条件包含 Accepted=True 和 ResolvedRefs=True:
  1. kubectl get httproute llm-route -o yaml
  2. apiVersion: gateway.networking.k8s.io/v1
  3. kind: HTTPRoute
  4. ......
  5. status:
  6.   parents:
  7.   - conditions:
  8.     - lastTransitionTime: "2025-04-05T13:04:35Z"
  9.       message: ""
  10.       observedGeneration: 2
  11.       reason: Accepted
  12.       status: "True"
  13.       type: Accepted
  14.     - lastTransitionTime: "2025-04-05T13:06:14Z"
  15.       message: ""
  16.       observedGeneration: 2
  17.       reason: ResolvedRefs
  18.       status: "True"
  19.       type: ResolvedRefs
  20.     controllerName: kgateway.dev/kgateway
  21.     parentRef:
  22.       group: gateway.networking.k8s.io
  23.       kind: Gateway
  24.       name: inference-gateway
复制代码
哀求验证

至此,我们已完成全部配置工作,接下来可通过 curl 下令向推理网关发送哀求进行测试。
  1. IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. PORT=80
  3. curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  4. "model": "food-review",
  5. "prompt": "Write as if you were a critic: San Francisco",
  6. "max_tokens": 100,
  7. "temperature": 0
  8. }'
复制代码
响应效果如下,可以看到模型成功处置处罚了哀求。
  1. HTTP/1.1 200 OK
  2. date: Sat, 05 Apr 2025 13:18:22 GMT
  3. server: envoy
  4. content-type: application/json
  5. x-envoy-upstream-service-time: 1785
  6. x-went-into-resp-headers: true
  7. transfer-encoding: chunked
  8. {
  9.   "choices": [
  10.     {
  11.       "finish_reason": "length",
  12.       "index": 0,
  13.       "logprobs": null,
  14.       "prompt_logprobs": null,
  15.       "stop_reason": null,
  16.       "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
  17.     }
  18.   ],
  19.   "created": 1743859102,
  20.   "id": "cmpl-0046459d-d94f-43b5-b8f4-0898d8e2d50b",
  21.   "model": "food-review-1",
  22.   "object": "text_completion",
  23.   "usage": {
  24.     "completion_tokens": 100,
  25.     "prompt_tokens": 11,
  26.     "prompt_tokens_details": null,
  27.     "total_tokens": 111
  28.   }
  29. }
复制代码
发布新的适配器版本

接下来,将演示怎样发布新的适配器版本。通过修改 vllm-llama3-8b-instruct-adapters ConfigMap,让 lora-adapter-syncer sidecar 容器加载新的适配器到 vLLM 容器。
  1. kubectl edit configmap vllm-llama3-8b-instruct-adapters
复制代码
更改 Configmap 的配置如下,增长 food-review-2 适配器。
  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4.   name: vllm-llama3-8b-instruct-adapters
  5. data:
  6.   configmap.yaml: |
  7.     vLLMLoRAConfig:
  8.       name: vllm-llama3-8b-instruct-adapters
  9.       port: 8000
  10.       defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
  11.       ensureExist:
  12.         models:
  13.         - id: food-review-1
  14.           source: Kawon/llama3.1-food-finetune_v14_r8
  15.         # 增加新的适配器
  16.         - id: food-review-2
  17.           source: Kawon/llama3.1-food-finetune_v14_r8
复制代码
新的适配器版本将及时应用于模型服务器,无需重新启动。检察 lora-adapter-syncer sidecar 容器日志,可以看到加载了 food-review-2 适配器。
  1. 2025-04-05 13:15:21 - INFO - sidecar.py:271 -  adapter to load food-review-2, food-review-1
  2. 2025-04-05 13:15:21 - INFO - sidecar.py:231 -  loaded model food-review-2
  3. 2025-04-05 13:15:21 - INFO - sidecar.py:218 -  food-review-1 already present on model server localhost:8000
  4. 2025-04-05 13:15:21 - INFO - sidecar.py:276 -  adapters to unload
  5. 2025-04-05 13:15:21 - INFO - sidecar.py:62 -  model server reconcile to Config '/config/configmap.yaml' !
  6. 2025-04-05 13:15:22 - INFO - sidecar.py:314 -  Periodic reconciliation triggered
  7. 2025-04-05 13:15:22 - INFO - sidecar.py:255 -  reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码
修改 InferenceModel 的配置以 Canary 的方式发布新的适配器版本。
  1. kubectl edit inferencemodel food-review
复制代码
将 10% 的流量路由到新的 food-review-2 适配器,90% 的流量路由到 food-review-1。
  1. apiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferenceModel
  3. metadata:
  4.   name: food-review
  5. spec:
  6.   modelName: food-review
  7.   criticality: Standard
  8.   poolRef:
  9.     name: vllm-llama3-8b-instruct
  10.   targetModels:
  11.   - name: food-review-1
  12.     weight: 90
  13.   - name: food-review-2
  14.     weight: 10
复制代码
使用雷同的 curl 下令哀求多次进行测试,可以观察到 90% 的哀求被路由到 food-review-1,10% 的哀求被路由到 food-review-2。
  1. curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  2. "model": "food-review",
  3. "prompt": "Write as if you were a critic: San Francisco",
  4. "max_tokens": 100,
  5. "temperature": 0
  6. }'
  7. # 发送 food-review-1 的请求,可以通过响应的 model 字段辨认
  8. HTTP/1.1 200 OK
  9. date: Sat, 05 Apr 2025 13:18:34 GMT
  10. server: envoy
  11. content-type: application/json
  12. x-envoy-upstream-service-time: 1780
  13. x-went-into-resp-headers: true
  14. transfer-encoding: chunked
  15. {
  16.   "choices": [
  17.     {
  18.       "finish_reason": "length",
  19.       "index": 0,
  20.       "logprobs": null,
  21.       "prompt_logprobs": null,
  22.       "stop_reason": null,
  23.       "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
  24.     }
  25.   ],
  26.   "created": 1743859115,
  27.   "id": "cmpl-99203056-cb12-4c8e-bae9-23c28c07cdd7",
  28.   "model": "food-review-1",
  29.   "object": "text_completion",
  30.   "usage": {
  31.     "completion_tokens": 100,
  32.     "prompt_tokens": 11,
  33.     "prompt_tokens_details": null,
  34.     "total_tokens": 111
  35.   }
  36. }
  37. curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  38. "model": "food-review",
  39. "prompt": "Write as if you were a critic: San Francisco",
  40. "max_tokens": 100,
  41. "temperature": 0
  42. }'
  43. HTTP/1.1 200 OK
  44. date: Sat, 05 Apr 2025 13:18:38 GMT
  45. server: envoy
  46. content-type: application/json
  47. x-envoy-upstream-service-time: 2531
  48. x-went-into-resp-headers: true
  49. transfer-encoding: chunked
  50. # 发送到 food-review-2 的请求
  51. {
  52.   "choices": [
  53.     {
  54.       "finish_reason": "length",
  55.       "index": 0,
  56.       "logprobs": null,
  57.       "prompt_logprobs": null,
  58.       "stop_reason": null,
  59.       "text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
  60.     }
  61.   ],
  62.   "created": 1743859119,
  63.   "id": "cmpl-6f2e2e5f-a0e7-4ee0-bd54-5b1a2ef23399",
  64.   "model": "food-review-2",
  65.   "object": "text_completion",
  66.   "usage": {
  67.     "completion_tokens": 100,
  68.     "prompt_tokens": 11,
  69.     "prompt_tokens_details": null,
  70.     "total_tokens": 111
  71.   }
  72. }
复制代码
确认新版本的适配器工作正常后,可以修改 InferenceModel 的配置,将 100% 的流量路由到 food-review-2。
  1. apiVersion: inference.networking.x-k8s.io/v1alpha2
  2. kind: InferenceModel
  3. metadata:
  4.   name: food-review
  5. spec:
  6.   modelName: food-review
  7.   criticality: Standard
  8.   poolRef:
  9.     name: vllm-llama3-8b-instruct
  10.   targetModels:
  11.   - name: food-review-2
  12.     weight: 100
复制代码
同时修改 vllm-llama3-8b-instruct-adapters ConfigMap,将旧版本 food-review-1 移动到 ensureNotExist 列表中,从服务器卸载旧版本。
  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4.   name: vllm-llama3-8b-instruct-adapters
  5. data:
  6.   configmap.yaml: |
  7.     vLLMLoRAConfig:
  8.       name: vllm-llama3-8b-instruct-adapters
  9.       port: 8000
  10.       defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
  11.       ensureExist:
  12.         models:
  13.         - id: food-review-2
  14.           source: Kawon/llama3.1-food-finetune_v14_r8
  15.       ensureNotExist:
  16.         models:
  17.         - id: food-review-1
  18.           source: Kawon/llama3.1-food-finetune_v14_r8
复制代码
观察 lora-adapter-syncer sidecar 容器日志,可以看到卸载了 food-review-1 适配器。
  1. 2025-04-05 13:27:53 - INFO - sidecar.py:271 -  adapter to load food-review-2
  2. 2025-04-05 13:27:53 - INFO - sidecar.py:218 -  food-review-2 already present on model server localhost:8000
  3. 2025-04-05 13:27:53 - INFO - sidecar.py:276 -  adapters to unload food-review-1
  4. 2025-04-05 13:27:53 - INFO - sidecar.py:247 -  unloaded model food-review-1
  5. 2025-04-05 13:27:53 - INFO - sidecar.py:62 -  model server reconcile to Config '/config/configmap.yaml' !
  6. 2025-04-05 13:27:56 - INFO - sidecar.py:314 -  Periodic reconciliation triggered
  7. 2025-04-05 13:27:56 - INFO - sidecar.py:255 -  reconciling model server localhost:8000 with config stored at /config/configmap.yaml
复制代码
此时,所有哀求都应该由新的适配器版本提供服务。
总结

Gateway API Inference Extension 为 Kubernetes 上的 LLM 推理服务提供了专业化的流量路由办理方案。通过模型感知路由、服务优先级和智能负载均衡等特性,它有效提高了 GPU 资源使用率,降低了推理延迟。该扩展通过 InferencePool 和 InferenceModel 两个核心 CRD,结合 EndPoint Picker 和 Dynamic LORA Adapter Sidecar 组件,实现了模型版本的灰度发布与动态 LoRA 适配器管理,为 Kubernetes 上自托管的大语言模型提供了标准化且机动的办理方案。
参考资料


接待关注



免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 IT评测·应用市场-qidao123.com技术社区 (https://dis.qidao123.com/) Powered by Discuz! X3.4