标题: 为 Kubernetes 提供智能的 LLM 推理路由:Gateway API Inference Extension 深度解析 [打印本页] 作者: 科技颠覆者 时间: 2025-4-14 00:21 标题: 为 Kubernetes 提供智能的 LLM 推理路由:Gateway API Inference Extension 深度解析 当代生成式 AI 和大语言模型(LLM)服务给 Kubernetes 带来了独特的流量路由挑衅。与典型的短时、无状态 Web 哀求不同,LLM 推剖析话通常是长时运行、资源密集且部分有状态的。比方,一个基于 GPU 的模型服务器大概同时维护多个活跃的推剖析话,并维护内存中的 token 缓存。
传统的负载均衡器多基于 HTTP 路径或轮询调理,缺乏处置处罚此类工作负载所需的专业能力。这些方案无法辨认模型标识或哀求的重要性(比方交互式对话哀求与批处置处罚作业之间的区别)。企业通常接纳临时拼集的方式应对,但缺乏同一标准的办理方案。
为了办理这一题目,Gateway API Inference Extension 在现有 Gateway API 的底子上,添加了针对推理任务的专属路由能力,同时保存了 Gateways 和 HTTPRoutes 等人们熟悉的模型。通过为现有网关添加这一扩展,可以将其转变为“推理网关”(Inference Gateway),帮助用户以“模型即服务”的方式自托管生成式 AI 模型或 LLM。
Gateway API Inference Extension 可将支持 ext-proc 的代理或网关(如 Envoy Gateway、kGateway 或 GKE Gateway)升级为推理网关,支持推理平台团队在 Kubernetes 上自建大语言模型服务。
主要特性
Gateway API Inference Extension 提供了以下关键特性:
模型感知路由:与传统仅基于哀求路径进行路由的方式不同,Gateway API Inference Extension 支持根据模型名称进行路由。这一能力得益于网关实现(如 Envoy Proxy)对生成式 AI 推理 API 规范(如 OpenAI API)的支持。该模型感知路由能力同样实用于基于 LoRA(Low-Rank Adaptation)微调的模型。
服务优先级:Gateway API Inference Extension 支持为模型指定服务优先级。比方,可为在线对话类任务(对延迟较为敏感)的模型设定更高的 criticality,而对延迟容忍度更高的任务(如摘要生成)的模型则设定较低的优先级。
模型版本发布:Gateway API Inference Extension 支持基于模型名称进行流量拆分,从而实现模型版本的渐进式发布与灰度上线。
推理服务的可扩展性:Gateway API Inference Extension 界说了一种可扩展模式,允许用户根据自身需求扩展推理服务,实现定制化的路由能力,以应对默认方案无法满足的场景。
面向推理的可定制负载均衡:Gateway API Inference Extension 提供了一种专为推理优化的可定制负载均衡与哀求路由模式,并在实现中提供了基于模型服务器及时指标的模型端点选择(model endpoint picking)机制。该机制可替代传统负载均衡方式,被称为“模型服务器感知”的智能负载均衡。实践表明,它能够有效降低推理延迟并提拔集群中 GPU 的使用率。
核心 CRD
Gateway API Inference Extension 界说了两个核心 CRD:InferencePool 和 InferenceModel。
InferencePool
InferencePool 表现一组专注于 AI 推理的 Pod,同时界说了用于路由到这些 Pod 的扩展配置。在 Gateway API 的资源模型中,InferencePool 被视为一种 “Backend” 资源。实际上,它可以用来替代传统的 Kubernetes Service,作为下游服务的目的。
固然 InferencePool 在某些方面与 Service 类似(比如选择 Pod 并指定端口),但它提供了专门面向推理场景的增强能力。InferencePool 通过 extensionRef 字段指向一个 EndPoint Picker 来管理推理感知的端点选择,从而根据及时指标(比方哀求队列深度和 GPU 内存可用性)做出智能路由决策。
Defaulted container "vllm" out of: vllm, lora-adapter-syncer (init)
INFO 04-05 05:51:39 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 05:51:44 [api_server.py:759] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 04-05 05:51:44 [api_server.py:981] vLLM API server version 0.8.2
INFO 04-05 05:51:51 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 04-05 05:51:51 [arg_utils.py:1859] Detected VLLM_USE_V1=1 with LORA. Usage should be considered experimental. Please report any issues on Github.
INFO 04-05 05:51:51 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 04-05 05:51:51 [config.py:2381] LoRA with chunked prefill is still experimental and may be unstable.
WARNING 04-05 05:51:54 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73d836b269c0>
INFO 04-05 05:51:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-05 05:51:55 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-05 05:51:55 [gpu_model_runner.py:1174] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 04-05 05:51:55 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
INFO 04-05 05:51:55 [weight_utils.py:265] Using model weights format ['*.safetensors']
INFO 04-05 05:52:51 [weight_utils.py:281] Time spent downloading weights for meta-llama/Llama-3.1-8B-Instruct: 55.301468 seconds
INFO 04-05 05:52:54 [loader.py:447] Loading weights took 3.27 seconds
INFO 04-05 05:52:54 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-05 05:52:54 [gpu_model_runner.py:1186] Model loading took 15.1749 GB and 59.268527 seconds
INFO 04-05 05:53:07 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/253772ede5/rank_0_0 for vLLM's torch.compile
INFO 04-05 05:53:07 [backends.py:425] Dynamo bytecode transform time: 12.60 s
INFO 04-05 05:53:13 [backends.py:132] Cache the graph of shape None for later use
INFO 04-05 05:53:53 [backends.py:144] Compiling a graph for general shape takes 44.37 s
INFO 04-05 05:54:19 [monitor.py:33] torch.compile takes 56.97 s in total
INFO 04-05 05:54:20 [kv_cache_utils.py:566] GPU KV cache size: 148,096 tokens
INFO 04-05 05:54:20 [kv_cache_utils.py:569] Maximum concurrency for 131,072 tokens per request: 1.13x
INFO 04-05 05:55:41 [gpu_model_runner.py:1534] Graph capturing finished in 81 secs, took 0.74 GiB
INFO 04-05 05:55:42 [core.py:151] init engine (profile, create kv cache, warmup model) took 167.44 seconds
WARNING 04-05 05:55:42 [config.py:1028] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-05 05:55:42 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
INFO 04-05 05:55:42 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
INFO 04-05 05:55:42 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-05 05:55:42 [launcher.py:26] Available routes are:
INFO 04-05 05:55:42 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /health, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /load, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /version, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /score, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/load_lora_adapter, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/unload_lora_adapter, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 10.244.1.1:33920 - "GET /health HTTP/1.1" 200 OK
INFO: 10.244.1.1:33922 - "GET /health HTTP/1.1" 200 OK
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
复制代码
响应效果如下,可以看到模型成功处置处罚了哀求。
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:22 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 1785
x-went-into-resp-headers: true
transfer-encoding: chunked
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
# 发送 food-review-1 的请求,可以通过响应的 model 字段辨认
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:34 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 1780
x-went-into-resp-headers: true
transfer-encoding: chunked
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:38 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 2531
x-went-into-resp-headers: true
transfer-encoding: chunked
# 发送到 food-review-2 的请求
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
Deep Dive into the Gateway API Inference Extension: https://kgateway.dev/blog/deep-dive-inference-extensions/
Smarter AI Inference Routing on Kubernetes with Gateway API Inference Extension: https://kgateway.dev/blog/smarter-ai-reference-kubernetes-gateway-api/