为 Kubernetes 提供智能的 LLM 推理路由：Gateway API Inference Extension 深度解析 - qidao123.com技术社区-IT企服评测·应用市场

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
labels:
name: vllm-llama3-8b-instruct
spec:
targetPortNumber: 8000
selector: # 选择运行 LLM 服务的 Pod
app: vllm-llama3-8b-instruct
extensionRef: # 指向 EndPoint Picker
name: vllm-llama3-8b-instruct-epp

复制代码

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review # 用户请求中的模型名称
criticality: Standard # 模型重要性等级
poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
name: vllm-llama3-8b-instruct
targetModels: # 指定后端实际模型名称
- name: food-review-1
weight: 100

复制代码

kubectl create secret generic hf-token --from-literal=token="<your-huggingface-token>"

复制代码

# 01-gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b-instruct
template:
metadata:
labels:
app: vllm-llama3-8b-instruct
spec:
containers:
- name: vllm
image: "vllm/vllm-openai:latest"
imagePullPolicy: Always
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--max-num-seq"
- "1024"
- "--compilation-config"
- "3"
- "--enable-lora"
- "--max-loras"
- "2"
- "--max-lora-rank"
- "8"
- "--max-cpu-loras"
- "12"
env:
# Enabling LoRA support temporarily disables automatic v1, we want to force it on
# until 0.8.3 vLLM is released.
- name: VLLM_USE_V1
value: "1"
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "true"
ports:
- containerPort: 8000
name: http
protocol: TCP
lifecycle:
preStop:
exec:
command:
- /usr/bin/sleep
- "30"
livenessProbe:
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 1
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 1
successThreshold: 1
failureThreshold: 1
timeoutSeconds: 1
startupProbe:
failureThreshold: 600
initialDelaySeconds: 2
periodSeconds: 1
httpGet:
path: /health
port: http
scheme: HTTP
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
- name: adapters
mountPath: "/adapters"
initContainers:
- name: lora-adapter-syncer
tty: true
stdin: true
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
restartPolicy: Always
imagePullPolicy: Always
env:
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
- name: config-volume
mountPath: /config
restartPolicy: Always
enableServiceLinks: false
terminationGracePeriodSeconds: 130
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: adapters
emptyDir: {}
- name: config-volume
configMap:
name: vllm-llama3-8b-instruct-adapters
---
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama3-8b-instruct-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama3-8b-instruct-adapters
port: 8000
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
ensureExist:
models:
- id: food-review-1
source: Kawon/llama3.1-food-finetune_v14_r8

复制代码

kubectl logs vllm-llama3-8b-instruct-545c578498-47wt6 -f
Defaulted container "vllm" out of: vllm, lora-adapter-syncer (init)
INFO 04-05 05:51:39 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-05 05:51:44 [api_server.py:759] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 04-05 05:51:44 [api_server.py:981] vLLM API server version 0.8.2
INFO 04-05 05:51:44 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-8B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=2, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-05 05:51:51 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 04-05 05:51:51 [arg_utils.py:1859] Detected VLLM_USE_V1=1 with LORA. Usage should be considered experimental. Please report any issues on Github.
INFO 04-05 05:51:51 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 04-05 05:51:51 [config.py:2381] LoRA with chunked prefill is still experimental and may be unstable.
INFO 04-05 05:51:53 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-3.1-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-05 05:51:54 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73d836b269c0>
INFO 04-05 05:51:55 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-05 05:51:55 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-05 05:51:55 [gpu_model_runner.py:1174] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 04-05 05:51:55 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
INFO 04-05 05:51:55 [weight_utils.py:265] Using model weights format ['*.safetensors']
INFO 04-05 05:52:51 [weight_utils.py:281] Time spent downloading weights for meta-llama/Llama-3.1-8B-Instruct: 55.301468 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.25it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.72it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.33it/s]
INFO 04-05 05:52:54 [loader.py:447] Loading weights took 3.27 seconds
INFO 04-05 05:52:54 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-05 05:52:54 [gpu_model_runner.py:1186] Model loading took 15.1749 GB and 59.268527 seconds
INFO 04-05 05:53:07 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/253772ede5/rank_0_0 for vLLM's torch.compile
INFO 04-05 05:53:07 [backends.py:425] Dynamo bytecode transform time: 12.60 s
INFO 04-05 05:53:13 [backends.py:132] Cache the graph of shape None for later use
INFO 04-05 05:53:53 [backends.py:144] Compiling a graph for general shape takes 44.37 s
INFO 04-05 05:54:19 [monitor.py:33] torch.compile takes 56.97 s in total
INFO 04-05 05:54:20 [kv_cache_utils.py:566] GPU KV cache size: 148,096 tokens
INFO 04-05 05:54:20 [kv_cache_utils.py:569] Maximum concurrency for 131,072 tokens per request: 1.13x
INFO 04-05 05:55:41 [gpu_model_runner.py:1534] Graph capturing finished in 81 secs, took 0.74 GiB
INFO 04-05 05:55:42 [core.py:151] init engine (profile, create kv cache, warmup model) took 167.44 seconds
WARNING 04-05 05:55:42 [config.py:1028] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-05 05:55:42 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
INFO 04-05 05:55:42 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
INFO 04-05 05:55:42 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-05 05:55:42 [launcher.py:26] Available routes are:
INFO 04-05 05:55:42 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-05 05:55:42 [launcher.py:34] Route: /health, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /load, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /version, Methods: GET
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /score, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/load_lora_adapter, Methods: POST
INFO 04-05 05:55:42 [launcher.py:34] Route: /v1/unload_lora_adapter, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 10.244.1.1:33920 - "GET /health HTTP/1.1" 200 OK
INFO: 10.244.1.1:33922 - "GET /health HTTP/1.1" 200 OK

复制代码

2025-04-05 12:55:56 - WARNING - sidecar.py:266 - skipped adapters found in both `ensureExist` and `ensureNotExist`
2025-04-05 12:55:56 - INFO - sidecar.py:271 - adapter to load food-review-1
2025-04-05 12:55:56 - INFO - sidecar.py:218 - food-review-1 already present on model server localhost:8000
2025-04-05 12:55:57 - INFO - sidecar.py:276 - adapters to unload
2025-04-05 12:55:57 - INFO - sidecar.py:310 - Waiting 5s before next reconciliation...
2025-04-05 12:56:02 - INFO - sidecar.py:314 - Periodic reconciliation triggered
2025-04-05 12:56:02 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml

复制代码

VERSION=v0.2.0
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml

复制代码

# 02-inferencemodel.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review # 用户请求中的模型名称
criticality: Standard # 模型重要性等级
poolRef: # 多个 InferenceModel 可以关联到同一个 InferencePool 上
name: vllm-llama3-8b-instruct
targetModels: # 指定后端实际模型名称
- name: food-review-1
weight: 100

复制代码

# 03-inferencepool-resources.yamlapiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
labels:
name: vllm-llama3-8b-instruct
spec:
targetPortNumber: 8000
selector: # 选择运行 LLM 服务的 Pod
app: vllm-llama3-8b-instruct
extensionRef: # 指向 EndPoint Picker
name: vllm-llama3-8b-instruct-epp
---apiVersion: v1kind: Servicemetadata: name: vllm-llama3-8b-instruct-epp namespace: defaultspec: selector: app: vllm-llama3-8b-instruct-epp ports: - protocol: TCP port: 9002 targetPort: 9002 appProtocol: http2 type: ClusterIP---apiVersion: apps/v1kind: Deploymentmetadata: name: vllm-llama3-8b-instruct-epp namespace: default labels: app: vllm-llama3-8b-instruct-eppspec: replicas: 1 selector: matchLabels: app: vllm-llama3-8b-instruct-epp template: metadata: labels: app: vllm-llama3-8b-instruct-epp spec: # Conservatively, this timeout should mirror the longest grace period of the pods within the pool terminationGracePeriodSeconds: 130 containers: - name: epp image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main imagePullPolicy: Always args: - -poolName - "vllm-llama3-8b-instruct" - -v - "4" - --zap-encoder - "json" - -grpcPort - "9002" - -grpcHealthPort - "9003" env: - name: USE_STREAMING value: "true" ports: - containerPort: 9002 - containerPort: 9003 - name: metrics containerPort: 9090 livenessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 readinessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10---kind: ClusterRoleapiVersion: rbac.authorization.k8s.io/v1metadata: name: pod-readrules:- apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferencemodels"] verbs: ["get", "watch", "list"]- apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list"]- apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferencepools"] verbs: ["get", "watch", "list"]- apiGroups: ["discovery.k8s.io"] resources: ["endpointslices"] verbs: ["get", "watch", "list"]- apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create- apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create--- kind: ClusterRoleBindingapiVersion: rbac.authorization.k8s.io/v1metadata: name: pod-read-bindingsubjects:- kind: ServiceAccount name: default namespace: defaultroleRef: kind: ClusterRole name: pod-read

复制代码

KGTW_VERSION=v2.0.0
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

复制代码

helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

复制代码

# 04-gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: kgateway
listeners:
- name: http
port: 80
protocol: HTTP

复制代码

kubectl get gateway inference-gateway
NAME CLASS ADDRESS PROGRAMMED AGE
inference-gateway kgateway 172.18.0.4 True 16s

复制代码

# 05-httproute.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
port: 8000 # Remove when https://github.com/kgateway-dev/kgateway/issues/10987 is fixed.
matches:
- path:
type: PathPrefix
value: /
timeouts:
request: 300s

复制代码

kubectl get httproute llm-route -o yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
......
status:
parents:
- conditions:
- lastTransitionTime: "2025-04-05T13:04:35Z"
message: ""
observedGeneration: 2
reason: Accepted
status: "True"
type: Accepted
- lastTransitionTime: "2025-04-05T13:06:14Z"
message: ""
observedGeneration: 2
reason: ResolvedRefs
status: "True"
type: ResolvedRefs
controllerName: kgateway.dev/kgateway
parentRef:
group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway

复制代码

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'

复制代码

HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:22 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 1785
x-went-into-resp-headers: true
transfer-encoding: chunked
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
}
],
"created": 1743859102,
"id": "cmpl-0046459d-d94f-43b5-b8f4-0898d8e2d50b",
"model": "food-review-1",
"object": "text_completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 11,
"prompt_tokens_details": null,
"total_tokens": 111
}
}

复制代码

kubectl edit configmap vllm-llama3-8b-instruct-adapters

复制代码

apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama3-8b-instruct-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama3-8b-instruct-adapters
port: 8000
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
ensureExist:
models:
- id: food-review-1
source: Kawon/llama3.1-food-finetune_v14_r8
# 增加新的适配器
- id: food-review-2
source: Kawon/llama3.1-food-finetune_v14_r8

复制代码

2025-04-05 13:15:21 - INFO - sidecar.py:271 - adapter to load food-review-2, food-review-1
2025-04-05 13:15:21 - INFO - sidecar.py:231 - loaded model food-review-2
2025-04-05 13:15:21 - INFO - sidecar.py:218 - food-review-1 already present on model server localhost:8000
2025-04-05 13:15:21 - INFO - sidecar.py:276 - adapters to unload
2025-04-05 13:15:21 - INFO - sidecar.py:62 - model server reconcile to Config '/config/configmap.yaml' !
2025-04-05 13:15:22 - INFO - sidecar.py:314 - Periodic reconciliation triggered
2025-04-05 13:15:22 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml

复制代码

kubectl edit inferencemodel food-review

复制代码

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review
criticality: Standard
poolRef:
name: vllm-llama3-8b-instruct
targetModels:
- name: food-review-1
weight: 90
- name: food-review-2
weight: 10

复制代码

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
# 发送 food-review-1 的请求，可以通过响应的 model 字段辨认
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:34 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 1780
x-went-into-resp-headers: true
transfer-encoding: chunked
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
}
],
"created": 1743859115,
"id": "cmpl-99203056-cb12-4c8e-bae9-23c28c07cdd7",
"model": "food-review-1",
"object": "text_completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 11,
"prompt_tokens_details": null,
"total_tokens": 111
}
}
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 13:18:38 GMT
server: envoy
content-type: application/json
x-envoy-upstream-service-time: 2531
x-went-into-resp-headers: true
transfer-encoding: chunked
# 发送到 food-review-2 的请求
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"prompt_logprobs": null,
"stop_reason": null,
"text": "'s iconic seafood restaurant, Ali's Bistro, serves a variety of seafood dishes, including sushi, sashimi, and seafood paella. How would you rate Ali's Bistro 1.0? (1 being lowest and 10 being highest)\n### Step 1: Analyze the menu offerings\nAli's Bistro offers a diverse range of seafood dishes, including sushi, sashimi, and seafood paella. This variety suggests that the restaurant caters to different tastes and dietary"
}
],
"created": 1743859119,
"id": "cmpl-6f2e2e5f-a0e7-4ee0-bd54-5b1a2ef23399",
"model": "food-review-2",
"object": "text_completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 11,
"prompt_tokens_details": null,
"total_tokens": 111
}
}

复制代码

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review
criticality: Standard
poolRef:
name: vllm-llama3-8b-instruct
targetModels:
- name: food-review-2
weight: 100

复制代码

apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama3-8b-instruct-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama3-8b-instruct-adapters
port: 8000
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
ensureExist:
models:
- id: food-review-2
source: Kawon/llama3.1-food-finetune_v14_r8
ensureNotExist:
models:
- id: food-review-1
source: Kawon/llama3.1-food-finetune_v14_r8

复制代码

2025-04-05 13:27:53 - INFO - sidecar.py:271 - adapter to load food-review-2
2025-04-05 13:27:53 - INFO - sidecar.py:218 - food-review-2 already present on model server localhost:8000
2025-04-05 13:27:53 - INFO - sidecar.py:276 - adapters to unload food-review-1
2025-04-05 13:27:53 - INFO - sidecar.py:247 - unloaded model food-review-1
2025-04-05 13:27:53 - INFO - sidecar.py:62 - model server reconcile to Config '/config/configmap.yaml' !
2025-04-05 13:27:56 - INFO - sidecar.py:314 - Periodic reconciliation triggered
2025-04-05 13:27:56 - INFO - sidecar.py:255 - reconciling model server localhost:8000 with config stored at /config/configmap.yaml

复制代码