上篇文章举行了 TensorRT-LLM & Triton Server 部署 ,本篇简单讲讲 Triton Inference Server 的架构原理,便于大家更好的做设置和开发。
TensorRT-LLM & Triton Server 部署回顾
部署梳理
上一篇我们了解到部署 TRT-LLM & Triton 需要非常复杂的步骤,其实总的来说一共就四步:1. 下载模型 2. 转换&编译模型(TensorRT) 3. 设置 backend(Triton) 4. 启动 Triton Server
- # 下载模型
- wget https://hf-mirror.com/hfd/hfd.sh
- chmod a+x hfd.sh
- export HF_ENDPOINT=https://hf-mirror.com
- apt-get update
- apt-get install -y aria2
- aria2c --version
- ./hfd.sh meta-llama/Meta-Llama-3.1-8B-Instruct --hf_username Dong-Hua --hf_token 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 aria2c -x 4
- # 断点续传
- aria2c --header='Authorization: Bearer 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' --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c 'https://hf-mirror.com/meta-llama/Meta-Llama-3.1-8B-Instruct/resolve/main/model-00002-of-00004.safetensors' -d '.' -o 'model-00002-of-00004.safetensors'
复制代码
- # 转换格式
- python3 convert_checkpoint.py --model_dir /data/Meta-Llama-3.1-8B-Instruct \
- --output_dir ./trt_ckpts/llama3.1_checkpoint_gpu_tp \
- --dtype float16 \
- --tp_size 1 \
- --workers 8
- # 编译模型
- trtllm-build --checkpoint_dir ./trt_ckpts/llama3.1_checkpoint_gpu_tp \
- --output_dir ./trt_engines/llama3.1_8B_128K_fp16_1-gpu \
- --workers 8 \
- --gemm_plugin auto \
- --max_num_tokens 131072
复制代码 该步骤得到的是两个文件,一个编译后的模型引擎文件,一个是模型设置文件:

- # 在进行配置前,你需要先创建 triton_model_repo,并且把上一步骤的两个文件移入对应的 repo 下面:
- cd tensorrtllm_backend
- mkdir triton_model_repo
- cp -r all_models/inflight_batcher_llm/* triton_model_repo/
- cp /data/TensorRT-LLM/examples/llama/trt_engines/llama3.1_8B_128K_fp16_1-gpu/* triton_model_repo/tensorrt_llm/1
- # 进行 backend 配置
- python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt \
- tokenizer_dir:/data/Meta-Llama-3.1-8B-Instruct,\
- tokenizer_type:auto,\
- triton_max_batch_size:64,\
- postprocessing_instance_count:1
- python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt \
- tokenizer_dir:/data/Meta-Llama-3.1-8B-Instruct,\
- tokenizer_type:auto,\
- triton_max_batch_size:64,\
- preprocessing_instance_count:1
- python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt \
- triton_max_batch_size:64,\
- decoupled_mode:true,\
- bls_instance_count:1
- python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt \
- triton_max_batch_size:64
- python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt \
- triton_backend:tensorrtllm,\
- triton_max_batch_size:64,\
- decoupled_mode:true,\
- engine_dir:triton_model_repo/tensorrt_llm/1,\
- max_queue_delay_microseconds:10000,\
- batching_strategy:inflight_fused_batching
复制代码 此步骤的设置是通过脚本填充的方式修改设置,其实也可以进对应的文件举行设置,可以看到所有的设置改动都是修改的 triton_model_repo 目次下的,是的没错,Triton 的核心就是 repo 下面的 backends,本篇文章目的也是讲关于 backend 的内容。
- # 启动
- python3 scripts/launch_triton_server.py \
- --world_size 1 \
- --model_repo=/data/tensorrtllm_backend/triton_model_repo/ \
- --log \
- --log-file=./triton_server.logs
- # 关闭
- pkill tritonserver
复制代码 Triton 架构
下图显示了 Triton 推理服务器的高级架构。
- 模型存储库是一个基于文件体系的存储库,其中包含 Triton 将用于推理的模型。
- 推理请求通过 HTTP/REST 或 GRPC 或 C API 到达服务器,然后路由到适当的每个模型调度程序。
- Triton 实现了多种调度和批处理惩罚算法,可以根据每个模型举行设置。
- 每个模型的调度程序可以选择实验推理请求的批处理惩罚,然后将请求传递给与模型范例相对应的 backend 。 backend 使用批处理惩罚请求中提供的输入实验推理以天生请求的输出,然后返回输出。
为什么要使用 backend ?
如上图,在 Triton上面所有部署的模型都是通过某一种 backend 部署在服务器上,比方 TensorRT 的模型,就会使用 tensorrt 范例的 backend 去启一个或者多个的 model 实例,这些 model 实例放到 GPU 上或者 CPU 上实验推理。
triton_model_repo 目次结构
在部署的时间我们需要设置很多 backend 的设置文件,然后再启动 server,我们修改的设置文件都在 triton_model_repo 目次下,先看目次结构:
- (base) [root@iv-ycl6gxrcwka8j6ujk4bc tensorrtllm_backend]# tree triton_model_repo/
- triton_model_repo/
- ├── ensemble
- │ ├── 1
- │ └── config.pbtxt
- ├── postprocessing
- │ ├── 1
- │ │ ├── model.py
- │ │ └── __pycache__
- │ │ └── model.cpython-310.pyc
- │ └── config.pbtxt
- ├── preprocessing
- │ ├── 1
- │ │ ├── model.py
- │ │ └── __pycache__
- │ │ └── model.cpython-310.pyc
- │ └── config.pbtxt
- ├── tensorrt_llm
- │ ├── 1
- │ │ ├── config.json # 模型配置文件
- │ │ ├── model.py # backend 脚本
- │ │ ├── __pycache__
- │ │ │ └── model.cpython-310.pyc
- │ │ └── rank0.engine # 模型引擎文件
- │ └── config.pbtxt # backend 配置文件,部署时候修改的配置就是此文件
- └── tensorrt_llm_bls
- ├── 1
- │ ├── lib
- │ │ ├── decode.py
- │ │ ├── __pycache__
- │ │ │ ├── decode.cpython-310.pyc
- │ │ │ └── triton_decoder.cpython-310.pyc
- │ │ └── triton_decoder.py
- │ ├── model.py
- │ └── __pycache__
- │ └── model.cpython-310.pyc
- └── config.pbtxt
复制代码 可以看到 triton_model_repo 仓库中有五个模型 backend,其中:
- ensemble model:它允许你把多个模型串联在一起,形成一个 papline,只能处理惩罚静态请求
- postprocessing model:前置处理惩罚
- preprocessing model:后置处理惩罚
- tensorrt_llm model:tensorrt 引擎部署的模型
- tensorrt_llm_bls model:bls 模式,它允许有分支结构,可以举行动态的处理惩罚
每个模型文件都是类似如下结构:
- ├── *****
- │ ├── 1
- │ │ ├── config.json # 模型配置文件
- │ │ ├── model.py # backend 脚本
- │ │ ├── __pycache__
- │ │ │ └── model.cpython-310.pyc
- │ │ └── rank0.engine # 模型引擎文件
- │ └── config.pbtxt # backend 配置文件,部署时候修改的配置就是此文件
复制代码 在部署完之后你可以通过以下两种方式举行大模型请求:
- curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "你是什么语言模型?", "max_tokens": 1000, "bad_words": "", "stop_words": "<|eot_id|>"}'
- curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "你是什么语言模型?", "max_tokens": 1000, "bad_words": "", "stop_words": "<|eot_id|>"}'
复制代码 标题来了,他们有什么区别呢?
Ensemble 模式
Ensemble 不是一个模型,只是一个调度器。它允许把多个模型串联在一起,形成一个 pipeline,请求使用该模型的时间,只会根据设置的 pipeline 举行静态的调用处理惩罚。
config.pbtxt 设置文件内容:
- # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- #
- # Redistribution and use in source and binary forms, with or without
- # modification, are permitted provided that the following conditions
- # are met:
- # * Redistributions of source code must retain the above copyright
- # notice, this list of conditions and the following disclaimer.
- # * Redistributions in binary form must reproduce the above copyright
- # notice, this list of conditions and the following disclaimer in the
- # documentation and/or other materials provided with the distribution.
- # * Neither the name of NVIDIA CORPORATION nor the names of its
- # contributors may be used to endorse or promote products derived
- # from this software without specific prior written permission.
- #
- # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- name: "ensemble"
- platform: "ensemble"
- max_batch_size: 64
- input [
- {
- name: "text_input"
- data_type: TYPE_STRING
- dims: [ 1 ]
- },
- ...
- {
- name: "embedding_bias_weights"
- data_type: TYPE_FP32
- dims: [ -1 ]
- optional: true
- }
- ]
- output [
- {
- name: "text_output"
- data_type: TYPE_STRING
- dims: [ -1 ]
- },
- ...
- {
- name: "batch_index"
- data_type: TYPE_INT32
- dims: [ 1 ]
- }
- ]
- ensemble_scheduling {
- step [
- {
- model_name: "preprocessing"
- model_version: -1
- input_map {
- key: "QUERY"
- value: "text_input"
- }
- ...
- },
- {
- model_name: "tensorrt_llm"
- model_version: -1
- input_map {
- key: "input_ids"
- value: "_INPUT_ID"
- }
- ...
- },
- {
- model_name: "postprocessing"
- model_version: -1
- input_map {
- key: "TOKENS_BATCH"
- value: "_TOKENS_BATCH"
- }
- ...
- }
- ]
- }
复制代码 可以看到设置文件中设置了 scheduling,当使用 ensemble 模型的时间,他的调用链就是根据设置的 scheduling 举行处理惩罚的,注意每个设置文件中都包含了输入和输出,以及下一个模型的输入输出映射(input_map),即上一个模型的输出对应为下一个模型某个字段的输入,上面列出来的内容作了省略处理惩罚,详细可自行查看设置文件。
BLS 模式
有 Ensemble 的静态模式,必然有动态模式,BLS 就是动态模式,他和静态模式相比就多了个 model.py 脚本,事实上他的动态也是根据脚本中举行的逻辑判断实现的
model.py 脚本架构:
- # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- #
- # Redistribution and use in source and binary forms, with or without
- # modification, are permitted provided that the following conditions
- # are met:
- # * Redistributions of source code must retain the above copyright
- # notice, this list of conditions and the following disclaimer.
- # * Redistributions in binary form must reproduce the above copyright
- # notice, this list of conditions and the following disclaimer in the
- # documentation and/or other materials provided with the distribution.
- # * Neither the name of NVIDIA CORPORATION nor the names of its
- # contributors may be used to endorse or promote products derived
- # from this software without specific prior written permission.
- #
- # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
- # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
- # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- import json
- import traceback
- import triton_python_backend_utils as pb_utils
- from lib.triton_decoder import TritonDecoder
- def get_valid_param_value(param, default_value=''):
- value = param.get('string_value', '')
- return default_value if value.startswith('${') or value == '' else value
- class TritonPythonModel:
- def initialize(self, args):
- # Parse model configs
- model_config = json.loads(args['model_config'])
- params = model_config['parameters']
- accumulate_tokens_str = get_valid_param_value(
- params.get('accumulate_tokens', {}))
- self.accumulate_tokens = accumulate_tokens_str.lower() in [
- 'true', 'yes', '1', 't'
- ]
- self.decoupled = pb_utils.using_decoupled_model_transaction_policy(
- model_config)
- self.logger = pb_utils.Logger
- default_tensorrt_llm_model_name = 'tensorrt_llm'
- self.llm_model_name = get_valid_param_value(
- params.get('tensorrt_llm_model_name', {}),
- default_tensorrt_llm_model_name)
- self.draft_llm_model_name = get_valid_param_value(
- params.get('tensorrt_llm_draft_model_name', {}), None)
- self.multimodal_encoders_name = get_valid_param_value(
- params.get('multimodal_encoders_name', {}), None)
- self.decoder = TritonDecoder(
- streaming=self.decoupled,
- accumulate=self.accumulate_tokens,
- preproc_model_name="preprocessing",
- postproc_model_name="postprocessing",
- llm_model_name=self.llm_model_name,
- draft_llm_model_name=self.draft_llm_model_name,
- multimodal_encoders_name=self.multimodal_encoders_name)
- def execute(self, requests):
- responses = []
- for request in requests:
- if self.decoupled:
- response_sender = request.get_response_sender()
- try:
- req = self.decoder.convert_triton_request(request)
- req.validate()
- speculative_decode = (req.num_draft_tokens is not None
- and req.num_draft_tokens[0][0] > 0)
- if speculative_decode and (self.draft_llm_model_name is None
- or self.draft_llm_model_name == ""):
- raise Exception(
- "cannot perform speculative decoding without draft model"
- )
- is_multimodal = req.image_input is not None
- if speculative_decode and is_multimodal:
- raise Exception(
- "Multimodal and speculative decoding is not currently supported"
- )
- res_gen = self.decoder.decode(
- req,
- speculative_decoding=speculative_decode,
- is_multimodal=is_multimodal)
- for res in res_gen:
- triton_response = self.decoder.create_triton_response(res)
- if self.decoupled:
- response_sender.send(triton_response)
- else:
- responses.append(triton_response)
- if self.decoupled:
- response_sender.send(
- flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
- except Exception:
- self.logger.log_error(traceback.format_exc())
- # If encountering an error, send a response with err msg
- error_response = pb_utils.InferenceResponse(
- output_tensors=[],
- error=pb_utils.TritonError(traceback.format_exc()))
- if self.decoupled:
- response_sender.send(error_response)
- response_sender.send(
- flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
- else:
- responses.append(error_response)
- self.decoder.reset_decoder()
- if self.decoupled:
- return None
- else:
- assert len(responses) == len(requests)
- return responses
- def finalize(self):
- """`finalize` is called only once when the model is being unloaded.
- Implementing `finalize` function is optional. This function allows
- the model to perform any necessary clean ups before exit.
- """
- print('Tensorrt_llm_bls Cleaning up...')
复制代码 所有的 model.py 文件都是需要实现上面的三个方法,BLS 模式的动态实现就是 execute 的内容,可以本身编码,遇到什么条件的时间举行不同的操作(或者不同的模型调用)实现动态处理惩罚。
根据上面 config.pbtxt 设置文件和 model.py 脚本的简单先容,大家也可以自行研究一下其他 backend 的对应文件。总之,我们在部署过程中需要举行的操作就是根据需要举行设置和代码修改,当然 triton_model_repo 仓库中并不是固定的上面这些 backend,而是可以根据需求自行增加或者删除。
系列文章:
一、大模型推理框架选型调研
二、TensorRT-LLM & Triton Server 部署过程纪录
三、vLLM 大模型推理引擎调研文档
四、vLLM 推理引擎性能分析基准测试
五、vLLM 部署大模型标题纪录
六、Triton Inference Server 架构原理
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |