ToB企服应用市场:ToB评测及商务社交产业平台

标题: Triton学习笔记 [打印本页]

作者: 兜兜零元    时间: 2024-6-11 11:12
标题: Triton学习笔记
b站链接:合集·Triton 从入门到精通

  
算法名词解释:

scheduler 任务调理器


model instance、inference和request


batching


一、Triton Inference Server原理

1. Overview of Trition



2. Design Basics of Trition


3. Auxiliary Features of Trition


4. Additional resources


5. In-house Inference Server vs Trition


6. 编程实战

6.1 Prepare the Model Repository


6.2 Configure the Served Model


6.3 Launch Triton Server


6.4 Configure an Ensemble Model



6.5 Send Requests to Triton Server


二、Triton Backend详细教程

1. Overview

1.1 什么时候实现backend?

1.2 实现backend过程中必要实现哪些内容?



1.3 为什么triton是这么计划去实现一个backend的?


2. Code Exploration

2.1 留意事项


2.2 backend写好后是如何编译、build的?


3. Summary


三、Triton Python Backend & BLS Deep Dive

1. Python Backend(用python实现custom backends)

1.1 Triton工作原理



1.2 为什么我们必要Python backend?


1.3 工作原理



1.4 如何实现?



  1. import triton_python_backend_utils as pb_utils
  2. import onnxruntime
  3. import json
  4. import os
  5. # 类的名字不可改
  6. class TritonPythonModel:
  7.         """Your Python model must use the same class name. Every Python model
  8.         that is created must have "TritonPythonModel" as the class name.
  9.         """
  10.         def initialize(self, args):
  11.         """`initialize is called only once when the model is being loaded.
  12.         Implementing initialize function is optional. This function allows
  13.         the model to initialize any state associated with this model.
  14.         Parameters
  15.         ----------
  16.         args : dict
  17.         Both keys and values are strings. The dictionary keys and values are:
  18.         *model config:A JSON string containing the model configuration
  19.         model instance kind: A string containing model instance kind
  20.         model instance device id: A string containing model instance device ID
  21.         model repository: Model repository path
  22.         model version: Model version
  23.         model name: Model name
  24.         """
  25.         # 获取模型的config:You must parse model config. JSON string is not parsed here
  26.         self.model_config = model_config = json.loads(args['model_config'])
  27.        
  28.         # 取出输出的config:Get OUTPUT configuration
  29.         output_config = pb_utils.get_output_config_by_name(model_config, "output")
  30.        
  31.         # 获取输出的data_type:Convert Triton types to numpy types
  32.         self.output_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
  33.        
  34.         # 获取python脚本的路径:Get path of model repository
  35.         self.model_directory = os.path.dirname(os.path.realpath( file ))
  36.        
  37.         # 通过上述路径生成onnx model的路径,创建onnx模型处理的session:Create nnx runtime session for inference
  38.         self.session = onnxruntime.InferenceSession(os.path.join(self.model_directory, 'model.onnx'))
  39.        
  40.         print('Initialized...')
  41.        
  42. def execute(self,requests):
  43.         """`execute` must be implemented in every Python model.
  44.         `execute` function receives a list of pb utils.InferenceRequest as the only
  45.         argument. This function is called when an inference is requested for this model.
  46.         Parameters
  47.         ------------------
  48.         requests :list 可能包含一个/多个request
  49.         A list of pb utils.InferenceRequest
  50.         Returns
  51.         ------
  52.         list
  53.         A list of pb utils.InferenceResponse. The length of this list must
  54.         be the same as requests.
  55.         """
  56.         output_dtype = self.output_dtype
  57.        
  58.         responses = []  #返回值
  59.        
  60.         # Every Python backend must iterate through list of requests and create
  61.         # an instance of pb utils.InferenceResponse class for each of them. You
  62.         # should avoid storing any of the input Tensors in the class attributes
  63.         # as they will be overridden in subsequent inference requests. You can
  64.         # make a copy of the underlying NumPy array and store it if it is required.
  65.        
  66.         for request in requests:
  67.                 # Get INPUT
  68.                 # 这个输入的tensor是python backend的tensor
  69.                 input_tensor = pb_utils.get_input_tensor_by_name(request, "input")
  70.                 input_array = input_tensor.as numpy()        #转换为numpy array,为了让input tensor可以去onnx里做input
  71.                
  72.                 # 跑onnx的session(会话)得到输出的tensor(prediction):Run inference with onnxruntime session
  73.                 prediction = self.session.run(None,{"input": input_array})
  74.                 # 将prediction转换为config文件中定义好的output datatype,再打包为python backend的output tensor:Pack output as python backend tensor
  75.                 out_tensor = pb_utils.Tensor("output", prediction[0].astype(output_dtype))
  76.                
  77.                 # 将out_tensor转换为request对应的输出tensor
  78.                 inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
  79.                 responses.append(inference_response)
  80.        
  81.         # You must return a list of pb utils.InferenceResponse. Length
  82.         # of this list must match the length of `requests` list.return responses
  83.         return responses
  84.        
  85. def finalize(self):
  86.         """ finalize is called only once when the model is being unloaded
  87.         Implementing finalize`function is optional. This function allows
  88.         the model to perform any necessary clean ups before exit.
  89.         """
  90.         print('cleaning up...')
  91.        
复制代码


  1. import triton_python_backend_utils as pb_utils
  2. from torch.utils.dlpack import from dlpack
  3. from torch.utils.dlpack import to_dlpagk
  4. import torch
  5. import json
  6. import os
  7. # 类的名字不可改
  8. class TritonPythonModel:
  9.         """Your Python model must use the same class name. Every Python model
  10.         that is created must have "TritonPythonModel" as the class name.
  11.         """
  12.         def initialize(self, args):
  13.         """`initialize is called only once when the model is being loaded.
  14.         Implementing initialize function is optional. This function allows
  15.         the model to initialize any state associated with this model.
  16.         Parameters
  17.         ----------
  18.         args : dict
  19.         Both keys and values are strings. The dictionary keys and values are:
  20.         *model config:A JSON string containing the model configuration
  21.         model instance kind: A string containing model instance kind
  22.         model instance device id: A string containing model instance device ID
  23.         model repository: Model repository path
  24.         model version: Model version
  25.         model name: Model name
  26.         """
  27.         # 获取模型的config:You must parse model config. JSON string is not parsed here
  28.         self.model_config = model_config = json.loads(args['model_config'])
  29.        
  30.         # 取出输出的config:Get OUTPUT configuration
  31.         output_config = pb_utils.get_output_config_by_name(model_config, "output")
  32.        
  33.         # 获取输出的data_type:Convert Triton types to numpy types
  34.         self.output_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
  35.        
  36.         # 获取python脚本的路径:Get path of model repository
  37.         self.model_directory = os.path.dirname(os.path.realpath( file ))
  38.        
  39.         # 检查系统是否有GPU,如有则设置device为GPU
  40.         self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  41.         print(self.device)
  42.        
  43.         # 加载pytorch模型:Load Torchscript and put on GPU
  44.         # 组织好pytorch模型所在的路径
  45.         model_path = os.path.join(self.model_directory, 'model.pt')
  46.         if not os.path.exists(model_path):
  47.                 raise pb_utils.TritonModelException("Cannot find the pytorch model")
  48.         # 将pyrotch模型加载到device上
  49.         self.model = torch.jit.load(model path).to(self.device)
  50.        
  51.         print('Initialized...')
  52.        
  53. def execute(self,requests):
  54.         """`execute` must be implemented in every Python model.
  55.         `execute` function receives a list of pb utils.InferenceRequest as the only
  56.         argument. This function is called when an inference is requested for this model.
  57.         Parameters
  58.         ------------------
  59.         requests :list 可能包含一个/多个request
  60.         A list of pb utils.InferenceRequest
  61.         Returns
  62.         ------
  63.         list
  64.         A list of pb utils.InferenceResponse. The length of this list must
  65.         be the same as requests.
  66.         """
  67.         output_dtype =self.output_dtype
  68.        
  69.         responses = []  #返回值
  70.        
  71.         # Every Python backend must iterate through list of requests and create
  72.         # an instance of pb utils.InferenceResponse class for each of them. You
  73.         # should avoid storing any of the input Tensors in the class attributes
  74.         # as they will be overridden in subsequent inference requests. You can
  75.         # make a copy of the underlying NumPy array and store it if it is required.
  76.        
  77.         for request in requests:
  78.                 # Get INPUT
  79.                 input_tensor = pb_utils.get_input_tensor_by_name(request, "input")
  80.                 # 将pb tensor转为pytorch tensor:Convert Triton tensor to Torch tensor
  81.                 pytorch_tensor =from_dlpack(input_tensor.to_dlpack())
  82.                
  83.                 if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
  84.                         responses.append(pb_utils.InferenceResponse(
  85.                                 output_tensors=[], error = pb_utils.TritonError("Image shape should not be larger than 1800")))        #error response
  86.                         continue
  87.                        
  88.                 # 将pytorch tensor放到GPU上,运行模型得到prediction:Run inference with onnxruntime session on GPU
  89.                 prediction = self.model(pytorch_tensor.to(self.device))
  90.                 #Transfer the GPu tensor to CPU
  91.                 # prediction =prediction.to('cpu')
  92.                
  93.                 #将输出的pytorch tensor转换回pb tensor:Convert Torch output tensor to Triton tensor
  94.                 out_tensor = pb_utils.Tensor.from_dlpack("output", to_dlpack(prediction))
  95.                
  96.                 # 将out_tensor转换为request对应的输出tensor
  97.                 inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
  98.                 responses.append(inference_response)
  99.        
  100.         # You must return a list of pb utils.InferenceResponse. Length
  101.         # of this list must match the length of `requests` list.return responses
  102.         return responses
  103.        
  104. def finalize(self):
  105.         """ finalize is called only once when the model is being unloaded
  106.         Implementing finalize`function is optional. This function allows
  107.         the model to perform any necessary clean ups before exit.
  108.         """
  109.         print('cleaning up...')
复制代码


2. BLS:Business Logic Scripting

2.1 何时利用?



2.2 如何实现?

  1. def execute(self,requests):
  2.         responses = []
  3.         input_tensors = []
  4.         batch_sizes = []
  5.         # Every Python backend must iterate over everyone of the requests
  6.         # and create a pb utils.InferenceResponse for each of them.
  7.         for request in requests:
  8.                 # Get INPUT Triton]tensor
  9.                 in_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT" )
  10.                 # Convert to torchtensor
  11.                 pytorch_tensor = from_dlpack(in_tensor.to_dlpack())
  12.                 input_tensors.append(pytorch_tensor)
  13.                 batch_sizes.append(pytorch_tensor.shape[0])
  14.                
  15.         # Concat input tensor in all requests into batch
  16.         batch_input_tensor = torch.cat(input_tensors, axis=0).to('cuda')
  17.         # 函数的第一个参数要填写你即将请求的模块的参数(config中)(INPUT_0)
  18.         batch_input = pb_utils.Tensor.from_dlpack("INPUT_0",to_dlpack(batch_input_tensor))
  19.        
  20.         # Make inference request for the first BLs call to preprocess
  21.         infer_request = pb_utils.InferenceRequest(
  22.                 model_name = 'preprocess',
  23.                 requested_output_names = ["OUTPUT_0"],
  24.                 inputs = [batch_input])
  25.                
  26.         # 依次调用BLS的模块
  27.         # 调用预处理,proprocess模块:First BLS call to preprocess
  28.         batch_preprocess_response = infer_request.exec()
  29.        
  30.         # Extract output tensor from the resposne of first call
  31.         batch_preprocess_output = pb_utils.get_output_tensor_by_name(batch_preprocess_response, 'OUTPUT_0')
  32.        
  33.         # Make inference request for the second BlS call to classifier
  34.         # 对inference_request来说,它的pb tensor的名字必须和请求的模块(model_name)的输入tensor一致。
  35.         infer_request = pb_utils.InferenceRequest(
  36.                 model_name = 'classifier'
  37.                 requested_output_names =「"OUTPUT_0"],
  38.                 inputs = [pb_utils.Tensor.from_dlpack('INPUT_0',batch_preprocess_output.to_dlpack())])
  39.        
  40.         # 运行第二个模块:Second BLS call to classifier
  41.         batch_classifier_response = infer_request.exec()
  42.        
  43.         # 判断分类器的分类结果
  44.         # 把分类器的分类结果拿出来:Extract output tensor from the resposne of second call
  45.         batch_classifier_output = pb_utils.get_output_tensor_by_name(batch_classifier_response, 'OUTPUT_0')
  46.         # 将结果转为torch tensor:Convert classifier output to torch tensor, shape 「batch size, 1000]
  47.         batch_classifier_tensor = from_dlpack(batch_classifier_output.to_dlpack())
  48.         # 找出概率值最大的那个:Get the category indices from classifier output, shape [ batch size ]
  49.         batch_class_ids = torch.argmax(batch_classifier_tensor, dim=1)
  50.        
  51.         batch_seg_input = pb_utils.Tensor.from_dlpack("input", batch_preprocess_output.to_dlpack())
  52.         #判断图片是不是猫/狗:Check if the input images contain cat or dog
  53.         if 283 in batch_class_ids or 263 in batch_class_ids:
  54.                 # if the input images contain cat or dog, segment with deeplabv3_rn50
  55.                 # Make inference request for the third BLs call to deeplabv3 rn50
  56.                 infer_request = pb_utils.InferenceRequest(
  57.                         model_name = 'deeplabv3 rn50',
  58.                         requested_output_names = ["out",'aux'],         #config里定义好的
  59.                         inputs = [batch_seg_input])
  60.                 # 运行第一个分割模块:Third BLs call to deeplabv3 rn50
  61.                 batch_seg_response = infer_request.exec()
  62.         else:
  63.                 # if the input images do not contain cat or dog, segment with fcn_resnet50
  64.                 # Make inference request for the third BLS call to fcn resnet50
  65.                 infer_request = pb_utils.InferenceRequest(
  66.                         model_name = 'fcn resnet50',
  67.                         requested_output_names=["out",'aux']
  68.                         inputs = [batch_seg_input])
  69.                 # 运行第二个分割模块:Third BLs call to deeplabv3 rn50
  70.                 batch_seg_response = infer_request.exec()
  71.                
  72.         # 后处理操作:Get segmentation output tensor from response
  73.         batch_seg_output = pb_utils.get_output_tensor_by_name(batch_seg_response, 'out')
  74.         batch_seg_tensor = from dlpack(batch_seg_output.to_dlpack())
  75.         batch_seg_tensor = torch.softmax(batch_seg_tensor, dim=1)
  76.         batch_seg_tensor = batch_seg_tensor*255.0
  77.         batch_seg_tensor = batch_seg_tensor.type(torch.uint8)
  78.        
  79.         # Define the mapping between classifier id and segmentation id
  80.         class_seg_id_map = {817:7,283:8,263:12,339:13,681:0,665:14, 176:13}
  81.         batch_classes = batch_class_ids.cpu().detach().numpy()
  82.        
  83.         # Extract output data from batched tensor to individual response
  84.         cursor = 0
  85.         for i in range(len(requests))
  86.                 batch_size = batch_sizes[i]
  87.                 batch_mask = []
  88.                 for j in range(cursor, cursor + batch_size):
  89.                         cls = class_seg_id_map[batch_classes[j]]
  90.                         print(cls)
  91.                         mask = batch_seg_tensor[j,cls,:,:]
  92.                         batch_mask.append(mask)
  93.                 batch_mask_tensor =torch.stack(batch_mask, dim=0)
  94.                 print(batch_mask_tensor.shape)
  95.                 batch_output = pb_utils.Tensor.from_dlpack('OUTPUT', to_dlpack(batch_mask_tensor))
  96.                 responses.append(pb_utils.InferenceResponse(output_tensors=[batch_output]))
  97.                
  98.                 cursor += batch_size
  99.         return responses
复制代码



2.3 工作原理?



四、Triton Stateful Model

1. Application


2. Stateful Models


2.1 Sequence batcher



2.2 Control Inputs


2.3 Direct & Oldest


2.4 Streaming client


3. Practice : CTC Streaming



  1. import numpy as np
  2. import json
  3. # triton python backend utils is available in every Triton Python model. You
  4. # need to use this module to create inference requests and responses, It also
  5. #contains some utility functions for extracting information from modol confia
  6. #and converting Triton input/output types to numpy types.
  7. import triton_python_backend_utils as pb_utils
  8. from multiprocessing.pool import ThreadPool
  9. class Decoder(obfect):
  10.         def init (self,blank):
  11.                 self.prev = ''
  12.                 self.result = ''
  13.                 self.blank_symbol = blank
  14.                
  15.         def decode(self,input,start, ready):   # 进行解码
  16.                 """
  17.                 input: a list of characters/a string
  18.                 """
  19.                 if start:
  20.                         self.prev = ''
  21.                         self.result = ''
  22.                 if ready:
  23.                         for li in input.decode("utf.8”):
  24.                                 if li != self.prev:
  25.                                         if li != self.blank_symbol:
  26.                                                 self.result += l1
  27.                                 self.prev = li
  28.                 r = np.array([[self.result]])
  29.                 return r
  30.                        
  31. class TritonPythonModel:
  32.         """Your Python model must use the same class name. Every Python model
  33.         that is created must have "TritonPythonodel" as the class name.
  34.         """
  35.        
  36.         def initialize(self, args):
  37.         """initialize`is called only once when the model is being loaded.
  38.         Implementing `initialize`function is optional. This function allows
  39.         the model to intialize any state associated with this model.
  40.         Parameters:
  41.         ----------------
  42.         args :dict
  43.                 Both keys and values are strings. The dictionary keys and values are:
  44.                 *model config: A JSON string containing the model configuration
  45.                 *model instance kind: A string containing model instance kind
  46.                 *model instance device id: A string containing model instance device ID
  47.                 *model repository:Model repository path
  48.                 *model version: Model version
  49.                 *model name: Model name
  50.         """
  51.         # 加载模型配置文件:You must parse model config. JSON string is not parsed here
  52.         self.model_config = model_config = json.loads(args['model_config'])
  53.        
  54.         # get max batch size
  55.         max_batch_size = max(model config["max_batch_size"],1)
  56.        
  57.         # 读取blank符号:get blank symbol from config
  58.         blank = self.model_config.get("blank_id",'.')
  59.        
  60.         # 将解码器的数量初始化为max_batch_size:initialize decoders
  61.         self.decoders = [Decoder(blank)for i in range(max_batch_size)]
  62.        
  63.         # Get 0UTPUT configuration
  64.         output0_config = pb_utils.get_output_config_by_name(model_config,"OUTPUTO")
  65.         # Convert Triton types to nupy types
  66.         self.output0_dtype = pbvuils.triton_string_to_numpy(outpute_config['data_type'])
  67.        
  68. def batch_decode(self, batch_input, batch_start, batch_ready):
  69.         responses = []
  70.         args = []
  71.         ldx = 0
  72.         for i,r,s in zip(batch_input, batch_ready, batch_start):
  73.                 args.append([lidx,1,r,s])
  74.                 idx += l
  75.         with ThreadPool()as p:
  76.                 responses = p.map(self.process_single_request, args)
  77.         return responses
  78.        
  79. def process_single_request(self,inp):
  80.         decoder_idx,input,ready,start = inp
  81.         # 对每一个request都调用对应的decoder对它进行一个操作
  82.         response = self.decoders[decoder idx].decode(input[e], start[0], ready[0])
  83.         out_tensor_0 = pb_utils.Tensor("0UTPUTo", response.astype(self.output0 dtype))
  84.         inference_response = pb_utils.InferenceResponse(output_tensors = loqt_tensor_0])
  85.         return inference_response
  86. def execute(self,requests):
  87.         """execute’ MuST be implemented in every python model. 'execute
  88.         function receives a list of pb utils.InferenceRequest as the only
  89.         argument. This function is called when an inference reguest is made
  90.         for this model, Depending on the batching configuration (e.g. Dynamic
  91.         Batching)used,"reguests may contain multiple requests. Every
  92.         Python model, must create one pb utiis.inferenceResponse for every
  93.         pb utils.InferenceRequest in "requests . if there is an error, you can
  94.         set the error arqument when creating a pb utils.InferenceResponse
  95.         Parameters
  96.         ----------------------------
  97.         requests :list
  98.         A list of pb utils.InferenceReguest
  99.         Returns
  100.         ----------------------------
  101.         list
  102.                 A list of pb utils.InferenceResponse, The length of this list must
  103.                 be the same as "requests
  104.         """
  105.         # print("START NEW")
  106.         responses = []
  107.        
  108.         batch_input = []
  109.         batch_ready = []
  110.         batch_start =[]
  111.         batch_corrid =[]
  112.         # Every Python backend ausr iterate over everyone of the reguests
  113.         # and create a pb utils.InferenceResponse for each of them.
  114.         for request in requests:
  115.                 #Get INPUTe
  116.                 in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT")
  117.                 #in 8> <triton python backend utils.Tensor object
  118.                 #in 0> ndarray!'xcx }
  119.                 batch_input += in_0.as_numpy().tolist()
  120.                
  121.                 in_start = pb utils.get_input_tensor_by_name(request, "START")
  122.                 batch_start += in_start.as_numpy().tolist()
  123.                
  124.                 in_ready = pb_utils.get_input_tensor_by_name(request, "READY")
  125.                 batch_ready += in_ready.as_numpy().tolist()
  126.                
  127.                 in_corrid = pb_utils.get_input_tensor_by_name(request, "CORRID")
  128.                 batch_corrid += in_corrid.as_numpy().tolist()
  129.                
  130.         #print("corrid",batch corrid)
  131.         #print("batch input".batch input)
  132.         responses = self.batch_decode(batch_input, batch_start, batch_ready)
  133.         #print("batch response",responses)
  134.         # You should rerurn a list of pb utils.InferenceResponse. Length
  135.         #of this list must match the lenoth ofreguests list
  136.         assert len(requests) == len(responses)
  137.         #printf"send responses":responses)
  138.         return responses
  139. def finalize(self):
  140.         """finalize’is called only once when the model is being unloaded
  141.         Implmenting “finalize" function is OPTIONAL. This function allows
  142.         the model to perform any necessary clean ups before exit.
  143.         """
  144.         print('cleaning up...')
复制代码
4. Practice : WeNet


5. Summary



五、Triton 优先级管理



Triton Priority Queue

工作原理



如何利用?



  1. # 发送n个请求
  2. for i in range(inference_count):
  3.         if i % 2 == 0:
  4.                 priority = 1
  5.         else:
  6.                 priority = 2
  7.                
  8.         client.async infer(
  9.                 model_name,
  10.                 inputs,
  11.                 partial(completion_callback, user_data)
  12.                 request_id = str(i),
  13.                 model_version = str(1),
  14.                 # timeout=1000,
  15.                 outputs = outputs,
  16.                 priority  =priority)
  17. processedcount =0
  18. while processed_count < inference_count:
  19.         (results,error) = user data. completed_requests.get()
  20.         processed_count += 1
  21.         if error is not None:
  22.                 print("inference failed:"+ str(error))
  23.                 # sys.exit(1)
  24.                 continue
  25.         this_id = results.get_response().id
  26.         if int(this_id)%2 == 0:
  27.                 this_priority = 1
  28.         else:
  29.                 this_priority = 2
  30.         print("Request id {} with priority {} has been executed".format(this id, this priority))
复制代码
Rate Limiter

原理


如何利用?


总结



免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 ToB企服应用市场:ToB评测及商务社交产业平台 (https://dis.qidao123.com/) Powered by Discuz! X3.4