清华大学YOLOv10公版目的检测算法在地平线Bayes架构神经网络加快单元BPU上高效部署参考(PTQ方案)
—— 以RDK Ultra为例,修改Head部分,13ms疾速Python后处理步调
本文提出一种在地平线Bayes架构BPU上部署YOLOv10的思绪,以YOLOv10s - Detect目的检测算法为例,利用640×640分辨率,80类别基于COCO数据集的预训练权重,让BPU加快Backbone的Neck的部分,推理时间约15ms,利用numpy优化的后处理部分,约13ms。并利用多线程推理+Web推流的方式完成了一个30fps的及时目的检测demo。本文所有步调均开源。
1. 前言
YOLOv10的锋利之处: 干掉了nms过程,酿成nms-free. 也就是说对8400个bbox,每个bbox的80类别,阈值筛选完了就是最终的结果了,不消nms过程再去干掉重复辨认的目的了。
但是很遗憾,这部分和BPU一点关系都没有,照旧要摘到CPU上实现,虽然在RDK Ultra上100个目的nms到5个结果耗时也只是1ms左右(cv2.dnn.NMSBoxes),不外由于筛选完就是最终结果了,后处理的计算量进一步减小,利用纯numpy向量化的在RDK Ultra上实现后处理耗时仅仅13ms。精度暂时还没测,一般能正常检测出目的的精度都不会太差。
以下是Ultralytics官方文档的形貌,注意这个one2one的字眼,我们修改Head导出为onnx时,必要利用one2one的。
YOLOv10, built on the Ultralytics Python package by researchers at Tsinghua University, introduces a new approach to real-time object detection, addressing both the post-processing and model architecture deficiencies found in previous YOLO versions. By eliminating non-maximum suppression (NMS) and optimizing various model components, YOLOv10 achieves state-of-the-art performance with significantly reduced computational overhead. Extensive experiments demonstrate its superior accuracy-latency trade-offs across multiple model scales.
在Bernoulli2架构的X3上面重要是Backbone部分有一个Softmax算子卡住了,在Bayes架构的Ultra上是支持手动指定量化,而且指定int16量化,控制住了这个算子的精度降落。以是现在很遗憾,X3不能较为高效的部署公版的YOLOv10模型,但是在非及时检测的场景可以尝试利用公版YOLOv10.
2. 后处理优化
2.1 数据实测
RDK Ultra串行步调数据,RDK Ultra 8×A55@1.2Ghz,2×Bayes BPU@96TOPS, YOLOv10s,原版Backbone+Neck,720万参数, 640×640分辨率,80类别,单核模型,numpy+torch向量化纯Python后处理
注:此处前处理是利用OpenCV实现,在工程应用中,往往采用nv12的数据,可以调用RDK Ultra的硬件编解码单元和VPS模块实现前处理,归一化等前处理已经融合在BPU中计算。
项目延迟前处理(CPU,Resize)4.38 ms推理(BPU,pyeasyDNN)14.30 ms后处理(CPU,Python)12.96 ms RDK Ultra 极限性能(多线程 hrt_model_exec perf 测试工具)
测试条件帧率BPU占用(最大200%)平均单帧延迟2线程230 FPS158%8.5 ms8线程291 FPS200 %27 ms 2.2 后处理优化后流程详解
如下图所示,Backbone和Neck部分的算子均能较好的被Bayes架构的BPU加快,尤其是Transformer块的SoftMax算子,支持被BPU加快,利用int16量化后也可保持0.99的余弦相似度,非常给力。
Head部分和end2end的部分(如本文onnx模型所示)不能较好的被BPU加快,以是只能完全摘出来放到后处理中,用CPU实现。同时由于部署时只思量前向传播,以是不必要对8400个Grid Cell的信息全部计算。重要的优化加快思绪为先筛选,再计算,这个计算包括Classify部分的Sigmoid,Bounding Box部分的DFL计算(SoftMax回归 + Conv卷积求盼望)和特征解码计算(dist2bbox, ltrb2xyxy)。
Classify部分,Dequantize操作
在模型编译时,选择了移除所有的反量化算子,这里必要在后处理中手动对Classify部分的三个输出头举行反量化。查看反量化系数的方式有多种,可以查看hb_mapper时产物的日志,也可通过BPU推理接口的API来获取。具体可参考社区文章:反量化节点的融合实现 (horizon.cc)
注意,这里每一个C维度的反量化系数都是差别的,每个头都有80个反量化系数,可以利用numpy的广播直接乘。
Classify部分,ReduceMax操作
ReduceMax操作是沿着Tensor的某一个维度找到最大值,此操作用于找到8400个Grid Cell的80个分数的最大值。操作对象是每个Grid Cell的80类别的值,在C维度操作。注意,这步操作给出的是最大值,并不是80个值中最大值的索引。
激活函数Sigmoid具有单调性,以是Sigmoid作用前的80个分数的巨细关系和Sigmoid作用后的80个分数的巨细关系不会改变。
S i g m o i d ( x ) = 1 1 + e − x Sigmoid(x)=\frac{1}{1+e^{-x}} Sigmoid(x)=1+e−x1
S i g m o i d ( x 1 ) > S i g m o i d ( x 2 ) ⇔ x 1 > x 2 Sigmoid(x_1) > Sigmoid(x_2) \Leftrightarrow x_1 > x_2 Sigmoid(x1)>Sigmoid(x2)⇔x1>x2
综上,bin模型直接输出的最大值(反量化完成)的位置就是最终分数最大值的位置,bin模型输出的最大值颠末Sigmoid计算后就是原来onnx模型的最大值。
Classify部分,Threshold(TopK)操作
此操作用于找到8400个Grid Cell中,符合要求的Grid Cell。操作对象为8400个Grid Cell,在H和W的维度操作。如果您有阅读我的步调,你会发现我将背面H和W维度拉平了,这样只是为了步调设计和书面表达的方便,它们并没有本质上的差别。
我们假设某一个Grid Cell的某一个类别的分数记为 x x x,激活函数作用完的整型数据为 y y y,阈值筛选的过程会给定一个阈值,记为 C C C,那么此分数合格的充分须要条件为:
y = S i g m o i d ( x ) = 1 1 + e − x > C y=Sigmoid(x)=\frac{1}{1+e^{-x}}>C y=Sigmoid(x)=1+e−x1>C
由此可以得出此分数合格的充分须要条件为:
x > − l n ( 1 C − 1 ) x > -ln\left(\frac{1}{C}-1\right) x>−ln(C1−1)
此操作会符合条件的Grid Cell的索引(indices)和对应Grid Cell的最大值,这个最大值颠末Sigmoid计算后就是这个Grid Cell对应类别的分数了。
Classify部分,GatherElements操作和ArgMax操作
利用Threshold(TopK)操作得到的符合条件的Grid Cell的索引(indices),在GatherElements操作中获得符合条件的Grid Cell,利用ArgMax操作得到具体是80个类别中哪一个最大,得到这个符合条件的Grid Cell的类别。
Bounding Box部分,GatherElements操作和Dequantize操作
利用Threshold(TopK)操作得到的符合条件的Grid Cell的索引(indices),在GatherElements操作中获得符合条件的Grid Cell,这里每一个C维度的反量化系数都是差别的,每个头都有64个反量化系数,可以利用numpy的广播直接乘,得到1×64×k×1的bbox信息。
Bounding Box部分,DFL:SoftMax+Conv操作
每一个Grid Cell会有4个数字来确定这个框框的位置,DFL布局会对每个框的某条边基于anchor的位置给出16个估计,对16个估计求SoftMax,然后通过一个卷积操作来求盼望,这也是Anchor Free的核心设计,即每个Grid Cell仅仅负责预测1个Bounding box。假设在对某一条边偏移量的预测中,这16个数字为 l p l_p lp 大概 ( t p , t p , b p ) (t_p, t_p, b_p) (tp,tp,bp) ,此中 p = 0 , 1 , . . . , 15 p = 0,1,...,15 p=0,1,...,15那么偏移量的计算公式为:
l ^ = ∑ p = 0 15 p ⋅ e l p S , S = ∑ p = 0 15 e l p \hat{l} = \sum_{p=0}^{15}{\frac{p·e^{l_p}}{S}}, S =\sum_{p=0}^{15}{e^{l_p}} l^=p=0∑15Sp⋅elp,S=p=0∑15elp
Bounding Box部分,Decode:dist2bbox(ltrb2xyxy)操作
此操作将每个Bounding Box的ltrb形貌解码为xyxy形貌,ltrb分别表示左上右下四条边距离相对于Grid Cell中央的距离,相对位置还原成绝对位置后,再乘以对应特征层的采样倍数,即可还原成xyxy坐标,xyxy表示Bounding Box的左上角和右下角两个点坐标的预测值。
图片输入为 S i z e = 640 Size=640 Size=640,对于Bounding box预测分支的第 i i i个特征图 ( i = 1 , 2 , 3 ) (i=1, 2, 3) (i=1,2,3),对应的下采样倍数记为 S t r i d e ( i ) Stride(i) Stride(i),在YOLOv10s - Detect中, S t r i d e ( 1 ) = 8 , S t r i d e ( 2 ) = 16 , S t r i d e ( 3 ) = 32 Stride(1)=8, Stride(2)=16, Stride(3)=32 Stride(1)=8,Stride(2)=16,Stride(3)=32,对应特征图的尺寸记为 n i = S i z e / S t r i d e ( i ) n_i = {Size}/{Stride(i)} ni=Size/Stride(i),即尺寸为 n 1 = 80 , n 2 = 40 , n 3 = 20 n_1 = 80, n_2 = 40 ,n_3 = 20 n1=80,n2=40,n3=20三个特征图,一共有 n 1 2 + n 2 2 + n 3 3 = 8400 n_1^2+n_2^2+n_3^3=8400 n12+n22+n33=8400个Grid Cell,负责预测8400个Bounding Box。
对特征图 i i i,第 x x x行 y y y列负责预测对应尺度Bounding Box的检测框,此中 x , y ∈ [ 0 , n i ) ⋂ Z x,y \in [0, n_i)\bigcap{Z} x,y∈[0,ni)⋂Z, Z Z Z为整数的集合。DFL布局后的Bounding Box检测框形貌为 l t r b ltrb ltrb形貌,而我们必要的是 x y x y xyxy xyxy形貌,具体的转化关系如下:
x 1 = ( x + 0.5 − l ) × S t r i d e ( i ) x_1 = (x+0.5-l)\times{Stride(i)} x1=(x+0.5−l)×Stride(i)
y 1 = ( y + 0.5 − t ) × S t r i d e ( i ) y_1 = (y+0.5-t)\times{Stride(i)} y1=(y+0.5−t)×Stride(i)
x 2 = ( x + 0.5 + r ) × S t r i d e ( i ) x_2 = (x+0.5+r)\times{Stride(i)} x2=(x+0.5+r)×Stride(i)
y 1 = ( y + 0.5 + b ) × S t r i d e ( i ) y_1 = (y+0.5+b)\times{Stride(i)} y1=(y+0.5+b)×Stride(i)
如果是YOLOv8,v9,会有一个nms操作去去掉重复辨认的目的,但是YOLOv10就不必要,到这里就能得到最终的检测结果了,包括类别(id),分数(score)和位置(xyxy)。
3. 步骤参考
注:任何No such file or directory, No module named "xxx", command not found.等报错请仔细检查,请勿逐条复制运行,如果对修改过程不理解请前去地平线开发者社区从YOLOv5开始相识。
下载THU-MIG/yolov10堆栈,并参考YOLOv8官方文档,设置好情况
- $ git clone https://github.com/THU-MIG/yolov10.git
复制代码 进入本地堆栈,下载官方的预训练权重,这里以720万参数的YOLOv10s-Detect模型为例
- $ cd yolov10
- $ wget https://github.com/THU-MIG/yolov10/releases/download/v1.1/yolov10s.pt
复制代码 卸载yolo相关的命令行命令,这样直接修改./ultralytics/ultralytics目次即可生效。
- $ conda list | grep ultralytics
- $ pip list | grep ultralytics # 或者
- # 如果存在,则卸载
- $ conda uninstall ultralytics
- $ pip uninstall ultralytics # 或者
复制代码 修改Detect的输出头,直接将三个特征层的Bounding Box信息和Classify信息分开输出,一共6个输出头。
文件目次:./ultralytics/ultralytics/nn/modules/head.py,约第510行,v10Detect类的forward方法替换成以下内容:
- def forward(self, x):
- bbox = []
- cls = []
- for i in range(self.nl):
- bbox.append(self.one2one_cv2[i](x[i]))
- cls.append(self.one2one_cv3[i](x[i]))
- return (bbox, cls)
复制代码 运行以下Python脚本,如果有No module named onnxsim报错,安装一个即可
- from ultralytics import YOLO
- YOLO('yolov8s.pt').export(format='onnx', simplify=True, opset=11)
复制代码 参考天工开物工具链手册和OE包的参考,对模型举行检查,所有算子均在BPU上,举行编译即可:
- (bpu) $ hb_mapper checker --model-type onnx --march bayes --model yolov10s.onnx
- (bpu) $ hb_mapper makertbin --model-type onnx --config ./yolov10s.yaml
复制代码 hb_mapper makerbin时的yaml文件,node_info的Softmax算子名称必要按照check的结果修改:
- model_parameters:
- onnx_model: './yolov10s.onnx'
- march: "bayes"
- layer_out_dump: False
- working_dir: 'yolov10s'
- output_model_file_prefix: 'yolov10s'
- remove_node_type: "Dequantize;" # 移除所有的反量化节点
- node_info: { # 指定卡在中间的那个softmax算子在BPU运行,且使用int16量化
- "/model.10/attn/Softmax": {
- 'ON': 'BPU',
- 'InputType': 'int16',
- 'OutputType': 'int16'
- }
- }
- input_parameters:
- input_name: ""
- input_type_rt: 'rgb'
- input_layout_rt: 'NCHW'
- input_type_train: 'rgb'
- input_layout_train: 'NCHW'
- input_shape: ''
- norm_type: 'data_scale'
- mean_value: ''
- scale_value: 0.003921568627451
- calibration_parameters:
- cal_data_dir: './calibration_data_rgb_f32'
- cal_data_type: 'float32'
- compiler_parameters:
- compile_mode: 'latency'
- debug: False
- optimize_level: 'O3'
复制代码 将编译后的bin模型拷贝到开发板,利用hrt_model_exec工具举行性能实测,此中可以调解thread_num来摸索最佳的线程数目。
- hrt_model_exec perf --model_file yolov10s.bin \
- --model_name="" \
- --core_id=0 \
- --frame_count=200 \
- --perf_time=0 \
- --thread_num=1 \
- --profile_path="."
复制代码 4. 串行部署步调
利用以下步调时记得修改图片和模型文件路径,缺包少库请自行pip install安装。
- # Copyright (c) 2024,WuChao D-Robotics.
- #
- # Licensed under the Apache License, Version 2.0 (the "License");
- # you may not use this file except in compliance with the License.
- # You may obtain a copy of the License at
- #
- # http://www.apache.org/licenses/LICENSE-2.0
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
- import cv2
- import numpy as np
- from scipy.special import softmax
- from time import time
- from hobot_dnn import pyeasy_dnn as dnn
- img_path = "kite.jpg"
- result_save_path = "kite.result.jpg"
- quantize_model_path = "./yolov10s_no_sigmoid.bin"
- input_image_size = 640
- conf=0.3
- conf_inverse = -np.log(1/conf - 1)
- print("sigmoid_inverse threshol = %.2f"%conf_inverse)
- # 一些常量或函数
- coco_names = [
- "person", "bicycle", "car", "motorcycle", "airplane",
- "bus", "train", "truck", "boat", "traffic light",
- "fire hydrant", "stop sign", "parking meter", "bench", "bird",
- "cat", "dog", "horse", "sheep", "cow",
- "elephant", "bear", "zebra", "giraffe", "backpack",
- "umbrella", "handbag", "tie", "suitcase", "frisbee",
- "skis", "snowboard", "sports ball", "kite", "baseball bat",
- "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle",
- "wine glass", "cup", "fork", "knife", "spoon",
- "bowl", "banana", "apple", "sandwich", "orange",
- "broccoli", "carrot", "hot dog", "pizza", "donut",
- "cake", "chair", "couch", "potted plant", "bed",
- "dining table", "toilet", "tv", "laptop", "mouse",
- "remote", "keyboard", "cell phone", "microwave", "oven",
- "toaster", "sink", "refrigerator", "book", "clock",
- "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
- ]
- yolo_colors = [
- (56, 56, 255), (151, 157, 255), (31, 112, 255), (29, 178, 255),
- (49, 210, 207), (10, 249, 72), (23, 204, 146), (134, 219, 61),
- (52, 147, 26), (187, 212, 0), (168, 153, 44), (255, 194, 0),
- (147, 69, 52), (255, 115, 100), (236, 24, 0), (255, 56, 132),
- (133, 0, 82), (255, 56, 203), (200, 149, 255), (199, 55, 255)]
- def draw_detection(img, box, score, class_id):
- x1, y1, x2, y2 = box
- color = yolo_colors[class_id%20]
- cv2.rectangle(img, (x1, y1), (x2, y2), color, 2)
- label = f"{coco_names[class_id]}: {score:.2f}"
- (label_width, label_height), _ = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
- label_x = x1
- label_y = y1 - 10 if y1 - 10 > label_height else y1 + 10
- # Draw a filled rectangle as the background for the label text
- cv2.rectangle(
- img, (label_x, label_y - label_height), (label_x + label_width, label_y + label_height), color, cv2.FILLED
- )
- # Draw the label text on the image
- cv2.putText(img, label, (label_x, label_y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1, cv2.LINE_AA)
- # 读取horizon_quantize模型, 并打印这个horizon_quantize模型的输入输出Tensor信息
- begin_time = time()
- quantize_model = dnn.load(quantize_model_path)
- print("\033[0;31;40m" + "Load horizon quantize model time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- print("-> input tensors")
- for i, quantize_input in enumerate(quantize_model[0].inputs):
- print(f"intput[{i}], name={quantize_input.name}, type={quantize_input.properties.dtype}, shape={quantize_input.properties.shape}")
- print("-> output tensors")
- for i, quantize_input in enumerate(quantize_model[0].outputs):
- print(f"output[{i}], name={quantize_input.name}, type={quantize_input.properties.dtype}, shape={quantize_input.properties.shape}")
- # 准备一些常量
- # 提前将反量化系数准备好
- s_bboxes_scale = quantize_model[0].outputs[0].properties.scale_data[:,np.newaxis]
- m_bboxes_scale = quantize_model[0].outputs[1].properties.scale_data[:,np.newaxis]
- l_bboxes_scale = quantize_model[0].outputs[2].properties.scale_data[:,np.newaxis]
- s_clses_scale = quantize_model[0].outputs[3].properties.scale_data[:, np.newaxis]
- m_clses_scale = quantize_model[0].outputs[4].properties.scale_data[:, np.newaxis]
- l_clses_scale = quantize_model[0].outputs[5].properties.scale_data[:, np.newaxis]
- # DFL求期望的系数, 只需要生成一次
- weights_static = np.array([i for i in range(16)]).astype(np.float32)[np.newaxis, :, np.newaxis]
- # 提前准备一些索引, 只需要生成一次
- static_index = np.arange(8400)
- # anchors, 只需要生成一次
- s_anchor = np.stack([np.tile(np.linspace(0.5, 79.5, 80), reps=80),
- np.repeat(np.arange(0.5, 80.5, 1), 80)], axis=0)
- m_anchor = np.stack([np.tile(np.linspace(0.5, 39.5, 40), reps=40),
- np.repeat(np.arange(0.5, 40.5, 1), 40)], axis=0)
- l_anchor = np.stack([np.tile(np.linspace(0.5, 19.5, 20), reps=20),
- np.repeat(np.arange(0.5, 20.5, 1), 20)], axis=0)
- # 读取图片并利用resize的方式进行前处理
- begin_time = time()
- img = cv2.imread(img_path)
- print("\033[0;31;40m" + "cv2.imread time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- begin_time = time()
- input_tensor = cv2.resize(img, (input_image_size, input_image_size), interpolation=cv2.INTER_NEAREST)
- input_tensor = cv2.cvtColor(input_tensor, cv2.COLOR_BGR2RGB)
- # input_tensor = np.array(input_tensor) / 255.0
- input_tensor = np.transpose(input_tensor, (2, 0, 1))
- input_tensor = np.expand_dims(input_tensor, axis=0)# .astype(np.float32) # NCHW
- print("\033[0;31;40m" + "Pre Process time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- print(f"{input_tensor.shape = }")
- img_h, img_w = img.shape[0:2]
- y_scale, x_scale = img_h/input_image_size, img_w/input_image_size
- # 推理
- begin_time = time()
- quantize_outputs = quantize_model[0].forward(input_tensor)
- print("\033[0;31;40m" + "BPU Forward time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- begin_time = time()
- # bbox: 转为numpy, reshape
- s_bboxes = quantize_outputs[0].buffer.reshape(64, -1) # (64,6400)
- m_bboxes = quantize_outputs[1].buffer.reshape(64, -1) # (64,1600)
- l_bboxes = quantize_outputs[2].buffer.reshape(64, -1) # (64,400)
- # classify: 转为numpy, reshape, 反量化
- s_clses = quantize_outputs[3].buffer.reshape(80, -1).astype(np.float32) * s_clses_scale # (80,6400)
- m_clses = quantize_outputs[4].buffer.reshape(80, -1).astype(np.float32) * m_clses_scale # (80,1600)
- l_clses = quantize_outputs[5].buffer.reshape(80, -1).astype(np.float32) * l_clses_scale # (80,400)
- # classify: 利用numpy向量化操作完成阈值筛选(优化版 2.0)
- s_max_scores = np.max(s_clses, axis=0)
- #s_valid_indices = np.where(s_max_scores >= conf_inverse)
- s_valid_indices = np.flatnonzero(s_max_scores >= conf_inverse) # 得到大于阈值分数的索引,此时为小数字
- s_ids = np.argmax(s_clses[:,s_valid_indices], axis=0)
- s_scores = s_max_scores[s_valid_indices]
- m_max_scores = np.max(m_clses, axis=0)
- #m_valid_indices = np.where(m_max_scores >= conf_inverse)
- m_valid_indices = np.flatnonzero(m_max_scores >= conf_inverse) # 得到大于阈值分数的索引,此时为小数字
- m_ids = np.argmax(m_clses[:,m_valid_indices], axis=0)
- m_scores = m_max_scores[m_valid_indices]
- l_max_scores = np.max(l_clses, axis=0)
- #l_valid_indices = np.where(l_max_scores >= conf_inverse)
- l_valid_indices = np.flatnonzero(l_max_scores >= conf_inverse) # 得到大于阈值分数的索引,此时为小数字
- l_ids = np.argmax(l_clses[:,l_valid_indices], axis=0)
- l_scores = l_max_scores[l_valid_indices]
- # 3个Classify分类分支:Sigmoid计算
- s_scores = 1 / (1 + np.exp(-s_scores))
- m_scores = 1 / (1 + np.exp(-m_scores))
- l_scores = 1 / (1 + np.exp(-l_scores))
- # 3个Bounding Box分支:反量化
- s_bboxes_float32 = s_bboxes[:,s_valid_indices].astype(np.float32) * s_bboxes_scale
- m_bboxes_float32 = m_bboxes[:,m_valid_indices].astype(np.float32) * m_bboxes_scale
- l_bboxes_float32 = l_bboxes[:,l_valid_indices].astype(np.float32) * l_bboxes_scale
- # 3个Bounding Box分支:dist2bbox(ltrb2xyxy)
- s_ltrb_indices = np.sum(softmax(s_bboxes_float32.reshape(4, 16,-1), axis=1) * weights_static, axis=1)
- s_anchor_indices = s_anchor[:,s_valid_indices]
- s_x1y1 = s_anchor_indices - s_ltrb_indices[0:2]
- s_x2y2 = s_anchor_indices + s_ltrb_indices[2:4]
- s_dbboxes = np.vstack([s_x1y1, s_x2y2]).transpose(1,0)*8
- m_ltrb_indices = np.sum(softmax(m_bboxes_float32.reshape(4, 16,-1), axis=1) * weights_static, axis=1)
- m_anchor_indices = m_anchor[:,m_valid_indices]
- m_x1y1 = m_anchor_indices - m_ltrb_indices[0:2]
- m_x2y2 = m_anchor_indices + m_ltrb_indices[2:4]
- m_dbboxes = np.vstack([m_x1y1, m_x2y2]).transpose(1,0)*16
- l_ltrb_indices = np.sum(softmax(l_bboxes_float32.reshape(4, 16,-1), axis=1) * weights_static, axis=1)
- l_anchor_indices = l_anchor[:,l_valid_indices]
- l_x1y1 = l_anchor_indices - l_ltrb_indices[0:2]
- l_x2y2 = l_anchor_indices + l_ltrb_indices[2:4]
- l_dbboxes = np.vstack([l_x1y1, l_x2y2]).transpose(1,0)*32
- # 大中小特征层阈值筛选结果拼接
- dbboxes = np.concatenate((s_dbboxes, m_dbboxes, l_dbboxes), axis=0)
- scores = np.concatenate((s_scores, m_scores, l_scores), axis=0)
- ids = np.concatenate((s_ids, m_ids, l_ids), axis=0)
- print("\033[0;31;40m" + "Post Process time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- # 绘制
- begin_time = time()
- for score, class_id, xyxy in zip(scores, ids, dbboxes):
- x1, y1, x2, y2 = xyxy
- x1, y1, x2, y2 = int(x1*x_scale), int(y1*y_scale), int(x2*x_scale), int(y2*y_scale)
- print("(%d, %d, %d, %d) -> %s: %.2f"%(x1,y1,x2,y2, coco_names[class_id], score))
- draw_detection(img, (x1, y1, x2, y2), score, class_id)
- print("\033[0;31;40m" + "Draw Result time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- # 保存图片到本地
- begin_time = time()
- cv2.imwrite(result_save_path, img)
- print("\033[0;31;40m" + "cv2.imwrite time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
复制代码 5. 并行部署步调
5.1 串并行步调设计
我们已经通过CPU和BPU异构计算,得到了一个相对高效的步调,举行端到端(end2end)的目的检测任务,而且在实际的工程应用中,可以灵活选择输入和输出的设备大概对象,满意现场应用的需求。
如果要对及时的视频流举行处理,一种经典的方式是利用串行步调设计,在While循环中,交替的举行整个端到端的处理流程。这种处理的方式好处是步调足够简单直观,但是由于地平线RDK Ultra智能计算平台异构计算的特性,在读取图片和生存图片的过程中,CPU处于IO状态,等待数据移动进内存,CPU和BPU均未工作,在前处理、后处理和渲染的过程中,CPU在工作,但是并非8核心均在工作,同时,CPU处于空闲状态,在推理的过程中,BPU在工作,但是也并非2核心均在工作,同时,CPU处于空闲状态。对串行步调而言,对于多核心CPU和BPU这种异构计算平台,没有利用好计算能力,以是必要举行并行步调设计
Python不负担计算任务,只是调用cv2,BPU,torch和numpy的接口,以是可以利用多线程实现及时视频流检测,不会受到Python全局GIL锁的影响。我们可以利用一个独立线程负责输入图像,一个独立线程负责输出图像,这两个线程只会在每次输入大概输出时占用CPU,等待输入或输出的时间这个线程会被操作体系休眠。而线程池会从输入队列中读取图像举行端到端推理,推理结果放在输出队列中,在具体的步调设计中,我利用了4个线程来举行端到端推理,这个过程也是由操作体系来调治,充分利用了CPU和BPU的计算能力。
5.2 推流到Web效果
【地平线Bayes架构BPU跑YOLOv10s稳稳30fps,RDK Ultra开发板】 https://www.bilibili.com/video/BV1X7421d75q
5.3 推流步调
注意,文件目次参考如下:
- .
- ├── templates
- │ └── index.html
- └── YOLOv10_HorizonRT_wucPostprocess_Web.py
复制代码 index.html
- <html lang="zh-CN">
- <head>
- <meta charset="UTF-8">
- <title>YOLOv10s实时视频推流 演示</title>
- </head>
- <body>
- <h1>
- <font color="#FF0000">YOLOv10s - Detect</font> 实时视频推流 演示
- </h1>
- <h2>实时视频推流</h2>
- <img src="{{ url_for('video_feed') }}" alt="实时视频" width="640" height="480">
- <h2>RDK Ultra 开发板 参数: 8×A55@1.2Ghz, 2×Bayes BPU@96TOPS</h2>
- <h2>YOLOv10s - Detect, 640×640, COCO2017数据集, 80类别</h2>
- </body>
- </html>
复制代码 YOLOv10_HorizonRT_wucPostprocess_Web.py
- #!/user/bin/env python
- # Copyright (c) 2024,WuChao D-Robotics.
- #
- # Licensed under the Apache License, Version 2.0 (the "License");
- # you may not use this file except in compliance with the License.
- # You may obtain a copy of the License at
- #
- # http://www.apache.org/licenses/LICENSE-2.0
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
- import cv2, argparse, sys
- import numpy as np
- from threading import Thread, Lock
- from queue import Queue
- from scipy.special import softmax
- from time import time, sleep
- from hobot_dnn import pyeasy_dnn as dnn
- from flask import Flask, render_template, Response
- def main():
- # 线程数配置
- n2 = 2 # 推理视频帧线程数
- # 用于控制线程的全局变量
- global is_loading, is_forwarding, is_writing
- is_loading, is_forwarding, is_writing = True, True, True
- # 推理实例
- model = YOLOv10_Detect()
- # 任务队列
- global task_queue, save_queue
- task_queue = Queue(30)
- save_queue = Queue(30) # 结果保存队列多缓存一些
- sleep(1)
- # 创建并启动读取线程
- video_path = "/dev/video0"
- task_loader = Dataloader_videoCapture(video_path, task_queue, 0.005)
- task_loader.start()
- # 创建并启动推理线程
- inference_threads = [InferenceThread(_, model, task_queue, save_queue, 0.005) for _ in range(n2)]
- for t in inference_threads:
- t.start()
- # 创建并启动日志打印线程
- result_writer = msg_printer(task_queue, save_queue, 0.5)
- result_writer.start()
- app.run(debug=False, port=7998, host="0.0.0.0")
- print("[INFO] wait_join")
- task_loader.join()
- for t in inference_threads:
- t.join()
- result_writer.join()
- result_writer.join()
- print("[INFO] All task done.")
- exit()
- class YOLOv10_Detect():
- def __init__(self):
- quantize_model_path = "./yolov10s_no_sigmoid.bin"
- self.input_image_size = 640
- self.conf=0.3
- self.conf_inverse = -np.log(1/self.conf - 1)
- print("sigmoid_inverse threshol = %.2f"%self.conf_inverse)
- # 读取horizon_quantize模型, 并打印这个horizon_quantize模型的输入输出Tensor信息
- begin_time = time()
- self.quantize_model = dnn.load(quantize_model_path)
- print("\033[0;31;40m" + "Load horizon quantize model time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- print("-> input tensors")
- for i, quantize_input in enumerate(self.quantize_model[0].inputs):
- print(f"intput[{i}], name={quantize_input.name}, type={quantize_input.properties.dtype}, shape={quantize_input.properties.shape}")
- print("-> output tensors")
- for i, quantize_input in enumerate(self.quantize_model[0].outputs):
- print(f"output[{i}], name={quantize_input.name}, type={quantize_input.properties.dtype}, shape={quantize_input.properties.shape}")
- # 准备一些常量
- # 提前将反量化系数准备好
- self.s_bboxes_scale = self.quantize_model[0].outputs[0].properties.scale_data[:,np.newaxis]
- self.m_bboxes_scale = self.quantize_model[0].outputs[1].properties.scale_data[:,np.newaxis]
- self.l_bboxes_scale = self.quantize_model[0].outputs[2].properties.scale_data[:,np.newaxis]
- self.s_clses_scale = self.quantize_model[0].outputs[3].properties.scale_data[:, np.newaxis]
- self.m_clses_scale = self.quantize_model[0].outputs[4].properties.scale_data[:, np.newaxis]
- self.l_clses_scale = self.quantize_model[0].outputs[5].properties.scale_data[:, np.newaxis]
- # DFL求期望的系数, 只需要生成一次
- self.weights_static = np.array([i for i in range(16)]).astype(np.float32)[np.newaxis, :, np.newaxis]
- # 提前准备一些索引, 只需要生成一次
- self.static_index = np.arange(8400)
- # anchors, 只需要生成一次
- self.s_anchor = np.stack([np.tile(np.linspace(0.5, 79.5, 80), reps=80),
- np.repeat(np.arange(0.5, 80.5, 1), 80)], axis=0)
- self.m_anchor = np.stack([np.tile(np.linspace(0.5, 39.5, 40), reps=40),
- np.repeat(np.arange(0.5, 40.5, 1), 40)], axis=0)
- self.l_anchor = np.stack([np.tile(np.linspace(0.5, 19.5, 20), reps=20),
- np.repeat(np.arange(0.5, 20.5, 1), 20)], axis=0)
- def forward(self, input_tensor):
- return self.quantize_model[0].forward(input_tensor)
- def preprocess(self, img):
- self.img = img
- input_tensor = cv2.resize(img, (self.input_image_size, self.input_image_size), interpolation=cv2.INTER_NEAREST)
- input_tensor = cv2.cvtColor(input_tensor, cv2.COLOR_BGR2RGB)
- # input_tensor = np.array(input_tensor) / 255.0
- input_tensor = np.transpose(input_tensor, (2, 0, 1))
- input_tensor = np.expand_dims(input_tensor, axis=0)# .astype(np.float32) # NCHW
- img_h, img_w = img.shape[0:2]
- self.y_scale, self.x_scale = img_h/self.input_image_size, img_w/self.input_image_size
- return input_tensor
-
- def postprocess(self, quantize_outputs):
- # bbox: 转为numpy, reshape
- s_bboxes = quantize_outputs[0].buffer.reshape(64, -1) # (64,6400)
- m_bboxes = quantize_outputs[1].buffer.reshape(64, -1) # (64,1600)
- l_bboxes = quantize_outputs[2].buffer.reshape(64, -1) # (64,400)
- # classify: 转为numpy, reshape, 反量化
- s_clses = quantize_outputs[3].buffer.reshape(80, -1).astype(np.float32) * self.s_clses_scale # (80,6400)
- m_clses = quantize_outputs[4].buffer.reshape(80, -1).astype(np.float32) * self.m_clses_scale # (80,1600)
- l_clses = quantize_outputs[5].buffer.reshape(80, -1).astype(np.float32) * self.l_clses_scale # (80,400)
- # classify: 利用numpy向量化操作完成阈值筛选(优化版 2.0)
- s_max_scores = np.max(s_clses, axis=0)
- #s_valid_indices = np.where(s_max_scores >= conf_inverse)
- s_valid_indices = np.flatnonzero(s_max_scores >= self.conf_inverse)
- s_ids = np.argmax(s_clses[:,s_valid_indices], axis=0)
- s_scores = s_max_scores[s_valid_indices]
- m_max_scores = np.max(m_clses, axis=0)
- #m_valid_indices = np.where(m_max_scores >= conf_inverse)
- m_valid_indices = np.flatnonzero(m_max_scores >= self.conf_inverse)
- m_ids = np.argmax(m_clses[:,m_valid_indices], axis=0)
- m_scores = m_max_scores[m_valid_indices]
- l_max_scores = np.max(l_clses, axis=0)
- #l_valid_indices = np.where(l_max_scores >= conf_inverse)
- l_valid_indices = np.flatnonzero(l_max_scores >= self.conf_inverse)
- l_ids = np.argmax(l_clses[:,l_valid_indices], axis=0)
- l_scores = l_max_scores[l_valid_indices]
- # 3个Classify分类分支:Sigmoid计算
- s_scores = 1 / (1 + np.exp(-s_scores))
- m_scores = 1 / (1 + np.exp(-m_scores))
- l_scores = 1 / (1 + np.exp(-l_scores))
- # 3个Bounding Box分支:反量化
- s_bboxes_float32 = s_bboxes[:,s_valid_indices].astype(np.float32) * self.s_bboxes_scale
- m_bboxes_float32 = m_bboxes[:,m_valid_indices].astype(np.float32) * self.m_bboxes_scale
- l_bboxes_float32 = l_bboxes[:,l_valid_indices].astype(np.float32) * self.l_bboxes_scale
- # 3个Bounding Box分支:dist2bbox(ltrb2xyxy)
- s_ltrb_indices = np.sum(softmax(s_bboxes_float32.reshape(4, 16,-1), axis=1) * self.weights_static, axis=1)
- s_anchor_indices = self.s_anchor[:,s_valid_indices]
- s_x1y1 = s_anchor_indices - s_ltrb_indices[0:2]
- s_x2y2 = s_anchor_indices + s_ltrb_indices[2:4]
- s_dbboxes = np.vstack([s_x1y1, s_x2y2]).transpose(1,0)*8
- m_ltrb_indices = np.sum(softmax(m_bboxes_float32.reshape(4, 16,-1), axis=1) * self.weights_static, axis=1)
- m_anchor_indices = self.m_anchor[:,m_valid_indices]
- m_x1y1 = m_anchor_indices - m_ltrb_indices[0:2]
- m_x2y2 = m_anchor_indices + m_ltrb_indices[2:4]
- m_dbboxes = np.vstack([m_x1y1, m_x2y2]).transpose(1,0)*16
- l_ltrb_indices = np.sum(softmax(l_bboxes_float32.reshape(4, 16,-1), axis=1) * self.weights_static, axis=1)
- l_anchor_indices = self.l_anchor[:,l_valid_indices]
- l_x1y1 = l_anchor_indices - l_ltrb_indices[0:2]
- l_x2y2 = l_anchor_indices + l_ltrb_indices[2:4]
- l_dbboxes = np.vstack([l_x1y1, l_x2y2]).transpose(1,0)*32
- # 大中小特征层阈值筛选结果拼接
- dbboxes = np.concatenate((s_dbboxes, m_dbboxes, l_dbboxes), axis=0)
- scores = np.concatenate((s_scores, m_scores, l_scores), axis=0)
- ids = np.concatenate((s_ids, m_ids, l_ids), axis=0)
- # 绘制
- for score, class_id, xyxy in zip(scores, ids, dbboxes):
- x1, y1, x2, y2 = xyxy
- x1, y1, x2, y2 = int(x1*self.x_scale), int(y1*self.y_scale), int(x2*self.x_scale), int(y2*self.y_scale)
- # print("(%d, %d, %d, %d) -> %s: %.2f"%(x1,y1,x2,y2, coco_names[class_id], score))
- draw_detection(self.img, (x1, y1, x2, y2), score, class_id)
- return self.img
- def signal_handler(signal, frame):
- global is_loading, is_forwarding, is_writing
- is_loading, is_forwarding, is_writing = False, False, False
- print('Ctrl+C, stopping!!!')
- sys.exit(0)
- class Dataloader_videoCapture(Thread):
- # 从cap中读帧, 一直读到无帧可读
- # delay_time 用于控制读帧的频率,尽量和极限帧率的帧间隔一致, 一般设置为0.033 s
- def __init__(self, video_path, task_queue, delay_time):
- Thread.__init__(self)
- self.cap = cv2.VideoCapture(video_path)
- self.task_queue = task_queue
- self.delay_time = delay_time
- def run(self):
- global is_loading
- while is_loading:
- if not self.task_queue.full():
- # begin_time = time()
- ret, frame = self.cap.read()
- if ret:
- self.task_queue.put(frame)
- else:
- is_loading = False
- self.cap.release()
- break
- # print("\033[0;31;40m" + "Read time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- sleep(self.delay_time)
- print("[INFO] Dateloader thread exit.")
- class InferenceThread(Thread):
- # 推理的线程
- def __init__(self, i, model, task_queue, result_queue, delay_time):
- Thread.__init__(self)
- self.task_queue = task_queue
- self.result_queue = result_queue
- self.model = model
- self.delay_time = delay_time
- self.i = i
- def run(self):
- global is_forwarding, is_running, frame_counter
- while is_forwarding:
- if not self.task_queue.empty():
- # begin_time = time()
- # 从任务队列取图
- img = self.task_queue.get()
- # 存储拉伸量
- img_h, img_w = img.shape[0:2]
- y_scale, x_scale = img_h/self.model.input_image_size, img_w/self.model.input_image_size
- # 前处理
- input_tensor = self.model.preprocess(img)
- # 推理
- output_tensors = self.model.forward(input_tensor)
- # 后处理
- result = self.model.postprocess(output_tensors)
- # 结果存入结果队列
- trytimes = 5
- while trytimes > 0:
- trytimes -= 1
- if not self.result_queue.full():
- trytimes = 0
- self.result_queue.put(result)
- # 帧率计数器自增
- if trytimes >= 0:
- frame_counter += 1
- # print("\033[0;31;40m" + "Forward time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- elif not is_loading:
- is_forwarding = False
- sleep(self.delay_time)
- print(f"[INFO] Forward thread {self.i} exit.")
- app = Flask(__name__)
- @app.route('/video_feed')
- def video_feed():
- return Response(gen_frames(), mimetype='multipart/x-mixed-replace; boundary=frame')
- @app.route('/')
- def index():
- return render_template('index.html')
- def gen_frames():
- global is_forwarding, is_writing, save_queue
- while not save_queue.empty():
- save_queue.get()
- while is_writing:
- if not save_queue.empty():
- # begin_time = time()
- img_result = save_queue.get()
- cv2.putText(img_result, '%.2f fps'%fps, (40, 40), cv2.FONT_HERSHEY_COMPLEX, 1.5, (255, 0, 0), 2, cv2.LINE_AA)
- ret, buffer = cv2.imencode('.jpg', img_result, [cv2.IMWRITE_JPEG_QUALITY, 50])
- frame = buffer.tobytes()
- yield (b'--frame\r\n'
- b'Content-Type: image/jpeg\r\n\r\n' + frame + b'\r\n')
- # print("\033[0;31;40m" + "Frame Write time = %.2f ms"%(1000*(time() - begin_time)) + "\033[0m")
- elif not is_forwarding:
- #self.out.release()
- is_writing = False
- sleep(0.001)
- class msg_printer(Thread):
- # 用于计算帧率的全局变量
- def __init__(self, task_queue, save_queue, delay_time):
- Thread.__init__(self)
- self.delay_time = delay_time
- self.task_queue = task_queue
- self.save_queue = save_queue
- def run(self):
- global frame_counter, fps
- frame_counter = 0
- fps = 0.0
- begin_time = time()
- while is_loading or is_forwarding or is_writing:
- delta_time = time() - begin_time
- fps = frame_counter/delta_time
- frame_counter = 0
- begin_time = time()
- print("Smart FPS = %.2f, task_queue_size = %d, save_queue_size = %d"%(fps, self.task_queue.qsize(), self.save_queue.qsize()))
- sleep(self.delay_time)
- print("[INFO] msg_printer thread exit.")
-
- # 一些常量或函数
- coco_names = [
- "person", "bicycle", "car", "motorcycle", "airplane",
- "bus", "train", "truck", "boat", "traffic light",
- "fire hydrant", "stop sign", "parking meter", "bench", "bird",
- "cat", "dog", "horse", "sheep", "cow",
- "elephant", "bear", "zebra", "giraffe", "backpack",
- "umbrella", "handbag", "tie", "suitcase", "frisbee",
- "skis", "snowboard", "sports ball", "kite", "baseball bat",
- "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle",
- "wine glass", "cup", "fork", "knife", "spoon",
- "bowl", "banana", "apple", "sandwich", "orange",
- "broccoli", "carrot", "hot dog", "pizza", "donut",
- "cake", "chair", "couch", "potted plant", "bed",
- "dining table", "toilet", "tv", "laptop", "mouse",
- "remote", "keyboard", "cell phone", "microwave", "oven",
- "toaster", "sink", "refrigerator", "book", "clock",
- "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
- ]
- yolo_colors = [
- (56, 56, 255), (151, 157, 255), (31, 112, 255), (29, 178, 255),
- (49, 210, 207), (10, 249, 72), (23, 204, 146), (134, 219, 61),
- (52, 147, 26), (187, 212, 0), (168, 153, 44), (255, 194, 0),
- (147, 69, 52), (255, 115, 100), (236, 24, 0), (255, 56, 132),
- (133, 0, 82), (255, 56, 203), (200, 149, 255), (199, 55, 255)]
- def draw_detection(img, box, score, class_id):
- x1, y1, x2, y2 = box
- color = yolo_colors[class_id%20]
- cv2.rectangle(img, (x1, y1), (x2, y2), color, 2)
- label = f"{coco_names[class_id]}: {score:.2f}"
- (label_width, label_height), _ = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
- label_x = x1
- label_y = y1 - 10 if y1 - 10 > label_height else y1 + 10
- # Draw a filled rectangle as the background for the label text
- cv2.rectangle(
- img, (label_x, label_y - label_height), (label_x + label_width, label_y + label_height), color, cv2.FILLED
- )
- # Draw the label text on the image
- cv2.putText(img, label, (label_x, label_y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1, cv2.LINE_AA)
- if __name__ == "__main__":
- main()
复制代码 6. 更加高效的DataFlow设计
由于作者比力懒,以是在处理图像时无脑利用OpenCV,利用bgr8的图像作为中间数据,这内里所有涉及到的图像操作都是CPU在计算。但是实际上,在RDK系列以nv12数据作为中间数据才是最高效的。我们可以在bin模型编译时选择nv12输入,这样BPU就能直接吸取nv12的数据,另外像图像编解码等操作可以利用硬件编解码器,图像缩放旋转等操作可以利用VPS,这样就能大幅低落CPU压力,进步整个体系的服从。
具体DataFlow请参考博客:TROS DataFlow - USB Camera & mipi Sensor - rtsp-CSDN博客
7. 相关模型文件和步调下载请前去地平线开发者社区
清华大学YOLOv10公版目的检测算法在地平线Bayes架构神经网络加快单元BPU上部署参考 (horizon.cc)
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |