AscendC从入门到醒目系列（三）基于自界说算子工程开发AscendC算子 ...

星球的眼睛 · 2024-11-12 20:12:10

本次主要讨论下AscendC别的一种开发流程，基于自界说算子工程的算子开发。从算子工程创建、代码编写、编译部署到运行验证的开发全流程，让您对算子开发工程有个宏观的熟悉，此处我们以输入是动态shape（主要体如今tiling）的Add算子实现为例，为了与内置Add算子区分，界说算子范例为AddCustom。
1、创建工程

CANN软件包中提供了工程创建工具msOpGen，开发者可以输入算子原型界说文件天生Ascend C算子开发工程。
1.1 编写AddCustom算子的原型界说json文件

```yaml
```java
[
{
"op": "AddCustom",
"input_desc": [
{
"name": "x",
"param_type": "required",
"format": [
"ND"
],
"type": [
"fp16"
]
},
{
"name": "y",
"param_type": "required",
"format": [
"ND"
],
"type": [
"fp16"
]
}
],
"output_desc": [
{
"name": "z",
"param_type": "required",
"format": [
"ND"
],
"type": [
"fp16"
]
}
]
}
]

复制代码

1.2 用msOpGen工具天生AddCustom算子的开发工程

${INSTALL_DIR}/python/site-packages/bin/msopgen gen -i $HOME/sample/add_custom.json -c ai_core-<soc_version> -lan cpp -out $HOME/sample/AddCustom

复制代码

${INSTALL_DIR}为CANN软件安装后文件存储路径，请根据现真相况举行替换，如/usr/local/Ascend/ascend-toolkit/latest。
-i：算子原型界说文件add_custom.json所在路径。
-c：ai_core-<soc_version>代表算子在AI Core上执行，<soc_version>为昇腾AI处理器的型号，可通过npu-smi info命令举行查询，基于同系列的AI处理器型号创建的算子工程，其根本能力通用。例如soc_version设置为Ascend310P1，Ascend910B3等。
-lan：参数cpp代表算子基于Ascend C编程框架，利用C++编程语言开发。

1.3 工程目录天生

命令执行完后，会在$HOME/sample目录下天生算子工程目录AddCustom，工程中包含算子实现的模板文件，编译脚本等，如下所示

AddCustom
├── build.sh // 编译入口脚本
├── cmake
│ ├── config.cmake
│ ├── util // 算子工程编译所需脚本及公共编译文件存放目录
├── CMakeLists.txt // 算子工程的CMakeLists.txt
├── CMakePresets.json // 编译配置项
├── framework // 算子插件实现文件目录，单算子模型文件的生成不依赖算子适配插件，无需关注
├── op_host // host侧实现文件
│ ├── add_custom_tiling.h // 算子tiling定义文件
│ ├── add_custom.cpp // 算子原型注册、shape推导、信息库、tiling实现等内容文件
│ ├── CMakeLists.txt
├── op_kernel // kernel侧实现文件
│ ├── CMakeLists.txt
│ ├── add_custom.cpp // 算子核函数实现文件
├── scripts // 自定义算子工程打包相关脚本所在目录

复制代码

CMakePresets.json // 编译设置项
add_custom_tiling.h // 算子tiling界说文件
op_host/add_custom.cpp // 算子原型注册、shape推导、信息库、tiling实现等内容文件
op_kernel/add_custom.cpp // 算子核函数实现文件
上述文件为后续算子开发过程中需要修改的文件，其他文件无需修改。

2 算子核函数实现

在工程存储目录的“AddCustom/op_kernel/add_custom.cpp”文件中实现算子的核函数。算子核函数实当代码的内部调用关系示意图如下：

2.1 核函数界说

核函数的界说，并在核函数中调用算子类的Init和Process函数。

extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
// 获取Host侧传入的Tiling参数
GET_TILING_DATA(tiling_data, tiling);
// 初始化算子类
KernelAdd op;
// 算子类的初始化函数，完成内存初始化相关工作
op.Init(x, y, z, tiling_data.totalLength, tiling_data.tileNum);
// 完成算子实现的核心逻辑
op.Process();
}

复制代码

2.2 界说KernelAdd算子类

和之前AscendC从入门到醒目系列（二） - 知乎 (zhihu.com)中一样，KernelAdd算子类主要也是实现Init，CopyIn，Compute，CopyOut这个4个关键函数。

#include "kernel_operator.h"
constexpr int32_t BUFFER_NUM = 2;
class KernelAdd {
public:
__aicore__ inline KernelAdd() {}
// 初始化函数，完成内存初始化相关操作
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum)
{
// 使用获取到的TilingData计算得到singleCoreSize(每个核上总计算数据大小)、tileNum（每个核上分块个数）、singleTileLength（每个分块大小）等变量
this->blockLength = totalLength / AscendC::GetBlockNum();
this->tileNum = tileNum;
this->tileLength = this->blockLength / tileNum / BUFFER_NUM;
// 获取当前核的起始索引
xGm.SetGlobalBuffer((__gm__ DTYPE_X*)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
yGm.SetGlobalBuffer((__gm__ DTYPE_Y*)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
zGm.SetGlobalBuffer((__gm__ DTYPE_Z*)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
// 通过Pipe内存管理对象为输入输出Queue分配内存
pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(DTYPE_X));
pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(DTYPE_Y));
pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(DTYPE_Z));
}
// 核心处理函数，实现算子逻辑，调用私有成员函数CopyIn、Compute、CopyOut完成矢量算子的三级流水操作
__aicore__ inline void Process()
{
int32_t loopCount = this->tileNum * BUFFER_NUM;
for (int32_t i = 0; i < loopCount; i++) {
CopyIn(i);
Compute(i);
CopyOut(i);
}
}
private:
// 搬入函数，完成CopyIn阶段的处理，被核心Process函数调用
__aicore__ inline void CopyIn(int32_t progress)
{
// 从Queue中分配输入Tensor
AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.AllocTensor<DTYPE_X>();
AscendC::LocalTensor<DTYPE_Y> yLocal = inQueueY.AllocTensor<DTYPE_Y>();
// 将GlobalTensor数据拷贝到LocalTensor
AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
// 将LocalTesor放入VECIN（代表矢量编程中搬入数据的逻辑存放位置）的Queue中
inQueueX.EnQue(xLocal);
inQueueY.EnQue(yLocal);
}
// 计算函数，完成Compute阶段的处理，被核心Process函数调用
__aicore__ inline void Compute(int32_t progress)
{
// 将Tensor从队列中取出，用于后续计算
AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
AscendC::LocalTensor<DTYPE_Y> yLocal = inQueueY.DeQue<DTYPE_Y>();
// 从Queue中分配输出Tensor
AscendC::LocalTensor<DTYPE_Z> zLocal = outQueueZ.AllocTensor<DTYPE_Z>();
// 调用Add接口进行计算
AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);
// 将计算结果LocalTensor放入到VecOut的Queue中
outQueueZ.EnQue<DTYPE_Z>(zLocal);
// 释放输入Tensor
inQueueX.FreeTensor(xLocal);
inQueueY.FreeTensor(yLocal);
}
// 搬出函数，完成CopyOut阶段的处理，被核心Process函数调用
__aicore__ inline void CopyOut(int32_t progress)
{
// 从VecOut的Queue中取出输出Tensor
AscendC::LocalTensor<DTYPE_Z> zLocal = outQueueZ.DeQue<DTYPE_Z>();
// 将输出Tensor拷贝到GlobalTensor中
AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
// 将不再使用的LocalTensor释放
outQueueZ.FreeTensor(zLocal);
}
private:
//Pipe内存管理对象
AscendC::TPipe pipe;
//输入数据Queue队列管理对象，QuePosition为VECIN
AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;
//输出数据Queue队列管理对象，QuePosition为VECOUT
AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;
//管理输入输出Global Memory内存地址的对象，其中xGm, yGm为输入，zGm为输出
AscendC::GlobalTensor<DTYPE_X> xGm;
AscendC::GlobalTensor<DTYPE_Y> yGm;
AscendC::GlobalTensor<DTYPE_Z> zGm;
// 每个核上总计算数据大小
uint32_t blockLength;
// 每个核上总计算数据分块个数
uint32_t tileNum;
// 每个分块大小
uint32_t tileLength;
};

复制代码

3. host侧开发

核函数开发并验证完成后，下一步就是举行Host侧的实现，对应“AddCustom/op_host”目录下的add_custom_tiling.h文件与add_custom.cpp文件。
3.1 add_custom_tiling.h

这个是界说数据怎么切分，每个核上执行多少数据量，核上的数据又怎么切分执行的问题。

#ifndef ADD_CUSTOM_TILING_H
#define ADD_CUSTOM_TILING_H
#include "register/tilingdata_base.h"
namespace optiling {
BEGIN_TILING_DATA_DEF(TilingData)
// AddCustom算子使用了2个tiling参数：totalLength与tileNum
TILING_DATA_FIELD_DEF(uint32_t, totalLength); // 总计算数据量
TILING_DATA_FIELD_DEF(uint32_t, tileNum); // 每个核上总计算数据分块个数
END_TILING_DATA_DEF;
// 注册tiling数据到对应的算子
REGISTER_TILING_DATA_CLASS(AddCustom, TilingData)
}
#endif // ADD_CUSTOM_TILING_H

复制代码

3.2 add_custom.cpp

修改“add_custom.cpp”文件，举行Tiling的实现。

namespace optiling {
const uint32_t BLOCK_DIM = 8;
const uint32_t TILE_NUM = 8;
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{
TilingData tiling;
uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize();
context->SetBlockDim(BLOCK_DIM);
tiling.set_totalLength(totalLength);
tiling.set_tileNum(TILE_NUM);
tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
size_t *currentWorkspace = context->GetWorkspaceSizes(1);
currentWorkspace[0] = 0;
return ge::GRAPH_SUCCESS;
}
} // namespace optiling

复制代码

3.3 实现AddCustom算子的shape推导

在“add_custom.cpp”文件中实现AddCustom算子的shape推导。

static graphStatus InferShape(gert::InferShapeContext *context)
{
const gert::Shape *x1_shape = context->GetInputShape(0);
gert::Shape *y_shape = context->GetOutputShape(0);
*y_shape = *x1_shape;
return GRAPH_SUCCESS;
}

复制代码

3.4 算子原型注册

namespace ops {
class AddCustom : public OpDef {
public:
explicit AddCustom(const char* name) : OpDef(name)
{
// Add算子的第一个输入
this->Input("x")
.ParamType(REQUIRED) // 代表输入必选
.DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 }) // 输入支持的数据类型
.Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND }); // 输入支持的数据格式
// Add算子的第二个输入
this->Input("y")
.ParamType(REQUIRED)
.DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 })
.Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND });
this->Output("z")
.ParamType(REQUIRED)
.DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 })
.Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND });
// 关联InferShape函数
this->SetInferShape(ge::InferShape);
// 关联Tiling函数
this->AICore()
.SetTiling(optiling::TilingFunc);
// 注册算子支持的AI处理器型号，请替换为实际支持的AI处理器型号
this->AICore().AddConfig("ascendxxx");
}
};
// 结束算子注册
OP_ADD(AddCustom);
} // namespace ops

复制代码

OP_ADD(AddCustom)：算子原型注册接口 .
4 算子工程编译部署

编译AddCustom工程，天生自界说算子安装包，并将其安装到算子库中
4.1 编译自界说算子工程

编译自界说算子工程，构建天生自界说算子包。修改CMakePresets.json中ASCEND_CANN_PACKAGE_PATH为CANN软件的安装目录，例如：/usr/local/Ascend/ascend-toolkit/latest。

{
……
"configurePresets": [
{
……
"ASCEND_CANN_PACKAGE_PATH": {
"type": "PATH",
"value": "/usr/local/Ascend/latest"
},
……
}
]
}

复制代码

在算子工程AddCustom目录下执行如下命令，举行算子工程编译。

./build.sh

复制代码

编译乐成后，会在当前目录下创建build_out目录，并在build_out目录下天生自界说算子安装包custom_opp__.run，例如“custom_opp_ubuntu_x86_64.run”。
4.2 自界说算子安装包部署

在自界说算子包所在路径下，执行如下命令，安装自界说算子包。

./custom_opp_<target os>_<target architecture>.run

复制代码

命令执行乐成后，自界说算子包中的相关文件将部署至当前情况的OPP算子库的vendors/customize目录中.
假如用户部署多个自界说算子包，可通过如下命令指定路径安装：

./custom_opp_<target os>_<target architecture>.run
--install-path=<path>

复制代码

说明：假如部署算子包时通过设置–install-path参数指定了算子包的安装目录，则在利用自界说算子前，需要执行source /vendors/<vendor_name>/bin/set_env.bash 命令，set_env.bash脚本中将自界说算子包的安装路径追加到情况变量ASCEND_CUSTOM_OPP_PATH中，使自界说算子在当前情况中生效。
4.3 查看部署后的目录结构

├── opp // 算子库目录
│ ├── built-in // 内置算子所在目录
│ ├── vendors // 自定义算子所在目录
│ ├── config.ini
│ └── vendor_name1 // 自定义算子所在目录，若不指定路径安装，默认为“customize”
│ ├── framework //自定义算子插件库
│ ├── op_impl
│ │ └── ai_core
│ │ └── tbe
│ │ ├── config
│ │ │ └── ${soc_version} //昇腾AI处理器类型
│ │ │ └── aic-${soc_version}-ops-info.json //自定义算子信息库文件
│ │ ├── vendor_name1_impl //自定义算子实现代码文件
│ │ │ └── dynamic
│ │ │ ├── xx.cpp
│ │ │ └── xx.py
│ │ ├── kernel //自定义算子二进制文件
│ │ │ └── ${soc_version} //昇腾AI处理器类型
│ │ │ └── config
│ │ └── op_tiling
│ │ ├── lib
│ │ └── liboptiling.so
│ └── op_proto //自定义算子原型库所在目录
│ ├── inc
│ │ └── op_proto.h
│ └── lib
│ ├── vendor_name2 // 存储厂商vendor_name2部署的自定义算子
vendor_name1 // 自定义算子所在目录，若不指定路径安装，默认为“customize”
vendor_name2 // 存储厂商vendor_name2部署的自定义算子

复制代码

5 算子ST测试

CANN开发套件包中提供了ST测试工具“msOpST”，用于天生算子的ST测试用例并在硬件情况中执行。
本节仅以AddCustom算子为例，介绍ST测试工具的关键执行流程。
5.1 编写测试界说文件AddCustom_case.json

创建算子ST测试用例界说文件“AddCustom_case.json”，例如存储到跟算子工程目录“AddCustom”同级别的“AddCustom_st”路径下。
“AddCustom_case.json”文件的样例如下，开发者可基于此文件定制修改。

[
{
"case_name": "Test_AddCustom_001",
"op": "AddCustom",
"input_desc": [
{
"format": [
"ND"
],
"type": [
"float16"
],
"shape": [8,2048],
"data_distribute": [
"uniform"
],
"value_range": [
[
0.1,
1.0
]
],
"name": "x"
},
{
"format": [
"ND"
],
"type": [
"float16"
],
"shape": [8,2048],
"data_distribute": [
"uniform"
],
"value_range": [
[
0.1,
1.0
]
],
"name": "y"
}
],
"output_desc": [
{
"format": [
"ND"
],
"type": [
"float16"
],
"shape": [8,2048],
"name": "z"
}
]
}
]

复制代码

5.2 设置ST测试用例执行时依靠的情况变量

${INSTALL_DIR}体现CANN软件安装目录，例如，/usr/local/Ascend/ascend-toolkit/latest。{arch-os}为运行情况的架构和操作体系，arch体现操作体系架构，os体现操作体系，例如x86_64-linux。

export DDK_PATH=${INSTALL_DIR}
export NPU_HOST_LIB=${INSTALL_DIR}/{arch-os}/devlib

复制代码

5.3 天生测试用例

进入msOpST工具所在目录，执行如下命令天生并执行测试用例。
step1：进入msOpST工具所在目录。

cd $HOME/Ascend/ascend-toolkit/latest/python/site-packages/bin

复制代码

step2：天生测试用例文件并执行.

./msopst run -i $HOME/AddCustom_st/AddCustom_case.json -soc <soc_version> -out $HOME/AddCustom_st

复制代码

-i：算子测试用例界说文件（*.json）的路径，可设置为绝对路径或者相对路径。
-soc：昇腾AI处理器的型号，请根据现真相况举行替换。
-out：天生文件所在路径。
此命令执行完成后，会输出类似如下打屏效果：

------------------------------------------------------------------------
- test case count: 1
- success count: 1
- failed count: 0
------------------------------------------------------------------------
2023-08-28 20:20:40 (25058) - [INFO] Process finished!
2023-08-28 20:20:40 (25058) - [INFO] The st report saved in: xxxx/AddCustom_st/20230828202015/st_report.json.

复制代码

也可以查看上述屏**显信息提示的“st_report.json”**文件，查看详细运行效果。
参考学习：
基于自界说算子工程的算子开发-快速入门-Ascend C算子开发-算子开发-CANN社区版8.0.RC3.alpha003开发文档-昇腾社区

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

AscendC从入门到醒目系列（三）基于自界说算子工程开发AscendC算子 ...

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块