Ascend C算子性能优化实用本领02——内存优化

打印 上一主题 下一主题

主题 1032|帖子 1032|积分 3096

 Ascend C是CANN针对算子开辟场景推出的编程语言,原生支持C和C++标准规范,兼具开辟效率和运行性能。利用Ascend C,开辟者可以基于昇腾AI硬件,高效的实现自定义的创新算法。
目前已经有越来越多的开辟者利用Ascend C,我们将通过几期“Ascend C算子性能优化”专题分享,围绕开辟者最为关心的算子性能优化环节,先容Ascend C算子常用的优化本领,资助开辟者自主构建出更优性能的算子。专题内容将围绕流水优化、搬运优化、内存优化、API利用优化以及Tiling优化等优化本领,从方案讲授、优化案例、性能对比等多角度展开先容。
上期内容分享了《Ascend C算子性能优化实用本领01——流水优化》,本期您将从内存优化角度,了解到一些实用的内存优化本领:
   

  • 通过Unified Buffer融合实现一连vector计算
  • 通过L0C Buffer数据暂存实现高效的矩阵乘结果累加
  • 较小矩阵长驻L1 Buffer,仅分次搬运较大矩阵
  • 通过BT Buffer实现高效的bias计算
  • 通过FP Buffer存放量化参数实现高效随路量化
  昇腾AI处置惩罚器存储单位简介

  AI处置惩罚器中的计算资源要想发挥强劲算力,必要条件是包管输入数据能够实时准确地出现在计算单位中,需要经心设计存储系统,包管计算单位所需的数据供应。
  昇腾AI处置惩罚器中的AI Core包含多级内部存储,AI Core需要把外部存储中的数据加载到内部存储中,才气完成相应的计算。AI Core的主要内部存储包罗:
   

  • L1 Buffer:L1缓冲区,通用内部存储,是AI Core内比较大的一块数据中转区,可暂存AI Core中需要反复利用的一些数据从而减少从总线读写的次数。
  • L0A Buffer / L0B Buffer:Cube指令的输入。
  • L0C Buffer:Cube指令的输出,但举行累加计算的时间,也是输入的一部门。
  • Unified Buffer:统一缓冲区,向量和标量计算的输入和输出。
  为了配合AI Core中的数据传输和搬运,AI Core中还包含MTE(Memory Transfer Engine,存储转换引擎)搬运单位,在搬运过程中可实行随路数据格式/类型转换。
  图 1AI Core架构图
  

  除L1 Buffer(L1缓冲区),L0 Buffer(L0缓冲区),Unified Buffer(统一缓冲区)这些基本的存储单位外,某些采用AI Core分离架构的昇腾AI处置惩罚器还会增长BT Buffer和FP Buffer这两个Buffer。AI Core分离架构将AI Core拆成矩阵计算(AI Cube,AIC)和向量计算(AI Vector,AIV)两个独立的核,每个核都有本身的Scalar单位,能独立加载本身的代码段,从而实现矩阵计算与向量计算的解耦,在系统软件的统一调度下互相配合达到计算效率优化的效果。
  
  

  • BT Buffer:BiasTable Buffer,用于存放Bias。
  • FP Buffer:Fixpipe Buffer,用于存放量化参数、Relu参数等。
  图 2AI Core架构图(分离架构)
  

    通过UB Buffer融合实现一连vector计算

  算子实现中涉及多次vector计算,且前一次计算输出是后一次计算输入的环境下,可将前一次计算输出暂存在UB(Unified Buffer)上直接作为下一次计算的输入,不需要将前一次的计算输出从UB搬运到GM后再从GM搬运到UB。这种UB Buffer融合的方式可以减少搬入搬出次数,实现一连vector计算,提拔内存利用效率。数据流图对比如下:
  
  图3数据流图对比
  
举个例子,以下算子的计算逻辑为举行Exp计算后再举行Abs计算。计算过程中先把源操作数从GM搬运到UB举行Exp计算,Exp计算完成后将Exp的结果从UB搬运到GM;再从GM中把Exp的结果搬运到UB上作为Abs计算的输入,Abs计算完成后将目标操作数结果从UB搬运到GM。整个过程从GM搬进搬出共4次。当需要举行的vector计算为n次时,从GM搬进搬出共需要2n次。
  1. class KernelSample {
  2. public:
  3.     __aicore__ inline KernelSample() {}
  4.     __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
  5.     {
  6.         src0Global.SetGlobalBuffer((__gm__ float*)src0Gm);
  7.         dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm);
  8.         pipe.InitBuffer(inQueueSrc0, 1, 1024 * sizeof(float));
  9.         pipe.InitBuffer(outQueueDst, 1, 1024 * sizeof(float));
  10.     }
  11.     __aicore__ inline void Process()
  12.     {
  13.         CopyIn();
  14.         Compute();
  15.         CopyOut();
  16.         CopyIn1();
  17.         Compute1();
  18.         CopyOut1();
  19.     }
  20.  
  21. private:
  22.     __aicore__ inline void CopyIn()
  23.     {
  24.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  25.         DataCopy(src0Local, src0Global, 1024);
  26.         inQueueSrc0.EnQue(src0Local);
  27.     }
  28.     __aicore__ inline void Compute()
  29.     {
  30.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  31.         LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
  32.         Exp(dstLocal, src0Local, 1024);
  33.         outQueueDst.EnQue<float>(dstLocal);
  34.         inQueueSrc0.FreeTensor(src0Local);
  35.     }
  36.     __aicore__ inline void CopyOut()
  37.     {
  38.         LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
  39.         DataCopy(dstGlobal, dstLocal, 1024);
  40.         outQueueDst.FreeTensor(dstLocal);
  41.     }
  42.     __aicore__ inline void CopyIn1()
  43.     {
  44. PipeBarrier<PIPE_ALL>();
  45.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  46.         DataCopy(src0Local, dstGlobal, 1024);
  47.         inQueueSrc0.EnQue(src0Local);
  48.     }
  49.     __aicore__ inline void Compute1()
  50.     {
  51.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  52.         LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
  53.         Abs(dstLocal, src0Local, 1024);
  54.         outQueueDst.EnQue<float>(dstLocal);
  55.         inQueueSrc0.FreeTensor(src0Local);
  56.     }
  57.     __aicore__ inline void CopyOut1()
  58.     {
  59.         LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
  60.         DataCopy(dstGlobal, dstLocal, 1024);
  61.         outQueueDst.FreeTensor(dstLocal);
  62.     }
  63.  
  64. private:
  65.     TPipe pipe;
  66.     TQue<QuePosition::VECIN, 1> inQueueSrc0;
  67.     TQue<QuePosition::VECOUT, 1> outQueueDst;
  68.     GlobalTensor<float> src0Global, dstGlobal;
  69. };
复制代码

   利用UB Buffer融合方式后,在UB上举行一连vector计算时,前一次的结果可直接作为后一次计算的输入,继续在UB上举行计算,不需要中心的搬进搬出,只需在开始计算时将源操作数搬运到UB,以及全部计算结束后将最闭幕果从UB搬运到GM,共2次搬进搬出。
  
  1. class KernelSample {
  2. public:
  3.     __aicore__ inline KernelSample() {}
  4.     __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
  5.     {
  6.         src0Global.SetGlobalBuffer((__gm__ float*)src0Gm);
  7.         dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm);
  8.         pipe.InitBuffer(inQueueSrc0, 1, 1024 * sizeof(float));
  9.         pipe.InitBuffer(outQueueDst, 1, 1024 * sizeof(float));
  10.     }
  11.     __aicore__ inline void Process()
  12.     {
  13.         CopyIn();
  14.         Compute();
  15.         CopyOut();
  16.     }
  17.  
  18. private:
  19.     __aicore__ inline void CopyIn()
  20.     {
  21.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  22.         DataCopy(src0Local, src0Global, 1024);
  23.         inQueueSrc0.EnQue(src0Local);
  24.     }
  25.     __aicore__ inline void Compute()
  26.     {
  27.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  28.         LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
  29.         Exp(dstLocal, src0Local, 1024);
  30.         Abs(dstLocal, dstLocal, 1024);
  31.         outQueueDst.EnQue<float>(dstLocal);
  32.         inQueueSrc0.FreeTensor(src0Local);
  33.     }
  34.     __aicore__ inline void CopyOut()
  35.     {
  36.         LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
  37.         DataCopy(dstGlobal, dstLocal, 1024);
  38.         outQueueDst.FreeTensor(dstLocal);
  39.     }
  40.  
  41. private:
  42.     TPipe pipe;
  43.     TQue<QuePosition::VECIN, 1> inQueueSrc0;
  44.     TQue<QuePosition::VECOUT, 1> outQueueDst;
  45.     GlobalTensor<float> src0Global, dstGlobal;
  46. };
复制代码

    通过L0C数据暂存实现高效的矩阵乘结果累加

  算子实现中对矩阵乘的结果举行累加时(比如矩阵A1 * B1 + A2 * B2...结果的累加),可将前一次矩阵乘的结果暂存在CO1(L0C)上,调用Mmad接口实现矩阵乘结果累加。相比于每次矩阵乘的结果从CO1搬运到GM上,再搬运到UB上举行累加计算,可减少数据搬运的次数,提拔内存利用效率。
  图4优化前数据流图
  

  
  图5优化后数据流图
  

  
   优化前,算子举行2次矩阵乘结果累加的过程如下:
  

  • 将前一次矩阵乘的计算结果从CO1搬运到workspace上,再从workspace搬运到UB上;
  • 下一次矩阵乘计算重复完成上述步调将结果搬运到UB上;
  • 在UB大将2次矩阵乘的结果相加。
   当需要累加n次矩阵乘时,分别增长了n次CO1->workspace、workspace->UB搬运以及n次Add运算。
  1. ...
  2. // 该样例仅做示例说明,非完整代码,省略了部分同步控制代码
  3. public:
  4.     __aicore__ inline KernelSample()
  5.     {
  6.         aSize = m * k;
  7.         bSize = k * n;
  8.         cSize = m * n;
  9.     }
  10.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
  11.     {
  12.         aGM.SetGlobalBuffer((__gm__ half *)a);
  13.         bGM.SetGlobalBuffer((__gm__ half *)b);
  14.         cGM.SetGlobalBuffer((__gm__ float *)c);
  15.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  19.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  20.         pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float));
  21.         pipe.InitBuffer(inQueueSrc1, 1, cSize * sizeof(float));
  22.         pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(float));
  23.  
  24.     }
  25.     __aicore__ inline void Process()
  26.     {
  27.         // 第一次矩阵乘计算
  28.         CopyIn();
  29.         SplitA();
  30.         SplitB();
  31.         Compute();
  32.         // 将第一次矩阵乘的结果搬出
  33.         CopyOut();
  34.         // 将第一次矩阵乘的结果搬运到UB
  35.         CopyIn1();
  36.         // 第二次矩阵乘计算
  37.         Compute1();
  38.         // 将第一次矩阵乘的结果搬出
  39.         CopyOut1();
  40.         // 将第二次矩阵乘的结果搬运到UB
  41.         CopyIn1();
  42.         // 将两次矩阵乘的结果累加
  43.         Compute2();
  44.         CopyOut2();
  45.     }
  46. private:
  47.     __aicore__ inline void CopyIn()
  48.     {
  49.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  50.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  51.  
  52.         Nd2NzParams dataCopyA1Params;
  53.         dataCopyA1Params.ndNum = 1;
  54.         dataCopyA1Params.nValue = m;
  55.         dataCopyA1Params.dValue = k;
  56.         dataCopyA1Params.srcNdMatrixStride = 0;
  57.         dataCopyA1Params.srcDValue = k;
  58.         dataCopyA1Params.dstNzC0Stride = m;
  59.         dataCopyA1Params.dstNzNStride = 1;
  60.         dataCopyA1Params.dstNzMatrixStride = 0;
  61.         DataCopy(a1Local, aGM, dataCopyA1Params);
  62.  
  63.         Nd2NzParams dataCopyB1Params;
  64.         dataCopyB1Params.ndNum = 1;
  65.         dataCopyB1Params.nValue = k;
  66.         dataCopyB1Params.dValue = n;
  67.         dataCopyB1Params.srcNdMatrixStride = 0;
  68.         dataCopyB1Params.srcDValue = n;
  69.         dataCopyB1Params.dstNzC0Stride = k;
  70.         dataCopyB1Params.dstNzNStride = 1;
  71.         dataCopyB1Params.dstNzMatrixStride = 0;
  72.         DataCopy(b1Local, bGM, dataCopyB1Params);
  73.  
  74.         inQueueA1.EnQue<half>(a1Local);
  75.         inQueueB1.EnQue<half>(b1Local);
  76.     }
  77.     __aicore__ inline void SplitA()
  78.     {
  79.         ...
  80.     }
  81.     __aicore__ inline void SplitB()
  82.     {
  83.         ...
  84.     }
  85.     __aicore__ inline void Compute()
  86.     {
  87.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  88.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  89.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  90.         MmadParams mmadParams;
  91.         mmadParams.m = m;
  92.         mmadParams.n = n;
  93.         mmadParams.k = k;
  94.         // 矩阵乘
  95.         Mmad(c1Local, a2Local, b2Local, mmadParams);
  96.         outQueueCO1.EnQue<float>(c1Local);
  97.         inQueueA2.EnQue<half>(a2Local);
  98.         inQueueB2.EnQue<half>(b2Local);
  99.     }
  100.     __aicore__ inline void CopyOut()
  101.     {
  102.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  103.         GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace);
  104.         xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace));
  105.         FixpipeParamsV220 fixpipeParams;
  106.         fixpipeParams.nSize = n;
  107.         fixpipeParams.mSize = m;
  108.         fixpipeParams.srcStride = m;
  109.         fixpipeParams.dstStride = n;
  110.         fixpipeParams.ndNum = 1;
  111.         fixpipeParams.srcNdStride = 0;
  112.         fixpipeParams.dstNdStride = 0;
  113.         // 将矩阵乘的计算结果从CO1搬运到workspace
  114.         Fixpipe(xGm, c1Local, fixpipeParams);
  115.         outQueueCO1.EnQue<float>(c1Local);
  116.     }
  117.     __aicore__ inline void CopyIn1()
  118.     {
  119.         PipeBarrier<PIPE_ALL>();
  120.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  121.         // 将矩阵乘的计算结果从workspace搬运到UB
  122.         DataCopy(src0Local, xGm, cSize);
  123.         inQueueSrc0.EnQue<float>(src0Local);
  124.     }
  125.     __aicore__ inline void Compute1()
  126.     {
  127.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  128.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  129.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  130.         MmadParams mmadParams;
  131.         mmadParams.m = m;
  132.         mmadParams.n = n;
  133.         mmadParams.k = k;
  134.         // 矩阵乘
  135.         Mmad(c1Local, a2Local, b2Local, mmadParams);
  136.         outQueueCO1.EnQue<float>(c1Local);
  137.         inQueueA2.FreeTensor(a2Local);
  138.         inQueueB2.FreeTensor(b2Local);
  139.     }
  140.     __aicore__ inline void CopyOut1()
  141.     {
  142.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  143.         FixpipeParamsV220 fixpipeParams;
  144.         fixpipeParams.nSize = n;
  145.         fixpipeParams.mSize = m;
  146.         fixpipeParams.srcStride = m;
  147.         fixpipeParams.dstStride = n;
  148.         fixpipeParams.ndNum = 1;
  149.         fixpipeParams.srcNdStride = 0;
  150.         fixpipeParams.dstNdStride = 0;
  151.         // 将矩阵乘的计算结果从CO1搬运到workspace
  152.         Fixpipe(xGm, c1Local, fixpipeParams);
  153.         outQueueCO1.FreeTensor(c1Local);
  154.     }
  155.     __aicore__ inline void CopyIn2()
  156.     {
  157.         PipeBarrier<PIPE_ALL>();
  158.         LocalTensor<float> src1Local = inQueueSrc1.AllocTensor<float>();
  159.         // 将矩阵乘的计算结果从workspace搬运到UB
  160.         DataCopy(src1Local, xGm, cSize);
  161.         inQueueSrc1.EnQue<float>(src1Local);
  162.     }
  163.     __aicore__ inline void Compute2()
  164.     {
  165.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  166.         LocalTensor<float> src1Local = inQueueSrc1.DeQue<float>();
  167.         LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
  168.         // 两次矩阵乘的结果相加
  169.         Add(dstLocal, src0Local, src1Local, cSize);
  170.         outQueueDst.EnQue<float>(dstLocal);
  171.         inQueueSrc0.FreeTensor(src0Local);
  172.         inQueueSrc1.FreeTensor(src1Local);
  173.     }
  174.     __aicore__ inline void CopyOut2()
  175.     {
  176.         ...
  177.     }
  178. private:
  179.     TPipe pipe;
  180.     TQue<QuePosition::A1, 1> inQueueA1;
  181.     TQue<QuePosition::A2, 1> inQueueA2;
  182.     TQue<QuePosition::B1, 1> inQueueB1;
  183.     TQue<QuePosition::B2, 1> inQueueB2;
  184.     TQue<QuePosition::CO1, 1> outQueueCO1;
  185.     TQue<QuePosition::VECIN, 1> inQueueSrc0;
  186.     TQue<QuePosition::VECIN, 1> inQueueSrc1;
  187.     TQue<QuePosition::VECOUT, 1> outQueueDst;
  188.  
  189.     GlobalTensor<half> aGM;
  190.     GlobalTensor<half> bGM;
  191.     GlobalTensor<dst_T> cGM;
  192.     uint16_t m = 32, k = 32, n = 32;
  193.     uint16_t aSize, bSize, cSize;   
  194. ...
复制代码
 通过优化,该算子对矩阵乘结果累加时,可将前一次矩阵乘的结果暂存在L0C上,通过Mmad接口参数cmatrixInitVal和cmatrixSource设置C矩阵的初始值 ,只调用2次Mmad接口实现2次矩阵乘结果累加。
  1. ...
  2. // 该样例仅做示例说明,非完整代码,省略了部分同步控制代码
  3. public:
  4.     __aicore__ inline KernelSample()
  5.     {
  6.         aSize = m * k;
  7.         bSize = k * n;
  8.         cSize = m * n;
  9.     }
  10.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
  11.     {
  12.         aGM.SetGlobalBuffer((__gm__ half *)a);
  13.         bGM.SetGlobalBuffer((__gm__ half *)b);
  14.         cGM.SetGlobalBuffer((__gm__ float *)c);
  15.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  19.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  20.     }
  21.     __aicore__ inline void Process()
  22.     {
  23.         CopyIn();
  24.         SplitA();
  25.         SplitB();
  26.         Compute();
  27.         CopyOut();
  28.     }
  29. private:
  30.     __aicore__ inline void CopyIn()
  31.     {
  32.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  33.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  34.  
  35.         Nd2NzParams dataCopyA1Params;
  36.         dataCopyA1Params.ndNum = 1;
  37.         dataCopyA1Params.nValue = m;
  38.         dataCopyA1Params.dValue = k;
  39.         dataCopyA1Params.srcNdMatrixStride = 0;
  40.         dataCopyA1Params.srcDValue = k;
  41.         dataCopyA1Params.dstNzC0Stride = m;
  42.         dataCopyA1Params.dstNzNStride = 1;
  43.         dataCopyA1Params.dstNzMatrixStride = 0;
  44.         DataCopy(a1Local, aGM, dataCopyA1Params);
  45.  
  46.         Nd2NzParams dataCopyB1Params;
  47.         dataCopyB1Params.ndNum = 1;
  48.         dataCopyB1Params.nValue = k;
  49.         dataCopyB1Params.dValue = n;
  50.         dataCopyB1Params.srcNdMatrixStride = 0;
  51.         dataCopyB1Params.srcDValue = n;
  52.         dataCopyB1Params.dstNzC0Stride = k;
  53.         dataCopyB1Params.dstNzNStride = 1;
  54.         dataCopyB1Params.dstNzMatrixStride = 0;
  55.         DataCopy(b1Local, bGM, dataCopyB1Params);
  56.  
  57.         inQueueA1.EnQue(a1Local);
  58.         inQueueB1.EnQue(b1Local);
  59.     }
  60.     __aicore__ inline void SplitA()
  61.     {
  62.         ...
  63.     }
  64.     __aicore__ inline void SplitB()
  65.     {
  66.         ...
  67.     }
  68.     __aicore__ inline void Compute()
  69.     {
  70.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  71.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  72.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  73.         MmadParams mmadParams;
  74.         mmadParams.m = m;
  75.         mmadParams.n = n;
  76.         mmadParams.k = k;
  77.         // 第一次矩阵乘
  78.         Mmad(c1Local, a2Local, b2Local, mmadParams);
  79.         PipeBarrier<PIPE_M>();
  80.         // 第二次矩阵乘累加第一次矩阵乘的结果
  81.         mmadParams.cmatrixInitVal = false;
  82.         Mmad(c1Local, a2Local, b2Local, c1Local, mmadParams);
  83.         outQueueCO1.EnQue<float>(c1Local);
  84.         inQueueA2.FreeTensor(a2Local);
  85.         inQueueB2.FreeTensor(b2Local);
  86.     }
  87.     __aicore__ inline void CopyOut()
  88.     {
  89.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  90.         FixpipeParamsV220 fixpipeParams;
  91.         fixpipeParams.nSize = n;
  92.         fixpipeParams.mSize = m;
  93.         fixpipeParams.srcStride = m;
  94.         fixpipeParams.dstStride = n;
  95.  
  96.         fixpipeParams.ndNum = 1;
  97.         fixpipeParams.srcNdStride = 0;
  98.         fixpipeParams.dstNdStride = 0;
  99.         Fixpipe(cGM, c1Local, fixpipeParams);
  100.         outQueueCO1.FreeTensor(c1Local);
  101.     }
  102. private:
  103.     TPipe pipe;
  104.     TQue<QuePosition::A1, 1> inQueueA1;
  105.     TQue<QuePosition::A2, 1> inQueueA2;
  106.     TQue<QuePosition::B1, 1> inQueueB1;
  107.     TQue<QuePosition::B2, 1> inQueueB2;
  108.     TQue<QuePosition::CO1, 1> outQueueCO1;
  109.  
  110.     GlobalTensor<half> aGM;
  111.     GlobalTensor<half> bGM;
  112.     GlobalTensor<dst_T> cGM;
  113.     uint16_t m = 32, k = 32, n = 32;
  114.     uint16_t aSize, bSize, cSize;
复制代码

    较小矩阵长驻L1 Buffer,仅分次搬运较大矩阵

  在举行cube计算时,当L1无法全载左右矩阵时,可以让较小的矩阵长驻于L1上,只分次搬运较大的矩阵,减少搬运次数。
  假设L1的大小为512K,左矩阵和右矩阵的大小分别为992K、16K,数据类型为half,单次无法将左右矩阵全部载入L1中。开辟者规划的切分策略为:不切K轴,将左矩阵平均分成两块A1、A2,shape大小均为[992, 256];将右矩阵平均分成两块,shape大小均为[256, 16]。计算时的加载顺序如下:先加载A1矩阵至L1,将B1、B2依次加载并计算;然后再加载A2至L1,将B1、B2依次加载并计算。
  图6优化前切分策略图示
  

  
  1. ...
  2. public:
  3.     __aicore__ inline KernelSample()
  4.     {
  5.         aSize = baseM * baseK;
  6.         bSize = baseK * baseN;
  7.         cSize = m * n;
  8.     }
  9.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
  10.     {
  11.         aGM.SetGlobalBuffer((__gm__ half *)a);
  12.         bGM.SetGlobalBuffer((__gm__ half *)b);
  13.         cGM.SetGlobalBuffer((__gm__ float *)c);
  14.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  15.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  18.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  19.     }
  20.     __aicore__ inline void Process()
  21.     {
  22.         for (uint32_t i = 0; i < 2; i++) {
  23.             CopyInA1(i);
  24.             SplitA();
  25.             for (uint32_t j = 0; j < 2; j++) {
  26.                 CopyInB1(j);
  27.                 SplitB();
  28.                 Compute(i, j);
  29.             }
  30.         }
  31.         CopyOut();
  32.     }
  33. private:
  34.     __aicore__ inline void CopyInA1(uint32_t i)
  35.     {
  36.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  37.         // 左矩阵a1/a2分块载入A1
  38.         Nd2NzParams dataCopyA1Params;
  39.         dataCopyA1Params.ndNum = 1;
  40.         dataCopyA1Params.nValue = baseM;
  41.         dataCopyA1Params.dValue = baseK;
  42.         dataCopyA1Params.srcNdMatrixStride = 0;
  43.         dataCopyA1Params.srcDValue = baseK;
  44.         dataCopyA1Params.dstNzC0Stride = baseM;
  45.         dataCopyA1Params.dstNzNStride = 1;
  46.         dataCopyA1Params.dstNzMatrixStride = 0;
  47.         DataCopy(a1Local, aGM[i * baseM * baseK], dataCopyA1Params);
  48.         inQueueA1.EnQue(a1Local);
  49.     }
  50.     __aicore__ inline void SplitA()
  51.     {
  52.         LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
  53.         LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();
  54.         // 左矩阵a1/a2分块从A1->A2
  55.         LoadData2dParams loadL0AParams;
  56.         loadL0AParams.repeatTimes = baseM * baseK * sizeof(half) / 512;
  57.         loadL0AParams.srcStride = 1;
  58.         loadL0AParams.dstGap = 0;
  59.         LoadData(a2Local, a1Local, loadL0AParams);
  60.         inQueueA2.EnQue(a2Local);
  61.         inQueueA1.FreeTensor(a1Local);
  62.     }
  63.     __aicore__ inline void CopyInB1(uint32_t j)
  64.     {
  65.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  66.         // 右矩阵分块b1/b2载入B1
  67.         Nd2NzParams dataCopyB1Params;
  68.         dataCopyB1Params.ndNum = 1;
  69.         dataCopyB1Params.nValue = baseK;
  70.         dataCopyB1Params.dValue = baseN;
  71.         dataCopyB1Params.srcNdMatrixStride = 0;
  72.         dataCopyB1Params.srcDValue = n;
  73.         dataCopyB1Params.dstNzC0Stride = baseK;
  74.         dataCopyB1Params.dstNzNStride = 1;
  75.         dataCopyB1Params.dstNzMatrixStride = 0;
  76.         DataCopy(b1Local, bGM[j * baseN], dataCopyB1Params);
  77.         inQueueB1.EnQue(b1Local);
  78.     }
  79.     __aicore__ inline void SplitB()
  80.     {
  81.         LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
  82.         LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();
  83.         // 右矩阵分块b1/b2从B1->B2
  84.         LoadData2dTransposeParams loadL0BParams;
  85.         loadL0BParams.startIndex = 0;
  86.         loadL0BParams.repeatTimes = baseK / nBlockSize;
  87.         loadL0BParams.srcStride = 1;
  88.         loadL0BParams.dstGap = 1;
  89.         LoadDataWithTranspose(b2Local, b1Local, loadL0BParams);
  90.         inQueueB2.EnQue(b2Local);
  91.         inQueueB1.FreeTensor(b1Local);
  92.     }
  93.     __aicore__ inline void Compute(uint32_t i, uint32_t j)
  94.     {
  95.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  96.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  97.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  98.         // 矩阵乘
  99.         mmadParams.m = baseM;
  100.         mmadParams.n = baseN;
  101.         mmadParams.k = baseK;
  102.         Mmad(c1Local[i * baseM * baseN + j * m * baseN], a2Local, b2Local, mmadParams);
  103.         outQueueCO1.EnQue<float>(c1Local);
  104.         inQueueA2.FreeTensor(a2Local);
  105.         inQueueB2.FreeTensor(b2Local);
  106.     }
  107.     __aicore__ inline void CopyOut()
  108.     {
  109.         ...
  110.     }
  111. private:
  112.     TPipe pipe;
  113.     TQue<QuePosition::A1, 1> inQueueA1;
  114.     TQue<QuePosition::A2, 1> inQueueA2;
  115.     TQue<QuePosition::B1, 1> inQueueB1;
  116.     TQue<QuePosition::B2, 1> inQueueB2;
  117.     TQue<QuePosition::CO1, 1> outQueueCO1;
  118.  
  119.     GlobalTensor<half> aGM;
  120.     GlobalTensor<half> bGM;
  121.     GlobalTensor<dst_T> cGM;
  122.     uint16_t m = 1984, k = 256, n = 32;
  123.     uint16_t baseM = 992, baseK = 256, baseN = 16;
  124.     uint16_t aSize, bSize, cSize;
  125.     uint16_t nBlockSize = 16;
  126. ...
复制代码
经过优化,将较小的右矩阵一次性搬入L1并长存于L1上,循环内不绝搬运A矩阵,当循环次数为2时,共需要3次搬运。
  1. ...
  2. public:
  3.     __aicore__ inline KernelSample()
  4.     {
  5.         aSize = baseM * baseK;
  6.         bSize = baseK * n;
  7.         cSize = m * n;
  8.     }
  9.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
  10.     {
  11.         aGM.SetGlobalBuffer((__gm__ half *)a);
  12.         bGM.SetGlobalBuffer((__gm__ half *)b);
  13.         cGM.SetGlobalBuffer((__gm__ float *)c);
  14.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  15.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  18.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  19.     }
  20.     __aicore__ inline void Process()
  21.     {
  22.         CopyInB1();
  23.         SplitB();
  24.         for (uint32_t i = 0; i < 2; i++) {
  25.             CopyInA1(i);
  26.             SplitA();
  27.             for (uint32_t j = 0; j < 2; j++) {
  28.                 Compute(i, j);
  29.             }
  30.         }
  31.         CopyOut();
  32.     }
  33. private:
  34.     __aicore__ inline void CopyInB1()
  35.     {
  36.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  37.         // 右矩阵全载入B1
  38.         Nd2NzParams dataCopyB1Params;
  39.         dataCopyB1Params.ndNum = 1;
  40.         dataCopyB1Params.nValue = baseK;
  41.         dataCopyB1Params.dValue = n;
  42.         dataCopyB1Params.srcNdMatrixStride = 0;
  43.         dataCopyB1Params.srcDValue = n;
  44.         dataCopyB1Params.dstNzC0Stride = baseK;
  45.         dataCopyB1Params.dstNzNStride = 1;
  46.         dataCopyB1Params.dstNzMatrixStride = 0;
  47.         DataCopy(b1Local, bGM, dataCopyB1Params);
  48.         inQueueB1.EnQue(b1Local);
  49.     }
  50.     __aicore__ inline void SplitB()
  51.     {
  52.         LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
  53.         LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();
  54.         // 右矩阵全部从B1->B2
  55.         LoadData2dTransposeParams loadL0BParams;
  56.         loadL0BParams.startIndex = 0;
  57.         loadL0BParams.repeatTimes = baseK / nBlockSize;
  58.         loadL0BParams.srcStride = 1;
  59.         loadL0BParams.dstGap = 1;
  60.         for (int blockNum = 0; blockNum < (n / nBlockSize); blockNum++) {
  61.             LoadDataWithTranspose(b2Local[blockNum * 16 * nBlockSize], b1Local[blockNum * baseK * nBlockSize], loadL0BParams);
  62.         }
  63.         inQueueB2.EnQue(b2Local);
  64.         inQueueB1.FreeTensor(b1Local);
  65.     }
  66.     __aicore__ inline void CopyInA1(uint32_t i)
  67.     {
  68.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  69.         // 左矩阵a1/a2分块载入A1
  70.         Nd2NzParams dataCopyA1Params;
  71.         dataCopyA1Params.ndNum = 1;
  72.         dataCopyA1Params.nValue = baseM;
  73.         dataCopyA1Params.dValue = baseK;
  74.         dataCopyA1Params.srcNdMatrixStride = 0;
  75.         dataCopyA1Params.srcDValue = baseK;
  76.         dataCopyA1Params.dstNzC0Stride = baseM;
  77.         dataCopyA1Params.dstNzNStride = 1;
  78.         dataCopyA1Params.dstNzMatrixStride = 0;
  79.         DataCopy(a1Local, aGM[i * baseM * baseK], dataCopyA1Params);
  80.         inQueueA1.EnQue(a1Local);
  81.     }
  82.     __aicore__ inline void SplitA()
  83.     {
  84.         LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
  85.         LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();
  86.         // 左矩阵a1/a2分块从A1->A2
  87.         LoadData2dParams loadL0AParams;
  88.         loadL0AParams.repeatTimes = baseM * baseK * sizeof(half) / 512;
  89.         loadL0AParams.srcStride = 1;
  90.         loadL0AParams.dstGap = 0;
  91.         LoadData(a2Local, a1Local, loadL0AParams);
  92.         inQueueA2.EnQue(a2Local);
  93.         inQueueA1.FreeTensor(a1Local);
  94.     }
  95.     __aicore__ inline void Compute(uint32_t i, uint32_t j)
  96.     {
  97.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  98.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  99.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  100.         // 矩阵乘
  101.         mmadParams.m = baseM;
  102.         mmadParams.n = baseN;
  103.         mmadParams.k = baseK;
  104.         Mmad(c1Local[i * baseM * baseN + j * m * baseN], a2Local, b2Local, mmadParams);
  105.         outQueueCO1.EnQue<float>(c1Local);
  106.         inQueueA2.FreeTensor(a2Local);
  107.         inQueueB2.FreeTensor(b2Local);
  108.     }
  109.     __aicore__ inline void CopyOut()
  110.     {
  111.         ...
  112.     }
  113. private:
  114.     TPipe pipe;
  115.     TQue<QuePosition::A1, 1> inQueueA1;
  116.     TQue<QuePosition::A2, 1> inQueueA2;
  117.     TQue<QuePosition::B1, 1> inQueueB1;
  118.     TQue<QuePosition::B2, 1> inQueueB2;
  119.     TQue<QuePosition::CO1, 1> outQueueCO1;
  120.  
  121.     GlobalTensor<half> aGM;
  122.     GlobalTensor<half> bGM;
  123.     GlobalTensor<dst_T> cGM;
  124.     uint16_t m = 1984, k = 256, n = 32;
  125.     uint16_t baseM = 992, baseK = 256, baseN = 16;
  126.     uint16_t aSize, bSize, cSize;
  127.     uint16_t nBlockSize = 16;
  128. ...
复制代码

  
    通过BT Buffer实现高效的bias计算

  算子中举行带bias的矩阵乘计算时,可将bias数据搬运至C2(Bias Table Buffer)上,调用一次Mmad接口实现矩阵乘加bias的计算。相比于先将矩阵乘的结果从CO1(L0C)搬运到GM上,再搬运到UB上举行加bias的过程,减少了数据搬运的次数,可提拔内存利用效率。数据流图对比如下:
  ​​​​​​​图7优化前数据流图
  

  
  ​​​​​​​图8优化后数据流图
  

  
  在优化前,算子举行带bias的矩阵乘计算时,过程如下:
  

  • 将矩阵乘的计算结果从CO1(L0C)搬运到workspace上;
  • 从workspace搬运到UB上;
  • 在UB上举行加bias的运算;
  • 末了将结果搬运到GM。
  当循环n次该计算过程,则分别增长了n次CO1->workspace、workspace->UB的搬运。
  1. // 该样例仅做示例说明,非完整代码,省略了部分同步控制代码
  2. public:
  3.     __aicore__ inline KernelSample()
  4.     {
  5.         aSize = m * k;
  6.         bSize = k * n;
  7.         cSize = m * n;
  8.     }
  9.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *bias, __gm__ uint8_t *c)
  10.     {
  11.         aGM.SetGlobalBuffer((__gm__ half *)a);
  12.         bGM.SetGlobalBuffer((__gm__ half *)b);
  13.         cGM.SetGlobalBuffer((__gm__ float *)c);
  14.         biasGM.SetGlobalBuffer((__gm__ float *)bias);
  15.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  19.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  20.         pipe.InitBuffer(inQueueBias, 1, n * sizeof(float));
  21.         pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float));
  22.         pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(float));
  23.  
  24.     }
  25.     __aicore__ inline void Process()
  26.     {
  27.         CopyIn();
  28.         SplitA();
  29.         SplitB();
  30.         Compute();
  31.         CopyOut();
  32.         CopyIn1();
  33.         Compute1();
  34.         CopyOut1();
  35.     }
  36. private:
  37.     __aicore__ inline void CopyIn()
  38.     {
  39.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  40.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  41.         LocalTensor<float> biasLocal = inQueueBias.AllocTensor<float>();
  42.  
  43.         Nd2NzParams dataCopyA1Params;
  44.         dataCopyA1Params.ndNum = 1;
  45.         dataCopyA1Params.nValue = m;
  46.         dataCopyA1Params.dValue = k;
  47.         dataCopyA1Params.srcNdMatrixStride = 0;
  48.         dataCopyA1Params.srcDValue = k;
  49.         dataCopyA1Params.dstNzC0Stride = m;
  50.         dataCopyA1Params.dstNzNStride = 1;
  51.         dataCopyA1Params.dstNzMatrixStride = 0;
  52.         DataCopy(a1Local, aGM, dataCopyA1Params);
  53.  
  54.         Nd2NzParams dataCopyB1Params;
  55.         dataCopyB1Params.ndNum = 1;
  56.         dataCopyB1Params.nValue = k;
  57.         dataCopyB1Params.dValue = n;
  58.         dataCopyB1Params.srcNdMatrixStride = 0;
  59.         dataCopyB1Params.srcDValue = n;
  60.         dataCopyB1Params.dstNzC0Stride = k;
  61.         dataCopyB1Params.dstNzNStride = 1;
  62.         dataCopyB1Params.dstNzMatrixStride = 0;
  63.         DataCopy(b1Local, bGM, dataCopyB1Params);
  64.         // 将bias搬运到UB
  65.         DataCopy(biasLocal, biasGM, n);
  66.  
  67.         inQueueA1.EnQue(a1Local);
  68.         inQueueB1.EnQue(b1Local);
  69.         inQueueBias.EnQue(biasLocal);
  70.     }
  71.     __aicore__ inline void SplitA()
  72.     {
  73.         ...
  74.     }
  75.     __aicore__ inline void SplitB()
  76.     {
  77.         ...
  78.     }
  79.     __aicore__ inline void Compute()
  80.     {
  81.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  82.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  83.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  84.         MmadParams mmadParams;
  85.         mmadParams.m = m;
  86.         mmadParams.n = n;
  87.         mmadParams.k = k;
  88.         // 矩阵乘
  89.         Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n
  90.         outQueueCO1.EnQue<float>(c1Local);
  91.         inQueueA2.FreeTensor(a2Local);
  92.         inQueueB2.FreeTensor(b2Local);
  93.     }
  94.     __aicore__ inline void CopyOut()
  95.     {
  96.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  97.         GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace);
  98.         xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace));
  99.         FixpipeParamsV220 fixpipeParams;
  100.         fixpipeParams.nSize = n;
  101.         fixpipeParams.mSize = m;
  102.         fixpipeParams.srcStride = m;
  103.         fixpipeParams.dstStride = n;
  104.         fixpipeParams.ndNum = 1;
  105.         fixpipeParams.srcNdStride = 0;
  106.         fixpipeParams.dstNdStride = 0;
  107.         // 将矩阵乘的计算结果从CO1搬运到workspace
  108.         Fixpipe(xGm, c1Local, fixpipeParams);
  109.         outQueueCO1.FreeTensor(c1Local);
  110.     }
  111.     __aicore__ inline void CopyIn1()
  112.     {
  113.         PipeBarrier<PIPE_ALL>();
  114.         // 将矩阵乘的计算结果从workspace搬运到UB
  115.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  116.         DataCopy(src0Local, xGm, cSize);
  117.         inQueueSrc0.EnQue(src0Local);
  118.     }
  119.     __aicore__ inline void Compute1()
  120.     {
  121.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  122.         LocalTensor<float> biasLocal = inQueueBias.DeQue<float>();
  123.         LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
  124.         BinaryRepeatParams addRepeatParams;
  125.         addRepeatParams.dstRepStride = 8;
  126.         addRepeatParams.src0RepStride = 8;
  127.         addRepeatParams.src1RepStride = 0;
  128.         // 加bias的运算
  129.         Add(dstLocal, src0Local, biasLocal, 32, m, addRepeatParams);
  130.         outQueueDst.EnQue<float>(dstLocal);
  131.         inQueueSrc0.FreeTensor(src0Local);
  132.         inQueueBias.FreeTensor(biasLocal);
  133.     }
  134.     __aicore__ inline void CopyOut1()
  135.     {
  136.         ...
  137.     }
  138. private:
  139.     TPipe pipe;
  140.     TQue<QuePosition::A1, 1> inQueueA1;
  141.     TQue<QuePosition::A2, 1> inQueueA2;
  142.     TQue<QuePosition::B1, 1> inQueueB1;
  143.     TQue<QuePosition::B2, 1> inQueueB2;
  144.     TQue<QuePosition::VECIN, 1> inQueueBias;
  145.     TQue<QuePosition::VECIN, 1> inQueueSrc0;
  146.     TQue<QuePosition::VECOUT, 1> outQueueDst;
  147.  
  148.     GlobalTensor<half> aGM;
  149.     GlobalTensor<half> bGM;
  150.     GlobalTensor<dst_T> cGM;
  151.     GlobalTensor<float> biasGM;
  152.     uint16_t m = 32, k = 32, n = 32;
  153.     uint16_t aSize, bSize, cSize;   
  154. ...
复制代码
经过优化,该算子举行带bias的矩阵乘计算时,先将bias搬运到BT上,调用一次Mmad接口实现矩阵乘加bias的计算。
  1. ...
  2. // 该样例仅做示例说明,非完整代码,省略了部分同步控制代码
  3. public:
  4.     __aicore__ inline KernelSample()
  5.     {
  6.         aSize = m * k;
  7.         bSize = k * n;
  8.         cSize = m * n;
  9.     }
  10.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *bias, __gm__ uint8_t *c)
  11.     {
  12.         aGM.SetGlobalBuffer((__gm__ half *)a);
  13.         bGM.SetGlobalBuffer((__gm__ half *)b);
  14.         cGM.SetGlobalBuffer((__gm__ float *)c);
  15.         biasGM.SetGlobalBuffer((__gm__ float *)bias);
  16.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  19.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  20.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  21.         pipe.InitBuffer(inQueueC1, 1, n * sizeof(float));
  22.         pipe.InitBuffer(outQueueC2, 1, n * sizeof(float));
  23.     }
  24.     __aicore__ inline void Process()
  25.     {
  26.         CopyIn();
  27.         SplitA();
  28.         SplitB();
  29.         SplitBias();
  30.         Compute();
  31.         CopyOut();
  32.     }
  33. private:
  34.     __aicore__ inline void CopyIn()
  35.     {
  36.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  37.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  38.         LocalTensor<float> bias1Local = inQueueC1.AllocTensor<float>();
  39.  
  40.         Nd2NzParams dataCopyA1Params;
  41.         dataCopyA1Params.ndNum = 1;
  42.         dataCopyA1Params.nValue = m;
  43.         dataCopyA1Params.dValue = k;
  44.         dataCopyA1Params.srcNdMatrixStride = 0;
  45.         dataCopyA1Params.srcDValue = k;
  46.         dataCopyA1Params.dstNzC0Stride = m;
  47.         dataCopyA1Params.dstNzNStride = 1;
  48.         dataCopyA1Params.dstNzMatrixStride = 0;
  49.         DataCopy(a1Local, aGM, dataCopyA1Params);
  50.  
  51.         Nd2NzParams dataCopyB1Params;
  52.         dataCopyB1Params.ndNum = 1;
  53.         dataCopyB1Params.nValue = k;
  54.         dataCopyB1Params.dValue = n;
  55.         dataCopyB1Params.srcNdMatrixStride = 0;
  56.         dataCopyB1Params.srcDValue = n;
  57.         dataCopyB1Params.dstNzC0Stride = k;
  58.         dataCopyB1Params.dstNzNStride = 1;
  59.         dataCopyB1Params.dstNzMatrixStride = 0;
  60.         DataCopy(b1Local, bGM, dataCopyB1Params);
  61.         // 将bias从GM搬运到L1
  62.         DataCopy(bias1Local, biasGM, n);
  63.  
  64.         inQueueA1.EnQue(a1Local);
  65.         inQueueB1.EnQue(b1Local);
  66.         inQueueC1.EnQue(bias1Local);
  67.     }
  68.     __aicore__ inline void SplitA()
  69.     {
  70.         ...
  71.     }
  72.     __aicore__ inline void SplitB()
  73.     {
  74.         ...
  75.     }
  76.     __aicore__ inline void SplitBias()
  77.     {
  78.         LocalTensor<float> bias1Local = inQueueC1.DeQue<float>();
  79.         LocalTensor<float> bias2Local = outQueueC2.AllocTensor<float>();
  80.         // 将bias从L1搬运到BT
  81.         DataCopy(bias2Local, bias1Local, { 1, (uint16_t)(n * sizeof(float) / 64), 0, 0 });
  82.         outQueueC2.EnQue<float>(bias2Local);
  83.         inQueueC1.FreeTensor(bias1Local);
  84.     }
  85.     __aicore__ inline void Compute()
  86.     {
  87.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  88.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  89.         LocalTensor<float> bias2Local = outQueueC2.DeQue<float>();
  90.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  91.         MmadParams mmadParams;
  92.         mmadParams.m = m;
  93.         mmadParams.n = n;
  94.         mmadParams.k = k;
  95.         mmadParams.cmatrixInitVal = false;
  96.         // 矩阵乘
  97.         Mmad(c1Local, a2Local, b2Local, bias2Local, mmadParams);
  98.         outQueueCO1.EnQue<float>(c1Local);
  99.         inQueueA2.FreeTensor(a2Local);
  100.         inQueueB2.FreeTensor(b2Local);
  101.         outQueueC2.FreeTensor(bias2Local);
  102.     }
  103.     __aicore__ inline void CopyOut()
  104.     {
  105.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  106.         FixpipeParamsV220 fixpipeParams;
  107.         fixpipeParams.nSize = n;
  108.         fixpipeParams.mSize = m;
  109.         fixpipeParams.srcStride = m;
  110.         fixpipeParams.dstStride = n;
  111.  
  112.         fixpipeParams.ndNum = 1;
  113.         fixpipeParams.srcNdStride = 0;
  114.         fixpipeParams.dstNdStride = 0;
  115.         Fixpipe(cGM, c1Local, fixpipeParams);
  116.         outQueueCO1.FreeTensor(c1Local);
  117.     }
  118. private:
  119.     TPipe pipe;
  120.     TQue<QuePosition::A1, 1> inQueueA1;
  121.     TQue<QuePosition::A2, 1> inQueueA2;
  122.     TQue<QuePosition::B1, 1> inQueueB1;
  123.     TQue<QuePosition::B2, 1> inQueueB2;
  124.     TQue<QuePosition::CO1, 1> outQueueCO1;
  125.     TQue<QuePosition::C1, 1> inQueueC1;
  126.     TQue<QuePosition::C2, 1> outQueueC2;
  127.  
  128.     GlobalTensor<half> aGM;
  129.     GlobalTensor<half> bGM;
  130.     GlobalTensor<dst_T> cGM;
  131.     GlobalTensor<float> biasGM;
  132.     uint16_t m = 32, k = 32, n = 32;
  133.     uint16_t aSize, bSize, cSize;
复制代码

  通过FP Buffer存放量化参数实现高效随路量化

算子实现中对矩阵乘结果举行量化计算时,可将量化参数搬运到C2PIPE2GM(Fixpipe Buffer)上,调用一次Fixpipe接口实现矩阵乘结果的量化计算。相比于将矩阵乘的结果从CO1(L0C)搬运到GM,再从GM搬运到UB,在UB举行量化计算的过程,数据搬运的次数更少,内存利用效率更高。
​​​​​​​图9优化前数据流图


​​​​​​​图10优化后数据流图


 在优化前,对矩阵乘结果举行量化计算的过程如下:

  • 将矩阵乘的结果从CO1搬运到workspace上;
  • 再从workspace搬运到UB上;
  • 将量化参数搬运到UB上,和矩阵乘的结果一起在UB上举行一系列量化计算;
  • 将最终量化结果从UB搬运到GM上。
 相比于正确示例多增长了CO1->workspace、workspace->UB的搬运过程和量化的vector计算。
  1. ...
  2. // 该样例仅做示例说明,非完整代码,省略了部分同步控制代码
  3. public:
  4.     __aicore__ inline KernelSample()
  5.     {
  6.         aSize = m * k;
  7.         bSize = k * n;
  8.         cSize = m * n;
  9.     }
  10.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c, __gm__ uint8_t *deqTensor)
  11.     {
  12.         aGM.SetGlobalBuffer((__gm__ half *)a);
  13.         bGM.SetGlobalBuffer((__gm__ half *)b);
  14.         cGM.SetGlobalBuffer((__gm__ float *)c);
  15.         deqGM.SetGlobalBuffer((__gm__ half *)deqTensor);
  16.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  19.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  20.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  21.         pipe.InitBuffer(inQueueSrc0, 1, cSize * sizeof(float));
  22.         pipe.InitBuffer(inQueueTmp, 1, cSize * sizeof(half));
  23.         pipe.InitBuffer(inQueueDeq, 1, cSize * sizeof(half));
  24.         pipe.InitBuffer(outQueueDst, 1, cSize * sizeof(int8_t));
  25.     }
  26.     __aicore__ inline void Process()
  27.     {
  28.         CopyIn();
  29.         SplitA();
  30.         SplitB();
  31.         Compute();
  32.         CopyOut();
  33.         CopyIn1();
  34.         Compute1();
  35.         CopyOut1();
  36.     }
  37. private:
  38.     __aicore__ inline void CopyIn()
  39.     {
  40.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  41.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  42.         LocalTensor<half> deqLocal = inQueueDeq.AllocTensor<half>();
  43.  
  44.         Nd2NzParams dataCopyA1Params;
  45.         dataCopyA1Params.ndNum = 1;
  46.         dataCopyA1Params.nValue = m;
  47.         dataCopyA1Params.dValue = k;
  48.         dataCopyA1Params.srcNdMatrixStride = 0;
  49.         dataCopyA1Params.srcDValue = k;
  50.         dataCopyA1Params.dstNzC0Stride = m;
  51.         dataCopyA1Params.dstNzNStride = 1;
  52.         dataCopyA1Params.dstNzMatrixStride = 0;
  53.         DataCopy(a1Local, aGM, dataCopyA1Params);
  54.  
  55.         Nd2NzParams dataCopyB1Params;
  56.         dataCopyB1Params.ndNum = 1;
  57.         dataCopyB1Params.nValue = k;
  58.         dataCopyB1Params.dValue = n;
  59.         dataCopyB1Params.srcNdMatrixStride = 0;
  60.         dataCopyB1Params.srcDValue = n;
  61.         dataCopyB1Params.dstNzC0Stride = k;
  62.         dataCopyB1Params.dstNzNStride = 1;
  63.         dataCopyB1Params.dstNzMatrixStride = 0;
  64.         DataCopy(b1Local, bGM, dataCopyB1Params);
  65.         // 将量化参数搬运到UB
  66.         DataCopy(deqLocal, deqGM, cSize);
  67.  
  68.         inQueueA1.EnQue(a1Local);
  69.         inQueueB1.EnQue(b1Local);
  70.         inQueueDeq.EnQue(deqLocal);
  71.     }
  72.     __aicore__ inline void SplitA()
  73.     {
  74.         ...
  75.     }
  76.     __aicore__ inline void SplitB()
  77.     {
  78.         ...
  79.     }
  80.     __aicore__ inline void Compute()
  81.     {
  82.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  83.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  84.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  85.         MmadParams mmadParams;
  86.         mmadParams.m = m;
  87.         mmadParams.n = n;
  88.         mmadParams.k = k;
  89.         // 矩阵乘
  90.         Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n
  91.         outQueueCO1.EnQue<float>(c1Local);
  92.         inQueueA2.FreeTensor(a2Local);
  93.         inQueueB2.FreeTensor(b2Local);
  94.     }
  95.     __aicore__ inline void CopyOut()
  96.     {
  97.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  98.         GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace);
  99.         xGm.SetGlobalBuffer((__gm__ float *)(usrWorkspace));
  100.         FixpipeParamsV220 fixpipeParams;
  101.         fixpipeParams.nSize = n;
  102.         fixpipeParams.mSize = m;
  103.         fixpipeParams.srcStride = m;
  104.         fixpipeParams.dstStride = n;
  105.         fixpipeParams.ndNum = 1;
  106.         fixpipeParams.srcNdStride = 0;
  107.         fixpipeParams.dstNdStride = 0;
  108.         // 将矩阵乘的计算结果从CO1搬运到workspace
  109.         Fixpipe(xGm, c1Local, fixpipeParams);
  110.         outQueueCO1.FreeTensor(c1Local);
  111.     }
  112.     __aicore__ inline void CopyIn1()
  113.     {
  114.         PipeBarrier<PIPE_ALL>();
  115.         // 将矩阵乘的计算结果从workspace搬运到UB
  116.         LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
  117.         DataCopy(src0Local, xGm, cSize);
  118.         inQueueSrc0.EnQue(src0Local);
  119.     }
  120.     __aicore__ inline void Compute1()
  121.     {
  122.         LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
  123.         LocalTensor<half> tmpLocal = inQueueTmp.AllocTensor<half>();
  124.         LocalTensor<half> deqLocal = inQueueDeq.DeQue<half>();
  125.         LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>();
  126.         // 量化计算
  127.         Cast(tmpLocal, src0Local, RoundMode::CAST_NONE, cSize);
  128.         LocalTensor<half> tmpHalfBuffer = src0Local.ReinterpretCast<half>();
  129.         Mul(tmpHalfBuffer, tmpLocal, deqLocal, cSize);
  130.         Cast(dstLocal, tmpHalfBuffer, RoundMode::CAST_NONE, cSize);
  131.         outQueueDst.EnQue<int8_t>(dstLocal);
  132.         inQueueSrc0.FreeTensor(src0Local);
  133.         inQueueTmp.FreeTensor(tmpLocal);
  134.         inQueueDeq.FreeTensor(deqLocal);
  135.     }
  136.     __aicore__ inline void CopyOut1()
  137.     {
  138.         ...
  139.     }
  140. private:
  141.     TPipe pipe;
  142.     TQue<QuePosition::A1, 1> inQueueA1;
  143.     TQue<QuePosition::A2, 1> inQueueA2;
  144.     TQue<QuePosition::B1, 1> inQueueB1;
  145.     TQue<QuePosition::B2, 1> inQueueB2;
  146.     TQue<QuePosition::CO1, 1> outQueueCO1;
  147.     TQue<QuePosition::VECIN, 1> inQueueDeq;
  148.     TQue<QuePosition::VECIN, 1> inQueueSrc0;
  149.     TQue<QuePosition::VECCALC, 1> inQueueTmp;
  150.     TQue<QuePosition::VECOUT, 1> outQueueDst;
  151.  
  152.     GlobalTensor<half> aGM;
  153.     GlobalTensor<half> bGM;
  154.     GlobalTensor<dst_T> cGM;
  155.     GlobalTensor<float> biasGM;
  156.     uint16_t m = 32, k = 32, n = 32;
  157.     uint16_t aSize, bSize, cSize;
  158.     ...
复制代码
 经过优化,该算子对矩阵乘的结果举行量化计算时,可将量化参数搬运到FB(Fixpipe Buffer)上,调用一次Fixpipe接口实现矩阵乘结果的量化计算。
  1. ...
  2. public:
  3.     __aicore__ inline KernelSample()
  4.     {
  5.         aSize = m * k;
  6.         bSize = k * n;
  7.         cSize = m * n;
  8.     }
  9.     __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c, __gm__ uint8_t *deqTensor)
  10.     {
  11.         aGM.SetGlobalBuffer((__gm__ half *)a);
  12.         bGM.SetGlobalBuffer((__gm__ half *)b);
  13.         cGM.SetGlobalBuffer((__gm__ float *)c);
  14.         deqGM.SetGlobalBuffer((__gm__ uint64_t *)deqTensor);
  15.         pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
  16.         pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
  17.         pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
  18.         pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half));
  19.         pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(float));
  20.         pipe.InitBuffer(inQueueDeq1, 1, cSize * sizeof(uint64_t));
  21.         pipe.InitBuffer(inQueueDeq, 1, cSize * sizeof(uint64_t));
  22.     }
  23.     __aicore__ inline void Process()
  24.     {
  25.         CopyIn();
  26.         SplitA();
  27.         SplitB();
  28.         SplitDeq();
  29.         Compute();
  30.         CopyOut();
  31.     }
  32. private:
  33.     __aicore__ inline void CopyIn()
  34.     {
  35.         LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
  36.         LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
  37.         LocalTensor<uint64_t> deq1Local = inQueueDeq1.AllocTensor<uint64_t>();
  38.  
  39.         Nd2NzParams dataCopyA1Params;
  40.         dataCopyA1Params.ndNum = 1;
  41.         dataCopyA1Params.nValue = m;
  42.         dataCopyA1Params.dValue = k;
  43.         dataCopyA1Params.srcNdMatrixStride = 0;
  44.         dataCopyA1Params.srcDValue = k;
  45.         dataCopyA1Params.dstNzC0Stride = m;
  46.         dataCopyA1Params.dstNzNStride = 1;
  47.         dataCopyA1Params.dstNzMatrixStride = 0;
  48.         DataCopy(a1Local, aGM, dataCopyA1Params);
  49.  
  50.         Nd2NzParams dataCopyB1Params;
  51.         dataCopyB1Params.ndNum = 1;
  52.         dataCopyB1Params.nValue = k;
  53.         dataCopyB1Params.dValue = n;
  54.         dataCopyB1Params.srcNdMatrixStride = 0;
  55.         dataCopyB1Params.srcDValue = n;
  56.         dataCopyB1Params.dstNzC0Stride = k;
  57.         dataCopyB1Params.dstNzNStride = 1;
  58.         dataCopyB1Params.dstNzMatrixStride = 0;
  59.         DataCopy(b1Local, bGM, dataCopyB1Params);
  60.         // 将量化参数搬运到L1上
  61.         DataCopy(deq1Local, deqGM, cSize);
  62.  
  63.         inQueueA1.EnQue(a1Local);
  64.         inQueueB1.EnQue(b1Local);
  65.         inQueueDeq.EnQue(deq1Local);
  66.     }
  67.     __aicore__ inline void SplitA()
  68.     {
  69.         ...
  70.     }
  71.     __aicore__ inline void SplitB()
  72.     {
  73.         ...
  74.     }
  75.     __aicore__ inline void SplitDeq()
  76.     {
  77.         LocalTensor<uint64_t> deq1Local = inQueueDeq1.DeQue<uint64_t>();
  78.         LocalTensor<uint64_t> deqLocal = inQueueDeq.AllocTensor<uint64_t>();
  79.         // 将量化参数从L1->FB
  80.         DataCopy(deqLocal, deq1Local, { 1, (uint16_t)(cSize * sizeof(uint64_t) / 128), 0, 0 });
  81.         inQueueDeq.EnQue<uint61_t>(deqLocal);
  82.         inQueueDeq1.FreeTensor(deq1Local);
  83.     }
  84.     __aicore__ inline void Compute()
  85.     {
  86.         LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
  87.         LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
  88.         LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
  89.         MmadParams mmadParams;
  90.         mmadParams.m = m;
  91.         mmadParams.n = n;
  92.         mmadParams.k = k;
  93.         // 矩阵乘
  94.         Mmad(c1Local, a2Local, b2Local, mmadParams); // m*n
  95.         outQueueCO1.EnQue<float>(c1Local);
  96.         inQueueA2.FreeTensor(a2Local);
  97.         inQueueB2.FreeTensor(b2Local);
  98.     }
  99.     __aicore__ inline void CopyOut()
  100.     {
  101.         LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
  102.         LocalTensor<uint64_t> deqLocal = inQueueDeq.DeQue<uint64_t>();
  103.         SetFixpipeNz2ndFlag(1, 0, 0);
  104.         DataCopyCO12DstParams dataCopyParams;
  105.         dataCopyParams.nSize = n;
  106.         dataCopyParams.mSize = m;
  107.         dataCopyParams.srcStride = m;
  108.         dataCopyParams.dstStride = n;
  109.         dataCopyParams.quantPre = QuantMode_t::VQF322B8_PRE;
  110.         dataCopyParams.nz2ndEn = true;
  111.         // 将矩阵乘进行量化后的计算结果搬出
  112.         DataCopy(cGM, c1Local, DataCopyCO12DstParams);
  113.         outQueueCO1.FreeTensor(c1Local);
  114.     }
  115.  
  116. private:
  117.     TPipe pipe;
  118.     TQue<QuePosition::A1, 1> inQueueA1;
  119.     TQue<QuePosition::A2, 1> inQueueA2;
  120.     TQue<QuePosition::B1, 1> inQueueB1;
  121.     TQue<QuePosition::B2, 1> inQueueB2;
  122.     TQue<QuePosition::C1, 1> inQueueDeq1;
  123.     TQue<QuePosition::C2PIPE2GM, 1> inQueueDeq;
  124.     TQue<QuePosition::CO1, 1> outQueueCO1;
  125.     GlobalTensor<half> aGM;
  126.     GlobalTensor<half> bGM;
  127.     GlobalTensor<dst_T> cGM;
  128.     GlobalTensor<uint64_t> deqTensorGM;
  129.     uint16_t m = 32, k = 32, n = 32;
  130.     uint16_t aSize, bSize, cSize;
  131.     ...
复制代码
更多学习资源

 了解更多Ascend C算子性能优化手段和实践案例,请访问:昇腾Ascend C-入门课程-学习资源-算子文档-昇腾社区
 

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

用户云卷云舒

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表