[C#] 对24位图像举行程度翻转(FlipX)的跨平台SIMD硬件加快向量算法-第2部分 ...

伤心客 · 2025-3-16 22:01:00

在本主题的上一篇文章里，给各人解说了24位图像程度翻转(FlipX)算法。但该文章重要是为了介绍 YShuffleX3Kernel 的使用，该算法性能并不是最优的。于是本文将介绍怎样使用 YShuffleX2Kernel 来优化程序。而且Imageshop在留言区给了一份C语言的、基于Sse系列指令集实现的代码，正好一起对比一下。
一、算法思路

1.1 瓶颈分析

当硬件没有多向量换位（shuffle）的指令时，一般来说需要3条换位指令才气实现 YShuffleX3Kernel 方法。
对于24位图像程度翻转来说，每次内循环是处置惩罚3个向量长度的数据，需调用3次 YShuffleX3Kernel 方法，故共调用了 3*3=9 条换位指令。
且在使用X86平台的Avx2指令集时，因它没有提供“跨小道（lane）换位”指令，导致需要用2条换位指令来实现“向量内全范围的换位”。于是对于24位图像程度翻转来说，故共调用了 3*3*2=18 条换位指令。这个数量很大了。
于是，若能低落换位指令的数量，将能提升程序的性能。
1.2 优化思路

X86架构的Sse指令集，与Arm架构的AdvSimd指令集里均使用128位的向量寄存器，即16个字节。
24位像素是3个字节一组。对于16个字节向量寄存器来说，仅能完整存放 5个像素（(int)(16 / 3) = 5），占据了 3*5=15 个字节。剩余的1个字节超出向量大小的界限了，位于另一个向量中。
换个角度来看，使用1条单向量的换位指令，能将16个字节中的15个字节做好程度翻转。仅剩1个字节的数据不正确，此时可以另想办法来修正。
一种思路是用标量的内存读写语句来修正这些不正确的字节。这就是 Imageshop的Sse版算法的思路，对于每次处置惩罚48个字节的内循环，使用标量的内存读写语句修正了6个字节的数据。这种办法能将换位指令数量降到了最低，但代价是增加了标量的内存读写工作。需要举行基准测试，实际对比性能。
另一种思路是再用1条换位指令来盘算剩余字节的数据，然后使用条件掩码将这2种效果（15个字节+剩余1字节）归并。而且大多数架构（X86、Arm）的换位指令带有清零能力，警惕的调整掩码，能使无效字节为0，这样仅需 or 指令就能将2种效果归并了。
上面这种盘算，正好是“2向量的换位”操作。X86架构的Avx512系列指令集，以及Arm架构的AdvSimd指令集，均提供了这样的指令。而且 .NET 8.0 支持了这些指令的内在函数（Intrinsic Functions）。
手动调用内在函数是很繁琐的，且难以跨平台。于是 VectorTraits 提供了 YShuffleX2Kernel 方法。它会检查硬件是否支持“2向量的换位”指令，如支持便会使用各平台的这些指令；若不支持，便会主动盘算好掩码，调用2条换位指令并组合。
也就是说，当硬件没有多向量换位的指令时，YShuffleX2Kernel 比起 YShuffleX3Kernel，能少使用1个换位指令。从而低落了开销，优化了性能。
1.3 盘算索引

对于24位图像程度翻转来说，每次内循环是处置惩罚3个向量长度的数据，此时需要3个索引向量。其中索引0、索引2，仅需访问2个数据向量，正好能使用 YShuffleX2Kernel。而中间的索引1，默认情况会访问3个数据向量。下面的表格具体说明白这些情况。
Nameoffset01234567891011121314151617181920212223242526272829303132333435363738394041424344454647Indice045464742434439404136373833343530313227282924252621222318192015161712131491011678345012Indice01629303126272823242520212217181914Indice1E1615161112138910567234-10Indice1A031322728292425262122231819201516Indice1B1516171213149101167834501Indice201712131491011678345012变量说明：

Indice: YShuffleX3Kernel 的索引。3个索引均写在同一行。
Indice0: YShuffleX2Kernel 的索引0。
Indice1E: 演示了错误方案下的索引1。
Indice1A: 索引1使用A方案。
Indice1B: 索引1使用B方案。
Indice2: YShuffleX2Kernel 的索引2。

索引2（Indice2）最简单，直接使用 YShuffleX3Kernel的索引2就行。因它仅需访问第0个与第1个输入向量。
索引0（Indice2）稍微复杂一点。它仅需访问第1个与第2个输入向量，于是从数据所在的偏移量（offset）来说，偏移量为16。即向量寄存器的字节数（Vector.Count）。
索引1最麻烦，由于它需要访问3个向量。于是上表给出了 Indice1E 这一行，将 offset 设为16，于是可以观察到它不仅有小于下界的值（-1），还有凌驾上界的值（16）。有2种思路来解决这一困难：

A. 干脆用 YShuffleX3Kernel 来盘算它。于是可以直接使用 YShuffleX3Kernel 时的索引1，即 Indice1A。也就说，内循环会调用2次 YShuffleX2Kernel，和1次 YShuffleX3Kernel，比起调用3次YShuffleX3Kernel的开销低。
B. 干脆为它提供的输入向量。此时用“偏移量15”来加载2笔一连的向量数据，便能用 YShuffleX2Kernel 来盘算了。由于现在数据加载所在很近（偏移15与偏移16很近），且现在处置惩罚器的高速缓存（Cache）技术很成熟，使得这种加载的开销很低。

随后可以通过基准测试，来看哪种思路的性能更好。
二、算法实现

2.1 程序里盘算索引

首先需要盘算索引。可以在先前YShuffleX3Kernel的索引盘算代码上修改，关键是处置惩罚好 offset。

// -- Indices of YShuffleX3Kernel
private static readonly Vector<byte> _shuffleIndices0;
private static readonly Vector<byte> _shuffleIndices1;
private static readonly Vector<byte> _shuffleIndices2;
// -- Indices of YShuffleX2Kernel
private static readonly byte _shuffleX2Offset0 = (byte)Vector<byte>.Count;
private static readonly byte _shuffleX2Offset1A = 0;
private static readonly byte _shuffleX2Offset1B = (byte)(Vector<byte>.Count / 3 * 3);
private static readonly byte _shuffleX2Offset2 = 0;
private static readonly Vector<byte> _shuffleX2Indices0;
private static readonly Vector<byte> _shuffleX2Indices1A; // Need YShuffleX3Kernel
private static readonly Vector<byte> _shuffleX2Indices1B;
private static readonly Vector<byte> _shuffleX2Indices2;
static ImageFlipXOn24bitBenchmark() {
const int cbPixel = 3; // 24 bit: Bgr24, Rgb24.
int vectorWidth = Vector<byte>.Count;
int blockSize = vectorWidth * cbPixel;
Span<byte> buf = stackalloc byte[blockSize];
for (int i = 0; i < blockSize; i++) {
int m = i / cbPixel;
int n = i % cbPixel;
buf[i] = (byte)((vectorWidth - 1 - m) * cbPixel + n);
}
_shuffleIndices0 = Vectors.Create(buf);
_shuffleIndices1 = Vectors.Create(buf.Slice(vectorWidth * 1));
_shuffleIndices2 = Vectors.Create(buf.Slice(vectorWidth * 2));
// -- Indices of YShuffleX2Kernel
_shuffleX2Indices0 = Vector.Subtract(_shuffleIndices0, new Vector<byte>(_shuffleX2Offset0));
_shuffleX2Indices1A = _shuffleIndices1; // _shuffleX2Offset1A is 0
_shuffleX2Indices1B = Vector.Subtract(_shuffleIndices1, new Vector<byte>(_shuffleX2Offset1B));
_shuffleX2Indices2 = _shuffleIndices2; // _shuffleX2Offset2 is 0
}

复制代码

可见，代码改动不多。仅需使用 Vector.Subtract 做向量减法，减去偏移量。
2.2 思路A的实现

根据上面的思路A，编写代码。源代码如下。
[code]public static unsafe void UseVectorsX2AArgsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) { const int cbPixel = 3; // 24 bit: Bgr24, Rgb24. Vectors.YShuffleX2Kernel_Args(_shuffleX2Indices0, out var indices0arg0, out var indices0arg1, out var indices0arg2, out var indices0arg3); Vectors.YShuffleX3Kernel_Args(_shuffleX2Indices1A, out var indices1arg0, out var indices1arg1, out var indices1arg2, out var indices1arg3); Vectors.YShuffleX2Kernel_Args(_shuffleX2Indices2, out var indices2arg0, out var indices2arg1, out var indices2arg2, out var indices2arg3); int vectorWidth = Vector.Count; if (width

		自动登录	找回密码
密码			立即注册

[C#] 对24位图像举行程度翻转(FlipX)的跨平台SIMD硬件加快向量算法-第2部分 ...

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块