IT评测·应用市场-qidao123.com
标题:
dotnet C# 在差别的机器 CPU 型号上的基准性能测试
[打印本页]
作者:
饭宝
时间:
2025-1-16 10:02
标题:
dotnet C# 在差别的机器 CPU 型号上的基准性能测试
本文将记录我在多个差别的机器上,在差别的 CPU 型号上,执行雷同的我编写的 dotnet 的 Benchmark 的代码,测试差别的 CPU 型号对 C# 系的优化程度。本文非严谨测试,数值只有相对意义
以下是我的测试效果,对应的测试代码放在
github
上,可以在本文末尾找到下载代码的方法
我非常推荐你自己拉取代码,在你自己的设备上跑一下,测试其性能。且在开始之前,期望你已经把握了基础的性能测试知识,避免出现诡异的结论
本文的测试将围绕着尽可能多的覆盖基础 CPU 指令以及基础逻辑行为。基础的 CPU 指令的性能测试已经有许多前辈测试过了,我这里重点测试的是各个 C# 系的上层业务行为下,所调用的多个 CPU 指令的最终性能影响。额外的也覆盖 CPU 缓存,逻辑分支命中,方法参数堆栈传递等的性能。本文的测试重点不在于 C# 系的雷同功能的多个差别实现之间的性能对比,重点在于雷同的代码在差别的 CPU 型号、内存、体系上的性能差异,正如此需求所述,本文非严谨测试,测试效果的数值只有相对意义
数组创建
英特尔 13th Gen Intel Core i7-13700K
以下是在我开辟机上跑的,我开了几百个进程,有比较多干扰,但是题目不大,因为 i7-13700K 依然性能遥遥领先。等后续找个空闲的机器,再跑一次比较正确的性能测试
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-AXOZTJ : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
复制代码
MethodArraySizeMeanErrorStdDevMedianRatioRatioSDNewArray103.873 ns0.1146 ns0.2417 ns3.777 ns1.000.00GCZeroInitialized1012.234 ns0.2815 ns0.4382 ns12.168 ns3.150.21GCZeroUninitialized104.470 ns0.1491 ns0.4056 ns4.354 ns1.140.13NewArrayWithRandomVisit1012.012 ns0.2679 ns0.2506 ns11.941 ns3.090.18NewArrayWithOrdinalVisit109.839 ns0.3379 ns0.9803 ns9.635 ns2.580.26NewArray10011.875 ns0.1932 ns0.2444 ns11.813 ns1.000.00GCZeroInitialized10021.980 ns0.4524 ns0.8931 ns21.820 ns1.880.08GCZeroUninitialized10012.126 ns0.2769 ns0.5201 ns11.953 ns1.040.05NewArrayWithRandomVisit10047.344 ns0.9635 ns2.1351 ns46.572 ns4.030.24NewArrayWithOrdinalVisit10075.207 ns1.4285 ns1.3363 ns75.364 ns6.330.15NewArray1000110.197 ns2.1602 ns2.0206 ns109.619 ns1.000.00GCZeroInitialized1000116.560 ns2.0796 ns1.8435 ns116.604 ns1.060.03GCZeroUninitialized100033.476 ns0.5921 ns0.5538 ns33.643 ns0.300.01NewArrayWithRandomVisit1000208.835 ns4.1962 ns8.8512 ns205.699 ns1.920.09NewArrayWithOrdinalVisit1000620.850 ns11.5406 ns10.7951 ns619.304 ns5.640.15NewArray10000996.853 ns21.9389 ns61.8790 ns970.393 ns1.000.00GCZeroInitialized10000996.704 ns20.8764 ns58.5397 ns974.900 ns1.000.08GCZeroUninitialized1000063.200 ns1.0544 ns0.9863 ns63.315 ns0.060.00NewArrayWithRandomVisit100001,242.151 ns24.2642 ns38.4856 ns1,233.944 ns1.210.07NewArrayWithOrdinalVisit100006,068.245 ns90.8508 ns84.9819 ns6,076.727 ns5.790.34NewArray1000007,381.046 ns137.9635 ns147.6194 ns7,372.520 ns1.000.00GCZeroInitialized1000007,214.089 ns85.2068 ns71.1515 ns7,209.220 ns0.970.02GCZeroUninitialized1000007,347.661 ns146.3643 ns174.2363 ns7,306.838 ns1.000.03NewArrayWithRandomVisit1000008,456.669 ns164.5726 ns219.6997 ns8,517.366 ns1.140.05NewArrayWithOrdinalVisit100000129,749.709 ns2,408.4302 ns2,773.5518 ns128,963.159 ns17.570.55NewArray100000059,752.036 ns1,194.7579 ns1,929.3113 ns59,414.325 ns1.000.00GCZeroInitialized100000060,008.303 ns1,188.0164 ns1,778.1671 ns59,378.000 ns1.010.04GCZeroUninitialized100000058,868.279 ns1,023.4279 ns957.3151 ns58,724.731 ns0.970.04NewArrayWithRandomVisit100000056,399.609 ns1,068.5479 ns999.5204 ns56,296.948 ns0.930.03NewArrayWithOrdinalVisit10000001,314,841.960 ns26,155.6618 ns27,986.2651 ns1,313,674.414 ns21.921.00
兆芯 ZHAOXIN KaiXian KX-U6780A
BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-YPUGMN : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
复制代码
MethodArraySizeMeanErrorStdDevMedianRatioRatioSDNewArray1040.20 ns0.977 ns1.491 ns39.98 ns1.000.00GCZeroInitialized10141.12 ns2.996 ns6.051 ns139.67 ns3.540.18GCZeroUninitialized1048.72 ns0.849 ns0.663 ns48.91 ns1.190.05NewArrayWithRandomVisit10195.75 ns1.082 ns0.845 ns195.65 ns4.770.16NewArrayWithOrdinalVisit1072.42 ns1.513 ns2.400 ns72.45 ns1.800.08NewArray100135.07 ns2.892 ns6.100 ns135.41 ns1.000.00GCZeroInitialized100228.42 ns4.662 ns10.135 ns228.83 ns1.700.11GCZeroUninitialized100137.26 ns2.939 ns5.519 ns136.70 ns1.020.06NewArrayWithRandomVisit100572.02 ns11.660 ns19.157 ns568.34 ns4.260.27NewArrayWithOrdinalVisit100467.29 ns9.357 ns13.117 ns464.49 ns3.470.21NewArray10001,037.70 ns20.377 ns54.742 ns1,031.50 ns1.000.00GCZeroInitialized10001,127.93 ns22.581 ns59.091 ns1,125.79 ns1.090.07GCZeroUninitialized1000653.93 ns6.239 ns4.871 ns652.04 ns0.600.02NewArrayWithRandomVisit10002,375.21 ns47.088 ns100.349 ns2,352.11 ns2.270.13NewArrayWithOrdinalVisit10004,474.90 ns87.887 ns107.933 ns4,453.16 ns4.190.28NewArray100009,586.62 ns189.501 ns369.608 ns9,657.74 ns1.000.00GCZeroInitialized100009,767.26 ns194.643 ns462.590 ns9,811.53 ns1.020.07GCZeroUninitialized100004,093.63 ns80.993 ns143.965 ns4,026.86 ns0.430.02NewArrayWithRandomVisit1000013,908.10 ns202.573 ns169.158 ns13,928.15 ns1.470.06NewArrayWithOrdinalVisit1000043,057.16 ns854.132 ns1,495.943 ns42,914.21 ns4.500.25NewArray10000063,542.13 ns576.256 ns510.836 ns63,519.28 ns1.000.00GCZeroInitialized10000066,357.64 ns1,312.089 ns2,118.779 ns66,043.66 ns1.030.03GCZeroUninitialized10000063,638.29 ns1,241.493 ns1,477.909 ns63,270.73 ns1.010.03NewArrayWithRandomVisit10000076,609.50 ns1,501.442 ns1,729.063 ns75,958.21 ns1.210.03NewArrayWithOrdinalVisit100000665,286.65 ns9,295.620 ns7,762.264 ns662,915.19 ns10.470.16NewArray1000000461,130.99 ns9,000.698 ns10,004.252 ns461,306.23 ns1.000.00GCZeroInitialized1000000459,810.29 ns8,893.401 ns10,586.961 ns455,791.25 ns1.000.03GCZeroUninitialized1000000456,245.03 ns8,819.606 ns12,363.856 ns452,252.89 ns0.990.04NewArrayWithRandomVisit1000000497,132.01 ns9,841.562 ns12,796.810 ns490,990.22 ns1.080.03NewArrayWithOrdinalVisit10000006,742,537.03 ns48,986.470 ns38,245.414 ns6,732,321.64 ns14.510.31
飞腾腾锐 Phytium D2000
BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-NHRLJG : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
复制代码
MethodArraySizeMeanErrorStdDevRatioRatioSD
NewArray
10
22.18 ns
0.149 ns
0.132 ns
1.00
0.00
GCZeroInitialized1092.43 ns0.564 ns0.440 ns4.170.02GCZeroUninitialized1025.68 ns0.248 ns0.243 ns1.160.01NewArrayWithRandomVisit10108.25 ns0.299 ns0.250 ns4.880.03NewArrayWithOrdinalVisit1034.55 ns0.126 ns0.112 ns1.560.01
NewArray
100
76.35 ns
0.941 ns
0.880 ns
1.00
0.00
GCZeroInitialized100163.69 ns0.952 ns0.743 ns2.140.03GCZeroUninitialized10080.21 ns0.528 ns0.468 ns1.050.02NewArrayWithRandomVisit100421.53 ns1.679 ns1.402 ns5.520.06NewArrayWithOrdinalVisit100300.66 ns1.274 ns1.130 ns3.940.05
NewArray
1000
640.11 ns
4.059 ns
3.598 ns
1.00
0.00
GCZeroInitialized1000672.06 ns3.242 ns3.032 ns1.050.01GCZeroUninitialized1000483.70 ns2.202 ns1.952 ns0.760.01NewArrayWithRandomVisit10001,765.24 ns6.469 ns5.402 ns2.760.02NewArrayWithOrdinalVisit10002,850.39 ns12.971 ns12.133 ns4.450.03
NewArray
10000
5,219.58 ns
36.810 ns
32.631 ns
1.00
0.00
GCZeroInitialized100005,280.52 ns27.550 ns24.422 ns1.010.01GCZeroUninitialized100002,640.52 ns44.642 ns34.853 ns0.510.01NewArrayWithRandomVisit100008,992.89 ns20.367 ns19.052 ns1.720.01NewArrayWithOrdinalVisit1000026,983.43 ns355.773 ns297.086 ns5.170.05
NewArray
100000
45,506.61 ns
431.868 ns
403.970 ns
1.00
0.00
GCZeroInitialized10000045,543.14 ns432.449 ns404.513 ns1.000.01GCZeroUninitialized10000044,461.84 ns331.168 ns309.775 ns0.980.01NewArrayWithRandomVisit10000057,232.01 ns318.770 ns298.178 ns1.260.01NewArrayWithOrdinalVisit100000445,380.51 ns2,904.888 ns2,425.713 ns9.780.10
NewArray
1000000
318,862.16 ns
1,899.267 ns
1,683.651 ns
1.00
0.00
GCZeroInitialized1000000319,510.71 ns4,669.274 ns3,645.462 ns1.000.01GCZeroUninitialized1000000314,884.17 ns5,637.859 ns4,401.669 ns0.990.02NewArrayWithRandomVisit1000000357,843.40 ns3,063.527 ns2,865.625 ns1.120.01NewArrayWithOrdinalVisit10000004,547,465.54 ns15,355.309 ns12,822.379 ns14.280.05
NewArray
1000000000
1,541,406,672.88 ns
35,733,853.844 ns
102,527,125.216 ns
1.000
0.00
GCZeroInitialized10000000001,548,370,215.42 ns38,407,327.571 ns110,197,822.498 ns1.0090.10GCZeroUninitialized10000000001,486,735.21 ns28,605.254 ns26,757.372 ns0.0010.00NewArrayWithRandomVisit10000000001,590,271,119.60 ns33,473,585.461 ns96,041,991.522 ns1.0360.09NewArrayWithOrdinalVisit10000000003,861,833,983.54 ns2,367,487.064 ns1,976,958.923 ns2.5460.16以上的飞腾腾锐 Phytium D2000 最后的测试数据预计是不正常的
数组拷贝
测试维度
到场测试的内容如下:
CopyByFor : 利用 for 循环进行拷贝数组
Memcpy : 利用尺度 C 提供的 memcpy 函数进行拷贝,在 linux 下利用 libc.so.6 导出函数,在 windows 下利用 msvcrt.dll 导出函数。这处于非常裸露的方式,更具体请参阅下文的数据说明内容
CopyBlockUnaligned : 利用 dotnet 自带的 Unsafe.CopyBlockUnaligned 方法进行数组拷贝
英特尔 13th Gen Intel Core i7-13700K
数组较小
小于 1000 的数组时,存在较大 P/Invoke 干扰,于是决定最小设置为 1000 的值
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-GCHWHL : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
复制代码
MethodsourcedestMeanErrorStdDevRatio
CopyByFor
Int32[10000]
Int32[10000]
1,958.98 ns
8.391 ns
7.007 ns
1.000
MemcpyInt32[10000]Int32[10000]609.35 ns3.266 ns3.055 ns0.311CopyBlockUnalignedInt32[10000]Int32[10000]577.84 ns1.391 ns1.301 ns0.295
CopyByFor
Int32[1000]
Int32[1000]
202.09 ns
0.376 ns
0.352 ns
0.103
MemcpyInt32[1000]Int32[1000]32.21 ns0.323 ns0.302 ns0.016CopyBlockUnalignedInt32[1000]Int32[1000]19.19 ns0.067 ns0.059 ns0.010根据上述测试数据可以看到,即使在较小数据量环境下,依然 memcpy 和 Unsafe.CopyBlockUnaligned 比 for 速率快
数组较大
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Job-DBDADP : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
RunStrategy=Throughput
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByFor
Int32[100000000]
Int32[100000000]
41,348,684.32 ns
751,207.515 ns
1,028,261.326 ns
41,102,646.15 ns
1.000
0.00
MemcpyInt32[100000000]Int32[100000000]27,086,427.67 ns738,121.867 ns2,057,588.736 ns26,318,143.75 ns0.6750.05CopyBlockUnalignedInt32[100000000]Int32[100000000]24,020,801.37 ns467,035.642 ns458,691.448 ns23,894,810.94 ns0.5790.02
CopyByFor
Int32[10000000]
Int32[10000000]
3,800,486.40 ns
69,523.151 ns
162,508.123 ns
3,748,857.23 ns
0.092
0.01
MemcpyInt32[10000000]Int32[10000000]2,313,413.90 ns75,362.059 ns208,827.911 ns2,248,826.17 ns0.0580.01CopyBlockUnalignedInt32[10000000]Int32[10000000]2,005,075.29 ns55,131.653 ns149,989.727 ns1,925,467.19 ns0.0490.00
CopyByFor
Int32[1000000]
Int32[1000000]
201,416.81 ns
1,630.278 ns
1,524.963 ns
200,902.27 ns
0.005
0.00
MemcpyInt32[1000000]Int32[1000000]104,570.31 ns3,304.068 ns9,319.184 ns100,412.65 ns0.0030.00CopyBlockUnalignedInt32[1000000]Int32[1000000]99,385.15 ns1,824.888 ns1,617.716 ns99,135.09 ns0.0020.00
CopyByFor
Int32[10000]
Int32[10000]
1,958.87 ns
4.267 ns
3.783 ns
1,959.42 ns
0.000
0.00
MemcpyInt32[10000]Int32[10000]624.06 ns4.451 ns4.164 ns622.60 ns0.0000.00CopyBlockUnalignedInt32[10000]Int32[10000]581.32 ns2.044 ns1.912 ns581.53 ns0.0000.00
CopyByFor
Int32[1000]
Int32[1000]
201.05 ns
0.678 ns
0.635 ns
201.05 ns
0.000
0.00
MemcpyInt32[1000]Int32[1000]32.12 ns0.638 ns0.683 ns32.10 ns0.0000.00CopyBlockUnalignedInt32[1000]Int32[1000]21.02 ns0.090 ns0.085 ns21.04 ns0.0000.00
兆芯 ZHAOXIN KaiXian KX-U6780A
数组较小
BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-SBDPDU : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByFor
Int32[10000]
Int32[10000]
14.814 us
0.1734 us
0.1537 us
14.785 us
1.00
0.00
MemcpyInt32[10000]Int32[10000]15.329 us0.2950 us0.5167 us15.313 us1.040.04CopyBlockUnalignedInt32[10000]Int32[10000]13.125 us0.5590 us1.6482 us13.188 us0.940.09
CopyByFor
Int32[1000]
Int32[1000]
1.127 us
0.0226 us
0.0211 us
1.127 us
0.08
0.00
MemcpyInt32[1000]Int32[1000]2.152 us0.0571 us0.1675 us2.197 us0.130.02CopyBlockUnalignedInt32[1000]Int32[1000]2.297 us0.0453 us0.0863 us2.279 us0.160.01
数组较大
BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-KKBWNV : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSD
CopyByFor
Int32[100000000]
Int32[100000000]
334,741.708 μs
7,661.2780 μs
22,469.2022 μs
332,150.996 μs
1.000
0.00
MemcpyInt32[100000000]Int32[100000000]164,233.004 μs3,256.7894 μs8,406.8134 μs161,660.880 μs0.4930.04CopyBlockUnalignedInt32[100000000]Int32[100000000]164,128.312 μs3,671.9104 μs10,826.7108 μs162,440.250 μs0.4920.05
CopyByFor
Int32[10000000]
Int32[10000000]
33,404.753 μs
663.0404 μs
1,687.6494 μs
32,963.932 μs
0.100
0.01
MemcpyInt32[10000000]Int32[10000000]23,405.518 μs1,142.2886 μs3,350.1346 μs24,879.320 μs0.0700.01CopyBlockUnalignedInt32[10000000]Int32[10000000]24,981.451 μs498.7301 μs899.3133 μs24,921.681 μs0.0750.00
CopyByFor
Int32[1000000]
Int32[1000000]
5,036.027 μs
100.2153 μs
195.4623 μs
5,014.961 μs
0.015
0.00
MemcpyInt32[1000000]Int32[1000000]2,585.947 μs51.0945 μs106.6533 μs2,601.145 μs0.0080.00CopyBlockUnalignedInt32[1000000]Int32[1000000]2,529.769 μs50.4126 μs98.3259 μs2,516.467 μs0.0080.00
CopyByFor
Int32[10000]
Int32[10000]
13.663 μs
0.2509 μs
0.2224 μs
13.680 μs
0.000
0.00
MemcpyInt32[10000]Int32[10000]10.112 μs0.1976 μs0.2957 μs10.131 μs0.0000.00CopyBlockUnalignedInt32[10000]Int32[10000]10.010 μs0.1742 μs0.1630 μs9.964 μs0.0000.00
CopyByFor
Int32[1000]
Int32[1000]
1.088 μs
0.0058 μs
0.0045 μs
1.089 μs
0.000
0.00
MemcpyInt32[1000]Int32[1000]1.358 μs0.0266 μs0.0364 μs1.355 μs0.0000.00CopyBlockUnalignedInt32[1000]Int32[1000]1.349 μs0.0267 μs0.0461 μs1.334 μs0.0000.00
数据说明
通过数据对比 Intel 和 兆芯 以上测试数据,可以看到在 Int32[10000] 的测试数据集内里,轻松就可以看到 Intel 比 兆芯 快了 10 倍,如下图所示
在如下图的对比 Intel 和 兆芯 的对较大的数组进行拷贝的性能,可以看到 Intel 平台也的确能够比 兆芯 快出 10 倍的性能
具体的性能比较如下
方法数组长度Intel兆芯Intel比兆芯兆芯比IntelCopyByForInt32[100000000]41,348,684.32334,741,708.000.12352414818.095583052MemcpyInt32[100000000]27,086,427.67164,233,004.000.16492682356.063295094CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.000.14635379536.832757553CopyByForInt32[10000000]3,800,486.4033,404,753.000.11377082788.789599405MemcpyInt32[10000000]2,313,413.9023,405,518.000.098840534110.11730672CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.000.080262563212.45910871CopyByForInt32[1000000]201,416.815,036,027.000.039995180725.00301241MemcpyInt32[1000000]104,570.312,585,947.000.040437916924.72926589CopyBlockUnalignedInt32[1000000]99,385.152,529,769.000.039286255025.45419512CopyByForInt32[10000]1,958.8713,663.000.14337041656.974939634MemcpyInt32[10000]624.0610,112.000.061714794316.20357017CopyBlockUnalignedInt32[10000]581.3210,010.000.058073926117.21943164CopyByForInt32[1000]201.051,088.000.18478860295.411589157MemcpyInt32[1000]32.121,358.000.023652430042.27895392CopyBlockUnalignedInt32[1000]21.021,349.000.015581912564.17697431更具体的对 兆芯 的分析:在对较小的数组进行拷贝,利用 for 进行拷贝的速率比尺度 C 的 memcpy 函数快,利用 for 循环进行拷贝与 dotnet 的 Unsafe.CopyBlockUnaligned 差不多。而在 Intel 平台下,无论是 尺度 C 的 memcpy 照旧 dotnet 的 Unsafe.CopyBlockUnaligned 都比 for 快几倍。这就意味着无论是 memcpy 照旧 CopyBlockUnaligned 内里的指令优化,在 兆芯 下都是负优化
在更大的数据两环境下,可以看到 Intel 平台的 memcpy 和 CopyBlockUnaligned 对 for 循环的优化比率不断下跌,其数据环境如下
数组长度CopyByForMemcpyCopyBlockUnalignedCopyByFor与Memcpy比率CopyByFor与CopyBlockUnaligned比率1000201.0532.1221.026.2593399759.564700285100001,958.87624.06581.323.1389129253.3696931121000000201,416.81104,570.3199,385.151.9261376392.026628827100000003,800,486.402,313,413.902,005,075.291.6428043421.89543326310000000041,348,684.3227,086,427.6724,020,801.371.526546241.721369894我的猜测是随着数组长度增长,将渐渐超过了 Intel 的 CPU 的缓存,导致了比率的下降。但无论怎样,利用 memcpy 和 CopyBlockUnaligned 在 Intel 下都有优化
这就是为什么在数组较大时,如在 100000000 长度时,雷同的 Memcpy 方法下兆芯比Intel的耗时比例为 6.06 倍。相较于在 1000 长度时,兆芯比Intel的耗时比例为 42.27 倍小了非常多。如此可以看到其实也不能全怪兆芯,只是因为 Intel 的优化比较强,导致看起来差异比较大
在数组长度比较大的时候,在 兆芯 上也是 memcpy 会比 for 循环拷贝更快。且 memcpy 和 CopyBlockUnaligned 的性能也是基本持平的。也就是说在数据量比较大的时候,利用 dotnet 自带的 Unsafe.CopyBlockUnaligned 方法照旧很有意义的,既速率快又相对安全。在数据量比较小的时候,利用 CopyBlockUnaligned 依然不会有较大的性能损失
飞腾腾锐 Phytium D2000
数组较大
BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-QEJWOH : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevRatioCopyByForInt32[100000000]Int32[100000000]161,848,301.3 ns275,376.77 ns229,952.07 ns1.000MemcpyInt32[100000000]Int32[100000000]139,057,784.0 ns493,850.72 ns437,785.80 ns0.859CopyBlockUnalignedInt32[100000000]Int32[100000000]137,746,376.7 ns740,242.45 ns618,135.97 ns0.851CopyByForInt32[10000000]Int32[10000000]15,514,977.7 ns33,694.59 ns29,869.38 ns0.096MemcpyInt32[10000000]Int32[10000000]14,492,865.7 ns40,272.32 ns35,700.37 ns0.090CopyBlockUnalignedInt32[10000000]Int32[10000000]14,497,063.8 ns38,595.84 ns30,133.09 ns0.090CopyByForInt32[1000000]Int32[1000000]1,240,798.0 ns15,140.32 ns14,162.26 ns0.008MemcpyInt32[1000000]Int32[1000000]1,046,522.7 ns20,519.03 ns19,193.52 ns0.006CopyBlockUnalignedInt32[1000000]Int32[1000000]1,032,201.1 ns19,159.44 ns17,921.75 ns0.006CopyByForInt32[10000]Int32[10000]8,930.6 ns9.04 ns8.02 ns0.000MemcpyInt32[10000]Int32[10000]3,058.1 ns12.31 ns11.51 ns0.000CopyBlockUnalignedInt32[10000]Int32[10000]3,199.1 ns16.51 ns12.89 ns0.000CopyByForInt32[1000]Int32[1000]886.8 ns0.66 ns0.59 ns0.000MemcpyInt32[1000]Int32[1000]250.3 ns0.36 ns0.30 ns0.000CopyBlockUnalignedInt32[1000]Int32[1000]235.8 ns0.29 ns0.25 ns0.000
数据说明和对比
飞腾腾锐 Phytium,D2000/8 E8C, 8 logical cores 的跑分不高,与 Intel 最大差距在数组拷贝上能拉到 10 倍,均值性能差距是 4 倍左右。但在我的测试内里飞腾腾锐的性能比兆芯快,大概均值性能差距是 2 倍左右,如以下对比
方法数组长度Intel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾CopyByForInt32[100000000]41,348,684.32334,741,708.00161,848,301.300.12352414818.09558305193.91423098372.0682435670MemcpyInt32[100000000]27,086,427.67164,233,004.00139,057,784.000.16492682356.06329509385.13385470001.1810414295CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.00137,746,376.700.14635379536.83275755345.73446216791.1915254392CopyByForInt32[10000000]3,800,486.4033,404,753.0015,514,977.700.11377082788.78959940504.08236632552.1530648413MemcpyInt32[10000000]2,313,413.9023,405,518.0014,492,865.700.098840534110.11730672156.26470935441.6149682530CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.0014,497,063.800.080262563212.45910870517.23018425911.7232076333CopyByForInt32[1000000]201,416.815,036,027.001,240,798.000.039995180725.00301240996.16034977424.0587001269MemcpyInt32[1000000]104,570.312,585,947.001,046,522.700.040437916924.729265888210.00783778882.4709898791CopyBlockUnalignedInt32[1000000]99,385.152,529,769.001,032,201.100.039286255025.454195118710.38586851252.4508489673CopyByForInt32[10000]1,958.8713,663.008,930.600.14337041656.97493963364.55905700741.5299084048MemcpyInt32[10000]624.0610,112.003,058.100.061714794316.20357016954.90033009653.3066282986CopyBlockUnalignedInt32[10000]581.3210,010.003,199.100.058073926117.21943163835.50316521023.1290050327CopyByForInt32[1000]201.051,088.00886.800.18478860295.41158915694.41084307391.2268831755MemcpyInt32[1000]32.121,358.00250.300.023652430042.27895392287.79265255295.4254894127CopyBlockUnalignedInt32[1000]21.021,349.00235.800.015581912564.176974310211.21788772605.7209499576
点的几何计算
代码和性能测试的设计
以下代码用于测试麋集的计算过程中的各个设备之间的性能差异,其性能测试焦点代码如下
[Benchmark()]
[ArgumentsSource(nameof(GetArgument))]
public void Test(Point[] source, double[] result)
{
for (int i = 1; i < source.Length - 1; i++)
{
var a = source[i - 1];
var b = source[i];
var c = source[i + 1];
var abx = b.X - a.X;
var aby = b.Y - a.Y;
var acx = c.X - a.X;
var acy = c.Y - a.Y;
var cross = abx * acy - aby * acx;
var abs = Math.Abs(cross);
var acl = Math.Sqrt(acx * acx + acy * acy);
result[i] = abs / acl;
}
}
复制代码
以上性能测试中传入的 Point[] source 为输入数据,而 double[] result 为存放的输出数据,输出数据只是为了让计算效果有的存放,让 JIT 开森而已
此性能测试中对代码逻辑的内存访问猜测,即 CPU 缓存命中以及浮点计算要求较高。颠末实际测试发现 Intel 在这方面的优化照旧非常好的,但兆芯则有很大的优化空间
英特尔 13th Gen Intel Core i7-13700K
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
.NET SDK 9.0.100-preview.5.24307.3
[Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
Job-UGRNFG : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
RunStrategy=Throughput
复制代码
MethodsourceresultMeanErrorStdDev
万点
Point[10000]
Double[10000]
19.622 μs
0.0914 μs
0.0810 μs
千点
Point[1000]
Double[1000]
1.974 μs
0.0108 μs
0.0101 μs
兆芯 ZHAOXIN KaiXian KX-U6780A
BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
Job-BBRJWB : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
RunStrategy=Throughput
复制代码
MethodsourceresultMeanErrorStdDevMedian
万点
Point[10000]Double[10000]475.13 us8.295 us15.782 us467.05 us
千点
Point[1000]Double[1000]50.89 us1.230 us3.626 us50.81 us
飞腾腾锐 Phytium D2000
BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-JCFXCW : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
MethodsourceresultMeanErrorStdDev
万点
Point[10000]Double[10000]147.96 us0.015 us0.014 us
千点
Point[1000]Double[1000]14.76 us0.004 us0.003 us
数据说明和对比
性能对好比下表,可以看到兆芯比Intel能慢上25倍左右,兆芯比飞腾慢上3倍左右
MethodsourceresultIntel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾万点Point[10000]Double[10000]19.622475.13147.960.04129817124.214147397.5405157483.211205731千点Point[1000]Double[1000]1.97450.8914.760.03878954625.780141847.4772036473.447831978
通过上图可以看到,在进行基础的麋集计算中,似乎兆芯做了负面优化
代码
本文代码放在
github
和
gitee
上,可以利用如下下令行拉取代码
先创建一个空文件夹,接着利用下令行 cd 下令进入此空文件夹,在下令行内里输入以下代码,即可获取到本文的代码
git init
git remote add origin https://gitee.com/lindexi/lindexi_gd.git
git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26
复制代码
以上利用的是 gitee 的源,如果 gitee 不能访问,请替换为 github 的源。请在下令行继续输入以下代码,将 gitee 源换成 github 源进行拉取代码
git remote remove origin
git remote add origin https://github.com/lindexi/lindexi_gd.git
git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26
复制代码
获取代码之后,进入 BulowukaileFeanayjairwo 文件夹,即可获取到源代码
特殊感谢
特殊感谢
https://github.com/mjebrahimi/Performance-Wars-Benchmarking-CSharp
提供的代码
参考文档
C# 尺度性能测试
C# 尺度性能测试高级用法
dotnet 6 数组拷贝性能对比
跑分系列
ARM Phytium,D2000/8 E8C 8 Core 2300 MHz
D2000高效能桌面CPU 跑分:
https://www.cpubenchmark.net/cpu.php?cpu=ARM+Phytium%2CD2000%2F8+E8C+8+Core+2300+MHz&id=4862
和 Intel i7-13700K 对比:
https://www.cpubenchmark.net/compare/4862vs5060/ARM-Phytium,D20008-E8C-8-Core-2300-MHz-vs-Intel-i7-13700K
另一个和 Intel i7-13700K 对比:
https://openbenchmarking.org/vs/Processor/Phytium+D2000,Intel+Core+i7-13700K
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
欢迎光临 IT评测·应用市场-qidao123.com (https://dis.qidao123.com/)
Powered by Discuz! X3.4