dotnet C# 在差别的机器 CPU 型号上的基准性能测试

饭宝  论坛元老 | 2025-1-16 10:02:47 | 显示全部楼层 | 阅读模式
打印 上一主题 下一主题

主题 1008|帖子 1008|积分 3024

本文将记录我在多个差别的机器上,在差别的 CPU 型号上,执行雷同的我编写的 dotnet 的 Benchmark 的代码,测试差别的 CPU 型号对 C# 系的优化程度。本文非严谨测试,数值只有相对意义
以下是我的测试效果,对应的测试代码放在 github 上,可以在本文末尾找到下载代码的方法
我非常推荐你自己拉取代码,在你自己的设备上跑一下,测试其性能。且在开始之前,期望你已经把握了基础的性能测试知识,避免出现诡异的结论
本文的测试将围绕着尽可能多的覆盖基础 CPU 指令以及基础逻辑行为。基础的 CPU 指令的性能测试已经有许多前辈测试过了,我这里重点测试的是各个 C# 系的上层业务行为下,所调用的多个 CPU 指令的最终性能影响。额外的也覆盖 CPU 缓存,逻辑分支命中,方法参数堆栈传递等的性能。本文的测试重点不在于 C# 系的雷同功能的多个差别实现之间的性能对比,重点在于雷同的代码在差别的 CPU 型号、内存、体系上的性能差异,正如此需求所述,本文非严谨测试,测试效果的数值只有相对意义
数组创建

英特尔 13th Gen Intel Core i7-13700K

以下是在我开辟机上跑的,我开了几百个进程,有比较多干扰,但是题目不大,因为 i7-13700K 依然性能遥遥领先。等后续找个空闲的机器,再跑一次比较正确的性能测试
  1. BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
  2. 13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  5.   Job-AXOZTJ : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  6. RunStrategy=Throughput  
复制代码
MethodArraySizeMeanErrorStdDevMedianRatioRatioSDNewArray103.873 ns0.1146 ns0.2417 ns3.777 ns1.000.00GCZeroInitialized1012.234 ns0.2815 ns0.4382 ns12.168 ns3.150.21GCZeroUninitialized104.470 ns0.1491 ns0.4056 ns4.354 ns1.140.13NewArrayWithRandomVisit1012.012 ns0.2679 ns0.2506 ns11.941 ns3.090.18NewArrayWithOrdinalVisit109.839 ns0.3379 ns0.9803 ns9.635 ns2.580.26NewArray10011.875 ns0.1932 ns0.2444 ns11.813 ns1.000.00GCZeroInitialized10021.980 ns0.4524 ns0.8931 ns21.820 ns1.880.08GCZeroUninitialized10012.126 ns0.2769 ns0.5201 ns11.953 ns1.040.05NewArrayWithRandomVisit10047.344 ns0.9635 ns2.1351 ns46.572 ns4.030.24NewArrayWithOrdinalVisit10075.207 ns1.4285 ns1.3363 ns75.364 ns6.330.15NewArray1000110.197 ns2.1602 ns2.0206 ns109.619 ns1.000.00GCZeroInitialized1000116.560 ns2.0796 ns1.8435 ns116.604 ns1.060.03GCZeroUninitialized100033.476 ns0.5921 ns0.5538 ns33.643 ns0.300.01NewArrayWithRandomVisit1000208.835 ns4.1962 ns8.8512 ns205.699 ns1.920.09NewArrayWithOrdinalVisit1000620.850 ns11.5406 ns10.7951 ns619.304 ns5.640.15NewArray10000996.853 ns21.9389 ns61.8790 ns970.393 ns1.000.00GCZeroInitialized10000996.704 ns20.8764 ns58.5397 ns974.900 ns1.000.08GCZeroUninitialized1000063.200 ns1.0544 ns0.9863 ns63.315 ns0.060.00NewArrayWithRandomVisit100001,242.151 ns24.2642 ns38.4856 ns1,233.944 ns1.210.07NewArrayWithOrdinalVisit100006,068.245 ns90.8508 ns84.9819 ns6,076.727 ns5.790.34NewArray1000007,381.046 ns137.9635 ns147.6194 ns7,372.520 ns1.000.00GCZeroInitialized1000007,214.089 ns85.2068 ns71.1515 ns7,209.220 ns0.970.02GCZeroUninitialized1000007,347.661 ns146.3643 ns174.2363 ns7,306.838 ns1.000.03NewArrayWithRandomVisit1000008,456.669 ns164.5726 ns219.6997 ns8,517.366 ns1.140.05NewArrayWithOrdinalVisit100000129,749.709 ns2,408.4302 ns2,773.5518 ns128,963.159 ns17.570.55NewArray100000059,752.036 ns1,194.7579 ns1,929.3113 ns59,414.325 ns1.000.00GCZeroInitialized100000060,008.303 ns1,188.0164 ns1,778.1671 ns59,378.000 ns1.010.04GCZeroUninitialized100000058,868.279 ns1,023.4279 ns957.3151 ns58,724.731 ns0.970.04NewArrayWithRandomVisit100000056,399.609 ns1,068.5479 ns999.5204 ns56,296.948 ns0.930.03NewArrayWithOrdinalVisit10000001,314,841.960 ns26,155.6618 ns27,986.2651 ns1,313,674.414 ns21.921.00兆芯 ZHAOXIN KaiXian KX-U6780A
  1. BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
  2. ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  5.   Job-YPUGMN : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  6. RunStrategy=Throughput  
复制代码
MethodArraySizeMeanErrorStdDevMedianRatioRatioSDNewArray1040.20 ns0.977 ns1.491 ns39.98 ns1.000.00GCZeroInitialized10141.12 ns2.996 ns6.051 ns139.67 ns3.540.18GCZeroUninitialized1048.72 ns0.849 ns0.663 ns48.91 ns1.190.05NewArrayWithRandomVisit10195.75 ns1.082 ns0.845 ns195.65 ns4.770.16NewArrayWithOrdinalVisit1072.42 ns1.513 ns2.400 ns72.45 ns1.800.08NewArray100135.07 ns2.892 ns6.100 ns135.41 ns1.000.00GCZeroInitialized100228.42 ns4.662 ns10.135 ns228.83 ns1.700.11GCZeroUninitialized100137.26 ns2.939 ns5.519 ns136.70 ns1.020.06NewArrayWithRandomVisit100572.02 ns11.660 ns19.157 ns568.34 ns4.260.27NewArrayWithOrdinalVisit100467.29 ns9.357 ns13.117 ns464.49 ns3.470.21NewArray10001,037.70 ns20.377 ns54.742 ns1,031.50 ns1.000.00GCZeroInitialized10001,127.93 ns22.581 ns59.091 ns1,125.79 ns1.090.07GCZeroUninitialized1000653.93 ns6.239 ns4.871 ns652.04 ns0.600.02NewArrayWithRandomVisit10002,375.21 ns47.088 ns100.349 ns2,352.11 ns2.270.13NewArrayWithOrdinalVisit10004,474.90 ns87.887 ns107.933 ns4,453.16 ns4.190.28NewArray100009,586.62 ns189.501 ns369.608 ns9,657.74 ns1.000.00GCZeroInitialized100009,767.26 ns194.643 ns462.590 ns9,811.53 ns1.020.07GCZeroUninitialized100004,093.63 ns80.993 ns143.965 ns4,026.86 ns0.430.02NewArrayWithRandomVisit1000013,908.10 ns202.573 ns169.158 ns13,928.15 ns1.470.06NewArrayWithOrdinalVisit1000043,057.16 ns854.132 ns1,495.943 ns42,914.21 ns4.500.25NewArray10000063,542.13 ns576.256 ns510.836 ns63,519.28 ns1.000.00GCZeroInitialized10000066,357.64 ns1,312.089 ns2,118.779 ns66,043.66 ns1.030.03GCZeroUninitialized10000063,638.29 ns1,241.493 ns1,477.909 ns63,270.73 ns1.010.03NewArrayWithRandomVisit10000076,609.50 ns1,501.442 ns1,729.063 ns75,958.21 ns1.210.03NewArrayWithOrdinalVisit100000665,286.65 ns9,295.620 ns7,762.264 ns662,915.19 ns10.470.16NewArray1000000461,130.99 ns9,000.698 ns10,004.252 ns461,306.23 ns1.000.00GCZeroInitialized1000000459,810.29 ns8,893.401 ns10,586.961 ns455,791.25 ns1.000.03GCZeroUninitialized1000000456,245.03 ns8,819.606 ns12,363.856 ns452,252.89 ns0.990.04NewArrayWithRandomVisit1000000497,132.01 ns9,841.562 ns12,796.810 ns490,990.22 ns1.080.03NewArrayWithOrdinalVisit10000006,742,537.03 ns48,986.470 ns38,245.414 ns6,732,321.64 ns14.510.31飞腾腾锐 Phytium D2000
  1. BenchmarkDotNet v0.13.12, Kylin V10 SP1
  2. Phytium,D2000/8 E8C, 8 logical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  5.   Job-NHRLJG : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  6. RunStrategy=Throughput  
复制代码
MethodArraySizeMeanErrorStdDevRatioRatioSDNewArray1022.18 ns0.149 ns0.132 ns1.000.00GCZeroInitialized1092.43 ns0.564 ns0.440 ns4.170.02GCZeroUninitialized1025.68 ns0.248 ns0.243 ns1.160.01NewArrayWithRandomVisit10108.25 ns0.299 ns0.250 ns4.880.03NewArrayWithOrdinalVisit1034.55 ns0.126 ns0.112 ns1.560.01NewArray10076.35 ns0.941 ns0.880 ns1.000.00GCZeroInitialized100163.69 ns0.952 ns0.743 ns2.140.03GCZeroUninitialized10080.21 ns0.528 ns0.468 ns1.050.02NewArrayWithRandomVisit100421.53 ns1.679 ns1.402 ns5.520.06NewArrayWithOrdinalVisit100300.66 ns1.274 ns1.130 ns3.940.05NewArray1000640.11 ns4.059 ns3.598 ns1.000.00GCZeroInitialized1000672.06 ns3.242 ns3.032 ns1.050.01GCZeroUninitialized1000483.70 ns2.202 ns1.952 ns0.760.01NewArrayWithRandomVisit10001,765.24 ns6.469 ns5.402 ns2.760.02NewArrayWithOrdinalVisit10002,850.39 ns12.971 ns12.133 ns4.450.03NewArray100005,219.58 ns36.810 ns32.631 ns1.000.00GCZeroInitialized100005,280.52 ns27.550 ns24.422 ns1.010.01GCZeroUninitialized100002,640.52 ns44.642 ns34.853 ns0.510.01NewArrayWithRandomVisit100008,992.89 ns20.367 ns19.052 ns1.720.01NewArrayWithOrdinalVisit1000026,983.43 ns355.773 ns297.086 ns5.170.05NewArray10000045,506.61 ns431.868 ns403.970 ns1.000.00GCZeroInitialized10000045,543.14 ns432.449 ns404.513 ns1.000.01GCZeroUninitialized10000044,461.84 ns331.168 ns309.775 ns0.980.01NewArrayWithRandomVisit10000057,232.01 ns318.770 ns298.178 ns1.260.01NewArrayWithOrdinalVisit100000445,380.51 ns2,904.888 ns2,425.713 ns9.780.10NewArray1000000318,862.16 ns1,899.267 ns1,683.651 ns1.000.00GCZeroInitialized1000000319,510.71 ns4,669.274 ns3,645.462 ns1.000.01GCZeroUninitialized1000000314,884.17 ns5,637.859 ns4,401.669 ns0.990.02NewArrayWithRandomVisit1000000357,843.40 ns3,063.527 ns2,865.625 ns1.120.01NewArrayWithOrdinalVisit10000004,547,465.54 ns15,355.309 ns12,822.379 ns14.280.05NewArray10000000001,541,406,672.88 ns35,733,853.844 ns102,527,125.216 ns1.0000.00GCZeroInitialized10000000001,548,370,215.42 ns38,407,327.571 ns110,197,822.498 ns1.0090.10GCZeroUninitialized10000000001,486,735.21 ns28,605.254 ns26,757.372 ns0.0010.00NewArrayWithRandomVisit10000000001,590,271,119.60 ns33,473,585.461 ns96,041,991.522 ns1.0360.09NewArrayWithOrdinalVisit10000000003,861,833,983.54 ns2,367,487.064 ns1,976,958.923 ns2.5460.16以上的飞腾腾锐 Phytium D2000 最后的测试数据预计是不正常的
数组拷贝

测试维度

到场测试的内容如下:

  • CopyByFor : 利用 for 循环进行拷贝数组
  • Memcpy  : 利用尺度 C 提供的 memcpy 函数进行拷贝,在 linux 下利用 libc.so.6 导出函数,在 windows 下利用 msvcrt.dll 导出函数。这处于非常裸露的方式,更具体请参阅下文的数据说明内容
  • CopyBlockUnaligned : 利用 dotnet 自带的 Unsafe.CopyBlockUnaligned 方法进行数组拷贝
英特尔 13th Gen Intel Core i7-13700K

数组较小

小于 1000 的数组时,存在较大 P/Invoke 干扰,于是决定最小设置为 1000 的值
  1. BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
  2. 13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  5.   Job-GCHWHL : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  6. RunStrategy=Throughput  
复制代码
MethodsourcedestMeanErrorStdDevRatioCopyByForInt32[10000]Int32[10000]1,958.98 ns8.391 ns7.007 ns1.000MemcpyInt32[10000]Int32[10000]609.35 ns3.266 ns3.055 ns0.311CopyBlockUnalignedInt32[10000]Int32[10000]577.84 ns1.391 ns1.301 ns0.295CopyByForInt32[1000]Int32[1000]202.09 ns0.376 ns0.352 ns0.103MemcpyInt32[1000]Int32[1000]32.21 ns0.323 ns0.302 ns0.016CopyBlockUnalignedInt32[1000]Int32[1000]19.19 ns0.067 ns0.059 ns0.010根据上述测试数据可以看到,即使在较小数据量环境下,依然 memcpy 和 Unsafe.CopyBlockUnaligned 比 for 速率快
数组较大
  1. BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
  2. 13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  5.   Job-DBDADP : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
  6. RunStrategy=Throughput  
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSDCopyByForInt32[100000000]Int32[100000000]41,348,684.32 ns751,207.515 ns1,028,261.326 ns41,102,646.15 ns1.0000.00MemcpyInt32[100000000]Int32[100000000]27,086,427.67 ns738,121.867 ns2,057,588.736 ns26,318,143.75 ns0.6750.05CopyBlockUnalignedInt32[100000000]Int32[100000000]24,020,801.37 ns467,035.642 ns458,691.448 ns23,894,810.94 ns0.5790.02CopyByForInt32[10000000]Int32[10000000]3,800,486.40 ns69,523.151 ns162,508.123 ns3,748,857.23 ns0.0920.01MemcpyInt32[10000000]Int32[10000000]2,313,413.90 ns75,362.059 ns208,827.911 ns2,248,826.17 ns0.0580.01CopyBlockUnalignedInt32[10000000]Int32[10000000]2,005,075.29 ns55,131.653 ns149,989.727 ns1,925,467.19 ns0.0490.00CopyByForInt32[1000000]Int32[1000000]201,416.81 ns1,630.278 ns1,524.963 ns200,902.27 ns0.0050.00MemcpyInt32[1000000]Int32[1000000]104,570.31 ns3,304.068 ns9,319.184 ns100,412.65 ns0.0030.00CopyBlockUnalignedInt32[1000000]Int32[1000000]99,385.15 ns1,824.888 ns1,617.716 ns99,135.09 ns0.0020.00CopyByForInt32[10000]Int32[10000]1,958.87 ns4.267 ns3.783 ns1,959.42 ns0.0000.00MemcpyInt32[10000]Int32[10000]624.06 ns4.451 ns4.164 ns622.60 ns0.0000.00CopyBlockUnalignedInt32[10000]Int32[10000]581.32 ns2.044 ns1.912 ns581.53 ns0.0000.00CopyByForInt32[1000]Int32[1000]201.05 ns0.678 ns0.635 ns201.05 ns0.0000.00MemcpyInt32[1000]Int32[1000]32.12 ns0.638 ns0.683 ns32.10 ns0.0000.00CopyBlockUnalignedInt32[1000]Int32[1000]21.02 ns0.090 ns0.085 ns21.04 ns0.0000.00兆芯 ZHAOXIN KaiXian KX-U6780A

数组较小
  1. BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
  2. ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  5.   Job-SBDPDU : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  6. RunStrategy=Throughput  
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSDCopyByForInt32[10000]Int32[10000]14.814 us0.1734 us0.1537 us14.785 us1.000.00MemcpyInt32[10000]Int32[10000]15.329 us0.2950 us0.5167 us15.313 us1.040.04CopyBlockUnalignedInt32[10000]Int32[10000]13.125 us0.5590 us1.6482 us13.188 us0.940.09CopyByForInt32[1000]Int32[1000]1.127 us0.0226 us0.0211 us1.127 us0.080.00MemcpyInt32[1000]Int32[1000]2.152 us0.0571 us0.1675 us2.197 us0.130.02CopyBlockUnalignedInt32[1000]Int32[1000]2.297 us0.0453 us0.0863 us2.279 us0.160.01数组较大
  1. BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
  2. ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  5.   Job-KKBWNV : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  6. RunStrategy=Throughput  
复制代码
MethodsourcedestMeanErrorStdDevMedianRatioRatioSDCopyByForInt32[100000000]Int32[100000000]334,741.708 μs7,661.2780 μs22,469.2022 μs332,150.996 μs1.0000.00MemcpyInt32[100000000]Int32[100000000]164,233.004 μs3,256.7894 μs8,406.8134 μs161,660.880 μs0.4930.04CopyBlockUnalignedInt32[100000000]Int32[100000000]164,128.312 μs3,671.9104 μs10,826.7108 μs162,440.250 μs0.4920.05CopyByForInt32[10000000]Int32[10000000]33,404.753 μs663.0404 μs1,687.6494 μs32,963.932 μs0.1000.01MemcpyInt32[10000000]Int32[10000000]23,405.518 μs1,142.2886 μs3,350.1346 μs24,879.320 μs0.0700.01CopyBlockUnalignedInt32[10000000]Int32[10000000]24,981.451 μs498.7301 μs899.3133 μs24,921.681 μs0.0750.00CopyByForInt32[1000000]Int32[1000000]5,036.027 μs100.2153 μs195.4623 μs5,014.961 μs0.0150.00MemcpyInt32[1000000]Int32[1000000]2,585.947 μs51.0945 μs106.6533 μs2,601.145 μs0.0080.00CopyBlockUnalignedInt32[1000000]Int32[1000000]2,529.769 μs50.4126 μs98.3259 μs2,516.467 μs0.0080.00CopyByForInt32[10000]Int32[10000]13.663 μs0.2509 μs0.2224 μs13.680 μs0.0000.00MemcpyInt32[10000]Int32[10000]10.112 μs0.1976 μs0.2957 μs10.131 μs0.0000.00CopyBlockUnalignedInt32[10000]Int32[10000]10.010 μs0.1742 μs0.1630 μs9.964 μs0.0000.00CopyByForInt32[1000]Int32[1000]1.088 μs0.0058 μs0.0045 μs1.089 μs0.0000.00MemcpyInt32[1000]Int32[1000]1.358 μs0.0266 μs0.0364 μs1.355 μs0.0000.00CopyBlockUnalignedInt32[1000]Int32[1000]1.349 μs0.0267 μs0.0461 μs1.334 μs0.0000.00数据说明

通过数据对比 Intel 和 兆芯 以上测试数据,可以看到在 Int32[10000] 的测试数据集内里,轻松就可以看到 Intel 比 兆芯 快了 10 倍,如下图所示

在如下图的对比 Intel 和 兆芯 的对较大的数组进行拷贝的性能,可以看到 Intel 平台也的确能够比 兆芯 快出 10 倍的性能

具体的性能比较如下

方法数组长度Intel兆芯Intel比兆芯兆芯比IntelCopyByForInt32[100000000]41,348,684.32334,741,708.000.12352414818.095583052MemcpyInt32[100000000]27,086,427.67164,233,004.000.16492682356.063295094CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.000.14635379536.832757553CopyByForInt32[10000000]3,800,486.4033,404,753.000.11377082788.789599405MemcpyInt32[10000000]2,313,413.9023,405,518.000.098840534110.11730672CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.000.080262563212.45910871CopyByForInt32[1000000]201,416.815,036,027.000.039995180725.00301241MemcpyInt32[1000000]104,570.312,585,947.000.040437916924.72926589CopyBlockUnalignedInt32[1000000]99,385.152,529,769.000.039286255025.45419512CopyByForInt32[10000]1,958.8713,663.000.14337041656.974939634MemcpyInt32[10000]624.0610,112.000.061714794316.20357017CopyBlockUnalignedInt32[10000]581.3210,010.000.058073926117.21943164CopyByForInt32[1000]201.051,088.000.18478860295.411589157MemcpyInt32[1000]32.121,358.000.023652430042.27895392CopyBlockUnalignedInt32[1000]21.021,349.000.015581912564.17697431更具体的对 兆芯 的分析:在对较小的数组进行拷贝,利用 for 进行拷贝的速率比尺度 C 的 memcpy 函数快,利用 for 循环进行拷贝与 dotnet 的 Unsafe.CopyBlockUnaligned 差不多。而在 Intel 平台下,无论是 尺度 C 的 memcpy 照旧 dotnet 的 Unsafe.CopyBlockUnaligned 都比 for 快几倍。这就意味着无论是 memcpy 照旧 CopyBlockUnaligned 内里的指令优化,在 兆芯 下都是负优化
在更大的数据两环境下,可以看到 Intel 平台的 memcpy 和 CopyBlockUnaligned 对 for 循环的优化比率不断下跌,其数据环境如下

数组长度CopyByForMemcpyCopyBlockUnalignedCopyByFor与Memcpy比率CopyByFor与CopyBlockUnaligned比率1000201.0532.1221.026.2593399759.564700285100001,958.87624.06581.323.1389129253.3696931121000000201,416.81104,570.3199,385.151.9261376392.026628827100000003,800,486.402,313,413.902,005,075.291.6428043421.89543326310000000041,348,684.3227,086,427.6724,020,801.371.526546241.721369894我的猜测是随着数组长度增长,将渐渐超过了 Intel 的 CPU 的缓存,导致了比率的下降。但无论怎样,利用 memcpy 和 CopyBlockUnaligned 在 Intel 下都有优化
这就是为什么在数组较大时,如在 100000000 长度时,雷同的 Memcpy 方法下兆芯比Intel的耗时比例为 6.06 倍。相较于在 1000 长度时,兆芯比Intel的耗时比例为 42.27 倍小了非常多。如此可以看到其实也不能全怪兆芯,只是因为 Intel 的优化比较强,导致看起来差异比较大
在数组长度比较大的时候,在 兆芯 上也是 memcpy 会比 for 循环拷贝更快。且 memcpy 和 CopyBlockUnaligned 的性能也是基本持平的。也就是说在数据量比较大的时候,利用 dotnet 自带的 Unsafe.CopyBlockUnaligned 方法照旧很有意义的,既速率快又相对安全。在数据量比较小的时候,利用 CopyBlockUnaligned 依然不会有较大的性能损失
飞腾腾锐 Phytium D2000

数组较大

BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-QEJWOH : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
MethodsourcedestMeanErrorStdDevRatioCopyByForInt32[100000000]Int32[100000000]161,848,301.3 ns275,376.77 ns229,952.07 ns1.000MemcpyInt32[100000000]Int32[100000000]139,057,784.0 ns493,850.72 ns437,785.80 ns0.859CopyBlockUnalignedInt32[100000000]Int32[100000000]137,746,376.7 ns740,242.45 ns618,135.97 ns0.851CopyByForInt32[10000000]Int32[10000000]15,514,977.7 ns33,694.59 ns29,869.38 ns0.096MemcpyInt32[10000000]Int32[10000000]14,492,865.7 ns40,272.32 ns35,700.37 ns0.090CopyBlockUnalignedInt32[10000000]Int32[10000000]14,497,063.8 ns38,595.84 ns30,133.09 ns0.090CopyByForInt32[1000000]Int32[1000000]1,240,798.0 ns15,140.32 ns14,162.26 ns0.008MemcpyInt32[1000000]Int32[1000000]1,046,522.7 ns20,519.03 ns19,193.52 ns0.006CopyBlockUnalignedInt32[1000000]Int32[1000000]1,032,201.1 ns19,159.44 ns17,921.75 ns0.006CopyByForInt32[10000]Int32[10000]8,930.6 ns9.04 ns8.02 ns0.000MemcpyInt32[10000]Int32[10000]3,058.1 ns12.31 ns11.51 ns0.000CopyBlockUnalignedInt32[10000]Int32[10000]3,199.1 ns16.51 ns12.89 ns0.000CopyByForInt32[1000]Int32[1000]886.8 ns0.66 ns0.59 ns0.000MemcpyInt32[1000]Int32[1000]250.3 ns0.36 ns0.30 ns0.000CopyBlockUnalignedInt32[1000]Int32[1000]235.8 ns0.29 ns0.25 ns0.000数据说明和对比

飞腾腾锐 Phytium,D2000/8 E8C, 8 logical cores 的跑分不高,与 Intel 最大差距在数组拷贝上能拉到 10 倍,均值性能差距是 4 倍左右。但在我的测试内里飞腾腾锐的性能比兆芯快,大概均值性能差距是 2 倍左右,如以下对比

方法数组长度Intel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾CopyByForInt32[100000000]41,348,684.32334,741,708.00161,848,301.300.12352414818.09558305193.91423098372.0682435670MemcpyInt32[100000000]27,086,427.67164,233,004.00139,057,784.000.16492682356.06329509385.13385470001.1810414295CopyBlockUnalignedInt32[100000000]24,020,801.37164,128,312.00137,746,376.700.14635379536.83275755345.73446216791.1915254392CopyByForInt32[10000000]3,800,486.4033,404,753.0015,514,977.700.11377082788.78959940504.08236632552.1530648413MemcpyInt32[10000000]2,313,413.9023,405,518.0014,492,865.700.098840534110.11730672156.26470935441.6149682530CopyBlockUnalignedInt32[10000000]2,005,075.2924,981,451.0014,497,063.800.080262563212.45910870517.23018425911.7232076333CopyByForInt32[1000000]201,416.815,036,027.001,240,798.000.039995180725.00301240996.16034977424.0587001269MemcpyInt32[1000000]104,570.312,585,947.001,046,522.700.040437916924.729265888210.00783778882.4709898791CopyBlockUnalignedInt32[1000000]99,385.152,529,769.001,032,201.100.039286255025.454195118710.38586851252.4508489673CopyByForInt32[10000]1,958.8713,663.008,930.600.14337041656.97493963364.55905700741.5299084048MemcpyInt32[10000]624.0610,112.003,058.100.061714794316.20357016954.90033009653.3066282986CopyBlockUnalignedInt32[10000]581.3210,010.003,199.100.058073926117.21943163835.50316521023.1290050327CopyByForInt32[1000]201.051,088.00886.800.18478860295.41158915694.41084307391.2268831755MemcpyInt32[1000]32.121,358.00250.300.023652430042.27895392287.79265255295.4254894127CopyBlockUnalignedInt32[1000]21.021,349.00235.800.015581912564.176974310211.21788772605.7209499576点的几何计算

代码和性能测试的设计

以下代码用于测试麋集的计算过程中的各个设备之间的性能差异,其性能测试焦点代码如下
  1.     [Benchmark()]
  2.     [ArgumentsSource(nameof(GetArgument))]
  3.     public void Test(Point[] source, double[] result)
  4.     {
  5.         for (int i = 1; i < source.Length - 1; i++)
  6.         {
  7.             var a = source[i - 1];
  8.             var b = source[i];
  9.             var c = source[i + 1];
  10.             var abx = b.X - a.X;
  11.             var aby = b.Y - a.Y;
  12.             var acx = c.X - a.X;
  13.             var acy = c.Y - a.Y;
  14.             var cross = abx * acy - aby * acx;
  15.             var abs = Math.Abs(cross);
  16.             var acl = Math.Sqrt(acx * acx + acy * acy);
  17.             result[i] = abs / acl;
  18.         }
  19.     }
复制代码
以上性能测试中传入的 Point[] source 为输入数据,而 double[] result 为存放的输出数据,输出数据只是为了让计算效果有的存放,让 JIT 开森而已
此性能测试中对代码逻辑的内存访问猜测,即 CPU 缓存命中以及浮点计算要求较高。颠末实际测试发现 Intel 在这方面的优化照旧非常好的,但兆芯则有很大的优化空间
英特尔 13th Gen Intel Core i7-13700K
  1. BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
  2. 13th Gen Intel Core i7-13700K, 1 CPU, 24 logical and 16 physical cores
  3. .NET SDK 9.0.100-preview.5.24307.3
  4.   [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  5.   Job-UGRNFG : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  6. RunStrategy=Throughput  
复制代码
MethodsourceresultMeanErrorStdDev万点Point[10000]Double[10000]19.622 μs0.0914 μs0.0810 μs千点Point[1000]Double[1000]1.974 μs0.0108 μs0.0101 μs兆芯 ZHAOXIN KaiXian KX-U6780A
  1. BenchmarkDotNet v0.13.12, UnionTech OS Desktop 20 E
  2. ZHAOXIN KaiXian KX-U6780A2.7GHz (Max: 2.70GHz), 1 CPU, 8 logical and 8 physical cores
  3. .NET SDK 8.0.204
  4.   [Host]     : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  5.   Job-BBRJWB : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX
  6. RunStrategy=Throughput
复制代码
MethodsourceresultMeanErrorStdDevMedian万点Point[10000]Double[10000]475.13 us8.295 us15.782 us467.05 us千点Point[1000]Double[1000]50.89 us1.230 us3.626 us50.81 us飞腾腾锐 Phytium D2000

BenchmarkDotNet v0.13.12, Kylin V10 SP1
Phytium,D2000/8 E8C, 8 logical cores
.NET SDK 8.0.204
[Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
Job-JCFXCW : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
RunStrategy=Throughput
MethodsourceresultMeanErrorStdDev万点Point[10000]Double[10000]147.96 us0.015 us0.014 us千点Point[1000]Double[1000]14.76 us0.004 us0.003 us数据说明和对比

性能对好比下表,可以看到兆芯比Intel能慢上25倍左右,兆芯比飞腾慢上3倍左右
MethodsourceresultIntel兆芯飞腾腾锐Intel比兆芯兆芯比Intel飞腾比Intel兆芯比飞腾万点Point[10000]Double[10000]19.622475.13147.960.04129817124.214147397.5405157483.211205731千点Point[1000]Double[1000]1.97450.8914.760.03878954625.780141847.4772036473.447831978

通过上图可以看到,在进行基础的麋集计算中,似乎兆芯做了负面优化
代码

本文代码放在 githubgitee 上,可以利用如下下令行拉取代码
先创建一个空文件夹,接着利用下令行 cd 下令进入此空文件夹,在下令行内里输入以下代码,即可获取到本文的代码
  1. git init
  2. git remote add origin https://gitee.com/lindexi/lindexi_gd.git
  3. git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26
复制代码
以上利用的是 gitee 的源,如果 gitee 不能访问,请替换为 github 的源。请在下令行继续输入以下代码,将 gitee 源换成 github 源进行拉取代码
  1. git remote remove origin
  2. git remote add origin https://github.com/lindexi/lindexi_gd.git
  3. git pull origin 1e20b4c8ef64b17604e1ee92f41f7ac25ad08d26
复制代码
获取代码之后,进入 BulowukaileFeanayjairwo 文件夹,即可获取到源代码
特殊感谢

特殊感谢 https://github.com/mjebrahimi/Performance-Wars-Benchmarking-CSharp 提供的代码
参考文档

C# 尺度性能测试
C# 尺度性能测试高级用法
dotnet 6 数组拷贝性能对比
跑分系列

ARM Phytium,D2000/8 E8C 8 Core 2300 MHz

D2000高效能桌面CPU 跑分:https://www.cpubenchmark.net/cpu.php?cpu=ARM+Phytium%2CD2000%2F8+E8C+8+Core+2300+MHz&id=4862
和 Intel i7-13700K 对比:https://www.cpubenchmark.net/compare/4862vs5060/ARM-Phytium,D20008-E8C-8-Core-2300-MHz-vs-Intel-i7-13700K
另一个和 Intel i7-13700K 对比:https://openbenchmarking.org/vs/Processor/Phytium+D2000,Intel+Core+i7-13700K

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

饭宝

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表