一:背景
1. 讲故事
前些天有位朋友在微信上找到我,说他们的客户端程序卡死了,让我帮忙看下是什么原因导致的?dump也拿到了手,既然有了dump就开始正式分析吧。
二:WinDbg 分析
1. 什么导致的卡死
客户端的程序卡死比较好找原因,入手点就是主线程,看下它此时正在做什么,可以用 k 命令。- 0:000> k
- # ChildEBP RetAddr
- 00 003cdf7c 74c115ce ntdll!NtWaitForSingleObject+0x15
- 01 003cdf7c 756e1194 KERNELBASE!WaitForSingleObjectEx+0x98
- 02 003cdf94 6f573bea kernel32!WaitForSingleObjectExImplementation+0x75
- 03 003cdfc4 6f573c31 clr!CLREventWaitHelper2+0x33
- 04 003ce014 6f573bb6 clr!CLREventWaitHelper+0x2a
- 05 003ce04c 6f57c8be clr!CLREventBase::WaitEx+0x152
- 06 003ce060 6f5764a9 clr!WKS::GCHeap::WaitUntilGCComplete+0x34
- 07 003ce0b0 6f583cf4 clr!Thread::RareDisablePreemptiveGC+0x231
- 08 003ce134 6a87a767 clr!JIT_RareDisableHelper+0x24
- 09 003ce16c 6a87472b System_Drawing_ni+0x4a767
- 0a 003ce17c 0846b372 System_Drawing_ni!System.Drawing.Graphics.Clear+0x1b
- ...
复制代码 从卦中信息看,代码正在托管层做Graphics,突然程序触发了GC,因为STW的原因,clr需要使用SuspendRuntime把主线程导入到 WaitUntilGCComplete 进行等待,有了这些信息之后,接下来就是寻找为什么会触发GC。
2. 为什么会触发 GC
要找到GC触发原因,首先要找哪一个线程触发了GC,这个可以用 !t 看下托管线程列表中的 GC 字样,输出如下:- 0:000> !t
- ThreadCount: 48
- UnstartedThread: 0
- BackgroundThread: 35
- PendingThread: 0
- DeadThread: 0
- Hosted Runtime: no
- Lock
- ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
- 48 42 ee8 0ee39f60 1029220 Cooperative 00000000:00000000 0076c700 3 MTA (GC) (Threadpool Worker)
- ...
- 0:048> k 10
- # ChildEBP RetAddr
- 00 4775c9e8 6f57d24e clr!WKS::gc_heap::mark_object_simple1+0x8a
- 01 4775ca14 6f57bf72 clr!WKS::gc_heap::mark_object_simple+0x22b
- 02 4775ca34 6f5774b2 clr!WKS::GCHeap::Promote+0xaa
- 03 4775ca4c 6f57809c clr!GcEnumObject+0x37
- 04 4775cdbc 6f5777cb clr!EECodeManager::EnumGcRefs+0x854
- 05 4775ce10 6f5723b9 clr!GcStackCrawlCallBack+0x167
- 06 4775d0dc 6f5724bf clr!Thread::StackWalkFramesEx+0x92
- 07 4775d410 6f57743b clr!Thread::StackWalkFrames+0x9d
- 08 4775d448 6f57ba0e clr!GCToEEInterface::GcScanRoots+0x108
- 09 4775d4a8 6f5792db clr!WKS::gc_heap::mark_phase+0x18a
- 0a 4775d4d0 6f57966f clr!WKS::gc_heap::gc1+0xda
- 0b 4775d508 6f57978c clr!WKS::gc_heap::garbage_collect+0x447
- 0c 4775d530 6f70b767 clr!WKS::GCHeap::GarbageCollectGeneration+0x1f6
- 0d 4775d590 6f70b7a3 clr!WKS::gc_heap::trigger_ephemeral_gc+0x1e
- 0e 4775d590 6f575f6f clr!WKS::gc_heap::allocate_small+0x270
- 0f 4775d5bc 6f575fca clr!WKS::gc_heap::try_allocate_more_space+0x17c
- ...
复制代码 从卦中可以看到当前 48 号线程触发了GC,并且是处于三阶段中的标记阶段,接下来需要观察下到底触发的是哪一代GC,可以用 dt 观察下 setting 全局变量即可。- 0:048> x clr!*settings*
- ...
- 6fbd4bd8 clr!WKS::gc_heap::settings = <no type information>
- 6fbd7538 clr!SVR::gc_heap::settings = <no type information>
- ...
- 0:048> dp clr!WKS::gc_heap::settings
- 6fbd4bd8 00002ce4 00000002 00000001 00000001
- 6fbd4be8 00000000 00000000 00000000 00000000
- 6fbd4bf8 00000001 00000000 00000000 00000000
- 6fbd4c08 00000000 00000000 00000005 00000001
- 6fbd4c18 00000000 00000000 00000000 00000001
- ...
复制代码 从卦中的 +0x4 偏移可以看到当前触发的是 FullGC,从 +0x38 可以看到GC的触发原因是 reason_oos_soh = 5,有经验的朋友看到这里应该就知道是什么原因了。
3. 为什么会触发FullGC
相信大家都知道FullGC 有一个 STW 的概念,既然有STW自然就会让程序卡死,回过头来说一下经验在哪里,对,就是这个指针的长度,很显然这个程序是 32bit 的,所以很大概率程序是 32bit 部署,会受到 2G 虚拟地址的限制,因为可用内存不足导致高频的触发 FullGC,可以用 !address -summary 去验证一下。- 0:048> !address -summary
- --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- <unknown> 955 6029f000 ( 1.503 GB) 81.21% 75.13%
- Image 1251 10105000 ( 257.020 MB) 13.57% 12.55%
- Free 326 995d000 ( 153.363 MB) 7.49%
- Stack 165 34c0000 ( 52.750 MB) 2.78% 2.58%
- Heap 137 2db0000 ( 45.688 MB) 2.41% 2.23%
- Other 12 47000 ( 284.000 kB) 0.01% 0.01%
- TEB 55 37000 ( 220.000 kB) 0.01% 0.01%
- PEB 1 1000 ( 4.000 kB) 0.00% 0.00%
- ...
- --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_COMMIT 2152 66fc7000 ( 1.609 GB) 86.97% 80.46%
- MEM_RESERVE 424 f6cc000 ( 246.797 MB) 13.03% 12.05%
- MEM_FREE 326 995d000 ( 153.363 MB) 7.49%
- ...
复制代码 从卦中可以看到,当前程序吃了1.6G的虚拟地址,占全量的 80% ,这样情况按理说程序会抛 OutOfMemoryException 异常,在 !t 中也得到了验证。- 0:048> !t
- ThreadCount: 48
- UnstartedThread: 0
- BackgroundThread: 35
- PendingThread: 0
- DeadThread: 0
- Hosted Runtime: no
- Lock
- ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
- ...
- 33 39 10c8 0ee3dec0 1029220 Preemptive 00000000:00000000 0076c700 0 MTA (Threadpool Worker) System.OutOfMemoryException 32614444 (nested exceptions)
- ...
- 46 44 89c 0ee3c458 1029220 Preemptive 00000000:00000000 0076c700 1 MTA (Threadpool Worker) System.OutOfMemoryException 32605d34 (nested exceptions)
- ...
复制代码 接下来的问题是谁吃掉了 1.6G 的内存,总有地方会吃,可以使用 !eeheap -gc 观察下托管堆。- 0:048> !eeheap -gc
- Number of GC Heaps: 1
- generation 0 starts at 0x32632d50
- generation 1 starts at 0x3262534c
- generation 2 starts at 0x03291000
- ephemeral segment allocation context: (0x3265ffb0, 0x3265ffbc)
- segment begin allocated size
- 03290000 03291000 0428fee4 0xffeee4(16772836)
- 06c60000 06c61000 07c5ffc4 0xffefc4(16773060)
- ...
- 7d210000 7d211000 7e20ffac 0xffefac(16773036)
- 31660000 31661000 3265ffb0 0xffefb0(16773040)
- Large object heap starts at 0x04291000
- segment begin allocated size
- 04290000 04291000 0450fa78 0x27ea78(2615928)
- 53390000 53391000 54391020 0x1000020(16777248)
- Total Size: Size: 0x4622fd30 (1176698160) bytes.
- ------------------------------
- GC Heap Size: Size: 0x4622fd30 (1176698160) bytes.
复制代码 从卦中看应该就是托管堆吃掉了,接下来就是看下托管堆中哪一类对象最多,最终找到了一个大集合,命令输出如下:- 0:048> !gcroot 4c0507c0
- Thread 89c:
- 471bd450 07f76405 IBatisNet.DataMapper.MappedStatements.MappedStatement.RunQueryForList[[System.__Canon, mscorlib]](IBatisNet.DataMapper.Scope.RequestScope, IBatisNet.DataMapper.ISqlMapSession, System.Object, System.Collections.Generic.IList`1<System.__Canon>, IBatisNet.DataMapper.RowDelegate`1<System.__Canon>)
- ebp+90: 471be6c4
- -> 32c2ea50 System.Collections.Generic.List`1[[xxx.Model]]
- -> 53391010 xxxRMT[]
- -> 4c0507c0 xxxMT
- 0:048> !do 32c2ea50
- Name: System.Collections.Generic.List`1[[xxx, xxx]]
- MethodTable: 095f1b58
- EEClass: 6e246b4c
- Size: 24(0x18) bytes
- File: C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
- Fields:
- MT Field Offset Type VT Attr Value Name
- 6e67ca34 4001886 4 System.__Canon[] 0 instance 53391010 _items
- 6e66f2d8 4001887 c System.Int32 1 instance 3011824 _size
- 6e66f2d8 4001888 10 System.Int32 1 instance 3011824 _version
- 6e66d824 4001889 8 System.Object 0 instance 00000000 _syncRoot
- 6e67ca34 400188a 4 System.__Canon[] 0 static <no information>
复制代码 从卦中看当前的List有length=3011824,并且还被 89c 线程持有,最终通过代码找到了是某种查询下导致的大SQL引发。
三:总结
这次程序卡死还是挺有意思的,表象是主线程被GC卡住,实则是大SQL导致虚拟地址不足,分享出来让大家少踩坑吧!

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |