一:背景
1. 讲故事
年初有位朋友找到我,说他们的管理系统不相应了,让我帮助看下到底咋回事? 手上也有dump,那就来分析吧。
二:为什么没有相应
1. 线程池队列有积压吗?
朋友的系统是一个web系统,那web系统的无相应我们起首要关注的就是 线程池,利用 !sos tpq 命令,参考输出如下:- 0:000> !sos tpq
- global work item queue________________________________
- 0x00000004010774C0 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT<Microsoft.AspNetCore.Hosting.HostingApplication+Context>
- 0x0000000401077808 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT<Microsoft.AspNetCore.Hosting.HostingApplication+Context>
- ....
- 0x000000030239DD78 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT<Microsoft.AspNetCore.Hosting.HostingApplication+Context>
- 0x000000030239E0C0 Microsoft.AspNetCore.Server.IIS.Core.IISHttpContextOfT<Microsoft.AspNetCore.Hosting.HostingApplication+Context>
- local per thread work items_____________________________________
- 0x0000000100A46410 System.Threading.Tasks.Task<System.Threading.Tasks.Task>
- ...
- 0x000000010133F8C0 System.Threading.Tasks.Task<System.Threading.Tasks.Task>
- 2 Work Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext+<>c.<WaitOnTasks>b__123_1
- 4 Work Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext+<>c.<WaitOnTasks>b__123_0
- 266 Work Microsoft.AspNetCore.SignalR.HubConnectionContext.AbortConnection
- ----
- 272
复制代码 从卦中可以看到确实存在线程池积压的情况,那为什么会有积压呢?条件反射告诉我,是不是因为锁的原因,利用 !syncblk 观察。- 0:000> !syncblk
- Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
- -----------------------------
- Total 468
- CCW 0
- RCW 0
- ComClassFactory 0
- Free 120
复制代码 从卦中看和锁没半毛钱关系,那就只能深入各个消费线程,看看这些线程为什么这么不给力。。。。
2. 线程都在干什么
要想观察各个线程都在做什么,可以用 ~*e !clrstack 观察各个线程调用栈,输出的调用栈太长,但仔细观察之后,发现很多线程都停顿在 TryGetConnnection 上,截图如下:

从卦中可以看到大概有143个线程卡在 TryGetConnection 上,不知道 Connection 为啥取不到了,接下来观察问题代码,简化后如下:- private bool TryGetConnection(DbConnection owningObject, uint waitForMultipleObjectsTimeout, bool allowCreate, bool onlyOneCheckConnection, DbConnectionOptions userOptions, out DbConnectionInternal connection)
- {
- DbConnectionInternal dbConnectionInternal = null;
- Transaction transaction = null;
- if (HasTransactionAffinity)
- {
- dbConnectionInternal = GetFromTransactedPool(out transaction);
- }
- if (dbConnectionInternal == null)
- {
- Interlocked.Increment(ref _waitCount);
- do
- {
- num = WaitHandle.WaitAny(_waitHandles.GetHandles(allowCreate), (int)waitForMultipleObjectsTimeout);
- } while (dbConnectionInternal == null);
- }
- }
- private DbConnectionInternal GetFromTransactedPool(out Transaction transaction)
- {
- transaction = ADP.GetCurrentTransaction();
- DbConnectionInternal dbConnectionInternal = null;
- if (null != transaction && _transactedConnectionPool != null)
- {
- dbConnectionInternal = _transactedConnectionPool.GetTransactedObject(transaction);
- //....
- }
- return dbConnectionInternal;
- }
-
- internal DbConnectionInternal GetTransactedObject(Transaction transaction)
- {
- lock (_transactedCxns)
- {
- flag = _transactedCxns.TryGetValue(transaction, out value);
- }
- if (flag)
- {
- lock (value)
- {
- int num = value.Count - 1;
- if (0 <= num)
- {
- dbConnectionInternal = value[num];
- value.RemoveAt(num);
- }
- }
- }
- return dbConnectionInternal;
- }
复制代码 从上面的卦中数据可知三点信息:
- 100 _totalObjects 当前的线程池存着100个Connection。
- 0 _count 当前100个Connection全部耗尽。
- 143 _waitCount 表示当前有 143 个线程在获取 Connection 上进行等待。
3. 池中之物都去了哪里
要想找到这个答案,继续观察线程栈,比如搜刮TDS传输层方法 Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte ,可以看到刚好是 100 个,上层大多是 xxxx.GetRoomNosInDltParams 方法,截图如下:
挖掘各个线程栈,大概都是下面的sql,格式化如下:
[code]SELECT [HotelId], [RoomId], [SubRoomId], [SupplierId], [RoomNo], [RoomCategory], [SellPrice], [RoomNoState], [CheckInDate]FROM [xxx]WHERE ((((((([RoomId] IN (673,674)) AND( [SellPrice] > @SellPrice1 )) AND ( [CheckInDate] >= @CheckInDate2 )) AND ( [CheckInDate]
|