标题: 【服务器】问题:训练跑一半显卡掉卡,报错Unable to determine the device [打印本页] 作者: 惊雷无声 时间: 2025-1-20 13:37 标题: 【服务器】问题:训练跑一半显卡掉卡,报错Unable to determine the device
训练跑一半显卡掉卡,服务器nvidia-smi报错:
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error
复制代码
1、使用sudo nvidia-bug-report.sh查看问题log
跑训练时同时运行nvidia-bug-report.sh同步显卡状态和bug信息,从中查找到问题:
Xid (...): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus
复制代码
参考ISSUE大佬提到:
One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU.
大概率是过热、缺电导致的
2、办理方案汇总
(1)显卡温度墙
跑训练时同时运行以下下令,监控显卡温度厘革
nvidia-smi -q -l 2 -d TEMPERATURE -f nvidiatemp.log