马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
报错信息
[INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used =
81.51 GB, percent = 64.9%
W0419 10:14:27.573000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108373 closing signal SIGTERM
W0419 10:14:27.594000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108375 closing signal SIGTERM
W0419 10:14:27.594000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108376 closing signal SIGTERM
E0419 10:14:33.446000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 1 (pid: 108374) of binary: /opt/conda/envs/llamaf/bin/python
Traceback (most recent call last):
File “/opt/conda/envs/llamaf/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.5.1’, ‘console_scripts’, ‘torchrun’)())
xxx
xxx
xxx
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/LLaMA-Factory/src/llamafactory/launcher.py FAILED
由于看到CPU Virtual Memory 只加载到了64.9%,发现是内存不足。
解决方法:
- # 创建分区路径
- sudo mkdir -p /data/swap/
- # 设置分区的大小
- # bs=128M是块大小,count=64是块数量,
- # 所以swap空间大小是bs*count=96GB
- sudo dd if=/dev/zero of=/data/swap/swap0 bs=512M count=192
- # 设置该目录权限
- sudo chmod 0600 /data/swap/swap0
- # 创建SWAP文件
- sudo mkswap /data/swap/swap0
- # 激活SWAP文件
- sudo swapon /data/swap/swap0
- # 查看SWAP信息是否正确
- sudo swapon -s
复制代码 免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |