IT评测·应用市场-qidao123.com技术社区

标题: llama-factory微调报错: [打印本页]

作者: 王柳    时间: 3 天前
标题: llama-factory微调报错:
报错信息

   [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used =
81.51 GB, percent = 64.9%
W0419 10:14:27.573000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108373 closing signal SIGTERM
W0419 10:14:27.594000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108375 closing signal SIGTERM
W0419 10:14:27.594000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 108376 closing signal SIGTERM
E0419 10:14:33.446000 108354 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 1 (pid: 108374) of binary: /opt/conda/envs/llamaf/bin/python
    Traceback (most recent call last):
File “/opt/conda/envs/llamaf/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.5.1’, ‘console_scripts’, ‘torchrun’)())
xxx
xxx
xxx
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/LLaMA-Factory/src/llamafactory/launcher.py FAILED
  由于看到CPU Virtual Memory 只加载到了64.9%,发现是内存不足。
解决方法:
  1. # 创建分区路径
  2. sudo mkdir -p /data/swap/
  3. # 设置分区的大小
  4. # bs=128M是块大小,count=64是块数量,
  5. # 所以swap空间大小是bs*count=96GB
  6. sudo dd if=/dev/zero of=/data/swap/swap0 bs=512M count=192
  7. # 设置该目录权限
  8. sudo chmod 0600 /data/swap/swap0
  9. # 创建SWAP文件
  10. sudo mkswap /data/swap/swap0
  11. # 激活SWAP文件
  12. sudo swapon /data/swap/swap0
  13. # 查看SWAP信息是否正确
  14. sudo swapon -s
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 IT评测·应用市场-qidao123.com技术社区 (https://dis.qidao123.com/) Powered by Discuz! X3.4