1.完整报错
torch.distributed.DistNetworkError: The client socket has timed out after -446744073709s while trying to connect to (127.0.0.1, 26825)
2.场景说明
在用llamafactory微调llama3模子。想用DeepSpeed做微调多卡方案。当尝试接入DeepSpeed并显式挂载全部显卡时,发现任务报错.起不来。
部署容器服务的服务: K8S
挂载全部显卡的环境变量: NVIDIA_VISIBLE_DEVICES = all
llamafactory-cli配置启动DeepSpeed的配置文件 --deepspeed ds_z2_config.json
显卡服务器配置: 5张显卡
3.问题办理
方案1 切换为单卡
在启动前用CUDA_VISIBLE_DEVICES帮他指定单个装备就能顺利规复了。
CUDA_VISIBLE_DEVICES = 0 llamafactory-cli train xxxx
方案2 经过测试发现 如果增加了参数–ddp_timeout 18000000000000会出现这个错误.
我的命令.只需要删掉–ddp_timeout 18000000000000 \ 再去执行就可以了
CUDA_VISIBLE_DEVICES=3,4 FORCE_TORCHRUN=1 llamafactory-cli train
–model_name_or_path /fine_tuning/model_path
–deepspeed ds_z2_config.json
–finetuning_type lora
–do_train true
–stage sft
–dataset_dir /fine_tuning/dataset
–dataset trainFile
–template llama3
–cutoff_len 1024
–overwrite_cache true
–preprocessing_num_workers 16
–output_dir /fine_tuning/work_dir/llama3-2024-10-12-10-09-48-adapter
–overwrite_output_dir true
–plot_loss true
–save_steps 500
–save_total_limit 5
–logging_steps 5
–learning_rate 5e-05
–num_train_epochs 3.0
–per_device_train_batch_size 1
–gradient_accumulation_steps 8
–lr_scheduler_type cosine
–warmup_steps 0
–bf16 false
–fp16 true
–ddp_timeout 18000000000000
–val_size 0.01
–eval_strategy steps
–eval_steps 50
–per_device_eval_batch_size 1
–lora_alpha 16
–lora_dropout 0
–lora_rank 8
–lora_target all
方案3 可以用其他方式代替 毕竟llamafactory源码cli.py内里在你指定FORCE_TORCHRUN=1的时间会用touchrun方式执行
==== torchtun方式 有2种不同的多卡方式 现实是差不多的https://pytorch.org/docs/stable/elastic/run.html ====
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 --nnodes=1 --nproc_per_node=3 /finetune/finetune.py --train_format multi-turn --train_file /fine_tuning/formatted_data/train.jsonl --max_seq_length 64 --preprocessing_num_workers 1 --model_name_or_path /fine_tuning/model_path --output_dir /fine_tuning/output/tool_alpaca_pt-PTV2_0000001-1728627571091 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 100000 --logging_steps 1 --save_steps 500 --learning_rate 2e-2 --pre_seq_len 128
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc_per_node=3 /finetune/finetune.py --train_format multi-turn --train_file /fine_tuning/formatted_data/train.jsonl --max_seq_length 64 --preprocessing_num_workers 1 --model_name_or_path /fine_tuning/model_path --output_dir /fine_tuning/output/tool_alpaca_pt-PTV2_0000001-1728627571091 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 100000 --logging_steps 1 --save_steps 500 --learning_rate 2e-2 --pre_seq_len 128
==== 用torchrun的方式去启动llamafactory ====
export NPROC_PER_NODE=2
export NNODES=1
CUDA_VISIBLE_DEVICES=3,4 FORCE_TORCHRUN=1 torchrun /usr/src/llama-factory/src/llamafactory/launcher.py
–model_name_or_path /fine_tuning/model_path
–finetuning_type lora
–do_train true
–stage sft
–dataset_dir /fine_tuning/dataset
–dataset trainFile
–template llama3
–cutoff_len 1024
–overwrite_cache true
–preprocessing_num_workers 16
–output_dir /fine_tuning/work_dir/llama3-2024-10-12-10-09-48-adapter
–overwrite_output_dir true
–plot_loss true
–save_steps 500
–save_total_limit 5
–logging_steps 5
–learning_rate 5e-05
–num_train_epochs 3.0
–per_device_train_batch_size 1
–gradient_accumulation_steps 8
–lr_scheduler_type cosine
–warmup_steps 0
–bf16 false
–fp16 true
–val_size 0.01
–eval_strategy steps
–eval_steps 50
–per_device_eval_batch_size 1
–lora_alpha 16
–lora_dropout 0
–lora_rank 8
–lora_target all
==== 直接用deepspeed方式 ====
deepspeed --include localhost:0,1,2,3,4 train.py
–deepspeed ds_z2_config.json
–stage sft
–model_name_or_path /fine_tuning/model_path
–do_train
–dataset_dir /fine_tuning/dataset
–dataset trainFile
–template llama3
–finetuning_type lora
–output_dir /fine_tuning/work_dir/llama3-2024-10-18-09-46-99-adapter
–overwrite_cache
–per_device_train_batch_size 1
–gradient_accumulation_steps 8
–lr_scheduler_type cosine
–logging_steps 5
–save_steps 500
–learning_rate 1e-01
–num_train_epochs 30.0
–plot_loss
–fp16
4.(不需要可以不看)问题分析
关于为什么要换成单卡的分析
只挂载1张显卡的时间不会堕落。挂载1张以上的就会出现这个问题。
比如正常的显卡显示配置 NVIDIA_VISIBLE_DEVICES=0
错误的显卡显示配置 NVIDIA_VISIBLE_DEVICES=0,1
尝试了torchrun方式 即使直接挂载全部显卡。它不会出现这种问题。尝试了 llamafactory会出现这个问题。可能是它的显卡通信、分配出现了问题。
找了不少的百度文章发现没人碰到过。只能自己分析了。发现他堕落的上级代码。llamafactory在尝试获取显卡装备来举行设置。可能默认只能适配单卡。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |