人工智能-llamafactory-llama3微调中文数据集

忿忿的泥巴坨 发表于 2024-9-19 02:20:45

llamafactory-llama3微调中文数据集

一、界说

https://github.com/SmartFlowAI/Llama3-Tutorial/tree/main

[*]基准模型测试
[*]opencompass 离线测评
[*]数据预备
[*]微调训练
[*]合并
[*]测试
[*]人工考核对比
二、实现

[*]基准模型测试
基准模型 llama3-8b
https://zhuanlan.zhihu.com/p/694818596?
https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/opencompass.md
https://github.com/InternLM/Tutorial/blob/camp2/data_fine_tuning/data_fine_tuning.md
CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval \
--model_name_or_path /home/Meta-Llama-3-8B-Instruct \
--template llama3 \
--task triviaqa \
--split validation \
--lang en \
--n_shot 5 \
--batch_size 1
https://i-blog.csdnimg.cn/blog_migrate/3557f1c9a3dafc43f37f96f48e83de82.png
个人以为，虽然epoch=1 在标准指标中中文评估能力大于epoch=3,但人工考核过程中，epoch =3 在中文表达上更满意人的需求。随着训练轮次的增长，模型更倾向于表达中文。

[*]opencompass 离线测评
部署见opencompass 配置篇
from mmengine.config import read_base

with read_base():
from .datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets

datasets = [*mmlu_datasets]

from opencompass.models import HuggingFaceCausalLM

batch_size = 20
# 指定评测模型
model_name_or_paths = [                            #可以多个模型
'/home/Meta-Llama-3-8B-Instruct'
]
models = [] #模型以及配置放于列表中

for model_name_or_path in model_name_or_paths:
abbr = model_name_or_path.split('/')[-1]
model = dict(
   type=HuggingFaceCausalLM,
   abbr=abbr,
   path=model_name_or_path,
   tokenizer_path=model_name_or_path,
   tokenizer_kwargs=dict(padding_side='left',
                           truncation_side='left',
                           use_fast=False,
                           trust_remote_code=True
                           ),
   max_out_len=1024,
   max_seq_len=2048,
   batch_size=batch_size,
   model_kwargs=dict(device_map='auto', trust_remote_code=True),
   batch_padding=False,# if false, inference with for-loop without batch padding
   run_cfg=dict(num_gpus=2, num_procs=2),
)
models.append(model)

# python run.py configs/eval_llama3_8b_demo.py

[*]数据预备
https://github.com/InternLM/Tutorial/blob/camp2/data_fine_tuning/data_fine_tuning.md
https://modelscope.cn/datasets/baicai003/Llama3-Chinese-dataset/summary
https://huggingface.co/datasets/m-a-p/COIG-CQIA
import datasets
data=datasets.load_dataset("llamafactory/alpaca_gpt4_zh")
data=data["train"]
res=[]
for i in range(len(data)):
res.append(data)
import json
with open('alpaca_gpt4_zh.json', 'w',encoding="utf8") as file:
# 使用缩进格式化输出 JSON 数据
json.dump(res, file, indent=4,ensure_ascii=False)
#42677
"alpaca_gpt4_zh_local": {
"file_name": "alpaca_gpt4_zh.json"
}

#16493
"Llama3-Chinese-dataset_local": {
"file_name": "Llama3-Chinese-dataset.json"
}
#11262
"COIG-CQIA_local": {
"file_name": "COIG-CQIA.json"
}
#51983
"alpaca_gpt4_en_local": {
"file_name": "alpaca_gpt4_en.json"
}

[*]微调训练
#lora 双卡微调
CUDA_VISIBLE_DEVICES=0,1 nohup llamafactory-cli train \
--stage sft \
--do_train \
--model_name_or_path /home/Meta-Llama-3-8B-Instruct \
--dataset alpaca_gpt4_zh_local,Llama3-Chinese-dataset_local,COIG-CQIA_local,alpaca_gpt4_en_local\
--dataset_dir ./data \
--template llama3 \
--finetuning_type lora \
--output_dir ./saves/LLaMA3-8B/lora/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 50 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 50 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--val_size 0.1 \
--plot_loss \
--fp16> output.log 2>&1 &
#3 个epoch 有些过拟合，采用1个epoch
CUDA_VISIBLE_DEVICES=0,1 nohup llamafactory-cli train \
--stage sft \
--do_train \
--model_name_or_path /home/Meta-Llama-3-8B-Instruct \
--dataset alpaca_gpt4_zh_local,Llama3-Chinese-dataset_local,COIG-CQIA_local,alpaca_gpt4_en_local\
--dataset_dir ./data \
--template llama3 \
--finetuning_type lora \
--output_dir ./saves/LLaMA3-8B/lora/sft_1 \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 50 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 50 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--val_size 0.1 \
--plot_loss \
--fp16> output.log 2>&1 &

[*]合并
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export \
--model_name_or_path /home/Meta-Llama-3-8B-Instruct \
--adapter_name_or_path ./saves/LLaMA3-8B/lora/sft\
--template llama3 \
--finetuning_type lora \
--export_dir megred-model-path-1 \
--export_size 2 \
--export_device cpu \
--export_legacy_format False

[*]测试
微调后：
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat \
--model_name_or_path megred-model-path\
--template llama3
https://i-blog.csdnimg.cn/blog_migrate/64d774727483b1fa680a8db9fc04eae7.png
https://i-blog.csdnimg.cn/blog_migrate/8dff9b3e3d8d00fa7179e090f544953e.png
https://i-blog.csdnimg.cn/blog_migrate/c5b1e9751e562e1f4d193adf9d46ae5f.png
https://i-blog.csdnimg.cn/blog_migrate/dd20611368663735d38a7c2c69d0a005.png
微调前：
https://i-blog.csdnimg.cn/blog_migrate/8ab80ffa0fe245419ec27642f281faf8.png
CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval \
--model_name_or_path megred-model-path \
--template llama3 \
--task mmlu \
--split validation \
--lang en \
--n_shot 5 \
--batch_size 1
https://i-blog.csdnimg.cn/blog_migrate/f123d984a20e120d221b1383fdaaa8e2.png
7. 人工考核对比
https://i-blog.csdnimg.cn/blog_migrate/af2842e56d57fc070d25acaba40f4468.png
https://i-blog.csdnimg.cn/blog_migrate/c52ccbdc6dfda0065154a4870a52903b.png

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

qidao123.com技术社区-IT企服评测·应用市场's Archiver

llamafactory-llama3微调中文数据集