【大模子】基于Unsloth微调Llama-3.1 8b代码详解

打印 上一主题 下一主题

主题 969|帖子 969|积分 2907

Unsloth是一个开源的大模子训练加速项目,利用OpenAI的Triton对模子的计算过程进行重写,大幅提升模子的训练速率,降低训练中的显存占用。


  • Unsloth Github项目:https://github.com/unslothai/unsloth
  • 基于Unsloth微调Llama-3.1 8b源代码官方colab地点:https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=2eSvM9zX_2d3
  • Unsloth的安装方式参考博客:【大模子】Unsloth安装及利用教程
1、加载模子和分词器

  1. from unsloth import FastLanguageModel
  2. import torch
  3. max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
  4. dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
  5. load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
  6. # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
  7. fourbit_models = [
  8.     "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
  9.     "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
  10.     "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
  11.     "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
  12.     "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
  13.     "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
  14.     "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
  15.     "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
  16.     "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
  17.     "unsloth/Phi-3-medium-4k-instruct",
  18.     "unsloth/gemma-2-9b-bnb-4bit",
  19.     "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
  20. ] # More models at https://huggingface.co/unsloth
  21. model, tokenizer = FastLanguageModel.from_pretrained(
  22.     model_name = "unsloth/Meta-Llama-3.1-8B",
  23.     max_seq_length = max_seq_length,
  24.     dtype = dtype,
  25.     load_in_4bit = load_in_4bit,
  26.     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
  27. )
复制代码
输出如下:

【代码解读】:
(1)代码中基于 unsloth 的 FastLanguageModel.from_pretrained() 加载了模子和分词器,能够显著提升模子和分词器加载速率。
  1. model, tokenizer = FastLanguageModel.from_pretrained(
  2.     model_name = "unsloth/Meta-Llama-3.1-8B",
  3.     max_seq_length = max_seq_length,
  4.     dtype = dtype,
  5.     load_in_4bit = load_in_4bit,
  6.     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
  7. )
复制代码
(2)这里,本文也给出传统的基于Hugging Face的 transformers 的模子和分词器加载方式,以此来对比一下:
  1. from transformers import AutoModelForCausalLM, AutoTokenizer
  2. model_path = './model/llama-3-8b'   # 模型的本地路径
  3. model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='auto')
  4. tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
复制代码
2、LoRA adapter

   We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
  1. model = FastLanguageModel.get_peft_model(
  2.     model,
  3.     r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
  4.     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
  5.                       "gate_proj", "up_proj", "down_proj",],
  6.     lora_alpha = 16,
  7.     lora_dropout = 0, # Supports any, but = 0 is optimized
  8.     bias = "none",    # Supports any, but = "none" is optimized
  9.     # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
  10.     use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
  11.     random_state = 3407,
  12.     use_rslora = False,  # We support rank stabilized LoRA
  13.     loftq_config = None, # And LoftQ
  14. )
复制代码
输出如下:

【代码解读】:
(1)代码中基于 unsloth 的 FastLanguageModel.get_peft_model() 的方式增长了 LoRA adapter,后续该模子作为参数传入 SFTTrainer 中。
(2)这里,本文也给出传统的基于 peft 的get_peft_model 的增长LoRA adapter 的方式,以此来对比一下:
  1. from peft import LoraConfig, get_peft_model
  2. lora_config = LoraConfig(
  3.     r=16,
  4.     lora_alpha=16,
  5.     target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
  6. )
  7. model = get_peft_model(model, lora_config)
  8. # model.print_trainable_parameters()
复制代码
传统方式中,通过调用 get_peft_model 方法包装根本的 Transformer 模子。可以进一步通过 model.print_trainable_parameters 方法查看可训练参数的数量以及占比(相比原始模子参数大幅减少)。
3、数据准备



  • 数据集:这里利用的数据集为 yahma,该数据集是基于原始  Alpaca 数据筛选过滤出52K条数据得到,数据主页为:https://huggingface.co/datasets/yahma/alpaca-cleaned
  • 留意事项:需要在 tokenized output 后面加上 EOS_TOKEN,否则代码将陷入无限迭代生成中。
  1. alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
  2. ### Instruction:
  3. {}
  4. ### Input:
  5. {}
  6. ### Response:
  7. {}"""
  8. EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
  9. def formatting_prompts_func(examples):
  10.     instructions = examples["instruction"]
  11.     inputs       = examples["input"]
  12.     outputs      = examples["output"]
  13.     texts = []
  14.     for instruction, input, output in zip(instructions, inputs, outputs):
  15.         # Must add EOS_TOKEN, otherwise your generation will go on forever!
  16.         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
  17.         texts.append(text)
  18.     return { "text" : texts, }
  19. pass
  20. from datasets import load_dataset
  21. dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
  22. dataset = dataset.map(formatting_prompts_func, batched = True,)
复制代码

【代码解读】:
(1)这里,训练数据集 yahma/alpaca-cleaned 接纳了 Alpaca 系列数据集的 prompt 格式,由 Instruction、 Input、Response组成。
(2)格式化函数 formatting_prompts_func 的作用为将训练语料正确处理惩罚成符合预训练模子规则的字符串,该函数后续需要传入SFTTrainer  类中。
   这里,输入语料数据无法直接输入模子中,需要先基于格式化函数 formatting_prompts_func 转换成规范化的字符串,再转换成 token,才能输入到模子中。
  (3)这里官方还给出了一些其他范例任务的数据准备及prompt树模样例:


  • 基于 ShareGPT 的对话任务(conversation task)的llama-3 template,可以参考:https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing
  • 针对文本补全任务(text completions task)的mistral-7b template,可以参考:https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing#scrollTo=QmUBVEnvCDJv
   目前常见微调数据集的格式包罗以下几种:指令跟随格式、多轮对话格式,以及其他辅助格式。
  

  • 指令跟随格式:指令跟随形式是指用户输入指令,模子按照指令的要求输出结果的格式。这种形式的数据集通常接纳json文件格式存储,典范的如Alpaca-52k数据集。Alpaca的格式有两类,一类是instruction/output格式,另一类为instruction/input/output格式。
  • 多轮对话格式:多轮对话形式是指用户和模子之间以对话的形式进行,模子将通过与用户进行多轮的交互最终来到达用户的需求。典范的如训练Vicuna模子 [6] 所利用的ShareGPT数据集。
  • 其他辅助格式:除了上述提到的数据格式,还有一些数据格式不易转化为对话形式,例如纯文本文档。另外,还有一些针对特定用途的数据集,例如文本总结数据集以及根据纯文本生成对话的数据集。
  关于LLM中Prompt的介绍,参考博客: [NLP]LLM—大模子指令微调中的“Prompt”
  4、训练模子

4.1 实例化 SFTrainer 类

基于 Huggingface 的 TRL 库中的 SFTTrainer 类来训练模子,SFTTrainer 的官方文档:https://huggingface.co/docs/trl/sft_trainer
  1. from trl import SFTTrainer
  2. from transformers import TrainingArguments
  3. from unsloth import is_bfloat16_supported
  4. trainer = SFTTrainer(
  5.     model = model,
  6.     tokenizer = tokenizer,
  7.     train_dataset = dataset,
  8.     dataset_text_field = "text",
  9.     max_seq_length = max_seq_length,
  10.     dataset_num_proc = 2,
  11.     packing = False, # Can make training 5x faster for short sequences.
  12.     args = TrainingArguments(
  13.         per_device_train_batch_size = 2,
  14.         gradient_accumulation_steps = 4,
  15.         warmup_steps = 5,
  16.         # num_train_epochs = 1, # Set this for 1 full training run.
  17.         max_steps = 60,
  18.         learning_rate = 2e-4,
  19.         fp16 = not is_bfloat16_supported(),
  20.         bf16 = is_bfloat16_supported(),
  21.         logging_steps = 1,
  22.         optim = "adamw_8bit",
  23.         weight_decay = 0.01,
  24.         lr_scheduler_type = "linear",
  25.         seed = 3407,
  26.         output_dir = "outputs",
  27.     ),
  28. )
复制代码
输出如下:

【代码解读】:
(1) SFTTrainer 类中传入的 model 为我们前面在 LORA 步骤中界说的模子(可训练参数的数量少),以实现参数高效微调。
4.2 启动训练

  1. trainer_stats = trainer.train()
复制代码
输出如下:

4.3 显存占用情况

  1. #@title Show final memory and time stats
  2. used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
  3. used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
  4. used_percentage = round(used_memory         /max_memory*100, 3)
  5. lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
  6. print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
  7. print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
  8. print(f"Peak reserved memory = {used_memory} GB.")
  9. print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
  10. print(f"Peak reserved memory % of max memory = {used_percentage} %.")
  11. print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
复制代码
输出如下:

5、模子推理

5.1 直接推理

  1. # alpaca_prompt = Copied from above
  2. FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  3. inputs = tokenizer(
  4. [
  5.     alpaca_prompt.format(
  6.         "Continue the fibonnaci sequence.", # instruction
  7.         "1, 1, 2, 3, 5, 8", # input
  8.         "", # output - leave this blank for generation!
  9.     )
  10. ], return_tensors = "pt").to("cuda")
  11. outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
  12. tokenizer.batch_decode(outputs)
复制代码
输出如下:

5.2 基于 TextStreamer 推理

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
  1. # alpaca_prompt = Copied from above
  2. FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  3. inputs = tokenizer(
  4. [
  5.     alpaca_prompt.format(
  6.         "Continue the fibonnaci sequence.", # instruction
  7.         "1, 1, 2, 3, 5, 8", # input
  8.         "", # output - leave this blank for generation!
  9.     )
  10. ], return_tensors = "pt").to("cuda")
  11. from transformers import TextStreamer
  12. text_streamer = TextStreamer(tokenizer)
  13. _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
复制代码
输出如下:

6、保存/加载 LORA 模子

6.1 保存 LoRA Adapter



  • 当地保存,保存至当地路径
  1. model.save_pretrained("lora_model") # Local saving
  2. tokenizer.save_pretrained("lora_model")
复制代码


  • 在线保存,并推送至 Huggingface 远程hub
  1. model.push_to_hub("your_name/lora_model", token = "...") # Online saving
  2. tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
复制代码
6.2 加载 LoRA Adapter

官网中给了两种方式:


  • 方式1:基于Unsloth的FastLanguageModel
如果想加载我们刚刚保存用于推理的LoRA适配器,请将False设置为True:
  1. if False:
  2.     from unsloth import FastLanguageModel
  3.     model, tokenizer = FastLanguageModel.from_pretrained(
  4.         model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
  5.         max_seq_length = max_seq_length,
  6.         dtype = dtype,
  7.         load_in_4bit = load_in_4bit,
  8.     )
  9.     FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  10. # alpaca_prompt = You MUST copy from above!
  11. inputs = tokenizer(
  12. [
  13.     alpaca_prompt.format(
  14.         "What is a famous tall tower in Paris?", # instruction
  15.         "", # input
  16.         "", # output - leave this blank for generation!
  17.     )
  18. ], return_tensors = "pt").to("cuda")
  19. from transformers import TextStreamer
  20. text_streamer = TextStreamer(tokenizer)
  21. _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
复制代码
输出如下:



  • 方式2:基于Hugging Face的 AutoModelForPeftCausalLM
在未安装 unsloth 的情况下,可以利用基于Hugging Face的 AutoModelForPeftCausalLM的方式来加载模子,不过相比于Unsloth加载速率会慢很多,且不支持4bit模子。
   You can also use Hugging Face’s AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth’s inference is 2x faster.
  1. if False:
  2.     # I highly do NOT suggest - use Unsloth if possible
  3.     from peft import AutoPeftModelForCausalLM
  4.     from transformers import AutoTokenizer
  5.     model = AutoPeftModelForCausalLM.from_pretrained(
  6.         "lora_model", # YOUR MODEL YOU USED FOR TRAINING
  7.         load_in_4bit = load_in_4bit,
  8.     )
  9.     tokenizer = AutoTokenizer.from_pretrained("lora_model")
复制代码
7、Saving to float16 for VLLM

   We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
  1. # Merge to 16bit
  2. if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
  3. if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
  4. # Merge to 4bit
  5. if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
  6. if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
  7. # Just LoRA adapters
  8. if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
  9. if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
复制代码
8、GGUF / llama.cpp Conversion

  1. # Save to 8bit Q8_0
  2. if False: model.save_pretrained_gguf("model", tokenizer,)
  3. # Remember to go to https://huggingface.co/settings/tokens for a token!
  4. # And change hf to your username!
  5. if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
  6. # Save to 16bit GGUF
  7. if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
  8. if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
  9. # Save to q4_k_m GGUF
  10. if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
  11. if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
  12. # Save to multiple GGUF options - much faster if you want multiple!
  13. if False:
  14.     model.push_to_hub_gguf(
  15.         "hf/model", # Change hf to your username!
  16.         tokenizer,
  17.         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
  18.         token = "", # Get a token at https://huggingface.co/settings/tokens
  19.     )
复制代码
Now, use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like GPT4All. You can install GPT4All by going here.
参考资料

基于Unsloth微调Llama-3的官方资料:


  • Unsloth Github项目:https://github.com/unslothai/unsloth
  • https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=2eSvM9zX_2d3
实战过程碰到的小题目,也可以参考以下博客:


  • llama3微调实战全流程
  • 尝试上手大模子的有监视微调

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

正序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

飞不高

金牌会员
这个人很懒什么都没写!
快速回复 返回顶部 返回列表