一、前言
今天这篇文章将向大家详细先容如何对 Code Llama 举行微调,让它酿成得当 SQL 开辟的有利工具。对于编程开辟使命,颠末恰当微调后的 Code Llama 的性能通常都会比寻常的 Llama 强许多,特殊是当我们针对具体使命举行优化时:
- 使用b-mc2/sql-create-context这个文本查询及其对应的SQL查询集合举行训练
- 使用Lora方法,将根本模型的权重量化为int8,冻结权重,仅对适配器举行训练
- 本文大多参考了alpaca-lora项目,同时也举行了一定的改进与优化
通过上述几点方法,信赖我们能使Code Llama专注于SQL开辟领域,得到更好的结果。如果按照本指南步骤举行指导,信赖您也能把握微调的奥妙。
二、微调 Code Llama
2.1、安装依赖
我使用了一台设置了 Python 3.10 和 Cuda 11.8 的 A100 GPU 服务器来运行本文中的代码。大约运行了一个小时。(为了验证可移植性,我还试验在Colab上运行代码,结果都很好。)
- !pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3 # we need latest transformers for this
- !pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
- !pip install datasets==2.10.1
- import locale # colab workaround
- locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
- !pip install wandb
复制代码 2.2、加载库
- from datetime import datetime
- import os
- import sys
- import torch
- from peft import (
- LoraConfig,
- get_peft_model,
- get_peft_model_state_dict,
- prepare_model_for_int8_training,
- set_peft_model_state_dict,
- )
- from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
复制代码 (如果出现导入错误,请实验重新启动 Jupyter 内核)
2.3、加载数据集
这将从 Huggingface Hub 中提取数据集,并将其中的 10% 分成评估集,以检查模型在训练中的体现如何:
- from datasets import load_dataset
- dataset = load_dataset("b-mc2/sql-create-context", split="train")
- train_dataset = dataset.train_test_split(test_size=0.1)["train"]
- eval_dataset = dataset.train_test_split(test_size=0.1)["test"]
复制代码 如果您想加载自己的数据集,请执行以下操作:
- train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
- eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')
复制代码 如果您想查看数据集中的任何样本,只需执行以下操作:
2.4、加载模型
我从 Huggingface 加载代码 llama int8(Lora 的标准):
- base_model = "codellama/CodeLlama-7b-hf"
- model = AutoModelForCausalLM.from_pretrained(
- base_model,
- load_in_8bit=True,
- torch_dtype=torch.float16,
- device_map="auto",
- )
- tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
复制代码 torch_dtype=torch.float16 体现使用 float16 体现形式执行盘算,即使值本身是 8 位整数。
如果出现错误“ValueError:Tokenizer 类 CodeLlamaTokenizer 不存在或当前未导入。”确保你的 Transformer 版本是 4.33.0.dev0 并且accelerate是 >=0.20.3。
2.5、检查根本型号
检查模型是否已经可以做我们想要它做的变乱:
- eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
- You must output the SQL query that answers the question.
- ### Input:
- Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?
- ### Context:
- CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)
- ### Response:
- """
- model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
- model.eval()
- with torch.no_grad():
- print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))
复制代码 输出结果:
- SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'
复制代码 如果输入只要求类,那么这显然是错误的,因此请继续举行微调!
2.6、Tokenization
设置一些标记化设置,例如左添补,由于它使训练使用更少的内存:
- tokenizer.add_eos_token = True
- tokenizer.pad_token_id = 0
- tokenizer.padding_side = "left"
复制代码 设置 tokenize 函数以使 labels 和 input_ids 类似。这根本上就是自我监视微调:
- def tokenize(prompt):
- result = tokenizer(
- prompt,
- truncation=True,
- max_length=512,
- padding=False,
- return_tensors=None,
- )
- # "self-supervised learning" means the labels are also the inputs:
- result["labels"] = result["input_ids"].copy()
- return result
复制代码 并运行将每个 data_point 转换为我在网上找到的结果很好的提示:
- def generate_and_tokenize_prompt(data_point):
- full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
- You must output the SQL query that answers the question.
- ### Input:
- {data_point["question"]}
- ### Context:
- {data_point["context"]}
- ### Response:
- {data_point["answer"]}
- """
- return tokenize(full_prompt)
复制代码 重新格式化以提示并将每个样本标记为我们的标记化训练和评估数据集:
- tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
- tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)
复制代码 2.7、设置 LoRA
置标准 Lora 设置并将其附加到根本模型:
- model.train() # put model back into training mode
- model = prepare_model_for_int8_training(model)
- config = LoraConfig(
- r=16,
- lora_alpha=16,
- target_modules=[
- "q_proj",
- "k_proj",
- "v_proj",
- "o_proj",
- ],
- lora_dropout=0.05,
- bias="none",
- task_type="CAUSAL_LM",
- )
- model = get_peft_model(model, config)
复制代码 要从检查点规复,请将resumefromcheckpoint 设置为要从中规复的adapter_model.bin 的路径:
- resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from
- if resume_from_checkpoint:
- if os.path.exists(resume_from_checkpoint):
- print(f"Restarting from {resume_from_checkpoint}")
- adapters_weights = torch.load(resume_from_checkpoint)
- set_peft_model_state_dict(model, adapters_weights)
- else:
- print(f"Checkpoint {resume_from_checkpoint} not found")
复制代码 设置权重和偏差以查看训练图的可选内容:
- wandb_project = "sql-try2-coder"
- if len(wandb_project) > 0:
- os.environ["WANDB_PROJECT"] = wandb_project
复制代码- if torch.cuda.device_count() > 1:
- # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
- model.is_parallelizable = True
- model.model_parallel = True
复制代码 2.8、模型训练
如果 GPU 内存不敷,请更改 perdevicetrainbatchsize。 gradientaccumulationsteps 变量应确保这不会影响训练运行期间的批量动态。全部其他变量都是标准的东西,不用设置:
- batch_size = 128
- per_device_train_batch_size = 32
- gradient_accumulation_steps = batch_size // per_device_train_batch_size
- output_dir = "sql-code-llama"
- training_args = TrainingArguments(
- per_device_train_batch_size=per_device_train_batch_size,
- gradient_accumulation_steps=gradient_accumulation_steps,
- warmup_steps=100,
- max_steps=400,
- learning_rate=3e-4,
- fp16=True,
- logging_steps=10,
- optim="adamw_torch",
- evaluation_strategy="steps", # if val_set_size > 0 else "no",
- save_strategy="steps",
- eval_steps=20,
- save_steps=20,
- output_dir=output_dir,
- load_best_model_at_end=False,
- group_by_length=True, # group sequences of roughly the same length together to speed up training
- report_to="wandb", # if use_wandb else "none",
- run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
- )
- trainer = Trainer(
- model=model,
- train_dataset=tokenized_train_dataset,
- eval_dataset=tokenized_val_dataset,
- args=training_args,
- data_collator=DataCollatorForSeq2Seq(
- tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
- ),
- )
复制代码 然后我们举行一些与 pytorch 相关的优化,这只是使训练更快,但不影响准确性:
- model.config.use_cache = False
- old_state_dict = model.state_dict
- model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
- model, type(model)
- )
- if torch.__version__ >= "2" and sys.platform != "win32":
- print("compiling the model")
- model = torch.compile(model)
复制代码 此 ^ 将在 A100 上运行大约 1 小时。
2.9、加载最终检查点
- import torchfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizerbase_model = "codellama/CodeLlama-7b-hf"
- model = AutoModelForCausalLM.from_pretrained(
- base_model,
- load_in_8bit=True,
- torch_dtype=torch.float16,
- device_map="auto",
- )
- tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
复制代码 要加载颠末微调的 Lora/Qlora 适配器,请使用 PeftModel.frompretrained。 output_dir 应该是包含adapterconfig.json和adapter_model.bin的东西:
- from peft import PeftModel
- model = PeftModel.from_pretrained(model, output_dir)
复制代码 实验与之前类似的提示:
- eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
- You must output the SQL query that answers the question.
- ### Input:
- Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?
- ### Context:
- CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)
- ### Response:
- """
- model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
- model.eval()
- with torch.no_grad():
- print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))
复制代码 模型输出:
- SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"
复制代码 从运行结果可以看到微调是有结果的!也可以将此适配器转换为 Llama.cpp 模型以在当地运行。
Jupyter Notebook 的完整代码:
https://github.com/Crossme0809/frenzyTechAI/blob/main/fine-tune-code-llama/finetunecode_llama.ipynb
三、References
[1]. Alpaca-LoRA:
https://github.com/tloen/alpaca-lora
[2]. LoRA Paper:
https://arxiv.org/abs/2106.09685
[3]. Sql-Create-Context:
https://huggingface.co/datasets/b-mc2/sql-create-context
如果你对这篇文章感爱好,而且你想要了解更多关于AI领域的实战本领,可以关注「技能狂潮AI」公众号。在这里,你可以看到最新最热的AIGC领域的干货文章和案例实战教程。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |