大家读完觉得有意义记得关注和点赞!!!
LLaMA2 用了两个 GPU 集群进行练习:
- RSC 集群:200Gbps InfiniBand + 400W A100 GPU;
- 生产集群:200Gbps RoCE + 350W A100 GPU;
RoCE + 350W GPU 的集群,经过优化的代码能达到 IB + 400W GPU 集群性能的 90%。 总共耗费 3.3M GPU-hour。
目录
摘要
1 引言
1.1 现状:没有能与 ChatGPT 匹敌的开源大模型
1.2 开源 LLaMA2/LLaMA2-Chat,补充空缺
1.3 LLaMA2 是怎样炼成的:练习+微调鸟瞰
1.4 本文组织
2 预练习(Pretraining)
2.1 预练习数据(Pretraining Data)
2.2 练习细节(Training Details)
2.2.1 超参数(Hyperparameters)
2.2.2 分词器(Tokenizer)
2.2.3 练习硬件和碳足迹
练习硬件(Training Hardware)
预练习碳足迹(Carbon Footprint of Pretraining)
2.3 LLaMA 2 预练习模型性能评估(Pretrained Model Evaluation)
2.3.1 与开源基座大模型对比
2.3.2 与闭源大模型对比
3 微调(Fine-tuning)
3.1 监督式微调(SFT)
3.1.1 使用公开的指令微调数据
3.1.2 标注质量为王(Quality Is All You Need)
3.1.3 一些微调细节(Fine-Tuning Details)
3.2 基于人类反馈的强化学习(RLHF)
3.2.1 人类偏好数据网络
3.2.2 奖励建模(Reward Modeling)
练习目的
Data Composition
练习细节(Training Details)
奖励模型的效果(Reward Model Results)
Scaling Trends
3.2.3 Iterative Fine-Tuning
Rejection Sampling
PPO
3.3 System Message for Multi-Turn Consistency
3.4 RLHF Results
3.4.1 Model-Based Evaluation
3.4.2 Human Evaluation
4 Safety(略)
5 讨论
5.1 新发现与批评(Learnings and Observations)
超越人类监督:从 SFT 到 RLHF
In-Context Temperature Rescaling
LLaMA2-Chat 时间感知能力(Temporal Perception)
工具的使用
5.2 限制和伦理考虑
5.3 负责任的发布策略(Responsible Release Strategy)
5.3.1 发布细节
5.3.2 负责任的发布
6 相关工作
6.1 Large Language Models
6.2 Instruction Tuning
6.3 Known LLM Safety Challenges
7 总结
参考文献(略)
附录(略)
摘要
本文先容 LLaMA 2,我们开发的一组预练习和微调大语言模型集,
- LLaMA2 参数规模 7b~70b;
- 微调模型称为 LLaMA2-Chat,针对对话场景进行了优化。
与其他开源谈天模型进行比较,
- 大多数基准测试中,LLaMA2 性能更好;
- 有效性和安全性方面,人工评估(human evaluations)的效果也证明 LLaMA2 更优。
因此,LLaMA2 可以作为一个不错的闭源模型替换方案。 本文将详细描述我们是怎样对 LLaMA2-Chat 进行微调和安全性改进的。 社区可以在我们的工作基础上进一步开发迭代,为 LLM 的负责任发展做出贡献。
1 引言
大语言模型(LLM)作为功能强盛的人工智能助手,在涉及跨领域、需要专业知识 (比方编程和创意写作)的复杂推理使命中表现出了巨大的潜力。 LLM 通过谈天窗口与人类进行交互,简单方便,因此一经推出就敏捷打开大众市场。
如果考虑到背后的练习方法论本质上非常简单直观,
- 首先,在大量自监督数据上对 auto-regressive transforer 进行预练习,
- 然后,通过基于人类反馈的强化学习(RLHF)等技术与人类偏好对齐。
就更会震动于 LLM 的能力是多么出众。
1.1 现状:没有能与 ChatGPT 匹敌的开源大模型
大模型的练习方法很简单,但是,极高的算力要求限制了它的发展, 效果是只有少数几家公司有财力进行研究和练习。 虽然之前已经开源了一些预练习的大模型,包括
- BLOOM(Scao 等,2022)
- LLaMA-1(Touvron 等,2023)
- Falcon(Penedo 等,2023)
这些模型的性能已经与 GPT-3(Brown 等,2020)和 Chinchilla(Hoffmann 等,2022) 等闭源预练习模型相当,但它们还无法成为 ChatGPT、BARD 和 Claude 等性能更强盛的闭源、生产级大模型的替换品,
- 后者做了大量微调以与人类偏好对齐, 极大加强了它们的可用性和安全性;
- 这一过程需要大量的盘算和人工标注成本,并且通常不透明,也难以轻松照搬, 因此限制了社区在 advance AI alignment research 方面的进展。
1.2 开源 LLaMA2/LLaMA2-Chat,补充空缺
本文先容我们开源的 LLaMA2,这是一组预练习和微调的 LLM,包括 LLaMA2 和 LLaMA2-Chat。 与其他开源 chat models 进行比较,
- 大多数基准测试中,LLaMA2 性能更好;
- 有效性和安全性方面,人工评估(human evaluations)的效果也证明 LLaMA2 更优。
因此,LLaMA2 可以作为一个不错的闭源模型替换方案。 本文将详细描述我们是怎样对 LLaMA2-Chat 进行微调和安全性改进的, 如许社区就可以或许在我们的工作基础上进一步开发迭代,为 LLM 的负责任发展做出贡献。 具体来说,我们向公众(the general public)开源以下模型,供研究和商业使用(research and commercial use):
- LLaMA2:这是 LLaMA 1 的升级版
- 新组合了公开可用数据(a new mix)进行练习,数据集巨细 +40%(1.4T tokens -> 2T tokens),
- 模型的上下文长度翻倍,
- 接纳了 grouped-query attention(Ainslie 等,2023)。
本次发布 7B/13B/70B 参数的 LLaMA2 模型。 34B 的模型本文会给出性能参数,但发布要晚一些(还在做安全测试)。
- LLaMA2-Chat:LLaMA2 的微调版本,针对对话场景进行了优化。
我们相信,在安全的前提下,LLM 的开放将对社会产生积极影响。但注意,和所有 LLM 一样,LLaMA2 是一项新技术, 在使用中存在潜在风险(Bender 等,2021b;Weidinger 等,2021;Solaiman 等,2023),
- 目前的测试仅涵盖了英语。 在摆设任何 LLaMA2-Chat 应用之前,开发者应针对其特定场景进行安全测试和调优;
- 我们提供了一份负责任使用指南和代码示例,以促进 LLaMA2 和 LLaMA2-Chat 的安全摆设。更多信息见 5.3 节。
一些资料链接:
- ai.meta.com/resources/models-and-libraries/llama/
- ai.meta.com/llama
- github.com/facebookresearch/llama
1.3 LLaMA2 是怎样炼成的:练习+微调鸟瞰
图 4:LLaMA2-Chat 练习和调优过程。
炼丹四步:
- 使用公开数据预练习(自监督学习),得到 LLaMA2;
- 对 LLaMA2 进行监督微调(SFT),得到一个初始版本的 LLaMA2-Chat;
- 人对 LLaMA2-Chat 回答进行反馈和标注,得到两个奖励模型(分别针对有效性和安全性);
- 通过 基于人类反馈的强化学习(RLHF)/ rejection sampling / PPO,对 LLaMA2-Chat 进行(多次)迭代。
1.4 本文组织
本文其余部门组织如下:
- 第 2 节:预练习方法
- 第 3 节:微调方法
- 第 4 节:模型安全方法
- 第 5 节:焦点观察和见解
- 第 6 节:相关工作
- 第 7 节:总结
2 预练习(Pretraining)
为了打造 LLaMA2 这个新系列模型,我们接纳了 Touvron 等(2023)的预练习方法, 使用了一个优化的自回归 transformer,并进行了一些改进以提高性能, 包括,
- 更健壮的数据洗濯;
- 更新的练习数据比例;
- 更多的练习 tokens;
- 更长的上下文;
- 使用 grouped-query attention(GQA),通过组查询来提高推理性能。
GQA 优化推理的基本原理:大模型推理的极限:理论分析、数学建模与 CPU/GPU 实测(2024)。译注。
表 1 比较了 LLaMA 2 与 LLaMA 1 的一些属性:
LLaMALLaMA 2练习数据见 LLaMA 论文基于公开可用数据新组合的数据集参数数量7b / 13b / 33b / 65b7b / 13b / 34b / 70b上下文长度2k / 2k / 2k / 2k4k / 4k / 4k / 4kGQANO/NO/NO/NONO/NO/YES/YES练习 tokens 数量1T / 1T / 1.4T / 1.4T2T / 2T / 2T / 2TLearning Rate3.0*10-4 / 3.0*10-4 / 1.5*10-4 / 1.5*10-43.0*10-4 / 3.0*10-4 / 3.0*10-4 / 3.0*10-4 表 1:LLaMA 1 和 2 模型对比。Token 数量只盘算了预练习数据。所有模型都是用 global batch-size of 4M tokens 练习的。
2.1 预练习数据(Pretraining Data)
- 组合了一些公开可用的数据源,此中不包罗来 Meta 产物或服务的数据。
- 某些网站包罗了许多个人信息,我们努力删掉了此中的此类信息。
- 练习了 2T(2 万亿)个 token,这在性能和成本之间提供了不错的折中(performance–cost trade-off),
- 对大部门事实类数据源进行 up-sampling,以增加知识镌汰幻觉( increase knowledge and dampen hallucinations)。
我们进行了大量预练习数据研究,如许用户可以更好地了解 LLaMA2 的潜在能力和限制;详细效果见 4.1 节。
2.2 练习细节(Training Details)
我们接纳了 Llama 1 的大部门预练习设置和模型架构。
- 使用尺度的 transformer 架构(Vaswani 等,2017),
- 使用 RMSNorm 进行预归一化(Zhang 和 Sennrich,2019),
- 使用 SwiGLU 激活函数(Shazeer,2020),以及旋转位置嵌入(rotary positional embeddings,RoPE,Su 等,2022)。
与 Llama 1 相比,主要的架构差别包括
- 上下文长度(翻倍,2k -> 4k)
- 组查询注意力(GQA, grouped-query attention)
附录 A.2.1 中详细先容了这些差别,并进行了 ablation experiments 以证明它们的重要性。
2.2.1 超参数(Hyperparameters)
- 使用 AdamW 优化器进行练习(Loshchilov 和 Hutter,2017),此中 β1 = 0.9,β2 = 0.95,eps = 10-5。
- 使用余弦学习率调治(a cosine learning rate schedule),热身阶段为 2000 steps,并将终极学习率衰减到峰值学习率的 10%。
- 使用 0.1 的权重衰减(weight decay)和 1.0 的梯度裁剪(gradient clipping)。
图 5(a)表现了使用这些超参数练习的 LLaMA2 的练习损失,
图 5:LLaMA2 Training Loss。注意纵然用 2T tokens 进行练习,这些模型仍然没有饱和的迹象。
2.2.2 分词器(Tokenizer)
LLaMA2 使用的分词器与 Llama 1 相同;接纳了一种字节对编码(bytepair encoding,BPE)算法(Sennrich 等,2016), 我们使用了 SentencePiece(Kudo 和 Richardson,2018)的实现。
与 Llama 1 一样,
- 将所有 numbers 拆分为单个 digits,
- 使用 bytes 来分解未知的 UTF-8 字符。
vocabulary size 为 32k tokens。
2.2.3 练习硬件和碳足迹
练习硬件(Training Hardware)
我们在 Meta 的超级集群(Research Super Cluster,RSC,Lee 和 Sengupta,2022) 以及内部生产集群上预练习 LLaMA2。 两个集群 GPU 都是 NVIDIA A100,网络也都是 200Gbps 互联, 但互联方案和 GPU 最大功耗不同:
- RSC 集群:200Gbps InfiniBand + 400W GPU;
- 生产集群:200Gbps RoCE + 350W GPU;RoCE 成本更低。
结论:RoCE + 350W GPU 的集群,经过优化的代码能达到 IB + 400W GPU 集群性能的 90%。
预练习碳足迹(Carbon Footprint of Pretraining)
根据之前的研究(Bender 等,2021a;Patterson 等,2021;Wu 等,2022;Dodge 等,2022), 结合 GPU 设备的功耗估计以及碳效率,我们来盘算 LLaMA2 预练习所产生的碳排放量。注意,
- GPU 的现实功耗取决于其利用率(util),我们估计 GPU 功耗使用的是热计划功率(TDP),二者可能会有所差别;
- 我们的盘算不考虑其他电力需求,比方互连、非 GPU 服务器功耗、数据中央制冷功耗等;
- 与 AI 硬件(如 GPU)生产相关的碳排放量可能会增加团体碳足迹(Gupta 等,2022)。
盘算效果见表 2,
表 2:预练习期间的 CO2 排放。Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta’s sustainability program, and becausewe are openly releasing these models, the pretraining costs do not need to be incurred by others.
- A100-80GB(400W/350W TDP)机器,总共耗费了 3.3M GPU-hour;
- 估算的总排放量为 539 tCO2eq,可 100% 由 Meta 的可持续筹划抵消;
- LLaMA2 的开源还意味着其他公司不需要承担这些预练习成本,节省了更多的环球资源。
2.3 LLaMA 2 预练习模型性能评估(Pretrained Model Evaluation)
本节先容在尺度学术基准测试中,LLaMA 1/2 基础模型、MosaicML 预练习 transforer (MPT)及 Falcon(Almazrouei 等,2023)的效果。 所有评估都使用我们的内部评估库。我们在内部重复了 MPT 和 Falcon 模型的效果。 对于这些模型,始终选择我们评估框架和所有公开报告的效果中的最高分(the best score between our evaluation framework and any publicly reported results)。
基准测试分为以下几类(单个基准测试的效果见 A.2.2):
- 代码。LLaMA 在 HumanEval(Chen 等,2021)和 MBPP(Austin 等,2021)上的平均 pass@1 分数。
- 常识推理。PIQA(Bisk 等,2020)、SIQA(Sap 等,2019)、HellaSwag(Zellers 等,2019a)、WinoGrande(Sakaguchi 等,2021)、 ARC easy 和 challenge(Clark 等,2018)、OpenBookQA(Mihaylov 等,2018)和 CommonsenseQA(Talmor 等,2018) 的平均分数。CommonSenseQA 的 7-shot 效果和其他所有基准测试的 0-shot 效果。
- 世界知识。评估了 NaturalQuestions(Kwiatkowski 等,2019)和 TriviaQA(Joshi 等,2017)的 5-shot 性能,并给出了平均分数。
- 阅读理解。在 SQuAD(Rajpurkar 等,2018)、QuAC(Choi 等,2018)和 BoolQ(Clark 等,2019)上的 0-shot 平均分数。
- 数学。GSM8K(8 shot)(Cobbe 等,2021)和 MATH(4 shot)(Hendrycks 等,2021)基准测试在 top 1 的平均分数。
- 聚合基准测试。MMLU(5 shot)(Hendrycks 等,2020)、Big Bench Hard(BBH)(3 shot)(Suzgun 等,2022)和 AGI Eval(3-5 shot)(Zhong 等,2023)的团体效果。 对于 AGI Eval,只评估英文使命并给出平均分数。
2.3.1 与开源基座大模型对比
表 3 总结了一系列常见基准测试的团体性能。安全基准测试见 4.1 节中。
表 3:与其他开源的基座大模型对比性能,基于一些学术基准测试
可以看出,
- LLaMA2 优于 LLaMA1;
- 与 Llama 1 65B 相比,LLaMA2 70B 在 MMLU 和 BBH 上的效果分别提高了约 5 和 8 个百分点;
- 除了 Code 基准测试,LLaMA2 7B 和 30B 模型在其他类别上都优于相应巨细的 MPT 模型;
- LLaMA2 7B 和 34B 在所有基准测试类别上优于 Falcon 7B 和 40B 模型。
- LLaMA2 70B 模型优于所有开源模型。
2.3.2 与闭源大模型对比
除了开源模型,我们还将 LLaMA2 70B 的效果与闭源模型进行了比较。如表 4 所示,
表 4:基于学术基准测试,对比 LLaMA2 和闭源模型。 GPT-3.5/GPT-4 的效果来自 OpenAI (2023);PaLM 的效果来自 Chowdhery et al. (2022); PaLM-2-L 的效果来自 Anil et al. (2023).
- LLaMA2 70B 在 MMLU 和 GSM8K 上与 GPT-3.5(OpenAI,2023)接近,但在编码基准测试上存在明显差距;
- LLaMA2 70B 与 PaLM(540B)(Chowdhery 等,2022)相当,甚至比后者更好;
- LLaMA2 70B 与 GPT-4/PaLM-2-L 仍存在较大差距。
我们还分析了潜在的数据污染,详细信息见 A.6 节。
3 微调(Fine-tuning)
LLaMA2-Chat 经过了几个月的对齐(alignment)迭代, 包括指令微调(instruction tuning)和 RLHF,这些都需要大量的盘算和标注资源。 本节先容我们的一些实验及发现。
3.1 监督式微调(SFT)
3.1.1 使用公开的指令微调数据
与 Touvron 等人(2023)类似,我们使用了公开可用 instruction tuning 数据(Chung 等,2022)开始 SFT 阶段。
3.1.2 标注质量为王(Quality Is All You Need)
另有一些不同来源的第三方 SFT 数据,但我们发现此中一些的多样性和质量欠佳 —— 尤其是对于将 LLM 对齐到对话式(dialogue-style)指令时。 因此,我们首先网络了数千个高质量的 SFT 数据示例,如表 5 所示,
表 5:SFT annotation 示例。分别展示了一个 helpfulness 和一个 safety annotation,此中的 prompt 和 answer 都是人(标注员)写的。
这些标注数据是从我们的供应商获取的,我们发现只需少量高质量 SFT 标注数据就能明显提拔效果质量,
- 这与 Zhou 等人(2023)的发现类似,后者也发现只需要一小组干净的 instruction-tuning data 就足以获得高质量;
- 根据我们的现实经验,几万个 SFT 标注就足以实现高质量的效果; 因此,我们总共网络了 27,540 个 SFT annotation,没有再网络更多;请注意,我们 SFT annotations 没使用任何 Meta 用户数据;
- 我们还观察到,不同标注平台和供应商(annotation platforms and vendors) 可能导致明显不同的下游模型性能,这凸显了在使用供应商获取标注时数据检查的重要性。
为了验证数据质量,我们细致检查了一组 180 个示例,将人工提供的标注与模 型生成的进行了人工对比。令人惊讶的是,我们发现 SFT 之后模型的抽样输出( outputs sampled from the resulting SFT model)与人工标注员提供的 SFT 数据 质量相当,这表明我们可以调解优先级,将更多的尺度精力投入到 preference-based annotation for RLHF。
3.1.3 一些微调细节(Fine-Tuning Details)
对于监督微调,我们使用一个 cosine learning rate schedule,
- 初始学习率为 2×10-5,
- 权重衰减为 0.1,
- batch size 64,
- 序列长度为 4096 token。
对于微调过程,每个样本由一个提示(prompt)和一个回答(answer)构成。
- 为了确保模型序列长度正确填充(properly filled),我们将练习集中的所有提示和回答毗连起来, 然后使用一个特殊的 token 来分隔提示和回答段落。
- 使用自回归目的,并 zero-out the loss on tokens from the user prompt, 因此只在 answer token 上进行反向流传。
- 末了,我们对模型进行 2 个 epoch 的微调。
3.2 基于人类反馈的强化学习(RLHF)
RLHF 是一种模型练习过程(model training procedure),应用在微调模型之上, 使模型行为与人类偏好和指令进一步对齐。 给定两个模型的输出,人类标注员选出他们更喜欢的那一个(打标), 我们以为如许得到的效果代表了广泛的人类偏好。 然后,拿这些人类反馈来练习一个奖励模型, 这个模型在学习完人类标注员的偏好模式之后,就可以自动做偏好决策了。
3.2.1 人类偏好数据网络
奖励建模需要网络人类偏好数据。 我们选择了一种“二选一比较协议”(binary comparison protocol),主要是由于它可以或许最大化我们网络的 prompts 的多样性。 其他策略也值得考虑,我们将留待未来的工作。
我们的标注过程如下:
- 标注员首先写一个提示,然后基于提供的判断尺度,在两个 sampled model response 中选择一个好的;
- 为了最大化多样性,这两个回答是从两个不同的 model variants 中抽样得到的,并且会改变温度超参数;
- 除了二选一,我们还要求标注员标志他们的喜欢程度:明显更好/更好/略微好/几乎无差别/不确定。
对于偏好标注,我们关注的是有效性和安全性(helpfulness and safety),
- 有效性指的是 LLaMA2-Chat 的回答满足用户哀求和提供所需信息的程度;
- 安全性指的是 LLaMA2-Chat 的回答是否不安全,比方,“列出制做炸弹的详细步骤” 可能符合“有效性”尺度,但根据我们的准则它不满足“安全性”。
将这两者区分开,使我们能对二者分别应用具体的准则并更好地引导标注员。比方, 常规引导原则之外,我们的安全标注(safety annotations)还提供了对 adversarial prompts 的引导。
除了标注引导原则的差别,我们在安全阶段还额外网络了一个安全标签。 这个额外的信息将模型的回答分为三个类别:
- 选中的回答是安全的,另一个回答不安全;
- 两个回答都是安全的;
- 两个回答都不安全。
安全数据集中分别有 18%、47% 和 35% 的样本分布在这三个类别中。 我们以为不存在“选中的回答不安全,而另一个回答安全”的场景, 由于我们相信更安全的回答也会被人类以为更好/更受欢迎。 关安全准则和更详细的安全标注信息,见 4.2.1。
人类标注是按每周级别(weekly)批次网络的。随着偏好数据的增多,奖励模型得到了改进, 可以或许练习出质量越来越好的 LLaMA2-Chat 版本(见第 5 节,图 20 中的效果)。 LLaMA2-Chat 的改进也使模型的数据分布产生了漂移(shift)。 如果不将这个新的数据分布输入奖励模型,它的正确性会敏捷下降 —— 比方,hyper-specialization (Scialom et al., 2020b) ,—— 因此在新一轮 LLaMA2-Chat 调优迭代之前, 使用最新的偏好数据迭代一次非常重要。 这一步有助于保持奖励模型的数据分布正确性,为最新模型提供正确的奖励。
表 6:用于奖励模型的人类偏好数据统计. We list both the open-source and internally collected human preference data used for reward modeling. Note that a binary human preference comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue). Each example consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model. We report the number of comparisons, the average number of turns per dialogue, the average number of tokens per example, per prompt and per response. More details on Meta helpfulness and safety data per batch can be found in Appendix A.3.1.
表 6 总结了我们的奖励模型数据信息,并与多个开源偏好数据集进行了对比, 包括 Anthropic Helpful and Harmless(Bai 等,2022a),OpenAI Summarize(Stiennon 等,2020), OpenAI WebGPT(Nakano 等,2021),StackExchange(Lambert 等,2023), Stanford Human Preferences(Ethayarajh 等,2022)和 Synthetic GPT-J(Havrilla)。
基于前面先容的引导原则,我们网络的超过 100 万个 binary comparison, 得到一个大型数据集,我们称之为 Meta reward modeling data。 请注意,根据 text domain 的不同,提示和回答中的 token 数量会不一样。
- 总结性文档(summarization)和在线论坛数据通常 prompt 更长,
- 对话式 prompt 通常较短。
与现有的开源数据集相比,我们的偏好数据具有更多的对话轮次,并且长度更长。
3.2.2 奖励建模(Reward Modeling)
奖励模型的工作机制:
- 输入:模型的 response 及其相应的 prompt(包括前几轮的上下文);
- 输出:一个标量分数,表示模型的生成质量(比方有效性和安全性)。
利用这些分数作为奖励,可以在 RLHF 期间优化 LLaMA2-Chat,实现更好的人类偏好对齐,改进有效性和安全性。
有人已经发现,有效性和安全性有时需要折中(Bai 等,2022a),这可能会使单个奖励模型在优化这两个方面时具有挑战性。 为了解决这个题目,我们练习了两个单独的奖励模型,
- 一个针对有效性进行优化(称为 Helpfulness RM),
- 一个针对安全性进行优化(Safety RM)。
我们用预练习的 LLaMA2-Chat checkpoint 初始化奖励模型,
- 这使得两个模型都受益于预练习模型已学到的知识。简而言之,奖励模型“知道”谈天模型知道的所有内容。
- 这可以防止两个模型信息不匹配,比方,这可能导致产生幻觉(hallucinations)。
- 模型架构和超参数与预练习模型相同,只是用于猜测下一个 token 的 classification head 替换为一个 regression head,用于输出标量奖励。
练习目的
为了练习奖励模型,我们将网络的人类偏好数据转换为 binary ranking label 格式(即 chosen & rejected), 并强制选中的响应有更高的分数。我们使用了与 Ouyang 等人(2022)同等的 binary ranking loss:
Lranking=−log(σ(rθ(x,yc)−rθ(x,yr)))
where rθ(x,y) is the scalar score output for prompt x and completion y with model weights θ. yc is the preferred response that annotators choose and yr is the rejected counterpart. Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g., significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly teach the reward model to assign more discrepant scores to the generations that have more differences. To do so, we further add a margin component in the loss:
Lranking=−log(σ(rθ(x,yc)−rθ(x,yr)−m(r)))
where the margin m(r) is a discrete function of the preference rating. Naturally, we use a large margin for pairs with distinct responses, and a smaller one for those with similar responses (shown in Table 27). We found this margin component can improve Helpfulness reward model accuracy especially on samples where two responses are more separable. More detailed ablation and analysis can be found in Table 28 in Appendix A.3.3.
Data Composition
We combine our newly collected data with existing open-source preference datasets to form a larger training dataset. Initially, open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data. We note that in the context of RLHF in this study, the role of reward signals is to learn human preference for LLaMA2-Chat outputs rather than any model outputs. However, in our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking, i.e. LLaMA2-Chat taking advantage of some weaknesses of our reward, and so artificially inflating the score despite performing less well. With training data available from different sources, we experimented with different mixing recipes for both Helpfulness and Safety reward models to ascertain the best settings. After extensive experimentation, the Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets. The Meta Safety reward model is trained on all Meta Safety and Anthropic Harmless data, mixed with Meta Helpfulness and open-source helpfulness data in a 90/10 proportion. We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe.
练习细节(Training Details)
We train for one epoch over the training data. In earlier experiments, we found that training longer can lead to over-fitting. We use the same optimizer parameters as for the base model. The maximum learning rate is 5 × 10−6 for the 70B parameter LLaMA2-Chat and 1 × 10−5 for the rest. The learning rate is decreased on a cosine learning rate schedule, down to 10% of the maximum learning rate. We use a warm-up of 3% of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.
奖励模型的效果(Reward Model Results)
On each batch of human preference annotation for reward modeling, we held out 1000 examples as a test set to evaluate our models. We refer to the union of all prompts for the corresponding test sets as “Meta Helpfulness” and “Meta Safety,” respectively.
As reference points, we also evaluated other publicly available alternatives as baselines: SteamSHP-XL (Ethayarajh et al., 2022) based on FLAN-T5-xl, the Open Assistant reward model based on DeBERTa V3 Large (He et al., 2020), and GPT4 accessible through the OpenAI’s API. Note that at inference time, as opposed to training, all the reward models can predict a scalar for a single output, without requiring to access its paired output. For GPT-4, we prompt with a zero-shot question “Choose the best answer between A and B,” where A and B are the two responses for comparison.
We report the results in terms of accuracy in Table 7. As expected, our own reward models perform the best on our internal test sets collected based on LLaMA2-Chat, with the Helpfulness reward model performing best on the Meta Helpfulness test set, and similarly the Safety reward model performing best on the Meta Safety test set. Overall, our reward models outperform all of the baselines, including GPT-4. Interestingly, GPT-4 performs better than other non-Meta reward models, despite not being trained directly nor targeting specifically this reward modeling task.
The fact that helpfulness and safety performed the best on their own domain is potentially due to the tension between the two objectives (i.e., being as helpful as possible versus refusing unsafe prompts when necessary), which may confuse the reward model during training. In order for a single model to perform well on both dimensions, it needs to not only learn to select the better response given a prompt but also to distinguish adversarial prompts from safe ones. As a result, optimizing two separate models eases the reward modeling task. More detailed analysis on this tension between safety and helpfulness can be found in Appendix A.4.1. When we group the scores by preference rating in Table 8, we can see that the accuracy is superior for the “significantly better” test set and degrades gradually as comparison pairs become more similar (e.g., “slightly better”). It is expected that learning to model human preferences becomes challenging when deciding between two similar model responses, due to annotator subjectivity and their reliance on nuanced details that may differentiate responses. We emphasize that the accuracy on more distinct responses matters the most to improve LLaMA2-Chat performance. The human preference annotation agreement rate is also higher on more distinct responses than similar pairs.
Scaling Trends
We study the scaling trends in terms of data and model size for the reward model, finetuning different model sizes on an increasing amount of the reward model data collected each week (see the details on volume per batch in Table 26). Figure 6 reports these trends, showing the expected result that larger models obtain higher performance for a similar volume of data. More importantly, the scaling performance has not yet plateaued given the existing volume of data annotation used for training, a signal that there is room for more improvement with more annotations. We note that reward model accuracy is one of the most important proxies for the final performance of LLaMA2-Chat. While best practices for comprehensively evaluating a generative model is an open research question, the ranking task of the reward has no ambiguity. Therefore, everything else being equal, an improvement of the reward model can be directly translated into an improvement for LLaMA2-Chat.
3.2.3 Iterative Fine-Tuning
As we received more batches of human preference data annotation, we were able to train better reward models and collect more prompts. We therefore trained successive versions for RLHF models, referred to here as RLHF-V1, . . . , RLHF-V5. We explored RLHF fine-tuning with two main algorithms:
- Proximal Policy Optimization (PPO) (Schulman et al., 2017), the standard in RLHF literature.
- Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.
The two RL algorithms mainly differ in:
- Breadth — in Rejection Sampling, the model explores K samples for a given prompt, while only one generation is done for PPO.
- Depth — in PPO, during training at step t the sample is a function of the updated model policy from t − 1 after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental differences between the two RL algorithms are less pronounced.
Until RLHF (V4), we used only Rejection Sampling fine-tuning, and after that, we combined the two sequentially, applying PPO on top of the resulted Rejection Sampling checkpoint before sampling again.
Rejection Sampling
We perform rejection sampling only with our largest 70B LLaMA2-Chat. All smaller models are fine-tuned on rejection sampled data from the larger model, thus distilling the large-model capabilities into the smaller ones. We leave further analysis of the effect of this distillation for future work.
At each iterative stage, we sample K answers for each prompt from the most recent model. We score each sample given the best reward model accessible at the time of the experiment, and then select the best answer for a given prompt. In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the “bag” of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigations for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research.
In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature.
We illustrate the benefit of Rejection Sampling in Figure 7. The delta between the maximum and median curves can be interpreted as the potential gain of fine-tuning on the best output. As expected, this delta increases with more samples, since the maximum increases (i.e., more samples, more opportunities to generate a good trajectory), while the median remains stationary. There is a direct connection between the exploration and the maximum reward we can obtain among the samples. The temperature parameter also plays an important role for exploration, as a higher temperature enables us to sample more diverse outputs. In Figure 8, we report for a LLaMA2-Chat-SFT (left) and a LLaMA2-Chat-RLHF (right), the maximum reward curves among N samples (with N ∈ [1, . . . , 100]), for different temperatures. We can observe that the optimal temperature is not constant during the iterative model updates: RLHF has a direct impact on rescaling the temperature. For LLaMA2-Chat-RLHF, the optimal temperature when sampling between 10 and 100 outputs is T ∈ [1.2, 1.3]. Given a finite compute budget, it is therefore necessary to re-adjust the temperature progressively. Note that this temperature rescaling happens for a constant number of steps for each model, and always starting from the base model on each new RLHF version.
PPO
We further train our language model following the RL scheme of Stiennon et al. (2020), which uses the reward model as an estimate for the true reward function (human preference) and the pretrained language model as the policy to optimize. During this phase, we seek to optimize the following objective: arg max
argmaxπ |