我们在这篇文章《ChatGPT技术原理剖析:从RL之PPO算法、RLHF到GPT4、instructGPT》中的2.5节有提到,“2021 年7月,OpenAI发布Codex的论文《Evaluating Large Language Models Trained on Code》,其中初始的Codex是根据120亿参数的GPT-3变体进行微调的,且通过对159GB的Python代码进行代码练习,厥后这个120 亿参数的模子演酿成OpenAI API中的code-cushman-001,具备较强的代码/推理能力”
接下来,我们来看下Codex背后的原理到底是怎样的,即其是怎样一步一步练习出来的
1.1 Codex效果的评估
这164个编程题目全部是手写的,非网上公开的。每个题目包括一个函数签名、文档、函数的实现和几个单元测试,匀称每个题目7.7个测试。 We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. Each problem includes a function signature,docstring, body, and several unit tests, with an average of 7.7 tests per problem.
对于这些任务来说,手写是很重要的,因为Codex模子是通过GitHub上的代码练习的,而如果测试的题目是网上公开的,那很大概从GitHub上获取的练习数据集大概已经包含了对应的测试题目及其答案(究竟本来是要评估模子的回答能力,结果模子直接看到答案了,就没法准确评估了) It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.
如果仅用其中一个样本答案,12B参数的Codex就能办理28.8%的题目,但如果通过一些独立的、正确实现的函数再对Codex进行微调,所得到的Codex-S模子用单个样本/答案办理了37.7%的题目(we fine-tune Codex on standalone, correctly implemented functions. The resulting model, Codex-S, solves 37.7% of problems with a single sample)
当然,如果对于某个编程题目,让Codex生成100个答案的话,那Codex-S能够为77.5%的题目生成至少一个正确的答案,这一结果表明,可以通过开导式排序来选择准确的代码样本,而不是充分评估每个样本(This result suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample)
事实上,我们发现对数概率均值最高的样本通过了44.5%题目的单元测试(Indeed, we find that the sample with highest mean log-probability passes unit tests for 44.5% of the problems)
1.1.2 pass@k度量的计算逻辑
所以,最终为了评估pass@k,我们为每个任务生成n≥k个样本(在本文中,利用n = 200和k≤100),统计通过单元测试的正确样本c≤n的数量,并计算无偏估计量「Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator」
然后每一次从里面随机采k个出来,看其中有没有符合要求的答案
Codex利用与GPT模子相同的学习率,具有175步线性热身和余弦学习率衰减。我们利用具有β1= 0.9,β2= 0.95,=10−8,权重衰减系数为0.1的Adam优化器,对总共1000亿个token进行练习「We train Codex using the same learning rate as the corre-sponding GPT model, with a 175 step linear warmup andcosine learning rate decay. We train for a total of 100 billiontokens, using the Adam optimizer with β1 = 0.9, β2 = 0.95, = 10−8 , and a weight decay coefficient of 0.1.」
之后,有个题目是在预测当前序列的末了一个词时,可以选取概率最大的词(softmax最高的值),但没法全局最优且不具备多样性(因为每次局部最优不代表最终全局最优,且每次都是取概率最大的词,则无论采样多少次,答案都是唯一没有多样性),对此
GitHub Copilot的一个重要用例是镌汰编写单元测试的一些苦差事。好比当我们实现了一个计算两个列表的公共前缀的代码后(we already have an implementation of a function that computes the common prefix of two lists),我们想要测试它
为此,我们导入单元测试包,然后我们开始编写一个测试函数,让Copilot生成asserts,我们只需按Tab键就可以接受这些断言
3.1.3 长上下文微调:类似位置插值微调(fine-tuning by position interpolation)
对长序列的有效处置惩罚是基于transformer的语言建模的一个主要研究课题(Vaswani et al., 2017)。基本的建模挑战是外推,即在练习时间以外的序列长度上进行操纵,以及有利于在短到中等长度输入上进行练习的留意力通报的二次复杂度(Effective handling of long sequences is a major topic of research in transformer-based language model-ing (Vaswani et al., 2017). The fundamental modeling challenges are extrapolation, i.e., operating on sequencelengths beyond those seen at training time, and the quadratic complexity of attention passes which favors training on short-to-medium length inputs.)
对于Code Llama,我们提出了一个专用的长上下文微调(LCFT)阶段,在该阶段中,模子呈现16,384个token序列,高于Llama 2和初始代码练习阶段利用的4,096个token。通过将处置惩罚长序列所耗费的练习时间限制在微调阶段,获得了远程能力,而不会显着增加练习模子的资源 For Code Llama, we propose a dedicated long context fine-tuning (LCFT) stage in which models arepresented with sequences of 16,384 tokens, up from the 4,096 tokens used for Llama 2 and our initial codetraining stages. By limiting the training time spent on processing long sequences to a fine-tuning stage, wegain long-range capabilities without significantly increasing the cost of training our models.
我们的策略类似于最近提出的位置插值微调(Chen等人,2023b),并且我们确认了修改Llama 2底子模子中利用的旋转位置嵌入的旋转频率的重要性(Su等人,2021)。然而,我们并没有像Chen等人(2023b)那样线性地降低频率,而是改变了它们的基周期 Our strategy issimilar to the recently proposed fine-tuning by position interpolation (Chen et al., 2023b), and we confirmthe importance of modifying the rotation frequencies of the rotary position embedding used in the Llama 2foundation models (Su et al., 2021). However, instead of downscaling frequencies linearly as Chen et al.(2023b), we change the base period from which they are derived
具体来说,在旋转嵌入中,位置n的query和key向量xn受到线性变更RΘd,n xn,其中RΘd,n是一个分块对角矩阵,其形式为(Specifically, with rotary embeddings, the query and key vectors xn at position n are subject to a linear transformation RΘd,n xn, where RΘd,nis a blockdiagonal matrix with entries of the form)
而d表示嵌入维数。旋转频率计算为θi=θ−2i/d,为了进行微调,我们将基准周期θ从10000增加到1,000,000。这种增加允许处置惩罚更大的序列,并镌汰对短距离留意力的偏差「d denotes the embedding dimension. Rotation frequencies are computed as θi = θ−2i/d , and we increasethe base period θfrom 10,000 to 1,000,000 for fine-tuning. This increase allows for processing much largersequences and reduces bias towards short-distance attention (see Appendix F.1 for further discussion).」
我们的实验证实,Code Llama模子不仅在微调期间利用的增加的序列长度内有效,而且进一步显示外推能力,不仅提供了多达 100000 个上下文 token 的稳定生成,所有模子的练习 token 序列也高达 16000「Ourexperiments confirm that Code Llama models are not only effective within the increased sequence lengthused during fine-tuning, but further show extrapolation capabilities and exhibit stable behavior on very longsequences of up to 100,000 tokens (Section 3.3).」
// 待更 3.2 Code Llama 性能怎样
Meta 利用了 HumanEval 和 MBPP(Mostly Basic Python Programming)两个编码基准进行测试。我们已经知道,HumanEval 测试模子基于文档字符串(docstrings)完成代码的能力,而MBPP 则测试模子基于形貌编写代码的能力,具体而言
原始的GPT模子利用pooler function来获得最终输出。我们在所有其他transformer层之上利用一个额外的查询层(Zeng et al., 2021),通过attention获得最终的嵌入(如上图所示,he input of the top query layer replaces the query input
by the query embedding of position n + 1)。末了的输出再乘以词嵌入矩阵的转置,得到输出概率(The final output is multiplied by the transpose of word embedding matrix to get the output probability)
对于解码策略,CodeGeeX支持贪婪、温度采样、top-k采样、top-p采样和束搜索(greedy, temperature sampling, top-k sampling, top-p sampling, and beam search)
末了,detokenization的操纵再将把选中的token ID酿成一个现实的单词
接纳了8路模子并行练习和192路数据并行练习,启用了ZeRO-2优化器(Rajbhandari等人,2020),以进一步镌汰优化器状态的内存斲丧。末了,每个节点的微批巨细为16,全局批巨细达到3072 To increase training efficiency, we adopt an 8-way model parallel training together with 192-way data parallel training, with ZeRO-2 (Rajbhandari et al., 2020) optimizer enabled to further reduce the memory consumption of optimizer states. Finally, the micro-batch size is 16 per node and the global batch size reaches 3,072.
具体来说,我们利用Adam优化器(Kingma and Ba, 2014)来优化下述方程中的损失
模子权重均接纳FP16格式,但为了更高的精度和稳定性,我们利用FP32进行层归一化和softmax。该型号大约需要27GB的GPU内存。我们从初始学习率1e-4开始,并应用余弦学习率衰减 Specifically, we use Adam optimizer (Kingma and Ba, 2014) to optimize the loss in Equation 2.The model weights are under FP16 format, except that we use FP32 for layer-norm and softmax for higher precision and stability. The model takes about 27GB of GPU memory. We start from an initial learning rate 1e-4, and apply a cosine learning rate decay by: