The Power of Scale for Parameter-Efficient Prompt Tuning

打印 上一主题 下一主题

主题 576|帖子 576|积分 1728

系列论文研读目次



  

论文链接
论文题目寄义

刻度在参数高效快速调优中的作用
Abstract

In this work, we explore “prompt tuning,” a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient “prompt ensembling.”在这项工作中,我们探索“提示调优”,一个简单而有效的机制,学习“软提示”条件冻结的语言模子,以执行特定的卑鄙任务。与GPT-3使用的离散文本提示不同,软提示是通过反向流传学习的,可以调整以包罗来自任何数量的标记示例的信号。我们的端到端学习方法远远优于GPT-3的少量学习。更值得注意的是,通过使用T5对模子大小进行烧蚀,我们表明,实时调整变得更具竞争力:随着模子超过数十亿个参数,我们的方法“缩小了差距”,并与模子调整的强大性能相匹配(此中所有模子权重都被调整)。这一发现尤其重要,因为大型模子的共享和服务成本很高,而将一个冻结模子重用于多个卑鄙任务的本领可以减轻这一负担。我们的方法可被视为李及梁(2021)最近提出的“预置调谐”的简化,并提供与此方法及其他类似方法的比力。末了,我们证明了用软提示调节冻结模子可以提高域转移的鲁棒性,并实现有效的“提示集成”。
1 Introduction


  • With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or “fine-tuning”), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).随着预训练的大型语言模子的广泛成功,出现了一系列技术来使这些通用模子适应卑鄙任务。埃尔莫(Peters等人,2018年)提出冻结预训练的模子,并学习其每层表现的特定任务权重。然而,由于GPT(拉德福等人,2018年)和BERT(Devlin等人,2019),主要的适应技术是模子调整(或“微调”),此中所有模子参数在适应过程中进行调整,如霍华德和鲁德(2018)所提出的。
  • More recently, Brown et al. (2020) showed that prompt design (or “priming”) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to “freezing” pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks.最近,Brown等人(2020)表明提示设计(或“引发”)在通过文本提示调节冷冻GPT-3模子的行为方面令人惊讶地有效。任务集通常由任务描述和/或几个规范示例构成。这种“冻结”预训练模子的回归很有吸引力,特别是在模子大小不断增长的情况下。一个通用模子可以同时服务于很多不同的任务,而不是为每个卑鄙任务提供一个单独的模子副本。
  • Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points be low fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters.不幸的是,基于网络的适应有几个关键的缺点。任务描述容易出错,须要人工参与,提示的有效性受到模子输入中可以容纳多少条件文本的限制。因此,卑鄙任务质量仍旧远远落后于调优模子。例如,GPT-3 175 B在SuperGLUE上的频频发射性能是17.5分,而T5-XXL被低微调(Raffel等人,2020年)(71.8 vs. 89.3),只管使用了16倍的参数。
  • Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning.最近已经提出了自动化提示设计的几种努力。Shin等人(2020)提出了一种在离散单词空间上的搜索算法,由卑鄙应用程序训练数据指导。虽然此技术优于手动提示设计,但相对于模子调整仍存在差距。
  • Li and Liang (2021) propose “prefix tuning” and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks.Li和Liang(2021)提出了“前缀调整”,并在天生性任务上显示了强有力的结果。此方法冻结模子参数,并在调整到编码器堆栈中每个层(包罗输入层)的前缀激活期间反向流传错误。Hambardzumyan等人(2021)通过将可训练参数限制在屏蔽语言模子的输入和输出子网络中来简化这一配方,并在分类任务上显示了合理的结果。
  • In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This “soft prompt” is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2).在本文中,我们提出了实时调整作为进一步简化适应语言模子。我们冻结整个预训练模子,只允许每个卑鄙任务的额外k个可调令牌被预先添加到输入文本中。这种“软提示”是端到端训练的,可以从完整的标记数据集中浓缩信号,使我们的方法能够优于少量提示,并通过模子调整缩小质量差距(图1)。同时,由于单个预训练模子被回收用于所有卑鄙任务,因此我们保存了冻结模子的高效服务上风(图2)。
    图1:T5的标准模子调优实现了强大的性能,但须要为每个终极任务存储单独的模子副本。我们对T5的快速调优与模子调优的质量相匹配,同时支持对所有任务重用单个冻结模子。我们的方法显着优于使用GPT-3的fewshot提示设计。我们显示了调整方法的3次运行的均匀值和标准差。
    图二:模子调优须要为每个卑鄙任务制作整个预训练模子的特定于任务的副本,而且推理必须在单独的批次中执行。提示调优只须要为每个任务存储一个小的特定于任务的提示,并使用原始的预训练模子实现混淆任务推理。对于T5“XXL”模子,调优模子的每个副本须要110亿个参数。相比之下,我们的调优提示每个任务只须要20,480个参数-减少了超过五个数量级-假设提示长度为5个标记。
  • While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale.虽然我们与Li和Liang(2021)以及Hambardzumyan等人(2021)同时开辟了我们的方法,但我们是第一个证明即时调整(没有中心层前缀或特定于任务的输出层)足以与模子调整竞争的人。通过第2-3节中的详细实验,我们证明了语言模子本领是这些方法成功的关键因素。如图1所示,即时调优与规模相比更具竞争力。
    We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the “generalist” parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that “prompt ensembling”, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are:我们在第4节中比力了类似的方法。将特定于任务的参数与一样寻常语言明白所需的“通才”参数明白分离有一系列额外的好处。我们在第5节中表明,通过在提示符中捕获任务界说,同时保持通才参数固定,我们能够实现对域转移的更好的弹性。在第6节中,我们展示了“提示集成”,即为同一任务学习多个提示,可以提高质量,而且比经典模子集成更有效。末了,在第7节中,我们研究了我们学习的软提示的可解释性。总而言之,我们的主要贡献是:Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing “prompt ensembling” and showing its effectiveness.1.提出了快速调优,并在大型语言模子的制度中显示了其与模子调优的竞争力。2.扩展了很多设计选择,并显示出质量和鲁棒性随着规模的增长而提高。3.在域转移问题上,显示提示调优优于模子调优。4.提出“即时汇编”并展示其有效性。
2 Prompt Tuning


遵循T5的“文本到文本”方法(Raffel等人,2020年),我们将所有任务都转换为文本天生。不是将分类建模为给定某个输入的输出类的概率,而是Pr(y| X),此中X是一系列标记,y是单个类标签,我们如今将其建模为条件天生,此中Y是表现类标签的标记序列。T5模子分类为Prθ(Y| X),由变压器的权重θ参数化(Vaswani等人,2017),构成其编码器息争码器。


提示是为模子添加额外信息的方法,以在其天生Y的过程中作为条件。通常,通过将一系列标记P预先附加到输入X来完成提示,使得模子最大化精确Y的似然性Prθ(Y| [P;X]),同时保持模子参数θ固定。在GPT-3中,提示标记的表现P = {p1,p2,…,pn},是模子嵌入表的一部门,由冻结θ参数化。因此,找到最佳提示须要通过手动搜索或不可区分的搜索方法来选择提示标记(Jiang等人,2020年; Shin等人,2020年)的报告。提示调优消除了提示P由θ参数化的限制;相反,提示有自己的专用参数θP,可以更新。虽然提示设计涉及从冻结嵌入的固定词汇表中选择提示标记,但是提示调整可以被以为是使用特殊标记的固定提示,此中只有这些提示标记的嵌入可以被更新。我们的新条件天生如今是Prθ;θP(Y| [P;X]),而且可以通过经过反向流传最大化Y的似然性来训练,同时仅对θP应用梯度更新。

给定一系列n个令牌,{x1,x2,…,xn},T5所做的第一件事是嵌入令牌,形成矩阵Xe ∈ Rn×e,此中e是嵌入空间的维数。我们的软提示符表现为参数Pe ∈ Rp×e,此中p是提示符的长度。然后,我们的提示符被连接到嵌入的输入,形成单个矩阵[Pe;Xe] ∈ R(p+n)×e,然后该矩阵正常地流过编码器-解码器。我们的模子被训练以最大化Y的概率,但仅更新提示参数Pe。
2.1 Design Decisions


  • There are many possible ways to initialize the prompt representations. The simplest is to train from scratch, using random initialization. A more sophisticated option is to initialize each prompt token to an embedding drawn from the model’s vocabulary. Conceptually, our soft-prompt modulates the frozen network’s behavior in the same way as text preceding the input, so it follows that a word-like representation might serve as a good initialization spot. For classification tasks, a third option is to initialize the prompt with embeddings that enumerate the output classes, similar to the “verbalizers” of Schick and Schütze (2021). Since we want the model to produce these tokens in the output, initializing the prompt with the embeddings of the valid target tokens should prime the model to restrict its output to the legal output classes.初始化提示表现法有很多大概的方式。最简单的方法是使用随机初始化从头开始训练。一个更复杂的选择是将每个提示标记初始化为从模子词汇表中提取的嵌入。从概念上讲,我们的软提示以与输入之前的文本相同的方式调整冻结网络的行为,因此可以得出结论,类似单词的表现可以作为一个很好的初始化点。对于分类任务,第三种选择是使用罗列输出类的嵌入来初始化提示符,类似于Schick和Schütze(2021)的“描述器”。由于我们盼望模子在输出中天生这些标记,因此使用嵌入的有效目的标记初始化提示符应该启动模子,以将其输出限制为法律的的输出类。
  • Another design consideration is the length of the prompt. The parameter cost of our method is EP, where E is the token embedding dimension and P is the prompt length. The shorter the prompt, the fewer new parameters must be tuned, so we aim to find a minimal length that still performs well.另一个设计考虑因素是提示的长度。我们的方法的参数代价是EP,此中E是令牌嵌入维数,P是提示长度。提示符越短,必须调整的新参数就越少,因此我们的目的是找到一个仍能很好地执行的最小长度。
2.2 Unlearning Span Corruption


与GPT-3等自回归语言模子不同,我们实验的T5模子使用编码器解码器架构,并在跨度腐败目的上进行预训练。详细来说,T5的任务是“重修”输入文本中的掩码跨度,这些跨度用唯一的哨兵标记标记。目的输出文本由所有屏蔽的内容构成,由sentinel分隔,加上末了一个sentinel。例如,从文本“Thank you for inviting me to your party last week”,我们可以构造一个预训练示例,此中输入是“Thank you X me to your party Y week”,目的输出是“X for inviting Y last Z "。

  • While Raffel et al. (2020) find this architecture and pre-training objective more effective than traditional language modeling, we hypothesize that this setup is not a good fit for producing a frozen model that can be readily controlled through prompt tuning. In particular, a T5 model pre-trained exclusively on span corruption, such as T5.1.1, has never seen truly natural input text (free of sentinel tokens), nor has it ever been asked to predict truly natural targets. In fact, due to the details of T5’s span corruption preprocessing, every pre-training target will begin with a sentinel. While this “unnatural” tendency to output sentinels is easy to overcome through fine-tuning, we suspect that it would be much harder to override through a prompt alone, as the decoder priors cannot be adjusted.虽然Raffel等人(2020)发现这种架构和预训练目的比传统的语言建模更有效,但我们假设这种设置并不得当天生可以通过即时调优轻松控制的冻结模子。特别是,专门针对跨度腐败进行预训练的T5模子,如T5.1.1,从未见过真正天然的输入文本(没有哨兵标记),也从未被要求预测真正天然的目的。事实上,由于T5的跨度腐败预处理的细节,每个预训练目的将开始与哨兵。虽然这种输出哨兵的“不天然”倾向很容易通过微调来克服,但我们怀疑单独通过提示来覆盖会困难过多,因为解码器先验无法调整。
  • Given these concerns, we experiment with T5 models in three settings. (1) “Span Corruption”: We use pre-trained T5 off-the-shelf as our frozen model, and test its ability to output the expected text for downstream tasks. (2) “Span Corruption + Sentinel”: We use the same model, but prepend all downstream targets with a sentinel, so as to more closely resemble the targets seen in pretraining. (3) “LM Adaptation”: We continue T5’s self-supervised training for a small number of additional steps, but using the “LM” objective discussed by Raffel et al. (2020); given a natural text prefix as input, the model must produce the natural text continuation as output. Crucially, this adaptation happens only once, producing a single frozen model that we can reuse for prompt tuning across any number of downstream tasks.考虑到这些问题,我们在三种设置中对T5模子进行了实验。(1)“Span Corruption”:我们使用预先训练的T5现成的模子作为我们的冻结模子,并测试它为卑鄙任务输出预期文本的本领。(2)“Span Corruption + Sentinel”:我们使用相同的模子,但在所有卑鄙目的之前添加一个Sentinel,以便更靠近预训练中看到的目的。(3)“LM适应”:我们继承T5的自监督训练,增长了少量的额外步调,但使用了Raffel等人(2020)讨论的“LM”目的;给定一个天然文本前缀作为输入,模子必须天生天然文本一连作为输出。至关重要的是,这种适应只发生一次,产生一个单一的冻结模子,我们可以在任何数量的卑鄙任务中重新使用该模子进行快速调优。
    Through LM adaptation, we hope to “quickly” transform T5 into a model more similar to GPT-3, which always outputs realistic text, and is known to respond well to prompts as a “few-shot learner”. It is not obvious how successful this late-stage transformation will be compared to pre-training from scratch, and it has not been investigated previously to our knowledge. As such, we experiment with various lengths of adaptation up to 100K steps.通过LM适应,我们盼望“快速”将T5转换为一个更类似于GPT-3的模子,它总是输出真实的文本,而且作为一个“少量学习者”对提示有很好的反应。与从头开始的预训练相比,这种后期转换的成功程度并不明显,而且据我们所知,以前也没有进行过调查。因此,我们实验了高达10万步的各种适应长度。
3 Results


我们的冻结模子创建在所有大小(小,基础,大,XL,XXL)的预训练T5检查点之上。我们使用公共T5.1.1检查点,此中包罗对原始T5.1的改进我们的“默认”设置(自始至终用绿色“×”()绘制)使用LM适应版本的T5,经过额外100 K步的训练,使用类标签进行初始化(请参阅第3.2节),并使用100个令牌的提示长度。虽然这比Li和Liang(2021)使用的默认10个令牌前缀更长,但我们的方法仍旧使用更少的任务特定参数,因为我们只调整输入层,而不是在所有网络层中进行重复激活。详细比力见图4。我们也将很快看到,随着模子大小的增长,甚至更短的提示也是可行的。
We measure performance on the SuperGLUE benchmark (Wang et al., 2019a), a collection of eight challenging English language understanding tasks.2 We report metrics on the development set associated with each dataset.我们在SuperGLUE基准上测量性能(Wang等人,2019a),八个具有挑战性的英语语言明白任务的集合。2我们报告了与每个数据集相关的开辟集的指标。
Each of our prompts train on a single SuperGLUE task; there was no multi-task setup or mixing of training data across tasks. We translate each SuperGLUE dataset into a text-to-text format following Raffel et al. (2020), except that we omit the task names prepended to inputs indicating which SuperGLUE task an example belongs to.我们的每个提示都在单个SuperGLUE任务上训练;没有多任务设置或跨任务混淆训练数据。我们将每个SuperGLUE数据集转换为Raffel等人(2020)的文本到文本格式,除了我们省略了输入前的任务名称,指示示例属于哪个SuperGLUE任务。
We train our prompts for 30,000 steps using T5’s standard cross-entropy loss, with a constant learn ing rate of 0.3 and a batch size of 32. Checkpoints are selected via early stopping on the development set, where the stopping metric is the default metric for the dataset, or the average of metrics for datasets evaluated with multiple metrics. All experiments were run in JAX (Bradbury et al., 2018) using the Adafactor optimizer (Shazeer and Stern, 2018) with weight decay 1e−5, β2 decay 0.8, and parameter scaling off. The models were implemented in Flax (Heek et al., 2020).我们使用T5的标准交叉熵损失,以0.3的恒定学习率和32的批量大小,训练30,000步的提示。检查点是透过在开辟集上提早停止来选取,此中停止测量结果是数据集的预设测量结果,或是以多个测量结果评估之数据集的均匀测量结果。所有的实验都在JAX中进行(Bradbury等人,2018年),使用Adafactor优化器(Shazeer和Stern,2018年),权重衰减为1e−5,β2衰减为0.8,参数按比例缩小。这些模子在Flax中实现(希克等人,2020年)的报告。
3.1 Closing the Gap

To compare our method with standard model tuning, we tune the public T5.1.1 checkpoints on SuperGLUE using the default hyperparameters specified in the T5 library (learning rate 0.001, and Adafactor optimizer with pre-training parameter states restored). We consider two baselines. (1) “Model Tuning”: For an apples-to-apples comparison, we tune on each task separately, as in our prompt tuning setup.3 (2) “Model Tuning (Multitask)”: We use T5’s multi-task tuning setup to achieve a more competitive baseline.4 In this case, a single model is tuned on all tasks jointly, with a text prefix indicating the task name.为了将我们的方法与标准模子调优进行比力,我们使用T5库中指定的默认超参数(学习率为0.001,Adafactor优化器恢复了预训练参数状态)来调优SuperGLUE上的公共T5.1.1检查点。我们考虑两条基线。(1)“模子调整”:对于一个苹果到苹果的比力,我们分别对每个任务进行调优,就像我们的提示调优设置一样。3(2)“模子调优(多任务)":我们使用T5的多任务调优设置来实现更具竞争力的基线。4在这种情况下,单个模子在所有任务上连合调优,并使用一个文本前缀指示任务名称。
In Figure 1 (p. 1), we see that prompt tuning becomes more competitive with model tuning as scale increases. At the XXL size (11 billion parameters), prompt tuning matches even the stronger multi-task model tuning baseline, despite having over 20,000 times fewer task-specific parameters.在图1(第1页)中,我们看到,随着规模的增长,即时调优变得比模子调优更具竞争力。在XXL大小(110亿个参数)下,即时调优甚至可以匹配更强的多任务模子调优基线,只管特定于任务的参数少了20,000倍以上。
To compare with prompt design, we include GPT-3 few-shot performance on the SuperGLUE dev split, as reported by Brown et al. (2020).5 Figure 1 shows that prompt tuning beats GPT-3 prompt design by a large margin, with prompttuned T5-Small matching GPT-3 XL (over 16 times larger), and prompt-tuned T5-Large beating GPT-3 175B (over 220 times larger).为了与提示设计进行比力,我们在SuperGLUE开辟拆分中纳入了GPT-3的频频拍摄性能,如Brown等人(2020)所报告的。5图1显示,提示调整大幅击败了GPT-3提示设计,调整后的T5-Small匹配GPT-3 XL(大16倍以上),调整后的T5-Large击败了GPT-3 175 B(大220倍以上)。
3.2 Ablation Study

Prompt Length We train prompts for each model size while varying the prompt length in {1, 5, 20, 100, 150} and fixing other settings to our default configuration. Figure 3(a) shows that for most model sizes, increasing prompt length beyond a single token is critical to achieve good performance. Notably, the XXL model still gives strong results with a single-token prompt, suggesting that the larger the model, the less conditioning signal is needed to achieve a target behavior. Across all models, increasing beyond 20 tokens only yields marginal gains.6提示长度我们为每个模子大小训练提示,同时在{1,5,20,100,150}中改变提示长度,并将其他设置固定为默认设置。图3(a)显示,对于大多数模子大小,将提示长度增长到超过单个标记对于实现良好性能至关重要。值得注意的是,XXL模子在单标记提示下仍旧给出了强有力的结果,这表明模子越大,实现目的行为所需的条件信号就越少。在所有模子中,增长超过20个代币只会产生边际收益。
图3:各种超参数对提示调优性能的消融(3次运行的均匀值和标准差)。在我们的“default”()设置中,质量随着模子大小的增长而稳固提高。在所有消融中,最大(XXL)模子对超参数选择最稳健。(a)提示长度:增长到20+令牌通常会带来很大的提拔,但XXL纵然在单令牌提示下也表现良好。(b)提示初始化:随机统一初始化落后于使用采样词汇表或类标签嵌入的更“高级”初始化,但差异在XXL大小时消散。©培训前目的:LM自适应性能优于跨度腐败,纵然在向卑鄙任务目的添加sentinel时,但XXL与任何方法都能很好地工作。(d)LM自适应:较长的自适应通常会产生较大的增益,但XXL纵然在短时间内也具有鲁棒性。
Prompt Initialization We ablate the effect of prompt initialization by training models at all sizes while fixing other hyperparameters to their default values. For random initialization, we sample uni formly from the range [−0.5, 0.5]. When initializing from sampled vocabulary, we restrict to the 5,000 most “common” tokens in T5’s SentencePiece vocabulary (Kudo and Richardson, 2018), which is ordered by likelihood in the pre-training corpus. For “class label” initialization, we take the embeddings for the string representations of each class in the downstream task and use them to initialize one of the tokens in the prompt. When a class label is multi-token, we average the token embeddings. At longer prompt lengths, we often run out of class labels before we have initialized all of the prompt tokens. In this case we fall back to our sampled vocab strategy to fill in the prompt.我们通过训练各种规模的模子来消除提示初始化的影响,同时将其他超参数固定为默认值。对于随机初始化,我们从范围[-0.5,0.5]均匀采样。当从采样词汇初始化时,我们将T5的SentencePiece词汇中的5,000个最“常见”的标记限制为(Kudo和Richardson,2018),这些标记在预训练语料库中按大概性排序。对于“类标签”初始化,我们在卑鄙任务中获取每个类的字符串表现的嵌入,并使用它们来初始化提示符中的一个标记。当类标签是多标记时,我们均匀标记嵌入。在更长的提示符长度下,我们经常在初始化所有的提示符标记之前就用完了类标签。在这种情况下,我们回到我们的词汇抽样计谋来添补提示。
Figure 3(b) shows our ablation of initialization strategy across model sizes, where we find that the class based initialization performs best. At smaller model sizes, there are large gaps between the different initializations, but once the model is scaled to XXL size, those differences disappear.图3(B)显示了我们在不同模子大小下的初始化计谋,我们发现基于类的初始化执行得最好。在较小的模子大小下,不同初始化之间存在较大的差距,但一旦模子缩放到XXL大小,这些差异就会消散。
With “class label” initialization, we observe that the class labels typically persist in the learned prompts, such that the nearest token embeddings (in cosine distance) match the tokens used for initialization. Beyond this, we did not find our learned prompts to be interpretable, similar to those of Shin et al. (2020). See Section 7 for details.通过“类标签”初始化,我们观察到类标签通常会保存在学习的提示中,以便最近的令牌嵌入(以余弦距离)与用于初始化的令牌匹配。除此之外,我们没有发现我们学习的提示是可解释的,类似于Shin et al.(2020)。详情见第7节。
Pre-training Objective In Figures 3© and 3(d), we see pre-training objective has a clear effect on prompt tuning quality. As hypothesized in Section 2.2, T5’s default “span corruption” objective is not well-suited for training frozen models to be later conditioned by prompts. Intuitively, models pre-trained to read and write sentinel tokens are hard to apply directly to tasks of reading and writing text without sentinels. As seen in Figure 3©, even the “workaround” of adding a sentinel to the downstream targets has little benefit. While LM adaptation adds value across all model sizes, we note our largest XXL model is the most forgiving and gives strong results even with span corruption.在图3(c)和3(d)中,我们看到预训练目的对即时调优质量有明显的影响。正如在2.2节中所假设的,T5的默认“跨度腐败”目的不太得当训练冻结模子,以便稍后通过提示进行调节。直觉上,预先训练好的读写哨兵标记的模子很难直策应用于没有哨兵的阅读和书写文本的任务。如图3(c)所示,纵然是向卑鄙靶标添加哨兵的“变通方法”也没有什么好处。虽然LM自适应在所有模子大小上都增长了价值,但我们注意到我们最大的XXL模子是最宽容的,纵然在跨度损坏的情况下也能给出强有力的结果。
Given the benefit of LM adaptation, we also explore how long of an adaptation is helpful. Figure 3(d) shows that longer adaptation provides additional gains, up to 100K steps. This suggests that the “transition” from span corruption to a language modeling objective is not a trivial change, and making an effective switch takes an investment of training resources (10% of the steps of the original T5 pre-training). At the same time, as in our other ablations, we observe that the XXL model is robust to even non-ideal configurations. At this size, the gains from adaptation are quite modest.考虑到LM适应的好处,我们还探讨了适应多长时间是有帮助的。图3(d)显示了更长的自适应提供了额外的增益,高达100K步。这表明,从跨度腐败到语言建模目的的“过渡”不是一个微不足道的变化,而且进行有效的切换须要投入培训资源(原始T5预培训步调的10%)。同时,与我们的其他消融一样,我们观察到XXL模子对非理想设置也具有鲁棒性。在这种规模下,适应的收益相当有限。
In the non-optimal “span corruption” setting, we observe instability across model sizes, with the Small model outperforming the larger Base, Large, and XL models. On inspection, we find that for many tasks, these mid-sized models never learn to output a legal class label and thus score 0%. The two most common error modes are copying subspans from the input and predicting an empty string. Furthermore, this poor performance is not due to random variance in prompt tuning, as we observe low variance across 3 runs for each size. These results indicate that using models pre-trained with the “span corruption” objective can be unreliable, with only 2 out of 5 models working well, whereas the LM adapated versions work reliably across all model sizes.在非最佳“跨度损坏”设置中,我们观察到不同模子大小的不稳固性,Small模子的性能优于较大的Base、Large和XL模子。在检查中,我们发现对于很多任务,这些中型模子从未学会输出法律的类标签,因此得分为0%。两种最常见的错误模式是从输入中复制子跨度和预测空字符串。此外,这种较差的性能不是由于即时调整中的随机方差,因为我们观察到每种尺寸的3次运行的方差较低。这些结果表明,使用以“跨度腐败”目的预训练的模子大概是不可靠的,5个模子中只有2个运行良好,而LM适配版本在所有模子大小上都能可靠地运行。
We have released T5 1.1 checkpoints adapted using the LM objective for 100K steps for all model sizes.8我们已经发布了T5 1.1检查点,该检查点使用LM目的进行了调整,实用于所有模子大小的10万步。
4 Comparison to Similar Approaches

In this section, we review recent work on learning continuous prompts, and draw comparisons with our method. One important axis of comparison is the number of task-specific parameters each method requires, as shown in Figure 4. Among methods with learnable parameters, prompt tuning is the most parameter efficient, requiring less than 0.01% task-specific parameters for models over a billion parameters.在本节中,我们将回顾最近关于学习一连提示的工作,并与我们的方法进行比力。一个重要的比力轴是每个方法须要的任务特定参数的数量,如图4所示。在具有可学习参数的方法中,即时调优是参数效率最高的,对于超过十亿个参数的模子,须要少于0.01%的任务特定参数。
Li and Liang (2021) propose “prefix tuning”: learning a sequence of prefixes that are prepended at every transformer layer. This is akin to learning transformer activations that are fixed across examples at every network layer. In contrast, prompt tuning uses a single prompt representation that is prepended to the embedded input. Beyond requiring fewer parameters, our approach allows the transformer to update the intermediate-layer task representations, as contextualized by an input example. Their work builds on GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020), while ours focuses on T5 and examines changes in performance and robustness to design choices as model size increases. When using BART, prefix tuning includes prefixes on both the encoder and decoder network, while prompt tuning only requires prompts on the encoder. Li and Liang (2021) also rely on a reparameterization of the prefix to stabilize learning, which adds a large number of parameters during training, whereas our configuration does not require this reparameterization and is robust across SuperGLUE tasks and model sizes.Li和Liang(2021)提出了“前缀调优”:学习在每个Transformer层前置的前缀序列。这类似于学习Transformer激活,这些激活在每个网络层的示例中都是固定的。相反,提示调优使用一个前置于嵌入式输入的提示表现。除了须要更少的参数外,我们的方法还允许Transformer更新中心层任务表现,如输入示例所示。他们的工作创建在GPT-2上(拉德福等人,2019)和BART(刘易斯等人,2020年),而我们的重点是T5,并检查随着模子大小的增长,性能和设计选择的鲁棒性的变化。使用BART时,前缀调整包罗编码器息争码器网络上的前缀,而提示调整仅须要编码器上的提示。Li和Liang(2021)还依靠于前缀的重新参数化来稳固学习,这在训练过程中添加了大量参数,而我们的设置不须要这种重新参数化,而且在SuperGLUE任务和模子大小上都是鲁棒的。
Hambardzumyan et al. (2021) propose “WARP”, where prompt parameters are added to the input layer. This method works with masked language models, relying on a [MASK] token and a learnable output layer to project the mask to class logits. This formulation restricts the model to producing a single output, limiting it to classification. Prompt tuning does not require any changes to the input or a task-specific head. The performance of prompt tuning is also considerably closer to the strong performance of model tuning.Hambardzumyan等人(2021)提出了“WARP”,此中将提示参数添加到输入层。此方法实用于掩码语言模子,依靠于[MASK]标记和可学习的输出层将掩码投影到类logit。该公式将模子限制为产生单一输出,将其限制为分类。提示调整不须要对输入或特定于任务的头进行任何更改。即时调优的性能也相当靠近模子调优的强大性能。
Liu et al. (2021) propose “P-tuning” where learnable continuous prompts are interleaved throughout the embedded input, using patterns based on human design. Our approach removes this complication by simply prepending the prompt to the input. To achieve strong SuperGLUE results, P-tuning has to be used in conjunction with model tuning, that is, models jointly update both the prompt and the main model parameters, whereas our approach keeps the original language model frozen.10Liu et al.(2021)提出了“P调优”,此中可学习的一连提示在整个嵌入式输入中交错,使用基于人类设计的模式。我们的方法通过简单地将提示符前置到输入中来消除这种复杂性。为了实现强大的SuperGLUE结果,P-tuning必须与模子调优结合使用,即模子连合更新提示和主模子参数,而我们的方法保持原始语言模子冻结。

Qin和Reynner(2021)使用“软词”来学习提示,以从预先训练的LM中提取知识。基于手工设计的提示原型,将提示放置在与输入相关的位置,而且为每一层包罗一个学习的提示参数,因此参数成本随模子深度而变化。
Logeswaran et al. (2020) use a learnable prepended token to adapt transformer models to various tasks, but focus on small synthetic datasets designed to accommodate a compositional task representation, as opposed to larger real-world datasets. Their base models are small transformers trained from scratch jointly with the task representations, whereas we keep the base model frozen and investigate scaling laws using larger transformers.Logeswaran等人(2020)使用可学习的前置令牌来使Transformer模子适应各种任务,但专注于设计用于适应组合任务表现的小型合成数据集,而不是更大的真实世界数据集。他们的基础模子是与任务表现一起从头开始训练的小transformer,而我们保持基础模子冻结,并使用更大的transformer研究缩放律。
More generally, work on task prompts is closely aligned with work on “adapters” (Rebuffi et al., 2017; Houlsby et al., 2019), small bottleneck layers inserted between frozen pre-trained network layers. Adapters offer another means of reducing task-specific parameters, with Houlsby et al. (2019) achieving GLUE performance close to full model tuning when freezing BERT-Large and only adding 2–4% additional parameters. Pfeiffer et al. (2020) use multiple adapters in a multilingual context to explicitly separate language understanding from task specification, similar to our approach. A core difference between adapters and prompt tuning is how the approaches change model behavior. Adapters modify the actual function that acts on the input representation, parameterized by the neural network, by allowing the rewriting of activations at any given layer. Prompt tuning modifies behavior by leaving the function fixed and adding new input representations that can affect how subsequent input is processed.更一样寻常地,关于任务提示符的工作与关于“适配器”的工作精密结合(Rebuffi等人,2017年; Houlsby等人,2019年),在冻结的预训练网络层之间插入小瓶颈层。适配器提供了另一种减少特定任务参数的方法,Houlsby等人(2019)在冻结BERT-Large时实现了靠近完整模子调整的GLUE性能,而且仅添加了2-4%的额外参数。Pfeiffer et al.(2020)在多语言环境中使用多个适配器来显式地将语言明白与任务规范分开,类似于我们的方法。适配器和提示调优之间的一个焦点区别是这两种方法怎样更改模子行为。适配器通过允许在任何给定层重写激活来修改作用于由神经网络参数化的输入表现的实际函数。提示调优通过保持函数稳固并添加新的输入表现来修改行为,这些新的输入表现会影响后续输入的处理方式。
5 Resilience to Domain Shift

By freezing the core language model parameters, prompt tuning prevents the model from modifying its general understanding of language. Instead, prompt representations indirectly modulate the representation of the input. This reduces the model’s ability to overfit to a dataset by memorizing spe cific lexical cues and spurious correlations. This restriction suggests that prompt tuning may improve robustness to domain shifts, where the distribution of inputs differs between training and evaluation.通过冻结焦点语言模子参数,快速调优可以防止模子修改其对语言的一样寻常明白。相反,提示表现间接地调节输入的表现。这降低了模子通过记忆特定词汇线索和虚假相关性来太过拟合数据集的本领。这种限制表明,实时调整可以提高对域偏移的鲁棒性,此中输入的分布在训练和评估之间不同。
We investigate zero-shot domain transfer on two tasks: question answering (QA) and paraphrase detection. For question answering, we use the MRQA 2019 shared task on generalization (Fisch et al., 2019). This task collects extractive QA datasets in a unified format and tests how models trained on “in-domain” datasets perform when evaluated on “out-of-domain” datasets. For our experiments, we train on SQuAD (Rajpurkar et al., 2016) and evaluate on each of the out-of-domain datasets.11我们在两个任务上研究了零触发域迁徙:问题答复和释义检测。对于问题答复,我们使用MRQA 2019关于概括的共享任务(菲施等人,(2019年版)。此任务以统一的格式网络提取QA数据集,并测试在“域内”数据集上训练的模子在“域外”数据集上评估时的性能。对于我们的实验,我们在SQuAD上训练(Rajpurkar等人,2016年),并对每个域外数据集进行评估。
Table 1 shows that prompt tuning outperforms model tuning on the majority of out-of-domain datasets, with a remarkable 12.5 point F1 gap between the two approaches on TextbookQA. We observe larger gains from prompt tuning in cases of larger domain shifts (e.g. to Biomedical in BioASQ or to Textbooks in TextbookQA). Of the datasets where model tuning is better, we see that DROP shares a domain (Wikipedia) with SQuAD and is thus one of the smallest domain transfers.表1显示,在大多数域外数据集上,即时调优优于模子调优,在TextbookQA上,这两种方法之间存在显著的12.5点F1差距。我们观察到更大的增益,从迅速调整的情况下,较大的域转移(例如,生物医学的BioASQ或教科书的TextbookQA)。在模子调优更好的数据集中,我们看到DROP与SQuAD共享一个域(Wikipedia),因此是最小的域传输之一。
表1:在SQuAD上训练并在MRQA 2019共享任务的域外数据集上评估的模子的F1均匀值和标准毛病。即时调优通常比模子调优提供更强的零触发性能,特别是在像TextbookQA这样具有较大域偏移的数据集上。
As a second test of robustness to domain shift, we explore transfer between two paraphrase detection tasks from GLUE (Wang et al., 2019b). The first task is QQP (Iyer et al., 2017), which asks if two questions from the community Q&A site Quora are “duplicates”. The second task is MRPC (Dolan and Brockett, 2005), which asks if two sentences drawn from news articles are paraphrases. We test transfer in both directions (QQP⇔MRPC). As before, we train on the “in-domain” task, select checkpoints using in-domain validation, and evaluate zero-shot on the “out-of-domain” task.作为对域移位的鲁棒性的第二测试,我们探索了来自GLUE的两个释义检测任务之间的转移(Wang等人,2019年b)。第一个任务是QQP(Iyer等人,2017年),此中询问来自社区问答网站Quora的两个问题是否“重复”。第二个任务是MRPC(Dolan和Brockett,2005),它询问从消息文章中提取的两个句子是否是释义。我们测试了两个方向的传输(QQP MRPC)。与之前一样,我们在“域内”任务上进行培训,使用域内验证选择检查点,并在“域外”任务上评估零命中率。
Table 2 shows that training a lightweight prompt on the QQP data and evaluating on MRPC gives much better performance than tuning the entire model (+3.2 accuracy and +3.1 F1). The results are much closer in the other direction, with prompt tuning showing a small improvement in accuracy and a small drop in F1. These results support the view that model tuning may be over-parameterized and more prone to overfit the training task, to the detriment of similar tasks in different domains.表2显示,在QQP数据上训练一个轻量级提示并在MRPC上进行评估,比调整整个模子(+3.2精度和+3.1 F1)提供了更好的性能。结果在另一个方向上要靠近得多,实时调整显示精度略有提高,F1略有降落。这些结果支持这样的观点,即模子调整大概是太过参数化的,而且更容易太过拟合训练任务,从而损害不同范畴中的类似任务。

6 Prompt Ensembling

Ensembles of neural models trained from different initializations on the same data are widely observed to improve task performance (Hansen and Salamon, 1990) and are useful for estimating model uncertainty (Lakshminarayanan et al., 2017). However, as model size increases, ensembling can become impractical. Beyond the space required to store N models (e.g. 42 GiB for each copy of T5-XXL), there is a substantial inference cost to running N distinct models, whether in parallel or in series.从相同数据的不同初始化训练的神经模子的集合被广泛地观察到改善任务性能(汉森和Salamon,1990),而且对于估计模子不确定性是有用的(Lakshminarayanan等人,(2017年版)。但是,随着模子大小的增长,组合大概变得不切实际。除了存储N个模子所需的空间(例如,对于T5-XXL的每个副本为42 GiB)之外,对于并行或串行地运行N个不同的模子,存在相当大的推理成本。
Prompt tuning provides a more efficient way to ensemble multiple adaptations of a pre-trained language model. By training N prompts on the same task, we create N separate “models” for a task, while still sharing the core language modeling parameters throughout. Beyond drastically reducing storage costs, the prompt ensemble makes inference more efficient. To process one example, rather than computing forward passes ofN different models, we can execute a single forward pass with a batch size of N, replicating the example across the batch and varying the prompt. These savings mirror those seen for multi-tasking in Figure 2.即时调优提供了一种更有效的方式来集成预训练语言模子的多种适应性。通过在同一个任务上训练N个提示,我们为一个任务创建了N个独立的“模子”,同时仍旧在整个过程中共享焦点语言建模参数。除了大幅降低存储成本外,快速集成还使推理更有效。为了处理一个示例,我们可以以批量大小N执行单个向前传递,而不是盘算N个不同模子的向前传递,从而在整个批次中复制示例并改变提示。这些节省反映了图2中多任务处理的节省。
To demonstrate the viability of prompt ensembling, we train five prompts for each SuperGLUE task, using a single frozen T5-XXL model with our default hyperparameters. We use simple majority voting to compute predictions from the ensemble. Table 3 shows that across all tasks, the ensemble beats the single-prompt average and beats, or matches, the best individual prompt.为了证明提示集成的可行性,我们为每个SuperGLUE任务训练了五个提示,使用单个冻结的T5-XXL模子和我们的默认超参数。我们使用简单的多数表决来盘算预测的合奏。表3显示,在所有任务中,集合都击败了单个提示的均匀值,并击败或匹配了最佳的单个提示。
7 Interpretability 可解释性


  • An ideally interpretable prompt would consist of natural language that clearly describes the task at hand, explicitly asks the model for some result or action, and makes it easy to understand why the prompt elicited such behavior from the model.一个理想的可解释的提示应该由天然语言构成,它清楚地描述了手头的任务,明白地要求模子给出一些结果或动作,而且很容易明白为什么提示会从模子中引发这样的行为。
  • As prompt tuning works in the continuous embedding space rather than the discrete token space, interpreting prompts becomes more difficult. To test the interpretability of our learned soft prompts, we compute the nearest neighbors to each prompt token from the frozen model’s vocabulary. We use cosine distance between the vocabulary embedding vector and the prompt token representation as the similarity metric.由于提示调优在一连嵌入空间而不是离散标记空间中工作,因此解释提示变得更加困难。为了测试我们学习的软提示的可解释性,我们从冻结模子的词汇表中盘算每个提示标记的最近邻居。我们使用词汇嵌入向量和提示符表现之间的余弦距离作为相似性度量。
  • We observe that for a given learned prompt token, the top-5 nearest neighbors form tight semantic clusters. For example, we see lexically similar clusters such as { Technology / technology / Technologies / technological / technologies }, as well as more diverse but still strongly related clusters such as { entirely / completely / totally / altogether / 100% }. The nature of these clusters suggests that the prompts are in fact learning “word-like” representations. We found that random vectors drawn from the embedding space do not show this sort of semantic clustering.我们观察到,对于一个给定的学习提示令牌,前5个最近的邻居形成精密的语义集群。例如,我们可以看到词汇相似的集群,如{ Technology / technology / Technologies / technological / technologies },以及更多样化但仍旧精密相关的集群,如{ entirely / completely / totally / altogether / 100% }。这些集群的性质表明,提示实际上是学习“类单词”的表现。我们发现,从嵌入空间中提取的随机向量不显示这种语义聚类。
  • When initializing the prompts using the “classlabel” strategy, we often find that the class labels persist through training. Specifically, if a prompt token is initialized to a given label, that label is often among the learned token’s nearest neighbors after tuning. When initializing with the “Random Uniform” or “Sampled Vocab” methods, the class labels can also be found in the nearest neighbors of the prompts; however they tend to appear as neighbors to multiple prompt tokens. This suggests that the model is learning to store the expected output classes in the prompts as reference, and initializing the prompt to outputs classes makes this easier and more centralized.当使用“classlabel”计谋初始化提示时,我们经常发现类标签在训练过程中一直存在。详细来说,如果提示标记被初始化为给定的标签,则该标签通常在调优后位于学习标记的最近邻居中。当使用“Random Uniform”或“Sampled Vocab”方法初始化时,类标签也可以在提示符的最近邻居中找到;但是它们通常会显示为多个提示符标记的邻居。这表明模子正在学习将预期的输出类存储在提示符中作为参考,而且将提示符初始化为输出类使其更容易和更集中。
  • When examining longer prompts (e.g. size 100), we often find several prompt tokens with the same nearest neighbors. This suggests there is either excess capacity in the prompt, or that the lack of sequential structure in the prompt representation makes it difficult for the model to localize information to a specific position.当检查较长的提示符(例如大小为100)时,我们经常会发现几个提示符具有相同的最近邻居。这表明提示中存在过剩的容量,或者提示表现中缺乏次序结构使得模子难以将信息本地化到特定位置。
  • While the learned prompts taken as sequences show little interpretability, we do observe a high frequency of words like science, technology and engineering as the nearest neighbors for prompts trained on the BoolQ dataset and approximately 20% of the questions are in the “Nature/Science” category. While more investigation is needed, this suggests that one role of the prompt may be to prime the model to interpret inputs in a specific domain or context (e.g. “scientific”)虽然作为序列的学习提示显示出很小的可解释性,但我们确实观察到科学,技术和工程等单词的频率很高,作为在BoolQ数据集上训练的提示的最近邻居,约莫20%的问题属于“天然/科学”种别。虽然须要更多的调查,这表明提示的一个作用大概是引导模子在特定范畴或背景下解释输入(例如“科学”)
8 Conclusion

In this paper, we showed that prompt tuning is a competitive technique for adapting frozen pretrained language models to downstream tasks. On the popular SuperGLUE benchmark, its task performance rivals that of traditional model tuning, with the gap vanishing as model size increases. On zeroshot domain transfer, we found that prompt tuning leads to improved generalization. This plausibly indicates that freezing general-purpose language understanding parameters and restricting downstream learning to a lightweight parameter footprint can help to avoid overfitting to a specific domain.在本文中,我们证明了即时调整是一种有竞争力的技术,可以使冻结的预训练语言模子适应卑鄙任务。在盛行的SuperGLUE基准测试中,它的任务性能与传统的模子调整相当,随着模子大小的增长,差距消散了。在zeroshot域转移,我们发现,实时调整导致改进的泛化。这合理地表明,冻结通用语言明白参数并将卑鄙学习限制在轻量级参数足迹上有助于制止对特定范畴的太过拟合。
Beyond task quality metrics, we discussed the appeal of moving to frozen pre-trained models in terms of storage and serving costs. This move enables both efficient multi-task serving, as well as efficient high-performing prompt ensembling. Looking forward, we believe that factoring out task-defining parameters as distinct from general language-modeling parameters is an exciting step that opens up many avenues for new research.除了任务质量指标之外,我们还讨论了在存储和服务成本方面转向冻结预训练模子的吸引力。这一举措既可以实现高效的多任务服务,也可以实现高效的高性能即时集成。预测将来,我们信赖,将任务界说参数与一样寻常语言建模参数区分开来是一个令人高兴的步调,为新的研究开辟了很多途径。

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

道家人

金牌会员
这个人很懒什么都没写!

标签云

快速回复 返回顶部 返回列表