1.2 文本类 1.2.1 完形填空
简单的来说就是在原始数据中扣掉一个或多个单词,让模子进行补充。 原始数据:All the world's a stage, and all the men and women merely players. 输入:All the world's a stage, and all the __ and women merely players. 输出:猜测的单词 标签:men 1.2.2 Masked Language Model (MLM)(划重点拉)
MLM模子会随机的选择需要掩饰的单词(大概15%)(主要用于让模子习得语义、语法)
ps:由于是随机的一般我们都会指定一个参数max_pred用来表示一个句子最多被掩饰单词的数目 原始数据:All the world's a stage, and all the men and women merely players. 输入:All the world's a stage, and all the MASK and MASK merely players. 输出:猜测的单词 标签:men, women
为了更好的顺应卑鄙任务,bert的作者对与MLM的规则进行了肯定的微调。
被更换的单词:men : MASK-------------------80%
apple(随机单词)------10%
men(保持不变--)------10%
依然照旧对标注为MASK的单词进行猜测。
下面是论文原文对于这段的描述附上中英文对照
为了训练一个深度双向表示,我们简单地随机遮掩输入标志的肯定比例,然后猜测这些被遮掩的标志。我们称这个过程为“遮掩语言建模”(Masked Language Modeling,MLM),尽管文献中通常称之为Cloze任务(Taylor, 1953)。在这种情况下,对应于遮掩标志的最终隐藏向量被馈送到一个标准语言模子中的词汇表上的输出softmax层。在所有实验中,我们随机遮掩每个序列中所有WordPiece标志的15%。与去噪自编码器(Vincent et al., 2008)不同,我们仅猜测遮掩的单词,而不是重构整个输入。
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.
尽管这使我们可以或许获得一个双向预训练模子,但其缺点是在预训练和微调之间创建了不匹配,因为在微调过程中不存在[MASK]标志。为了减轻这一问题,我们并不总是用现实的[MASK]标志更换“遮掩”的单词。训练数据天生器随机选择15%的标志位置进行猜测。假如选择第i个标志,则有80%的概率将第i个标志更换为[MASK]标志,10%的概率将其更换为随机标志,以及10%的概率保持不变。然后,使用交织熵损失来猜测原始标志。我们在附录C.2中比较了这一过程的变化。
Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, T i will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.