云原生构建生产环境中的大型语言模型（LLMs）——LLM架构与近况

钜形不锈钢水箱 发表于 2024-12-10 12:10:05

构建生产环境中的大型语言模型（LLMs）——LLM架构与近况

理解Transformer

Transformer架构在各种应用中展示了其多才多艺的特性。最初的网络被提出作为一个用于翻译任务的编码器-解码器架构。Transformer架构的下一次演进是引入了仅编码器模型，如BERT，随后是仅解码器网络，即GPT模型的初次迭代。
这些区别不仅体现在网络计划上，还包括学习目标。这些不同的学习目标在塑造模型的举动和结果方面起着至关紧张的作用。理解这些差别对于选择适合特定任务的架构以及在各种应用中实现最佳性能至关紧张。
在本章中，我们将更深入地探讨Transformer，提供对其各个组件及网络内部机制的全面理解。我们还将研究开创性论文《Attention is All You Need》。
我们还将加载预训练模型，以突出Transformer和GPT架构之间的区别，并查抄该范畴最新的创新，如大型多模态模型（LMMs）。
《Attention is All You Need》

这是自然语言处理（NLP）范畴一个极具记忆性的标题。论文《Attention is All You Need》标记着在NLP神经网络架构开发中的一个紧张里程碑。这项由Google Brain和多伦多大学的相助研究介绍了Transformer，这是一种利用注意力机制进行自动翻译任务的编码器-解码器网络。Transformer模型在(WMT 2014数据集) 英语到法语翻译任务上达到了41.8的最新最高分。值得注意的是，这一性能水平是在仅用8个GPU训练了3.5天后取得的，显示出相比于以前的模型，训练本钱大幅低沉。
Transformer极大地改变了这一范畴，并在翻译之外的任务中展示了杰出的有用性，包括分类、摘要和语言生成。Transformer的一个关键创新是其高度并行化的网络结构，这增强了训练的效率和效果。
架构

现在，让我们更具体地审视Transformer模型的基本组件。如下面的图示所示，最初的架构计划用于序列到序列的任务（即输入一个序列并基于它生成输出），比方翻译。在这个过程中，编码器创建输入短语的表现，解码器则利用这一表现作为参考生成输出。
进一步研究Transformer架构，发现其可以分为三种独特的种别，这些种别以其多样性和在处理不同任务中的专业本领而有所区别。

[*]仅编码器种别专注于从输入数据中提取上下文感知的表现。这个种别中的代表性模型是BERT，它在分类任务中非常有用。
[*]编码器-解码器种别实用于序列到序列的任务，如翻译、摘要生成和训练多模态模型（如标题生成器）。该分类下的模型示例是BART。
[*]仅解码器种别专门计划用于根据提供的指令生成输出，这在LLM中得到了体现。这个种别的代表性模型是GPT家属。
接下来，我们将探讨这些计划选择之间的对比及其对不同任务的影响。然而，如图所示，多个构建块，如嵌入层和注意力机制，在编码器和解码器组件中是共享的。理解这些元素将有助于进步对模型内部运作的理解。本节概述了关键组件，然后演示如何加载开源模型以追踪每一步。
输入嵌入

在Transformer架构中，初始步骤是将输入标记（单词或子词）转化为嵌入。这些嵌入是高维向量，捕捉了输入标记的语义特征。可以将它们视为一个大型特征列表，代表被嵌入的单词。这个列表包罗成千上万的数字，模型通过自我学习来表现我们的世界。与其处理句子、单词和同义词以进行比较并理解语言，不如用这些数字列表进行数值比较，通过基本盘算（如向量的加法和减法）来看它们是否相似。这比理解单词本身要复杂得多。因此，这些嵌入向量的巨细非常大。当你无法理解意义和单词时，需要成千上万的值来表现它们。这个巨细因模型架构而异。比方，OpenAI的GPT-3利用的是12,000维的嵌入向量，而较小的模型，如BERT，利用768维的嵌入。这一层使模型可以或许有用理解和处理输入，作为所有后续层的底子。
位置信息编码

早期模型，如递归神经网络（RNNs），以顺序的方式处理输入，一次一个标记，自然地保留了文本的顺序。与这些模型不同，Transformer没有内建的顺序处理本领。相反，它们利用位置信息编码来保持短语中单词的顺序，以供后续层利用。这些编码是填充了唯一值的向量，每个索引处的值不同，这些编码与输入嵌入联合，为模型提供有关标记在序列中的相对或绝对位置的数据。这些向量编码了每个单词的位置，确保模型可以或许辨认单词的顺序，这对于理解句子的上下文和含义至关紧张。
自注意力机制

自注意力机制是Transformer模型的核心，盘算短语中所有单词嵌入的加权总和。这些权重是通过学习的“注意力”分数盘算的。更高的“注意力”权重会分配给相互更相关的术语。根据输入，这一机制通过查询（Query）、键（Key）和值（Value）向量来实现。下面是每个向量的扼要形貌：

[*]查询向量（Query Vector）：这是盘算注意力权重的单词或标记。查询向量指定应优先思量输入序列的哪些部分。当你将单词嵌入与查询向量相乘时，你在扣问，“我应该关注什么？”
[*]键向量（Key Vector）：输入序列中与查询相比较的一组单词或标记。键向量有助于辨认输入序列中的紧张或相关信息。当你将单词嵌入与键向量相乘时，你在问本身，“什么是紧张的？”
[*]值向量（Value Vector）：存储与输入序列中每个单词或标记相关的信息或特征。值向量包罗实际数据，这些数据将根据查询和键之间盘算的注意力权重进行加权和混合。值向量答复查询，“我们有什么信息？”
在Transformer计划出现之前，注意力机制紧张用于比较文本的两个部分。比方，模型可以在生成总结任务时，关注输入文章的不同区域。
自注意力机制使模型可以或许突出文本中最紧张的部分。它可以用于仅编码器或仅解码器模型，以构建强大的输入表现。文本可以在仅编码器的情况下被翻译成嵌入，而仅解码器模型则实现文本生成。
多头注意力机制的实现大大进步了准确性。在这种设置中，多个注意力组件处理相同的信息，每个头在训练和生成过程中学习专注于文本的独特特征，如动词、名词、数字等。
架构实际操作

在此部分，您可以在[Notebooks找到相关的代码示例。
通过实际操作架构，您可以相识上述组件如何在预训练的大型语言模型中工作，利用Transformers库（Hugging Face）提供的工具来深入相识其内部机制。您将学习如何加载预训练的分词器，将文本转换为标记ID，随后将输入数据传递到每个网络层，并调查输出结果。
首先，利用 AutoModelForCausalLM 和 AutoTokenizer 来加载模型和分词器。然后，将一个示例句子进行标记化，该句子将作为接下来步骤中的输入。
from transformers import AutoModelForCausalLM, AutoTokenizer

OPT = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

inp = "The quick brown fox jumps over the lazy dog"
inp_tokenized = tokenizer(inp, return_tensors="pt")
print(inp_tokenized['input_ids'].size())
print(inp_tokenized)

输出：
torch.Size()
{'input_ids': tensor([[ 2, 133,2119,6219, 23602, 13855, 81,
5, 22414,2335]]), 'attention_mask': tensor([])}

我们加载了Facebook的预训练Transformer模型（facebook/opt-1.3b），并以8位格式存储，这是一种节省内存的策略，用于有用利用GPU资源。分词器对象加载了与模型交互所需的词汇，并用于将示例输入（inp 变量）转换为标记ID和注意力掩码。注意力掩码是一个向量，旨在资助忽略特定标记。在给定的示例中，注意力掩码向量的所有索引都设置为1，表现每个标记都会被正常处理。然而，通过将注意力掩码向量中的某个索引设置为0，您可以指示模型忽略输入中的特定标记。同时，注意到文本输入是如何利用模型的预训练字典转换为标记ID的。
接下来，我们通过 .model 方法来查抄模型的架构。
print(OPT.model)

输出：
OPTModel(
(decoder): OPTDecoder(
(embed_tokens): Embedding(50272, 2048, padding_idx=1)
(embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
(final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
(layers): ModuleList(
   (0-23): 24 x OPTDecoderLayer(
   (self_attn): OPTAttention(
      (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
      (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
      (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
      (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
   )
   (activation_fn): ReLU()
   (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
   (fc1): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
   (fc2): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
   (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
   )
)
)
)

解码器模型是基于Transformer的语言模型中常见的选择。因此，我们必须利用解码器键来访问其内部工作机制。layers 键还显示解码器组件由24个堆叠层组成，每个层的计划相同。首先，思量嵌入层。
embedded_input = OPT.model.decoder.embed_tokens(inp_tokenized['input_ids'])
print("Layer:\t", OPT.model.decoder.embed_tokens)
print("Size:\t", embedded_input.size())
print("Output:\t", embedded_input)

输出：
Layer: Embedding(50272, 2048, padding_idx=1)
Size:    torch.Size()
Output:tensor([[[-0.0407,0.0519,0.0574,..., -0.0263, -0.0355, -0.0260],
      [-0.0371,0.0220, -0.0096,...,0.0265, -0.0166, -0.0030],
      [-0.0455, -0.0236, -0.0121,...,0.0043, -0.0166,0.0193],
      ...,
      [ 0.0007,0.0267,0.0257,...,0.0622,0.0421,0.0279],
      [-0.0126,0.0347, -0.0352,..., -0.0393, -0.0396, -0.0102],
      [-0.0115,0.0319,0.0274,..., -0.0472, -0.0059,0.0341]]],
   device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)

嵌入层通过解码器对象的 .embed_tokens 方法访问，它将标记化的输入传递给该层。正如所见，嵌入层将一个巨细为的ID列表转换为。该表现将被利用并通过解码器层进行传递。
如前所述，位置信息编码组件利用注意力掩码构建一个向量，传达模型中的位置信号。位置信息嵌入是利用解码器的 .embed_positions 方法生成的。如所示，这一层为每个位置生成一个独特的向量，然后将其添加到嵌入层的输出中。这一层将位置信息添加到模型中。
embed_pos_input = OPT.model.decoder.embed_positions(
inp_tokenized['attention_mask']
)
print("Layer:\t", OPT.model.decoder.embed_positions)
print("Size:\t", embed_pos_input.size())
print("Output:\t", embed_pos_input)

输出：
Layer: OPTLearnedPositionalEmbedding(2050, 2048)
Size:    torch.Size()
Output:tensor([[[-8.1406e-03, -2.6221e-01,6.0768e-03,...,1.7273e-02,
-5.0621e-03, -1.6220e-02],
      [-8.0585e-05,2.5000e-01, -1.6632e-02,..., -1.5419e-02,
-1.7838e-02,2.4948e-02],
      [-9.9411e-03, -1.4978e-01,1.7557e-03,...,3.7117e-03,
-1.6434e-02, -9.9087e-04],
      ...,
      [ 3.6979e-04, -7.7454e-02,1.2955e-02,...,3.9330e-03,
-1.1642e-02,7.8506e-03],
      [-2.6779e-03, -2.2446e-02, -1.6754e-02,..., -1.3142e-03,
-7.8583e-03,2.0096e-02],
      [-8.6288e-03,1.4233e-01, -1.9012e-02,..., -1.8463e-02,
-9.8572e-03,8.7662e-03]]], device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)

最后，查看自注意力组件！我们可以通过索引访问第一个层的自注意力组件，并利用 .self_attn 方法。同时，查抄架构图显示，自注意力的输入是通过将嵌入向量与位置信息编码向量相加来创建的。
embed_position_input = embedded_input + embed_pos_input
hidden_states, _, _ = OPT.model.decoder.layers.self_attn(embed_position_input)
print("Layer:\t", OPT.model.decoder.layers.self_attn)
print("Size:\t", hidden_states.size())
print("Output:\t", hidden_states)

输出：
Layer: OPTAttention(
(k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
(v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
(q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
(out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
)
Size:    torch.Size()
Output:tensor([[[-0.0119, -0.0110,0.0056,...,0.0094,0.0013,0.0093],
      [-0.0119, -0.0110,0.0056,...,0.0095,0.0013,0.0093],
      [-0.0119, -0.0110,0.0056,...,0.0095,0.0013,0.0093],
      ...,
      [-0.0119, -0.0110,0.0056,...,0.0095,0.0013,0.0093],
      [-0.0119, -0.0110,0.0056,...,0.0095,0.0013,0.0093],
      [-0.0119, -0.0110,0.0056,...,0.0095,0.0013,0.0093]]],
   device='cuda:0', dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)

自注意力组件包括前述的查询、键、值层以及终极的输出投影。它担当嵌入输入与位置信息编码向量的和作为输入。在实际应用中，模型还会为该组件提供注意力掩码，使其可以或许确定应忽略或忽视输入的哪些部分。（为了清晰起见，这部分在示例代码中省略）
架构的其余部分利用非线性函数（比方RELU）、前馈层和批量归一化。
Transformer模型的计划选择

在中可以找到此部分的相关笔记。
Transformer架构已经证明了它在多种应用中的适应性。最初的模型是为翻译任务中的编码器-解码器任务提出的。随着仅编码器模型（如BERT）的出现，Transformer计划的演变继续进行，第一代GPT模型引入了仅解码器网络。
这些变体不仅限于网络架构，还包括学习目标的不同。这些不同的学习目标对模型的举动和结果有显著影响。理解这些差别对于选择适合特定任务的最佳计划并在各种应用中获得最佳性能至关紧张。
编码器-解码器架构

完备的Transformer架构，通常称为编码器-解码器模型，由多个编码器层堆叠而成，并通过交织注意力机制毗连到多个解码器层。这个架构与我们在前面部分看到的完全一致。
这些模型特殊适合将一个序列转换为另一个序列的任务，如文本翻译或总结，其中输入和输出都是基于文本的。它在多模态应用中也非常有用，比方图像形貌，其中输入是图像，而期望的输出是相应的形貌。在这些场景中，交织注意力发挥了关键作用，资助解码器在生成过程中专注于内容的最相关部分。
一个典型的例子是BART预训练模型，它具有双向编码器，负责形成输入的具体表现。同时，自回归解码器渐渐生成输出，一个标记一个标记地输出。该模型处理一些部分被随机掩饰的输入以及通过一个标记移动的输入。它努力重构原始输入，将此任务设定为学习目标。下面的代码加载了BART模型，以查抄其架构。
from transformers import AutoModel, AutoTokenizer

BART = AutoModel.from_pretrained("facebook/bart-large")
print(BART)

输出：
BartModel(
(shared): Embedding(50265, 1024, padding_idx=1)
(encoder): BartEncoder(
(embed_tokens): Embedding(50265, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
   (0-11): 12 x BartEncoderLayer(
   (self_attn): BartAttention(
      (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
   )
   (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   (activation_fn): GELUActivation()
   (fc1): Linear(in_features=1024, out_features=4096, bias=True)
   (fc2): Linear(in_features=4096, out_features=1024, bias=True)
   (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   )
)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(decoder): BartDecoder(
(embed_tokens): Embedding(50265, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
   (0-11): 12 x BartDecoderLayer(
   (self_attn): BartAttention(
      (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
   )
   (activation_fn): GELUActivation()
   (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   (encoder_attn): BartAttention(
      (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
      (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
   )
   (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   (fc1): Linear(in_features=1024, out_features=4096, bias=True)
   (fc2): Linear(in_features=4096, out_features=1024, bias=True)
   (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
   )
)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)

我们已经认识BART模型中的大多数层。该模型由编码器和解码器组件组成，每个组件有12层。此外，特殊是解码器组件，包罗一个额外的encoder_attn层，称为交织注意力。交织注意力组件将基于编码器表现来调解解码器的输出。我们可以利用transformers的pipeline功能和微调后的模型进行总结。
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sum = summarizer("""Gaga was best known in the 2010s for pop hits like “Poker Face” and avant-garde experimentation on albums like “Artpop,” and Bennett, a singer who mostly stuck to standards, was in his 80s when the pair met. And yet Bennett and Gaga became fast friends and close collaborators, which they remained until Bennett’s death at 96 on Friday. They recorded two albums together, 2014’s “Cheek to Cheek” and 2021’s “Love for Sale,” which both won Grammys for best traditional pop vocal album.""", min_length=20, max_length=50)

print(sum['summary_text'])

输出：
Bennett and Gaga became fast friends and close collaborators.
They recorded two albums together, 2014's "Cheek to Cheek" and 2021's
"Love for Sale"

仅编码器架构

仅编码器模型是通过堆叠多个编码器组件创建的。由于编码器的输出不能与其他解码器耦合，它只能用作文本到向量的方法来度量相似性。它也可以与顶部的分类头（前馈层）共同利用，资助进行标签猜测（在像Hugging Face这样的库中也称为Pooler层）。
在仅编码器架构中，基本的区别是缺少掩蔽自注意力层。因此，编码器可以同时处理完备的输入。（与解码器不同，在训练期间，将来的标记必须被掩蔽，以制止在生成新标记时“作弊”。）这一特性使得仅编码器模型非常适合从文档中生成向量表现，确保保留所有信息。
BERT文章（或更高质量的变体，如RoBERTa）引入了一个著名的预训练模型，该模型大大进步了各种NLP任务的最先进评分。该模型预训练时思量了两个学习目标：

[*]掩蔽语言建模：遮蔽输入中的随机标记，并尝试猜测这些掩蔽的标记。
[*]下一个句子猜测：出现句子对，并确定第二个句子是否在文本序列中逻辑上跟随第一个句子。
from transformers import AutoModel

BERT = AutoModel.from_pretrained("bert-base-uncased")
print(BERT)

输出：
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
   (0-11): 12 x BertLayer(
   (attention): BertAttention(
      (self): BertSelfAttention(
         (query): Linear(in_features=768, out_features=768, bias=True)
         (key): Linear(in_features=768, out_features=768, bias=True)
         (value): Linear(in_features=768, out_features=768, bias=True)
         (dropout): Dropout(p=0.1, inplace=False)
      )
      (output): BertSelfOutput(
         (dense): Linear(in_features=768, out_features=768, bias=True)
         (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
         (dropout): Dropout(p=0.1, inplace=False)
      )
   )
   (intermediate): BertIntermediate(
      (dense): Linear(in_features=768, out_features=3072, bias=True)
      (intermediate_act_fn): GELUActivation()
   )
   (output): BertOutput(
      (dense): Linear(in_features=3072, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
   )
   )
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)

BERT模型采用传统的Transformer架构，具有12个堆叠的编码器块。然而，网络的输出将传递到一个pooler层，这是一种前馈线性层，随后黑白线性激活函数，用于构建终极的表现。该表现将用于其他任务，如分类和相似性评估。下面的代码利用微调后的BERT模型进行情感分析：
from transformers import pipeline

classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
lbl = classifier("""This restaurant is awesome.""")

print(lbl)

输出：
[{'label': '5 stars', 'score': 0.8550480604171753}]

仅解码器架构

本日的大型语言模型紧张利用仅解码器网络作为底子，并偶尔进行小的修改。由于集成了掩蔽自注意力机制，这些模型紧张集中于猜测下一个标记，这也催生了提示（prompting）的概念。
根据研究，扩大仅解码器模型的规模可以显著提拔网络的语言理解和泛化本领。因此，人们可以通过利用不同的提示来在各种任务中表现精彩。大型预训练模型，如GPT-4和LLaMA 2，可以通过利用相关指令来执行分类、总结、翻译等任务。
大型语言模型，如GPT系列，利用了因果语言建模目标进行预训练。这意味着模型试图猜测下一个单词，而注意力机制只能关注左侧的先前标记。这意味着模型只能基于之前的上下文猜测下一个标记，而不能窥探将来的标记，从而制止了作弊。
from transformers import AutoModel

gpt2 = AutoModel.from_pretrained("gpt2")
print(gpt2)

输出：
GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
   (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
   (attn): GPT2Attention(
   (c_attn): Conv1D()
   (c_proj): Conv1D()
   (attn_dropout): Dropout(p=0.1, inplace=False)
   (resid_dropout): Dropout(p=0.1, inplace=False)
   )
   (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
   (mlp): GPT2MLP(
   (c_fc): Conv1D()
   (c_proj): Conv1D()
   (act): NewGELUActivation()
   (dropout): Dropout(p=0.1, inplace=False)
   )
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

通过查看架构，您会发现标准的变换器解码器块，没有交织注意力层。GPT系列还利用了独特的线性层（Conv1D）来转置权重。（请注意，这与PyTorch的卷积层不同！）这种计划选择是OpenAI特有的，其他大型开源语言模型利用的是传统的线性层。以下代码展示了如何将GPT-2模型用于文本生成。它生成了四种大概的方式来完成句子：“这部影戏非常”。
from transformers import pipeline

generator = pipeline(model="gpt2")
output = generator("This movie was a very", do_sample=True,
top_p=0.95, num_return_sequences=4, max_new_tokens=50, return_full_text=False)

for item in output:
print(">", item['generated_text'])

输出示例：
>hard thing to make, but this movie is still one of the most amazing shows I've seen in years. You know, it's sort of fun for a couple of decades to watch, and all that stuff, but one thing's for sure —
>special thing and that's what really really made this movie special," said Kiefer Sutherland, who co-wrote and directed the film's cinematography. "A lot of times things in our lives get passed on from one generation to another, whether
>good, good effort and I have no doubt that if it has been released, I will be very pleased with it."
>enjoyable one for the many reasons that I would like to talk about here. First off, I'm not just talking about the original cast, I'm talking about the cast members that we've seen before and it would be fair to say that none of

页: [1]

IT评测·应用市场-qidao123.com技术社区's Archiver

构建生产环境中的大型语言模型（LLMs）——LLM架构与近况