transformers 阅读：BERT 模子

何小豆儿在此 · 2024-6-15 00:42:42

媒介

想深入理解 BERT 模子，在阅读 transformers 库同时纪录一下。
笔者小白，错误的地方请不吝指出。
Embedding

为了使 BERT 能处理大量卑鄙任务，它的输入可以明确表示单一句子或句子对，例如<标题，答案>。
To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answeri) in one token sequence
因此 BERT 的 Embedding 分为三个部门：

Token Embeddings：对于分词效果进行嵌入。
Segement Embeddings：用于表示每个词地点句子，例如区分某个词是属于标题句子照旧属于答案句子。
Position Embeddings：位置嵌入。

在 transfoerms 中定义如下：

class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
self.register_buffer(
"position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
)
self.register_buffer(
"token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
)

复制代码

值得注意的是，BERT 中采取可学习的位置嵌入，而不是 Transformer 中的计算编码。
下面是前向计算代码：

def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
token_type_ids: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
past_key_values_length: int = 0,
) -> torch.Tensor:
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
# issue #5664
if token_type_ids is None:
if hasattr(self, "token_type_ids"):
buffered_token_type_ids = self.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings

复制代码

下面介绍各个参数的含义和作用：

input_ids：当前词在词表的位置构成的列表。
token_type_ids：当前词对应的句子，所属第一句/第二句/Padding。
position_ids：当前词在句子中的位置构成的列表。
inputs_embeds：对 input_ids 进行嵌入的效果。
past_key_values_length：假如没有传入 position_ids 则从过去计算的地方向后自动取 seq_len 长度作为 position_ids

前向计算逻辑如下：

根据 input_ids 计算 input_embeddings，假如提供 input_embeds 则不消计算。
根据 token_type_ids 计算 token_type_embeddings。
根据 position_ids 计算 position_embeddings。
上面三个步骤的效果求和。
对步骤4效果做一次 LayerNorm 和 Dropout 后输出。

BertSelfAttention

自注意力是 BERT 中的核心模块，其初始化代码如下：

class BertSelfAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = position_embedding_type or getattr(
config, "position_embedding_type", "absolute"
)
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.is_decoder = config.is_decoder

复制代码

与 Transformer 基本一致，都会将 embedding 分为多块计算。虽然判断了 hidden_size 能否整除 num_attention_heads ，但是由于后面的设置，理论上仍然大概导致 hidden_size 与 all_head_size 大小不同。
在 BERT 中对应参数如下：
typehidden_sizenum_attention_headsbase76812large102416 在 base 和 large 中每个 head 的大小为 64。
下面是对张量进行维度变更，为了后面的计算。

def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(new_x_shape)
return x.permute(0, 2, 1, 3)

复制代码

输入 x 的维度为 [bsz, seq_len, hidden_size]，new_x_shape 变为 [bsz, seq_len, heads, head_size]，然后交换 1 2 维度，变为 [bsz, heads, seq_len, head_size]。
前向计算代码如下：

def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
mixed_query_layer = self.query(hidden_states)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
is_cross_attention = encoder_hidden_states is not None
if is_cross_attention and past_key_value is not None:
# reuse k,v, cross_attentions
key_layer = past_key_value[0]
value_layer = past_key_value[1]
attention_mask = encoder_attention_mask
elif is_cross_attention:
key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
use_cache = past_key_value is not None
if self.is_decoder:
# if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
# Further calls to cross_attention layer can then reuse all cross-attention
# key/value_states (first "if" case)
# if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
# all previous decoder key/value_states. Further calls to uni-directional self-attention
# can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
# if encoder bi-directional self-attention `past_key_value` is always `None`
past_key_value = (key_layer, value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
query_length, key_length = query_layer.shape[2], key_layer.shape[2]
if use_cache:
position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
-1, 1
)
else:
position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
if self.is_decoder:
outputs = outputs + (past_key_value,)
return outputs

复制代码

首先判断是否要做交织注意力，判断原则就是是否输入 encoder_hidden_states。再联合是否输入 past_key_value 就形成了四种情况。

计算交织注意力，传入过去的键值。则 K、V 均采取过去的键值，Mask 采取 encoder_attention_mask。
计算交织注意力，没有传入过去的键值。则 K、V 通过 hidden_size 线性变更得到，Mask 采取 encoder_attention_mask。
不计算交织注意力，传入过去的键值。则 K、V 由 hidden_size 线性变更之后，与过去的键值拼接而成，拼接维度 dim=2。
不计算交织注意力，没有传入过去的键值。则 K、V 通过 hidden_size 线性变更得到。

无论哪种情况，Q 的都是由 hidden_size 线性变更得到。
然后就是计算 QKTQK^TQKT 得到 raw_attention_score。但是在进行 scale 之前，对位置编码种类进行特别操纵。
absolute
不进行任何操纵。
relative
分为 relative_key 和 relative_key_query。
这两者都会先计算 Q 和 K 的距离，然后对距离进行 embedding。
不同的是 relative_key 会将 Q 与上述 embedding 进行计算后与 raw_attention_score 相加。
relative_key_query 会将 Q 和 V 都与上述 embedding 进行计算，然后两者与 raw_attention_score 累加。
经过上面的操纵后，对 raw_attention_score 进行 scale 操纵。
操纵之后，与 Mask 计算采取加法而不是乘法，这是因为 Mask 的值是很大的负数而不是零，这种方式 Mask 更加“严实”。
然后就是正常的后续计算。
BertSelfOutput

多头注意力后的操纵：

class BertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

复制代码

必要注意的是，hidden_size 经过线性变更之后，先经过 dropoutdropoutdropout，然后与 input_tensor 进行残差毗连，之后进行 LayerNormLayerNormLayerNorm。
BertAttention

上面报告了 BERT 中的多头注意力层和注意力层之后的输出，这里就是对这两块进行一次封装。

class BertAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
self.self = BertSelfAttention(config, position_embedding_type=position_embedding_type)
self.output = BertSelfOutput(config)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)

复制代码

里面定义 prune_heads() 函数用于注意力头的剪枝。
此中 find_pruneable_heads_and_indices 用于找到可以剪枝的注意力头。返回必要剪掉的 heads 和保存的维度下标。

def find_pruneable_heads_and_indices(
heads: List[int], n_heads: int, head_size: int, already_pruned_heads: Set[int]
) -> Tuple[Set[int], torch.LongTensor]:
"""
Finds the heads and their indices taking `already_pruned_heads` into account.
Args:
heads (`List[int]`): List of the indices of heads to prune.
n_heads (`int`): The number of heads in the model.
head_size (`int`): The size of each head.
already_pruned_heads (`Set[int]`): A set of already pruned heads.
Returns:
`Tuple[Set[int], torch.LongTensor]`: A tuple with the indices of heads to prune taking `already_pruned_heads`
into account and the indices of rows/columns to keep in the layer weight.
"""

复制代码

prune_linear_layer 函数用于具体剪枝注意力头。会按照 index 保存维度。

def prune_linear_layer(layer: nn.Linear, index: torch.LongTensor, dim: int = 0) -> nn.Linear:
"""
Prune a linear layer to keep only entries in index.
Used to remove heads.
Args:
layer (`torch.nn.Linear`): The layer to prune.
index (`torch.LongTensor`): The indices to keep in the layer.
dim (`int`, *optional*, defaults to 0): The dimension on which to keep the indices.
Returns:
`torch.nn.Linear`: The pruned layer as a new layer with `requires_grad=True`.
"""

复制代码

前向计算代码如下：

def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs

复制代码

outputs 大概包含(attention, all_attention, past_value_key)
BertIntermediate

Attention 之后加入全毗连层和激活函数，比较简朴。

class BertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states

复制代码

BertOutput

经过中间层后，又是一个全毗连层 + dropout + 残差 + 层归一化。和 BertSelfOutput 架构相同。

class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states

复制代码

BertLayer

BertLayer 是将 BertAttention、BertIntermediate 和 BertOutput 封装起来。

class BertLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = BertAttention(config)
self.is_decoder = config.is_decoder
self.add_cross_attention = config.add_cross_attention
if self.add_cross_attention:
if not self.is_decoder:
raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
self.crossattention = BertAttention(config, position_embedding_type="absolute")
self.intermediate = BertIntermediate(config)
self.output = BertOutput(config)

复制代码

这里注意的是，假如加入交织注意力，必须作为 decoder。
前向计算代码如下：

def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2
self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
past_key_value=self_attn_past_key_value,
)
attention_output = self_attention_outputs[0]
# if decoder, the last output is tuple of self-attn cache
if self.is_decoder:
outputs = self_attention_outputs[1:-1]
present_key_value = self_attention_outputs[-1]
else:
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
cross_attn_present_key_value = None
if self.is_decoder and encoder_hidden_states is not None:
if not hasattr(self, "crossattention"):
raise ValueError(
f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
" by setting `config.add_cross_attention=True`"
)
# cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
cross_attn_past_key_value,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights
# add cross-attn cache to positions 3,4 of present_key_value tuple
cross_attn_present_key_value = cross_attention_outputs[-1]
present_key_value = present_key_value + cross_attn_present_key_value
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
outputs = (layer_output,) + outputs
# if decoder, return the attn key/values as the last output
if self.is_decoder:
outputs = outputs + (present_key_value,)
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output

复制代码

基本逻辑：

对 hidden_states 进行一次 Attention。
假如是 decoder，将 attention_outputs 进行一次 CrossAttention。
经过中间层和 Output 层。

必要注意的是，对于第三步，这里采取分块操纵来节省内存。

layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
def apply_chunking_to_forward(
forward_fn: Callable[..., torch.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
) -> torch.Tensor:
"""
This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension
`chunk_dim`. It then applies a layer `forward_fn` to each chunk independently to save memory.
If the `forward_fn` is independent across the `chunk_dim` this function will yield the same result as directly
applying `forward_fn` to `input_tensors`.
Args:
forward_fn (`Callable[..., torch.Tensor]`):
The forward function of the model.
chunk_size (`int`):
The chunk size of a chunked tensor: `num_chunks = len(input_tensors[0]) / chunk_size`.
chunk_dim (`int`):
The dimension over which the `input_tensors` should be chunked.
input_tensors (`Tuple[torch.Tensor]`):
The input tensors of `forward_fn` which will be chunked
Returns:
`torch.Tensor`: A tensor with the same shape as the `forward_fn` would have given if applied`.

复制代码

apply_chunking_to_forward 函数就是将输入的 tensor 在指定的 chunk_dim 上分为若干个大小为 chunk_size 的块，然后对每块进行前向计算。
BERT 中指定的维度是 seq_len_dim，也就是将一个句子分为若干指定大小的块，分别进行计算。
BertEncoder

有了前面的 BertLayer，BertEncoder 就是若干 BertLayer 堆叠而成。

class BertEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False

复制代码

前向计算也就是输入经过若干 BertLayer。假如设置了梯度查抄点，而且处于训练状态，BERT 会采取 torch.utils.checkpoint.checkpoint() 来节省内存。

if self.gradient_checkpointing and self.training:
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, past_key_value, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
)
def checkpoint(function, *args, use_reentrant: bool = True, **kwargs):
r"""Checkpoint a model or part of the model
Checkpointing works by trading compute for memory. Rather than storing all
intermediate activations of the entire computation graph for computing
backward, the checkpointed part does **not** save intermediate activations,
and instead recomputes them in backward pass. It can be applied on any part
of a model.

复制代码

这是一种时间换空间的思想。
BertPooler

class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output

复制代码

BertPooler 仅仅获取 hidden_states 在 seq_len 维度上的第一个向量，然后经过线性变更后传入激活函数。
Summary

BERT 是由若干 BertLayer 堆叠而成，可以在最后一层加入不同的线性层，以适应不同的卑鄙任务。
BertLayer 是由 BertAttention、CrossAttention(可选)、BertIntermediate 和 BertOutput 堆叠而成。

BertAttention：由自注意力层和输出层构成
- 自注意力层：Mask 采取加法，被遮罩的地方为较大负数
- 输出层：依次经过线性层 dropout 残差 LayerNorm
BertIntermediate：依次经过线性层激活函数
BertOutput：依次经过线性层 dropout 残差 LayerNorm

那么，我们该怎样学习大模子？

作为一名热心肠的互联网老兵，我决定把宝贵的AI知识分享给大家。至于能学习到多少就看你的学习毅力和能力了。我已将重要的AI大模子资料包罗AI大模子入门学习头脑导图、精品AI大模子学习书籍手册、视频教程、实战学习等录播视频免费分享出来。
一、大模子全套的学习门路

学习大型人工智能模子，如GPT-3、BERT或任何其他先进的神经网络模子，必要体系的方法和持续的努力。既然要体系的学习大模子，那么学习门路是必不可少的，下面的这份门路能资助你快速梳理知识，形成本身的体系。
L1级别:AI大模子期间的华丽登场

L2级别：AI大模子API应用开辟工程

L3级别：大模子应用架构进阶实践

L4级别：大模子微调与私有化部署

一样寻常掌握到第四个级别，市场上大多数岗位都是可以胜任，但要还不是天花板，天花板级别要求更加严格，对于算法和实战是非常苛刻的。发起普通人掌握到L4级别即可。
以上的AI大模子学习门路，不知道为什么发出来就有点糊，高清版可以微信扫描下方CSDN官方认证二维码免费领取【包管100%免费】

二、640套AI大模子报告合集

这套包含640份报告的合集，涵盖了AI大模子的理论研究、技能实现、行业应用等多个方面。无论您是科研人员、工程师，照旧对AI大模子感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

三、大模子经典PDF籍

随着人工智能技能的飞速发展，AI大模子已经成为了当今科技领域的一大热点。这些大型预训练模子，如GPT-3、BERT、XLNet等，以其强盛的语言理解和天生能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

四、AI大模子商业化落地方案

作为普通人，入局大模子期间必要持续学习和实践，不断进步本身的技能和认知水平，同时也必要有责任感和伦理意识，为人工智能的健康发展贡献力量。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

transformers 阅读：BERT 模子

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云