马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
Keras 3 神经网络紧凑型卷积转换器(Transformers)
作者:Sayak Paul
创建日期:2021/06/30
最后修改时间:2023/08/07
形貌:用于高效图像分类的紧凑型卷积变压器。
(i) 此示例使用 Keras 3 在 Colab 中查看
GitHub 源
正如在视觉Transformer(ViT)论文中所讨论的,基于Transformer的视觉架构通常必要比通例更大的数据集,以及更长的预训练时间。对于ViT而言,ImageNet - 1k(包含约一百万张图像)被认为属于中等规模的数据范畴。这主要是由于,与卷积神经网络(CNNs)不同,ViT(或典型的基于Transformer的架构)不具备充分的归纳偏置(好比用于处置惩罚图像的卷积操作)。这就引出了一个题目:我们可否将卷积的上风与Transformer的上风联合到单一的网络架构中呢?这些上风包括参数服从,以及能够处置惩罚长距离和全局依赖关系(图像中不同区域之间的相互作用)的自注意力机制 。
在《用紧凑Transformer逃离大数据范式》一文中,哈萨尼等人提出了一种正是实现上述想法的方法。他们提出了紧凑卷积Transformer(CCT)架构。在本示例中,我们将对CCT举行实现,并观察它在CIFAR - 10数据集上的体现如何。
假如你不熟悉自我注意或 Transformer 的概念,你可以阅读 François Chollet 的书 Deep Learning with Python 中的这一章。此示例使用 来自另一个示例的代码片段 Image classification with Vision Transformer.
进口
- from keras import layers
- import keras
- import matplotlib.pyplot as plt
- import numpy as np
复制代码 超参数和常量
- positional_emb = True
- conv_layers = 2
- projection_dim = 128
- num_heads = 2
- transformer_units = [
- projection_dim,
- projection_dim,
- ]
- transformer_layers = 2
- stochastic_depth_rate = 0.1
- learning_rate = 0.001
- weight_decay = 0.0001
- batch_size = 128
- num_epochs = 30
- image_size = 32
复制代码 加载 CIFAR-10 数据集
- num_classes = 10
- input_shape = (32, 32, 3)
- (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
- y_train = keras.utils.to_categorical(y_train, num_classes)
- y_test = keras.utils.to_categorical(y_test, num_classes)
- print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
- print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")
复制代码 - x_train shape: (50000, 32, 32, 3) - y_train shape: (50000, 10) x_test shape: (10000, 32, 32, 3) - y_test shape: (10000, 10)
复制代码 CCT 分词器
CCT 作者先容的第一个配方是用于处置惩罚 图像。在尺度 ViT 中,图像被构造成均匀的非重叠块。 这消除了不同补丁之间存在的边界级信息。这 对于神经网络有效利用局部信息非常告急。这 下图展示了如何将图像构造成补丁。
我们已经知道卷积非常擅长利用位置信息。以是 基于此,作者引入了一种全卷积微型网络来天生图像 补丁。
- class CCTTokenizer(layers.Layer):
- def __init__(
- self,
- kernel_size=3,
- stride=1,
- padding=1,
- pooling_kernel_size=3,
- pooling_stride=2,
- num_conv_layers=conv_layers,
- num_output_channels=[64, 128],
- positional_emb=positional_emb,
- **kwargs,
- ):
- super().__init__(**kwargs)
- # This is our tokenizer.
- self.conv_model = keras.Sequential()
- for i in range(num_conv_layers):
- self.conv_model.add(
- layers.Conv2D(
- num_output_channels[i],
- kernel_size,
- stride,
- padding="valid",
- use_bias=False,
- activation="relu",
- kernel_initializer="he_normal",
- )
- )
- self.conv_model.add(layers.ZeroPadding2D(padding))
- self.conv_model.add(
- layers.MaxPooling2D(pooling_kernel_size, pooling_stride, "same")
- )
- self.positional_emb = positional_emb
- def call(self, images):
- outputs = self.conv_model(images)
- # After passing the images through our mini-network the spatial dimensions
- # are flattened to form sequences.
- reshaped = keras.ops.reshape(
- outputs,
- (
- -1,
- keras.ops.shape(outputs)[1] * keras.ops.shape(outputs)[2],
- keras.ops.shape(outputs)[-1],
- ),
- )
- return reshaped
复制代码 位置嵌入在 CCT 中是可选的。假如我们想使用它们,我们可以使用 下面定义的 Layer。
- class PositionEmbedding(keras.layers.Layer):
- def __init__(
- self,
- sequence_length,
- initializer="glorot_uniform",
- **kwargs,
- ):
- super().__init__(**kwargs)
- if sequence_length is None:
- raise ValueError("`sequence_length` must be an Integer, received `None`.")
- self.sequence_length = int(sequence_length)
- self.initializer = keras.initializers.get(initializer)
- def get_config(self):
- config = super().get_config()
- config.update(
- {
- "sequence_length": self.sequence_length,
- "initializer": keras.initializers.serialize(self.initializer),
- }
- )
- return config
- def build(self, input_shape):
- feature_size = input_shape[-1]
- self.position_embeddings = self.add_weight(
- name="embeddings",
- shape=[self.sequence_length, feature_size],
- initializer=self.initializer,
- trainable=True,
- )
- super().build(input_shape)
- def call(self, inputs, start_index=0):
- shape = keras.ops.shape(inputs)
- feature_length = shape[-1]
- sequence_length = shape[-2]
- # trim to match the length of the input sequence, which might be less
- # than the sequence_length of the layer.
- position_embeddings = keras.ops.convert_to_tensor(self.position_embeddings)
- position_embeddings = keras.ops.slice(
- position_embeddings,
- (start_index, 0),
- (sequence_length, feature_length),
- )
- return keras.ops.broadcast_to(position_embeddings, shape)
- def compute_output_shape(self, input_shape):
- return input_shape
复制代码 序列池化
CCT 中引入的另一个方法是注意力池或序列池。在 ViT 中,只有 与类 Token 对应的特征映射被池化,然后用于 后续分类使命(或任何其他下游使命)。
- class SequencePooling(layers.Layer):
- def __init__(self):
- super().__init__()
- self.attention = layers.Dense(1)
- def call(self, x):
- attention_weights = keras.ops.softmax(self.attention(x), axis=1)
- attention_weights = keras.ops.transpose(attention_weights, axes=(0, 2, 1))
- weighted_representation = keras.ops.matmul(attention_weights, x)
- return keras.ops.squeeze(weighted_representation, -2)
复制代码 正则化的随机深度
随机深度是一种正则化技能,它 随机放置一组图层。在推理过程中,各层保持原样。是的 与 Dropout 非常相似,但仅 它在层块上运行,而不是在 层。在 CCT 中,随机深度在 Transformer 的残差块之前使用 编码器。
- # Referred from: github.com:rwightman/pytorch-image-models.
- class StochasticDepth(layers.Layer):
- def __init__(self, drop_prop, **kwargs):
- super().__init__(**kwargs)
- self.drop_prob = drop_prop
- self.seed_generator = keras.random.SeedGenerator(1337)
- def call(self, x, training=None):
- if training:
- keep_prob = 1 - self.drop_prob
- shape = (keras.ops.shape(x)[0],) + (1,) * (len(x.shape) - 1)
- random_tensor = keep_prob + keras.random.uniform(
- shape, 0, 1, seed=self.seed_generator
- )
- random_tensor = keras.ops.floor(random_tensor)
- return (x / keep_prob) * random_tensor
- return x
复制代码 用于 Transformers 编码器的 MLP
- def mlp(x, hidden_units, dropout_rate):
- for units in hidden_units:
- x = layers.Dense(units, activation=keras.ops.gelu)(x)
- x = layers.Dropout(dropout_rate)(x)
- return x
复制代码 数据增强
在原始论文中,作者使用 AutoAugment 来诱导更强的正则化。为 在这个例子中,我们将使用尺度的多少增强,如随机裁剪 和翻转。
- # Note the rescaling layer. These layers have pre-defined inference behavior.
- data_augmentation = keras.Sequential(
- [
- layers.Rescaling(scale=1.0 / 255),
- layers.RandomCrop(image_size, image_size),
- layers.RandomFlip("horizontal"),
- ],
- name="data_augmentation",
- )
复制代码 最终的 CCT 模子
在 CCT 中,来自 Transformers 编码器的输出被加权,然后转到达最终的使命特定层(在 这个例子,我们举行分类)。
- def create_cct_model(
- image_size=image_size,
- input_shape=input_shape,
- num_heads=num_heads,
- projection_dim=projection_dim,
- transformer_units=transformer_units,
- ):
- inputs = layers.Input(input_shape)
- # Augment data.
- augmented = data_augmentation(inputs)
- # Encode patches.
- cct_tokenizer = CCTTokenizer()
- encoded_patches = cct_tokenizer(augmented)
- # Apply positional embedding.
- if positional_emb:
- sequence_length = encoded_patches.shape[1]
- encoded_patches += PositionEmbedding(sequence_length=sequence_length)(
- encoded_patches
- )
- # Calculate Stochastic Depth probabilities.
- dpr = [x for x in np.linspace(0, stochastic_depth_rate, transformer_layers)]
- # Create multiple layers of the Transformer block.
- for i in range(transformer_layers):
- # Layer normalization 1.
- x1 = layers.LayerNormalization(epsilon=1e-5)(encoded_patches)
- # Create a multi-head attention layer.
- attention_output = layers.MultiHeadAttention(
- num_heads=num_heads, key_dim=projection_dim, dropout=0.1
- )(x1, x1)
- # Skip connection 1.
- attention_output = StochasticDepth(dpr[i])(attention_output)
- x2 = layers.Add()([attention_output, encoded_patches])
- # Layer normalization 2.
- x3 = layers.LayerNormalization(epsilon=1e-5)(x2)
- # MLP.
- x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
- # Skip connection 2.
- x3 = StochasticDepth(dpr[i])(x3)
- encoded_patches = layers.Add()([x3, x2])
- # Apply sequence pooling.
- representation = layers.LayerNormalization(epsilon=1e-5)(encoded_patches)
- weighted_representation = SequencePooling()(representation)
- # Classify outputs.
- logits = layers.Dense(num_classes)(weighted_representation)
- # Create the Keras model.
- model = keras.Model(inputs=inputs, outputs=logits)
- return model
复制代码 模子训练和评估
- def run_experiment(model):
- optimizer = keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.0001)
- model.compile(
- optimizer=optimizer,
- loss=keras.losses.CategoricalCrossentropy(
- from_logits=True, label_smoothing=0.1
- ),
- metrics=[
- keras.metrics.CategoricalAccuracy(name="accuracy"),
- keras.metrics.TopKCategoricalAccuracy(5, name="top-5-accuracy"),
- ],
- )
- checkpoint_filepath = "/tmp/checkpoint.weights.h5"
- checkpoint_callback = keras.callbacks.ModelCheckpoint(
- checkpoint_filepath,
- monitor="val_accuracy",
- save_best_only=True,
- save_weights_only=True,
- )
- history = model.fit(
- x=x_train,
- y=y_train,
- batch_size=batch_size,
- epochs=num_epochs,
- validation_split=0.1,
- callbacks=[checkpoint_callback],
- )
- model.load_weights(checkpoint_filepath)
- _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
- print(f"Test accuracy: {round(accuracy * 100, 2)}%")
- print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")
- return history
- cct_model = create_cct_model()
- history = run_experiment(cct_model)
复制代码 - Epoch 1/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 90s 248ms/step - accuracy: 0.2578 - loss: 2.0882 - top-5-accuracy: 0.7553 - val_accuracy: 0.4438 - val_loss: 1.6872 - val_top-5-accuracy: 0.9046 Epoch 2/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 91s 258ms/step - accuracy: 0.4779 - loss: 1.6074 - top-5-accuracy: 0.9261 - val_accuracy: 0.5730 - val_loss: 1.4462 - val_top-5-accuracy: 0.9562 Epoch 3/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 260ms/step - accuracy: 0.5655 - loss: 1.4371 - top-5-accuracy: 0.9501 - val_accuracy: 0.6178 - val_loss: 1.3458 - val_top-5-accuracy: 0.9626 Epoch 4/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 261ms/step - accuracy: 0.6166 - loss: 1.3343 - top-5-accuracy: 0.9613 - val_accuracy: 0.6610 - val_loss: 1.2695 - val_top-5-accuracy: 0.9706 Epoch 5/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 261ms/step - accuracy: 0.6468 - loss: 1.2814 - top-5-accuracy: 0.9672 - val_accuracy: 0.6834 - val_loss: 1.2231 - val_top-5-accuracy: 0.9716 Epoch 6/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 261ms/step - accuracy: 0.6619 - loss: 1.2412 - top-5-accuracy: 0.9708 - val_accuracy: 0.6842 - val_loss: 1.2018 - val_top-5-accuracy: 0.9744 Epoch 7/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 263ms/step - accuracy: 0.6976 - loss: 1.1775 - top-5-accuracy: 0.9752 - val_accuracy: 0.6988 - val_loss: 1.1988 - val_top-5-accuracy: 0.9752 Epoch 8/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 263ms/step - accuracy: 0.7070 - loss: 1.1579 - top-5-accuracy: 0.9774 - val_accuracy: 0.7010 - val_loss: 1.1780 - val_top-5-accuracy: 0.9732 Epoch 9/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 95s 269ms/step - accuracy: 0.7219 - loss: 1.1255 - top-5-accuracy: 0.9795 - val_accuracy: 0.7166 - val_loss: 1.1375 - val_top-5-accuracy: 0.9784 Epoch 10/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 264ms/step - accuracy: 0.7273 - loss: 1.1087 - top-5-accuracy: 0.9801 - val_accuracy: 0.7258 - val_loss: 1.1286 - val_top-5-accuracy: 0.9814 Epoch 11/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 265ms/step - accuracy: 0.7361 - loss: 1.0863 - top-5-accuracy: 0.9828 - val_accuracy: 0.7222 - val_loss: 1.1412 - val_top-5-accuracy: 0.9766 Epoch 12/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 264ms/step - accuracy: 0.7504 - loss: 1.0644 - top-5-accuracy: 0.9834 - val_accuracy: 0.7418 - val_loss: 1.0943 - val_top-5-accuracy: 0.9812 Epoch 13/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 94s 266ms/step - accuracy: 0.7593 - loss: 1.0422 - top-5-accuracy: 0.9856 - val_accuracy: 0.7468 - val_loss: 1.0834 - val_top-5-accuracy: 0.9818 Epoch 14/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 265ms/step - accuracy: 0.7647 - loss: 1.0307 - top-5-accuracy: 0.9868 - val_accuracy: 0.7526 - val_loss: 1.0863 - val_top-5-accuracy: 0.9822 Epoch 15/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 263ms/step - accuracy: 0.7684 - loss: 1.0231 - top-5-accuracy: 0.9863 - val_accuracy: 0.7666 - val_loss: 1.0454 - val_top-5-accuracy: 0.9834 Epoch 16/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 94s 268ms/step - accuracy: 0.7809 - loss: 1.0007 - top-5-accuracy: 0.9859 - val_accuracy: 0.7670 - val_loss: 1.0469 - val_top-5-accuracy: 0.9838 Epoch 17/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 94s 268ms/step - accuracy: 0.7902 - loss: 0.9795 - top-5-accuracy: 0.9895 - val_accuracy: 0.7676 - val_loss: 1.0396 - val_top-5-accuracy: 0.9836 Epoch 18/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 106s 301ms/step - accuracy: 0.7920 - loss: 0.9693 - top-5-accuracy: 0.9889 - val_accuracy: 0.7616 - val_loss: 1.0791 - val_top-5-accuracy: 0.9828 Epoch 19/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 264ms/step - accuracy: 0.7965 - loss: 0.9631 - top-5-accuracy: 0.9893 - val_accuracy: 0.7850 - val_loss: 1.0149 - val_top-5-accuracy: 0.9842 Epoch 20/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 93s 265ms/step - accuracy: 0.8030 - loss: 0.9529 - top-5-accuracy: 0.9899 - val_accuracy: 0.7898 - val_loss: 1.0029 - val_top-5-accuracy: 0.9852 Epoch 21/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 261ms/step - accuracy: 0.8118 - loss: 0.9322 - top-5-accuracy: 0.9903 - val_accuracy: 0.7728 - val_loss: 1.0529 - val_top-5-accuracy: 0.9850 Epoch 22/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 91s 259ms/step - accuracy: 0.8104 - loss: 0.9308 - top-5-accuracy: 0.9906 - val_accuracy: 0.7874 - val_loss: 1.0090 - val_top-5-accuracy: 0.9876 Epoch 23/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 263ms/step - accuracy: 0.8164 - loss: 0.9193 - top-5-accuracy: 0.9911 - val_accuracy: 0.7800 - val_loss: 1.0091 - val_top-5-accuracy: 0.9844 Epoch 24/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 94s 268ms/step - accuracy: 0.8147 - loss: 0.9184 - top-5-accuracy: 0.9919 - val_accuracy: 0.7854 - val_loss: 1.0260 - val_top-5-accuracy: 0.9856 Epoch 25/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 262ms/step - accuracy: 0.8255 - loss: 0.9000 - top-5-accuracy: 0.9914 - val_accuracy: 0.7918 - val_loss: 1.0014 - val_top-5-accuracy: 0.9842 Epoch 26/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 90s 257ms/step - accuracy: 0.8297 - loss: 0.8865 - top-5-accuracy: 0.9933 - val_accuracy: 0.7924 - val_loss: 1.0065 - val_top-5-accuracy: 0.9834 Epoch 27/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 262ms/step - accuracy: 0.8339 - loss: 0.8837 - top-5-accuracy: 0.9931 - val_accuracy: 0.7906 - val_loss: 1.0035 - val_top-5-accuracy: 0.9870 Epoch 28/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 260ms/step - accuracy: 0.8362 - loss: 0.8781 - top-5-accuracy: 0.9934 - val_accuracy: 0.7878 - val_loss: 1.0041 - val_top-5-accuracy: 0.9850 Epoch 29/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 260ms/step - accuracy: 0.8398 - loss: 0.8707 - top-5-accuracy: 0.9942 - val_accuracy: 0.7854 - val_loss: 1.0186 - val_top-5-accuracy: 0.9858 Epoch 30/30 352/352 ━━━━━━━━━━━━━━━━━━━━ 92s 263ms/step - accuracy: 0.8438 - loss: 0.8614 - top-5-accuracy: 0.9933 - val_accuracy: 0.7892 - val_loss: 1.0123 - val_top-5-accuracy: 0.9846 313/313 ━━━━━━━━━━━━━━━━━━━━ 14s 44ms/step - accuracy: 0.7752 - loss: 1.0370 - top-5-accuracy: 0.9824 Test accuracy: 77.82% Test top 5 accuracy: 98.42%
复制代码 现在,我们来可视化模子的训练进度。
- plt.plot(history.history["loss"], label="train_loss")
- plt.plot(history.history["val_loss"], label="val_loss")
- plt.xlabel("Epochs")
- plt.ylabel("Loss")
- plt.title("Train and Validation Losses Over Epochs", fontsize=14)
- plt.legend()
- plt.grid()
- plt.show()
复制代码
我们刚刚训练的 CCT 模子只有 40 万个参数,它让我们 在 30 个 epoch 内到达 ~79% top-1 的准确率。上图体现没有过拟合的迹象,由于 井。这意味着我们可以训练这个网络更长时间(也许必要更多一点 正则化),而且可能会得到更好的性能。此性能可以进一步 通过其他方法举行改进,例如 cosine decay learning rate schedule、其他数据增强 AutoAugment、MixUp 或 Cutmix 等技能。通过这些修改,作者提出了 CIFAR-10 数据集上 95.1% 的 top-1 准确率。作者还先容了一些 实行来研究卷积块、Transformers 层等的数目。 影响 CCT 的最终性能。
相比之下,ViT 模子约莫必要 470 万个参数和 100 个 在 CIFAR-10 数据集上到达 78.22% 的 top-1 准确率的训练纪元。您可以 请参阅此条记本以相识有关实行设置的信息。
作者还演示了 Compact Convolutional Transformers 在 NLP 使命,他们在那里报告有竞争力的结果。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |