注意力机制与Transformer (Attention Mechanism and Transformers)
注意力机制基础 (Attention Mechanism Basics)
注意力机制是一种允许模型动态关注输入不同部分的技术,在序列建模中发挥重要作用。
Attention mechanism is a technique that allows models to dynamically focus on different parts of the input, playing a crucial role in sequence modeling.
Transformer架构 (Transformer Architecture)
整体结构 (Overall Structure)
- 编码器-解码器 (Encoder-Decoder)
- 多头注意力 (Multi-head Attention)
- 前馈网络 (Feed-forward Network)
自注意力机制 (Self-attention Mechanism)
- 查询-键-值 (Query-Key-Value)
- 注意力分数 (Attention Scores)
- 缩放点积注意力 (Scaled Dot-Product Attention)
位置编码 (Positional Encoding)
- 位置信息 (Position Information)
- 三角函数编码 (Sinusoidal Encoding)
- 可学习位置编码 (Learnable Positional Encoding)
实践应用 (Practical Applications)
自然语言处理 (Natural Language Processing)
import tensorflow as tf
# Transformer编码器层 (Transformer Encoder Layer)
class TransformerEncoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(TransformerEncoderLayer, self).__init__()
self.mha = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=d_model)
self.ffn = tf.keras.Sequential([
tf.keras.layers.Dense(dff, activation='relu'),
tf.keras.layers.Dense(d_model)
])
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask=None):
# 多头自注意力 (Multi-head self-attention)
attn_output = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output)
# 前馈网络 (Feed-forward network)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
机器翻译 (Machine Translation)
- 序列到序列 (Sequence-to-Sequence)
- 并行处理 (Parallel Processing)
- 长距离依赖 (Long-range Dependencies)
文本生成 (Text Generation)
- 语言模型 (Language Models)
- 自回归生成 (Autoregressive Generation)
- 条件生成 (Conditional Generation)
高级主题 (Advanced Topics)
模型变体 (Model Variants)
- BERT
- GPT系列 (GPT Series)
- T5模型 (T5 Model)
优化技巧 (Optimization Techniques)
- 预训练策略 (Pre-training Strategies)
- 微调方法 (Fine-tuning Methods)
- 知识蒸馏 (Knowledge Distillation)
效率改进 (Efficiency Improvements)
- 稀疏注意力 (Sparse Attention)
- 线性注意力 (Linear Attention)
- 渐进式计算 (Progressive Computation)
实战项目 (Hands-on Project)
文本分类实现 (Text Classification Implementation)
import tensorflow as tf
import numpy as np
# 文本分类的Transformer模型 (Transformer Model for Text Classification)
class TransformerClassifier(tf.keras.Model):
def __init__(self, vocab_size, num_layers, d_model, num_heads, dff, max_length, num_classes, rate=0.1):
super(TransformerClassifier, self).__init__()
self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_encoding = self.positional_encoding(max_length, d_model)
self.encoder_layers = [
TransformerEncoderLayer(d_model, num_heads, dff, rate)
for _ in range(num_layers)
]
self.dropout = tf.keras.layers.Dropout(rate)
self.final_layer = tf.keras.layers.Dense(num_classes)
def positional_encoding(self, position, d_model):
# 实现位置编码 (Implement positional encoding)
angle_rads = self.get_angles(
np.arange(position)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :],
d_model
)
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
pos_encoding = angle_rads[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
def get_angles(self, pos, i, d_model):
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
return pos * angle_rates
def call(self, x, training):
# 添加位置编码 (Add positional encoding)
seq_len = tf.shape(x)[1]
x = self.embedding(x)
x *= tf.math.sqrt(tf.cast(self.embedding.output_dim, tf.float32))
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
# 编码器层 (Encoder layers)
for encoder_layer in self.encoder_layers:
x = encoder_layer(x, training)
# 全局池化 (Global pooling)
x = tf.reduce_mean(x, axis=1)
# 最终分类层 (Final classification layer)
return self.final_layer(x)
模型训练与评估 (Model Training and Evaluation)
# 模型参数 (Model parameters)
VOCAB_SIZE = 10000
NUM_LAYERS = 4
D_MODEL = 128
NUM_HEADS = 8
DFF = 512
MAX_LENGTH = 200
NUM_CLASSES = 5
DROPOUT_RATE = 0.1
# 创建模型实例 (Create model instance)
model = TransformerClassifier(
vocab_size=VOCAB_SIZE,
num_layers=NUM_LAYERS,
d_model=D_MODEL,
num_heads=NUM_HEADS,
dff=DFF,
max_length=MAX_LENGTH,
num_classes=NUM_CLASSES,
rate=DROPOUT_RATE
)
# 编译模型 (Compile model)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
性能优化技巧 (Performance Optimization Tips)
- 梯度累积 (Gradient Accumulation)
def train_step(model, optimizer, loss_fn, accumulation_steps=4):
accumulated_gradients = [tf.zeros_like(var) for var in model.trainable_variables]
for _ in range(accumulation_steps):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
loss = loss_fn(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
for i in range(len(accumulated_gradients)):
accumulated_gradients[i] += gradients[i]
# 应用累积的梯度 (Apply accumulated gradients)
for i in range(len(accumulated_gradients)):
accumulated_gradients[i] = accumulated_gradients[i] / accumulation_steps
optimizer.apply_gradients(zip(accumulated_gradients, model.trainable_variables))
- 混合精度训练 (Mixed Precision Training)
from tensorflow.keras.mixed_precision import experimental as mixed_precision
# 启用混合精度 (Enable mixed precision)
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
未来展望 (Future Prospects)
新兴研究方向 (Emerging Research Directions)
- 稀疏注意力优化 (Sparse Attention Optimization)
- 高效Transformer变体 (Efficient Transformer Variants)
- 跨模态应用 (Cross-modal Applications)
工业应用趋势 (Industrial Application Trends)
- 大规模预训练模型 (Large-scale Pre-trained Models)
- 模型压缩与部署 (Model Compression and Deployment)
- 领域特定优化 (Domain-specific Optimization)
参考资源 (Reference Resources)
论文 (Papers)
- "Attention Is All You Need" - Vaswani et al.
- "BERT: Pre-training of Deep Bidirectional Transformers" - Devlin et al.
在线教程 (Online Tutorials)
- TensorFlow官方文档 (TensorFlow Official Documentation)
- Hugging Face Transformers库 (Hugging Face Transformers Library)
代码仓库 (Code Repositories)