Artificial Intelligence - Transformer (1)
Transformer
is a neural network architecture introduced in the 2017 Google research paper, “Attention is All You Need.” It has achieved groundbreaking results in the fields of natural language processing (NLP) and computer vision (CV). Modern language models like BERT
and GPT
are all based on the Transformer architecture.
In this article, we will explain the basic concepts and structure of the Transformer and provide a simple example code to help you understand it better.
🔍 What is a Transformer?
Unlike traditional Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), which are suitable for sequential data processing, Transformers leverage the Self-Attention mechanism. This allows them to perform parallel processing and effectively learn long-range dependencies.
Key Features
- Parallel Processing Capable: Transformers do not rely on sequential processing, allowing them to learn from large amounts of data quickly.
- Learning Long Dependencies: They can effectively learn relationships between distant words in a sentence, leading to more accurate context understanding.
- Flexible Structure: Comprising an Encoder and Decoder, the Transformer can be adapted for various tasks.
🏗️ Components of a Transformer
A Transformer is broadly composed of two parts: the Encoder and the Decoder.
1. Encoder
- Converts input sentences into internal representations.
- Composed of multiple identical layers, each consisting of Self-Attention and Feed-Forward Neural Networks.
2. Decoder
- Takes the output from the Encoder and generates the target output sequence.
- Similarly, it comprises multiple identical layers, each consisting of Self-Attention, Encoder-Decoder Attention, and Feed-Forward Neural Networks.
3. Self-Attention Mechanism
- Learns the relationships between words within an input sequence.
- Determines how each word should reference other words in the sentence, providing richer representations.
4. Multi-Head Attention
- Performs several Self-Attention operations in parallel to extract information from different representation spaces.
- Enables the model to learn diverse contextual information simultaneously.
5. Position Encoding
- Adds positional information to input sequences, which lack inherent order, allowing the model to recognize word sequences.
🧪 Example Code
Now, let’s look at a simple example of implementing a Transformer model using PyTorch.
Installing Required Libraries
pip install torch torchvision torchtext
Importing Libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
Data Preprocessing
SRC = Field(tokenize="spacy", tokenizer_language="de", init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize="spacy", tokenizer_language="en", init_token='<sos>', eos_token='<eos>', lower=True)
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
Model Definition
class TransformerModel(nn.Module):
def __init__(self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size=512, num_heads=8, num_layers=3, forward_expansion=4, dropout=0.1, max_len=100):
super(TransformerModel, self).__init__()
self.src_word_embedding = nn.Embedding(src_vocab_size, embed_size)
self.src_position_embedding = nn.Embedding(max_len, embed_size)
self.trg_word_embedding = nn.Embedding(trg_vocab_size, embed_size)
self.trg_position_embedding = nn.Embedding(max_len, embed_size)
self.transformer = nn.Transformer(embed_size, num_heads, num_layers, num_layers, forward_expansion * embed_size, dropout)
self.fc_out = nn.Linear(embed_size, trg_vocab_size)
self.dropout = nn.Dropout(dropout)
self.src_pad_idx = src_pad_idx
self.trg_pad_idx = trg_pad_idx
def make_src_mask(self, src):
src_mask = src.transpose(0, 1) == self.src_pad_idx
return src_mask
def make_trg_mask(self, trg):
trg_len = trg.shape[0]
trg_mask = self.transformer.generate_square_subsequent_mask(trg_len).to(trg.device)
return trg_mask
def forward(self, src, trg):
src_seq_length, N = src.shape
trg_seq_length, N = trg.shape
src_positions = torch.arange(0, src_seq_length).unsqueeze(1).expand(src_seq_length, N).to(src.device)
trg_positions = torch.arange(0, trg_seq_length).unsqueeze(1).expand(trg_seq_length, N).to(trg.device)
embed_src = self.dropout(self.src_word_embedding(src) + self.src_position_embedding(src_positions))
embed_trg = self.dropout(self.trg_word_embedding(trg) + self.trg_position_embedding(trg_positions))
src_padding_mask = self.make_src_mask(src)
trg_mask = self.make_trg_mask(trg)
out = self.transformer(embed_src, embed_trg, src_key_padding_mask=src_padding_mask, tgt_mask=trg_mask)
out = self.fc_out(out)
return out
Model Initialization and Training Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
src_vocab_size = len(SRC.vocab)
trg_vocab_size = len(TRG.vocab)
src_pad_idx = SRC.vocab.stoi['<pad>']
trg_pad_idx = TRG.vocab.stoi['<pad>']
model = TransformerModel(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.0005)
criterion = nn.CrossEntropyLoss(ignore_index=trg_pad_idx)
Training Loop
for epoch in range(10):
model.train()
for batch in train_iterator:
src = batch.src.to(device)
trg = batch.trg.to(device)
output = model(src, trg[:-1, :])
output = output.reshape(-1, output.shape[2])
trg = trg[1:, :].reshape(-1)
optimizer.zero_grad()
loss = criterion(output, trg)
loss.backward()
optimizer.step()
print(f'Epoch {epoch} Loss {loss.item():.4f}')
Example Explanation
- Data Preprocessing: Loads and preprocesses German-English translation data using
torchtext
. - Model Definition: Implements the Transformer architecture in the
TransformerModel
class. - Model Initialization: Sets vocabulary sizes, padding indices, and initializes the model.
- Training Loop: Iterates over the data to train the model.
This example is a simple implementation to understand the basic concept of Transformers, and further tuning and data preprocessing are required for practical use.
🔗 Transformers in BERT and GPT
BERT (Bidirectional Encoder Representations from Transformers)
- Structure: Uses only the Encoder part of the Transformer.
- Characteristics:
- Understands context bidirectionally.
- Trained using Masked Language Modeling.
- Applications:
- Used for various NLP tasks like sentence classification, named entity recognition, and question answering.
GPT (Generative Pretrained Transformer)
- Structure: Uses only the Decoder part of the Transformer.
- Characteristics:
- Understands context in a unidirectional (forward) manner.
- Trained using language modeling to predict the next word.
- Applications:
- Excels in generative tasks like text generation, translation, and summarization.
🧐 Conclusion
The Transformer is a pivotal architecture in modern NLP, and its flexibility and performance have led to its application across various models and tasks. In this article, we explored the basic concepts and structure of Transformers and provided a simple implementation example to aid understanding. For a deeper understanding, it is recommended to refer to the original paper and various implementation examples.
📚 References
This is the translated version of your original Korean article on the Transformer architecture.
댓글남기기