Transformer
Definition
The transformer is a deep learning architecture introduced in the 2017 paper 'Attention Is All You Need' that relies entirely on self-attention mechanisms. It replaced recurrent neural networks as the dominant architecture for language tasks and now underpins virtually every major AI model. Transformers process input tokens in parallel, enabling much faster training on modern GPUs.
How It Works
Transformers use multi-head self-attention to weigh the importance of each token relative to every other token in a sequence. The architecture consists of encoder and decoder stacks, each containing attention layers and feed-forward networks with residual connections. Positional encodings are added to input embeddings since the architecture has no inherent sense of token order.