An Introduction to Transformers

Understanding Modern AI from the Inside Out

This is an educational resource that teaches how modern AI systems work by building them yourself. No black boxes - just the actual math and code that powers language models and image generators.

📖 Read Online

What's Inside

Understanding Gradients

Calculate a complete transformer forward and backward pass by hand. Using only basic Python (no NumPy, no PyTorch), you compute every matrix multiplication, every activation function, every gradient. By the end, you understand transformers not because someone explained them in abstract terms, but because you calculated every operation yourself.

Tokenization & Embeddings - How text becomes vectors
QKV Projections - What Query, Key, Value actually mean
Attention - The softmax-weighted sum that made transformers possible
Multi-Head Attention - Running parallel attention operations
Feed-Forward Network - The MLP that processes attended information
Layer Normalization - Stabilizing activations for training
Cross-Entropy Loss - Measuring prediction quality
Loss Gradients & Backpropagation - The complete backward pass
AdamW Optimizer - How weights actually get updated

Building a Transformer

Build a complete GPT-style transformer in PyTorch. This section covers the architecture that powers modern language models, from embeddings to interpretability tools.

Embeddings & Positions - Token embeddings, ALiBi, RoPE
Scaled Dot-Product Attention - Attention with causal masking
Multi-Head Attention - Parallel attention heads
Feed-Forward Networks - Position-wise MLPs with GELU
Transformer Block - Pre-LN, residuals, and all components combined
Complete Model - Stacking blocks with embedding and output layers
Training at Scale - Gradient accumulation, validation splits
KV-Cache - Fast inference through caching
Interpretability - Logit lens, attention analysis, induction heads

Fine-Tuning a Transformer

Take a pretrained model and adapt it for specific tasks using modern techniques.

Supervised Fine-Tuning (SFT) - Instruction formatting, loss masking, LoRA
Reward Modeling - Preference data, training reward models, evaluation
RLHF - PPO algorithm, KL penalty, training dynamics, reference models
DPO - Direct Preference Optimization as an alternative to RLHF
Advanced Topics - Memory optimization, hyperparameter tuning, evaluation metrics, common pitfalls

Reasoning with Transformers

Explore how models can "think" before answering. This section covers techniques from simple prompting to training your own reasoning model.

Chain-of-Thought - Prompting models to show their work
Self-Consistency - Sampling multiple chains, majority voting
Tree of Thoughts - Exploring and pruning reasoning paths
Process Reward Models - Scoring individual reasoning steps
Best-of-N Sampling - Using verification to select best solutions
Monte Carlo Tree Search - Search algorithms for reasoning
Budget Forcing - Controlling reasoning length with "Wait" tokens
GRPO - RL without a critic (DeepSeek's approach)
Distillation - Transferring reasoning to smaller models

From Noise to Images

Learn how AI generates images from text prompts. This section builds from flow matching fundamentals to a working latent diffusion model.

Flow Matching - Velocity fields, noise-to-data paths
Diffusion Transformer - Patchifying images, attention for generation
Class Conditioning - Classifier-free guidance (CFG)
Text Conditioning - CLIP embeddings, cross-attention
Latent Diffusion - VAE compression, scaling to larger images

Philosophy

These materials follow a simple principle: the best way to understand something is to build it yourself.

No hand-waving. Every operation is explicit.
No magic. You see exactly what the computer sees.
No shortcuts. Understanding comes from doing the work.

The architecture that powers GPT, Claude, Stable Diffusion, and other frontier models isn't beyond comprehension. It's just matrix multiplications, attention mechanisms, and optimization - repeated at scale.

Getting Started

Prerequisites

Python 3.12
A GPU is recommended for the PyTorch sections (CUDA or MPS)
Basic understanding of calculus (chain rule, partial derivatives)
Familiarity with Python

Installation

# Clone the repository
git clone https://github.com/zhubert/intro-to-transformers.git
cd intro-to-transformers

# Install dependencies with uv (recommended)
uv sync

# Or with pip
pip install -e .

Running the Notebooks

# Activate the virtual environment
source .venv/bin/activate

# Start Jupyter
jupyter notebook

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Attention Is All You Need (Vaswani et al., 2017)
Andrej Karpathy's educational videos and nanoGPT
The PyTorch and Hugging Face teams
Everyone who has worked to make AI more understandable

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
building-a-transformer		building-a-transformer
fine-tuning-a-transformer		fine-tuning-a-transformer
from-noise-to-images		from-noise-to-images
reasoning-with-transformers		reasoning-with-transformers
understanding-gradients		understanding-gradients
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
combined-logo.png		combined-logo.png
custom-theme.js		custom-theme.js
custom.css		custom.css
empty.md		empty.md
favicon.svg		favicon.svg
intro-dark.svg		intro-dark.svg
intro-light.svg		intro-light.svg
intro.md		intro.md
intro.png		intro.png
main.py		main.py
myst.yml		myst.yml
outro.md		outro.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
run_all.py		run_all.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Introduction to Transformers

What's Inside

Understanding Gradients

Building a Transformer

Fine-Tuning a Transformer

Reasoning with Transformers

From Noise to Images

Philosophy

Getting Started

Prerequisites

Installation

Running the Notebooks

License

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

An Introduction to Transformers

What's Inside

Understanding Gradients

Building a Transformer

Fine-Tuning a Transformer

Reasoning with Transformers

From Noise to Images

Philosophy

Getting Started

Prerequisites

Installation

Running the Notebooks

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages