Diffusion Models
Diffusion models generate data by learning to reverse a gradual noising process. They've become the dominant approach for high-quality image generation, powering DALL-E, Stable Diffusion, and Midjourney.
The Core Idea
Forward Process (Add Noise)
Clean image → Noisy image → ... → Pure noise
x₀ → x₁ → ... → x_T
Gradually add Gaussian noise over T steps.
Reverse Process (Remove Noise)
Pure noise → Less noisy → ... → Clean image
x_T → x_{T-1} → ... → x₀
Learn to denoise step by step.
Mathematical Framework
Forward Process
q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)
At each step, scale down and add noise.
Closed Form (Jump to Any Step)
q(xₜ | x₀) = N(xₜ; √ᾱₜ x₀, (1-ᾱₜ)I)
where ᾱₜ = ∏(1-βᵢ)
Can directly compute xₜ from x₀ without iterating.
Reverse Process
p(xₜ₋₁ | xₜ) = N(xₜ₋₁; μθ(xₜ, t), σₜ²I)
Neural network predicts the mean of denoised distribution.
Training
Simple Objective
Loss = ||ε - ε_θ(xₜ, t)||²
- Sample noise ε
- Add it to image: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
- Train network to predict ε from xₜ
Training Algorithm
for batch in dataloader:
x_0 = batch # Clean images
t = randint(1, T) # Random timestep
noise = randn_like(x_0)
# Create noisy version
x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * noise
# Predict noise
predicted_noise = model(x_t, t)
# Loss
loss = mse_loss(predicted_noise, noise)
loss.backward()
optimizer.step()
Sampling
Basic DDPM Sampling
def sample(model):
x = randn(shape) # Start with noise
for t in reversed(range(T)):
# Predict noise
pred_noise = model(x, t)
# Denoise one step
x = (x - beta[t]/sqrt(1-alpha_bar[t]) * pred_noise) / sqrt(1-beta[t])
# Add noise (except last step)
if t > 0:
x += sqrt(beta[t]) * randn_like(x)
return x
DDIM (Faster Sampling)
Deterministic sampling, skip steps:
1000 steps → 50 steps with minimal quality loss
Architecture
U-Net Backbone
Input (noisy image + time embedding)
↓
[Encoder blocks with downsample]
↓
[Middle block]
↓
[Decoder blocks with upsample]
+ Skip connections from encoder
↓
Output (predicted noise)
Time Conditioning
# Sinusoidal embedding like transformers
t_embed = get_timestep_embedding(t)
# Add to network activations
h = h + mlp(t_embed)
Attention
Self-attention layers for global context.
Conditional Generation
Classifier Guidance
Use classifier gradient to steer generation:
ε̃ = ε_θ(xₜ, t) - s∇ₓlog p(y|xₜ)
Classifier-Free Guidance
Train with and without condition, interpolate:
ε̃ = ε_θ(xₜ, t, ∅) + s(ε_θ(xₜ, t, c) - ε_θ(xₜ, t, ∅))
No separate classifier needed!
Latent Diffusion (Stable Diffusion)
Key Innovation
Diffusion in latent space, not pixel space:
Image → [Encoder] → Latent → Diffusion → Latent → [Decoder] → Image
VAE VAE
Benefits
- Much smaller latent space (64×64 vs 512×512)
- Faster training and sampling
- Enables high-resolution generation
Architecture
1. VAE: Compress image 8× (512×512 → 64×64)
2. U-Net: Denoise in latent space
3. Text conditioning: CLIP text encoder
4. VAE Decoder: Decompress to image
Applications
Text-to-Image
- DALL-E 2, 3
- Stable Diffusion
- Midjourney
- Imagen
Image-to-Image
- Style transfer
- Inpainting
- Super-resolution
- Editing (with ControlNet)
Other Domains
- Video generation (Sora)
- Audio synthesis
- 3D generation
- Molecule design
Diffusion vs GANs
| Aspect | Diffusion | GANs |
|---|---|---|
| Training | Stable | Unstable |
| Mode coverage | Excellent | Can collapse |
| Sample quality | SOTA | Good |
| Sampling speed | Slow (many steps) | Fast (one pass) |
| Diversity | High | Can be limited |
Recent Advances
Consistency Models
Distill diffusion into single-step generator.
Flow Matching
Simplified training objective, connects to flows.
Rectified Flow
Straighten sampling paths for faster generation.
Key Takeaways
- Diffusion: learn to reverse a noising process
- Train by predicting noise added to images
- Sample by iterative denoising
- Latent diffusion: diffuse in compressed space
- Classifier-free guidance: conditional generation
- State-of-the-art for image generation quality