Diffusion Models

Diffusion models generate data by learning to reverse a gradual noising process. They've become the dominant approach for high-quality image generation, powering DALL-E, Stable Diffusion, and Midjourney.

The Core Idea

Forward Process (Add Noise)

Clean image → Noisy image → ... → Pure noise
     x₀    →     x₁      → ... →    x_T

Gradually add Gaussian noise over T steps.

Reverse Process (Remove Noise)

Pure noise → Less noisy → ... → Clean image
    x_T    →    x_{T-1} → ... →    x₀

Learn to denoise step by step.

Mathematical Framework

Forward Process

q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

At each step, scale down and add noise.

Closed Form (Jump to Any Step)

q(xₜ | x₀) = N(xₜ; √ᾱₜ x₀, (1-ᾱₜ)I)

where ᾱₜ = ∏(1-βᵢ)

Can directly compute xₜ from x₀ without iterating.

Reverse Process

p(xₜ₋₁ | xₜ) = N(xₜ₋₁; μθ(xₜ, t), σₜ²I)

Neural network predicts the mean of denoised distribution.

Training

Simple Objective

Loss = ||ε - ε_θ(xₜ, t)||²

Sample noise ε
Add it to image: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
Train network to predict ε from xₜ

Training Algorithm

for batch in dataloader:
    x_0 = batch  # Clean images
    t = randint(1, T)  # Random timestep
    noise = randn_like(x_0)
    
    # Create noisy version
    x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * noise
    
    # Predict noise
    predicted_noise = model(x_t, t)
    
    # Loss
    loss = mse_loss(predicted_noise, noise)
    loss.backward()
    optimizer.step()

Sampling

Basic DDPM Sampling

def sample(model):
    x = randn(shape)  # Start with noise
    
    for t in reversed(range(T)):
        # Predict noise
        pred_noise = model(x, t)
        
        # Denoise one step
        x = (x - beta[t]/sqrt(1-alpha_bar[t]) * pred_noise) / sqrt(1-beta[t])
        
        # Add noise (except last step)
        if t > 0:
            x += sqrt(beta[t]) * randn_like(x)
    
    return x

DDIM (Faster Sampling)

Deterministic sampling, skip steps:

1000 steps → 50 steps with minimal quality loss

Architecture

U-Net Backbone

  Input (noisy image + time embedding)
        ↓
  [Encoder blocks with downsample]
        ↓
  [Middle block]
        ↓
  [Decoder blocks with upsample]
  + Skip connections from encoder
        ↓
  Output (predicted noise)

Time Conditioning

# Sinusoidal embedding like transformers
t_embed = get_timestep_embedding(t)

# Add to network activations
h = h + mlp(t_embed)

Attention

Self-attention layers for global context.

Conditional Generation

Classifier Guidance

Use classifier gradient to steer generation:

ε̃ = ε_θ(xₜ, t) - s∇ₓlog p(y|xₜ)

Classifier-Free Guidance

Train with and without condition, interpolate:

ε̃ = ε_θ(xₜ, t, ∅) + s(ε_θ(xₜ, t, c) - ε_θ(xₜ, t, ∅))

No separate classifier needed!

Latent Diffusion (Stable Diffusion)

Key Innovation

Diffusion in latent space, not pixel space:

Image → [Encoder] → Latent → Diffusion → Latent → [Decoder] → Image
          VAE                                        VAE

Benefits

Much smaller latent space (64×64 vs 512×512)
Faster training and sampling
Enables high-resolution generation

Architecture

1. VAE: Compress image 8× (512×512 → 64×64)
2. U-Net: Denoise in latent space
3. Text conditioning: CLIP text encoder
4. VAE Decoder: Decompress to image

Applications

Text-to-Image

DALL-E 2, 3
Stable Diffusion
Midjourney
Imagen

Image-to-Image

Style transfer
Inpainting
Super-resolution
Editing (with ControlNet)

Other Domains

Video generation (Sora)
Audio synthesis
3D generation
Molecule design

Diffusion vs GANs

Aspect	Diffusion	GANs
Training	Stable	Unstable
Mode coverage	Excellent	Can collapse
Sample quality	SOTA	Good
Sampling speed	Slow (many steps)	Fast (one pass)
Diversity	High	Can be limited

Recent Advances

Consistency Models

Distill diffusion into single-step generator.

Flow Matching

Simplified training objective, connects to flows.

Rectified Flow

Straighten sampling paths for faster generation.

Key Takeaways

Diffusion: learn to reverse a noising process
Train by predicting noise added to images
Sample by iterative denoising
Latent diffusion: diffuse in compressed space
Classifier-free guidance: conditional generation
State-of-the-art for image generation quality

Diffusion Models

The Core Idea

Forward Process (Add Noise)

Reverse Process (Remove Noise)

Mathematical Framework

Forward Process

Closed Form (Jump to Any Step)

Reverse Process

Training

Simple Objective

Training Algorithm

Sampling

Basic DDPM Sampling

DDIM (Faster Sampling)

Architecture

U-Net Backbone

Time Conditioning

Attention

Conditional Generation

Classifier Guidance

Classifier-Free Guidance

Latent Diffusion (Stable Diffusion)

Key Innovation

Benefits

Architecture

Applications

Text-to-Image

Image-to-Image

Other Domains

Diffusion vs GANs

Recent Advances

Consistency Models

Flow Matching

Rectified Flow

Key Takeaways

Related Concepts