advancedDeep Learning

Understand diffusion models - the generative AI breakthrough behind DALL-E, Stable Diffusion, and state-of-the-art image generation.

diffusiongenerativestable-diffusionimage-generationdall-e

Diffusion Models

Diffusion models generate data by learning to reverse a gradual noising process. They've become the dominant approach for high-quality image generation, powering DALL-E, Stable Diffusion, and Midjourney.

The Core Idea

Forward Process (Add Noise)

Clean image → Noisy image → ... → Pure noise
     x₀    →     x₁      → ... →    x_T

Gradually add Gaussian noise over T steps.

Reverse Process (Remove Noise)

Pure noise → Less noisy → ... → Clean image
    x_T    →    x_{T-1} → ... →    x₀

Learn to denoise step by step.

Mathematical Framework

Forward Process

q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

At each step, scale down and add noise.

Closed Form (Jump to Any Step)

q(xₜ | x₀) = N(xₜ; √ᾱₜ x₀, (1-ᾱₜ)I)

where ᾱₜ = ∏(1-βᵢ)

Can directly compute xₜ from x₀ without iterating.

Reverse Process

p(xₜ₋₁ | xₜ) = N(xₜ₋₁; μθ(xₜ, t), σₜ²I)

Neural network predicts the mean of denoised distribution.

Training

Simple Objective

Loss = ||ε - ε_θ(xₜ, t)||²
  • Sample noise ε
  • Add it to image: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
  • Train network to predict ε from xₜ

Training Algorithm

for batch in dataloader:
    x_0 = batch  # Clean images
    t = randint(1, T)  # Random timestep
    noise = randn_like(x_0)
    
    # Create noisy version
    x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * noise
    
    # Predict noise
    predicted_noise = model(x_t, t)
    
    # Loss
    loss = mse_loss(predicted_noise, noise)
    loss.backward()
    optimizer.step()

Sampling

Basic DDPM Sampling

def sample(model):
    x = randn(shape)  # Start with noise
    
    for t in reversed(range(T)):
        # Predict noise
        pred_noise = model(x, t)
        
        # Denoise one step
        x = (x - beta[t]/sqrt(1-alpha_bar[t]) * pred_noise) / sqrt(1-beta[t])
        
        # Add noise (except last step)
        if t > 0:
            x += sqrt(beta[t]) * randn_like(x)
    
    return x

DDIM (Faster Sampling)

Deterministic sampling, skip steps:

1000 steps → 50 steps with minimal quality loss

Architecture

U-Net Backbone

  Input (noisy image + time embedding)
        ↓
  [Encoder blocks with downsample]
        ↓
  [Middle block]
        ↓
  [Decoder blocks with upsample]
  + Skip connections from encoder
        ↓
  Output (predicted noise)

Time Conditioning

# Sinusoidal embedding like transformers
t_embed = get_timestep_embedding(t)

# Add to network activations
h = h + mlp(t_embed)

Attention

Self-attention layers for global context.

Conditional Generation

Classifier Guidance

Use classifier gradient to steer generation:

ε̃ = ε_θ(xₜ, t) - s∇ₓlog p(y|xₜ)

Classifier-Free Guidance

Train with and without condition, interpolate:

ε̃ = ε_θ(xₜ, t, ∅) + s(ε_θ(xₜ, t, c) - ε_θ(xₜ, t, ∅))

No separate classifier needed!

Latent Diffusion (Stable Diffusion)

Key Innovation

Diffusion in latent space, not pixel space:

Image → [Encoder] → Latent → Diffusion → Latent → [Decoder] → Image
          VAE                                        VAE

Benefits

  • Much smaller latent space (64×64 vs 512×512)
  • Faster training and sampling
  • Enables high-resolution generation

Architecture

1. VAE: Compress image 8× (512×512 → 64×64)
2. U-Net: Denoise in latent space
3. Text conditioning: CLIP text encoder
4. VAE Decoder: Decompress to image

Applications

Text-to-Image

  • DALL-E 2, 3
  • Stable Diffusion
  • Midjourney
  • Imagen

Image-to-Image

  • Style transfer
  • Inpainting
  • Super-resolution
  • Editing (with ControlNet)

Other Domains

  • Video generation (Sora)
  • Audio synthesis
  • 3D generation
  • Molecule design

Diffusion vs GANs

AspectDiffusionGANs
TrainingStableUnstable
Mode coverageExcellentCan collapse
Sample qualitySOTAGood
Sampling speedSlow (many steps)Fast (one pass)
DiversityHighCan be limited

Recent Advances

Consistency Models

Distill diffusion into single-step generator.

Flow Matching

Simplified training objective, connects to flows.

Rectified Flow

Straighten sampling paths for faster generation.

Key Takeaways

  1. Diffusion: learn to reverse a noising process
  2. Train by predicting noise added to images
  3. Sample by iterative denoising
  4. Latent diffusion: diffuse in compressed space
  5. Classifier-free guidance: conditional generation
  6. State-of-the-art for image generation quality