intermediateDeep Learning

Learn about quantization - techniques to reduce model size and speed up inference by using lower-precision numbers.

quantizationoptimizationinferenceefficiencyllms

Model Quantization

Quantization reduces the precision of model weights and activations from 32-bit floats to lower-bit representations. It dramatically reduces model size and speeds up inference with minimal accuracy loss.

Why Quantize?

Memory Reduction

FP32: 32 bits per weight
INT8:  8 bits per weight → 4× smaller
INT4:  4 bits per weight → 8× smaller

Speed Improvement

  • Faster memory access (smaller weights)
  • Faster integer arithmetic
  • Better cache utilization

Deployment Benefits

  • Run larger models on device
  • Lower latency
  • Reduced power consumption

Quantization Basics

Linear Quantization

Map continuous values to discrete integers:

Q(x) = round(x / scale) + zero_point

scale = (max_val - min_val) / (2^bits - 1)
zero_point = round(-min_val / scale)

Dequantization

x_approx = (Q(x) - zero_point) × scale

Example

Original: [-2.1, 0.5, 1.8]
Scale: 0.015
Zero point: 140

Quantized: [0, 173, 260] → clipped to [0, 140, 255] for INT8
Dequantized: [-2.1, 0.495, 1.725]

Types of Quantization

Post-Training Quantization (PTQ)

Quantize after training:

# PyTorch dynamic quantization
model_quantized = torch.quantization.quantize_dynamic(
    model, 
    {nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)

Pros: No retraining needed Cons: May lose more accuracy

Quantization-Aware Training (QAT)

Simulate quantization during training:

# Insert fake quantization nodes
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Train with fake quantization
for batch in dataloader:
    train_step(model, batch)

# Convert to actually quantized model
model_quantized = torch.quantization.convert(model)

Pros: Better accuracy retention Cons: Requires training

Precision Levels

PrecisionBitsSize vs FP32Typical Use
FP3232Training
FP16/BF16160.5×Training, inference
INT880.25×Inference
INT440.125×LLM inference

LLM Quantization

GPTQ

One-shot quantization for LLMs
Uses calibration data
Works well for INT4

AWQ (Activation-aware Weight Quantization)

Protect salient weights based on activation magnitude
Better INT4 quality than GPTQ

GGML/GGUF (llama.cpp)

Multiple quantization formats: Q4_0, Q4_K_M, Q8_0
Optimized for CPU inference

bitsandbytes

from transformers import AutoModelForCausalLM
import bitsandbytes as bnb

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.float16
)

QLoRA

Combines quantization with fine-tuning:

Base model: 4-bit quantized (frozen)
LoRA adapters: Full precision (trained)

→ Fine-tune 70B models on single GPU!

Quantization Strategies

Per-Tensor Quantization

One scale for entire tensor
Simple but less accurate

Per-Channel Quantization

Different scale per output channel
Better accuracy, more complex

Per-Token Quantization (Activations)

Different scale per token
Handles varying activation ranges

Mixed Precision

Different precision for different layers:
- Embeddings: FP16
- Attention: INT8
- FFN: INT4

Accuracy Considerations

Sensitive Layers

Some layers are more sensitive:

  • First and last layers
  • Normalization layers
  • Skip connections

Keep these in higher precision.

Calibration

# Run representative data through model
for batch in calibration_data:
    model(batch)  # Collect activation statistics

# Use statistics to determine quantization parameters

Hardware Considerations

CPU

  • INT8 well supported (AVX-512, VNNI)
  • INT4 requires special handling

GPU

  • INT8 tensor cores (Turing+)
  • INT4 on Hopper
  • cuBLAS, TensorRT

Edge Devices

  • INT8 common
  • NPU acceleration

Code Example: Basic Quantization

import torch

# Original model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))

# Dynamic quantization (simplest)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# Check size reduction
original_size = sum(p.numel() * 4 for p in model.parameters())
quant_size = sum(p.numel() for p in quantized_model.parameters())
print(f"Size: {original_size/1e6:.1f}MB → {quant_size/1e6:.1f}MB")

# Inference
with torch.no_grad():
    output = quantized_model(input_tensor)

Best Practices

  1. Start with PTQ: Try post-training first
  2. Use calibration data: Representative of real usage
  3. Evaluate carefully: Check accuracy on your task
  4. Mixed precision: Keep sensitive layers in FP16
  5. Benchmark: Actually measure speedup on target hardware

Key Takeaways

  1. Quantization reduces precision for smaller, faster models
  2. INT8: 4× smaller, INT4: 8× smaller
  3. PTQ is simple; QAT is more accurate
  4. For LLMs: GPTQ, AWQ, bitsandbytes
  5. QLoRA enables fine-tuning quantized models
  6. Always validate accuracy after quantization

Practice Questions

Test your understanding with these related interview questions: