Model Quantization
Quantization reduces the precision of model weights and activations from 32-bit floats to lower-bit representations. It dramatically reduces model size and speeds up inference with minimal accuracy loss.
Why Quantize?
Memory Reduction
FP32: 32 bits per weight
INT8: 8 bits per weight → 4× smaller
INT4: 4 bits per weight → 8× smaller
Speed Improvement
- Faster memory access (smaller weights)
- Faster integer arithmetic
- Better cache utilization
Deployment Benefits
- Run larger models on device
- Lower latency
- Reduced power consumption
Quantization Basics
Linear Quantization
Map continuous values to discrete integers:
Q(x) = round(x / scale) + zero_point
scale = (max_val - min_val) / (2^bits - 1)
zero_point = round(-min_val / scale)
Dequantization
x_approx = (Q(x) - zero_point) × scale
Example
Original: [-2.1, 0.5, 1.8]
Scale: 0.015
Zero point: 140
Quantized: [0, 173, 260] → clipped to [0, 140, 255] for INT8
Dequantized: [-2.1, 0.495, 1.725]
Types of Quantization
Post-Training Quantization (PTQ)
Quantize after training:
# PyTorch dynamic quantization
model_quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear}, # Layers to quantize
dtype=torch.qint8
)
Pros: No retraining needed Cons: May lose more accuracy
Quantization-Aware Training (QAT)
Simulate quantization during training:
# Insert fake quantization nodes
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
# Train with fake quantization
for batch in dataloader:
train_step(model, batch)
# Convert to actually quantized model
model_quantized = torch.quantization.convert(model)
Pros: Better accuracy retention Cons: Requires training
Precision Levels
| Precision | Bits | Size vs FP32 | Typical Use |
|---|---|---|---|
| FP32 | 32 | 1× | Training |
| FP16/BF16 | 16 | 0.5× | Training, inference |
| INT8 | 8 | 0.25× | Inference |
| INT4 | 4 | 0.125× | LLM inference |
LLM Quantization
GPTQ
One-shot quantization for LLMs
Uses calibration data
Works well for INT4
AWQ (Activation-aware Weight Quantization)
Protect salient weights based on activation magnitude
Better INT4 quality than GPTQ
GGML/GGUF (llama.cpp)
Multiple quantization formats: Q4_0, Q4_K_M, Q8_0
Optimized for CPU inference
bitsandbytes
from transformers import AutoModelForCausalLM
import bitsandbytes as bnb
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
bnb_4bit_compute_dtype=torch.float16
)
QLoRA
Combines quantization with fine-tuning:
Base model: 4-bit quantized (frozen)
LoRA adapters: Full precision (trained)
→ Fine-tune 70B models on single GPU!
Quantization Strategies
Per-Tensor Quantization
One scale for entire tensor
Simple but less accurate
Per-Channel Quantization
Different scale per output channel
Better accuracy, more complex
Per-Token Quantization (Activations)
Different scale per token
Handles varying activation ranges
Mixed Precision
Different precision for different layers:
- Embeddings: FP16
- Attention: INT8
- FFN: INT4
Accuracy Considerations
Sensitive Layers
Some layers are more sensitive:
- First and last layers
- Normalization layers
- Skip connections
Keep these in higher precision.
Calibration
# Run representative data through model
for batch in calibration_data:
model(batch) # Collect activation statistics
# Use statistics to determine quantization parameters
Hardware Considerations
CPU
- INT8 well supported (AVX-512, VNNI)
- INT4 requires special handling
GPU
- INT8 tensor cores (Turing+)
- INT4 on Hopper
- cuBLAS, TensorRT
Edge Devices
- INT8 common
- NPU acceleration
Code Example: Basic Quantization
import torch
# Original model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))
# Dynamic quantization (simplest)
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
# Check size reduction
original_size = sum(p.numel() * 4 for p in model.parameters())
quant_size = sum(p.numel() for p in quantized_model.parameters())
print(f"Size: {original_size/1e6:.1f}MB → {quant_size/1e6:.1f}MB")
# Inference
with torch.no_grad():
output = quantized_model(input_tensor)
Best Practices
- Start with PTQ: Try post-training first
- Use calibration data: Representative of real usage
- Evaluate carefully: Check accuracy on your task
- Mixed precision: Keep sensitive layers in FP16
- Benchmark: Actually measure speedup on target hardware
Key Takeaways
- Quantization reduces precision for smaller, faster models
- INT8: 4× smaller, INT4: 8× smaller
- PTQ is simple; QAT is more accurate
- For LLMs: GPTQ, AWQ, bitsandbytes
- QLoRA enables fine-tuning quantized models
- Always validate accuracy after quantization