Model Quantization

Quantization reduces the precision of model weights and activations from 32-bit floats to lower-bit representations. It dramatically reduces model size and speeds up inference with minimal accuracy loss.

Why Quantize?

Memory Reduction

FP32: 32 bits per weight
INT8:  8 bits per weight → 4× smaller
INT4:  4 bits per weight → 8× smaller

Speed Improvement

Faster memory access (smaller weights)
Faster integer arithmetic
Better cache utilization

Deployment Benefits

Run larger models on device
Lower latency
Reduced power consumption

Quantization Basics

Linear Quantization

Map continuous values to discrete integers:

Q(x) = round(x / scale) + zero_point

scale = (max_val - min_val) / (2^bits - 1)
zero_point = round(-min_val / scale)

Dequantization

x_approx = (Q(x) - zero_point) × scale

Example

Original: [-2.1, 0.5, 1.8]
Scale: 0.015
Zero point: 140

Quantized: [0, 173, 260] → clipped to [0, 140, 255] for INT8
Dequantized: [-2.1, 0.495, 1.725]

Types of Quantization

Post-Training Quantization (PTQ)

Quantize after training:

# PyTorch dynamic quantization
model_quantized = torch.quantization.quantize_dynamic(
    model, 
    {nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)

Pros: No retraining needed Cons: May lose more accuracy

Quantization-Aware Training (QAT)

Simulate quantization during training:

# Insert fake quantization nodes
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Train with fake quantization
for batch in dataloader:
    train_step(model, batch)

# Convert to actually quantized model
model_quantized = torch.quantization.convert(model)

Pros: Better accuracy retention Cons: Requires training

Precision Levels

Precision	Bits	Size vs FP32	Typical Use
FP32	32	1×	Training
FP16/BF16	16	0.5×	Training, inference
INT8	8	0.25×	Inference
INT4	4	0.125×	LLM inference

LLM Quantization

GPTQ

One-shot quantization for LLMs
Uses calibration data
Works well for INT4

AWQ (Activation-aware Weight Quantization)

Protect salient weights based on activation magnitude
Better INT4 quality than GPTQ

GGML/GGUF (llama.cpp)

Multiple quantization formats: Q4_0, Q4_K_M, Q8_0
Optimized for CPU inference

bitsandbytes

from transformers import AutoModelForCausalLM
import bitsandbytes as bnb

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.float16
)

QLoRA

Combines quantization with fine-tuning:

Base model: 4-bit quantized (frozen)
LoRA adapters: Full precision (trained)

→ Fine-tune 70B models on single GPU!

Quantization Strategies

Per-Tensor Quantization

One scale for entire tensor
Simple but less accurate

Per-Channel Quantization

Different scale per output channel
Better accuracy, more complex

Per-Token Quantization (Activations)

Different scale per token
Handles varying activation ranges

Mixed Precision

Different precision for different layers:
- Embeddings: FP16
- Attention: INT8
- FFN: INT4

Accuracy Considerations

Sensitive Layers

Some layers are more sensitive:

First and last layers
Normalization layers
Skip connections

Keep these in higher precision.

Calibration

# Run representative data through model
for batch in calibration_data:
    model(batch)  # Collect activation statistics

# Use statistics to determine quantization parameters

Hardware Considerations

CPU

INT8 well supported (AVX-512, VNNI)
INT4 requires special handling

GPU

INT8 tensor cores (Turing+)
INT4 on Hopper
cuBLAS, TensorRT

Edge Devices

INT8 common
NPU acceleration

Code Example: Basic Quantization

import torch

# Original model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))

# Dynamic quantization (simplest)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# Check size reduction
original_size = sum(p.numel() * 4 for p in model.parameters())
quant_size = sum(p.numel() for p in quantized_model.parameters())
print(f"Size: {original_size/1e6:.1f}MB → {quant_size/1e6:.1f}MB")

# Inference
with torch.no_grad():
    output = quantized_model(input_tensor)

Best Practices

Start with PTQ: Try post-training first
Use calibration data: Representative of real usage
Evaluate carefully: Check accuracy on your task
Mixed precision: Keep sensitive layers in FP16
Benchmark: Actually measure speedup on target hardware

Key Takeaways

Quantization reduces precision for smaller, faster models
INT8: 4× smaller, INT4: 8× smaller
PTQ is simple; QAT is more accurate
For LLMs: GPTQ, AWQ, bitsandbytes
QLoRA enables fine-tuning quantized models
Always validate accuracy after quantization