Anomaly Detection
Anomaly detection identifies data points that deviate significantly from expected behavior. It's critical for fraud detection, system monitoring, quality control, and security.
Types of Anomalies
Point Anomalies
Single data points that are unusual:
Data: [10, 12, 11, 13, 500, 12, 11]
↑
Anomaly
Contextual Anomalies
Normal in one context, anomalous in another:
Temperature 80°F: Normal in summer, anomalous in winter
Collective Anomalies
Group of points that together are anomalous:
Heart rate: Normal pattern, then sudden flat line
Approaches
Supervised
Labeled anomalies available:
- Classification problem (very imbalanced)
- Need labeled anomaly examples
Unsupervised
No labels, learn "normal":
- Density estimation
- Clustering
- Reconstruction error
Semi-supervised
Only normal data for training:
- Learn normal distribution
- Flag deviations as anomalies
Statistical Methods
Z-Score
z = (x - μ) / σ
Anomaly if |z| > threshold (typically 2-3)
Assumes normal distribution.
IQR Method
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Anomaly if: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
Robust to non-normal distributions.
Mahalanobis Distance
D = √((x - μ)ᵀ Σ⁻¹ (x - μ))
Accounts for correlations between features.
Machine Learning Methods
Isolation Forest
Idea: Anomalies are easier to isolate
Build trees with random splits
Anomalies require fewer splits to isolate
Anomaly score = average path length
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.01)
predictions = model.fit_predict(X) # -1 = anomaly, 1 = normal
Local Outlier Factor (LOF)
Idea: Compare local density to neighbors
LOF > 1: Less dense than neighbors (anomaly)
LOF ≈ 1: Similar density (normal)
LOF < 1: Denser than neighbors
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
predictions = lof.fit_predict(X)
One-Class SVM
Idea: Find boundary around normal data
from sklearn.svm import OneClassSVM
model = OneClassSVM(nu=0.01) # nu ≈ expected anomaly rate
model.fit(X_train) # Train on normal data only
predictions = model.predict(X_test)
DBSCAN
Idea: Anomalies don't belong to any cluster
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.5, min_samples=5)
labels = clustering.fit_predict(X)
anomalies = X[labels == -1] # Noise points
Deep Learning Methods
Autoencoder
Idea: High reconstruction error = anomaly
# Train on normal data
model.fit(X_normal)
# Detect anomalies
reconstruction = model.predict(X)
error = np.mean((X - reconstruction) ** 2, axis=1)
anomalies = error > threshold
Variational Autoencoder
Use likelihood instead of reconstruction error:
p(x) low → anomaly
LSTM for Sequences
Predict next value, flag large prediction errors:
[x₁, x₂, x₃] → predict x₄
If |x₄ - x̂₄| > threshold → anomaly
Time Series Anomaly Detection
Seasonal-Trend Decomposition
data = trend + seasonal + residual
Anomaly if residual is large
Exponential Smoothing
forecast_{t+1} = α × actual_t + (1-α) × forecast_t
Anomaly if |actual - forecast| > threshold
Prophet
Facebook's library handles seasonality:
from prophet import Prophet
model = Prophet()
model.fit(df)
forecast = model.predict(future_df)
# Check residuals for anomalies
Evaluation
Challenges
- Highly imbalanced (few anomalies)
- Accuracy is misleading
- Unlabeled test data
Metrics
Precision: What fraction of detected anomalies are real?
Recall: What fraction of real anomalies did we detect?
F1: Harmonic mean
AUC-PR: Area under precision-recall curve
At Different Thresholds
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_true, scores)
Setting Thresholds
Statistical
threshold = mean + k × std
k typically 2-3
Percentile
threshold = percentile(scores, 99)
Business-driven
False positive cost vs false negative cost
Choose threshold that minimizes expected cost
Best Practices
Feature Engineering
- Aggregate features (rolling stats)
- Time-based features
- Domain-specific indicators
Multiple Detectors
scores = 0.3 * isolation_forest_scores + \
0.3 * autoencoder_scores + \
0.4 * lof_scores
Feedback Loop
- Collect analyst labels
- Retrain periodically
- Track precision over time
Production Considerations
Streaming Data
# Maintain running statistics
mean = ema(new_value, mean, alpha)
std = ema_std(new_value, mean, std, alpha)
score = (new_value - mean) / std
Concept Drift
- Normal changes over time
- Retrain/update models
- Multiple time windows
Alert Fatigue
- Too many false positives → ignored
- Tune for precision over recall
- Group related anomalies
Key Takeaways
- Anomalies can be point, contextual, or collective
- Unsupervised: Isolation Forest, LOF, autoencoders
- Train on normal data, flag high-scoring points
- Evaluate with precision, recall, PR-AUC
- Threshold setting is crucial and domain-dependent
- In production: handle drift, avoid alert fatigue