Anomaly Detection

Anomaly detection identifies data points that deviate significantly from expected behavior. It's critical for fraud detection, system monitoring, quality control, and security.

Types of Anomalies

Point Anomalies

Single data points that are unusual:

Data: [10, 12, 11, 13, 500, 12, 11]
                      ↑
                   Anomaly

Contextual Anomalies

Normal in one context, anomalous in another:

Temperature 80°F: Normal in summer, anomalous in winter

Collective Anomalies

Group of points that together are anomalous:

Heart rate: Normal pattern, then sudden flat line

Approaches

Supervised

Labeled anomalies available:

Classification problem (very imbalanced)
Need labeled anomaly examples

Unsupervised

No labels, learn "normal":

Density estimation
Clustering
Reconstruction error

Semi-supervised

Only normal data for training:

Learn normal distribution
Flag deviations as anomalies

Statistical Methods

Z-Score

z = (x - μ) / σ

Anomaly if |z| > threshold (typically 2-3)

Assumes normal distribution.

IQR Method

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1

Anomaly if: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR

Robust to non-normal distributions.

Mahalanobis Distance

D = √((x - μ)ᵀ Σ⁻¹ (x - μ))

Accounts for correlations between features.

Machine Learning Methods

Isolation Forest

Idea: Anomalies are easier to isolate

Build trees with random splits
Anomalies require fewer splits to isolate
Anomaly score = average path length

from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination=0.01)
predictions = model.fit_predict(X)  # -1 = anomaly, 1 = normal

Local Outlier Factor (LOF)

Idea: Compare local density to neighbors

LOF > 1: Less dense than neighbors (anomaly)
LOF ≈ 1: Similar density (normal)
LOF < 1: Denser than neighbors

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
predictions = lof.fit_predict(X)

One-Class SVM

Idea: Find boundary around normal data

from sklearn.svm import OneClassSVM

model = OneClassSVM(nu=0.01)  # nu ≈ expected anomaly rate
model.fit(X_train)  # Train on normal data only
predictions = model.predict(X_test)

DBSCAN

Idea: Anomalies don't belong to any cluster

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=0.5, min_samples=5)
labels = clustering.fit_predict(X)
anomalies = X[labels == -1]  # Noise points

Deep Learning Methods

Autoencoder

Idea: High reconstruction error = anomaly

# Train on normal data
model.fit(X_normal)

# Detect anomalies
reconstruction = model.predict(X)
error = np.mean((X - reconstruction) ** 2, axis=1)
anomalies = error > threshold

Variational Autoencoder

Use likelihood instead of reconstruction error:

p(x) low → anomaly

LSTM for Sequences

Predict next value, flag large prediction errors:

[x₁, x₂, x₃] → predict x₄
If |x₄ - x̂₄| > threshold → anomaly

Time Series Anomaly Detection

Seasonal-Trend Decomposition

data = trend + seasonal + residual

Anomaly if residual is large

Exponential Smoothing

forecast_{t+1} = α × actual_t + (1-α) × forecast_t

Anomaly if |actual - forecast| > threshold

Prophet

Facebook's library handles seasonality:

from prophet import Prophet

model = Prophet()
model.fit(df)
forecast = model.predict(future_df)
# Check residuals for anomalies

Evaluation

Challenges

Highly imbalanced (few anomalies)
Accuracy is misleading
Unlabeled test data

Metrics

Precision: What fraction of detected anomalies are real?
Recall: What fraction of real anomalies did we detect?
F1: Harmonic mean
AUC-PR: Area under precision-recall curve

At Different Thresholds

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_true, scores)

Setting Thresholds

Statistical

threshold = mean + k × std
k typically 2-3

Percentile

threshold = percentile(scores, 99)

Business-driven

False positive cost vs false negative cost
Choose threshold that minimizes expected cost

Best Practices

Feature Engineering

Aggregate features (rolling stats)
Time-based features
Domain-specific indicators

Multiple Detectors

scores = 0.3 * isolation_forest_scores + \
         0.3 * autoencoder_scores + \
         0.4 * lof_scores

Feedback Loop

Collect analyst labels
Retrain periodically
Track precision over time

Production Considerations

Streaming Data

# Maintain running statistics
mean = ema(new_value, mean, alpha)
std = ema_std(new_value, mean, std, alpha)
score = (new_value - mean) / std

Concept Drift

Normal changes over time
Retrain/update models
Multiple time windows

Alert Fatigue

Too many false positives → ignored
Tune for precision over recall
Group related anomalies

Key Takeaways

Anomalies can be point, contextual, or collective
Unsupervised: Isolation Forest, LOF, autoencoders
Train on normal data, flag high-scoring points
Evaluate with precision, recall, PR-AUC
Threshold setting is crucial and domain-dependent
In production: handle drift, avoid alert fatigue