intermediateClassical Machine Learning

Learn about anomaly detection - techniques to identify unusual patterns, outliers, and rare events in data.

anomaly-detectionoutliersfraud-detectionunsupervised-learning

Anomaly Detection

Anomaly detection identifies data points that deviate significantly from expected behavior. It's critical for fraud detection, system monitoring, quality control, and security.

Types of Anomalies

Point Anomalies

Single data points that are unusual:

Data: [10, 12, 11, 13, 500, 12, 11]
                      ↑
                   Anomaly

Contextual Anomalies

Normal in one context, anomalous in another:

Temperature 80°F: Normal in summer, anomalous in winter

Collective Anomalies

Group of points that together are anomalous:

Heart rate: Normal pattern, then sudden flat line

Approaches

Supervised

Labeled anomalies available:

  • Classification problem (very imbalanced)
  • Need labeled anomaly examples

Unsupervised

No labels, learn "normal":

  • Density estimation
  • Clustering
  • Reconstruction error

Semi-supervised

Only normal data for training:

  • Learn normal distribution
  • Flag deviations as anomalies

Statistical Methods

Z-Score

z = (x - μ) / σ

Anomaly if |z| > threshold (typically 2-3)

Assumes normal distribution.

IQR Method

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1

Anomaly if: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR

Robust to non-normal distributions.

Mahalanobis Distance

D = √((x - μ)ᵀ Σ⁻¹ (x - μ))

Accounts for correlations between features.

Machine Learning Methods

Isolation Forest

Idea: Anomalies are easier to isolate

Build trees with random splits
Anomalies require fewer splits to isolate
Anomaly score = average path length
from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination=0.01)
predictions = model.fit_predict(X)  # -1 = anomaly, 1 = normal

Local Outlier Factor (LOF)

Idea: Compare local density to neighbors

LOF > 1: Less dense than neighbors (anomaly)
LOF ≈ 1: Similar density (normal)
LOF < 1: Denser than neighbors
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
predictions = lof.fit_predict(X)

One-Class SVM

Idea: Find boundary around normal data

from sklearn.svm import OneClassSVM

model = OneClassSVM(nu=0.01)  # nu ≈ expected anomaly rate
model.fit(X_train)  # Train on normal data only
predictions = model.predict(X_test)

DBSCAN

Idea: Anomalies don't belong to any cluster

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=0.5, min_samples=5)
labels = clustering.fit_predict(X)
anomalies = X[labels == -1]  # Noise points

Deep Learning Methods

Autoencoder

Idea: High reconstruction error = anomaly

# Train on normal data
model.fit(X_normal)

# Detect anomalies
reconstruction = model.predict(X)
error = np.mean((X - reconstruction) ** 2, axis=1)
anomalies = error > threshold

Variational Autoencoder

Use likelihood instead of reconstruction error:

p(x) low → anomaly

LSTM for Sequences

Predict next value, flag large prediction errors:

[x₁, x₂, x₃] → predict x₄
If |x₄ - x̂₄| > threshold → anomaly

Time Series Anomaly Detection

Seasonal-Trend Decomposition

data = trend + seasonal + residual

Anomaly if residual is large

Exponential Smoothing

forecast_{t+1} = α × actual_t + (1-α) × forecast_t

Anomaly if |actual - forecast| > threshold

Prophet

Facebook's library handles seasonality:

from prophet import Prophet

model = Prophet()
model.fit(df)
forecast = model.predict(future_df)
# Check residuals for anomalies

Evaluation

Challenges

  • Highly imbalanced (few anomalies)
  • Accuracy is misleading
  • Unlabeled test data

Metrics

Precision: What fraction of detected anomalies are real?
Recall: What fraction of real anomalies did we detect?
F1: Harmonic mean
AUC-PR: Area under precision-recall curve

At Different Thresholds

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_true, scores)

Setting Thresholds

Statistical

threshold = mean + k × std
k typically 2-3

Percentile

threshold = percentile(scores, 99)

Business-driven

False positive cost vs false negative cost
Choose threshold that minimizes expected cost

Best Practices

Feature Engineering

  • Aggregate features (rolling stats)
  • Time-based features
  • Domain-specific indicators

Multiple Detectors

scores = 0.3 * isolation_forest_scores + \
         0.3 * autoencoder_scores + \
         0.4 * lof_scores

Feedback Loop

  • Collect analyst labels
  • Retrain periodically
  • Track precision over time

Production Considerations

Streaming Data

# Maintain running statistics
mean = ema(new_value, mean, alpha)
std = ema_std(new_value, mean, std, alpha)
score = (new_value - mean) / std

Concept Drift

  • Normal changes over time
  • Retrain/update models
  • Multiple time windows

Alert Fatigue

  • Too many false positives → ignored
  • Tune for precision over recall
  • Group related anomalies

Key Takeaways

  1. Anomalies can be point, contextual, or collective
  2. Unsupervised: Isolation Forest, LOF, autoencoders
  3. Train on normal data, flag high-scoring points
  4. Evaluate with precision, recall, PR-AUC
  5. Threshold setting is crucial and domain-dependent
  6. In production: handle drift, avoid alert fatigue