intermediateClassical Machine Learning

Learn about DBSCAN, a density-based clustering algorithm that finds clusters of arbitrary shape and identifies outliers.

clusteringunsuperviseddensity-basedoutlier-detection

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together points that are closely packed, marks points in low-density regions as outliers, and can find clusters of arbitrary shape.

Core Concepts

Key Parameters

ε (eps): Maximum distance between two points to be neighbors
minPts: Minimum points required to form a dense region

Point Types

   ●●●●●          ← Core points (≥ minPts neighbors within ε)
  ●     ●
  ●  ●  ●
  ● ● ● ●
   ●●●●●
        ○         ← Border point (< minPts, but near core)
              ×   ← Noise point (outlier)

Core Point: Has at least minPts points within distance ε Border Point: Within ε of a core point, but has < minPts neighbors Noise Point: Neither core nor border - an outlier

Algorithm

def dbscan(X, eps, min_pts):
    labels = [UNVISITED] * len(X)
    cluster_id = 0
    
    for i, point in enumerate(X):
        if labels[i] != UNVISITED:
            continue
            
        # Find neighbors within eps
        neighbors = get_neighbors(X, point, eps)
        
        if len(neighbors) < min_pts:
            labels[i] = NOISE  # Mark as noise (may change later)
        else:
            # Start new cluster
            expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
            cluster_id += 1
    
    return labels

def expand_cluster(X, labels, point_idx, neighbors, cluster_id, eps, min_pts):
    labels[point_idx] = cluster_id
    
    i = 0
    while i < len(neighbors):
        neighbor_idx = neighbors[i]
        
        if labels[neighbor_idx] == NOISE:
            labels[neighbor_idx] = cluster_id  # Border point
        elif labels[neighbor_idx] == UNVISITED:
            labels[neighbor_idx] = cluster_id
            
            # Check if this neighbor is also a core point
            new_neighbors = get_neighbors(X, X[neighbor_idx], eps)
            if len(new_neighbors) >= min_pts:
                neighbors.extend(new_neighbors)  # Add to cluster
        
        i += 1

Scikit-learn Implementation

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [2, 2]  # Cluster 1
X[100:200] += [-2, -2]  # Cluster 2

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Choosing Parameters

eps (epsilon)

k-distance graph method:

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

# Compute k-nearest neighbor distances
k = 5  # Usually k = minPts
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)

# Sort and plot k-th distance
k_distances = np.sort(distances[:, k-1])

plt.plot(k_distances)
plt.xlabel('Points sorted by distance')
plt.ylabel(f'{k}th nearest neighbor distance')
plt.title('k-distance graph (elbow = eps)')
plt.axhline(y=0.5, color='r', linestyle='--', label='eps = 0.5')
plt.legend()
plt.show()

# Look for "elbow" - sharp increase indicates good eps

minPts

Rules of thumb:

  • minPts ≥ dimensions + 1 (minimum)
  • minPts ≥ 2 × dimensions (recommended)
  • Larger minPts = smoother clusters, more noise
  • For 2D data: minPts = 4-5 is common

DBSCAN vs K-Means

AspectDBSCANK-Means
Cluster shapeArbitrarySpherical
Number of clustersAutomaticMust specify k
Handles outliersYes (marks as noise)No (assigns all points)
DensityHandles varying density poorlyAssumes uniform
Parameterseps, minPtsk
DeterministicYesNo (random init)

Visual Comparison

Data with non-spherical clusters:

     ●●●●●●                    ●●●●●●
   ●●      ●●                ●●      ●●  
  ●●        ●●              ●●        ●●
 ●●          ●●            ●●          ●●
  ●●        ●●              ●●        ●●
   ●●      ●●                ●●      ●●
     ●●●●●●                    ●●●●●●
         ○○○○○
        ○     ○                 
       ○       ○
        ○     ○
         ○○○○○

DBSCAN: Finds 2 clusters     K-Means: Cuts across shapes

Variations

HDBSCAN (Hierarchical DBSCAN)

Handles varying densities:

import hdbscan

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=10,
    min_samples=5,
    cluster_selection_epsilon=0.0
)
labels = clusterer.fit_predict(X_scaled)

# Also provides cluster probabilities
probs = clusterer.probabilities_

OPTICS

Produces a reachability plot for different eps values:

from sklearn.cluster import OPTICS

optics = OPTICS(min_samples=5, xi=0.05)
labels = optics.fit_predict(X_scaled)

# Can extract clusters at different thresholds
reachability = optics.reachability_

Strengths

  1. No need to specify k: Finds natural clusters
  2. Arbitrary shapes: Not limited to spherical clusters
  3. Outlier detection: Naturally identifies noise
  4. Robust: Less sensitive to initialization
  5. Single pass: O(n²) or O(n log n) with spatial indexing

Limitations

  1. Varying density: Struggles when clusters have different densities
  2. High dimensions: Distance becomes less meaningful (curse of dimensionality)
  3. Parameter sensitivity: Results depend heavily on eps
  4. Border points: May be assigned arbitrarily when between clusters
  5. Memory: Requires distance computations for all pairs

When to Use DBSCAN

Good For

  • Data with noise/outliers
  • Non-spherical cluster shapes
  • Unknown number of clusters
  • Spatial data (geographic clustering)
  • Anomaly detection

Not Ideal For

  • Clusters with very different densities
  • High-dimensional data (> 20 dimensions)
  • When clusters must include all points
  • Large datasets without spatial indexing

Practical Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Create moon-shaped clusters
X, _ = make_moons(n_samples=300, noise=0.05)

# Add some outliers
outliers = np.random.uniform(-1.5, 2.5, (20, 2))
X = np.vstack([X, outliers])

# Cluster with DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

# Plot results
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.5)
plt.title('Original Data')

plt.subplot(1, 2, 2)
colors = ['red', 'blue', 'green', 'purple', 'orange']
for i in range(max(labels) + 1):
    mask = labels == i
    plt.scatter(X[mask, 0], X[mask, 1], c=colors[i % len(colors)], label=f'Cluster {i}')

# Plot noise points
noise_mask = labels == -1
plt.scatter(X[noise_mask, 0], X[noise_mask, 1], c='black', marker='x', label='Noise')
plt.title(f'DBSCAN (eps=0.2, minPts=5)')
plt.legend()

plt.tight_layout()
plt.show()

Key Takeaways

  1. DBSCAN finds clusters based on density, not distance to centroids
  2. Automatically determines number of clusters and identifies outliers
  3. Works well for arbitrary cluster shapes
  4. Key parameters: eps (neighborhood radius) and minPts (density threshold)
  5. Use k-distance graph to choose eps
  6. Consider HDBSCAN for varying density clusters