DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together points that are closely packed, marks points in low-density regions as outliers, and can find clusters of arbitrary shape.

Core Concepts

Key Parameters

ε (eps): Maximum distance between two points to be neighbors
minPts: Minimum points required to form a dense region

Point Types

   ●●●●●          ← Core points (≥ minPts neighbors within ε)
  ●     ●
  ●  ●  ●
  ● ● ● ●
   ●●●●●
        ○         ← Border point (< minPts, but near core)
              ×   ← Noise point (outlier)

Core Point: Has at least minPts points within distance ε Border Point: Within ε of a core point, but has < minPts neighbors Noise Point: Neither core nor border - an outlier

Algorithm

def dbscan(X, eps, min_pts):
    labels = [UNVISITED] * len(X)
    cluster_id = 0
    
    for i, point in enumerate(X):
        if labels[i] != UNVISITED:
            continue
            
        # Find neighbors within eps
        neighbors = get_neighbors(X, point, eps)
        
        if len(neighbors) < min_pts:
            labels[i] = NOISE  # Mark as noise (may change later)
        else:
            # Start new cluster
            expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
            cluster_id += 1
    
    return labels

def expand_cluster(X, labels, point_idx, neighbors, cluster_id, eps, min_pts):
    labels[point_idx] = cluster_id
    
    i = 0
    while i < len(neighbors):
        neighbor_idx = neighbors[i]
        
        if labels[neighbor_idx] == NOISE:
            labels[neighbor_idx] = cluster_id  # Border point
        elif labels[neighbor_idx] == UNVISITED:
            labels[neighbor_idx] = cluster_id
            
            # Check if this neighbor is also a core point
            new_neighbors = get_neighbors(X, X[neighbor_idx], eps)
            if len(new_neighbors) >= min_pts:
                neighbors.extend(new_neighbors)  # Add to cluster
        
        i += 1

Scikit-learn Implementation

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [2, 2]  # Cluster 1
X[100:200] += [-2, -2]  # Cluster 2

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Choosing Parameters

eps (epsilon)

k-distance graph method:

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

# Compute k-nearest neighbor distances
k = 5  # Usually k = minPts
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)

# Sort and plot k-th distance
k_distances = np.sort(distances[:, k-1])

plt.plot(k_distances)
plt.xlabel('Points sorted by distance')
plt.ylabel(f'{k}th nearest neighbor distance')
plt.title('k-distance graph (elbow = eps)')
plt.axhline(y=0.5, color='r', linestyle='--', label='eps = 0.5')
plt.legend()
plt.show()

# Look for "elbow" - sharp increase indicates good eps

minPts

Rules of thumb:

minPts ≥ dimensions + 1 (minimum)
minPts ≥ 2 × dimensions (recommended)
Larger minPts = smoother clusters, more noise
For 2D data: minPts = 4-5 is common

DBSCAN vs K-Means

Aspect	DBSCAN	K-Means
Cluster shape	Arbitrary	Spherical
Number of clusters	Automatic	Must specify k
Handles outliers	Yes (marks as noise)	No (assigns all points)
Density	Handles varying density poorly	Assumes uniform
Parameters	eps, minPts	k
Deterministic	Yes	No (random init)

Visual Comparison

Data with non-spherical clusters:

     ●●●●●●                    ●●●●●●
   ●●      ●●                ●●      ●●  
  ●●        ●●              ●●        ●●
 ●●          ●●            ●●          ●●
  ●●        ●●              ●●        ●●
   ●●      ●●                ●●      ●●
     ●●●●●●                    ●●●●●●
         ○○○○○
        ○     ○                 
       ○       ○
        ○     ○
         ○○○○○

DBSCAN: Finds 2 clusters     K-Means: Cuts across shapes

Variations

HDBSCAN (Hierarchical DBSCAN)

Handles varying densities:

import hdbscan

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=10,
    min_samples=5,
    cluster_selection_epsilon=0.0
)
labels = clusterer.fit_predict(X_scaled)

# Also provides cluster probabilities
probs = clusterer.probabilities_

OPTICS

Produces a reachability plot for different eps values:

from sklearn.cluster import OPTICS

optics = OPTICS(min_samples=5, xi=0.05)
labels = optics.fit_predict(X_scaled)

# Can extract clusters at different thresholds
reachability = optics.reachability_

Strengths

No need to specify k: Finds natural clusters
Arbitrary shapes: Not limited to spherical clusters
Outlier detection: Naturally identifies noise
Robust: Less sensitive to initialization
Single pass: O(n²) or O(n log n) with spatial indexing

Limitations

Varying density: Struggles when clusters have different densities
High dimensions: Distance becomes less meaningful (curse of dimensionality)
Parameter sensitivity: Results depend heavily on eps
Border points: May be assigned arbitrarily when between clusters
Memory: Requires distance computations for all pairs

When to Use DBSCAN

Good For

Data with noise/outliers
Non-spherical cluster shapes
Unknown number of clusters
Spatial data (geographic clustering)
Anomaly detection

Not Ideal For

Clusters with very different densities
High-dimensional data (> 20 dimensions)
When clusters must include all points
Large datasets without spatial indexing

Practical Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Create moon-shaped clusters
X, _ = make_moons(n_samples=300, noise=0.05)

# Add some outliers
outliers = np.random.uniform(-1.5, 2.5, (20, 2))
X = np.vstack([X, outliers])

# Cluster with DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

# Plot results
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.5)
plt.title('Original Data')

plt.subplot(1, 2, 2)
colors = ['red', 'blue', 'green', 'purple', 'orange']
for i in range(max(labels) + 1):
    mask = labels == i
    plt.scatter(X[mask, 0], X[mask, 1], c=colors[i % len(colors)], label=f'Cluster {i}')

# Plot noise points
noise_mask = labels == -1
plt.scatter(X[noise_mask, 0], X[noise_mask, 1], c='black', marker='x', label='Noise')
plt.title(f'DBSCAN (eps=0.2, minPts=5)')
plt.legend()

plt.tight_layout()
plt.show()

Key Takeaways

DBSCAN finds clusters based on density, not distance to centroids
Automatically determines number of clusters and identifies outliers
Works well for arbitrary cluster shapes
Key parameters: eps (neighborhood radius) and minPts (density threshold)
Use k-distance graph to choose eps
Consider HDBSCAN for varying density clusters