DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together points that are closely packed, marks points in low-density regions as outliers, and can find clusters of arbitrary shape.
Core Concepts
Key Parameters
ε (eps): Maximum distance between two points to be neighbors
minPts: Minimum points required to form a dense region
Point Types
●●●●● ← Core points (≥ minPts neighbors within ε)
● ●
● ● ●
● ● ● ●
●●●●●
○ ← Border point (< minPts, but near core)
× ← Noise point (outlier)
Core Point: Has at least minPts points within distance ε Border Point: Within ε of a core point, but has < minPts neighbors Noise Point: Neither core nor border - an outlier
Algorithm
def dbscan(X, eps, min_pts):
labels = [UNVISITED] * len(X)
cluster_id = 0
for i, point in enumerate(X):
if labels[i] != UNVISITED:
continue
# Find neighbors within eps
neighbors = get_neighbors(X, point, eps)
if len(neighbors) < min_pts:
labels[i] = NOISE # Mark as noise (may change later)
else:
# Start new cluster
expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
cluster_id += 1
return labels
def expand_cluster(X, labels, point_idx, neighbors, cluster_id, eps, min_pts):
labels[point_idx] = cluster_id
i = 0
while i < len(neighbors):
neighbor_idx = neighbors[i]
if labels[neighbor_idx] == NOISE:
labels[neighbor_idx] = cluster_id # Border point
elif labels[neighbor_idx] == UNVISITED:
labels[neighbor_idx] = cluster_id
# Check if this neighbor is also a core point
new_neighbors = get_neighbors(X, X[neighbor_idx], eps)
if len(new_neighbors) >= min_pts:
neighbors.extend(new_neighbors) # Add to cluster
i += 1
Scikit-learn Implementation
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate sample data
X = np.random.randn(300, 2)
X[:100] += [2, 2] # Cluster 1
X[100:200] += [-2, -2] # Cluster 2
# Standardize features
X_scaled = StandardScaler().fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
Choosing Parameters
eps (epsilon)
k-distance graph method:
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
# Compute k-nearest neighbor distances
k = 5 # Usually k = minPts
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_scaled)
distances, _ = neighbors.kneighbors(X_scaled)
# Sort and plot k-th distance
k_distances = np.sort(distances[:, k-1])
plt.plot(k_distances)
plt.xlabel('Points sorted by distance')
plt.ylabel(f'{k}th nearest neighbor distance')
plt.title('k-distance graph (elbow = eps)')
plt.axhline(y=0.5, color='r', linestyle='--', label='eps = 0.5')
plt.legend()
plt.show()
# Look for "elbow" - sharp increase indicates good eps
minPts
Rules of thumb:
- minPts ≥ dimensions + 1 (minimum)
- minPts ≥ 2 × dimensions (recommended)
- Larger minPts = smoother clusters, more noise
- For 2D data: minPts = 4-5 is common
DBSCAN vs K-Means
| Aspect | DBSCAN | K-Means |
|---|---|---|
| Cluster shape | Arbitrary | Spherical |
| Number of clusters | Automatic | Must specify k |
| Handles outliers | Yes (marks as noise) | No (assigns all points) |
| Density | Handles varying density poorly | Assumes uniform |
| Parameters | eps, minPts | k |
| Deterministic | Yes | No (random init) |
Visual Comparison
Data with non-spherical clusters:
●●●●●● ●●●●●●
●● ●● ●● ●●
●● ●● ●● ●●
●● ●● ●● ●●
●● ●● ●● ●●
●● ●● ●● ●●
●●●●●● ●●●●●●
○○○○○
○ ○
○ ○
○ ○
○○○○○
DBSCAN: Finds 2 clusters K-Means: Cuts across shapes
Variations
HDBSCAN (Hierarchical DBSCAN)
Handles varying densities:
import hdbscan
clusterer = hdbscan.HDBSCAN(
min_cluster_size=10,
min_samples=5,
cluster_selection_epsilon=0.0
)
labels = clusterer.fit_predict(X_scaled)
# Also provides cluster probabilities
probs = clusterer.probabilities_
OPTICS
Produces a reachability plot for different eps values:
from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=5, xi=0.05)
labels = optics.fit_predict(X_scaled)
# Can extract clusters at different thresholds
reachability = optics.reachability_
Strengths
- No need to specify k: Finds natural clusters
- Arbitrary shapes: Not limited to spherical clusters
- Outlier detection: Naturally identifies noise
- Robust: Less sensitive to initialization
- Single pass: O(n²) or O(n log n) with spatial indexing
Limitations
- Varying density: Struggles when clusters have different densities
- High dimensions: Distance becomes less meaningful (curse of dimensionality)
- Parameter sensitivity: Results depend heavily on eps
- Border points: May be assigned arbitrarily when between clusters
- Memory: Requires distance computations for all pairs
When to Use DBSCAN
Good For
- Data with noise/outliers
- Non-spherical cluster shapes
- Unknown number of clusters
- Spatial data (geographic clustering)
- Anomaly detection
Not Ideal For
- Clusters with very different densities
- High-dimensional data (> 20 dimensions)
- When clusters must include all points
- Large datasets without spatial indexing
Practical Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Create moon-shaped clusters
X, _ = make_moons(n_samples=300, noise=0.05)
# Add some outliers
outliers = np.random.uniform(-1.5, 2.5, (20, 2))
X = np.vstack([X, outliers])
# Cluster with DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)
# Plot results
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.5)
plt.title('Original Data')
plt.subplot(1, 2, 2)
colors = ['red', 'blue', 'green', 'purple', 'orange']
for i in range(max(labels) + 1):
mask = labels == i
plt.scatter(X[mask, 0], X[mask, 1], c=colors[i % len(colors)], label=f'Cluster {i}')
# Plot noise points
noise_mask = labels == -1
plt.scatter(X[noise_mask, 0], X[noise_mask, 1], c='black', marker='x', label='Noise')
plt.title(f'DBSCAN (eps=0.2, minPts=5)')
plt.legend()
plt.tight_layout()
plt.show()
Key Takeaways
- DBSCAN finds clusters based on density, not distance to centroids
- Automatically determines number of clusters and identifies outliers
- Works well for arbitrary cluster shapes
- Key parameters: eps (neighborhood radius) and minPts (density threshold)
- Use k-distance graph to choose eps
- Consider HDBSCAN for varying density clusters