Feature Engineering
Feature engineering is the process of using domain knowledge to create, transform, and select features that make machine learning algorithms work better. It's often the difference between a mediocre model and a great one.
Why Feature Engineering Matters
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng
Good features can:
- Make simple models perform like complex ones
- Reduce training time
- Improve interpretability
- Handle domain-specific patterns
Numerical Features
Scaling
Standardization (Z-score):
X_scaled = (X - mean) / std
# Mean = 0, Std = 1
Use for: Most algorithms, especially distance-based
Min-Max Scaling:
X_scaled = (X - min) / (max - min)
# Range [0, 1]
Use for: Neural networks, when you need bounded values
Robust Scaling:
X_scaled = (X - median) / IQR
Use for: Data with outliers
Transformations
Log Transform:
X_log = np.log1p(X) # log(1 + X) for zeros
Use for: Right-skewed data (income, prices)
Square Root:
X_sqrt = np.sqrt(X)
Use for: Count data, moderate skew
Box-Cox / Yeo-Johnson:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X)
Use for: Automatic normalization
Binning
Convert continuous to categorical:
# Equal-width bins
pd.cut(age, bins=[0, 18, 35, 50, 65, 100], labels=['child', 'young', 'middle', 'senior', 'elderly'])
# Equal-frequency bins (quantiles)
pd.qcut(income, q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])
Use for: Capturing non-linear relationships, reducing noise
Categorical Features
One-Hot Encoding
pd.get_dummies(df['color'])
# color_red, color_blue, color_green
Warning: High cardinality → many columns
Label Encoding
from sklearn.preprocessing import LabelEncoder
le.fit_transform(['red', 'blue', 'green']) # [2, 0, 1]
Use for: Tree-based models, ordinal data
Target Encoding
mean_by_category = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(mean_by_category)
Warning: Risk of data leakage! Use cross-validation.
Frequency Encoding
freq = df['category'].value_counts(normalize=True)
df['category_freq'] = df['category'].map(freq)
Embedding Encoding
For high-cardinality categoricals in neural networks:
nn.Embedding(num_categories, embedding_dim)
Date/Time Features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'].isin([5, 6])
df['hour'] = df['date'].dt.hour
df['quarter'] = df['date'].dt.quarter
# Cyclical encoding for periodic features
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
Feature Creation
Interaction Features
df['area'] = df['length'] * df['width']
df['bmi'] = df['weight'] / df['height']**2
df['price_per_sqft'] = df['price'] / df['sqft']
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Creates: x1, x2, x1², x2², x1*x2
Aggregation Features
# Group statistics
df['user_avg_purchase'] = df.groupby('user_id')['amount'].transform('mean')
df['user_purchase_count'] = df.groupby('user_id')['amount'].transform('count')
df['amount_vs_user_avg'] = df['amount'] / df['user_avg_purchase']
Lag Features (Time Series)
df['sales_lag1'] = df['sales'].shift(1)
df['sales_lag7'] = df['sales'].shift(7)
df['sales_rolling_mean_7'] = df['sales'].rolling(7).mean()
df['sales_rolling_std_7'] = df['sales'].rolling(7).std()
Text Features
Basic
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in x.split()]))
df['num_exclamation'] = df['text'].str.count('!')
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
X_tfidf = tfidf.fit_transform(df['text'])
Feature Selection
Filter Methods
# Variance threshold
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
# Correlation with target
correlations = df.corrwith(df['target']).abs().sort_values(ascending=False)
Wrapper Methods
# Recursive Feature Elimination
from sklearn.feature_selection import RFE
rfe = RFE(estimator=model, n_features_to_select=10)
Embedded Methods
# L1 regularization (Lasso)
# Tree-based feature importance
importances = model.feature_importances_
Common Pitfalls
1. Data Leakage
# WRONG: Fit on all data
scaler.fit(X) # Sees test data!
# RIGHT: Fit only on training data
scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
2. Target Leakage
Features that wouldn't be available at prediction time.
3. High Cardinality
Too many one-hot columns → overfitting, slow training.
4. Missing Values
Handle before or during feature engineering.
Best Practices
- Understand your data first: EDA before engineering
- Use domain knowledge: Best features come from understanding the problem
- Keep it simple: Start simple, add complexity as needed
- Validate properly: Use cross-validation, watch for leakage
- Document features: Track what each feature represents
- Automate pipelines: Use sklearn Pipeline for reproducibility
Key Takeaways
- Feature engineering often matters more than model choice
- Scale numerical features for distance-based algorithms
- Handle categorical features appropriately for your model
- Create interaction and domain-specific features
- Use aggregations for relational data
- Always prevent data leakage with proper train/test separation