Feature Engineering

Session Overview

What is feature engineering?
Why do it?
Transformation
Extraction
Creation

Feature Engineering

Feature:

A dimension of your data
A column of your data
A factor

Feature Engineering:

Making new features that are not originally in your data.

Why Feature Engineering?

Adjust data to algorithm
Enhance ability to interpret the data
Improve performance of your analysis

Manual vs Automated Feature Engineering

Feature engineering is about highlighting relationships in your data.

You can do this
Algorithms can do this (Deep Learning)

Transformation

Imputation
Actual transformation
One-hot encoding
Outliers
Scaling

Imputation

Dealing with those pesky missing numbers.

Imputation replaces missing values with the values of your choice.
Common choices: Mean, Median, Mode
Less common choices: K-Nearest Neighbors, Tree based learners

Example:

County	Day	Temperature
Buncombe	Monday	21
Buncombe	Tuesday	23
Buncombe	Wednesday
Buncombe	Thursday	24
Buncombe	Friday	22
Buncombe	Saturday	21

Wait, aren’t we making up data?

Yes, and
Advantages in terms of preserving information about our datasets
Can result in unbiased estimates (often better than dropping)
Preserves dataset size
Can improve model performance

Transformation

Data transformation changes your data from one distribution to another via a mathematical function.
In general, this is done for single feature without involving other data.

Why Transformations?

Assumptions of Statistical Models
Adjust importance of relative differences
Change relationships between two (or more variables)

One-hot encoding

Outliers

Leverage?

Scaling

Normalization: Rescale values to [0,1]
Standardization Rescale data to have a mean of 0 and standard deviation of 1.

Why feature scaling?

Units
Relative scales of features
- Super important for distance based algorithms
- Super important for PCA, K-Means, KNN
Gradient Descent!

Scaling Article

Transformation Example

Epidemiological Study with millions of health records related to infant mortality. We have age, parent smoking status, weight (lbs), height (cm), head circumference (mm). Our research goals are to understand potential roles of microcephaly and tobacco exposure on infant mortality.

EDA Results:

Age (yrs) ranges from 0-6. 1% missing data.
Parent smoking is either yes or no. 20% missing data.
Weight is in lbs and has no missing data, but some massive outliers. It is also decidedly not normal in distribution.
Height is in cm and has 30% missing data. No obvious outliers, but also very not normal.
Head circumference is in mm and has no missing data. No obvious outliers, a distribution with a long tail

Feature Engineering:

Age: might want to toss the missing data
Parent smoking: One-hot encode & Impute
Weight: Scale, remove outliers
Height: Impute, scale
Head: Scale

Session Overview

What is feature engineering?
Why do it?
Transformation
Extraction
Creation

Extraction

Binning
Dimension Reduction
Clustering

Binning

Impose cut-offs on continuous data
Data discretization, continuous -> categorical/ordinal
Establish relationships where thresholds are important

Dimension Reduction

Understand relationships in many dimensions and project that relationship into a smaller number of dimensions.

## Dimension Reduction Approaches

Principle Coordinates Analysis
Multi-Dimensional Scaling
t-Stochastic Neighborhood Embedding
Unifold Manifold Approximation and Projection

These are also really good approaches for visualization of high dimension datasets

Why Dimension Reduction

Reduces complexity and ‘feature overload’
Faster training
Correlation/Collinearity
Improve Signal to Noise ratio

Clustering

Extraction Example

We want to understand the effect of soil characteristics on rapid moisture loss and have a large gridded dataset with lat, lon, time and 34 other features ranging from soil particle size, average pore space, soil type, etc…

EDA Results:

All 34 features are numeric, contain no missing values, and have already been scaled.
10 are highly correlated with one another.
Feature by feature visualization shows that two features have distinct groupings.

Feature Engineering:

Dimension reduction on highly correlated features and/or to reduce dimensions
Clustering on two features to identify distinct groups

Session Overview

What is feature engineering?
Why do it?
Transformation
Extraction
Creation

Creation

Indexing (Creating your index)
Lagging

Indexing

Create your own index through a meaningful combination of features.

Examples:

Hot Dry Windy Index
Standard Precipitation Index
…

Lagging

Creation Examples

We are interesting in predicting the Hot Dry Windy Index for the next week using non-time series machine learning approaches. The only data we have are historical values that function as inputs for the HDWI.

Feature Engineering:

Construct HDWI
Lag existing times series to use historical data as features