Feature Engineering

Session Overview

  • What is feature engineering?
  • Why do it?
  • Transformation
  • Extraction
  • Creation

Feature Engineering

Feature:

  • A dimension of your data
  • A column of your data
  • A factor

Feature Engineering:

Making new features that are not originally in your data.

Why Feature Engineering?

  • Adjust data to algorithm

  • Enhance ability to interpret the data

  • Improve performance of your analysis

Manual vs Automated Feature Engineering

Feature engineering is about highlighting relationships in your data.


  • You can do this

  • Algorithms can do this (Deep Learning)

Transformation

  • Imputation
  • Actual transformation
  • One-hot encoding
  • Outliers
  • Scaling

Imputation

Dealing with those pesky missing numbers.

  • Imputation replaces missing values with the values of your choice.

  • Common choices: Mean, Median, Mode

  • Less common choices: K-Nearest Neighbors, Tree based learners

Example:

County Day Temperature
Buncombe Monday 21
Buncombe Tuesday 23
Buncombe Wednesday
Buncombe Thursday 24
Buncombe Friday 22
Buncombe Saturday 21

Wait, aren’t we making up data?

  • Yes, and
  • Advantages in terms of preserving information about our datasets
  • Can result in unbiased estimates (often better than dropping)
  • Preserves dataset size
  • Can improve model performance

Transformation

  • Data transformation changes your data from one distribution to another via a mathematical function.

  • In general, this is done for single feature without involving other data.

Why Transformations?

  • Assumptions of Statistical Models
  • Adjust importance of relative differences
  • Change relationships between two (or more variables)

One-hot encoding

Outliers

Leverage?

Scaling

  • Normalization: Rescale values to [0,1]

  • Standardization Rescale data to have a mean of 0 and standard deviation of 1.

Why feature scaling?

  • Units
  • Relative scales of features
    • Super important for distance based algorithms
    • Super important for PCA, K-Means, KNN
  • Gradient Descent!

Scaling Article

Transformation Example

Epidemiological Study with millions of health records related to infant mortality. We have age, parent smoking status, weight (lbs), height (cm), head circumference (mm). Our research goals are to understand potential roles of microcephaly and tobacco exposure on infant mortality.


EDA Results:

  • Age (yrs) ranges from 0-6. 1% missing data.
  • Parent smoking is either yes or no. 20% missing data.
  • Weight is in lbs and has no missing data, but some massive outliers. It is also decidedly not normal in distribution.
  • Height is in cm and has 30% missing data. No obvious outliers, but also very not normal.
  • Head circumference is in mm and has no missing data. No obvious outliers, a distribution with a long tail


Feature Engineering:

  • Age: might want to toss the missing data
  • Parent smoking: One-hot encode & Impute
  • Weight: Scale, remove outliers
  • Height: Impute, scale
  • Head: Scale

Session Overview

  • What is feature engineering?
  • Why do it?
  • Transformation
  • Extraction
  • Creation

Extraction

  • Binning
  • Dimension Reduction
  • Clustering

Binning

  • Impose cut-offs on continuous data

  • Data discretization, continuous -> categorical/ordinal

  • Establish relationships where thresholds are important

Dimension Reduction

  • Understand relationships in many dimensions and project that relationship into a smaller number of dimensions.

## Dimension Reduction Approaches

  • Principle Coordinates Analysis
  • Multi-Dimensional Scaling
  • t-Stochastic Neighborhood Embedding
  • Unifold Manifold Approximation and Projection


These are also really good approaches for visualization of high dimension datasets

Why Dimension Reduction

  • Reduces complexity and ‘feature overload’

  • Faster training

  • Correlation/Collinearity

  • Improve Signal to Noise ratio

Clustering

Extraction Example

We want to understand the effect of soil characteristics on rapid moisture loss and have a large gridded dataset with lat, lon, time and 34 other features ranging from soil particle size, average pore space, soil type, etc…

EDA Results:

  • All 34 features are numeric, contain no missing values, and have already been scaled.
  • 10 are highly correlated with one another.
  • Feature by feature visualization shows that two features have distinct groupings.

Feature Engineering:

  • Dimension reduction on highly correlated features and/or to reduce dimensions
  • Clustering on two features to identify distinct groups

Session Overview

  • What is feature engineering?
  • Why do it?
  • Transformation
  • Extraction
  • Creation

Creation

  • Indexing (Creating your index)
  • Lagging

Indexing

Create your own index through a meaningful combination of features.


Examples:

  • Hot Dry Windy Index
  • Standard Precipitation Index

Lagging

Creation Examples

We are interesting in predicting the Hot Dry Windy Index for the next week using non-time series machine learning approaches. The only data we have are historical values that function as inputs for the HDWI.

Feature Engineering:

  • Construct HDWI
  • Lag existing times series to use historical data as features