Feature:
Feature Engineering:
Making new features that are not originally in your data.
Adjust data to algorithm
Enhance ability to interpret the data
Improve performance of your analysis
Feature engineering is about highlighting relationships in your data.
You can do this
Algorithms can do this (Deep Learning)
Dealing with those pesky missing numbers.
Imputation replaces missing values with the values of your choice.
Common choices: Mean, Median, Mode
Less common choices: K-Nearest Neighbors, Tree based learners
Example:
County | Day | Temperature |
---|---|---|
Buncombe | Monday | 21 |
Buncombe | Tuesday | 23 |
Buncombe | Wednesday | |
Buncombe | Thursday | 24 |
Buncombe | Friday | 22 |
Buncombe | Saturday | 21 |
Wait, aren’t we making up data?
Data transformation changes your data from one distribution to another via a mathematical function.
In general, this is done for single feature without involving other data.
Leverage?
Normalization: Rescale values to [0,1]
Standardization Rescale data to have a mean of 0 and standard deviation of 1.
Epidemiological Study with millions of health records related to infant mortality. We have age, parent smoking status, weight (lbs), height (cm), head circumference (mm). Our research goals are to understand potential roles of microcephaly and tobacco exposure on infant mortality.
EDA Results:
Feature Engineering:
Impose cut-offs on continuous data
Data discretization, continuous -> categorical/ordinal
Establish relationships where thresholds are important
## Dimension Reduction Approaches
These are also really good approaches for visualization of high dimension datasets
Reduces complexity and ‘feature overload’
Faster training
Correlation/Collinearity
Improve Signal to Noise ratio
We want to understand the effect of soil characteristics on rapid moisture loss and have a large gridded dataset with lat, lon, time and 34 other features ranging from soil particle size, average pore space, soil type, etc…
EDA Results:
Feature Engineering:
Create your own index through a meaningful combination of features.
Examples:
We are interesting in predicting the Hot Dry Windy Index for the next week using non-time series machine learning approaches. The only data we have are historical values that function as inputs for the HDWI.
Feature Engineering:
Earth System Data Science in the Cloud