Model Development Workflow

Our Framework

Session Overview

Applied Modeling Recap
Questions!
ML Paradigms
Workflow

Applied Modeling Recap

“All models are wrong, some are useful.”

George Box

Understanding and Prediction

Understanding: How does something work?

Prediction: What will happen?

Understanding

Approach:

Apply a mechanistic model
- Multivariate Regression
- Differential Equations
Goal: Understand the relative weighting of parts and how they interact.
Result: Estimation of how a natural phenomena works.

Prediction

Approach:

Apply a model and estimate it’s predictive performance (not necessarily it’s goodness of fit)
- Pretty much any model
- Including ‘Black Box’ Models
Goal: Reproducibly produce accurate and precise estimates of desired outcome given inputs.
Result: A tool to map inputs to outputs with known performance.

Modeling in this Session

Can use the same type of model for both predicting and understanding.

We will be wrong.
- We want to know how wrong.
- We want to be useful!

Learning

Unsupervised

Tell the model to look for something….

Supervised

Tell the model what to look for…

Semi-Supervised

Tell the model some of what to look for…

Learning

Unsupervised

Pattern Recognition
We ask the computer to learn, then teach us
No (or little) a-priori knowledge, minimal inputs
Examples:
- Clustering
- Hidden Markov Models
- Anomaly Detection
- Autoencoders
- Generative Networks
- Object Feature Recognition

Supervised

Pattern Mapping
We teach the computer to learn by example
Provide labeled or informational inputs
Examples
- Regression
- Classification
- Facial Recognition
- Most ML problems start or end here….

Semi-supervised

Blend of pattern mapping and pattern finding
Mix of Supervised and Semisupervised approaches
Some labels, mostly unlabeled data
Examples: Most use cases where label generation at scale is prohibitively expensive.

Research Questions and Learning Type:

Which of Supervised, Semi-supervised, or Unsupervised learning are you likely to use in your team project? In your research?

Classification and Regression

Supervised Learning!

Classification:

Categorical Response
Predicting Labels
Each Label is a Class
Probability of an observation belonging to a Class is assessed and returned
Classes are assigned based on those probabilities

Regression:

Numerical/Continuous Response
Predicting Values
Values are returned directly from the model

Classification and Regression

Supervised Learning!

Classification:

Can predict only labels it was trained on

Performance is measured by how many labels it gets right (or wrong).
Performance metrics: Accuracy, Precision, Recall

Regression:

Can predict values it was not trained on

Performance is measured by how close it comes to the correct value
Performance metrics: Mean Absolute Error, Root-Mean-Square-Error

Case Studies & Examples

Research Question: Will there be a drought next week?

Information:

Past week’s maximum soil temperature
Past week’s average soil temperature
Past week’s average humidity
Soil Type

			Previous Week
Drought	Week ID	County ID	Max Temp	Avg Temp	Avg Soil Humidity	Soil Type
Severe	1	3477	37	35	10	Sandy Loam
None	1	3211	20	22	80	Sandy Loam
Moderate	2	3100	38	24	20	Sandy Loam
Moderate	1	2011	22	15	20	Sandy Loam
Wet!	3	1022	22	19	90	Sandy Loam
None	1	1102	21	19	50	Sandy Loam
Moderate	1	2204	20	15	20	Sandy Loam
Wet!	1	2224	20	15	90	Sandy Loam
…	…	…	…	…	…	…

Case Studies & Examples

Research Question: Will there be a drought next week?

Supervised, Unsupervised, Semi-Supervised?
Classification, Regression?

Case Studies & Examples

Research Question: What is the HDWI?

Information:

Insolation
Percent Vegetation Cover
Vegetation Type
Altitude

HDWI	Insolation	Vegetation Cover	Vegetation Type	Altitude
70	40	80	Trees	3000
40	20	30	Trees	1500
30	40	20	Grass	2800
10	50	15	Grass	1200
60	10	7	Corn!	900
…	…	…	…	…

Case Studies & Examples

Research Question: What is the HDWI?

Supervised, Unsupervised, Semi-Supervised?
Classification, Regression?

Case Studies & Examples

Research Question: How have patterns of precipitation changed in the tropics?

Information:

Precipitation in the Tropics

Research Question: What archetypes present themselves for hospital admission on hot days?

Information:

Hospital Admissions on hot (>20C) days

Supervised, Unsupervised, Semi-Supervised?

Paradigms

Statistical and ML Models can be thought of as:

Maps

Algorithms map inputs to outputs in a predictable, repeatable way
Helpful paradigm for linking/connecting type problems.
What two pieces of information would be useful to connect?

Compression

Algorithms strip out redundant information to reduce the size to the critical pieces of information.
Critical information can be reconstituted via decompression
Helpful for dimension reduction/distillation type problems
What is the minimum set of useful information that allows me to understand my data?

Relationship Discovery

Algorithms detect latent relationships embedded in data
Data result from consistent outcomes of interactions
Helpful paradigm for ‘Understanding’ type problems.
What and how are the features are related to the response?

Difference Assessment

Algorithms separate groups based on embedded characteristics
Data arise from different processes
Helpful paradigm for classification and hypothesis testing type problems
Are there differences in my data? What groups to new data belong to?

Our Approach

How do we teach this?

Start Small and Scale Up
Start Backwards
- Know the End before Beginning
- Metrics!
Common Workflow Bakes in Best Practices
Tools
ML Algorithms
Performance and Scaling

Session Overview

Goals
Metrics
Model Evaluation
- Training and Testing Data
Preparing your data
- Information Leakage
- Order of Operations
Training
- Cross validation!
Tuning/Optimization
Fitting
Out of Sample Performance
Workflow Evaluation

Today

Focus: Supervised Classification and Regression

Many common applications
Common workflows and principles
Workflow best practices apply for deep learning and not so deep learning
Process and principles are more important than the tools

Goals

Define your very specific research question
What approach(es) does your research question relate to?
- mapping, compression, relationship discovery, and/or difference assessment
What type of learning will you use?
- supervised, unsupervised, semi-supervised
What type of data are you working with?
- Categorical, continuous, ordinal
- Image, text, gridded, tabular
- If supervised, classification or regression?

Metrics

Now that you have defined your research space, what metrics will you use?

What does success look like?

How do we test for success?

Modeling Evaluation

How do we design an experiment to see if our model is working?

Model on a portion of the data, test on another portion.

Training and Test Sets

Basic idea: Train on a portion of the data, test on a isolated separate portion of the data.

Common Splits: 80:20, 70:30, 75:25
Time Series Considerations
- Avoid information leakage
- Make sure you have enough information in your time series
Unbalanced data (Categorical)
- Make sure splits reflect data structure
RANDOM (but reproducible, set your seed)

Session Overview

Goals
Metrics
Model Evaluation
- Training and Testing Data
Preparing your data
- Information Leakage
- Order of Operations
Training
- Cross validation!
Tuning/Optimization
Fitting
Out of Sample Performance
Workflow Evaluation

Preparing your Data

Information Leakage

When you have information during model training that you probably should or would not have in practice. Or, when the information you are predicting is somehow embedded in your predictors.

Do not allow information from your training set to enter your test set.
- Standardization
- Dimension Reduction
- Imputation
Order of Operations
- Drop NA
- Near Zero Variance
- Dimension Reduction
- One Hot Encoding

Session Overview

Goals
Metrics
Model Evaluation
- Training and Testing Data
Preparing your data
- Information Leakage
- Order of Operations
Training
- Cross validation!
Tuning/Optimization
Fitting
Out of Sample Performance
Workflow Evaluation

Training

When you actually apply a model to your dataset. :)

Wasn’t splitting your data into training and testing sets a good idea?

Cross Validation

Creates fold (sets) of data to alternately leave out and include.
This resampling can help with appropriate fitting
Can be used to tune model parameters.
k-fold, n-repeated
- Number of folds
- Number of times you repeat the sampling

Regularization

Session Overview

Goals
Metrics
Model Evaluation
- Training and Testing Data
Preparing your data
- Information Leakage
- Order of Operations
Training
- Cross validation!
Tuning/Optimization
Fitting
Out of Sample Performance
Workflow Evaluation

Tuning and Optimization

Parameters: Values determined by fitting your model to the training data. Fitted parameters, coefficients, model parameters, trained parameters. This is what you train.

\(y = mx + b\)
\(y = f(x_0 w_0 + x_1 w_1)\)

Hyperparameters: Values that govern the structure of the model. This is what you tune.

Fitting

Train your best model (after tuning and/or optimization) on the whole training dataset.

Out of Sample Performance

How well did your model do on the test set?

Generate predictions on your test set using your fitted model.
Evaluate metrics of interest based on these predictions.

Workflow Evaluation

How did performance on test set compare to internal performance with validation?
Models?
Overfitting, Underfitting?
Insights:
- Mapping -> Strength of connection
- Compression -> Reduction in parameter space
- Relationship Discovery -> How do the predictors interact
- Difference Assessment -> What does our model tell us about groups

Session Overview

Goals
Metrics
Model Evaluation
- Training and Testing Data
Preparing your data
- Information Leakage
- Order of Operations
Training
- Cross validation!
Tuning/Optimization
Fitting
Out of Sample Performance
Workflow Evaluation