“All models are wrong, some are useful.”
Understanding: How does something work?
Prediction: What will happen?
Approach:
Apply a mechanistic model
Multivariate Regression
Differential Equations
Goal: Understand the relative weighting of parts and how they interact.
Result: Estimation of how a natural phenomena works.
Approach:
Apply a model and estimate it’s predictive performance (not necessarily it’s goodness of fit)
Pretty much any model
Including ‘Black Box’ Models
Goal: Reproducibly produce accurate and precise estimates of desired outcome given inputs.
Result: A tool to map inputs to outputs with known performance.
Can use the same type of model for both predicting and understanding.
Unsupervised
Supervised
Semi-Supervised
Unsupervised
Supervised
Semi-supervised
Supervised Learning!
Classification:
Regression:
Supervised Learning!
Classification:
Regression:
Research Question: Will there be a drought next week?
Information:
Previous Week | ||||||
---|---|---|---|---|---|---|
Drought | Week ID | County ID | Max Temp | Avg Temp | Avg Soil Humidity | Soil Type |
Severe | 1 | 3477 | 37 | 35 | 10 | Sandy Loam |
None | 1 | 3211 | 20 | 22 | 80 | Sandy Loam |
Moderate | 2 | 3100 | 38 | 24 | 20 | Sandy Loam |
Moderate | 1 | 2011 | 22 | 15 | 20 | Sandy Loam |
Wet! | 3 | 1022 | 22 | 19 | 90 | Sandy Loam |
None | 1 | 1102 | 21 | 19 | 50 | Sandy Loam |
Moderate | 1 | 2204 | 20 | 15 | 20 | Sandy Loam |
Wet! | 1 | 2224 | 20 | 15 | 90 | Sandy Loam |
… | … | … | … | … | … | … |
Research Question: Will there be a drought next week?
Supervised, Unsupervised, Semi-Supervised?
Classification, Regression?
Research Question: What is the HDWI?
Information:
HDWI | Insolation | Vegetation Cover | Vegetation Type | Altitude |
---|---|---|---|---|
70 | 40 | 80 | Trees | 3000 |
40 | 20 | 30 | Trees | 1500 |
30 | 40 | 20 | Grass | 2800 |
10 | 50 | 15 | Grass | 1200 |
60 | 10 | 7 | Corn! | 900 |
… | … | … | … | … |
Research Question: What is the HDWI?
Supervised, Unsupervised, Semi-Supervised?
Classification, Regression?
Research Question: How have patterns of precipitation changed in the tropics?
Information:
Research Question: What archetypes present themselves for hospital admission on hot days?
Information:
Statistical and ML Models can be thought of as:
Maps
Compression
Relationship Discovery
Difference Assessment
How do we teach this?
Focus: Supervised Classification and Regression
Define your very specific research question
What approach(es) does your research question relate to?
What type of learning will you use?
What type of data are you working with?
Categorical, continuous, ordinal
Image, text, gridded, tabular
If supervised, classification or regression?
Now that you have defined your research space, what metrics will you use?
What does success look like?
How do we test for success?
How do we design an experiment to see if our model is working?
Model on a portion of the data, test on another portion.
Basic idea: Train on a portion of the data, test on a isolated separate portion of the data.
Information Leakage
When you have information during model training that you probably should or would not have in practice. Or, when the information you are predicting is somehow embedded in your predictors.
Do not allow information from your training set to enter your test set.
Order of Operations
When you actually apply a model to your dataset. :)
Wasn’t splitting your data into training and testing sets a good idea?
Creates fold (sets) of data to alternately leave out and include.
This resampling can help with appropriate fitting
Can be used to tune model parameters.
k-fold, n-repeated
\(y = mx + b\)
\(y = f(x_0 w_0 + x_1 w_1)\)
Train your best model (after tuning and/or optimization) on the whole training dataset.
How well did your model do on the test set?
How did performance on test set compare to internal performance with validation?
Models?
Overfitting, Underfitting?
Insights:
Earth System Data Science in the Cloud