Machine Learning Metrics

Session Overview

What they mean and how to interpret/implement them

Why Metrics?
Bias-Variance Tradeoff
Classification Metrics
Regression Metrics

Why Metrics

Start with the beginning in mind…

Objective Measures:

How do we know we are successful?
How do we communicate our success?
How do we interpret our success?

Model Prediction Error

We want our models to be:

Generalizable (work with previous AND new data)
Low error

Bias - Variance Tradeoff

Bias: Difference between model prediction and correct value.

Variance: Variability of a prediction at a given point.

Bias - Variance Tradeoff | Example

Research Objective: Measure soil temperature using remote sensing observations.

Method: Use 10% of cloudless satellite photos, calculate the mean ground temperature from a single wavelength band.

Our Error

Bias-Variance Error

Finding the Sweet Spot

Study Design:

Bias:

Remove Systemic Bias in your sampling

Variance:

Sample Size

Tradeoff?

Algorithms:

Less complex models (Linear, Parametric) tend to have higher bias, but lower variance
More complex models (Trees, Deep Learning) tend to have lower bias, but higher variance

Training:

Metrics

Finding the sweet spot requires metrics!

Session Overview

What they mean and how to interpret/implement them

Why Metrics?
Bias-Variance Tradeoff
Classification Metrics
Regression Metrics

Classification and Regression

Supervised Learning!

Classification:

Performance is measured by how many labels it gets right (or wrong).
Performance metrics: Accuracy, Precision, Recall

Regression:

Performance is measured by how close it comes to the correct value
Performance metrics: Mean Absolute Error, Root-Mean-Square-Error

Regression

Root Mean Square Error (RMSE):

Procedure -> Square all the error, Average it, then square root it.

Weights larger errors more
Desirable when need to penalize large errors more (ie the relative weight of large errors is more than small errors)
RMSE increases with the variance of the frequency distribution of error magnitudes.
RMSE tends to increase as sample size increases (bad for comparing across different sample sizes)
Less intuitively explainable to stakeholders

Mean Absolute Error (MAE):

Procedure -> Take the absolute value of every error, then average it.

All errors are weighted equally.
Does not increase with the variance of the frequency distribution of error magnitudes.
Not sensitive to sample size.
Intuitively explainable to stakeholders

Takeaways:

With Regression you are measuring error, not accuracy!
Low error is good, high error is bad!
Continuous metric - know your relative units!
Other metrics: \(R^2\), MAPE, Adj \(R^2\), MSE, etc.

Classification

Confusion Matrix:

Accuracy:

How many did the model get right divided by total predictions.
Overall, how well did the model do at making correct predictions

What happens with imbalanced data?

Balanced Accuracy:

Average of True Positive Rate (predictions/samples) and True Negative Rate

Precision:

What proportion of positive identifications was actually correct?
The number of true positives divided by the total number the model thought were positive

\(Precision = TP / (TP + FP)\)

Recall:

AKA: Sensitivity
What proportion of actual positives was identified correctly?
The number of true positives divided by number of true positives and number of false negatives.

\(Recall = TP/(TP + FN)\)

Specificity:

What proportion of actual negatives was identified correctly?
The number of true negatives divided by number of true negatives and number of false positives

\(Specificity = TN/(TN + FP)\)

Classification: Curves

Classification algorithms don’t just provide a class.
They generate a probability of pertaining to a class then use a threshold algorithm to assign classes.

Receiver Operator Characteristic (ROC) Curve

Plots performance of classification model at different thresholds
Usually True Positive Rate and False Positive Rate
Area Under the Curve (AUC) is used as a metric

Classification Takeaways

With Classification you are measuring accuracy (or similar metrics), not error!
High Accuracy is good, low accuracy is bad
Continuous metric, but between 0-1 (or 0 and 100%)
Use confusion matrices to understand your situation
Other metrics: … many others including variations on Curves, and F1 Scores which combine precision and recall

But…

Log Loss:

Classification algorithms don’t just provide a class.
They generate a probability of pertaining to a class then use a threshold algorithm to assign classes.
What if we used a metric that measured error of the probabilities?

Log Loss measures how close the predicted class probability is to the correct value. The farther away the probability is, the higher the log loss value.

Session Overview

What they mean and how to interpret/implement them

Why Metrics?
Bias-Variance Tradeoff
Classification Metrics
Regression Metrics