Deep Learning

Session Overview

Background
Principles
Components and Architectures

Deep Learning

Many Layers
Can Learn Important Features

Deep Learning Ingredients

Lot’s of Data
Hardware that supports Massive Parallelization
Easy to use human-algorithm interface
- Keras
- Tensorflow
- PyTorch

Building Blocks of Deep Learning

Intro to Deep Learning

Activation Functions

Activation Functions 2

Single Layer Perceptron

Note: Dense Layer

How does this Work?

Forward propagation & Backward Propagation

Minimize Loss -> Metrics!
Loss = cost incurred from incorrect predictions
Optimize Network weights to achieve lowest loss

Real Loss Surfaces

Learning rate governs how we move through gradient descent.

Learning Rate for NN

Small learning Rate -> slow convergence (optimum finding), local optima more likely
Large learning Rate -> divergent and unstable, maybe no convergence
Optimal learning rate -> Adapative Gradient Descent

Overfitting?

Regularization

Drop Out - Lose connections (Set activations to 0 randomly)

Early Stopping - stop training early to prevent overfitting

TODO: BETTER IMAGE Early Stopping

Deep Learning Complexity

Advanced Complex Topics are just combinations of the Basics
Increasing complexity -> Layers and Activation Functions

Types of Deep Learning Layers

Convolution

Pooling

Softmax

Activation function that converts inputs to probabilities

Convolutional Neural Networks

Recurrent Neural Networks

Unlike feedforward networks, RNNs have loops that allow information to be fed back into the network.
This feedback allows RNNs to learn dependencies between elements in a sequence.

Long Short-Term Memory

LSTMs introduce memory cells that can store information for extended periods.
These memory cells are controlled by gates that regulate the flow of information into, within, and out of the cell.
The forget gate decides what information to discard from the cell.
The input gate determines what new information to store in the cell.
The output gate controls what information from the cell to include in the output.

Transformers!

Please read: Inventing Transformers

Definitely Read: Attention is all you Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

Transformers: Attention!

Attention = Context

Applications

LLMs
Time Series
Sequences

Training a Deep Learner: epochs

Since a NN’s internal weights are updated in single steps via backpropagation and gradient descent, it is common for a NN to need to see the training data tens-to-hundreds of times before it is adequately trained.

Each time the entire training data set is passed through model is called an epoch. Usually, you’ll need at least 10-100 epochs to adequately train your model.

Training a Deep Learner: number of epochs

How many epochs do you need to train your deep learner? Unlike non-NN deep learners, determining this requires active monitoring and experimentation.

Interpreting Training/Validation Accuracy and Loss

Training a Deep Learner: epoch and batch strategy

During each epoch, the weights can be updated in the following manner:

Each time a single training observation is passed through the network, known as stochastic gradient descent;
After all the training data have been passed through the network and their errors have been averaged together, known as batch gradient descent;
After multiple subsets (ie, mini batches) of the training data have been passed through the network, with the weights updated after each subset has been passed through, known as mini-batch gradient descent.

Training a Deep Learner: batch size and steps per epoch

Each epoch will have a specific number of steps/iterations associated with it, num_steps;
The number of steps/iterations is determined by the size of your data, N, and the size of your mini-batches, m;
num_steps = ceiling(N / m). The ceiling function rounds up to the nearest integer number.

Training a Deep Learner: data set I/O for training on a GPU

Deep learners thrive when trained on large data sets. Deep learners also train orders of magnitude faster on GPUs (versus CPUs). However, GPUs have limited memory (although it’s gotten much better).

How you do data I/O to your GPU matters! You may need to build a custom data generator/data loader whereby your CPU pulls data from your “hard drive” and preps it for your GPU.