Note: Dense Layer
Forward propagation & Backward Propagation
Minimize Loss -> Metrics!
Loss = cost incurred from incorrect predictions
Optimize Network weights to achieve lowest loss
Learning rate governs how we move through gradient descent.
Small learning Rate -> slow convergence (optimum finding), local optima more likely
Large learning Rate -> divergent and unstable, maybe no convergence
Optimal learning rate -> Adapative Gradient Descent
Regularization
TODO: BETTER IMAGE Early Stopping
Advanced Complex Topics are just combinations of the Basics
Increasing complexity -> Layers and Activation Functions
Convolution
Pooling
Softmax
Convolutional Neural Networks
Please read: Inventing Transformers
Definitely Read: Attention is all you Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
Attention = Context
Since a NN’s internal weights are updated in single steps via backpropagation and gradient descent, it is common for a NN to need to see the training data tens-to-hundreds of times before it is adequately trained.
Each time the entire training data set is passed through model is called an epoch. Usually, you’ll need at least 10-100 epochs to adequately train your model.
How many epochs do you need to train your deep learner? Unlike non-NN deep learners, determining this requires active monitoring and experimentation.
During each epoch, the weights can be updated in the following manner:
Each time a single training observation is passed through the network, known as stochastic gradient descent;
After all the training data have been passed through the network and their errors have been averaged together, known as batch gradient descent;
After multiple subsets (ie, mini batches) of the training data have been passed through the network, with the weights updated after each subset has been passed through, known as mini-batch gradient descent.
Each epoch will have a specific number of steps/iterations associated with it, num_steps;
The number of steps/iterations is determined by the size of your data, N, and the size of your mini-batches, m;
num_steps = ceiling(N / m). The ceiling function rounds up to the nearest integer number.
Deep learners thrive when trained on large data sets. Deep learners also train orders of magnitude faster on GPUs (versus CPUs). However, GPUs have limited memory (although it’s gotten much better).
How you do data I/O to your GPU matters! You may need to build a custom data generator/data loader whereby your CPU pulls data from your “hard drive” and preps it for your GPU.
Earth System Data Science in the Cloud