Notes on chapter three of Deep Learning with Javascript
Goals
- Nonlinearity enhances representational power of neural netwroks and in many cases improves prediction power.
- Understanding underfitting and overfitting in relationship to model training.
3.1 Nonlinearity Description and Utility
Definitions
A neural network with two layers has a first layer that is call a hidden layer because its output is not directly seen from outside the model. The second layer then would be the output layer because it’s output is also the models final output using TensorFlow.js API method predict()
.
These can be referred to as a multilayer perceptron if: - They have a simple topology, called feedforward neural networks - They have a least one hidden layer
API
The call to model.summary()
is a diagnostic/reporting tool that prints the topology of TensorFlow.js models to the console
.summary() model
- The first column is the layer name and type.
- The second column is the shape for each layer. The first dimension is almost always a null value which represents an undetermined and variable batch size.
- The third column is the number of weight parameters for each layer. This counts all of the individual numbers that make up a layer’s weights.
- A kernel shape of:
[12, 50]
- A bias of:
50
- Yields:
(12 * 50) + 50 = 650
- A kernel shape of:
- The total at the bottom includes all the weights of the model and which are trainable and which are untrainable.
- The trainable weights are updataed in the call to
js model.fit()
- The trainable weights are updataed in the call to
3.1.1 Building the Intuition for Nonlinearity in Neural Networks
An activation function is an element by element transform. The sigmoid function is a squashing function in that it squashes all real values into a much smaller space 0 to +1
- sigmoid activation function is a curve - relu activation function is the concatenation of two line segments
Why Nonlinear
Both activation functions are smooth and differentiable which make backpropogation possible. The plot of the activation function is not a straight line.
Note
Hyperbolic tangent (tanh), relu and sigmoid are activation functions.
Nonlinearity and Model Capacity
Nonliear functions allow our models to represent more diverse relationships between data. Replacing a linear function with a nonlinear function does not lose the ability to learn linear relationships. Adding a non linear layer adds to breadth an depth of input output relations it can learn..
Neural networks can be thought of as cascading functions. Each layer can be viewed as a function the stacking of layersa mounts to cascading these functions to form a more complex function that is the network itself.
The range of input output relations a machine learning algorithm is capable of learning is called its capacity.
Nonlinearity and Model Interpretability
Does a nonlinear model with multiple layers still allow for interpretability? In general no. This is because there is no easily identifiable meaning to any of the 50 outputs from the hidden layer. Choosing a deep model over a shallow on is a tradeoff between interpretability and more model capacity
3.1.2 Hyperparameters and Hyperparameter Optimization
The proper configuration of a sigmoid layer also relies on - the number of units - the kernels 'leCunNormal'
initialization - this is distict from the default 'glorotNormal'
which uses size of input and output to create the random numbers - The number of dense layers in a model - Type of initializer - Whether to use any weight regularization and factor therein - _Inclusion of dropout layers*_ - _Number of epochs*_ - _Optimizer learning rate*_ - _Learning rate decrease*_ - _Batch size for training*_
* These are model training configurations
Number of units, kernel initializers, optimizers and activation function are hyperparameters
Hyperparameters do not change for a model during training. Hyperparameter optimization is the name of the process of selecting these values.
Many Hyperparameters are discrete and therefore their loss value is not differentiable..
3.2 Nonlinear Output
Binary vs Multiclass Classification
- Binary Examples
- spam email
- fraudulent charges
- whther a word is present in audio sample
- fingerprint matching
- Multiclass Examples
- article topics
- image classification
- charachter recognition
- atari
3.2.1 Binary Classification
In the example regarding email phishing we use a sigmoid layer as the output layer as a way to express the probability that an email is spam.
In this way we understand the support for or against with the probability.
It’s also easier to come up with a differentiable loss funtion given the nature of the binary labels yes/no we are able to see by how much a model has missed the mark.
The
adam
optimizer aims at addressing these shortcomings of zig zaging and slow convergence found in gradient descent by using a multiplication factor that varies with the history of the gradients.- The
adam
optimizer usually depends less on the learning rate sgd
,momentum
,rmsprop
,adadelta
,adamax
are other optimizers
- The