Deep Learning with Javascript 3.1-3.2.1

Posted on April 4, 2020

Notes on chapter three of Deep Learning with Javascript

3.1 Nonlinearity Description and Utility 3.1.1 Building the Intuition for Nonlinearity in Neural Networks 3.1.2 Hyperparameters and Hyperparameter Optimization 3.2 Nonlinear Output 3.2.1 Binary Classification


  1. Nonlinearity enhances representational power of neural netwroks and in many cases improves prediction power.
  2. Understanding underfitting and overfitting in relationship to model training.

3.1 Nonlinearity Description and Utility


A neural network with two layers has a first layer that is call a hidden layer because its output is not directly seen from outside the model. The second layer then would be the output layer because it’s output is also the models final output using TensorFlow.js API method predict().

These can be referred to as a multilayer perceptron if: - They have a simple topology, called feedforward neural networks - They have a least one hidden layer


The call to model.summary() is a diagnostic/reporting tool that prints the topology of TensorFlow.js models to the console

  1. The first column is the layer name and type.
  2. The second column is the shape for each layer. The first dimension is almost always a null value which represents an undetermined and variable batch size.
  3. The third column is the number of weight parameters for each layer. This counts all of the individual numbers that make up a layer’s weights.
    1. A kernel shape of: [12, 50]
    2. A bias of: 50
    3. Yields: (12 * 50) + 50 = 650
  4. The total at the bottom includes all the weights of the model and which are trainable and which are untrainable.
    1. The trainable weights are updataed in the call to js

3.1.1 Building the Intuition for Nonlinearity in Neural Networks

An activation function is an element by element transform. The sigmoid function is a squashing function in that it squashes all real values into a much smaller space 0 to +1 - sigmoid activation function is a curve - relu activation function is the concatenation of two line segments

Why Nonlinear

Both activation functions are smooth and differentiable which make backpropogation possible. The plot of the activation function is not a straight line.


Hyperbolic tangent (tanh), relu and sigmoid are activation functions.

Nonlinearity and Model Capacity

Nonliear functions allow our models to represent more diverse relationships between data. Replacing a linear function with a nonlinear function does not lose the ability to learn linear relationships. Adding a non linear layer adds to breadth an depth of input output relations it can learn..

Neural networks can be thought of as cascading functions. Each layer can be viewed as a function the stacking of layersa mounts to cascading these functions to form a more complex function that is the network itself.

The range of input output relations a machine learning algorithm is capable of learning is called its capacity.

Nonlinearity and Model Interpretability

Does a nonlinear model with multiple layers still allow for interpretability? In general no. This is because there is no easily identifiable meaning to any of the 50 outputs from the hidden layer. Choosing a deep model over a shallow on is a tradeoff between interpretability and more model capacity

3.1.2 Hyperparameters and Hyperparameter Optimization

The proper configuration of a sigmoid layer also relies on - the number of units - the kernels 'leCunNormal' initialization - this is distict from the default 'glorotNormal' which uses size of input and output to create the random numbers - The number of dense layers in a model - Type of initializer - Whether to use any weight regularization and factor therein - _Inclusion of dropout layers*_ - _Number of epochs*_ - _Optimizer learning rate*_ - _Learning rate decrease*_ - _Batch size for training*_

* These are model training configurations

Number of units, kernel initializers, optimizers and activation function are hyperparameters

Hyperparameters do not change for a model during training. Hyperparameter optimization is the name of the process of selecting these values.

Many Hyperparameters are discrete and therefore their loss value is not differentiable..

3.2 Nonlinear Output

Binary vs Multiclass Classification

  • Binary Examples
    • spam email
    • fraudulent charges
    • whther a word is present in audio sample
    • fingerprint matching
  • Multiclass Examples
    • article topics
    • image classification
    • charachter recognition
    • atari

3.2.1 Binary Classification

In the example regarding email phishing we use a sigmoid layer as the output layer as a way to express the probability that an email is spam.

  • In this way we understand the support for or against with the probability.

  • It’s also easier to come up with a differentiable loss funtion given the nature of the binary labels yes/no we are able to see by how much a model has missed the mark.

  • The adam optimizer aims at addressing these shortcomings of zig zaging and slow convergence found in gradient descent by using a multiplication factor that varies with the history of the gradients.

    • The adam optimizer usually depends less on the learning rate
    • sgd, momentum, rmsprop, adadelta, adamax are other optimizers