2.2 Inside Model.fit() – Disecting Stochastic Gradient Descent
Understand what’s going on in TensorFlow.js’s
2.2.1 Intuitions Behind Gradient Descent Optimization
A process called random initialization helps set the weights of the dense layer of the function we used in the previous example…
= m * x + b y /* or... */ = kernel * sizeMb + biasoutput
Here kernel and bias are tunable weights. The loss with parameters close to the slope and y-intercept will be small and as the values move away from these goals the loss function will increase.
This loss as a function of all tunable params is also referred to as a loss surface.
The 2D contour plot demonstrates this loss surface since our model has only two params. Each training loop we receive a feedback signal which adjusts parameters to find a better fit or move closer to the bottom of the loss surface..
THis process has some distinct steps. - Draw a batch of trsaining samples x. these have corresponding targets
y_true - the batch size is often a power of 2 in order to leverage a GPUs architecture for processing. - this also makes the calculated values more stable - Run the network on x called a forward pass to obtain predictions
y_pred - Compute the loss of the network on the batch.. or the difference between y_true and y_pred as specified by the loss function. - the loss function is specified when
loss.compile() is called. - Then update all parameters/weights so that this batch has a reduction in the output of thee loss function. - this is done by the optimizer option in compile.
The optimizer has a difficult job, but because all operations are differentiable we can use gradient of the loss in order to determine which params/weights should be adjusted.
A gradient is a direction in which moving the parameters reduced the loss function. The mathematical definition defines the gradient as the direction in which the loss increases so in the context of deep learning we want to create the opposite.
The stochastic in SGD just means we pick random samples from the training step for efficiency
2.2.2 Backpropogation: inside stochastic gradient descent
The algorithm for computing gradients in gradient descent is called backpropogation. The relationshipt between loss v x and y’ can be expressed as
loss = square(y' - y) = square(v * x - y) Backpropogation has the important step of determining…
Assuming everything else stays the same, how much change in the loss value will we get if
vis increased by a unit amount?
The gradient is computed step by step.. from loss value back to the value v. - step one is to make the point that a unit increase in loss corresponds to to a unit increase in loss itself. - the derivative of edge 3 gives us a value of -10 this is the amount of increase in loss we’ll get if edge 3 is increased by 1. - this is referred to as the chain rule. - At edge 2 we calculate the gradient of edge 3 with respect to edge 2. The gradient is 1 because this is a simple add operation.Multiplying this 1 with the value on edge 3 gives the gradient on edge 2 which is just -10 - at edge 1 the gradient of edge 2 with respect to edge 1 is
2. We calculate the final gradient as
x * v or
2 * -10 = -20
2.3 Linear Regression With Multiple Input Features
- Understand an learn how to build a model that takes in and learns from multiple input features
- Know how to normalize data to stabilize the learning process
- Get a feel for using
tf.Model.fit()to update the ui
2.3.1 Boston Housing Prices
This particular dataset has been used since the 70s to ntroduce problems of statistics and morerecently deep learning. We’ll use the features in this dataset in order to get a median home price based solely on information about the home.
2.3.2 Getting and Running the Boston Housing Project Data from Github
Examples can be downloaded from
git clone https://github.com/tensorflow/tfjs-examples.git