jamescarl.ai - Deep Learning with Javascript 2.3.3-2.3.6

Notes on chapter two of Deep Learning with Javascript

2.3.3 Accessing Boston Housing Data 2.3.4 Precisely Defining the Boston Housing Problem 2.3.5 Data Normalization

2.3.3 Accessing Boston Housing Data

Typically downloading data from a URL or Google Cloud Storage Bucket will be the easiest way to manage and process data for training and testing a model. Beyond training and testing the Boston Housing Data sets have a third of the data dedicated to the validation of the model in order to independently evaluate its performance.

2.3.4 Precisely Defining the Boston Housing Problem

Step one is to download and convert the data to the approriate shape.

const Papa = require('papaparse');

const BASE_URL = 'https://storage.googleapis.com/tfjs-examples/multivariate-linear-regression/data/';

const TRAIN_FEATURES_FN = 'train-data.csv';
const TRAIN_TARGET_FN = 'train-target.csv';
const TEST_FEATURES_FN = 'test-data.csv';
const TEST_TARGET_FN = 'test-target.csv';

const parseCSV = async (data) => {
  return new Promise( resolve =>
    data = data.map((row) => 
      return Object.keys(row).map(key => parseFloat(row[key]))
    )
    resolve(data)
  )
}

const loadCSV = async (filename) => {
  return new Promise(resolve => {
    const url = `${BASE_URL}${filename}`

    console.log(`  * Downloading data from: ${url}`)
    Papa.parse(url, {
      download: true,
      header: true,
      complete: (results) => {
        resolve(parseCSV(results['data']));
      }
    })
  });
};

export class BostonHousingDataset{
  constructor(){
    this.trainFeatures = null;
    this.trainTarget = null;
    this.testFeatures = null;
    this.testTarget = null;
  }

  get numFeatures(){
    if (this.trainFeatures = null){
      throw new Error(`'loadData() must be called before 'numfeatures()'`)
    }
    return this.trainFeatures[0].length
  }

  async loadData(){
    [this.trainFeatures,this.trainTarget,this.testFeatures,this.testTarget] = 
      await Promise.all([
        loadCSV(TRAIN_FEATURES_FN), loadCSV(TRAIN_TARGET_FN),
        loadCSV(TEST_FEATURES_FN), loadCSV(TEST_TARGET_FN)
      ])
    
    shuffle(this.trainFeatures,this.trainTarget)
    shuffle(this.testFeatures,this.testTarget)
  }
}

with shuffle and featureDescriptions defined as

function shuffle (data, target) {
  let counter = data.length;
  let temp = 0
  let index = 0
  while (counter > 0 ) {
    index = (Math.random() * counter) | 0;
    counter--;

    temp = data[counter];
    data[counter] = data[index];
    data[index] = temp

    temp = target[counter]
    target[counter]= target[index]
    target[index] = temp
  }
}

export const featureDescriptions = [
  'Crime rate',
  'Land zone size',
  'Industrial proportion', 
  'Next to river',
  'Nitric oxide concentration',
  'Number of rooms per house', 
  'Age of housing',
  'Distance to commute', 
  'Distance to highway', 
  'Tax rate', 
  'School class size',
  'School drop-out rate'
];

then we define our workflow with the following

import * as tf from '@tensorflow/tfjs';
import * as tfviz from '@tensorflow/tfjs-vis';

import {BostonHousingDataset, featureDescriptions} from './data'
import * as normalization from './normalization'
import * as ui from './ui'

const NUM_EPOCHS = 200;
const BATCH_SIZE = 40;
const LEARNING_RATE = 0.01;

Then we convert it to a tensor using the same API as before with the download example.

const bostonData = new BostonHousingDataset()
const tensors = {}
export const arraysToTensors = () => {
    tensors.trainFeatures = tf.tensor2d(bostonData.trainFeatures)
    tensors.trainTarget   = tf.tensor2d(bostonData.trainTarget)
    tensors.testFeatures  = tf.tensor2d(bostonData.testFeatures)
    tensors.testTarget.   = tf.tensor2d(bostonData.testTarget)
}

In the Last example we used mean absolute error this one is not very sensitive to large outliers. An alternative optimizer is mean squared error which will prefer models that make smaller mistakes.

The next step is to determine a baseline loss estimation. If we cant evaluate the error from a simple estimate we can’t evaluate it from a more complicated model. For this example we use the average price as a way to gauge the error when always guessing tht value.

export const computeBaseline = () => {
  const averagePrice = tf.mean(tensors.trainTarget);
  console.log(`  * Average Price: ${averagePrice.dataSync()[0]}`)
  // Calculate mean squared error
  const baseline = tf.mean(tf.pow(tf.sub(tensors.testTarget, avgPrice),2))
  console.log(`  * Baseline Loss: ${baseline.dataSync()[0]}`)
}

The call to dataSync() here is to pull the TensorFlow.js value from the GPU to the CPU.

The square root of the output from this baseline loss function is our error and we can expect to be incorrect by this value.. From here we use machine learning to achieve a better MSE.

2.3.5 Data Normalization

Linear regression works by the optimizer multiplying the weights times the features and searching for a weight. Thi is so that the sum of the weights times the features is equal to the home price. To do this the optimizer is following a gradient in the weight space. Becaus small changes in one direction can have large impacts and very large moves can have little impact, this can cause instability in the model.

To counteract these differences we will normalize our data. This means we will scale features so they have a zero mean and unit standard deviation. This is sometimes calles standard transformation or z-score normalization

The algorithm is simple:

Calculate the mean of each feature and subtract it from the value
Then calculate the feature subtracted by the mean of the feature
Divide the above by the the standard deviation of the feature

normalizeFeature = (feature - mean(feature)) / std(feature)

or with TensorFlow.js ```js export function determineMeanAndStandardDeviation () { const dataMean = data.mean(0); const differenceFromMean = data.sub(dataMean); const squaredDrifferenceFromMean = differenceFromMean.squared() const variance = squaredDrifferenceFromMean.mean(0); const dataStandardDeviation = variance.sqrt() return {dataMean, dataStandardDeviation} }

export function normalizeTensor(data,dataMean, dataStandardDeviation) { return data.sub(dataMean).div(dataStandardDeviation) } ```

The 0 in this call means the mean will be taken over the 0th index in data. The data var is a 2d Array of data or a rank2 tensor and thus has 2 dimensions.

2.3.6 Linear Regression

Once data is normalized we can calculate a baseline. We can now build and fit a model

const bostonData = new BostonHousingDataset()
const tensors = {}
export const arraysToTensors = () => {
  tensors.rawTrainFeatures = tf.tensor2d(bostonData.trainFeatures)
  tensors.trainTarget   = tf.tensor2d(bostonData.trainTarget)
  tensors.rawTestFeatures  = tf.tensor2d(bostonData.testFeatures)
  tensors.testTarget    = tf.tensor2d(bostonData.testTarget)
  let {dataMean, dataStandardDeviation} = normalization.determineMeanAndStandardDeviation(tensors.rawTrainFeatures)
  tensors.trainFeatures = 
    normalization.normalizeTensor(
      tensors.rawTrainFeatures, 
      dataMean, 
      dataStandardDeviation
    )
  tensors.testFeatures = 
    normalization.normalizeTensor(
      tensors.rawTestFeatures, 
      dataMean, 
      dataStandardDeviation
    )
}

export function linearRegressionModel() {
    const.model = tf.sequential();
    model.add(tf.layers.dense({inputShap: [bostonData.numFeatures], units: 1}))

    model.summary()
    return model;
}

Now we must specify the loss function and optimizer before we begin training the data with a call to model.compile() ### 202004021821

function run (model, modelName, weightsIllustration) {
  model.compile({
    optimizer: tf.train.sgd(LEARNING_RATE), 
    loss: 'meanSquaredError'
  })
  let trainLogs = [];
  const container = document.querySelector(`#${modelName} .chart`)

  ui.updateStatus(`  * Starting Training Process: ${new Date().toISOString()}`)

  await model.fit(
    tensors.trainFeatures, 
    tensors.trainTarget, 
    { batchSize: BATCH_SIZE,
      epochs: NUM_EPOCHS,
      validationSplit: 0.2,
      callbacks: {
        onEpochEnd: async (epoch, logs) => {
          await ui.updateStatus(
            `  * Epoch ${epoch +1} of ${NUM_EPOCHS} completed`,
            modelName
          )
          trainLogs.push(logs)
          tfviz.show.history(container, trainLogs, ['loss', 'val_loss'])

          if (weightsIllustration) {
            model.layers[0].getWeights()[0].data().then(kernelAsArray => {
              const weightsList = describeKernelElements(kernelAsArray)
              ui.updateWeightDescription(weightsList)
            })
          }
        }
      }
    }
    ui.updateStatus('Running on test data...');
    const result = model.evaluate(
        tensors.testFeatures, tensors.testTarget, {batchSize: BATCH_SIZE});
    const testLoss = result.dataSync()[0];

    const trainLoss = trainLogs[trainLogs.length - 1].loss;
    const valLoss = trainLogs[trainLogs.length - 1].val_loss;
    await ui.updateModelStatus(
            `Final train-set loss: ${trainLoss.toFixed(4)}\n` +
            `Final validation-set loss: ${valLoss.toFixed(4)}\n` +
            `Test-set loss: ${testLoss.toFixed(4)}`,
        modelName);
  }
)

Training Data → For fitting the model weights with gradient descent.
- Usage in TensorFlow.js: Typically training data is used as the x and y parameters in the call to js Model.fit(x, y, config)
Validation Data → For selecting model structure and hyperparameters
- Usage in TensorFlow.js: The fit function has two ways of specifyin validation data in the config object as shown above
  - validationData: use this if you have your own validation data
  - validationSplit: use this if you want some test data to be split out into a validation set
Testing Data → For final unbiased estimate of model performance
- Usage in TensorFlow.js: data is used as the x and y parameters in the call to
```
Model.evaluate(x, y, config)
```