Neural Nets for Web Developers #1: Quick Start w/ TensorFlow

Neural Nets for Web Developers #1: Quick Start w/ TensorFlow

ยท

16 min read

Introduction

In this multiple-part series of posts, I plan to introduce the concepts of Deep Learning to my fellow web developers.

In this post, we'll start from the very first question that arises after hearing the word Neural Nets "What is a Neural Net?" and gradually work our way up to our first Deep Learning Model making predictions on our data.

In the last few posts of this series, I've planned to include a project laying out and implementing advanced DL topics like Convolutional Neural Nets & more.

Without further ado, let's get started!

What is a Neural Net?

Let's start with an analogy & then get to the first principles.

Neuron Facts for Kids

This is a figure depicting the structure of a single Biological Neuron.

There are billions of these in your brain & your body connected to each other at terminals and communicate via certain chemicals (neurotransmitters) and electricity signals.

The junction between 2 neurons is called a synapse where electric signals from neurons get converted into these chemicals, which then travel the gap in the synapse & get converted into electric signals again in the second neuron.

When your brain and/or body is doing an activity, that to some extent, has repetitive patterns in it, certain networks consisting of these neurons get activated periodically again & again.

Mind-blowingly enough (pun intended), after a few repetitions, these networks begin adapting certain parts of the neurons in them to facilitate faster & stronger electric signal transfer.

This happens when you learn to do something physically or mentally.

Here, when I say networks being "activated"
I mean, the neurons in the network serially fire electrochemical signals causing some of the subsequent neurons to fire & some to not fire resulting in a unique pattern of firing of signals with different intensities,

In turn, this causes other connected networks to activate that MIGHT be connected to other areas of your body or even your brain.

Understanding "learning" in terms of connections

In our brains, every thought, activity or response corresponds to a certain network of neurons being activated.

While reading this post you can 3asily 1denti5y 3ords eve9 if they are not exactly what you see.

In your lifetime, reading millions of words has developed strongly connected neural networks in your brain taking input from eyes & outputting "ideas" for what the next whole word could be.

The same happens when you speak, activating certain neural networks that rigorously take input from your mouth muscles moving & depict the next relevant movement such that you can produce the next relevant word the instant it "pops up" in your head.

It was a long run of saying non-sensical words to meaningful conversations.

The tireless effort from moving your legs randomly to walking with outstanding balance.

bicycling, learning javascript, learning math.

All of it boils down to certain groups of neurons firing in certain patterns millions of times to produce the "conscious you".

Have you noticed a pattern till now?

All of the learnable mental/physical actions you did, at first, started with randomness and slowly got more accurate or "improved".

Let's define "improvement" on an abstract basis:

  1. start with randomness

  2. change

  3. calculate accuracy

  4. change in whatever direction that helps improve accuracy

  5. repeat until the maximum possible accuracy is achieved

The Artificial Neural Networks merely mimic this process and in the past few decades, we've started to reap the benefits.

This abstract process of improvement is even followed by nature itself in a modified approach, outlined by Darwin's theory of evolution (Survival of the fittest).

Let's look into ways of how can we replicate this process digitally.

The Perceptron

After many iterations of ON/OFF neurons mimicking biological neurons, in 1943, Warren McCulloch and Walter Pitts proposed the "Perceptron" model which was more well-suited to computers than other models depending on the binary form of signals.

Let's break this figure down into components.

  1. Think of the AN (artificial neuron) as a machine or a big mathematical function that receives some input, processes it and generates an output where the output depends on the input.

  2. The X1, X2 & X3 here, are the numerical inputs

  3. The inputs go into the summation unit that does a weighted sum of all the inputs or X's (more on "weighted sum" in the next few sections)

    Quick Note:
    The greek letter "ฮฃ" is called "sigma" and in Math we use it to signify where addition of multiple values is being done.

  4. The summation unit then passes the summed-up result into a literal Mathematical function, this function could be anything that suits best, exponential, linear, sinusoidal, etc.

Okay... so this appears like a really big "machine" so to speak, but where's the "learning part"?

The "learning part" lies in the "weighted sum" in the summation component of the AN, let's look into it.

Summation Unit and Weighted Sum

The summation unit of the neuron receives input(s) in an array like so:
[12.495, 200.001, 89.055, 79.402, 203.379, ...]
Or more generalized as:
[x1, x2, x3, x4, ... xn]

  1. The whole array is first multiplied by another array of the same length containing the "weights":
    [x1, x2, ... xn] * [w1, w2 ... wn] = [w1 * x1, w2 * x2, ... wn * xn]

    More you know:
    This is also known as a "Vector Product" of 2 lists/matrices

  2. And then all the elements of the resultant "weighted" array are summed up to produce one number (also known as the weighted sum).

    Along with that we also add one more number to the final weighted sum number, called a "bias". (more on this number in the next section)

  3. The resultant number is passed off to the last component of our AN, the literal mathematical function

A number to multiply with is often called a weight as the product is more/less "weighted" than the original number, for example:
12 * 1.2 = 14.4
In this simple operation, 1.2 was the weight which was weighted to 12 to produce a weighted product 14.4.

But you recently mentioned that an AN is like a machine, a mathematical function whose output only depends on the inputs given?

What are these w1, w2, w3...?
They don't seem to be dependent on anything? what are their values?

You got me there!

The weights array is initially randomized & contains random numbers.

And I know that this will produce a totally random output after going through the literal mathematical function.

But that's part of the plan!

Recall our definition of improvement, the first step is to start randomly AND THEN, steer towards maximum accuracy.

Machine Learning with A Hypothetical Perceptron

Let's try to understand the whole process with a simple, highly overused example of housing prices.

Let's look at this example of data

house_areas = [3765, 7530, 11295, 15060, 18825, 22590, 26355, 30120, 33885, 37650, 41415, 45180, 48945, 52710, 56475]
house_prices = [16203815, 11924806, 15363025, 20681588, 23343728, 31297816, 45029324, 48653516, 43981690, 52140160, 64208930, 59116732, 69043050, 66664916, 77795170]

In this example, our goal is to look at this data & predict for an unseen given house area, what should be the price of the house.

Our hypothetical perceptron for this problem looks something like this:

Let's start with defining a mathematical function for the last component of the Perceptron,

Because we aren't doing something crazy here, let's set the function to the identity function in Math which just outputs the exact same input that is passed in it.

Let's now go through the process of "training" this Perceptron (hypothetically):

  1. we pass in our first house's area in as an array containing one number like this: [A1]

  2. the summation unit takes its weighted sum with [w1] weights and adds the "bias" number b to the resultant.

    The bias terms and weight(s) are randomized initially,

    Resultant number = (A1 * w1) + b

    This resultant number is then passed into our identity function.

    If we had more than 1 input other than the house area A let's say no. of bedrooms, then the weighted sum + bias would've looked like this:
    (A1 * w1 + B1 * w2) + b

  3. identity function receives (A1 * w1) + b and spits out (A1 * w1) + b i.e. same as the input

  4. we compare the numeric difference between (A1 * w1) + b and the actual price of the house for the area A1

  5. we change the values of w1 & b by some amount and repeat the above steps for A2, A3 & so on...

  6. we keep on doing this until our predicted values are close to actual values throughout the data and we've TUNED the weights & bias(es)

Machine Learning with an Actual Perceptron in TensorFlow

We've seen in an overview how perceptrons work, let's implement them using TensorFlow now.

This blog is available as a Google Colab notebook where you can easily run all the code from the comfort of your web browser,

No need to jump through a dozen hoops to get your Python environment up & running.

import tensorflow as tf # 0

ml_model = tf.keras.Sequential([ # 1
    tf.keras.layers.Dense(1)
])

ml_model.compile( # 2
    loss="mae",
    optimizer=tf.keras.optimizers.SGD(),
)

ml_model.build((None, 1)) # 3

ml_model.summary() # 4

ml_model.weights # 5

Let's break this down into pieces to understand more clearly what's happening behind the scenes.

Keras Sequential API (# 1)

Keras module in Tensorflow helps us work with deep neural networks.

Let's understand with examples.

model = tf.keras.Sequential([
    tf.keras.layers.Dense(3),
    tf.keras.layers.Dense(5),
])

The above code corresponds to the below neural network structure behind the scenes:

If you look at this figure closely, you'll find that the ANs are connected to each other in a special way where each AN in a layer receives multiple inputs & generates 1 output but sends the output to ALL of the neurons in the next layer.

ANs in a layer aren't connected to themselves but are connected to all of the ANs in the previous and subsequent layers.

This type of layering & connecting architecture is called "Dense" architecture and is one of the most commonly used in Deep learning models.

The dense architecture helps us to put as many ANs & layers of ANs as we want and the model would learn its weights & biases accordingly.

Let's use our imagination aided by this figure to hypothetically train this example ANN (Artificial Neural Network):

  1. The array of inputs containing numbers is fed into our first layer containing 3 Neurons where each neuron receives all the inputs in an array.

  2. Each neuron in the first layer (layer #1) then processes the input array independently, putting the inputs through the summation unit to compute a weighted sum & a bias and then the activation function to compute the final output.

    TensorFlow calls the mathematical function in the neuron an "activation function" We'll respect the same convention from now on

    The final output from each neuron in the current layer is then sent to all of the neurons in the subsequent layer.

  3. The subsequent layer or layer #2 neurons receive the outputs of ALL the neurons in the previous layer, compute the weighted sum with bias & run the activation function on the resultant.

  4. The output layer which is nothing but a simple array, collects the output of each layer #2 neuron.

  5. [Explanation Point]
    Here, just for the sake of continuing this process, we'll assume that there's a dataset that has 3 numerical values for its features and 5 numerical values for its labels.

    Independent values are called "features" in Machine Learning
    for example: house area, no. of bedrooms, etc.

    Dependent values are called "labels" in Machine Learning
    for example: house price, etc.

    Labels are dependent on Features that's why we try to find patterns in features relating to the labels.

  6. We'll compare the difference in the 5 predicted/output label values with the actual label values in our hypothetical dataset corresponding to the feature values we predicted upon.

  7. Update each weight & bias, do a comparison again and feed in the next 3 values in the neural network and so on until we have JUST RIGHT values of weights & biases.

In tf.keras.layers.Dense(1) we tell TensorFlow to scaffold a new layer with 1 neuron where every neuron in the layer has the default, identity function (aka linear function) set as their activation function.

So this means, the weighted sum of their inputs + bias is each neuron's actual output.

To change the activation function for the neurons of a particular layer, we can do:
tf.keras.layers.Dense(n_neurons, activation="...")

There are some predefined activation functions like the default "linear", or "elu", "relu", "sigmoid", etc. that help train ANNs better by modulating the outputs of the neurons for a particular layer.

Now, allow me to visualize the importance of bias numbers for you.

There might be a neuron itself in the network whose output doesn't strongly depend on the overall output of the network. By providing a changeable bias number added to the weighted sum for the neuron in the system, we can "dim" or "highlight" the influence of that neuron.

Coming to our original code, we can see that we've defined an ANN that has only 1 layer of neuron(s) and that layer contains only 1 neuron.

ml_model = tf.keras.Sequential([ # 1
    tf.keras.layers.Dense(1)
])

Now that we know how to code up an ANN & how Sequential API works, let's code up the next step of the process - calculating loss & optimizing the model.

Losses & Optimizers (# 2)

After we've scaffolded our Neural Net, we need to compile it with a loss function & an optimizer algorithm, let's understand what they are here.

ml_model.compile( # 2
    loss="mae",
    optimizer=tf.keras.optimizers.SGD(),
)

We generally use 2 types of loss-calculating functions - MAE & MSE

Mean Absolute Error (MAE) is calculated by taking the difference between each predicted label and the actual data label.

For example, let's say we have 3 predicted labels & corresponding 3 actual data labels - [p1, p2, p3] & [y1, y2, y3]
Then the difference will be - [abs(p1 - y1), abs(p2 - y2), abs(p3 - y3)]

Where abs() the function calculates the absolute value of the difference (disregards any -ve sign)

And then, takes the average of the difference array.

This gives out the MAE loss for that particular prediction.

Mean Squared Error (MSE) is similar in the sense that it only takes the square of the corresponding differences instead of taking the absolute value.

Therefore, MSE too acts as an important loss function if you want to penalize the model more for more significant differences.

In this case, we use MAE loss by specifying a string "mae" as the loss function parameter and TensorFlow will take care of the rest.

Coming to the topic of optimizers, for now, just know that optimizers like Stochastic Gradient Descent (SGD) or Adam take in the loss value from the loss function, and set the factor by which the weights & biases are changed.

At each iteration of the data, TensorFlow does all of this for us.

Building The Model

So until now, our model actually looks something like this:

We've been assuming that there will only be one input feature to train on but the model doesn't know this yet.

According to the TF model, The number of inputs could be anywhere ranging from 1 to infinity, which also means that the weights to be assigned to each input feature could be anywhere ranging from 1 to infinity.

To specify the input shape of the data, we call the .build() function of the TF model that takes in a tuple as its first argument to define the shape of the input data.

ml_model.build((None, 1)) # 3

The tuple argument (None, 1) signifies the following:

  1. None - any number of rows of the data

  2. 1 - 1 number of features per row of the data

This means that our features have to be passed in as a row-column array with 1 as the length of the column like this:

house_areas = [
    [3765.],
    [7530.],
    [11295.],
    [15060.],
    [18825.],
    [22590.],
    [26355.],
    [30120.],
    [33885.],
    [37650.],
    [41415.],
    [45180.],
    [48945.],
    [52710.],
    [56475.]
]

Here, the shape of our data is (15, 1) (15 rows & 1 columns).

Now our model knows how many features will be passed in it in one go i.e. 1.

Quick Note:

Calling .build() always before training the model isn't necessary at all as TF automatically sets the input shape once we start training with the data,

For the sake of showing the model summary which requires input shape to show the correct number of parameters, I've explicitly called the .build() function.

Model Summary

The convenient function .summary() provides us with a summary of our TF model, let's have a look at what it has to say about our model:

In a descriptive manner, the function .summary() listed out the layers along with their type, output data shape & parameters (number of weights & biases)

In our case, this is exactly what we expected, 1 tunable/trainable weight number and 1 tunable/trainable bias number which in total is 2 trainable parameters.

And here's the icing on the cake when we return all of the trainable params using ml_model.weights

The first element in the array lists all the weights per input and the second element lists all the biases per neuron.

Exactly 2 trainable variables where apparently weights are initialized to random values & biases are initialized to 0.

Visualizing the Data

This step generally comes at the tippy top of the process but for the sake of learning & and the format of this post, Let's visualize our data a little here.

plt.scatter(house_areas, house_prices)

plt.xlabel("House area (sq. ft)")
plt.ylabel("House pricing USD")

Please excuse the accuracy here, given how little time I spent generating it, I like how it scales :)

Again, as visualizing comes at the top of the ML process, it'll be the right time to point out that this data's features scale/increase linearly as the prices increase.

This means that if we can figure out the equation of a straight line where we assume that all the data points lie on it such that our assumption of the line's equation variable results in the minimum difference of the actual points from line points for a particular point, then our job would be done & our model would perform good enough.

Let's do just that.

Training the Model

ml_model_history = ml_model.fit(house_areas, house_prices, epochs=100)

To train our TF model, we call the model's .fit() function giving in the features array as the first argument & labels array as the second argument, passing an optional parameter called epochs

In our case, our model will go through the whole feature sample 100 times (i.e. number of epochs) with the set weights & biases, calculate & optimize loss after every epoch and repeat 100 times.

While printing loss at each epoch, TensorFlow also stores the loss at each epoch during training time and can be accessed from the returned object's .history["loss"] property.

Let's plot our epoch vs. loss curve using Matplotlib (a popular graphing library)

plt.plot(np.arange(1, 101), ml_model_history.history["loss"])

plt.xlabel("No. of epochs")
plt.ylabel("Loss")

There's nothing more satisfying than seeing your model's loss decreasing at every epoch!

Here, the loss acutely decreases as TensorFlow's SGD optimizer tunes our 2 trainable params in the direction of minimum loss.

After 5 or so epochs, the losses at each epoch start to fluctuate around approximately the same level, this is the point the model has achieved maximum accuracy & minimum loss & the loss won't go lower.

Now, that the losses are lowest, let's run this model over the house areas & compare how close it predicts the house prices to the actual ones.

This is not ideal when working with big real-world datasets due to the problem of overfitting, this post will not go further into explaining overfitting & underfitting problems,

But I'll highly recommend you to go down the rabbit hole of youtube videos & data science articles as to why it is a problem :)

house_prices_pred = ml_model.predict(house_areas)

house_prices_pred

Alright alright alright...

Let's plot & visualize these predictions:

plt.scatter(house_areas, house_prices, c="blue")

plt.plot(house_areas, house_prices_pred, c="red")

plt.xlabel("House areas (sq. ft)")
plt.ylabel("House prices")

Look at that!

It's almost perfect!

Where to go from here?

Consider going through these resources if you wanna learn more about Deep Learning & TensorFlow:

Lastly, be sure to give this post a "๐Ÿ‘๐Ÿป" to let Hashnode know this post is good enough :)

ย