CODE YOUR OWN NEURAL NETWORK
A step-by-step explanation
2nd Edition (Python code + additional material)
Steven C. Shaffer
Copyright 2020, by Shaffer Media Enterprises LLC
You may use this code as a basis for your own learning or as the starting place for your own projects (academic or commercial). Do not reproduce or publish in whole or in part without the express permission of the author.
About the author
Steven Shaffer is an Associate Teaching Professor of computer science and engineering at Penn State, University Park. He has been using and teaching artificial intelligence techniques for about 20 years.
Preface
This e-book will take you step-by-step through coding your own neural network. This version uses Python, whereas the previous version used C++. Once you have mastered the steps, it should be easy enough to translate your knowledge into most other languages.
Please note: This mini-book goes step by step through the development of a relatively simple 3-layer back-propagation neural network to solve the exclusive OR problem. The purpose is to explain the concepts from a code-literate standpoint. This book uses a single example using a sigmoid function (only), and does not contain information on bias weights nor momentum. The purpose of this book is to explain the concept of neural networks to someone who understands coding but has no other training in the subject. I use a single example with built-in (magic) numbers and global variables in order to make the example clear (without confounding details). This is not production code. If you already understand how neural networks work, you don’t need to read this book.
There are lots of text-heavy and math-heavy explanations of how neural networks work, but few code-centric explanations. The exclusive OR problem is a good example because this cannot be solved with a single-layer neural network. If you are a coder, the explanations should make sense to you. No explanations of Python syntax are given, so this is not for the novice programmer. This book is not for the purist or the researcher, it’s an introductory book for people who want a basic understanding of how neural networks work.
With that said, I feel confident that if you spend a few hours going through this tutorial, you will gain a solid knowledge of the mechanics of back-propagation neural networks. If you want a copy of the source code, follow the instructions towards the end of this mini-book. However, I believe that you’re better off typing it in as you go – you will understand it better that way.
Once you’ve walked through this tutorial, you can start to code your own for your own work, or, having a better understanding of how they work, use one of the many open source or commercial neural network building tools.
Please note: It’s very hard to get program code to display properly in an e-format; I’ve done the best I can.
Introduction: The requisite explanatory diagram
Every time one encounters an explanation of a neural network, there is always a diagram given such as that shown below. So, in the interests of tradition (and probably pedagogy), I’ve included it. Please keep in mind that this is a very rough sketch of the process at this point; however, we will refer back to this diagram from time to time.

Let’s go through the parts one at a time…
- Two inputs come from outside of the network, which are sent intact from nodes 1 and 2 to nodes 3 and 4, and are interpreted as weights.
- Nodes 3 and 4 combine the weights, compare them to the thresholds, then produce output weights sent to node 5.
- Node 5 combines the input weight, compares it to the threshold, then produces an output.
Note that nodes 3, 4, and 5 – which are referred to as neurons — each operate the same way: they take the input weights, combine them, and then, based on a threshold value, produce an output. This is how a working, trained neural network works.
There is another step though; the network has to be trained to produce the correct response. This is done through an iteration of comparing actual output with expected outputs and adjusting the weights until a certain level of correctness is achieved. We will go through each of these steps as we progress through this tutorial.
1: Data structures and initialization
If you’re a software developer, you know that the definition of your data structures is an important step in writing any program. In this section, I will explain the data structures used in the neural network, which are surprisingly simple.
First, let’s define some terms. Our network will have a certain number of input nodes (two in the example), a certain number of hidden nodes (also two in the example), and some number of output nodes (one in the example). Near the top of the program, I define the following constants:
NUMINPUTNODES = 2
NUMHIDDENNODES = 2
NUMOUTPUTNODES = 1
NUMNODES = NUMINPUTNODES + NUMHIDDENNODES + NUMOUTPUTNODES
ARRAYSIZE = NUMNODES + 1 # 1-offset to match "node 1" "node 2" etc.
MAXITERATIONS = 50000
E = 2.71828
LEARNINGRATE = 0.20
TARGETERROR = 0.005
So, the value for ARRAYSIZE is based on the number of nodes in the network, which will vary for different applications. Why do we need ARRAYSIZE? Because all of our data structures are arrays (I told you they were simple)!
weights = [[0.0]*ARRAYSIZE]*ARRAYSIZE
values = [0.0]*ARRAYSIZE
expectedValues = [0.0]*ARRAYSIZE
thresholds = [0.0]*ARRAYSIZE
Each of the arrays has been given a descriptive name; for comparison, look at the “explanatory diagram” shown earlier. An element of the weights array, e.g. weights[x][y], is the weight from node x to node y. The values are the values stored at any moment in each node. The thresholds are as shown in the diagram. The expected values are not in the diagram, but will be needed when we train the network. Not every cell of these arrays will be used (for example, there are no thresholds for the input nodes), but the simplicity of the data structure is well worth a few wasted bytes of memory. If ARRAYSIZE was large, an argument could be made for making these data structures better, and you are certainly welcomed to do so, but it’s not really necessary for this tutorial.
That’s the extent of the data structures in this program; as I said, it’s deceptively simple.
The first executable line of main is connectNodes, which is shown below in context, along things that we will need eventually:
import random
NUMINPUTNODES = 2
NUMHIDDENNODES = 2
NUMOUTPUTNODES = 1
NUMNODES = NUMINPUTNODES + NUMHIDDENNODES + NUMOUTPUTNODES
ARRAYSIZE = NUMNODES + 1 # 1-offset to match "node 1" "node 2" etc.
MAXITERATIONS = 50000
E = 2.71828
LEARNINGRATE = 0.20
TARGETERROR = 0.005
…
def main():
random.seed(0)
weights = [[0.0]*ARRAYSIZE]*ARRAYSIZE
values = [0.0]*ARRAYSIZE
expectedValues = [0.0]*ARRAYSIZE
thresholds = [0.0]*ARRAYSIZE
print("Neural Network Tutorial")
connectNodes(weights, thresholds);
…
main()
Now, let’s look at the connectNodes function:
def connectNodes(weights, thresholds):
#Set random connection weights
for x in range(1, NUMNODES+1):
for y in range(1, NUMNODES+1):
weights[x][y] = (random.randint(0, 200))/100.0
# Set random thresholds
thresholds[3] = random.uniform(0.0, 2.0)
thresholds[4] = random.uniform(0.0, 2.0)
thresholds[5] = random.uniform(0.0, 2.0)
This function sets the initial values for the weights and the thresholds. The actual selection of ideal initial values for these variables is the subject of an entire line of research. What I’ve done here is to set the initial thresholds to fairly random doubles, and the initial weights to between 0.0 and 2.0, primarily because when I messed around with the fully working program, these values seemed to work pretty well. Once you have the entire program running (at the end of the tutorial), it’s instructive to change these initial values and observe what happens.
This would be a good time to get your first version of this code running if you haven’t already!
2: The main loop of the program
The main processing loop of the program works as shown below in context:
def main():
random.seed(0)
weights = [[0.0]*ARRAYSIZE]*ARRAYSIZE
values = [0.0]*ARRAYSIZE
expectedValues = [0.0]*ARRAYSIZE
thresholds = [0.0]*ARRAYSIZE
print("Neural Network Tutorial")
connectNodes(weights, thresholds);
# MAIN LOOP OF THE PROGRAM
fourInARow = 0
for counter in range(MAXITERATIONS):
trainingExample(values, expectedValues, counter)
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors=updateWeights(weights,values,expectedValues,thresholds)
if sumOfSquaredErrors < TARGETERROR:
fourInARow += 1
else:
fourInARow = 0
if fourInARow >= 4:
print("Leaving early! Counter:", counter)
break
if counter % 1000 == 0:
print("Counter:", counter)
print("End Counter:", counter)
displayNetwork(weights, values, thresholds, expectedValues)
The training process works essentially like this:

We just keep doing this until we decide we’re done. In the end, we’ll set MAXITERATIONS to 50000, but for developing and debugging, we’ll set it much lower. The sum of squared errors is a measure of how close the calculated answer is to the desired/expected value. I’ve set the program to end when a particular sum of squared errors is attained; however, there is no guarantee that will ever happen, so MAXITERATIONS puts a cap on how long the program will run. Again, all of these values can be adjusted and played with after the program is up and running. (Note that we want all four learned responses to have a sum of squared errors under the threshold to be considered “done”, so this is accounted for in the code. More on this below.)
For each iteration, the program will:
- Generate a training example.
- Run the example through the network.
- Update the values to “learn” the desired function, returning the sum of squared errors.
- Check if we’ve found a “good enough” answer and, if not, repeat the process.
What’s happening here is that the trainingExample function will generate an example of our target function (XOR), then we send it through the network, checking how close the network’s answer would be, and correct the network by modifying the node weights. This works here because we already know how to calculate the function we want (XOR); in many/most applications, we don’t have a hard and fast way to create the training examples. Instead, the network might learn from real-life consequences or by comparing its results to an expert answering the same question. For example, in a stock trading application, the network might learn from successful versus unsuccessful trades (the real-world results). In another case, if a network is learning to diagnose diseases, it might compare its answers to one or more expert doctors.
With XOR, what we want to look at in the end is the weighting of the results of the network for each of the cases:

Foreshadowing a bit, we’ll see output from the program something like this:
1.0000 | 1.0000 | 0.0705 | err: 0.0050
0.0000 | 1.0000 | 0.9479 | err: 0.0027
1.0000 | 0.0000 | 0.9480 | err: 0.0027
0.0000 | 0.0000 | 0.0542 | err: 0.0029
What this represents is the results of running the network with each of the four scenarios (the first two columns), resulting in the third column, the values approaching zero and one. Thus, for the inputs 1 and 1, the result is 0.0705, which is pretty close to zero, when compared to the result of 0 and 1, which is .9479, which is pretty close to one.
The fourth column shows the sum of squared errors, as described in more detail below.
[advertisement]
APPEALING: A dark comedy about a serious subject
Christine, a young attorney on the brink of death, finds herself at the gates of heaven and hell. Having summarily received a sentence to hell, she demands an appeal. This sets her on a path of spiritual discovery, shepherded by three strangely familiar characters.
It should be noted that this film is psychological and spiritual in focus, but definitely not a mainstream Christian film. The film does contain adult themes and some strong language.
Visit www.AppealingFilm.com to watch!
3: How the training example is generated
Generating the training example for an XOR function is pretty simple, which is one of the reasons why it’s a good example for this tutorial. The code for trainingExample is:
def trainingExample(values, expectedValues, counter):
if counter % 4 == 0:
values[1] = 1
values[2] = 1
expectedValues[5] = 0
elif counter % 4 == 1:
values[1] = 0
values[2] = 1
expectedValues[5] = 1
elif counter % 4 == 2:
values[1] = 1
values[2] = 0
expectedValues[5] = 1
elif counter % 4 == 3:
values[1] = 0
values[2] = 0
expectedValues[5] = 0
else:
print("ACK!")
exit()
I’ve added the “unneeded” else statement for completeness. Again, the code is not meant for production, but to simplify the concepts. The function is simple and effective as-is. Every time the function runs, it generates one of four possible XOR scenarios, as laid out in the diagram in the previous section. There’s little randomness or fuzziness to this, because XOR is a strict function. All this function does is set up the example against which the network will compare its answers.
I suggest that you implement this function into your program at this point and run it to make sure it’s working thus far. If you throw in some trace statements, you’ll see that the first six results look something like this:
1 1 0
0 0 0
0 1 1
0 0 0
1 1 0
0 0 0
Which looks like an XOR function!
4. How activate network works
Okay, now we’re getting to the crux of the matter! Before diving into the code, let’s review the “requisite explanatory diagram” from the first few pages of this mini-book, and review how the values of NUMINPUTNODES and NUMHIDDENNODES are set. Here is the diagram again, just for ease of access:

Note that there are two input nodes, two hidden nodes, and one output node. Another neural network might have a different topology, so this function is written to be fairly general. Keep in mind that the basic data structure here is just an array, with each element of the array corresponding to one of the nodes in the network.
In order to activate the network, we will need to process all of the hidden and output nodes (note that the input nodes are just values, they need no processing). So, in order to process all of the hidden nodes, we start at NUMINPUTNODES plus one, and iterate through all of the hidden nodes, as follows:
for hiddenNode in range(1 + NUMINPUTNODES, 1 + NUMINPUTNODES + NUMHIDDENNODES):
# do something!
What exactly we will do in this loop will be covered below. Note that this might seem overly complex, because in this particular case, we could just do the following:
for hiddenNode in range(3, 5):
# do something!
But we want to make the process work for different network designs. Equivalently, then, when we need to process all of the output nodes, we will start with the last hidden node and process through the output nodes:
for outputNode in range(1 + NUMINPUTNODES + NUMHIDDENNODES, NUMNODES+1):
# do something!
One of the cool things is that the process within each of these loops is exactly the same; if you understand how to process the hidden nodes, you will immediately understand how to process the output nodes. (Remember that the input nodes are not “really” neurons, but just a way to get the input into the network.) This being the case, we will cover the hidden node processing in great detail, and the just refer back to it for the output node processing.
5. Sidebar: Summations and Sigmoid functions
If you are very familiar with the concepts of summations and sigmoid functions, please feel free to skip this section!
A summation, usually symbolized with the “E” looking symbol, is simply the total of some bunch of numbers, and is most easily understood with this piece of code:
total = 0
for k in range(0, n):
total += data[k]
The above code would be written something like this by a mathematician:

A sigmoid function produces an “S” shaped curve that has the special property of quickly sliding toward one value or another. (This is not the mathematical definition, of course, but an explanation of its usefulness.) The reason that this is useful to us for the purposes of this mini-book is that the output of an XOR function is really 1 or 0, not .5 or .78634 or some other real number. So, what we want from this function is an attempt to “push” the answer one way or the other. If you look at the diagram below and imagine that the curve goes on forever in both directions, you can see that there is only a small space in which the value is “fuzzy”.

The sigmoid function is calculated like this:

Where ‘e’ is the mathematical constant approximated by 2.71828.
6. Back to: Calculating the output of the hidden nodes
The following calculation shows how the output for each hidden neuron is computed:

It might look a little daunting if you are not big on math, but between this explanation and the code below, you’ll be able to “get” it!
Starting from the inside of the equation:

All that we are doing is going through all of the inputs into the hidden node, multiplying them by the appropriate weight, and adding them up as we go along. This is literally just a “weighted sum.”
Next, we subtract the threshold value:

Then we apply the sigmoid function to the entire result and voila we have our output!
Here is the code to implement this part – for each hidden node, we loop through all of the input nodes, updating the weighted input. After that, we adjust for the threshold and then use the sigmoid function to calculate the value of the hidden node based on the weighted input. In short: Process each hidden node using the values of the input nodes.
for hiddenNode in range(1 + NUMINPUTNODES, 1 + NUMINPUTNODES + NUMHIDDENNODES):
weightedInput = 0.0
for inputNode in range(1, NUMINPUTNODES+1):
weightedInput += weights[inputNode][hiddenNode] * values[inputNode]
weightedInput += (-1 * thresholds[hiddenNode])
values[hiddenNode] = 1.0 / (1.0 + pow(E, -weightedInput))
(Sidebar)
In an early version of this book, I used single-letter variable names to make the code easier to read. For this I received criticism in the reviews. Given smaller variable names, the code would look like this:
for h in range(1 + NUMINPUTNODES, 1 + NUMINPUTNODES + NUMHIDDENNODES):
wi = 0.0
for i in range(1, NUMINPUTNODES+1):
wi += weights[i][h] * values[i]
wi += (-1 * thresholds[h])
values[h] = 1.0 / (1.0 + pow(E, -wi))
Which I think is easier to read. Please feel free to do this exchange yourself as you go.
(End of sidebar)
As was mentioned earlier, calculating the output nodes works exactly the same way as the hidden nodes, except that, now, we process all output nodes by using the hidden nodes as the inputs.
for outputNode in range(1 + NUMINPUTNODES + NUMHIDDENNODES, NUMNODES+1):
weightedInput = 0.0
for hiddenNode in range(1 +NUMINPUTNODES,NUMINPUTNODES+NUMHIDDENNODES+ 1):
weightedInput += weights[hiddenNode][outputNode] * values[hiddenNode]
weightedInput += (-1 * thresholds[outputNode])
values[outputNode] = 1.0 / (1.0 + pow(E, -weightedInput))
That’s all there is to it! Below is the entire activateNetwork function. For code-literate folks, the code is likely easier than the explanation!
def activateNetwork(weights, values, thresholds):
for hiddenNode in range(1 + NUMINPUTNODES, 1 + NUMINPUTNODES + NUMHIDDENNODES):
weightedInput = 0.0
for inputNode in range(1, NUMINPUTNODES+1):
weightedInput += weights[inputNode][hiddenNode] * values[inputNode]
weightedInput += (-1 * thresholds[hiddenNode])
values[hiddenNode] = 1.0 / (1.0 + pow(E, -weightedInput))
for outputNode in range(1 + NUMINPUTNODES + NUMHIDDENNODES, NUMNODES+1):
weightedInput = 0.0
for hiddenNode in range(1 + NUMINPUTNODES,NUMINPUTNODES+NUMHIDDENNODES + 1):
weightedInput += weights[hiddenNode][outputNode] * values[hiddenNode]
weightedInput += (-1 * thresholds[outputNode])
values[outputNode] = 1.0 / (1.0 + pow(E, -weightedInput))
At this point, it might be worthwhile to run the program; set MAXITERATIONS to 5, throw in some trace statements, and watch the program execute.
7: Updating the weights
The next step of the iteration is to update the weights; this is how the network will “learn” to produce the correct output. The process is essentially this:

So, now we will need to update the weights in the network so that subsequent runs of the network will generate answers that are closer and closer to the idealized training examples. We’ll calculate three things:
- The absolute error, which is just the expected value minus the value generated by the network.
- The sum of squared errors, which will give us a way to measure the overall error in the system.
- The error gradient, which is explained below.
We’ll do this for each of the output nodes:
sumOfSquaredErrors = 0.0
for outputNode in range(1 + NUMINPUTNODES + NUMHIDDENNODES, NUMNODES+1):
absoluteerror = expectedvalues[outputNode] - values[outputNode]
sumOfSquaredErrors += pow(absoluteerror, 2);
outputErrorGradient=values[outputNode]*(1.0-values[outputNode])*absoluteerror
# process the hidden nodes
# update the threshold
delta3 = LEARNINGRATE * -1 * outputErrorGradient;
thresholds[outputNode] += delta3;
Note that in our current example there is only one output node, but there could be more.
(Sidebar on error gradients)
The error gradient is a simple concept, wrapped up in a complex derivation, which results in a simple calculation. Roughly, the error gradient is the slope of the activation function times the amount of error. This makes sense because the slope of the activation function at a particular point gives you a sense of how “changeable” the value is, and multiplying it by the amount of error gives you a sense of the “bigness” of the error. Officially, the error gradient is calculated by taking the derivative of the activation function and multiplying it by the amount of the error. This sounds like a huge pain, except that when you start to do the math, one of the big factors turns out to be equivalent to the output of the neuron itself, which we already know. Replacing that into the equation gives us:
Error gradient = O(n) x [1 – O(n)] x E(n), where:
O(n) is the output of the network at node n, and
E(n) is the absolute error at node n.
Which is very easy to calculate and build into a program. It’s important to note that calculating the error gradient for a hidden layer node is very close to how it’s done for the output nodes. An important difference, though, is that we don’t know the expected value for a hidden layer node, and so we can’t use the absolute error. Instead, we will use a portion of the output error gradient – that portion is determined by the weight of the connection between the hidden node and the output node. For example, in our “requisite explanatory diagram”, the value of the output node 5 is determined by the values of the hidden nodes 3 and 4, multiplied by their weight. This gives us the ability to apportion “blame” to nodes 3 and 4 for the error found in node 5.
(End of sidebar)
Back to: Updating the weights
Recapping a bit, we are looping through all of the output nodes, calculating the error, sum of squared errors, and the error gradient. Each of these values will be used as we continue.
Now, for each output node, we need to backtrack to the hidden nodes that generated the output. This is called back propagation; all it really means is that, assuming there is some error in the output, we figure out how to assign ‘blame’ to the nodes that generated that output. Then, once we know who to blame, we give them a correction by changing the weighting. The basic process is:

Note that all of this processing happens for each output node (in case there are more than one), so there are 3 nested loops:

Also note that this program works for a single layer of hidden nodes. If you have more layers of hidden nodes, there will need to be another layer of nested for loop. There are ways to program this in a more general way, but it makes the code a lot harder to understand. Once you completely understand this example (all the way to the end), it would be a good idea to first add an extra hidden layer, then think of ways to generalize the case to ‘n’ hidden layers.
Below is the code for the entire function (sorry it wraps awkwardly); note that it returns the sumOfSquaredErrors for use in main.
def updateWeights(weights, values, expectedvalues, thresholds):
sumOfSquaredErrors = 0.0
for outputNode in range(1 + NUMINPUTNODES + NUMHIDDENNODES, NUMNODES+1):
absoluteerror = expectedvalues[outputNode] - values[outputNode]
sumOfSquaredErrors += pow(absoluteerror, 2);
outputErrorGradient=values[outputNode]*(1.0-values[outputNode])* absoluteerror
for hiddenNode in range(1+NUMINPUTNODES, 1+NUMINPUTNODES+NUMHIDDENNODES):
delta1 = LEARNINGRATE * values[hiddenNode] * outputErrorGradient
weights[hiddenNode][outputNode] += delta1
hiddenErrorGradient=values[hiddenNode]*(1 - values[hiddenNode])*
outputErrorGradient*weights[hiddenNode][outputNode]
for inputNode in range(1, NUMINPUTNODES+1):
delta2 = LEARNINGRATE * values[inputNode] * hiddenErrorGradient
weights[inputNode][hiddenNode] += delta2
thresholdDelta = LEARNINGRATE * -1 * hiddenErrorGradient;
thresholds[hiddenNode] += thresholdDelta;
# update each weighting for the theta
delta3 = LEARNINGRATE * -1 * outputErrorGradient;
thresholds[outputNode] += delta3;
return sumOfSquaredErrors
Using our “requisite explanatory diagram” (shown again below), the function will update the weights and the thresholds. If there were more nodes, then the corresponding weights and threshold values would be updated as well, based on NUMINPUTNODES, NUMHIDDENNODES, and NUMOUTPUTNODES.

The last element that needs to be explained is the learning rate.
8: Understanding the learning rate
When a person learns, there is a fine balance between making a reasonable deduction and jumping to conclusions. Phrases like “don’t generalize from a single data point” and “the definition of insanity is doing the same thing over and over and expecting a different result” remind us of the pitfalls of inductive thinking. However, how many times does someone need to touch a hot stove before s/he decides it’s a bad idea? If you think of this in terms of learning rate, not touching a hot stove should have a learning rate of 1 (100%). However, if you eat pizza and then later get a stomachache, would it be reasonable to never eat pizza again? Probably not. Colloquially, if we agree that if, after 5 times, you get sick right after eating pizza, maybe it’s a good idea to stop, then we could say that a learning rate of .2 (20%) might be appropriate. (It doesn’t exactly work that way, but it’s a reasonable way to think about it.)
Learning rate is built into neural networks in order to keep them from over generalizing or under generalizing. It’s implemented simply as a real number greater than zero but less than one, and the amount of change to the weights and the thresholds is multiplied by this number in order to keep the program from jumping to conclusions or taking forever to learn. I’ve set the LEARNINGRATE in this program to 0.2, mostly because it seems to be pretty common and it worked in this case. The better the training examples, the higher you could consider setting the learning rate. Given this particular problem (XOR), one could make the argument that it could be set to 1.0 because the training examples are generated to be perfect. However, this is not the case. There are actually four different things that the network will learn, and you don’t want the most recent example dominating the overall results. Setting it to 1.0 would mean that the network would just spit out the last training example it encountered. We’d like to let the network find the answer by approaching it, not leaping onto it!
9: Displaying the network
The last step of the program is to dump the network values to the screen:
values[1] = 1
values[2] = 1
expectedValues[5] = 0
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors = updateWeights(weights, values, expectedValues, thresholds)
print("%8.4f | %8.4f | %8.4f | err: %8.4f" % (values[1], values[2],
values[5], sumOfSquaredErrors));
This just executes the network and reports the results for one type of input (1,1). We could display the network without executing it, but we’d have to somehow save the sumOfSquaredErrors for each of the four input types, which would be a pain. Here is the entire function:
def displayNetwork(weights, values, thresholds, expectedValues):
values[1] = 1
values[2] = 1
expectedValues[5] = 0
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors = updateWeights(weights, values, expectedValues, thresholds)
print("%8.4f | %8.4f | %8.4f | err: %8.4f" % (values[1], values[2], values[5],
sumOfSquaredErrors));
values[1] = 0
values[2] = 1
expectedValues[5] = 1
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors = updateWeights(weights, values, expectedValues, thresholds)
print("%8.4f | %8.4f | %8.4f | err: %8.4f" % (values[1], values[2], values[5],
sumOfSquaredErrors));
values[1] = 1
values[2] = 0
expectedValues[5] = 1
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors = updateWeights(weights, values, expectedValues, thresholds)
print("%8.4f | %8.4f | %8.4f | err: %8.4f" % (values[1], values[2], values[5],
sumOfSquaredErrors));
values[1] = 0
values[2] = 0
expectedValues[5] = 0
activateNetwork(weights, values, thresholds)
sumOfSquaredErrors = updateWeights(weights, values, expectedValues, thresholds)
print("%8.4f | %8.4f | %8.4f | err: %8.4f" % (values[1], values[2], values[5],
sumOfSquaredErrors));
You can insert displayNetwork anywhere you want in the process in order to see the current status. You can also do something like this:
if counter % 1000 == 0:
displayNetwork(weights, values, thresholds, expectedValues)
as I did in the example main. This will display the network every 1000 iterations.
10: Analyzing the results
Start by putting all of the parts of the program together. This is ideal because it will help you to understand how it all works. If you have a problem getting it to work, you can get a full copy of the entire program by contacting me here.
Set MAXITERATIONS to 4 and run the program to completion. You should see something like this:
Neural Network Tutorial
…
Leaving early! Counter: 18592
End Counter: 18592
1.0000 | 1.0000 | 0.0705 | err: 0.0050
0.0000 | 1.0000 | 0.9479 | err: 0.0027
1.0000 | 0.0000 | 0.9480 | err: 0.0027
0.0000 | 0.0000 | 0.0542 | err: 0.0029
This display tells us that when the network receives a [1,1], it produces a value of 0.0705. Note that the ideal value for an input of [1,1] is 0. Likewise for the other three combinations of input: [0,1] is 0.9479 but should be 1; [1,0] is .9480 but should be 1; and [0,0] is 0.0542 but should be 0. These answers are just the luck of the draw – they’re just the result of running the network on what was randomly assigned to the weights and thresholds.
(Note: You, of course, might get different answers, based on the version of Python, operating system, and computer you are using. After you’ve got the program running, I certainly suggest you randomize better to compare the results.)
Now, let’s change MAXITERATIONS to 100 and re-run the program:
Neural Network Tutorial
1.0000 | 1.0000 | 0.7226 | err: 0.5221
0.0000 | 1.0000 | 0.5466 | err: 0.2055
1.0000 | 0.0000 | 0.5620 | err: 0.1918
0.0000 | 0.0000 | 0.3355 | err: 0.1126
Here we see that nothing much good has happened! Let’s run it for 1000 iterations:
1.0000 | 1.0000 | 0.6664 | err: 0.4440
0.0000 | 1.0000 | 0.5497 | err: 0.2028
1.0000 | 0.0000 | 0.5670 | err: 0.1875
0.0000 | 0.0000 | 0.3367 | err: 0.1134
This is somewhat better, but certainly not great. Let’s try 10000:
1.0000 | 1.0000 | 0.1316 | err: 0.0173
0.0000 | 1.0000 | 0.9043 | err: 0.0092
1.0000 | 0.0000 | 0.9051 | err: 0.0090
0.0000 | 0.0000 | 0.0947 | err: 0.0090
You can start to see that things are “tending toward” the proper answers. As we know (shown above), by the time you get to 18,592 iterations, the answers have a sumOfSquareErrors at or below 0.005. (This is the case for the specific example; there is nothing magic about that number.)
At some point we have to decide what is “close enough” for us to stop. A common (and natural) notion is that we could write the program to run until we obtained some specific sum of squared errors (as shown) – this would be great, but unfortunately you can’t rely on the network ever reaching any particular such point. In fact, often a neural net will hit a point where is starts flipping from a low sum of squared errors to a relatively high one, then back again, ad infinitum. Although it’s possible to write more sophisticated “when to stop” checkers, these are beyond the scope of this introductory mini-book.
11: Changing the learning rate
Let’s look at messing with the learning rate. In the current example, with the learning rate set to 0.2, we get the following results:
End Counter: 18592
1.0000 | 1.0000 | 0.0705 | err: 0.0050
0.0000 | 1.0000 | 0.9479 | err: 0.0027
1.0000 | 0.0000 | 0.9480 | err: 0.0027
0.0000 | 0.0000 | 0.0542 | err: 0.0029
Using this as a baseline, let’s see what happens if we change the learning rate. First, let’s go high and set it for 0.9, and the MAXITERATIONS back to 50000:
End Counter: 49999
1.0000 | 1.0000 | 0.7710 | err: 0.5944
0.0000 | 1.0000 | 0.6456 | err: 0.1256
1.0000 | 0.0000 | 0.7241 | err: 0.0761
0.0000 | 0.0000 | 0.0118 | err: 0.0001
Not only did it take longer, it learned way worse! How about way low (0.05)? Here are the results:
1.0000 | 1.0000 | 0.1024 | err: 0.0105
0.0000 | 1.0000 | 0.9254 | err: 0.0056
1.0000 | 0.0000 | 0.9255 | err: 0.0056
0.0000 | 0.0000 | 0.0758 | err: 0.0058
Not a bad outcome, but it took way longer; maybe it would have finished in another 5000 iterations or so. Perhaps this is an example of slow and steady wins the race.
Continuing to home in on the “best” (fastest to the goal of 0.05 sumOfSquareErrors) learning rate, it looks to be around 0.22:
Leaving early! Counter: 16872
End Counter: 16872
1.0000 | 1.0000 | 0.0705 | err: 0.0050
0.0000 | 1.0000 | 0.9478 | err: 0.0027
1.0000 | 0.0000 | 0.9480 | err: 0.0027
0.0000 | 0.0000 | 0.0544 | err: 0.0030
You could, of course, change the goal of making the lowest sumOfSquareErrors the primary goal, and let it run for days or weeks. Keep in mind, though, that setting a particular goal does not mean that it is achievable! Your program may run for weeks and not get an appreciably better answer than the one we started out with in this section.
And that, dear readers, is how a simple neural network works!
12: Furthering your understanding
The best way to really “get” what is happening is to mess around with the code. Try the following:
- Change how you seed the random number generator.
- You could play with the initial range of values in connectNodes to see how the initial values changed the results of the runs.
- Another thing to try is to change the training function from XOR to just OR. It turns out that OR is a lot easier for a neural network to learn. Test it out!
- You could also try to change the program to keep running until a specific sum of squared errors was achieved; remember though, not every program and network will be able to achieve a pre-determined error level.
- Also, as previously mentioned, try modifying the network to something that takes three inputs, has three hidden nodes, and two outputs. It really doesn’t matter what the function does, as long as the training examples can be written.
- Use your imagination and natural curiosity! Hours of fun can be had by all!
13: Conclusion and full program code
This mini-book was intended to give you a boost in understanding how neural networks work by looking at the code, instead of all the math. Yes, the math is important, but it’s not necessary to understand the fundamental principles involved. I hope that I’ve succeeded in my mission!
I strongly urge you to read the text and follow along with the development of the code if you really want to understand it. But, if you end up frustrated or unable to get the code to work, contact me here to get a copy of the code.