Multilayer perceptron tutorial - Building one from scratch in Python

Author: Goran Trlin

In this article, we will see how a basic multilayer perceptron (MLP) can be made from scratch. We will use Python and its machine learning libraries pandas and NumPy to make a program capable of distinguishing between two types of input images: handwritten circles and handwritten lines.

This article is the first one in a mini-series on building and improving MLPs in Python without using any high-level libraries or large data sets. Just plain Python and some very basic inputs (datasets). The full series:

1. Multilayer perceptron tutorial - building one from scratch in Python
The first tutorial uses no advanced concepts and relies on two small neural networks, one for circles and one for lines.

2. Softmax and Cross-entropy functions for multilayer perceptron networks
The second tutorial fuses the two neural networks into one and adds the notions of Softmax output and Cross-entropy loss.

3. Adding automatic differentiation to the multilayer perceptron
The third tutorial introduces automatic differentiation which immensely helps with gradient calculations.

The real goal of this tutorial is to enable the readers to inspect the attached code and play with it to gain a better intuition on how multilayer perceptrons, and artificial neural networks in general, work.

Neural network architecture

Multilayer perceptrons are a type of artificial neural network that can be used to classify data or predict outcomes based on input features provided with each training example. An MLP contains at least three layers: (1.) input layer, (2.) one or more hidden layers, and (3.) output layer.

The basic architecture of the multilayer perceptron we are building in this part is given in the figure below.

Multilayer Perceptron Architecture

As you can see, we will have three input neurons, three hidden layer neurons, equipped with logistic functions at their outputs, and a single output neuron. This is a simple network, but with properly selected image features, and combined with another identical network, it should be enough to correctly classify input images into either circles or lines. Please note that the logistic function, located at the output of every hidden layer neuron, limits the hidden neuron outputs to a value between 0.00 and 1.00.

Perceptrons, and all other neural networks, learn by comparing the predicted outputs with the correct (target) outputs for every input example (in our case image). This comparison is carried out by a loss function. Loss function returns a numerical value ( a scalar ) representing the distance between the predicted output and target. The ultimate objective for any neural network would be to have as small as possible loss (ideally zero) for any test example. The training phase of neural networks comes down to updating weights based on gradients of the loss function. More on this in the sections below.

In this first article, two MLP networks will be used to distinguish between handwritten circles and lines. One network, MLP(circle), will be trained to recognize circles, and the other network, MLP(line), will be trained to recognize lines. After a training phase, for any given image the program will select the network with the higher output value. This is a very basic technique, but again, it should be good enough as an introductory example. In future tutorials and articles, we will add more advanced network architectures and procedures.

In our network, and most other neural networks, each neuron is connected to all neurons in the layer after it. In neural network terminology, this type of connectivity between two layers is referred to as fully connected layers. Every connection between two nodes (neurons) needs to have a corresponding weight. Since our MLP network has three layers, it will need two layers of weights to connect these neuron layers:

$$ \boldsymbol{W_{1}} = [W_{1,1,1},W_{1,1,2},W_{1,1,3},W_{1,2,1},W_{1,2,2},W_{1,2,3},W_{1,3,1},W_{1,3,2},W_{1,3,3}] $$ $$ \boldsymbol{W_{2}} = [W_{2,1,1},W_{2,2,1},W_{2,3,1}] $$

Weights \( \boldsymbol{W_{1}} )\ connect input layer with the hidden layer and weights \( \boldsymbol{W_{2}} )\ connect the hidden layer with the output layer. In the code, you can find these weight arrays stored as MLP.weights[0] and MLP.weights[1].

Feature selection

For training, we will use a custom set of 40 PNG images. Some of these contain handwritten circles, while the others contain lines. For the sake of simplicity, we are using only 16x16 pixel images with white backgrounds.

The next step is to select the features which will be used as inputs \( ( i_{1}, i_{2}, i_{3} ) )\ to our network . We will use the following features:

  1. A total percentage of colored pixels \( \left ( i_{1} \right) )\
  2. A total amount of colored pixels in the central region of the image \( \left ( i_{2} \right) )\
  3. A total amount of colored pixels near the image borders \( \left ( i_{3} \right) )\

Since circles have more near-border pixels, they should have increased values for feature 3. On the other hand, lines are usually going through the image center, so they are likely to have increased values of feature 2.

Training data

After we have decided on features, the next step is to extract these features from the training images. For this operation, we will create a helper Python class ImageFeatureExtractor. This class contains methods capable of accessing the image pixels and computing features 1-3 based on pixel colors. To automate the feature extraction process, we will add another class ProcessImageFolder. This class expects a folder containing training images as its input. Once such a folder is provided, the class starts extracting features from every image in the folder. The result is a CSV file of this format:

file_name,feature1,feature2,feature3
1.png,0.19,0,0.04
2.png,0.08,0.44,0

In order for our networks to be able to learn, we need to assign correct outputs (targets) to every training example. Since we are going to build and train two MLPs, we need to create two different CSV files with correct answers. The file names for these are correct-outputs-circle.txt and correct-outputs-line.txt. These files contain the correct answers in the following format (extracted from correct-outputs-circle.txt):

file_name,correct_answer
1.png,1
2.png,0
3.png,0

We are assigning 1 if the training image is really a circle, and 0 if it is not. The same logic is used for correct-outputs-line.txt.

Backpropagation and gradient descent

Training is performed through backpropagation. Backpropagation is an iterative procedure which uses the chain rule for derivatives to propagate the error back from the output layer all the way to the input layer. What does propagate exactly mean in this context? It means that we need to update all weights, between all node layers in the network, so they better fit the training data. And how can this be done? It can be carried out by the following procedure:

  1. Calculating the output error on every training sample
  2. Finding the error function partial derivative with respect to \( [W_{2,1,1},W_{2,2,1},W_{2,3,1}] )\
  3. Finding the error function partial derivative with respect to \( [W_{1,1,1}, W_{1,1,2}, W_{1,1,3}, W_{1,2,1}, W_{1,2,2}, W_{1,2,3}, W_{1,3,1}, W_{1,3,2}, W_{1,3,3}] )\

Above described process can be also seen as a process of finding gradients of error function at every network layer. The partial derivative values for node weights \( W_{1,x,y} )\ (weights between the input and hidden layer) are typically smaller than the partial derivatives for weights \( W_{2,x,y} )\ (weights between the hidden and output layer) because their calculation includes additional multiplications with values smaller than 1.00.

The best way to get a better grasp of the gradient calculation would be to derive the formulas for every gradient on paper. In this first tutorial, we are doing just that, meaning that the gradient formulas in our code were derived using only basic tools of symbolic differentiation - chain rule for derivatives and basic derivative rules. In future tutorials, gradients will likely be computed by methods of automatic differentiation, since that approach is more elegant. Automatic differentiation packages such as Autograd abstract away many of the troubles that can come up when calculating closed-form expressions for gradients - these packages, when implemented correctly, enable us to compute gradient values in just a few lines of code. However, we believe that it's important to know how these expressions can be calculated manually, using just the chain rule and some basic derivative rules.

After all the gradients are calculated, we can update the weights. However, there are multiple options here:

  1. Stohastic Gradient Descent (SGD) – update all weights after every training sample.
  2. Batch Gradient Descent (BGD) – update all weights after calculating gradients on all samples, and finding their average values. Usually involves finding the average value of the gradient for each weight, and then moving in that direction.
  3. Mini-batch Gradient Descent – a middle way between SGD and BGD – updates weights after processing a small number of examples from the training set.

We are using SGD for training our MLP networks. Since we need to find gradients for all weights, we will be storing these to an object with the same structure as MLP.weights. The name of this new object will be MLP.weights_gradients. The gradient values stored there will be used for making updates to all weights in MLP.weights. Here is the code which makes updates to the weights:

        if (gradient_value > 0):
            self.weights[layer][i][j] += -learning_rate * abs(gradient_value)
        elif (gradient_value < 0):
            self.weights[layer][i][j] += learning_rate * abs(gradient_value)
        

As you can see, we are changing the weight values in the direction of the negative gradient (slope). This way, we should be a step closer to the error function minimum with every weight update. However, in practice, depending on the value of constant learning_rate, and the overall shape of the error function, we might not be able to find the global minimum.

Testing

When both networks are trained, the program should be able to distinguish between circles and lines for any custom image we provide in the test-images folder.

For each test sample, we find the output value of both MLP(circle) and MLP(line). The larger of these values determines the final program verdict on the provided test image.

How to use the main program

First, download the attached Python project. Run the script extract-image-features.py to generate the features CSV file for the training process. After that, just run the script mlp.py. It will start training the two networks and after it asks for a test image name, enter one of the following file names:

  • tc-1.png (circle)
  • tc-2.png (circle)
  • tc-3.png (circle)
  • tc-4.png (circle)
  • tc-5.png (circle)
  • tl-1.png (line)
  • tl-2.png (line)
  • tl-3.png (line)
  • tl-4.png (line)
  • tl-5.png (line)

You can of course add your own images to the test-images/ folder and perform testing on them. The program should output the correct answer in the console. You should see one of the following string in the console:

This image is a CIRCLE

or

This image is a LINE

Conclusion

So, now you have a fully working python program capable of distinguishing between handwritten lines and circles, no matter how you draw them! We hope you find it cool, but also useful for learning purposes. We have lots of plans for more advanced neural networks so make sure you get back here from time to time!