How Can Mathematics and Computational Theory Combine To Create Artificial Neural Networks?
How Can Mathematics and Computational Theory Combine To Create Artificial Neural Networks?
14/03/2023
1
Contents
1 Introduction 3
6 Limitations 10
6.0.1 Computational Rescources . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.0.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
1 Introduction
It was Alexander Bain (1873), in his book Mind and Body who first theorized the idea of the brain
being made up of many individual neurones and introduced their relation to the activity of the body,
which marked the starting point for neural networks. However, it wasn’t until 1943 that McCulloch
and Pitts, proposed a computational model, combining both mathematics and computational theory,
for an artifical neural network. In addition, they modeled the worlds first neural network using eletrical
circuits, which marked the start of the beginning of artificial neural networks. A large proportion of the
creation of the idea of the neural network can largely be traced to a human attribute. Inspiration. We
can relate to this by studying the evolution of flight. When it comes to the history of flying, some of the
first (theoretical) air flying vehicles can be traced back to Leonardo Da Vinci’s Orthinopter helicopter,
which employed the use of a central coque and wings on the side to provide lift. This and following
helicopters following the same principal proved unsuccessful, but the key point is where they got the idea
from. Birds. Effectively, they had taken bird, figured it used wings to fly and therefore implemented
these wings into their own designs. This idea of focusing on the key parts of something (such as the
bird) to deduce how it works can be closely linked to how neural networks came about.
Similiar to how the idea of airplanes (air vehicles) originiated and were first designed around birds,
when humans started to ponder whether they could make conscious artificial systems, they turned to
the closest thing that could very much relate with it. The brain. David Marr’s three levels of analysis
for information-processing tasks/systems greatly describes how we can use study the brain (and more,
such as birds and flight) to recreate a similiar system in an artificial environment. The first of such
level, computational theory, describes the abstraction of a problem/task to obtain a definition, which
in neural networks, involves how the brain functions and its overall design, such as the individual
neurones which make up the brain, however doesn’t involve the chemical processes/biological that
occur. The second level - the algorithmic level - defines the representation and algorithms used. For
example, the roman numeral VI, the number 6 and the binary representation 0000 0110 (8 bits) are all
different representations of the same fundamental idea, however because of this representation, different
algorithms are used to both add and substract in each. Therefore, we’ve now seen that simply making
an abstract definition of the problem can’t be automatically applied to different scenarios - it would
be infeasible to put flapping wings on aircraft in the case of flight - since there are different types of
representation and therefore different types of ways (algorithms) to solve the problem. This leads on to
the third level, the physical implementation, the birth of the algorithmic level, into an actual working
prototype/item.
The development of new types of neural networks, algorithms and machine learning has rapidly
increased over the past years. This dissertation aims to understand the links between these designs
and both the theory and mathematics behind them, and how we can use these to create our very own
neural networks. It will also study the architectural style of Generatively Pre-Trained Transformer (a
term explained later) language models, such as ChatGPT. However, with almost everything, there is a
second side of the coin, which in this case relates to the limitations and problems of neural networks,
which this dissertation will explore.
3
2 The compuational theory behind artifical neural net-
works
2.1 The perceptron
2.1.1 Linear classifiers (hard thresholds):
Linear classification models are one’s that categorize new data points to a discrete class, either 1
or the other (often called binary classificatin), based on learned decision boundaries (lines seperating
data points so the line seperates two different types/classes of points) so that when new data points are
obtained, they can be easily classed as either 1 class or another. However, this is not always the case,
such as trying to seperate the results from logical operator XOR, which can be described as linearly
inseperable, since linear decision boundaries cannot be drawn to correctly classify the data points.
Therefore, we have to introduce non-linear threshold activation functions.
1
f (x) =
1 + e− x
which creates a soft (smooth) and differentiable boundary, as depicted on the following graph:
y
The logistic function calculates, depending on the input, the probability of which class the input
falls in, which we can then use to classify the input.
4
The perceptron is at the core of every neural network. Without it, neural networks wouldn’t function
as they do today, therefore understanding how it functions is a crucial part to understanding the rest
of the dissertation. A great way to describe the perceptron is by using the following analogy, what the
biological neurone is to the brain, the perceptron is to the neural network, which we can derive from its
own name, perception - the ability to perceive and receive input - and neurones - the processing units
of the brain. A perceptron receives several inputs through a range of directed links from perceptrons
occuring before. These links transfer the result from the activation function, a i, of the previous node.
The perceptron then calculates a weighted sum of inputs plus the bias (which is a number given to the
perceptron, which stays constant regardless of the input), which it then applies an activation function
to (as mentioned above) to calculate the output of the perceptron.
n
X
inj = wi ,j ai
i=0
The activation function can be either a hard threshold, or a logisitic function in which the percep-
tron is called a sigmoid perceptron. For linear classifiers, the perceptron will output a binary 1 or 0
(representing which class it belongs to), it is either on side of the line or it is not, whereas the sigmoid
perceptron will output an output of 0 to 1, representig the probability that the input belongs to each
class, where the difference between the probability and class (0 or 1) represents the probabiltiy that it
belongs to that certain class. If it is far away from one class, it is more likely to belong to the other
class. For example, if the output was 0.98, then the output suggests that it most likely belongs to the
class represented by 1. However, another case scenario is that the output falls at 0.5. In this case, the
output suggests that it likely belongs equally to both classes. Another popular and commonly used
activation function is called the heaviside step function, which is a hard threshold, however this is only
ever used in isolated, unconnected perceptrons, since it is non-differentiable, and is incompatible in
various important neural network features since neural networks use the gradient of the curve to update
the perceptrons weights (a concept mentioned later in Training and Learning algorithms).
(
0 if w·x<0
hw (x) =
1 if w·x≥0
Figure (2): The Heaviside Step Function where w represents the weights vector (essentially, a list of
the weights), and x, a vector of the corresponding inputs to the weights. An example representation of
w and x may be [1, 1.05, 3] and [20.5, 15, 18].
5
while often more complex, are more commonly used as they produce higher accuracy results, at the
disadvantage that they take longer to train.
Deep Neural Networks, which links to the term Deep Learning, contain many (two or more) hierarhical-
ordered hidden layers. The first hidden layers begin with basic, straightforward classifications. As the
number of hidden layers increase, the complexity and level of abstraction of the classifications increase,
building upon and collecting previous outputs (classifications) from the neurons before it to make these
classifications. For example, in the case of classifying digits, the input data would be just plain pix-
els, however as it propogatess through the network, the network starts to recognise lines, patterns etc
building an image until it reaches its conclusion.
There are also other types of reccurent neural networks, such as LTSM (Long short term memory),
which was created in response to the vanishing gradient problem (where the gradient of the loss function
(explained below) becomes very small, preventing the neural network from training any further). It also
sought out to solve problems where the previous data (which would be essential predicting the following)
was not cycling anymore and inaccessible, which therefore made it possible that the RNN model was
not able to accuractely predict certain patterns. For example, in the case of two seperate sentences,
where one leads on to the other such as, ”Joe broke his leg. He cannot move his leg.”, the model
can take on that information and use it to process the second sentence. However if the first sentence
was several lines behind, it may have dropped out of memory, making it harder to predict the second
sentence accuractely. To counter this, ”forgot gates” are added to RNN’s to make them LTSM’s, which
allows the LTSM to retain the information and use it many steps later.
6
The convultional layer is where the majority of the computation within the network happens. It uses a
kernel (algorithms used for pattern analysis) to identify features in the image (known as convolution).
CNN’s are designed so that they can support multiple Convolutional layers which allow for a hierarichal
design, where the lower layers process smaller objects in an image such as an eye, wheras the later layers
may use this data to detect faces and etc.
The Pooling layer uses dimenstionality reduction to reduce the number of inputs, by combing several
output neurons in the previous layer into one neuron on the next layer. This helps to prevent complexity,
overfitting (explained in limitations) and uses less computational rescources.
Finally we have the fully-connected layer which means that every node in this layer will be connected
directly to every neuron in the previous layer. It then, analyzes the features identified in the previous
layers and uses them to perform classification, deciding on to output.
TP
precision =
TP + FP
Since we are working in discrete values, where the algorithm is either right or wrong, it is alot easier
to calculate both the accuracy and precision. For accuracy, we are calculating the total sum of the
correct predictions against the entire number of predictions giving us the probability of success of the
model. For example, if it predicted 72 values right out of 100, the accuracy would be 0.72, meaning that
72% of all predictions were right. For precision, we are calculating the ratio of the correctly identified
predictions to the number of predictions whether right or wrong of that class.
However these two measurements alone sometimes seem like they are indicating a great success rate,
however these can sometimes be misleading. Let us take an example of a spam detection model, which
aims to determine whether a received email is spam or not. There are two types of false possible
predictions. The first, false positive, is if the model incorrectly classifies spam as an authentic email.
Although, this is not ideal, we are primarily focused on false negatives, which if made can have much
more profound consequences. If an important and authentic email was classified as spam when it
wasnt, this could lead to a potentially business/family complications. Therefore we want to prevent this
absolutely from happening. Here we have show how precision doesn’t directly focus on FN’s, therefore
meaning we have to introduce other metrics.
7
Although the above example focused on precision, there are also limitations with accuracy aswell.
One of these is called the ’accuracy paradox’, which we will apply to the spam detection model again.
Lets say, the that 20 spam and 60 authentic emails are classfiied correctly, but 20 authentic emails are
incorrectly classified as spam we end up with an accuracy of 80%. We can than compare this to a second
example where 80 authentic emails are classified correctly, but 10 spam emails are incorrectly classified
as authentic. This again would result in 80% accuracy, however the models are vastly different. Our
second model doesn’t classify correctly a single spam email, and we could essentially change this model
to an algorithm which classifies every email as authentic without even seeing them. We have now seen
how precision and accuracy can be wholly misleading, and as mentioned above must introduce new
metrics, such as Recall & and F1-Score to better deal with these.
TP
recall =
TP + FN
Whereas the F1 score is considered as the models accuracy. It is calculated from both the precision
and recall of the results, giving equal weight to both of them (by calculating the harmonic mean, as
shown below)
2 precision × recall
F1 = 1 1 F1 = 2 ×
recall
+ precision
precsion + recall
1 X
M SE = (y − h(x))2
N
(x,y)∈E
Figure (3): MSE, where N represents the total number of labeled examples (an input (x), and the
correct output (y)). The equations then loops through and totals all the labeled examples in E (which
represents the set of all the examples), calculates the difference between the actual and predicted y
(h(x)) value and then squares it. The total is then devided by N to obtian the average (mean) squared
error.
We can plot the loss function as a usually convex (bowl shaped) line - where the y axis represents the
loss. while the x axis represents the weights value., for the different values of weights we inputted into
the MSE, which in our case is a quadratic convex function. Therefore we know, our line only has one
global minima/local minima (which would differ it we had non-quadratic and more erratic functions).
We refer to a global minima as the lowest y value that the entire function will output, for all values of
x, while a local minima is the lowest y value of x in a certain region of x values. Now that we know
our function has a global minima, we can start to take steps towards it, by following the polarity of the
8
deritative of the curve. If it is positive, shift to the left, and if it is negative, shift to the right. These
shifts are affected by the learning rate, η which is a hyperparameter (a value set by the user before
training as in contrast to a parameter, which is a value (such as a weight and bias) which is learned
during training). Therefore if we had a large η, we would make large steps to the global minima and
quite quickly, however because of this, we risk overshooting the minima, and constantly rebounding
between two values above it - therefore not reaching an optimal weight value. However if we set η to a
very low value, our end result may be optimal, however the trip there would have taken very long due
to the very smal increments. Therefore we have to try and choose a balanced learning rate, in which it
won’t overshoot the minima, but also won’t take too long either.
3.3 Backpropogation
Up until the 1960s when backpropogation was first theorised, methods of training neural networks
- even smaller ones - were slow and computationally expensive. The introduction of backpropogation
changed this, enabling the training of neural networks with hundreds, even thousands of perceptrons.
But how does it work? While forward propogation is the movement data from the inputs to the
outputs, backpropogation retraces these steps, propogating backwards through to each layer calculating
the gradient of the error/loss function with respect to its weights. It then utliizes methods such as
stochastic gradient descent to decrease the loss function and to obtain optimal weights.
9
seeing as this is more specific to a programming language rather than the gist behind neural networks).
In the final subsection, ’testing’, this essay will analzye and explore the different algorithms mentioned
throughout this paper.
4.1 Creation
To correctly classifier a digit, i.e ”that number is 7” etc, is quite an easy task for humans, however
once you attempt to try and make your own, it is quite hard to even know where to start. How can we
get a computer to correctly recognize a number. One way of doing this is by taking looking at a digit
in an abstract way, i.e ”What patterns are there”, ”does it contain any circles” (ideal for recognizing
numbers such as nine, eight or zero or six.). However, we won’t explicitly tell the network to do this, as
ML is defined as the ability of a program to learn to do a specific task without beign explicitly told how
to. Therefore in this example, we are going to include 784 input nodes, one for each pixel in the image,
10 hidden nodes (the nodes which will process the data and look for the patterns), and 10 output nodes
which each output a probability of the corresponding number attached to the node, being the node
in the image. We can use then these probabilites to converge on a answer (the one with the highest
probability).
4.2 Training
Before we begin training a neural network, we have to split the data we have into training and
test data sets. The training data set will be used to train the neural network, while the testing set
(indepedant of the training set) will be used to test our model on. It is important that we have these
two sets, since if we trained out model and tested it on the same data we trained it on, it would easily
recognize and classify it correctly, however the whole point of the network is to classify new datapoints
correctly. Additionally, we don’t want a small test data set, since noise in chosen examples may cause
the model to look inaccuracte when it is in fact accurate. Let us take an example of a model which
makes an error 10 in 10,000 times. An astounding accuracy, however if we tested the model using a data
set of 5 examples, and 1 was incorrectly classified, then the appearance of the accuracy would be only
80% when reality it is 99.99% accurate. As convention, the typical split for testing to training data sets
are 20:80 (testing:training) .The training of the network used mini batch SGD, where for each epoch
(iteration), the program randomly chose n number of samples, which it then evaluated, calculated a
loss function and modified the weights of the network to achieve a change in output. Although, do note
that due to noise in datasets, the weights can sometimes be modified incorrectly, resulting in a reduced
accuracy.
10
6.0.1 Computational Rescources
One of the first and major limitations is the intensive computational rescources it takes to train
neural networks on datasets, especially large ones. For example, as explored above, ChatGPT, took
over 34 days to finish training. Therefore, unless you have to access to huge amounts of computational
power, it is hard to create powerful models such as GPT, and even if you did have access, the amount
of electricity used would be hugely costly.
Additionally, training a good neural network requires huge amounts of high quality data. Therefore,
it is hard to train models on area’s where there is not so much data, or where data may be expensive
to get.
6.0.2 Overfitting
Secondly, another big problem (but solvable) is overfitting. This is when the model has been exces-
sively fit so it fits exactly to the data instead of generalizing the data it has received. Therefore when
it trys to predict new data, it will be inaccurate. One way of overcoming this, is to add more examples
to the training data (Data augmentation). We can also try and reduce the complexity of the network,
since if it has excessively many hidden layers and nodes, it will most likely overfit. Additionally, models
can underfit resulting from a lack of training.
Another limitation is that, no-one knows directly what a neural network is doing under the hood.
They are both complex and very difficult to intepret, making it hard to find errors or find out how the
neural network is actually working.
11