0% found this document useful (0 votes)
39 views11 pages

How Can Mathematics and Computational Theory Combine To Create Artificial Neural Networks?

The document discusses the computational theory behind artificial neural networks. It explains that neural networks were inspired by the human brain, with early models like the perceptron combining mathematics and computation to mimic neurons. The document outlines different types of neural network architectures including feedforward, recurrent, and convolutional networks. It also covers training algorithms like gradient descent and backpropagation that are used to optimize network parameters.

Uploaded by

Josephine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
39 views11 pages

How Can Mathematics and Computational Theory Combine To Create Artificial Neural Networks?

The document discusses the computational theory behind artificial neural networks. It explains that neural networks were inspired by the human brain, with early models like the perceptron combining mathematics and computation to mimic neurons. The document outlines different types of neural network architectures including feedforward, recurrent, and convolutional networks. It also covers training algorithms like gradient descent and backpropagation that are used to optimize network parameters.

Uploaded by

Josephine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

How can mathematics and computational theory combine

to create artificial neural networks?


Elliot Boulanger

14/03/2023

1
Contents
1 Introduction 3

2 The compuational theory behind artifical neural networks 4


2.1 The perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Linear classifiers (hard thresholds): . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Logisitic Regression (soft threshold) . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 The McCulloch-Pitts neuron (perceptron) . . . . . . . . . . . . . . . . . . 4
2.2 Types of network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Deep and Shallow Neural Networks . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Feed-forward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Training and Learning algorithms 7


3.1 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Accuracy & Precsion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Recall & F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Mini-batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 9
3.3 Backpropogation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Real-world application analysis: 9


4.1 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Testing (Work in progress) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 The use of neural networks in a new booming technological era (Work in


progress) 10

6 Limitations 10
6.0.1 Computational Rescources . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.0.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7 Conclusion & insight (Work in progress) 11

2
1 Introduction
It was Alexander Bain (1873), in his book Mind and Body who first theorized the idea of the brain
being made up of many individual neurones and introduced their relation to the activity of the body,
which marked the starting point for neural networks. However, it wasn’t until 1943 that McCulloch
and Pitts, proposed a computational model, combining both mathematics and computational theory,
for an artifical neural network. In addition, they modeled the worlds first neural network using eletrical
circuits, which marked the start of the beginning of artificial neural networks. A large proportion of the
creation of the idea of the neural network can largely be traced to a human attribute. Inspiration. We
can relate to this by studying the evolution of flight. When it comes to the history of flying, some of the
first (theoretical) air flying vehicles can be traced back to Leonardo Da Vinci’s Orthinopter helicopter,
which employed the use of a central coque and wings on the side to provide lift. This and following
helicopters following the same principal proved unsuccessful, but the key point is where they got the idea
from. Birds. Effectively, they had taken bird, figured it used wings to fly and therefore implemented
these wings into their own designs. This idea of focusing on the key parts of something (such as the
bird) to deduce how it works can be closely linked to how neural networks came about.

Similiar to how the idea of airplanes (air vehicles) originiated and were first designed around birds,
when humans started to ponder whether they could make conscious artificial systems, they turned to
the closest thing that could very much relate with it. The brain. David Marr’s three levels of analysis
for information-processing tasks/systems greatly describes how we can use study the brain (and more,
such as birds and flight) to recreate a similiar system in an artificial environment. The first of such
level, computational theory, describes the abstraction of a problem/task to obtain a definition, which
in neural networks, involves how the brain functions and its overall design, such as the individual
neurones which make up the brain, however doesn’t involve the chemical processes/biological that
occur. The second level - the algorithmic level - defines the representation and algorithms used. For
example, the roman numeral VI, the number 6 and the binary representation 0000 0110 (8 bits) are all
different representations of the same fundamental idea, however because of this representation, different
algorithms are used to both add and substract in each. Therefore, we’ve now seen that simply making
an abstract definition of the problem can’t be automatically applied to different scenarios - it would
be infeasible to put flapping wings on aircraft in the case of flight - since there are different types of
representation and therefore different types of ways (algorithms) to solve the problem. This leads on to
the third level, the physical implementation, the birth of the algorithmic level, into an actual working
prototype/item.

The development of new types of neural networks, algorithms and machine learning has rapidly
increased over the past years. This dissertation aims to understand the links between these designs
and both the theory and mathematics behind them, and how we can use these to create our very own
neural networks. It will also study the architectural style of Generatively Pre-Trained Transformer (a
term explained later) language models, such as ChatGPT. However, with almost everything, there is a
second side of the coin, which in this case relates to the limitations and problems of neural networks,
which this dissertation will explore.

3
2 The compuational theory behind artifical neural net-
works
2.1 The perceptron
2.1.1 Linear classifiers (hard thresholds):
Linear classification models are one’s that categorize new data points to a discrete class, either 1
or the other (often called binary classificatin), based on learned decision boundaries (lines seperating
data points so the line seperates two different types/classes of points) so that when new data points are
obtained, they can be easily classed as either 1 class or another. However, this is not always the case,
such as trying to seperate the results from logical operator XOR, which can be described as linearly
inseperable, since linear decision boundaries cannot be drawn to correctly classify the data points.
Therefore, we have to introduce non-linear threshold activation functions.

2.1.2 Logisitic Regression (soft threshold)


Logistic Regression uses the logistic function,

1
f (x) =
1 + e− x
which creates a soft (smooth) and differentiable boundary, as depicted on the following graph:
y

The logistic function calculates, depending on the input, the probability of which class the input
falls in, which we can then use to classify the input.

2.1.3 The McCulloch-Pitts neuron (perceptron)

Figure (1): The McCulloch-Pitts perceptron

4
The perceptron is at the core of every neural network. Without it, neural networks wouldn’t function
as they do today, therefore understanding how it functions is a crucial part to understanding the rest
of the dissertation. A great way to describe the perceptron is by using the following analogy, what the
biological neurone is to the brain, the perceptron is to the neural network, which we can derive from its
own name, perception - the ability to perceive and receive input - and neurones - the processing units
of the brain. A perceptron receives several inputs through a range of directed links from perceptrons
occuring before. These links transfer the result from the activation function, a i, of the previous node.
The perceptron then calculates a weighted sum of inputs plus the bias (which is a number given to the
perceptron, which stays constant regardless of the input), which it then applies an activation function
to (as mentioned above) to calculate the output of the perceptron.

n
X
inj = wi ,j ai
i=0

The activation function can be either a hard threshold, or a logisitic function in which the percep-
tron is called a sigmoid perceptron. For linear classifiers, the perceptron will output a binary 1 or 0
(representing which class it belongs to), it is either on side of the line or it is not, whereas the sigmoid
perceptron will output an output of 0 to 1, representig the probability that the input belongs to each
class, where the difference between the probability and class (0 or 1) represents the probabiltiy that it
belongs to that certain class. If it is far away from one class, it is more likely to belong to the other
class. For example, if the output was 0.98, then the output suggests that it most likely belongs to the
class represented by 1. However, another case scenario is that the output falls at 0.5. In this case, the
output suggests that it likely belongs equally to both classes. Another popular and commonly used
activation function is called the heaviside step function, which is a hard threshold, however this is only
ever used in isolated, unconnected perceptrons, since it is non-differentiable, and is incompatible in
various important neural network features since neural networks use the gradient of the curve to update
the perceptrons weights (a concept mentioned later in Training and Learning algorithms).

(
0 if w·x<0
hw (x) =
1 if w·x≥0

Figure (2): The Heaviside Step Function where w represents the weights vector (essentially, a list of
the weights), and x, a vector of the corresponding inputs to the weights. An example representation of
w and x may be [1, 1.05, 3] and [20.5, 15, 18].

2.2 Types of network architectures


Now that we have seen how a perceptron works, we can combine them to form an artifical neural
network, a collection of perceptrons structured in layers, an input layer, hidden layer(s) and an output
layer. There are two types of fundamental neural networks that are commonly used, which are called
feed-forward and recurrent neural networks both unique in their own ways. Feed-forward networks
have directed links (connections) between nodes, meaning inputs only travel in one direction, from any
particular node to a node further along (’downstream’ and ’upstream’ nodes)

2.2.1 Deep and Shallow Neural Networks


If we take a step back from the specific properties and rules that different types of networks follow,
we can see that their are two general types of neural networks, some with one hidden layer, and some
where there are multiple hidden layers. These are called Shallow and Deep Neural Networks. While
Shallow Neural Networks can be quick to train and useful in simpler problems, Deep Neural Networks,

5
while often more complex, are more commonly used as they produce higher accuracy results, at the
disadvantage that they take longer to train.

Deep Neural Networks, which links to the term Deep Learning, contain many (two or more) hierarhical-
ordered hidden layers. The first hidden layers begin with basic, straightforward classifications. As the
number of hidden layers increase, the complexity and level of abstraction of the classifications increase,
building upon and collecting previous outputs (classifications) from the neurons before it to make these
classifications. For example, in the case of classifying digits, the input data would be just plain pix-
els, however as it propogatess through the network, the network starts to recognise lines, patterns etc
building an image until it reaches its conclusion.

2.2.2 Feed-forward neural networks


Feed-forward neural networks are a type of neural network which only allow inputs and activation
functions propogate one way through the network, from input to output. Additionally, unlike other
neural networks, feed-forward networks have no cycles (loops, where a perceptron sends inputs back a
layer). The most commonly used FFNN’s are Single-layer (SLP) and Multi-layer (MLP) Perceptrons,
which as their name suggest, have one and the other other one, many layers. Although SLP’s are
considered neural networks, due to their simplicity, they are only able to perform on simple linear
problems while MLP’s are able to perform on much more complex non-linear problems.

2.2.3 Recurrent neural networks


Recurrent neural networks are a type of neural network where links are specfically placed between
nodes to create a cycle (loop). Therefore this allows the output of a node to affect its input. These cycles
are commonly seen as short-term memory, since it allows the neural network to keep the data in memory
while it processes other perceptrons. This means the networks are specialized in processing sequential
data where data from before is needed to predict/calculate the the next wanted data. This is very
useful in NLP (Natural Language Processing, i.e understanding langauge syntax and word meanings)
problems, such as text and speech recognition where it is important to know the words that come before
the current word in order to formulate the sentence.

There are also other types of reccurent neural networks, such as LTSM (Long short term memory),
which was created in response to the vanishing gradient problem (where the gradient of the loss function
(explained below) becomes very small, preventing the neural network from training any further). It also
sought out to solve problems where the previous data (which would be essential predicting the following)
was not cycling anymore and inaccessible, which therefore made it possible that the RNN model was
not able to accuractely predict certain patterns. For example, in the case of two seperate sentences,
where one leads on to the other such as, ”Joe broke his leg. He cannot move his leg.”, the model
can take on that information and use it to process the second sentence. However if the first sentence
was several lines behind, it may have dropped out of memory, making it harder to predict the second
sentence accuractely. To counter this, ”forgot gates” are added to RNN’s to make them LTSM’s, which
allows the LTSM to retain the information and use it many steps later.

2.2.4 Convolutional neural networks


A convolutional neural network are specific networks which are commonly used to analyze and re-
congnize images, having been specifically designed around processing pixels. A CNN has three types of
layers:
- A convolutional layer
- A pooling layer
- A fully-connected layer

6
The convultional layer is where the majority of the computation within the network happens. It uses a
kernel (algorithms used for pattern analysis) to identify features in the image (known as convolution).
CNN’s are designed so that they can support multiple Convolutional layers which allow for a hierarichal
design, where the lower layers process smaller objects in an image such as an eye, wheras the later layers
may use this data to detect faces and etc.
The Pooling layer uses dimenstionality reduction to reduce the number of inputs, by combing several
output neurons in the previous layer into one neuron on the next layer. This helps to prevent complexity,
overfitting (explained in limitations) and uses less computational rescources.
Finally we have the fully-connected layer which means that every node in this layer will be connected
directly to every neuron in the previous layer. It then, analyzes the features identified in the previous
layers and uses them to perform classification, deciding on to output.

3 Training and Learning algorithms


At the moment, all we have is a collection of perceptrons in layers (with randomly initilialized weight
and bias values), which if we fed some inputs to, would output gibberish essentially. Therefore we have
to introduce training algorithms, which in a few words, modify the weights of links and the biases of
perceptrons, to produce (hopefully better) changes in the output so we get closer to the results we want.

3.1 Performance metrics


Before delving into the analysis of different types of training and learning algorithms, it is important
to understand the different performance metrics used to determine whether the network has been trained
well or not. These metrics are: Accuracy, Precision, Recall and F1-Score.

3.1.1 Accuracy & Precsion


Accuracy is the measure of the closeness of a set of values/measurements to a specific value (usually,
the real value), whereas precision is the closeness of a set of values/measurements to each other. We
can calculate these using the following equations, where TP = True Positive, TN = True Negative, FP
= False Positive, FN = False Negative.
TP + TN
accuracy =
TP + TN + FP + FN

TP
precision =
TP + FP

Since we are working in discrete values, where the algorithm is either right or wrong, it is alot easier
to calculate both the accuracy and precision. For accuracy, we are calculating the total sum of the
correct predictions against the entire number of predictions giving us the probability of success of the
model. For example, if it predicted 72 values right out of 100, the accuracy would be 0.72, meaning that
72% of all predictions were right. For precision, we are calculating the ratio of the correctly identified
predictions to the number of predictions whether right or wrong of that class.

However these two measurements alone sometimes seem like they are indicating a great success rate,
however these can sometimes be misleading. Let us take an example of a spam detection model, which
aims to determine whether a received email is spam or not. There are two types of false possible
predictions. The first, false positive, is if the model incorrectly classifies spam as an authentic email.
Although, this is not ideal, we are primarily focused on false negatives, which if made can have much
more profound consequences. If an important and authentic email was classified as spam when it
wasnt, this could lead to a potentially business/family complications. Therefore we want to prevent this
absolutely from happening. Here we have show how precision doesn’t directly focus on FN’s, therefore
meaning we have to introduce other metrics.

7
Although the above example focused on precision, there are also limitations with accuracy aswell.
One of these is called the ’accuracy paradox’, which we will apply to the spam detection model again.
Lets say, the that 20 spam and 60 authentic emails are classfiied correctly, but 20 authentic emails are
incorrectly classified as spam we end up with an accuracy of 80%. We can than compare this to a second
example where 80 authentic emails are classified correctly, but 10 spam emails are incorrectly classified
as authentic. This again would result in 80% accuracy, however the models are vastly different. Our
second model doesn’t classify correctly a single spam email, and we could essentially change this model
to an algorithm which classifies every email as authentic without even seeing them. We have now seen
how precision and accuracy can be wholly misleading, and as mentioned above must introduce new
metrics, such as Recall & and F1-Score to better deal with these.

3.1.2 Recall & F1-Score


Recall, often called sensitivity, is the ratio of the correctly predicted positive cases (TP) to all the
real positive cases, which in the case of our spam detection model refers to the number of correctly
identified spam emails against the total number of spam emails

TP
recall =
TP + FN

Whereas the F1 score is considered as the models accuracy. It is calculated from both the precision
and recall of the results, giving equal weight to both of them (by calculating the harmonic mean, as
shown below)

2 precision × recall
F1 = 1 1 F1 = 2 ×
recall
+ precision
precsion + recall

3.2 Gradient Descent


Gradient Descent is one of the most commonly used algorithms to which finds optimal parameters
(weights and biases) to minimize a given loss funtion in machine learning and neural networks. It is
first important to define a loss function (also known as cost and objective functions). One of the most
commonly used is Mean Squared Error (MSE).

1 X
M SE = (y − h(x))2
N
(x,y)∈E

Figure (3): MSE, where N represents the total number of labeled examples (an input (x), and the
correct output (y)). The equations then loops through and totals all the labeled examples in E (which
represents the set of all the examples), calculates the difference between the actual and predicted y
(h(x)) value and then squares it. The total is then devided by N to obtian the average (mean) squared
error.

We can plot the loss function as a usually convex (bowl shaped) line - where the y axis represents the
loss. while the x axis represents the weights value., for the different values of weights we inputted into
the MSE, which in our case is a quadratic convex function. Therefore we know, our line only has one
global minima/local minima (which would differ it we had non-quadratic and more erratic functions).
We refer to a global minima as the lowest y value that the entire function will output, for all values of
x, while a local minima is the lowest y value of x in a certain region of x values. Now that we know
our function has a global minima, we can start to take steps towards it, by following the polarity of the

8
deritative of the curve. If it is positive, shift to the left, and if it is negative, shift to the right. These
shifts are affected by the learning rate, η which is a hyperparameter (a value set by the user before
training as in contrast to a parameter, which is a value (such as a weight and bias) which is learned
during training). Therefore if we had a large η, we would make large steps to the global minima and
quite quickly, however because of this, we risk overshooting the minima, and constantly rebounding
between two values above it - therefore not reaching an optimal weight value. However if we set η to a
very low value, our end result may be optimal, however the trip there would have taken very long due
to the very smal increments. Therefore we have to try and choose a balanced learning rate, in which it
won’t overshoot the minima, but also won’t take too long either.

3.2.1 Stochastic Gradient Descent


Although Gradient Descent works well on small datasets (batches), when faced with millions, or even
billions of examples such as at large companies, including Google and OpenAI, it can take a very long
time to iterate through the batch once (changing the parameters of each perceptron once). However this
is vastly inefficient, since tuning a neural network parameters to their optimum requires often thousands
if not hundreds of thousand iteraitons. One way of working around this is to use batch sizes of one
example, called stochastic (meaning random) gradient descrent, which calculates new parameters at each
iteration (for each example). This has many advantages. When using large datasets SGD (Stochastic
Gradient Descent) is both faster than the original GD and Mini-Batch SGD (explaine below) and more
feasible, since it requires only one example to be in memory at a time. However it is important to note
an important feature in datasets cause ’noise’. Since all the examples in the dataset are different (atleast
they should be), this induces noise into, which means randomness (such as the features of the data).
This noise can be very useful in non-convex curves where there are local and global minima, since the
randomness can help move away from local minima and to converge on a global minima. However this
comes at a price. However since we are only using 1 example, SGD is therefore very ’noisy’ making it
hard to converge onto a global minima.

3.2.2 Mini-batch Stochastic Gradient Descent


Therefore, we want to find a solution which includes both the speed of SGD, and the smoothness
(because of the reduced noise) of GD. Mini-batch stochastic gradient descent is a compromise of both.
In MB SGD, instead of using one example per batch, it usually uses between 10 and 1000 examples,
which both reduces the amount of noise of SGD and is more efficient than using all the examples in one
batch (GD).

3.3 Backpropogation
Up until the 1960s when backpropogation was first theorised, methods of training neural networks
- even smaller ones - were slow and computationally expensive. The introduction of backpropogation
changed this, enabling the training of neural networks with hundreds, even thousands of perceptrons.
But how does it work? While forward propogation is the movement data from the inputs to the
outputs, backpropogation retraces these steps, propogating backwards through to each layer calculating
the gradient of the error/loss function with respect to its weights. It then utliizes methods such as
stochastic gradient descent to decrease the loss function and to obtain optimal weights.

4 Real-world application analysis:


Now we know the theory behind neural networks, this section of the essay will explain a neural network
made by me in which my goal was to classify handwritten digits from the famous MNIST dataset (in
the hope that this helps the reader further understand NN’s). The MNIST dataset is a collection of
28 by 28 pixel images, drawn by students and government workers. This dataset is considered as the
”hello world” to creating neural networks, albeit much more complicated. In this section, I will describe
how we can create, train, test our very own neural network (however I won’t go into the code behind it,

9
seeing as this is more specific to a programming language rather than the gist behind neural networks).
In the final subsection, ’testing’, this essay will analzye and explore the different algorithms mentioned
throughout this paper.

4.1 Creation
To correctly classifier a digit, i.e ”that number is 7” etc, is quite an easy task for humans, however
once you attempt to try and make your own, it is quite hard to even know where to start. How can we
get a computer to correctly recognize a number. One way of doing this is by taking looking at a digit
in an abstract way, i.e ”What patterns are there”, ”does it contain any circles” (ideal for recognizing
numbers such as nine, eight or zero or six.). However, we won’t explicitly tell the network to do this, as
ML is defined as the ability of a program to learn to do a specific task without beign explicitly told how
to. Therefore in this example, we are going to include 784 input nodes, one for each pixel in the image,
10 hidden nodes (the nodes which will process the data and look for the patterns), and 10 output nodes
which each output a probability of the corresponding number attached to the node, being the node
in the image. We can use then these probabilites to converge on a answer (the one with the highest
probability).

4.2 Training
Before we begin training a neural network, we have to split the data we have into training and
test data sets. The training data set will be used to train the neural network, while the testing set
(indepedant of the training set) will be used to test our model on. It is important that we have these
two sets, since if we trained out model and tested it on the same data we trained it on, it would easily
recognize and classify it correctly, however the whole point of the network is to classify new datapoints
correctly. Additionally, we don’t want a small test data set, since noise in chosen examples may cause
the model to look inaccuracte when it is in fact accurate. Let us take an example of a model which
makes an error 10 in 10,000 times. An astounding accuracy, however if we tested the model using a data
set of 5 examples, and 1 was incorrectly classified, then the appearance of the accuracy would be only
80% when reality it is 99.99% accurate. As convention, the typical split for testing to training data sets
are 20:80 (testing:training) .The training of the network used mini batch SGD, where for each epoch
(iteration), the program randomly chose n number of samples, which it then evaluated, calculated a
loss function and modified the weights of the network to achieve a change in output. Although, do note
that due to noise in datasets, the weights can sometimes be modified incorrectly, resulting in a reduced
accuracy.

4.3 Testing (Work in progress)


Now that we have defined how we will create and trained our network, we can start to modify the
hyperparameters of the model (such as changing the learning rate, batch size) to analyze the impact of
these changes.

5 The use of neural networks in a new booming techno-


logical era (Work in progress)
6 Limitations
Most things often have ”two faces”, one which shows propserous and good while the other disad-
vantageous. Neural networks are not an exception, however most of these limitations aren’t due to
limitations in the theory. Additionally, when there are possible disadvantages, these can be passed by
with suitable and clever theory and coding.

10
6.0.1 Computational Rescources
One of the first and major limitations is the intensive computational rescources it takes to train
neural networks on datasets, especially large ones. For example, as explored above, ChatGPT, took
over 34 days to finish training. Therefore, unless you have to access to huge amounts of computational
power, it is hard to create powerful models such as GPT, and even if you did have access, the amount
of electricity used would be hugely costly.

Additionally, training a good neural network requires huge amounts of high quality data. Therefore,
it is hard to train models on area’s where there is not so much data, or where data may be expensive
to get.

6.0.2 Overfitting
Secondly, another big problem (but solvable) is overfitting. This is when the model has been exces-
sively fit so it fits exactly to the data instead of generalizing the data it has received. Therefore when
it trys to predict new data, it will be inaccurate. One way of overcoming this, is to add more examples
to the training data (Data augmentation). We can also try and reduce the complexity of the network,
since if it has excessively many hidden layers and nodes, it will most likely overfit. Additionally, models
can underfit resulting from a lack of training.

Another limitation is that, no-one knows directly what a neural network is doing under the hood.
They are both complex and very difficult to intepret, making it hard to find errors or find out how the
neural network is actually working.

7 Conclusion & insight (Work in progress)

11

You might also like