A Practical Guide To Graph Neural Networks
A Practical Guide To Graph Neural Networks
How do graph neural networks work, and where can they be applied?
1 AN INTRODUCTION
Contemporary artificial intelligence (AI), or more specifically, deep learning (DL) has been domi-
nated in recent years by the learning architecture known as the neural network (NN). NN variants
have been designed to increase performance in certain problem domains; the convolutional neural
network (CNN) excels in the context of image-based tasks, and the recurrent neural network (RNN)
in the space of natural language processing and time series analysis. NNs have also been leveraged
as components in composite DL frameworks — they have been used as trainable generators and
discriminators in generative adversarial networks (GANs), and as encoders and decoders in trans-
formers [46]. Although they seem unrelated, the images used as inputs in computer vision, and
the sentences used as inputs in natural language processing can both be represented by a single,
general data structure: the graph (see Figure 1).
Formally, a graph is a set of distinct vertices (representing items or entities) that are joined
optionally to each other by edges (representing relationships). The learning architecture that has
been designed to process said graphs is the titular graph neural network (GNN). Uniquely, the graphs
fed into a GNN (during training and evaluation) do not have strict structural requirements
per se; the number of vertices and edges between input graphs can change. In this way, GNNs
can handle unstructured, non-Euclidean data [4], a property which makes them valuable in certain
problem domains where graph data is abundant. Conversely, NN-based algorithms are typically
required to operate on structured inputs with strictly defined dimensions. For example, a CNN built
to classify over the MNIST dataset must have an input layer of 28 × 28 neurons, and all subsequent
1 www.isolabs.com.au.
2 These authors contributed equally to this work.
1
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
(a) A graph representation of a 14 × 14 pixel image of the (b) Methane (top) and glycine (bottom)
digit ‘7’. Pixels are represented by vertices and their direct molecular structures represented as graphs.
adjacency is represented by edge relationships. Edges represent electrochemical bonds and
vertices represent atomic nuclei or simple
molecules.
(c) A vector representation and a Reed– (d) A gameplaying tree can be represented as a graph.
Kellogg diagram (rendered according to mod- Vertices are states of the game and directed edges repre-
ern tree conventions) of the same sentence. sent actions which take us from one state to another.
The graph structure encodes dependencies
and constituencies.
Fig. 1. The graphs data structure is highly abstract, and can be used to represent images (matrices), molecules,
sentence structures, game playing trees, etc.
images fed to it must be 28 × 28 pixels in size to conform to this strict dimensionality requirement
[27].
The expressiveness of graphs as a method for encoding data, and the flexibility of GNNs with
respect to unstructured inputs has motivated their research and development. They represent a
new approach for exploring comparatively general deep learning methods, and they facilitate the
2
A Practical Guide to GNNs
application of deep learning approaches to sets of data which — until recently — were not possible
to work with using traditional NNs or other such algorithms.
3
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Notation Meaning
V A set of vertices.
N The number of vertices in a set of vertices V .
v𝑖 The 𝑖 𝑡ℎ vertex in a set of vertices V .
v𝑖𝐹 The feature vector of vertex v𝑖 .
ne[v𝑖 ] The set of vertex indicies for the vertices that are direct neighbors of v𝑖 .
E A set of edges.
M The number of edges in a set of edges E.
e𝑖,𝑗 The edge between the 𝑖 𝑡ℎ vertex and the 𝑗 𝑡ℎ vertex, in a set of edges E.
e𝑖,𝑗
𝐹 The feature vector of edge e𝑖,𝑗 .
h𝑖𝑘 The 𝑘 𝑡ℎ hidden layer’s representation of the 𝑖 𝑡ℎ vertex’s local neighborhood.
o𝑖 The 𝑖 𝑡ℎ output of a GNN (indexing is framework dependant).
G = G (V, E ) A graph defined by the set of vertices V and the set of edges E.
A The adjacency matrix; each element A𝑖,𝑗 represents if the 𝑖 𝑡ℎ vertex is connected
to the 𝑗 𝑡ℎ vertex by a weight.
W The weight matrix; each element W𝑖,𝑗 represents the ‘weight’ of the edge
between the 𝑖 𝑡ℎ vertex and the 𝑗 𝑡ℎ vertex. The ‘weight’ typically represents
some real concept or property. For example, the weight between two given
vertices could be inversely proportional to their distance from one another
(i.e., close vertices have a higher weight between them). Graphs with a weight
matrix are referred to as weighted graphs, but not all graphs are weighted
graphs.
D The degree matrix; a diagonal matrix of vertex degrees or valencies (the number
of edges incident to a vertex). Formally defined as D𝑖,𝑖 = 𝑗 A𝑖,𝑗 .
Í
L The non-normalized graph Laplacian; defined as L = D − W. For unweighted
graphs, W = A.
1 1
L 𝑠𝑛 The symmetric normalized graph Laplacian; defined as L = I𝑛 − D− 2 AD− 2 .
L𝑟 𝑤 The random-walk normalized graph Laplacian; defined as L = I𝑛 − D−1 A.
I𝑛 An 𝑛 × 𝑛 identity matrix; all zero except for one’s along the diagonal.
Table 1. Notation used in this work. We suggest that the reader familiarise themselves with this notation
before proceeding.
representing a molecular structure, the edges might represent the electrochemical bond between
two atoms (see Figure 3).
2.1.3 Features. In AI, features are simply quantifiable attributes which characterize a phenomenon
that is under study. In the graph domain, features can be used to further characterize vertices and
edges. Extending our social network example, we might have features for each person (vertex)
which quantifies the person’s age, popularity, and social media usage. Similarly, we might have a
feature for each relationship (edge) which quantifies how well two people know each other, or the
type of relationship they have (familial, colleague, etc.). In practice there might be many different
features to consider for each vertex and edge, so they are represented by numeric feature vectors
referred to as v𝑖𝐹 and e𝑖,𝑗
𝐹 respectively.
4
A Practical Guide to GNNs
Fig. 2. Two renderings of the same social network graph, one displaying only the edge relationships and
vertex indices v𝑖 (left), and one displaying each vertices’ vertex features v𝑖𝐹 (right).
0 1 0 0 0 0 0 0
1 0 1 1 1 0 0 0
0 1 0 0 0 0 0 0
0 1 0 0 0 1 1 1
A =
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0
Fig. 3. A diagram of an alcohol molecule (left), its associated graph representation with vertex indices labelled
(middle), and its adjacency matrix (right).
considering the vertices attached (via edges) to the current neighborhood. The neighborhood grown
from v𝑖 after one iteration is referenced to in this text as the direct neighborhood, or by the set of
neighbor indices ne[v𝑖 ]. Note that a neighborhood can be defined subject to certain vertex and
edge feature criteria.
2.1.5 States. Encode the information represented in a given neighborhood around a vertex (includ-
ing the neighborhood’s vertex and edge features, and states). States can be thought of as hidden
feature vectors. In practice, states are created by iteratively applying a feature extraction function
to the previous iteration’s states, with later iteration states including all the information required
to perform classification, regression, or some other output calculation on a given vertex.
5
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Fig. 4. Two selected frames from a computer generated video of a model hand moving. The rendered graphs
show the connections between joints in the human hand, and the hierarchical dependency of said joints.
This is actually one single spatiotemporal graph — the spatial positions of the vertices in the graph change
with respect to time. In practice, these kinds of graphs could be classified into types of hand movements.
Images provided from the ‘Hands from Synthetic Data’ dataset [41]. Image best viewed in colour.
6
A Practical Guide to GNNs
(2) e𝑖,𝑗
𝐹 — the features of the edges which join v to its neighbor vertices v . Here only direct
𝑖 𝑗
neighbors will be considered, though in practice neighbors further than one edge away may
be used. Similarly, for directed graphs, neighbors may or may not be considered based on edge
direction (e.g., only outgoing or incoming edges considered as valid neighbor connection).
(3) v 𝑗𝐹 — the features of v𝑖 ’s neighbors.
(4) h 𝑗𝑘−1 — the previous state of v𝑖 ’s neighbors. Recall that a state simply encodes the information
represented in a given neighborhood.
Formally, the transition function 𝑓 is used in the recursive calculation of a vertex’s 𝑘 𝑡ℎ state as
per Equation 1.
We can see that under this formulation, 𝑓 is well defined. It accepts four feature vectors which
all have a defined length, regardless of which vertex in the graph is being considered, regardless
of the iteration. This means that the transition function can be applied recursively, until a stable
state is reached for all vertices in the input graph. If 𝑓 is a contraction map, Banach’s fixed point
theorem ensures that the values of h𝑖𝑘 will converge to stable values exponentially fast, regardless
of the initialisation of h𝑖0 [21].
The iterative passing of ‘messages’ or states between neighbors to generate an encoding of the
graph is what gives this message passing operation its name. In the first iteration, any vertex’s state
encodes the features of the neighborhood within a single edge. In the second iteration, any vertex’s
state is an encoding of the features of the neighborhood within two edges away, and so on. This is
because the calculation of the 𝑘 𝑡ℎ state relies on the (𝑘 − 1)𝑡ℎ state. To fully elucidate this process,
we will step through how the transition function is recursively applied (see Figures 5, 6, and 7).
The purpose of repeated applications of the transition function is thus to create discriminative
embeddings which can ultimately be used for downstream machine learning tasks. This is inves-
tigated further via a worked example in Appendices B.1, and we recommend that readers look
through the example (and code) if further explanation is required.
3.1.2 Output. The output function is responsible for taking the converged hidden state of a graph
G (V, E ) and creating a meaningful output. Recall that from a machine learning perspective, the
repeated application of the transition function 𝑓 to the features of G (V, E ) ensured that every
final state h𝑖𝑘𝑚𝑎𝑥 is simply an encoding of some region of the graph. The size of this region is
dependent on the halting condition (convergence, max time steps, etc.), but often the repeated
‘message passing’ ensures that each vertex’s final hidden state has ‘seen’ the entire graph. These
rich encodings typically have lower dimensionality than the graph’s input features, and can be fed
to fully connected layers for the purpose of classification, regression, and so on.
The question now is what do we define as a ‘meaningful output’? This depends largely on the
task framework:
• Vertex-level frameworks. A unique output value is required for every vertex, and the
output function 𝑔 takes a vertex’s features and final state as inputs: o𝑖 = 𝑔(v𝑖𝐹 , h𝑖𝑘𝑚𝑎𝑥 ). The
output function 𝑔 can be applied to every vertex in the graph.
• Edge-level frameworks. A unique output value is required for each edge, and the output
function 𝑔 takes the edge’s features, and the vertex features and final state of the vertices
which define the edge: o𝑖 𝑗 = 𝑔(e𝑖,𝑗
𝐹 , v 𝐹 , h 𝑘𝑚𝑎𝑥 , v 𝐹 , h 𝑘𝑚𝑎𝑥 ). Similarly, 𝑔 can be applied to every
𝑖 𝑖 𝑗 𝑗
edge in the graph.
7
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Fig. 5. A simple graph with N = 6 and M = 6 is defined using the notation outlined in Table 1. It is understood
that every vertex and edge has some associated features. Additionally, each vertex v𝑖 is initialised with some
h𝑖0 .
Fig. 6. We calculate the 𝑘 𝑡ℎ state for each neighborhood in the graph. Note that any 𝑘 𝑡ℎ state calculation is
dependent on only the input graph features of the graph and on the (𝑘 − 1)𝑡ℎ states. We begin by calculating
h0𝑘 for the neighborhood centered around v0 , as per Equation 1. v0 and its direct connections (emboldened) to
its neighbor vertices (dotted) are illustrated here.
Fig. 7. The same calculation is performed on each neighborhood in the graph, until the final vertex’s neigh-
borhood is reached. Once this is done, the 𝑘 𝑡ℎ layer representation of the graph has been calculated. The
process is then repeated for the (𝑘 + 1)𝑡ℎ layer, the (𝑘 + 2)𝑡ℎ layer, and so on, typically until convergence is
reached.
8
A Practical Guide to GNNs
• Graph-level frameworks. A unique output is required over the entire graph (i.e., graph clas-
sification). The output function 𝑔will consider all relevant information calculated thus far. This
will include the final state around each vertex, and may optionally include the initial vertex and
edge features: 𝑜 = 𝑔(h0𝑘𝑚𝑎𝑥 , h1𝑘𝑚𝑎𝑥 , ..., hN−1 0 1 N−1 0,0 0,1 0,N−1, ..., eN−1,N−1 ). Here,
𝑘𝑚𝑎𝑥 , v 𝐹 , v 𝐹 , ..., v 𝐹 , e 𝐹 , e 𝐹 , ...e 𝐹 𝐹
9
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
NGNs weights are dependent on the local structure of the graph — as opposed to methods that
attempt to use the global structure of the graph and thus incur large computational costs. As such,
the messages which are passed in an NGN’s forward pass include features that are sensitive to the
flow of information over the graph. Haan et al. find that such NGN frameworks provide competitive
results in a variety of tasks, including graph classification.
Fig. 8. A convolutional operation of 2D matrices. This process is used in computer vision and in CNNs. The
convolutional operation here has a stride of 2 pixels. The given filter is applied in the red, green, blue, and
then purple positions — in that order. At each position each element of the filter is multiplied with the
corresponding element in the input and the results are summed, producing a single element in the output. For
clarity, this multiplication and summing process is illustrated for the purple position. In CNNs, the filter is
learned during training, so that the most discriminative features are extracted while the dimension is reduced.
In the case of this image the filter is a standard sharpening filter used in image analysis. Image best viewed
in colour.
When we think of convolution, we often think of the convolution operation used in CNNs (see
Figure 8). Indeed, this conforms to a general definition of convolution:
“An output derived from two given inputs by integration (or summation), which
expresses how the shape of one is modified by the other.”
Convolution in CNNs involves two matrix inputs, one is the previous layer of activations, and the
other is a matrix 𝑊 × 𝐻 of learned weights, which is ‘slid’ across the activation matrix, aggregating
each 𝑊 × 𝐻 region using a simple linear combination. In the spatial graph domain, it seems that this
type of convolution is not well defined [40]; the convolution of a rigid matrix of learned weights
must occur on a rigid structure of activation. How do we reconcile convolutions on unstructured
inputs such as graphs?
Note that at no point during our formal definition of convolution is the structure of the given
inputs alluded to. In fact, convolutional operations can be applied to continuous functions (e.g.,
audio recordings and other signals), N-dimensional discrete tensors (e.g., semantic vectors in 1D,
10
A Practical Guide to GNNs
and images in 2D), and so on. During convolution, one input is typically interpreted as a filter (or
kernel) being applied to the other input, and we will adopt this language throughout this section.
Specific filters or kernels can be utilised to perform specific tasks: in the case of audio recordings,
high pass filters can be used to filter out low frequency signals, and in the case of images, certain
filters can be used to increase contrast, sharpen, or blur images. Relating to our previous example
in CNNs, the learned convolutional filters perform a kind of learned feature extraction.
Fig. 9. Three neighborhoods in a given graph (designated by dotted boxes), with each one defined by a
central vertex (designated by a correspondingly coloured circle). In spatial convolution, a neighborhood is
selected, the values of the included vertices’ features are aggregated, and then this aggregated value is
used to update the vertex embedding of the central vertex. This process is repeated for all neighborhoods in
the graph. These embeddings can then be used as vertex ‘values’ in the next layer of spatial convolution, thus
allowing for hierarchical feature extraction. In this case a vertex is a neighbor of another vertex only if it is
directly connected to it. Note the similarities to the transition process of RGNNs, as illustrated in Figures 5, 6,
and 7. Image best viewed in colour.
Consider the similarities between images and graphs (see Figure 1). Images are just particular
cases of graphs: pixels are vertices, and edges exist between adjacent pixels [54]. Image convolution
is then just a specific example of graph convolution. One full graph convolution operation is then
completed as follows (see Figure 9 for an illustrated example). Note that we use the term spatial
connectivity to describe if two vertices are connected by an edge (or ‘connected’ under some other
spatial definition).
(1) Using spatial connectivity to define graph neighborhoods around all vertices, select the first
neighborhood in the input graph.
(2) Aggregate the values in this neighborhood (e.g., using a sum or mean calculation).
(3) Use the aggregated values to update the central vertex’s hidden state.
(4) Repeat this process on the subsequent neighborhoods in the input graph.
The choice of aggregation function is not trivial — different aggregation functions can have
notable effects on performance and computational cost. A notable framework that investigated
aggregator selection is the GraphSAGE framework [13], which demonstrated that learned aggre-
gators can outperform simpler aggregation functions (such as taking the mean of embeddings)
and thus can create more discriminative, powerful vertex embeddings. GraphSAGE has since been
outperformed on accepted benchmarks [10] by other frameworks [3], but the framework is still
competitive and can be used to explore the concept of learned aggregators (see Appendices B.2).
This type of graph convolution is referred to as the spatial graph convolutional operation,
since spatial connectivity is used to retrieve the neighborhoods in this process. The hidden states
11
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
calculated during the first pass of spatial graph convolution can be considered in the next stage
of spatial graph convolution, thus allowing for increasingly higher level feature extraction. This
operation propagates vertex information along edges in a similar manner to RGNNs (Section 3).
Interestingly, at this point we can see that there are many similarities to RGNNs and spatial
CGNNs. The core difference is that RGNNs continue iteration until a stable point, and use a repeated
transition function in doing so. Alternatively, CGNNs iterate for a fixed number of layers, each of
which usually have only a single layer of weights being applied.
Fig. 10. Pictured is an example graph G (V, E ) rendered in the plane (N = 18, M = 23), with corresponding
graph signal values rendered as red bars rising perpendicularly to said plane. The vertices V represent cities,
and the edges E represent if two cities have a flight path between them. This graph also has a weight matrix
W signifying the average number of flights per week along a given flight path which are actually conducted.
The weight / number of flights associated with each edge / flight path is proportional to the edge’s thickness.
The graph signal represents the number of people with a contagious virus in each city, since there are 18
vertices, the graph signal vector is accordingly of size 18. Other graphs and graph signals can be defined for
neural networks, traffic networks, energy networks, social networks, etc [39, 42].
For example, a graph might have vertices that represent cities, and weighted edges that represent
the amount of average weekly flights which go between two cities. In this context, we could have a
signal which represents the amount of people with a contagious virus in each city (see Figure 10).
As previously mentioned, we could describe this scenario with a different language; each vertex
has a feature which is the number of people with a contagious virus in the city which that vertex
represents. Again, this collective set of features (or samples) is referred to as the graph signal [39].
4.2.1 The Mathematics. Unfortunately many of the simple and fundamental tools for understanding
and analysing a signal are not well defined for graphs that are represented in vertex space. For
example, the translation operation has no immediate analogue to a graph [39]. Similarly, there is no
immediate way to convolve a graph directly with a filter. To overcome the challenges in processing
12
A Practical Guide to GNNs
signals on graphs, the inclusion of spectral analysis — similar to that used in discrete signal analysis
— has been fundamental in the emergence of localised computation of graph information. To process
graph signals in this way we are required to transform the graph from vertex space to frequency
space. It is within this frequency space that our previous understandings of mathematical operations
(such as convolution) now hold.
We leverage Fourier techniques to perform this transformation to and from the vertex and
frequency spaces. Classically, the Fourier transform is the expansion of a given function in terms
of the eigenfunctions of the Laplace operator [39]. In the graph domain, we use the graph Fourier
transform, which is the expansion of a given signal in terms of the eigenfunctions of the graph
Laplacian (see Table 1 for a formal definition of the graph Laplacian, and see Figure 11 for a concrete
example).
The graph Laplacian L is a real symmetric matrix, and as such, we can perform eigendecom-
position on it and hence extract its set of 𝑘 eigenvalues 𝜆𝑘 and eigenvectors 𝑢𝑘 (its eigensystem).
In reality, there are many ways to calculate a matrix’s eigensystem, such as via singular value
decomposition (SVD). Additionally, other graph Laplacian’s have been used, such as L 𝑠𝑛 and L 𝑟 𝑤
(see Table 1).
By the fundamental property of eigenvectors, the Laplacian matrix can be factorised as three
matrices such that L = U𝚲U𝑇 . Here, U is a matrix with eigenvectors as its columns, ordered by
eigenvalues, and 𝚲 is a diagonal matrix with said eigenvalues across its main diagonal. Interestingly,
the eigenvectors of L are exactly the exponentials of the discrete Fourier transform.
Once this eigendecomposition has been performed and the eigensystem has been calculated,
we can freely express a given discrete graph signal 𝑓 in terms of the eigenvectors of the graph
Laplacian, thus producing the graph Fourier transform 𝑓ˆ. This is performed using Equation 2, and
the inverse transformation is performed using Equation 3. Again, the 𝑘 eigenvectors 𝑢𝑘 appear as
columns in the matrix U, which is itself generated when calculating the eigensystem of the graph
Laplacian matrix L.
N
𝑓ˆ(𝑘) = 𝑓ˆ = U𝑇 𝑓
∑︁
𝑓 (𝑖) 𝑢𝑘∗ (𝑖) or (2)
𝑖=1
N
𝑓ˆ(𝑘) 𝑢𝑘 (𝑖) 𝑓 = U 𝑓ˆ
∑︁
𝑓 (𝑖) = or (3)
𝑘=1
As per convolution theorem, multiplication in the frequency domain corresponds to convolution
in the time domain [29]. This property is also valid for graphs: multiplication in the vertex domain
is equivalent to convolution in the frequency domain. This is an important property, as we cannot
define the classical convolutional operator directly in the vertex domain, since classical convolution
requires a translation operator — one signal is imposed on the other at every possible position.
However, we can define convolution in the frequency domain as per Equation 4.
N
𝑓ˆ(𝑘) 𝑔(𝑘)
∑︁
(𝑓 ∗ 𝑔) (𝑖) = ˆ 𝑢𝑘 (𝑖) (4)
𝑘=1
In practice, the function 𝑔 that appears in Equation 4 is typically a learned filter (as alluded to in
Section 4.1), but we will cover this more specifically in Section 4.3. For worked examples and an in
depth explanation of graph signal processing we recommend the following works: [39, 42]. With
an understanding of how convolution can be performed on graphs in the spectral domain, we can
now delve into the mechanics of spectral convolution.
13
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Fig. 11. A concrete example of how the graph Laplacian is formed for a graph, and how we might represent
edge weights and graph signals. In this example, there are two graph signals; the first graph signal 𝑓1 =
[0.2, 0.4, 0.3, 0.3, 0.1], equivalently, it is the ordered set of the first elements from each vertices’ feature vector
v𝑖𝐹 . Similarly, the second graph signal is 𝑓2 = [8, 6, 7, 12, 4]. The graph Laplacian in this example is simply
formed by L = D − A, and it is both real and symmetric, thus it can be decomposed to determine its
eigensystem (again, definitions of mathematical notation are available in Table 1).
14
A Practical Guide to GNNs
In practice, a graph might have multiple features per vertex, or in other words, multiple graph
signals (see Section B.3 for an example). We define 𝑓𝑘 as the number of graph signals (or hidden
channels) in the 𝑘 𝑡ℎ layer. We also define the 𝑘 𝑡ℎ layer’s hidden representation of the 𝑖 𝑡ℎ graph
signal’s element belonging to the 𝑗 𝑡ℎ vertex as H 𝑗,𝑖
𝑘 . The goal is then to aggregate each of the 𝑓
𝑘
graph signals like so 𝑖 U(Θ(U𝑇 H:𝑘,𝑖 )), where the ‘:’ symbol designates that we are considering
Í 𝑓𝑘
every vertex.
The spectral convolution process is now almost complete. We firstly need to add non-linearity
to each layer of processing in the form of an activation function 𝜎. Secondly, we must define our
forward pass with respect to subsequent layers of processing. This forms Equation 5, the forward
pass for spectral convolution. Note that we are calculating H 𝑗,𝑘 : — the 𝑘 𝑡ℎ layer’s 𝑗 𝑡ℎ hidden channel
for every vertex — and we do this for each of the 𝑓𝑘 hidden channels.
𝑓𝑘−1
∑︁
H 𝑗,𝑘 : = 𝜎( U(Θ𝑘 (U𝑇 H:,𝑘−1
𝑖 ))), where 𝑗 = 1, 2, ..., 𝑓𝑘 (5)
𝑖=1
4.3.1 Limitations and Improvements. There are some obvious shortcomings with this basic form of
spectral convolution, not the least of which is having to recalculate the eigensystem if the graph’s
structure is altered. This requirement precludes the use of spectral methods on dynamic graphs,
and also means that the learned filters Θ𝑘 are domain dependant — they only work correctly on
the graph structure they were trained on [54].
These initial limitations were alleviated by ChebNet, an approach which approximates the filter
Θ𝑘 using Chebyshev polynomials of 𝚲 [22, 43]. This approach ensures that filters extract local
features independently of the graph. Furthermore graph convolutional networks (or GCNs) further
reduce the computational complexity of determining the eigensystem from 𝑂 (N 3 ) to 𝑂 (M) by
introducing first-order approximations and normalization tricks to improve numerical stability
[28, 53, 54].
Interestingly, spectral methods which use order 𝐾 polynomials as their filters (such as ChebNet
and GCNs [22, 28, 43]) can be shown to be producing linear combinations of input signals at vertices
with a 𝐾-hop local neighborhood [39]. This allows us to interpret these spectral filtering methods as
spatial convolution methods, thus linking spectral convolution methods back to spatial convolution
methods.
15
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
classification, and graph regression). A focus is placed on identifying medium-scale datasets which
can statistically separate GNN performances, and all experimental parameters are explicitly defined.
A variety of general findings are provided based on the empirical results of said experiments. For
example, it is found that graph agnostic NNs perform poorly, i.e., multi-layer perceptron algorithms
which update vertex embeddings independently of one another (without considering the graph
structure) are not suitable for tasks in the graph domain.
Zhang et al. discuss all major methods of deep learning in the graph domain [58], and as
such include a discussion on GNNs. Uniquely, Zhang et al. include detailed discussions on Graph
Reinforcement Learning and Graph Adverserial Methods, though there is again a focus on CGNNs.
As in [54], a distinction is drawn between vertex level and graph level learning frameworks.
In [59], Zhou et al. provide a GNN taxonomy, and distinctly classify GNNs based on the type of
graph ingested, the method of training, and the method of propagation (during the forward pass).
Numerous aggregation and updater functions are formally described, compared, and classified.
Additionally, GNNs are also classified by problem domain; applications to knowledge graphs,
physical systems, combinatorial optimisation problems, graph generation, as well as text and image
based problems are considered.
In general, there has been much discussion in the literature on improving GCNs: a stand-alone
GCN was shown by [7] to have poor performance, particularly on graph-level classification tasks. It
was supposed that this was due to the isotropic nature of the edges (all edges are weighted equally).
More accurate algorithms take an anisotropic approach, such as Graph Attention Networks (GAT)
[47] and GatedGCN [3]. If the edges are weighted differently, the importance of vertices and the
state of the graph can be ‘remembered’. The covalent single and double bonds in a molecular
application of GNNs would be an example of an anisotropic architecture. Other applications make
use of residual connections, such as the Graph Isomorphism Network (GIN) [55]. The residual
connections allow GIN to use the learned features from all layers in the prediction. GIN also makes
use of batch normalization [19], but in the graph domain.
5 GRAPH AUTOENCODERS
Graph autoencoders (GAEs) represent the application of GNNs (often CGNNs) to autoencoding.
Perhaps more than other machine learning architecture, autoencoders (AEs) transition smoothly to
the graph domain because they come ‘prepackaged’ with the concept of embeddings. In their short
history, GAEs have lead the way in unsupervised learning on graph-structured data and enabled
greater performance on supervised tasks such as vertex classification on citation networks [23].
Loss𝐴𝐸 = ∥𝑋 − 𝑋ˆ ∥ 2 (6)
16
A Practical Guide to GNNs
We can perform end-to-end learning across this network to optimize the encoding and decoding
in order to strike a balance between both sensitivity to inputs and generalisability — we do
not want the network to overfit and ‘memorise’ all training inputs. Additionally, the latent space
representations of an encoder-decoder network can be referred to as the embedding of the inputs,
and they are analogous to the vertex embeddings which were discussed in earlier sections. In other
words, a latent representation is simply a learned feature vector.
Fig. 12. The architecture for a very simple standard AE. AEs take an original input, alter the dimensionality
through multiple fully connected layers, and thus convert said input into a latent space vector (this process
forms the encoder). From there, the AE attempts to reconstruct the original input (this process forms the
decoder). By minimising the reconstruction loss (see Equation 6), efficient latent space representations can be
learned. This diagram shows a AE with a latent space representation that is smaller than the input size, such
that dim(𝐿0 ) = dim(𝐿4 ) > dim(𝐿1 ) = dim(𝐿3 ) > dim(𝐿2 ). In practice, other layers such as convolutional
layers can be used in lieu of fully connected layers.
5.1.1 Variational Autoencoders. Rather than representing inputs with single points in latent space,
variational autoencoders (VAEs) learn to encode inputs as probability distributions in latent space.
This creates a ‘smoother’ latent space that covers the full spectrum of inputs, rather than leaving
‘gaps’, where an unseen latent space vector would be decoded into a meaningless output. This
has the effect of increasing generalisation to unseen inputs and regularising the model to avoid
overfitting. Ultimately, this approach transforms the AE into a more suitable generative model.
Figure 13 shows a VAE which predicts a normal distribution 𝑁 (𝜇, 𝜎) for a given input (in practice,
the mean and covariance matrix are typically used to define these normal distributions). Unlike in
AEs — where the loss is simple the mean squared error between the input and the reconstructed
input — a VAEs loss has two terms. This is shown formally in Equation 7. The first term is the
reconstruction term we saw in Equation 6, this term still appears as its obviously still our desire to
accurately reconstruct the provided input. The second term regularises the latent space distributions
by ensuring that they do not diverge significantly from standard normal distributions (denoted as
𝑁 (0, 1)), using Kulback-Leibler divergence (denoted as KL in Equation 7). Without this second
term, the VAE might return distributions with small variances, or high magnitude means: both of
which would cause overfitting.
17
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Fig. 13. This diagram depicts a simple VAE architecture. The encoder portion of this VAE takes an input, and
encodes it via multiple fully connected layers into a distribution in latent space. This distribution is then
sampled to produce a latent space vector (akin to the latent space representation in Figure 12). The decoder
portion of this VAE takes this latent space vector, and decodes it into a reconstructed input as in an AE.
Equation 7 illustrates the loss term which is minimised in a VAE architecture.
5.2.1 The Target Variable. In AEs, the target input has a well defined structure (e.g., a vector of
known length, or an image of known size), and thus the quality of its reconstruction is easily
measured using mean squared error. To measure the suitability of a reconstruction in a GAE, we
need to produce a meaningfully structured output that allows us to identify similarities between
the reconstruction and the input graph.
18
A Practical Guide to GNNs
Fig. 14. This diagram demonstrates the high level operation of a basic GAE. Rather than ingesting an matrix
or vector, the encoder ingests graph structured data G. In this case a CGNN fulfills the role of the encoder by
creating discriminative vertex features / vertex embeddings for every vertex. This does not alter the structure
of G. As with AEs and VAEs, the decoder’s role is to take these embeddings and create a datastructure which
can be compared to the input. In the case of GAEs, G’s adjacency matrix is reconstructed as per Section 5.2.4.
This reconstructed adjacency matrix 𝐴ˆ (see Equation 8) is compared with the original adjacency matrix 𝐴 to
create a loss term, which is then used to backpropagate errors throughout the GAE and allow training.
5.2.2 The Loss Function. In VGAEs the loss is similar to the standard VAE loss; KL-divergence is
still used to measure similarity between the predicted and true distributions, however now binary
cross-entropy is used to measure the difference between the predicted adjacency matrix and true
adjacency matrix, which replaces the reconstruction component of standard AE loss. Alternative
methods include using distance in Wassertein space (this is known as the Wassertein metric, and it
quantifies the cost associated with making a probability distribution equal to another) rather than
KL-divergence [60], L2-reconstruction loss, Laplacian eigenmaps, and the ranking loss [2].
5.2.3 Encoders. Before the seminal work in [23] on VGAEs, a number of deep GAEs had been
developed for unsupervised training on graph-structured data, including Deep Neural Graph
Representations (DNGR) [6] and Structure Deep Network Embeddings (SDNE) [50]. Because these
methods operate on only the adjacency matrix, information about both the entire graph and the
local neighbourhoods was lost. More recent work mitigates this by using an encoder that aggregates
information from a vertex’s local neighbourhood to learn latent vector representations. For example,
[35] proposes a linear encoder that uses a single weights matrix to aggregate information from each
vertex’s one-step local neighbourhood, showing competitive performance on numerous benchmarks.
Despite this, typical GAEs use more complex encoders — primarily CGNNs — in order to capture
nonlinear relationships in the input data and larger local neighbourhoods [2, 6, 23, 33, 44, 45, 50, 56].
5.2.4 Decoders. Standard GNNs, regardless of their aggregation functions or other details, typically
generate a matrix containing the embedding of each vertex. GAEs apply a decoder to this encoded
representation of each vertex in order to reconstruct the original graph. For this reason, the decoder
aspect of GAEs is commonly referred to as the generative model. In many variations of GAEs [23],
the decoder is a simple inner product of the latent variables. As an inner product of two vectors
is equivalent to calculating their cosine similarity, the higher the result of this product the more
likely are those vertices connected. With this prediction of vertex similarity, the graph’s adjacency
matrix can be predicted solely from the embeddings of all of the vertices in the graph.
19
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
More formally, the encoder produces a vertex embedding 𝑧𝑖 for each vertex v𝑖 , and we then
compute the cosine similarity for each pair (𝑧𝑖 , 𝑧 𝑗 ) which gives us an approximation for the distance
between v𝑖 and v 𝑗 (after adding non-linearity via an activation function 𝜎). By computing all pairwise
distance approximations, we have produced an approximation of the adjacency matrix 𝐴. ˆ This can
be compared to A to determine a reconstruction error, which can then be backpropagated through
the GAE during training. As with AEs and VAEs, GAEs can thus be trained in an unsupervised
manner, since a loss value can be calculated using only unlabelled instances (a concept which is
explored via a worked example in Appendices B.3).
𝐴ˆ𝑖 𝑗 = 𝜎 𝑧𝑇𝑖 𝑧 𝑗
(8)
5.3.1 Further Reading. Some notable developments in GAEs include variational graph autoen-
coders (VGAEs) [23] (this architecture is discussed in our worked example in Appendices B.3 and
implemented in the associated code), Graph2Gauss (G2G) [2] and Semi-implicit Graph Variational
Autoencoders (SIG-VAE) [15], which further apply the concept of embeddings as probability distri-
butions. In addition, both [56] and [33] regularization is achieved through adversarial training. In
[16] they take graph embeddings further still: attempting to account for noise in vertex attributes
with a denoising attribute autoencoder.
Other applications of AEs to graphs have explored alternative encoders, such as LSTMs, in
attempts to more appropriately define a vertex’s local neighbourhood and generate embeddings
[44, 56]. With a similar perspective to VGAEs, the work described in [2] embeds vertices as Gaussian
distributions, rather than point vectors, in order to improve generalisation to unseen vertices and
capture uncertainty about each embedding that may be useful to downstream machine learning
tasks.
Experiments with alternative decoders have shown promising results in the improvement of GAEs.
For example, [15] use a Bernoulli-Poisson link decoder that enables greater generative flexibility
and application to sparse real-world graphs. While many GAEs can perform link prediction on
undirected graphs, [36] move away from symmetric decoders such as inner-product in order to
perform link prediction on directed graphs. This method draws inspiration from physics, particularly
classical mechanics, analogising the concepts of mass and acceleration between bodies with edge
directions in a graph.
6 CONCLUSION
The development of GNNs has accelerated hugely in the recent years due to increased interest
in exploring unstructured data and developing general AI architectures. Consequently GNNs
have emerged as a successful and highly performant branch of algorithms for handling such data.
Ultimately, GNNs represent both a unique and novel approach to dealing with graphs as input data,
20
A Practical Guide to GNNs
and this capability exposes graph-based datasets to deep learning approaches, thus supporting the
development of highly general ML and AI.
ACKNOWLEDGMENTS
We thank our colleague Richard Pienaar for providing feedback which greatly improved this work.
REFERENCES
[1] [n.d.]. Classification: Precision and Recall | Machine Learning Crash Course. https://developers.google.com/machine-
learning/crash-course/classification/precision-and-recall
[2] Aleksandar Bojchevski and Stephan Günnemann. 2018. Deep Gaussian Embedding of Graphs: Unsupervised Inductive
Learning via Ranking. arXiv: Machine Learning (2018).
[3] Xavier Bresson and Thomas Laurent. 2017. Residual Gated Graph ConvNets. arXiv:cs.LG/1711.07553
[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. 2017. Geometric Deep Learning: Going beyond
Euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18–42.
[5] Cameron Buckner and James Garson. 2019. Connectionism. In The Stanford Encyclopedia of Philosophy (fall 2019 ed.),
Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University.
[6] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 1145–1152.
[7] Hong Chen and Hisashi Koga. 2019. GL2vec: Graph Embedding Enriched by Line Graphs with Edge Features. In
ICONIP.
[8] Pim de Haan, Taco Cohen, and Max Welling. 2020. Natural Graph Networks. arXiv:cs.LG/2007.08349
[9] Nathan de Lara and Edouard Pineau. 2018. A Simple Baseline Algorithm for Graph Classification. CoRR abs/1810.09155
(2018). arXiv:1810.09155 http://arxiv.org/abs/1810.09155
[10] Vijay Prakash Dwivedi, Chaitanya K. Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2020. Benchmarking
Graph Neural Networks. arXiv:cs.LG/2003.00982
[11] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop
on Representation Learning on Graphs and Manifolds.
[12] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. 2019. Deep Learning for 3D
Point Clouds: A Survey. arXiv:cs.CV/1912.12033
[13] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. CoRR
abs/1706.02216 (2017). arXiv:1706.02216 http://arxiv.org/abs/1706.02216
[14] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning on Graphs: Methods and Applications.
CoRR abs/1709.05584 (2017). arXiv:1709.05584 http://arxiv.org/abs/1709.05584
[15] Arman Hasanzadeh, Ehsan Hajiramezanali, Nick Duffield, Krishna R. Narayanan, Mingyuan Zhou, and Xiaoning Qian.
2019. Semi-Implicit Graph Variational Auto-Encoders. arXiv:cs.LG/1908.07078
[16] Bhagya Hettige, Weiqing Wang, Yuan-Fang Li, and Wray Buntine. 2020. Robust Attribute and Structure Preserving
Graph Embedding. In Advances in Knowledge Discovery and Data Mining, Hady W. Lauw, Raymond Chi-Wing Wong,
Alexandros Ntoulas, Ee-Peng Lim, See-Kiong Ng, and Sinno Jialin Pan (Eds.). Springer International Publishing, Cham,
593–606.
[17] G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neu-
ral Networks. Science 313, 5786 (2006), 504–507. https://doi.org/10.1126/science.1127647
arXiv:https://science.sciencemag.org/content/313/5786/504.full.pdf
[18] Weihua Hu*, Bowen Liu*, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2020. Strategies
for Pre-training Graph Neural Networks. In International Conference on Learning Representations. https://openreview.
net/forum?id=HJlWWJSFDH
[19] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift. CoRR abs/1502.03167 (2015). arXiv:1502.03167 http://arxiv.org/abs/1502.03167
[20] John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman. 2012. ZINC: A Free
Tool to Discover Chemistry for Biology. Journal of Chemical Information and Modeling 52, 7 (2012), 1757–1768.
https://doi.org/10.1021/ci3001277 arXiv:https://doi.org/10.1021/ci3001277 PMID: 22587354.
[21] M. A. Khamsi and William A. Kirk. 2001. An introduction to metric spaces and fixed point theory. Wiley.
[22] Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. CoRR
abs/1609.02907 (2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02907
[23] Thomas N. Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. arXiv:stat.ML/1611.07308
[24] Alex Krizhevsky et al. 2009. Learning multiple layers of features from tiny images. (2009).
21
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
[25] Q. V. Le. 2013. Building high-level features using large scale unsupervised learning. In 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing. 8595–8598.
[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc.
IEEE 86, 11 (1998), 2278–2324.
[27] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. (2010).
http://yann.lecun.com/exdb/mnist/
[28] Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein. 2017. CayleyNets: Graph Convolutional
Neural Networks with Complex Rational Spectral Filters. CoRR abs/1705.07664 (2017). arXiv:1705.07664 http:
//arxiv.org/abs/1705.07664
[29] Bing Li and G. Jogesh Babu. 2019. Convolution Theorem and Asymptotic Efficiency. In A Graduate Course on Statistical
Inference. Springer New York, New York, NY, 295–327. https://doi.org/10.1007/978-1-4939-9761-9_10
[30] Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. CoRR abs/1907.03199 (2019).
arXiv:1907.03199 http://arxiv.org/abs/1907.03199
[31] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal.
2017. graph2vec: Learning Distributed Representations of Graphs. CoRR abs/1707.05005 (2017). arXiv:1707.05005
http://arxiv.org/abs/1707.05005
[32] Antonio Ortega, Pascal Frossard, Jelena Kovačević, José MF Moura, and Pierre Vandergheynst. 2018. Graph signal
processing: Overview, challenges, and applications. Proc. IEEE 106, 5 (2018), 808–828.
[33] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. 2018. Adversarially Regularized Graph
Autoencoder for Graph Embedding. arXiv:cs.LG/1802.04407
[34] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: An API Oriented Open-source Python Framework
for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and
Knowledge Management (CIKM ’20). ACM.
[35] Guillaume Salha, Romain Hennequin, and Michalis Vazirgiannis. 2019. Keep It Simple: Graph Autoencoders Without
Graph Convolutional Networks. arXiv:cs.LG/1910.00942
[36] Guillaume Salha, Stratis Limnios, Romain Hennequin, Viet-Anh Tran, and Michalis Vazirgiannis. 2019. Gravity-
Inspired Graph Autoencoders for Directed Link Prediction. CoRR abs/1905.09570 (2019). arXiv:1905.09570 http:
//arxiv.org/abs/1905.09570
[37] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The Graph Neural Network Model. IEEE
Transactions on Neural Networks 20, 1 (Jan 2009), 61–80. https://doi.org/10.1109/TNN.2008.2005605
[38] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders Meet Collaborative
Filtering. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). Association
for Computing Machinery, New York, NY, USA, 111–112. https://doi.org/10.1145/2740908.2742726
[39] David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2012. Signal Processing
on Graphs: Extending High-Dimensional Data Analysis to Networks and Other Irregular Data Domains. CoRR
abs/1211.0053 (2012). arXiv:1211.0053 http://arxiv.org/abs/1211.0053
[40] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging field
of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.
IEEE signal processing magazine 30, 3 (2013), 83–98.
[41] Tomas Simon, Hanbyul Joo, Iain A. Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images
using Multiview Bootstrapping. CoRR abs/1704.07809 (2017). arXiv:1704.07809 http://arxiv.org/abs/1704.07809
[42] Ljubisa Stankovic, Danilo P Mandic, Milos Dakovic, Ilia Kisil, Ervin Sejdic, and Anthony G Constantinides. 2019.
Understanding the basis of graph signal processing via an intuitive example-driven approach [lecture notes]. IEEE
Signal Processing Magazine 36, 6 (2019), 133–145.
[43] Shanshan Tang, Bo Li, and Haijun Yu. 2019. ChebNet: Efficient and Stable Constructions of Deep Neural Networks
with Rectified Power Units using Chebyshev Approximations. arXiv:cs.LG/1911.05467
[44] Ke Tu, Peng Cui, Xiao Wang, Philip S. Yu, and Wenwu Zhu. 2018. Deep Recursive Network Embedding with Regular
Equivalence. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(KDD ’18). Association for Computing Machinery, New York, NY, USA, 2357–2366. https://doi.org/10.1145/3219819.
3220068
[45] Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion.
arXiv:stat.ML/1706.02263
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.
03762
[47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph
Attention Networks. arXiv:stat.ML/1710.10903
22
A Practical Guide to GNNs
[48] Saurabh Verma and Zhi-Li Zhang. 2017. Hunt For The Unique, Stable, Sparse And Fast Feature Learning On Graphs.
In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 88–98. http://papers.nips.cc/paper/6614-hunt-for-the-
unique-stable-sparse-and-fast-feature-learning-on-graphs.pdf
[49] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust
Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML
’08). Association for Computing Machinery, New York, NY, USA, 1096–1103. https://doi.org/10.1145/1390156.1390294
[50] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Embedding. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing
Machinery, New York, NY, USA, 1225–1234. https://doi.org/10.1145/2939672.2939753
[51] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue
Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J Smola, and Zheng Zhang. 2019.
Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. ICLR Workshop on Representation
Learning on Graphs and Manifolds (2019). https://arxiv.org/abs/1909.01315
[52] Isaac Ronald Ward, Hamid Laga, and Mohammed Bennamoun. 2019. RGB-D image-based Object Detection: from
Traditional Methods to Deep Learning Techniques. CoRR abs/1907.09236 (2019). arXiv:1907.09236 http://arxiv.org/
abs/1907.09236
[53] Felix Wu, Tianyi Zhang, Amauri H. Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. 2019. Simplifying
Graph Convolutional Networks. CoRR abs/1902.07153 (2019). arXiv:1902.07153 http://arxiv.org/abs/1902.07153
[54] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive
Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596
[55] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks? CoRR
abs/1810.00826 (2018). arXiv:1810.00826 http://arxiv.org/abs/1810.00826
[56] Wenchao Yu, Cheng Zheng, Wei Cheng, Charu C. Aggarwal, Dongjin Song, Bo Zong, Haifeng Chen, and Wei Wang.
2018. Learning Deep Network Representations with Adversarially Regularized Autoencoders. In Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). Association for Computing
Machinery, New York, NY, USA, 2663–2671. https://doi.org/10.1145/3219819.3220000
[57] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive
review. Computational Social Networks 6, 1 (10 Nov 2019), 11. https://doi.org/10.1186/s40649-019-0069-y
[58] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2018. Deep Learning on Graphs: A Survey. CoRR abs/1812.04202 (2018).
arXiv:1812.04202 http://arxiv.org/abs/1812.04202
[59] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph Neural Networks: A
Review of Methods and Applications. CoRR abs/1812.08434 (2018). arXiv:1812.08434 http://arxiv.org/abs/1812.08434
[60] Dingyuan Zhu, Peng Cui, Daixin Wang, and Wenwu Zhu. 2018. Deep Variational Network Embedding in Wasserstein
Space. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD
’18). Association for Computing Machinery, New York, NY, USA, 2827–2836. https://doi.org/10.1145/3219819.3220052
23
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
A PERFORMANCE METRICS
In this section we explicitly define the performance metrics used in our practical examples, before
presenting the practical examples themselves in Section B. For more detail on these performance
metrics, we recommend the reader visit Google’s Machine Learning Crash Course [1].
In classification, we define the positive class as the class which is relevant or currently being
considered, and the negative class (or classes) as the remaining class(es). It is customary to label
what we are interested in as the positive class. For example, in the case of a machine learning model
which predicts if patients have tumors, the positive class would correspond to a patient having a
tumor present, and the negative class would correspond to a patient not having a tumor. It then
follows that:
• a true positive (TP) is where the actual class is positive and the prediction is correct. For
example, the patient has a tumor and the model predicts that they have a tumor.
• a false positive (FP) is where the actual class is negative and the prediction is incorrect.
For example, the patient does not have a tumor and the model predicts that they do have a
tumor.
• a true negative (TN) is where the actual class is negative and the prediction is correct. For
example, the patient does not have a tumor and the model predicts that they do not have a
tumor.
• a false negative (FN) is where the actual class is positive and the prediction is incorrect.
For example, the patient has a tumor and the model predicts that they do not have a tumor.
Accuracy is the fraction of predictions which were correct (when compared to their ground truth
labels) out of all the predictions made.
TP TP
Precision = (10) Recall = (11)
TP + FP TP + FN
The true positive rate (TPR) and false positive rate (FPR) are defined as follows. Note that the
TPR is the same as recall.
TP FP
TPR = (12) FPR = (13)
TP + FN FP + FN
One can then plot the TPR against the FPR while varying the threshold at which a classification
is considered a positive prediction. The resulting plot is the receiver operating characteristic
(ROC) curve. A sample ROC curve is plotted in Figure 15. The area under the ROC curve (AUC)
is the integral of the ROC curve in from the bounds 0 to 1 (see Figure 15). It is a useful metric as
it is invariant to the classification threshold and scale. In essence, it is an aggregate measure of a
model at all classification thresholds.
24
A Practical Guide to GNNs
Fig. 15. A demonstration of how the ROC is plotted with respect to the TPR and FPR at varying classification
thresholds. The AUC is also illustrated as the shaded area under the receiver operating characteristic.
Average precision is the average value of the precision metric over the interval where recall
= 0 to recall = 1. It is also equivalent to measuring the area under the precision-recall curve [52].
∫ 1
Average precision = p(𝑟 )𝑑𝑟 (14)
0
B PRACTICAL EXAMPLES
In the following worked examples, we concisely present the results of an investigation involving
GNNs. Code for each of these examples can be accessed online4 .
25
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
Fig. 18. TSNE renderings of final hidden graph representations for the x1, x2, x4, x8 hidden layer networks.
Note that with more applications of the transition function (equivalent to more layers in a NN) the final
hidden representations of the input graphs become more linearly separable into their classes.
Here, our transition function 𝑓 was a ‘feedforward NN’ with just one layer, so more advanced NNs (or
other) implementations of 𝑓 might result in more performant RGNNs. As more rounds of transition function
were applied to our hidden states, the performance — and required computation — increased. Ensuring a
consistent number of transition function applications is key in developing simplified GNN architectures, and
in reducing the amount of computation required in the transition stage. We will explore how this improved
concept is realised through CGNNs in Section 4.1.
26
A Practical Guide to GNNs
27
I. R. Ward, J. Joyner, C. Lickfold, S. Rowe, Y. Guo, M. Bennamoun
28