Practical Statistics For Geoscientists
Practical Statistics For Geoscientists
Practical Statistics For Geoscientists
Online Edition
2012
Contents
3
3.2.1 An example: the confidence interval on a mean . . . . . . . . . . . . . . . . 33
4 Hypothesis testing 37
4.1 Hypotheses and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Significance levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1.1 Important points concerning significance levels . . . . . . . . . . . . 43
4.2 An example: Meltwater particles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Review of the F -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Now it’s your turn: Mesozoic belemnites . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Other hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Meltwater particles revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4
7.2.2 Disadvantages of dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3 Deterministic k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.1 Visualizing cluster solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.2 What input data should you use? . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.3 How many clusters should be included in a model . . . . . . . . . . . . . . . 114
7.3.3.1 Silhouette plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Now it’s your turn: Portuguese rocks . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5
10.4 More advanced texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6
Years of geological research and exploration using traditional
methods have discovered a lot of relatively obvious theoretical
principals and economic deposits; we have to use more sophisti-
cated and sensitive techniques to uncover what remains!
Swan and Sandilands
The overarching aim of the course is more broad. Hopefully you’ll see how useful statistics are and
after the course you’ll have the confidence to use statistics independently and apply the methods
most appropriate for your research problems.
7
recommended textbooks can also be downloaded for free thanks to the ANU’s library subscription
and I’ve marked these in the reading list.
It is an interpreted language, this means we can work step by step through a series of
commands without the need for compiling complete sets of code.
R is based around so-called inbuilt functions that provide quick access to thousands of different
data processing methods.
R is the official programming language of the statistics community, so for almost any statistical
task you have, there will be R code available to perform it.
R contains versatile graphics libraries allowing you to view your data in a variety of ways
(visualization is an important part of statistics).
1.4.1 Installing R
If you don’t have R installed then you just need to go to the home page:
http://www.r-project.org/
and download the appropriate version for your computer. You can then install R just like you
8
would install any other piece of software. At certain points during the course we’ll need to use
special R packages to perform certain tasks. These packages can also be downloaded and installed
using a single command in R (you’ll need an active internet connection). Once you have R installed
and you start the software you’ll see a screen that looks something like the one in Figure 1.1
Figure 1.1: A screenshot of the basic R installation. You can interact with R by typing commands
at the > prompt in the console.
1.4.2 RStudio
As you’ll see, the R interface is pretty simple, which limits how efficiently you can interact with it.
A recently developed piece of software called Rstudio provides a more complete way of interacting
with R and allows scripts to be written and then executed, existing variables to be displayed on
screen, etc (Figure 1.2). If you can download Rstudio from:
http://rstudio.org/download/
and getting it working on your system it will certainly enhance the useability of your R instal-
lation. Don’t worry if you don’t want to install RStudio (or can’t get it to work) everything we
will do in the course can be easily be performed in the standard version of R.
9
Figure 1.2: A screenshot of the Rstudio. The console lets you interact with R more easily and
provides extra information such as the variables in the memory and available packages.
Example code: 1
> 1 + 1
Hopefully R will give you the correct answer and if you want to confirm this look at Figure 1.3
which shows the mathematical proof by Whitehead and Russell (1910) of the calculation we’ve
just done.
Figure 1.3: That’s a relief, mathematicians agree that 1+1 does indeed equal 2.
Sometimes I’ll also include the answers that R writes to the screen within the gray boxes. In
such cases you can see that this is output rather than a command because the prompt symbol is
10
missing. For example, repeating the calculation from above.
Example code: 2
> 1 + 1
[1] 2
Finally, for most of the commands I have also included comments. A comment is a piece of
text that can be included in the command, but which will be ignored by R. This might seem a bit
pointless, but comments provide a way in which to annotate commands and provide an explanation
of what they are doing. The comment symbol in R is #, which means any text after the # will be
ignored by R.
Example code: 3
11
12
The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from
a given body of data.
John Tukey
13
Figure 2.1: The orbit of Ceres.
After just 42 days Ceres disappeared behind the Sun when only 19 imprecise observations of
its path had been made. Based on this scant amount of data Giuseppe Piazzi made predications
of when and where Ceres would reappear out of the Sun’s glare. However, Ceres didn’t reappear
where expected and the new planet was lost. The 23 year old German, Carl Friedrich Gauss, heard
of this problem and using statistical methods he had developed when he was just 18, extracted
sufficient information from the existing observations to make a prediction of the position of Ceres
based on Kepler’s second law of planetary motion (Figure 2.2).
Figure 2.2: A (very poor) reproduction of the sketch in Gauss’ notebook that shows his calculated
orbit for Ceres. The Sun is positioned in the center of the sketch and the orbit of Ceres is connected
to the Sun by two straight-lines.
The prediction Gauss made was close enough that Ceres was found again and it made him a
scientific celebrity. Gauss’ key insight was the observations of Ceres’ motion would be normally
(bell-shaped) distributed such that observations towards the middle of the distribution should be
considered as more reliable than those towards the extremes. Maybe this idea seems obvious to us
now, but at the time this statistical insight allowed Gauss to identify patterns in the Ceres data
that everyone else had missed.
14
The normal distribution, also known not unsurprisingly as the Gaussian distribution, appears
commonly in nature and we’ll be looking at it in more detail later.
To consult the statistician after an experiment is finished is often merely to ask him to conduct a
post-mortem examination. He can perhaps say what the experiment died of.
15
2.1.4 Do you need statistics?
As you might have guessed from the three examples above, I’m going to suggest that you do need
statistics. I can’t think of any field in the geosciences where, at the most basic level, information is
stored in some form other than numbers. Therefore we have to be comfortable dealing with large
(sometimes massive) numerical data sets.
As we saw in the example of Fisher at the Agricultural College, if we don’t think in a statistical
way we can easily end up wasting our time. Imagine that you suddenly discovered that your work
had been in vain because rather than performing a proper statistical analysis you just formed
conclusions on the basis of your own subjective analysis of the data. Our brains seem to be very
good at spotting patterns, sometimes this can be useful, but often in the case of data analysis we
can convince ourselves that certain patterns exist that really don’t. There’s not much we can do
about this, it just seems to be the way our brains are wired, but to make advances in science and
try to limit the mistakes we make, we should take an objective (i.e., statistical) approach rather
than relying upon subjective intuition. To try to convince you how bad our intuition can be, I’m
going to give two examples.
Figure 2.3: Which set of data points are distributed randomly, the ones in the left panel or the
ones on the right?
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door
16
is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind
the doors, opens another door, say C, which has a goat. He then says to you, “Do you want to
pick door B?” Is it to your advantage to switch your choice?
Figure 2.4: In search of a new car, the player picks a door, say A. The game show host then opens
one of the other doors, say C, to reveal a goat and offers to let the player pick door B instead of
door A.
So here’s the big question, if you want to maximize your probability of winning a car rather
than a goat, should you stick with the door you selected first, switch to the alterative door, or
doesn’t it make any difference which door you select?
17
Imagine I arrive in Canberra, go to a major street and write down the numbers of the first 5
taxis that I spot. For example (once I’ve sorted them into numerical order):
Here we are using a sample (our 5 taxi numbers) to draw inferences concerning a population
(all the taxis in the city). Clearly the problem cannot be solved with the provided information.
However we can make an estimate of N using the information we have, some necessary assumptions,
and some simple statistics. The key point is that unless we are very lucky, our estimate of N will
be incorrect. Don’t think of statistics in terms of right or wrong answers, think of statistics in
terms of better or worse answers. So what we are interested in is a good estimate of the total
number of taxis.
If we just take the taxi numbers at face value we can see the problems with forming a good
estimate. Given the numbers above, one person may say:
The highest taxi number is 440 therefore I can say that there must be at least 440 taxis in
Canberra.
What if some taxi numbers in the sequence between 1 and 440 are missing (i.e., the taxis are
not numbered in a continuous sequence), then there could be less than 440.
Or alternatively:
What if some taxis share the same number, then there could be more than 440.
I’ve seen 5 taxis on the street therefore there must be at least 5 taxis in Canberra.
18
To which we could ask:
What if the driver of a taxi simply keeps driving around the block and each time gets out of
their taxi and swaps the licence plate with another they have stored in the car? That way you could
see 5 different licence plate numbers, but it would always be the same taxi.
Okay, this last argument might be unreasonable, but it does show that we can come up with
arguments that will only allow us to estimate that the total number of taxis in Canberra is 1. It
goes without saying that 1 is a very bad estimate of the population. Therefore we can see that
simply spotting 5 taxis doesn’t help too much in estimating the size of the population unless we
make some basic assumptions about how the taxi numbers behave. Such assumptions could be:
These simple assumptions will form the basis of our assessment. It is important to realize that if
our assumptions are incorrect or we’ve missed a key assumption then our estimate may be poor.
There are two simple ways we can estimate the number of taxis, the median estimate and the
extreme estimate. In the median estimate we find the middle number in the sequence, which is 280,
and calculate that the difference between the median value of the sample and the assumed first value
of the population (taxi number 1) is 279. Therefore we can estimate that the difference between
the median value and the largest taxi number in the population is 280+279 = 559 (Figure 2.5).
Notice that this approach employs all of the assumptions listed above.
?
73 179 280 405 440
Median estimate = 2 * Median - 1
Median estimate = 2 * 280 -1 = 559
Figure 2.5: Schematic diagram showing how the median estimate is formed for the taxi number
problem.
The extreme estimate looks at the edges of the data rather than the center. The lowest number
in our sample is 73 therefore there is a difference of 72 between the first number in our sample
and the assumed lowest number in the population (taxi number 1). We then look at the highest
number in our sample and add 72 to make an estimate of the population size, so 440+72 = 512
(Figure 2.6).
19
?
73 179 280 405 440
72 72
73 440
You can see that our two estimates are different and of course if we repeated the experiment we
would collect 5 different numbers in our sample and obtain different estimates for the population.
The key point, however, is that we are using a sample to draw inference concerning the population.
To do this we make estimates that rely on assumptions. If we have bad assumptions then our
estimate (unless we are very lucky) will also be bad. There may be more that one method with
which to make an estimate and these methods cannot be expected to yield the same result (although
you would hope that they are consistent).
It may seem that the taxi number problem is a trivial example and we could simply telephone
the taxi company and ask them how many taxis they have. This was, however, an important
problem in World War II when the Allies could make estimates of how many missiles the German
forces had at their disposal. In order to keep track of their weapons the Germans painted sequential
serial numbers on the outside of their missiles. I’m guessing that in this situation, Eisenhower
couldn’t telephone Hitler and ask him how many missiles he had, so instead soldiers were ordered
to record the serial numbers of any German missiles that they captured. These numbers were
returned for analysis and predictions of the total number of German missiles could be made. Out
of interest, the median estimate outperforms the extreme estimate and it didn’t take long before
the Germans stopped painting sequential numbers on the sides of their missiles.
20
poll predicted a clear win for the Republican candidate, Alf Landon. On election day, however,
the Democratic candidate Franklin D. Roosevelt won with a landslide victory. So why was the
result of the Literary Digest poll so wrong?
Figure 2.7: The front cover of the issue of the Literary Digest in which they announced the result
of the election poll.
The simple answer is that the sample of people surveyed by the Literary Digest was not rep-
resentative of the population as whole. The election took place during the Great Depression and
lists of people who had magazine subscriptions and owned cars and telephones were biased towards
the middle classes with higher than average incomes. In general, the middle classes favored the
Republican Party, hence the result of the pole suggested a win for Alf Landon. This is a clear ex-
ample of a nonrepresentative sample leading to a poor statistical estimate. Contrastingly, George
Gallup performed a similar poll for the same election, which involved a much smaller sample size,
but selected the voters to specifically obtain a demographically representative sample. On the
basis of his poll, Gallup predicated the outcome of the election correctly.
21
sample rather than a geological sample. You’ll notice that when we discuss geological samples I
will use words like specimen to try to avoid any confusion.
The Hypothetical Population corresponds to the complete geological entity (Figure 2.8).
In some cases the hypothetical population only exists in theory because parts may have been lost
to erosion, etc. The Available Population represents the existing parts of the geological entity.
Finally the Accessible Population is the material which can be collected to form a sample and
therefore is used to represent the entity. Given that one of the main aims of statistics is to use
statistical samples in order to draw conclusions concerning populations it is essential to consider
if the accessible population is representative of the hypothetical and available populations.
Figure 2.8: A cross-section showing the different populations that can be considered in a geological
scenario.
22
Percentage data giving a relative abundance of the taxa.
Concentrations of each taxon.
Certain statistical approaches can only be applied to specific forms of data. Therefore the type
of statistical technique we must use will depend on the type of data that we have available. It is
important to consider what type of data you have and what limitations that places on you. In
this section we’ll look quickly at some of the different types of data and the problems that may be
associated with them.
Figure 2.9: The number of earthquakes per hour is a discrete data set (clearly you cannot have half
an earthquake).
23
2.4.3 Ordinal data
With ordinal data the values are used to denote a position within a sequence. This allows a
qualitative, but not quantitative, rank order. A classical example of ordinal data is Mohs scale of
mineral hardness.
340+20
2
180o
Figure 2.10: An example of the problems associated with directional data. If we calculate the
average of two angles using simple arithmetic, the value may not be the one we expect!
24
data is sediment grain sizes that are split into sand, silt and clay fractions then expressed as
percentage abundance and plotted in a ternary diagram (Figure 2.11). Clearly the sand, silt and
clay fractions for any given sample must add up to 100%. Closed data are surprisingly common
in the geosciences and we’ll be paying special attention to them in Chapter 9.
Figure 2.11: A ternary diagram showing the classification of sediments depending on their compo-
sition of sand, silt and clay. Notice that all positions in the diagram correspond to compositions
that add to 100%.
-4 -3 -2 -1 0 1 2 3 4
Figure 2.12: An example of an interval scale. Notice that the values are equally spaced but negative
values are also possible because zero does not represent the end of the scale.
The classic example of interval scale data is the Celsius temperature scale. Ask yourself the
question:
It was 0o C today, but tomorrow it will be twice as warm. What will the temperature be tomor-
row?
This demonstrates that calculations such as ratios are meaningless for interval scale data.
25
2.4.7 Ratio scale data
Ratio scale data is the best form of data for statistical analysis. The data is continuous, the zero
point is fundamentally meaningful, and as the name suggests, ratios are meaningful. An example
of ratio scale data is the Kelvin temperature scale. Ask yourself the question:
It was 273 K today, but tomorrow it will be twice as warm. What will the temperature be to-
morrow?
Length is also a form of ratio scale data, so if a fossil is 1.5 meters long, how long would a fossil
be that is half the size?
26
Jetzt stehen die Chancen 50:50 oder sogar 60:60.
(Now our chances are 50:50, if not even 60:60)
Reiner Calmund (German football coach)
A discrete random variable takes on various values of x with probabilities specified by its prob-
ability distribution p(x).
Consider what happens when we roll a fair 6 sided die, what is the probability that we will
throw a given number? Because the die is fair we know that there is an equal probability that it
will land on any of its 6 sides, so the probability is simply 1/6 = 0.167. We can represent this
information with an appropriate discrete probability distribution (Figure 3.1).
27
0.20
● ● ● ● ● ●
0.15
0.10
p(x)
0.05
0.00
1 2 3 4 5 6
Figure 3.1: The discrete probability distribution describing the probability, p, of obtaining a given
value, x, when a fair die is thrown once.
The probability distribution looks exactly like we would expect, with the chance of throwing
each number having the same probability of 1/6. The use of the word “discrete” tells us that only
certain results are allowed, for example, we cannot consider the probability of throwing a value of
2.5 because it is clearly impossible given the system we are studying. The discrete nature of the
system is demonstrated in the distribution where all values except the allowable results of 1, 2, 3,
4, 5 and 6 have a probability of zero.
Rolling a single die once is a simple case, what about if we roll two dice and add up their
values? There are 11 possible totals between 2 (rolling two ones) and 12 (rolling two sixes) and
the probability distribution is shown in Figure 3.2.
28
0.20
●
0.15
● ●
● ●
0.10
p(x)
● ●
● ●
0.05
● ●
0.00
2 4 6 8 10 12
Figure 3.2: The discrete probability distribution describing the probability, p, of obtaining a given
total, x, when 2 fair dice are thrown.
Let’s look at the kind of information the probability distribution for rolling two dice and
summing their values can provide us with. For example, we are most likely to throw a total of 7,
which has a probability of 0.167, whilst the chance of throwing a total of 2 is under 0.03 (in other
words less than 3%). We can also combine probabilities, for example, the probability of throwing
a total of 4 or less is just the probability of throwing a total of 4 plus the probability of throwing a
total of 3 plus the probability of throwing a total of 2 (0.083 + 0.056 + 0.028 = 0.167). The chance
of throwing a total of 5 or a total of 7 is the probability of throwing a total of 5 plus the probability
of throwing a total of 7 (0.111 + 0.167 = 0.278). Of course if we summed the probabilities of all
the different possible outcomes they would equal 1 because we know that for any given trial one
of the allowable outcomes has to occur.
29
0.030
0.025
0.020
Probability density
0.015
0.010
0.005
0.000
IQ
Figure 3.3: The normal continuous probability distribution describing IQ scores that have a mean
of 100 and a standard deviation of 15.
The first thing to notice is that the distribution is symmetrical about its center, which is
positioned on the mean value of 100. The width of the distribution is controlled by the standard
deviation, if we used a larger standard deviation the distribution would be wider and lower and if
we had used a smaller standard deviation it would be narrower and higher (Figure 3.4).
30
0.04
μ = 120
σ = 10
μ = 100
0.03
σ = 15
Probability density
0.02
0.01
μ = 100
σ = 20
0.00
50 100 150
Figure 3.4: Examples of normal distributions with different means (µ) and standard deviations
(σ).
There are some important differences in how we must interpret discrete and continuous dis-
tributions. For example, it is not as simple to answer the question “what it the probability of a
candidate scoring 100 on a IQ test” as it may seem. If we used the same approach as we did for
discrete distributions we would simply read off the y-axis value at an IQ of 100 and quote that as a
probability. However, if we assume that the IQ score can take any value (i.e., there are an infinite
number of possible test scores), then the probability of obtaining a given score exactly is zero. We
can however make statements concerning probabilities if we consider ranges of values, for example,
what is the probability that a randomly selected candidate will score between 80 and 110 points.
By definition the integral of a continuous distribution is 1 and if we simply integrate the distri-
bution between 80 and 110 we will obtain the probability of a score in that interval (Figure 3.5).
This is the reason why continuous probability distributions are expressed in terms of probability
densities (see the y-axis of Figure 3.3) rather than straight probabilities as in the discrete case. If
we do this for the interval [80,110] we find the probability is ∼0.66.
31
0.025
0.020
Probability density
0.015
0.010
0.005
IQ
Figure 3.5: The probability of a random candidate obtaining an IQ score of between 80 and 110 can
be found by integration of the corresponding interval of the normal distribution (shaded region).
We can also determine the probability of scoring more or less than a given value. Marilyn vos
Savant who developed the Monty Hall problem has a quoted IQ of 228. This leads to an obvious
question; what proportion of people will have an IQ of 228 or higher. We use the same procedure
as above and simply integrate a normal distribution with a mean of 100 and a standard deviation
of 15 in the interval [228,∞], Figure 3.6.
32
2.0e−17
1.5e−17
Probability density
1.0e−17
5.0e−18
IQ
Figure 3.6: The probability of a random candidate obtaining an IQ score of between 228 or higher
can be found by integration of the corresponding interval of the normal distribution (shaded region).
We can see that the probability densities associated with this part of the distribution are very
low and the number of people expected to have an IQ of 228 or higher corresponds to less than
1 person in every 1000 trillion. How can we interpret this result, does it mean that Marilyn vos
Savant is the cleverest person who will ever live or maybe we can take it as evidence that she lied
about her IQ? Well the answer is probably simpler. The normal distribution provides a model of
how the IQ scores are expected to be distributed, but it is certainly not a perfect model. We can
expect it to perform well in the region where we find the majority of the cases (the center), but as
we head out to the extremes of the distribution, called the tails, it will perform poorly. Therefore
we must take the probabilities associated with Marilyn vos Savant’s IQ with a pinch of salt.
To demonstrate this point concerning the tails of distribution, ask yourself what is the proba-
bility of someone having an IQ of 0 or less? Clearly it’s not possible to have a negative IQ, but if
we take a normal distribution with a mean of 100 and a SD of 15 and integrate the interval [−∞,0]
we find the probability according to the model is 1.3 × 10−11 , which admittedly is very low, but is
clearly not zero (Figure 3.7).
33
5e−11
4e−11
Probability density
3e−11
2e−11
1e−11
−10 −5 0 5
IQ
Figure 3.7: The probability of a random candidate obtaining an IQ score of 0 or lower can be found
by integration of the corresponding interval of the normal distribution (shaded region).
1.0
0.6
0.6
0.2
0.2
0.5 1.0 1.5 2.0 2.5 −1.0 −0.5 0.0 0.5 1.0
Figure 3.8: An example of a log-normal distribution (left). The log-normal distribution becomes a
normal distribution when expressed in terms of logarithms (right)
34
3.2.1 An example: the confidence interval on a mean
As mentioned above, probability distributions play a key role in allowing us to make inferences
concerning a population on the basis of the information contained in a sample. At this stage in the
course you’ll just need to accept the steps we’re about to take without worrying why they work,
the key point is to demonstrate how we can use probability distributions to make inferences.
Returning to our example of IQ scores, imagine that I don’t know what the average IQ of the
population is (remember it’s defined as a population mean µ = 100 with a population standard
deviation of σ = 15), so I’m going to estimate it using a statistical sample. I choose 10 people at
random and obtain their IQ scores to form my sample, X. The scores in the sample are as follows:
107.9, 106.8, 88.1, 100.8, 94.4, 99.0, 84.4, 110.9, 110.0, 85.7
Based on this sample I wish to estimate the mean of the population, µ. Of course it’s easy to
find the mean of the sample, X̄, simply by adding the values in X and dividing by the number of
values, n (10 in this case). P
X
X̄ = (3.1)
n
For my sample X̄ = 98.8, which is close to, but not exactly the same as, the true value of 100. Of
course it’s not surprising that the mean of the sample is not exactly the same as the mean of the
population, it is after all just a sample. The key step is to use the information in the sample to
draw inferences concerning the mean of the population. Specifically we want to define a confidence
interval, so that we can make the statement:
In this way we won’t be able to make a definite statement about the precise value of the population
mean, but instead we’ll be able to say with a specific probability that µ lies in a certain interval
(based on the sampling error).
To find the confidence interval let’s imagine that we repeated the above experiment an infinite
number of times, collecting a new sample of 10 scores and calculating a value of X̄ each time. If
the values in X come from a normal distribution, which we know they do, the collection of infinity√
X̄ values would be normally distributed with a mean of µ and a standard deviation of σ/ n,
where n is still the size of each sample. This is a so-called sampling distribution and it is shown in
Figure 3.9.
35
Probability density
μ - 1.96 σ √n μ μ +1.96 σ √n
X
Figure 3.9: Distribution of sample means (X̄). The central 95% of the distribution provides the
basis for estimating a confidence interval for the population mean.
Of course I can’t take an infinite number of samples, but the sampling distribution provides
me with a model with which to estimate the population mean within a confidence interval.
Examination of the sampling distribution √ shows that 95% of the samples should yield values
of X̄ that lie within the interval µ ± 1.96σ/ n. So there is a 95% chance that our original sample
with X̄ = 98.8 falls into this interval. This can be written more formally as:
√ √
Pr(µ − 1.96σ/ n < X̄ < µ + 1.96σ/ n) = 95% (3.3)
36
Probability density
μ - 2.26 s √n μ μ +2.26 s √n
X
We now have all the information we need to estimate the 95% confidence interval on µ:
37
38
By a small sample, we may judge of the whole piece.
Miguel de Cervantes from Don Quixote
Hypothesis testing
4
We’ll start by considering the very simple example of flipping a coin which can either land heads
or tails side up. If your coin is “fair”, in other words it is equally likely to land on heads or tails
when flipped, what is the probability that you will flip a tail? The answer is pretty obvious, if we
have two possibilities and they carry an equal probability then:
1
p= (4.1)
2
So let’s try this for real, flip a coin 10 times and from your sample of results calculate the probability
that you will flip a tail (in other words count the number of times you flipped a tail and divide it
by the total number of flips, which is 10). We know that a fair coin should yield p = 0.5, so if you
didn’t flip 5 tails is it safe to assume that your coin is not fair? Of course it’s possible that your
coin gave a result close to 0.5, maybe 0.4 or 0.6, so what do you consider to the acceptable range
of p values from your experiment in which the coin can still be considered to be fair?
The primary problem with this question is that the answers we give are subjective. One person
may say the you must achieve p = 0.5 exactly, whilst a more relaxed person may consider anything
in the interval [0.3,0.7] to be okay. As scientists we want to avoid these kind of subjective choices
because they mean that two different people can make two different judgements based on the same
data sets. Statistics help us to interpret the data in an objective manner and thus remove the
effects of our personal beliefs (which as we saw in the Monty Hall problem can be very much in
error).
This problem is similar to the experimental design to test if the lady drinking tea at an garden
party in Cambridge can really tell the difference if you put the milk in the cup first. How many
cups of tea should she drink and what proportion should she get correct before you can conclude
that she can really can tell the difference in taste. Let’s go back to our coin flipping experiment.
Imagine we flip a coin 100 times and after each flip calculate the current probability of a tails by
dividing the number of tails obtained at that point by the current number of flips. I did this and
the results are shown in Figure 4.1.
39
0.8
0.6
Probability of tails
0.4
0.2
0.0
0 20 40 60 80 100
Figure 4.1: The results of flipping a coin 100 times. After each flip the probability of flipping a
tail is calculated based on the data so far. The dashed horizontal line shows p = 0.5, which is the
probability expected for a fair coin.
We can see from the results of our coin flips that the experimental probability gets close to
the expected value of p = 0.5, but even after 100 flips we’re not exactly at 0.5, so can the coin be
judged to be fair?
Clearly we can’t say that a coin is only fair if it gives p = 0.5 exactly. This would mean
that every time we repeat the 100 coin flips we would always need to get 50 heads and 50 tails.
Instead, we have to decide what is the acceptable range of p values in which the coin can still be
considered fair and what that range depends on (for example, the total number of flips included
in the experiment). To make these kind of decisions we will employ hypothesis testing.
A tentative assumption made in order to draw out and test its logical or empirical consequences.
40
Lets apply these definitions to the coin flipping experiment we examined above. If we want to
perform a hypothesis test to judge if our coin is fair we need to state the null and alternative
hypotheses:
A hypothesis test allows us to evaluate the possibility of H0 given the available experimental data.
If H0 does not appear to be very likely on the basis of the data, then we must reject H0 and
instead accept H1 . For example if we flipped our coin 100 times and obtained 100 tails we would
feel pretty safe in rejecting the null hypothesis; the coin is fair, instead accepting the alternative
hypothesis; the coin is not fair.
How we could go about testing the null hypothesis for our coin flipping experiment? We want
to test if our coin is fair, so lets consider how a perfectly fair coin (i.e., with p = 0.5) would behave.
We actually studied a similar problem in Section 3.1 when we were studying discrete probability
distributions. If we repeated the coin flipping experiment with a total of 10 flips a number of times
we would obtain a distribution of results that would describe the probability of any given result.
For example, if my coin is fair, what proportion of the experiments would yield 1 tail and 9 heads?
Using the binomial distribution we can find what the distribution of results would look like for our
coin tossing experiment if we repeated it an infinite number of times (i.e., a perfect representation
of the system). The binomial distribution representing the 10 flip experiment for a fair coin is
shown in Figure 4.2.
41
0.30
0.25
●
0.20 ● ●
Probability
0.15
● ●
0.10
0.05
● ●
● ●
0.00
● ●
0 2 4 6 8 10
Number of tails
Figure 4.2: The distribution of results for a fair coin flipped a total of 10 times. As expected most
trials give a result around the 5 tails region, thus this region of the distribution is associated with
high probabilities. We should also expect, however, to occasionally see extreme results with low
probabilities, for example 10 tails out of 10 (which carries a probability of approximately 0.001).
Now we’ll use the binomial distribution to look at the probabilities for an experiment including
100 flips and how many tails we can expect to get in any given trial (Figure 4.3).
42
0.12
0.10
0.08
●●●
Probability
● ●
● ●
0.06
● ●
● ●
0.04
● ●
● ●
0.02
● ●
● ●
● ●
● ●
●● ●●
0.00
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 100
Number of tails
Figure 4.3: The distribution of results for a fair coin flipped a total of 100 times. As expected most
trials yield a result around the 50 tails region, thus this region of the distribution is associated with
high probabilities.
As before, extreme results have low probabilities, for example, we should only expect to observe
a trial that produces 100 tails out of 100 in about 1 in every 1030 trials. That means if you had
started flipping coins at the birth of the Universe and completed an 100 flip trial every 15 minutes
you probably still wouldn’t have got 100 tails out of 100 yet. This might seem like a stupid
statement to make, but what it shows us is that for a truly fair coin, results at the extremes (i.e.,
the extremes of the distribution) are very unlikely and results towards the center of the distribution
are much more likely. We can use this information to test the hypothesis that a given coin is fair.
If the result of our 100 flips experiment falls into a low probability region of the distribution we
know that the chance of getting such a result for a truly fair coin is low, which suggests that our
coin may not in fact be fair.
Let’s look at our binomial distribution for 100 flips of a fair coin again, with a specific focus
on the extremes (the tails) of the distribution. If we add up the probabilities of the results (as we
did in Section 3.1) we find there is only a 5% chance that an experiment will result in 59 or more
tails and a 5% chance that my experiment will result in 41 or less tails (Figure 4.4). This also
tells us that for an experiment consisting of 100 flips of a fair coin we would expect to get between
42 and 58 tails in 90% of the cases. If our coin is unfair, however, we should get an unexpectedly
low or high number of tails, that doesn’t fit with the probabilities expected from the binomial
distribution.
43
0.12
Middle
0.10
90%
0.08
Probability
0.06
0.04
Lower Upper
0.02
5% 5%
0.00
0 20 40 60 80 100
Number of tails
Figure 4.4: The distribution of results for a fair coin flipped a total of 100 times. We can assess
how likely an extreme result is by adding up the probabilities in the tails of the distribution. In the
case of a fair coin flipped 100 times, 5% of the experiments should yield 641 tails and 5% of the
experiments should yield >59 tails. The remaining 90% of the experiments should give between 42
and 58 tails.
Remember that our null hypothesis (H0 ) is that the coin is fair, so at the start of the experiment
we are assuming the coin to be fair. If you now did 100 flips of the coin and only got 41 tails or
less, you could say that there is only a 5% chance that you would get 41 tails or less in a given
experiment, which would make you think that the coin may not be fair. The opposite case of
having too many tails also holds. If you got 59 tails or more you could say that there is only a
5% chance that you would get 59 tails or more in a given experiment, which again would make
you think that the coin may not be fair. To place this argument into a more robust framework we
need to introduce the concept of significance levels.
44
an experiment and get either 641 or >59 tails we could reject the null hypothesis (the coin is
fair) and accept the alternative hypothesis (the coin is not fair) at a significance level (α) of 0.1.
Here the α = 0.1 is telling us that there is a 10% chance that given the available data we have
incorrectly accepted the alternative hypothesis when the null hypothesis was in fact true.
Maybe we really want to make sure that we don’t incorrectly reject the null hypothesis, so we
will work with a significance level of α = 0.01, which means that our experiment has to fall further
into the tails of the binomial distribution before we will reject the null hypothesis (Figure 4.5). For
100 coin flips, if the number of tails fell in the interval [37,62] we would accept the null hypothesis
with a significance level of 0.01. If, however, the number of tails was 636 or >63 we would see that
the probability of such a result for a fair coin is low (≤1%)and therefore reject the null hypothesis
and adopt the alternative hypothesis with a significance level of 0.01.
0.12
Middle
0.10
99%
0.08
Probability
0.06
0.04
Lower Upper
0.02
0.5% 0.5%
0.00
0 20 40 60 80 100
Number of tails
Figure 4.5: The distribution of results for a fair coin flipped a total of 100 times. We can assess
how likely an extreme result is by adding up the probabilities in the extremes of the distribution.
In the case of a fair coin flipped 100 times, 0.5% of the experiments should yield 636 tails and
0.5% of the experiments should yield >63 tails. The remaining 99% of the experiments should give
between 37 and 62 tails per experiment.
45
decided in advance what result is wanted, e.g., the coin is not fair, and then a significance level is
chosen to ensure the test gives the desired result. Second, significance levels only tell us about the
probability that we have incorrectly rejected the null hypothesis. Significance levels don’t give any
information about alternative possibilities, for example, incorrectly accepting the null hypothesis.
3.7, 2.0, 1.3, 3.9, 0.2, 1.4, 4.2, 4.9, 0.6, 1.4, 4.4, 3.2, 1.7, 2.1, 4.2, 3.5
and 18 locations were visited in Greenland yielding the concentrations (again in ppm):
3.7, 7.8, 1.9, 2.0, 1.1, 1.3, 1.9, 3.7, 3.4, 1.6, 2.4, 1.3, 2.6, 3.7, 2.2, 1.8, 1.2, 0.8
To help understand the transport of meltwater particles we want to test if the variance in meltwater
particle concentrations is the same in Antarctica and Greenland (the variance is just the standard
deviation squared). The variance of the population is denoted by; σ 2 , however because we are
working with a sample we have to make an estimate of the population variance by calculating the
sample variance, s2 :
n
1 X
s2 = (Xi − X̄)2 (4.2)
n − 1 i=1
where n is the size of the given data set (i.e., n=18 for Greenland and n=16 for Antarctica). Using
equation 4.2 we find that:
s2Antarctica = 2.2263
s2Greenland = 2.6471
2 2
and now we must use this information to draw inferences concerning σAntarctica and σGreenland . To
test if the meltwater variances at Antarctica and Greenland are the same, first we must state our
hypotheses:
H0 : The population variances are the same (σAntarctica
2 2
= σGreenland )
46
This is the same form of problem as we had in our coin flipping experiment. We therefore need
to be able to generate a probability distribution that represents the possible values of variance
ratios and we need to select a significance level against which the null hypothesis can be tested.
Earlier we discussed the need to make assumptions in order to draw statistical inference and
here we will make the assumption that the data from both Antarctica and Greenland come from
normal distributions. Therefore we can take two normal distributions with the same variance and
sample 16 random numbers from the first (to represent the Antarctica sampling) and 18 random
numbers from the second (to represent the Greenland sampling), find their respective estimated
variances and then take the ratio. Because the distributions have the same variance we know that
their their values of σ 2 are the same, however, because we are dealing with samples their values of
s2 will be slightly different each time we draw a set of random numbers and therefore the ratios
will form a distribution. The so-called F -distribution gives the distribution of ratios for an infinite
number of samples. We can control the sample sizes the F -distribution represents by adjusting its
degrees of freedom.
We calculated our ratio with s2Greenland as the numerator and s2Antarctica . Because the sample sizes
for Greenland and Antarctica are 18 and 16, respectively, our F -test will employ an F -distribution
with {18-1,16-1} degrees of freedom. We can then compare the ratio obtained from our Greenland
and Antarctica samples to the distribution of ratios expected from two normal distributions with
the same variances. We’ll perform the F -test at the α =0.05 level, which means we need to check
if our ratio for the Greenland and Antarctica samples is more extreme than the 5% most extreme
values of an F -distribution with {18-1,16-1} degrees of freedom. Because our variance ratio could
possibly take values less than 1 (if the numerator is less than the denominator) or values greater
than 1 (if the numerator is greater than the denominator) our 5% of extreme values must consider
the lowest 2.5% and the highest 2.5%, as shown in Figure 4.6.
47
0.8
F = 1.189
0.6
Probability density
Lower
Upper
2.5%
2.5%
0.4
0.2
0.0
0 1 2 3 4 5
F value
Figure 4.6: An F-distribution with {18-1,16-1} degrees of freedom. The extreme 5% of the F-
values are shown by the shaded regions. The F-value of the Greenland to Antarctica variance ratio
is shown as an open symbol.
We can see that our variance ratio for the Greenland and Antarctica samples does not fall into
2
the extremes, so at the α = 0.05 significance level we accept the null hypothesis that σAntarctica =
2
σGreenland .
Voila, we have now performed an F -test and shown the population variances of the meltwater
particle concentrations at Greenland and Antarctica are the same at the 0.05 significance level.
You could now take this information and build it into your understanding of how meltwater particle
systems work. This is an important point, the F -test has given us some statistical information, but
the job of understanding what that information means in a geological sense is your responsibility.
Choose the significance level (α) at which the test will be performed.
Calculate the test statistic (ratio of the variances in the case of the F -test).
Compare the test statistic to a critical value or values (obtained from an F -distribution in
the case of the F -test).
Accept or reject H0 .
48
Finally, it is important to consider if we made any assumptions during the statistical test. For
the F -test we assumed that the Antarctica and Greenland samples both came from a normal
distribution. If this is not a valid assumption then the results of the F -test will not be valid. Such
assumptions make the F -test a so-called parametric test, which just means that we assume that
the data come from a specific form of distribution.
Figure 4.7: Belemnite fossils (photo courtesy of the Natural History Museum, London).
We will now perform an F -test to determine if the variances of the lengths of the samples from
horizons A and B are the same at the 0.05 significance level. Our first step is to state the null
and alternative hypotheses:
s2A
F =
s2B
To perform the calculations we must first load the data into R. The data is stored in the file
Belemnites.Rdata and it includes two variables A and B.
Example code: 4
49
We now have the two variables in the memory. If we want to the look at the values they contain
we just give the variable name and hit the enter key. For example to look at the values in the
variable B:
Example code: 5
> B
[1] 5.13 6.59 6.02 3.42 4.92 4.32 3.98 3.77 5.29 4.57 5.06 4.63 4.54 5.37 5.73
[16] 7.11 3.64 3.98 6.04 4.61
Note that the values in the square brackets tell you what position in B the beginning of the
displayed row corresponds to. For example [16] indicates that the displayed row starts with the
16th value of B, which is 7.11. Using R we can perform a wide variety of mathematical procedures,
which makes it very useful when were need to calculate various statistical values.
To calculate the variances of A and B, we will use the function var, give the commands:
Example code: 6
Example code: 7
> F=A_var/B_var
> F
[1] 0.3375292
So the ratio of the sample variances is 0.338 and now we need to compare this value to a F -
distribution to see if it lies in the most extreme 5%. This is what we call a two-sided test so as
before we need to consider the lowest 2.5% and highest 2.5% of the F -distribution. Therefore we’re
accounting for the possibility that the ratio is significantly less than 1 or significantly greater than
1 (just like in our example with the microparticle concentrations). To find the values of F for the
extremes we first need to find the number of degrees of freedom for the distribution. We can do
this using the length function, which tells us how many entries there are in a given variable. Once
we know the degrees of freedom we can find the value at which the F -distribution reaches a given
probability using the function qf. The qf function has 3 inputs; the probability we are interested
in, the number of degrees of freedom of the numerator and the number of degrees of freedom of
the denominator.
Example code: 8
50
off the test.
Example code: 9
Lower
Upper
2.5%
2.5%
0.4
F = 0.338
0.2
0.0
0 1 2 3 4 5
F value
Figure 4.8: An F-distribution with {17,19} degrees of freedom. The extreme 5% of the F-values
are shown by the shaded regions. The F-value of the belemnite variance ratio is shown as an open
symbol.
Choose the significance level (α) at which the test will be performed.
51
Calculate the test statistic(s).
Compare the test statistic to a critical value or values (obtained from a suitable distribution).
Accept or reject H0 .
So if there is some property you need to test in order to make inferences about the geological
system you are studying you can usually find a test that will do it for you. We’ll look at one final
example to again demonstrate the general nature of hypothesis testing.
You can see that the assumption of normal distributions is the same as for the F -test, but now
we have the added assumption that the Antarctica and Greenland samples have the same vari-
ance. Fortunately, in our earlier analysis we established using an F -test that the Antarctica and
Greenland samples have the same variance so we know that our data meet this assumption.
If we want to test if the two means are the same, first we must state our hypotheses:
and we’ll work with a significance level of α = 0.05. The t-statistic is a little more complicated
to calculate than the F-statistic, but it is still just an equation that we plug known values into.
Specifically: s
X̄1 − X̄2 (n1 − 1)s21 + (n2 − 1)s22
t= q S= (4.3)
S 1 + 1 n1 + n2 − 2
n1 n2
where n1 , n2 are the number of values from Antarctica (16) and Greenland (18), X̄1 , X̄2 , are the
mean sample values from Antarctica and Greenland and s21 , s22 are the sample variances from
Antarctica and Greenland. We can calculate the test-statistic in R, using the variables A (Antarc-
tica) and G (Greenland) stored in the file microparticles.Rdata. As you type in commands into
R, check that you can see how they marry up with the terms in equation 4.3. Also notice that
we can reuse variables names such as top and bottom once we are finished with them (R simply
overwrites the existing values).
52
Example code: 10
Example code: 11
Example code: 12
53
t-value for the microparticle data lies between the critical values rather than in the extremes of the
distribution. Therefore we can accept the null hypothesis and state that µAntarctica = µGreenland at
the 0.05 significance level.
0.4
t = 0.376
0.3
Probability density
Lower Upper
2.5% 2.5%
0.2
0.1
0.0
−4 −2 0 2 4
t value
Figure 4.9: A t-distribution with 32 degrees of freedom. The extreme 5% of the t-values are shown
by the shaded regions. The t-value of the microparticle data is shown as an open symbol.
54
The invalid assumption that correlation implies cause is probably
among the two or three most serious and common errors of human
reasoning.
Stephen Jay Gould
I’m sure that you are familiar with correlation and regression from packages like EXCEL that
include them as part of their trendlines options. We’re going to look at both correlation and
regression in more detail and examine what important information they can and cannot provide
you with.
5.1 Correlation
As mentioned above, correlation tells us the degree to which variables (2 or more) are related
linearly. Correlation usually expresses the degree of the relationship with a single value. By far
the most common value used in correlation analysis is the Pearson product-moment correlation
coefficient (PPMCC) and it will be our focus in this section. You’ve all probably used a variant
of the PPMCC before in the “R-squared” value EXCEL allows you to add to trendlines in scatter
plots. In which case, you should be familiar with at least a qualitative interpretation of the PPMCC
which is simply the value “R” that is used to calculate “R-squared”. We will use a lowercase r
rather than R so that it won’t cause any confusion with the language R that we are using for the
examples. In EXCEL, r is only a few mouse clicks away, but to understand what information it
can give us we need to look at how it is calculated (sorry, but it’s unavoidable) and how we can
draw inferences from it.
55
5.1.1 Sample correlation
If we have a data set consisting of two variables, let’s say; X and Y , first we look at how far each
value is away from its corresponding mean. We’ll call these differences deviations and represent
them with lowercase letters:
x = X − X̄, (5.1)
y = Y − Ȳ . (5.2)
where D is the diameter of the particle and D0 is a reference diameter equal to 1 mm. So if
the data are log-normally distributed, the φ scale transforms them into a normal distribution and
the assumptions of the correlation analysis are met. The second variable in the data set is the
downstream distance, dist, which has units of kilometers. In this case a value of, for example,
dist = 10 corresponds to a piece of gravel being collected 10 km downstream with respect to the
starting location of the experiment.
The first thing we’ll do is plot the data to take a look at it. We can do this in R using the
commands.
Example code: 13
> #plot the data points with black symbols and label the axes
> plot(dist,size,col='black',xlab='Distance [km]',ylab='Gravel size [phi]')
56
●
●
−9.0
●
● ●
● ●
●
● ●
−9.5
Gravel size [phi] ●
●
●
−10.0
●
● ●
●
● ●
−10.5
●
●
2 4 6 8 10
Distance [km]
Just by looking at the plot we can see that there may be a linear relationship between gravel
size and distance, but we’ll need to calculate the PPMCC to quantify the degree of the relation-
ship. As we saw in equations 5.1 and 5.2, the first step to calculating r is to find the deviations
of the variables. The structure of the deviations is the key to understanding equation 5.3, so we’ll
now plot them in a new figure and consider them in a general way in the next section.
Example code: 14
57
xy =negative
0.5
xy = positive
0.0
y
−0.5
xy =negative
−1.0
xy = positive
−4 −2 0 2 4
Figure 5.2: Deviations of the gravel data set. Notice how the sign of xy will depend on the quadrant
of the plot where the deviations lie.
Looking back at equation 5.3 we can see that the top of the equation involves multiplying
the two sets of deviations together. This multiplication gives us a measure of how well the two
variables are moving together. If the deviations of a given observation have the same sign (i.e.,
the point representing the deviations lies in the first or third quadrant) the product xy will be
positive. Alternatively, if the deviations of a given observation have different signs (i.e., the point
representing the deviations lies in the second or fourth quadrant) the product xy will be negative.
If values in X and Y exhibit a positive relationship, i.e., as X increases Y also increases, then
most of our
P deviations will define points that lie in the first or third quadrants and when we form
the sum xy, we’ll get a positive number. In the opposite case of a negative relationship, i.e., as
one variable increases the other decreases,
P most of the deviations will define points in the second
or fourth quadrant. In such cases xy will be negative. The last case to consider is when no
relationship exists between X and Y . In this situation the points defined by xP and y will be spread
amongst all four quadrants and P the signs of xy will cancel to give a value of xy close to zero.
We can see that the sign of xy tells us about the sense of the correlation, i.e., is it positive
or negative, but there are problems with the magnitude of the value. Clearly, each value of xy will
depend on the units of the data. In the example of our river gravels, if the down stream distance
was in millimeters rather than kilometers each value of xy would be 106 times larger because
106 mm = 1 km. To compensate for the problem we take the magnitudes of the x and y values
into account in the denominator of equation 5.3 and this makes the PPMCC “scale invariant”.
We’ve calculated and plotted the deviations of the gravel data already and now it is a simple task
to calculate the PPMCC using R.
58
Example code: 15
Example code: 16
Figure 5.3: Examples of the PPMCC value for different data distributions. Notice that some cases
yield r = 0 even though there is a clear relationship between x and y. This is because the PPMCC
only measures the extent of the linear relationship.
59
relationship with X. Taking our gravel data as an example, we found r = 0.96 and thus r2 = 0.92.
Therefore 92% of the variation in the gravel sizes is accounted for by their position downstream,
whilst 100(1 − r2 )% of their variation is not accounted for by their position downstream.
Example code: 17
60
5.1.5 Confidence interval for the population correlation
If we can show that ρ is significantly different from 0 at some desired significance level then we can
also calculate the confidence interval on ρ. The confidence interval will tell us the range in which
the true value of ρ should lie with a given probability. The first step is to take our calculated value
of r and apply Fisher’s z transform:
1 1+r
zr = ln . (5.6)
2 1−r
The standard error of zr is approximately:
1
SE = √ , (5.7)
n−3
where n is the sample size. Because zr is normally distributed we can find confidence intervals for
zr from a normal distribution just like in Section 3.2.1. For example, the 95% confidence interval
on zr would be [zr − (1.96 ∗ SE), zr + (1.96 ∗ SE)]. We now have the confidence interval for zr and
to find what this corresponds to in terms of correlation coefficients we need to apply the inverse z
transform (i.e., convert from zr values back to r values):
e2z − 1
r= (5.8)
e2z + 1
We’ll now calculate the confidence interval on ρ for our river gravels using R.
Example code: 18
Example code: 19
61
5.1.6 The influence of outliers on correlation
An outlier is a data point that deviates markedly from other members of the sample in which
it occurs. An outlier could be from an extreme end of a distribution, for example Marilyn vos
Savant’s IQ of 228, or simply the product of a large measurement error. The important point
is that outliers can have a dramatic effect on the calculated PPMCC because they allow a large
amount of the data variability to be described by a single point. Shown in Figure 5.4 is a small
data set with a correlation that is effectively zero (r2 = 0.03). When an outlier is included in the
same data set the correlation becomes highly significant (r2 = 0.81) because most of the variability
in the data is controlled by the single outlier.
10
1.0
8
0.5
6
outlier
0.0
y
y
4
−1.0 -0.5
r2 = 0.03 2
0 r2 = 0.81
−1 0 1 2 0 2 4 6 8 10
x x
Figure 5.4: The correlation in a data set can be strongly influenced by the presence of outliers.
The panel on the left shows a data set which exhibits no correlation. However, when an outlier is
added to the data the correlation increases dramatically.
It is therefore important that you check your data in advance for outliers and consider removing
them from the analysis.
5.2 Regression
In the previous section we assessed the degree to which two variables are related linearly using
correlation. The aim of regression is to quantify how the variables are related linearly. To put this
in simpler terms, how do we find the straight-line that will give the best description of the sample
62
data. Again this is a task that is performed commonly in EXCEL, but we need to look at the
process in more detail. Firstly, you should all be familiar with the equation for a straight-line:
Y = a + bX, (5.9)
where a is the intercept (the value of Y when X = 0) and b is the gradient. Let’s again consider
our river gravels. With distance as the X value and grain size as the Y value we fit a straight line
to the data to obtain a and b for our studied sample. Of course what really interests us is making
inferences from the sample to estimate the relationship for the population. Thus whilst we can
calculate a and b, what we really want is confidence intervals for α and β. It’s also important to
note that there are assumptions associated with linear regression, we’ll state them now and then
look at them in more detail later:
Y = a + bX + E, (5.10)
63
−9.0
−9.5
Gravel size [phi]
−10.0
^
−10.5
Yi
^
Ei = Yi - Y
Ei i
Yi
2 4 6 8 10
Distance [km]
Figure 5.5: The error Ei for point i based on the difference between the data point Yi and its
corresponding point on the line Ŷi (more on this later).
A line that fits the data closely will produce small errors, whilst a poorly fitting line will produce
large errors. Therefore if we can get the collection of errors in E as small as possible we will know
that we’ve found the P best fitting line. We can do this by finding the line that produces the smallest
2
possible value
P of E . Another property of E that we can use to help us is that for the best
fitting line E = 0. It makes intuitive sense that for the best fitting line allPthe2 errors will cancel
each
P other out. To make the errors in E as small as possible (i.e., minimize E ) and ensure that
E = 0 we use the approach of least-squares (which is the technique Gauss developed when he
was 18 years old and subsequently used to make his predictions about the orbit of Ceres).
To calculate b, we’ll again use deviations in X and Y (equations 5.1 and 5.2):
P
xy
b = P 2. (5.11)
x
We also know that the best fit line must pass through the point (X̄, Ȳ ), i.e., the mean of the data,
so once b is found we can calculate a:
a = Ȳ − bX̄. (5.12)
We now know the equation for the best-fit line, which provides us with a linear model relating X
and Y . We can therefore use the line to make predictions about Y given a value or values of X.
If we were interested in the value of Y at a value of X denoted by X0 , the prediction of Y is given
by:
Yˆ0 = a + bX0 , (5.13)
note the Ŷ notation that denotes we are making a prediction of Y . Let’s try this out in R, again
using the downstream gravel data and the deviations we calculated earlier. We’ll make predictions
64
for the original X values in order to draw the regression line in the plot (Figure 5.6).
Example code: 20
●
●
−9.0
●
● ●
● ●
●
● ●
−9.5
●
Gravel size [phi]
●
−10.0
●
● ●
●
● ●
−10.5
●
●
2 4 6 8 10
Distance [km]
Figure 5.6: Regression line showing the linear relationship relating gravel size to distance down-
stream.
65
10
1.0
8
0.5
y = 0.90x - 0.04
6
y = -0.14x + 0.12
y
4
2
0
−1 0 1 2 0 2 4 6 8 10
x x
Figure 5.7: Regression can be strongly influenced by the presence of outliers. The panel on the left
shows the regression for a data set with no outliers. However, when an outlier is added to the data
(right) the regression line is pulled towards it, changing the regression dramatically.
Therefore, as with correlations, it is essential that you check your data set in advance for
outliers and consider removing them from the analysis.
Example code: 21
> Yhat = a+b*dist #predicted gravel size for the distance values
> s2 = 1/(n-2)*sum((size-Yhat)^2) #estimated variance of residuals
> s = sqrt(s2) #estimated standard deviation of the residuals
> t=-qt(0.025,n-2) #obtain the t distribution value with dof = n-2
> beta_low=b-t*s/sqrt(sum(x^2)) #lower value of 95% CI
> beta_up=b+t*s/sqrt(sum(x^2)) #upper value of 95% CI
> b #show the value of the slope on screen
> beta_low #show the lower value of the slope confidence interval
> beta_up #show the upper value of the slope confidence interval
66
So we can say that on the basis of our sample gradient, b = 0.203 φ km−1 , there is a 95%
probability that the true population gradient lies in the interval [0.201, 0.205] φ km−1 .
Let’s go back to our gravels example and see how we would calculate the confidence interval for α
in R.
Example code: 22
> Yhat = a+b*dist #predicted gravel size for the distance values
> s2 = 1/(n-2)*sum((size-Yhat)^2) #estimated variance of residuals
> s = sqrt(s2) #estimated standard deviation of the residuals
> t=-qt(0.025,n-2) #obtain the t distribution value with dof=n-2
> alpha_low=a-t*s*sqrt(1/n+mean(dist^2)/sum(x^2)) #lower value of 95% CI
> alpha_up=a+t*s*sqrt(1/n+mean(dist^2)/sum(x^2)) #upper value of 95% CI
> a #show the value of the slope on screen
> alpha_low #show the lower value of the intercept confidence interval
> alpha_up #show the upper value of the intercept confidence interval
So we can say that on the basis of our sample intercept, a = -10.81 φ, there is a 95% proba-
bility that the true population intercept lies in the interval [-10.62, -10.99] φ.
67
5.2.5.1 Predicting a mean
In the sections above we found that a = -10.807 and b = 0.203, so for example, at distance 5 km
downstream we can make a predication of the the gravel size:
a + bX0 = Yˆ0 ,
−10.807 + 0.203 ∗ 5 = −9.792
Okay, so we’ve predicted a value of -9.8 φ but what does this value correspond to? It is a prediction
of the mean of Y at X0 . So in the case of our example data set the estimated mean gravel size
at a downstream distance of 5 km is -9.8 φ. The focus of the previous sections was the calculation
of confidence intervals on the slope and intercept and in a similar manner we need to include a
confidence interval on the estimated mean because of the uncertainty associated with working with
a sample rather than the whole population and the misfit between the data and the regression
model. Again we’ll use the estimated variance of the residuals, s2 (equation 5.14). The 95%
confidence interval for the mean of Y0 at the position X0 is given by:
s
1 (X0 − X̄)2
µ0 = (a + bX0 ) ± t0.025 s + P 2 (5.17)
n x
where t again represents Student’s t-distribution with n − 2 degrees of freedom. Let’s perform this
calculation in R for the gravel data at a position X0 = 5 km (Figure 5.8).
Example code: 23
> Yhat = a+b*dist #predicted gravel size for the distance values
> s2 = 1/(n-2)*sum((size-Yhat)^2) #estimated variance of residuals
> s = sqrt(s2) #estimated standard deviation of the residuals
> t=-qt(0.025,n-2) #obtain the t distribution value with dof=2
> X0=5 # make predictions for a distance of 5 km downstream
> C=t*s*sqrt(1/n+(X0-mean(dist))^2/sum(x^2)) #half-width of the CI
> mu0=a+b*X0 #prediction of mean at X0
> mu0_low=mu0-C #lower value of the 95% CI
> mu0_up=mu0+C #upper value of the 95% CI
> mu0 #show the value of the mean at XO
> mu0_low #show the lower value of the mean confidence interval
> mu0_up #show the upper value of the mean confidence interval
> plot(dist,size,col='black',xlab='Distance [km]',ylab='Gravel size [phi]')
> lines(dist,Yhat,col='black') #plot the regression line
> lines(c(X0,X0),c(mu0_low,mu0_up)) #plot the CI for the predicted mean
68
●
●
−9.0
●
● ●
● ●
●
● ●
−9.5
Gravel size [phi] ●
●
●
−10.0
●
● ●
●
● ●
−10.5
●
●
2 4 6 8 10
Distance [km]
Figure 5.8: Regression line showing the linear relationship relating gravel size to distance down-
stream. The vertical line shows the 95% confidence interval for the mean gravel size at a distance
of 5 km downstream (calculated using equation 5.17).
Example code: 24
69
●
●
−9.0
●
● ●
● ●
●
● ●
−9.5
●
Gravel size [phi]
●
−10.0
●
● ●
●
● ●
−10.5
●
●
2 4 6 8 10
Distance [km]
Figure 5.9: Regression line showing the linear relationship relating gravel size to distance down-
stream. The dashed lines shows the band of 95% confidence intervals for the mean gravel size
downstream, calculated using equation 5.17 and a collection of X0 values.
Let’s think about the band defined by the confidence intervals in more detail. First, we can see
that the band gets wider towards the edges of the data (i.e., low values of X and high values of X).
This is because as we move away from the center of the data our predictions of the mean contain
larger uncertainties. This level of uncertainty is controlled by the term (X0 − X̄)2 in equation 5.17,
which we can see will yield larger values (and thus wider confidence intervals) the more removed
X0 is from X. The second point to note is what would happen to the width of the confidence
intervals as we increased the size of the sample, n. If we imagine that we had an infinitely large
sample, i.e., the whole population, we can see that equation 5.17 would become:
√
µ0 = (a + bX0 ) ± t0.025 s 0 (5.18)
so the uncertainty disappears and we can make a perfect prediction of µ0 . Although the lab work
associated with measuring the size of infinity pieces of gravel may be a challenge this demonstrates
that as n gets larger the size of the confidence interval will decrease. This shouldn’t be too
surprising to you because clearly the larger the sample size the better it will approximate the
natural system. Of course we can also use equation 5.17 in an alternative manner if we need to
make predictions of the mean grain size with a certain level of uncertainty. Beginning with my
original sample I can estimate the value of n that would be required to make a prediction within
a confidence interval of a certain size and then increase the size of my sample appropriately.
70
5.2.5.2 Predicting a single future observation
Imagine after I’ve collected my river gravels and performed a regression analysis, I decide to return
to the river to collect one more piece of gravel at a distance X0 downstream. Clearly at any given
point along the river the gravels will have a distribution of sizes and this needs to be included in
our analysis. The best prediction of the size of this new piece of gravel is still the mean given
by the regression model, but now we have to include an uncertainty relating to the distribution
of gravel sizes at a given location. The equation for the predication interval for a single future
observation is: s
1 (X0 − X̄)2
µ0 = (a + bX0 ) ± t0.025 s + P 2 + 1. (5.19)
n x
Again we’ll perform the calculation in R for the gravel data at a collection of positions, X0 , and
add the prediction intervals to the existing plot as dotted lines (Figure 5.10).
Example code: 25
71
●
●
−9.0
●
● ●
● ●
●
● ●
−9.5
●
Gravel size [phi]
●
−10.0
●
● ●
●
● ●
−10.5
●
●
2 4 6 8 10
Distance [km]
Figure 5.10: Regression line showing the linear relationship relating gravel size to distance down-
stream. The dashed lines show the band of 95% confidence intervals for the mean gravel size
downstream, calculated using equation 5.17 and a collection of X0 values. The dotted lines show
the 95% prediction intervals for the gravel size of a single future observation, calculated using
equation 5.19.
You’ll see that equation 5.19 looks very similar to equation 5.17, but notice the +1 inside the
square-root. This term has an important effect as we increase n. Again, for an infinitely large
sample, equation 5.19 would become:
√
µ0 = (a + bX0 ) ± t0.025 s 1,
so we can see the prediction interval will never be reduced to 0, no matter how many samples are
included in the analysis. At first this may seem a bit odd because by considering an infinitely
large sample you would think that we had removed all uncertainty, but this is not the case. As
mentioned above, we know that at a given distance along the river not all the gravel particles will
have the same size. Therefore, we will be selecting a single piece from a distribution of sizes and
thus there is a level of uncertainty associated with the prediction. A further example of this is
given in Figure 5.11 where n is increased for an artificial data set. You can see that as n increases
the confidence interval on the mean gets smaller, but the prediction interval stays approximately
the same width. This is because the data naturally exhibit a scatter around the mean that is not
influenced by n and the prediction interval has to take this scatter into account.
72
100
n = 10 n = 50
80
Y
60
40
100
n = 100 n = 500
80
Y
60
40
4 6 8 10 12 14 16 4 6 8 10 12 14 16
X X
Figure 5.11: Notice how with increasing sample size (denoted by n) the width of the confidence
interval of the prediction of the mean (dashed lines) decreases and the level of uncertainty is reduced.
In contrast the prediction intervals on a future observation (dotted lines) remain approximately
constant because they have to incorporate the uncertainty created by the scatter of the data. This
scatter is a natural property of the data and is therefore independent of sample size.
73
Figure 5.12: The sepals of a flower lie between the petals.
With this type of data it’s very difficult to say which is the independent variable and which is
the dependant variable (if we can even claim that such a relationship exists at all). We’ll start by
loading the data from the file iris regression.Rdata and plot len (sepal length) as the inde-
pendent variable and wid (sepal width) as the dependant variable (Figure 5.13). In the examples
above we calculated the various correlation and regression parameters from scratch, but now we
have a chance to use some of the functions built into R to make things a bit easier.
Example code: 26
74
●
4.0
●
● ●
● ● ●
● ● ●
3.5
Width [cm]
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
3.0
● ● ● ● ●
●
2.5
Length [cm]
Figure 5.13: The sepal data set from 50 of the irises measured by Fisher.
The first thing we need to test is if there is a significant correlation between sepal length and
width. We can do this using the hypothesis test outlined in Section 5.1.4.
Example code: 27
75
Example code: 28
●
4.0
● ●
● ● ●
● ● ●
3.5
Width [cm]
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
3.0
● ● ● ● ●
●
2.5
Length [cm]
Figure 5.14: The sepal data set from 50 of the irises measured by Fisher. The line shows the
regression relationship between sepal length (assumed to be the independent variable) and width
(assumed to be the dependant variable).
We can ask R to return a full summary of the regression model stored in model using the
function summary.
76
Example code: 29
> summary(model)
Call:
lm(formula = wid ~ len)
Residuals:
Min 1Q Median 3Q Max
-0.72394 -0.18273 -0.00306 0.15738 0.51709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5694 0.5217 -1.091 0.281
len 0.7985 0.1040 7.681 6.71e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2565 on 48 degrees of freedom
Multiple R-squared: 0.5514, Adjusted R-squared: 0.542
F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10
We can now make predictions of sepal width based on the values of sepal length with the re-
gression equation:
width = 0.7985 ∗ length − 0.5694 (5.20)
What will happen if we try the regression the other way around. Rather than predicting the
width from the length, we want to predict the length from the width. We’ll start by switching the
variables so that wid is the independent variable and len is the dependent variable and then we’ll
calculate the correlation coefficient.
Example code: 30
77
Example code: 31
Call:
lm(formula = len ~ wid)
Residuals:
Min 1Q Median 3Q Max
-0.52476 -0.16286 0.02166 0.13833 0.44428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6390 0.3100 8.513 3.74e-11 ***
wid 0.6905 0.0899 7.681 6.71e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
78
●
● ●
5.5
● ●
● ● ●
● ● ●
Length [cm]
5.0 ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
4.5
● ● ●
Width [cm]
Figure 5.15: The sepal data set from 50 of the irises measured by Fisher. The line shows the
regression relationship between sepal width (assumed to be the independent variable) and length
(assumed to be the dependent variable).
We now have a regression equation to predict sepal length from sepal width:
length = 0.6905 ∗ width − 2.6390 (5.21)
Now we have to ask ourselves an important question; are the regression relationships given in equa-
tions 5.20 and 5.21 equivalent? We can test this by rewriting equation 5.21 to make a prediction
of sepal width:
length =0.6905 ∗ width − 2.6390 (5.22)
length − 2.6390
= width (5.23)
0.6905
Now that both regression equations are written in terms of sepal length to make predictions about
sepal width we can compare the regression lines on the same plot (Figure 5.16).
Example code: 32
79
●
4.0
●
● ●
● ● ●
● ● ●
3.5
Width [cm]
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
3.0
● ● ● ● ●
●
2.5
Length [cm]
Figure 5.16: Comparison of the two regression equations obtained with sepal length as the inde-
pendent variable (dashed) and sepal with as the independent variable (dotted). The two lines are
different so we can see that the regression equations are not equivalent.
We find that the two regression equations are not equivalent and the analysis is sensitive to
the selection of the independent and dependant variables. We can see why this effect occurs if
we look back at the way the regression line is calculated. In Figure 5.5 we saw that the best-fit
regression line is found by minimizing the sum of the squared residuals, which represent the errors
in Y but do not consider the possibility of errors in X. This also fits with the assumption we
stated earlier that the errors associated to X should be orders of magnitude less than those on
Y (in other words the errors in X are unimportant compared to those in Y ). So the regression
is calculated by minimizing the residuals in Y and does not consider the X direction. Therefore
when we switch the variables on the X and Y axes we will obtain a different regression equation
(Figure 5.17).
80
Figure 5.17: Depending on how the independent and dependant variables are selected the errors
which are minimized are different and therefore different regression equations are obtained.
(a) (b)
Y Y
X X
Figure 5.18: Comparison of the (a) least-squares and (b) RMA approaches. Least-squares is based
on the differences (dashed line) between the model and data in the Y direction. RMA is based on
the areas of the triangles (shaded) defined by the differences between the model and data in both
the X and Y directions.
b = sY /sX , (5.24)
a = Ȳ − bX̄, (5.25)
where sY represents the standard deviation of the Y variable and sX represents the standard de-
viation of the X variable. Returning to Fisher’s irises, we can perform the RMA analysis in R and
include the resulting line in our plot (Figure 5.19).
81
Example code: 33
●
4.0
● ●
● ● ●
● ● ●
3.5
Width [cm]
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
3.0
● ● ● ● ●
●
2.5
Length [cm]
Figure 5.19: Comparison of the two regression equations obtained with sepal length as the inde-
pendent variable (dashed) and sepal width as the dependent variable (dotted). The two lines are
different so we can see that the regression equations are not equivalent. The RMA line is shown
as a solid line.
The RMA line lies between the two other regression estimates and has the equation:
width = 1.0754 ∗ length − 1.9554 (5.26)
But what happens when we switch X and Y again?
Example code: 34
82
So when we swap the axes we obtain the RMA line:
If we rearrange equation 5.27 to make predictions of sepal width on the basis of sepal length we
get:
Compare this result to equation 5.26 and we can see the RMA line is not influenced by our selection
of X and Y . One drawback of the RMA approach, however, is the calculation of confidence and
prediction intervals is not so straight forward. A very good paper by Warton (see recommended
reading) gives a good discussion of the different approaches to regression and how the different
techniques can be applied.
83
84
Multiple linear regression
6
In the previous chapter we looked in detail at correlation and regression when there was a linear
relationship between two variables. For obvious reasons, the case of two variables is called a
bivariate problem and now we’ll extend the idea to multivariate problems where we consider more
variables.
85
1D 2D 3D
(0,1,1) (1,1,1)
(0,1,0) (1,1,0)
Figure 6.1: How different numbers of dimensions can be represented using increasingly complex
coordinate systems.
We can continue this process of adding as many dimensions as we want (and some problems
require a lot of dimensions) and we never hit a limit. Instead, we just add an extra coordinate each
time we need one. Just to give you some idea of how extreme these things can get, mathematicians
have discovered a symmetrical shape called The Monster that looks a bit like a snowflake and exists
in 196883 dimensional space!
Although we can’t draw or build a 4D hypercube we can attempted to represent it using
projections. For example the cube drawn in Figure 6.1 is a projection of a 3D body on to a 2D
surface (a piece of paper). Similarly we can project a 4D hypercube into 2D. Such a projection
is shown in Figure 6.2 and you should be able to tell immediately that it would be difficult to
interpret data plotted in such a projected space. If that’s the case for 4D, just imagine what would
happen for more complicated data sets with higher numbers of dimensions.
If you would like to see what a 4D hypercube looks like projected into 3D you should visit the
Arche de la Défense in Paris. Alternatively, if you would like to see a 2D representation of a 3D
projection of a 4D hypercube then you can look at Figure 6.3.
86
Figure 6.3: La Grande Arche de la Défense in Paris
Before we finish with this section it’s important to note that many statistical techniques rely
on calculating the distance between points (we’ll look at this in more detail later). Of course in 2D
we can calculate the straight-line distance between points using Pythagoras’ theorem, whereby we
draw a straight-line along the X dimension and a straight-line along the Y dimension to form a
right angled triangle and then the distance is given by the length of the hypotenuse. This procedure
can be extended to 3 or more dimensions, so for any given number of dimensions we can calculate
the distance between two points using Pythagoras’ theorem (Figure 6.4).
D D
y Δy z
Δz
y
Δy
Δx Δx
x x
Figure 6.4: Pythagoras’ theorem works in 2 (left), 3 (right) or as many dimensions as you want.
87
gives the 4-dimensional case
Ŷ = b1 X1 + b2 X2 + b3 X3 + a (6.2)
In this case the regression equation defines a hyperplane that, just like a hypercube, we cannot
visualize easily. However, the method for fitting (hyper-)planes to data with more than one re-
gressor variable is the same as bivariate (X,Y ) regression, namely; least-squares. We’ll examine
this problem in more detail in R. First we’ll generate some artificial data with a known structure
(so we know the result we should get in advance) and then we’ll build a regression model using
the lm function.
The first step is to create 2 independent variables, X1 and X2 . To do this we’ll generate 2
sets of random numbers in R, each consisting of 500 values. Then we’ll calculate the dependent
variable, Y , based on the relationship Y = 0.5X1 + 1.25X2 + 0.9.
Example code: 35
Example code: 36
88
● ●●● ●
●● ●
● ● ● ●● ● ● ● ●● ●
●● ● ● ●
●● ●● ● ●
● ●●● ● ●● ● ●●
● ●
●● ●
● ● ● ● ● ●●●● ● ●●
● ● ● ● ● ● ●
●●● ●● ● ● ●● ●
●●
●● ●● ● ●
● ● ●●
●
●● ● ●●
3.0
●● ● ●● ● ● ●
● ● ●
●● ● ●
●● ● ● ● ●● ● ●● ● ● ●●
● ● ● ●● ● ●
● ● ● ●●●
● ●
●● ●
● ● ● ●● ● ● ● ● ● ● ●●
●
●● ● ● ● ● ● ●
● ●
●● ● ●● ●● ●●● ● ● ● ●● ● ●
● ● ● ●●● ●
●
● ●●● ● ●●● ● ● ●●
2.5
● ● ● ●● ●
●● ●● ● ● ● ●
● ●● ● ●●
● ●● ●
●● ● ●● ●
● ● ●● ●● ● ● ● ● ●● ● ●
● ●
●● ●
● ● ●● ● ●● ● ●
● ●● ● ●● ● ●●●
●
● ● ●● ●
●● ● ● ●●● ● ● ●
● ● ● ● ●
● ● ●
●● ● ● ●● ●●● ●
●● ●
2.0
● ●●● ● ● ● ●
●● ● ● ●● ● ● ● ●
●●● ● ● ● ●● ● ● ●●
● ● ● ● ● ●● ● ●●● ● ●
● ● ● ● ● ● ● ●●
● ● ●● ● ● ●
Y
● ● ● ●
●●●● ● ●
●● ● ● ● ● ● ●●●● ● ●
X2
● ●● ● ● ● ● ● 1.0
●●● ●● ●● ● ●● ●● ● ●●● ●●●
1.5
●
● ●●
● ●● ● ● ● ● ● ●●
●●●● ●● ● ● ● ● ●
● ● ●●
● ●●
● ●●
● ● ● ●
● 0.8
● ● ●● ●
● ●●
●● ●●
● ● ● ●● ●
● ● ●●
● ● ●
● ● ● 0.6
● ● ●
●● ● ● ●●
1.0
● 0.4
0.2
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
X1
Figure 6.5: Scatter plot of the dependent variable Y formed as a function of the independent
variables X1 and X2 .
Next, we’ll build a MLR model using the lm function and then look at the coefficients of the
fitted equation.
Example code: 37
(Intercept) X1 X2
0.90 0.50 1.25
You should see that the values of the coefficients match those we used to build the Y values.
Although this is no great surprise, it does demonstrate how we can use the lm function for MLR.
89
investigate the information returned by lm. We’ll start by generating 4 independent parameters,
with each one composed of 500 random numbers. To make things more realistic we’ll also add
some random numbers into the system which will act like measurement noise.
Example code: 38
Example code: 39
Example code: 40
> Yhat=fitted(fit)
> plot(Y,Yhat,xlab='Y',ylab='Predicted Y')
90
3.5
●
●●
● ● ●
●
● ●●●
●● ● ●
●
● ●●●
●
● ●●
● ● ● ●●●
● ●
●● ●●
3.0
● ●● ●●
● ●● ● ●●
●● ●● ●●●●●
● ●
●● ●● ●●● ●
●●●●●●●
● ●●
●
●●●● ● ●●●
● ●● ●●●●
●
●● ●●●● ●
● ●●●●●
●●●●
●●●
●● ●
●●
●
● ●●●●
●
● ● ●
●
●● ● ● ●
●●●
●● ●●
●
●●●●
●●
● ●●●
●● ●●●●
2.5
●●●●●● ● ● ●●
●● ●
Predicted Y
●
●● ●●
●
● ●
●
●
● ●
●
●
●
●● ●
●●● ●
●●● ● ● ●
● ● ●● ●●
●● ● ●
●
● ●●
● ● ● ●●● ● ●●
●●
●
●●● ●
● ● ●●
●● ●● ●
●●
●●●● ● ●
●
● ● ● ●●●
●●● ● ●
● ●●●● ●●
●●
●● ●●●
● ●●●●● ●●●
●●●●● ●
●●●●●●● ●● ● ●●
●● ●● ● ● ● ●
●
●● ● ● ●
2.0
●
● ●● ●
●● ●●
●●
● ●
●●●●● ●
●
● ●●● ●
●●
●●●●● ●
●●● ●
●
●●● ●●● ●●
● ●● ●●
●●
● ●●● ●●
●
●● ●
● ● ●
●●●●● ● ●
● ●●●
●●●
● ●● ●●● ●
●●●●
● ● ●
●● ●● ●
● ● ●
●
●● ● ● ●
●●●●● ● ●●
●●
1.5
●● ● ●
● ●●
●●
● ● ●
●● ●
●
● ●
●● ● ●
● ●
● ●● ●
●
●
1.0
Figure 6.6: Scatter plot of the dependent variable Y against the predictions, Ŷ , made by the MLR
model.
Clearly we’ll have to extend the statistical analysis further if we are to obtain information
about β rather than simply b. Fortunately, the lm function can give us more information about
which coefficients are important in the regression model. This information can be obtained using
the summary command.
91
Example code: 41
> summary(fit)
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4)
Residuals:
Min 1Q Median 3Q Max
-0.302141 -0.073881 0.005527 0.071022 0.287857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.91170 0.01585 57.536 <2e-16 ***
X1 0.49786 0.01526 32.628 <2e-16 ***
X2 1.25338 0.01579 79.374 <2e-16 ***
X3 0.01457 0.01562 0.933 0.351
X4 1.01231 0.01501 67.455 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Example code: 42
92
intervals for us. We just need to use the confint and define the confidence level we want. As an
example we can calculate the 95% confidence value for the model in fit:
Example code: 43
2.5 % 97.5 %
(Intercept) 0.88056983 0.94283674
X1 0.46787723 0.52783661
X2 1.22235535 1.28440573
X3 -0.01612402 0.04526898
X4 0.98282237 1.04179386
Remeber that we constructed the data using the relationship Y = 0.5X1 +1.25X2 +0X3 +1.05X4 +
0.9 and we can see that the estimated confidence intervals for β span the true values of β (note,
your coefficients and confidence intervals might be slightly different to the ones above because you
will have used different random numbers in the construction of E, which will change the final model
fit very slightly).
We can also test which of the variables make a significant contribution to the model, by finding
a t-value given by the ratio of a coefficient to its standard error. Again we’ll do this for the X4
coefficient. We can then compare the t-value to a Student’s t-distribution with n − k degrees of
freedom to find the significance of the regressor in the model, which is termed a p-value.
Example code: 44
Example code: 45
93
●
3.5
●
●
●
●● ●
● ●
●● ● ●●
● ●●
● ● ●● ●
●● ●● ●
3.0
●●
● ●● ●●●●
● ●● ● ●●
●
● ●
● ●
●●
● ●● ● ●●
● ●●●
● ● ● ●●
●●
● ●●●
● ●● ●●●●●
●
●●
● ● ●●●●
●● ●●
●●
●● ●●●● ●
Predicted Y values
●●●
●● ●
●● ●●
●●●●● ● ●●●●
●● ●
●
●
● ●● ● ● ●
●
● ●●
2.5
● ● ●● ●●●●● ●●
●●●●●●●●
● ●
● ●
● ●●
●●
●
●● ●
●●●●● ●
● ●●●
●
●● ● ●
●●
● ●● ● ●●●
●
● ● ●
●● ●
●●●●●
● ●●●●●
●
●●● ●● ●
●●● ●
● ●● ● ●●●●●
● ● ●●
●
●
●
●●●●
● ●●●● ●● ●●
●●●●●●
● ● ●
●●●● ●●● ● ● ●
● ●●●
●
●● ●●●●●
● ●●●●●
● ● ●●
●● ●●
●●●●●●●
●
●●
●●
●● ●●
2.0
●●●● ●
●
●●
●
●
●
● ●
●●●●●●● ● ●● ●
●● ●
●●● ●●●●●
●●● ●●
● ●●
● ● ● ● ●● ●
●●● ●●●●●● ●● ●
●
● ●●● ● ● ●
● ● ●●●●● ●
●● ● ●● ● ●
●
●
●● ●●●●
●● ●●●
●
●● ● ● ●
●
● ●● ●●
1.5
● ●●● ● ●● ●
●
● ●● ●●●●● ●
● ●●● ●
●●
●●●
●
●●
● ●
Figure 6.7: Scatter plot of the dependent variable Y against the predictions, Ŷ , made by the MLR
model.
In this case effective length is defined as “the length of the boundary which is capable of exerting
a net driving or resisting force. For example, two mid-ocean ridges on opposite sides of a plate
exert no net force on the plate because their effects cancel out”. Forsyth and Uyeda collected a
data set for 12 plates (Figure 6.8).
94
Figure 6.8: The tectonic plates.
Plate Total area Continental area Effective ridge Effective trench Speed
name (106 km2 ) (106 km2 ) length (102 km) length (102 km) (cm/yr)
North American 60 36 86 10 1.1
South American 41 20 71 3 1.3
Pacific 108 0 119 113 8.0
Antarctic 59 15 17 0 1.7
Indian 60 15 108 83 6.1
African 79 31 58 9 2.1
Eurasian 69 51 35 0 0.7
Nazca 15 0 54 52 7.6
Cocos 2.9 0 29 25 8.6
Caribbean 3.8 0 0 0 2.4
Philippine 5.4 0 0 30 6.4
Arabian 4.9 4.4 27 0 4.2
So the big question is which independent variables could control plate motion and in what way?
We can perform with MLR with the four parameters to see how well they can predict the plate
speed. The data are held in a data file called plates.Rdata and the first step is to load them into R.
Example code: 46
95
Example code: 47
Call:
lm(formula = speed ~ total_area + cont_area + ridge_len + trench_len)
Residuals:
Min 1Q Median 3Q Max
-2.00396 -0.41947 -0.05667 0.23691 2.57130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.54710 0.73409 6.194 0.000448 ***
total_area -0.03767 0.02211 -1.704 0.132193
cont_area -0.01529 0.04743 -0.322 0.756580
ridge_len -0.01442 0.02067 -0.698 0.507746
trench_len 0.08037 0.02511 3.200 0.015063 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
From the Pr(>|t|) column it appears that only the intercept and effective trench length play
a significant role in the regression. This is confirmed by the confidence intervals on the coeffi-
cients, whereby the intervals for the intercept and effective trench length are the only ones not to
span 0.
Example code: 48
> confint(fit,level=0.95)
2.5 % 97.5 %
(Intercept) 2.81125995 6.28293689
total_area -0.08994543 0.01460916
cont_area -0.12745309 0.09687012
ridge_len -0.06329900 0.03444902
trench_len 0.02098227 0.13975114
Finally we can plot the measured plate speeds against the speeds predicted by the regression
model (Figure 6.9).
96
Example code: 49
8 ●
●
●
7
●
6
Predicted Speed (cm/yr)
●
4
●
3
2
● ●
●
1
●
●
2 4 6 8
Speed (cm/yr)
Figure 6.9: Scatter plot of the measured plate speed against the predictions of plate speed made by
the MLR model.
6.2.2 Multicollinearity
MLR assumes that the regressors are independent of each other (i.e., they exhibit no correlation).
If, however, a correlation exists between the regressors then they are not independent and suffer
from multicollinearity. If there was a perfect correlation between the regressors X1 and X2 , then an
infinite number of MLR solutions would exist that can all explain the data equally well (however,
perfect correlation is very rare in geological situations). In practise, if multicollinearity exists
the contribution of any regressor to the MLR depends on the other regressors that are already
included in the model. Therefore it is important to check for any correlation between regressors
before beginning MLR and think about the variables you are using.
We will use R to form a data set with variables which suffer from multicollinearity and then
perform a simple MLR. First, we’ll set up the independent variables (X1 and X2 ) each consisting
of 500 random numbers. We’ll make a third regressor, X3 , as a function of X1 and X2 , thus intro-
ducing multicollinearity. Then we’ll form the dependent variable, Y , according to the relationship;
Y = 1.3X1 + 0.8X2 + 1.1X3 + 0.5.
97
Example code: 50
Call:
lm(formula = Y ~ X1 + X2 + X3)
Residuals:
Min 1Q Median 3Q Max
-1.801e-14 -1.360e-16 4.220e-17 2.078e-16 7.019e-15
We’ve obviously managed to confuse R, the NA values mean Not Available because the multi-
collinearity means a stable solution can’t be found. We can also see that the coefficients returned
by the model are not close to the ones we used to build the model, this means that we could fun-
damentally misunderstand the system we’re working with. Something surprising happens when
we look at the predictions of the regression model (Figure 6.10).
Example code: 51
98
●
●
●
●
●
●
●
●●
5.0
●
●
●
●●
●
●●
●●
●●
●
●●
●
●●
●●
●●
●●
●
●
●
●●
4.5
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●●●
●
Predicted Y values ●●
●
●
●
●●
●
●
●
4.0
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●●
●●
3.5
●●
●●
●●
●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●●
●
●●
●
●
●
●●
●
3.0
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●
●●
●
2.5
●
●●
●●
●
●
●
●●●
●●
●
●
●●
●●
●
●●
2.0
We can see that although the predicted coefficients are incorrect, the actual predictions of Y
are perfect. Therefore, the derived relationship between the variables and Y is incorrect because
of the multicollinearity, but the prediction of Y is accurate. This allows us to make predictions
but not to quantify the underlying relationship.
Multicollinearity can be difficult to detect in real data sets where the relationship between the
regressor variables may be poorly understood. There are however cases such as compositional data
where we would expect multicollinearity to occur (Chapter 9). As with bivariate regression it is
also important to consider the following points.
2. Are there any outliers which could strongly influence the regression (influential values can
be found using Cook’s distance)?
Calorie content.
Sodium content.
99
Carbohydrate content.
Potassium content.
The R file cereals.Rdata contains the following variables; rating, calories, Na, carbo and K. Use
MLR to find the significant regressors and determine the regression relationship. How does predicted
taste compare to the taste rating? Use the code from the previous examples as a template, the
solution to the problem is on the next page.
100
We will assume that any effects of multicollinearity are negligible. Start by performing the
regression:
Example code: 52
Call:
lm(formula = rating ~ calories + carbo + K + Na)
Residuals:
Min 1Q Median 3Q Max
-13.693 -4.820 -0.966 3.685 24.756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.47364 8.52176 7.683 5.98e-11 ***
calories -0.54754 0.04566 -11.992 < 2e-16 ***
carbo 1.28172 0.22697 5.647 3.04e-07 ***
K 0.09131 0.01298 7.033 9.59e-10 ***
Na 0.03986 0.03512 1.135 0.26
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
101
●
80
●
70
●
●
Predicted Rating
● ●
60
● ● ●
●
●
●
● ●
●●
50
● ● ● ●
● ● ●
●● ● ●
●●
● ●●●● ●
● ●● ●
● ● ●
40
● ● ●
● ●
●● ●
●● ●
●
● ●● ●
●● ● ● ● ●
●
●●
30
● ●●
● ● ●
●
●
●
20 40 60 80
Rating
Figure 6.11: Scatter plot of the taste rating values against the predictions of taste rating.
102
The average human has one breast and one testicle.
Des McHale
Cluster analysis
7
In the previous discussion of regression techniques we assumed that variables were related by some
form of continuous relationship that could be described by a line or a similar function. In some
situations, however, we may expect data to fall into a collection of discrete groups rather than
describing a continuous path. An example is given on the left of Figure 7.1, which shows the
regression line for the river gravels we studied earlier. Not surprisingly, we found that the change
in gravel size as we moved downstream could be described by a continuous line. On the right of
Figure 7.1 you can see a very different situation where the data points appear in distinct groups.
Would it make sense to fit a straight-line through this data? Clearly not, we would be using a
continuous function to try and describe a data set that appears not to be continuous. Instead we
can gain insights into the data by trying to find and characterize the groups of data within the
sample. To do this we will employ cluster analysis.
Figure 7.1: An example of the kinds of data that should be analyzed using regression analysis (left)
and cluster analysis (right).
103
7.1 The principals behind cluster analysis
Cluster analysis, also called segmentation analysis or taxonomy analysis, is a way to create groups
of objects, or clusters, in such a way that the properties of objects in the same cluster are very
similar and the properties of objects in different clusters are quite distinct. Cluster analysis is
a multivariate technique so it can work in large numbers of dimensions, where each dimension
represents one property of the data set. There are a variety of different clustering methods, we’ll
only be looking at a small number of them, but they all work on a similar principal, which is to
measure the similarity between data points.
x
Figure 7.2: How can we judge which point, B or C, is the most similar to A?
There are a number of different ways in which we can measure distance. The one we are most
familiar with is the Euclidean distance, which is simply the length of a straight-line connecting two
points. As we saw in Section 6.1 the Euclidean distance can be calculated for any given number of
dimensions using Pythagoras’ theorem. There are, however, different ways to measure distance. A
simple example is the Manhattan Street Block distance, which rather than measuring the shortest
straight-line distance between two points, instead sums the absolute differences of their coordinates
(Figure 7.3). This is also known as Taxicab geometry because if you imagine the grid layout of
streets in New York, a taxi driver cannot go in a straight-line between two locations, but instead
can just move along north-south or east-west streets.
104
D=√Δx2 + Δy2 D=√Δx2 + √Δy2
D
y Δy y Δy
Δx Δx
x x
Figure 7.3: A comparison of Euclidean (left) and Manhattan Street Block (right) distances.
There is even a measure of distance known as the Canberra distance! This measures the
distance of data points from a central point of interest, analogous to how the suburbs are set up
in Canberra. Not surprisingly the Canberra distance was developed by two workers at C.S.I.R.O.
in the ACT.
Height
Foot length
Finger width
Umbilical radius
In this case it is clear that the absolute variations in height will be much larger than the other
variables, therefore they will dominate the analysis. To avoid this effect the data are standardized
before the analysis, this means setting the mean of each variable to 0 and the standard deviation
to 1. This is also known as taking the zscore and for a collection of values, X, we can standardize
the data using the equation:
X − X̄
z= , (7.1)
s
where X̄ and s are the mean and standard deviation of the values in X, respectively. This is a
simple operation to perform in R and as an example we’ll standardize a sample of 100 random
numbers.
105
Example code: 53
Example code: 54
80
100
Counts
Counts
60
40
50
20
0
0
X log(X)
Figure 7.4: A log-normal distribution (left) can be transformed into a normal distribution (right)
simply by taking the logarithm of the values.
106
1.2
A
1.0
0.8
F
0.6
Y
0.4
C
0.2
D
0.0
B
−0.2
Figure 7.5: In our hierarchical clustering example we start with 6 data points based on 2 measured
parameters, X and Y. The individual points are labeled so that we can see how the clustering
procedure advances.
The first step of the hierarchical clustering procedure is to find the 2 points that are most
similar to each other. Because we know that distance is a good measure of similarity, we can
simply say that the two points closest to each other are the most similar. Examination of the data
shows that the closest two points are C and D so we connect them with a line and create a new
data point (shown by a black dot) at the center of the line (Figure 7.6). Simultaneously we’ll track
the development of the clusters in a so-called dendrogram, in which we draw a link between points
C and D at a height corresponding to the distance separating them.
107
1.2
2.0
A
1.0
1.5
0.8
F
0.6
1.0
Height
Y
0.4
C
0.2
0.5
D
0.0
0.0
−0.2
D
−1.5 −1.0 −0.5 0.0 0.5 1.0
Figure 7.6: The closest two points, C and D, are deemed to be the most similar and are connected.
A new point is then created halfway along the link between the points. This connection can be
represented in the dendrogram on the right, with the link between the points drawn at a height
corresponding to the distance separating them.
Because we have now connected C and D together, they are no longer considered separately in
the analysis, but instead are represented by the single point midway along their link. Now we need
to find the next most similar pair of points, which is B and E. Again we link the points and create
a new point midway between them (Figure 7.7). We can also add this link to the dendrogram.
Notice in the dendrogram that the link between B and E is higher than the link of C and D
because the points are separated by a greater distance.
1.2
2.0
A
1.0
1.5
0.8
F
0.6
1.0
Height
Y
0.4
C
0.2
0.5
D
0.0
B
0.0
−0.2
E
B
Figure 7.7: Points B and E are connected and an additional link is added to the dendrogram.
The next shortest connection is between F and the midpoint in the link between C and D.
Again we connect these points and insert a new point halfway along the connection (Figure 7.8).
In the dendrogram a connection is made between F and the center of the C and D connection.
108
You should now be able to see how the dendrogram records the sequence of connections that are
being formed in the data set.
1.2
2.0
A
1.0
1.5
0.8
F
0.6
1.0
Height
Y
0.4
C
0.2
0.5
D
0.0
0.0
−0.2
D
−1.5 −1.0 −0.5 0.0 0.5 1.0
Figure 7.8: The next connection links F with the point between C and D. This is represented in the
dendrogram as a link connecting into the existing C to D link.
The next shortest connection is point A to the middle of the connection between F and C to
D. This link is also added to the dendrogram (Figure 7.9).
1.2
2.0
A
1.0
1.5
0.8
F
0.6
1.0
Height
Y
0.4
C
0.2
0.5
D
0.0
B
0.0
−0.2
E
B
Figure 7.9: Point A is now included in the hierarchy and a link is formed in the dendrogram that
connects it to the existing cluster of points containing C, D and F.
Finally, the center point of the B to E link is connected to the center point that was inserted
when the previous link was made. This connects B and E to the other data points and the final
link is placed in the dendrogram (Figure 7.10).
109
1.2
2.0
A
1.0
1.5
0.8
F
0.6
1.0
Height
Y
0.4
C
0.2
0.5
D
0.0
0.0
−0.2
D
−1.5 −1.0 −0.5 0.0 0.5 1.0
Figure 7.10: The final connection completes the hierarchical clustering routine and the resulting link
in the dendrogram shows how all 6 points are connected to each other, the length of the connections
and the sequence in which the connections were made.
To show how hierarchical clustering can be performed in R we’ll use the same set of data points
and calculate a dendrogram directly. As a first step we’ll enter the data values by hand and plot
them.
Example code: 55
Example code: 56
110
Example code: 57
Example code: 58
111
Cluster Dendrogram
3.5
3.0
2.5
2.0
Height
1.5
1.0
0.5
0.0
15
6
11
9
14
12
8
1
5
10
2
13
7
3
4
16
18
17
22
29
20
24
23
26
28
21
27
30
19
25
D
hclust (*, "average")
Figure 7.11: Final dendrogram for 30 cases taken from Fisher’s iris data set. Notice how the
dendrogram reveals clearly the presence of two different species in the sample, which are only
connected by the final link. The numbers along the bottom correspond to the case number in the
data set.
112
Example code: 59
Cluster Dendrogram
3.0
2.5
2.0
Height
1.5
1.0
0.5
0.0
42
15
16
33
34
37
21
32
44
24
27
36
5
38
8
50
40
28
29
41
1
18
6
45
19
17
11
49
47
20
22
23
14
43
9
39
12
25
7
13
2
46
26
10
35
30
3
31
4
48
61
99
58
94
73
84
66
76
55
59
78
77
87
51
53
74
79
64
92
72
75
98
71
86
52
57
63
69
88
68
83
93
62
95
100
89
96
97
67
85
56
91
65
80
60
54
90
70
81
82
D
hclust (*, "average")
Figure 7.12: Final dendrogram for the 100 cases that compose Fisher’s complete iris data set.
Notice how the fine details of the dendrogram become difficult to interpret when a large number of
cases are included.
113
7.3 Deterministic k-means clustering
K-means clustering is another way of finding groups within large multivariate data sets. It aims
to find an arrangement of cluster centers and associate every case in the data set to one of these
clusters. The best cluster solution minimizes the total distance between the data points and their
assigned cluster center. The procedure for calculating a k-means clustering follows 4 basic steps:
4. Recalculate the positions of the cluster centers until a minimum total distance is obtained
across all the cases.
To demonstrate the ideas behind k-means clustering we’ll look at a graphical example that considers
a simple two-dimensional data set. The data set clearly contains two different groups of points,
those with high X and high Y values and those with low X and low Y values (Figure 7.13). In
k-means clustering we aim to find cluster centers that mark the centers of the different groups.
Each case is then assigned to the nearest cluster center and it is assumed that they have similar
characteristics. Of course, the characteristics of a cluster are given by the location of the center in
the parameter space. Finally, because each case can only belong to one cluster this is a so-called
hard clustering technique.
6
6
Cluster 1
(high X, high Y)
5
5
4
4
3
3
Y
Y
2
2
1
1
Cluster 2
0
0
(low X, low Y)
−1
−1
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
X X
Figure 7.13: A simple example of k-means clustering. The cases on the left show that there are two
clear groups in the data. K-means involves defining the centers of these groups (marked by filled
circles) and associating the individual cases to them to form clusters. The positions of the cluster
centers in the parameter space define the characteristics of that cluster.
114
Therefore to obtain a clear plot of the clusters in two-dimensions we must choose 2 variables which
show a good separation between the cluster centers. To look at this problem we’ll return to Fisher’s
100 irises and perform a k-means clustering of the 4 petal and sepal measurements, splitting the
data into 2 clusters (Figure 7.14).
2.0
1.5
●
Petal width
1.0
0.5
●
0.0
1 2 3 4 5
Petal length
Figure 7.14: K-means clustering of Fisher’s iris data set. When plotted using the petal length
and petal width variables, a 2 cluster solution can be visualized with a clear separation between
the cluster centers (filled circles). Each case is assigned to its nearest cluster center. Cluster 1
(plus signs) is characterized by a center with small petal widths and small petal lengths. Cluster 2
(multiplication signs) is characterized by a center with large petal widths and large petal lengths.
If we repeat the exercise but now calculate a more complex 3 cluster solution we can see that
it becomes harder to visualize the results with just 2 parameters (Figure 7.15). There is a strong
overlap between two of the clusters, which may indicate that there are too many clusters in the
model (i.e., the data really only consists of 2 groups) or that we haven’t chosen the best parameters
to display the results.
115
2.0
1.5
●
●
Petal width
1.0
0.5
●
0.0
1 2 3 4 5
Petal length
Figure 7.15: K-means clustering of Fisher’s iris data set. When plotted using the petal length and
petal width variables a 3 cluster solution shows a clear overlap between two of the clusters (different
clusters are shown with stars, plus and multiplication signs). There is a poor separation between
two of the clusters in the top right of the plot.
116
that explains what process each cluster represents.
Therefore a good cluster solution will be one which produces silhouette values close to +1. The
values for all the cases in a given solution can be displayed in a silhouette plot. Let’s look at an
example of some two-dimensional data that clearly contains 4 clusters (Figure 7.16).
6
● ●
● ● ● ●
● ●
5
●●
●
● ● ● ●
●
● ●● ●
●
● ●
●
4
3
Y
●
●
1
●
● ●●
● ●
● ● ● ●
●
●
0
●
● ●
● ● ●
● ● ●
●
−1
−1 0 1 2 3 4 5 6
Figure 7.16: Example two-dimensional data set that clearly contains 4 clusters.
First we’ll calculate a 2 cluster solution, which is clearly inadequate to explain the data given
that we know it contains 4 clusters. The data is split into two clusters and the silhouette values
are generally around 0.5 (Figure 7.17).
117
1 : 25 | 0.47
2 : 25 | 0.48
Figure 7.17: The assignments of the individual cases to the final clusters are indicated by the
different symbols (left). The silhouette plot on the right gives a histogram of the case silhouette
values for each cluster (gray bars). The numbers to the right of the plot show how many cases are
in each cluster and what their average silhouette value is. For example there are 25 cases in cluster
1 and their average silhouette value is 0.47.
Now we can try a more complex 3 cluster solution and again examine the silhouette plot
(Figure 7.18). We can see that the silhouette values for clusters 2 (crosses) and 3 (triangles) are
close to 1, which suggests that points have been assigned to them correctly. Cluster 1 (circles),
however, is still spanning two sets of data points, so its silhouette values are reduced.
1 : 20 | 0.48
2 : 15 | 0.85
3 : 15 | 0.79
Figure 7.18: The assignments of the individual cases to the final clusters are indicated by the
different symbols (left). The silhouette plot on the right gives a histogram of the case silhouette
values for each cluster (gray bars). The numbers to the right of the plot show how many cases are
in each cluster and what their average silhouette value is. In this case the values for cluster 1 are
still low because it is spanning two of the clusters in the data.
A 4 cluster solution clearly (and not too surprisingly) does a good job of partitioning the data
into its correct groups (Figure 7.19). None of the cases are assigned wrongly so all the clusters
have high silhouette values.
118
1 : 10 | 0.85
2 : 15 | 0.85
3 : 10 | 0.83
4 : 15 | 0.79
Figure 7.19: The assignments of the individual cases to the final clusters are indicated by the
different symbols (left). The silhouette plot on the right gives a histogram of the case silhouette
values for each cluster (gray bars). The numbers to the right of the plot show how many cases are
in each cluster and what their average silhouette value is. In this case the values for all the clusters
are high because the data have been grouped correctly.
So far we have only considered the cases where not enough, or just enough, clusters were
included. But what happens when we calculate a solution that contains too many clusters? We’ll
work with the same data set, but this time calculate a 5 cluster solution (Figure 7.20). We can see
that in order to accommodate the extra cluster, one of the clusters has been split into two parts
(bottom right). This means that some of the points in clusters 4 and 5 have low silhouette values
because they are not distinctly in one cluster or another or may have been assigned to the wrong
cluster.
1 : 10 | 0.85
2 : 15 | 0.85
3 : 10 | 0.83
4 : 10 | 0.25
5 : 5 | 0.38
Figure 7.20: The assignments of the individual cases to the final clusters are indicated by the
different symbols (left). The silhouette plot on the right gives a histogram of the case silhouette
values for each cluster (gray bars). The numbers to the right of the plot show how many cases are
in each cluster and what their average silhouette value is. In this case one group in the data has
been split into two parts to accommodate an additional cluster. This leads to an ambiguity in the
assignment of the cases that is reflected in the silhouette plot.
119
A 6 cluster solution is clearly too complex, with 2 of the data groups (top left and bottom
right) being split in order to accommodate the additional clusters (Figure 7.21). The low values
in the silhouette plot show the high uncertainties associated with the case assignments when the
model is overly complex.
1 : 10 | 0.85
2 : 15 | 0.85
3 : 7 | 0.49
4 : 3 | 0.36
5 : 10 | 0.25
6 : 5 | 0.38
Figure 7.21: The assignments of the individual cases to the final clusters are indicated by the
different symbols (left). The silhouette plot on the right gives a histogram of the case silhouette
values for each cluster (gray bars). The numbers to the right of the plot show how many cases
are in each cluster and what their average silhouette value is. In this case two groups in the data
have been split into parts to accommodate the additional clusters. This leads to an ambiguity in
the assignment of the cases that is reflected in the silhouette plot.
As a general approach to model selection we can calculate the average silhouette value across all
cases for a given number of clusters. If we then compare the different values for different models we
can choose the number of clusters which returns the highest average silhouette value (Figure 7.22).
120
●
0.80
0.75
Average silhouette width
0.70
●
●
0.65
0.60
●
0.55
0.50
2 3 4 5 6
Number of clusters
Figure 7.22: For our example data set the average silhouette value reaches a maximum when 4
clusters are included in the analysis. This suggests we should select a 4 cluster model to represent
the data. Of course we set the data up to have 4 clusters so this result is not surprising, but it
demonstrates the approach.
Example code: 60
121
Example code: 61
Example code: 62
Example code: 63
> fit$silinfo$avg.width
The results of the analysis are shown on the next page.
122
The results of the silhouette analysis are shown in Figure 7.23 and suggest that a two cluster
model should be selected.
0.72
●
0.70
Average silhouette width
0.68
0.66
0.64
●
0.62
●
0.60
2 3 4 5 6
Number of clusters
Figure 7.23: Plot of the average silhouette value as a function of the number of clusters for the
Portuguese rocks data set. The silhouette values suggest we should select a 2 cluster model to
represent the data.
If we perform the analysis for two clusters we can look at how the rocks are assigned to the
different cluster centers and the properties of the cluster centers. We’ll start by producing the two
cluster model and looking at the cluster assignments.
123
Example code: 64
Example code: 65
> fit$medoids
124
Example code: 66
For the center of the second cluster we obtain oxide percentages of:
SiO2 [%] Al2 O3 [%] Fe2 O3 [%] MnO[%] CaO[%] MgO[%] Na2 O[%] K2 O[%] TiO2 [%]
0.5 0.4 0.1 0 54.4 0.6 0.07 0.13 0
Given that cluster 2 contains limestones and marbles the composition above is not surprising.
Repeat the calculation to find the composition of the first cluster center and check that it matches
with what you would expect from the rocks included in the cluster.
125
126
Simplicity, simplicity, simplicity! I say, let your affairs be as two
or three, and not a hundred or a thousand. Simplify, simplify.
Henry David Thoreau
127
●
●
●
●
●
8.0
●
●
●
●
●
●
●●
●
●
●
7.5
Y
●●
7.0
●●
●
●
●
●
●
●
●
Figure 8.1: An example 2D data set. How many dimensions do we need to describe the variability
of the data fully?
Because the data points all fall exactly on a straight-line we can describe their variability fully
with a single dimension (a straight-line passing through the points). This means that we are able
to take the 2D data and by exploiting its structure, specifically the perfect correlation between X
and Y , reduce the number of dimensions needed to represent it to 1. This is shown in Figure 8.2
where the data point are plotted on a 1 dimensional line.
Figure 8.2: Because the data points in Figure 8.1 fall on a line, we can rotate our coordinate
system so that all of the points lie along a single dimension. Therefore their full variability can be
explained in 1D.
Now consider the two-dimensional data set in Figure 8.3, how many dimensions do we need to
describe its variability fully?
128
●
8.5
●
● ●
●●
●●
●
● ●
●
● ●
● ● ● ● ●
●
● ● ●
8.0
● ●
●●
● ●
●
● ● ● ●
● ● ●
●●
● ●
● ● ●
●
● ●●
● ● ●
7.5
● ●
Y
● ●
●
● ●
● ● ●
●● ●
●
● ● ●
●
● ● ● ●
●
7.0
●●●
●
●●
● ●
●
●
●
● ●●
●●
● ●
● ●
● ●
6.5
● ● ●
●
Figure 8.3: Another example of a 2D data set, where the points do not all fall perfectly onto a
straight-line. How many dimensions do we need to describe the variability of the data fully?
Because the points do not lie perfectly on a straight-line we will still need two dimensions to
describe the variability fully. However we could make a compromise and say that the deviations
from a straight-line are very minor so we could still represent the vast majority of the data vari-
ability using a single dimension passing along the main path of the data. By doing this we’ll lose
some information about the data (specifically their deviations from the line) but maybe that is an
acceptable loss given that we can reduce the number of dimensions required to represent most of
the data variability.
Often in data sets with many variables, a number of the variables will show the same variation
because they are controlled by the same process. Therefore in multivariate data sets we often
have data redundancy. We can take advantage of this redundancy by replacing groups of corre-
lated variables with new uncorrelated variables (the so-called principal components). Principal
component analysis (PCA) generates a new set of variables based on a linear combination of the
original parameters. All the principal components are orthogonal (at right-angles) to each other so
there is no redundant information because no correlation exists between them. There are as many
principal components as original variables, however, because of data redundancy it is common for
the first few principal components to account for a large proportion of the total variance of the
original data.
Principal components are formed by rotating the data axes and shifting the origin to the point
corresponding to the multivariate mean of the data. Let’s look at an example similar to the one
above. We’ll take a 2D data set and see how much of the data variability is explained by the
principal components (Figure 8.4).
129
2
8
1
6
0
4
−1
2
−2
PC1
2 4 6 8 10 −4 −2 0 2 4 6
X (51% of variability) PC1 (89% of variability)
Figure 8.4: In the original data set (left) each dimension explains approximately 50% of the total
variability. Principal components (marked as PC1 and PC2) can be fitted to the data. The data
can then be plotted with the principal components defining a new coordinate system (right) where
89% of the total variability can be explained using a single dimension.
Now we’ll consider a 3D case. In Figure 8.5 the three-dimensional data can be represented very
well using only the first 2 principal components. This simplifies the data into a two-dimensional
system and allows it to be plotted more easily. The first 2 principal components account for 99%
of the total data variability, whilst the third component accounts for the remaining 1%.
3
10
PC3(1% of variability)
2
Z (26% of variability)
5
1
0
0
−5
10
−1
10
y)
5 y)
i li t
lit
5
−10
bi
0 ia b
−2
ria
0
ar
va
−5 −5
o fv
of
−15
−10 −10
−3
7%
%
Figure 8.5: The original data set is shown on the left and the variability explained by each dimension
is shown in the axis labels. The principal components can be fitted to the data (black lines). The
data can then be plotted with the principal components defining a new coordinate system (right).
Given that the third principal component describes such a small amount of the total data
variability we can consider dropping it from the plot. Therefore if we only plot the first 2 principal
components, as in Figure 8.6, we lose 1% of the data variability from the representation, but maybe
this is worth it when we can represent the data in a 2D rather than 3D plot.
130
5
PC2 (17% of variability)
0
−5
−10
−20 −10 0 10 20
PC1 (82% of variability)
Figure 8.6: By only plotting the first 2 principal components we lose a small amount of information
(1% in this case), but the data can be plotted in two dimensions and is easier to interpret.
We’re not going to dwell on how the principal components are calculated, but instead we’ll
focus on their application. To give you a hint, the principal components are obtained from so-
called eigenvector analysis, which involves the calculation of eigenvalues and eigenvectors. The
eigenvectors describe the directions of the principal components through the data, whilst the
eigenvalues describe the length of the components (the longer the component, the more of the
data variability it describes). Information on how a data set is related to it’s principal components
are given by the scores and loadings. The scores give the coordinates of the data points in the
principal component space and the loadings show how the orientation of the principal components
is related to the original coordinate system. If these concepts aren’t too clear to you at the moment,
some examples should help.
131
course as the names suggest, the lengths must obey the relationship: X1 > X2 > X3 . We’ll now
calculate some more properties of the blocks based on the length of the axes.
132
The first thing we can do to investigate the structure of the blocks data set is to look at how
the different parameters (X1 through X6 ) are correlated to each other. Given that we know how
the data set was constructed we should have a general feeling of what correlations should exist.
We can calculate a so-called correlation matrix, which consists of the correlation of each parameter
to all the other parameters and plot it in an form which is easy to interpret (Figure 8.8).
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
Figure 8.8: Graphical representation of the blocks correlation matrix. The colour and orientation
of the ellipses indicates either a positive (blue) or negative (red) correlation. Additionally, a darker
shade of the colour indicates a stronger correlation. The strength of the correlation is also indicated
by the form of the ellipse, ranging from circles (no correlation) to straight-lines (perfect correlation).
The correlation matrix shows the kind of relationships we would expect. For example a strong
positive correlation exists between X1 and X4 , which is not surprising because as the longest
side of the block increases the length of the diagonal should also increase. We also see negative
relationships, for example between X3 and X5 , again this is not surprising because as the length of
the shortest side increases, X5 will decrease because the denominator in the ratio becomes larger.
Because some of the parameters in the blocks data set are correlated to each other it means that
we have data redundancy and PCA should provide us with a good representation of the data
variability. We can test this supposition be calculating the so-called scree plot, which shows how
much of the data variability each principal component represents (Figure 8.9).
133
pc
3.5
3.0
2.5
2.0
Variances
●
1.5
1.0
0.5
●
0.0
● ●
Figure 8.9: A scree plot of the PCA of the blocks data set. The scree plot shows how much variance
each principal component explains.
Remember, the aim of PCA is to provide a good representation of the data variability in a small
number of dimensions. Therefore it is important to check the scree plot and see what proportion
of the variance is explained by the first 2 or 3 principal components. In the case of the blocks data,
the first two principal components explain >90% of the data variability. Therefore we know that
we are not losing too much information by only considering the first 2 principal components. We
can now plot the blocks according to their coordinates with respect to the principal components.
This information is given by the scores and we’ll simply plot each block according to its scores
with respect to the first and second principal components (Figure 8.10).
134
2
0
PC2 Score
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
PC1 Score
Figure 8.10: PCA representation for the 25 blocks based on the scores of the first 2 principal
components.
We can now see that the positions of the blocks in the principal component space follow gen-
eral trends according to their shape characteristics. We can get this information in more detail by
studying the loadings, which show how the directions of the principal components are related to
the original parameters.
135
overall size of the blocks. The second principal component is a little more complicated. The
parameters X3 and X6 have opposite signs to the other parameters. Again this is a little clearer if
we look back at the scores in Figure 8.10. We can see that blocks with high scores on the second
principal component tend to be more equidimensional, but those with large negative scores are
more tabular. This makes sense given that in the loadings, X3 has a different sign to X1 and X2 .
Therefore it appears the second principal component is representing the shape of the blocks rather
than their size.
Hopefully, this example demonstrates the concepts behind PCA and how we can go about the
interpretation of the results. Of course in a manufactured example like this one it is easier to
understand the results because all the relationships between the different parameters are defined
in advance. In real world examples it is necessary to study the principal component scores and
loadings in detail in order to form a clear understanding of a PCA structure.
The skulls of 82 individuals were examined and the measurements of four different parts of the
skull were taken:
Braincase width.
136
Bulla length.
Bulla depth.
We’ll now perform a PCA on the oreodont skull measurements. As a first step we’ll load the data
file (Orodont.Rdata) and plot all the variables in the data matrix, X, against each other in a series
of 2D plots (Figure 8.12).
Example code: 67
> rm(list=ls())
> load('Orodont.Rdata')
137
180
● ● ●
●
● ●
35
●
●● ●
● ●● ● ●
20
160
● ● ● ●● ●
●
● ● ● ● ●
● ●● ●
● ● ●
30
Cheek tooth length
● ● ● ●
● ● ●● ●●●
140
● ●● ● ● ● ● ● ●● ●●
Bulla length
Bulla depth
● ● ● ●
15
● ● ● ●
● ● ● ● ●
25
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
120
● ●● ●●
● ● ●
● ●
●● ● ●●
●
20
10
● ●●●● ● ● ●
100
● ●
●●●●●●
●● ● ●●●
●●
●●●●
● ● ● ●●●●●●
● ●●●● ● ● ●●● ● ●●●
● ●●●
● ●●●●●● ●●
15
●● ● ● ●● ● ● ● ●● ● ● ●●
● ● ●●
● ●●●
●●
● ●
● ●
● ● ● ●● ●● ●● ● ● ●●
5
80
● ●● ● ●
● ● ● ● ● ●●
40 50 60 70 40 50 60 70 40 50 60 70
● ● ●
● ●
35
●
●● ● ●
20
20
● ● ●● ● ●● ●
●
● ● ●● ● ● ● ●
● ● ●● ●
●● ● ● ● ● ● ● ●●
30
● ● ●
●● ●● ● ●● ● ● ●●● ●●
● ●● ● ● ●● ●
●● ● ● ● ● ●
Bulla length
Bulla depth
Bulla depth
● ● ● ●
15
15
● ● ●● ●● ●●
●● ● ●●
25
● ● ● ● ● ●
● ● ● ●● ● ● ●● ●●
● ●
● ●
● ● ● ● ●
20
10
10
●●● ●
● ●● ●
●●
● ●
●●●
●●
●●●●●●● ●●●●●
●● ●●●●
●●● ●
● ●●● ● ●
●●●
●●●● ●●●●
15
●●
●●●●
● ● ● ● ●●
●
●●●
●●
●●● ● ●●●●●
5
●
● ●
●●●● ●●● ● ●
Figure 8.12: The various oreodont skull measurements displayed as a collection as bivariate plots.
Just as with the blocks example we can calculate the correlation matrix and plot it (Figure 8.13).
We see strong positive relationships between the 4 different measurement types, this is not too
surprising because we would expect that as one component of a skull gets larger, all the other
components become larger as well. The high correlation suggests that a data redundancy may
exist that can be exploited by PCA.
138
Cheek tooth length
Braincase width
Bulla length
Bulla depth
Braincase width 1.00 0.90 0.81 0.71
Figure 8.13: Correlation matrix for the four sets of measurements made on the collection of ore-
odont skulls.
The first step of the PCA is to standardize the data using the scale function and then process
the data with the princomp function. To illustrate the results we’ll then plot the first 2 principal
components (Figure 8.14).
Example code: 68
> dev.off()
> Xz=scale(X,center=TRUE,scale=TRUE) #standardize the data
> pc=princomp(Xz) # calculate the PCA solution
> plot(pc$scores[,1],pc$scores[,2],xlab='PC1',ylab='PC2') #plot the PCA scores
139
●
● ●
●
1.0
● ●
●
● ●
● ●
0.5 ● ●
● ●
● ●
● ● ●●
● ●● ● ● ●
● ● ●● ●
● ● ● ● ● ● ●
●
● ● ●
●●
PC2
●
0.0
● ●
●
● ●
● ●
●
● ●
● ●
● ●
●
●
●
−0.5
●
●
● ● ●
●
●
● ●
−1.0
● ●
● ● ● ●
●
● ●
●
−4 −3 −2 −1 0 1 2
PC1
Figure 8.14: Score of the first two principal components for the 82 oreodont skulls.
We can see that the collection of skulls appears to fall into two different groups that may cor-
respond to different species. To understand these groups in more detail we must first look at the
scree plot to see how much of the data variability is explained by the first 2 principal components
and then the principal component loadings. First the scree plot (Figure 8.15).
Example code: 69
140
3.5
3.0
2.5
2.0
Variances
1.5
1.0
0.5
0.0
Figure 8.15: Scree plot from the PCA of the oreodont skulls. The first 2 principal components
explain ∼96% of the data variability.
The scree plot shows that the first two principal components explain about 96% of the data
variability, which suggests the scores in Figure 8.14 provide a reliable representation of the data.
To understand what the first two principal components represent we must look at the loadings.
Example code: 70
> pc$loadings[,1:2]
Comp.1 Comp.2
Braincase width -0.4971044 0.4878506
Cheek tooth length -0.5012622 0.4681118
Bulla depth -0.5185804 -0.2901168
Bulla length -0.4823876 -0.6772780
All 4 variables have a similar influence on PC1, this suggests PC1 must represent changes in
skull size. For PC2, the size of the Bulla controls the component in a different way to the size of
the teeth and braincase. This suggests PC2 is related to the shape of the skull.
141
●
2
●
●
●
●
● ●
● ●
● ● ●
● ●
●
1 ● ●
● ● ●●
● ● ●
● ●
●
● ●
● ●● ● ●
● ● ●● ●
● ● ● ●
PC2
● ● ● ●
● ● ● ● ●
0
●
● ●
● ● ●● ●
● ● ●
● ●
●●
●● ● ●● ●
● ● ● ● ●
● ●
● ● ●● ●
● ●
●
−1
●
● ● ●
● ● ●
●
●
●
−2
● ●
−2 −1 0 1 2
PC1
Figure 8.16: Score of the first two principal components for Fisher’s 100 irises.
The scores show two groupings, which supports the results of our earlier cluster analysis and
again indicates that two different species are contained in the sample. The first two principal
components explain over 96% of the variability and therefore provide a good representation of the
data set.
Water depth
Temperature
Nutrient content
oxygen content
A typical benthic foraminifera data set will contain the number of individuals of different taxa
(maybe ∼400) from different depths in a sediment core or many different surface sediment locations.
In this form we may have a 400 dimensional space and we would need 79800 charts if we wanted
to make bivariate scatter plots comparing each of the taxa’s abundances. How can we use PCA
to make the analysis of the assemblage an easier task?
142
Often in data sets with many variables, a number of the variables will show the same variation
because they are controlled by the same process. Many of the species will respond to the same
environmental conditions in a similar way, i.e., we have data redundancy. Therefore, we can
perform a PCA and use the loadings to find which species vary with which principal component.
We know which environmental parameters control the specific species abundance and therefore
we can make an environmental interpretation about what the individual principal components
represent. There can be some complicating issues when working with abundance data such as
foraminifera percentages because they are closed data, but we’ll talk about this in more detail in
Chapter 9.
0.0
−0.5
−1.0
NLM X
Figure 8.17: A data set consisting of 1000 random points within a 3D sphere were generated (left).
A NLM of the 3D data to map it into 2D produces points within a circle.
Alternatively, we can start with random points in a 5D hypersphere (we can’t visualize this
but it is simple enough to represent mathematically). If we map from 5D to 3D we’ll get a sphere
full of points and if we map to 2D we’ll get a circle full of points (Figure 8.18).
143
0.5
1.0
NLM Y
0.0
0.5
NLM Z
0.0
1. 0
−0.5
0. 5
−0.5
0. 0
Y
−0.5
−1.0
LM
−1.0
−1.0
N
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
NLM X NLM X
Figure 8.18: A data set consisting of 1000 random points within a 5D sphere. A NLM of the data
into 3D produces a sphere of points (left) and mapping into 2D produces a circle of points (right).
For any mapping into p dimensional space we can calculate the stress [0,1], which is given by:
n−1 X n
1 X (δij − dij )2
E = Pn−1 Pn (8.1)
i=1 j=i+1 δij i=1 j=i+1
δij
144
Example code: 71
●
1.0
●
●
● ●
● ●
●
● ●
●
● ● ● ● ● ●
0.5
● ●
●● ● ●
● ● ●
● ●
●●● ●
● ●● ● ●
NLM Y
● ●
● ● ●
● ● ●
0.0
● ● ●
● ●● ●
● ● ●
●
●
● ● ●● ●
●● ● ●
● ● ● ● ●
● ● ●● ●
● ●
●
−0.5
● ● ●
● ● ●
●
● ●
●
●
● ● ●
●
●
−1.0
●●
●
●
−2 −1 0 1 2
NLM X
Figure 8.19: NLM of Fisher’s iris data, mapping the data from 4 dimensions down to 2 dimensions.
So, on the basis of the flower measurements the NLM also reveals the presence of two distinct
groups of irises, which may be two different species. An important point to note is the names and
units of the NLM axes. As we saw in equation 8.1 the NLM is based on the distances between
points and the axes will combine the different variables together in such a way that we cannot give
meaningful names or units to the axes. Therefore we just assign the axes arbitrary names, like the
ones in Figure 8.19, which show that we are dealing with an NLM.
145
slates, limestones and breccias. The measured oxides included (all expressed as %); SiO2 , Al2 O3 ,
Fe2 O3 , MnO, CaO, MgO, Na2 O, K2 O and TiO2 . We’ll extend the data set and include an addi-
tional 9 physical-mechanical measurements, which are as follows:
RMCS: Compression breaking load (kg/cm2 ).
RCSG: Compression breaking load after freezing tests (kg/cm2 ).
RMFX: Bending strength (kg/cm2 ).
MVAP: Volumetric weight (kg/m3 ).
AANP: Water absorbtion (%).
PAOA: Apparent porosity (%).
CDLT: Thermal linear expansion coefficient (10−6 /o C).
RDES: Abrasion test (mm).
RCHQ: Impact test: minimum fall height (cm).
We now have an 18 dimensional data set and if we wanted to create a collection of scatter plots
showing all the different combinations of variables we would need 144 plots. Instead we can try to
reveal the underlying structure of the data by using NLM to map it into 2D (Figure 8.20). This
is just like the example above, which requires standardization of the columns in the data matrix
and the calculation of the interpoint distances.
●
2
●●●●
● ●● ●
● ●●●●●●
● ● ● ●
●
●●
●
●
●●●
● ●●
●
●●
● ● ●● ●
●●●● ● ●
● ● ●●
● ●●●
●● ●
●●●
●●
● ●
● ● ● ●
●●
0
●
●●● ●●● ● ● ●
● ●
●●● ● ●
● ● ●●● ●
●●● ● ● ● ●
●●
●
●●
● ●
● ●
●
● ●
−2
● ●
NLM Y
● ●
●
●
●
−4
●
●
●
−6
●
●
−8
−6 −4 −2 0 2 4
NLM X
Figure 8.20: NLM of the Portuguese rocks data set containing both oxide and physical-mechanical
measurements. The data is mapped from 18 to 2 dimensions.
146
8.2.3.1 Cluster analysis and NLM
When we studied k-means clustering we analyzed the Portuguese rocks data set and concluded
that there were two different groups of data. When working with multivariate data sets it is a
challenge to represent the results of a cluster analysis because the cluster centers exist in the same
high dimensional space as the original data. One solution to the problem is to try to represent
the original data in a lower number of dimensions, for example by using PCA or NLM, and then
building the results of the cluster analysis into that solution. As an example, Figure 8.21 takes
the 2D representation of the Portuguese rocks data set obtained by NLM, but codes the points
according to the cluster assignments that were found in Section 7.4. In this way we can combine
the techniques of cluster analysis and dimension reduction to give more detailed insights into the
data.
2
●
●
●
●
● ● ●
0
●
●●● ●●● ● ●
● ●
●●● ●
● ● ●●●
●●● ● ● ● ●
●●
●
●●
● ●
●
●
● ●
−2
●
NLM Y
●
−4
−6
−8
−6 −4 −2 0 2 4 6
NLM X
Figure 8.21: A 2D NLM representation of the 18 dimensional rocks data set. The results of a 2
cluster k-means solution are combined with the NLM representation by plotting each point with a
symbol according to the cluster center to which it is assigned (i.e., all the points marked by circles
belong to one cluster and all the points marked by triangles belong to another cluster.)
We can repeat the same process for more complex cluster solutions (i.e., ones involving more
cluster centers). The NLM doesn’t change, but the points are assigned symbols according to which
of the clusters they are assigned to. A 3 cluster solution is shown in Figure 8.22 and a 4 cluster
solution is shown in Figure 8.23.
147
●
2
●●●●
●● ●
● ●●● ●●●
● ● ●●
●
●●
●
●●●●
●● ●
●
●●
● ●
●●
● ● ●● ●
● ●●●
●●●
●● ●
●●●●
● ● ●
●●
0
●
●
●
●
−2
NLM Y
−4
−6
−8
−6 −4 −2 0 2 4 6
NLM X
Figure 8.22: A 2D NLM representation of the 18 dimensional rocks data set. The results of a 3
cluster k-means solution are combined with the NLM representation by plotting each point with a
symbol according to the cluster center to which it is assigned.
148
2
●
●
●
●
● ●
0
● ●●
−2
●
●
NLM Y
−4
−6
−8
−6 −4 −2 0 2 4 6
NLM X
Figure 8.23: A 2D NLM representation of the 18 dimensional rocks data set. The results of a 4
cluster k-means solution are combined with the NLM representation by plotting each point with a
symbol according to the cluster center to which it is assigned. Two of the clusters (marked by o
and x symbols) overlap strongly, which suggests that 4 clusters are too many to give an appropriate
representation of the data.
149
Figure 8.24: The 3D swiss roll function. Notice how the function evolves with a looping form.
We can see that local structure in the Swiss roll function is very important. If we only consider
inter-point distances then we get a false impression of the structure of the function. For example,
the straight-line distance between the start (red) and end (blue) of the function is less than the
distance between the start and the point where the function has completed half a loop (yellow).
Therefore using only interpoint distance we would have to conclude that the blue region of the
function is more closely related to the start of the function than the yellow region. However, since
we can see how the function evolves, it is clear that the yellow part of the function is in fact
more closely related to the start of the function. A global technique like NLM is not designed to
preserve the kind of evolving structure that we see in the Swiss roll function, but LLE is. The
idea behind LLE is to find networks of neighboring points in high dimensional space, like those
in Figure 8.25, and in an approach quite similar to Sammon mapping find a representation that
preserves the inter-point distances in 2D. This is repeated for each data point and thus we obtain
a low-dimensional mapping that focusses on local rather than global structure.
150
Figure 8.25: Schematic showing how LLE works. Distances are preserved within local neighbour-
hoods rather than globally, therefore the low-dimensional mapping focusses on local structure. Image
taken from http://cs.nyu.edu/∼roweis/lle/algorithm.html
So, what happens when we apply LLE and Sammon mapping to the Swiss roll function (Fig-
ure 8.26). Because LLE is based on local behavior, the Swiss roll is successfully unrolled and the
local structure is clearly preserved. The Sammon map, however, performs less well and simply
compresses the function along its long axis, failing to uncover the true underlying structure.
Figure 8.26: Examples of the Swiss roll function in 3D (right) mapped into 2D using the LLE
(middle) and Sammon approaches (right). Notice the fundamental differences in the final mappings
that result from considering local or global structure.
151
they contain in a concise way. This has led to the development of techniques such as nonlinear
principal components, where the components are no longer straight-lines. This means that data
sets that may contain non-linear structures can be represented efficiently. An example of nonlinear
principal component analysis is given in Figure 8.27.
Figure 8.27: An example of nonlinear principal component analysis where the individual com-
ponents can be curved in order to represent nonlinear structures in a data set (taken from
http://www.nlpca.de/).
152
The study of ’non-Euclidean’ geometry brings nothing to students
but fatigue, vanity, arrogance, and imbecility.
Matthew Ryan
So, the results are clear. There are more red jellybeans in the jar than at the start of the
experiment, therefore Richard must have bought a new bag of red beans. Greg owes me 35 green
jellybeans and Becky owes just 5 blue.
What would have happened in the same experiment if I had expressed all the results as per-
centages rather than the absolute number of jelly beans. The same results would look like this:
153
Time Percentage of red Percentage of green Percentage of blue
Start 43 35 22
Finish 69 6 25
How can we interpret these results? The percentage of red and blue have both increased, does
that mean that both Richard and Becky both bought replacement beans? Well we know from the
absolute numbers of beans that only Richard bought new beans, but this information has been
lost when we express the results as percentages. What we can see is that the percentage data
only carries relative information, which is how abundant one colour of bean is compared to the
other colours of beans. Because we have no absolute information we can’t say how the number
of individual colours of beans are changing. This is a key concept, compositional data only carry
relative information.
All the components must be represented by non-negative numbers (i.e., numbers that are
>0).
This second requirement is called the sum-to-one constraint. An example of closed data are sed-
iment grain sizes that are split into sand, silt and clay size fractions. For any given sample, the
sum of the contributions from each of the size fractions must add to 100%.
Because of the sum-to-one constraint, all the information of a D component composition can
be given by D − 1 components. For example consider the grain size compositions for three more
samples, where missing values are shown with question marks.
154
Sample Sand [%] Silt [%] Clay [%] Total [%]
1 21.3 57.5 ? 100.0
2 ? 13.2 10.7 100.0
3 63.2 ? 7.8 100.0
Even though each case is represented with only two values, we can immediately work out what
the percentage of the remaining grain size fraction will be because the total for each case must be
100%. Therefore all the information on the 3 grain size components can be represented with just
2 components.
This might seem like a trivial issue, but its effect can be dramatic. As an example think of the
case of a marine micropaleontologist, who has counted the numbers of two different foraminifer
taxa, A and B, through a sediment core. They now want to test how A and B correspond to each
other because it is possible that the two taxa abundances are controlled by the same environmental
conditions. There are two ways to perform this analysis, by comparing the absolute numbers of
individuals or by comparing the percentage abundances of the two species (Figure 9.1)
r2 = 0.025 r2 = 1.0
Individuals of species B
Individuals of species A
Figure 9.1: The comparison of the foraminifer taxa A and B expressed in absolute counts (left) and
as percentages (right). The r2 values give the coefficient of determination for each comparison.
We can see that the results of the two forms of analysis are dramatically different. When we
employ counts of individuals there is clearly no significant correlation between A and B, but when
we use percentages we obtain a perfect negative correlation. This reasons behind these results are
clear, if we have a given percentage of species A, then the corresponding percentage of species B
must be (100-A)%. Therefore, when we plot a collection of results we’ll obtain a perfect negative
relationship that is purely a product of the manner in which we have decided to express the data
and has no physical meaning.
You might have two responses to this problem. First, you could decide never to use percentage
data, which is a nice idea but almost impossible in practise. Some forms of data can only be
expressed in a relative sum-to-one form, for example mineral composition that may be given in %,
ppt, ppm, etc. Although it is possible to express data such as foraminifer abundance in absolute
terms, in practise it is very difficult, therefore most assemblage information is normally given
in percentages. Your second response to the problem could be to say that it only deals with 2
parameters whereas your own data set contains many more, for example, 50 different foraminifer
taxa. Unfortunately this doesn’t solve the problem, the sum-to-one constraint will still induce false
correlation when different components of a composition are studied, no matter how many parts
it is made up from. To my knowledge (at the time of writing), there is no statistical technique
available that can quantify correctly the correlations between the different parts of a composition
(this is worrying when you think how often you see such correlations employed in the scientific
literature).
155
We have now seen some of the problems caused by the sum-to-one constraint, but in some cases
we can use it to our advantage. You’ll be familiar with representing grain size data in a triangular
ternary plot, where each edge of the triangle represents the abundance of one of the grain size
fractions. Our ability to represent such 3D data sets in a 2D figure (a triangle) without any loss
of information is a direct result of the sum-to-one constraint (Figure 9.2).
Clay
0.2 0.8
0.4 0.6
●●●
●
●●●●
●●
0.6 ●
●●
●
0.4
●
●●
● ● ●●
● ●
●
●
0.8 ● 0.2
● ●
●
●
● ●
● ●●
● ● ●
Figure 9.2: Example of grain size data characterized by the relative proportions of sand, silt and
clay, plotted in a 2D ternary plot.
Of course for a given case each of the abundances of the grain size fractions must also be >0,
otherwise the case will plot outside of the triangle. The non-negativity and sum-to-one constraints
should make intuitive sense to us. Imagine that someone told you that “my sediment contains
-20% sand ” or “the three grain size fractions in my sediment add up to 105% ” you would know
that there is something wrong with their data.
156
1.2
A+B=1
1.0
A<0
A+B=1
0.8
A+B=1
A >= 0
0.6
B >=0
B
0.4
0.2
0.0
Figure 9.3: Example of the constraints that act on 2 part compositions. Such compositions must
lie on the diagonal line between (1,0) and (0,1), which means they meet both the non-negativity
and sum-to-one constraints. Points that do not fall on the line must violate one or both of the
constraints.
Similarly, if we consider compositions with three components, all of the cases must lie on a 2D
triangle within a 3D space. Because of the sum-to-one constraint the corners, or vertices, of the
triangle must be positioned at (1,0,0), (0,1,0) and (0,0,1), Figure 9.4.
157
(0,0,1)
(0,1,0)
(1,0,0)
Figure 9.4: Example of the constraints that act on 3 part compositions. Such compositions must
lie on a triangle (shaded) between the points (1,0,0), (0,1,0) and (0,0,1).
We can extend this idea to a higher number of dimensions, but they become more difficult to
represent in a diagram. For example a composition with four components must lie within a 3D
tetrahedron with the corners (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1).
The general rule that emerges is that a D component composition must reside within a D-1
dimensional simplex. So a 1D simplex is a line, a 2D simplex is a triangle, a 3D simplex is a
tetrahedron, a 4D simplex is a pentachoron, etc. This means that compositions are not allowed to
exist within any point in Euclidean space, but instead they reside in a so-called simplicial sample
space. At this point you might think “so what”, but simplicial sample spaces have a number of
difficulties associated within them.
158
9.3.0.1 Distances between points in a simplex
When studying cluster analysis we saw how the similarity between two data points can be measured
in terms of the distance separating them. This is simple for points in a Euclidean space, where
we can just measure the distance between points along each dimension and then use Pythagoras’
theorem to find the length of the line separating the points. When we try something similar in a
ternary plot we hit a problem. The dimensions are at 60o to each other, so when we draw lines
along the dimensions we don’t form a right-angled triangle and we can’t apply Pythagoras’ theorem
(Figure 9.5). Maybe at this stage we shouldn’t be too worried because we can simply use a bit
of extra geometry that doesn’t rely on Pythagoras’ theorem to find the distance between the two
points. However, we do have a problem because the algorithms that are available for calculating
various statistics assume that you can use Pythagoras’ theorem, so you’re going to have a lot of
work if you plan to rewrite them all. In fact the problem is more fundamental because what we
think is a straight-line in our Euclidean minds has a different form in simplicial space.
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
Figure 9.5: Because the angles between the dimensions in a simplex are not 90o we can’t use
Pythagoras’ theorem to find the distance between the two points.
159
C
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
●
Figure 9.6: The shortest path between two points in a simplex appears as a curve rather than a
straight-line because the dimensions are not at right-angles to each other.
We won’t dwell on this point because geometry within a simplex is a whole field within itself
(if you’re interested you should see the recommended reading for more details). The fact that the
shortest path connecting two points in a simplex appears as a curve should re-emphasize the point
that we really won’t be able to use Pythagoras’ theorem and Euclidean distances.
The arithmetic mean minimizes the sum of squared Euclidean distances to the individual values.
So Euclidean distances are even involved in a simple mean, therefore we can’t even calculate
something as basic as an average composition using standard statistical methods.
160
9.4 Solving the problems
The problems associated with the statistical analysis of compositional data were first identified by
Karl Pearson in 1896. An effort was then undertaken to redesign a number of different statistical
methods so they could be applied to compositional data. However, to go through every existing
statistical approach and find ways to adjust it for compositional data was just too large a task
and the problems associated with compositional data became largely ignored. Instead standard
statistical techniques would be applied to compositional data ignoring the fact that the results could
be entirely spurious. Occasional papers were published warning of the problems with compositional
data analysis, but they were largely ignored because no solutions to the issues could be given.
The framework for the statistical analysis of composition data was set out by John Aitchison
in a series of papers in the 1980s and 1990s. Aitchison’s solution to the problem was not to try to
redesign all statistical approaches to make them suitable for compositional data, but instead he
developed a method to transform composition data so that it could be analyzed using standard
statistical approaches. Aitchison defined two key criteria that the analysis of compositional data
must meet, which we’ll look at now.
Subcompositional coherence demands that two scientists, one using full compositions and the other
using subcompositions of these full compositions, should make the same inference about relations
within common parts.
To illustrate this problem we’ll consider a sample composed of 25 specimens of hongite. One
investigator defines the composition of the hongites by percentage occurrence of the five minerals;
albite, blandite, cornite, daubite and endite. Another investigator decides that they will only ex-
amine the three minerals; blandite, daubite and endite. To demonstrate the differences in the way
the data is represented by the two investigators, consider the example of one specimen.
Both investigators want to find out if there is a relationship between the relative abundances
of daubite and endite, so they perform a correlation analysis using all 25 specimens. The results
of this analysis are shown in Figure 9.7.
161
Investigator 1 Investigator 2
r = -0.21 r = 0.46
12
10
Endite [%]
8
6
4 6 8 10 12 14 16
Daubite [%]
Figure 9.7: The correlation of daubite and endite for the two investigators
Notice that the correlations obtained by the two investigators don’t even have the same sign,
this means they would reach very different conclusions about how daubite and endite are related
in their specimens. This result demonstrates that correlations do not exhibit subcompositionally
coherence (they are said to be subcompositional incoherent) and the results of the statistical analysis
depend on which minerals were included in the original mineral quantification. Clearly, this is not
a good thing.
162
Ratios also help to provide subcompositional coherence because the ratios within a subcomposition
are equal to the corresponding ratios within the full composition. Returning to the composition
of our first hongite specimen:
If the two investigators take the ratios of their daubite and endite percentages they will get the
same result.
6.4 13.5
= = 0.96
9.3 19.6
Thus, the ratios do not depend on which components a given investigator decided to quantify.
Taking the logarithm has a number of advantages, but one of the most obvious is that it will
transform any ratio to lie in the interval [−∞,∞] which fits with the requirement we discussed in
Section 9.3.
We can see that the third step requires an inverse transform to convert log-ratio values back into
compositions that reside within a unit simplex. As an example, consider log-ratios formed from 3
components, A, B and C, where C is the denominator variable. Then to perform the inverse-alr
(alr −1 ) and recover the values of A, B, and C:
exp(log(A/C))
A=
exp(log(A/C)) + exp(log(B/C)) + 1
exp(log(B/C))
B=
exp(log(A/C)) + exp(log(B/C)) + 1
1
C=
exp(log(A/C)) + exp(log(B/C)) + 1
163
The corresponding alr −1 would therefore be:
exp(2.77)
= 0.80
exp(2.77) + exp(1.10) + 1
exp(1.10)
= 0.15
exp(2.77) + exp(1.10) + 1
1
= 0.05
exp(2.77) + exp(1.10) + 1
We’ll now look at a number of different examples to see how Aitchison’s method can be applied
and how it gives results that are consistent with the behaviour of compositional data.
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.8: A ternary plot showing the composition of 23 basalt specimens from Scotland.
It would be tempting to simply calculate the mean of the A, F and M components directly, but
as we’ve seen in the previous sections this would yield a spurious result. Instead we must take the
164
log-ratio of the data, calculate the means of the log-ratios and then transform those means back
into A, F and M values (Figure 9.9). Fortunately, R has a package available called compositions,
which will help us with this procedure. Once we define a data set as a collection of compositions
using the acomp function, R knows that it should always use log-ratio analysis when calculating
statistics. The data are stored in the file AFM.Rdata, which contains the data matrix AFM.
Example code: 72
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.9: A ternary plot showing the composition of 23 basalt specimens from Scotland and the
resultant mean composition (filled square). For comparison the position of the average of the A, F
and M components without taking the compositional nature of the data into consideration is shown
(open square).
We can see that the mean composition sits where we would expect, in the center of the curved
165
distribution of data points. For comparison, I also calculated the mean using the incorrect method
of just taking the average of each component without using the log-ratio transform. This compo-
sition is marked by a open square and it is positioned towards the edge of the data distribution,
which is obviously not what we would expect for a mean.
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.10: The first two principal components of the basalt data set calculated without applying
the log-ratio transform
We can see from this analysis that as expected the two principal components are at right-angles
to each other. The data distribution, however, seems quite curved and the principal components
don’t do a very good job of describing this curvature. There is an additional problem, if we
lengthen the principal components, they will eventually extend outside the limits of the ternary
plot, which means that they are not physically realistic for a simplicial sample space.
Calculating the principal components using the log-ratio approach is simple in R, we can add
them to a plot with a simple command in the straight function (Figure 9.11). For completeness
we’ll start from the beginning by loading the data, etc, but if you still have all the basalt data
loaded from the previous example then you can go straight into the PCA.
166
Example code: 73
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.11: The first two principal components of the basalt data set calculated using the log-ratio
transform
The first thing we notice is that the principal components appear to be curves, however when
we think back to Section 9.3.0.2 this is not surprising because we know that straight-lines appear
to be curved inside a simplex. This curvature also means that unlike the non-alr example above,
the principal components will never extend beyond the limits of the ternary plot. The curve in
the first principal component allows it to describe the curvature in the data very clearly and we
can emphasize this point by looking at the variability explained by each of the components and
comparing it to the non-alr analysis (Figure 9.12).
167
PC2 scores (22% of total variability)
non-alr alr
2.0
1.0
0.0
−1.0
−1 0 1 2 3
Figure 9.12: Scores for the first two principal components of the Scottish basalts using the traditional
non-alr approach (left) and the alr method (right).
We can see the scores from the non-alr approach describe a slightly curved path because the
principal components cannot represent the data properly. In constrast, by applying the alr we
obtain principal components that provide a very good representation of the data with the first
component explaining over 99% of the data variability.
Example code: 74
168
M
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.13: A ternary plot showing the composition of 23 basalt specimens from Scotland (dots).
We then collect a new specimen and determine its composition (square). Does the new specimen
belong to the same population distribution as our original sample of 23 basalts?
For this example we’ll calculate the log-ratios directly (rather than letting R do it in the back-
ground) and plot them in order that we can see the structure of the data after the log-ratio
transform (Figure 9.14). To calculate the alr we’ll use M as the denominator in the ratios. This
is, however, an arbitrary decision and the results should not be sensitive to which parameter we
use as the denominator.
Example code: 75
169
●
2.0
●
●
1.5
log(F/M)
●
●●
●
1.0
● ●
●
●
●
● ●
●
0.5
log(A/M)
Figure 9.14: A plot of the 23 basalt specimens from Scotland (dots) and the new specimen (square)
in log-ratio space.
Working with the log-ratios we now need to define a confidence ellipse around the 2D distri-
bution of data points using the R function ellipse. To do this we need to use some statistical
quantities such as covariance matrices (calculated with the cov function) that we haven’t consid-
ered in the earlier chapters. At this stage we are just going to perform the necessary calculations
and not worry too much about how the process works. You’ll see that the position of the center
of the confidence ellipse is simply the mean of each of the log-ratios, which can be found using
the function colMeans (which returns the mean of each column in a matrix). We’ll then add the
confidence ellipses to the plot (Figure 9.15).
Example code: 76
170
●
2.0
●
1.5
log(F/M)
●
●●
●
1.0
● ●
●
●
●
● ●
●
0.5
log(A/M)
Figure 9.15: 90% (solid line), 95% (dotted line) and 99% (dashed) confidence regions of the true
population distribution represented by the original 23 basalt specimens.
We can see that the new sample is located outside of even the 99% confidence ellipse for the
original specimens, so it would appear that it originates from a different population. Of course
when plotted simply as log-ratios it is difficult to imagine what the confidence ellipses will look like
in the original ternary plot. This is no problem, we can use the inverse alr to transform the ellipse
information stored in e1, e2 and e3 back into the original A, F and M compositional space. We
can then plot all the information together in a ternary plot (Figure 9.16). This processes involves
a number of commands in R, so we’ll only look at the case of the 99% ellipse stored in e3. If you
want to perform the same process for the 90% or 95% ellipses you would need to use the e1 and
e2 variables, respectively.
Example code: 77
171
M
0.2 0.8
0.4 0.6
0.6 0.4
●
●
●●
0.8 ●
●● ● ● ● 0.2
●
● ●●
●
●
●
●
● ●
●
●
Figure 9.16: The 99% confidence ellipse (solid) of the basalt specimens (dots) shows that it is highly
unlikely that the new specimen (square) comes from the same underlying population.
Example code: 78
172
Clay
●●●
●
●●●●
●●
●
●●
●●
●●
● ● ●●
● ●
●
●
●
● ●
●
●
● ●
● ●●
● ● ●
Sand Silt
Figure 9.17: Ternary diagram for the sand, silt and clay compositions of 39 sediments from an
Arctic lake. Points are colour coded according to the depth at which they were collected, with red
and blue indicating shallow and deep water, respectively.
The pattern in the ternary plot shows a clear relationship between the grain size composition
of the sediments and the depth at which they were collected. At shallower depths the sediments
contain more sand and as depth increases the sediments become more fine grained. This is exactly
what we would expect from gravity settling with the largest particles being deposited in shallower
waters and only the finer particles making it towards the center of the lake where the water is
deeper. But can we find the regression relationship between grain size and depth?
We could start our analysis by ignoring the fact we are working with compositions and just
blindly apply a regression analysis to the sand, silt and clay percentages as a function of depth.
We could then plot the results of the regression analysis in the ternary plot. You can probably
see by now that this isn’t going to work because we can’t just ignore the problems associated with
compositional data, but if we did the regression plot would look like Figure 9.18 (there is no point
in showing you how to calculate this regression relationship because it is wrong).
173
clay
●●●
●
●●●●
●●
●
●●
●●
●●
● ● ●●
● ●
●
●
●
● ●
●
●
● ●
● ●●
● ● ●
sand silt
Figure 9.18: A regression of sediment grain size and collection depth that ignores the fact that the
grain size data are compositions.
We can see that the regression line in Figure 9.18 doesn’t do a very good job of explaining
the data. The data shows a curved trend, whilst the line is straight and we can see that if we
extended the line it would eventually pass outside the boundaries of the diagram. By now, it
should be clear what we need to do to solve these problems. We’ll apply the alr transform to the
grainsize data, calculate the regression in log-ratio space and then transform the results back to
the original compositional data space. We’ll perform the regression using the lm function just as
we did in Chapter 5. We’ll first look at the regression in log-ratio space (Figure 9.19) and then
plot everything in the ternary plot.
174
Example code: 79
175
●
4
● ●
●
● ● ●
●
●
●
● ● ●
2
● ●
●
●
● ● ●
log−ratio
●
●
● ●
● ●
● ● ●
● ● ●●●
● ●
● ● ● ●
● ●● ●
● ●
0
● ● ● ●
● ●
● ●
● ●
● ●
● ● ●
●
●
● ●●
●●
−2
● ●
●
● ●
●
●
● ●
●
log(Depth [m])
Figure 9.19: Regression for the two log-ratios against the log of depth. Red shows the regression
for log(sand/clay) and green shows log(silt/clay).
Now we have the regression models that are represented by a collection of points calculated
as a function of depth along each of the regression lines in Figure 9.19. The points are stored
in the variables lr1 hat and lr2 hat and we simply need to perform the inverse alr transform
to return the points to the original sand, silt and clay compositional data space. Then we’ll dis-
play the transformed points in a ternary plot to illustrate the regression relationship (Figure 9.20).
Example code: 80
176
clay
●●●
●
●●●●
●●
●
●●
●●
●●
● ● ●●
● ●
●
●
●
● ●
●
●
● ●
● ●●
● ● ●
sand silt
Figure 9.20: Log-ratio based regression line for the Arctic lake sediment data.
The calculated regression line captures the curvature of the data and we can see that even if
the line was extended it would still provide a physically meaningful model. For example as the
depth becomes shallower the regression line moves towards sediments consisting of only sand and
at greater depths the line moves towards sediments composed of only clay.
177
9.6.1 Dealing with zeros
Finally, one major issue when dealing with compositional data in the geosciences is the presence
of zeros in a data set. For example we might have counted foraminifer abundance in a collection
of sediments and a certain species may be absent in one of the assemblages. It would therefore be
represented in the foraminifer abundances as 0%. However when we apply the alr we then need
to take the logarithm of 0, which is not defined.
Various strategies have been suggested to deal with the presence of zeros in compositional data,
most of which focus on replacing them with very small values. This is still a problem for geoscience
data sets that may contain large numbers of zeros and as yet a satisfactory way to deal with such
data has yet to be developed.
Maybe there is light at the end of the tunnel, in 2011 Michael Greenarce from the Universitat
Pompeu Fabra developed a method for dealing with zeros by relaxing the subcompositional co-
herence requirement. He found that you could deal with zeros as long as you were willing to let
your analysis be very slightly subcompositionally incoherent. We will have to wait and see how
successful this approach is.
178
10
Recommended reading
Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twenti-
eth Century. Holt.
Swan, A.R.H. & M. Sandilands (1995). Introduction to Geological Data Analysis. Blackwell Sci-
ence.
Middleton, G.V. (2000). Data Analysis in the Earth Sciences using MATLAB. Prentice Hall.
179
Trauth, M.J. (2010). MATLAB Recipes for Earth Sciences. Springer.
Pawlowsky-Glahn, V. & A. Buccianti (2011). Compositional Data Analysis: Theory and Ap-
plications. Wiley.
Warton D.I., I.J Wright, D.S. Falster & M. Westoby (2006). Bivariate line-fitting methods for
allometry. Biological Reviews 81(2), 259-291.
180