Outliers

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

DETECT suspects 1: Visual inspection

2.24 2.43 2.36 2.83 2.30


N Titre
o (ml)
1 11.4
2 11.1
3 11.5
4 11.9
5 11.3
6 11.2
DETECT suspects 2: Calculation: Tukey k-Test

The interquartile range (IQR) is the distance between the first and third
quartiles (the length of the box in the boxplot)
IQR = Q3 Q1

An outlier is an individual value that falls outside the overall pattern.


How far outside the overall pattern does a value have to fall to be
considered a suspected outlier?

Suspected low outlier: any value < Q1 1.5 IQR

Suspected high outlier: any value > Q3 + 1.5 IQR


25 7.9
24 5.6
23 5.3
22 4.9
21 4.7
20 4.5
19 4.2 Q3 = 4.35
18 4.1
17 3.9
16 3.8
15 3.7
14 3.6
13 3.4
12 3.3
11 2.9
10 2.8
9 2.5
8 2.3
7 2.3
Q1 = 2.2
6 2.1
5 1.5
4 1.9
3 1.6
2 1.2
1 0.6
DETECT suspects: Calculation: Grubbs Test
ISO test for point outliers
suspect value is value that is furthest away from mean
Normal population
Use entire dataset to calculate statistics
Gcritical depends on n
If G exp> Gcritical value, then REJECT suspect

suspect x
G exp
s
example

The following values were got for the nitrate concentration (mg/L) in a
sample of river water:

0.403 0.410 0.401 0.380

Ideally get more measurements if suspect occurs, esp. if only a few made.
the more values may make it clearer if suspect should be rejected
Also if kept, reduce its effect.
if 3 further measurements...

0.403 0.410 0.401 0.380 0.400 0.413 0.408


You try

set of mass spectrometer measurements on a uranium isotope:

199.31 199.53 200.19 200.82 201.92 201.95 202.18 206.32


DETECT suspects 2: Calculation: Dixon's Q-Test

popular
for small sample (n=3 to 10)
assumes Normal population
if Q > critical value, then REJECT suspect
Dixon's Q-Test
The following values were got for the
nitrate concentration (mg/L) in a sample of
river water:

0.403 0.410 0.401 0.380 0.400 0.413 0.408

suspect nearest
Q
range
You try:
0.189 0.167 0.187 0.183 0.186 0.182

0.181 0.184 0.181 0.177

suspect nearest
Q
range
DECIDE
Correct obvious errors for which data exists
Exclude obvious errors for which no data exists
Ignore? run with/without to see if influential
trimmed mean
Retain?
outliers are expected for large sample sizes
some methods are robust
Replace

DISCLOSE

You might also like