2.1 Data Analysis
2.1 Data Analysis
DATA ANALYSIS
(Part 1)
HANIM AWAB Department of Chemistry Faculty of Science UTM
TYPES OF ERROR
1. GROSS ERROR (eg. eg. C Contaminated ontaminated reagents, faulty instrument) - Serious obvious errors that give outlier readings - Detectable with sufficient replicate measurements - Experiments with gross errors must be repeated 2. RANDOM/INDETERMINATE ERROR (eg. eg. Inaccurate manipulation of procedure) - Data scattered symmetrically about a mean value - Deviations of measurements from the mean shown using the Gaussian or normal error curve - Cannot eliminate but can be minimized - Error can be assessed by statistical tests
Some ways to overcome errors Carry out replicate measurements Analyse accurately using known standards or standard reference materials (SRM) Perform statistical tests on data
3. SYSTEMATIC/DETERMINATE ERROR Operator/Instrument error/Method error - All data too high/too low or data increases with magnitude of measurement - Causes bias in technique (either +ve +ve or ve) ve) - Affects accuracy - May be detected by: - blank determinations, - analysis of standard samples, - independent analyses by alternative/dissimilar methods - Can be avoided/eliminated avoided/eliminated by correcting instrument, method and personal errors* errors*
*Ways to minimize/eliminate systematic errors Instrument errors: - Careful recalibration and good maintenance of apparatus (eg (eg glassware) and instruments ( (eg eg AAS, GC)
Personal errors:
- Training of operator, care and selfself-discipline
Degree of Freedom - number of results in a set (each time another quantity is derived from the set, the degrees of freedom are reduced by 1) Range - difference between the highest and lowest value of the results Standard Deviation (s or ) - difference, with respect to sign, between an individual result and the mean or median of the set Relative Standard Deviation (RSD) - Also known as the coefficient of variation, often used in comparing precisions Variance (V) (V) - square of the value of standard deviation (2 or s2)
Determinations/Formula
MEAN (AVERAGE) MEDIAN
STANDARD DEVIATION Measure of spread about the mean Estimate the variability of individual measurement (The standard deviation is better estimated by the pooling of results from more than one set)
x =
i = 1
xi
(
x
2222
xxxx
))))
iiii
ssss
iiii
1111 NNNN
2222
))))
iiii
RELATIVE STANDARD DEVIATION (RSD)/ COEFFICIENT OF VARIATION (CV) Standard deviation divided by mean (depends on the units used)
Mean = xi/N = 0.077 (x xi-mean)2 = 4.01x10-4 VARIANCE The square of standard deviation - Sample variance ( 30) 30): V = s2 - Population variance (large #) #): : V = 2
Sample 1 2 3 4 5 6 7 8 9
Se (mg/g) 0.07 0.07 0.08 0.07 0.07 0.08 0.08 0.09 0.08
(xi - mean) 4.9x10-5 4.9x10-5 9.0x10-6 4.9x10-5 4.9x10-5 9.0x10-6 9.0x10-6 1.69x10-4 9.0x10-6
S.D. =
s=
(x
i
x)2
= 0.007
N 1
STD. DEV. FOR POOLED DATA (Spooled) To achieve a value of good approx. to s for N 30, it is sometimes necessary to pool pool data from a number of sets of measurements Suppose there are t small sets of data, comprising N1, N2,.Nt measurements, the equation for the resultant sample standard deviation is:
Obs 3 4 5 4 3 4
2222
Deviations from mean 0.05, 0.10, 0.08 0.06, 0.05, 0.09, 0.06 0.05, 0.12, 0.07, 0.00, 0.08 0.05, 0.10, 0.06, 0.09 0.07, 0.09, 0.10 0.06, 0.12, 0.04, 0.03
2222
N1
N2
N3
= (
5 0 . 0
) +(
2
0 1 2222 . 0
) +(
8 0 . 0
9 8 1 2222 0 . 0
) =
spooled =
i =1
i =1
i =1
N1 + N2 + N3 +......t
S 1 2 3 4 5 6 Total
ssss
7 9 0 . 0 =
1111
6 6666 2 3 1 3 . 0 2
ssss
% 8 8 0 . 0
d e l o o p
Solve this Problem Given a set of diameters of four cells in units of m, 120, 135, 160 150 (a) Use functions available in your calculator (b) Use the Excel Spreadsheet (at your own time and submit the data and result printout) Calculate the following: - Mean - Median - Standard Deviation - Relative Standard Deviation (RSD) - Variance
PRECISION
- Reproducibility (repeatability) of repeated measurements ie How similar are values obtained in exactly the same way? Useful for measuring deviation from the mean
d i = xi x
ACCURACY Nearness (proximity) to the true value, ie. measurement of agreement between experimental mean and true value (which may not be known!) Measures of accuracy:
- Absolute error:
- Relative error: E R = |
xi | 100%
Discussion Question 1 Four students analyzed Fe content in a sample. Each student performed 5 replicates and the results are illustrated below. Comment on the accuracy and precision of each set of results (Hint: Student C obtained the best results)
True value A B C D 9.80 10.00 10.20 mean 10.10 9.90 10.01 10.01
Discussion Question 2 - Comment on the accuracy and precision of the following results. Explain or show proof? - Which set of data has to be thrown out (discarded)? (discarded) ? Why?
Student A B 10.10 10.08 10.09 10.07 10.08 10.10 0.01 C 9.65 9.75 9.78 10.07 10.24 9.90 0.25 D 9.97 9.98 10.02 10.03 10.05 10.01 0.03 E 9.80 9.89 10.01 10.13 10.22 10.01 0.17
X
DATA VALUE 10.00 10.00 10.00 10.00 10.00 10.00 0.00
CONFIDENCE LIMIT & CONFIDENCE INTERVAL Confidence Interval (CI) is the range of values surrounding the mean, mean, within which the population mean, is expected to lie with a certain degree of probability The boundries of the range are called the Confidence Limits Confidence Level (CL) is the probability that the true mean lies within a certain interval (expressed as %) Example: It is 99% probable that for a set of measurement is 7.25mg 0.15. Thus, the mean should lie in the interval from 7.10mg to 7.40mg with 99% probability
CI for large no. of data (>30) with known population std deviation, CI for small no. of data (30) without knowing (know s)
=x z N
=x ts N
Values of z for determining confidence limits Confidence level (%) 50 68 80 90 95 96 99 99.7 99.9 z 0.67 1.0 1.29 1.64 1.96 2.00 2.58 3.00 3.29
N = Number of measurements z = values from normal distribution curve (Read from the zz-table) t = values from normal distribution curve but depends on the degree of freedom (N(N-1) (Read from the tt-table) t is also known as the students t, generally used in hypothesis tests
SAMPLE QUESTION (CONFIDENCE INTERVAL) Calculate the confidence interval (CI) at 95%, 90% & 99% confidence level given the following data for the analysis of Ca in a rock sample: 14.35, 14.41, 14.40, 14.32, 14.37 Mean = 14.37, s = 0.037 From table: @ confidence level 95% & NN-1 = 4, t = 2.78 = 14.37 2.78 x 0.037 CI = = x t s =
Confidence interval is 14.37 0.05 or 14.32<< 14.42 Summary of results (calculate the rest by yourselves): @ Confidence level Confidence interval (CI) 90% = 14.37 0.04 95% = 14.37 0.05 = 14.37 0.08 99% If confidence level increases, the CI increases, and the probability of appearing in the interval also increases
90%, CL = 8.53
99%, CL = 8.53
N
(c)
90%, CL = 8.53
@ 99%, CL = 8.53
99%, CL = 8.53
OTHER USAGE OF CONFIDENCE INTERVAL To determine # of replicates (N) needed for the the mean to be within the confidence interval To determine systematic error
x
N
2172 . = 7.24 3
s=
@90%, CL = x ts
= 7.24
Example 1: 1: Calculate the number of replicates needed to change the confidence interval by 1.5 g/mL at 95% confidence level. Given, s = 2.4 g/mL
Example 2: 2: Ten measurements on a sample gave a mean of 0.461, with std dev of 0.003. A solution gave a reading of 0.470. Show whether systematic error exists at 95% confidence level At 95 95% % confidence level, (N (N 1) = 9, t = 2.26
(0.003 ) ts = 0.461 2.26 N 10 = 0.461 0.002 This means, 0.459 < < 0.463, ie 95% of the time, the true value lies between 0.459 to 0.463 Therefore, the the reading 0.470 is NOT in the range, and systematic error EXISTS
= x
DISTRIBUTION OF ERRORS
NORMAL or GAUSSIAN distribution (bell shaped, symmetrical curve) gives limits within which the population mean () is expected to lie with a given degree of probability (without any systematic error)
50% -0.67s +0.67s 80% -1.29s
dN/N
95% +1.29s
dN/N
Based on the curve, percentages of area under the curves between certain limits of z are as follows: 50% of area lies between 0.67s 80% " 1.29s 90% " 1.64s 95% " 1.96s 2.58s 99% " When we say that at a confidence level of 80%, the confidence limits are 1.29s we mean that: - 80% of the time the true mean will lie between 1.29s of the measurements made - or in other words 20% of the time the true mean will NOT lie between 1.29s
-1.96s
+1.96s
dN/N
1s 2s 3s 4s
-4s -3s -2s -1s 0 1s 2s 3s 4s -4s -3s -2s -1s 0 1s 2s 3s 4s -4s -3s -2s -1s 0
mean is indicated by
SIGNIFICANCE TESTS
Tests whether the difference between two results is significant (or merely due to random variations) - used to decide whether the difference between the measured and known values can be explained by random errors The NULL HYPOTHESIS, HYPOTHESIS, Ho If Ho is accepted: accepted: means there is NO significant difference between observed and known values (other than that due to random observation) If Ho is rejected: rejected: means difference is significant
(2) Comparison of means ( ) of two samples - eg Compare mean of new method with a reference (or standard) method - Accept Null hypothesis (Ho) if NO significant difference between methods ie the results are the same, or =0 - Calculate t, if tcalc < ttable, accept Ho to show that there is NO significant difference in results Use pooled estimate of std dev, s2={(n1-1)s12+ (n2-1)s22} / (n1+n2-2), or
The F Test
F-TABLE
- One tailed test: test: test whether method A is more - Two tailed test: test: test whether methods A and B - F is ratio of two
sample variances:
precise than method B (assumes A is always precise) differ in their precision (ie any method can be precise)
F=
s2 1 1 = 2 2 s2
Ho: Population variances are equal (or 1) [F is always >1, thus the smaller ie the more precise is always the denominator] If Fcalc < Ftable (Accept Ho) which means that there is NO significant difference in precision between the two methods
Example Question: ONEONE-TAILED F TEST A proposed method for COD of wastewater was compared with a standardized method The results are given as follows: Standardized method (8 (8 determinations): determinations): mean =72 mg/L, s = 3.31 mg/L determinations): Proposed method (9 (9 determinations): mean = 72 mg/L, s = 1.51 mg/L () Is the proposed method significantly more precise than the standardized method? F = (SStd)2/(SProp)2 = (3.31)2/(1.51)2 = 4.8 Data values: 8 for Std & 9 for proposed, thus from the FF-table degrees of freedom (N(N-1) = 7numerator and 8denominator, Fcrit = 3.50 Since Fcalc >Ftable , reject Ho. Thus there is a significant difference bet the methods and the proposed method is significantly more precise
Set as denominator
Example: Determination of CO using a Standard Procedure gave an s value of 0.21 ppm. The method was modified twice giving s1 of 0.15 and s2 of 0.12 (both 9 degrees of freedom). Are the modified methods significantly more precise than the std? Ho : s1 = sstd Ho: s2 = sstd
F1 =
2 std 2 1
F2 =
In standard methods the # of data is large, thus s, & degrees of freedom becomes infinity, From FF-table, num=, den=9; Fcrit = 2.71 F1< Ftable : accept Ho but F2>Ftable : reject Ho Only the 2nd modified method is is significantly more precise than the standard method
The Q TEST or DIXONS TEST (Detection of gross errors) The QQ-Test is used for detecting outlier (suspected unreasonable data) which statistically does not belong to the set Example: Example : 10.05, 10.10, 10.15, 10.05, 10.45, 10.10
The Qcal is compared with the Qtable and the null hypothesis, Ho is checked
Q expt =
= 0.75
From QQ-table (@95% & N=6) Q = 0.625 (Q-table:Next slide ) Qcal > Qtable data (10.45) can be rejected
Q TABLE No. of Observations 3 4 5 6 7 8 9 10 Confidence Level 90% 0.941 0.765 0.642 0.560 0.507 0.468 0.437 0.412 95% 99% 0.970 0.829 0.710 0.625 0.568 0.526 0.493 0.466 0.994 0.926 0.821 0.740 0.680 0.634 0.599 0.568
EXAMPLE QUESTION: QQ-TEST The following data was obtained for the determination of nitrite concentration (mg/L) in a sample of river water: 0.403, 0.410, 0.401, 0.380, 0.400, 0.413, 0.411 Should the data 0.380 be retained? Q = |0.380 - 0.400|/|0.413 - 0.380)| = 0.606 From the QQ-table: Sample size 7, Qtable = 0.570 Qcalc>Qtable, thus the suspect outlier is rejected