Introduction to Statistical Analysis of Laboratory Data
()
About this ebook
- Provides detailed discussions on statistical applications including a comprehensive package of statistical tools that are specific to the laboratory experiment process
- Introduces terminology used in many applications such as the interpretation of assay design and validation as well as “fit for purpose” procedures including real world examples
- Includes a rigorous review of statistical quality control procedures in laboratory methodologies and influences on capabilities
- Presents methodologies used in the areas such as method comparison procedures, limit and bias detection, outlier analysis and detecting sources of variation
- Analysis of robustness and ruggedness including multivariate influences on response are introduced to account for controllable/uncontrollable laboratory conditions
Related to Introduction to Statistical Analysis of Laboratory Data
Related ebooks
Modelling Optimization and Control of Biomedical Systems Rating: 0 out of 5 stars0 ratingsRandomization in Clinical Trials: Theory and Practice Rating: 0 out of 5 stars0 ratingsRobust Statistics: Theory and Methods (with R) Rating: 0 out of 5 stars0 ratingsIntroduction to Population Pharmacokinetic / Pharmacodynamic Analysis with Nonlinear Mixed Effects Models Rating: 0 out of 5 stars0 ratingsQuantile Regression: Estimation and Simulation Rating: 4 out of 5 stars4/5An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics Rating: 0 out of 5 stars0 ratingsAdvanced Analysis of Variance Rating: 0 out of 5 stars0 ratingsFundamental Statistical Inference: A Computational Approach Rating: 0 out of 5 stars0 ratingsChemical Reaction Kinetics: Concepts, Methods and Case Studies Rating: 0 out of 5 stars0 ratingsPractical Design of Experiments (DOE): A Guide for Optimizing Designs and Processes Rating: 0 out of 5 stars0 ratingsMethod of Lines PDE Analysis in Biomedical Science and Engineering Rating: 0 out of 5 stars0 ratingsPrognostics and Health Management: A Practical Approach to Improving System Reliability Using Condition-Based Data Rating: 0 out of 5 stars0 ratingsRobust Nonlinear Regression: with Applications using R Rating: 0 out of 5 stars0 ratingsExperimentation, Validation, and Uncertainty Analysis for Engineers Rating: 0 out of 5 stars0 ratingsSmall Area Estimation Rating: 0 out of 5 stars0 ratingsMistakes in Quality Statistics: and How to Fix Them Rating: 0 out of 5 stars0 ratingsPractical Inductively Coupled Plasma Spectrometry Rating: 0 out of 5 stars0 ratingsMolecular Data Analysis Using R Rating: 0 out of 5 stars0 ratingsStatistical Methodologies with Medical Applications Rating: 0 out of 5 stars0 ratingsSensory Discrimination Tests and Measurements: Sensometrics in Sensory Evaluation Rating: 0 out of 5 stars0 ratingsA Guide to Business Statistics Rating: 0 out of 5 stars0 ratingsPractical Applications of Bayesian Reliability Rating: 0 out of 5 stars0 ratingsSample Sizes for Clinical, Laboratory and Epidemiology Studies Rating: 0 out of 5 stars0 ratingsQuantitative Methods: An Introduction for Business Management Rating: 5 out of 5 stars5/5CompTIA DataX Study Guide: Exam DY0-001 Rating: 0 out of 5 stars0 ratingsUnderstanding Least Squares Estimation and Geomatics Data Analysis Rating: 0 out of 5 stars0 ratingsBiomedical Engineering Challenges: A Chemical Engineering Insight Rating: 0 out of 5 stars0 ratingsNetwork Meta-Analysis for Decision-Making Rating: 0 out of 5 stars0 ratingsStatistical Thinking for Non-Statisticians in Drug Regulation Rating: 0 out of 5 stars0 ratingsPharmaceutical Analysis for Small Molecules Rating: 0 out of 5 stars0 ratings
Mathematics For You
Mental Math Secrets - How To Be a Human Calculator Rating: 5 out of 5 stars5/5My Best Mathematical and Logic Puzzles Rating: 4 out of 5 stars4/5Quantum Physics for Beginners Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5What If?: Serious Scientific Answers to Absurd Hypothetical Questions Rating: 5 out of 5 stars5/5How to Solve It: A New Aspect of Mathematical Method Rating: 4 out of 5 stars4/5Basic Math & Pre-Algebra Workbook For Dummies with Online Practice Rating: 4 out of 5 stars4/5The Little Book of Mathematical Principles, Theories & Things Rating: 3 out of 5 stars3/5Basic Math & Pre-Algebra For Dummies Rating: 4 out of 5 stars4/5Algebra I For Dummies Rating: 4 out of 5 stars4/5Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis Rating: 0 out of 5 stars0 ratingsAlgebra II For Dummies Rating: 3 out of 5 stars3/5Algebra - The Very Basics Rating: 5 out of 5 stars5/5The Golden Ratio: The Divine Beauty of Mathematics Rating: 5 out of 5 stars5/5Calculus Essentials For Dummies Rating: 5 out of 5 stars5/5Game Theory: A Simple Introduction Rating: 4 out of 5 stars4/5ACT Math & Science Prep: Includes 500+ Practice Questions Rating: 3 out of 5 stars3/5Calculus Made Easy Rating: 4 out of 5 stars4/5Sneaky Math: A Graphic Primer with Projects Rating: 0 out of 5 stars0 ratingsAlgebra I Workbook For Dummies Rating: 3 out of 5 stars3/5The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English! Rating: 4 out of 5 stars4/5Algebra I Essentials For Dummies Rating: 2 out of 5 stars2/5Painless Algebra Rating: 0 out of 5 stars0 ratingsPrecalculus: A Self-Teaching Guide Rating: 4 out of 5 stars4/5The New York Times Book of Mathematics: More Than 100 Years of Writing by the Numbers Rating: 0 out of 5 stars0 ratingsFlatland Rating: 4 out of 5 stars4/5
Reviews for Introduction to Statistical Analysis of Laboratory Data
0 ratings0 reviews
Book preview
Introduction to Statistical Analysis of Laboratory Data - Alfred Bartolucci
To Lieve and Frank
Preface
Intended Audience
The advantage of this book is that it provides a comprehensive knowledge of the analytical tools for problem solving related to laboratory data analysis and quality control. The content of the book is motivated by the topics that a laboratory statistics course audience and others have requested over the years since 2003. As a result, the book could also be used as a textbook in short courses on quantitative aspects of laboratory experimentation and a reference guide to statistical techniques in the laboratory and processing of pharmaceuticals. Output throughout the book is presented in familiar software format such as EXCEL and JMP (SAS Institute, Cary, NC).
The audience for this book could be laboratory scientists and directors, process chemists, medicinal chemists, analytical chemists, quality control scientists, quality assurance scientists, CMC regulatory affairs staff and managers, government regulators, microbiologists, drug safety scientists, pharmacists, pharmacokineticists, pharmacologists, research and development technicians, safety specialists, medical writers, clinical research directors and personnel, serologists, and stability coordinators. The book would also be suitable for graduate students in biology, chemistry, physical pharmacy, pharmaceutics, environmental health sciences and engineering, and biopharmaceutics. These individuals usually have an advanced degree in chemistry, pharmaceutics, and formulation science and hold job titles such as scientist, senior scientist, principal scientist, director, senior director, and vice president. The above partial list of titles is from the full list of attendees that have participated in the 2-day course titled Introductory Statistics for Laboratory Data Analysis
given through the Center for Professional Innovation and Education.
Prospectus
There is an unmet need to have the necessary statistical tools in a comprehensive package with a focus on laboratory experimentation. The study of the statistical handling of laboratory data from the design, analysis, and graphical perspective is essential for understanding pharmaceutical research and development of results involving practical quantitative interpretation and communication of the experimental process. A basic understanding of statistical concepts is pertinent to those involved in the utilization of the results of quantitation from laboratory experimentation and how these relate to assuring the quality of drug products and decisions about bioavailability, processing, dosing and stability, and biomarker development. A fundamental knowledge of these concepts is critical as well for design, formulation, and manufacturing.
This book presents a detailed discussion of important basic statistical concepts and methods of data presentation and analysis in aspects of biological experimentation requiring a fundamental knowledge of probability and the foundations of statistical inference, including basic statistical terminology such as simple statistics (e.g., means, standard deviations, medians) and transformations needed to effectively communicate and understand one's data results. Statistical tests (one-sided, two-sided, nonparametric) are presented as required to initiate a research investigation (i.e., research questions in statistical terms). Topics include concepts of accuracy and precision in measurement analysis to ensure appropriate conclusions in experimental results including between- and within-laboratory variation. Further topics include statistical techniques to compare experimental approaches with respect to specificity, sensitivity, linearity, and validation and outlier analysis. Advanced topics of the book go beyond the basics and cover more complex issues in laboratory investigations with examples, including association studies such as correlation and regression analysis with laboratory applications, including dose response and nonlinear dose–response considerations. Model fit and parallelism are presented. To account for controllable/uncontrollable laboratory conditions, the analysis of robustness and ruggedness as well as suitability, including multivariate influences on response, are introduced. Method comparison using more accurate alternatives to correlation and regression analysis and pairwise comparisons including the Mandel sensitivity are pursued. Outliers, limit of detection and limit of quantitation and data handling of censored results (results below or above the limit of detection) with imputation methodology are discussed. Statistical quality control for process stability and capability is discussed and evaluated. Where relevant, the procedures provided follow the CLSI (Clinical and Laboratory Standards Institute) guidelines for data handling and presentation.
The significance of this book includes the following:
A comprehensive package of statistical tools (simple, cross-sectional, and longitudinal) required in laboratory experimentation
A solid introduction to the terminology used in many applications such as the interpretation of assay design and validation as well as fit-for-purpose
procedures
A rigorous review of statistical quality control procedures in laboratory methodologies and influences on capabilities
A thorough presentation of methodologies used in the areas such as method comparison procedures, limit and bias detection, outlier analysis, and detecting sources of variation.
Acknowledgments
The authors would like to thank Ms. Laura Gallitz for her thorough review of the manuscript and excellent suggestions and edits that she provided throughout.
Chapter 1
Descriptive Statistics
1.1 Measures of Central Tendency
One wishes to establish some basic understanding of statistical terms before we deal in detail with the laboratory applications. We want to be sure to understand the meaning of these concepts, since one often describes the data with which we are dealing in summary statistics. We discuss what is commonly known as measures of central tendency such as the mean, median, and mode plus other descriptive measures from data. We also want to understand the difference between samples and populations.
Data come from the samples we take from a population. To be specific, a population is a collection of data whose properties are analyzed. The population is the complete collection to be studied; it contains all possible data points of interest. A sample is a part of the population of interest, a subcollection selected from a population. For example, if one wanted to determine the preference of voters in the United States for a political candidate, then all registered voters in the United States would be the population. One would sample a subset, say, 5000, from that population and then determine from the sample the preference for that candidate, perhaps noting the percent of the sample that prefer that candidate over another. It would be impossible logistically and costwise in statistics to canvass the entire population, so we take what we believe to be a representative sample from the population. If the sampling is done appropriately, then we can generalize our results to the whole population. Thus, in statistics, we deal with the sample that we collect and make our decisions. Again, if we want to test a certain vegetable or fruit for food allergens or contaminants, we take a batch from the whole collection, send it to the laboratory and it is, thus, subjected to chemical testing for the presence or degree of the allergen or contaminants. There are certain safeguards taken when one samples. For example, we want the sample to appropriately represent the whole population. Factors relevant in considering the representativeness of a sample include the homogeneity of the food and the relative sizes of the samples to be taken, among other considerations. Therefore, keep in mind that when we do statistics, we always deal with the sample in the expectation that what we conclude generalizes to the whole population.
Now let's talk about what we mean when we say we have a distribution of the data. The following is a sample of size 16 of white blood cell (WBC) counts ×1000 from a diseased sample of laboratory animals:
equationNote that this data is purposely presented in ascending order. That may not necessarily be the order in which the data was collected. However, in order to get an idea of the range of the observations and have it presented in some meaningful way, it is presented as such. When we rank the data from the smallest to the largest, we call this a distribution.
One can see the distribution of the WBC counts by examining Figure 1.1. We'll use this figure as well as the data points presented to demonstrate some of the statistics that will be commonplace throughout the text. The height of the bars represents the frequency of counts for each of the values 5.13–6.8, and the actual counts are placed on top of the bars. Let us note some properties of this distribution. The mean is easy. It is obviously the average of the counts from 5.13 to 6.8 or c01-math-0002 . Algebraically, if we denote the elements of a sample of size c01-math-0003 as c01-math-0004 , then the sample mean in statistical notation is equal to
1.1 equation
For example, in our aforementioned WBC data, c01-math-0006 , and so on, where c01-math-0007 .
c01f001Figure 1.1 Frequency Distribution of White Cell Counts
Then the mean is noted as earlier, c01-math-0008 .
The median is the middle data point of the distribution when there is an odd number of values and the average of the two middle values when there is an even number of values in the distribution. We demonstrate it as follows.
Note our data is:
equationThe number of data points is an even number, or 16. Thus, the two middle values are in positions 8 and 9 underlined above. So the median is the average of 6.0 and 6.0 or
c01-math-0010.
Suppose we had a distribution of seven data points, which is an odd number, then the median is just the middle value or the value in position number 4. Note the following: c01-math-0011 . Thus, the median value is 5.7. The median is also referred to as the 50th percentile. Approximately 50% of the values are above it and 50% of the values are below it. It is truly the middle value of the distribution.
The mode is the most frequently occurring value in the distribution. If we examine our full data set of 16 points, one will note that the value 6.0 occurs four times. Also see Figure 1.1. Thus, the mode is 6.0. One can have a distribution with more than one mode. For example, if the values of 5.4 and 6.0 were each counted four times, then this would be a bimodal distribution or a distribution with two modes.
We have just discussed what is referred to as measures of central tendency. It is easy to see that the measures of central tendency from this data (mean, median, and mode) are all in the center of the distribution, and all other values are centered around them. In cases where the mean = median = mode as in our example, the distribution is seen to be symmetric. Such is not always the case.
Figure 1.2 deals with data that is skewed and not symmetric. Note the mode to the left indicating a high frequency of low values. These are potassium values from a laboratory sample. This data is said to be skewed to the right or positively skewed. We'll revisit this concept of skewness in Chapter 2 and later chapters as well. There are 23 values (not listed here) ranging from 30 to 250. One usually computes the geometric mean (GM) of the data of this form. Sometimes, GM is preferred to the arithmetic mean (ARM) since it is less sensitive to outliers or extreme values. Sometimes, it is called a spread preserving
statistic. The GM is always less than or equal to the ARM and is commonly used with data that may be skewed and not normal or not symmetric, such as much laboratory data is not symmetric.
Figure 1.2 Frequency Distribution of Potassium Values
Suppose we have c01-math-0012 observations c01-math-0013 , then the GM is defined as
1.2 equation
or equivalently
1.3
equationIn our potassium example c01-math-0016 . Note that the ARM = 75.217.
1.2 Measures of Variation
We've learned some important measures of statistics. The mean, median, and mode describe some sample characteristics. However, they don't tell the whole story. We want to know more characteristics of the data with which we are dealing. One such measure is the dispersion or the variance. This particular measure has several forms in laboratory science and is essential to determining something about the precision of an experiment. We will discuss several forms of variance and relate them to data accordingly.
The range is the difference between the maximum and minimum value of the distribution. Referring to the WBC data:
equationObviously, the range is easy to compute, but it only depends on the two most extreme values of the data. We want a value or measure of dispersion that utilizes all of the observations. Note the data in Table 1.1. For the sake of demonstration, we have three observations: 2, 4, and 9. These data are seen in the data column. Note their sum or total is 15. Their mean or average is 5. Note their deviation from the mean, 2 − 5 = −3, 4 − 5 = −1 and 9 − 5 = 4. The sum of their deviations is 0. This property is true for any size data set, that is, the sum of the deviations will be close to 0. This doesn't make much sense as a measure of dispersion or we would have a perfect world of no variation or dispersion of the data. The last column denoted as (Deviation)² is the deviation column squared. And the sum of the squared deviations is 26.
Table 1.1 Demonstration of Variance
The variance of a sample is the average squared deviation from the sample mean. Specifically, from the previous sample of three values,
c01-math-0018. Thus, the variance is 13. Dividing by (3 − 1) = 2 instead of 3 gives us an unbiased estimator of the variance because it tends to closely estimate the true population variance. Note that if our sample size were 100, then dividing by 99 or 100 would not make much of a difference in the value of the variance. The adjustment of dividing the sum of squares of the deviation by the sample size minus 1, (n − 1), can be thought of as a small sample size adjustment. It allows us not to underestimate the variance but to conservatively overestimate it.
Recall our WBC data:
equationThe mean or average is: 5.939 = 5.94.
So the variance is
equationAlgebraically, one may note the variance formula in statistical notation for the data in Table 1.1, where the mean is c01-math-0021 .
One defines the sample variance as c01-math-0022 or
1.4 equation
So for the data in Table 1.1 we have
equationThe sample standard deviation (SD), c01-math-0025 , is the square root of sample c01-math-0026 , or in our case c01-math-0027 .
1.5 equation
The variance is a measure of variation. The square root of the variance, or SD, is a measure of variation in terms of the original scale.
Thus, referring back to the aforementioned WBC data, the SD of our WBC counts is the square root of the variance, that is, c01-math-0029 .
Just as we discussed the GM earlier for data that may be possibly skewed, we also have a geometric standard deviation (GSD). One uses the log of the data as we did for the GM. The GSD is defined as
1.6 equation
As an example, suppose we have c01-math-0031 data points c01-math-0032 .
Then from (1.6), the c01-math-0033 . Unlike the GM, the GSD is not necessarily a close neighbor of the arithmetic SD, which in this case is 16.315.
Another measure of variation is the standard error of the mean (SE or SEM), which is the SD divided by the square root of the sample size or
1.7 equation
For our aforementioned WBC data, we have c01-math-0035 .
The standard error (SE) of the mean is the variation one would expect in the sample means after repeated sampling from the same population. It is the SD of the sample means. Thus, the sample SD deals with the variability of your data while the SE of the mean deals with the variability of your sample mean.
Naturally, we have only one sample and one sample mean. Theoretically, the SE is the SD of many sample means after sampling repeatedly from the same population. It can be thought of as a SD of the sample means from replicated sampling or experimentation. Thus, a good approximation of the SE of the mean from one sample is the SD divided by the square root of the sample size as seen earlier. It is naturally smaller than the SD. This is because from repeated sampling from the population one would not expect the mean to vary much, certainly not as much as the sample data. Rosner (2010, Chapter 6, Estimation) and Daniel (2008, Chapter 6, Estimation) give an excellent demonstration and explanation of the SD and SE of the mean comparisons.
Another common measure of variation used in laboratory data exploration is the coefficient of variation (CV), sometimes referred to as the relative standard deviation (RSD). This is defined as the ratio of the SD to the mean expressed as a percent.
It is also called a measure of reliability – sometimes referred to as precision and is defined as
1.8 equation
Our Sample CV of the WBC measurements is c01-math-0037 .
The multiplication by 100 allows it to be referred to as the percent CV, %CV, or CV%.
The %CV normalizes the variability of the data set by calculating the SD as a percent of the mean. The %CV or CV helps one to compare the precision differences that may exist among assays and assay methods. We'll see an example of this in the following section. Clearly, an assay with CV = 7.1% is more precise than one with CV = 10.3%.
1.3 Laboratory Example
The following example is based on the article by Steele et al. (2005) from the Archives of Pathology and Laboratory Medicine. The objective of the study was to determine the long-term within- and between-laboratory variation of cortisol, ferritin, thyroxine, free thyroxine, and Thyroid-Stimulating Hormone (TSH) measurements by using commonly available methods and to determine if these variations are within accepted medical standards, that is to say within the specified CV.
The design – Two vials of pooled frozen serum were mailed 6 months apart to laboratories participating in two separate College of American Pathologists' surveys. The data from those laboratories that analyzed an analyte in both surveys were used to determine for each method the total variance and the within- and between-laboratory variance components. For our purposes, we focus on the CV for one of the analytes, namely, the TSH. There were more than 10 analytic methods studied in this survey. The three methods we report here are as follows: A – Abbott AxSYM, B – Bayer Advia Centaur, and C – Bayer Advia Centaur 3G. The study examined many endpoints directed to measuring laboratory precision with a focus on total precision overall and within- and between-laboratory precision. The within-laboratory goals as per the %CV based on biological criteria were cortisol – 10.43%, ferritin – 6.40%, thyroxine – 3.00%, free thyroxine – 3.80%, and TSH – 10.00%. Figure 1.3 shows the graph for analytic methods A, B, and C, for TSH. The horizontal reference line across the top of the figure at 10% indicates that all of the bars for the total, within- and between-laboratory %CV met the criteria for the three methods shown here. Also, note in examining Figure 1.3 that the major source of variation was within-laboratory as opposed to the between- or among-laboratory variation or %CV.
c01f003Figure 1.3 CV% for TSH.
Reproduced in part from Steele et al. (2005) with permission from Archives of Pathology and Laboratory Medicine. Copyright 2005 College of American Pathologists
When examining the full article, the authors point out that the number of methods that met within-laboratory imprecision goals based on biological criteria were 5 of 5 for cortisol; 5 of 7 for ferritin; 0 of 7 for thyroxine and free thyroxine; and 8 of 8 for TSH. Their overall conclusion was that for all analytes tested, the total within-laboratory component of variance was the major source of variation. In addition, note that there are several methods, such as thyroxine and free thyroxine that may not meet analytic goals in terms of their imprecision.
1.4 Putting it All Together
Let's consider a small data set of potassium values and demonstrate summary statistics in one display. Table 1.2 gives the potassium values denoted by the c01-math-0038 , where c01-math-0039 . The natural log of the values are seen in the third column denoted by c01-math-0040 . The normal range of values for adult laboratory potassium (K) levels are from 3.5–5.2 milliequivalents per liter (mEq/L) or 3.5–5.2 millimoles per liter (mmol/L). Obviously, a number of the values are outside the range. The summary statistics are provided for both raw and transformed values, respectively. The c01-math-0041 values are actually from what we call a log-normal distribution, which we will discuss in the following chapter. Focusing on the untransformed potassium values of Table 1.2, Table 1.3 gives a complete set of summary statistics that one often encounters. We've discussed most of them and will explain the others. The minimum and maximum values are obvious, being the minimum and maximum potassium values from Table 1.2. The other two added values in Table 1.3 are 25th percentile (first quartile) and 75th percentile (third quartile). They are percentiles just like the median. Just as the median is the 50th percentile (second quartile) in which approximately 50% of the values may lie above it as well as below it, the 25th percentile is the value of 2.9, meaning that approximately 25% of the values in the distribution lie below it, which implies about 75% of the values in the distribution lie above the value 2.9. Thus, the 75th percentile is the value of 8.05, meaning that 75% of the values in the distribution are less than or equal to 8.05, implying that about 25% of the values lie above it. Note that the median is in the middle of the 25th and 75th percentile. These values between the 25th and 75th quartile are called the interquartile range (IQR). Note that approximately 50% of the data points are in the IQR.
Table 1.2 Potassium Values and Descriptive Statistics
Table 1.3 Descriptive Statistics of 10 Potassium (X) Values
Let's revisit the GM and GSD. From Table 1.2, we note that
equationAlso, the relation between the arithmetic standard and GSD is such that ln (GSD) = arithmetic SD of the Yi;s in Table 1.2. Thus, ln(GSD) = 0.68 or GSD = exp(0.68) = 1.974.
1.5 Summary
We have briefly summarized a number of basic descriptive statistics in this chapter such as the measures of central tendency and measures of variation. We also put them in the context of data that has a symmetric distribution as well as data that is not symmetrically distributed or may be skewed. It is important to note that these statistics just describe some property of the sample with which we are dealing in laboratory experimentation. Our goal in the use of these statistics is to describe what is expected to be true in the population from which the sample was drawn. In the next chapter, we discuss inferential statistics, which leads us to draw scientific conclusions from the data.
References
Daniel WM. (2008). Biostatistics: A Foundation for Analysis in the Health Sciences, 9th ed., John Wiley & Sons, New York.
Rosner B. (2010). Fundamentals of Biostatistics, 7th ed., Cengage Learning.
Steele BW, Wang E, Palmer-Toy DE, Killeen AA,