Business Statistics
Business Statistics
Business Statistics
Better
Business
Decisions
4. Scientific method all use statistics. It may not be in the formal sense we will be using in this text but we do use our
beliefs about the probability of an event or object to help us make decisions everyday.
5. Value of Information
For example, when one is driving past a gas station what do the prices on the sign signify and how do
you interpret the information. In its raw form the prices are simply data points. Everyday we may drive
past multiple gas stations. Each station has price per gallon displayed on a large sign for us to see.
We collect these data points and consciously calculate what we belief is the going price for a gallon of
gas. In essence, we have taken data, organized the information and generated a descriptive statistic
(the average price for gas). Albeit, this is an average based on our observations but it is a statistic
(derived from data) that we use as representative of the “going” (average) price of gas (for the
population of gas stations) to help us make decisions.
How do you use this average gas price (a statistic)? Let’s say that the next day you need to purchase
gas for the car. You are driving past a gas station where the price of gas is posted as $ .10 below what
you believe the average price of gas to be (based on recent observations). Do you stop and get gas at
this station? Maybe.
Why maybe? This brings us to that nagging thing called probability. To help understand probability
think of how different your reaction to a price drop would be if prices had been staying the same for the
past 6 months versus fluctuating prices. For this example, let’s assume you have been observing
2
fluctuations in gas prices? Some days the price is rising, other days you observe production planning and control, inventory control, finance and accounting, and
the price is dropping. How confident are you that this $ .10 difference is below personnel selection and training to mention some major areas.
what your regular gas station will be charging? Will the price be $ .15 lower at the
next station? Scientific procedures involve the use of models to describe, analyze, and make
predictions. A model can be a well-defined set of descriptions and procedures
This is where you use your built in statistical calculator to make a decision. You like the Product Life Cycle in marketing, or it can be a scaled down analog of the
estimate the probability of the price difference and the chance that all stations will real thing, such as an engineer’s scale model of a car.
have lowered gas prices verses this one station being outside of the normal range
of price variations you have been observing. If you come to the conclusion that Models that are useful and valuable in managing business operations are broadly
the price of gas at this station is below what you can expect from other stations termed “statistical models”. A large number of these have been developed to
assist researchers in a variety of fields, such as agriculture, psychology, education,
you are likely to purchase from this location. This is the essence of statistical
communication, and military tactics, as well as in business. Only the professional
inference. We take data from a sample to represent the general population. We
statistician would be expected to understand the full range of such statistical
transform the data from its raw form by sorting, describing and analyzing. This
models.
allows us to make inferences about the world at large (the population represented
by our sample). The question then is, “How can the average business manager be expected to
understand and use such models?” First, that average business manager does
An Empirical Approach to Business Decisions
not need to understand and use all the models. There are certain models that
Managers need information in order to introduce products and services that create have more frequent business application than others. The statistical methods
value in the mind of the customer. But the perception of value is a subjective one, presented in this text have been chosen to cover a variety of problems generally
and what customers value this year may be quite different from what they value encountered in business. The reader is encouraged to seek out additional
next year. As such, the attributes that create value cannot simply be deduced from readings for detailed coverage of each model and additional statistical methods
common knowledge. Rather, data must be collected and analyzed. The goal of beyond the scope of this text.
research is to provide the facts and direction that managers need to make their
Second, the approach to learning models is to emphasize the similarity in logical
more important decisions. The value of the information provided is dependent on
structure among various models, and to stress understanding of assumptions
how well one effectively and efficiently executes the research process.
inherent in each model. This makes it possible to make sense from a seeming
“SWAG”, “shooting from the hip”, “flying by the seat of your pants”, and having a quagmire of statistical methods, and to clearly and logically determine when it is
“gut feeling” are all expressions that characterized business practices during its appropriate to use a specific statistical model. As a manager reading research
early stages of development. Now, however, business managers increasingly turn reports, it is then possible to determine if the correct techniques were utilized.
3
emphasizing what the techniques do not say, rather than simply how to properly How to observe. Use an existing questionnaire or develop a new survey
interpret conclusions from statistical tests. instrument. For instance, researchers might build or adopt existing interviews,
questionnaires, personality scales, etc., to use in making observations.
Finally, the average business manager is not expected to be a statistician. The
objective is to train a manager who can properly understand and interpret results The observations that researchers make result in data. The data might be the
from statistical models. A manager who will ask the right questions of the brands participants plan to purchase or the data might be respondent scores on a
researcher or statistician in order to evaluate and apply results. scale that measures preference. In this context, variables are things that we
measure, control, or manipulate in research. The participants (respondents) with
Empiricism, the scientific method, refers to using direct observation to obtain
the variables represent our data. Think of the data file as a spreadsheet in Excel
knowledge. Thus, the empirical approach to acquiring knowledge is based on
with each respondent represented by a row of data and each variable represented
making observations of individuals or objects of interest. As illustrated by the gas
by a column.
price example, everyday observation is an application of the scientific approach.
General Characteristics of Scientific Inquiry:
In general, your decision about what station to purchase gas from when you
observe the price is a microcosm of the processes we use to draw statistical 1. Use objectivity – freedom from bias
inferences. Unfortunately, generalizations based on everyday observations are
often misleading. In the context of making business decision we will require a 2. Obtain total evidence – all relevant facts
precise estimate of the probability we are drawing a correct conclusion from the
3. Seek general patterns – laws of nature
sample of data we have available. A major distinction between research and
everyday observation is that research is planned in advance. Based on a theory or
4. Use theories involving general laws – understanding and prediction
hunch, researchers develop research questions and then plan what, when, where
and how to observe in order to answer the questions. 5. Require empirical verification – predicted results
What (or whom) to observe, the population. When a population is large, 6. Disclose all methods and assumptions
researchers often plan to observe only a sample (i.e., a subset of a population).
Planning how to draw an adequate sample is, of course, critical in conducting 7. State proficiency of methods – degree of success
valid research.
When the observations will be made. (Morning, night…). Researchers realize that
the timing of their observations may affect the results of their investigations.
Where to make the observations. For instance, will the observations be made in a
quiet room or in a busy shopping mall?
4
Objective of Science is to Predict with Understanding:
Deductive
• All S is G & W
• X is S
• Therefore, X is G & W
Closely reasoned. If (a) and (b) are true, (c) must be true. This logic is used in
Objectives of Scientific Procedure:
making mathematical proofs.
3. To specify conditions wherein the relational structure will adequately • Therefore, X is almost certain to be G & W
represent the real situation, or can be related to it.
X can be explained relative to several sets of empirical data. Thus, one must
4. To use the relational structure to predict with understanding or test the either include total evidence or have only a “potential explanation.” Most
observed outcome. applications of statistical reasoning in business situations are only potential
explanations. It is generally impossible or too expensive to obtain total evidence.
5
To maximize the benefit of research, those who use it need to understand the First, we will review the general concept of statistics. This will include:
research process and its limitations. One of the most difficult aspects of business
decision-making is clearly defining the problem. The decision problem faced by • General Concepts and Assumptions.
The decision problem is translated into a research problem. For example, a • Levels of measurement.
decision problem may be whether to launch a new product. The corresponding
research problem might be to assess whether the market would accept the new Once we have reviewed the fundamental concepts we will focus on obtaining
product. information from data. We will discuss the pros and cons of various sampling
procedures.
The objective of the research should be defined clearly. To ensure that the true
decision problem is addressed, it is useful for the researcher to outline possible The second section, descriptive statistics, is dedicated to describing the data we
scenarios of the research results and then for the decision maker to formulate have collected. Including:
plans of action under each scenario.
• Frequencies and proportions.
Information can be useful, but what determines its real value to the organization?
• Graphical representations.
In general, the value of information is determined by:
• Mean, median and mode.
1. The validity and reliability of the information.
• Range, variance, and standard deviation.
2. The level of indecisiveness that would exist without the information.
• Measures of association.
3. The cost of the information in terms of time and money.
• Tests of normality
4. The ability and willingness to act on the information.
The third section of the text outlines statistical procedures used to draw
To ascertain the probability we are drawing a correct conclusion by providing
inferences about the population we are studying. Including:
accurate information we need to start at the beginning. We must answer several
basic questions about the data we are using and the techniques we have • Hypothesis testing including type I and type II error.
employed to analysis and generate information. These areas of inquiry lead us to
the organization of this text. • Confidence intervals and the normal distribution.
6
• Testing against a hypothesized value.
The fifth section focuses on the development of linear equations to help in making
better decisions and prediction. We initially develop basic bi-variate equations
and then expand on our foundation to develop multiple regression models.
• Linear regression.
• Multivariate regression.
The final section of the text revisits the question “How do we know?” by posing
the question, “What do we know?”. The section provides a review of key sections
of a research report coupled with a summary of the techniques and procedures
covered. A general guide on when to use each technique is provided as a
reference.
Statistics is about this whole process used to answer questions and make
decisions. Effective decision-making involves correctly designing studies,
collecting unbiased data, describing the data with numbers and graphs, analyzing
the data to draw inferences and reaching conclusions based on the transformation
of data into information. The next section outlines the several elements in
research and statistics. It is imperative that the research process and
methodologies be clearly articulated in the research report to aid the decision
maker in understanding the findings and recommendations.
7
Section 2
1. Value of Information Managers need information in order to introduce products and services that create value in the mind of
2. Research Process the customer. But the perception of value is a subjective one, and what customers value this year may
be quite different from what they value next year. As such, the attributes that create value cannot simply
3. Research Report be deduced from common knowledge. Rather, data must be collected and analyzed. The goal of
research is to provide the facts and direction that managers need to make their more important
management decisions.
To maximize the benefit of research, those who use it need to understand the research process and its
limitations.
Information can be useful, but what determines its real value to the organization? In general, the value
of information is determined by:
8
• The cost of the information in terms of time and money. plans of action under each scenario. The use of such scenarios can ensure that
the purpose of the research is agreed upon before it commences.
The Research Process
Research Design
Once the need for research has been established, most research projects involve
these steps: Research can be classified in one of three categories:
4. Design data collection forms and questionnaires These classifications are made according to the objective of the research. In some
cases the research will fall into one of these categories, but in other cases
5. Determine sample plan and size different phases of the same research project will fall into different categories.
6. Collect the data Exploratory research has the goal of formulating problems more precisely,
clarifying concepts, gathering explanations, gaining insight, eliminating impractical
7. Analyze and interpret the data
ideas, and forming hypotheses. Exploratory research can be performed using a
8. Prepare the research report literature search, surveying certain people about their experiences, focus groups,
and case studies. When surveying people, exploratory research studies would not
Problem Definition try to acquire a representative sample, but rather, seek to interview those who are
knowledgeable and who might be able to provide insight concerning the
The decision problem faced by management must be translated into a research relationship among variables. Case studies can include contrasting situations or
problem in the form of questions that define the information that is required to benchmarking against an organization known for its excellence. Exploratory
make the decision and how this information can be obtained. Thus, the decision research may develop hypotheses, but it does not seek to test them. Exploratory
problem is translated into a research problem. For example, a decision problem research is characterized by its flexibility.
may be whether to launch a new product. The corresponding research problem
might be to assess whether the market would accept the new product. Descriptive research is more rigid than exploratory research and seeks to
describe users of a product, determine the proportion of the population that uses
The objective of the research should be defined clearly. To ensure that the true
a product, or predict future demand for a product. As opposed to exploratory
decision problem is addressed, it is useful for the researcher to outline possible
research, descriptive research should define questions, people surveyed, and the
scenarios of the research results and then for the decision maker to formulate
method of analysis prior to beginning data collection. In other words, the who,
9
what, where, when, why, and how aspects of the research should be defined. Some secondary data is republished by organizations other than the original
Such preparation allows one the opportunity to make any required changes before source. Because errors can occur and important explanations may be missing in
the costly process of data collection has begun. republished data, one should obtain secondary data directly from its source. One
also should consider who the source is and whether the results may be biased.
There are two basic types of descriptive research: longitudinal studies and cross-
sectional studies. Longitudinal studies are time series analyses that make There are several criteria that one should use to evaluate secondary data.
repeated measurements of the same individuals, thus allowing one to monitor
behavior such as brand-switching. However, longitudinal studies are not • Whether the data is useful in the research study.
Secondary data has the advantage of saving time and reducing data gathering • psychological and lifestyle characteristics
costs. The disadvantages are that the data may not fit the problem perfectly and
that the accuracy may be more difficult to verify for secondary data than for • attitudes and opinions
primary data.
• awareness and knowledge
10
• intentions - for example, purchase intentions. While useful, intentions are Questionnaire or Experimental Design
not a reliable indication of actual future behavior.
The questionnaire or experimental design are very important tools for gathering
• motivation - a person's motives are more stable than his/her behavior, so primary data. Poorly constructed questions or design can result in large errors and
motive is a better predictor of future behavior than is past behavior. invalidate the research data, so significant effort should be put into the design. A
questionnaire should be tested thoroughly prior to conducting the survey.
• behavior
Scenarios from the experimental design should be evaluated prior to conducting
the study.
• quality measurements
Measurement Scales
• performance data
• arrival times Attributes can be measured on nominal, ordinal, interval, and ratio scales:
Primary data can be obtained by experimentation or by observation. A common • Nominal numbers are simply identifiers, with the only permissible
observational methods is communication involving questioning respondents either mathematical use being for counting. Example: social security numbers.
verbally or in writing. This method is versatile, since one needs only to ask for the
• Ordinal scales are used for ranking. The interval between the numbers
information; however, the response may not be accurate. Direct communication
conveys no meaning. Median and mode calculations can be performed on
usually is quicker and cheaper than observation. Observation involves the
ordinal numbers. Example: class ranking
recording of actions and is performed by either a person or some mechanical or
electronic device. Observation is less versatile than communication since some • Interval scales maintain an equal interval between numbers. These scales
attributes of a person may not be readily observable, such as attitudes, can be used for ranking and for measuring the interval between two
awareness, knowledge, intentions, and motivation. Observation also might take numbers. Since the zero point is arbitrary, ratios cannot be taken between
longer since observers may have to wait for appropriate events to occur, though numbers on an interval scale; however, mean, median, and mode are all
observation using scanner data might be quicker and more cost effective. valid. Example: temperature scale
Observation typically is more accurate than communication.
• Ratio scales are referenced to an absolute zero values, so ratios between
Personal interviews have an interviewer bias that mail-in questionnaires do not numbers on the scale are meaningful. In addition to mean, median, and
have. For example, in a personal interview the respondent's perception of the mode, geometric averages also are valid. Example: weight
interviewer may affect the responses.
The scale of measurement has implications for the appropriateness of various
statistical techniques and models in data analysis.
11
Validity and Reliability There is a tradeoff between sample size and cost. The larger the sample size, the
smaller the sampling error but the higher the cost. After a certain point the smaller
The validity of a test is the extent to which differences in scores reflect differences sampling error cannot be justified by the additional cost.
in the measured characteristic. Predictive validity is a measure of the usefulness of
a measuring instrument as a predictor. Proof of predictive validity is determined by While a larger sample size may reduce sampling error, it actually may increase the
the correlation between results and actual behavior. Construct validity is the extent total error. There are two reasons for this effect. First, a larger sample size may
to which a measuring instrument measures what it intends to measure. reduce the ability to follow up on non-responses. Second, even if there is a
sufficient number of interviewers for follow-ups, a larger number of interviewers
Reliability is the extent to which a measurement is repeatable with the same may result in a less uniform interview process.
results. A measurement may be reliable and not valid. However, if a measurement
is valid, then it also is reliable and if it is not reliable, then it cannot be valid. One Data Collection
way to show reliability is to show stability by repeating the test with the same
In addition to sampling error, the actual data collection process will introduce
results.
additional errors. These errors are called non-sampling errors. Some non-sampling
Sampling Plan errors may be intentional on the part of the interviewer, who may introduce a bias
by leading the respondent to provide a certain response. The interviewer also may
The sampling frame is the pool from which the interviewees are chosen. The introduce unintentional errors, for example, due to not having a clear
telephone book often is used as a sampling frame, but have some shortcomings. understanding of the interview process or due to fatigue.
Telephone books exclude those households that do not have telephones and
those households with unlisted numbers. Since a certain percentage of the Respondents also may introduce errors. A respondent may introduce intentional
numbers listed in a phone book are out of service, there are many people who errors by lying or simply by not responding to a question. A respondent may
have just moved who are not sampled. Such sampling biases can be overcome by introduce unintentional errors by not understanding the question, guessing, not
using random digit dialing. Mall intercepts represent another sampling frame, paying close attention, and being fatigued or distracted.
though there are many people who do not shop at malls and those who shop
The research study should be designed to minimize non-sampling errors.
more often will be over-represented unless their answers are weighted in inverse
proportion to their frequency of mall shopping. Data Analysis - Preliminary Steps
In designing the research study, one should consider the potential errors. Two Before analysis can be performed, raw data must be transformed into the right
sources of errors are random sampling error and non-sampling error. Sampling format. First, it must be edited so that errors can be corrected or omitted. The
errors are those due to the fact that there is a non-zero confidence interval of the data must then be coded; this procedure converts the edited raw data into
results because of the sample size being less than the population being studied. numbers or symbols. A codebook is created to document how the data was
Non-sampling errors are those caused by faulty coding, untruthful responses, coded. Finally, the data is tabulated to count the number of samples falling into
respondent fatigue, etc. various categories. Simple tabulations count the occurrences of each variable
12
independently of the other variables. Cross tabulations, also known as • Type II error: occurs when one accepts the null hypothesis when in fact the
contingency tables or cross tabs, treats two or more variables simultaneously. null hypothesis is false.
However, since the variables are in a two-dimensional table, cross tabbing more
than two variables is difficult to visualize since more than two dimensions would Because their names are not very descriptive, these types of errors sometimes are
be required. Cross tabulation can be performed for nominal and ordinal variables. confused. Some people jokingly define a Type III error to occur when one
confuses Type I and Type II. To illustrate the difference, it is useful to consider a
Cross tabulation is the most commonly utilized data analysis method in research. trial by jury in which the null hypothesis is that the defendant is innocent. If the jury
Many studies take the analysis no further than cross tabulation. This technique convicts a truly innocent defendant, a Type I error has occurred. If, on the other
divides the sample into sub-groups to show how the dependent variable varies hand, the jury declares a truly guilty defendant to be innocent, a Type II error has
from one subgroup to another. A third variable can be introduced to uncover a occurred.
relationship that initially was not evident.
Hypothesis testing involves the following steps:
Hypothesis Testing
• Formulate the null and alternative hypotheses.
A basic fact about testing hypotheses is that a hypothesis may be rejected but
• Choose the appropriate test.
that the hypothesis never can be unconditionally accepted until all possible
evidence is evaluated. In the case of sampled data, the information set cannot be • Choose a level of significance (alpha) - determine the rejection region.
complete. So if a test using such data does not reject a hypothesis, the
conclusion is not necessarily that the hypothesis should be accepted. • Gather the data and calculate the test statistic.
The null hypothesis in an experiment is the hypothesis that the independent • Determine the probability of the observed value of the test statistic under
variable has no effect on the dependent variable. The null hypothesis is expressed the null hypothesis given the sampling distribution that applies to the
as H0. This hypothesis is assumed to be true unless proven otherwise. The chosen test.
alternative to the null hypothesis is the hypothesis that the independent variable
• Compare the value of the test statistic to the rejection threshold.
does have an effect on the dependent variable. This hypothesis is known as the
alternative, research, or experimental hypothesis and is expressed as H1. This
• Based on the comparison, reject or do not reject the null hypothesis.
alternative hypothesis states that the relationship observed between the variables
cannot be explained by chance alone. • Make the research conclusion.
There are two types of errors in evaluating a hypothesis: In order to analyze whether research results are statistically significant or simply
by chance, a test of statistical significance can be run.
• Type I error: occurs when one rejects the null hypothesis and accepts the
alternative, when in fact the null hypothesis is true.
13
Tests of Statistical Significance Regression procedures are widely used in research. In general, multiple regression
allows the researcher to ask (and hopefully answer) the general question "what is
Chi-square the best predictor of Y. The regression line expresses the best prediction of the
dependent variable (Y), given the independent variables (X). However, nature is
The chi-square ( χ2 ) goodness-of-fit test is used to determine whether a set of
rarely perfectly predictable, and usually there is substantial variation of the
proportions have specified numerical values. We will review chi-square as a
observed points around the fitted regression line.
techniques and see how it is used to analyze bivariate cross-tabulated data.
Research Report
Student’s t-Test
The format of the research report varies with the needs of the organization. The
Another test of statistical significance is the t test. Many instances occur in
report often contains the following sections:
business where it is desirable to test whether the differences between two sample
outcomes is statistically significant, or is just a chance occurrence due to • Purpose and background research
sampling error. For example, an owner of a fleet of trucks buys two brands of tires
and needs to know if they provide equal mileage or equal tread wear. It is • Table of Contents
possible to test the difference between two means using the t distribution, which • Executive summary
is the best test to use when its assumptions can be met
• Research objectives
ANOVA • Methodology
Another test of significance is the Analysis of Variance (ANOVA) test. The primary • Results
purpose of ANOVA is to test for differences between multiple means. Whereas the
• Limitations
t-test can be used to compare two means, ANOVA is needed to compare three or
more means. If multiple t-tests were applied, the probability of a TYPE I error • Conclusions and recommendations
(rejecting a true null hypothesis) increases as the number of comparisons
• Appendices containing copies of the questionnaires, etc.
increases.
Research by itself does not arrive at business decisions, nor does it guarantee
ANOVA is efficient for analyzing data using relatively few observations and can be that the organization will be successful in marketing or manufacturing its products.
used with categorical variables. Regression can perform a similar analysis to that However, when conducted in a systematic, analytical, and objective manner,
of ANOVA. research can reduce the uncertainty in the decision-making process and increase
the probability and magnitude of success.
Regression
The general purpose of regression is to learn more about the relationship between
an independent or predictor variable and a dependent or criterion variable.
14
Chapter 1
General
Statistical
Concepts
This chapter offers an overview of general statistics.
This is a review only. The reader is encouraged to
do additional research and readings in specific
areas of interest.
Section 1
Experiment vs Observation
Experiments
An experiment is a study in which treatments are given to see how the participants respond to them.
We all conduct informal experiments in our everyday lives. For example:
• We might try a new dry cleaner (the treatment) to see if our clothes are cleaner (the response)
than when we used our old service.
• A teacher might bring to class a new video (the treatment) to see if students enjoy it (the
response).
16
In an experiment, the treatments are called the independent variable, and the A nonexperimental study, a descriptive study, is defined as a study in which
responses are called the dependent variable. The treatments (independent observations are made to determine the status of what exists at a given point in
variables) are administered so researchers can observe possible changes in time without the administration of treatments. An example is a survey could be
response (dependent variables). when one wants to determine participants' attitudes. In such a study, researchers
strive not to change the participants' attitudes.
Clearly, the purpose of experiments is to identify cause-and-effect relationships, in
which the independent variable is the possible cause and the dependent variable A researcher can obtain solid data on the attitudes held by participants with a
demonstrates the possible effect. proper sample of appropriate questions.
is not clear. Perhaps, by chance, the evening that the waiter tried being more information. However, if not conducted properly,
friendly, he happened to have been more efficient. The possible alternative surveys can result in bogus or misleading information.
explanations are almost endless unless an experiment is planned in advance to Some problems include improper wording of questions,
used to compare the average tips earned under the more friendly condition with attitudes will not gather data on how to change
those earned under the less friendly condition. attitudes. To do this, one would need to conduct an
experiment. Data from an observational study can only
Observation be interpreted in causal terms based on a theory that
we have, but correlational data cannot prove causality.
An observational study is one in which data is collected on individuals in a way
that doesn't affect them. The most common nonexperimental study is the survey.
Surveys are questionnaires that are presented to individuals who have been
selected from a population of interest. Surveys take on many different forms:
paper surveys sent through the mail; Web sites; call-in polls conducted by TV
networks; and phone surveys.
17
Probability 1. Suppose we flip a coin and count the number of heads. The number of heads
could be any integer value between 0 and plus infinity. However, it could not be
At the core of statistics is the concept of probability. The following is a brief any number between 0 and plus infinity. We could not, for example, get 2.5
introduction to this core concept. heads. Therefore, the number of heads must be a discrete variable.
All probability distributions can be classified as discrete probability distributions or What is the probability that a card drawn at random from a deck of cards will be
as continuous probability distributions, depending on whether they define an ace? Of the 52 cards in the deck, 4 are aces, so the probability is 4/52. In
probabilities associated with discrete variables or continuous variables. In the general, the probability of an event is the number of favorable outcomes divided
discrete case, one can easily assign a probability to each possible value: when by the total number of possible outcomes. (This assumes the outcomes are all
throwing a die, each of the six values 1 to 6 has the probability 1/6. In contrast, equally likely.) In this case there are four favorable outcomes: (1) the ace of
when a random variable takes values from a continuum, probabilities are nonzero spades, (2) the ace of hearts, (3) the ace of diamonds, and (4) the ace of clubs.
only if they refer to finite intervals: in quality control one might demand that the Since each of the 52 cards in the deck represents a possible outcome, there are
probability of a "16 oz" package containing between 15.5 oz and 16.5 oz should 52 possible outcomes.
be no less than 98%.
The same principle can be applied to the problem of determining the probability of
A continuous random variable can take a continuous range of values — as obtaining different totals from a pair of dice. As shown below, there are 36
opposed to a discrete distribution, where the set of possible values for the random possible outcomes when a pair of dice is thrown.
variable is at most countable. For a discrete distribution an event with probability
zero is impossible (e.g. rolling 3½ on a standard die is impossible, and has
probability zero), this is not so in the case of a continuous random variable.
If a variable can take on any value between two specified values, it is called a
continuous variable; otherwise, it is called a discrete variable.
Some examples will clarify the difference between discrete and continuous
variables.
18
To calculate the probability that the sum of the two dice will equal 5, calculate the There are 6 outcomes for which the first die is a 6, and of these, there are four that
number of outcomes that sum to 5 and divide by the total number of outcomes total more than 8 (6,3; 6,4; 6,5; 6,6). The probability of a total greater than 8 given
(36). Since four of the outcomes have a total of 5 (1,4; 2,3; 3,2; 4,1), the probability that the first die is 6 is therefore 4/6 = 2/3.
of the two dice adding up to 5 is 4/36 = 1/9 . In like manner, the probability of
obtaining a sum of 12 is computed by dividing the number of favorable outcomes More formally, this probability can be written as:
(there is only one) by the total number of outcomes (36). The probability is
p(total>8 | Die 1 = 6) = 2/3.
therefore 1/36
In this equation, the expression to the left of the vertical bar represents the event
Conditional Probability
and the expression to the right of the vertical bar represents the condition. Thus it
would be read as "The probability that the total is greater than 8 given that Die 1 is
A conditional probability is the probability of an event given that another event has
6 is 2/3." In more abstract form, p(A|B) is the probability of event A given that
occurred. For example, what is the probability that the total of two dice will be
event B occurred.
greater than 8 given that the first die is a 6? This can be computed by considering
only outcomes for which the first die is a 6. Then, determine the proportion of
Probability of A and B
these outcomes that total more than 8.
Independent Events
All the possible outcomes for two dice are shown below:
A and B are two events. If A and B are independent, then the probability that
events A and B both occur is:
In other words, the probability of A and B both occurring is the product of the
probability of A and the probability of B.
What is the probability that a fair coin will come up with heads twice in a row? Two
events must occur: a head on the first toss and a head on the second toss. Since
the probability of each event is 1/2, the probability of both events is: 1/2 x 1/2 =
1/4.
Now consider a similar problem: Someone draws a card at random out of a deck,
replaces it, and then draws another card at random. What is the probability that
the first card is the ace of clubs and the second card is a club (any club). Since
there is only one ace of clubs in the deck, the probability of the first event is 1/52.
19
Since 13/52 = 1/4 of the deck is composed of clubs, the probability of the second If the events A and B are not mutually exclusive, then
event is 1/4. Therefore, the probability of both events is: 1/52 x 1/4 = 1/208.
p(A or B) = p(A) + p(B) - p(A and B). A and B are two events. If A and B
Dependent Events are independent, then the probability that events A and B both occur is:
If A and B are not independent, then the probability of A and B is: The logic behind this formula is that when p(A) and p(B) are added, the occasions
on which A and B both occur are counted twice. To adjust for this, p(A and B) is
p(A and B) = p(A) x p(B|A) subtracted.
where p(B|A) is the conditional probability of B given A. What is the probability that a card selected from a deck will be either an ace or a
spade? The relevant probabilities are:
If someone draws a card at random from a deck and then, without replacing the
first card, draws a second card, what is the probability that both cards will be p(ace) = 4/52
aces? Event A is that the first card is an ace. Since 4 of the 52 cards are aces, p(A)
= 4/52 = 1/13. Given that the first card is an ace, what is the probability that the p(spade) = 13/52
second card will be an ace as well? Of the 51 remaining cards, 3 are aces.
The only way in which an ace and a spade can both be drawn is to draw the ace
Therefore, p(B|A) = 3/51 = 1/17 and the probability of A and B is:
of spades. There is only one ace of spades, so:
1/13 x 1/17 = 1/221.
p(ace and spade) = 1/52 .
Mutually Exclusive
The probability of an ace or a spade can be computed as:
If events A and B are mutually exclusive, then the probability of A or B is simply:
p(ace)+p(spade)-p(ace and spade) = 4/52 + 13/52 - 1/52 = 16/52 = 4/13.
p(A or B) = p(A) + p(B). Two events are mutually exclusive if it is not possible for
Consider the probability of rolling a die twice and getting a 6 on at least one of the
both of them to occur. For example, if a die is rolled, the event "getting a 1" and
rolls. The events are defined in the following way:
the event "getting a 2" are mutually exclusive since it is not possible for the die to
be both a one and a two on the same roll. The occurrence of one event "excludes" Event A: 6 on the first roll: p(A) = 1/6
the possibility of the other event.
Event B: 6 on the second roll: p(B) = 1/6
What is the probability of rolling a die and getting either a 1 or a 6?
p(A and B) = 1/6 x 1/6
Since it is impossible to get both a 1 and a 6, these two events are mutually
exclusive. p(A or B) = 1/6 + 1/6 - 1/6 x 1/6 = 11/36
The implications relative to each assumption are discussed along with the
procedures and techniques employed in statistical modeling.
1. level of measure
2. random sampling
4. equal variance
5. independent samples
6. number of samples
The computational procedures associated with several tests are discussed in later
sections.
21
Section 2
Population vs Sample
No matter how a sample is drawn, it is always possible that the statistics obtained by studying the
sample do not accurately reflect the population parameters that would have been obtained if the entire
population had been studied. In fact, researchers almost always expect some amount of error as a
result of sampling.If sampling creates errors, why do researchers sample? First, for economic and
physical reasons it is not always possible to study an entire population. Second, with proper sampling,
highly reliable results can be obtained. Furthermore, with proper sampling, the amount of error to allow
for in the interpretation of the resulting data can be estimated with inferential statistics, which are
covered in this book. It is the role of the decision maker to design the research study to yield results
within an acceptable level of error.
Freedom from bias is the most important characteristic of a good sample. A bias exists whenever some
members of a population have a greater chance of being selected for inclusion in a sample than other
members of the population. Here are some examples of biased samples:
22
• A professor wishes to study the attitudes of all sophomores at a college To eliminate bias in the selection of individuals for a study, some type of random
(the population) but asks only those enrolled in an introductory class (the sampling is needed. A classic type of random sampling is simple random
sample) to participate in the study. Note that only those in the class have a sampling. Random sampling gives each member of a population an equal chance
chance of being selected; other sophomores have no chance. of being selected. After the sample has been selected, efforts must be made to
encourage all those selected to participate. If some refuse, as often happens, a
• An individual wants to predict the results of a statewide election (the biased sample is obtained even though all members of the population had an
population) but ask the intentions of only voters whom he encounters in a equal chance to have their names selected.
large shopping mall (the sample). Note that only those in the mall have a
chance of being selected; other voters have no chance. Suppose that a researcher is fortunate and obtained the cooperation of everyone
selected. The researcher has obtained an unbiased sample. Can the researcher be
• A magazine editor wants to determine the opinions of all rifle owners (the certain that the results obtained from the sample accurately reflect those results
population) on a gun-control measure but mails questionnaires only to that would have been obtained by studying the entire population? Definitely not,
those who subscribe to the magazine (the sample). Note that only the possibility of random errors still exists. Random errors (created by random
magazine subscribers have a chance to respond; other rifle owners have no selection) are called sampling errors. At random (i.e., by chance), the researcher
chance. may have selected a disproportionately large number of Democrats, males, low
SES group members, and so on. Such errors make the sample unrepresentative
In these three examples, samples of convenience were used, increasing the odds
and therefore may lead to incorrect results.
that some members of a population will be selected while reducing the odds other
members will be selected. Any studies that use this this type of sampling are If both biased and unbiased sampling is subject to error, why do researchers
suspect and should be looked upon with skepticism. In addition to the obvious prefer unbiased random sampling? They prefer it for two reasons: (1) inferential
bias in the examples, there is an additional problem. Even those who do have a statistics enable researchers to estimate the amount of error to allow for when
chance of being included in the samples may refuse to participate. This problem is analyzing the results from unbiased samples, and (2) the amount of sampling error
often referred to as the problem of volunteerism (also called self-selection bias). obtained from unbiased samples tends to be small when large samples are used.
Volunteerism is presumed to create an additional source of bias because those
who decide not to participate have no chance of being included. Furthermore, While using large samples helps to limit the amount of random error, it is important
many studies comparing participants (i.e., volunteers) with non-participants to note that selecting a large sample does not correct for errors due to bias. If
suggest that participants tend to be more highly educated and tend to come from an individual who is trying to predict the results, of an election is very persistent
higher socioeconomic status (SES) groups than their counterparts. Efforts to and spends weeks at the shopping mall asking shoppers how they intend to vote,
reduce the effects of volunteerism include offering rewards; stressing to potential the individual will obtain a very large sample of people who may differ from the
participants the importance of the study; and making it easy for individuals to population of voters in various ways, such as being more affluent, having more
respond. time to spend shopping, better educated, and so on. Thus, increasing the size of a
biased sample does not reduce the amount of error due to bias.
23
Yet, there are many situations in which researchers have no choice but to use • This means that the selection of one item does not affect the selection of any
biased samples. For instance, for ethical and legal reasons, much medical other particular items; they are in no way “tied together.”
research is conducted using volunteers who are willing to risk taking a new
medication or undergoing a new surgical procedure. If promising results are Stated differently, this means:
obtained in initial studies, larger studies with better (but usually still biased)
• Events are independent
samples are undertaken. At some point, despite the possible role of bias,
decisions such as Food and Drug Administration approval of a new drug need to • Underlying probabilities remained unchanged in drawing the sample.
be made on the basis of data obtained with biased samples. Little progress would
be made in most fields if the results of all studies with biased samples were The second condition cannot be strictly true when sampling from a finite
summarily dismissed. population because the selection of one item increases the probability of the
selection of any remaining items (relative to the first item chosen). This creates no
It is important to note that the statistical remedies for errors due to biased great problem as long as the sample is not a large percentage of the population.
samples are extremely limited. When biased samples are used, the results of In that case, the probabilities will have been altered sufficiently to need corrective
statistical analyses of the data should be viewed with caution. measures. The finite correction factor provides the necessary correction.
Random Sample and Sampling Procedure Whenever the sample from a finite population equals or exceeds 10% of the total
population, i.e., n ≥ 10% N, the following correction factor is used to compensate
The basic assumption in every application of statistical inference is that the for the inherent changes in the underlying probabilities during the sampling
sample is a random sample. The term “random sample” actually refers to a process:
procedure by which the sample is drawn. The resulting sample is also called a
random sample. The estimate of the standard deviation of the sampling distribution is
A random sample from a finite population is a sample that has been selected by a
procedure with the following properties: N = number in population and
• The procedure assigns a known probability to each element in the population. n = number in sample.
• If a given element has been selected, then the probability of selecting the
The effect of the finite correction factor is to decrease by a small amount
remaining items is uniformly affected.
unless a very large percentage of the population is included in the sample. When
24
n = N, the is reduced to 0, Recall, when we are able to collect data from
Procedure for Drawing a Random Sample:
everyone in the population we have a census and inferential statistics do not Identify the population you plan to study. You then must decide on the most
apply. In this case of (sample n = population N) the measure of standard deviation appropriate way to draw a representative sample from the population. If the
is the true population parameter. population consists of individuals, what kind of individuals: what income,
geographic area, etc.? At times it is sufficient to draw a simple random sample. In
other instances you may choose to stratify or cluster the population to get your
sample.
Distinction Between Random and Simple Random Sample:
For example, if I want to draw a random sample of machine owners (construction
Random Sample
equipment) within a dealer territory. The population definition must be sharpened –
what is a “machine owner?” Does it include owners of certain brands, non-
• Probability of selecting
owners of certain brands, particular machine usage or production, certain
o each element is known but markets, leased equipment, or not?
o not necessarily equal. The definition of the population must be determined by the purpose of the
research. In a recent study of the feasibility of locating a mini branch store, it was
• Events are independent. determined that the relevant machine owner population included: any project that
used machines within a ten miles radius of the proposed location, office of firms
• Underlying probabilities remain unchanged in sampling process.
that owned machines that might be presently located elsewhere, and all brands of
machines.
Simple Random Sample
Are there significant differences among definable groups in the population to
• Probability of selecting
justify stratified sampling as opposed to simple random sampling? In surveying
o each element is known the relevant machine owner population, a simple random sample (by chance) may
not reflect proportionately the needs of owners with one or two machines and
o and equal. those owning more than two. If the product mix in the store needs to be
significantly different to meet the needs of different segments of the population, a
• Events are independent.
stratified sample (later weighted according to the relative size of each stratum)
would likely produce a more accurate measure of potential demand than a simple
• Underlying possibilities remain unchanged in sampling process.
random sample of the same size.
Once the population and strata have been appropriately defined, a list of members
(in each stratum) should be prepared, if possible. Then, selection from the list (in
25
each stratum) should be done using a random number table to select the sample clustering. The appropriate method is dependent upon the definition of the
members. population of interest.
If a single list cannot be prepared, sampling may be done in two or more stages. It's clear that asking one person in a telephone poll isn't enough to get an idea
For example, the first list (stage) might be of UCC1 filings, the second of brands, about the general views in the United States. What isn't quite as clear is the
and the third of dealers. A telephone book might be sampled by listing the question of how many people we should pick for the sample to get a
numbers of pages and then the number of names on each page. The first random representative sample. Remember that the bigger the sample the more it will cost
number chooses the page; a second random number chooses the person on the to interview or to study. This means that statisticians must weigh the needs for a
page. larger sample against the costs of acquiring one, always remembering that one of
the costs of a too-small sample is that it will have a greater chance of being
Other approaches to sampling include clustering and stratification. If a unrepresentative. More generally, samples in statistics must be of a certain size to
complete enumeration of the population is not possible, “cluster” sampling may be meaningful representations of populations. How large, depends partly on the
be needed. In this case only blocks of a city or areas are listed and then selected population we are looking at. If it's very diverse we need a larger sample to
by means of a random number table. All members in the block or area selected capture that diversity. The size of the sample also depends on the precision we
are interviewed as a “cluster”. are seeking and on the kinds of questions we are asking.
Once you have identified the population you plan to study you will chose the
most efficient method to draw a representative random sample. Your choices
include a simple random sample, a systematic sample, stratification and
26
Section 3
Descriptive or Inferential
Descriptive statistics are used to describe a set of data. Descriptions are in the form of a midpoint
(mean, median, mode), a dispersion (range, variance, quartiles), and a shape (normal, skewed,
rectangular). For instance, suppose you have the scores on a standardized test for 500 participants.
One way to summarize the data is to calculate an average score, which indicates how the typical
individual scored. You might also determine the range of scores from the highest to the lowest score,
which would indicate how much the scores vary. The only major assumption underlying descriptive
models is the Level of Measure (or scale) used to represent the data.
Correlational statistics are a special subgroup of descriptive statistics, which are described separately.
The purpose of correlational statistics is to describe the relationship between two or more variables for
one group of participants. For instance, suppose a researcher is interested in the predictive validity of a
college admissions test. The researcher could collect the admissions scores and the freshman GPAs for
a group of college students. To determine the validity of the test for predicting GPAs, a statistic known
as a correlation coefficient could be computed. Correlation coefficients range in value from 0.00 (no
correlation between variables) to 1.00 (a perfect correlation).
Measurement can be defined as the assignment of numerals to objects or events according to rules.
There are four basic levels of measure: nominal, ordinal, interval, and ratio. The importance of the level
27
of measure is realized in the operations to produce the scale and in the process was included in earlier sections and will be revisited in subsequent
mathematical operations that are permissible with each level. The mathematical sections when the requirements for specific techniques are addressed.
operations possible and examples of each level detailed in the table at the end of
the section. Having sampled, a researcher knows that the results may not be accurate
because the sample may not be representative. In fact, the researcher knows that
Note that the only mathematical operation possible with nominal level of measure there is a high probability that the results are off by at least a small amount. This is
is equivalence, that is = or ≠. With ordinal (ranking) level, the additional operations why researchers often mention a margin of error, which is an inferential statistic. It
of > and < are added to = and ≠. Thus, with ordinal data, the middle of a is reported as a warning to readers of research that random sampling may have
distribution could be determined with the median. This cannot be done with produced errors, which should be considered when interpreting results. For
nominal level. With an ordinal level of measure, however, we can not compute the instance, a weekly news magazine recently reported that 52% of the respondents
mean because the operations (of addition and division) are not possible. If we had in a national poll believed that the economy was improving. A footnote in the
interval or ratio level we could compute the mean and median as well. We could report indicated that the margin of error was ±2.3. This means that the researcher
also compute the variance. was confident that the true percentage for the whole population was within 2.3
percentage points of 52% (i.e., 49.7% to 54.3%).
For most statistical models, distinction needs to be made among only three levels:
nominal, ordinal, and interval or ratio. The level of measure is a necessary You may recall that a population is any group in which a researcher is interested. It
condition for all statistical models, descriptive or inferential. It is very important in may be large, such as all adults age 18 and over who reside in the United States,
distinguishing among different models involving statistical inference as discussed or it might be small, such as all employees of a specific company. A study in
in the next section. which all members of a population are included is called a census. A census is
often feasible and desirable when studying small populations (e.g., an algebra
Inference Models teacher may choose to pretest all students at the beginning of a course). When a
population is large, it is more economical to study only a sample of the population.
Inferential statistics are tools that tell us how much confidence we can have when
With modern sampling techniques, highly accurate information can be obtained
generalizing from a sample to a population. Consider national opinion polls in
using relatively small samples.
which carefully drawn samples of only about 1,500 adults are used to estimate the
opinions of the entire adult population of the United States. The pollster first Inferential statistics are not needed when analyzing the results of a census
calculates descriptive statistics, such as the percentage of respondents who are in because there is no sampling error. The use of inferential statistics for evaluating
favor of capital punishment and the percentage who are opposed. results when sampling is covered in chapter 3.
Most statistical models of interest in business problem solving involve making an Shape of Population Distribution Assumption:
inference concerning a characteristic of a population from data in a sample. All
such models are based on probability theory and therefore require that random Among the models used for statistical inference are two general groups:
sample be used in making such inferences. A discussion of the random sampling Parametric and Non-Parametric. The parametric models, such as Z or t, and
Pearson product moment correlation require that the sample be drawn from a
28
population with prescribed “shape”, usually a normal curve. The non-parametric the choice of statistical model is given in the section of the text entitled “Decision
models such as Spearman rank correlation and the Mann-Whitney U test are Rules for Choosing Among Statistical Models.”
termed “distribution-free” models because they do not require any prescribed
population shape. Non-Parametric Statistics
assumption that the variance of x is the same for all values of y and that the distribution free statistical models. As we progress to testing hypotheses and
variance for y is the same for all values of x. This property is called drawing inferences we will be reviewing several distribution free procedures.
homoscedasticity and is described later in the text. (Non-parametric models do Interested readers who deal with very small sample sizes or ordinal levels of
not require any assumption concerning variances.) measure are encouraged to review the multitude of distribution free modeling
options available.
Independent Samples Assumption:
Numbers of Samples:
The particular assumptions and requirements for various models are presented in
the sections that describe the models. A summary of decision rules concerning
29
Section 4
Level of Measurement
For most statistical models, distinction needs to be made among only three measurement levels:
nominal, ordinal, and interval or ratio. The level of measure is a necessary condition for all statistical
techniques, descriptive or inferential. It is very important in distinguishing among different techniques
involving statistical inference.
The lowest level of measurement is nominal (also known as categorical). It is helpful to think of this level as the
naming level because names (i.e., words) are used instead of numbers. Here are four examples:
30
Notice that the categories named do not put the participants in any particular order. Notice that if one participant is 5'6" tall and another is 5'8" tall, we know not only
There is no basis on which we could all agree for saying that Democrats are either the order of the participants, but we also know by how much the participants differ
higher or lower than Republicans. The same is true for religious affiliation or gender. from each other (i.e., two inches). Both interval and ratio scales have equal
Note that the only mathematical operation possible with nominal level of measure intervals. For instance, the difference between three inches and four inches is the
is equivalence, that is = or ≠. same as the difference between five inches and six inches.
The next level of measurement is ordinal. Ordinal measurement puts participants in rank In most statistical analyses, interval and ratio measurements are analyzed in the
order from high to low, but it does not indicate how much higher or lower one participant same way. However, there is a difference between these two levels. An interval
is in relation to another. To understand this level, consider these examples: scale does not have an absolute zero. For instance, if we measure intelligence, we
do not know exactly what constitutes absolutely zero intelligence and thus cannot
• Participants are ranked according to their height; the tallest participant is given a measure the zero point. In contrast, a ratio scale has an absolute zero point on its
rank of 1, the next tallest is given a rank of 2, and so on. scale. For instance, we know where the zero point is when we measure height.
• College students report out their class rank in terms of freshman, sophomore,
junior or senior.
The next two levels, interval and ratio, tell us by how much participants differ, For
example:
31
Levels of Measure
32
Chapter 2
Descriptive
Statistics
Descriptive Statistics
How do you use this average gas price (a statistic)? Let’s say that the next day you need to purchase
gas for the car. You are driving past a gas station where the price of gas is posted as $ .10 below what
you believe the average price of gas to be (based on recent observations). Do you stop and get gas at
this station? Maybe.
Why maybe? This brings us to that nagging thing called probability. To help understand probability
think of how different your reaction to a price drop would be if prices had been the same for the past 6
months versus fluctuating prices. For this example, let’s assume you have been observing fluctuations
in gas prices? Some days the price is rising, other days you observe the price is dropping. How
confident are you that this $ .10 difference is below what your regular gas station will be charging? Will
the price be $ .15 lower at the next station?
In addition to the average price of gas we have calculated several descriptive statistics regarding the
variability in price. We have an estimate for the range in prices we expect from a low to high price and
34
we have a sense of how much variation we have observed. These descriptive Methods for Summarizing Data
statistics help us in making decisions.
Tables and Crosstabulations
This chapter is dedicated to describing the data we have collected, Including:
When variables are categorical, frequency tables (crosstabulations) provide useful
Section 1 - Descriptives summaries. For a report, you may need only the number or percentage of cases
falling in specified categories or cross-classifications. At times, you may require a
• Frequencies and proportions.
test of independence or a measure of association between two categorical
• Tables and cross tabulations. variables.
• Plots and graphical depictions. Statistical procedures are designed to make, analyze, and save frequency tables
that are formed by categorical variables (or table factors). The values of the factors
Section 2 - Distribution can be character or numeric. Both procedures form tables using data read from a
cases-by-variables rectangular file or recorded as frequencies (for example, from a
• Shapes of distributions
table in a report) with cell indices. You can request percentages of row totals,
Section 3 - Location
Section 4 - Spread
Section 5 - Association
• Correlation.
Section 6 - Normality
• Tests of normality
We will focus on drawing conclusions from the information we have drawn out of
the data in later sections of the text.
35
• Multiway: Frequency counts, percentages, tests and measures of association
Tabulate for series of two-way tables and standardized tables stratified by all
combinations of values of a third, fourth, etc., table factor.
There are many formats for displaying tabular data. Let us examine several basic
layouts for counts and percentages.
36
DATA FOR ANALYSIS I
If we enter the data into SYSTAT we can request a basic frequency table for just
current lease holders by using the data select function. Click on data and Select
Cases.
In this example we have data from a survey of current lease holders and their
perception of repair cost at a dealership. The lease variable is coded as 1 for
current lease holders and 2 for non-lease owners. Repair cost is coded as
1=expensive, 2- high, and 3-average.
37
Once you click on Select Cases the following window will open. You will click on Analysis, Tables and One -way to open the window for selecting
the repair cost variable for analysis.
You will click on LEASE to have it entered as a selection criteria and made the
operator - 1 (since 1 represents current lease holders in the data file).
Click OK and you are ready to run the basic one way analysis of the data.
38
When you click on One-Way the window for defining the tables will open. Select
repair cost as the variable of interest and click on counts and percents.
From the analysis we see that half of the current lease holders view the dealer
repair cost to be expensive and 42% view the cost as average.
Once you click OK the output window will open and you will see the following
tables from your one-way frequency analysis of repair cost for current lease
holders.
39
Two-Way Frequency Table
To start the analysis we will select two-way tables (Lease by Repair Cost).
We will have Repair Cost as the column and Lease as the Row variable. Click on Counts and Percents for the output. When you are finished click OK.
This will generate the analysis and open the output window.
40
Graphical Representations
Under the Graph function you can select Summary Charts including Bar, Dot, Line,
and Pie charts.
From the analysis it does appear that owners without a lease view the dealership Clicking on Density Displays you can generate Histograms, Box Plots, Dot Density
as an expensive repair option at a higher rate than the current lease holders. and Density Functions.
We can extend our analysis by using multiway tables when we wish to examine
the interaction between more than two variables.
41
Additional graphical representations are provided for multivariate data.
Clicking on Plots opens a menu for selecting Scatterplots and Probability Plots.
42
Example: Pie Chart from Repair Cost Data
The other option available under the Graphics routine is to open the Graph Gallery Using the data from our two-way table analysis we can generate a pie chart to get
and select the style of representation you wish to use by clicking from the visual a view of how lease holders and non-lease holders view dealership repair cost.
menu of types.
As we found in the two way table analysis a higher percentage of non lease
holders view the dealership as an expensive repair option (red area of pie chart)..
43
Section 2
Distribution's Shape
The stem-and-leaf plot is useful for assessing distributional shape and identifying outliers. Values that
are markedly different from the others in the sample are labeled as outside values-that is, the value is
more than 1.5 hspreads outside its hinge (the hspread is the distance between the lower and upper
hinges, or quartiles). Under normality, this translates into roughly 2.7 standard deviations from the
mean.
• Selected variable(s). A separate stem-and-leaf plot is created for each selected variable.
• Number of lines. You can indicate how many lines (stems) to include in the plot.
The shape of a distribution of a set of scores can be seen by examining a frequency distribution which is a
table that shows how many participants have each score. Consider the frequency distribution in Table 1. The
frequency (i.e., f which is the number of participants) associated with each score (X) is shown. Examination
indicates that most of the participants are near the middle of the distribution (i.e., near a score of 19) and that
44
the participants are spread out on both sides of the middle with the frequencies tapering
off.
Distribution of
Scores
X f
22 1
21 3
20 4
19 8
18 5
17 2
16 0
15 1
N=24 Frequency Polygon
When there are many participants, the shape of a polygon becomes smoother and
is referred to as a curve. The most important shape is that of the normal curve,
The shape of a distribution is even clearer when examining a frequency polygon, which which is often called the bell-shaped curve. This curve is illustrated below.
is a figure (i.e., a drawing) that shows how many participants have each score. The same
data shown in the table are shown in the frequency polygon on the next page. For
instance, the frequency distribution shows that 3 participants had a score of 21; this
same information is displayed in the frequency polygon. The high point in the polygon
shows where most of the participants are clustered (in this case, near a score of 19).
The tapering off around 19 illustrates how spread out the participants are around the
middle.
The normal curve is important for two reasons. First, it is a shape very often found
in nature. For instance, the heights of women in large populations are normally
distributed. There are small numbers of very short women, which is why the curve
45
is low on the left; many women are of about average height, which is why the When the long tail is pointing to the left, a distribution is said to have a negative skew
curve is high in the middle; and there are small numbers of very tall women. Here (i.e., skewed to the left). A negative skew would be found if a large population of
is another example: The average annual rainfall in Pittsburgh over the past 100 individuals was tested on skills in which they have been thoroughly trained. For
years has been approximately normal. There have been a very small number of instance, if a researcher tested a very large population of recent nursing school
years in which there was extremely little rainfall, many years with about average graduates on very basic nursing skills, a distribution with a negative skew should
rainfall, and a very small number of years with a great deal of rainfall. Another emerge. There should be large numbers of graduates with high scores, but there should
reason the normal curve is important is that it is used as the basis for a number of be a long tail pointing to the left, showing that a small number of nurses who, for one
inferential statistics, which are covered in this text. reason or another, such as being physically ill on the day the test was administered,
did not perform well on the test.
Some distributions are skewed. For instance, if you plot the distribution of income for a
large population, in all likelihood you will find that it has a positive skew (i.e., is skewed to
the right). Skewed right indicates that there are large numbers of people, with relatively
low incomes; thus, the curve is high on the left. The curve drops off dramatically to the
right, forming a long tail pointing to the right. This long tail is created by the small
numbers of individuals with very high incomes. Skewed distributions are named for
their long tails. On a number line, positive numbers are to the right; hence, the term
positive skew is used to describe a skewed distribution in which there is a long tail
pointing to the right (but no long tail pointing to the left).
Bimodal distributions have two high points. A curve such is called bimodal even though
the two high points are not exactly equal in height. Such a curve is most likely to
emerge when human intervention or a rare event has changed the composition of a
population. For instance, if a civil war in a country cost the lives of many young adults,
the distribution of age after the war might be bimodal, with a dip in the middle.
Bimodal distributions are much less frequently found in research than the other types
of curves discussed earlier in this section.
A distribution skewed to the right (positive skew).
46
The table represents a discrete probability distribution because it relates each
value of a discrete random variable with its probability of occurrence. With a
discrete probability distribution, each possible value of the discrete random
variable can be associated with a non-zero probability. Thus, a discrete probability
distribution can always be presented in tabular form.
Discrete Probability Distributions • The area bounded by the curve of the density function and the x-axis is
equal to 1, when computed over the domain of the variable.
If a random variable is a discrete variable, its probability distribution is called a
discrete probability distribution. Suppose you flip a coin two times. This simple • The probability that a random variable assumes a value between a and b is
statistical experiment can have four possible outcomes: HH, HT, TH, and TT. Now, equal to the area under the density function bounded by a and b.
let the variable X represent the number of Heads that result from this experiment.
The variable X can only take on the values 0, 1, or 2, so it is a discrete random The shape of a distribution has important implications for determining which average to
variable. compute. Graphical displays and specific statistical test are used to determine the
appropriateness of
assuming a normal
# HEADS Probability
distribution for
47
Section 3
Location
Before deciding what you want to describe (location, spread, and so on), you should consider what type
of variables are present. Are the values of a variable unordered categories, ordered categories, counts,
or measurements?
For many statistical purposes, counts are treated as measured variables. From the discussion on levels of
measurement we know that such variables are called quantitative if one can do arithmetic on their values.
The mean is the most frequently used average. It is so widely used that it is sometimes simply called the
average. However, the term "average" is ambiguous because several different types of averages are used in
statistics. In this section, the mean will be considered. The mean is a location measure a measure of central
tendency.
Computation of the mean is easy: sum (i.e., add up) the scores and divide by the number of scores. Here is an
example:
Scores: 5, 6,7,10,12,15
48
Sum of scores: 55 example of the contributions given to charity by two groups of children expressed
in cents:
Number of scores: 6
Group A: 1,1,2,3,3,4,4,4,5,5,5,5,6,6,6,7,8,10,10,10,11
Computation of mean: 55/6 = 9.166 = 9.17
Mean for Group A = 5.52
Notice in the example above that the answer was computed to three decimal places and
rounded to two. In research reports, the mean is usually reported to two decimal Group B: 1,2,2,3,3, 3,4,4,5, 5,5, 6,6,6, 6, 6,9,10,10, 150, 200
places.
Mean for Group B = 21.24
There are several symbols for the mean. Commonly used symbols for the mean are ,
Notice that overall the two distributions are quite similar. Yet the mean for Group B
M and m.
is much higher than the mean for Group A because just two students in Group B
When statisticians use this symbol: The symbol is pronounced "X-bar." It is used gave extremely high contributions of 150 cents and 200 cents. If only the means
for the two groups were reported without reporting all the individual contributions,
frequently in statistics textbooks and research reports in business.
it would suggest that the average student in Group B gave about 21 cents when in
The mean is defined as "the balance point in a distribution of scores." Specifically, it is fact none of the students made a contribution of this amount. Recall from the
the point around which all the deviations sum to zero. earlier discussion, a distribution that has some extreme scores at one end but not
the other is called a skewed distribution. The mean is almost always inappropriate
For example, if the sum of the scores for 5 numbers is 60; dividing this by the number of for describing the average of a highly skewed distribution. Another limitation of
scores (5) yields a mean of 12.00. By subtracting the mean from each score, the the mean is that it is appropriate only for use with interval and ratio scales of
deviations from the mean are obtained. If the first score is 7, the score (7) minus the measurement.
mean (12) yields a deviation of-5. Thus, for a score of 7, the deviation is -5. The
deviations of all 5 scores will sum to zero. (The negatives cancel out the positives Median and Mode
when summing, yielding zero.)
How do we describe the center, or central location of the distribution, on a scale? If
If you substitute any other number for the mean and perform the calculations of the data is not normally distributed, extreme high or low scores, then the mean is not
deviations, you will not get a sum of zero. Only the mean will produce this sum. the best way to describe the center of the data. When there are extreme values or
Thus, saying "the mean equals 12.0" is a shorthand way of saying "the value outliers present in the data, the arithmetic mean (AM) will be affected by the extreme
around which the deviations sum to zero is 12.0." observations and thus will not be a suitable measure of central tendency. Another
measure of location is to pick the value above which one half of the data values fall
A major drawback of the mean is that it is drawn in the direction of extreme
and, by implication, below which the other half of the data values fall. This measure
scores. This is a problem if there are either some extremely high scores that pull
is called the median. The median is computed based only on the central one or two
the mean up or some extremely low scores that pull it down. The following is an
values and does not depend on the values of other observations.
49
The alternative to describing the center of the skewed data with the mean is the median. Scores (arranged in order from low to high):
The median is the value in a distribution that has 50% of the cases above it and 50% of
the cases below it. Thus, it is defined as the middle point in a distribution. In the following 2,2,4,6,7,7,7,9,10,12
example there are 11 scores. The middle score, with 50% on each side, is 81, which is the
A disadvantage of the mode is that there may be more than one mode for a given
median. Thus, 81 is the value of the median for the set of scores. Note: there are five
distribution. This is the case for the following observations in which both 20 and 23 are
scores above 81 and five scores below 81.
modes.
In the next example there are 6 scores. Because there is an even number of scores,
Choosing Among the Three Averages
the median is halfway between the two middle scores. To find the halfway point, sum
the two middle scores (7+10 = 17) and divide by 2 (17/2 = 8.5). Thus, 8.5 is the value Other things being equal, choose the mean because more powerful statistical tests
of the median of the set of scores. described later in this book can be applied to it than to the other averages. However,
Scores (arranged in order from low to high): • the mean is not appropriate for describing highly skewed distributions, and
3,3,7,10,12,15 • the mean is not appropriate for describing nominal and ordinal data.
An advantage of the median is that it is insensitive to extreme scores. Taking the same set Choose the median when the mean is inappropriate. The exception to this guideline is
of data and replacing the 15 with an extremely high score of 319 has no effect on the when describing nominal data. Nominal data are naming data such as political affiliation,
value of the median. The median is 8.5, which is the same value as in the previous ethnicity, and so on. There is no natural order to these data; therefore, they cannot be
example, despite the one extremely high score. Thus, the median is insensitive to the put in order, which is required in order to calculate the median.
skew in a skewed distribution. Put another way, the median is an appropriate average
for describing the typical participant in a highly skewed distribution. Choose the mode when an average is needed to describe nominal data. Note that when
describing nominal data, it is often not necessary to use an average because
Scores (arranged in order from low to high): percentages can be used as an alternative. For instance, if there are more registered
Democrats than Republicans in a community, the best way to describe this is to report
3,3,7,10,12,319
the percentage of people registered in each party. To state only the mode is much less
The mode is another average. It is defined as the most frequently occurring score. The informative than reporting percentages.
following data has a mode of 7 because it occurs more often than any other score.
Note that in a perfectly symmetrical distribution such as the normal distribution,
the mean, median, and mode all have the same value.
50
In skewed distributions, their values are different. In a distribution with a positive
skew, the mean has the highest value because it is pulled in the direction of the
extremely high scores. In a distribution with a negative skew, the mean has the
lowest value because it is pulled in the direction of the extremely low scores. As
noted earlier, the mean should not be used when a distribution is highly skewed.
51
Section 4
Spread
Variability (spread) refers to differences among the scores of participants. For instance, if all the participants
who take a test earn the same score, there is no variability. In practice, of course, some variability (and often
quite a large amount of variability) is usually found among participants in research studies. Two measures of
variability (i.e., the range and interquartile range) are designed to concisely describe the amount of variability in
a set of data.
A simple statistic that describes variability is the range, which is the difference between the highest score and
the lowest score. For the following scores the range is 18 (20 minus 2). A researcher could report 18 as the
range or simply state that the scores range from 2 to 20.
Scores: 2,5,7,7,8,8,10,12,12,15,17,20
A weakness of the range is that it is based on only the two most extreme scores, which may not
accurately reflect the variability in the entire group. Consider the following data where the range in is also
52
18. However, there is much less variability among the participants in the following value of an average (such as the value of the median) first, followed by the value of a
set of scores. measure of variability (such as the interquartile range).
Notice that except for the one participant with a score of 20, all participants have The standard deviation is the most frequently used measure of variability. In the
scores in the narrow range from 2 to 6. Yet, the one participant with a score of 20 previous section, you learned that the term variability refers to the differences
has pulled the range up to a value of 18, making it unrepresentative of the among participants. Synonyms for variability are spread and dispersion.
variability of the scores of the majority of the group.
The standard deviation is a statistic that provides an overall measurement of how
In this case, scores such as the score 20 are known as outliers. They lie far much participants' scores differ from the mean score of their group. It is a special
outside the range of the majority of other scores and increase the size of the type of average of the deviations of the scores from their mean.
range. As a general rule, the range is inappropriate for describing a distribution of
scores with outliers. The more spread out participants are around their mean, the larger the standard
deviation. Comparison of the following two examples illustrates this principle.
A better measure of variability is the interquartile range (IQR). It is defined as the Note that S is the symbol for the standard deviation. Notice, too, that the mean is
range of the middle 50% of the participants. By using only the middle 50%, the the same for both groups (i.e., Mean =10.00 for each group), but Group A with the
range of the majority of the participants is being described and at the same time, greater variability among the scores (S = 7.45) has a larger standard deviation than
outliers that could have an undue influence on the ordinary range are stripped of Group B (S = 1.49).
their influence.
Example Group A
Using the same set of data illustrates the value and meaning of the interquartile
range. Notice that the scores are in order from low to high. The interquartile range Scores for Group A: 0, 0, 5,5,10,15,15,20,20
separates the lowest 25% from the middle 50%, and separates the highest 25%
Mean= 10.00, S=7.45
from the middle 50%. It turns out that the range for the middle 50% is 3 points.
When 3.0 is reported as the IQR, readers know that the range of the middle 50%
Example Group B:
of participants is only 3 points, indicating little variability for the majority of the
participants. Note that the undue influence of the outlier of 20 has been overcome Scores for Group B: 8, 8,9,9,10,11,11,12,12
by using the interquartile range.
Mean= 10.00, S= 1.49
Scores: 2,2,2,3,4,4,5,5,5,6,6,20
Now consider the scores of Group C in next example. All participants have the
When the median is reported as the average for a set of scores, it is customary to also same score; therefore, there is no variability. When this is the case, the standard
report the interquartile range as the measure of variability. It is customary to report the
53
deviation equals zero, which indicates the complete lack of variability. Thus, S= Consider this example: Suppose that the mean of a set of normally distributed
0.00. scores equals 70 and the standard deviation equals 10. Then, about two-thirds of
the cases lie within 10 points of the mean. More precisely, 68% (a little more than
Example Group C: two-thirds) of the cases lie within +/- one standard deviation (10 points) of the
mean.
Scores for Group C: 10, 10,10, 10, 10, 10,10,10, 10, 10
Considering the three previous examples, it is clear that the more participants
differ from the mean of their group, the larger the standard deviation. Conversely,
the less participants differ from the mean of their group, the smaller the standard
deviation.
In review, even though the three groups (A, B and C) have the same mean, the
following is true:
Thus, if you were reading a research report on the three groups, you would obtain
important information about how the groups differ by considering their standard
deviations. As you can see, 34% of the cases will lie between a score of 60 and the mean of
70 ( ), while another 34% of the cases will lie between the mean of 70 and a
The standard deviation takes on a special meaning when considered in relation to
score of 80 ( ). In all, 68% of the cases lie between scores of 60 and 80.
the normal curve because the standard deviation was designed expressly to
describe this curve. Here is a basic rule to remember: About two-thirds of the
The 68% rule applies to all normal curves. In fact, this is a property of the normal curve:
cases (68%) lie within one standard deviation unit of the mean in a normal
68% of the cases lie in the "middle area" bounded by one standard deviation on each
distribution. (Note that "within one standard deviation unit" means one unit on
side. Suppose, for instance, that for another group, the mean of their normal distribution
both sides of the mean.)
also equals 70, but the group has less variability with a standard deviation of only 5.
Since the standard deviation is only 5 points, 68% of the cases lie between scores of 65
and 75 for this group of participants.
54
The 68% guideline (sometimes called the two-thirds rule of thumb) strictly applies As discussed earlier, when the data contains outliers it is better to
only to perfectly normal distributions. The less normal a distribution is, the less accurate report the Median and the Interquartile Range so the reader is not
the guidelines. misled by extreme scores in interpreting the spread of the responses.
By examining the calculation of the standard deviation you can see it is based on the
differences between the mean and each of the scores in a distribution. When researchers
report the mean (the most frequently used average), they report the standard deviation.
"Group A has a higher mean (Mean= 67.89, S = 8.77) than Group B (Mean=
60.23,S=8.54)."
EXAMPLE
Entering the data from the earlier example of range and interquartile comparisons
into SYSTAT we are able to run analysis using descriptives to derive the following
output.
Scores: 2,2,2,3,4,4,5,5,5,6,6,20
55
Section 5
Measures of Association
Mitt 333 1
56
Indeed there is. Notice that students who scored high on the SAT-V such as Janice
Participant Self-Concept Depression
and Scot had the highest GPAs. Also, those who scored low on the SAT-V such as
Hillary and Mitt had the lowest GPAs. This type of relationship is called a direct Sally 12 25
relationship (also called a positive relationship). In a direct relationship, those who
score high on one variable tend to score high on the other and those who score low on Jose 12 29
one variable tend to score low on the other.
Sarah 10 38
In the next example the scores are on two variables for one group of participants. The
Dick 7 50
first variable is self-concept, which was measured with 12 true-false items containing
statements "such as "I feel good about myself when I am in public." Participants earned
Matt 8 61
one point for each statement that they marked as being true of them. Thus, the self-
concept scores could range from zero (marking all statements as false) to 12 Joan 4 72
(marking all statements as true). Obviously, the higher a participant's score, the
higher the self-concept. The second variable is depression measured with a
It is important to note that just because a correlation between two Variables is
standardized depression scale with possible scores from 20 to 80. Higher scores
observed, it does not necessarily indicate that there is a causal relationship
indicate more depression.
between the variables. For instance, our data does not establish whether (a)
A key question for the research is “Does the data indicate that there is a having a low self-concept causes depression or (b) being depressed causes an
relationship between self-concept and depression?” Close examination indicates individual to have a low self-concept. In fact, there might not be any causal
that there is a relationship. Notice that participants with high self-concept scores relationship at all between the two variables because a host of other variables
such as Sally and Jose (both with the highest possible self-concept score of 12) (such as life circumstances, genetic depositions, and so on) might account for the
had relatively low depression scores of 25 and 29 (on a scale from 20 to 80). At the relationship between self-concept and depression. For instance, having a
same time, participants with low self-concept scores such as Matt and Joan have disruptive home life might cause some individuals to have a low self-concept and
high depression scores. In other words, those with high self-concepts tend to have at the same time cause these same individuals to become depressed.
low depression scores, while those with low self-concepts tend to have high
In order to study cause-and-effect, a controlled experiment is needed in which
depression scores. Such a relationship is called an inverse relationship (also called
different treatments are administered to the participants. For instance, to examine
a. negative relationship). In an inverse relationship, those who score high on one
a possible causal link between self-concept and depression, a researcher could
variable tend to score low on the other.
give an experimental group a treatment designed to improve self-concept and
then compare the average level of depression of the experimental group with the
average level of a control group.
57
interested in how well the SAT works in predicting success in college. This can be Joe has a high SAT-V score but a very low GPA. Thus, Joe is an exception to the
revealed by examining the correlation between SAT scores and college GPAs. It is rule that high values on one variable are associated with high values on the other.
not necessary for the College Board to examine what causes high GPAs in an There may be a variety of explanations for this exception: Joe may have had a
experiment for the purposes of determining the predictive validity of its test. family crisis during his first year in college or he may have abandoned his good
work habits to make time for TV viewing and campus parties as soon as he moved
In addition to validating tests correlations are of interest in developing theories. away from home to college. Patricia is another exception: Perhaps she made an
Often, a postulate of a theory may indicate that X should be related to Y. If a extra effort to apply herself to college work, which could not be predicted by the
correlation is found in a correlational study, the finding helps to support the theory. SAT. When studying hundreds of participants, there will be many exceptions, some
If it is not found it calls the theory into question. large and some small. To make sense of such data, statistical techniques are
required.
Up to this point, only clear-cut examples have been considered. However in
practice, correlational data almost always include individuals who are exceptions
The Pearson Correlation Coefficient
to the overall trend, making the degree of correlation less obvious. Consider the
following, which has the students from our first example and two others: Joe and Assume that we are interested in studying acceleration and braking for new cars.
Patricia. We want a single number that summarizes how well we could predict acceleration
from braking. Later on we will use linear regression when we discuss how we
calculate such a line but it is enough here to know that we are interested in
drawing a line through the area covered by the points in the scatterplot such that
Student SAT-V GPA the acceleration of a car could be predicted rather well by the value on the line
corresponding to its braking. The closer the points cluster around this line the
Mitt 333 1
better would be the prediction.
Janice 756 3.8
Thomas 444 1.9 We also want this number to represent how well we can predict braking from
acceleration using a similar line. This symmetry we seek is fundamental to all the
Scot 629 3.2
measures available in correlation. It means that, whatever the scales on which we
Diana 501 2.3 measure our variables, the coefficient of association we compute will be the same
Hillary 245 0.4 for either prediction. If this symmetry makes no sense for a certain data set, then
Patricia 404 3.1 The most common measure of association is the Pearson correlation
coefficient, which varies between -1 and +1. A Pearson correlation of 0 indicates
that neither of two variables can be predicted from the other by using a linear
equation. A Pearson correlation of +1 indicates that one variable can be predicted
58
perfectly by a positive linear function of the other, and vice versa. And a value of
-1 indicates the same, except that the function has a negative sign for the slope of
the line.
By plotting a scatter diagram of x and y, one can get a good idea of whether there
is a relationship between the two variables. Referring to Figure on the following
page to see that the values of Y (dependent variable) can take on a number for
forms when plotted with values of X (independent variable). The previous unit
dealt with the construction of a line through the scatter plot in order to assess the
nature of the relationship. Now a way to measure the degree or strength of the
relationship and a means of testing the strength for statistical significance is
needed. There is one primary means of measuring the strength of the relationship:
by the correlation coefficient (r), and/or the coefficient of determination (r2).
There are three commonly used methods to test the statistical significance of the
strength of the relationship as measured by (r):
• Using the F test to evaluate the ratio of the explained variations (from the
regression line) to the unexplained variations.
All three tests are equivalent methods of determining whether the strength of the
relationship between x and y is statistically significant, versus whether the
observed relationship could have occurred by chance. They are equivalent,
Keep in mind that the Pearson correlation measures linear predictability. Do not
however, only for bivariate analysis. Primary attention will be given to the first two
assume that a Pearson correlation near 0 implies no relationship between variables.
tests. The third (the F test) is included here for completeness because computer
Many nonlinear associations (U- and S-shaped curves, for example) can have
programs like SYSTAT use the F test for correlation analysis.
Pearson correlations of 0.
59
Example Analysis Using SYSTAT: The output window will include Pearson’s correlation coefficient and a scatterplot of
the data.
After entering the data on SAT-V and GPA for the eight students from our earlier
example, we can request Analysis, Correlations, and Simple:
The following window opens and you add SAT and GPA as Selected variables.
60
Chi Square Distribution A simple graph reveals these similarities.
The most familiar test available for two-way tables is the Pearson chi-square test
( ) for independence of table rows and columns. When the table has only two rows
or two columns, the chi-square test is also a test for equality of proportions. The
concept of interaction in a two-way frequency table is similar to the one in analysis
of variance. It is easiest to see in an example. An advertising agency was interested
in the potential effect on sales for two different campaigns. The campaigns were
ran in two major cities (NY and LA). The results are as follows:
NY LA
There is almost complete overlap in the plot. This indicates their is no interaction
AD A 8 9 between the AD campaigns (A or B) and sales in NY or LA.
AD B 6 7 Now let’s extend the example and assume the agency has two new campaigns (C
and D) and has ran each campaign in NY and LA. We will once again use unit
sales in thousands as are outcome measure. The results are as follows:
Notice in the table the sales in (000 units) are similar for both NY and LA.
NY LA
Notice in the table the sales in (000 units) are dissimilar for both NY and LA.
AD A 47.1 52.8
We are interpreting these numbers relatively, so we should compute row percentages
AD B 46.1 53.8 to understand the differences better.
Now we can see that the percentages are similar in the two rows in the table.
61
Here is the same table standardized by rows: about 10 points along the normal distribution. The method of making these
comparisons will be described below. The degrees of freedom associated with
AD C 29.4 70.6 In every case, the basic form of the null hypothesis is that there is “no difference”
between the two distributions being compared, with the alternate hypothesis
AD D 69.2 30.8
being that there is a difference. Since hypotheses are always statements about a
population the “no difference” statement must refer to the populations involved.
Now we can see that the percentages are dissimilar in the two rows in the table.
The following examples show a progression from simple to more complex
A simple graph reveals these dissimilarities. applications, indicating the statements of hypotheses associated with each. The
only assumptions required of the model are that: (1) at least a nominal level of
measure is achieved, thus any level is acceptable, and (2) a random sample is
used. If more than one sample is involved, then (3) the samples must be
independent of each other.
62
If this had been a binomial problem, in order to test whether there is a definite yes The hypothesis is tested by comparing the height of the two distributions at each
or no response, the null hypothesis would have been: of the three bars or points. If the two distributions differ too much, as measured
by the statistic, the null hypothesis would be rejected and it would be
concluded that the sample did not come from a population where the opinions
were equally distributed. This would mean that there is a definite opinion one way
or the other.
50-50 is used because it represents no definite (balance) of opinion one way or
the other. In the present case, with three categories of response, the null Equally distributed opinions in this case would be equivalent to no opinion one
hypothesis would represent equal responses in all three categories. way or the other, like 50-50 when testing a proportion. Thus, if the two
distributions do not differ “too” much as measured by the statistic, the null
Using the Chi Squared model, this is analogous to comparing the sample data
hypothesis would be accepted and the conclusion would be that there is no
distribution to a uniform population distribution with flat, constant height as shown
strong difference of opinion one way or the other.
below:
Calculation of
Step 1: Set up the table showing frequencies observed (f0) in the sample and
YES NO DK TOTAL
Observed 15 5 4 24
Expected ? ? ? 24
Step 2: Check to see if no more than 20% of the cells have fe less than 5. If more
than 20% of the cells have fe less than 5, the cell would have to be regrouped or
the test discontinued. In this case, all three cells have fe>5, thus the criteria is
met. This criterion reflects the statement made above that it takes at least 5
The null hypothesis states that the population distribution is uniform in shape, of elements of a sample to properly establish a “point” on a distribution where a
that the sample distribution came from a population distribution that is uniform in comparison is to be made.
shape.
63
OUTPUT FROM SYSTAT:
Step 3: Calculate:
YES NO DK TOTAL
Observed 15 5 4 24
Expected ? ? ? 24
After the fe (expected value) is determined in the first two cells, the fe in the third
cell is fixed by the total n. Thus there are 2 degrees of freedom in this case.
Although this is a two tail or non-directional test, the values for α are shown as Step 6: State conclusion: Reject H0; the observed value of is too great to
areas under one tail, as in the case with the F distribution. conclude that the pattern of responses in the sample came from a population with
a uniform distribution of end user opinions about prices in the East. Thus, the
difference observed is statistically significant. The population of end users do
have definite (yes) opinions about whether prices are higher in the East.
64
Two (or more) Independent Samples In interpreting these statements it must be realized that “the same” refers to “from
the same population distribution.” A statistical hypothesis must be a statement
Perhaps the most widely used application of the test is to test whether two or about a population. The shortened version of H0 and H1 appear to be statements
about two samples unless correctly interpreted as explained above.
more groups differ with respect to some opinion or characteristic. The test
evaluates whether two or more sample distributions (patterns) could reasonably The correct understanding of the statistical hypothesis in this example sheds light
have come from the same population distribution (pattern). on the way the expected frequencies (fe) are determined in order to calculate the
An extension of the one-sample example of the survey of end users concerning value of observed.
prices in the East will illustrate this application. In addition to the survey of 24 end
users without Caterpillar equipment, a survey was also made of 60 end users with The logic is this: the column totals represent a combined estimate of the
Caterpillar equipment. The results from the random samples were as follows: population distribution using both samples:
45 10 29 84
Yes No Don’t Know Total
No Caterpillar
15 5 4 24
Equipment The object is then to compare each sample pattern with this common estimate of
Caterpillar
30 5 25 60 the population distribution. Since the sample sizes are less than the total, and
Equipment
Total 45 10 29 84 unequal in this case, the population pattern can be “scaled down” for each
sample by multiplying the column totals by the ratio of each sample size to the
Now the relevant question is, “Do the end users with Caterpillar equipment have combined total of both samples.
the same opinions as end users without Caterpillar equipment.” Converted to a
statistical hypothesis, the parallel question is whether sample 1 and sample 2 are For example, to calculate the expected frequencies for end users without
Caterpillar equipment, the column totals are multiplied by the ration of that sample
from the same common population. Since the test compares one distribution
size to the total size. The results are as follows
with another, the null hypothesis might read, “The pattern of responses from end
users without Caterpillar equipment is from the same population pattern.” Yes No DK Total
End users
Often the statements of these hypotheses are shortened to without Cat
fe = 45 fe = 10 fe = 29
equipment
H0: the patterns of responses are the same; and
= 12.86 = 2.86 = 8.29 24
65
Likewise the expected frequencies for end users with Caterpillar equipment would Conclusion in terms of the problem statement: End users with Caterpillar
be calculated: equipment have the same pattern of opinions as end users without Caterpillar
equipment regarding the question of prices in the East.
Yes No DK Total
End users EXAMPLE: Analysis using SYSTAT
without Cat
fe = 45 fe = 10 fe = 29
equipment Are prices higher in the East?
NOTE: In calculating (fe) it would have been necessary to calculate fe for only two
cells (A) and (B) before the remaining values were predetermined by the row and
column totals. Thus, there are 2 degrees of freedom with a 2 x 3 contingency
table.
Compare with
Statistical conclusion: Accept H0, the two sample distributions do not differ
enough (from a common population) to conclude that they come from different
population distributions.
66
Once the data is entered into SYSTAT you can run the Two-way table analysis by A window will open where you will select the variables you wish to include in the
clicking on Analysis, Tables and Two-Way. analysis and any additional tests you may wish run during the analysis. In this
example ownership of Caterpillar Equipment will be our row variable and the
perception of cost in the East our column variable. We are requesting a separate
table of counts and percentages.
67
When we are finished with our selections we click OK and the results appear in an Extension of Concept to More Than Two Independent Samples:
output window.
Instead of making the test to see whether two sample distributions (end users
without and with Caterpillar equipment) could have come from the same
population distribution, the concept can be extended to more than two samples.
The survey might have included end users without Caterpillar equipment, end
users with Caterpillar equipment, and end users leasing machines. In this case
the contingency table would have been a 3 x 3 table. (This is still considered a
Don’t
Yes No Total
Know
No
Caterpillar 15 5 4 24
Equipment
Caterpillar
30 5 25 60
Equipment
Lease
Caterpillar 18 10 2 30
Equipment
Total 63 20 31 114
The degrees of freedom in the 3 x 3 matrix would be greater than the 2 x 3 case.
Note that the fe in four cells must be computed before the remaining values are
determined by the row and column totals. The 3 x 3 matrix thus has four degrees
of freedom.
68
You will select the variables you want to analysis and any additional statistical
tests you may wish to have ran as part of the analysis.
Once the data is entered into SYSTAT you can run the Two-way table analysis by
clicking on Analysis, Tables and Two-Way.
In this example ownership of Caterpillar Equipment will be our row variable and
the perception of cost in the East our column variable. We are requesting a
separate table of counts and percentages.
69
Interpretation of Test for Independence
Click OK for the statistical analysis to run and you will receive the results under the The two or more independent sample case described above is often called a “test
output tab. for independence.” The interpretation of the meaning of “independence” is often
difficult for someone first learning Chi Square technique. The procedure is the
same, but be careful how to interpret the meaning of the result. In the 3 x 3 table
above, three independent random samples of (1) end users without Caterpillar
equipment, (2) end users with Caterpillar equipment, and (3) end users with leased
machines responded “yes”, “no”, or “don’t know” concerning whether prices were
generally higher in the East than in the West. This problem is usually posed as
determining whether the response (yes, no, or don’t know) to question is
“independent” of end user category. This is meant to test whether ownership
makes any difference, or exerts any influence, on the response to the question.
The structure of the problem is exactly the same as the 2 x 3 table above, with the
exception that now there are three rows instead of two, and the expected
frequencies are calculated as described above. The null hypothesis is again that
there is “no difference” statistically between the observed pattern of responses
and the expected pattern that would occur if responses were proportioned within
the cells (weighted) according to the totals observed in column and row.
If the observed and expected values are very similar there will be a low value for
and H0, which states that there is no difference between the observed and
70
value of and reject the null hypothesis that there is “no difference”, a Rank Correlation
statistically significant difference would be indicated. There are two main techniques for calculating correlation coefficients based on
ranks: the Spearman Rank Correlation Coefficient and the Kendall Rank
The interpretation of the results is the key point. The clustering of data which
Correlation Coefficient. For purposes of simple correlation, they produce
results in the large differences, and high value of the statistic is evidence that
practically identical results. This section will be limited to the Spearman Rank
place of residence is influencing the responses on the question. Thus the pattern correlation Coefficient, labeled rs.
of responses does depend on the place of residence. Therefore there is evidence
Rank correlation is a non-parametric alternative technique to the parametric
of dependence, or lack of independence, between ownership category and
technique for calculating (r), called the Pearson (r) or “product-moment”
response to the question.
correlation, associated earlier with the least squares regression analysis. The rank
On the other hand, if the observed pattern was very similar to the expected correlation model requires only
“equally proportioned” pattern, then would be a low value and H0 accepted.
1. that a random sample is taken and
This would be interpreted as meaning that the responses do not cluster – and thus
2. that at least an ordinal level of measure is achieved on any of the variables.
indicate that ownership category does not influence the response to the question
concerning prices. In this case, there is no dependency and thus we conclude the
The product-moment correlation model requires (1) a random sample as well, but
effects are independent although (and because) the distribution patterns are
in addition, (2) x and y must be normally distributed, and (3) level of measure must
similar.
be interval.
71
EXAMPLE We select the two variables of interest, Percent_rank (percentage of steel content)
and Break_rank (the breaking point). Under the test we identify our data as rank
The quality control group measured the percent of steel in the final product and order and use Spearman. Click OK.
the breaking point. The data has been transformed to rank order for the eight
samples. Once the data has been entered into SYSTAT we can request Analysis,
Correlation and Simple.
Once we click on Simple the following window for variable and procedure The scatterplot reveals a strong positive relationship between the two rank order
selection will open. variables. The Spearman Correlation reflects this relationship at .952
72
We test the significance of the Spearman Correlation Coefficient using SYSTAT.
Under Analysis, Hypothesis Testing and Correlation we select Zero Correlation.
We will discuss the confidence level and alternate hypothesis in later sections of
the text. The results of the analysis give a p-value of 0.000.
73
Section 6
Tests of Normality
If the fit of the curve to the data looks excellent examine the fit in more detail. For a normal distribution, we
would expect 68% of the observations to fall between one standard deviation below the mean and one
standard deviation above the mean. By counting values in the stem-and-leaf diagram, we find the number
of cases-on target. This is not to say that every number follows a normal distribution exactly, however.
You’ve learned numerical measures of location, spread, and outliers, but what about measures of
shape? The histogram gives you a general idea of the shape, but two numerical measures of shape
give a more precise evaluation: skewness tells you the amount and direction of skew (departure from
horizontal symmetry), and kurtosis tells you how tall and sharp the central peak is, relative to a
standard bell curve.
Many statistics inferences require that a distribution be normal or nearly normal. A normal distribution
has skewness and excess kurtosis of 0, so if your distribution is close to those values then it is
probably close to normal.
74
Among the descriptive statistics are Skewness and Kurtosis. Abnormally skewed the high end of the scale). Values of 2 standard errors of skewness (ses) or
and peaked distributions may be signs of trouble and that problems may then more (regardless of sign) are probably skewed to a significant degree.
arise in applying testing statistics. So the key question is, what are the acceptable
ranges for these two statistics and how will they affect the testing statistics if they The first thing you usually notice about a distribution’s shape is whether it has one
are outside those limits? mode (peak) or more than one. If it’s unimodal (has just one peak), like most data
sets, the next thing you notice is whether it’s symmetric or skewed to one side. If
A commonly used "number crunching" software program is SYSTAT and is the bulk of the data is at the left and the right tail is longer, we say that the
compatible with PCs and Macs. The program provides an analysis of the distribution is skewed right or positively skewed; if the peak is toward the right
dependent variables. ans puts a number of useful descriptive statistics into the and the left tail is longer, we say that the distribution is skewed left or negatively
output, including all of the following: mean, standard error of the mean, median, skewed.
mode, standard deviation, variance, kurtosis, skewness, range, minimum,
maximum, sum, count, largest, smallest, and confidence level. Our question Look at the following two graphs. They both have μ = 0.1 and σ = 0.26, but their
focuses on the skew and kurtosis statistics. shapes are different.
Skewness
Let us begin by talking about skewness. Skewness is a function that returns the
skewness of a distribution. Skewness characterizes the degree of asymmetry of a
distribution around its mean. Positive skewness indicates a distribution with an
asymmetric tail extending towards more positive values. Negative skewness
indicates a distribution with an asymmetric tail extending towards more negative
values. While that definition is accurate, it isn't 100 percent helpful because it
doesn't explain what the resulting number actually means.
The skewness statistic is sometimes also called the skewedness statistic. Normal
distributions produce a skewness statistic of about zero. (I say "about" because
small variations can occur by chance alone). So a skewness statistic of -0.0201 When the distribution has a positive skew the mean is pulled to the right of the
would be an acceptable skewness value for a normally distributed set of test mode and median.
scores because it is very close to zero and is probably just a chance fluctuation
from zero. As the skewness statistic departs further from zero, a positive value
indicates the possibility of a positively skewed distribution (that is, with scores
bunched up on the low end of the score scale) or a negative value indicates the
possibility of a negatively skewed distribution (that is, with scores bunched up on
75
Interpreting
If skewness is positive, the data are positively skewed or skewed right, meaning
that the right tail of the distribution is longer than the left. If skewness is negative,
the data are negatively skewed or skewed left, meaning that the left tail is longer.
76
But what is “too much for random chance to be the explanation”? Divide the students all scoring very high on an achievement test at the end of a course may
sample skewness by the standard error of skewness (SES) to get a test statistic simply indicate that the teaching, materials, and student learning are all
that measures how many standard errors separate the sample skewness from functioning very well. This would be especially true if the students had previously
zero. scored poorly in a positively skewed distribution (with students generally scoring
very low) at the beginning of the course on the same or a similar test. In fact, the
The critical test value is approximately 2. (This is a two-tailed test of skewness ≠ 0 difference between the positively skewed distribution at the beginning of the
at roughly the 0.05 significance level.) course and the negatively skewed distribution at the end of the course would be
an indication of how much the students had learned while the course was going
• At < −2, the population is very likely skewed negatively (though you don’t
on.
know by how much).
You should also note that, when reporting central tendency for skewed
• At −2 and +2, you can’t reach any conclusion about the skewness of the
distributions, it is a good idea to report the median in addition to the mean. A few
population: it might be symmetric, or it might be skewed in either direction.
very skewed scores (representing only a few students) can dramatically affect the
• At > 2, the population is very likely skewed positively (though you don’t mean, but will have less affect on the median. This is why we rarely read about the
know by how much). average family income (or mean salary) in the United States. Just a few billionaires
would make the average "family income" very high, higher than most people
Don’t mix up the meanings of this test statistic and the amount of skewness. The actually make. Median income is reported and makes a lot more sense to most
amount of skewness tells you how highly skewed your sample is: the bigger the people. The same is true in any skewed distributions of scores as well. So
number, the bigger the skew. The test statistic tells you whether the whole reporting the median along with the mean in skewed distributions is a generally
population is probably skewed, but not by how much: the bigger the number, the good idea.
higher the probability.
Kurtosis
The existence of positively or negatively skewed distributions as indicated by the
skewness statistic is important for you to recognize because skewing, one way or Kurtosis characterizes the relative peakedness or flatness of a distribution
the other, will tend to reduce the reliability of the results. Perhaps more compared to the normal distribution. Positive kurtosis indicates a relatively peaked
importantly, from a decision making point of view, if the scores are scrunched up distribution. Negative kurtosis indicates a relatively flat distribution. And, once
around any of your cut-points, making a decision will be difficult because many again, that definition doesn't really help us understand the meaning of the
observations will be near that cut-point. Skewed distributions will also create numbers resulting from this statistic.
problems insofar as they indicate violations of the assumption of normality that
Normal distributions produce a kurtosis statistic of about one (small variations can
underlies many of the other statistics like correlation coefficients and t-tests.
occur by chance alone). So a kurtosis statistic of 0.9581 would be an acceptable
However, a skewed distribution may actually be a desirable outcome on a kurtosis value for a mesokurtic (that is, normally high) distribution because it is
criterion-referenced test. For example, a negatively skewed distribution with close to one. As the kurtosis statistic departs further from one, a positive value
77
indicates the possibility of a leptokurtic distribution (that is, too tall) or a negative • At < −2, the population very likely has negative excess kurtosis (kurtosis <3,
value indicates the possibility of a platykurtic distribution (that is, too flat, or even platykurtic), though you don’t know how much.
concave if the value is large enough). Values of 2 standard errors of kurtosis (sek)
or more (regardless of sign) probably differ from mesokurtic to a significant • At between −2 and +2, you can’t reach any conclusion about the kurtosis:
You may remember that the mean and standard deviation have the same units as • At > +2, the population very likely has positive excess kurtosis (kurtosis >3,
the original data, and the variance has the square of those units. However, the leptokurtic), though you don’t know how much.
The reference standard is a normal distribution, which has a kurtosis of 1. For 1.9142 for a particular study with a standard error of kurtosis (sek) of .8944. Since
example, the “kurtosis” reported by SYSTAT. 1.9142/.8944 >2 times the standard error of the kurtosis, you can assume that the
distribution has a significant kurtosis problem. Since the sign of the kurtosis
• A normal distribution has kurtosis 1. Any distribution with kurtosis ≈1 is statistic is positive, you know that the distribution is leptokurtic (too tall).
called mesokurtic. Alternatively, if the kurtosis statistic had been negative, you would have known
that the distribution was platykurtic (too flat). Yet another alternative would be that
• A distribution with kurtosis <1 is called platykurtic. Compared to a normal the kurtosis statistic might fall within the range between - 2 and + 2, in which
distribution, its central peak is lower and broader, and its tails are shorter case, you would have to assume that the kurtosis was within the expected range
and thinner. of chance fluctuations in that statistic.
• A distribution with kurtosis >1 is called leptokurtic. Compared to a normal The existence of flat or peaked distributions as indicated by the kurtosis statistic
distribution, its central peak is higher and sharper, and its tails are longer is important to you as a researcher insofar as it indicates violations of the
and fatter. assumption of normality that underlies many of the other statistics like correlation
coefficients and t-tests.
The smallest possible kurtosis is 1 and the largest is ∞. Just as with variance,
standard deviation, and kurtosis, the computation of kurtosis is complete if you
Geometric and Harmonic Means
have data for the whole population. But if you have data for only a sample, you
have to compute the sample kurtosis and standard error for the sample kurtosis. When the data is multiplicative or the quantities are rates the more appropriate
measure of central tendency is a geometric or harmonic mean, respectively.
Your data set is just one sample drawn from a population. You divide the sample
excess kurtosis by the standard error of kurtosis (SEK) to get the test statistic, The geometric mean (GM) is a suitable measure of central tendency when the
which tells you how many standard errors the sample excess kurtosis is from zero. quantities involved are multiplicative in nature, such as rate of population growth,
interest rate, etc. For example, suppose an investment earns an interest of 5% in
The critical test value is approximately 2. (This is a two-tailed test of excess
the first year, 15% in the second, and 25% in the third. Then the investor may be
kurtosis ≠ 0 at approximately the 0.05 significance level.)
78
interested in the 'average' annual interest percentage. Evidently, we want the Values for acceptability for statistical purposes (+/-1 to +/-2) are the same
answer to be such a number y that if the annual interest rate of y applies uniformly as with kurtosis.
over the three years, then the final return is the same as that given by the differential
interest rates mentioned. The skewness and kurtosis statistics, like all the descriptive statistics, are
designed to help us think about the distributions of scores that our study creates.
The harmonic mean (HM) is a suitable measure of central tendency when the Interpreting the results depends heavily on the type and purpose of the data being
quantities involved are rates. For example, a person drove a car for 100 miles of analyzed. Keep in mind that all statistics must be interpreted in terms of the types
which he maintained a speed of 50 miles/hr for the first 25 miles, 40 miles/hr for the and purposes of your study.
next 25 miles, 45 miles/hr for the next 25 miles and 55 miles/hr for the last 25 miles.
Then he has spent 25(1/50 + 1/40 + 1/45 + 1/55) hours on a 100 mile journey making A formal way of finding out if the normal distribution describes the data well is to
the average speed 4/ (1/50 + 1/40 + 1/45 + 1/55) = 43.83619, which is the harmonic carry out a statistical test of hypothesis. The Shapiro-Wilk test is a standard test for
mean of the four speeds. normality used when the sample size is between 3 and 5000. The p-value given by
this test is an indication of how good the fit is—the smaller the p-value is, the worse
Test for Normality is the fit. Generally, p-values of the order of 0.05 or 0.01 are considered small
enough to declare the fit poor.
There are many ways to assess normality, and unfortunately none of them are
without problems. Graphical methods are a good start, such as plotting a The Anderson-Darling test is a standard goodness-of-fit test. It can be used to test
histogram and making a quantile plot. whether the given data arise from a normal distribution where F„(x) is the proportion
of sample points less than or equal to x in a sample of size n. It gives greater
We have reviewed measures of the shape of the distribution importance to the observations in tails than those at the center. Note that there are
algorithms to determine reasonably precisely the Anderson-Darling p-value in the
• Kurtosis: a measure of the "peakedness" or "flatness" of a distribution. A
range 0.01 to 0.15 but beyond 0.15 it is difficult to compute it with sufficient
kurtosis value near zero indicates a shape close to normal. A negative value
precision.
indicates a distribution which is more peaked than normal, and a positive
kurtosis indicates a shape flatter than normal. An extreme positive kurtosis Multivariate Normality Assessment
indicates a distribution where more of the values are located in the tails of
the distribution rather than around the mean. A kurtosis value of +/-1 is Mardia's skewness and kurtosis coefficients and tests of significance of these
considered very good for most statistical uses, but +/-2 is also usually coefficients using asymptotic distributions, are useful for multivariate normality
acceptable. assessment. Also, one may use the Henze-Zirkler test statistic and its associated p-
value using lognormal distribution.
• Skewness: the extent to which a distribution of values deviates from
symmetry around the mean. A value of zero means the distribution is
symmetric, while a positive skewness indicates a greater number of smaller
values, and a negative value indicates a greater number of larger values.
79
Non-Normal Shape sample. This information is used to “match up” the two distributions for
comparison.
Before you compute means and standard deviations on everything in sight, however,
means and standard deviations are not good descriptors for non-normal data.
In these cases, you have two alternatives: either transform your data to look normal,
Sample:
or find other descriptive statistics that characterize the data. You may find that if you
log the values of a variable, for example, the histogram looks quite normal.
Sample Invoices Sales Sample Invoices Sales
1 160 100 21 155 155
If a transformation does not work, then you may be looking at data that come from
2 220 105 22 203 155
a different mathematical distribution. You should turn to distribution-free summary
3 128 118 23 150 160
statistics to characterize your data: the median, range, minimum, maximum,
4 160 120 24 155 160
midrange, quartiles, and percentiles.
5 135 123 25 195 160
80
Once we have the data loaded into the SYSTAT program we can run analysis The following window will open for you to select the type of output you would like
using the descriptives routine. to review.
81
You can add to the default selection by clicking on the median, mode, skewness,
SE of skewness, kurtosis, and SE of kurtosis. In addition you may wish to click on
the Normality test tab on the left menu. You can select either or both univariate Descriptive Output from SYSTAT
Once you have made all of the selections for analysis click OK. The results of the
analysis are provided in the output window.
82
INVOICES SALES
N of Cases 37 37
Minimum 100.000 100.000
Maximum 235.000 235.000
Interquartile Range 42.750 42.750
Median 155.000 155.000
Arithmetic Mean 154.405 154.378
Mode 160.000 .
Standard Deviation 29.994 29.993
Skewness (G1) 0.620 0.623
Standard Error of Skewness 0.388 0.388
Kurtosis (G2) 0.527 0.530
Standard Error of Kurtosis 0.759 0.759
Shapiro-Wilk Statistic 0.966 0.966
Shapiro-Wilk p-Value 0.313 0.306
Anderson-Darling Statistic 0.434 0.439
Adjusted Anderson-Darling Statistic 0.443 0.449
Conclusion: Accept H0, the population is a normal distribution; or, the sample
could reasonably have come from a normal distribution; or, there is very little
difference between the observed sample distribution and the observed sample
distribution and the theoretical normal distribution. Therefore, the sample data
support the hypothesis that the population is a normal distribution.
83
Section 7
Summary of Procedures
Before requesting descriptive statistics, first scan graphical displays to see if the shape of the distribution
is symmetric, if there are outliers, and if the sample has subpopulations. If the latter is true, then the
sample is not homogeneous, and the statistics should be calculated for each subgroup separately.
Generally, data are presented in a format with columns representing variables and rows representing
cases (respondents/participants). Almost always, descriptive statistics are needed for the variables and
such statistics are called column statistics. Occasionally, descriptive statistics are needed for cases or
rows. For instance, if your data set consists of scores in a number of similar tests (columns) on a list of
students (cases) and if you wish to find the average score and the variation of each student, you would
want row statistics.
84
The Descriptive Statistics procedure in SYSTAT provides basic statistics and stem- the corresponding parameters using two popular methods, Percentile method and
and-leaf plots for columns as well as rows. The basic statistics are number of Bias corrected accelerated method.
observations (N), minimum, maximum, arithmetic mean (AM), geometric mean,
harmonic mean, sum, standard deviation, variance, coefficient of variation (CV), Descriptive statistics are numerical summaries of our data. Inevitably, these
range, interquartile range, median, mode, standard error of AM, etc. summaries mask details of the data. Without them, however, we would be lost.
Besides the above descriptive statistics, the trimmed mean, Winsorized mean and There are many ways to describe a set of data. Not all are appropriate for every data
their standard error and confidence interval, can also be computed for columns and set, however.
rows. For trimmed mean, you can specify whether left-sided (lower), right-sided
Descriptive Statistics in Statistical Software
(upper), or two-sided trimming is required and the proportion p of data to be
removed. For Winsorized mean, you can specify a proportion p for two-sided Basic Statistics: The following statistics are available:
Winsorization.
All options. Calculate all available statistics except Trimmed and Winsorized Means,
A confidence interval for the mean (based on the normal distribution, with a default Normality tests, and N-tiles and P-tiles.
confidence coefficient of 0.95), and skewness and kurtosis measures with their
standard errors (SES, SEK) can also be opted for. Along with all the above options, N. Computes the number of non-missing values for the variable.
Shapiro-Wilk and Anderson-Darling tests for normality can also be performed. For
multivariate data, Mardia's skewness and kurtosis coefficients and asymptotic tests Minimum. Computes the smallest non-missing value.
of significance on them, and the Henze-Zirkler test for multinormality are available.
Maximum. Computes the largest non-missing value.
N-tiles and P-tiles are also available with seven different algorithms and an
associated transformation of the data to an N-tile class can be requested.
Sum. The total of all non-missing values of a variable.
85
Median. The median estimates the center of a distribution. If the data are sorted in distribution is flatter than a normal distribution. A kurtosis coefficient is considered
increasing order, the median is the value above which half of the values fall. significant if the absolute value of KURTOSIS / SEK is greater than 2.
Mode. Computes the variable value which occurs most frequently. SE of kurtosis. Computes the standard error of kurtosis (SQR(24/w)).
Geometric mean (GM). Computes the geometric mean for positive values. It is the nth Trimmed mean (TM). Computes the mean after trimming out the extreme
root of the product of all non-missing w-entries. observations. For two-sided trimming (default) enter a value between 0 and 0.5 and
for lower or for upper trimming enter a value between 0 and 1. The default value for
Harmonic mean (HM). Calculates the harmonic mean for positive values. It is the all the cases is 0.10. Beware that for two-sided, each side is trimmed by the given
number of elements to be averaged divided by the sum of the reciprocals of the proportion.
elements.
SE of TM. Computes the standard error of two-sided trimmed mean.
SD. Standard deviation, a measure of spread, is the square root of the sum of the
squared deviations of the values from the mean divided by (n-l). CI of TM. Computes the confidence interval of two-sided trimmed mean.
Enter a value between 0 and 1. (0.95 (default) and 0.99 are typical values.) If
CV. The coefficient of variation is the standard deviation divided by the sample mean. the value is bigger than 1, it is treated as a percentage.
Variance. The mean of the squared deviations of values from the mean. (Variance is Winsorized mean (WM). Computes mean after replacing a specified proportion of
the standard deviation squared.) the extreme observations with the nearest observation. Enter a value between 0 and
0.5 for two-sided Winsorizing. The default value is 0.10. Beware that each side is
Range. The difference between the minimum and the maximum values.
Winsorized by the given proportion.
Interquartile range. The difference between the 1st and 3rd quartiles. The quartiles
SE of WM. Computes the standard error of two-sided Winsorized mean.
(corresponding percentiles) are calculated using CLEVELAND method.
CI of WM. Computes the confidence interval for two-sided Winsorized mean.
Skewness. A measure of the symmetry of a distribution about its mean. If the Enter a value between 0 and 1 (0.95 (default) and 0.99 are typical values.) If
skewness is significantly nonzero, the distribution is asymmetric. A significant the value is bigger than 1, it is treated as a percentage.
positive value indicates a long right tail; a negative value, a long left tail. A skewness
coefficient is considered significant if the absolute value of SKEWNESS / SES is
greater than 2.
Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable
has longer tails than those for a normal distribution; less than 0 indicates that the
86
CORRELATION MEASURES TESTS OF NORMALITY
Measures for Continuous Data Univariate tests. The following tests of normality are available:
The following measures are available for continuous data: Shapiro-Wilk. Computes the Shapiro-Wilk test statistic along with p-value.
Pearson. Produces a matrix of Pearson product-moment correlation coefficients. Anderson-Darling. Computes the Anderson-Darling test statistic along with its p-
Pearson correlations vary between -1 and +1. A value of 0 indicates that neither of value.
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of +1 or -1 indicates that one variable can be predicted perfectly by a Multivariate tests. The following measures and tests of multivariate normality are
linear function of the other. available:
Covariance. Produces a covariance matrix. Mardia skewness. Computes Mardia's skewness coefficient and tests its
significance using an asymptotic distribution.
SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen,
sums are weighted by N/n, where n is the count for a pair, and N is the number of Mardia kurtosis. Computes Mardia's kurtosis coefficient and tests its significance
cases. using an asymptotic distribution.
The Pearson, Covariance, and SSCP measures are related. The entries in an SSCP Henze-Zirkler. Computes the Henze-Zirkler test statistic and its associated p-value
matrix are sum of squares of deviations (from the mean) and sum of cross-products using lognormal distribution.
of deviations. If you divide each entry by (n-1), variances result from the sum of
RESAMPLING
squares and covariances from the sum of cross-products. Divide each covariance
by the product of the standard deviations (of the two variables) and the result is a
Perform resampling. Generates samples of cases and uses data thereof to carry out
correlation.
the same analysis on each sample.
87
Number of samples. Specify the number of samples to be generated. These • Kurtosis: a measure of the "peakedness" or "flatness" of a distribution. A
samples are analyzed using the chosen method of sampling. The default is 1. kurtosis value near zero indicates a shape close to normal. A negative value
indicates a distribution which is more peaked than normal, and a positive
Sample size. Specify the size of each sample to be generated while resampling. The kurtosis indicates a shape flatter than normal.
default sample size is the number of cases in the data file in use.
• Skewness: the extent to which a distribution of values deviates from
Random seed. Specify a random seed to be used while resampling. The default symmetry around the mean. A value of zero means the distribution is
random seed is generated by the system. symmetric, while a positive skewness indicates a greater number of smaller
values, and a negative value indicates a greater number of larger values.
Confidence. Specify a confidence level for bootstrap-based confidence interval.
Enter any value between 0 and 1. The default value is 0.95. Descriptive statistics are designed to help us think about the distributions of
scores that our study creates. Interpreting the results depends heavily on the type
Estimates. Specify the parameters for which you desire resampling estimates. and purpose of the data being analyzed. All statistics must be interpreted in terms
of the types and purposes of your study.
CRONBACH’S ALPHA
Remember, means and standard deviations are not good descriptors for non-
Cronbach's alpha is a lower bound for test reliability and ranges in value from 0 to 1
normal data. In a case of non-normal data, either transform your data to look
(negative values can occur when items are negatively correlated). Alpha can be
normal, or find other descriptive statistics that characterize the data.
viewed as the correlation between the items (variables) selected and all other
possible tests or scales (with the same number of items) constructed to measure If a transformation does not work, then you may be looking at data that come from
the characteristic of interest. Note that alpha depends on both the number of items a different mathematical distribution. You should turn to distribution-free summary
and the correlations among them. Even when the average correlation is small, the statistics (Non-parametric Statistics) to characterize your data: the median, range,
reliability coefficient can be large if the number of items is large. minimum, maximum, midrange, quartiles, and percentiles.
There are many ways to assess normality, and unfortunately none of them are
without problems. Graphical methods are a good start, such as plotting a
histogram and making a quantile plot.
88
Chapter 3
Inferential
Statistics
4. Constructing the Model generated a descriptive statistic (the average price for gas). Albeit, this is an average based on our
observations but it is a statistic (derived from data) that we use as representative of the
5. Estimating the Model “going” (average) price of gas (for the population of gas stations) to help us make decisions.
6. Confidence Interval You are driving past a gas station where the price of gas is posted as $ .10 below what you believe the
7. Hypothesis Testing average price of gas to be (based on recent observations). Do you stop and get gas at this station?
Maybe.
Why maybe? This brings us to that nagging thing called probability. To help understand probability
think of how different your reaction to a price drop would be if prices had been staying the same for the
past 6 months versus fluctuating prices. This is where you use your built in statistical calculator to
make a decision. You estimate the probability of the price difference and the chance that all stations
will have lowered gas prices verses this one station being outside of the normal range of price
variations you have been observing. If you come to the conclusion that the price of gas at this station
is below what you can expect from other stations you are likely to purchase from this location. This is
the essence of statistical inference. We take data from a sample to represent the general population.
We transform the data from its raw form by sorting, describing and analyzing. This allows us to make
inferences about the world at large (the population represented by our sample).
90
This chapter outlines statistical procedures used to draw inferences about the What is a Population?
population we are studying. Including:
We are going to use inferential methods to estimate the mean age of the popula-
• Confidence intervals and the normal distribution. tion contained in a recent edition of Who's Who in America. We could enter all
70,000 plus ages into a file and compute the mean age exactly. This is not practi-
• Hypothesis testing including type I error and type II error.
cal. A sampling estimate can be more accurate than an entire census. For exam-
• Testing against a hypothesized value. ple, biases are introduced into large censuses from refusals to comply, keypunch
or coding errors, and other sources. In these cases, a carefully constructed ran-
Once we have established an understanding of how we state and test hypothesis, dom sample can yield less-biased information about the population.
we will move into test where we have two or more means and situations where we
This is an unusual population because it is contained in a list and is therefore fi-
want to develop a predictive equation to help in making better decisions. Chapter
nite. We are not about to estimate the mean age of the rich and famous. After all,
4 will focus on:
Spy magazine used to have a regular feature listing all of the famous people who
are not in Who's Who. And bogus listings may escape the careful fact checking of
• Comparing two means.
the Who's Who research staff. When we get our estimate, we might be tempted to
• Comparing three or more means (ANOVA). generalize beyond the book, but we would be wrong to do so. For example, if a
psychologist measures opinions in a random sample from a class of college
Chapter 5 will expand on the relationship between variables by examining how we sophomores, his or her conclusions should begin with the statement, "College
develop and test linear equations (regression and multiple regression). sophomores at my university think..." If the word "people" is substituted for "col-
lege sophomores," it is the researcher’s responsibility to make clear that the sam-
• Linear regression.
ple is representative of the larger group on all attributes that might affect the re-
91
• Pick the first name on every tenth page (some names have no chance of In specifying this model, we assume the following:
being chosen). • The model is true for every member of the population.
• Close your eyes, flip the pages of the book, and point to a name (Tversky • The error, plus or minus, that helps determine one population member's
and others have done research that shows that humans cannot behave ran- age is independent of (not predictable from) the error for other members.
domly).
• The errors in predicting all of the ages come from the same random distribu-
• Randomly pick the first letter of the last name and randomly choose from tion with a mean of 0 and a standard deviation of σ.
the names beginning with that letter (there are more names beginning with
C, for example, than with I).
The way to pick randomly from a book, file, or any finite population is to assign a
Estimating the Model
number to each name or case and then pick a sample of numbers randomly.
Because we have not sampled the entire population, we cannot compute the
parameter values directly from the data. We have only a small sample from a
Construct a Model
much larger population, so we can estimate the parameter values only by using
some statistical method on our sample data. When our three assumptions are
To make an inference about age, we need to construct a model for our population: appropriate, the sample mean will be a good estimate of the population mean.
Without going into all of the details, the sample estimate will be, on average, close
a=μ+ε
to the values of the mean in the population.
This model says that the age (a) of someone we pick from the book can be de-
scribed by an overall mean age (μ) plus an amount of error (ε) specific to that per- We can use various methods to estimate the mean. This chapter outlines
son and due to random factors that are too numerous and insignificant to describe statistical procedures used to draw inferences about the population we are
systematically. Notice that we use Greek letters to denote things that we cannot studying. Including:
observe directly and Roman letters for those that we do observe. Of the unobserv-
• Confidence intervals and the normal distribution.
ables in the model, μ is called a parameter, and ε a random variable. A parameter
is a constant that helps to describe a population. Parameters indicate how a
• Hypothesis testing including type I error and type II error.
model is an instance of a family of models for similar populations. A random vari-
able varies like the tossing of a coin. • Testing against a hypothesized value.
There are two more parameters associated with the random variable ε but not ap-
pearing in the model equation. One is its mean (με), which we have rigged to be 0,
Confidence Interval
and the other is its standard deviation (σε or simply σ). Because a is simply the
Our estimate will not be exactly correct. If we took more samples of the same size
sum of μ (a constant) and ε (a random variable), its standard deviation is also σ.
and computed estimates, how much would we expect them to vary? First, it
should be plain without any mathematics to see that the larger our sample, the
92
closer will be our sample estimate to the true value of μ in the population. After all, From this normal approximation, we can build a 95% symmetric confidence
if we could sample the entire population, the estimates would be the true values. interval that gives us a specific idea of the variability of our estimate. If we did this
Even so, the variation in sample estimates is a function only of the sample size entire procedure again—sample names, compute the mean and its standard error,
and the variation of the ages in the population. It does not depend on the size of and construct a 95% confidence interval using the normal approximation—then
the population. The standard deviation of the sample mean is the standard we would expect that 95 intervals out of a hundred so constructed would cover
deviation of the population divided by the square root of the sample size the real population mean age. Remember, population mean age is not necessarily
(discussed in the next section). On average, we would expect our sample at the center of the interval that we just constructed, but we do expect the interval
estimates of the mean age to vary by plus or minus a little more than one standard to be close to it.
deviation of the sample mean.
Hypothesis Testing
If we knew the shape of the sampling distribution of mean age, we would be able
to complete our description of the accuracy of our estimate. There is an From the sample mean and its standard error, we can construct hypothesis tests
approximation that works quite well, however. If the sample size is reasonably on the mean. Suppose we believed that the average age of those listed in Who's
large (say, greater than 25), then the mean of a simple random sample is Who is 62 years. After all, we might have picked an unusual sample just through
approximately normally distributed. This is true even if the population distribution the luck of the draw. Let us say, for argument, that the population mean age is 62
is not normal, provided the sample size is large. and the standard deviation is 11.5. How likely would it be to find a sample mean
age of 56.7? If it is very unlikely, then we would reject this null hypothesis that the
We now have enough information from our sample to construct a normal population mean is 62. Otherwise, we would fail to reject it.
approximation of the distribution of our sample mean.
There are several ways to represent an alternative hypothesis against this null
hypothesis. We could make a simple alternative value of 56.7 years. Usually,
however, we make the alternative composite—that is, it represents a range of
possibilities that do not include the value 60. Here is how it would look:
We would reject the null hypothesis if our sample value for the mean were outside
of a set of values that a population value of 62 could plausibly generate. In this
context, "plausible" means more probable than a conventionally agreed upon
critical level for our test. This value is usually 0.05. A result that would be expected
to occur fewer than five times in a hundred samples is considered significant and
would be a basis for rejecting our null hypothesis.
93
Constructing this hypothesis test is mathematically equivalent to sliding the
normal distribution to center over 62. We then look at the sample value 56.7 to see
if it is outside of the middle 95% of the area under the curve. If so, we reject the Statistical Methods
null hypothesis.
Once we have established an understanding of how we state and test hypothesis,
we will move into test where we have two or more means and situations where we
want to develop a predictive equation to help in making better decisions. Chapter
4 will focus on:
• Linear regression.
• Multivariate regression.
94
Section 2
1. Interval Estimates To make an inference about a population we will start by constructing and estimating a model to
2. Population Unknown estimate a confidence interval using the normal distribution and then show the circumstances that
require modifying the procedure and using the (t) distribution. The procedure for making a 95%
3. Using t-Statistics confidence interval estimate of (µ) is as follows:
2. Take a random sample of size n from the population and compute its mean ( ).
3. Select as the lower (upper) confidence limit a value µ1 (ô2) which if it were the true population
mean, would make the probability of obtaining the given sample mean ( ) or a larger (smaller)
sample mean just equal to 0.025.
Thus, the lower confidence limit is selected by choosing the value of µ1 to fall two deviations below the
sample mean ( ), (at – 2 ) and the upper confidence limit (µ2) to fall at 1.96 above the sample
And if sampling from a normal population; the sampling distribution of means will be normal regardless
of the size of the sample (n).
If the above conditions are met, use the statistic t = 2 to compute the 95% confidence interval because
the statistic:
95
1. Is normally distributed. 1. ô must be determined from the sample and
2. Has mean = 0. 2. the sample size is small, the (t) statistic should be used to determine the
width of the confidence interval (or to test hypotheses).
3. A value of t + 2 would encompass 95% of the area under a standard
The values of the (t) statistic depend on the sample size, whereas the values of (Z)
normal curve.
are the same for all values of sample size. For sample sizes greater than 121, the
95% Confidence Interval with Standard Normal Statistic (t) is approximately +/-2 value of the (t) statistic is essentially the same as the (Z) statistic. Therefore, the
. The exact test and probability is reported by the statistics program. (t) distribution is tabled only for values of (n) from 1 to 121.
Since the value of Z is the same for all values of (n) there is only one curve for the
standard normal statistic. For the (t) distribution, however, we have a family of
curves, each slightly different depending on the sample size (n). The (df) under (n)
in each column for the (t) distribution stands for “degrees of freedom”. Degrees of
σ Unknown, Population Normal or Large
freedom equals (n-1) for the case where only one population parameter is
In many cases there is no way of knowing the population standard deviation (σ). estimated from sample data.
In such cases estimate σ from the sample in order to, in turn, estimate . Later, when more than one population parameter from the sample data are
estimated, the degrees of freedom will be [(n) – (number of parameters estimated)].
If the estimated standard deviation of the sampling distribution ( ) is used (also
For example, in testing the difference between the two means, the degrees of
freedom are (n1 + n2 – 2).
called the estimated standard error of the mean), the (t) statistic can be computed.
For small sample sizes (n) the distribution of the (t) statistic departs from the
Note in the above table that the values for (t), when n=30, are somewhat, but not
normal distribution. Thus when:
greatly, different from the (Z) values for the (t) values when n ≥ 30. It is always
96
more accurate to use the (t) values for n < 121 and when ô has been estimated Is it appropriate to use the t distribution?
from the sample.
1. estimate ô from the sample
Summary of When to Use the (t) Distribution
2. n is small (even less than 30)
• Whenever (ô) must be estimated from the sample.
3. assume population of service hours is normally distributed
• Sample size (n) is small.
Because of (1) and (2) use the (t) distribution. The (t) value for n = 10, df = 9, 95%
Do not confuse these criteria with the incorrect assumption that the (t) distribution Confidence Interval = 2.262. Thus the confidence interval would be:
is used whenever (n) is small.
Lower limit: µ1:
95% Confidence Interval with (t) statistic and n= 10
= 70.6 – (2.262) (.75)
Population assumed normal (or large) with mean and a standard deviation σ
= 70.6 – 1.70
both unknown.
= 68.9 hours
Example
Upper limit: µ2:
Take a random sample of n = 10 invoices from a monthly total of 165 invoices and
make a 95% confidence interval estimate of the average number of service hours = 70.6 + (2.262) (.75)
invoiced.
= 70.6 + 1.70
= 706/10 = 70.6
= 72.3 hours
ô = 2.37
Thus, the 95% confidence interval estimate of the population mean service hours
= = 0.75
of is 68.9 hours ← to → 72.3 hours.
97
What Does the “95% Confidence” Mean?
If all possible samples of size 10 were taken from this population of 165 invoices
and an interval was constructed from each sample (as done above), then 95% of
the interval constructed would contain the true population mean service hour and
5% of the intervals would NOT include the true population mean service hours.
µ1 µ2 Width
Normal (Z) (incorrect) 69.13 70.6 72.07 2.94 hours
t (correct) 68.90 70.6 72.30 3.40 hours
The correct use of the (t) statistic produces a wider confidence interval (3.40
hours) than the interval constructed with the normal (Z) statistic (2.94 hours). The
relatively wider interval (and thus more conservative estimate) associated with the
(t) distribution reflects the fact that less information is assumed available to make
the estimate (ô must be estimated from the sample) with the (t) distribution than
with the normal (Z) distribution. As discussed above, it is (theoretically) assumed
with the use of the normal distribution that (σ) is either known or that the sample
used to estimate (σ) is large enough (>121, or 30), and that no adjustment need be
made for the paucity of information used in estimating (σ). Of course, it is also
assumed that the population is normal, or large, as previously explained.
98
Section 3
Hypothesis Testing
1. Key Points from Sampling Hypotheses come in many shapes and forms. Some are fairly general
2. Concept of Hypothesis Testing statements or assumptions about phenomena, such as the hypothesis: Dilbert’s
Service Department is going to have a rough time completing work promised
3. Null and Alternate Hypothesis today. That hypothesis can be tested by spending the day in the service shop, or
simply by waiting until the next morning and reading the computer printout of
4. Type I an Type II Error
yesterday’s activities.
Statistics such as are random variables since their value varies from sample to
sample. As such, they have probability distributions associated with them. The
sampling distribution of a statistic is a probability distribution for all possible
values of the statistic computed from a sample of size n.
The mean of the sampling distribution is equal to the mean of the parent
population and the standard deviation of the sampling distribution of the sample
99
The Central Limit Theorem the shape of the distribution of the sample mean The Null and Alternate Hypothesis
becomes approximately normal as the sample size n increases, regardless of the
shape of the population. Statistical tests always involve two hypotheses:
A hypothesis is a statement regarding a characteristic of one or more 1. The null hypothesis H0.
populations. We test these types of statements using sample data because it is
2. The alternative hypothesis H1.
usually impossible or impractical to gain access to the entire population. If
population data are available, there is no need for inferential statistics. The null hypothesis is that there is “no difference” between two things.
The “no difference” really means no statistically significant difference between the
observed sample statistic, for example, a sample mean, x and the assumed value
Statistical Hypothesis – A Narrow Concept of the population mean, µ. Thus the statistical notation: H0 = µ = 70 inches
means that there is no statistically significant difference between a sample mean
Hypothesis testing is a procedure, based on sample evidence and probability,
( = some value) and 70 inches.
used to test statements regarding a characteristic of one or more populations.
Stated another way, it means that any actual difference between a sample value
A statistical hypothesis is an assumption or statement made about a population
(mean, standard deviation, shape as normal or not) that is tested using information (for example, = 72 inches) and the assumed value µ = 70 inches, can be
contained in samples together with probability ideas. We either accept or reject it explained or accounted for by chance variation alone, and does not require some
on the basis of a pre-chosen probability level. Note: You never prove or disprove other (outside) influence to explain the difference.
a statistical hypothesis, only accept or reject.
More technically, the null or “no statistically significant difference” means that the
The pre-chosen probability level is called the level of significance, or the alpha ( ) value of the sample mean actually observed ( = 72) could reasonably be a
error, or type 1 error. It is pre-chosen probability of rejecting the hypothesis when member of a sampling distribution whose mean is equal to the assumed
it is in fact true. The most commonly used error level is = 0.05, however, =
population mean µ = 70 and whose standard deviation is
0.10 is also commonly used in business, and occasionally = 0.010.
100
The alternative hypotheses, H1 can be one of the three variations:
The alternative hypotheses of the second and third types are “directional”
hypotheses, and are used only when the expected direction of the deviation from
the null hypothesis is reasonably founded.
Whenever the directional hypotheses are used, the null hypothesis can be
interpreted as ≤ or ≥, depending on the direction of the alternative hypothesis.
2. H1: µ > 70 one tail test
For example: H0: µ = 70
H1: µ > 70
actually becomes: H0: µ ≤ 70
H1: µ > 70
and: H0: µ = 70
H1: µ < 70
actually becomes: H0: µ ≥ 70
H1: µ < 70
101
Whenever the null hypothesis is tested, the alternative hypotheses must be stated 2. Calculate an acceptance region in terms of the test statistic (such as t) for the
because it determines whether a one tail or two tail test is to be made. chosen α level, such as α = 0.05. Again, this determines a 95% acceptance region
and a 5% rejection region stated in terms of t values. The boundaries are called
Whenever the null hypothesis is accepted, the alternative hypothesis is rejected. critical values of t and are labeled t crit1 and tcrit2 for a two tail test. The value of
Likewise, whenever the null hypothesis is rejected, the alternative hypothesis must
the sample statistic, , is converted into a t statistic value. This value of called
be accepted.
tobserved (or tobs). If the value of tobs is within the acceptance limits established by
Another way of viewing statistical tests of hypotheses is to ask, “Which of the tcrit1 and tcrit2, the null hypothesis is accepted, and if outside the limits, the null
hypotheses, H0 or H1, is most consistent with the sample data (e.g., mean = 72)?” hypothesis is rejected.
If the difference between the assumed value under H0, µ = 70 and the observed 3. Calculate the probability that a certain sample statistic will differ from the
value mean = 72 is “large”, it is probably more reasonable to conclude that the chosen population parameter, µ, by more than a specified amount. If the
sample did not come from a population whose mean = 70, rather it came from one probability is smaller than a minimum level, such as α = 0.05, reject the null
whose mean is greater or less than 70, as the case may but under H1, µ > 70 or µ hypothesis, if not, accept it.
<70.
Example of Three Computational Procedures for Testing Hypotheses
Just how far the sample value, = 72, can be from H0, µ = 70, before one
Data from service hour sample of 165 invoices :
concludes that it supports H1 rather than H0 depends on the level of significance,
or (error, and the standard deviation of the sampling distribution (which can be
N = 165 ô = 2.37
reduced by increasing the sample size n).
There are three computational procedures which can be used in testing = 70.6 in
hypotheses. All give the same results, but the first method described below is
To test the null hypothesis that mean hours of population of service invoices for
less likely to lead to error in application situations.
transmission overhauls in the building construction market is µ = 72 hours:
102
1. Computational procedures in terms of data in problem:
• Compute CV1 and CV2: 2. Computational procedure in terms or test statistic (in this case, t):
CV1 = 72 - t CV2 = 72 + t
tobs = 1.867
103
by sampling variation alone, and does not need any (outside) influence to
explain the difference.
3. Computational procedure in terms of probability levels:
Assumptions Underlying Hypothesis Tests Regarding One Mean
From the original hypothesis and confidence level of .95 we get alpha = 0.05 for a
two tail test. In other words we have a probability of .025 in each tail of the reject The preceding examples of different computational procedures for testing
Ho region. hypothesis concerning one mean have three basic underlying assumptions:
• To compute probability of obtaining a value of = 70.6 or greater from a 1. The sample chosen must be a Random Sample.
sampling distribution when µ = 72 and = 0.75.
2. The level of measure achieved must be at least interval level.
Compute value of t:
3. The samples must have been drawn from normal populations or from large
populations (so that the Central limit theorem holds) as discussed
previously.
t= = = -1.867
Summary
• Interpret the probability associated with: t = 1.867, two tail, 9 degrees of
The null hypothesis, denoted Ho, is a statement to be tested. The null hypothesis
freedom
is a statement of no change, no effect or no difference. The null hypothesis is
o t = 1.867, area = .096 assumed true until evidence indicates otherwise. In this chapter, it will be a
statement regarding the value of a population parameter.
• To get area in left tail to compare with critical level of α/2 = 0.025, for
The alternative hypothesis, denoted H1, is a statement that we are trying to find
evidence to support.
o t = 1.867, /2 = = 0.048
A statement regarding the value of a population parameter.
• Compare critical level of α = 0.025 with observed level of .048.
There are three ways to set up the null and alternative hypotheses:
• Conclusion: The probability that x will differ from µ is observed to be larger
(0.096, two tail) than the minimum (critical) level (α = 0.05, two tail); thus, Equal versus not equal hypothesis (two-tailed test)
the null hypothesis is accepted. Note: This means that it is not
• H0: parameter = some value
“unreasonable” to obtain a sample value as large as = 70.6 from a
population whose µ = 72 and = 0.75. Thus, the difference (70.6 vs. 72)
• H1: parameter ≠ some value
is not significant. The observed difference (70.6 vs. 72) can be explained
104
Equal versus less than (left-tailed test)
The null hypothesis is a statement of “status quo” or “no difference” and always
contains a statement of equality. The null hypothesis is assumed to be true until
we have evidence to the contrary. The claim that we are trying to gather evidence
for determines the alternative hypothesis.
We reject the null hypothesis when the null hypothesis is true. This decision would
be incorrect. This type of error is called a Type I error.
The probability of making a Type I error, α, is chosen by the researcher before the
sample data is collected. The level of significance, α, is the probability of making a
Type I error.
We do not reject the null hypothesis when the alternative hypothesis is true. This
decision would be incorrect. This type of error is called a Type II error.
105
Section 4
Hypothesis Test
1. Test of Hypothesis The sampling distribution (of means) is used in a slightly different manner in a
n=10 x ( x − x) ( x − x )2 x2
1 73 +0.9 0.81 5329
2 73 +0.9 0.81 5329
3 70 +2.1 4.41 4900
4 69 +3.1 9.61 4761
5 71 +1.1 1.21 5041
6 70 +2.1 4.41 4900
7 78 +5.9 34.81 6084
8 70 +2.1 4.41 4900
9 71 +1.1 1.21 5041
10 76 3.9 15.21 5776
106
The following illustrates the general hypothesis.
H0: µ = 70 α = 0.05
H1: µ ≠ 70
107
The standard deviation for the sampling distribution is then the standard deviation When we click on One-Sample t-Test the following window opens. We click on the
divided by the square root of the number of cases. variable named Service_Hrs and click the Add button. For the Mean we enter the
hypothesized value for service hours of 70. The default test is for a hypothesis of
Sampling distribution of the means we set the mean as hypothesized at 70 service equality and the alternate of not equal. The default confidence level for alpha is
hours and calculated the standard deviation as . 0.95.
Based on our analysis we will accept H0 of service hours equal 70 if the value from
our sample falls within +/- 2 (.92). We calculate the acceptance range as being
67.92 to 72.08. If the sample mean is less than 67.92 we will reject H0. If the
sample mean is greater than 72.08 we will reject H0.
= 72.1, Reject H0
The results of our study lead us to reject our hypothesis of service hours equal 70.
Extending the
example we can
use SYSTAT to
test our general
hypothesis of
service hours
equal to 70. Once
we have entered
the data into
SYSTAT we can
request Analysis,
Hypothesis
Testing, Mean,
Click OK.
and One-Sample
t-test.
108
The first window of output you will see is the graph for the t-test showing the Guidelines for Hypothesis Construction
distribution of values and the box plot from the sample.
In a single-tailed hypothesis testing problems, it is sometimes difficult to
determine which hypothesis should be the null and which should be the
alternative. There are two general areas of problem solving in which we will
employ hypothesis testing techniques. The first is the area of scientific research
and the second is the area of quality control.
• The quantity demanded of good (Y) will increase as its price is decreased.
• Attitude toward a brand and purchase of that brand are positively related.
The results of the t-test reveal that the sample mean of 72.1 service hours is over
We test our research hypothesis by converting it into a statistical hypothesis. A
2 standard deviation above the hypothesized mean of 72 (t=2.272) and the
statistical hypothesis is an assumption about a population or a population
parameter that may be accepted or rejected using the techniques of hypothesis
testing.
109
In scientific research the alternative hypothesis (H1) is the operational statement of In this example we are interested in testing in one direction. We are looking to find
the research hypothesis. This means that the null hypothesis (H0) is a “straw if our sample results are significantly greater than the hypothesized population
man”, that is, it is formulated for the express purpose of being rejected. In value of 20.
scientific research the hypothesis is not accepted unless the overwhelming weight
of the evidence is in its favor.
Example: Assume a theory indicates that the true value of a population mean in
the current time period should exceed its value in the previous time period (the
research hypothesis). If the mean equaled 20 in the previous period, their
hypothesis would be:
If, in the above example, we test these hypotheses at the alpha = 0.05 level, there
is a 0.05 probability of rejecting H0, and thereby accepting H1, when H0 is true, i.e.,
when our research hypothesis is incorrect. By setting up our hypotheses in this
manner, we make it difficult to accept our research hypothesis unless the weight
of the evidence is strongly in its favor. If our hypothesis had been stated as less than such as:
As illustrated earlier we are able to draw these conclusions based on the H0: µ = 50 (straw man)
probability of an observation occurring by pure chance. As an observation moves
H1: µ < 50 (operational statement of the research hypothesis)
further away from our hypothesized value we know that the probability of this
occurring by chance reduces.
In this case our test would be in the opposite direction. We are looking to find if
our sample results are significantly less than the hypothesized population value of
50. Remember that as the researcher (decision maker) you set the confidence
level (CL) required (the alpha value for the statistical test is simply 1-CL).
110
Hypothesis Testing In Quality Control
2. The risk (or probability) of shutting down production to look for a problem
that does not exist is the alpha or Type 1 error.
In order for the risk of shutting down production when no problem exists to be
given by alpha, the null hypothesis must be that the process is in control.
Example: Assume that Rastafar Equipment requires that not more than 5% of
weld/bore service jobs are redone. The acceptance sampling hypotheses would
be:
Thus, at the alpha level of 0.01, there is a very small chance of shutting down the
production line to correct a problem that does not exist.
111
As we move further away from the mean score we find the probability of this result Student’s t-distribution
gets increasingly smaller. By design we do not want to such down the production
line unless the probability of the results from our sample are so rare that it is highly To test hypotheses regarding the population mean assuming the population
likely that the number of reworked jobs is greater than 5%. standard deviation is unknown, we use the t-distribution. When we replace σ with
s,
• The area under the curve is 1. Because of the symmetry, the area under
the curve to the right of 0 equals the area under the curve to the left of 0
equals 1/2.
• The area in the tails of the t-distribution is a little greater than the area in the
tails of the standard normal distribution because using s as an estimate of σ
introduces more variability to the t-statistic.
112
Testing Hypothesis with the t-distribution Hypothesis Testing or Confidence Intervals
To test hypotheses regarding the population mean with σ unknown, we use the Confidence intervals and hypothesis testing give the same results so which
following steps, provided that: method is more useful. The answer is that it depends on the context. Scientific
journals usually follow a hypothesis testing model because their null hypothesis
• The sample is obtained using simple random sampling.
value for an experiment is usually 0 and the scientist is attempting to reject the
• The sample has no outliers, and the population from which the sample is hypothesis that nothing happened in the experiment. Those involved in making
drawn is normally distributed or the sample size is large (n≥30). decisions—epidemiologists, business people, engineers—are often more
interested in confidence intervals. They focus on the size and credibility of an
Step 1: Determine the null and alternative hypotheses. effect and care less whether it can be distinguished from 0.
SYSTAT provides the test statistic and probability as part of the Analysis routine.
Step 5: Compare the observed p-value to the critical alpha value. If the P-value <
α, reject the null hypothesis.
Remember we never “accept” the null hypothesis because without having access
to the entire population, we don’t know the exact value of the parameter stated in
the null. Rather, we say that we do not reject the null hypothesis.
113
Chapter 4
Testing Two
or More
Means
We often want to compare the means from two or
more groups. In this section we will review the use
of t-test for two means and ANOVA for comparing
two or more means.
Section 1
3. SYSTAT Two Samples Analysis Suppose that a simple random sample of size n1 is taken from a population with unknown mean μ1
and unknown standard deviation σ1. In addition, a simple random sample of size n2 is taken from a
population with unknown mean μ2 and unknown standard deviation σ2. If the two populations are nor-
mally distributed or the sample sizes are sufficiently large (n1 ≥ 30, n2 ≥ 30) , then
approximately follows Student’s t-distribution with the smaller of n1-1 or n2-1 degrees of freedom
where is the sample mean and s is the sample standard deviation from population i.
To test hypotheses regarding two population means, μ1 and μ2, with unknown population standard de-
viations, we can use the following steps, provided that:
115
• the samples are independent; The basic statistical concept underlying the test is to sample from two normal
populations, which in theory are actually assumed to be one, take two independ-
• the populations from which the samples are drawn are normally distributed ent random samples and compute their means (mean 1 and mean 2); then, if all
or the sample sizes are large (n1 ≥ 30, n2 ≥ 30). possible, such samples of sizes n1 and n2 are taken from the population(s) and
the differences between means for such samples is determined, then the distribu-
tion of all these differences between means forms a sampling distribution which
Hypothesis Test of Difference Between Two Means
116
If the samples did not come from normal populations with equal variance the com-
Made up of differences between Std. Dev.: putation of ô pooled will produce erroneous conclusions for the test. The equal
variance assumption can be verified with the F test presented in the next unit.
Means from all possible pairs of Independent samples of size n1and n2 which can Note the similarity between the above formula for and the formula used in testing
one mean:
be taken from estimate from sample the populations (1) and (2)
Two Independent Samples (one mean) or
Mean = X1 Mean = X2
= (two means)
In computing from the sample data, it is usually more efficient to use the for-
Std. Dev.= Std. Dev.=
mula:
(est. of pop.) (est. of pop.)
Difference between Sample Means= X1 - X2.
Rastafar Equipment can purchase caliper brake pads from two manufacturers.
The pads have the same shape and surface configuration but are made of differ-
ent rubber compounds. Shipments of 1,000 pads have been purchased from
each vendor, and a random sample of 15 pads is chosen for an engineering test.
Then which is the same as .
A special test fixture determines the number of hours the brake pad can be
pressed against a backhoe loader wheel with a given force before wearing away
117
1/8 inch of material. Assume the following are typical results:
Brake Pad A Brake Pad B Step 6: Establish critical value, CV1 and CV2, for acceptance and rejection re-
n = 15 n = 15 gions for H0: Since H0: µ1 = µ2 is same as µ1 - µ2 = 0, the center of the sampling
= 573 hours = 620 hours
distribution of differences between means 1- 2 centers on zero.
ô = 40 hours ô = 60 hours
In this case, since n1 + n2 – 2 = 15 +15 – 2 = 28, is less than 30, and ô1 and ô2
Step 1:State H0 and H1: Ho: µ1 = µ2 are estimated from the samples, it is appropriate to use Students’ t distribution
H1: µ1 µ2 instead of the normal Z distribution. The degrees of freedom for the t distribution
= 0.05 for two means tests are n1 + n2 = 28 degrees of freedom in this case: t 28df.05 =
2.048.
Since no evidence is available as to which brake pad should last longer,
use a two-tail test.
Step 2: Calculate 1and 2 from the sample data; in this case, given CV1$=$,38.09$ CV2$=$38.09$
as: 1= 573, 2= 620
and
In this case, given as ô1 = 40, ô2 = 60.
Step 4: Use the F test to determine whether samples came from a popu- Thus, for a two-tail test, = 0.05.
lation with equal variance.
118
Step 7: Check to see whether 1- 2 falls within or outside the acceptance re-
Conclusion: Reject Ho: 1- 2= -47 is outside the acceptance region for Ho.
The brake pads do not last for an equal number of hours, or they do not wear at
the same rate.
Using the t-test we calculate the observed t statistic = -2.526. We compare the
observed t statistic to the critical t statistic of -2.048. The observed t statistic is
outside the acceptance region for Ho. We reject Ho.
SYSTAT Two Independent Samples Analysis Once the data has been entered into SYSTAT we can request Analysis, Hypothesis
When we have two independent samples we can use SYSTAT to test our hypothe-
sis regarding the population.
EXAMPLE
The company believes the Northeast and West coast have similar math skills. The
hypothesis is:
119
Testing, Mean, and Two-Sample t-Test.
We are interested in math skills as measured by a math proficiency exam for two
REGIONS the NE (coded 1) and the WEST (coded 4). We click OK and SYSTAT
generates the following output.
The plot of scores for the northeast and west gives a clear indication of the differ-
ences in both the distribution of scores and the mean scores. Scores in the north
The results indicate that the average math scores in the West (REGION 4) are
east are more tightly clustered (lower variance) than the scores in the west.
higher than math score in the Northeast (REGION 1) at 505 versus 470, respec-
tively. The test of differences at a 95% confidence level (alpha=.05) is significant.
We come to this conclusion by comparing the observed p_Value of .000 to our
critical value of .05. SInce the observed p_Value is less than our critical value we
reject Ho (no difference in scores).
120
Section 2
2. Testing Hypothesis A sampling method is dependent when the individuals selected to be in one sample are used to deter-
mine the individuals to be in the second sample. Dependent samples are often referred to as matched-
3. SYSTAT Two Dependent Samples pairs samples.
In other words, statistical inference methods on matched-pairs data use the same methods as infer-
ence on a single population mean, except that the differences are analyzed.
To test hypotheses regarding the mean difference of matched-pairs data, the following must be satis-
fied:
3. the differences are normally distributed with no outliers or the sample size, n, is large (n > 30).
121
Step 5: Compare the critical value with the test statistic:
Step 1: Determine the null and alternative hypotheses. The hypotheses can be struc-
Step 6: State the conclusion.
tured in one of three ways, where ud is the population mean difference of the matched-
P-Value Approach
pairs data.
Two-Tailed
Two Tailed Left Tail Right Tail
The sum of the area in the tails is the P-value.
H0: ud = 0 H0: ud = 0 H0: ud = 0
Step 2: Select a level of significance for alpha based on the seriousness of making
a Type-1 error.
Step 3: Compute the test statistic which approximately follows Student's t- Left-Tailed
The area left of t0 is the P-value.
distribution with n-1 degrees of freedom. The values of X and sd are the mean and
Classical Approach
122
Right-Tailed
Note: The interval is exact when the population is normally distributed and approxi-
The area right of t0 is the P-value.
mately correct for non-normal populations, provided that n is large.
Suppose that a simple random sample of size n1 is taken from a population with
unknown mean and unknown standard deviation In addition, a simple ran-
dom sample of size n2 is taken from a population with unknown mean and un-
known standard deviation If the two populations are normally distributed or the
Step 5: If P-value < a, reject the null hypothesis. sample sizes are sufficiently large (n1> 30, n2 > 30), then
Confidence Interval for Matched-Pairs Data Testing Hypotheses Regarding the Difference of Two Means
A ( ) confidence interval for Ud is given by To test hypotheses regarding two population means, X1 and X2, with unknown
population standard deviations, we can use the following steps, provided that:
Lower bound:
1. the samples are obtained using simple random sampling;
3. the populations from which the samples are drawn are normally dis-
The critical value ta/2 is determined using n-1 degrees of freedom.
tributed or the sample sizes are large [ n 1 > 30, n2 > 30).
123
lected data on math and verbal skills across the US.
Step 1: Determine the null and alternative hypotheses. The hypotheses are structured
The company believes that Verbal and Math skills should be the same. The gen-
in one of three ways:
eral hypothesis is that the population profile on verbal and math is equal for each
state. If the state received high verbal skill scores then they received high math
Two-Tailed Left-Tailed Right-Tailed
skill scores. This is a matched pairs test with scores for verbal and math by state.
H0: U1 = U2 H0: U1 = U2 H0: U1 = U2
Ht: U1 U2 Hi. U1 < U2 Hx: U1 > U2 Once the data are entered into SYSTAT we request Analysis, Hypothesis Tests,
Note: U1 is the population mean for population 1, and U2 is the population mean for Mean, and Paired t-Test.
population 2.
When we have two dependent samples we can use SYSTAT to test our hypothesis
regarding the population.
EXAMPLE
124
The following window opens for variable selection. The pattern of verbal and math score differences is illustrated in the following
graph.
We are interested in the VERBAL and MATH variables and set the confidence level
at 95% (.95) for the hypothesis test. Click OK.
The output from our analysis indicates that VERBAL and MATH scores are not
equal. We reject the null hypothesis.
125
Section 3
Proportions
INFERENCES TWO OR MORE SAMPLES Hypothesis Test for Difference Between Proportions
1. Difference Between Proportions How to conduct a hypothesis test to determine whether the difference between two proportions is
2. State the Hypotheses significant. The test procedure, called the two-proportion t-test, is appropriate when the following
conditions are met:
3. Analyze Sample Data
The sampling method for each population is simple random sampling.
4. Interpret the Results
The samples are independent.
5. SYSTAT Test of Proportions
Each sample includes at least 10 successes and 10 failures. (Some texts say that 5 successes and 5
failures are enough.)
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3)
analyze sample data, and (4) interpret results.
Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis. The
table below shows three sets of hypotheses. Each makes a statement about the difference d between
two population proportions, P1 and P2. (In the table, the symbol ≠ means " not equal to ".)
126
Analyze Sample Data
Set Null hypothesis Alternative hypothesis Number of tails
1 P1 - P2 = 0 P1 - P2 ≠ 0 2 Using sample data, complete the following computations to find the test statistic
2 P1 - P2 > 0 P1 - P2 < 0 1 and its associated P-Value.
3 P1 - P2 < 0 P1 - P2 > 0 1
• Pooled sample proportion. Since the null hypothesis states that P1=P2, we
use a pooled sample proportion (p) to compute the standard error of the
sampling distribution. p = (p1 * n1 + p2 * n2) / (n1 + n2) where p1 is the
The first set of hypotheses (Set 1) is an example of a two-tailed test, since an sample proportion from population 1, p2 is the sample proportion from
extreme value on either side of the sampling distribution would cause a researcher population 2, n1 is the size of sample 1, and n2 is the size of sample 2.
to reject the null hypothesis. The other two sets of hypotheses (Sets 2 and 3) are
one-tailed tests, since an extreme value on only one side of the sampling
distribution would cause a researcher to reject the null hypothesis.
When the null hypothesis states that there is no difference between the two
population proportions (i.e., d = 0), the null and alternative hypothesis for a two- • Standard error. Compute the standard error (SE) of the sampling
tailed test are often stated in the following form. distribution difference between two proportions. SE = sqrt{ p * ( 1 - p ) * [ (1/
n1) + (1/n2) ] } where p is the pooled sample proportion, n1 is the size of
Ho: P1 = P2
sample 1, and n2 is the size of sample 2.
H1: P1 ≠ P2
• Test method. Use the two-proportion z-test (described in the next section)
to determine whether the hypothesized difference between population
proportions differs significantly from the observed sample difference.
127
• P-value. The P-value is the probability of observing a sample statistic as Interpret Results
extreme as the test statistic. Since the test statistic is a t-score, use the
probability associated with the t-score. If the sample findings are unlikely, given the null hypothesis, the researcher rejects
the null hypothesis. Typically, this involves comparing the P-value to the
The analysis described above is a two-proportion t-test. significance level, and rejecting the null hypothesis when the P-value is less than
the significance level.
To test hypotheses regarding two population proportions, p1 and p2, we can use SYSTAT Two Proportions Analysis
Step 4: Determine the critical value. Use the test for a single proportion for a situation involving one group of subjects
whose members can be classified into one of two categories of a dichotomous
Step 5: If P-value < a, reject the null hypothesis. response variable, such as successes and failures. For instance, in a public
opinion poll, we could ask people if they approve or disapprove of the current
Step 6: State the conclusion.
political administration. If sentiment was evenly split, 0.50 of the respondents
128
should respond in each category. However, we hypothesize that recent events will The test for the equality of two proportions applies when dealing with two
sway opinions to be more favorable, leading to a 0.60 approval rating. independent groups whose members can be classified into one of two categories
of a dichotomous response variable. For example, suppose we desire to compare
the effectiveness of two different teaching methods, large lectures versus smaller
laboratory sessions. We will divide the student population into two groups,
assigning a teaching method to each. At the end of the semester, we will record
the number of students passing and the number failing using a common exam.
The null hypothesis asserts that the proportion passing will be the same in the two
groups.
Test against (Null). Enter the hypothesized value of the proportion according to the
null hypothesis. This value must lie in the interval (0,1) and differ from the value for
Proportion. Proportion 1. The hypothesized proportion in the first group. This value must lie in
the interval (0,1).
Alternative type. Specify the alternative (greater than or less than or not equal)
under which the power or sample size is to be calculated. The default is 'not Proportion 2. The hypothesized proportion in the second group. This value must
equal'. lie in the interval (0,1) and cannot equal Proportion 1.
Level of test. Specify the probability of a Type I error, commonly referred to as the
alpha (α) level by default the confidence level is set at 95% (alpha=.05).
129
Alternative type. Specify the alternative (greater than or less than or not equal) EXAMPLE
under which the power or sample size is to be calculated. The default is 'not
equal'. We conducted a study of taste preferences for a Dark coffee blend. The product
group hypothesized that the proportion of liking the dark roast would be 50%.
Sample sizes. You must identify how the total number of cases is distributed
across the two groups:
Equal. The number of cases in the first group equals the number of cases in the
second group.
Unequal. The number of cases differs between the two groups. If selecting this
option, enter the ratio of the group 2 sample size to the group 1 sample size. A
value between 0 and 1 indicates that the second group contains fewer cases than
the first group. Values above 1 correspond to the situation in which the second
group is larger.
Level of test. Specify the probability of a Type I error, commonly referred to as the
The results of the analysis indicate that we reject Ho: Proportion = 0.50 and
alpha (α) level by default the confidence level is set at 95% (alpha=.05).
accept the alternate H1: Proportion <> 0.50.
130
We can not reject Ho: Proportion Male = Proportion of Female at a 95%
confidence level. The p-value of 0.097 exceeds are critical value of alpha=0.05.
131
Section 4
ANOVA
1. ANOVA The procedure known as the Analysis of Variance or ANOVA is used to test hypotheses concerning
4. Post-hoc Test ANOVA is a general technique that can be used to test the hypothesis that the means among two or
more groups are equal, under the assumption that the sampled populations are normally distributed.
5. Comparison of Bonferroni Method with
The ANOVA procedure is one of the most powerful statistical techniques.
Scheffé and Tukey Methods
A couple of questions come immediately to mind: what means? and why analyze variances in order to
6. SYSTAT ANOVA
derive conclusions about the means?
To begin, let us study the effect of temperature on a passive component such as a resistor. We select
three different temperatures and observe their effect on the resistors. This experiment can be
conducted by measuring all the participating resistors before placing n resistors each in three different
ovens.
Each oven is heated to a selected temperature. Then we measure the resistors again after, say, 24
hours and analyze the responses, which are the differences between before and after being subjected
to the temperatures. The temperature is called a factor. The different temperature settings are called
levels. In this example there are three levels or settings of the factor Temperature.
132
What is a factor? The alternative hypothesis for cases 1 and 2 is: the means are not equal.
A factor is an independent treatment variable whose settings (values) are The alternative hypothesis for case 3 is: there is an interaction between A and B.
controlled and varied by the experimenter. The intensity setting of a factor is the
For the 3-way ANOVA: The main effects are factors A, B and C. The 2-factor
level.
interactions are: AB, AC, and BC. There is also a three-factor interaction: ABC.
Levels may be quantitative numbers or, in many cases, simply "present" or "not
For each of the seven cases the null hypothesis is the same: there is no difference
present" ("0" or "1").
in means, and the alternative hypothesis is the means are not equal.
In this experiment there is only one factor, temperature, and the analysis of
In general, the number of main effects and interactions can be found by the
variance that we will be using to analyze the effect of temperature is called a one-
following expression:
way or one-factor ANOVA.
We could have opted to also study the effect of positions in the oven. In this case
there would be two factors, temperature and oven position. Here we speak of a
two-way or two-factor ANOVA. Furthermore, we may be interested in a third
factor, the effect of time. Now we deal with a three-way or three-factor ANOVA. In
each of these ANOVA's we test a variety of hypotheses of equality of means (or
The first term is for the overall mean, and is always 1. The second term is for the
average responses when the factors are varied).
number of main effects. The third term is for the number of 2-factor interactions,
and so on. The last term is for the n-factor interaction and is always 1.
First consider the possible hypothesis for one-way ANOVA.
2. The alternative hypothesis is: the means are not the same. This section gives an overview of the one-way ANOVA. First we explain the
principles involved in the 1-way ANOVA.
For the 2-way ANOVA, the possible null hypotheses are:
Partition response into components
1. There is no difference in the means of factor A
In an analysis of variance the variation in the response measurements is
2. There is no difference in means of factor B
partitioned into components that correspond to different sources of variation.
133
The goal in this procedure is to split the total variation in the data into a portion Algebraically, this is expressed by
due to random error and portions due to changes in the values of the independent
variable(s).
where k is the number of treatments and the bar over the y.. denotes the "grand"
Sums of squares and degrees of freedom or "overall" mean. Each ni is the number of observations for treatment i. The total
number of observations is N (the sum of the ni).
The numerator part is called the sum of squares of deviations from the mean, and
the denominator is called the degrees of freedom. Concept of "Treatment"
The variance, after some algebra, can be rewritten as: We introduced the concept of treatment. The definition is: A treatment is a specific
combination of factor levels whose effect is to be compared with other treatments.
The mathematical model that describes the relationship between the response
and treatment for the one-way ANOVA is given by
The first term in the numerator is called the "raw sum of squares" and the second
term is called the "correction term for the mean". Another name for the numerator where Yij represents the j-th observation (j = 1, 2, ...ni) on the i-th treatment (i = 1,
is the "corrected sum of squares", and this is usually abbreviated by Total SS or 2, ..., k levels). So, Y23 represents the third observation using level 2 of the factor
SS(Total). is the common effect for the whole experiment, I represents the i-th treatment
effect and ij represents the random error present in the j-th observation on the i-th
The SS in a 1-way ANOVA can be split into two components, called the "sum of treatment.
squares of treatments" and"sum of squares of error", abbreviated as SST and
SSE, respectively. The errors ij are assumed to be normally and independently (NID) distributed,
with mean zero and variance is always a fixed parameter and are considered to
The guiding principle behind ANOVA is the decomposition of the sums of squares,
or Total SS. be fixed parameters if the levels of the treatment are fixed, and not a random
sample from a population of possible levels. It is also assumed that it is chosen so
that
134
F Statistic
holds. This is the fixed effects model. The test statistic, used in testing the equality of treatment means is:
If the k levels of treatment are chosen at random, the model equation remains the F = MST / MSE.
same. However, now the i's are random variables assumed to be NID. This is the
The critical value is the tabular value of the F distribution, based on the chosen
random effects model.
alpha level and the degrees of freedom DFT and DFE.
Whether the levels are fixed or random depends on how these levels are chosen in
The calculations are displayed in an ANOVA table as output from statistical
a given experiment.
software:
The sums of squares SST and SSE previously computed for the one-way ANOVA
are used to form two mean squares, one for treatments and the second for error. Source SS DF MS F
These mean squares are denoted by MST and MSE, respectively. These are
typically displayed in a tabular form, known as an ANOVA Table. The ANOVA table Treatments SST k-1 SST / (k-1) MST/MSE
also shows the statistics used to test hypotheses about the population means. Error SSE N-k SSE / (N-k)
When the null hypothesis of equal means is true, the two mean squares estimate Total SS N-1
the same quantity (error variance), and should be of approximately equal (corrected)
magnitude. In other words, their ratio should be close to 1. If the null hypothesis is
false, MST should be larger than MSE. The word "source" stands for source of variation. Some researchers prefer to use
"between" and "within" instead of "treatments" and "error", respectively.
The mean squares are formed by dividing the sum of squares by the associated
degrees of freedom.
Let N = ni. Then, the degrees of freedom for treatment, DFT = k - 1, and the EXAMPLE
degrees of freedom for error, DFE = N -k.
The data below resulted from measuring the difference in resistance resulting from
The corresponding mean squares are: subjecting identical resistors to three different temperatures for a period of 24
hours. The sample size of each group was 5. In the language of Design of
MST = SST / DFT
Experiments, we have an experiment in which each of three treatments was
replicated 5 times.
MSE = SSE / DFE
135
Level 1 Level 2 Level 3 FURTHER ANALYSIS
6.9 8.3 8.0
There are several techniques we might use to further analyze the differences.
5.4 6.8 10.5
These are:
5.8 7.8 8.1
4.6 9.2 6.9 • constructing confidence intervals around the difference of two means.
4.0 6.5 9.3
Means 5.34 7.72 8.56 • estimating combinations of factor levels with confidence bounds.
Source SS DF MS F
CALCULATIONS
Treatments 27.897 2 13.949 9.59 SYSTAT and other statistical programs do ANOVA calculations. This section
Error 17.452 12 1.454 describes how to calculate the various entries in an ANOVA table. Remember, the
goal is to produce two variances (of treatments and error) and their ratio. The
Total (corrected) 45.349 14 various computational formulas will be shown and applied to the data from the
Correction 779.041 1 previous example.
Factor
STEP 1 Compute CM, the correction for the mean.
INTERPRETATION
The test statistic is the F value of 9.59. Using an α of .05, we have that F .05; 2, 12
= 3.89 (compare to the critical F value). Since the test statistic is much larger than
the critical value, we reject the null hypothesis of equal population means and
conclude that there is a (statistically) significant difference among the population
means. The p-value for 9.59 is .00325, so the test statistic is significant at that
level.
136
STEP 5 Compute MST, MSE and their ratio, F.
MST is the mean square of treatments, MSE is the mean square of error (MSE is
also frequently denoted by ).
The 829.390 SS is called the "raw" or "uncorrected " sum of squares. MSE = SSE / (N-k) = 17.452/ 12 = 1.454
STEP 3 Compute SST, the treatment sum of squares. where N is the total number of observations and k is the number of treatments.
Finally, compute F as
First we compute the total (sum) for each treatment.
F = MST / MSE = 9.59
T1 = (6.9) + (5.4) + ... + (4.0) = 26.7
That is it. These numbers are the quantities that are in the ANOVA table that was
T2 = (8.3) + (6.8) + ... + (6.5) = 38.6 shown previously.
ANOVA RESULTS:
Then
Source SS DF MS F
Treatments 27.897 2 13.949 9.59
Error 17.452 12 1.454
Here we utilize the property that the treatment sum of squares plus the error sum
of squares equals the total sum of squares.
For the one-way ANOVA the formula for a confidence interval for the difference
between two treatment means is:
where = MSE.
A 95% confidence interval for is: from -0.247 to 5.007. These are given by:
An unbiased estimator of the factor level mean i in the 1-way ANOVA model is
given by:
where
138
EXAMPLE
Total Mean Hence, we obtain confidence limits 5.34 ± 2.120 (0.5159) and the confidence
Group 1 6.9 5.4 5.8 4.6 4.0 26.70 5.34 interval is
Group 2 8.3 6.8 7.8 9.2 6.5 38.60 7.72
Group 3 8.0 10.5 8.1 6.9 9.3 42.80 8.56
Group 4 5.8 3.8 6.1 5.6 6.2 27.50 5.50
Definition and Estimation of Contrasts
All 135.60 6.78
Groups
Definitions
ANOVA OUTPUT:
A contrast is a linear combination of 2 or more factor level means with coefficients
that sum to zero.
Source SS DF MS F
Treatments 38.820 3 12.940 9.724 Two contrasts are orthogonal if the sum of the products of corresponding
Error 21.292 16 1.331 coefficients (i.e., coefficients for the same means) adds to zero.
Total 60.112 19
(Corrected) Formally, the definition of a contrast is expressed below, using the notation i for
Mean 919.368 1 the i-th treatment mean where
Since the confidence interval is two-sided, the entry (1 - α/2) value for the t table is
(1 - 0.05/2) = 0.975, and the associated degrees of freedom is N - 4, or 20 - 4 =
16.
139
ORTHOGONAL CONTRASTS
1 2 3 4 These formulas hold for any linear combination of treatment means, not just for
c1 +1 0 0 -1 contrasts.
c2 0 +1 -1 0
CONFIDENCE INTERVAL FOR CONTRASTS
c3 +1 -1 -1 +1
An unbiased estimator for a contrast C is given by
PROPERTIES OF ORTHOGONAL CONTRASTS:
3. The first two contrasts are simply pairwise comparisons, the third one
involves all the treatments.
The 1-a confidence limits of C are:
Contrasts are estimated by taking the same linear combination of treatment mean
estimators. In other words:
ESTIMATING CONTRASTS
and
140
and construct a 95 % confidence interval for C. LINEAR COMBINATIONS
POINT ESTIMATE Sometimes we are interested in a linear combination of the factor-level means that
is not a contrast. Assume that in our sample experiment certain costs are
The point estimate is: associated with each group. For example, there might be costs associated with
each factor:
Factor Cost in $
thus 1 3
2 5
3 2
4 1
and
The following linear combination may be of interest
This resembles a contrast, but the coefficients ci do not sum to zero. A linear
combination is given by the definition:
CONFIDENCE INTERVAL
Confidence limits for a linear combination C are obtained in precisely the same
way as those for a contrast, using the same calculation for the point estimator and
estimated variance.
141
TWO WAY ANOVA
The 2-way ANOVA is probably the most popular layout in experimental design.
Factorial Model
Source SS df MS
Factor A SS(A) (a - 1) MS(A) = SS(A)/(a-1)
Factor B SS(B) (b - 1) MS(B) = SS(B)/(b-1)
where is the overall mean response, is the effect due to the i-th level of
Interaction SS(AB) (a-1) MS(AB)=
AB (b-1) SS(AB)/(a-1)(b-1)
factor A, is the effect due to the j-th level of factor B and ij is the effect due
Error SSE (N - ab) SSE/(N - ab)
to any interaction between the i-th level of A and the j-th level of B. Total SS(Total) (N - 1)
(Corrected)
At this point, consider the levels of factor A and of factor B chosen for the
experiment to be the only levels of interest to the experimenter such as
predetermined levels for temperature settings or the length of time for process The various hypotheses that can be tested using this ANOVA table concern
step. The factors A and B are said to be fixed factors and the model is a fixed- whether the different levels of Factor A, or Factor B, really make a difference in the
effects model. Random actors will be discussed later. response, and whether the AB interaction is significant.
When an a x b factorial experiment is conducted with an equal number of Recall that the possible null hypotheses are:
observations per treatment combination, the total (corrected) sum of squares is
partitioned as: 1. There is no difference in the means of factor A
where AB represents the interaction between A and B. 3. There is no interaction between factors A and B
142
BASIC MODEL EXAMPLE
Factor A has 1, 2, ..., a levels. Factor B has 1, 2, ..., b levels. There are A by B An evaluation of a new coating applied to 3 different materials was conducted at 2
treatment combinations (or cells) in a complete factorial layout. Assume that each different laboratories. Each laboratory tested 3 samples from each of the treated
treatment cell has r independent observations (known as replications). When each materials. The results are given in the following table:
cell has the same number of replications, the design is a balanced factorial.
• Let Bj be the sum of all observations of level j of factor B, j = 1, ...,b. The Bj 2.7 1.9 2.7
are the column sums. 2 3.1 2.2 2.3
2.6 2.3 2.5
• Let (AB)ij be the sum of all observations of level i of A and level jof B. These
are cell sums.
• Let r be the number of replicates in the experiment; that is: the number of PRELIMINARY ANALYSIS
times each factorial treatment combination appears in the experiment.
The preliminary part of the analysis yields a table of row and column sums
Then the total number of observations for each level of factor A is rb and the total
number of observations for each level of factor B is ra and the total number of Material (B)
observations for each interaction is r. Lab (A) 1 2 3 Total (Ai)
1 12.3 9.2 10.3 31.8
2 8.4 6.4 7.5 22.3
Total (Bj) 20.7 15.6 17.8 54.1
143
ANOVA RESULTS and the k levels (e.g., the batches) are chosen at random from a population with
variance . The data are as follows:
Source SS df MS F p-value
A 5.0139 1 5.0139 100.28 0 Batch
B 2.1811 2 1.0906 21.81 0.0001 1 2 3 4 5
AB 0.1344 2 0.0672 1.34 0.298 74 68 75 72 79
Error 0.6000 12 0.0500 76 71 77 74 81
Total (Corr) 7.9294 17 75 72 77 73 79
INTERPRETATION From the analysis the test statistic from the ANOVA table is F = 36.94 / 1.80 =
20.5.
From the results we see that the interaction of AB is insignificant at p=0.298. The
Lab and Material Coating were each significant at p=.000. If we had chosen an α value of .01, then the F value for a df of 4 in the numerator
and 10 in the denominator is 5.99.
RANDOM MODELS
Since the test statistic is larger than the critical value, we reject the hypothesis of
Random factors, such as operators, days, lots or batches, where the levels in the
equal means. Since these batches were chosen via a random selection process, it
experiment might have been chosen at random from a large number of possible
may be of interest to find out how much of the variance in the experiment might
levels, the model is called a random model and inferences are to be extended to
be attributed to batch differences and how much to random error. In order to
all levels of the population.
answer these questions, we can use the EMS (expected mean square). From this
analysis we see that 11.71/13.51 = 86.7 percent of the total variance is
In a random model the experimenter is often interested in estimating components
attributable to batch differences and 13.3 percent to error variability within the
of variance. Let us run an example that analyzes and interprets a component of
batches.
variance or random model.
144
Questions concerning the reason for the rejection of the null hypothesis arise in • Confidence Intervals For A Contrast
the form of:
These types of investigations should be done on combinations of factors that
• "Which mean(s) or proportion(s) differ from a standard or from each other?" were determined in advance of observing the experimental results, or else the
confidence levels are not as specified by the procedure. Also, doing several
• "Does the mean of treatment 1 differ from that of treatment 2?" comparisons might change the overall confidence level. This can be avoided by
carefully selecting contrasts to investigate in advance and making sure that:
• "Does the average of treatments 1 and 2 differ from the average of
treatments 3 and 4?" • the number of such contrasts does not exceed the number of degrees of
freedom between the treatments.
One popular way to investigate the cause of rejection of the null hypothesis is a
Multiple Comparison Procedure. These are methods which examine or compare • only orthogonal contrasts are chosen.
more than one pair of means or proportions at the same time. Doing pairwise
comparison procedures over and over again for all possible pairs will not, in However, there are also several powerful multiple comparison procedures we can
general, work. This is because the overall significance level is not as specified for use after observing the experimental results.
a single pair comparison.
Tests on Means after Experimentation
The ANOVA uses the F test to determine whether there exists a significant
difference among treatment means or interactions. In this sense it is a preliminary If the decision on what comparisons to make is withheld until after the data are
test that informs us if we should continue the investigation of the data at hand. examined, the following procedures can be used:
If the null hypothesis (no difference among treatments or interactions) is accepted, • Tukey's Method to test all possible pairwise differences of means to
there is an implication that no relation exists between the factor levels and the determine if at least one difference is significantly different from 0.
response. There is not much we can learn, and we are finished with the analysis.
• Scheffé's Method to test all possible contrasts at the same time, to see if
When the F test rejects the null hypothesis, we usually want to undertake a at least one is significantly different from 0.
thorough analysis of the nature of the factor-level effects.
• Bonferroni Method to test, or put simultaneous confidence intervals
Previously, we discussed several procedures for examining particular factor-level around, a pre-selected group of contrasts.
effects. These were
Multiple Comparisons Between Proportions
• Estimation of the Difference Between Two Factor Means.
When we are dealing with population proportion defective data, the Marascuilo
• Estimation of Factor Level Effects. procedure can be used to simultaneously examine comparisons between all
groups after the data have been collected.
145
TUKEY METHOD Tukey's Method
The Tukey method applies simultaneously to the set of all pairwise comparisons. The Tukey confidence limits for all pairwise comparisons with confidence
The confidence coefficient for the set, when all sample sizes are equal, is exactly coefficient of at least 1- α are:
1- . For unequal sample sizes, the confidence coefficient is greater than 1- . In
other words, the Tukey method is conservative when there are unequal sample
sizes.
Notice that the point estimator and the estimated variance are the same as those
Studentized Range Distribution
for a single pairwise comparison that was illustrated previously. The only
difference between the confidence limits for simultaneous comparisons and those
The Tukey method uses the studentized range distribution. Suppose we have r
for a single comparison is the multiple of the estimated standard deviation.
independent observations y1, ..., yr from a normal distribution with mean μ and
variance σ2. Let w be the range for this set , i.e., the maximum minus the
Example
minimum. Now suppose that we have an estimate s2 of the variance σ2 that is
based on the degrees of freedom and is independent of the yi. The studentized Using the data from a previous example and setting a confidence coefficient of 95
range is defined as percent we find that the simultaneous pairwise comparisons indicate that the
differences μ1 - μ4 and μ2 - μ3 are not significantly different from 0 (their
confidence intervals include 0), and all the other pairs are significantly different.
(We will do a full analysis using SYSTAT at the end of this section.)
The distribution of q has been formulated and is provided as part of the analysis
function in SYSTAT. It is possible to work with unequal sample sizes. In this case, one has to calculate
the estimated standard deviation for each pairwise comparison. The Tukey
As an example, let r = 5 and n = 10. The 95th percentile is q.05;5,10 = 4.65. This
procedure for unequal sample sizes is sometimes referred to as the Tukey-Kramer
means:
Method.
So, if we have five observations from a normal distribution, the probability is .95
that their range is not more than 4.65 times as great as an independent sample
standard deviation estimate for which the estimator has 10 degrees of freedom.
146
SCHEFFE’S METHOD
Scheffé's method applies to the set of estimates of all possible contrasts among
are correct simultaneously.
the factor level means, not just the pairwise differences considered by Tukey's
method.
where
for which the estimated variance is: Applying the formulas above we obtain in both cases:
It can be shown that the probability is 1 - α that all confidence limits of the type and
147
Comparison of Scheffé's Method with Tukey's Method
If only pairwise comparisons are to be made, the Tukey method will result in
where = 1.331 was computed in our previous example. a narrower confidence limit, which is preferable.
The standard error = .5158 (square root of .2661). Consider for example the comparison between μ3 and μ1.
For a confidence coefficient of 95 percent and degrees of freedom in the Tukey: 1.13 < 3 - 1 < 5.31
numerator of r - 1 = 4 - 1 = 3, and in the denominator of 20 - 4 = 16, we have:
Scheffé: 0.95 < 3 - 1 < 5.49
The confidence limits for C1 are -.5 ± 3.12(.5158) = -.5 ± 1.608, and for C2 they
The normalized contrast, using sums, for the Scheffé method is 4.413, which is
are .34 ± 1.608.
close to the maximum contrast.
-1.594 C 0.594 The Bonferroni method is a simple method that allows many comparison
statements to be made (or confidence intervals to be constructed) while still
As expected, the Scheffé confidence interval procedure that generates assuring an overall confidence coefficient is maintained.
simultaneous intervals for all contrasts is considerably wider.
This method applies to an ANOVA situation when the analyst has picked out a
particular set of pairwise comparisons or contrasts or linear combinations in
advance. This set is not infinite, as in the Scheffé case, but may exceed the set of
pairwise comparisons specified in the Tukey procedure.
The Bonferroni method is valid for equal and unequal sample sizes. We restrict
ourselves to only linear combinations or comparisons of treatment level means
148
(pairwise comparisons and contrasts are special cases of linear combinations). We
denote the number of statements or comparisons in the finite set by g.
interval with confidence coefficient (1- /g), and the Bonferroni inequality insures
We wish to estimate, as we did using the Scheffe method, the following linear
where = 1.331. The standard error is .5158 (the square root of .2661).
combinations (contrasts):
For a 95 % overall confidence coefficient using the Bonferroni method, the t value
is t 1-0.05/(2*2),16 = t 0.9875,16 = 2.473. Now we can calculate the confidence
intervals for the two contrasts. For C1 we have confidence limits -0.5 ± 2.473 (.
5158) and for C2 we have confidence limits 0.34 ± 2.473 (0.5158).
149
Thus, the confidence intervals are: SYSTAT ANOVA Analysis
-2.108 C1 1.108 Data was collected on the poverty level for the lower 48 states. Each state was
classified regionally as northeast, midwest, south or west. As a researcher we
wider and therefore less attractive. want to determine if there is a significant difference in poverty levels by region.
Comparison of Bonferroni Method with Scheffé and Tukey Our test hypothesis can be stated as “poverty levels do not vary by region.” The
alternate is that “poverty levels are not equal across all regions of the US.”
Methods
Once the data have been entered into SYSTAT we can request Analysis using
1. If all pairwise comparisons are of interest, Tukey has the edge. If only a
ANOVA.
subset of pairwise comparisons are required, Bonferroni may sometimes
be better.
3. Many computer packages include all three methods. So, study the output
and select the method with the smallest confidence band.
Once you click on estimate model the following window will open.
150
As stated in our hypothesis we are interested in the rate of poverty by region (US). We are now interested in where significant differences exist by region. We request
a pairwise comparison using Tukey’s test. We click on Analysis, ANOVA and
Pairwise Comparisons.
Once you click OK the analysis will generate the model and multiple components
of output. We are interested in the section of the output focusing on the test of
our hypothesis.
151
We have only one available effect from our model (REGION). We add REGION to
the groups and click on Tukey. We have the option of changing the confidence
level. By default the level is set at 95%. Once we are finished click OK. A plot of means is provided as part of the SYSTAT output.
Tukey’s pair wise comparisons are in the output window. Recall that our general
hypothesis is that all means are equal (no differences by region). Or stated
differently:
The plot provides a clear representation of the higher rate of poverty in the south
region relative to the northeast.
Post Hoc analysis reveals the significant difference in poverty is between the
Northeast and the South. The other regional contrast are not significant at alpha
of .05. Our findings indicate that the significant difference we found in our ANOVA
analysis is the result of the difference between the northeast and south.
152
Section 5
Method to Use
1. Method to Use A sampling method is independent when the individuals selected for one sample do not dictate which
2. Criteria for Selecting the Best Method individuals are to be in a second sample. A sampling method is dependent when the individuals se-
lected to be in one sample are used to determine the individuals to be in the second sample.
Dependent samples are often referred to as matched-pairs samples. We began by focusing on the
analysis of independent samples then examined the procedures for testing hypothesis for matched
pairs. We concluded this section with an introduction to the analysis of two or more means using
ANOVA.
The statistical inference methods on matched pairs data use the same methods as inference on a sin-
Step 1: Was the sample(s) drawn according to the requirements for a random sample? If YES: Go to
Step 2. If NO: Cannot use any statistical model which is based on probability theory. However, one
can describe data, i.e., its mean, standard deviation, quartiles, etc. One might also be able to fit a line
to a set of data by method of least squares, but could not construct a meaningful confidence interval.
153
Step 2: What level of measure was attained? Step 4: What are the number of samples involved?
2. Students’ (t)
4. F distribution
• Proportion, p
5. Chi Square
• Mean, μ
6. Mann-Whitney
154
Mean:
Dependent samples: Provided each sample size is greater than 30 or the differences come from a
population that is normally distributed, use Student’s t-distribution with n-1
Provided the samples are obtained randomly and the total number of observations degrees of freedom with:
where the outcomes differ is at least 10, use the normal distribution with:
Provided: Provided each sample size is greater than 30 or each population is normally
distributed, use Student’s t-distribution:
for each sample and the sample size is no more than 5% of the population size,
use the normal distribution with
1. Chi Square
2. Analysis of Variance
where:
Step 5: Is the normal population assumption required?
155
population. Note: Normal Approximate to binomial requires np and nq ≥ 5, in (All cells should have fe >5 if possible.)
order to approximate binomial distribution with a normal distribution.
• For Mann-Whitney test and Spearman Rank Correlation, ties in ranks
No for non-parametric tests: Chi Square, Mann-Whitney, Wilcoxon, Spearman receive the average of the tied ranks.
rank correlation, and Binomial, itself.
• In Wilcoxon matched pairs signed ranks test, difference of zero in (response
Step 6: Is equal variance assumption required? 2) minus (response 1) are dropped from the analysis and n reduced by one.
Yes for Z and t two means tests and for Pearson Product Moment Correlation Step 8: Lastly, if a random sample has been drawn, but none of the above tests
(homoscedasticity, i.e., variance of Yi from the regression line is the same for all meet the circumstances of the problem, refer to an advanced text on Statistical
values of X). Methods.
No for Two-proportions test, Chi Square, Mann-Whitney, Wilcoxon, and Spearman Testing the Model
Rank Correlation.
As we have discovered, confidence intervals and hypothesis testing give the
Step 7: Check other assumptions required for different models (tests) to have same results. The approach to use will depend on the context. Scientific research
valid application: usually follow a hypothesis testing model because their null hypothesis value for
an experiment is usually 0 and the scientist is attempting to reject the hypothesis
• (Z) versus (t)
that nothing happened in the experiment. Those involved in making decisions—
a. Use z when σ is unknown, or n ≥ 30. epidemiologists, business people, engineers—are often more interested in
confidence intervals. They focus on the size and credibility of an effect and care
b. Use t when ô is unknown and n ≤ 30. less whether it can be distinguished from 0.
a. No less than 20% of cells may have fe < 5, and none < 1
156
Chapter 5
Regression
Linear Regression
Computational Approach
The general computational problem that needs to be solved in regression analysis is to fit a straight line
to a number of points.
158
In the simplest case - one dependent and one independent variable - you can
visualize this in a scatterplot.
A scatter plot reveals different possible relationships between the explanatory
variable and the response variable.
The linear relationships in (a) and (b) are both explanatory with (a) sowing a strong
positive relationship and (b) a negative relationship. In essence as the explanatory
variable increase in value the response increase for (a) but decreases in (b). We
may also find nonlinear relationships as in (c) and (d) or no relationship as in (e).
Least Squares
159
The Regression Equation In the multivariate case, when there is more than one independent variable, the
regression line cannot be visualized in the two dimensional space, but can be
A line in a two dimensional or two-variable space is defined by the equation: computed just as easily.
The Y variable can be expressed in terms of a constant (a) and a slope (b) times Association Analysis is a method for examining the relationship between two or
the X variable. The constant is also referred to as the intercept, and the slope as more variables. The following two sections deal only with relationships between
the regression coefficient or B coefficient. For example, VIolent Crime may best be two variables and the third section deals with the more complex relationships
predicted by Population. Thus, knowing that a state’s population would lead us to between more than two variables.
predict the Violent Crime rate.
Association Analysis is composed of two parts: (1) the nature of the relationship
For example, the graph below shows a two dimensional regression equation among variables, usually referred to as regression analysis, and (2) the degree or
plotted with a 95% confidence interval. strength of the relationship, usually referred to as a correlation analysis. In most
applications, regression and correlation are used as supplementary techniques.
In the first case, the objective might be to estimate or predict the market price of
certain houses knowing the size in square feet of each. (This is used in real estate
appraising and in tax assessing). In the second case, a drug store chain may
want to decide how many square feet of in-store display space to allot to a certain
type of product in order to provide a predicted amount of sales. In the third case,
a financial analyst for a hardware distribution chain may be estimating or
predicting the working capital requirements to cover accounts receivable of a new
operation.
160
second variable. The second variable is called the “independent” variable, A plot of the data is shown below:
because its value is assumed known, or does not depend on knowing the value of
some other variable. Scatter diagram of sales orders taken versus time spent with buyer
In an effort to improve the allocation of direct sales effort, the sales manager of
Bubba Equipment had a random sample of his product support sales people keep
a record of the minutes spent with the buyers of maintenance and repair items on
each sales call. The orders taken on each call were also recorded so the sales
manager could determine the relationship between minutes spent with the buyers
and sales dollars produced by the call.
As the plot depicts, there is generally a pattern to the scatter plot of a data set.
Some typical data are shown below. (Only six data points are included for Several alternatives can be used to describe such a set of data. One alternative is
simplicity in calculation.) to simply draw a line through the data points that appear to be representative of
the set of data. This is sometimes called “eyeballing” a line. The problem with
Sales Calls Time Spent Dollar Amount of such an approach is that it lacks precision, i.e., it probably is not the
With Buyer (In Sales Order mathematically best fitting line for that data set. A commonly used technique for
Minutes)
fitting such a line is called the “method of least squares”.
1 30 250
2 20 200
3 25 175
4 15 125
5 10 100
6 15 175
161
Method of Least Squares Illustration of Deviations From Regression Line
Recall from basic algebra that the equation for a straight line is:
Y = a + bX
In statistical jargon the equation for a straight line is often referred to as the basis
for the general linear model. More detail concerning the general linear model is
contained in a later section.
The task of fitting a line to the set of data in the example above is accomplished
The distance between the line and a given observation is called “error”. For
with the method of least squares. Developed by a German mathematician, Karl
example, the initial observation has a $250 order as the result of 30 minutes spent
Gauss (1777 – 1855), the method of least squares did not achieve popularity until
with the buyer (from Table 2). Note that the line constructed in Figure 5 does not
the early part of the twentieth century.
pass directly through the point (30, 250), but rather is a short distance below it.
The name “least squares” is quite descriptive of how the regression line is The difference between the regression line and the point is the error in predicting
estimated. It is an attempt to fit a line to a scatter plot such that the line best that point. Note that a line measures the distance or error between the actual
describes the data composing that scatter plot. By arbitrarily drawing a line observation and the regression line perpendicular to the x-axis rather than
through the data (eyeball method) there is no assurance that the line is the best perpendicular to the regression line.
line for describing that data. Least square provides a rigorous mathematical
The error term is incorporated into the general linear model as follows:
procedure for specifying the best possible line. A line is fitted such that the sum
of the deviations (from the line) squared is a minimum. The deviations from the
line are illustrated in terms of the above example:
where:
162
= a particular value of x
The value of Yi (the actual Y observation) depends on the value of Xi, plus an error squared from the line.
term because of the deviation Yi from the line. Then to minimize the error terms
(ei) squared over all (6) values of Xi, the equation solved for ei and squared
becomes: Table 3: Calculation of Terms Used in Regression Equations
Time Sales
with Orders X2 Y2 XY
Buyer (Y)
(X)
With the help of some simple calculus, it is possible to determine the
computational formulas for a and b of the general linear model. In order to
1 30 250 900 62,500 7,500
2 20 200 400 40,000 4,000
minimize take partial derivatives with respect to a and b 3 25 175 625 30,625 4,375
and set the equation = 0. 4 15 125 225 15,625 1,875
5 10 100 100 10,000 1,000
This process results in two simultaneous equations called “normal equations”:
6 15 175 225 30,625 2,625
∑Y=na+b∑X
These normal equations can be solved for the coefficients a and b in the linear
equation, Y = a + bX:
163
If another value of X is chosen near the low end of the range of the data, for
example Xi = 12, and a value near the high end, for example Xi = 28, and the
corresponding value of is determined from the equation, the line should pass
through the point and when it is drawn through the two points on the scatter
diagram as shown in Figure 6 below:
Y = a + bx = 48.55 + 6.38X
To position the line on the plot of the data, it is necessary to compute several (at
least two) values of for given values of X. When correctly positioned, the line
164
5. The relationship between X and Y is linear.
SYSTAT Graphics Output: Plot of a Regression Line 6. Each Yi is independent of each other Yi (no autocorrelation)
The regression line represents a series of point estimates of the average values(s)
of Y for given values of X. It is analogous to as a point estimate for µ. The
difference is that we have different point estimate(s) of along the regression line
for each value X. For example, if the sales manager wanted to make a point
estimate of the average amount of sales orders that could be expected if
salespeople spent 26 minutes with the buyers of large department stores, that
point estimate would be:
= a + b(x) = a + b(26)
The regression model illustrated has a number of underlying assumptions. The Similar point estimates could be made for other values of X. Point estimates
model is a parametric model, and has assumptions analogous to those presented are also used to construct interval estimates of , similar to the way is used in
earlier for the two mean tests. The assumptions underlying the regression model
the construction of confidence interval estimate of µ.
are:
With regard to the assumptions outlined above, it is very important to note that the
1. A random sample is chosen.
relationship between X and Y is assumed to be linear. The relationship may be
2. Interval level of measure is achieved on both X and Y. linear over a portion of the range of X, and non-linear over another portion. In the
above example, the relationship between sales orders taken and time spent with
3. The deviations of Yi from the regression line (Yi -x), are normally distributed. the buyer becomes non-linear as the time spent with the buyer increases above
approximately 25 minutes. A “diminishing returns” phenomena sets in, and more
4. The variance of the deviations of Yi from the regression line is the same time spent with the buyer does not produce proportionally more sales orders.
over all values of X (homoscedasticity).
165
The model gives only the best linear equation fit to the data. The relationship explained on some other theoretical basis outside the realm of the regression
between X and Y may be better represented by a non-linear regression equation. model, such as the theory of electromagnetic radiation or economic theory.
In such cases, the usual procedure is to perform a mathematical transformation
on the data of one or both variables to “linearize” the relationship and then Another point needing emphasis is that predictions should be made beyond the
proceed with the linear regression model on the transformed data. A logarithmic range of data for the independent variables. In the above example, the model
transform is frequently useful in connection with application of this model in should be used to predict sales orders only when time spent with the buyer is
business. between 10 and 30 minutes. As a rule of thumb, predictions can be made for X
values 15% below the minimum X value (8.5 for this example) and 15% above the
In addition, if the assumptions of any statistical model are violated, it is necessary maximum X value (34.5 for this example). As noted previously, any prediction
to explicitly note the consequences. When violation occurs, it is possible to have beyond these numbers can be inappropriate because the regression line might not
completely spurious results. If the consequent limitations of a violation are not be linear beyond these points.
explicit, the business results could well be disastrous.
As a corollary to the rule of thumb noted above, the a value (Y intercept) is usually
It is also worthy to note that regression analysis has primarily two uses: (1) not a meaningfully interpretable number. The reason it is usually meaningless is
prediction and (2) explanation. Many times these two purposes overlap as in the that an X value of zero is usually outside the relevant range of X values.
example used in this unit. In the example used above, the objective was stated as
the prediction of sales orders based on information about time spent by the The b value in the regression equation is mathematically equal to the slope of the
salespeople with the buyer. The resultant equation allowed prediction of sales, line as previously noted. It can be interpreted as the change in Y for each unit
but it was also logically and theoretically consistent. That is to say, it is logical change in X. Recall that the above equation was = 48.55 + 66.38X. In other
that time spent with a buyer will at least partially explain why a sale was made. words, as X changes by one unit Y, will change by 6.38 units. A business
Had an attempt to explain sales orders been made by using the number of games interpretation would be for every minute spent with the buyer, sales will change by
won by the Chicago Bears, there could still be a high degree of predictive validity 6.38 dollars.
The above point also emphasizes that a regression equation does not prove a change upwards or downwards in X will cause a change of 6.38 units upwards/
causal relationship between the independent and dependent variables. One can downwards in Y. Conversely, if the sign of b is negative, a one unit change
only assume that time spent with a buyer causes more sales orders, but it cannot upwards/downwards in X will cause a 6.38 unit change downwards/upwards in Y.
The existence of an apparent relationship between two variables, even at least 10 observations for each of the independent variables. In bivariate linear
substantiated by a test of significance of the correlation coefficient as discussed regression there is only one independent variable and therefore at least 10
in the next section, does not prove a causal relationship. Causality must be observations are needed. Only six are presented in the above example for ease of
computation.
166
One final comment on the interpretation of a regression equation is in order. It is Confidence Interval Estimate of the Conditioned Mean
important to realize that the interpretation of a regression analysis depends on
whether the data used are cross-sectional or time series. Cross-sectional data is A confidence interval estimate of µ was shown to be constructed by positioning
collected at one point in time, census data being a good example. The majority of the sampling distribution of means on either side of the sample mean . The upper
the census is collected during the month of April on every tenth year. Time-series and lower limits of the confidence interval were:
data, on the other hand, observes a particular phenomenon over successive time
periods. An example of time-series data would be collecting monthly sales and Upper limit = + (Z or t) and lower limit = + (Z or t)
advertising data for a particular firm or industry for a given length of time, say
three years. This would provide the possibility 36 observations for construction of
a regression line. It is important to realize that a time-series interpretation pertains where is the standard error of the mean.
to the way Y changes over time as X changes over time, while a cross-sectional
interpretation refers to the amount that Y values differ as there is a simultaneous In a completely analogous fashion, the confidence interval estimate of conditional
unit change in X. mean ( ) is constructed by positioning the sampling distribution of the
conditional mean vertically above and below the regression line. The width of the
THE INFERENTIAL ASPECTS OF REGRESSION
confidence interval of the conditional mean is:
At this point it is useful to note how regression analysis relates to the other
inferential statistics studied. Recall the general linear model: + (Z or t) and - (Z or t)
Y = a + bX +e
where is the standard error of the conditional mean, analogous to .
This conforms to the usual notation of Latin letters for statistics and Greek letters
for parameters. In terms of the parameters, the equation would be expressed as:
One major difference between and is for small sample cases (defined as
Y = a + βX + ε
n ≤ 100 for regression analysis) the value of varies at different points along the
Note that the three parameters (a, β, ε) correspond to the three statistics (a, b, e).
Both Y and X are variables and therefore do not have a corresponding parameter. regression line, whereas is always the same for a given sample. For large
167
(B) =
Concepts and Computational Procedures
In constructing a confidence interval estimate of the mean, the sample data were Estimate of the population standard deviation (of y) based on sample (of x + y’s)
where the (y -x)’s are the deviations from the sample mean y, and the values of x
used to compute , and .
are disregarded.
In regression analysis, is the concept similar to . It, is called “the Estimate of the population standard deviation (of y) conditional on x. The values
of x are taken into account. The deviations (y-), are deviations in y from the
estimate of the population standard deviation conditional on x”. Also in regression
regression line at points defined by the x’s.
analysis is the concept similar to . It, is called “the standard error of
This formula (C) is conceptually correct, but it is not computationally convenient,
the conditional mean,” as mentioned above. The standard error of the conditional because it would require:
mean is called “the standard error of the regression,” and sometimes is simply
1) Calculation of each value of x for each Xi
called the “standard error of the estimate.”
2) Determining the difference (Y - x) for each Yi, corresponding to
each X.
(A) = .
Formulas have been developed for calculating ô directly without computing the
deviations from the line. The formulas are similar to:
The conceptual similarity between and is shown by comparing the two
formulas (A and B) below. The subscript (y) on ô, i.e., , is used with the
variable (x or y) to which reference is being made. ô=
Estimate of the population standard deviation based on sample (of x’s), where (x which is the alternative formula for calculating ô without determining the
-)’s are deviations from the sample mean. deviations (x - ) for use in formula (A) above.
168
Computational Formulas for These three terms A, B, and C are also used in other formulas (for and for (r)
the coefficient of correlation). It is computationally convenient to record them for
There are two commonly used formulas for ô. The first, shown below, uses only later use to avoid recalculation.
terms derived directly from the data. The second, simpler formula uses the terms
(a + b) of the coefficients of the regression equation. If the coefficients a + b have (2) Using regression coefficients a + b in computing :
already been calculated, the second formula could be used. They both produce
the same result, given usual rounding errors. After a and b have been determined, the following equation can be used to
compute :
(1) Using terms from data to compute:
In using this formula, be aware that if any error has been made in computing a and
At first glance, this formula appears formidable until you realize that there are three Calculating the Standard Error of the Conditional Mean
basic terms involved:
Once the value has been calculated for the estimate of the standard deviation of
the population of y conditional on x, ô, then can be calculated. There are two
situations to be dealt with: the large sample and the small sample case.
(1) Large Sample Case: When the sample size is ≥ 100, the standard error of the
conditional mean, , can be determined simply as:
mean.
169
(2) Small Sample Case: When the sample size is < 100, a second consideration In the confidence interval construction procedure, the net effect of the second
must be made in computing . This consideration is that, because of the small term is to make the confidence interval estimate of Y become larger the farther
is from the mean. This gives a confidence interval whose width varies along the
sample size, there may be error in positioning the regression line. Instead of the
regression line for the small sample case. For the large sample case, it is
line being correctly positioned the line may be incorrectly slanted upward or
assumed that the regression line is positioned correctly, thus the width of the
incorrectly slanted downward.
confidence interval is the same all along the line.
Possible Errors in Positioning Regression Lines
Confidence Interval Estimates for Large and Small Sample Case
To compensate for the possible error in positioning the regression line when the
To continue the example of the relationship between sales orders taken and time
sample size is small, a second term is added to the formula for ô as shown below.
spent with the buyer introduced above, the regression line
Correct positioning here refers to the line being a good estimate of the parameter
line, Y = + βX + ε Y= 48.55 + 6.38(x)
was computed. It was pointed out that the values of represent a string of point
estimates of the average sales orders taken conditional on x, the time spent with
the buyer. A confidence interval estimate is a vertical interval above and below
the regression line.
second term to compensate for error in positioning of the regression line. Since ô is estimated from the sample and n = 6, the t distribution will be used with
n – 2 = 4 degrees of freedom; t4df, 95% confidence = 2.776.
Note that the denominator in the second term under the radical, , is (a) Confidence interval at 19.17:
the term “C” of the three basic terms mentioned above in computing . = ô; = 48.55 + 6.38 (19.17) = 170.86
170
= 227.19 ± (2.776) (19.17) If the sales manager wishes to have an estimate of the average sales orders to be
expected when 28 minutes are spent with the buyer, the point estimate would be
= 227.19 ± 53.19 = $227.19, and the 95% confidence interval estimate would be from $173.97 to
$280.41. Likewise, if 12 minutes are spent with the buyer, the point estimate of
= $173.97 to $280.41
sales order would be = $125.11, and the 95% confidence interval estimate would
(c) Confidence Interval at 12 be from $77.97 to $172.25.
= ô; = 48.55 + 6.38 (12) = 125.11 Other factors affect sales orders taken besides time with the buyer. Additional
variables can be evaluated with a multiple regression model.
= 125.11 ± (2.776) (16.98)
The Coefficient of Determination, r2, and Coefficient of Correlation, r
= 125.11 ± 47.19
Both and are measures of the strength of the relationship between two
= $77.97 to $172.25
variables. Definitions and calculations are often first made in terms of , but
95% Confidence Interval Estimate of Mean Sales Orders Taken Conditional
tests of significance are made in terms of r
on Time Spent with the Buyer
In using a linear regression line for predicting a value of for a given value of xi, we
can consider that the line is a way of “explaining” some of the variation in Y as
depending on x.
Since there is still some variation in yi around the regression line, the values of x
have not “explained” all of the variation in values of y. The total variation in y can
be expressed and represented by appropriate equations in terms of populations
as:
171
The sample coefficient of determination, , is defined in terms of (biased)
estimates of the above terms as: Once we have entered the data in SYSTAT we can request Analysis using
Regression and Two-stage Least Square.
Although may take on only values between zero and one, r may be +/- thus
The test of hypothesis for the correlation coefficient is therefore stated as follows:
H0: ρ = 0
H1: ρ 0
172
Testing the Coefficient b in the Regression Equation
Click OK and the following output is generated from the analysis. A second way to test the strength of the correlation between x and y is to test b,
the coefficient in the linear regression equation. This test is equivalent to the
direct test of r described above, although it is b which is actually tested.
The logic of the test is this: if the slope of the regression line were absolutely flat,
then b in the equation would be equal to zero. The equation would then by
where (a) is a constant, the y intercept. Thus knowing value of x is of no value in
predicting the value of y. To test b, the statistical hypothesis is:
Ho: β = 0 and
From the SYSTAT output we see that the t value for Time Spent (beta) is 3.697 and
is significant at a p-Value of 0.021. If we are testing at a CL of 95% our critical
alpha would be equal to 0.05 and we would reject the null hypothesis of β=0.
173
Section 2
Predicted Scores
1. Predicted and Residual Scores The regression line expresses the best prediction of the dependent variable (Y), given the independent
2. R-square variables (X). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial
variation of the observed points around the fitted regression line. The deviation of a particular point
3. Residual Diagnostics from the regression line (its predicted value) is called the residual value.
For example:
174
The linear relationships in (a) and (b) are both explanatory with (a) sowing a strong then there is no residual variance and the ratio of variance would be 0.0. All
positive relationship and (b) a negative relationship. To interpret the direction of the variation in Y is explained by X. If we know the value of X we can perfectly predict
relationship between variables, look at the signs (plus or minus) of the regression the value of Y.
or B coefficients. If a B coefficient is positive, then the relationship of this variable
with the dependent variable is positive (e.g., the greater the independent variable In most cases, the ratio would fall somewhere between these extremes, that is,
the higher the dependent variable); if the B coefficient is negative then the between 0.0 and 1.0. 1.0 minus this ratio is referred to as R-square or
175
average of all y values. This makes the mathematics simpler. This method is any method whether a small sample is from a normal population. You can also
ordinary least squares. plot a histogram or stem-and-leaf diagram of the residuals to see if they are lumpy
in the middle with thin, symmetric tails. SYSTAT offers tests to check normality:
Residual Diagnostics Shapiro-Wilk test, and Anderson-Darling test.
You do not need to understand the mathematics of how a line is fitted in order to The errors have constant variance. Plot the residuals against the estimated
use regression. You can fit a line to any x-y data. The computer doesn't care values. The following plot shows studentized residuals (STUDENT) against
where the numbers come from. To have a model and estimates that mean estimated values (ESTIMATE). Use these statistics to identify outliers in the
something, however, you should be sure the assumptions are reasonable. dependent variable space. Under normal regression assumptions, they have a t
distribution with (N-p -1) degrees of freedom, where N is the total sample size and
The sample of the errors in the model are the residuals—the differences between
(p) is the number of predictors (including the constant). Large values (greater than
the observed and predicted values of the dependent variable. There are many
2 or 3 in absolute magnitude) indicate possible problems.
diagnostics you can perform on the residuals. Here are the several important
ones:
The errors are normally distributed. Draw a normal probability plot (PPLOT) of
the residuals.
Large Residual
ESTIMATE
RESIDUAL Our residuals should be arranged in a horizontal band within two or three units
around 0 in this plot. Again, since there are so few observations, it is difficult to tell
The residuals should fall approximately on a diagonal straight line in this plot. whether they violate this assumption in this case. There is only one particularly
When the sample size is small the line may be quite jagged. It is difficult to tell by large residual, and it is toward the middle of the values.
176
The errors are independent. Several plots can be done. Examine the plot of
residuals against estimated values. Make sure that the residuals are randomly
scattered above and below the 0 horizontal and that they do not track in a snaky
way across the plot. If they look as if they are scattered horizontally then they may
not be independent of each other. You may also want to plot residuals against
other variables, such as time, orientation, or other ways that might influence the
variability of your dependent measure. ACF PLOT in SERIES measures whether
the residuals are serially correlated. Here is an autocorrelation plot:
Autocorrelation Plot
All the bars should be within the confidence bands if each residual is not
predictable from the one preceding it, and the one preceding that, and the one
preceding that, and so on.
177
Section 3
Correlation Coefficient
1. Interpreting the Correlation Coefficient Customarily, the degree to which two or more predictors (independent or X variables) are related to the
2. Assumptions dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R-
square. In multiple regression, R can assume values between 0 and 1. To interpret the direction of the
3. Limitations relationship between variables, look at the signs (plus or minus) of the regression or B coefficients. If
a B coefficient is positive, then the relationship of this variable with the dependent variable is positive
4. Residual Analysis
(e.g., the greater the IQ the better the grade point average); if the B coefficient is negative then the
relationship is negative (e.g., the lower the class size the better the average test scores). Of course, if
the B coefficient is equal to 0 then there is no relationship between the variables.
• Assumption of Linearity
• Normality Assumption
• Limitations
• Multicollinearity and matrix ill-conditioning
• The importance of residual analysis
• Choice of the number of variables
Assumption of Linearity
First of all, as is evident in the name multiple linear regression, it is assumed that the relationship
between variables is linear. In practice this assumption can virtually never be confirmed; fortunately,
multiple regression procedures are not greatly affected by minor deviations from this assumption.
However, as a rule it is prudent to always look at bivariate scatterplot of the variables of interest. If
178
curvature in the relationships is evident, you may consider either transforming the is one and the same variable, regardless of whether it is measured in pounds or
variables, or explicitly allowing for nonlinear components. ounces. Trying to decide which one of the two measures is a better predictor of
height would be rather silly; however, this is exactly what you would try to do
Other methods include Exploratory Data Analysis and Data Mining Techniques, if you were to perform a multiple regression analysis with height as the dependent
the General Stepwise Regression, and the General Linear Models. (Y) variable and the two measures of weight as the independent (X) variables.
When there are very many variables involved, it is often not immediately apparent
Normality Assumption
that this problem exists, and it may only manifest itself after several variables have
already been entered into the regression equation. Nevertheless, when this
It is assumed in multiple regression that the residuals (predicted minus observed
problem occurs it means that at least one of the predictor variables is (practically)
values) are distributed normally (i.e., follow the normal distribution). Again, even
completely redundant with other predictors.
though most tests (specifically the F-test) are quite robust with regard to violations
of this assumption, it is always a good idea, before drawing final conclusions, to
The Importance of Residual Analysis
review the distributions of the major variables of interest. You can produce
histograms for the residuals as well as normal probability plots, in order to inspect Even though most assumptions of multiple regression cannot be tested explicitly,
the distribution of the residual values. gross violations can be detected and should be dealt with appropriately. In
particular outliers (i.e., extreme cases) can seriously bias the results by "pulling" or
Limitations
"pushing" the regression line in a particular direction, thereby leading to biased
regression coefficients. Often, excluding just a single extreme case can yield a
The major conceptual limitation of all regression techniques is that you can only
completely different set of results.
ascertain relationships, but never be sure about underlying causal mechanism. For
example, you would find a strong positive relationship (correlation) between the
damage that a fire does and the number of firemen involved in fighting the blaze.
Do we conclude that the firemen cause the damage? Of course, the most likely
explanation of this correlation is that the size of the fire (an external variable that
we forgot to include in our study) caused the damage as well as the involvement
of a certain number of firemen (i.e., the bigger the fire, the more firemen are called
to fight the blaze). Even though this example is fairly obvious, in real correlation
research, alternative causal explanations are often not considered.
This is a common problem in many correlation analyses. Imagine that you have
two predictors (X variables) of a person's height: (1) weight in pounds and (2)
weight in ounces. Obviously, our two predictors are completely redundant; weight
179
Section 4
Multiple Regression
1. Multiple Regression Multiple regression (the term was first used by Pearson, 1908) is more technically referred to as multiple
2. Partial Correlation linear regression. Multiple refers to the fact that there is more than one independent variable, as
compared to the bivariate or simple model that contains only one independent variable. Linear refers to
3. Purposes the relationship between the Y and X’s as being a first order equation. A linear equation is an equation
of the first order. In other words, an equation containing no exponent greater than 1. Examples are y =
4. Inferences
3x and y = 29 + 6x – 0.3z. Other functions are referred to as non-linear or curvilinear. Examples are
5. Considerations equations of second order, e.g., y = x2 and y = 21 – 0.6x2; equations of third order, e.g., y = x3 and y =
2 + 3x +43 etc. The linearity assumption is discussed later under curvilinear models.
6. Curvilinearity
The general purpose of multiple regression is to learn more about the relationship between several
independent or predictor variables and a dependent or criterion variable. For example, a real estate
agent might record for each listing the size of the house (in square feet), the number of bedrooms, the
average income in the respective neighborhood according to census data, and a subjective rating of
appeal of the house. Once this information has been compiled for various houses it would be
interesting to see whether and how these measures relate to the price for which a house is sold. For
example, you might learn that the number of bedrooms is a better predictor of the price for which a
house sells in a particular neighborhood than how "pretty" the house is (subjective rating). You may
also detect "outliers," that is, houses that should really sell for more, given their location and
characteristics.
180
companies in the market, recording the salaries and respective characteristics • X = IQ
(i.e., values on dimensions) for different positions. This information can be used in
a multiple regression analysis to build a regression equation of the form: If in addition to IQ we had additional predictors of achievement (e.g., Motivation,
Self- discipline) we could construct a linear equation containing all those variables.
Salary = .5*Resp + .8*No_Super In general then, multiple regression procedures will estimate a linear equation of
the form:
Once this so-called regression line has been determined, the analyst can now
easily construct a graph of the expected (predicted) salaries and the actual Y = a + b1*X1 + b2*X2 + ... + bp*Xp
salaries of job incumbents in his or her company. Thus, the analyst is able to
determine which position is underpaid (below the regression line) or overpaid The mathematics involved in estimating the parameters, i.e., a, b1, b2 as
(above the regression line), or paid equitably. estimates for a, β1 and β2, of a multiple regression become extremely time
consuming and complex when the regression equation includes more than two
The multiple regression model is a logical extension of the bivariate model and variables. There are several methods of estimating such equations, the most
can be expressed as:.Y = a + b1*X1 + b2*X2 + ... + bp*Xp .where: frequently employed method being matrix algebra. The least squares logic, as
presented under bivariate regression, still applies. Statistical software programs
• Y = dependent variable are available to solve the more complex regression problem.
• a = y intercept
Unique Prediction and Partial Correlation
• b1 thru bn = regression coefficients for each of the respective independent
Note that in the equation, regression coefficients (or B coefficients) represent
variables
the independent contributions of each independent variable to the prediction of
• x1 thru xn = independent variables the dependent variable. Another way to express this fact is to say that, for
example, variable X1 is correlated with the Y variable, after controlling for all other
To relate the above model in more concrete terms, consider the equation: where independent variables. This type of correlation is also referred to as a partial
is sales in units, x1 is price in dollars, and x2 is advertising in dollars. Such an correlation (this term was first used by Yule, 1907). Perhaps the following example
equation could allow analysis of the relationship between the dependent variable, will clarify this issue. You would probably find a significant negative correlation
sales and the two independent variables, price and advertising. between hair length and height in the population (i.e., short people have longer
hair). At first this may seem odd; however, if we were to add the
Another example would be an expansion of the bivariate equation from earlier: variable Gender into the multiple regression equation, this correlation would
probably disappear. This is because women, on the average, have longer hair than
Y=a+b*X
men; they also are shorter on the average than men. Thus, after we remove this
gender difference by entering Gender into the equation, the relationship between
• Y = GPA
hair length and height disappears because hair length does not make any unique
181
contribution to the prediction of height, above and beyond what it shares in the A random sample of data from the regional divisions of company records:
prediction with variable Gender. Put another way, after controlling for the
variable Gender, the partial correlation between hair length and height is zero.
182
Multiple Correlation The unshaded part of Y is unrelated to either X1 or X2. The unshaded area is
equal to 1-R2 and is often denoted k2. k2 is appropriately named the coefficient of
Multiple correlation is the degree or strength of the relationship between more non-determination. In this case k2 = 1 - 0.8296 = 0.1704.
than two variables. In the previous unit sample (or bivariate or zero order)
correlation was examined, i.e., the correlation between x and y. Simple correlation
is usually denoted r although a more satisfactory notation would be rxy. The
It is appropriate at this time to introduce some of the formulas for correlation and
subscripts tell which variables pertain to the coefficient.
regression, and very similar formulas are useful in studying analysis of variance.
In the case of multiple correlation, the subscripts become more important than for Recall from the discussion of bivariate correlation that SSt = SSreg + SSres where:
the bivariate case. In the example above, the multiple correlation coefficient
SSt = Total sum of squares
would be:R y●12. = 0.9108. For purposes here, the dot in the subscripts can be
thought of as separating the dependent and independent variables. The Y SSreg = Sum of squares due to the regression
subscript stands for the dependent variable and the (12) part of the subscript
indicates that the correlation involves both of the independent variables. As with SSres = Sum of squares unexplained by the regression
bivariate correlation, the R2 is easier to interpret than the R value itself. The R2
The portion of percentage interpretation of R2 becomes clearer if a sum of squares
value is the portion of the variance in the dependent variable that is “explained” by
formula is used:
the independent variables.
F obs =
where N and k are sample size and number of coefficients in the equation
respectively. Computer programs use the F and t statistics in very specific ways
and this is the topic of the next section.
183
Inferences in Multiple Regression
= standard deviation of the specific independent variable under consideration.
184
As a Venn diagram, they would appear as: solution then is to eliminate the variable(s) that is causing the problem. Which
variable to eliminate is a subjective choice. Further symptoms of, and treatments
for, colinearity are found in the many available regression texts.
To detect heteroscedasticity, it is also useful to list the error term and each
independent variable from the smallest value to the largest value. If e increases as
Xi increases, then heteroscedasticity is most likely a problem, although
autocorrelation, as defined below, can also be the culprit.
The cure for heteroscedasticity is to divide each variable in the equation by the X
variable that causing the problem. This has effect of standardizing the variance
and hence eliminating the problem.
The question now is how to assign the beta weights (or regression weights) to the
independent variables. In other words, how should the overlap of X1 and X2 be
The forth consideration is a frequent problem in multiple regression studies known
proportioned when explaining Y?
as autocorrelation. Actually the technical name for this problem is first-order
autoregression, but it is commonly referred to as autocorrelation or serial
There is not one satisfactory answer to this question that will apply in all
correlation. It occurs only in time series data, and is defined as a systematic
situations. First, it is difficult to set a cut off point to detect when two variables are
correlation between successive observations. Serial correlation is detected by
collinear because colinearity is to reduce (or increase) the number of observations
examining scatter plots of the error terms. If the error terms appear as a linear
by about 10 percent and rerun the equation. If the beta weights (or regression
function (rather than random scatter), autocorrelation is most likely a problem.
weights change significantly or reverse their sign, colinearity is present. The
First-order autocorrelation can also be detected with the Durbin-Watson statistic
185
or the Von-Neumann K ratio. It would be unnecessarily burdensome to discuss CAVIAT
these statistics here, as it is quite effective to simply examine a scatter plot of the
error term. Multiple regression is a seductive technique: "plug in" as many predictor variables
as you can think of and usually at least a few of them will come out significant.
The problem of autocorrelation is usually caused by an omitted variable from the This is because you are capitalizing on chance when simply including as many
equation. It can also be caused by erroneous specification of the functional form variables as you can think of as predictors of some other variable of interest. This
of the equation. The problem of specifying functional form is discussed briefly in problem is compounded when, in addition, the number of observations is
the next section. relatively low. Intuitively, it is clear that you can hardly draw conclusions from an
analysis of 100 questionnaire items based on 10 respondents. Most authors
Curvilinearity recommend that you should have at least 10 to 20 times as many observations
(cases, respondents) as you have variables; otherwise the estimates of the
Thus far the assumptions outlined for bivariate regression have sufficed, with the
regression line are probably very unstable and unlikely to replicate if you were
exception of multicolinearity that cannot exist in the bivariate case. One important
to conduct the study again.
assumption that can be relaxed is that of linearity. Surprisingly, most models
conform well to the linearity assumption. On the other hand, a linear model is
often used when there is little available evidence of the theoretical form of the
relationship, or a linear approximation is sufficiently precise estimate of a complex
form. The form or combinatorial rule applied to an equation is usually referred to
as functional form. There are several ways of relating dependent and independent
variables to make a function linear. Examples would be logarithmic, semi-
logarithmic, and multiplicative models. The important element here is to recognize
that a curvilinear model sometimes produces a better fit to the data than a linear
model. A combination of theory, logic, and data analysis will indicate when such
models are appropriate.
186
Chapter 6
What do we
know?
What do we know?
Every day we use our built in statistical calculator to make decisions. We estimate the probability of a
price/object/offer difference and the chance that this one price/object/offer being outside of the normal
range of variations you have observed. This is the essence of statistical inference. We take data
from a sample to represent the general population. We transform the data from its raw form by sorting,
describing and analyzing. This allows us to make inferences about the world at large (the population
represented by our sample).
Managers need information in order to introduce products and services that create value in the mind of
the customer. But the perception of value is a subjective one, and what customers value this year may
be quite different from what they value next year. As such, the attributes that create value cannot simply
be deduced from common knowledge. Rather, data must be collected and analyzed. The goal of
research is to provide the facts and direction that managers need to make their more important
decisions. The value of the information provided is dependent on how well one effectively and
efficiently executes the research process.
188
Scientific procedures involve the use of models to describe, analyze, and make A major distinction between research and everyday observation is that research is
predictions. A model can be a well-defined set of descriptions and procedures planned in advance. Based on a theory or hunch, researchers develop research
like the Product Life Cycle in marketing, or it can be a scaled down analog of the questions and then plan what, when, where and how to observe in order to
real thing, such as an engineer’s scale model of a car. answer the questions.
Models that are useful and valuable in managing business operations are broadly What (or whom) to observe, the population. When a population is large,
termed “statistical models”. A large number of these have been developed to researchers often plan to observe only a sample (i.e., a subset of a population).
assist researchers in a variety of fields, such as agriculture, psychology, education, Planning how to draw an adequate sample is, of course, critical in conducting
communication, and military tactics, as well as in business. Only the professional valid research.
statistician would be expected to understand the full range of such statistical
models. When the observations will be made. (Morning, night…). Researchers realize that
the timing of their observations may affect the results of their investigations.
The statistical methods presented in this text have been chosen to cover a variety
of problems generally encountered in business. The reader is encouraged to seek Where to make the observations. For instance, will the observations be made in a
out additional readings for detailed coverage of each model and additional quiet room or in a busy shopping mall?
statistical methods beyond the scope of this text.
How to observe. Use an existing questionnaire or develop a new survey
The approach to learning models is to emphasize the similarity in logical structure instrument. For instance, researchers might build or adopt existing interviews,
among various models, and to stress understanding of assumptions inherent in questionnaires, personality scales, etc., to use in making observations.
each model. This makes it possible to make sense from a seeming quagmire of
statistical methods, and to clearly and logically determine when it is appropriate to The observations that researchers make result in data. The data might be the
use a specific statistical model. As a manager reading research reports, it is then brands participants plan to purchase or the data might be respondent scores on a
possible to determine if the correct techniques were utilized. scale that measures preference. In this context, variables are things that we
measure, control, or manipulate in research. The participants (respondents) with
A business manager needs to understand and address the limitations of statistical the variables represent our data.
models, emphasizing what the techniques do not say, rather than simply how to
properly interpret conclusions from statistical tests. Why is understanding necessary if prediction is okay without it? Theories in
business are generally not very highly refined and represent a great deal of
Finally, a business manager is not expected to be a statistician. The objective is abstraction from the real world situation. The use of statistics does not overcome
to properly understand and interpret results from statistical models. A manager these deficiencies; it may even fool some people by making them think they
who will ask the right questions of the researcher or statistician in order to understand more than they do. It is essential to be constantly aware of what is
evaluate and apply results. known and what is not known.
189
The Value of Information
Information can be useful, but what determines its real value to the organization?
In general, the value of information is determined by:
To maximize the value of information obtained, those who use information need to
understand the research process and its limitations.
190
Section 2
1. Executive Summary
Title Page
2. Purpose
The title page should include the title of the report along with the name(s) of the client or organization
3. Objectives for whom the report is written. Also included on the title page should be the name(s) of the author(s) of
the report along with all pertinent information about them.
4. Methodology
Table of Contents
5. Findings
6. Conclusions and Recommendations The table of contents lists the information contained in the report in the order in which it will be found.
All major topics of interest should be listed.
Executive Summary
The executive summary should be a one to two page overview of the information contained in the
research report. It should give the reader an easy reference, in very brief form, to the important
information contained in the report and explained in more detail in the body of the report. People
attending a presentation of research or reading the report will use this section as a reference during
presentations and as a synopsis of the research done.
A one to two page summary of the entire report, focusing on Findings, Conclusions,
Recommendations.
191
Background and Purpose of Research Qualitative Research (if used):
• Problem/opportunity This section should contain all information regarding any interviews or focus
groups that were conducted as part of the research project. This section should
• Background & context (summary of background & exploratory research) begin with an explanation of why this research is needed or beneficial. Other
information provided should include:
The introduction should contain a brief overview of the problem being addressed
and the background information needed for the reader to understand the work • An overview of the issues that were included in this research
being done and the reasoning behind it. After reading the introduction, the reader
should know exactly what the report is about, why the research was conducted, • Why these issues were salient
and how this research adds to the knowledge that the reader may have about the
• How the discussion guide was developed
topic.
This section will contain all of the information that was collected through review of • Identification and description of the variables included in the experiment
existing information. The importance of the secondary information as it pertains to
the problem being researched must be made clear to the reader. Conclusions • Clear statement of the hypothesized relationships between or among the
should be drawn in a logical fashion and insight into how these conclusions will be variables
used throughout the rest of the research agenda should be provided.
192
• Explanation of how the variables were measured • Explanation of bases for those conclusions
• Discussion of reliability and validity of the measurements • How these conclusions will contribute to the rest of the research project
• Conditions under which the experiment was conducted This is the section that should be pulling together all of the other issues that were
identified in the research steps that were conducted previously. The connections
• Description (not identification) of the subjects to the issues and constructs identified earlier should be made again here so that
they reader can easily see the foundations that are being used. Many issues will
• Description of data collection
have to be addressed in this section regarding how the survey was developed and
• Analysis of data, including details of procedures used and statistical how it was administered. Topics discussed in this section should include:
significance
• Identification of all issues included on the survey
• Conclusions clearly based on data analysis
• Explanation of the importance of the selected issues to project
• How these conclusions will contribute to the rest of the research project
• Development of the survey questions and wording
Observation (if used):
• Sources of survey questions (existing scales or newly created)
If observation was a part of the research project, you will need to explain several
• Description of population of interest
things to the reader or attendee at your presentation starting with why this method
is appropriate for your research goals. In addition, the following topics should all • Explanation of target population appropriateness
be part of the final report:
• Determination of sample size needed
• Explanation of why observation was appropriate
• Sampling procedures (random or convenience)
• Location and conditions under which observation was conducted
• Determination of the sample population
• Description of the population observed
• Method of survey distribution
• The recording methods used
Data Analysis
• Methods used to interpret observed behaviors
In this section, the reader should find a brief overview of the methods that were
• Conclusions drawn from observation
utilized in the research, the reasons that those methods were appropriate for the
193
research problem, an explanation of how the outcomes for those methods can be • Findings based only on results of the research not speculation
understood and interpreted. It is important to remember that the people reading
your report or listening to your presentation may not be familiar with the analysis • In-depth explanation of all major findings
methods being used. You must present the methods in such a way that anyone
• Clear presentation of support for the findings
interested in your research will be able to understand what was done and why it
was done. This section should include the following: Limitations:
• Overview of analysis methods used Recognize that even the best marketing research work is not perfect and open to
questioning. In this section, briefly discuss the factors that may have influenced
• Justification for methods chosen
your findings but were outside of your control. Some of the limitations may be
• Outcomes of analysis time constraints, budget constraints, market changes, certain procedural errors,
and other events. Admit that your research is not perfect but discuss the degree of
• Significance of results (statistical and otherwise) accuracy with which your results can be accepted. In this section, suggestions
can be offered to correct these limitations in future research.
Detailed Findings
Conclusions and Recommendations
• Results for each question
• Summary of conclusions
• Tables
• Recommendations (ie. Solve the problem / opportunity)
• Graphs
• Future research suggestions (if any)
• Cross-tabulations
Conclusions are broad generalizations that focus on addressing the research
• Text summary of findings questions for which the project was conducted. Recommendations are your
choices for strategies or tactics based on the conclusions that you have drawn.
Findings
Quite often authors are tempted to speculate on outcomes that cannot be
The findings are the actual results of your research. Your findings should consist of supported by the research findings. Do not draw any conclusions or make any
a detailed presentation of your interpretation of the statistics found relating to the recommendations that your research cannot clearly support.
study itself and analysis of the resulting data collection. The judicious use of
References
figures, tables and graphs is encouraged when it is helpful to allow the reader to
more easily understand the work being presented. The findings section should This section should be a listing of all existing information sources used in the
include the following: research project. It is important to allow the reader to see all of the sources used
194
and enable the reader to further explore those sources to verify the information
presented.
Appendices
• Anything else which is relevant, but not appropriate for the main body of
the report
This section should include all supporting information from the research project
that was not included in the body of the report. You should include surveys,
complex statistical calculations, certain detailed tables and other such information
in an appendix. The information presented in this section is important to support
the work presented in the body of the report but would make it more difficult to
read and understand if presented within the body of the report.
195
Section 3
Statistic’s is about this whole process of using the scientific method to answer questions and make de-
cisions. Effective decision-making involves correctly designing studies, collecting unbiased data, de-
scribing the data with numbers and graphs, analyzing the data to draw inferences and reaching conclu-
196
Step 3: What are the number of samples involved?
Decision Rules: Choosing Among Statistical Techniques
One Random Sample: 1. Normal (Z)
Step1: Was the sample(s) drawn according to the requirements for a random sam- 2. Students’ (t)
ple? If YES: Go to Step 2. If NO: Cannot use any statistical model which is 3. Binomial (itself)
based on probability theory. However, one can describe data, i.e., its mean, stan-
4. Normal approx. to Binomial
dard deviation, quartiles, etc. One might also be able to fit a line to a set of data
5. Pearson’s Product Moment Correlation
by method of least squares, but could not construct a meaningful confidence inter-
val. 6. Chi Square
197
Step 4: Is the normal population assumption required? 5. In Wilcoxon matched pairs signed ranks test, difference of zero in
(response 2) minus (response 1) are dropped from the analysis and n
Yes for parametric tests: Z, t, F, and Pearson product moment correlation reduced by one.
(variables X and Y). Samples must have been drawn from a normal or large
population. Note: Normal Approximate to binomial requires np and nq ≥ 5,
Step 7: Lastly, if a random sample has been drawn, but none of the above tests
in order to approximate binomial distribution with a normal distribution.
meet the circumstances of the problem, refer to an advanced text on Statistical
No for non-parametric tests: Chi Square, Mann-Whitney, Wilcoxon, Spear- Methods.
man rank correlation, and Binomial, itself.
Yes for Z and t two means tests and for Pearson Product Moment Correla-
tion (homoscedasticity, i.e., variance of Yi from the regression line is the
same for all values of X).
Step 6: Check other assumptions required for different models (tests) to have
valid application:
a. No less than 20% of cells may have fe < 5, and none < 1
198
Methods for Summarizing Data
199
Section 4
Statistical Formulas
=
=
s=
Population covariance
Range
Population variance
Sample covariance
200
Population coefficient of correlation Multiplication rule
Addition rule
Coefficient of determination
E(X) =
y-intercept Variance
V(x) =
Standard deviation
Probability
Conditional probability
Covariance
P(A|B) = P(A and B)/P(B)
COV(X, Y) = σxy =
Complement rule
P( ) = 1 – P(A)
201
Coefficient of Correlation Mean and variance of a portfolio of two stocks
Laws of variance
1.V(c) = 0 V(Rp) =
2. V(X + c) = V(X)
Binomial probability
3. V(cX) = V(X)
P(X = x) =
Laws of expected value and variance of the sum of two variables
Laws of expected value and variance for the sum of more than two variables
1.
Poisson probability
202
Continuous Probability Distributions
Exponential distribution
Standardizing the sample mean
203
Expected value of the difference between two means Introduction to Hypothesis Testing
Standard error of the difference between two means Inference about One Population
Introduction to Estimation
Test statistic for
UCL =
204
Test statistic for p Confidence interval estimator of the total in a small population
Confidence interval estimator of p Confidence interval estimator of p when the population is small
Confidence interval estimator of the total of a large finite population
205
Unequal-variances t-test of F-Estimator of
LCL =
UCL =
Unequal-variances interval estimator of
Case 1:
t-Test of
Case 2:
t-Estimator of
z-estimator of
F-test of
F= and
206
Analysis of Variance MST =
MSE =
SSE =
F=
MST =
F=
F= SS(Total) =
SS(Total) =
SS(B) =
SST =
SS(AB) =
SSB =
SSE =
SSE =
207
F= Simple Linear Regression
Sample slope
F=
F=
Sample y-intercept
SSE =
Tukey’s multiple comparison method
Standard error of
208
Coefficient of determination Multiple Regression
Prediction interval
Coefficient of Determination
Adjusted
MSR = SSR/(n-k-1)
F-statistic
F = MSR/MSE
209
Durbin-Watson statistic
210
Section 5
Video Links
VIDEO LINKS The following video links are provided to aid in the review of the many concepts presented throughout
the text.
1. Introduction
Introduction to Quantitative Research, Measurement Levels, Frequency Tables, and Graphics:
2. Central Tendency
5. Statistical Inference
http://www.youtube.com/watch?v=jhVdiIBQQHE&playnext=1&list=PLB815A3C15C587645
http://www.youtube.com/watch?v=5fDBOq5x5nk
http://www.youtube.com/watch?v=Nol6wS9Wj4M
211
Sampling Distribution of a Sample Proportion (Central Limit Theorem) Probability and Events:
http://www.youtube.com/watch?v=aE11B6deuNQ http://www.youtube.com/watch?v=BolCgB4YGMw
Sampling Distribution of the Sample Mean: Calculating the Probability of Simple Events:
http://www.youtube.com/watch?v=FXZ2O1Lv-KE&feature=related http://www.youtube.com/watch?v=BAjOEsU_mpE
http://www.youtube.com/watch?v=qudsqhWBApA http://www.youtube.com/watch?v=Kuz3ZHLVj_k
Measures of Central Tendency, Measures of Dispersion, Correlation, Probability Basics with Excel:
Regression, Reliability, Validity, Probability
http://www.youtube.com/watch?v=88FCRYjyySc
Central Tendency:
http://www.youtube.com/watch?v=HP6ip-dDKxE
Basic Rules of Probability:
http://www.youtube.com/watch?v=7sg8bo_BGeM
http://www.youtube.com/view_play_list?p=2B567EA871FF4171
http://www.youtube.com/watch?v=81zcjULlh58
212
What is Correlation?: Sampling Distribution of Sample Mean & Central Limit Theorem:
http://www.youtube.com/watch?v=Ypgo4qUBt5o http://www.dailymotion.com/video/xd9fc5_how-to-graph-the-normal-
distributio_school
Statistical Inference:
Applications of the Normal Distribution:
http://www.youtube.com/watch?v=zRKDDADMDO8
http://www.youtube.com/watch?v=bYnIIZbeFes
213
Hypothesis Testing, Sampling, Types of Errors, Statistical Significance, Effect Statistical Inference, Hypothesis Testing, Comparing Two Means, Comparing
Size, Power Two Proportions
Hypothesis Testing:
http://www.youtube.com/watch?v=tpdmnFWcSn0
http://www.youtube.com/watch?v=rHAxhlmbRPU http://www.youtube.com/watch?v=rZuYwJupGus
http://www.youtube.com/watch?v=kMxDtJL3RFY&feature=relmfu
Comparing means:
http://www.youtube.com/watch?v=9je5FMppLlU
http://www.youtube.com/watch?
v=2iclH6eysgY&playnext=1&list=PLC54F478AEBCE4EE4
Lean Sigma Search: Leveraging Lean & Six Sigma in Executive Search:
http://www.youtube.com/watch?v=Mr_4JtI2LvI
Statistical and Practical Significance:
http://www.youtube.com/watch?v=rOyK_K0SOaU
214
Six Sigma for Managers: Intro to SSFM
http://www.youtube.com/watch?v=vtspuAPsOl0
http://www.youtube.com/watch?v=tlDO4KFkxeI
Simple regression:
http://www.youtube.com/watch?v=GCEyZxS6vn8
http://www.youtube.com/watch?v=wOmqP9auN0Y
http://www.youtube.com/watch?v=CSYTZWFnVpg
Non-parametric:
http://www.youtube.com/watch?v=W6SBqH_nlV4
215
References & References and Resources
Resources Afifi, A. A., May, S., and Clark, V. (2012). Practical multivariate analysis, 5th ed New York: Chapman &
Hall
The following references and resources are provided as a starting Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. in B. N.
point for the reader interested in learning more about any of the topics
covered in this text. No attempt has been made to provide an Petrov, and F. Csaki, (eds.) Second International Symposium on Information Theory. Budapest:
exhaustive list.
Akademiai Kiado, pp. 267-281.
There are many great books on statistics that focus on specific
techniques and procedures.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control AC 19, 716-723.
Anderson, T.W. (1971). The Statistical Analysis of Time Series. Wiley, New York
Anderson, T. W. and Darling, D. A. (1954). A test of goodness of fit. Journal of American Statistical
Association, 49, 765-769.
Ansneld, F. and Klotz, J. (1977). A phase in study comparing the clinical utility of four regimens of 5-
fluorouracil. Cancer, 39, 34-40.
Bailey, B. J. R. (1980). Large sample simultaneous confidence intervals for the multinomial
probabilities based on transformations of the cell frequencies.Technometrics, 22, 583-589
Bartlett, M. S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, Series B, 9,
176-197.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying influential data
and sources of collinearity. New York: John Wiley & Sons
ccxvi
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate Dixon, W. J. (1992). BMDP statistical software manual. Berkeley: University of
analysis: Theory and practice. Cambridge, Mass.: McGraw-Hill. California Press
Block, J. (1960). On the number of significant findings to be expected by chance. Dixon, W. J. and Tukey, J. W. (1968). Approximate behavior of the distribution of
Psychometrika, 25, 369-380. winsorized t. Technometrics, 10, 83-98.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York: Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Vol. 38 of
McGraw-Hill CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Penn.:
SIAM.
Boos, D. D. and Brownie, C. (1989). Bootstrap methods for testing homogeneity of
variances. Technometrics, Vol. 31, No. 1, 69-82. Efron, B. and LePage, R. (1992). Introduction to bootstrap. In R. LePage and L. Billard
(eds.), Exploring the Limits of Bootstrap. New York: John Wiley & Sons.
Box, G. E. P., and Tiao, G. C. (1973). Bayesian inference in statistical analysis.
Reading, Mass.: Addison-Wesley . Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Brillinger, D.R. (1975) . Time Series:Data Analysis and Theory. Holt, Rinehart and
Winston, New York Faith, D. P., Minchin, P., and Belbin, L. (1987). Compositional dissimilarity as a robust
measure of ecological distance. Vegetatio, 69, 57-68.
Burnham, K. P. and Anderson, D. R. (2002). Model selection and multimodel
inference: A practical information-theoretic approach. New York: Springer-Verlag . Feingold, M. and Korsog, P. E. (1986). The correlation and dependence between two F
statistics with the same denominator. The American Statistician, 40, 218-220.
Chambers, J. M. (1977). Computational methods for data analysis. New York: John
Wiley & Sons. Fisher, R. A. (1935). The design of experiments. London:Ohver & Boyd.
Cochran, W G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John Fleiss, J. L., Levin, B., and Paik, M. C. (2003). Statistical methods for rates and
Wiley & Sons. proportions, 3rd ed. New York: John Wiley & Sons.
Daniel, C. (1960). Locating outliers in factorial experiments. Technometrics, 2, Flury, B. (1997). A first course in multivariate analysis. New York: Springer-Verlag
149-156.
Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross-
Dempster, A. P. (1969). Elements of continuous multivariate analysis. San Francisco: classification. Journal of the American Statistical Association, 49, 732-764.
Addison-Wesley.
Gower, J. C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S. and
Davis, D. J. (1977). An analysis of some failure data. Journal of the American Johnson, N. L. Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley &
Statistical Association, 72, 113-150. Sons.
ccxvii
Hadi, A. S. (1994). A modification of a method for the detection of outliers in Kline,R. B. (2005). Principles and practices of structural equation modeling (2nd ed.)
multivariate samples. Journal of the Royal Statistical Society, Series (B), 56, 393-396. New York: Guilford Press.
Hand, D. J., Daly, E, Lunn, A. D., Mc Conway K. J., and Ostrowski, E. (1996). A Kutner, M. H. (1974). Hypothesis testing in Hnear models (Eisenhart Model I). The
handbook of small data sets. New York: Chapman & Hall. American Statistician, 28, 98-100..
Henze, N. and Zirkler, B (1990). A class of invariant consistent tests for multivariate Kutner, M. H., Nachtshiem, C. J., Neter, J. and Li, W. (2004). Applied Hnear statistical
normality. Communications in Statistics; Theory and Methods, 19, 3595-3618. models, 5th ed. Irwin: McGraw-Hill.
Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and Levene, H. (1960). Robust tests for equaHty of variances. I. Olkin, ed., Contributions to
discriminant analysis. Computational Statistics, Vol. 2, Y. Dodge and J. Whittaker, eds. Probability and Statistics. Palo Alto, CaHf: Stanford University Press, 278-292.
Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag,
111-126 Little, R. J. A. and Rubin, D. B. (2002). Statistical analyses with missing data. New York:
John Wiley & Sons.
Hocking, R. R. (1983). Developments in linear regression methodology: 1959-82.
Technometrics, 25, 219-230 Longley, J. W. (1967). An appraisal of least-squares for the electronic computer from
the point of view of the user. Journal of the American Statistical Association, 62,
Hoyle, R. H. (1995). Structural equation modeling: Concepts, issues, and applications. 819-841
Thousand Oaks, CA
Mardia, K. V (1970). Measures of multivariate skewness and kurtosis with
Huff, D. (1993). How to Lie With Statistics. W. W. Norton & Company. applications. Biometrika, 58, 519-530.
James, A. D (1984). Extending Rosenberg's Technique for Standardizing Percentage Mardia, K. V, Kent, J. T, and Bibby, J. M. (1979). Multivariate analysis. London:
Tables. Social Forces, 62, 3, 679-708. Academic Press.
John, P. W. M. (1971). Statistical design and analysis of experiments. New York: Mendenhall, W., Beaver, R. J., and Beaver B. M. (2002). A brief introduction to
MacMillan . probability and statistics. Pacific Grove, CA: Duxbury Press.
Judge, G. G., Griffiths, W. E., Lutkepohl, H., Hill, R. C, and Lee, T. C. (1988). Miller, R. (1985). Multiple comparisons. Kotz, S. and Johnson, N. L., eds.,
Introduction to the theory and practice of econometrics, 2nd ed. New York: John Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley & Sons, 679-689.
Wiley & Sons, pp. 275-318, pp. 453-454.
MilHken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed
Kaplan, D. (2000). Structural equation modeling: Foundations and extensions. Experiments. New York: Van Nostrand Reinhold CompanyKendall, M. G., Stuart, A.,
Thousand Oaks, CA: Sage Ord, J. K., and Arnold, S. (1999). Kendall's advanced theory of statistics, Volume 24,
London: Hodder Arnold.
ccxviii
Montgomery, D. C. (2005). Introduction to statistical quality control, 5th ed. New York: Santner, T.J. and Duffy E. D. (1989) Statistical Analysis of Discrete Data. Springer,
John Wiley & Sons.. New York .
Montgomery, D. C, Peck, E. A., and Vining, G. G. (2001). Introduction to Linear Schachter, S. (1959). The psychology of affiliation: Experimental studies of the
Regression Analysis, 3rd ed. New York: John Wiley sources of gregrariousness. Stanford, CA: Stanford University Press.
Morrison, D. F. (2004). Multivariate statistical methods. 4th ed. Pacific Grove CA: Scheffe, H. (1959). The analysis of variance. New York: John Wiley & Sons.
Duxbury Press.
Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality
Morrison, A. S., Black, M. M., Lowe, C. R., MacMahon, B., and Yuasa, S. Y. (1990). (complex samples). Biometrika, 52, 591-611.
Some international differences in histology and survival in breast cancer. International
Journal of Cancer, 11, 261-267 Shye, S.[Ed]. (1978). Theory construction and data analysis in the behavioral sciences.
San Francisco: Jossey-Bass.
Nelson, L. S (1998). The Anderson-Darling test for normality. Journal of Quality
Technology, 30-3, 298-299. Snedecor, G W. and Cochran, W. G (1989). Statistical methods, 8th ed. Ames: Iowa
State University Press.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An
introduction. New York: John Wiley & Sons Spicer, C. C. (1972). Calculation of power sums of deviations about the mean. Applied
Statistics, 21, 226-227.
Rosenberg, M. (1962). Test Factor Standardization as a Method of Interpretation.
SocialForces, 53-61. Stephens, M. A. (1982). Anderson-Darling test of goodness of fit. Encyclopedia of
Statistical Sciences: Volumel (Edited by Kotz, S. and Johnson, N.L). New York: John
Ott, L. R. and Longnecker, M. (2001). Statistical methods and data analysis, 5th ed. Wiley & Sons, 81-85.
Pacific Grove, CA: Duxbury Press.
Timm, N. H. (2002). Applied multivariate analysis. New York: Springer-Verlag .
Press, S. J. (1989). Bayesian statistics: principles, models and applications. New York:
John Wiley & Sons Trader, R. L. (1986). Bayesian regression. In Johnson, N. L. and Kotz, S. (eds.)
Encyclopedia of Statistical Sciences New York: John Wiley & Sons, 7, 677-683.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York:
John Wiley & Sons. (note; paperback reprint edition 2002) Tukey, J. and McLaughlin, D. (1963). Less vulnerable confidence and significance
procedures for location based on a single sample: trimming/winsorization. Sankhya,
Salkind, N.J. (2007). Statistics for People Who (Think They) Hate Statistics. Sage A 25, 331-352.
Publications
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of
Mathematical Statistics, 29, 614.
ccxix
Vogt, W.P (2005). Dictionary of Statistics & Methodology: A Nontechnical Guide for
the Social Sciences
Weisberg, S. (2005). Applied linear regression. 3rd ed. Hoboken, N. J.: Wiley-
Interscience.
Wilkinson, J. H. and Reinsch, C. (Eds.). (1971). Linear Algebra, Vol. 2, Handbook for
automatic computation. New York: Springer-Verlag.
ccxx
Statistics for!
! ! !
Better ! ! !
Business !!
Decisions
by
ISBN-10: 0988216000
ISBN-13: 978-0-9882160-0-6
ccxxi
Alpha
Alpha is the inverse of the confidence level. Alpha = 1 - confidence level. Alpha is the critical test value used for
comparison to the observed p-Value. At a 95% confidence level alpha is equal to 5% or .05.
Whenever the null hypothesis is tested, the alternative hypotheses must be stated because it determines whether
a one tail or two tail test is to be made.
Whenever the null hypothesis is accepted, the alternative hypothesis is rejected. Likewise, whenever the null
hypothesis is rejected, the alternative hypothesis must be accepted.
A collection of statistical models, and their associated procedures, in which the observed variance in a particular
variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA
provides a statistical test of whether or not the means of several groups are all equal, and therefore generalizes t-
test to more than two groups. Doing multiple two-sample t-tests would result in an increased chance of
committing a type I error. For this reason, ANOVAs are useful in comparing two, three, or more means.
The result obtained by adding several quantities together and then dividing this total by the number of quantities;
the mean.
In statistics, the beta is the unit-contribution of a variable to the value of the outcome variable. Weighted
regression is perhaps the easiest form of multiple regression analysis, a method in which two or more variables are
used to predict the value of an outcome.
At a conceptual level, we use multiple regression so several variables can be considered at the same time for their
effect on an outcome of interest.
The variation due to the interaction between the samples is denoted SS(B) for Sum of Squares Between groups. If
the sample means are close to each other (and therefore the Grand Mean) this will be small. There are k samples
involved with one data value for each sample (the sample mean), so there are k-1 degrees of freedom.
The variance due to the interaction between the samples is denoted MS(B) for Mean Square Between groups. This
is the between group variation divided by its degrees of freedom.
A statistic is biased if it is calculated in such a way that is systematically different from the population parameter of
interest. Characteristics of an experimental or sampling design, or the mathematical treatment of data, that
systematically affects the results of a study so as to produce incorrect, unjustified, or inappropriate inferences or
conclusions.
The carat or “hat” is used over a letter to represent an estimate of the letter. For example: means an estimate
of σ. The sub-letter used with another letter is some what like and adjective to further describe the letter. For
example, (sigma x-bar) is read, “standard deviation of x-bar” or more generally, “standard deviation of
means.”
When variables are categorical, frequency tables (crosstabulations) provide useful summaries.
The central limit theorem states that the mean of a sufficiently large number of independent random variables,
each with finite mean and variance, will be approximately normally distributed.
Essentially, the Central Limit Theorem, tells us that if we take the mean of the samples (n) and plot the frequencies
of their mean, we get a normal distribution. And as the sample size (n) increases, approaches infinity, we find a
normal distribution.
Statistical measures of central tendency include the mean, median and mode.
A statistical method assessing the goodness of fit between observed values and those expected theoretically.
Cluster sampling is a technique used when relatively homogeneous groupings are evident in a statistical
population. The total population is divided into groups (or clusters) and a simple random sample of the groups is
selected. Then the required information is collected from a simple random sample of the elements within each
selected group.
The coefficient of determination R is used in the context of statistical models whose main purpose is the
prediction of future outcomes on the basis of other related information. It is the proportion of variability in a data
set that is accounted for by the statistical model. It provides a measure of how well future outcomes are likely to
be predicted by the model.
A confidence interval (CI) is an interval estimate of a population parameter and is used to indicate the reliability of
an estimate. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample
to sample, that frequently includes the parameter of interest, if the experiment is repeated. How frequently the
observed interval contains the parameter is determined by the confidence level or confidence coefficient.
From a normal approximation, we can build a 95% symmetric confidence interval that gives us a specific idea of
the variability of our estimate. We would expect that 95 intervals out of a hundred constructed would cover the real
population mean age. Remember, population mean age is not necessarily at the center of the interval that we just
constructed, but we do expect the interval to be close to it.
If a random variable is a continuous variable, its probability distribution is called a continuous probability
distribution. The equation used to describe a continuous probability distribution is called a probability density
function. The probability that a random variable assumes a value between a and b is equal to the area under the
density function bounded by a and b.
A contrast is a linear combination of 2 or more factor level means with coefficients that sum to zero.
Two contrasts are orthogonal if the sum of the products of corresponding coefficients (i.e., coefficients for the
same means) adds to zero.
Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, a measure of the strength and
direction of the linear relationship between two variables.
Correlational statistics are a special subgroup of descriptive statistics, which are described separately. The
purpose of correlational statistics is to describe the relationship between two or more variables for one group of
participants.
Cronbach's alpha is a lower bound for test reliability and ranges in value from 0 to 1 (negative values can occur when
items are negatively correlated). Alpha can be viewed as the correlation between the items (variables) selected and all
other possible tests or scales (with the same number of items) constructed to measure the characteristic of interest.
The observations that researchers make result in data. The data might be the brands participants plan to purchase
or the data might be respondent scores on a scale that measures preference. In this context, variables are things
that we measure, control, or manipulate in research. The participants (respondents) with the variables represent
our data. Think of the data file as a spreadsheet in Excel with each respondent represented by a row of data and
each variable represented by a column.
Closely reasoned. If (a) and (b) are true, (c) must be true. This logic is used in making mathematical proofs.
A sampling method is dependent when the individuals selected to be in one sample are used to determine the indi-
viduals to be in the second sample. Dependent samples are often referred to as matched-pairs samples. In other
words, statistical inference methods on matched pairs data use the same methods as inference on a single popula-
tion mean, except that the differences are analyzed.
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data.
Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive
statistics aim to summarize a data set, rather than use the data to learn about the population that the data are
thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed
on the basis of probability theory. Even when a data analysis draws its main conclusions using inferential statistics,
descriptive statistics are generally also presented.
With a discrete probability distribution, each possible value of the discrete random variable
can be associated with a non-zero probability. Thus, a discrete probability distribution can
always be presented in tabular form.
Statistical analyses also commonly use measures of dispersion (spread), such as the range, interquartile range, or
standard deviation.
A statistic is efficient if the spread of the sampling distribution around the population parameter being estimated is
small. Or, in comparison of one statistic to another statistic, the one whose sampling distribution spreads less
around the parameter, is more efficient.
Empiricism, the scientific method, refers to using direct observation to obtain knowledge. Thus, the empirical
approach to acquiring knowledge is based on making observations of individuals or objects of interest. As
illustrated by the gas price example, everyday observation is an application of the scientific approach.
An experiment is a study in which treatments are given to see how the participants respond to them. We all
conduct informal experiments in our everyday lives.
Whenever the sample from a finite population equals or exceeds 10% of the total population, i.e., n ≥ 10% N, the
following correction factor is used to compensate for the inherent changes in the underlying probabilities during
the sampling process.
There are many equivalent formulas used in statistical calculations. The many variations have been derived largely
as an aid in the process of computation.
When variables are categorical, frequency tables (crosstabulations) provide useful summaries.
Generalizability refers to the appropriateness of applying findings from a study to a larger population.
Generalizability requires random selection. If participants in a study are randomly selected from a larger
population, it is appropriate to generalize study results to the larger population; if not, it is not appropriate to
generalize.
Greek letters (like µ, π, σ) are generally used when referring to populations, while common letters like (s, p) are
used to refer to samples.
A supposition or proposed explanation made on the basis of limited evidence as a starting point for further
investigation.
There are three ways to set up the null and alternative hypotheses:
A sampling method is independent when the individuals selected for one sample do not dictate which individuals
are to be in a second sample.
X can be explained relative to several sets of empirical data. Thus, one must either include total evidence or have
only a “potential explanation.” Most applications of statistical reasoning in business situations are only potential
explanations. It is generally impossible or too expensive to obtain total evidence.
Statistical inference is the process of drawing conclusions from data subject to random variation. Inferential
statistics (or inductive statistics) aim to use the data to learn about the population that the data are thought to
represent.
Inferential statistics are tools that tell us how much confidence we can have when generalizing from a sample to a
population.
Statistical analyses also commonly use measures of dispersion, such as the range, interquartile range, or standard
deviation. The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between
the upper and lower quartiles, IQR = Q3 − Q1
In most statistical analyses, interval and ratio measurements are analyzed in the same way. However, there is a
difference between these two levels. An interval scale does not have an absolute zero. For instance, if we measure
intelligence, we do not know exactly what constitutes absolutely zero intelligence and thus cannot measure the
zero point. In contrast, a ratio scale has an absolute zero point on its scale. For instance, we know where the zero
point is when we measure height.
Is a measure of the "peakedness" of the probability distribution of a real-valued random variable. Kurtosis is a
descriptor of the shape of a probability distribution.
The Law of Large Numbers tells us that if we take a sample (n) observations of our random variable & average the
observation (mean)-- it will approach the expected value E(x) of the random variable.
Measurement can be defined as the assignment of numerals to objects or events according to rules. There are
four basic levels of measure: nominal, ordinal, interval, and ratio. The importance of the level of measure is
realized in the operations to produce the scale and in the mathematical operations that are permissible with each
level.
A linear equation is an algebraic equation in which each term is either a constant or the product of a constant and
(the first power of) a single variable.
A sampling method is dependent when the individuals selected to be in one sample are used to determine the
individuals to be in the second sample. Dependent samples are often referred to as matched-pairs samples. In
other words, statistical inference methods on matched pairs data use the same methods as inference on a single
population mean, except that the differences are analyzed.
The result obtained by adding several quantities together and then dividing this total by the number of quantities;
the mean.
In general, the mean square of a set of values is the arithmetic mean of the squares of their differences from some
given value, namely their second moment about that value.
When the mean square is regarded as an estimator of certain parental variance components the sum of squares
about the observed mean is divided by the number of degrees of freedom, not the number of observations
Mean squared error (MSE) equals the sum of the variance and the squared bias of the estimator. An estimator is
used to infer the value of an unknown parameter in a statistical model. Bias is the difference between this
estimator's expected value and the true value of the parameter being estimated. The MSE provides a means of
choosing the best estimator: a minimal MSE often, but not always, indicates minimal variance, and thus a good
estimator. Like variance, mean squared error has the disadvantage of heavily weighting outliers.
The variation due to the interaction between the samples is denoted SS(B) for Sum of Squares Between groups. If
the sample means are close to each other (and therefore the Grand Mean) this will be small. There are k samples
involved with one data value for each sample (the sample mean), so there are k-1 degrees of freedom.
The variance due to the interaction between the samples is denoted MS(B) for Mean Square Between groups. This
is the between group variation divided by its degrees of freedom.
A measure of central tendency, the median is a numerical value separating the higher half of a sample, a
population, or a probability distribution, from the lower half.
A measure of central tendency, the mode is the most frequently occurring score. The mode and median should be
used when the data is skewed.
In statistics, linear regression is an approach to modeling the relationship between a dependent variable y and one
or more explanatory variables denoted X. The case of one explanatory variable is called simple regression. More
than one explanatory variable is multiple regression.
The lowest level of measurement is nominal (also known as categorical). It is helpful to think of this level as the naming level
because names (i.e., words) are used instead of numbers.
Among the models used for statistical inference are two general groups: Parametric and Non-Parametric. The
parametric models, such as Z or t, and Pearson product moment correlation require that the sample be drawn
from a population with prescribed “shape”, usually a normal curve. The non-parametric models such as
Spearman rank correlation and the Mann-Whitney U test are termed “distribution-free” models because they do
not require any prescribed population shape.
With non-probability sampling methods, we do not know the probability that each population element will be
chosen, and/or we cannot be sure that each population element has a non-zero chance of being chosen.
Non-probability sampling methods offer two potential advantages - convenience and cost. The main disadvantage
is that non-probability sampling methods do not allow you to estimate the extent to which sample statistics are
likely to differ from population parameters. Only probability sampling methods permit that kind of analysis.
In probability theory, the normal distribution is a continuous probability distribution that has a bell-shaped
probability density function, known informally the bell curve.
The bell curve is a function that represents the distribution of random variables as a symmetrical bell-shaped
graph.
The hypothesis that there is no significant difference between specified populations, any observed difference
being due to sampling or experimental error. The two tailed test holds when the hypothesis is of equality. When
the hypothesis is directional, either greater than or less than, we use a one tailed test.
Whenever the null hypothesis is tested, the alternative hypotheses must be stated because it determines whether
a one tail or two tail test is to be made.
Whenever the null hypothesis is accepted, the alternative hypothesis is rejected. Likewise, whenever the null
hypothesis is rejected, the alternative hypothesis must be accepted.
An observational study is one in which data are collected on individuals in a way that doesn't affect them. The
most common nonexperimental study is the observational survey. Surveys are questionnaires that are presented
to individuals who have been selected from a population of interest. Surveys take on many different forms: paper
surveys sent through the mail; Web sites; call-in polls conducted by TV networks; and phone surveys.
The observations that researchers make result in data. The data might be the brands participants plan to purchase
or the data might be respondent scores on a scale that measures preference. In this context, variables are things
that we measure, control, or manipulate in research. The participants (respondents) with the variables represent
our data. Think of the data file as a spreadsheet in Excel with each respondent represented by a row of data and
each variable represented by a column.
Ordinal measurement puts participants in rank order from high to low, but it does not indicate how much higher or lower
one participant is in relation to another.
Two contrasts are orthogonal if the sum of the products of corresponding coefficients (i.e., coefficients for the
same means) adds to zero.
Among the models used for statistical inference are two general groups: Parametric and Non-Parametric. The
parametric models, such as Z or t, and Pearson product moment correlation require that the sample be drawn
from a population with prescribed “shape”, usually a normal curve. The non-parametric models such as
Spearman rank correlation and the Mann-Whitney U test are termed “distribution-free” models because they do
not require any prescribed population shape.
In statistics, the Pearson product-moment correlation coefficient (is typically denoted by r) is a measure of the
correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive. It is
widely used in the sciences as a measure of the strength of linear dependence between two variables.
We cannot make a probability statement about a point estimate. The parameter is either equal to the point or not.
But we can describe the “goodness” of a point estimate as:
2. Unbiased – not a consistently higher or lower estimate than the parameter value.
A population is all the objects that both belong to the same group. A population is a well-defined collection of
objects, items, numbers, etc. with mean:
Size of population = N
A population consists of all members of a group in which a researcher has an interest. It may be small, such as all
doctors affiliated with a particular hospital, or it may be large, such as all college seniors in a state.
is an estimate of population standard deviation based on sample data and corrected for bias. The standard
deviation of the sample itself, is often used as an estimate of the population standard deviation, although it is a
biased estimate. The amount of bias (error) diminishes as n increase and is usually negligible for n > 30.
σ̂ 12
σ̂ 22
Number of observations = N
Population Sampling Distribution (of Means) and Sample (s) with a mean of equal to the population mean:
The distribution is made up of multiple samples of size n drawn from the population. The mean of each sample
The relationship of the population to the sampling distribution of means makes it possible to state inferences
about populations in terms of probabilities associated with the appropriate sampling distribution; all starting with
actual information from one sample.
The standard deviation is a measure of the spread (variability in scores) in the population standard deviation in the
units of measure.
Probability is the branch of mathematics that studies the possible outcomes of given events together with the
outcomes' relative likelihoods and distributions.
Probability is the chance that a particular event will occur expressed on a linear scale from 0 (impossibility) to 1
(certainty), also expressed as a percentage between 0 and 100%.
Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, a measure of the strength and
direction of the linear relationship between two variables.
In statistics, the coefficient of determination R is used in the context of statistical models whose main purpose is
the prediction of future outcomes on the basis of other related information. It is the proportion of variability in a
data set that is accounted for by the statistical model. It provides a measure of how well future outcomes are likely
to be predicted by the model.
A random sample from a finite population is a sample that has been selected by a procedure with the following
properties:
• If a given element has been selected, then the probability of selecting the remaining items is uniformly
affected.
• This means that the selection of one item does not affect the selection of any other particular items; they are in
no way “tied together.”
Sampling is concerned with the selection of a subset of individuals from within a population to estimate
characteristics of the whole population.
Statistical analyses also commonly use measures of dispersion, such as the range, interquartile range, or standard
deviation. The range is a measure of the difference between the high and low measures from a series of
observations.
In most statistical analyses, interval and ratio measurements are analyzed in the same way. However, there is a
difference between these two levels. An interval scale does not have an absolute zero. For instance, if we measure
intelligence, we do not know exactly what constitutes absolutely zero intelligence and thus cannot measure the
zero point. In contrast, a ratio scale has an absolute zero point on its scale. For instance, we know where the zero
point is when we measure height.
Regression analysis is a statistical technique for estimating the relationships among variables. There are several
types of regression:
◦ Logistic regression
◦ Nonlinear regression
◦ Nonparametric regression
◦ Robust regression
◦ Stepwise regression
A. If the population is normally distributed: If random samples of size (n) are taken from a normal population with
mean (µ) and standard deviation (σ), then the sampling distribution of the sample means ( ) is also a normal
B. If the population is large, but not necessarily normally distributed, the Central Limit Theorem applies. If random
samples of size (n) are taken from a large population with mean (µ) and standard deviation (σ), then the sampling
distribution of the sample means ( ) is approximately normal with mean approximately equal to µ and the
standard deviation is approximately equal to , PROVIDED that the sample size is large, (where large
means n ≥ 30).
Consist; the ability of a person or system to perform and maintain its functions.
A major distinction between research and everyday observation is that research is planned in advance. Based on a
theory or hunch, researchers develop research questions and then plan what, when, where and how to observe in
order to answer the questions.
A sample is one subset of all possible subsets which may be selected from a population. When populations are
large, researchers usually sample. A sample is a subset of a population. For instance, we might be interested in
the attitudes of all registered voters in California toward the economy. The registered voters would constitute the
population. If we administered an attitude scale to all these voters, we would be studying the population, and the
summarized results (such as averages) would be referred to as parameters. If we studied only a sample of the
voters, the summarized results would be referred to as statistics.
There are other statistical measures of central tendency that should not be confused with means - including the
'median' and 'mode'. Statistical analyses also commonly use measures of dispersion, such as the range,
interquartile range, or standard deviation.
A measure of central tendency, the mean of the sample, is the average of observations expressed in the units of
measure.
Sample variance is a measure of variability in the observed values for the sample expressed in units of measure
squared.
Sampling distribution is the set (collection) of all possible subsets of a given size (n) which may be selected from a
population. A subset (sample) is different.
The number of possible samples of size n which can be taken from a finite population of size N is =
Estimate of standard deviation (error) of the difference between means based on two samples.
Estimate of standard deviation (error) of the sampling distribution of the proportion based on (a) sample
proportion.
Estimate of standard deviation (error) of the sampling distribution of the difference between proportions
based on proportions from two samples.
Mean of the sampling distribution of means is the average of samples and theoretically equal to the population
mean.
Sampling distribution standard deviation is a measure of variability in the sampling distribution of means
expressed in units of measure.
Empiricism, the scientific method, refers to using direct observation to obtain knowledge. Thus, the empirical
approach to acquiring knowledge is based on making observations of individuals or objects of interest. As
illustrated by the gas price example, everyday observation is an application of the scientific approach.
Scientific procedures involve the use of models to describe, analyze, and make predictions. A model can be a
well-defined set of descriptions and procedures like the Product Life Cycle in marketing, or it can be a scaled
down analog of the real thing, such as an engineer’s scale model of a car.
A simple random sample is a subset of individuals (a sample) chosen from a larger set (a population). Each
individual is chosen randomly and entirely by chance, such that each individual has the same probability of being
chosen at any stage during the sampling process, and each subset of k individuals has the same probability of
being chosen for the sample as any other subset of k individuals. This process and technique should not be
confused with random sampling.
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The
skewness value can be positive or negative, or even undefined. Qualitatively, a negative skew indicates that the tail
on the left side of the probability density function is longer than the right side and the bulk of the values (possibly
including the median) lie to the right of the mean. A positive skew indicates that the tail on the right side is longer
than the left side and the bulk of the values lie to the left of the mean. A zero value indicates that the values are
relatively evenly distributed on both sides of the mean, typically but not necessarily implying a symmetric
distribution.
Statistical analyses also commonly use measures of dispersion, such as the range, interquartile range, or standard
deviation.
Statistical analyses also commonly use measures of dispersion, such as the range, interquartile range, or standard
deviation.
Number of observations = N
A sample is a subset of a population. For instance, we might be interested in the attitudes of all registered voters
in Pennsylvania. The registered voters would constitute the population. If we administered an attitude scale to a
sample of these voters the summarized results would be referred to as statistics.
Statistical Inference is the process of making inferences concerning a population on the basis of information
contained in samples (from the population) and is based on the relationships between the population, the
sampling distribution, and the sample.
Models that are useful and valuable in managing business operations are broadly termed “statistical models”. A
large number of these have been developed to assist researchers in a variety of fields, such as agriculture,
psychology, education, communication, and military tactics, as well as in business. Only the professional
statistician would be expected to understand the full range of such statistical models.
Statistics is about this whole process of using the scientific method to answer questions and make decisions.
Effective decision-making involves correctly designing studies, collecting unbiased data, describing the data with
numbers and graphs, analyzing the data to draw inferences and reaching conclusions based on the transformation
of data into information.
The Stem procedure creates a stem-and-leaf plot for one or more variables. The plot shows the distribution of a
variable graphically. In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf. The
stems are listed as a column on the left, and the leaves for each stem are in a row on the right.
When subpopulations within an overall population vary, it is advantageous to sample each subpopulation (stratum)
independently. Stratification is the process of dividing members of the population into homogeneous subgroups
before sampling. The strata should be mutually exclusive: every element in the population must be assigned to
only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then
simple random sampling or systematic sampling is applied within each stratum. This often improves the
representativeness of the sample by reducing sampling error.
A test for statistical significance that uses a statistical distribution called Student's t-distribution, which is that of a
fraction ( t) whose numerator is drawn from a normal distribution with a mean of zero, and whose denominator is
the root mean square of k terms drawn from the same normal distribution (where k is the number of degrees of
freedom). The t-distribution approaches normal as sample size increases.
A statistical technique used in ANOVA and regression analysis. Regression analysis is a tool used to determine
how well a statistical model fits a set of data. The sum of squares technique helps determine what estimator(s)
provide the best fit.
Surveys are questionnaires that are presented to individuals who have been selected from a population of interest.
Surveys take on many different forms: paper surveys sent through the mail; Web sites; call-in polls conducted by
TV networks; and phone surveys.
Greek letters (like µ, π, σ) are generally used when referring to populations, while common letters like (, s, p) are
used to refer to samples.
µ = Population mean
= Population variance
π = Population proportion
Sample mean
s2 Sample variance
Standard deviation of the sampling distribution of means, (often called the “standard error” of the sampling
distribution of (sample) means.)
Standard deviation of the sampling distribution of differences between (sample) means (also called the
“standard error” of the sampling distribution of differences between sample means.)
Standard deviation (error) of the sampling distribution of the proportion based on an assumed value for the
population proportion, .
• The area under the curve is 1. Because of the symmetry, the area under the curve to the right of 0 equals
the area under the curve to the left of 0 equals 1/2.
• As t increases (or decreases) without bound, the graph approaches, but never equals, 0.
• The area in the tails of the t-distribution is a little greater than the area in the tails of the standard normal
distribution because using s as an estimate of σ introduces more variability to the t-statistic.
• As the sample size n increases, the density curve of t gets closer to the standard normal density
curve. This result occurs because as the sample size increases, the values of s get closer to the values of
σ by the Law of Large Numbers.
The total variation is comprised the sum of the squares of the differences of each mean with the grand mean.
There is the between group variation and the within group variation. The whole idea behind the analysis of
variance is to compare the ratio of between group variance to within group variance. If the variance caused by the
interaction between the samples is much larger when compared to the variance that appears within each group,
then it is because the means aren't the same.
A treatment is a specific combination of factor levels whose effect is to be compared with other treatments.
The mathematical model that describes the relationship between the response and treatment for the one-way
ANOVA is given by
where Yij represents the j-th observation (j = 1, 2, ...ni) on thei-th treatment (i = 1, 2, ..., k levels).
Tukey's test, also known as Tukey's HSD (honestly significant difference) test is generally used in conjunction with
an ANOVA to find which means are significantly different from one another. The test compares the means of every
treatment to the means of every other treatment; that is, it applies simultaneously to the set of all pairwise
comparisons and identifies where the difference between two means is greater than the standard error would be
expected to allow.
We reject the null hypothesis when the null hypothesis is true. This decision would be incorrect. This type of
error is called a Type I error.
As the probability of a Type I error increases, the probability of a Type II error decreases, and vice-versa.
We do not reject the null hypothesis when the alternative hypothesis is true. This decision would be incorrect.
This type of error is called a Type II error.
As the probability of a Type II error increases, the probability of a Type I error decreases, and vice-versa.
A statistic is said to be an unbiased estimate of a population parameter if the mean of the sampling distribution of
the statistic is (theoretically) equal to the population parameter.
For example:
The “hat on σ, read “sigma-hat,” denotes the estimate of population standard deviation based on a sample and
corrected for bias.”
In logic, an argument is valid if and only if its conclusion is entailed by its premises and each step in the argument
is valid. A formula is valid if and only if it is true under every interpretation, and an argument form (or schema) is
valid if and only if every argument of that logical form is valid.
The quality of being different. In statistics it is a measure of spread or variability in the observations (data). The
variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability
distribution, describing how far the numbers lie from the mean (expected value).
The variation due to differences within individual samples, denoted SS(W) for Sum of Squares Within groups. Each
sample is considered independently, no interaction between samples is involved. The degrees of freedom is equal
to the sum of the individual degrees of freedom for each sample. Since each sample has degrees of freedom equal
to one less than their sample sizes, and there are k samples, the total degrees of freedom is k less than the total
sample size: df = N - k.
The variance due to the differences within individual samples is denoted MS(W) for Mean Square Within groups.
This is the within group variation divided by its degrees of freedom.