Chisquare
Chisquare
Chisquare
Gain a basic understanding of non-parametric statistical tests, with particular emphasis on the Chi-
square test.
Comprehend the notion of the two principal uses of Chi-square tests: the test of independence and
the goodness-of-fit test.
Learn the best practices for selecting and utilizing Chi-square tests, taking constraints and
assumptions into account, and reliably interpreting findings.
where df = degrees of freedom which depends on how chi-square is being used. (If you want to
practice calculating chi-square probabilities then use df = n - 1. The degrees of freedom for the three
major uses are each calculated differently.)
For the x 2 distribution, the population mean is μ = df and the population standard deviation is
σ =√ 2(df )
The random variable is shown as x 2, but may be any upper-case letter. The random variable for a
chi-square distribution with k degrees of freedom is the sum of k independent, squared standard
normal variables.
Figure
2
x =¿
1. The curve is nonsymmetrical and skewed to the right.
2. There is a different chi-square curve for each df.
3. The test statistic for any test is always greater than or equal to zero.
4. When df > 90, the chi-square curve approximates the normal distribution. For x x 21,000 the mean,
μ = df = 1,000 and the standard deviation, σ =√ 2 ( 1,000 )=44.7 . Therefore,
5. X ~ N (1,000, 44.7), approximately.
6. The mean, μ, is located just to the right of the peak.
There are two main kinds of chi-square tests: the goodness-of-fit test, which asks something like
"How well does the coin in my hand match a theoretically fair coin?"; and the test of independence,
which asks a question of relationship, such as, "Is there a relationship between student gender and
course choice?"
Goodness-of-Fit Test
This kind of hypothesis test tells you whether or not the data "fit" into a given distribution. You
might think, for instance, that the unknown data you have fits a binomial distribution. To find out if
there is a fit or not, you use a chi-square test, which indicates that the distribution for the hypothesis
test is chi-square. For this test, the alternative and null hypotheses can be expressed as equations or
inequalities, or they can be written as sentences.
The test statistic for a goodness-of-fit test is:
where:
O = observed values (data)
E = expected values (from theory)
k = the number of different data cells or categories
The observed values are the data values and the expected values are the values you would expect
2
(O−E)
to get if the null hypothesis were true. There are n terms of the form .
E
The number of degrees of freedom is df = (number of categories – 1).
In almost all cases, the goodness-of-fit test is a right tail. In the instance that there is a significant
difference between the observed and expected values, the test statistic may become extremely large
and extend far into the right tail of the chi-square curve.
Note: In order to use this test, each cell's expected value must be at least five.
Example 1:
One study indicates that the number of A random sample of 600 families in the far
televisions that American families have is western United States resulted in the data in
distributed (this is the given distribution for Table 2.
the American population) as in Table 1.
Number of Frequency
Number of Television Percent Television
0 10 0 66
1 16 1 119
2 55 2 340
3 11 3 60
4+ 8 4+ 15
Table 1: Expected (E) Percents
Total=600
Does the distribution of "number of televisions" among families in the far west of the United
States seem to differ from the distribution of the American population as a whole, even at the 1%
significance level?
Solution
Therefore, 60, 96, 330, 66, and 48 are the anticipated frequencies.
H O : The distribution of "number of televisions" among families in the far west of the United
States is equal to that of the American population.
H a : The distribution of "number of televisions" among families in the far west of the United
States differs from the distribution of "number of televisions" among all Americans.
This indicates that you disagree with the notion that the population distribution of the far western
states is the same as that of the entire American population.
In summary, there is enough data to draw the conclusion that, at the 1% significance level, the
distribution of "number of televisions" in the far western United States differs from that of the entire
American population.
Test of Independence
To ascertain whether two factors are independent, apply the test of independence based on the
chi-square distribution. The null hypothesis of the test states that the two factors are independent. In
the test, values that are expected and observed are compared. The test is right-tailed. Every observation
or cell category must have an expected value of at least five. Utilizing a contingency table of observed
(data) values is a common practice in tests of independence.
The test statistic for a test of independence is similar to that of a goodness-of-fit test:
where:
O = observed values
E = expected values
i = the number of rows in the table
j = the number of columns in the table
Example:
In a volunteer group, adults 21 and older volunteer from one to nine hours each week to spend
time with a disabled senior citizen. The program recruits among community college students, four-year
college students, and nonstudents.
Type of Volunteer 1-3 Hours 4-6 Hours 7-9 Hours Row Total
Community College Students 111 96 48 255
Four-Year College Students 96 133 61 290
Nonstudents 91 150 53 294
Column Total 298 379 162 839
Table 4: Number of Hours Wokes Per Week by Volunteer Type (Observed)
Does the type of volunteer not affect the quantity of hours volunteered?
Solution
This is an independence test, as indicated by the question and the observed table. The quantity of
hours volunteered and the kind of volunteerism are the two variables.
H 0: Counting hours does not depend on the kind of volunteer.
For example, the calculation for the expected frequency for the top left cell is
(row total)(column total ) (255)(298)
E= = =90.57
total number surveyed 839
Graph:
Calculate the test statistic: χ2 = 12.99 (calculator or computer)
Make a decision: Since α > p-value, reject H 0. This means that the factors are not independent.
Conclusion: The data provide enough evidence, at a 5% level of significance, to draw the conclusion
that the type of volunteer and the quantity of hours volunteered are dependent on one another.
Make sure the data you want to look at "passes" two assumptions if you plan to use a chi-square
test for independence to analyze your data. This is required because you should only use a chi-square
test for independence if your data satisfies these two conditions. If it doesn't, you can't use a chi-square
test for independence. The two suppositions are as follows:
First assumption: The two variables you're measuring ought to be at the nominal or ordinal level, or
categorical data.
Second assumption: There should be two or more independent, categorical groups in each of your
two variables. Examples of independent variables that satisfy this requirement are: gender (two
categories: males and females); ethnicity (three categories: Caucasian, African American, and
Hispanic); degree of physical activity (four categories: sedentary, low, moderate, and high);
occupation (five categories: surgeon, physician, nurse, dentist, and therapist); and so on.
The Chi-Square test is an essential tool for statistical inference, so including it in specialized
software like SPSS (Statistical Package for the Social Sciences) improves accessibility and efficiency
in the analytical process.
Example
Educators are constantly searching for innovative approaches to teach undergraduate statistics as
a part of a degree course that does not focus on statistics (such as psychology). With today's
technology, statistical program guides can be provided online rather than in a book. But different
people have different ways of learning. An educator would like to know if the preferred learning
medium—online vs. books—is correlated with gender (male or female). As a result, we have two
nominal variables: preferred learning medium (online/books) and gender (male/female).
You can use SPSS Statistics to analyze your data using a chi-square test for independence by following
the 13 steps listed below. We walk you through how to interpret your chi-square test results for
independence at the end of these 13 steps.
3. Transfer one of the variables into the Row(s): box and the other variable into the Column(s): box. In
our example, we will transfer the Gender variable into the Row(s): box and Preferred Learning
Medium into the Column(s): box. There are two ways to do this. You can either: (1) highlight the
variable with your mouse and then use the relevant Right arrow button buttons to transfer the
variables; or (2) drag-and-drop the variables. How do you know which variable goes in the row or
column box? There is no right or wrong way. It will depend
on how you want to present your data.
If you want to display clustered bar charts
(recommended), make sure that display clustered bar charts
checkbox is ticked.
You will end up with a screen similar on the right:
7. Click on the Cells button. You will 8. Select Observed from the –Counts–
be presented with the following area, and Row, Column and Total
Crosstabs: Cell Display dialogue from the –Percentages– area, as
box: shown below:
9. Click on the Continue button. 11. You will be presented with the
10. Click on the Format button. following:
We can see here that χ(1) = 0.487, p = .485. This tells us that there is no statistically significant
association between Gender and Preferred Learning Medium; that is, both Males and Females equally
prefer online learning versus books.
Phi and Cramer's V are both tests of the strength of association. We can see that the strength of
association between the variables is very weak.