Chisquare

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Unit 3: Non-Parametric Tests

Unit Learning Outcomes


By the end of this lesson, you should be able to:
1. Understand the concept of Non-parametric Test in statistics and its importance in data analysis;
2. Perform non-parametric test using different software program/s; and
3. Apply non-parametric tests to real-world scenarios and research studies.
Introduction
The mathematical methods used in statistical hypothesis testing, known as non-parametric tests,
do not assume anything about the frequency distribution of the variables that need to be assessed.
When there are skewed data, non-parametric tests are conducted. These tests include methods that
don't rely on data from a specific distribution. These models do not lack parameters, despite the term
"non-parametric" implying otherwise. In actuality, the parameters and their number are rather arbitrary
and adaptable. Consequently, the term "distribution-free models" refers to these models.
Determining whether a continuous outcome has a normal distribution and, consequently, whether
a parametric or nonparametric test is appropriate can be challenging at times. A number of statistical
tests can be employed to evaluate the likelihood that data originate from a normal distribution. Every
test, which essentially compares observed data to quantiles of the normal distribution, is a goodness of
fit test. For every test, the null hypothesis is H 0: The data are normally distributed, as opposed to H 1:
The data are not normally distributed. A nonparametric test is necessary if the test is statistically
significant (p<0.05), which indicates that the data do not follow a normal distribution. It should be
mentioned that low power may occur in these normalcy tests. More specifically, H 0 may not be
rejected by the tests.
Numerous techniques and models are used in nonparametric testing. The most popular tests and
their corresponding parametric counterparts are listed below:
1. Chi-square
2. Wilcoxon Test
3. Mann-Whitney U Test
4. Kruskal-Wallis H Test

Lesson 1: Chi-square Test


Introduction
The purpose of this module is to give students a thorough understanding of the theoretical
underpinnings, practical applications, and Chi-square non-parametric tests. Students will explore the
Chi-square distribution, paying particular attention to the test of independence and goodness-of-fit
tests. In addition, the module will cover best practices for accurate interpretation and decision-making
as well as the practical application of Chi-square tests using SPSS.

Lesson Learning Outcomes


By the end of this lesson, you should be able to:

 Gain a basic understanding of non-parametric statistical tests, with particular emphasis on the Chi-
square test.
 Comprehend the notion of the two principal uses of Chi-square tests: the test of independence and
the goodness-of-fit test.
 Learn the best practices for selecting and utilizing Chi-square tests, taking constraints and
assumptions into account, and reliably interpreting findings.

What is a chi-squared test?

A chi-squared test (represented symbolically as x 2) is a data analysis based on observations of a


random set of variables. Typically, two statistical data sets are compared. Karl Pearson developed this
test in 1900 for the analysis and distribution of categorical data. So, it was mentioned as Pearson’s chi-
squared test.
By taking the null hypothesis as true, the chi-square test is used to calculate the likelihood that
the observations would be true.
A hypothesis is a conjecture that, based on further testing, suggests that a particular condition or
statement might be true. Typically, chi-squared tests are constructed by adding up all squared errors or
falsities over the sample variance.
A statistical test known as the chi-square is used to assess how well categorical variables from a
random sample match the expected and observed outcomes. Researchers who are analyzing survey
response data are the ones who use chi-square the most because it can be applied to categorical
variables. This kind of study can cover a wide range of topics, including political science, economics,
marketing, and demographics.
The best data to use chi-square on is nominal data. A nominal variable is a type of categorical
variable where the numerical order may not matter, but the variables differ in terms of quality. Asking
someone what color they like would, for example, result in a nominal variable. Conversely, asking
someone their age would result in an ordinal set of data.
Formula for Chi-Square
where:
c = Degrees of freedom
O = Observed value(s)
E = Expected value(s)
A chi-square test is used to help determine whether observed results are consistent with expected
results and to rule out the possibility that observations are the product of chance. This is appropriate
when the variable being studied is categorical and the data being analyzed are taken from a random
sample. Examples of categorical variables are choices related to gender, race, education level, and type
of cars. These types of information are often collected via surveys or questionnaires. Therefore, chi-
square analysis is often most useful when analyzing this type of data.

How to Perform a Chi-Square Test?


Whether you are doing an independence test or a goodness of fit test, these are the fundamental steps
to follow:
1. Make a table with the expected and observed frequencies in it.
2. To determine the chi-square value, use the formula;
3. Using statistical software or a chi-square value table, determine the critical chi-square value;
4. Determine which value is larger, the critical value or the chi-square value;
5. Accept the null hypothesis or reject it.

Limitations of the Chi-Square Test


The sample size affects the chi-square test's sensitivity. Because of the large sample size,
relationships may appear significant when they are not. Furthermore, the chi-square test is unable to
determine if there is a causal relationship between two variables. It is limited to determining the
relationship between two variables.

What is Chi-square distribution?


The test statistic's sampling distribution is referred to as a chi-squared distribution when we
assume that the null hypothesis is true. A family of distributions is the Chi-square distribution. The
degrees of freedom define each distribution. In one or more classes or categories, the chi-squared test
aids in identifying whether there is a significant difference between the observed frequencies and the
normal frequencies. The probability of the independent variables is provided.
2
The notation for the chi-square distribution is: x x df

where df = degrees of freedom which depends on how chi-square is being used. (If you want to
practice calculating chi-square probabilities then use df = n - 1. The degrees of freedom for the three
major uses are each calculated differently.)

For the x 2 distribution, the population mean is μ = df and the population standard deviation is

σ =√ 2(df )

The random variable is shown as x 2, but may be any upper-case letter. The random variable for a
chi-square distribution with k degrees of freedom is the sum of k independent, squared standard
normal variables.

Figure
2
x =¿
1. The curve is nonsymmetrical and skewed to the right.
2. There is a different chi-square curve for each df.
3. The test statistic for any test is always greater than or equal to zero.
4. When df > 90, the chi-square curve approximates the normal distribution. For x x 21,000 the mean,
μ = df = 1,000 and the standard deviation, σ =√ 2 ( 1,000 )=44.7 . Therefore,
5. X ~ N (1,000, 44.7), approximately.
6. The mean, μ, is located just to the right of the peak.
There are two main kinds of chi-square tests: the goodness-of-fit test, which asks something like
"How well does the coin in my hand match a theoretically fair coin?"; and the test of independence,
which asks a question of relationship, such as, "Is there a relationship between student gender and
course choice?"

Goodness-of-Fit Test
This kind of hypothesis test tells you whether or not the data "fit" into a given distribution. You
might think, for instance, that the unknown data you have fits a binomial distribution. To find out if
there is a fit or not, you use a chi-square test, which indicates that the distribution for the hypothesis
test is chi-square. For this test, the alternative and null hypotheses can be expressed as equations or
inequalities, or they can be written as sentences.
The test statistic for a goodness-of-fit test is:
where:
O = observed values (data)
E = expected values (from theory)
k = the number of different data cells or categories
The observed values are the data values and the expected values are the values you would expect
2
(O−E)
to get if the null hypothesis were true. There are n terms of the form .
E
The number of degrees of freedom is df = (number of categories – 1).
In almost all cases, the goodness-of-fit test is a right tail. In the instance that there is a significant
difference between the observed and expected values, the test statistic may become extremely large
and extend far into the right tail of the chi-square curve.
Note: In order to use this test, each cell's expected value must be at least five.
Example 1:
One study indicates that the number of A random sample of 600 families in the far
televisions that American families have is western United States resulted in the data in
distributed (this is the given distribution for Table 2.
the American population) as in Table 1.
Number of Frequency
Number of Television Percent Television
0 10 0 66
1 16 1 119
2 55 2 340
3 11 3 60
4+ 8 4+ 15
Table 1: Expected (E) Percents
Total=600
Does the distribution of "number of televisions" among families in the far west of the United
States seem to differ from the distribution of the American population as a whole, even at the 1%
significance level?
Solution

 This test is always right-tailed.


 The first table contains expected percentages. To get expected (E) frequencies, multiply
the percentage by 600.

Number of Television Percent Expected Frequency


0 10 (0.10) (600) = 60
1 16 (0.15) (600) = 96
2 55 (0.55) (600) = 330
3 11 (0.11) (600) = 66
4+ 8 (0.08) (600) = 48
Table 3: Expected Frequencies

Therefore, 60, 96, 330, 66, and 48 are the anticipated frequencies.
H O : The distribution of "number of televisions" among families in the far west of the United
States is equal to that of the American population.
H a : The distribution of "number of televisions" among families in the far west of the United
States differs from the distribution of "number of televisions" among all Americans.

Distribution for the test: x 24 where df = (the number of cells) – 1 = 5 – 1 = 4.

Calculate the test statistic: χ2 = 29.65


Graph:

Probability statement: p-value = P (χ2 > 29.65) = 0.000006


Compare α and the p-value:
α = 0.01
p-value = 0.000006
So, α > p-value.

Make a decision: Since α > p-value, reject H O .

This indicates that you disagree with the notion that the population distribution of the far western
states is the same as that of the entire American population.
In summary, there is enough data to draw the conclusion that, at the 1% significance level, the
distribution of "number of televisions" in the far western United States differs from that of the entire
American population.
Test of Independence
To ascertain whether two factors are independent, apply the test of independence based on the
chi-square distribution. The null hypothesis of the test states that the two factors are independent. In
the test, values that are expected and observed are compared. The test is right-tailed. Every observation
or cell category must have an expected value of at least five. Utilizing a contingency table of observed
(data) values is a common practice in tests of independence.
The test statistic for a test of independence is similar to that of a goodness-of-fit test:
where:
O = observed values
E = expected values
i = the number of rows in the table
j = the number of columns in the table
Example:
In a volunteer group, adults 21 and older volunteer from one to nine hours each week to spend
time with a disabled senior citizen. The program recruits among community college students, four-year
college students, and nonstudents.

Type of Volunteer 1-3 Hours 4-6 Hours 7-9 Hours Row Total
Community College Students 111 96 48 255
Four-Year College Students 96 133 61 290
Nonstudents 91 150 53 294
Column Total 298 379 162 839
Table 4: Number of Hours Wokes Per Week by Volunteer Type (Observed)

Does the type of volunteer not affect the quantity of hours volunteered?
Solution
This is an independence test, as indicated by the question and the observed table. The quantity of
hours volunteered and the kind of volunteerism are the two variables.
H 0: Counting hours does not depend on the kind of volunteer.

H 1: The kind of volunteer determines how many hours are required.

Type of Volunteer 1-3 Hours 4-6 Hours 7-9 Hours


Community College Students 90.57 115.19 49.24
Four-Year College Students 103 131 56
Nonstudents 104.42 132.81 56.77
Table 5: Number of Hours Wokes Per Week by Volunteer Type (Expected)

For example, the calculation for the expected frequency for the top left cell is
(row total)(column total ) (255)(298)
E= = =90.57
total number surveyed 839

Graph:
Calculate the test statistic: χ2 = 12.99 (calculator or computer)

Distribution for the test: x 24

df = (3 columns – 1) (3 rows – 1) = (2)(2) = 4

Probability statement: p-value=P (χ2 > 12.99) = 0.0113


Compare α and the p-value: Since no α is given, assume α = 0.05. p-value = 0.0113. α > p-value.

Make a decision: Since α > p-value, reject H 0. This means that the factors are not independent.

Conclusion: The data provide enough evidence, at a 5% level of significance, to draw the conclusion
that the type of volunteer and the quantity of hours volunteered are dependent on one another.
Make sure the data you want to look at "passes" two assumptions if you plan to use a chi-square
test for independence to analyze your data. This is required because you should only use a chi-square
test for independence if your data satisfies these two conditions. If it doesn't, you can't use a chi-square
test for independence. The two suppositions are as follows:
First assumption: The two variables you're measuring ought to be at the nominal or ordinal level, or
categorical data.
Second assumption: There should be two or more independent, categorical groups in each of your
two variables. Examples of independent variables that satisfy this requirement are: gender (two
categories: males and females); ethnicity (three categories: Caucasian, African American, and
Hispanic); degree of physical activity (four categories: sedentary, low, moderate, and high);
occupation (five categories: surgeon, physician, nurse, dentist, and therapist); and so on.
The Chi-Square test is an essential tool for statistical inference, so including it in specialized
software like SPSS (Statistical Package for the Social Sciences) improves accessibility and efficiency
in the analytical process.
Example
Educators are constantly searching for innovative approaches to teach undergraduate statistics as
a part of a degree course that does not focus on statistics (such as psychology). With today's
technology, statistical program guides can be provided online rather than in a book. But different
people have different ways of learning. An educator would like to know if the preferred learning
medium—online vs. books—is correlated with gender (male or female). As a result, we have two
nominal variables: preferred learning medium (online/books) and gender (male/female).
You can use SPSS Statistics to analyze your data using a chi-square test for independence by following
the 13 steps listed below. We walk you through how to interpret your chi-square test results for
independence at the end of these 13 steps.

1. Click Analyze > Descriptives 2. You will be presented with the


Statistics > Crosstabs... on the top following Crosstabs dialogue box:
menu, as shown below:

3. Transfer one of the variables into the Row(s): box and the other variable into the Column(s): box. In
our example, we will transfer the Gender variable into the Row(s): box and Preferred Learning
Medium into the Column(s): box. There are two ways to do this. You can either: (1) highlight the
variable with your mouse and then use the relevant Right arrow button buttons to transfer the
variables; or (2) drag-and-drop the variables. How do you know which variable goes in the row or
column box? There is no right or wrong way. It will depend
on how you want to present your data.
If you want to display clustered bar charts
(recommended), make sure that display clustered bar charts
checkbox is ticked.
You will end up with a screen similar on the right:

4. Click on the Statistics button. You


will be presented with the following
Crosstabs: Statistics dialogue box:

5. Select the Chi-square and Phi and


Cramer's V options, as shown below:

6. Click on the Continue button.

7. Click on the Cells button. You will 8. Select Observed from the –Counts–
be presented with the following area, and Row, Column and Total
Crosstabs: Cell Display dialogue from the –Percentages– area, as
box: shown below:

9. Click on the Continue button. 11. You will be presented with the
10. Click on the Format button. following:

This option allows you to change the order of the


values to either ascending or descending.
12. Once you have made your choice, click on the
Continue button.
13. Click on the OK button to generate your output.
OUTPUT
Figure 3: The Chi-Square Tests Table

Figure 2: The Crosstabulation Table


(Gender*Preferred Learning Medium Crosstabulation)

We can see here that χ(1) = 0.487, p = .485. This tells us that there is no statistically significant
association between Gender and Preferred Learning Medium; that is, both Males and Females equally
prefer online learning versus books.

Figure 4: The Symmetric Measures Table

Phi and Cramer's V are both tests of the strength of association. We can see that the strength of
association between the variables is very weak.

You might also like