Andy Field Using Spss

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 115

Return to Menu

Page | Search this


Site | Provide Feedback

Lesson 1: SPSS Windows and Files


Objectives
1.

Launch SPSS for Windows.

2.

Examine SPSS windows and file types.

Overview
In a typical SPSS session, you are likely to work with two or more SPSS windows and to save the
contents of one or more windows to separate files. The window containing your data is the SPSS Data
Editor. If you plan to use the data file again, you may click on File, Save from within the Data Editor
and give the file a descriptive name. SPSS will supply the .sav extension, indicating that the saved
information is in the form of a data file. An SPSS data file includes both the data records and their
structure. The window containing the results of the SPSS procedures you have performed is the SPSS
Viewer. You may find it convenient to save this as an output file. It is okay to use the same name you
used for your data because SPSS will supply the .spo extension to indicate that the saved file is an
output file. As you run various procedures, you may also choose to show the SPSS syntax for these
commands in a syntax window, and save the syntax in a separate .sps file. It is possible to run SPSS
commands directly from syntax, though in this series of tutorials we will focus our attention on SPSS
data and output files and use the point-and-click method to enter the necessary commands.
Launching SPSS
SPSS for Windows is launched from the Windows desktop. There are several ways to access the
program, and the one you use will be based on the way your particular computer is configured. There
may be an SPSS for Windows shortcut on the desktop or in your Start menu. Or you may have to
click Start, All Programs to find the SPSS for Windows folder. In that folder, you will find the SPSS
for Windows program icon.
Once you have located it, click on the SPSS for Windows icon with the left mouse button to launch
SPSS. When you start the program, you will be given a blank dataset and a set of options for running
the SPSS tutorial, typing in data, running queries, creating queries, or opening existing data sources
(see Figure 1-1). For now, just click on Cancel to reveal the blank dataset in the Data Editor screen.

Figure 1-1 SPSS opening screen

The SPSS Data Editor


Examine the SPSS Data Editor's Data View shown in Figure 1-2 below. You will learn in Lesson 2 how
create an effective data structure within the Variable View and how to enter and manipulate data using
the Data Editor. As indicated above, if you click File, Save while in the Data Editor view, you can save
the data along with their structure as a separate file with the .sav extension. The Data Editor provides
the Data View as shown below, and also a separate Variable View. You can switch between these views
by clicking on the tabs at the bottom of the worksheet-like interface.

Figure 1-2 SPSS Data Editor (Data View)

The SPSS Viewer


The SPSS Viewer is opened automatically to show the output when you run SPSS commands. Assume
for example that you wanted to find the average age of 20 students in a class. We will examine the
commands needed to calculate descriptive statistics in Lesson 3, but for now, simply examine the
SPSS Viewer window (see Figure 1-3). When you click File, Save in this view, you can save the output
to a file with the .spo extension.

Figure 1-3 SPSS Viewer

Syntax Editor Window


Finally, you can view and save SPSS syntax commands from the Syntax Editor window. When you are
selecting commands, you will see a Paste button. Clicking that button pastes the syntax for the
commands you have chosen into the Syntax Editor. For example, the syntax to calculate the mean age
shown above is shown in Figure 1-4:

Figure 1-4 SPSS Syntax Editor

Though we will not address SPSS syntax except in passing in these tutorials, you should note that you
can run commands directly from the Syntax Editor and save your syntax (.sps) files for future

reference. Unlike earlier versions of SPSS, version 15, the version illustrated in these tutorials,
automatically presents in the SPSS Viewer the syntax version of the commands you give it when you
point and click in the Data Editor or the SPSS Viewer (examine Figure 1-3 for an example).
Now that you know the kinds of windows and files involved in an SPSS session, you are ready to learn
how to enter, structure, and manipulate data. Those are the subjects of Lesson 2.

Lesson 2: Entering and Working with Data


Objectives
1.

Create a data file and data structure.

2.

Compute a new variable.

3.

Select cases.

4.

Sort cases.

5.

Split a file.

Overview
Data can be entered directly into the SPSS Data Editor or imported from a variety of file types. It is
always important to check data entries carefully and ensure that the data are accurate. In this lesson
you will learn how to build an SPSS data file from scratch, how to calculate a new variable, how to
select and sort cases, and how to split a file into separate layers.
Creating a Data File
A common first step in working with SPSS is to create or open a data file. We will assume in this
lesson that you will type data directly into the SPSS Data Editor to create a new data file. You should
realize that you can also read data from many other programs, or copy and paste data from
worksheets and tables to create new data files.
Launch SPSS. You will be given various options, as we discussed in Lesson 1. Select Type in
Data or Cancel . You should now see a screen similar to the following, which is a blank dataset in
the Data View of the SPSS Data Editor (see Figure 2-1):

Figure 2-1 SPSS Data Editor - Data View

Key Point: One Row Per Participant, One Column per Variable
It is important to note that each row in the SPSS data table should be assigned to a single participant,
subject, or case, and that no case's data should appear on different rows. When there are multiple
measures for a case, each measure should appear in a separate column (called a "variable" by SPSS).
If you use a coding variable to indicate which group or condition was assigned to a case, that variable
should also appear in a separate column. So if you were looking at the scores for five quizzes for each
of 20 students, the data for each student would occupy a single row (line) in the data table, and the
score for each quiz would occupy a separate column.
Although SPSS automatically numbers the rows of the data table, it is a very good habit to provide a
separate participant (or subject) number column so that records can be easily sorted, filtered, or
selected. Best practice also requires setting up the data structure for the data. For this purpose, we
will switch to the Variable View of the Data Editor by clicking on the Variable View tab at the bottom
of the Data Editor window. See Figure 2-2.

Figure 2-2 SPSS Data Editor - Variable View

Example Data
Let us establish the data structure for our example of five quizzes and 20 students. We will assume
that we also know the age and the sex of each student. Although we could enter "F" for female and
"M" for male, most statistical procedures are easier to perform if a number is used to code such
categorical variables. Let us assign the number "1" to females and the number "0" to males. The
hypothetical data are shown below:
Student

Sex

Age

Quiz1

Quiz2

Quiz3

Quiz4

Quiz5

18

83

87

81

80

69

19

76

89

61

85

75

17

85

86

65

64

81

20

92

73

76

88

64

23

82

75

96

87

78

18

88

73

76

91

81

21

89

71

61

70

75

20

89

70

87

76

88

23

92

85

95

89

62

10

21

86

83

77

64

63

11

23

90

71

91

86

87

12

18

84

71

67

62

70

13

21

83

80

89

60

60

14

17

79

77

82

63

74

15

19

89

80

64

94

78

16

20

76

85

65

92

82

17

19

92

76

76

74

91

18

22

75

90

78

70

76

19

22

87

87

63

73

64

20

20

75

74

63

91

87

Specifying the Data Structure


Switch to the Variable View by clicking on the Variable View tab (see Figure 2-2 above). The
numbers at the left of the window now refer to variables rather than participants. Note that you can
specify the variable Name, the Type of variable, the variable Width (in total characters or digits), the
number of Decimals , a descriptive Label, labels for different Values, how to deal
with Missing Values, the display Column width, how to Alignthe variable in the display, and whether
the Measure is nominal, ordinal, or scale (interval and ratio). In many cases you can simply accept
the defaults by leaving the entries blank. But you will definitely want to enter a variable Name and
Label, and also specify Value labels for the levels of categorical or grouping variables such as sex or
the levels of an independent variable. The variable names should be short and should not contain
spaces or special characters other than perhaps underscores. Variable labels, on the other hand, can
be longer and can contain spaces and special characters.
Let us specify the structure of our dataset by naming the variables as follows. We will also provide
information concerning the width, number of decimals, and type of measure, along with a descriptive
label:
1.

Student

2.

Sex

3.

Age

4.

Quiz1

5.

Quiz2

6.

Quiz3

7.

Quiz4

8.

Quiz5

No decimals appear in our raw data, so we will set the number of decimals to zero. After we enter the
desired information, the completed data structure might appear as follows:

Figure 2-3 SPSS data structure (Variable View)

Notice that we provided value labels for Sex, so we won't confuse our 1's and 0's later. To do this, click
on Values in the Sex variable row and enter the appropriate labels for males and females (see Figure
2-4).

Figure 2-4 Adding value labels

After entering the value and label for one sex, click on Add and then repeat the process for the other
sex. Click on Add after entering this information and then click OK.
Entering the Data
Now return to the data view (click on the Data View tab), and type in the data. If you prefer, you
may retrieve a copy of the data file by clicking here. Save the data file with a name that will help you
remember it. In this case, we used lesson_2.sav as the file name. Remember that SPSS will provide
the .sav extension for a data file. The data should appear as follows:

Figure 2-5 Completed data entry

Computing a New Variable


Now we will compute a new variable by averaging the five quiz scores for each student. When we
compute this new variable, it will be added to our variable list, and a new column will be created for it.
Let us call the new variable Quiz_Avg and use SPSS's built-in function called MEAN to compute it.
Select Transform, then Compute. The Compute Variable dialog box appears. You may type in the
new variable name, specify the type and provide a label, and enter the formula for computing the new
variable. In this case, we will use the formula:
Quiz_Avg = MEAN(Quiz1, Quiz2, Quiz3, Quiz4, Quiz5)
You can enter the formula by selecting MEAN from the Functions window and then clicking on the
variable names, or you can simply type in the formula, separating the variable names by commas.
The initial Compute Variable dialog box with the target variable named Quiz_Avg and the MEAN
function selected is below. The question marks indicate that you must supply expressions for the
computation.

Figure 2-6 Compute Variable screen

The appropriate formula is as follows:

Figure 2-7 Completed expression

When you click OK, the new variable appears in both the data and variable views (see below). As
discussed earlier, you can change the number of decimals (numerical variables default to two
decimals) and add a descriptive label for the new variable.

Figure 2-8 New variable appears in Data View

Figure 2-9 New variable appears in Variable View

Selecting Cases
You may want to select only certain cases, such as the data for females or for individuals with ages
lower than 20 years. SPSS allows you to select cases either by filtering (which keeps all the cases but
limits further analyses to the selected cases) or by removing the cases that do not meet your criteria.
Usually, you will want to filter cases, but sometimes, you may want to create separate files for
additional analyses by deleting records that do not match your selection criteria. We will select records
for females and filter those records so that the records for males remain but will be excluded from
analyses until we select them again.
From either the variable view or the data view, click on Data, then click on Select Cases. The
resulting dialog box allows you to select the desired cases for further analysis, or to re-select all cases
if data were previously filtered. Let us choose "If condition is satisfied," and specify that we want to
select only records for which the sex of the participant is female. See the dialog box in the following
figure.

Figure 2-10 Select Cases dialog

Click the "If..." button and enter the condition for selection. In this case we will enter the expression
Sex = 1. You can type this in directly, or you can point and click to the entries in the dialog box

Figure 2-11 Select Cases expression

Click Continue, then Click OK, and then examine the data view (see Figure 2-12). Records for males
will now have a diagonal line through the row number label, indicating that though still present, these
records are excluded from further analyses.

Figure 2-12 Selected and filtered data

Also notice that a new variable called Filter_$ has been automatically added to your data file. If you
return to the Data menu and select all the cases again, you can use this filter variable to select
females instead of having to re-enter the selection formula. If you do not want to keep this new
variable, you can right-click on its column label and select Clear.

Figure 2-13 Filter variable added by SPSS

Sorting Cases
Next you will learn to sort cases. Let's return to the Data, Select Cases menu and choose "Select all
cases" in order to re-select the records for males.
We can sort on one or more variables, For example, we may want to sort the records in our dataset by
age and sex. Select Data, Sort Cases:

Figure 2-14 Sort Cases option

Move Sex and Age to the "Sort by" window (see Figure 2-15) and then click OK.

Figure 2-15 Sort Cases dialog

Return to the Data View and confirm that the data are sorted by sex and by age within sex (see Figure
2-16).

Figure 2-16 Cases sorted by Sex and Age

Splitting a File
The last subject we will cover in this tutorial is splitting a file. Instead of filtering cases, splitting a file
creates separate "layers" for the grouping variables. For example, instead of selecting only one sex at
a time, you may want to run several analyses separately for males and females. One convenient way
to accomplish that is to split the file so that every procedure you run will be automatically conducted
and reported for the two groups separately. To split a file, select Data, Split File. The data in a goup
need to be consecutive cases in the dataset, so the records must be sorted by groups. However, if
your data are not already sorted, SPPS can do that for you at the same time the file is split (see Figure
2-17).

Figure 2-17 Split File menu

Now, when you run a command, such as a table command to summarize average quiz scores, the
command will be performed for each group separately and those results will be reported in the same
output (see Figure 2-18).

Figure 2-18 Split file results in separate analysis for each group

Lesson 3: Descriptive Statistics and Graphs


Objectives
1.

Compute descriptive statistics.

2.

Compare means for different groups.

3.

Display frequency distributions and histograms.

4.

Display boxplots.

Overview
In this lesson, you will learn how to produce various descriptive statistics, simple frequency
distribution tables, and frequency histograms. You will also learn how to explore your data and create
boxplots.
Example
Let us return to our example of 20 students and five quizzes. We would like to calculate the average
score (mean) and standard deviation for each quiz. We will also look at the mean scores for men and
women on each quiz. Open the SPSS data file you saved in Lesson 2, or click here for lesson_3.sav.
Remember that we previously calculated the average quiz score for each person and included that as a
new variable in our data file.
To calculate the means and standard deviations for age, all quizzes, and the average quiz score,
select Analyze, then Descriptive Statistics, and then Descriptives as shown in the following
screenshot (see Figure 3-1).

Figure 3-1 Accessing the Descriptives Procedure

Move the desired variables into the variables window (see Figure 3-2) and then Click OK.

Figure 3-2 Move the desired variables into the variables window.

In the resulting dialog box, make sure you check (at a minimum) the boxes in front of Mean and Std.
deviation:

Figure 3-3 Descriptives options

The resulting output table showing the means and standard deviations of the variables is opened in
the SPSS Viewer (see Figure 3-4).

Figure 3-4 Output from Descriptives Procedure

Exploring Means for Different Groups


When you have two or more groups, you may want to examine the means for each group as well as
the overall mean. The SPSS Compare Means procedure provides this functionality and much more,
including various hypothesis tests. Assume that you want to compare the means of men and women
on age, the five quizzes, and the average quiz score. Select Analyze, Compare Means, Means (see
Figure 3-5):

Figure 3-5 Selecting Means Procedure

Click OK, and then in the resulting dialog box, move the variables you are interested in summarizing
into the Dependent List. At this point, do not worry whether your variables are actual "dependent
variables" or not. Move Sex to the Independent List (see Figure 3-6). Click on Options to see the
many summary statistics available. In the current case, make sure that Mean, Number of Cases, and
Standard Deviation are selected.

Figure 3-6 Means dialog box

When you click OK, the report table appears in the SPSS Viewer with the separate means for the two
sexes along with the overall data, as shown in the following figure.

Figure 3-7 Report from Means procedure

As this lesson makes clear, there are several ways to produce summary statistics such as means and
standard deviations in SPSS. From Lesson 2 you may recall that splitting the file would allow you to
calculate the descriptive statistics separately for males and females. The way to find the procedure
that works best in a given situation is to try different ones, and always to explore the options
presented in the SPSS menus and dialog boxes. The extensive SPSS help files and tutorials are also
very useful.
Frequency Distributions and Histograms
SPSS provides several different ways to explore, summarize, and present data in graphic form. For
many procedures, graphs and plots are available as output options. SPSS also has an extensive
interactive chart gallery and a chart builder that can be accessed through the Graphs menu. We will
look at only a few of these features, and the interested reader is encouraged to explore the many
additional charting and graphing features of SPSS.
One very useful feature of the Frequencies procedure in SPSS is that it can produce simple frequency
tables and histograms. You may optionally choose to have the normal curve superimposed on the
histogram for a visual check as to how the data are distributed. Let us examine the distribution of
ages of our 20 hypothetical students. Select Analyze, Descriptive Statistics, Frequencies (see
Figure 3-8).

Figure 3-8 Selecting Frequencies procedure

In the Frequencies dialog, move Age to the variables window, and then click on Charts. Select
Histograms and check the box in front of With normal curve (see Figure 3-9).

Figure 3-9 Frequencies: Charts dialog

Click Continue and OK. In the resulting output, SPSS displays the simple frequency table for age and
the frequency histogram with the normal curve (see Figures 3-10 and 3-11).

Figure 3-10 Simple frequency table

Figure 3-11 Frequency histogram with normal curve

Exploratory Data Analysis


In addition to the standard descriptive statistics and frequency distributions and graphs, SPSS also
provides many graphical and semi-graphical techniques collectively referred to as exploratory data

analysis (EDA). EDA is useful for describing the characteristics of a dataset, identifying outliers, and
providing summary descriptions. Some of the most widely-used EDA techniques are boxplots and
stem-and-leaf displays. You can access these techniques through the commands found
through Analyze, Descriptive Statistics, Exlpore. As with the Compare Means procedure, groups
can be separated if desired. For example, a side-by-side boxplot comparing the average quiz grades of
men and women is shown in Figure 3-12.

Figure 3-12 Boxplots

Lesson 4: Independent-Samples t Test


Objectives
1.

Conduct an independent-samples t test.

2.

Interpret the output of the t test.

Overview
The independent-samples or between-groups t test is used to examine the effects of one independent
variable on one dependent variable and is restricted to comparisons of two conditions or groups (two
levels of the independent variable). In this lesson, we will describe how to analyze the results of a
between-groups design. Lesson 5 covers the paired-samples or within-subjects t test. The reader

should note that SPSS incorrectly labels this test a "T test" rather than a t test, but is inconsistent in
that labeling, as some of the SPSS output also refers to t-test results .
A between-groups design is one in which participants have been randomly assigned to the two levels
of the independent variable. In this design, each participant is assigned to only one group, and
consequently, the two groups are independent of one another. For example, suppose that you are
interested in studying the effects caffeine consumption on task performance. If you randomly assign
some participants to the caffeine group and other participants to the no-caffeine group, then you are
using a between-groups design. In a within-subjects design, by contrast, all participants would be
tested once with caffeine and once without caffeine.
An Example: Parental Involvement Experiment
Assume that you studied the effects of parental involvement (independent variable) on students'
grades (dependent variable). Half of the students in a third grade class were randomly assigned to the
parental involvement group. The teacher contacted the parents of these children throughout the year
and told them about the educational objectives of the class. Further, the teacher gave the parents
specific methods for encouraging their children's educational activities. The other half of the students
in the class were assigned to the no-parental involvement group. The scores on the first test were
tabulated for all of the children, and these are presented below:
Student

Involve

Test1

78.6

64.9

100.0

83.7

94.0

78.2

76.9

82.0

81.0

10

69.5

11

73.8

12

66.7

13

54.8

14

69.3

15

73.5

16

79.4

Creating Your Data File: Key Point


When creating a data file for an independent-samples t test in SPSS, you must also create a separate
column for the grouping variable that shows to which condition or group a particular participant
belongs. In this case, that is the parental involvement condition, so you should create a numeric code
that allows SPSS to identify the parental involvement condition for that particular score. If this concept
is difficult to grasp, you may want to revisit Lesson 2, in which a grouping variable is created for male
and female students.
So, the variable view of your SPSS data file should look like the one below, with three variables--one
for student number, one for parental involvement condition (using for example a code of "1" for
involvement and "0" for no involvement), and one column for the score on Test 1. When creating the
data file, is is a good idea to create a variable Label for each variable and Value label for the
grouping variable(s). These labels make it easier to interpret the output of your statistical procedures.
The variable view of the data file might look similar to the one below.

Figure 4-1 Variable View

The data view of the file should look like the following:

Figure 4-2 Data View

Note that in this particular case the two groups are separated in the data file, with the first half of the
data corresponding to the parental involvement condition and the second half corresponding to the noinvolvement condition. Although this makes for an orderly data table, such ordering is NOT required in
SPSS for the independent-samples t test. When performing the test, whether or not the data are
sorted by the independent variable, you must specify which condition a participant is in by use of
a grouping variable as indicated above.

Performing the t test for the Parental Involvement Experiment


You should enter the data as described above. Or you may access the SPSS data file for the parental
involvement experiment by clicking here. To perform the t test, complete the following steps in order.
Click on Analyze, then Compare Means, then Independent Samples T Test.

Figure 4-3 Select Analyze, Compare Means, Independent-Samples T Test

Now, move the dependent variable (in this case, labeled "Score on Test 1 [Test 1] ") into the Test
Variable window. Then move your independent variable (in this case, "Parental Involvement
[Involve]") into the Grouping Variable window. Remember that Grouping Variable stands for the
levels of the independent variable.

Figure 4-4 Independent-Samples T Test dialog box

You will notice that there are question marks in the parentheses following your independent variable in
the Grouping Variable field. This is because you need to define the particular groups that you want
to compare. To do so, click on Define Groups, and indicate the numeric values that each group
represents. In this case, you will want to put a "0" in the field labeled Group 1 and a "1" in the field
labeled Group 2. Once you have done this, click on Continue.
Now click on OK to run the t test. You may also want to click on Paste in order to save the SPSS
syntax of what you have done (see Figure 4-5) in case you desire to run the same kind of test from
SPSS syntax.

Figure 4-5 Syntax for the independent-samples t test

Output from the t test Procedure


As you can see below, the output from an independent-samples t test procedure is relatively
straightforward.

Figure 4-6 Independent-samples t test output

Interpreting the Output


In the SPSS output, the first table lists the number of participants (N), mean, standard deviation, and
standard error of the mean for both of your groups. Notice that the value labels are printed as well as
the variable labels for your variables, making it easier to interpret the output.
The second table (see Figure 4-6) presents you with an F test (Levene's test for equality of variances)
that evaluates the basic assumption of the t test that the variances of the two groups are
approximately equal (homogeneity of variance or homoscedasticity). If the F value reported here is
very high and the significance level is very low--usually lower than .05 or .01, then the assumption of
homogeneity of variance has been violated. In that case, you should use the t test in the lower half of
the table, whereas if you have not violated the homogeneity assumption, you should use the t test in
the upper half of the table. The t-test formula for unequal variances makes an adjustment to the
degrees of freedom, so this value is often fractional, as seen above.
In this particular case, you can see that we have not violated the homogeneity assumption, and we
should report the value of t as 2.356, degrees of freedom of 14, and the significance level of .034.
Thus, our data show that parental involvement has a significant effect on grades, t(14) = 2.356, p = .
034.

Lesson 5: Paired-Samples t Test


Objectives
1.

Conduct a paired-samples t test.

2.

Interpret the output of the paired-samples t test.

Overview
The paired-samples or dependent t test is used for within-subjects or matched-pairs designs in which
observations in the groups are linked. The linkage could be based on repeated measures, natural
pairings such as mothers and daughters, or pairings created by the experimenter. In any of these
cases, the analysis is the same. The dependency between the two observations is taken into account,
and each set of observations serves as its own control, making this a generally more powerful test
than the independent-samples t test. Because of the dependency, the degrees of freedom for the
paired-samples t test are based on the number of pairs rather than the number of observations.
Example
Imagine that you conducted an experiment to test the the effects of the presence of others
(independent variable) on problem-solving performance (dependent variable). Assume further that
you used a within-subjects design; that is, each participant was tested alone and in the presence of
others on different days using comparable tasks. Higher scores indicate better problem-solving
performance. The data appear below:
Participant

Alone

Others

12

10

12

10

11

10

12

11

12

The following figure shows the variable view of the structure of the dataset:

Figure 5-1 Dataset variable view

Entering Data for a Within-Subjects Design: Key Point

When you enter data for a within-subjects design, there must be a separate column for each
condition. This tells SPSS that the two data points are linked for a given participant. Unlike the
independent-samples t test where a grouping variable is required, there is no additional grouping
variable in the paired-samples t test. The properly configured data are shown in the following
screenshot of the SPSS Data Editor Data View:

Figure 5-2 Dataset data view

Performing the Paired-Samples t test Step-by-Step


The SPSS data file for this example can be found here. After you have entered or opened the dataset,
you should follow these steps in order.
Click on Analyze, Compare Means, and then Paired-Samples T test.

Figure 5-3 Select Paired-Samples T Test

In the resulting dialog box, click on the label for Alone and then press <Shift> and click on the label
for Others. Click on the arrow to move this pair of variables to the Paired Variables window.

Figure 5-4 Identify paired variables

Interpreting the Paired-Samples t Test Output

Click OK and the following output appears in the SPSS Output Viewer Window (see Figure 5-5). Note
that the correlation between the two observations is reported along with its p level, and that the value
of t, the degrees of freedom (df), and the p level of the calculated t are reported as well.

Figure 5-5 Paired-Samples T Test output

Lesson 6: One-Way ANOVA


Objectives
1.

Conduct a one-way ANOVA.

2.

Perform post hoc comparisons among means.

3.

Interpret the ANOVA and post hoc comparison output.

Overview
The one-way ANOVA compares the means of three or more independent groups. Each group
represents a different level of a single independent variable. It is useful at least conceptually to think
of the one-way ANOVA as an extension of the independent-samples t test. The null hypothesis in the
ANOVA is that the several populations being sampled all have the same mean. Because the variance is
based on deviations from the mean, the "analysis of variance" can be used to test hypotheses about
means. The test statistic in the ANOVA is an F ratio, which is a ratio of two variances. When an ANOVA
leads to the conclusion that the sample means differ by more than a chance level, it is usually
instructive to perform post hoc or (a posteriori) analyses to determine which of the sample means are
different. It is also helpful to determine and report effect size when performing ANOVA.

Example Problem
In a class of 30 students, ten students each were randomly assigned to three different methods of
memorizing word lists. In the first method, the student was instructed to repeat the word silently
when it was presented. In the second method, the student was instructed to spell the word backward
and visualize the backward word and to pronounce it silently. The third method required the student to
associate each word with a strong memory. Each student saw the same 10 words flashed on a
computer screen for five seconds each. The list was repeated in random order until each word had
been presented a total of five times. A week later, students were asked to write down as many of the
words as they could recall. For each of the three groups, the number of correctly-recalled words is
shown in the following table:
Method1

Method2

Method3

Entering the Data in SPSS


Recall our previous lessons on data entry. These 30 scores represent 30 different individuals, and each
participant's data should take up one line of the data file. The group membership should be coded as a
separate variable. The correctly-entered data would take the following form (see Figure 6-1). Note
that although we used 1, 2, and 3 to code group membership, we could just as easily have used 0, 1,
and 2.

Figure 6-1 Data for one-way ANOVA

Conducting the One-Way ANOVA


To perform the one-way ANOVA in SPSS, click on Analyze, Compare Means, One-Way ANOVA (see
Figure 6-2).

Figure 6-2 Select Analyze, Compare Means, One-Way ANOVA

In the resulting dialog box, move Recall to the Dependent List and Method to the Factor field. Select
Post Hoc and then check the box in front of Tukey for the Tukey HSD test (see Figure 6-3), which is
one of the most frequently used post hoc procedures. Note also the many other post hoc comparison
tests available.

Figure 6-3 One-Way ANOVA dialog with Tukey HSD test selected

The ANOVA summary table and the post hoc test results appear in the SPSS Viewer (see Figure 6-4).
Note that the overal (omnibus) F ratio is significant, indicating that the means differ by a larger
amount than would be expected by chance alone if the null hypothesis were true. The post hoc test
results indicate that the mean for Method 1 is significantly lower than the means for Methods 2 and 3,
but that the means for Methods 2 and 3 are not significantly different.

Figure 6-4 ANOVA summary table and post hoc test results

As an aid to understanding the post hoc test results, SPSS also provides a table of homogenous
subsets (see Figure 6-5). Note that it is not strictly necessary that the sample sizes be equal in the
one-way ANOVA, and when they are unequal, the Tukey HSD procedure uses the harmonic mean of
the sample sizes for post hoc comparisons.

Figure 6-5 Table of homogeneous subsets

Missing from the ANOVA results table is any reference to effect size. A common effect size index is eta
squared, which is the between-groups sum of squares divided by the total sum of squares. As such,
this index represents the proportion of variance that can be attributed to between-group differences or
treatment effects. An alternative method of performing the one-way ANOVA provides the effect-size
index, but not the post hoc comparisons discussed earlier. To perform this alternative analysis,
select Analyze, Compare Means, Means (see Figure 6-6). Move Recall to the Dependent List and
Method to the Independent List. Under Options, select Anova Table and eta.

Figure 6-6 ANOVA procedure and effect size index available from Means procedure

The ANOVA summary table from the Means procedure appears in Figure 6-7 below. Eta squared is
directly interpretable as an effect size index: 58 percent of the variance in recall can be explained by
the method used for remembering the word list.

Figure 6-7 ANOVA table and effect size from Means procedure

Lesson 7: Repeated-Measures ANOVA


Objectives
1.

Conduct the repeated-measures ANOVA.

2.

Interpret the output.

3.

Construct a profile plot.

Overview
The repeated-measures or within-subjects ANOVA is used when there are multiple measures for each
participant. It is conceptually useful to think of the repeated-measures ANOVA as an extension of the
paired-samples ttest. Each set of observations for a subject or case serves as its own control, so this
test is quite powerful. In the repeated-measures ANOVA, the test of interest is the within-subjects
effect of the treatments or repeated measures.
The procedure for performing a repeated-measures ANOVA in SPSS is found in the Analyze, General
Linear Model menu.
Example Data
Assume that a statistics professor is interested in the effects of taking a statistics course on
performance on an algebra test. She administers a 20-item college algebra test to ten randomly
selected statistics students at the beginning of the term, at the end of the term, and six months after
the course is finished. The hypothetical test results are as follows.
Student

Before

After

SixMo

13

15

17

12

15

14

12

17

16

19

20

20

10

15

14

10

13

15

12

11

14

15

13

10

11

16

Coding Considerations
Data coding considerations in the repeated-measures ANOVA are similar to those in the pairedsamples t test. Each participant or subject takes up a single row in the data file, and each observation
requires a separate column. The properly coded SPSS data file with the data entered correctly should
appear as follows (see figure 7-1). You may also retrieve a copy of the data file if you like.

Figure 7-1 SPSS data file coded for repeated-measures ANOVA

Performing the Repeated-Measures ANOVA


To perform the repeated-measures ANOVA in SPSS, click on Analyze, then General Linear Model,
and then Repeated Measures. See Figure 7-2.

Figure 7-2 Select Analyze, General Linear Model, Repeated Measures

In the resulting Repeated Measures dialog, you must specify the number of factors and the number of
levels for each factor. In this case, the single factor is the time the algebra test was taken, and there
are three levels: at the beginning of the course, immediately after the course, and six months after
the course. You can accept the default label of factor1, or change it to a more descriptive one. We will
use "Time" as the label for our factor, and specify that there are three levels (see Figure 7-3).

Figure 7-3 Specifying factor and levels

After naming the factor and specifying the number of levels, you must add the factor and then define
it. Click on Add and then click on Define. See Figure 7-4.

Figure 7-4 Specifying within-subjects variable levels

Now you can enter the levels one at a time by clicking on a variable name and then clicking on the
right arrow adjacent to the Within-Subjects Variables field. Or you can click on Before in the left pane
of the Repeated Measures dialog, then hold down <Shift> and click on SixMo to select all three levels
at the same time, and then click on the right arrow to move all three levels to the window in one step
(see Figure 7-5).

Figure 7-5 Within-subjects variables appropriately entered

Clicking on Options allows you to specify the calculation of descriptive statistics, effect size, and
contrasts among the means. If you like, you can also click on Plots to include a line graph of the
algebra test mean scores for the three administrations. Figure 7-6 is a screen shot of the Profile Plots
dialog. You should click on Time, then Horizontal Axis, and then click on Add. Click Continue to
return to the Repeated Measures dialog.

Figure 7-6 Profile Plots dialog

Now click on Options and specify descriptive statistics, effect size, and contrasts (see Figure 7-7). You
must move Time to the Display Means window as well as specify a confidence level adjustment for the
main effects contrasts. A Bonferroni correction will adjust the alpha level in the post hoc comparisons,
while the default LSD (Fisher's least significant difference test) will not adjust the alpha level. We will
select the more conservative Bonferroni correction.

Figure 7-7 Specifying descriptive statistics, effect size, and mean contrasts

Click on Continue, then OK to run the repeated-measures ANOVA. The SPSS output provides several
tests. When there are multiple dependent variables, the multiviariate test is used to determine
whether there is an overall within-subjects effect for the combined depedendent variables. As there is
only one within-subject factor, we can ignore this test in the present case. Sphericity is an assumption
that the variances of the differences between the pairs of measures are equal. The insignificant test of
sphericity indicates that this assumption is not violated in the present case, and adjustments to the
degrees of freedom (and thus to the p level) are not required. The test of interest is the Test of
Within-Subjects Effects. We can assume sphericity and report the F ratio as 8.149 with 2 and 18
degrees of freedom and the p level as .003 (see Figure 7-8). Partial eta-squared has an interpretation
similar to that of eta-squared in the one-way ANOVA, and is directly interpretable as an effect-size
index: about 48 percent of the within-subjects variation in algebra test performance can be explained
by knowledge of when the test was administered.

Figure 7-8 Test of within-subjects effects

Additional insight is provided by the Bonferroni-corrected pairwise comparisons, which indicate that
the means for Before and After are significantly different, while none of the other comparisons are
signficant. The profile plot is of assistance in the visualization of these contrasts. See Figures 7-9 and
7-10. These results indicate an immediate but unsustained improvement in algebra test performance
for students taking a statistics course.

Figure 7-9 Bonferroni-corrected pairwise comparisions

Figure 7-10 Profile plot

Lesson 8: Two-Way ANOVA


Objectives
1.

Conduct the two-way ANOVA.

2.

Examine and interpret main effects and interaction effect.

3.

Produce a plot of cell means.

Overview
We will introduce the two-way ANOVA with the simplest of such designs, a balanced or completelycrossed factorial design. In this case there are two independent variables (factors), each of which has
two or more levels. We can think of this design as a table in which each cell represents a single
independent group. The group represents a combination of levels of the two factors. For simplicity, let
us refer to the factors as A and B and assume that each factor has two levels and each independent

group has the same number of observations. There will be four independent groups. The design can
thus be visualized as follows:

Figure 8-1 Conceptualization of Two-Way ANOVA

The two-way ANOVA is an economical design, because it allows the assessment of the main effects of
each factor as well as their potential interaction.
Example Data and Coding Considerations
Assume that you are studying the effects of observing violent acts on subsequent aggressive behavior.
You are interested in the kind of violence observed: a violent cartoon versus a video of real-action
violence. A second factor is the amount of time one is exposed to violence: ten minutes or 30 minutes.
You randomly assign 8 children to each group. After the child watches the violent cartoon or action
video, the child plays a Tetris-like computer video game for 30 minutes. The game provides options for
either aggressing ("trashing" the other computerized player) or simply playing for points without
interfering with the other player. The program provides 100 opportunities for the player to make an
aggressive choice and records the number of times the child chooses an aggressive action when the
game provides the choice. The hypothetical data are below:

Figure 8-2 Example Data

When coding and entering data for this two-way ANOVA, you should recognize that each of the 32
participants is a unique individual and that there are no repeated measures. Therefore, each
participant takes up a row in the data file, and the data should be coded and entered in such a way
that the factors are identified by two columns with group membership coded as a combination of the
levels. For illustrative purposes we will use 1 and 2 to represent the levels of the factors, though as
you learned earlier, you could just as easily have used 0s and 1s. The data view of the resulting SPSS
data file should appear something like this:

Figure 8-3 SPSS data file data view for two-way ANOVA (partial data)

For ease of interpretation, the variables can be labeled and the values of each specified in the variable
view (see Figure 8-4).

Figure 8-4 Variable view with labels and values identified

If you prefer, you may retrieve a copy of the data file.


Performing the Two-Way ANOVA
To perform the two-way ANOVA, select Analyze, General Linear Model, and
then Univariate because there is only one dependent variable (see Figure 8-5).

Figure 8-5 Select Analyze, General Linear Model, Univariate

In the resulting dialog, you should specify that Aggression is the dependent variable and that both
Time and Type are fixed factors (see Figure 8-6).

Figure 8-6 Specifying the two-way ANOVA

This procedure will test the main effects for Time and Type as well as their possible interaction. It is
helpful to specify profile plots to examine the interaction of the two variables. For that purpose,
select Plots and then move Type to the Horizontal Axis field and Time to the Separate Lines field (see
Figure 8-7).

Figure 8-7 Specifying profile plots

When you click on Add, the Type * Time interaction is added to the Plots window, as shown in Figure
8-8.

Figure 8-8 Plotting an interaction term

Click Continue, then click Options. Check the boxes in front of Descriptive statistics and Estimates of
effect size (see Figure 8-9). Click Continue, then click OK to run the two-way ANOVA. The table of
interest is the Test of Between-Subjects Effects. Examination of the table reveals significant F ratios
for Time, Type and the Time * Type interaction (see Figure 8-9).

Figure 8-9 Table of between-subjects effects


As in the repeated-measures ANOVA, a partial eta-squared is calculated as a measure of effect size.
The profile plot (see Figure 8-10) shows that the interaction is ordinal: the differences in the number
of aggressive choices made after observing the two violence conditions increase with the time of
exposure.

Figure 8-10 Interaction plot

Lesson 9: ANOVA for Mixed Factorial Designs


Objectives
1.

Conduct a mixed-factorial ANOVA.

2.

Test between-groups and within-subjects effects.

3.

Construct a profile plot.

Overview
A mixed factorial design involves two or more independent variables, of which at least one is a withinsubjects (repeated measures) factor and at least one is a between-groups factor. In the simplest case,

there will be one between-groups factor and one within-subjects factor. The between-groups factor
would need to be coded in a single column as with the independent-samples t test or the one-way
ANOVA, while the repeated measures variable would comprise as many columns as there are
measures as in the paired-samples t test or the repeated-measures ANOVA.
Example Data
As an example, assume that you conducted an experiment in which you were interested in the extent
to which visual distraction affects younger and older people's learning and remembering. To do this,
you obtained a group of younger adults and a separate group of older adults and had them learn
under three conditions (eyes closed, eyes open looking at a blank field, eyes open looking at a
distracting field of pictures). This is a 2 (age) x 3 (distraction condition) mixed factorial design. The
scores on the data sheet below represent the number of words recalled out of ten under each
distraction condition.
Age

Closed Eyes

Simple
Distraction

Complex
Distraction

Younger

Younger

Younger

Younger

Older

Older

Older

Older

Building the SPSS Data File


Note that there are eight separate participants, so the data file will require eight rows. There will be a
column for the participants' age, which is the between-groups variable, and three columns for the
repeated measures, which are the distraction conditions. As always it is helpful to include a column for
participant (or case) number.
The data appropriately entered in SPSS should look something like the following (see Figure 9-1). You
may optionally download a copy of the data file.

Figure 9-1 SPSS data structure for mixed factorial design

Performing the Mixed Factorial Anova

To conduct this analysis, you will use the repeated measures procedure. The initial steps are identical
to those in the within-subjects ANOVA. You must first specify repeated measures to identify the
within-subjects variable(s), and then specify the between-groups factor(s).
Select Analyze, then General Linear Model, then Repeated Measures (see Figure 9-2).
.

Figure 9-2 Preparing for the Mixed Factorial Analysis

Next, you must define the within-subjects factor(s). This process should be repeated for each factor
on which there are repeated measures. In our present case, there is only one within-subject variable,
the distraction condition. SPSS will give the within-subjects variables the names factor1, factor2, and
so on, but you can provide more descriptive names if you like. In the Repeated Measures dialog box,
type in the label distraction and the number of levels, 3. See Figure 9-3. If you like, you can give this
measure (the three distraction levels) a new name by clicking in the Measure Name field. If you
choose to name this factor, the name must be unique and may not conflict with any other variable
names. If you do not name the measure, the SPSS name for the measure will default to MEASURE_1.
In the present case we will leave the measure name blank and accept the default label.

Figure 9-3 Specifying the within-subjects factor.

We will now specify the within-subjects and between-groups variables. Click on Add and
then Define to specify which variable in the dataset is associated with each level of the withinsubjects factor (see Figure 9-4).

Figure 9-4 Defining the within-subjects variable

Move the Closed, Simple, and Complex variables to levels 1, 2, and 3, respectively, and then move
Age to the Between-Subjects Factor(s) window (see Figure 9-5). You can optionally specify one or
more covariates for analysis of covariance.

Figure 9-5 The complete design specification for the mixed factorial ANOVA

To display a plot of the cell means, click on Plots, and then move Age to the Horizontal axis, and
distraction to Separate Lines. Next click on Add to specify the plot (see Figure 9-6) and then
click Continue.

Figure 9-6 Specifying plot

We will use the Options menu to specify the display marginal and cell means, to compare main
effects, to display descriptive statistics, and display measures of effect size. We will select the
Bonferroni interval adjustment to control the level of Type I error. See Figure 9-7.

Figure 9-7 Repeated measures options

Select Continue to close the options dialog and then OK to run the ANOVA. The resulting SPSS output
is rather daunting, but you should focus on the between and within-subjects tests. The test of
sphericity is not significant, indicating that this assumption has not been violated. Therefore you
should use the F ratio and degrees of freedom associated with the sphericity assumption (see Figure
9-8). Specifically you will want to determine whether there is a main effect for age, an effect for
distraction condition, and a possible interaction of the two. The tables of interest from the SPSS
Viewer are shown in Figures 9-8 and 9-9.

Figure 9-8 Partial SPSS output

The test of within-subjects effects indicates that there is a significant effect of the distraction condition
on word memorization. The lack of an interaction between distraction and age indicates that this effect
is consistent for both younger and older participants. The test of between-subjects effects (see Figure
9-9) indicates there is a significant effect of the age condition on word memory.

Figure 9-9 Test of between-subjects effects

The remainder of the output assists in the interpretation of the main effects of the within-subjects
(distraction condition) and between-subjects (age condition) factors. Of particular interest is the
profile plot, which clearly displays the main effects and the absence of an interaction (see Figure 910). As disussed above, SPSS calls the within subjects variable MEASURE_1 in the plot.

Figure 9-10 Profile plot

Lesson 10: Correlation and Scatterplots


Objectives
1.

Calculate correlation coefficients.

2.

Test the significance of correlation coefficients.

3.

Construct a scatterplot.

4.

Edit features of the scatterplot.

Overview

In correlational research, there is no experimental manipulation. Rather, we measure variables in their


natural state. Instead of independent and dependent variables, it is useful to think of predictors and
criteria. In bivariate (two-variable) correlation, we are assessing the degree of linear relationship
between a predictor, X, and a criterion, Y. In multiple regression, we are assessing the degree of
relationship between a linear combination of two or more predictors, X1, X2, ...Xk, and a criterion, Y.
We will address correlation in the bivariate case in Lesson 10, linear regression in the bivariate case
in Lesson 11, and multiple regression and correlation in Lesson 12.
The Pearson product moment correlation coefficient summarizes and quantifies the relationship
between two variables in a single number. This number can range from -1 representing a perfect
negative or inverse relationship to 0 representing no relationship or complete independence to +1
representing a perfect positive or direct relationship. When we calculate a correlation coefficient from
sample data, we will need to determine whether the obtained correlation is significantly different from
zero. We will also want to produce a scatterplot or scatter diagram to examine the nature of the
relationship. Sometimes the correlation is low not because of a lack of relationship, but because of a
lack of linear relationship. In such cases, examining the scatterplot will assist in determining whether a
relationship may be nonlinear.
Example Data
Suppose that you have collected questionnaire responses to five questions concerning dormitory
conditions from 10 college freshmen. (Normally you would like to have a larger sample, but the small
sample in this case is useful for illustration.) The questionnaire assesses the students' level of
satisfaction with noise, furniture, study area, safety, and privacy. Assume that you have also assessed
the students' family income level, and you would like to test the hypothesis that satisfaction with the
college living environment is related to wealth (family income).
The questionnaire contains five questions about satisfaction with the various aspects of the dormitory
"noise," "furniture," "space," "study," "safety," and "privacy." These are answered on a 5-point Likerttype scale (very dissatisfied to very satisfied), which are coded as 1 to 5. The data sheet for this study
is shown below.
Student

Income

Noise

Furniture

Study_Area

Safety

Privacy

39

59

75

45

95

115

67

48

140

10

55

Entering the Data in SPSS


The data correctly entered in SPSS would look like the following (see Figure 10-1). Remember not only
to enter the data, but to add appropriate labels in the Variable View to improve the readability of the
output. If you prefer, you can download a copy of the data file.

Figure 10-1 Data entered in SPSS

Calculating and Testing Correlation Coefficients


To calculate and test the significance of correlation coefficients,
select Analyze, Correlate, Bivariate (see Figure 10-2).

Figure 10-2 The bivariate correlation procedure

Move the desired variables to the Variables window, as shown in Figure 10-3.

Figure 10-3 Move desired variables to the Variables window

Under the Options menu, let us select means and standard deviations and then click Continue. The
output contains a table of descriptive statistics (see Figure 10-4) and a table of correlations and
related significance tests (see Figure 10-5).

Figure 10-4 Descriptive statistics

Figure 10-5 Correlation matrix

Note that SPSS flags significant correlations with asterisks. The correlation matrix is symmetrical, so
the above-diagonal entries are the same as the below-diagonal entries. In our survey results we note
strong negative correlations between family income and the various survey items and strong positive
correlations among the various items.

Constructing a Scatterplot
For purposes of illustration, let us produce a scatterplot of the relationship between satisfaction with
noise level in the dormitory and family income. We see from the correlation matrix that this is a
significant negative correlation. As family income increases, satisfaction with the dormitory noise level
decreases. To build the scatterplot, select Graphs, Interactive, Scatterplot (see Figure 10-6). Please
note that there are several different ways to construct the scatterplot in SPSS, and that we are
illustrating only one here.

Figure 10-6 Constructing a scatterplot


In the resulting dialog, enter Family Income on the X-axis and Noise on the Y-axis (see Figure 10-7).

Figure 10-7 Specifying variables for the scatterplot

The resulting scatterplot (see Figure 10-8) shows the relationship between family income and
satisfaciton with dormitory noise.

Figure 10-8 Scatterplot

In the SPSS Viewer it is possible to edit a chart object by double-clicking on it in the SPSS Viewer. In
attition to many other options, you can change the labeling and scaling of axes, add trend lines and
other elements to the scatterplot, and change the marker types. The edited chart apears in Figure 109. If you like, you can save this particular combination as a chart template to use it again in the
future.

Figure 10-9 Edited scatterplot

Lesson 11: Linear Regression


Objectives
1.

Determine the regression equation.

2.

Compute predicted Y values.

3.

Compute and interpret residuals.

Overview
Closely related to correlation is the topic of linear regression. As you learned in Lesson 10, the
correlation coefficient is an index of linear relationship. If the correlation coefficient is significant, that
is an indication that a linear equation can be used to model the relationship between the
predictor X and the criterion Y. In this lesson you will learn how to determine the equation of the line
of best fit between the predictor and the criterion, how to compute predicted values based on that
linear equation, and how to calculate and interpret residuals.
Example Problem and Data
This spring term you are in a large introductory psychology class. You observe an apparent
relationship between the outside temperature and the number of people who skip class on a given
day. More people seem to be absent when the weather is warmer, and more seem to be present when
it is cooler outside. You randomly select 10 class periods and record the outside temperature reading

10 minutes before class time and then count the number of students in attendance that day. If you
determine that there is a significant linear relationship, you would like to impress your professor by
predicting how many people will be present on a given day, based on the outside temperature. The
data you collect are the following:
Temp

Attendance

50

87

77

60

67

73

53

86

75

59

70

65

83

65

85

62

80

58

64

89

Entering the Data in SPSS


These pairs of data must be entered as separate variables. The data file may look something like the
following (see Figure 11-1):

Figure 11-1 Data in SPSS

If you prefer, you can download a copy of the data. As you learned in Lesson 10, you should first
determine whether there is a significant correlation between temperature and attendance. Running the
Correlation procedure (see Lesson 10 for details), you find that the correlation is -.87, and is
significant at the .01 level (see Figure 11-2).

Figure 11-2 Significant correlation

A scatterplot is helpful in visualizing the relationship (see Figure 11-3). Clearly, there is a negative
relationship between attendance and temperature.

Figure 11-3 Scatterplot

Linear Regression

The correlation and scatterplot indicate a strong, though by no means perfect, relationship between
the two variables. Let us now turn our attention to regression. We will "regress" the attendance (Y)on
the temperature (X). In linear regression, we are seeking the equation of a straight line that best fits
the observations. The usefulness of such a line may not be immediately apparent, but if we can model
the relationship by a straight line, we can use that line to predict a value of Y for any value of X, even
those that have not yet been observed. For example, looking at the scatterplot in Figure 11-3, what
attendance would you predict for a temperature of 60 degrees? The regression line can answer that
question. This line will have an intercept term and a slope coefficient and will be of the general form

The intercept and slope (regression) coefficient are derived in such a way that the sums of the
squared deviations of the actual data points from the line are minimized. This is called "ordinary least
squares" estimation or OLS. Note that the predicted value of Y (read "Y-hat") is a linear combination
of two constants, the intercept term and the slope term, and the value of X, so that the only thing that
varies is the value of X. Therefore, the correlation between the predicted Ys and the observed Ys will
be the same as the correlation between the observed Ys and the observed Xs. If we subtract the
predicted value of Y from the observed value of Y, the difference is called a "residual." A residual
represents the part of the Y variable that cannot be explained by the X variable. Visually, the distance
between the observed data points and the line of best fit represents the residual.
SPSS's Regression procedure allows us to determine the equation of the line of best fit, to calculate
predicted values of Y, and to calculate and interpret residuals. Optionally, you can save the predicted
values of Y and the residuals as either standard scores or raw-score equivalents.
Running the Regression Procedure
Open the data file in SPSS. Select Analyze, Regression, and then Linear (see Figure 11-4).

Figure 11- 4 Performing the Regression procedure

The Regression procedure outputs a value called "Multiple R," which will always range from 0 to 1. In
the bivariate case, Multiple R is the absolute value of the Pearson r, and is thus .87. The square of r or
of Multiple R is .752, and represents the amount of shared variance between Y and X. When we run
the regression tool, we can optionally ask for either standardized or unstandardized (raw-score)
predicted values of Y and residuals to be calculated and saved as new variables (see Figure 11-5).

Figure 11-5 Save options in the Regression procedure

Click OK to run the Regression procedure. The output is shown in Figure 11-6. In the ANOVA table
summarizing the regression, the omnibus F test tests the hypothesis that the population Multiple R is
zero. We can safely reject that null hypothesis. Notice that dividing the regression sum of squares,
which is based on the predicted values of Y, by the total sum of squares, which is based on the
observed values of Y, produces the same value as R Square. The value of R Square thus represents
the proportion of variance in the criterion that can be explained by the predictor. The residual sum of
squares represents the variance in the criterion that remains unexplained.

Figure 11-6 Regression procedure output

In Figure 11-7 you can see that the residuals and predicted values are now saved as new variables in
the SPSS data file.

Figure 11-7 Saving predicted values and residuals

The regression equation for predicting attendance from the outside temperature is 133.556 - .897 x
Temp. So for a temperature of 60 degrees, you would predict the attendance to be 80 students (see
Figure 11-8 in which this is illustrated graphically). Note that this process of using a linear equation to
predict attendance from the temperature has some obvious practical limits. You would never predict
attendance higher than 100 percent, for example, and there may be a point at which the temperature
becomes so hot as to be unbearable, and the attendance could begin to rise simply because the
classroom is air-conditioned.

Figure 11-8 Linear trend line and regression equation

To impress your professor, assume that the outside temperature on a class day is 72 degrees.
Substituting 72 for X in the regression equation, you predict that there will be 69 students in
attendance that day.
Examining Residuals
A residual is the difference between the observed and predicted values for the criterion variable (Hair,
Black, Babin, Anderson, & Tatham, 2006). Bivariate linear regression and multiple linear regression
make four key assumptions about these residuals.
1.

The phenomenon (i.e., the regression model being considered) is linear, so that the
relationship between X and Y is linear.

2.

The residuals have equal variances at all levels of the predicted values of Y.

3.

The residuals are independent. This is another way of saying that the successive observations
of the dependent variable are uncorrelated.

4.

The residuals are normally distributed with a mean of zero.

Thus it can be very instructive to examine the residuals when you perform a regression analysis. It is
helpful to examine a histogram of the standardized residuals (see Figure 11-9), which can be created
from the Plots menu. The normal curve can be superimposed for visual reference.

Figure 11-9 Histogram of standardized residuals

These residuals appear to be approximately normally distributed. Another useful plot is the normal p-p
plot produced as an option in the Plots menu. This plot compares the cumulative probabilities of the
residuals to the expected frequencies if the residuals were normally distributed. Significant departures
from a straight line would indicate nonnormality in the data (see Figure 11-10). In this case the
residuals appear once again to be fairly normally distributed.

Figure 11-10 Normal p-p plot of observed and expected cumulative probabilities of residuals

When there are significant departures from normality, homoscedasticity, and linearity, data
transformations or the introduction of polynomial terms such as quadratic or cubic values of the
original independent or dependent variables can often be of help (Edwards, 1976).
References
Edwards, A. L. (1976). An introduction to linear regression and correlation. San Francisco: Freeman.
Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., and Tatham, R. L. (2006). Multivariate data
analysis (6th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.

Lesson 12: Multiple Correlation and


Regression
Objectives
1.

Perform and interpret a multiple regression analysis.

2.

Test the significance of the regression and the regression coefficients.

3.

Examine residuals for diagnostic purposes.

Overview
Multiple regression involves one continuous criterion (dependent) variable and two or more predictors
(independent variables). The equation for a line of best fit is derived in such a way as to minimize the
sums of the squared deviations from the line. Although there are multiple predictors, there is only one
predicted Y value, and the correlation between the observed and predicted Y values is called
Multiple R. The value of Multiple R will range from zero to one. In the case of bivariate correlation, a
regression analysis will yield a value of Multiple R that is the absolute value of the Pearson product
moment correlation coefficient between X and Y, as discussed in Lesson 11. The multiple linear
regression equation will take the following general form:

Instead of using a to represent the Y intercept, it is common practice in multiple regression to call the
intercept term b0. The significance of Multiple R, and thus of the entire regression, must be tested. As
well, the significiance of the individual regression coefficients must be examined to verify that a
particular independent variable is adding significantly to the prediction.
As in simple linear regression, residual plots are helpful in diagnosing the degree to which the linearity,
normality, and homoscedasticity assumptions have been met. Various data transformations can be
attempted to accommodate situations of curvilinearity, non-normality, and heteroscedasticity. In
multiple regression we must also consider the potential impact of multicollinearity, which is the degree
of linear relationship among the predictors. When there is a high degree of collinearity in the
predictors, the regression equation will tend to be distorted, and may lead to inappropriate conclusions
regarding which predictors are statistically significant (Lind, Marchal, and Wathen, 2006). For this
reason, we will ask for collinearity diagnostics when we run our regression. As a rule of thumb, if the
variance inflation factor (VIF) for a given predictor is very high or if the absolute value of the
correlation between two predictors is greater than .70, one or more of the predictors should be
dropped from the analysis, and the regression equation should be recomputed.
Multiple regression is in actuality a general family of techniques, and the mathematical and statistical
underpinnings of multiple regression make it an extremely powerful and flexible tool. By using group
membership or treatment level qualitative coding variables as predictors, one can easily use multiple
regression in place of t tests and analyses of variance. In this tutorial we will concentrate on the
simplest kind of multiple regression, a forced or simultaneous regression in which all predictor
variables are entered into the regression equation at one time. Other approaches include stepwise
regression in which variables are entered according to their predictive ability and hierarchical
regression in which variables are entered according to theory or hypothesis. We will examine
hierarchical regression more closely in Lesson 14 on analysis of covariance.
Example Data
The following data (see Figure 12-1) represent statistics course grades, GRE Quantitative scores, and
cumulative GPAs for 32 graduate students at a large public university in the southern U.S. (source:
data collected by the webmaster). You may click here to retrieve a copy of the entire dataset.

Figure 12-1 Statistics course grades, GREQ, and GPA (partial data)
Preparing for the Regression Analysis
We will determine whether quantitative ability (GREQ) and cumulative GPA can be used to predict
performance in the statistics course. A very useful first step is to calculate the zero-order correlations
among the predictors and the criterion. We will use the Correlate procedure for that purpose.
Select Analyze, Correlate, Bivariate (see Figure 12-2).

Figure 12-2 Calculate intercorrelations as preparation for regression analysis

In the Options menu of the resulting dialog box, you can request descriptive statistics if you like. The
resulting intercorrelation matrix reveals that GREQ and GPA are both significantly related to the course
grade, but are not significantly related to each other. Thus our initial impression is that collinearity will
not be a problem (see Figure 12-3).

Figure 12-3 Descriptive statistics and intercorrelations

Conducting the Regression Analysis


To conduct the regression analysis, select Analyze, Regression, Linear (see Figure 12-4).

Figure 12-4 Selecting the Linear Regression procedure

In the Linear Regression dialog box, move Grade to the Dependent variable field and GPA and GREQ to
the Independent(s) list, as shown in Figure 12-5.

Figure 12-5 Linear Regression dialog box

Click on the Statistics button and check the box in front of collinearity diagnostics (see Figure 12-6).

Figure 12-6 Requesting collinearity diagnostics

Select Continue and then click on Plots to request standardized residual plots and also to request
scatter diagrams. You should request a histogram and normal distribution plot of the standardized
residuals. You can also plot the standardized residuals against the standardized predicted values to
check the assumption of homoscedasticity (see Figure 12-7).

Click OK to run the regression analysis. The results are excerpted in Figure 12-8.

Figure 12-8 Regression procedure output (excerpt)

Interpreting the Regression Output


The significant overall regression indicates that a linear combination of GREQ and GPA predicts grades
in the statistics course. The value of R-Square is .513, and indicates that about 51 percent of the
variation in grades is accounted for by knowledge of GPA and GREQ. The significant t values for the
regression coefficients for GREQ and GPA show that each variable contributes significantly to the
prediction. Examining the unstandardized regression coefficients is not very instructive, because these
are based on raw scores and their values are influenced by the units of measurement of the
predictors. Thus, the raw-score regression coefficient for GREQ is much smaller than that for GPA
because the two variables use different scales. On the other hand, the standardized coefficients are
quite interpretable, because each shows the relative contribution to the prediction of the given
variable with the other variable held constant. These are technically standardized partial regression
coefficicients. In the present case, we can conclude that GREQ has more predictive value than GPA,
though both are significant.
The collinearity diagnostics indicate a low degree of overlap between the predictors (as we predicted).
If the two predictor variables were orthogonal (uncorrelated), the variance inflation factor (VIF) for
each would be 1. Thus we conclude that there is not a problem with collinearity in this case.
The histogram of the standardized residuals shows that the departure from normality is not too severe
(see Figure 12-9).

Figure 12-9 Histogram of standardized residuals

The normal p-p plot indicates some departure from normality and may suggest a curvilinear
relationship between the predictors and the criterion (see Figure 12-10).

Figure 12-10 Nomal p-p plot

The plot of standardized predicted values against the standardized residuals indicates a large degree
of heteroscedasticity (see Figure 12-11). This is mostly the result of a single outlier, case 11
(Participant 118), whose GREQ and grade scores are significantly lower than those of the remainder of
the group. Eliminating that case and recomputing the regression increases Multiple R slightly and also
reduces the heteroscedasticity.

Figure 12-11 Plot of predicted values against residuals

Lesson 13: Chi-Square Tests


Objectives
1.

Perform and interpret a chi-square test of goodness of fit.

2.

Perform and interpret a chi-square test of independence.

Overview
Chi-square tests are used to compare observed frequencies to the frequencies expected under some
hypothesis. Tests for one categorical variable are generally called goodness-of-fit tests. In this case,
there is a one-way table of observed frequencies of the levels of some categorical variable. The null
hypothesis might state that the expected frequencies are equally distributed or that they are unequal
on the basis of some theoretical or postulated distribution.
Tests for two categorical variables are usually called tests of independence or association. In this case,
there will be a two-way contingency table with one categorical variable occupying rows of the table
and the other categorical variable occupying columns of the table. In this analysis, the expected
frequencies are commonly derived on the basis of the assumption of independence. That is, if there
were no association between the row and column variables, then a cell entry would be expected to be
the product of the cell's row and column marginal totals divided by the overall sample size.

In both tests, the chi-square test statistic is calculated as the sum of the squared differences between
the observed and expected frequencies divided by the expected frequencies, according to the following
simple formula:

where O represents the observed frequency in a given cell of the table and E represents the
corresponding expected frequency under the null hypothesis.
We will illustrate both the goodness-of-fit test and the test of independence using the same dataset.
You will find the goodness of fit test for equal or unequal unexpected frequencies as an option under
Nonparametric Tests in the Analyze menu. For the chi-square test of independence, you will use the
Crosstabs procedure under the Descriptive Statistics menu in SPSS. The cross-tabulation procedure
can make use of numeric or text entries, while the Nonparametric Test procedure requires numeric
entries. For that reason, you will need to recode any text entries into numerical values for goodnessof-fit tests.
Example Data
Assume that you are interested in the effects of peer mentoring on student academic success in a
competitive private liberal arts college. A group of 30 students is randomly selected during their
freshman orientation. These students are assigned to a team of seniors who have been trained as
tutors in various academic subjects, listening skills, and team-building skills. The 30 selected students
meet in small group sessions with their peer tutors once each week during their entire freshman year,
are encouraged to work with their small group for study sessions, and are encouraged to schedule
private sessions with their peer mentors whenever they desire. You identify an additional 30 students
at orientation as a control group. The control group members receive no formal peer mentoring. You
determine that there are no significant differences between the high school grades and SAT scores of
the two groups. At the end of four years, you compare the two groups on academic retention and
academic performance. You code mentoring as 1 = present and 0 = absent to identify the two groups.
Because GPAs differ by academic major, you generate a binary code for grades. If the student's
cumulative GPA is at the median or higher for his or her academic major, you assign a 1. Students
whose grades are below the median for their major receive a zero. If the student is no longer enrolled
(i.e., has transferred, dropped out, or flunked out), you code a zero for retention. If he or she is still
enrolled, but has not yet graduated after four years, you code a 1. If he or she has graduated, you
code a 2.
You collect the following (hypothetical) data:

Properly entered in SPSS, the data should look like the following (see Figure 13-1). For your
convenience, you may also download a copy of the dataset.

Figure 13-1 Dataset in SPSS (partial data)

Conducting a Goodness-of-Fit Test


To determine whether the three retention outcomes are equally distributed, you can perform a
goodness-of-fit test. Because there are three possible outcomes (no longer enrolled, currently
enrolled, and graduated) and sixty total students, you would expect each outcome to be observed in
1/3 of the cases if there were no differences in the frequencies of these outcomes. Thus the null
hypothesis would be that 20 students would not be enrolled, 20 would be currently enrolled, and 20
would have graduated after four years. To test this hypothesis, you must use the Nonparametric Tests
procedure. To conduct the test, select Analyze, Nonparametric Tests, Chi-Square as shown in
Figure 13-2.

Figure 13-2 Selecting chi-square test for goodness of fit

In the resulting dialog box, move Retention to the Test Variable List and accept the default for equal
expected frequencies. SPSS counts and tabulates the observed frequencies and performs the chisquare test (see Figure 13-3). The degrees of freedom for the goodness-of-fit test are the number of
categories minus one. The significant chi-square shows that the freqencies are not equally
distributed, 2 (2, N = 60) = 6.10, p = .047.

Figure 13-3 Chi-square test of goodness of fit

Conducting a Chi-Square Test of Independence


If mentoring is not related to retention, you would expect mentored and non-mentored students to
have the same outcomes, so that any observed differences in frequencies would be due to chance.
That would mean that you would expect half of the students in each outcome group to come from the
mentored students, and the other half to come from the non-mentored students. To test the
hypothesis that there is an association (or non-independence) between mentoring and retention, you
will conduct a chi-square test as part of the cross-tabulation procedure. To conduct the test,
select Analyze, Descriptive Statistics, Crosstabs (see Figure 13-4).

Figure 13-4 Preparing for the chi-square test of independence

In the Crosstabs dialog, move one variable to the row field and the other variable to the column field.
I typically place the variable with more levels in the row field to keep the output tables narrower (see
Figure 13-5), though the results of the test would be identical if you were to reverse the row and
column variables.

Figure 13-5 Establishing row and column variables

Clustered bar charts are an excellent way to compare the frequencies visually, so we will select that
option (see Figure 13-5). Under the Statistics option, select chi-square and Phi and
Cramer's V (measures of effect size for chi-square tests). You can also click on the Cells button to
display both observed and expected cell frequencies. The format menu allows you to specify whether
the rows are arranged in ascending (the default) or descending order. Click OK to run the Crosstabs
procedure and conduct the chi-square test.

Figure 13-6 Partial output from Crosstabs procedure

For the test of independence, the degrees of freedom are the number of rows minus one multiplied by
the number of columns minus one, or in this case 2 x 1 = 2. The Pearson Chi-Square is significant,
indicating that mentoring had an effect on retention, 2 (2, N = 60) = 14.58, p < .001. The value of
Cramer's V is .493, indicating a large effect size (Gravetter & Walnau, 2005).
The clustered bar chart provides an excellent visual representation of the chi-square test results (see
Figure 13-7).

Figure 13-7 Clustered bar chart

Going Further
For additional practice, you can use the Nonparametric Tests and Crosstabs procedures to determine
whether grades differ between mentored and non-mentored students and whether there is an
association between grades and retention outcomes.
References
Gravetter, F. J., & Walnau, L. B. (2005). Essentials of statistics for the behavioral sciences (5th ed.).
Belmont, CA: Thomson/Wadsworth.

Lesson 14: Analysis of Covariance


Objectives
1.

Perform and interpret an analysis of covariance using the General Linear Model.

2.

Perform and interpret an analysis of covariance using hierarchical regression.

Analysis of covariance (ANCOVA) is a blending of regression and analysis of variance (Roscoe, 1975).
It is possible to perform ANCOVA using the General Linear Model procedure in SPSS. An entirely
equivalent analysis is also possible using hierarchical regression, so the choice is left to the user and

his or her preferences. We will illustrate both procedures in this tutorial. We will use the simplest of
cases, a single covariate, two treatments, and a single variate (dependent variable).
ANCOVA is statistically equivalent to matching experimental groups with respect to the variable or
variables being controlled (or covaried). As you recall from correlation and regression, if two variables
are correlated, one can be used to predict the other. If there is a covariate(X) that correlates with the
dependent variable (Y), then dependent variable scores can be predicted by the covariate. If this is the
case, the differences observed between the groups cannot then be attributed to the experimental
treatment(s). ANCOVA provides a mechanism for assessing the differences in dependent variable
scores after statistically controlling for the covariate. There are two obvious advantages to this
approach: (1) any variable that influences the variation in the dependent variable can be statistically
controlled, and (2) this control can reduce the amount of error variance in the analysis.
Example Data
Assume that you are comparing performance in a statistics class taught by two different methods.
Students in one class are instructed in the classroom, while students in the second class take their
class online. Both classes are taught by the same instructor, and use the same textbook, exams, and
assignments. At the beginning of the term all students take a test of quantitative ability (pretest), and
at the end, their score on the final exam is recorded (posttest). Because the two classes are intact, it
is not possible to achieve experimental control, so this is a quasi-experimental design. Assume that
you would like to compare the scores for the two groups on the final score while controlling for initial
quantitative ability. The hypothetical data are as follows:

Before the ANCOVA


You may retrieve the SPSS dataset if you like. As a precursor to the ANCOVA, let us perform a
between-groups t test to examine overall differences between the two groups on the final exam. You
will find or recall this test as the subject of Lesson 4, and details will not be repeated here. The result
of the t test is shown below. See Figure 14-1. Of course, as you know, if there were multiple groups
you would perform an ANOVA rather than a ttest. In this case, we conclude that the second method
led to improved test scores, but must rule out the possibility that this difference is attributable to
differences in quantitative ability of the two groups. As you know by now, you could just as easily have
compared the means using the Compare Means or One-way ANOVA procedures, and the square root
of the F-ratio obtained would be the value of t.

Figure 14-1 t Test Results

As a second precursor to the ANCOVA, let us determine the degree of correlation between quantitative
ability and exam scores. As correlation is the subject of Lesson 10, the details are omitted here, and
only the results are shown in Figure 14-2.

Figure 14-2 Correlation between pretest and posttest scores

Knowing that there is a statistically significant correlation between pretest and posttest scores, we
would like to exercise statistical control by holding the effects of the pretest scores constant. The
resulting ANCOVA will verify whether there are any differences in the posttest scores of the two groups
after controlling for differences in ability.
Performing the ANCOVA in GLM
To perform the ANCOVA via the General Linear Model menu, select Analyze, General Linear Model,
Univariate (see Figure 14-3).

Figure 14-3 ANCOVA via the GLM procedure

In the resulting dialog box, move Posttest to the Dependent Variable field, Method to the Fixed
Factor(s) field, and Pretest to the Covariate(s) field. See Figure 14-4.

Figure 14-4 Univariate dialog box

Under Options you may want to choose descriptive statistics and effect size indexes, as well as plots of
estimated marginal means for Method. As there are just two groups, main effect comparisons are not
appropriate. Examine Figure 14-5.

Figure 14-5 Univariate options for ANCOVA

Click Continue. If you like, you can click on Plots to add profile plots for the estimated marginal means
of the posttest scores of the two groups after adjusting for pretest scores. Click on OK to run the
analysis. The results are shown in Figure 14-6. The results indicate that after controlling for initial
quantitative ability, the differences in posttest scores are statistically significantly different between
the two groups, F(1,27)=16.64, p < .001, partial eta-squared = .381.

Figure 14-6 ANCOVA results

The profile plot makes it clear that the online class had higher exam scores after controlling for initial
quantitative ability (see Figure 14-7).

Figure 14-7 Profile plot

Performing an ANCOVA Using Hierarchical Regression


To perform the same ANCOVA using hierarchical regression, enter the posttest as the criterion. Then
enter the covariate (pretest) as one independent variable block and group membership (method) as a
second block. Examine the change in R-Square as the two models are compared, and the significance
of the change. The F value produced by this analysis is identical to that produced via the GLM
approach.
Select Analyze, Regression, Linear (see Figure 14-8).

Figure 14-8 ANCOVA via hierarchical regression

Now enter Posttest as the Dependent Variable and Pretest as an Independent variable (see Figure 149).

Figure 14-9 Linear regression dialog box


Click on the Next button and enter Method as an Independent variable, as shown in Figure 14-10.

Figure 14-10 Entering second block

Click on Statistics, and check the box in front of R squared change (see Figure 14-11).

Figure 14-11 Specify R squared change

Click Continue then OK to run the hierarchical regression. Note in the partial output shown in Figure
14-12 that the value of F for the R Square Change with pretest held constant is identical to that
calculated earlier.

Figure 14-2 Hierarchical regression yields results identical to GLM

References
Roscoe, J. T. (1975). Fundamental research statistics for the behavioral sciences (2nd ed.). New York:
Hot, Rinehart and Winston, Inc.

You might also like