Classification In multivariate data analysis(i.e., when we have two or more variables) we are trying to find the relationship between the variables and understand the nature of the relationship. The choice of the techniques primarily depends on what measurement scales we used for the variables and in which way we are trying to find the relationship between the variables. For example, we can see the relationship between variable x and y, by examining the difference of y at each value of x or by examining the association between the two variables. In class, basically the two classes of data analysis techniques will be taught: analysis of difference and analysis of association.
I. Analysis of Difference Ex. t-test(z-test), Analysis of Variance(ANOVA) II. Analysis of Association Ex. Cross-tabulation, Correlation, Regression Suggestion Before you start any of the data analysis techniques, check: 1. What are the variables in my analysis? 2. Can the variables be classified as dependent and independent variable? 3. If yes, what is (ind)dependent variable? 4. What measurement scales were used for the variables?
Data Analysis I: Cross-tabulation (& Frequency) In cross-tabulation, theres no specific dependent or independent variables. The crosstabulation just analyzes whether the two(or more) variables measured with nominal scales are associated or equivalently, whether the cell counts of the values of one variable are different across the different values(categories) of the other variable. All of the variables are measured by nominal scale. Example 1 Research objective: To see whether the preferred brands(brand A, brand B, and brand C) are associated with the locations(Denver and Salt Lake City); Is there any difference in brand preference between the two locations? Data: We selected 40 people in each city and measured what their preferred brands were. If the preferred brand is A, the favorite brand =1. If the preferred brand is B, the favorite brand =2. If the preferred brand is C, the favorite brand =3. Denver
ID 1 2 3 4 5 6 7 8 9 10 favorite brand 1 3 3 1 3 2 3 3 3 1 ID 11 12 13 14 15 16 17 18 19 20 favorite brand 1 2 2 2 3 3 1 3 1 2 ID 21 22 23 24 25 26 27 28 29 30 favorite brand 1 3 3 3 1 3 3 3 3 3 ID 31 32 33 34 35 36 37 38 39 40 favorite brand 3 3 2 3 1 3 3 3 2 1
Procedures in SPSS 1. Open SPSS. See upper-left corner. You see SPSS Data Editor. 2. Type in new the data(Try to save it!). Each column represents each variable. Or, if you already have the data file, just open it.(data_crosstab_1.sav) 3. Go to top menu bar and click on Analyze. Then, now youve found Descriptive Statistics. Move your cursor to Descriptive Statistics 4. Right next to Descriptive Statistics, click on Crosstabs. Then you will see:
These are the variable names and can be changed for different dataset.
5. Move your cursor to the first variable in upper-left corner box and click. In this example, its Location. Click to the button just left to Row(s): box. This will move Location to the Row(s): box. In the same way, move the other variable Brand to Column(s): box. You have:
6. Go to Statistics in the bottom and click. You will see another menu screen is up. In that menu box, click on Chi-square and Contingency coefficient as follows: Chi-square Correlations
and click on Continue on right-upper corner. 7. Youve been forwarded to the original menu box. Click on OK in right-upper corner.
8. Now youve got the final results at Output- SPSS Viewer window as follows:
Case Processing Summary Cases Missing N Percent 0 .0%
df 2 2
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.50.
Symmetric Measures
Contingency Coefficient
Value .427 80
a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.
Interpretation First table simply shows that there are 40 observations in the data set. Recall the dataset again. There were 20 observations in each city.
From the second table, it is found that among 40 people in Denver(location=1), 10, 7, and 23 people prefer the brand A, B, and C, respectively. On the other hand, 19, 16, and 5 out of 40 people in Salt Lake City (location=2) like brand A, B, and C, respectively. From the third table, the chi-square value is 17.89(Chi-Square = 17.886) and the associated p-value for this chi-square value is is 0.00(Asymp. Sig. = 0.000), which is less than 0.05. Therefore, we conclude that people in different city prefer the different brand or that consumers favorite brand is associated with where they live in. (This conclusion is also supported by the final table that shows the contingency coefficient is 0.427.) Looking at the column totals or row totals in the second table, we can also get the frequency for each variable. In this example, among all 80 people in my sample, 29, 23, and 28 people said that their most favorite brands are A, B, and C, respectively. And 40 people are selected from Denver (location=1) and the other 40 are from Salt Lake City (location=2).
Example 2 49ers vs Packers Research objective: To see whether winning (or losing) the basketball game is associated with whether the game is home-game or away-game.
Data: We selected the data of the basketball game between 49ers and Packers for the period between 19XX and 19&&
GAME_AT * RESULT Crosstabulation Count RESULT 1 GAME_AT 1 2 Total 12 4 16 2 3 11 14
Chi-Square Tests Asymp. Sig. (2-sided) .003 .010 .003 Exact Sig. (2-sided) Exact Sig. (1-sided)
Total 15 15 30
Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases
df 1 1 1
a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 7.00. Symmetric Measures
Contingency Coefficient
Value .471 30
a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.
Interpretation Among total 30 games, 49ers won 16 games and lost 14 games. Each teams tends to win the game when the football game is home-game for each team: Among 16 49ers victories, 12 games were played at SF. Among 14 Packers victories, 11 games were played at GB. The chi-square for this table is 8.571(Pearson Chi-Square = 8.571) and the p-value is 0.003 (Asymp. Sig. = 0.003). Since p-value = 0.003 is less than 0.05(we typically assume that our pre-specified level of significance is 0.05) and contingency coefficient is far from 0(.471), we conclude that the likelihood that a school wins the football game is associated with whether the game is home-game or away-game for the school.
Dependent variable: Sales Grouping variable: Promotion(1 = Yes, 2 = No) Procedures in SPSS 1. Open SPSS. See upper-left corner. You see in SPSS Data Editor. 2. Type in new the data(Try to save it!). Each column represents each variable. Or, if you already have the data file, just open it.(data_t-test_1.sav) 3. Go to top menu bar and click on Analyze. Then, now youve found Compare Means. Move your cursor to Compare Means 4. Right next to Compare Means, click on Independent Samples T-Test. Then you will see:
Independent Samples T-Test Test Variable(s): Sales Promo OK PASTE RESET CANCEL HELP Grouping variable:
These are the variable names and can be changed for different dataset. 5. Move your cursor to the dependent variable in left-hand side box and click. In this example, its Sales. Click to the button just left to Test Variable(s): box. This will move Sales to the Test Variable(s): box. Then move the independent variable(grouping variable) to Grouping variable:. In this example, its Promo. Click on Define Groups, and specify what numbers are for what group in your data by typing the numbers into the boxes. Group 1: Group 2: 1 Type these numbers. 2
6. Click on OK
in right-upper corner.
7. Now youve got the final results at Output- SPSS Viewer window as follows:
Group Statistics Std. Error Mean 5.336 5.498
N 10 10
t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper 3.004 3.003 35.196 35.197
Sig. .507
t 2.493 2.493
df 18 17.984
Interpretation From the first table, the mean sales under the promotion and non-promotion are 81.7 and 62.6, respectively. The test statistic, t, for this observed difference is 2.49(t= 2.493). The p-value for this t-statistic is 0.023(Sig.(2-tailed)=0.023). Since p-value (0.023) is less than 0.05, we reject the null hypothesis and conclude that theres a significant difference in average sales between when firms offer price promotion and when they offer just regular prices.
Trained by a 1 1 2 2 3 1 2 3 2 2 1 1 3 3 3
Grouping variable: training program(1= trained by program #1, 2= trained by program #2, 3 = trained by program #3) b Dependent variable: Job satisfaction (1: Strongly dissatisfied --- 7: strongly satisfied) Procedures in SPSS 1. Open SPSS. See upper-left corner. You see in SPSS Data Editor. 2. Type in new the data(Try to save it!). Each column represents each variable. Or, if you already have the data file, just open it.(data_anova_1.sav)
3. Go to top menu bar and click on Analyze. Then, now youve found Compare Means. Move your cursor to Compare Means. 4. Right next to Compare Means, click on One-way ANOVA. Then you will see:
One-Way ANOVA Dependent List:: Program Satis OK PASTE RESET CANCEL HELP Factor:
Post Hoc
These are the variable names and can be changed for different dataset. 5. Move your cursor to the dependent variable in left-hand side box and click. In this example, its Satis. Click to the button just left to Dependent List: box. This will move Satis to the Dependent List: box. Then move the independent variable(grouping variable) to Factor:. In this example, its Program. 6. Click on Options . A new sub-menu box will appear. Then, click on the following boxes: Statistics Descriptive Fixed and random effects Homogeneity of Variance test ........
Continue OK
in right-upper corner.
8. Now youve got the final results at Output- SPSS Viewer window as follows:
Descriptives SATIS 95% Confidence Interval for Mean Lower Bound Upper Bound 2.68 7.32 1.76 3.84 3.92 5.28 3.30 4.97
N 1 2 3 Total 5 5 5 15
Minimum 2 2 4 2
Maximum 7 4 5 7
SATIS Sum of Squares 13.733 18.000 31.733 df 2 12 14 Mean Square 6.867 1.500 F 4.578 Sig. .033
Interpretation (1) From the first table, the mean satisfaction of the subjects in group 1(i.e., those who were trained by program 1), group 2, and group 3 are 5.0, 2.8, and 4.6, respectively. (2) Checking ANOVA table, the second output:
Source Between Within Total SS DF MS F Sig.
31.73 = 13.73 + 18.00 SST = SSG + SSE dfT = number of respondents 1 = 15 1 = 14 dfG = number of groups 1 = 3 1 = 2 dfE = dfT dfG = 14 2 = 12 MSG = SSG/dfG MSE = SSE/dfE F = MSG/MSE (Test statistic) 6.87 = 13.73 / 2 1.50 = 18.00 / 12 4.58 = 6.87 / 1.50