A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation Modeling, Second Edition
()
About this ebook
Norm O'Rourke, Ph.D., R.Psych.
Norm O'Rourke, Ph.D., R.Psych., is a clinical psychologist and associate professor with the Interdisciplinary Research in the Mathematical and Computational Sciences (IRMACS) Centre at Simon Fraser University in Burnaby (BC), Canada. He sits on the executive board of the American Psychological Association's Society for Clinical Geropsychology and the National Mental Health Commission of Canada. To date, he has published two governmental reports and seventy peer-reviewed publications in leading gerontology, measurement, and mental health academic journals. As co-applicant, Dr. O'Rourke has been part of teams awarded $4M in research funding, and $1.3M as principal applicant in governmental and foundation funding as team leader.
Related to A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation Modeling, Second Edition
Related ebooks
Categorical Data Analysis Using SAS, Third Edition Rating: 0 out of 5 stars0 ratingsStatistics and Causality: Methods for Applied Empirical Research Rating: 0 out of 5 stars0 ratingsEssentials of Applied Econometrics Rating: 0 out of 5 stars0 ratingsTime Series Analysis in the Social Sciences: The Fundamentals Rating: 0 out of 5 stars0 ratingsSPSS for Applied Sciences: Basic Statistical Testing Rating: 3 out of 5 stars3/5Design and Analysis of Experiments by Douglas Montgomery: A Supplement for Using JMP Rating: 0 out of 5 stars0 ratingsBeginning Statistics with Data Analysis Rating: 4 out of 5 stars4/5Bayesian Methodology: an Overview With The Help Of R Software Rating: 0 out of 5 stars0 ratingsDesign and Analysis of Experiments, Volume 3: Special Designs and Applications Rating: 0 out of 5 stars0 ratingsConjoint analysis Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsBiostatistics by Example Using SAS Studio Rating: 0 out of 5 stars0 ratingsSAS Certification Prep Guide: Statistical Business Analysis Using SAS9 Rating: 0 out of 5 stars0 ratingsFundamentals of Predictive Analytics with JMP, Third Edition Rating: 0 out of 5 stars0 ratingsText Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS Rating: 0 out of 5 stars0 ratingsBusiness Analytics Using SAS Enterprise Guide and SAS Enterprise Miner: A Beginner's Guide Rating: 0 out of 5 stars0 ratingsSAS Statistics by Example Rating: 5 out of 5 stars5/5Building Better Models with JMP Pro Rating: 0 out of 5 stars0 ratingsDiscovering Partial Least Squares with JMP Rating: 0 out of 5 stars0 ratingsMachine Learning with SAS Viya Rating: 0 out of 5 stars0 ratingsSAS System A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsMachine Learning Solutions: Expert techniques to tackle complex machine learning problems using Python Rating: 0 out of 5 stars0 ratingsPharmaceutical Quality by Design Using JMP: Solving Product Development and Manufacturing Problems Rating: 5 out of 5 stars5/5Operations Research for Social Good: A Practitioner’s Introduction Using SAS and Python Rating: 0 out of 5 stars0 ratingsPreparing Data for Analysis with JMP Rating: 0 out of 5 stars0 ratingsSPSS A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsJMP for Basic Univariate and Multivariate Statistics: Methods for Researchers and Social Scientists, Second Edition Rating: 0 out of 5 stars0 ratingsArray Data Type A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsElementary Statistics Using SAS Rating: 0 out of 5 stars0 ratingsSAS Programming in the Pharmaceutical Industry, Second Edition Rating: 5 out of 5 stars5/5IBM SPSS Statistics A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratings
Programming For You
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsPython Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5HTML in 30 Pages Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Coding with JavaScript For Dummies Rating: 0 out of 5 stars0 ratingsLinux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Beginning Programming with C++ For Dummies Rating: 4 out of 5 stars4/5C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5C# Programming from Zero to Proficiency (Beginner): C# from Zero to Proficiency, #2 Rating: 0 out of 5 stars0 ratingsPython: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5
Reviews for A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation Modeling, Second Edition
0 ratings0 reviews
Book preview
A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation Modeling, Second Edition - Norm O'Rourke, Ph.D., R.Psych.
Chapter 1: Principal Component Analysis
Introduction: The Basics of Principal Component Analysis
A Variable Reduction Procedure
An Illustration of Variable Redundancy
What Is a Principal Component?
Principal Component Analysis Is Not Factor Analysis
Example: Analysis of the Prosocial Orientation Inventory
Preparing a Multiple-Item Instrument
Number of Items per Component
Minimal Sample Size Requirements
SAS Program and Output
Writing the SAS Program.
Results from the Output
Steps in Conducting Principal Component Analysis
Step 1: Initial Extraction of the Components
Step 2: Determining the Number of Meaningful
Components to Retain
Step 3: Rotation to a Final Solution
Step 4: Interpreting the Rotated Solution
Step 5: Creating Factor Scores or Factor-Based Scores
Step 6: Summarizing the Results in a Table
Step 7: Preparing a Formal Description of the Results for a Paper
An Example with Three Retained Components
The Questionnaire
Writing the Program.
Results of the Initial Analysis
Results of the Second Analysis
Conclusion
Appendix: Assumptions Underlying Principal Component Analysis
References
Introduction: The Basics of Principal Component Analysis
Principal component analysis is used when you have obtained measures for a number of observed variables and wish to arrive at a smaller number of variables (called principal components
) that will account for, or capture, most of the variance in the observed variables. The principal components may then be used as predictors or criterion variables in subsequent analyses.
A Variable Reduction Procedure
Principal component analysis is a variable reduction procedure. It is useful when you have obtained data for a number of variables (possibly a large number of variables) and believe that there is redundancy among those variables. In this case, redundancy means that some of the variables are correlated with each other, often because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components that will account for most of the variance in the observed variables.
Because it is a variable reduction procedure, principal component analysis is similar in many respects to exploratory factor analysis. In fact, the steps followed when conducting a principal component analysis are virtually identical to those followed when conducting an exploratory factor analysis. There are significant conceptual differences between the two, however, so it is important that you do not mistakenly claim that you are performing factor analysis when you are actually performing principal component analysis. The differences between these two procedures are described in greater detail in a later subsection titled "Principal Component Analysis Is Not Factor Analysis."
An Illustration of Variable Redundancy
We now present a fictitious example to illustrate the concept of variable redundancy. Imagine that you have developed a seven-item measure to gauge job satisfaction. The fictitious instrument is reproduced here:
Please respond to the following statements by placing your response to the left of each statement. In making your ratings, use a number from 1 to 7 in which 1 = Strongly Disagree
and 7 = Strongly Agree.
_____ 1. My supervisor(s) treats me with consideration.
_____ 2. My supervisor(s) consults me concerning important decisions that affect my work.
_____ 3. My supervisor(s) gives me recognition when I do a good job.
_____ 4. My supervisor(s) gives me the support I need to do my job well.
_____ 5. My pay is fair.
_____ 6. My pay is appropriate, given the amount of responsibility that comes with my job.
_____ 7. My pay is comparable to that of other employees whose jobs are similar to mine.
Perhaps you began your investigation with the intention of administering this questionnaire to 200 employees using their responses to the seven items as seven separate variables in subsequent analyses.
There are a number of problems with conducting the study in this manner, however. One of the more important problems involves the concept of redundancy as previously mentioned. Examine the content of the seven items in the questionnaire. Notice that items 1 to 4 each deal with employees’ satisfaction with their supervisors. In this way, items 1 to 4 are somewhat redundant or overlapping in terms of what they are measuring. Similarly, notice also that items 5 to 7 each seem to deal with the same topic: employees’ satisfaction with their pay.
Empirical findings may further support the likelihood of item redundancy. Assume that you administer the questionnaire to 200 employees and compute all possible correlations between responses to the seven items. Fictitious correlation coefficients are presented in Table 1.1:
Table 1.1: Correlations among Seven Job Satisfaction Items
NOTE: N = 200.
When correlations among several variables are computed, they are typically summarized in the form of a correlation matrix such as the one presented in Table 1.1; this provides an opportunity to review how a correlation matrix is interpreted. (See Appendix A.5 for more information about correlation coefficients.)
The rows and columns of Table 1.1 correspond to the seven variables included in the analysis. Row 1 (and column 1) represents variable 1, row 2 (and column 2) represents variable 2, and so forth. Where a given row and column intersect, you will find the correlation coefficient between the two corresponding variables. For example, where the row for variable 2 intersects with the column for variable 1, you find a coefficient of .75; this means that the correlation between variables 1 and 2 is .75.
The correlation coefficients presented in Table 1.1 show that the seven items seem to hang together in two distinct groups. First, notice that items 1 to 4 show relatively strong correlations with each another. This could be because items 1 to 4 are measuring the same construct. In the same way, items 5 to 7 correlate strongly with one another, a possible indication that they also measure a single construct. Even more interesting, notice that items 1 to 4 are very weakly correlated with items 5 to 7. This is what you would expect to see if items 1 to 4 and items 5 to 7 were measuring two different constructs.
Given this apparent redundancy, it is likely that the seven questionnaire items are not really measuring seven different constructs. More likely, items 1 to 4 are measuring a single construct that could reasonably be labeled satisfaction with supervision,
whereas items 5 to 7 are measuring a different construct that could be labeled satisfaction with pay.
If responses to the seven items actually display the redundancy suggested by the pattern of correlations in Table 1.1, it would be advantageous to reduce the number of variables in this dataset, so that (in a sense) items 1 to 4 are collapsed into a single new variable that reflects employees’ satisfaction with supervision and items 5 to 7 are collapsed into a single new variable that reflects satisfaction with pay. You could then use these two new variables (rather than the seven original variables) as predictor variables in multiple regression, for instance, or another type of analysis.
In essence, this is what is accomplished by principal component analysis: it allows you to reduce a set of observed variables into a smaller set of variables called principal components. The resulting principal components may then be used in subsequent analyses.
What Is a Principal Component?
How Principal Components Are Computed
A principal component can be defined as a linear combination of optimally weighted observed variables. In order to understand the meaning of this definition, it is necessary to first describe how participants’ scores on a principal component are computed.
In the course of performing a principal component analysis, it is possible to calculate a score for each participant for a given principal component. In the preceding study, for example, each participant would have scores on two components: one score on the satisfaction with supervision
component; and one score on the satisfaction with pay
component. Participants’ actual scores on the seven questionnaire items would be optimally weighted and then summed to compute their scores for a given component.
Below is the general form of the formula to compute scores on the first component extracted (created) in a principal component analysis:
C1 = b11(X1) + b12(X2) + ... b1p(Xp)
where
C1 = the participant’s score on principal component 1 (the first component extracted)
b1p = the coefficient (or weight) for observed variable p, as used in creating principal component 1
Xp = the participant’s score on observed variable p
For example, assume that component 1 in the present study was satisfaction with supervision.
You could determine each participant’s score on principal component 1 by using the following fictitious formula:
C1 =.44 (X1) + .40 (X2) + .47 (X3)+ .32 (X4)
+ .02 (X5) + .01 (X6) + .03 (X7)
In this case, the observed variables (the X
variables) are participant responses to the seven job satisfaction questions: X1 represents question 1; X2 represents question 2; and so forth. Notice that different coefficients or weights were assigned to each of the questions when computing scores on component 1: questions 1 to 4 were assigned relatively large weights that range from .32 to .47, whereas questions 5 to 7 were assigned very small weights ranging from .01 to .03. This makes sense, because component 1 is the satisfaction with supervision component and satisfaction with supervision was measured by questions 1 to 4. It is therefore appropriate that items 1 to 4 would be given a good deal of weight in computing participant scores on this component, while items 5 to 7 would be given comparatively little weight.
Because component 2 measures a different construct, a different equation with different weights would be used to compute scores for this component (i.e., satisfaction with pay
). Below is a fictitious illustration of this formula:
C2 =.01 (X1)+ .04 (X2) + .02 (X3)+ .02 (X4)
+ .48 (X5) + .31 (X6) + .39 (X7)
The preceding example shows that, when computing scores for the second component, considerable weight would be given to items 5 to 7, whereas comparatively little would be given to items 1 to 4. As a result, component 2 should account for much of the variability in the three satisfaction with pay items (i.e., it should be strongly correlated with those three items).
But how are these weights for the preceding equations determined? PROC FACTOR in SAS generates these weights by using a special type of equation called an eigenequation. The weights produced by these eigenequations are optimal weights in the sense that, for a given set of data, no other set of weights could produce a set of components that are more effective in accounting for variance among observed variables. These weights are created to satisfy what is known as the principle of least squares. Later in this chapter we will show how PROC FACTOR can be used to extract (create) principal components.
It is now possible to understand the definition provided at the beginning of this section more fully. A principal component was defined as a linear combination of optimally weighted observed variables. The words linear combination
refer to the fact that scores on a component are created by adding together scores for the observed variables being analyzed. Optimally weighted
refers to the fact that the observed variables are weighted in such a way that the resulting components account for a maximal amount of observed variance in the dataset.
Number of Components Extracted
The preceding section may have created the impression that, if a principal component analysis were performed on data from our fictitious seven-item job satisfaction questionnaire, only two components would be created. Such an impression would not be entirely correct.
In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analyzed. This means that an analysis of responses to the seven-item questionnaire would actually result in seven components, not two.
In most instances, however, only the first few components account for meaningful amounts of variance; only these first few components are retained, interpreted, and used in subsequent analyses. For example, in your analysis of the seven-item job satisfaction questionnaire, it is likely that only the first two components would account for, or capture, meaningful amounts of variance. Therefore, only these would be retained for interpretation. You could assume that the remaining five components capture only trivial amounts of variance. These latter components would therefore not be retained, interpreted, or further analyzed.
Characteristics of Principal Components
The first component extracted in a principal component analysis accounts for a maximal amount of total variance among the observed variables. Under typical conditions, this means that the first component will be correlated with at least some (often many) of the observed variables.
The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the dataset that was not accounted for or captured by the first component. Under typical conditions, this again means that the second component will be correlated with some of the observed variables that did not display strong correlations with component 1.
The second characteristic of the second component is that it will be uncorrelated with the first component. Literally, if you were to compute the correlation between components 1 and 2, that coefficient would be zero. (For the exception, see the following section regarding oblique solutions.)
The remaining components that are extracted exhibit the same two characteristics: each accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components; and each is uncorrelated with all of the preceding components. Principal component analysis proceeds in this manner with each new component accounting for progressively smaller amounts of variance. This is why only the first few components are retained and interpreted. When the analysis is complete, the resulting components will exhibit varying degrees of correlation with the observed variables, but will be completely uncorrelated with each another.
What is meant by total variance
in the dataset? To understand the meaning of total variance
as it is used in a principal component analysis, remember that the observed variables are standardized in the course of the analysis. This means that each variable is transformed so that it has a mean of zero and a standard deviation of one (and hence a variance of one). The total variance
in the dataset is simply the sum of variances for these observed variables. Because they have been standardized to have a standard deviation of one, each observed variable contributes one unit of variance to the total variance in the dataset. Because of this, total variance in principal component analysis will always be equal to the number of observed variables analyzed. For example, if seven variables are being analyzed, the total variance will equal seven. The components that are extracted in the analysis will partition this variance. Perhaps the first component will account for 3.2 units of total variance; perhaps the second component will account for 2.1 units. The analysis continues in this way until all variance in the dataset has been accounted for.
Orthogonal versus Oblique Solutions
This chapter will discuss only principal component analyses that result in orthogonal solutions. An orthogonal solution is one in which the components are uncorrelated (orthogonal
means uncorrelated).
It is possible to perform a principal component analysis that results in correlated components. Such a solution is referred to as an oblique solution. In some situations, oblique solutions are preferred to orthogonal solutions because they produce cleaner, more easily interpreted results.
However, oblique solutions are often complicated to interpret. For this reason, this chapter will focus only on the interpretation of orthogonal solutions. The concepts discussed will provide a good foundation for the somewhat more complex concepts discussed later in this text.
Principal Component Analysis Is Not Factor Analysis
Principal component analysis is commonly confused with factor analysis. This is understandable because there are many important similarities between the two. Both are methods that can be used to identify groups of observed variables that tend to hang together empirically. Both procedures can also be performed with PROC FACTOR, and they generally provide similar results.
Nonetheless, there are some important conceptual differences between principal component analysis and factor analysis that should be understood at the outset. Perhaps the most important difference deals with the assumption of an underlying causal structure. Factor analysis assumes that covariation among the observed variables is due to the presence of one or more latent variables that exert directional influence on these observed variables. An example of such a structure is presented in Figure 1.1.
Figure 1.1: Example of the Underlying Causal Structure That Is Assumed in Factor Analysis
The ovals in Figure 1.1 represent the latent (unmeasured) factors of satisfaction with supervision
and satisfaction with pay.
These factors are latent in the sense that it is assumed employees hold these beliefs but that these beliefs cannot be measured directly; however, they do influence employees’ responses to the items that constitute the job satisfaction questionnaire described earlier. (These seven items are represented as the squares labeled V1 to V7 in the figure.) It can be seen that the supervision
factor exerts influence on items V1 to V4 (the supervision questions), whereas the pay
factor exerts influence on items V5 to V7 (the pay items).
Researchers use factor analysis when they believe that one or more unobserved or latent factors exert directional influence on participants’ responses to observed variables. Exploratory factor analysis helps the researcher identify the number and nature of such latent factors. These procedures are described in the next chapter.
In contrast, principal component analysis makes no assumption about underlying causal structures; it is simply a variable reduction procedure that (typically) results in a relatively small number of components accounting for, or capturing, most variance in a set of observed variables (i.e., groupings of observed variables versus latent constructs).
Another important distinction between the two is that principal component analysis assumes no measurement error whereas factor analysis captures both true variance and measurement error. Acknowledgement and measurement of error is particularly germane to social science research because instruments are invariably incomplete measures of underlying constructs. Principal component analysis is sometimes used in instrument construction studies to overestimate precision of measurement (i.e., overestimate the effectiveness of the scale).
In summary, both factor analysis and principal component analysis are important in social science research, but their conceptual foundations are quite distinct.
Example: Analysis of the Prosocial Orientation Inventory
Assume that you have developed an instrument called the Prosocial Orientation Inventory (POI) that assesses the extent to which a person has engaged in helping behaviors over the preceding six months. This fictitious instrument contains six items and is presented here:
Instructions: Below are a number of activities in which people sometimes engage. For each item, please indicate how frequently you have engaged in this activity over the past six months. Provide your response by circling the appropriate number to the left of each item using the response key below:
7 = Very Frequently
6 = Frequently
5 = Somewhat Frequently
4 = Occasionally
3 = Seldom
2 = Almost Never
1 = Never
When this instrument was developed, the intent was to administer it to a sample of participants and use their responses to the six items as separate predictor variables. As previously stated, however, you learned that this is a problematic practice and have decided, instead, to perform a principal component analysis on responses to see if a smaller number of components can successfully account for most variance in the dataset. If this is the case, you will use the resulting components as predictor variables in subsequent analyses.
At this point, it may be instructive to examine the content of the six items that constitute the POI to make an informed guess as to what is likely to result from the principal component analysis. Imagine that when you first constructed the instrument, you assumed that the six items were assessing six different types of prosocial behavior. Inspection of items 1 to 3, however, shows that these three items share something in common: they all deal with going out of one’s way to do a favor for someone else.
It would not be surprising then to learn that these three items will hang together empirically in the principal component analysis to be performed. In the same way, a review of items 4 to 6 shows that each of these items involves the activity of giving money to those in need.
Again, it is possible that these three items will also group together in the course of the analysis.
In summary, the nature of the items suggests that it may be possible to account for variance in the POI with just two components: a helping others
component and a financial giving
component. At this point, this is only speculation, of course; only a formal analysis can determine the number and nature of components measured by the inventory of items. (Remember that the preceding instrument is fictitious and used for purposes of illustration only and should not be regarded as an example of a good measure of prosocial orientation. Among other problems, this questionnaire obviously deals with very few forms of helping behavior.)
Preparing a Multiple-Item Instrument
The preceding section illustrates an important point about how not to prepare a multiple-item scale to measure a construct. Generally speaking, it is poor practice to throw together a questionnaire, administer it to a sample, and then perform a principal component analysis (or factor analysis) to determine what the questionnaire is measuring.
Better results are much more likely when you make a priori decisions about what you want the questionnaire to measure, and then take steps to ensure that it does. For example, you would have been more likely to obtain optimal results if you:
• began with a thorough review of theory and research on prosocial behavior
• used that review to determine how many types of prosocial behavior may exist
• wrote multiple questionnaire items to assess each type of prosocial behavior
Using this approach, you could have made statements such as There are three types of prosocial behavior: acquaintance helping; stranger helping; and financial giving.
You could have then prepared a number of items to assess each of these three types, administered the questionnaire to a large sample, and performed a principal component analysis to see if three components did, in fact, emerge.
Number of Items per Component
When a variable (such as a questionnaire item) is given a weight in computing a principal component, we say that the variable loads on that component. For example, if the item Went out of my way to do a favor for a coworker
is given a lot of weight on the helping others
component, we say that this item loads
on that component.
It is highly desirable to have a minimum of three (and preferably more) variables loading on each retained component when the principal component analysis is complete (see Clark and Watson 1995). Because some items may be dropped during the course of the analysis (for reasons to be discussed later), it is generally good practice to write at least five items for each construct that you wish to measure. This increases your chances that at least three items per component will survive the analysis. Note that we have violated this recommendation by writing only three items for each of the two a priori components constituting the POI.
Keep in mind that the recommendation of three items per scale should be viewed as an absolute minimum and certainly not as an optimal number. In practice, test and attitude scale developers normally desire that their scales contain many more than just three items to measure a given construct. It is not unusual to see individual scales that include 10, 20, or even more items to assess a single construct (e.g., Chou and O’Rourke 2012; O’Rourke and Cappeliez 2002). Up to a point, the greater the number of scale items, the more reliable it will be. The recommendation of three items per scale should therefore be viewed as a rock-bottom lower bound, appropriate only if practical concerns prevent you from including more items (e.g., total questionnaire length). For more information on scale construction, see DeVellis (2012) and, Saris and Gallhofer (2007).
Minimal Sample Size Requirements
Principal component analysis is a large-sample procedure. To obtain reliable results, the minimal number of participants providing usable data for the analysis should be the larger of 100 participants or 5 times the number of variables being analyzed (Streiner 1994).
To illustrate, assume that you wish to perform an analysis on responses to a 50-item questionnaire. (Remember that when responses to a questionnaire are analyzed, the number of variables is equal to the number of items on that questionnaire.) Five times the number of items on the questionnaire equals 250. Therefore, your final sample should provide usable (complete) data from at least 250 participants. Note, however, that any participant who fails to answer just one item will not provide usable data for the principal component analysis and will therefore be excluded from the final sample. A certain number of participants can always be expected to leave at least one question blank. To ensure that the final sample includes at least 250 usable responses, you would be wise to administer the questionnaire to perhaps 300 to 350 participants (see Little and Rubin 1987). A preferable alternative is to use an imputation procedure that assigns values for skipped items (van Buuren 2012). A number of such procedures are available in SAS but are not covered in this text.
These rules regarding the number of participants per variable again constitute a lower bound, and some have argued that they should be applied only under two optimal conditions for principal component analysis: when many variables are expected to load on each component, and when variable communalities are high. Under less optimal conditions, even larger samples may be required.
What is a communality? A communality refers to the percent of variance in an observed variable that is accounted for by the retained components (or factors). A given variable will display a large communality if it loads heavily on at least one of the study’s retained components. Although communalities are computed in both procedures, the concept of variable communality is more relevant to factor analysis than principal component analysis.
SAS Program and Output
You may perform principal component analysis using the PRINCOMP, CALIS, or FACTOR procedures. This chapter will show how to perform the analysis using PROC FACTOR since this is a somewhat more flexible SAS procedure. (It is also possible to perform an exploratory factor analysis with PROC FACTOR or PROC CALIS.) Because the analysis is to be performed using PROC FACTOR, the output will at times make reference to factors rather than to principal components (e.g., component 1 will be referred to as FACTOR1 in the output). It is important to remember, however, that you are performing principal component analysis, not factor analysis.
This section will provide instructions on writing the SAS program and an overview of the SAS output. A subsequent section will provide a more detailed treatment of the steps followed in the analysis as well as the decisions to be made at each step.
Writing the SAS Program
The DATA Step
To perform a principal component analysis, data may be entered as raw data, a correlation matrix, a covariance matrix, or some other format. (See Appendix A.2 for further description of these data input options.) In this chapter’s first example, raw data will be analyzed.
Assume that you administered the POI to 50 participants, and entered their responses according to the following guide:
Here are the statements to enter these responses as raw data. The first three observations and the last three observations are reproduced here; for the entire dataset, see Appendix B.
data D1;
input V1-V6 ;
datalines;
556754
567343
777222
.
.
.
767151
455323
455544
;
run;
The dataset in Appendix B includes only 50 cases so that it will be relatively easy to enter the data and replicate the analyses presented here. It should be restated, however, that 50 observations is an unacceptably small sample for principal component analysis. Earlier it was noted that a sample should provide usable data from the larger of either 100 cases or 5 times the number of observed variables. A small sample is being analyzed here for illustrative purposes only.
The PROC FACTOR Statement
The general form for the SAS program to perform a principal component analysis is presented here:
proc factor data=dataset-name
simple
method=prin
priors=one
mineigen=p
rotate=varimax
round
flag=desired-size-of-significant
-factor-loadings ;
var variables-to-be-analyzed ;
run;
Options Used with PROC FACTOR
The PROC FACTOR statement begins the FACTOR procedure and a number of options may be requested in this statement before it ends with a semicolon. Some options that are especially useful in social science research are:
FLAG
causes the output to flag (with an asterisk) factor loadings with absolute values greater than some specified size. For example, if you specify
flag=.35
an asterisk will appear next to any loading whose absolute value exceeds .35. This option can make it much easier to interpret a factor pattern. Negative values are not allowed in the FLAG option, and the FLAG option can be used in conjunction with the ROUND option.
METHOD=factor-extraction-method
specifies the method to be used in extracting the factors or components. The current program specifies
method=prin
to request that the principal axis (principal factors) method be used for the initial extraction. This is the appropriate method for a principal component analysis.
MINEIGEN=p
specifies the critical eigenvalue a component must display if that component is to be retained (here, p = the critical eigenvalue). For example, the current program specifies
mineigen=1
This statement will cause PROC FACTOR to retain and rotate any component whose eigenvalue is 1.00 or larger. Negative values are not allowed.
NFACT=n
allows you to specify the number of components to be retained and rotated where n = the number of components.
OUT=name-of-new-dataset
creates a new dataset that includes all of the variables in the existing dataset, along with factor scores for the components retained in the present analysis. Component 1 is given the variable name FACTOR1, component 2 is given the name FACTOR2, and so forth. It must be used in conjunction with the NFACT option, and the analysis must be based on raw data.
PRIORS=prior-communality-estimates
specifies prior communality estimates. Users should always specify PRIORS=one to perform a principal component analysis.
ROTATE=rotation-method
specifies the rotation method to be used. The preceding program requests a varimax rotation that provides orthogonal (uncorrelated) components. Oblique rotations may also be requested (correlated components).
ROUND
factor loadings and correlation coefficients in the matrices printed by PROC FACTOR are normally carried out to several decimal places. Requesting the ROUND option, however, causes all coefficients to be limited to two decimal places, rounded to the nearest integer, and multiplied by 100 (thus eliminating the decimal point). This generally makes it easier to read the coefficients.
PLOTS=scree
creates a plot that graphically displays the size of the eigenvalues associated with each component. This can be used to perform a scree test to visually determine how many components should be retained.
SIMPLE
requests simple descriptive statistics: the number of usable cases on which the analysis was performed and the means and standard deviations of the observed variables.
The VAR Statement
The variables to be analyzed are listed on the VAR statement with each variable separated by at least one space. Remember that the VAR statement is a separate statement and not an option within the FACTOR statement, so don’t forget to end the FACTOR statement with a semicolon before beginning the VAR statement.
Example of an Actual Program
The following is an actual program, including the DATA step, that could be used to analyze some fictitious data. Only a few sample lines of data appear here; the entire dataset may be found in Appendix B.
data D1;
input #1 @1 (V1-V6) (1.)
datalines;
556754
567343
777222
.
.
.
767151
455323
455544
;
run;
proc factor data=D1
simple
method=prin
priors=one
mineigen=1
plots=scree
rotate=varimax
round
flag=.40 ;
var V1 V2 V3 V4 V5 V6;
run;
Results from the Output
The preceding program would produce three pages of output. Here is a list of some of the most important information provided by the output and the page on which it appears:
• page 1 includes simple statistics (mean values and standard deviations)
• page 2 includes scree plot of eigenvalues and cumulative variance explained
• page 3 includes the final communality estimates
The output created by the preceding program is presented here as Output 1.1.
Output 1.1: Results of the Initial Principal Component Analysis of the Prosocial Orientation Inventory (POI) Data (Page 1)
The FACTOR Procedure
Output 1.1 (Page 2)
The FACTOR Procedure
Initial Factor Method: Principal Components
Prior Communality Estimates: ONE
Output 1.1 (Page 3)
The FACTOR Procedure
Rotation Method: Varimax
Page 1 from Output 1.1 provides simple statistics for the observed variables included in the analysis. Once the SAS log has been checked to verify that no errors were made in the analysis, these simple statistics should be reviewed to determine how many usable observations were included in the analysis, and to verify that the means and standard deviations are in the expected range. On page 1, it says Means and Standard Deviations from 50 Observations,
meaning that data from 50 participants were included in the analysis.
Steps in Conducting Principal Component Analysis
Principal component analysis is normally conducted in a sequence of steps, with somewhat subjective decisions being made at various points. Because this chapter is intended as an introduction to the topic, this text will not provide a comprehensive discussion of all of the options available at each step; instead, specific recommendations will be made, consistent with common practice in applied research. For a more detailed treatment of principal component analysis and factor analysis, see Stevens (2002).
Step 1: Initial Extraction of the Components
In principal component analysis, the number of components extracted is equal to the number of variables being analyzed. Because six variables are analyzed in the present study, six components are extracted. The first can be expected to account for a fairly large amount of the total variance. Each succeeding component will account for progressively smaller amounts of variance. Although a large number of components may be extracted in this way, only the first few components will be sufficiently important to be retained for interpretation.
Page 2 from Output 1.1 provides the eigenvalue table from the analysis. (This table appears just below the heading Eigenvalues of the Correlation Matrix: Total = 6 Average = 1
.) An eigenvalue represents the amount of variance captured by a given component. In the column heading Eigenvalue,
the eigenvalue for each component is presented. Each row in the matrix presents information for each of the six components. Row 1 provides information about the first component extracted, row 2 provides information about the second component extracted, and so forth.
Where the column heading Eigenvalue
intersects with rows 1 and 2, it can be seen that the eigenvalue for component 1 is approximately 2.27, while the eigenvalue for component 2 is 1.97. This pattern is consistent with our earlier statement that the first components tend to account for relatively large amounts of variance, whereas the later components account for comparatively smaller amounts.
Step 2: Determining the Number of Meaningful
Components to Retain
Earlier it was stated that the number of components extracted is equal to the number of variables analyzed. This requires that you decide just how many of these components are truly meaningful and worthy of being retained for rotation and interpretation. In general, you expect that only the first few components will account for meaningful amounts of variance, and that the later components will tend to account for only trivial variance. The next step, therefore, is to determine how many meaningful components should be retained to interpret. This section will describe four criteria that may be used in making this decision: the eigenvalue‑one criterion, the scree test, the proportion of variance accounted for, and the interpretability criterion.
The Eigenvalue-One Criterion
In principal component analysis, one of the most commonly used criterion for solving the number-of-components problem is the eigenvalue-one criterion, also known as the Kaiser-Guttman criterion (Kaiser 1960). With this method, you retain and interpret all components with eigenvalues greater than 1.00.
The rationale for this criterion is straightforward: each observed variable contributes one unit of variance to the total variance in the dataset. Any component with an eigenvalue greater than 1.00 accounts for a greater amount of variance than had been contributed by one variable. Such a component therefore accounts for a meaningful amount of variance and (in theory) is worthy of retention.
On the other hand, a component with an eigenvalue less than 1.00 accounts for less variance than contributed by one variable. The purpose of principal component analysis is to reduce a number of observed variables into a relatively smaller number of components. This cannot be effectively achieved if you retain components that account for less variance than had been contributed by individual variables. For this reason, components with eigenvalues less than 1.00 are viewed as trivial and are not retained.
The eigenvalue-one criterion has a number of positive features that contribute to its utility. Perhaps the most important reason for its use is its simplicity. It does not require subjective decisions; you merely retain components with eigenvalues greater than 1.00.
Yet this criterion often results in retaining the correct number of components, particularly when a small to moderate number of variables are analyzed and the variable communalities are high. Stevens (2002) reviews studies that have investigated the accuracy of the eigenvalue-one criterion and recommends its use when fewer than 30 variables are being analyzed and communalities are greater than .70, or when the analysis is based on more than 250 observations and the mean communality is greater than .59.
There are, however, various problems associated with the eigenvalue-one criterion. As suggested in the preceding paragraph, it can lead to retaining the wrong number of components under circumstances that are often encountered in research (e.g., when many variables are analyzed, when communalities are small). Also, the reflexive application of this criterion can lead to retaining a certain number of components when the actual difference in the eigenvalues of successive components is trivial. For example, if component 2 has an eigenvalue of 1.01 and component 3 has an eigenvalue of 0.99, then component 2 will be retained but component 3 will not. This may mistakenly lead you to believe that the third component was meaningless when, in fact, it accounted for almost the same amount of variance as the second component. In short, the eigenvalue‑one criterion can be helpful when used judiciously, yet the reflexive application of this approach can lead to serious errors of interpretation. Almost always, the eigenvalue-one criterion should be considered in conjunction with other criteria (e.g., scree test, the proportion of variance accounted for, and the interpretability criterion) when deciding how many components to retain and interpret.
With SAS, the eigenvalue-one criterion can be applied by including the MINEIGEN=1 option in the PROC FACTOR statement and not including the NFACT option. The use of the MINEIGEN=1 will cause PROC FACTOR to retain any component with an eigenvalue greater than 1.00.
The eigenvalue table from the current analysis appears on page 2 of Output 1.1. The eigenvalues for components 1, 2, and 3 are 2.27, 1.97, and 0.80, respectively. Only components 1 and 2 have eigenvalues greater than 1.00, so the eigenvalue-one criterion would lead you to retain and interpret only these two components.
Fortunately, the application of the criterion is fairly unambiguous in this case. The last component retained (2) has an eigenvalue of 1.97, which is substantially greater than 1.00, and the next component (3) has an eigenvalue of 0.80, which is clearly lower than 1.00. In this instance, you are not faced with the difficult decision of whether to retain a component with an eigenvalue approaching 1.00 (e.g., an eigenvalue of .99). In situations such as this, the eigenvalue-one criterion may be used with greater confidence.
The Scree Test
With the scree test (Cattell 1966), you plot the eigenvalues associated with each component and look for a definitive break
between the components with relatively large eigenvalues and those with relatively small eigenvalues. The components that appear before the break are assumed to be meaningful and are retained for rotation, whereas those appearing after the break are assumed to be unimportant and are not retained. Sometimes a scree plot will display several large breaks. When this is the case, you should look for the last big break before the eigenvalues begin to level off. Only the components that appear before this last large break should be retained.
Specifying the PLOTS=SCREE option in the PROC FACTOR statement tells SAS to print an eigenvalue plot as part of the output. This appears as page 2 of Output 1.1.
You can see that the component numbers are listed on the horizontal axis, while eigenvalues are listed on the vertical axis. With this plot, notice there is a relatively small break between components 1 and 2, and a relatively large break following component 2. The breaks between components 3, 4, 5, and 6 are all relatively small. It is often helpful to draw long lines with extended tails connecting successive pairs of eigenvalues so that these breaks are more apparent (e.g., measure degrees separating lines with a protractor).
Because the large break in this plot appears between components 2 and 3, the scree test would lead you to retain only components 1 and 2. The components appearing after the break (3 to 6)would be regarded as trivial.
The scree test can be expected to provide reasonably accurate results, provided that the sample is large (over 200) and most of the variable communalities are large (Stevens 2002). This criterion too has its weaknesses, most notably the ambiguity of scree plots under common research conditions. Very often, it is difficult to determine precisely where in the scree plot a break exists, or even if a break exists at all. In contrast to the eigenvalue-one criterion, the scree test is often more subjective.
The break in the scree plot on page 3 of Output 1.1 is unusually obvious. In contrast, consider the plot that appears in Figure 1.2.
Figure 1.2: A Scree Plot with No Obvious Break
Figure 1.2 presents a fictitious scree plot from a principal component analysis of 17 variables. Notice that there is no obvious break in the plot that separates the meaningful components from the trivial components. Most researchers would agree that components 1 and 2 are probably meaningful whereas components 13 to 17 are probably trivial; but it is difficult to decide exactly where you should draw the line. This example underscores the qualitative nature of judgments based solely on the scree test.
Scree plots such as the one presented in Figure 1.2 are common in social science research. When encountered, the use of the scree test must be supplemented with additional criteria such as the variance accounted for
criterion and the interpretability criterion, to be described later.
Why do they call it a scree
test? The word scree
refers to the loose rubble that lies at the base of a cliff or glacier. When performing a scree test, you normally hope that the scree plot will take the form of a cliff. At the top will be the eigenvalues for the few meaningful components, followed by a definitive break (the edge of the cliff). At the bottom of the cliff will lay the scree (i.e., eigenvalues for the trivial components).
Proportion of Variance Accounted For
A third criterion to address the number of factors problem involves retaining a component