Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
Bryan K. Orme,
Sawtooth Software, Inc.,
2002
In this example, we assume the reader has a basic understanding of multiple regression
analysis.
Consider a conjoint analysis problem with three attributes, each with levels as follows:
Brand
A
B
C
Color
Red
Blue
Price
$50
$100
$150
1
Assume the data for one respondent are as follows (as if in an Excel spreadsheet):
A B C D E
1 Card# Brand Color Price Preference
2 1 1 1 1 5
3 2 1 1 2 5
4 3 1 1 3 0
5 4 1 2 1 8
6 5 1 2 2 5
7 6 1 2 3 2
8 7 2 1 1 7
9 8 2 1 2 5
10 9 2 1 3 3
11 10 2 2 1 9
12 11 2 2 2 6
13 12 2 2 3 5
14 13 3 1 1 10
15 14 3 1 2 7
16 15 3 1 3 5
17 16 3 2 1 9
18 17 3 2 2 7
19 18 3 2 3 6
The first card is made up of level 1 of each of the attributes, or (Brand A, Red, $50). The
respondent rated that card a “5” on the preference scale.
After collecting the respondent data, the next step is to code the data in an appropriate
manner for estimating utilities using multiple regression. We use a procedure called
dummy coding for the independent variables (the product characteristics). In its simplest
form, dummy coding uses a “1” to reflect the presence of a feature, and a “0” to represent
its absence. For example, we can code the Brand attribute as three separate columns.
2
Applying dummy-coding for all attributes results in an array of columns as follows:
A B C D E F G H I J
1 Card # A B C Red Blue $50 $100 $150 Preference
2 1 1 0 0 1 0 1 0 0 5
3 2 1 0 0 1 0 0 1 0 5
4 3 1 0 0 1 0 0 0 1 0
5 4 1 0 0 0 1 1 0 0 8
6 5 1 0 0 0 1 0 1 0 5
7 6 1 0 0 0 1 0 0 1 2
8 7 0 1 0 1 0 1 0 0 7
9 8 0 1 0 1 0 0 1 0 5
10 9 0 1 0 1 0 0 0 1 3
11 10 0 1 0 0 1 1 0 0 9
12 11 0 1 0 0 1 0 1 0 6
13 12 0 1 0 0 1 0 0 1 5
14 13 0 0 1 1 0 1 0 0 10
15 14 0 0 1 1 0 0 1 0 7
16 15 0 0 1 1 0 0 0 1 5
17 16 0 0 1 0 1 1 0 0 9
18 17 0 0 1 0 1 0 1 0 7
19 18 0 0 1 0 1 0 0 1 6
Again, we see that card 1 is defined as (Brand A, Red, $50), but we have expanded the
layout to reflect dummy coding.
To this point, the coding has been very straightforward. But, there is one complication
that must be resolved. In multiple regression analysis, no independent variable may be
perfectly predictable based on the state of any other independent variable or combination
of independent variables. If so, the regression procedure could not separate the effects of
the confounded variables. We have that problem with the data above, since, for example,
we can perfectly predict the state of brand A based on the states for brands B and C. This
situation is termed linear dependency.
3
To resolve this linear dependency, we omit one column from each attribute. It really
doesn’t matter which column (level) we drop, and for this example we have excluded the
first level for each attribute, to produce the modified data table below:
A B C D E F G
1 Card # B C Blue $100 $150 Preference
2 1 0 0 0 0 0 5
3 2 0 0 0 1 0 5
4 3 0 0 0 0 1 0
5 4 0 0 1 0 0 8
6 5 0 0 1 1 0 5
7 6 0 0 1 0 1 2
8 7 1 0 0 0 0 7
9 8 1 0 0 1 0 5
10 9 1 0 0 0 1 3
11 10 1 0 1 0 0 9
12 11 1 0 1 1 0 6
13 12 1 0 1 0 1 5
14 13 0 1 0 0 0 10
15 14 0 1 0 1 0 7
16 15 0 1 0 0 1 5
17 16 0 1 1 0 0 9
18 17 0 1 1 1 0 7
19 18 0 1 1 0 1 6
Even though it appears that one level from each attribute is missing from the data, they
are really implicitly included as reference levels for each attribute. The explicitly coded
levels are estimated as contrasts with respect to the omitted levels, which are defined as
“0.”
Microsoft ExcelTM (we have used Excel from Office 2000 in this example) offers a
simple multiple regression tool, under Tools + Data Analysis + Regression (you must
have installed the Analysis Toolpak add-in). Using the tool, specify the preference score
(column G) as the dependent variable (the Input Y Range) and the five dummy-coded
attribute columns (columns B through F) as independent variables (the Input X range).
You should also make sure a constant is estimated (this usually happens by default).
where:
Y = respondent’s preference for the product concept,
b1 through b5 are beta weights (utilities) for the features,
e is an error term, and
the reference levels are equal to “0.”
4
The solution minimizes the sum of squares of the errors over all observations. A portion
of the output from Excel is as follows:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.948902
R Square 0.900415
Adjusted R
Square 0.858921
Standard Error 0.942809
Observations 18
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 5.833333 0.544331 10.71652 1.69E-07 4.647338 7.019329 4.647338 7.019329
X Variable 1 1.666667 0.544331 3.061862 0.009865 0.480671 2.852662 0.480671 2.852662
X Variable 2 3.166667 0.544331 5.817538 8.24E-05 1.980671 4.352662 1.980671 4.352662
X Variable 3 1.111111 0.444444 2.5 0.027915 0.14275 2.079472 0.14275 2.079472
X Variable 4 -2.16667 0.544331 -3.98042 0.001825 -3.35266 -0.98067 -3.35266 -0.98067
X Variable 5 -4.5 0.544331 -8.26703 2.68E-06 -5.686 -3.314 -5.686 -3.314
Using that output (after rounding to two decimals places of precision), the utilities
(Coefficients) are:
Brand
A 0.00
B 1.67
C 3.17
Color
Red 0.00
Blue 1.11
Price
$50 0.00
$100 -2.17
$150 -4.50
The constant is 5.83, and the fit for this respondent (R-Square) is 0.90. The fit ranges
from a low of 0 to a high of 1.0. The standard errors of the coefficients (betas) reflect
how precisely we are able to estimate the betas with this design. Lower standard errors
are better. The remaining statistics presented in Excel’s output are beyond the scope of
this paper, and are generally not of much use when considering individual-level conjoint
analysis problems.
5
Notes:
One can easily generalize how many parameters (independent variables plus the constant)
are involved in any conjoint analysis problem as #Levels - #Attributes + 1. Most
traditional conjoint analysis problems solve a separate regression equation for each
respondent. Therefore, to estimate utilities, the respondent must have evaluated at least
as many cards as parameters to be estimated. When the respondent answers the
minimum number of conjoint cards to enable estimation, this is called a saturated design.
While such a design is easiest on the respondent, it leaves no room for respondent error.
It also always yields an R-square of 100, and therefore no ability to assess respondent
consistency.
Most good conjoint designs in practice include more observations than parameters to be
estimated (usually 1.5 to 3 times more). The design above has three times as many cards
(observations) as parameters to be estimated. These designs usually lead to more stable
estimates of respondent utilities than saturated designs.
Also note that in practice (except with the smallest problems), asking respondents to
evaluate all possible combinations of the attribute levels is usually not practical. Design
catalogs and computer programs are available to find efficient fractional factorial
designs. Fractional factorial designs show just an efficient subset of the possible
combinations, and still provide enough degrees of freedom to estimate utilities.
The standard errors for the Color attribute are lower than for Brand and Price (recall that
lower standard errors imply greater precision of the beta estimate). Because Color only
has two levels (as compared to three each for Brand and Price), each color level has more
representation within the design. Therefore, more information is provided for each color
level than is provided for any level of the three-level attributes.
Acknowledgements: