Advanced Statistical Methods and Data Analytics For Research - Hypothesis Testing and SPSS - by Prof.M.guruprasad

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Prof.M.

Guruprasad

DATA ANALYTICS AND STATISTICS

Section 1 ANALYSIS –HYPOTHESIS TESTING

Section 2 ANALYSIS AND EVALUATION -Advanced statistical techniques for


Marketing Research -SPSS

1
Prof.M.Guruprasad

Section 1 ANALYSIS –HYPOTHESIS TESTING

Hypothesis is usually considered as the principal instrument in research. Its main function is to
suggest new experiments and observations. In fact, many experiments arc carried out with the
deliberate object of testing hypotheses. Decision-makers often face situations wherein they are
interested in testing hypotheses on the basis of available information and then take decisions on
the basis of such testing. In social science, where direct knowledge of population parameter(s) is
rare, hypothesis testing is the often-used strategy for deciding whether a sample data offer such
support for a hypothesis that generalization can be made. Thus, hypothesis testing enables us to
make probability statements about population parameters). The hypothesis may not be proved
absolutely, but in practice it is accepted if it has withstood a critical testing. Before we explain
how hypotheses arc tested through different tests meant for the purpose, it will be appropriate to
explain clearly the meaning of a hypothesis and the related concepts for better understanding of
the hypothesis testing techniques.

What is hypothesis?
Ordinarily, when one talks about hypothesis, one simply means a mere assumption or some
supposition to be proved or disproved. But for a researcher hypothesis is a formal question that
he intends to resolve. Thus a hypothesis may be defined as a proposition or a set of propositions
set forth as an explanation for the occurrence of some specified group of phenomena either
asserted merely as a provisional conjecture to guide some investigation or accepted as highly
probable in the light of established facts. Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific methods, that relates an independent variable to
some dependent variable. For example, consider statements like the following ones:
“Students who receive counseling will show a greater increase in creativity than students not
receiving counseling” or
“The automobile A is performing as well as automobile B.”
These are hypotheses capable of being objectively verified and tested. Thus, we may conclude
that a hypothesis states what we are looking for and it is a proposition which can be put to a test
to determine its validity.

2
Prof.M.Guruprasad

Characteristics of hypothesis: Hypothesis must possess the following characteristics:

(i) Hypothesis should be clear and precise. If the hypothesis is not clear and precise, the
inferences drawn on its basis cannot be taken as reliable.
(ii) Hypothesis should be capable of being tested. In a swamp of un-testable hypothesis,
many a time the research programs have bogged down. Some prior study may be
done by the researcher in order to make hypothesis a testable one. A hypothesis “is
testable if other deductions can be made from it which, in turn, can be confirmed or
disproved by observation.”
(iii) Hypothesis should state relationship between variables, if it happens to be relational
hypothesis.
(iv) Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should
develop such hypotheses.
(v) Hypothesis should be stated as far as possible in most simple terms so that the same is
easily understandable by all concerned. But one must remember that simplicity of
hypothesis has nothing to do with its significance.
(vi) Hypothesis should be consistent with most known facts i.e., it must be consistent with
a substantial body of established facts. In other words, it should be one which judges
accept as being the most likely.
(vii) Hypothesis should be amenable to testing within a reasonable time. One should not
use even an excellent hypothesis, if the same cannot be tested in reasonable time for
one cannot spend a lifetime collecting data to test it.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation. This
means that by using the hypothesis plus other known and accepted generalizations,
one should be able to deduce the original problem condition. Thus hypothesis must
actually explain what it claims to explain; it should have empirical reference.
Basic Concepts Concerning Testing of Hypothesis
Basic concepts in the context of hypothesis need to be explained.
(a) Null hypothesis and alternative hypothesis: In the context of statistical analysis, we often
talk about null hypothesis and alternative hypothesis. If we are to compare method A

3
Prof.M.Guruprasad

with method B about its superiority and if we proceed on the assumption that both
methods are equally good, then this assumption is termed as the null hypothesis. As
against this, we may think that the method A is superior or the method B is inferior,
we are then stating what is termed as alternative hypothesis. The null hypothesis is
generally symbolized as Ho and the alternative hypothesis as Ha. Suppose we want
to test the hypothesis that the population mean () is equal to hypothesized mean (
ho) = 100. Then we would say that the null hypothesis is that the population mean is
equal to the hypothesized mean 100 and symbolically we can express as:
Ho :  = Ho = 100
If our sample results do not support this null hypothesis, we should conclude that something
else is true. What we conclude rejecting the null hypothesis is known as alternative
hypothesis. In other words, the set of alternatives to the null hypothesis is referred to as
the alternative hypothesis. If we accept Ho, then we are accepting Ha. For Ho: = Ho = 100,
we may consider three possible alternative hypotheses as follows:
Alternative hypothesis To be read as follows
Ha:   Ho (The alternative hypothesis is that the
population mean is not equal to 100 i.e.,
it may be more or less than 100)

Ha:  > Ho (The alternative hypothesis is that the


population mean is greater than 100)
Ha:  < Ho (The alternative hypothesis is that the
population mean is less than 100)
The null hypothesis and the alternative hypothesis are chosen before the sample is drawn
(the researcher must avoid the error of deriving hypotheses from the data that he collects
and then testing the hypotheses from the same data). In the choice of null hypothesis, the
following considerations are usually kept in view:

(i) Alternative hypothesis is usually the one which one wishes to prove and the
null hypothesis is the one which one wishes to disprove. Thus, a null
hypothesis represents the hypothesis we are trying to reject, and alternative
hypothesis represents all other possibilities.
(ii) If the rejection of a certain hypothesis when it is actually true involves great
risk, it is taken as null hypothesis because then the probability of rejecting it
when it is true is  (the level of significance) which is chosen very small.

4
Prof.M.Guruprasad

(iii) Null hypothesis should always be specific hypothesis i.e., it should not state
about approximately a certain value.
Generally, in hypothesis testing we proceed on the basis of null hypothesis, keeping the
alternative hypothesis in view. Why so? The answer is that on the assumptions that null
hypothesis is truer, one can assign the probabilities to different possible sample results, but
this cannot be done if we proceed with the alternative hypothesis. Hence the use of null
hypothesis (at times also known as statistical hypothesis) is quite frequent.
(b) The level of significance: This is a very important concept in the context of
hypothesis testing. It is always some percentage (usually 5%) which should be
chosen with great care, thought and reason. In case we take the significance level at
5 percent, then this implies that Ho will be rejected when the sampling result (i.e.,
observed evidence) has a less than 0.05 probability of occurring if Ho is true. In other
words, the 5 percent level of significance means that researcher is willing to take as
much as 5 percent risk of rejecting the null hypothesis when it (Ho) happens to be
true. Thus the significance level is the maximum value of the probability of rejecting
Ho when it is true and is usually determined in advance before testing the
hypothesis.
(c) Decision rule or test of hypothesis: Given a hypothesis Ho and alternative hypothesis
Ha, we make a rule which is known as decision rule according to which we accept Ho
(i.e., reject Ha) or reject Ho (i.e., accept Ha). For instance, if Ho is that a certain lot is
good (there are few defective items in it), then we must decide the number
Hypothesis is an important characteristic of scientific method; it establishes the relationship
of concept with theory and specifies the test to be applied especially in the content of
meaningful value judgment states what the researcher is looking for Tentative
generalization

TYPES OF HYPOTHESIS.

1.Hypothesis that explains with the existence of empirical uniformities(generalization).

2.Hypothesis, which are concerned with the relation of analytical variables.

-Testing the existence of logically relationship between empirical uniformities derived.

3.Observations in day-to –day practice.

CHARATERISTICS:

1.Conceptually clear

2.Empirically verified.

5
Prof.M.Guruprasad

3.Specific.

4.Related to available techniques or related to the body of the theory.

5.Relevant to the existing environmental conditions for the purpose of testing and

6.It should identify the specific variables and their relations.

Hypothesis is a tentative, provisional solution guides the scientist to get on with the
enquiry.

Place in science:

A) Helps to explain facts.

B) Establish guidelines for scientific research.

Well established –generalization:

In order to explain the observed facts a hypothesis is formulated to establish a hypothesis


facts provide the evidence also initial formulation.

Working hypothesis: Only to guide investigation good hypothesis:

1.Relevant to the fact.

2.Self-consistency

3.Capacity of deductive developments.

4.Testability.

5. Agreement with established law.

6.Simplicity.

7.Satisfaction solution.
ANALYSIS- PROCEDURE FOR HYPOTHESIS TESTING(tests)

BACKGROUND:

6
Prof.M.Guruprasad

To test a hypothesis means to tell (on the basis of the data the researcher has collected)
Whether or not the hypothesis seems to be valid. In hypothesis testing the main question
is: whether to accept the null hypothesis or not to accept the null hypothesis? Procedure for
hypothesis testing refers to all those steps that we undertake for making a choice between
the two actions i.e., rejection and acceptance of a null hypothesis. The various steps
involved in hypothesis testing are stated below: i) Making of formal statement.ii) Selecting a
significance level.iii) Deciding the distribution to use.iv) Selecting a random sample and
computing an appropriate value.V) Calculation of the probability.Vi) Comparing the
probability.
Test of hypotheses: : As has been stated above that hypothesis testing determines the
validity of the assumption (technically described as null hypothesis) with a view to choose
between two conflicting hypotheses about the value of population parameter. Hypothesis
testing helps to decide on the basis of a sample data, whether a hypothesis about the
population is likely to be true or false. Statisticians have developed several tests of
hypothesis (also known as the tests of significance for the purpose of testing of hypothesis
which can be classified as: A) Parametric tests or standards tests of hypothesis; and B)
Non-parametric tests or distribution-free tests of hypotheses. Importance of Parametric
Tests: The importances of parametric tests are: 1) z-test; 2) t-test; 3) ( Chi-Square test )
X² test, and 4) F-test (ANOVA). All these tests are based on the assumption of normality
i.e. the source of data is considered to be normally distributed. In some cases the
population may not be normally distributed, yet the tests will be applicable on account of
the fact that we mostly deal with samples and the sampling distributions closely approach
normal distributions.

Now, let us discuss the subject in detail,

BASIC CONCEPTS CONCERNING TESTING OF HYPOTHESES

Basic concepts in the context of testing of hypotheses need to be explained.


(a) Null hypothesis and alternative hypothesis: In the context of statistical analysis, we often
talk about null hypothesis and alternative hypothesis. If we are to compare method A with
method B about its superiority and if we proceed on the assumption that both methods are
7
Prof.M.Guruprasad

equally good, then this assumption is termed as the null hypothesis. As against this, we may
think that the method A is superior or the method B is inferior, we are then stating what is termed
as alternative hypothesis. The null hypothesis is generally symbolized as H0 and the alternative
hypothesis as Ha. Suppose we want to test the hypothesis that the population mean (.) is equal
to the hypothesized mean (H0) = 100. Then we would say that the null hypothesis is that the
population mean is equal to the hypothesized mean 100 and symbolically we can express as:

H0:  =H0 = 100

If our sample results do not support this null hypothesis, we should conclude that something else
is true. What we conclude rejecting the null hypothesis is known as alternative hypothesis. In
other words, the set of alternatives to the null hypothesis is referred to as the alternative
hypothesis. If we accept H0, then we are rejecting Ha and if we reject H0, then we are accepting

Ha. For H0:  =H0 =100, we may consider three possible alternative hypotheses as follows:

The null hypothesis and the alternative hypothesis are chosen before the sample is drawn (the
researcher must avoid the error of deriving hypotheses from the data that he collects and then
testing the hypotheses from the same data).

8
Prof.M.Guruprasad

In the choice of null hypothesis, the following considerations are usually kept in view:
(1) Alternative hypothesis is usually the one which one wishes to prove and the null hypothesis is
the one which one wishes to disprove. Thus, a null hypothesis represents the hypothesis we
are trying to reject, and 'alternative hypothesis represents all other possibilities.
(2) If the rejection of a certain hypothesis when it is actually true involves great risk, it is taken
as null hypothesis because then the probability of rejecting it when it is true is  (the level of
significance) which is chosen very small.
(3) Null hypothesis should always be specific hypothesis i.e., it should not state about or
approximately a certain value.

Generally, in hypothesis testing we proceed on the basis of null hypothesis, keeping the
alternative hypothesis in view. Why so? The answer is that on the assumption that null
hypothesis is true, one can assign the probabilities to different possible sample results, but this
cannot be done if we proceed with the alternative hypothesis. Hence the use of null hypothesis
(at times also known as statistical hypothesis) is quite frequent.

(b) The level of significance: This is a very important concept in the context of hypothesis
testing. It’s always some percentage (usually 5%), which should be chosen, with great care,
thought and reason. In case we take the significance level at 5 per cent, then this implies that H0
will be rejected when the sampling result (i.e., observed evidence) has a less than 0.05
probability of occurring if H0 is true. In other words, the 5 per cent level of significance means
that researcher is willing to take as much as a 5 per cent risk of rejecting the null hypothesis
when it (Ho) happens to be true. Thus the significance level is the maximum value of the
probability of rejecting H0 when it is true and is usually determined in advance before testing the
hypothesis.

(c) Decision rule or test of hypothesis: Given a hypothesis Ho and an alternative hypothesis Ha,
we make a rule which is known as decision rule according to which we accept H0 (i.e., reject Ha)
or reject Ho (i.e., accept Ha). For instance, if H0 is that a certain lot is good (there are very few
defective items in it) against Ha that the lot is not good (there are too many defective items in it);
then we must decide the number of items to be tested and the criterion for accepting or rejecting

9
Prof.M.Guruprasad

the hypothesis. We might test 10 items in the lot and plan our decision saying that if there are
none or only 1 defective item among the 10, we will accept Ho otherwise we will reject Ho (or
accept Ha). This sort of basis is known as decision rule.

(d) Type I and Type II errors: In the context of testing of hypotheses, there are basically two
types of errors we can make. We may reject H0 when H0 is true and we may accept H0 when in
fact H0 is not true. The former is known as Type I error and the latter as Type II error. In other
words, Type I error means rejection of hypothesis that should have been accepted and Type II
error means accepting the hypothesis, which should have been rejected. Type I error is denoted
by (alpha) known as a error, also called the level of significance of test; and Type II error-is
denoted by (beta) known as -error. In a tabular form the said two errors can be presented as
follows:

The probability of Type I error is usually determined in advance and is understood as the level of
significance of testing the hypothesis. If type I error is fixed at 5 percent, it means that there are
about 5 chances in 100 that we will reject H0 when H0 is true. We can control Type I error just by
fixing it at a lower level. For instance, if we fix it at 1 per cent, we will say that the maximum
probability of committing Type I error would only be 0.01.

10
Prof.M.Guruprasad

But with a fixed sample size, n, when we try to reduce Type I error, the probability of
committing Type II error increases. Both types of errors cannot be reduced simultaneously.
There is a trade-off between these two types of errors, which means that the probability of
making one type of error can only be reduced if we are willing to increase the probability of
making the other type of error. To deal with this trade-off in business situations, decision-makers
decide the appropriate level of Type I error by examining the costs or penalties attached to both
types of errors. If Type I error involves the time and trouble of reworking a batch of chemicals
that should have been accepted, whereas Type II error means taking a chance that an entire group
of users of this chemical compound will be poisoned, then in such a situation one should prefer a
Type I error to a Type II error. As a result one must set very high level for Type I error in one's
testing technique of a given hypothesis. Hence, in the testing of hypothesis, one must make all
possible effort to strike an adequate balance between Type I and Type II errors.

(e) Two tailed and One-tailed tests: In the context of hypothesis testing, these two terms are
quite important and must be clearly understood. A two-tailed test rejects the null hypothesis if,
say, the sample mean is significantly higher or lower than the hypothesized value of the mean of
the population. Such a test is appropriate when the null hypothesis is some specified value and
the alternative hypothesis is a value not equal to the specified value of the null hypothesis.
Symbolically, the two-tailed test is appropriate when we have H0:  = H0 and Ha:   H0
which may mean   H0 or  > H0. Thus, in a two-tailed test, there are two rejection regions
(also known as critical regions), one on each tail of the curve which can be illustrated in Figure
a:

If the significance level is 5 per cent and the two-tailed test is to be applied, the probability of the
rejection area will be 0.05 (equally split on both tails of the curve as 0.025) and that of the
acceptance region will be 0.95 as shown in the curve in Fig. a. If we take  = 100 and if our
sample mean deviates significantly from 100 in either direction, then we shall reject the null
hypothesis; but if the sample mean does not deviate significantly from , in that case we shall
accept the null hypothesis.

11
Prof.M.Guruprasad

But there are situations when only one-tailed test is considered appropriate. A one-tailed test
would be used when we are to test, say, whether the population mean is cither lower than or
higher than some hypothesized value. For instance, if our H0:  = H0 and Ha:   H0, Then we
are interested in what is known as left-tailed test (wherein there is one rejection region only on
the left tail) which can be illustrated as in Figure b:

Figure a

12
Prof.M.Guruprasad

Mathematically we can state:


Acceptance Region A: | Z | < 1.96
Rejection Region R: | Z |  1.96

13
Prof.M.Guruprasad

Figure b

Mathematically we can state:


Acceptance Region A: Z > -1.645
Rejection Region R: Z  - 1.645

If our  = 100 and if our sample mean deviates significantly from 100 in the lower direction, we
shall reject H0, otherwise we shall accept H0 at a certain level of significance. If the significance

14
Prof.M.Guruprasad

level in the given case is kept at 5%, then the rejection region will be equal to 0.05 of area in the
left tail as has been shown in the above curve.
In case our H0:  = H0 and Ha:  > H0 we are then interested in what is known as one-tailed
test (right tail) and the rejection region will be on the right tail of the curve as shown below:

Mathematically we can state:


Acceptance Region A: Z < 1.645
Rejection Region R: Z  1.645

If our  = 100 and if our sample mean deviates significantly from 100 in the upward direction,
we shall reject H0 otherwise we shall accept the same If in the given case the significance level is

15
Prof.M.Guruprasad

kept at 5% then the rejection region will be equal to 0 05 of area in the right-tail as has been
shown in the above curve

It should always be remembered that accepting H0 on the basis of sample information does not
constitute the proof that H0 is true. We only mean that there is no statistical evidence to reject it,
but we are certainly not saying that H0 is true (although we behave as if H0 is true)
Note on Normal Distribution
NORMAL DISTRIBUTION
The normal distribution is a continuous distribution and plays a very important and pivotal
role in statistical theory and practice, particularly in the area of statistical inference and
statistical quality control. Its importance is also due to the fact that in practice, the
experimental results, very often seem to follow the normal distribution or the bell-shaped
curve. The normal curve is symmetrical and is defined by its mean () and its standard
deviation (). The shape of the curve is as shown below and the mean, mode and the
median of the distribution have the same value.
Symmetrical about Tails extend
the mean indefinitely

Mean = Mode = Median

The normal curve is not just one curve but a family of curves. Just as the equation for a
circle describes the family of circles, some small and some big, the equation for the normal
curve describes a family of such curves which may differ only with regard to the values of 
and , but have the same characteristics in all other respects.

Characteristics of the Normal Curve

16
Prof.M.Guruprasad

Some of these characteristics are:

a) All normal curves are symmetrical about the mean. This means that the number
of units in the data below the mean is the same as the number of units above the
mean. This means that the mean and the median have the same value.
b) The height of the normal curve is at its maximum at the mean value. Thus the
mean and the mode coincide. This means that the normal distribution has the
same value for mean, median and mode.
c) The height of the curve declines as we go in either direction from the mean, but
never touches the base, so that the tails of the curves on both sides of mean
extend indefinitely.
d) The first and the third quartiles are equidistant from the mean.
e) The height of the normal curve Y at any value of the random continuous variable
x is given by the following equation:
x- 2

1 -1/2 ( )

Y= e
 2

where:

e = mathematical constant = 2.71828

 = mathematical constant = 3.14159

 = mean of the distribution

 = standard deviation of the distribution

x = value of the continuous random variable between minus


infinity (-) and plus infinity (+).

17
Prof.M.Guruprasad

Normal distribution is most commonly applied in statistical quality control and is very useful
in many sociological studies. It is very close to many other distributions and can serve as a
good approximation for many discrete distributions such as binominal as n becomes larger.

PROCEDURE FOR HYPOTHESIS TESTING


To test a hypothesis means to tell (on the basis of the data the researcher has collected) whether
or not the hypothesis seems to be valid. In hypothesis testing the main question is: whether to
accept the null hypothesis or not to accept the null hypothesis? Procedure for hypothesis testing
refers to all those steps that we undertake for making a choice between the two actions i.e.,
rejection and acceptance of a null hypothesis

The various steps involved in hypothesis testing are stated below:


(i) Making a formal statement: The step consists in making a formal statement of the null
hypothesis (H0) and also of the alternative hypothesis (Ha) This means that hypotheses
should be clearly stated, considering the nature of the research problem For instance, Mr.
Mohan of the Civil Engineering Department wants to test the load bearing capacity of an old
bridge which must be more than 10 tons In that case he can state his hypotheses as under:
Null Hypothesis H0:  = 10 tons
Alternative Hypothesis Ha:  > 10 tons

Take another example The average score in an aptitude test administered at the national
level is 80 To evaluate a state's education system, the average score of 100 of the state's
students selected on random basis was 75. The state wants to know if there is a significant
difference between the local scores and the national scores. In such a situation the
hypotheses may be stated as under
Null Hypothesis H0:  = 80
Alternative Hypothesis Ha:   80

The formulation of hypotheses is an important step, which must be accomplished with due
care in accordance with the object and nature of the problem under consideration It also
indicates whether we should use a one-tailed test or a two-tailed test. If Ha is of the type

18
Prof.M.Guruprasad

greater than (or of the type lesser than), we use a one-tailed test, but when Ha is of the type
"whether greater or smaller", then we use a two-tailed test

(ii) Selecting a significance level: The hypotheses are tested on a pre-determined level of
significance and as such the same should be specified Generally, in practice, either 5% level
or 1% level is adopted for the purpose The factors that affect the level of significance are
(a)the magnitude of the difference between sample means
(b) the size of the samples
(c)the variability of measurements within samples
(d) whether the hypothesis is directional or non-directional (A directional hypothesis is one
which predicts the direction of the difference between, say, means). In brief, the level of
significance must be adequate in the context of the purpose and nature of enquiry.

(iii) Deciding the distribution to use: After deciding the level of significance, the next step in
hypothesis testing is to determine the appropriate sampling distribution The choice generally
remains between normal distribution and the t-distribution. The rules for selecting the
correct distribution are similar to those that we have stated earlier in the context of
estimation.

(iv) Selecting a random sample and computing an appropriate value: Another step is to
select a random sample(s) and compute an appropriate value from the sample data
concerning the test statistic utilizing the relevant distribution. In other words, draw a sample
to furnish empirical data.
(v) Calculation of the probability: One has then to calculate the probability that the sample
result would diverge as widely as it has from expectations, if the null hypothesis were in fact
true

(vi) Comparing the Probability: Yet another step consists in comparing the probability thus
calculated with the specified value for , the significance level If the calculated probability
is equal to or smaller than the  value in case of one-tailed test (and /2 in case of two-
tailed test), then reject the null hypothesis (i e, accept the alternative hypothesis), but if the

19
Prof.M.Guruprasad

calculated probability is greater, then accept the null hypothesis. In case we reject H0, We

run a risk of (at most the level of significance) committing an error of Type I, but if we
accept H0, then we run some risk (the size of which cannot be specified as long as the H0
happens to be vague rather than specific) of committing an error of Type II.
FLOW DIAGRAM FOR HYPOTHESIS TESTING
The above stated general procedure for hypothesis testing can also be depicted in the form of a
flow-chart for better understanding as shown below:

Tests of Hypotheses

20
Prof.M.Guruprasad

Hypothesis testing helps to decide on the basis of a sample data, whether a hypothesis about the
population is likely to be true or false. Statisticians have developed several tests of hypotheses
(also known as the tests of significance) for the purpose of testing of hypotheses which can be
classified as:
(a) Parametric tests or standard tests of hypotheses
(b) Non-parametric tests or distribution-free test of hypotheses.

Parametric tests usually assume certain properties of the parent population from which we draw
samples. Assumptions like observations come from a normal population, sample size is large,
assumptions about the population parameters like mean, variance, etc., must hold good before
parametric tests can be used. But there are situations when the researcher cannot or does not
want to make such assumptions. In such situations we use statistical methods for testing
hypotheses, which are called non-parametric tests because such tests do not depend on any
assumption about the parameters of the parent population. Besides, most non-parametric tests
assume only nominal or ordinal data, whereas parametric tests require measurement equivalent
to at least an interval scale. As a result, non-parametric tests need more observations than
parametric tests to achieve the same size of Type I and Type II errors.

Important Parametric Tests


The important parametric tests are:
(1) z-test
z-test is based on the normal probability distribution and is used for judging the significance of
several statistical measures, particularly the mean. The relevant test statistic, z, is worked out and
compared with its probable value (to be read from table showing area under normal curve) at a
specified level of significance for judging the significance of the measure concerned. This is a
most frequently used test in research studies. This test is used even when binomial distribution or
t-distribution is applicable on the presumption that such a distribution tends to approximate
normal distribution as ‘n’ becomes larger z-test is generally used for comparing the mean of a
sample to some hypothesized mean for the population in case of large sample, or when
population variance is known. z-test is also used for judging the significance of difference
between means of two-independent samples in case of large samples, or when population

21
Prof.M.Guruprasad

variance is known. z-test is also used for comparing the sample proportion to a theoretical value
of population proportion or for judging the difference in proportions of two independent samples
when ‘n’ happens to be large. Besides, this test may be used for judging the significance of
median, mode, coefficient of correlation and several other measures.

(2) t-test
t-test is based on t-distribution and is considered an appropriate test for judging the significance
of a sample mean or for judging the significance of difference between the means of two samples
in case of small sample(s) when population variance is not known (in which case we use
variance of the sample as an estimate of the population variance) In case two samples are related,
we use paired t-test (or what is known as difference test) for judging the significance of the mean
of difference between the two related samples. It can also be used for judging the significance of
the coefficients of simple and partial correlations The relevant test statistic, t, is calculated from
the sample data and then compared with its probable value based on t-distribution (to be read
from the table that gives probable values of t for different levels of significance for different
degrees of freedom) at a specified level of significance for concerning degrees of freedom for
accepting or rejecting the null hypothesis. It may be noted that t-test applies only in case of small
sample(s)( may be less than 30 samples) when population variance is unknown.

(3) 2-test or Chi-square test


2 - test is based on chi-square distribution and as a parametric test is used for comparing a
sample variance to a theoretical population variance. As a non-parametric test, it "can be used to
determine if categorical data shows dependency or if two classifications are independent. It can
also be used to make comparisons between theoretical populations and actual data when
categories arc used." Thus, the chi-square test is applicable in large number of problems. The test
is, in fact, a technique through the use of which it is possible for all researchers to (i) test the
goodness of fit (ii) test the significance of association between two attributes, and (iii) test the
homogeneity or the significance of population variance.

(4) F-test.

22
Prof.M.Guruprasad

F-test is based on F-distribution and is used to compare the variance of the two independent
samples. This test is also used in the context of analysis of variance (ANOVA) for judging the
significance of more than two sample means at one and the same time. It is also used for judging
the significance of multiple correlation coefficients. Test statistic, F, is calculated and compared
with its probable value (to be seen in the F-ratio tables for different degrees of freedom for
greater and smaller variances at a specified level of significance) for accepting or rejecting the
null hypothesis.

All these tests are based on the assumption of normality i.e., the source of data is considered to
be normally distributed. In some cases the population may not be normally distributed, yet the
tests will be applicable on account of the fact that we mostly deal with samples and the sampling
distributions closely approach normal distributions.

23
Prof.M.Guruprasad

Section 2 ANALYSIS AND EVALUATION

Advanced statistical techniques for Marketing Research -SPSS

24
Prof.M.Guruprasad

Advanced statistical techniques for Marketing Research SPSS

As discussed in the chapter 13, the process of converting information from a questionnaire
so it can be read by a computer is referred to as data preparation. This process normally
follows a five-step approach, beginning with data validation, then editing and coding of the
data, followed by data entry, error detection, and data tabulation. The purpose of the data
preparation process is to take data in its raw form and convert it so as to establish meaning
and create value for the user.

In chapter 13 we discussed some of the basic statistical techniques of tabulation, frequency


distribution, Measures of central tendency among others. In this chapter we will briefly
discuss some of the advanced techniques used in research.

Once the researcher has formed the hypotheses and calculated the means of the groups,
the next step is to actually analyze the relationships of the sample data.

The reader should not that most of the techniques are used for the following reasons

Determination of population parameter using Sample statistic ie deciding the extent of


relation ship between the studied sample data and confirming whether the sample data can
be generalized and it is representative of the population characteristics. This decision
implies that whatever result we get from the sample data analysis is valid and closer to the
reality.

The other purpose is Statistical significance between different variables within the research
study(say to verify the researchers proposed hypothesis). For example consider the
Hypothesis “ There is no association between owning a PC and frequency of browsing the
cyber café”. Now from the collected data from the field, the researcher can verify whether
such an relationship exists using techniques such as Cross tabulation method, correlation
techniques or any other relevant technique.

Relationships between variables can be described in several ways, including presence, direction,
strength of association, and type. The first issue, and probably the most obvious, is whether two or more
variables are related at all. If a systematic relationship exists between two or more variables, this is
referred to as the presence of a relationship. To measure whether a relationship is present, we rely on the
concept of statistical significance. If we test for statistical significance and find that it exists, then we say
that a relationship is present. Stated another way, we say that knowledge about the behavior of one
variable allows us to make a useful prediction about the behavior of another.

If a relationship is present between two variables, it is important to know the direction. The
direction of a relationship can be either positive or negative. An understanding of the
strength of association also is important. We generally categorize the strength of association
as nonexistent, weak, moderate, or strong. If a consistent and systematic relationship is not
present, then the strength of association is nonexistent. A weak association means there is

25
Prof.M.Guruprasad

a low probability of the variables having a relationship. A strong association means there is
a high probability a consistent and systematic relationship exists.

Another concept that is important to understand is the type of relationship. If we say that
two variables can be described as related, then we would pose this as a question:

“What is the nature of the relationship?” How can the link between Y and X best be
described? There are a number of different ways in which two variables can share a
relationship. Variables Y and X can have a linear relationship, which means that the strength
and nature of the relationship between them remains the same over the range of both
variables, and can best be described using a straight line. Conversely, Y and X could have a
curvilinear relationship, which would mean that the strength and/or direction of their
relationship changes over the range of both variables (perhaps Y’s relationship with X first
gets stronger as X increases, but then gets weaker as the value of X continues to increase).

In the above discussion, we assumed the relation between two variables. In reality, the
Market forces consist of multiple factors impacting the business. Speaking statistically, the
real market situation requires a multivariate analysis.

Accordingly,for the purpose of analysi, statistical techniques are broadly divided into
Univariate analysis, Bivariate analysis and Multivariate analysis

Univariate analysis

Univariate analysis involves describing the distribution of a single variable, including its
central tendency (including the mean, median, and mode) and dispersion (including the
range and quantiles of the data-set, and measures of spread such as the variance and
standard deviation). The shape of the distribution may also be described via indices such as
skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.

Bivariate analysis

When a sample consists of more than one variable, descriptive statistics may be used to
describe the relationship between pairs of variables. In this case, descriptive statistics
include:

Cross-tabulations and contingency tables

Graphical representation via scatterplots

Quantitative measures of dependence

Descriptions of conditional distributions

The main reason for differentiating univariate and bivariate analysis is that bivariate
analysis is not only simple descriptive analysis, but also it describes the relationship
between two different variables. Quantitative measures of dependence include correlation
(such as Pearson's r when both variables are continuous, or Spearman's rho if one or both
are not) and covariance (which reflects the scale variables are measured on). The slope, in

26
Prof.M.Guruprasad

regression analysis, also reflects the relationship between variables. The slope indicates the
unit change in the criterion variable for a one unit change in the predictor.

Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics,


which involves observation and analysis of more than one statistical outcome variable at a
time(thus also includes bivariate analysis) . In design and analysis, the technique is used to
perform trade studies across multiple dimensions while taking into account the effects of all
variables on the responses of interest.

Uses for multivariate analysis include:

Design for capability (also known as capability-based design)

Inverse design, where any variable can be treated as an independent variable

Analysis of Alternatives (AoA), the selection of concepts to fulfill a customer need

Analysis of concepts with respect to changing scenarios

Identification of critical design drivers and correlations across hierarchical levels.

Some of the Multivariate statistical techniques are

Multiple regression analysis,Discriminant analysis,Factor analysis,Cluster analysis

There are many different models, each with its own type of analysis:

Multivariate analysis of variance (MANOVA) extends the analysis of variance to cover


cases where there is more than one dependent variable to be analyzed simultaneously: see
also MANCOVA.

Multivariate regression analysis attempts to determine a formula that can describe how
elements in a vector of variables respond simultaneously to changes in others. For linear
relations, regression analyses here are based on forms of the general linear model.

Principal components analysis (PCA) creates a new set of orthogonal variables that
contain the same information as the original set. It rotates the axes of variation to give a
new set of orthogonal axes, ordered so that they summarize decreasing proportions of the
variation.

Factor analysis is similar to PCA but allows the user to extract a specified number of
synthetic variables, fewer than the original set, leaving the remaining unexplained variation
as error. The extracted variables are known as latent variables or factors; each one may be
supposed to account for covariation in a group of observed variables.

Canonical correlation analysis finds linear relationships among two sets of variables; it is
the generalised (i.e. canonical) version of bivariate correlation.

27
Prof.M.Guruprasad

Multidimensional scaling comprises various algorithms to determine a set of synthetic


variables that best represent the pairwise distances between records. The original method is
principal coordinates analysis (PCoA; based on PCA).

Discriminant analysis, or canonical variate analysis, attempts to establish whether a set


of variables can be used to distinguish between two or more groups of cases.

Clustering systems assign objects into groups (called clusters) so that objects (cases)
from the same cluster are more similar to each other than objects from different clusters.

With the advancement of technology most of the application of these complicated statistical
techniques is carried out through a menu driven user friendly statistical software
packages.Prominent among the software is SPSS originally named the Statistical Package
for the Social Sciences

SPSS
SPSS is a widely used program for statistical analysis in social science. It is also used by market
researchers, health researchers, survey companies, overnment, education researchers, marketing
organizations, data miners, and others. The original SPSS manual (Nie, Bent & Hull, 1970) has
been described as one of "sociology's most influential books" for allowing ordinary researchers
to do their own statistical analysis. In addition to statistical analysis, data Zanagement (case
selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is
stored in the datafile) are features of the base software.

SPSS is a comprehensive and flexible statistical analysis and data management solution. SPSS
can take data from almost any type of file and use them to generate tabulated reports, charts, and
plots of distributions and trends, descriptive statistics, and conduct complex statistical analyses.
SPSS is available from several platforms; Windows, Macintosh, and the UNIX systems.

SPSS Statistics is a software package used for statistical analysis. Long produced by SPSS Inc.,
it was acquired by IBM in 2009. The current versions (2014) are officially named IBM SPSS
Statistics. Companion products in the same family are used for survey authoring and deployment
(IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, and
collaboration and deployment (batch and automated scoring services).

The software was originally named the Statistical Package for the Social Sciences (SPSS).[2]
Later the expansion of the acronym was changed to Statistical Product and Service Solutions to
reflect the growing diversity of the userbase. Today, the IBM SPSS website makes no mention of
an official expansion of the acronym.

SPSS is used virtually every industry, including telecommunications, banking, finance,


insurance, healthcare, manufacturing, retail, consumer packaged goods, higher education,
government, and market research.

Statistics included in the base software:

28
Prof.M.Guruprasad

Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio


Statistics
Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial, distances),
Nonparametric tests
Prediction for numerical outcomes: Linear regression
Prediction for identifying groups: Factor analysis, cluster analysis (two-step, K-means,
hierarchical), Discriminant

Now let us discuss briefly some of the statistical techniques and how they are applied for
analysis

Simple Tabulation and Cross Tabulation

In a questionnaire-based marketing research project, each question usually represents a variable


under study. The basic form of analysis of one variable in a questionnaire is Simple Tabulation
of the answers. This could be in the form of simple counting of the frequencies (how many
people answered Yes, and No, for example), and percentages.

Two different questions in a questionnaire may represent two variables, and if we count these
two together, this is called a cross-tabulation. An example could be “10 people from Income
Group 1 said they liked Brand A”. Here, the two variables are “INCOME GROUP” and
“LIKING FOR BRANDS A TO E”, measured separately in two different questions on the
questionnaire.

Simple and Cross tabulation is a very useful form of analysis for all nominally and ordinally
scaled variables. For these two scales, calculations such as average (mean) and standard
deviation are not permitted. Therefore, frequency and percentages are used to analyse such
variables.

Dependent and Independent Variables


1. If two or more variables are analysed together, it may be necessary to spell out the relationship
between the two variables. The concept of dependent and independent variables is useful in
spelling out the relationship. Two variables are called independent variables if a change in one
does not influence or cause a change in the other. But if a change in one variable causes a
change in the other, the first one is called an independent variable, and the second one is called a
dependent variable (dependent on the first).
2. A common example of a dependent variable in marketing is “Sales”. Annual sales of a brand
usually depend on several factors or variables. One of the independent variables on which
annual sales depend could be the quantum of advertising (in rupees) done for the brand. A
second variable on which sales may depend could be the number of retailers stocking the brand.
3. In a consumer research questionnaire, the dependent variable could be satisfaction with the
brand, which may depend on taste (if it is a food brand), and easy availability. Another example
is the quantity of a product bought, a dependent variable, which depends on family size and
household income.

29
Prof.M.Guruprasad

Demographic Variables
1. Many demographic variables such as age, location, income, occupation, sex, education are
generally independent variables for the purposes of most marketing studies. This is because
other variables “depend” on them.
2. Attitude towards a brand, or the brand purchased, or intention to buy, are usually treated as
dependent variables in many marketing studies. For a marketing researcher, these variables or
similar ones, are the real variables of interest, as they help in arriving at strategies for increasing
sales or market share.
3. The other major types of independent variables are the elements of the four ‘P’s of marketing.
The marketing effort of a company can be measured in terms of its promotional efforts, price
variations and distribution changes. It can also be gauged from new product launches, or
repositioning or repackaging of existing brands.
4. Therefore, we could measure sales as the dependent variable with any of the marketing ‘P’s as
independent variables.
In a questionnaire-based survey, the first stage of analysis is called simple tabulation. This
consists of every question being treated separately and tabulated. For every question, the number
of responses in each category of answers is counted. Assuming the sample size is 500, and all
500 have answered the question, the simple tabulation of the respondents' gender may look like
the following –
1. Male - 300
2. Female - 200
-----
Total 500
-----
The simple tabulation for another question on the questionnaire may look like this –
1. Regular Users of Brand X -- 200
2. Occasional Users of Brand X -- 150
3. Non-users of Brand X -- 150
-----
Total 500
-----
A title can be included for each table, and on the top of each column, to explain the variable
name through a label. For example, the above simple table can be titled Frequency of Usage, or
Number of Users and Non-users of Brand X.

After the simple frequency and percentage tabulation for every question on the questionnaire
comes the second stage – the cross tabulations. A cross-tabulation can be done by combining
any two of the questions and tabulating the data together. This is a 2-variable cross tabulation.
An example could be a cross-tabulation between Brand Preference for brands of tea and Region
to which Respondent belongs. Assuming we have the data on these two variables from a study,
the cross tabulation may look like this –
BRAND Regionwise Buyers (No.)
North South East West Total
BrookeBond 25 20 20 15 80
Lipton 10 15 20 5 50
Tata 15 15 10 30 70

30
Prof.M.Guruprasad

Total 50 50 50 50 200
This is a cross-tabulation of two variables. An extension of this could be adding percentages.
All these percentages can be displayed in a table form separately, or in brackets along with
number of respondents. The table of percentages along with numbers will look like this –
BRAND Regionwise Buyers-Numbers and Percentage
North South East West Total
BrookeBond 25(50%) 20(40%) 20(40%) 15(30%)
80(40%)
Lipton 10(20%) 15(30%) 20(40%) 5(10%)
50(25%)
Tata 15(30%) 15(30%) 10(20%) 30(60%)
70(35%)
Total 50(100%) 50(100%) 50(100%) 50(100%)
200(100%)
The above table can be interpreted according to the column (region) we are looking at. The first
four columns represent findings for each region, and the fifth column (Total) represents overall
findings for all the regions on an average. For example, from column 4, 30% of buyers in the
west prefer Brooke Bond, 10% Lipton, and 60% prefer Tata tea. From column 5, out of the total
200 respondents, across all regions, 40% prefer Brooke Bond, 25% Lipton and 35% Tata tea
Lack of Causal Inference in Cross Tabulations
It must be mentioned here that any two variables can be cross-tabulated. Even if the cross-
tabulation shows a significant association between the two variables, it does not necessarily
mean that one of them (the independent) causes the other (the dependent). Causality or direct
effect is more of an assumption made by the researcher based on his expectation or experience.
The mere existence of a statistically significant association does not necessarily imply a cause-
and-effect relationship between the (presumed) independent and the (presumed) dependent
variable.

z-test
z-test is based on the normal probability distribution and is used for judging the significance of
several statistical measures, particularly the mean. The relevant test statistic*, z, is worked out
and compared with its probable value (to be read from table showing area under normal curve) at
a specified level of significance for judging the significance of the measure concerned. This is a
most frequently used test in research studies. This test is used even when binomial distribution or
t-distribution is applicable on the presumption that such a distribution tends to approximate
normal distribution as ‘n’ becomes larger. z-test is generally used for comparing the mean of a
sample to some hypothesised mean for the population in case of large sample, or when
population variance is known. z-test is also used for judging he significance of difference
between means of two independent samples in case of large samples, or when population

31
Prof.M.Guruprasad

variance is known. z-test is also used for comparing the sample proportion to a theoretical value
of population proportion or for judging the difference in proportions of two independent samples
when n happens to be large. Besides, this test may be used for judging the significance of
median, mode, coefficient of correlation and several other measures.

Using the t-Test

The t test is especially useful when the sample size is small (n < 30) and when the population
standard deviation is unknown. Unlike the univariate test, however, we assume that the samples
are drawn from populations with normal distributions and that the variances of the populations
are equal. Essentially, the t-test for differences between group means can be conceptualized as
the difference between the means divided by the variability of random means. The t value is a
ratio of the difference between the two sample means and the standard error. The t-test tries to
provide a rational way of determining if the difference between the two sample means occurred
by chance.

t-test is based on t-distribution and is considered an appropriate test for judging the significance
of a sample mean or for judging the significance of difference between the means of two samples
in case of small sample(s) when population variance is not known (in which case we use
variance of the sample as an estimate of the population variance). In case two samples are
related, we use
paired t-test (or what is known as difference test) for judging the significance of the mean of
difference between the two related samples. It can also be used for judging the significance of
the
coefficients of simple and partial correlations. The relevant test statistic, t, is calculated from the
sample data and then compared with its probable value based on t-distribution (to be read from
the table that gives probable values of t for different levels of significance for different degrees
of
freedom) at a specified level of significance for concerning degrees of freedom for accepting or
rejecting the null hypothesis. It may be noted that t-test applies only in case of small sample(s)
when population variance is unknown.

The Chi-squared Test for Cross Tabulations


In the case of cross-tabulations featuring two variables, a test of significance called the chi-
squared test can be used to test if the two variables are statistically associated with each other
significantly.
The chi-square value is often used to judge the significance of population variance i.e., we can
use
the test to judge if a random sample has been drawn from a normal population with mean and
with a specified variance

Let us assume that we have conducted a consumer survey of a brand of detergent. One of the
question dealt with the income category of the respondent and the other question asked the
respondent to rate his purchase intention.

32
Prof.M.Guruprasad

These two variables were cross-tabulated from a sample of say 20 respondents for the sake of
this illustration. A cross-tabulation with a chi-squared test was requested from the computer
package.
The question asked was Is there a Significant Association Between Respondent Income and
Purchase Intention ? The chi-squared test basically answers the above question.

Chi-square analysis permits us to test for significance between the frequency distributions of
two or more groups, say, males versus females. Categorical data from questions about sex,
education, or other nominal variables can be examined to provide tests of hypotheses of interest.
Chi-square analysis compares the observed frequencies of the responses with the expected
frequencies, which are based on our ideas about the population distribution or our predicted
proportions. This statistic tests whether or not the observed data are distributed the way we
would expect them to be. It does this by comparing the observed frequencies with the expected
frequencies.
The chi-square test is an important test amongst the several tests of significance developed by
statisticians. Chi-square, (Pronounced as Ki-square), is a statistical measure used in the context
of sampling analysis for comparing a variance to a theoretical variance.It “can be used to
determine if categorical data shows dependency or the two classifications are independent. It can
also be used to make comparisons between theoretical populations and actual data when
categories are used.”1 Thus, the chi-square test is applicable in
large number of problems. The test is, in fact, a technique through the use of which it is possible
for all researchers to (i) test the goodness of fit; (ii) test the significance of association between
two attributes, and (iii) test the homogeneity or the significance of population variance.

It can be calculated by using the formula

Correlation

1. Correlation and Regression are generally performed together. The application of correlation
analysis is to measure the degree of association between two sets of quantitative data. The
correlation coefficient measures this association. It has a value ranging from 0 (no correlation) to
1 (perfect positive correlation), or -1 (perfect negative correlation).

2. For example, how are sales of product A correlated with sales of product B? Or, how is the
advertising expenditure correlated with other promotional expenditure? Or, are daily ice cream
sales correlated with daily maximum temperature?

33
Prof.M.Guruprasad

3. Correlation does not necessarily mean there is a causal effect. Given any two strings of
numbers, there will be some correlation among them. It does not imply that one variable is
causing a change in another, or is dependent upon another.

4. Correlation is usually followed by regression analysis in many applications.

Regression

1. The main objective of regression analysis is to explain the variation in one variable (called the
dependent variable), based on the variation in one or more other variables (called the
independent variables).

2. The applications areas are in ‘explaining’ variations in sales of a product based on advertising
expenses, or number of sales people, or number of sales offices, or on all the above variables.

3. If there is only one dependent variable and one independent variable is used to explain the
variation in it, then the model is known as a simple regression.

4. If multiple independent variables are used to explain the variation in a dependent variable, it is
called a multiple regression model.

5. The form of the regression equation could be either linear or non-linear.

As seen from the preceding discussion, the major application of regression analysis in marketing
is in the area of sales forecasting, based on some independent (or explanatory) variables. This
does not mean that regression analysis is the only technique used in sales forecasting. There are a
variety of quantitative and qualitative methods used in sales forecasting, and regression is only
one of the better known (and often used) quantitative techniques.

ANALYSIS OF VARIANCE (ANOVA)

Analysis of variance (ANOVA) is an extremely useful technique concerning researches in the


fields of economics, biology, education, psychology, sociology, business/industry and in
researches of several other disciplines. This technique is used when multiple sample cases are
involved. It is used to determine the statistical difference between three or more means. As stated
earlier, the significance of the difference between the means of two samples can be judged
through either z-test or the t-test, but the difficulty arises when we happen to examine the
significance of the difference amongst more than two sample means at the same time. The
ANOVA technique enables us to perform this simultaneous test and as such is considered to be
an important tool of analysis in the hands of a researcher. Using this technique, one can draw
inferences about whether the samples have been drawn from populations having the same mean.
The ANOVA technique is important in the context of all those situations where we want to
compare more than two populations such as in comparing the yield of crop from several varieties
of seeds, the gasoline mileage of four automobiles, the smoking habits of five groups of
university students and so on. In such circumstances one generally does not want to consider all

34
Prof.M.Guruprasad

possible combinations of two populations at a time for that would require a great number of tests
before we would be able to arrive at a decision. This would also consume lot of time and money,
and even then certain relationships may be left unidentified (particularly the interaction effects).
Therefore, one quite often utilizes the ANOVA technique and through it investigates the
differences among the means of all the populations simultaneously.

Multiple dependent variables can be analyzed together using a related procedure called
multivariate analysis of variance (MANOVA). The objective in MANOVA is identical to that in
ANOVA—to examine group differences in means—only the comparisons are considered for a
group of dependent variables.

An example of an ANOVA problem may be to compare light, medium, and heavy drinkers of
Starbucks coffee on their attitude toward a particular Starbucks advertising campaign. In this
instance there is one independent variable—consumption of Starbucks coffee—but it is divided
into three different levels. Our earlier t statistics won’t work here, since we have more than two
groups to compare. ANOVA requires that the dependent variable, in this case the attitude toward
the Starbucks advertising campaign, be metric. That is, the dependent variable must be either
interval or ratio scaled. A second data requirement is that the independent variable, in this case
the coffee consumption variable, be categorical. The null hypothesis for ANOVA always states
that there is no difference between the ad campaign attitudes of the groups of Starbucks coffee
drinkers. In specific terminology, the null hypothesis would be μ1 = μ2 = μ3

FACTOR ANALYSIS:
1. Factor Analysis is a set of techniques used for understanding variables by grouping them into
“factors” consisting of similar variables

2. It can also be used to confirm whether a hypothesized set of variables groups into a factor or
not

3. It is most useful when a large number of variables needs to be reduced to a smaller set of
“factors” that contain most of the variance of the original variables

4. Generally, Factor Analysis is done in two stages, called


Extraction of Factors and
Rotation of the Solution obtained in stage

5. Factor Analysis is best performed with interval or ratio-scaled variables

In marketing research, a common application area of Factor Analysis is to understand underlying


motives of consumers who buy a product category or a brand

Suppose that a two wheeler manufacturer is interested in determining which variables his
potential customers think about when they consider his product
4. Let us assume that twenty two-wheeler owners were surveyed by this manufacturer (or by a
marketing research company on his behalf). They were asked to indicate on a seven point scale

35
Prof.M.Guruprasad

(1=Completely Agree, 7=Completely Disagree), their agreement or disagreement with a set of


ten statements relating to their perceptions and some attributes of the two-wheelers.
5. The objective of doing Factor Analysis is to find underlying "factors" which would be fewer
than 10 in number, but would be linear combinations of some of the original 10 variables

Thus Factor analysis is used to summarize the information contained in a large number of
variables into a smaller number of subsets called factors.

CLUSTER ANALYSIS

Cluster analysis is used to classify respondents or objects (e.g., products, stores) into groups that
are homogeneous, or similar within the groups but different between the groups.

. A cluster, by definition, is a group of similar objects


2. There could be clusters of people, brands or other objects
3. If clusters are formed of customers similar to one another, then cluster analysis can help
marketers identify segments (clusters)
4. If clusters of brands are formed, this can be used to gain insights into brands that are perceived
as similar to each other on a set of attributes
Cluster analysis is best performed when the variables are interval or ratio-scaled

As the name implies, the basic purpose of cluster analysis is to classify or segment objects (e.g.,
customers, products, market areas) into groups so that objects within each group are similar to
one
another on a variety of variables. Cluster analysis seeks to classify segments or objects such that
there will be as much likeness within segments and as much difference between segments as
possible. Thus, this method strives to identify natural groupings or segments among many
variables without designating any of the variables as a dependent variable. Let us discuss the
application of cluster analysis with an example. Suppose a fastfood restaurant wants to open an
eat-in restaurant in a new, growing suburb of a major metropolitan area. Marketing researchers
surveyed a large sample of households in this suburb and collected data on characteristics such
as demographics, lifestyles, and expenditures on eating out. The fast-food chain wants to identify
one or more household segments that are likely to visit its new restaurant. Once this segment is
identified, the firm’s advertising and services would be tailored to them. A target segment can be
identified for the company by conducting a cluster analysis of the data it has gathered. The
results of the cluster analysis will identify segments, each of which contains households that
have similar characteristics and differs considerably from the other segments.

Let us suppose that our research identifies four potential clusters or segments for our fast-food
chain. As our intuitive example illustrates, this growing suburb contains households that seldom
visit restaurants at all (cluster 1), households that tend to frequent regular restaurants (with table
service) exclusively (cluster 2), households that tend to frequent fast-food restaurants exclusively
(cluster 3), and households that frequent both regular and fast-food restaurants (cluster 4). By
examining the characteristics associated with each of the clusters, management can decide which
clusters to target and how best to reach them through marketing communications.

36
Prof.M.Guruprasad

Discriminant Analysis
Discriminant analysis is a multivariate technique used for predicting group membership on the
basis of two or more independent variables. There are many situations where the marketing
researcher’s purpose is to classify objects or groups by a set of independent variables. In
marketing, consumers are often categorized on the basis of heavy versus light users of a product,
or viewers versus nonviewers of a media vehicle such as a television commercial.

1. The major application area for this technique is where we want to be able to distinguish
between two or three sets of objects or people, based on the knowledge of some of their
characteristics.
2. Examples include the selection process for a job, the admission process of an educational
programme in a college, or dividing a group of people into potential buyers and non-buyers.
3. Discriminant analysis can be, and is in fact used, by credit rating agencies to rate individuals,
to classify them into good lending risks or bad lending risks. The detailed example discussed
later tells you how to do that.
4. To summarise, we can use linear discriminant analysis when we have to classify objects into
two or more groups based on the knowledge of some variables (characteristics) related to them.
Typically, these groups would be users-non-users, potentially successful salesman – potentially
unsuccessful salesman, high risk – low risk consumer, or on similar lines.

Conjoint Analysis
Conjoint analysis is a multivariate technique which estimates the relative importance consumers
place on the different attributes of a product or service, as well as the utilities or value they attach
to the various levels of each attribute. This dependence method assumes that consumers choose
or form preferences for products by evaluating the overall utility or value of the product. This
value is composed of the individual utilities of each product feature or attribute. Conjoint
analysis tries to estimate the product attribute importance weights that would best match the
consumer’s indicated product choice or preference.

1. Marketing managers frequently want to know what utility a particular product feature or
service feature will have for a consumer.
2. Conjoint analysis is a multivariate technique that captures the exact levels of utility that an
individual customer puts on various attributes of the product offering. It enables a direct
comparison between say, the utility of a price level of Rs. 400 versus Rs.500, a delivery period
of 1 week versus 2 weeks, or an after sales response of 24 hours versus 48 hours.
3. Once we know utility levels for every attribute (and at every level), we can combine these to
find the best combination of attributes that gives him the highest utility, the second best
combination that gives the second highest utility, and so on.
4. This information can be used to design a product or service offering.
5. If this is done across a sample of customers say, segment-wise, it can also be used to predict
market-share, and the response of customers to changes in the competitive strategy through
changes in the marketing elements.

For example, assume that our fast-food company wants to determine the best combination of
restaurant features to attract customers. A marketing researcher could develop a number of

37
Prof.M.Guruprasad

descriptions or restaurant profiles, each containing different combinations of features.


Consumers
would then be surveyed, shown the different profiles, and asked to rank the descriptions in order
of their likelihood of patronizing the restaurant.

Conjoint Analysis Applications in Marketing Research


The fast-food example used in this discussion illustrates one possible use of conjoint analysis
to identify the important attributes that influence consumer restaurant choice. There are
other important applications of this technique in marketing research, such as the following:
• Market share potential. Products with different combinations of features can be compared
to determine the most popular design.
• Product image analysis. The relative contribution of each product attribute can be
determined for use in marketing and advertising decisions.
• Segmentation analysis. Groups of potential customers who place different levels of
importance on the product features can be identified for use as high and low potential
market segments.

Thus the various statistical techniques are extremely useful for the Marketing Researcher and the
Marketing Decision makers and policy decision makers in various areas of Management
decision making, Government and Academic Research.

Summary of Selected Multivariate Methods

Multiple regression enables the marketing researcher to predict a single dependent metric
variable from two or more metrically measured independent variables.
Multiple discriminant analysis can predict a single dependent nonmetric variable from two or
more metrically measured independent variables.
Factor analysis is used to summarize the information contained in a large number of variables
into a smaller number of subsets called factors.
Cluster analysis is used to classify respondents or objects (e.g., products, stores) into groups that
are homogeneous, or similar within the groups but different between the groups.
Conjoint analysis is used to estimate the value (utility) that respondents associate with different
product and/or service features, so that the most preferred combination of features can be
determined.
Perceptual mapping is used to visually display respondents’ perceptions of products, brands,
companies, and so on. Several multivariate methods can be used to develop the data to construct
perceptual maps.

38

You might also like