Advanced Statistical Methods and Data Analytics For Research - Hypothesis Testing and SPSS - by Prof.M.guruprasad
Advanced Statistical Methods and Data Analytics For Research - Hypothesis Testing and SPSS - by Prof.M.guruprasad
Advanced Statistical Methods and Data Analytics For Research - Hypothesis Testing and SPSS - by Prof.M.guruprasad
Guruprasad
1
Prof.M.Guruprasad
Hypothesis is usually considered as the principal instrument in research. Its main function is to
suggest new experiments and observations. In fact, many experiments arc carried out with the
deliberate object of testing hypotheses. Decision-makers often face situations wherein they are
interested in testing hypotheses on the basis of available information and then take decisions on
the basis of such testing. In social science, where direct knowledge of population parameter(s) is
rare, hypothesis testing is the often-used strategy for deciding whether a sample data offer such
support for a hypothesis that generalization can be made. Thus, hypothesis testing enables us to
make probability statements about population parameters). The hypothesis may not be proved
absolutely, but in practice it is accepted if it has withstood a critical testing. Before we explain
how hypotheses arc tested through different tests meant for the purpose, it will be appropriate to
explain clearly the meaning of a hypothesis and the related concepts for better understanding of
the hypothesis testing techniques.
What is hypothesis?
Ordinarily, when one talks about hypothesis, one simply means a mere assumption or some
supposition to be proved or disproved. But for a researcher hypothesis is a formal question that
he intends to resolve. Thus a hypothesis may be defined as a proposition or a set of propositions
set forth as an explanation for the occurrence of some specified group of phenomena either
asserted merely as a provisional conjecture to guide some investigation or accepted as highly
probable in the light of established facts. Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific methods, that relates an independent variable to
some dependent variable. For example, consider statements like the following ones:
“Students who receive counseling will show a greater increase in creativity than students not
receiving counseling” or
“The automobile A is performing as well as automobile B.”
These are hypotheses capable of being objectively verified and tested. Thus, we may conclude
that a hypothesis states what we are looking for and it is a proposition which can be put to a test
to determine its validity.
2
Prof.M.Guruprasad
(i) Hypothesis should be clear and precise. If the hypothesis is not clear and precise, the
inferences drawn on its basis cannot be taken as reliable.
(ii) Hypothesis should be capable of being tested. In a swamp of un-testable hypothesis,
many a time the research programs have bogged down. Some prior study may be
done by the researcher in order to make hypothesis a testable one. A hypothesis “is
testable if other deductions can be made from it which, in turn, can be confirmed or
disproved by observation.”
(iii) Hypothesis should state relationship between variables, if it happens to be relational
hypothesis.
(iv) Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should
develop such hypotheses.
(v) Hypothesis should be stated as far as possible in most simple terms so that the same is
easily understandable by all concerned. But one must remember that simplicity of
hypothesis has nothing to do with its significance.
(vi) Hypothesis should be consistent with most known facts i.e., it must be consistent with
a substantial body of established facts. In other words, it should be one which judges
accept as being the most likely.
(vii) Hypothesis should be amenable to testing within a reasonable time. One should not
use even an excellent hypothesis, if the same cannot be tested in reasonable time for
one cannot spend a lifetime collecting data to test it.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation. This
means that by using the hypothesis plus other known and accepted generalizations,
one should be able to deduce the original problem condition. Thus hypothesis must
actually explain what it claims to explain; it should have empirical reference.
Basic Concepts Concerning Testing of Hypothesis
Basic concepts in the context of hypothesis need to be explained.
(a) Null hypothesis and alternative hypothesis: In the context of statistical analysis, we often
talk about null hypothesis and alternative hypothesis. If we are to compare method A
3
Prof.M.Guruprasad
with method B about its superiority and if we proceed on the assumption that both
methods are equally good, then this assumption is termed as the null hypothesis. As
against this, we may think that the method A is superior or the method B is inferior,
we are then stating what is termed as alternative hypothesis. The null hypothesis is
generally symbolized as Ho and the alternative hypothesis as Ha. Suppose we want
to test the hypothesis that the population mean () is equal to hypothesized mean (
ho) = 100. Then we would say that the null hypothesis is that the population mean is
equal to the hypothesized mean 100 and symbolically we can express as:
Ho : = Ho = 100
If our sample results do not support this null hypothesis, we should conclude that something
else is true. What we conclude rejecting the null hypothesis is known as alternative
hypothesis. In other words, the set of alternatives to the null hypothesis is referred to as
the alternative hypothesis. If we accept Ho, then we are accepting Ha. For Ho: = Ho = 100,
we may consider three possible alternative hypotheses as follows:
Alternative hypothesis To be read as follows
Ha: Ho (The alternative hypothesis is that the
population mean is not equal to 100 i.e.,
it may be more or less than 100)
(i) Alternative hypothesis is usually the one which one wishes to prove and the
null hypothesis is the one which one wishes to disprove. Thus, a null
hypothesis represents the hypothesis we are trying to reject, and alternative
hypothesis represents all other possibilities.
(ii) If the rejection of a certain hypothesis when it is actually true involves great
risk, it is taken as null hypothesis because then the probability of rejecting it
when it is true is (the level of significance) which is chosen very small.
4
Prof.M.Guruprasad
(iii) Null hypothesis should always be specific hypothesis i.e., it should not state
about approximately a certain value.
Generally, in hypothesis testing we proceed on the basis of null hypothesis, keeping the
alternative hypothesis in view. Why so? The answer is that on the assumptions that null
hypothesis is truer, one can assign the probabilities to different possible sample results, but
this cannot be done if we proceed with the alternative hypothesis. Hence the use of null
hypothesis (at times also known as statistical hypothesis) is quite frequent.
(b) The level of significance: This is a very important concept in the context of
hypothesis testing. It is always some percentage (usually 5%) which should be
chosen with great care, thought and reason. In case we take the significance level at
5 percent, then this implies that Ho will be rejected when the sampling result (i.e.,
observed evidence) has a less than 0.05 probability of occurring if Ho is true. In other
words, the 5 percent level of significance means that researcher is willing to take as
much as 5 percent risk of rejecting the null hypothesis when it (Ho) happens to be
true. Thus the significance level is the maximum value of the probability of rejecting
Ho when it is true and is usually determined in advance before testing the
hypothesis.
(c) Decision rule or test of hypothesis: Given a hypothesis Ho and alternative hypothesis
Ha, we make a rule which is known as decision rule according to which we accept Ho
(i.e., reject Ha) or reject Ho (i.e., accept Ha). For instance, if Ho is that a certain lot is
good (there are few defective items in it), then we must decide the number
Hypothesis is an important characteristic of scientific method; it establishes the relationship
of concept with theory and specifies the test to be applied especially in the content of
meaningful value judgment states what the researcher is looking for Tentative
generalization
TYPES OF HYPOTHESIS.
CHARATERISTICS:
1.Conceptually clear
2.Empirically verified.
5
Prof.M.Guruprasad
3.Specific.
5.Relevant to the existing environmental conditions for the purpose of testing and
Hypothesis is a tentative, provisional solution guides the scientist to get on with the
enquiry.
Place in science:
2.Self-consistency
4.Testability.
6.Simplicity.
7.Satisfaction solution.
ANALYSIS- PROCEDURE FOR HYPOTHESIS TESTING(tests)
BACKGROUND:
6
Prof.M.Guruprasad
To test a hypothesis means to tell (on the basis of the data the researcher has collected)
Whether or not the hypothesis seems to be valid. In hypothesis testing the main question
is: whether to accept the null hypothesis or not to accept the null hypothesis? Procedure for
hypothesis testing refers to all those steps that we undertake for making a choice between
the two actions i.e., rejection and acceptance of a null hypothesis. The various steps
involved in hypothesis testing are stated below: i) Making of formal statement.ii) Selecting a
significance level.iii) Deciding the distribution to use.iv) Selecting a random sample and
computing an appropriate value.V) Calculation of the probability.Vi) Comparing the
probability.
Test of hypotheses: : As has been stated above that hypothesis testing determines the
validity of the assumption (technically described as null hypothesis) with a view to choose
between two conflicting hypotheses about the value of population parameter. Hypothesis
testing helps to decide on the basis of a sample data, whether a hypothesis about the
population is likely to be true or false. Statisticians have developed several tests of
hypothesis (also known as the tests of significance for the purpose of testing of hypothesis
which can be classified as: A) Parametric tests or standards tests of hypothesis; and B)
Non-parametric tests or distribution-free tests of hypotheses. Importance of Parametric
Tests: The importances of parametric tests are: 1) z-test; 2) t-test; 3) ( Chi-Square test )
X² test, and 4) F-test (ANOVA). All these tests are based on the assumption of normality
i.e. the source of data is considered to be normally distributed. In some cases the
population may not be normally distributed, yet the tests will be applicable on account of
the fact that we mostly deal with samples and the sampling distributions closely approach
normal distributions.
equally good, then this assumption is termed as the null hypothesis. As against this, we may
think that the method A is superior or the method B is inferior, we are then stating what is termed
as alternative hypothesis. The null hypothesis is generally symbolized as H0 and the alternative
hypothesis as Ha. Suppose we want to test the hypothesis that the population mean (.) is equal
to the hypothesized mean (H0) = 100. Then we would say that the null hypothesis is that the
population mean is equal to the hypothesized mean 100 and symbolically we can express as:
If our sample results do not support this null hypothesis, we should conclude that something else
is true. What we conclude rejecting the null hypothesis is known as alternative hypothesis. In
other words, the set of alternatives to the null hypothesis is referred to as the alternative
hypothesis. If we accept H0, then we are rejecting Ha and if we reject H0, then we are accepting
Ha. For H0: =H0 =100, we may consider three possible alternative hypotheses as follows:
The null hypothesis and the alternative hypothesis are chosen before the sample is drawn (the
researcher must avoid the error of deriving hypotheses from the data that he collects and then
testing the hypotheses from the same data).
8
Prof.M.Guruprasad
In the choice of null hypothesis, the following considerations are usually kept in view:
(1) Alternative hypothesis is usually the one which one wishes to prove and the null hypothesis is
the one which one wishes to disprove. Thus, a null hypothesis represents the hypothesis we
are trying to reject, and 'alternative hypothesis represents all other possibilities.
(2) If the rejection of a certain hypothesis when it is actually true involves great risk, it is taken
as null hypothesis because then the probability of rejecting it when it is true is (the level of
significance) which is chosen very small.
(3) Null hypothesis should always be specific hypothesis i.e., it should not state about or
approximately a certain value.
Generally, in hypothesis testing we proceed on the basis of null hypothesis, keeping the
alternative hypothesis in view. Why so? The answer is that on the assumption that null
hypothesis is true, one can assign the probabilities to different possible sample results, but this
cannot be done if we proceed with the alternative hypothesis. Hence the use of null hypothesis
(at times also known as statistical hypothesis) is quite frequent.
(b) The level of significance: This is a very important concept in the context of hypothesis
testing. It’s always some percentage (usually 5%), which should be chosen, with great care,
thought and reason. In case we take the significance level at 5 per cent, then this implies that H0
will be rejected when the sampling result (i.e., observed evidence) has a less than 0.05
probability of occurring if H0 is true. In other words, the 5 per cent level of significance means
that researcher is willing to take as much as a 5 per cent risk of rejecting the null hypothesis
when it (Ho) happens to be true. Thus the significance level is the maximum value of the
probability of rejecting H0 when it is true and is usually determined in advance before testing the
hypothesis.
(c) Decision rule or test of hypothesis: Given a hypothesis Ho and an alternative hypothesis Ha,
we make a rule which is known as decision rule according to which we accept H0 (i.e., reject Ha)
or reject Ho (i.e., accept Ha). For instance, if H0 is that a certain lot is good (there are very few
defective items in it) against Ha that the lot is not good (there are too many defective items in it);
then we must decide the number of items to be tested and the criterion for accepting or rejecting
9
Prof.M.Guruprasad
the hypothesis. We might test 10 items in the lot and plan our decision saying that if there are
none or only 1 defective item among the 10, we will accept Ho otherwise we will reject Ho (or
accept Ha). This sort of basis is known as decision rule.
(d) Type I and Type II errors: In the context of testing of hypotheses, there are basically two
types of errors we can make. We may reject H0 when H0 is true and we may accept H0 when in
fact H0 is not true. The former is known as Type I error and the latter as Type II error. In other
words, Type I error means rejection of hypothesis that should have been accepted and Type II
error means accepting the hypothesis, which should have been rejected. Type I error is denoted
by (alpha) known as a error, also called the level of significance of test; and Type II error-is
denoted by (beta) known as -error. In a tabular form the said two errors can be presented as
follows:
The probability of Type I error is usually determined in advance and is understood as the level of
significance of testing the hypothesis. If type I error is fixed at 5 percent, it means that there are
about 5 chances in 100 that we will reject H0 when H0 is true. We can control Type I error just by
fixing it at a lower level. For instance, if we fix it at 1 per cent, we will say that the maximum
probability of committing Type I error would only be 0.01.
10
Prof.M.Guruprasad
But with a fixed sample size, n, when we try to reduce Type I error, the probability of
committing Type II error increases. Both types of errors cannot be reduced simultaneously.
There is a trade-off between these two types of errors, which means that the probability of
making one type of error can only be reduced if we are willing to increase the probability of
making the other type of error. To deal with this trade-off in business situations, decision-makers
decide the appropriate level of Type I error by examining the costs or penalties attached to both
types of errors. If Type I error involves the time and trouble of reworking a batch of chemicals
that should have been accepted, whereas Type II error means taking a chance that an entire group
of users of this chemical compound will be poisoned, then in such a situation one should prefer a
Type I error to a Type II error. As a result one must set very high level for Type I error in one's
testing technique of a given hypothesis. Hence, in the testing of hypothesis, one must make all
possible effort to strike an adequate balance between Type I and Type II errors.
(e) Two tailed and One-tailed tests: In the context of hypothesis testing, these two terms are
quite important and must be clearly understood. A two-tailed test rejects the null hypothesis if,
say, the sample mean is significantly higher or lower than the hypothesized value of the mean of
the population. Such a test is appropriate when the null hypothesis is some specified value and
the alternative hypothesis is a value not equal to the specified value of the null hypothesis.
Symbolically, the two-tailed test is appropriate when we have H0: = H0 and Ha: H0
which may mean H0 or > H0. Thus, in a two-tailed test, there are two rejection regions
(also known as critical regions), one on each tail of the curve which can be illustrated in Figure
a:
If the significance level is 5 per cent and the two-tailed test is to be applied, the probability of the
rejection area will be 0.05 (equally split on both tails of the curve as 0.025) and that of the
acceptance region will be 0.95 as shown in the curve in Fig. a. If we take = 100 and if our
sample mean deviates significantly from 100 in either direction, then we shall reject the null
hypothesis; but if the sample mean does not deviate significantly from , in that case we shall
accept the null hypothesis.
11
Prof.M.Guruprasad
But there are situations when only one-tailed test is considered appropriate. A one-tailed test
would be used when we are to test, say, whether the population mean is cither lower than or
higher than some hypothesized value. For instance, if our H0: = H0 and Ha: H0, Then we
are interested in what is known as left-tailed test (wherein there is one rejection region only on
the left tail) which can be illustrated as in Figure b:
Figure a
12
Prof.M.Guruprasad
13
Prof.M.Guruprasad
Figure b
If our = 100 and if our sample mean deviates significantly from 100 in the lower direction, we
shall reject H0, otherwise we shall accept H0 at a certain level of significance. If the significance
14
Prof.M.Guruprasad
level in the given case is kept at 5%, then the rejection region will be equal to 0.05 of area in the
left tail as has been shown in the above curve.
In case our H0: = H0 and Ha: > H0 we are then interested in what is known as one-tailed
test (right tail) and the rejection region will be on the right tail of the curve as shown below:
If our = 100 and if our sample mean deviates significantly from 100 in the upward direction,
we shall reject H0 otherwise we shall accept the same If in the given case the significance level is
15
Prof.M.Guruprasad
kept at 5% then the rejection region will be equal to 0 05 of area in the right-tail as has been
shown in the above curve
It should always be remembered that accepting H0 on the basis of sample information does not
constitute the proof that H0 is true. We only mean that there is no statistical evidence to reject it,
but we are certainly not saying that H0 is true (although we behave as if H0 is true)
Note on Normal Distribution
NORMAL DISTRIBUTION
The normal distribution is a continuous distribution and plays a very important and pivotal
role in statistical theory and practice, particularly in the area of statistical inference and
statistical quality control. Its importance is also due to the fact that in practice, the
experimental results, very often seem to follow the normal distribution or the bell-shaped
curve. The normal curve is symmetrical and is defined by its mean () and its standard
deviation (). The shape of the curve is as shown below and the mean, mode and the
median of the distribution have the same value.
Symmetrical about Tails extend
the mean indefinitely
The normal curve is not just one curve but a family of curves. Just as the equation for a
circle describes the family of circles, some small and some big, the equation for the normal
curve describes a family of such curves which may differ only with regard to the values of
and , but have the same characteristics in all other respects.
16
Prof.M.Guruprasad
a) All normal curves are symmetrical about the mean. This means that the number
of units in the data below the mean is the same as the number of units above the
mean. This means that the mean and the median have the same value.
b) The height of the normal curve is at its maximum at the mean value. Thus the
mean and the mode coincide. This means that the normal distribution has the
same value for mean, median and mode.
c) The height of the curve declines as we go in either direction from the mean, but
never touches the base, so that the tails of the curves on both sides of mean
extend indefinitely.
d) The first and the third quartiles are equidistant from the mean.
e) The height of the normal curve Y at any value of the random continuous variable
x is given by the following equation:
x- 2
1 -1/2 ( )
Y= e
2
where:
17
Prof.M.Guruprasad
Normal distribution is most commonly applied in statistical quality control and is very useful
in many sociological studies. It is very close to many other distributions and can serve as a
good approximation for many discrete distributions such as binominal as n becomes larger.
Take another example The average score in an aptitude test administered at the national
level is 80 To evaluate a state's education system, the average score of 100 of the state's
students selected on random basis was 75. The state wants to know if there is a significant
difference between the local scores and the national scores. In such a situation the
hypotheses may be stated as under
Null Hypothesis H0: = 80
Alternative Hypothesis Ha: 80
The formulation of hypotheses is an important step, which must be accomplished with due
care in accordance with the object and nature of the problem under consideration It also
indicates whether we should use a one-tailed test or a two-tailed test. If Ha is of the type
18
Prof.M.Guruprasad
greater than (or of the type lesser than), we use a one-tailed test, but when Ha is of the type
"whether greater or smaller", then we use a two-tailed test
(ii) Selecting a significance level: The hypotheses are tested on a pre-determined level of
significance and as such the same should be specified Generally, in practice, either 5% level
or 1% level is adopted for the purpose The factors that affect the level of significance are
(a)the magnitude of the difference between sample means
(b) the size of the samples
(c)the variability of measurements within samples
(d) whether the hypothesis is directional or non-directional (A directional hypothesis is one
which predicts the direction of the difference between, say, means). In brief, the level of
significance must be adequate in the context of the purpose and nature of enquiry.
(iii) Deciding the distribution to use: After deciding the level of significance, the next step in
hypothesis testing is to determine the appropriate sampling distribution The choice generally
remains between normal distribution and the t-distribution. The rules for selecting the
correct distribution are similar to those that we have stated earlier in the context of
estimation.
(iv) Selecting a random sample and computing an appropriate value: Another step is to
select a random sample(s) and compute an appropriate value from the sample data
concerning the test statistic utilizing the relevant distribution. In other words, draw a sample
to furnish empirical data.
(v) Calculation of the probability: One has then to calculate the probability that the sample
result would diverge as widely as it has from expectations, if the null hypothesis were in fact
true
(vi) Comparing the Probability: Yet another step consists in comparing the probability thus
calculated with the specified value for , the significance level If the calculated probability
is equal to or smaller than the value in case of one-tailed test (and /2 in case of two-
tailed test), then reject the null hypothesis (i e, accept the alternative hypothesis), but if the
19
Prof.M.Guruprasad
calculated probability is greater, then accept the null hypothesis. In case we reject H0, We
run a risk of (at most the level of significance) committing an error of Type I, but if we
accept H0, then we run some risk (the size of which cannot be specified as long as the H0
happens to be vague rather than specific) of committing an error of Type II.
FLOW DIAGRAM FOR HYPOTHESIS TESTING
The above stated general procedure for hypothesis testing can also be depicted in the form of a
flow-chart for better understanding as shown below:
Tests of Hypotheses
20
Prof.M.Guruprasad
Hypothesis testing helps to decide on the basis of a sample data, whether a hypothesis about the
population is likely to be true or false. Statisticians have developed several tests of hypotheses
(also known as the tests of significance) for the purpose of testing of hypotheses which can be
classified as:
(a) Parametric tests or standard tests of hypotheses
(b) Non-parametric tests or distribution-free test of hypotheses.
Parametric tests usually assume certain properties of the parent population from which we draw
samples. Assumptions like observations come from a normal population, sample size is large,
assumptions about the population parameters like mean, variance, etc., must hold good before
parametric tests can be used. But there are situations when the researcher cannot or does not
want to make such assumptions. In such situations we use statistical methods for testing
hypotheses, which are called non-parametric tests because such tests do not depend on any
assumption about the parameters of the parent population. Besides, most non-parametric tests
assume only nominal or ordinal data, whereas parametric tests require measurement equivalent
to at least an interval scale. As a result, non-parametric tests need more observations than
parametric tests to achieve the same size of Type I and Type II errors.
21
Prof.M.Guruprasad
variance is known. z-test is also used for comparing the sample proportion to a theoretical value
of population proportion or for judging the difference in proportions of two independent samples
when ‘n’ happens to be large. Besides, this test may be used for judging the significance of
median, mode, coefficient of correlation and several other measures.
(2) t-test
t-test is based on t-distribution and is considered an appropriate test for judging the significance
of a sample mean or for judging the significance of difference between the means of two samples
in case of small sample(s) when population variance is not known (in which case we use
variance of the sample as an estimate of the population variance) In case two samples are related,
we use paired t-test (or what is known as difference test) for judging the significance of the mean
of difference between the two related samples. It can also be used for judging the significance of
the coefficients of simple and partial correlations The relevant test statistic, t, is calculated from
the sample data and then compared with its probable value based on t-distribution (to be read
from the table that gives probable values of t for different levels of significance for different
degrees of freedom) at a specified level of significance for concerning degrees of freedom for
accepting or rejecting the null hypothesis. It may be noted that t-test applies only in case of small
sample(s)( may be less than 30 samples) when population variance is unknown.
(4) F-test.
22
Prof.M.Guruprasad
F-test is based on F-distribution and is used to compare the variance of the two independent
samples. This test is also used in the context of analysis of variance (ANOVA) for judging the
significance of more than two sample means at one and the same time. It is also used for judging
the significance of multiple correlation coefficients. Test statistic, F, is calculated and compared
with its probable value (to be seen in the F-ratio tables for different degrees of freedom for
greater and smaller variances at a specified level of significance) for accepting or rejecting the
null hypothesis.
All these tests are based on the assumption of normality i.e., the source of data is considered to
be normally distributed. In some cases the population may not be normally distributed, yet the
tests will be applicable on account of the fact that we mostly deal with samples and the sampling
distributions closely approach normal distributions.
23
Prof.M.Guruprasad
24
Prof.M.Guruprasad
As discussed in the chapter 13, the process of converting information from a questionnaire
so it can be read by a computer is referred to as data preparation. This process normally
follows a five-step approach, beginning with data validation, then editing and coding of the
data, followed by data entry, error detection, and data tabulation. The purpose of the data
preparation process is to take data in its raw form and convert it so as to establish meaning
and create value for the user.
Once the researcher has formed the hypotheses and calculated the means of the groups,
the next step is to actually analyze the relationships of the sample data.
The reader should not that most of the techniques are used for the following reasons
The other purpose is Statistical significance between different variables within the research
study(say to verify the researchers proposed hypothesis). For example consider the
Hypothesis “ There is no association between owning a PC and frequency of browsing the
cyber café”. Now from the collected data from the field, the researcher can verify whether
such an relationship exists using techniques such as Cross tabulation method, correlation
techniques or any other relevant technique.
Relationships between variables can be described in several ways, including presence, direction,
strength of association, and type. The first issue, and probably the most obvious, is whether two or more
variables are related at all. If a systematic relationship exists between two or more variables, this is
referred to as the presence of a relationship. To measure whether a relationship is present, we rely on the
concept of statistical significance. If we test for statistical significance and find that it exists, then we say
that a relationship is present. Stated another way, we say that knowledge about the behavior of one
variable allows us to make a useful prediction about the behavior of another.
If a relationship is present between two variables, it is important to know the direction. The
direction of a relationship can be either positive or negative. An understanding of the
strength of association also is important. We generally categorize the strength of association
as nonexistent, weak, moderate, or strong. If a consistent and systematic relationship is not
present, then the strength of association is nonexistent. A weak association means there is
25
Prof.M.Guruprasad
a low probability of the variables having a relationship. A strong association means there is
a high probability a consistent and systematic relationship exists.
Another concept that is important to understand is the type of relationship. If we say that
two variables can be described as related, then we would pose this as a question:
“What is the nature of the relationship?” How can the link between Y and X best be
described? There are a number of different ways in which two variables can share a
relationship. Variables Y and X can have a linear relationship, which means that the strength
and nature of the relationship between them remains the same over the range of both
variables, and can best be described using a straight line. Conversely, Y and X could have a
curvilinear relationship, which would mean that the strength and/or direction of their
relationship changes over the range of both variables (perhaps Y’s relationship with X first
gets stronger as X increases, but then gets weaker as the value of X continues to increase).
In the above discussion, we assumed the relation between two variables. In reality, the
Market forces consist of multiple factors impacting the business. Speaking statistically, the
real market situation requires a multivariate analysis.
Accordingly,for the purpose of analysi, statistical techniques are broadly divided into
Univariate analysis, Bivariate analysis and Multivariate analysis
Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its
central tendency (including the mean, median, and mode) and dispersion (including the
range and quantiles of the data-set, and measures of spread such as the variance and
standard deviation). The shape of the distribution may also be described via indices such as
skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.
Bivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to
describe the relationship between pairs of variables. In this case, descriptive statistics
include:
The main reason for differentiating univariate and bivariate analysis is that bivariate
analysis is not only simple descriptive analysis, but also it describes the relationship
between two different variables. Quantitative measures of dependence include correlation
(such as Pearson's r when both variables are continuous, or Spearman's rho if one or both
are not) and covariance (which reflects the scale variables are measured on). The slope, in
26
Prof.M.Guruprasad
regression analysis, also reflects the relationship between variables. The slope indicates the
unit change in the criterion variable for a one unit change in the predictor.
There are many different models, each with its own type of analysis:
Multivariate regression analysis attempts to determine a formula that can describe how
elements in a vector of variables respond simultaneously to changes in others. For linear
relations, regression analyses here are based on forms of the general linear model.
Principal components analysis (PCA) creates a new set of orthogonal variables that
contain the same information as the original set. It rotates the axes of variation to give a
new set of orthogonal axes, ordered so that they summarize decreasing proportions of the
variation.
Factor analysis is similar to PCA but allows the user to extract a specified number of
synthetic variables, fewer than the original set, leaving the remaining unexplained variation
as error. The extracted variables are known as latent variables or factors; each one may be
supposed to account for covariation in a group of observed variables.
Canonical correlation analysis finds linear relationships among two sets of variables; it is
the generalised (i.e. canonical) version of bivariate correlation.
27
Prof.M.Guruprasad
Clustering systems assign objects into groups (called clusters) so that objects (cases)
from the same cluster are more similar to each other than objects from different clusters.
With the advancement of technology most of the application of these complicated statistical
techniques is carried out through a menu driven user friendly statistical software
packages.Prominent among the software is SPSS originally named the Statistical Package
for the Social Sciences
SPSS
SPSS is a widely used program for statistical analysis in social science. It is also used by market
researchers, health researchers, survey companies, overnment, education researchers, marketing
organizations, data miners, and others. The original SPSS manual (Nie, Bent & Hull, 1970) has
been described as one of "sociology's most influential books" for allowing ordinary researchers
to do their own statistical analysis. In addition to statistical analysis, data Zanagement (case
selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is
stored in the datafile) are features of the base software.
SPSS is a comprehensive and flexible statistical analysis and data management solution. SPSS
can take data from almost any type of file and use them to generate tabulated reports, charts, and
plots of distributions and trends, descriptive statistics, and conduct complex statistical analyses.
SPSS is available from several platforms; Windows, Macintosh, and the UNIX systems.
SPSS Statistics is a software package used for statistical analysis. Long produced by SPSS Inc.,
it was acquired by IBM in 2009. The current versions (2014) are officially named IBM SPSS
Statistics. Companion products in the same family are used for survey authoring and deployment
(IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, and
collaboration and deployment (batch and automated scoring services).
The software was originally named the Statistical Package for the Social Sciences (SPSS).[2]
Later the expansion of the acronym was changed to Statistical Product and Service Solutions to
reflect the growing diversity of the userbase. Today, the IBM SPSS website makes no mention of
an official expansion of the acronym.
28
Prof.M.Guruprasad
Now let us discuss briefly some of the statistical techniques and how they are applied for
analysis
Two different questions in a questionnaire may represent two variables, and if we count these
two together, this is called a cross-tabulation. An example could be “10 people from Income
Group 1 said they liked Brand A”. Here, the two variables are “INCOME GROUP” and
“LIKING FOR BRANDS A TO E”, measured separately in two different questions on the
questionnaire.
Simple and Cross tabulation is a very useful form of analysis for all nominally and ordinally
scaled variables. For these two scales, calculations such as average (mean) and standard
deviation are not permitted. Therefore, frequency and percentages are used to analyse such
variables.
29
Prof.M.Guruprasad
Demographic Variables
1. Many demographic variables such as age, location, income, occupation, sex, education are
generally independent variables for the purposes of most marketing studies. This is because
other variables “depend” on them.
2. Attitude towards a brand, or the brand purchased, or intention to buy, are usually treated as
dependent variables in many marketing studies. For a marketing researcher, these variables or
similar ones, are the real variables of interest, as they help in arriving at strategies for increasing
sales or market share.
3. The other major types of independent variables are the elements of the four ‘P’s of marketing.
The marketing effort of a company can be measured in terms of its promotional efforts, price
variations and distribution changes. It can also be gauged from new product launches, or
repositioning or repackaging of existing brands.
4. Therefore, we could measure sales as the dependent variable with any of the marketing ‘P’s as
independent variables.
In a questionnaire-based survey, the first stage of analysis is called simple tabulation. This
consists of every question being treated separately and tabulated. For every question, the number
of responses in each category of answers is counted. Assuming the sample size is 500, and all
500 have answered the question, the simple tabulation of the respondents' gender may look like
the following –
1. Male - 300
2. Female - 200
-----
Total 500
-----
The simple tabulation for another question on the questionnaire may look like this –
1. Regular Users of Brand X -- 200
2. Occasional Users of Brand X -- 150
3. Non-users of Brand X -- 150
-----
Total 500
-----
A title can be included for each table, and on the top of each column, to explain the variable
name through a label. For example, the above simple table can be titled Frequency of Usage, or
Number of Users and Non-users of Brand X.
After the simple frequency and percentage tabulation for every question on the questionnaire
comes the second stage – the cross tabulations. A cross-tabulation can be done by combining
any two of the questions and tabulating the data together. This is a 2-variable cross tabulation.
An example could be a cross-tabulation between Brand Preference for brands of tea and Region
to which Respondent belongs. Assuming we have the data on these two variables from a study,
the cross tabulation may look like this –
BRAND Regionwise Buyers (No.)
North South East West Total
BrookeBond 25 20 20 15 80
Lipton 10 15 20 5 50
Tata 15 15 10 30 70
30
Prof.M.Guruprasad
Total 50 50 50 50 200
This is a cross-tabulation of two variables. An extension of this could be adding percentages.
All these percentages can be displayed in a table form separately, or in brackets along with
number of respondents. The table of percentages along with numbers will look like this –
BRAND Regionwise Buyers-Numbers and Percentage
North South East West Total
BrookeBond 25(50%) 20(40%) 20(40%) 15(30%)
80(40%)
Lipton 10(20%) 15(30%) 20(40%) 5(10%)
50(25%)
Tata 15(30%) 15(30%) 10(20%) 30(60%)
70(35%)
Total 50(100%) 50(100%) 50(100%) 50(100%)
200(100%)
The above table can be interpreted according to the column (region) we are looking at. The first
four columns represent findings for each region, and the fifth column (Total) represents overall
findings for all the regions on an average. For example, from column 4, 30% of buyers in the
west prefer Brooke Bond, 10% Lipton, and 60% prefer Tata tea. From column 5, out of the total
200 respondents, across all regions, 40% prefer Brooke Bond, 25% Lipton and 35% Tata tea
Lack of Causal Inference in Cross Tabulations
It must be mentioned here that any two variables can be cross-tabulated. Even if the cross-
tabulation shows a significant association between the two variables, it does not necessarily
mean that one of them (the independent) causes the other (the dependent). Causality or direct
effect is more of an assumption made by the researcher based on his expectation or experience.
The mere existence of a statistically significant association does not necessarily imply a cause-
and-effect relationship between the (presumed) independent and the (presumed) dependent
variable.
z-test
z-test is based on the normal probability distribution and is used for judging the significance of
several statistical measures, particularly the mean. The relevant test statistic*, z, is worked out
and compared with its probable value (to be read from table showing area under normal curve) at
a specified level of significance for judging the significance of the measure concerned. This is a
most frequently used test in research studies. This test is used even when binomial distribution or
t-distribution is applicable on the presumption that such a distribution tends to approximate
normal distribution as ‘n’ becomes larger. z-test is generally used for comparing the mean of a
sample to some hypothesised mean for the population in case of large sample, or when
population variance is known. z-test is also used for judging he significance of difference
between means of two independent samples in case of large samples, or when population
31
Prof.M.Guruprasad
variance is known. z-test is also used for comparing the sample proportion to a theoretical value
of population proportion or for judging the difference in proportions of two independent samples
when n happens to be large. Besides, this test may be used for judging the significance of
median, mode, coefficient of correlation and several other measures.
The t test is especially useful when the sample size is small (n < 30) and when the population
standard deviation is unknown. Unlike the univariate test, however, we assume that the samples
are drawn from populations with normal distributions and that the variances of the populations
are equal. Essentially, the t-test for differences between group means can be conceptualized as
the difference between the means divided by the variability of random means. The t value is a
ratio of the difference between the two sample means and the standard error. The t-test tries to
provide a rational way of determining if the difference between the two sample means occurred
by chance.
t-test is based on t-distribution and is considered an appropriate test for judging the significance
of a sample mean or for judging the significance of difference between the means of two samples
in case of small sample(s) when population variance is not known (in which case we use
variance of the sample as an estimate of the population variance). In case two samples are
related, we use
paired t-test (or what is known as difference test) for judging the significance of the mean of
difference between the two related samples. It can also be used for judging the significance of
the
coefficients of simple and partial correlations. The relevant test statistic, t, is calculated from the
sample data and then compared with its probable value based on t-distribution (to be read from
the table that gives probable values of t for different levels of significance for different degrees
of
freedom) at a specified level of significance for concerning degrees of freedom for accepting or
rejecting the null hypothesis. It may be noted that t-test applies only in case of small sample(s)
when population variance is unknown.
Let us assume that we have conducted a consumer survey of a brand of detergent. One of the
question dealt with the income category of the respondent and the other question asked the
respondent to rate his purchase intention.
32
Prof.M.Guruprasad
These two variables were cross-tabulated from a sample of say 20 respondents for the sake of
this illustration. A cross-tabulation with a chi-squared test was requested from the computer
package.
The question asked was Is there a Significant Association Between Respondent Income and
Purchase Intention ? The chi-squared test basically answers the above question.
Chi-square analysis permits us to test for significance between the frequency distributions of
two or more groups, say, males versus females. Categorical data from questions about sex,
education, or other nominal variables can be examined to provide tests of hypotheses of interest.
Chi-square analysis compares the observed frequencies of the responses with the expected
frequencies, which are based on our ideas about the population distribution or our predicted
proportions. This statistic tests whether or not the observed data are distributed the way we
would expect them to be. It does this by comparing the observed frequencies with the expected
frequencies.
The chi-square test is an important test amongst the several tests of significance developed by
statisticians. Chi-square, (Pronounced as Ki-square), is a statistical measure used in the context
of sampling analysis for comparing a variance to a theoretical variance.It “can be used to
determine if categorical data shows dependency or the two classifications are independent. It can
also be used to make comparisons between theoretical populations and actual data when
categories are used.”1 Thus, the chi-square test is applicable in
large number of problems. The test is, in fact, a technique through the use of which it is possible
for all researchers to (i) test the goodness of fit; (ii) test the significance of association between
two attributes, and (iii) test the homogeneity or the significance of population variance.
Correlation
1. Correlation and Regression are generally performed together. The application of correlation
analysis is to measure the degree of association between two sets of quantitative data. The
correlation coefficient measures this association. It has a value ranging from 0 (no correlation) to
1 (perfect positive correlation), or -1 (perfect negative correlation).
2. For example, how are sales of product A correlated with sales of product B? Or, how is the
advertising expenditure correlated with other promotional expenditure? Or, are daily ice cream
sales correlated with daily maximum temperature?
33
Prof.M.Guruprasad
3. Correlation does not necessarily mean there is a causal effect. Given any two strings of
numbers, there will be some correlation among them. It does not imply that one variable is
causing a change in another, or is dependent upon another.
Regression
1. The main objective of regression analysis is to explain the variation in one variable (called the
dependent variable), based on the variation in one or more other variables (called the
independent variables).
2. The applications areas are in ‘explaining’ variations in sales of a product based on advertising
expenses, or number of sales people, or number of sales offices, or on all the above variables.
3. If there is only one dependent variable and one independent variable is used to explain the
variation in it, then the model is known as a simple regression.
4. If multiple independent variables are used to explain the variation in a dependent variable, it is
called a multiple regression model.
As seen from the preceding discussion, the major application of regression analysis in marketing
is in the area of sales forecasting, based on some independent (or explanatory) variables. This
does not mean that regression analysis is the only technique used in sales forecasting. There are a
variety of quantitative and qualitative methods used in sales forecasting, and regression is only
one of the better known (and often used) quantitative techniques.
34
Prof.M.Guruprasad
possible combinations of two populations at a time for that would require a great number of tests
before we would be able to arrive at a decision. This would also consume lot of time and money,
and even then certain relationships may be left unidentified (particularly the interaction effects).
Therefore, one quite often utilizes the ANOVA technique and through it investigates the
differences among the means of all the populations simultaneously.
Multiple dependent variables can be analyzed together using a related procedure called
multivariate analysis of variance (MANOVA). The objective in MANOVA is identical to that in
ANOVA—to examine group differences in means—only the comparisons are considered for a
group of dependent variables.
An example of an ANOVA problem may be to compare light, medium, and heavy drinkers of
Starbucks coffee on their attitude toward a particular Starbucks advertising campaign. In this
instance there is one independent variable—consumption of Starbucks coffee—but it is divided
into three different levels. Our earlier t statistics won’t work here, since we have more than two
groups to compare. ANOVA requires that the dependent variable, in this case the attitude toward
the Starbucks advertising campaign, be metric. That is, the dependent variable must be either
interval or ratio scaled. A second data requirement is that the independent variable, in this case
the coffee consumption variable, be categorical. The null hypothesis for ANOVA always states
that there is no difference between the ad campaign attitudes of the groups of Starbucks coffee
drinkers. In specific terminology, the null hypothesis would be μ1 = μ2 = μ3
FACTOR ANALYSIS:
1. Factor Analysis is a set of techniques used for understanding variables by grouping them into
“factors” consisting of similar variables
2. It can also be used to confirm whether a hypothesized set of variables groups into a factor or
not
3. It is most useful when a large number of variables needs to be reduced to a smaller set of
“factors” that contain most of the variance of the original variables
Suppose that a two wheeler manufacturer is interested in determining which variables his
potential customers think about when they consider his product
4. Let us assume that twenty two-wheeler owners were surveyed by this manufacturer (or by a
marketing research company on his behalf). They were asked to indicate on a seven point scale
35
Prof.M.Guruprasad
Thus Factor analysis is used to summarize the information contained in a large number of
variables into a smaller number of subsets called factors.
CLUSTER ANALYSIS
Cluster analysis is used to classify respondents or objects (e.g., products, stores) into groups that
are homogeneous, or similar within the groups but different between the groups.
As the name implies, the basic purpose of cluster analysis is to classify or segment objects (e.g.,
customers, products, market areas) into groups so that objects within each group are similar to
one
another on a variety of variables. Cluster analysis seeks to classify segments or objects such that
there will be as much likeness within segments and as much difference between segments as
possible. Thus, this method strives to identify natural groupings or segments among many
variables without designating any of the variables as a dependent variable. Let us discuss the
application of cluster analysis with an example. Suppose a fastfood restaurant wants to open an
eat-in restaurant in a new, growing suburb of a major metropolitan area. Marketing researchers
surveyed a large sample of households in this suburb and collected data on characteristics such
as demographics, lifestyles, and expenditures on eating out. The fast-food chain wants to identify
one or more household segments that are likely to visit its new restaurant. Once this segment is
identified, the firm’s advertising and services would be tailored to them. A target segment can be
identified for the company by conducting a cluster analysis of the data it has gathered. The
results of the cluster analysis will identify segments, each of which contains households that
have similar characteristics and differs considerably from the other segments.
Let us suppose that our research identifies four potential clusters or segments for our fast-food
chain. As our intuitive example illustrates, this growing suburb contains households that seldom
visit restaurants at all (cluster 1), households that tend to frequent regular restaurants (with table
service) exclusively (cluster 2), households that tend to frequent fast-food restaurants exclusively
(cluster 3), and households that frequent both regular and fast-food restaurants (cluster 4). By
examining the characteristics associated with each of the clusters, management can decide which
clusters to target and how best to reach them through marketing communications.
36
Prof.M.Guruprasad
Discriminant Analysis
Discriminant analysis is a multivariate technique used for predicting group membership on the
basis of two or more independent variables. There are many situations where the marketing
researcher’s purpose is to classify objects or groups by a set of independent variables. In
marketing, consumers are often categorized on the basis of heavy versus light users of a product,
or viewers versus nonviewers of a media vehicle such as a television commercial.
1. The major application area for this technique is where we want to be able to distinguish
between two or three sets of objects or people, based on the knowledge of some of their
characteristics.
2. Examples include the selection process for a job, the admission process of an educational
programme in a college, or dividing a group of people into potential buyers and non-buyers.
3. Discriminant analysis can be, and is in fact used, by credit rating agencies to rate individuals,
to classify them into good lending risks or bad lending risks. The detailed example discussed
later tells you how to do that.
4. To summarise, we can use linear discriminant analysis when we have to classify objects into
two or more groups based on the knowledge of some variables (characteristics) related to them.
Typically, these groups would be users-non-users, potentially successful salesman – potentially
unsuccessful salesman, high risk – low risk consumer, or on similar lines.
Conjoint Analysis
Conjoint analysis is a multivariate technique which estimates the relative importance consumers
place on the different attributes of a product or service, as well as the utilities or value they attach
to the various levels of each attribute. This dependence method assumes that consumers choose
or form preferences for products by evaluating the overall utility or value of the product. This
value is composed of the individual utilities of each product feature or attribute. Conjoint
analysis tries to estimate the product attribute importance weights that would best match the
consumer’s indicated product choice or preference.
1. Marketing managers frequently want to know what utility a particular product feature or
service feature will have for a consumer.
2. Conjoint analysis is a multivariate technique that captures the exact levels of utility that an
individual customer puts on various attributes of the product offering. It enables a direct
comparison between say, the utility of a price level of Rs. 400 versus Rs.500, a delivery period
of 1 week versus 2 weeks, or an after sales response of 24 hours versus 48 hours.
3. Once we know utility levels for every attribute (and at every level), we can combine these to
find the best combination of attributes that gives him the highest utility, the second best
combination that gives the second highest utility, and so on.
4. This information can be used to design a product or service offering.
5. If this is done across a sample of customers say, segment-wise, it can also be used to predict
market-share, and the response of customers to changes in the competitive strategy through
changes in the marketing elements.
For example, assume that our fast-food company wants to determine the best combination of
restaurant features to attract customers. A marketing researcher could develop a number of
37
Prof.M.Guruprasad
Thus the various statistical techniques are extremely useful for the Marketing Researcher and the
Marketing Decision makers and policy decision makers in various areas of Management
decision making, Government and Academic Research.
Multiple regression enables the marketing researcher to predict a single dependent metric
variable from two or more metrically measured independent variables.
Multiple discriminant analysis can predict a single dependent nonmetric variable from two or
more metrically measured independent variables.
Factor analysis is used to summarize the information contained in a large number of variables
into a smaller number of subsets called factors.
Cluster analysis is used to classify respondents or objects (e.g., products, stores) into groups that
are homogeneous, or similar within the groups but different between the groups.
Conjoint analysis is used to estimate the value (utility) that respondents associate with different
product and/or service features, so that the most preferred combination of features can be
determined.
Perceptual mapping is used to visually display respondents’ perceptions of products, brands,
companies, and so on. Several multivariate methods can be used to develop the data to construct
perceptual maps.
38