0% found this document useful (0 votes)
77 views

MDA Book

The document discusses analyzing research data through statistical measures and testing hypotheses. It describes preprocessing data, reducing attributes, predicting and validating models, and analyzing and interpreting results. Statistical measures like mean, median, and mode are explained for summarizing data along with graphical and correlation analyses.

Uploaded by

Akshat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

MDA Book

The document discusses analyzing research data through statistical measures and testing hypotheses. It describes preprocessing data, reducing attributes, predicting and validating models, and analyzing and interpreting results. Statistical measures like mean, median, and mode are explained for summarizing data along with graphical and correlation analyses.

Uploaded by

Akshat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

6

Data Analysis and Statistical Testing

The research data can be analyzed using various statistical measures and inferring
conclusions from these measures. Figure 6.1 presents the steps involved in analyzing and
interpreting the research data. The research data should be reduced in a suitable form
before it can be used for further analysis. The statistical techniques can be used to prepro-
cess the attributes (software metrics) so that they can be analyzed and meaningful conclu-
sions can be drawn out of them. After preprocessing of the data, the attributes need to be
reduced so that dimensionality can be reduced and better results can be obtained. Then,
the model is predicted and validated using statistical and/or machine learning techniques.
The results obtained are analyzed and interpreted from each and every aspect. Finally,
hypotheses are tested and decision about the accuracy of model is made.
This chapter provides a description of data preprocessing techniques, feature reduction
methods, and tests for statistical testing. As discussed in Chapter 4, hypothesis testing can be
done either without model prediction or can be used for model comparison after the models
have been developed. In this chapter, we present the various statistical tests that can be applied
for testing a given hypothesis. The techniques for model development, methods for model vali-
dation, and ways of interpreting the results are presented in Chapter 7. We explain these tests
with software engineering-related examples so that the reader gets an idea about the practical
use of the statistical tests. The examples of model comparison tests are given in Chapter 7.

6.1 Analyzing the Metric Data


After data collection, descriptive statistics can be used to summarize and analyze the nature
of the data. The descriptive statistics are used to describe the data, for example, extracting
attributes with very few data points or determining the spread of the data. In this section,
we present various statistical measures for summarizing data and graphical techniques for
identifying outliers. We also present correlation analysis used to find the relation between
attributes.

6.1.1 Measures of Central Tendency


Measures of central tendency are used to summarize the average values of the attributes.
These measures include mean, median, and mode. They are known as measures of central
tendency as they provide idea about the central values of the data around which all the
other values tend to gather.

6.1.1.1 Mean
Mean can be computed by taking the average values of the data set. Mean is defined as the
ratio of sum of values of the data points to the total number of data points and is given as,
207
208 Empirical Research in Software Engineering

Analysis and interpretation


• Descriptive statistics
Metric data analysis • Outlier analysis
Research data • Correlation analysis
• Attribute selection
Attribute reduction • Attribute extraction
• Parametric tests
Hypothesis testing • Nonparametric
tests
• Recall, precision,
Performance evaluation accuracy, etc.
measures • MARE, MRE
Interpretation of
• Statistical analysis results
Model development • Machine learning
• Leave-one-out
Model validation • Hold-out
• k-cross
• Friedman
Model comparison tests • Paired t-test
• Post hoc
analysis

FIGURE 6.1
Steps for analyzing and interpreting data.

∑N
xi
Mean ( µ ) =
i =1

where:
xi (i = 1, . . . N) are the data points
N is the number of data points

For example, consider 28, 29, 30, 14, and 67 as values of data points.
The mean is (28 + 29 + 30 + 14 + 67) 5 = 33.6.

6.1.1.2 Median
The median is that value which divides the data into two halves. Half of the number of
data points are below the median values and half number of the data points are above the
median values. For odd number of data points, median is the central value, and for even
number of data points, median is the mean of the two central values. Hence, exactly 50%
of the data points lie above the median values and 50% of data points lie below the median
values. Consider the following data points:
8, 15, 5, 20, 6, 35, 10

First, we need to arrange data in ascending order,


5, 6, 8, 10, 15, 20, 35

The median is at 4th value, that is, 10. If one more additional data point 40 is added to the
above distribution then,
5, 6, 8, 10, 15, 20, 35, 40
Data Analysis and Statistical Testing 209

10 + 15
Median = = 12.5
2
Median is not useful, if number of categories in the ordinal type of scale are very low.
In such cases, mode is the preferred measure of central tendency.

6.1.1.3 Mode
Mode gives the value that has the highest frequency in the distribution. For example,
consider Table 6.1, the second category of fault severity has the highest frequency of 50.
Hence, 2 can be reported as the mode for Table 6.1 as it has the highest frequency.
Unlike the mean and median, the same distribution may have multiple values of mode.
Consider Table 6.2, there are two categories of maintenance effort with same frequency:
very high and medium. This is known as bimodal distribution.
The major disadvantage of mode is that it does not produce useful results when applied
to interval/ratio scales having many values. For example, the following data points
represent the number of failures occurred per second, while testing a given software and
are arranged in ascending order:
15, 17 , 18, 18, 45, 63, 64, 65, 71, 75, 79

It can be seen that the data is centered around 60–80 number of failures. But the mode
of the distribution is 18, since it occurs twice in the distribution whereas the rest of the
values only occur once. Clearly, the mode does not represent the central values in this case.
Hence, either other measures of central tendency will be useful in this case or the data
should be organized in suitable class intervals before mode is computed.

6.1.1.4 Choice of Measures of Central Tendency


The choice of selecting a measure of central tendency depends on

1. The scale type of data at which it is measured.


2. The distribution of data (left skewed, symmetrical, right skewed).

TABLE 6.1
Faults at Severity Levels
Fault Severity Frequency
0 23
1 19
2 50
3 17

TABLE 6.2
Maintenance Effort
Maintenance Effort Frequency
Very high 15
High 10
Medium 15
210 Empirical Research in Software Engineering

TABLE 6.3
Statistical Measures with Corresponding Relevant Scale Types
Measures Relevant Scale Type
Mean Interval and ratio data that are not skewed.
Median Ordinal, interval, and ratio, but not useful for
ordinal scales having few values.
Mode All scale types, but not useful for scales having
multiple values.

Table 6.3 depicts the relevant scale type of data for each statistical measure.
Consider the following data set:
18, 23, 23, 25, 35, 40, 42

The mean, median, and mode are shown in Table 6.4, as each measure has different ways
for computing “average” values. In fact, if the data is symmetrical, all the three measures
(mean, median, and mode) have the same values. But, if the data is skewed, there will
always be difference between these measures. Figure 6.2 shows the symmetrical and
skewed distributions. The symmetrical curve is a bell-shaped curve, where all the data
points are equally distributed.
Usually, when the data is skewed, the mean is a misleading measure for determining
central values. For example, if we calculate average lines of code (LOC) of 10 modules
given in Table 6.5, it can be seen that most of the values of the LOC are between 200 and 400,
but one module has 3,000 LOC. In this case, the mean will be 531. Only one value has influ-
enced the mean and caused the distribution to skew to the right. However, the median will
be 265, since the median is based on the midpoint and is not affected by the extreme values

TABLE 6.4
Descriptive Statistics
Measure Value
Mean 29.43
Median 25
Mode 23

Mean
median
Mode mode Mode
Median Median
Frequency

Mean Mean

(a) X (b) X (c) X

FIGURE 6.2
Graphs representing skewed and symmetrical distributions: (a) left skewed, (b) normal (no skew), and (c) right
skewed.
Data Analysis and Statistical Testing 211

TABLE 6.5
Sample Data of LOC for 10 Modules
Module# LOC Module# LOC
1 200 6 270
2 202 7 290
3 240 8 300
4 250 9 301
5 260 10 3,000

in the data distribution. Hence, the median better reflects the average LOC in modules as
compared to the mean and is the best measure when the data is skewed.

6.1.2 Measures of Dispersion


The measures of dispersion indicate the spread or the range of the distributions in the data
set. Measures of dispersion include range, standard deviation, variance, and quartiles.
The range is defined as the difference between the highest value and the lowest value
in the distribution. It is the easiest measure that can be quickly computed. Thus, for the
distribution of faults given in Table 6.5, the range of LOC will be
Range = 3000 − 200 = 2800
The range of the two distributions may be different even if they have the same mean.
The advantage of using range measure is that it is a simple to compute, and the disadvan-
tage is that it only takes into account the extreme values in the distribution and, hence,
does not represent actual spread in the distribution. The interquartile range (IQR) can be
used to overcome the disadvantage of the simple range measure.
The quartiles are used to compute the IQR of the distribution. The quartile divides
the metric data into four equal parts. Figure 6.3 depicts the division of the data set into
four equal parts. For the purpose of calculation of quartiles, the data is first required to
be arranged in ascending order. The 25% of the metric data is below the lower quartile
(25 percentile), 50% of the metric data is below the median value, and 75% of the metric data
is below the upper quartile (75 percentile).
The lower quartile (Q1) is computed by the following methods:

1. Computing the median of the data set


2. Computing the median of the lower half of the data set

The upper quartile (Q3) is computed by the following methods:

1. Computing the median of the data set


2. Computing the median of the upper half of the data set

Lower quartile Median Upper quartile

1st part 2nd part 3rd part 4th part

FIGURE 6.3
Quartiles.
212 Empirical Research in Software Engineering

Median
Q1 Q3

200 202 240 250 260 270 290 300 301 3,000

FIGURE 6.4
Example of quartile.

The IQR is defined as the difference between upper quartile and lower quartile and is given as,
IQR = Q3 − Q1

For example, for Table 6.5, the quartiles are shown in Figure 6.4.
IQR = Q3 − Q1 = 300 − 240 = 60

The standard deviation is used to measure the average distance a data point has from the
mean. The standard deviation assesses the spread by calculating the distance of the data
point from the mean. The standard deviation is large, if most of the data points are near to
the mean. The standard deviation (σx) for the population is given as:

∑( x − µ)
2

σx =
N
where:
x is the given value
N is the number of values
µ is the mean of all the values

Variance is a measure of variability and is the square of standard deviation.

6.1.3 Data Distributions


The shape of the distribution of the data is used to describe and understand the met-
rics data. Shape exhibits the patterns of distribution of data points in a given data set.
A distribution can either be symmetrical (half of the data points lie to the left of the
median and other half of the data points lie to the right of the median) or skewed (low
and/or high data values are imbalanced). A bell-shaped curve is known as normal curve
and is defined as, “The normal curve is a smooth, unimodel curve that is perfectly sym-
metrical. It has 68.3 percent of the area under the curve within one standard deviation of
the mean” (Argyrous 2011). For example, for variable LOC the mean is 250 and standard
deviation is 50 for the given 500 samples. For LOC to be normally distributed, 342 data
points must be between 200 (250 − 50) and 300 (250 + 50). For normal curve to be sym-
metrical, 171 data points must lie between 200 and 250, and the same number of data
points must lie between 250 and 300 (Figure 6.5).
Consider the mean and standard deviation of LOC for four different data sets with 500 data
points shown in Table 6.6. Given the mean and standard deviation in Table 6.6, for data set
1 to be normal, the range of LOC consisting 342 (68.3% of 500) data points should be between
200 and 300. Similarly, in data set 2, 342 data points should have LOC ranges between 160 and
280, and in data set 3, 342 data points should have ranges between 170 and 230.
Data Analysis and Statistical Testing 213

68.3%

34.15 34.15
200 250 300

FIGURE 6.5
Normal curve.

TABLE 6.6
Range of Distribution for Normal Data Sets
S. No. Mean Standard Deviation Ranges
1 250 50 200–300
2 220 60 160–280
3 200 30 170–230
4 200 10 190–210

TABLE 6.7
Sample Fault Count Data
Fault Count Data1 35, 45, 45, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75, 75, 85, 85, 85, 85, 95, 95, 95,
105, 105, 115
Data2 0, 2, 72, 75, 78, 80, 80, 85, 85, 87, 87, 87, 87, 88, 89, 90, 90, 92, 92, 95, 95, 98, 98,
99, 102
Data3 20, 37, 40, 43, 45, 52, 55, 57, 63, 65, 74, 75, 77, 82, 86, 86, 87, 89, 89, 90, 95, 107,
165, 700, 705

6.1.4 Histogram Analysis


The normal curves can be used to understand data descriptions. There are a number of
methods that can be applied to analyze the normality of the data set. One of the methods
is histogram analysis. Histogram is a graphical representation that depicts frequency of
occurrence of range of values. For example, consider fault count given for three software
systems in Table 6.7. The histograms for all the three data sets are shown in Figure 6.6.
The normal curve is superimposed on the histogram to check the normality of the data.
Figure 6.6 shows that the data set Data1 is normal. Figure 6.6 also shows that the data set
Data2 is left skewed and data set Data3 is right skewed.

6.1.5 Outlier Analysis


Data points that lie away from the rest of the data values are known as outliers. These values
are located in an empty space and are extreme or unusual values. The presence of these outliers
may adversely affect the results in data analysis. This is because of the following three reasons:
1. The mean no longer remains a true representative to capture central tendency.
2. In regression analysis, the values are squared hence the outliers may overinflu-
ence the results.
3. The outlier may affect the data analysis.
214 Empirical Research in Software Engineering

5 20

4
15
Frequency

Frequency
3
10
2

5
1

0 0
20.00 40.00 60.00 80.00 100.00 120.00 00 20.00 40.00 60.00 80.00 100.00 120.00
(a) Fault count (b) Fault count

12

10

8
Frequency

0
00 200.00 400.00 600.00 800.00
(c) Fault count

FIGURE 6.6
Histogram analysis for fault count data given in Table 6.7: (a) Data1, (b) Data2, and (c) Data3.

For example, suppose that one calculates the average of LOC, where most values are
between 1,000 and 2,000, but the LOC for one module is 15,000. Thus, the data point with
the value 15,000 is located far away from the other values in the data set and is an outlier.
Outlier analysis is carried out to detect the data points that are overinfluential and must be
considered for removal from the data sets.
The outliers can be divided into three types: univariate, bivariate, and multivariate.
Univariate outliers are influential data points that occur within a single variable. Bivariate
outliers occur when two variables are considered in combination, whereas multivariate
outliers occur when more than two variables are considered in combination. Once the out-
liers are detected, the researcher must make the decision of inclusion or exclusion of the
identified outlier. The outliers generally signal the presence of anomalies, but they may
sometimes provide interesting patterns to the researchers. The decision is based on the
reason of the occurrence of the outlier.
Box plots, z-scores, and scatter plots can be used for detecting univariate and bivariate
outliers.

6.1.5.1 Box Plots


Box plots are based on median and quartiles. Box plots are constructed using upper and
lower quartiles. An example box plot is shown in Figure 6.7. The two boundary lines
Data Analysis and Statistical Testing 215

Median

End of the
tail
Start of Lower quartile Upper quartile
the tail

FIGURE 6.7
Example box plot.

signify the start and end of the tail. These two boundary lines correspond to ±1.5 IQR.
Thus, once the value of IQR is known, it is multiplied by 1.5. The values shown inside of
the box plots are known to be within the boundaries, and hence are not considered to be
extreme. The data points beyond the start and end of the boundaries or tail are considered
to be outliers. The distance between the lower and the upper quartile is often known as
box length.
The start of the tail is calculated as Q 3 − 1.5 × IQR and end of the tail is calculated
as Q 3 + 1.5 × IQR. To avoid negative values, the values are truncated to the nearest
values of the actual data points. Thus, actual start of the tail is the lowest value in the
variable above (Q 3 − 1.5 × IQR), and actual end of the tail is the highest value below
(Q 3 − 1.5 × IQR).
The box plots also provide information on the skewness of the data. The median lies in
the middle of the box if the data is not skewed. The median lies away from the middle if
the data is left or right skewed. For example, consider the LOC values given below for a
software:
200, 202, 240, 250, 260, 270, 290, 300, 301, 3000

The median of the data set is 265, lower quartile is 240, and upper quartile is 300. The IQR
is 60. The start of the tail is 240 − 1.5 × 60 = 150 and end of the tail is 300 + 1.5 × 60 = 390.
The actual start of the tail is the lowest value above 150, that is, 200, and actual end of
the tail is the highest value below 390, that is, 301. Thus, the case number 10 with value
30,000 is above the end of the tail and, hence, is an outlier. The box plot for the given data
set is shown in Figure 6.8 with one outlier 3,000.
A decision regarding inclusion or exclusion of the outliers must be made by the research-
ers during data analysis considering the following reasons:

1. Data entry errors


2. Extraordinary or unusual events
3. Unexplained reasons

Outlier values may be present because of combination of data values present across more
than one variable. These outliers are called multivariate outliers. Scatter plot is another
visualization method to detect outliers. In scatter plots, we simply represent all the data
points graphically. The scatter plot allows us to examine more than one metric variable at
a given time.
216 Empirical Research in Software Engineering

3000 ∗
3000

2500

2000

1500

1000

500

LOC

FIGURE 6.8
Box plot for LOC values.

6.1.5.2 Z-Score
Z-score is another method to identify outliers and is used to depict the relationship of a
value to its mean, and is given as follows:
x −µ
z-score =
σ
where:
x is the score or value
µ is the mean
σ is the standard deviation

The z-score gives the information about the value as to whether it is above or below
the mean, and by how many standard deviations. It may be positive or negative. The
z-score values of data samples exceeding the threshold of ±2.5 are considered to be
outliers.

Example 6.1:
Consider the data set given in Table 6.7. Calculate univariate outliers for each variable
using box plots and z-scores.
Solution:
The box plots for Data1, Data2, and Data3 are shown in Figure 6.9. The z-scores for
data sets given in Table 6.7 are shown in Table 6.8.
To identify multivariate outliers, for each data point, the Mahalanobis Jackknife distance
D measure can be calculated. Mahalanobis Jackknife is a measure of the distance in
multidimensional space of each observation from the multivariate mean center of the
observations (Hair et al. 2006). Each data point is evaluated using chi-square distribution
with 0.001 significance value.
Data Analysis and Statistical Testing 217

120 120

100 100

80
80
60
60
40
40
20

20 ∗1
0 ∗2
(a) Data1 (b) Data2

800
25
∗∗
24
600

400

200 23

0
(c) Data3

FIGURE 6.9
(a)–(c) Box plots for data given in Table 6.7.

TABLE 6.8
Z-Score for Data Sets
Case No. Data1 Data2 Data3 Z-scoredata1 Z-scoredata2 Z-scoredata3
1 35 0 20 −1.959 −3.214 −0.585
2 45 2 37 −1.469 −3.135 −0.488
3 45 72 40 −0.979 −0.052 −0.404
4 55 75 43 −0.489 −0.052 −0.387
5 55 78 45 −0.489 0.145 −0.375
6 55 80 52 −0.489 0.145 −0.341
7 65 80 55 −0.489 0.224 −0.330
8 65 85 57 0 0.224 −0.279
9 65 85 63 0 0.224 −0.273
10 65 87 65 0 0.224 −0.262
11 75 87 74 0 0.264 −0.234
12 75 87 75 0 0.303 −0.211
13 75 87 77 0.489 0.343 −0.211
15 75 89 86 0.489 0.422 −0.194
16 85 90 86 0.489 0.422 −0.194
(Continued)
218 Empirical Research in Software Engineering

TABLE 6.8 (Continued )


Z-Score for Data Sets
Case No. Data1 Data2 Data3 Z-scoredata1 Z-scoredata2 Z-scoredata3
17 85 90 87 0.489 0.343 −0.205
18 85 92 89 0.489 0.422 −0.194
19 85 92 89 −1.463 −1.530 −0.652
20 95 95 90 0.979 0.540 −0.188
21 95 95 95 −0.956 −1.354 −0.813
22 95 98 107 0.979 0.659 −0.092
23 105 98 165 1.469 0.659 0.235
24 105 99 700 1.469 0.698 3.264
25 115 102 705 1.959 0.817 3.292
Mean 75 81.35 123.26
SD 20.41 25.29 176.63

6.1.6 Correlation Analysis


This is an optional step followed in empirical studies. Correlation analysis studies the
variation of two or more independent variables for determining the amount of cor-
relation between them. For example, if the relationship of design metrics to the size
of the class is to be analyzed. This is to determine empirically whether the coupling,
cohesion, or inheritance metric is essentially measuring size such as LOC. The model
that predicts larger classes as more fault prone is not much useful such as these classes
cover large part of the system, and thus testing cannot be done very well (Briand et al.
2000; Aggarwal et al. 2009). A nonparametric technique (Spearman’s Rho) for measur-
ing relationship between object-oriented (OO) metrics and size can be used, if skewed
distribution of the design measures is observed. Hopkins calls a correlation coefficient
value between 0.5 and 0.7 as large, 0.7 and 0.9 as very large, and 0.9 and 1.0 as almost
perfect (Hopkins 2003).

6.1.7 Example—Descriptive Statistics of Fault Prediction System


Univariate and multivariate outliers are found in FPS study. To identify multivariate
outliers, for each data point, the Mahalanobis Jackknife distance is calculated. The input
metrics were normalized using min–max normalization. Min–max normalization performs
a linear transformation on the original data (Han and Kamber 2001). Suppose that min A
and max A are the minimum and maximum values of an attribute A. It maps the value v of
A to v′ in the range 0–1 using the formula:
v − min A
v′ =
max A − min A

Table 6.9 shows “min,” “max,” “mean,” “median,” “standard deviation,” “25% quartile,”
and “75% quartile” for all metrics considered in FPS study. The following observations are
made from Table 6.9:

• The size of a class measured in terms of lines of source code ranges from 0 to 2,313.
• The values of depth of inheritance tree (DIT) and number of children (NOC)
are low in the system, which shows that inheritance is not much used in all the
Data Analysis and Statistical Testing 219

TABLE 6.9
Descriptive Statistics for Metrics
Std. Percentile Percentile
Metric Min. Max. Mean Median Dev. (25%) (75%)
CBO 0 24 8.32 8 6.38 3 14
LCOM 0 100 68.72 84 36.89 56.5 96
NOC 0 5 0.21 0 0.7 0 0
RFC 0 222 34.38 28 36.2 10 44.5
WMC 0 100 17.42 12 17.45 8 22
LOC 0 2313 211.25 108 345.55 8 235.5
DIT 0 6 1 1 1.26 0 1.5

TABLE 6.10
Correlation Analysis Results
Metric CBO LCOM NOC RFC WMC LOC DIT
CBO 1
LCOM 0.256 1
NOC −0.03 −0.028 1
RFC 0.386 0.334 −0.049 1
WMC 0.245 0.318 0.035 0.628 1
LOC 0.572 0.238 −0.039 0.508 0.624 1
DIT 0.4692 0.256 −0.031 0.654 0.136 0.345 1

systems; similar results have also been shown by others (Chidamber et al. 1998;
Cartwright and Shepperd 2000; Briand et al. 2000a).
• The lack of cohesion in methods (LCOM) measure, which counts the number of classes
with no attribute usage in common, has high values (upto 100) in KC1 data set.

The correlation among metrics is calculated, which is an important static quantity. As


shown in Table 6.10, Gyimothy et al. (2005) and Basili et al. (1996) also calculated the cor-
relation among metrics. The values of correlation coefficient are interpreted using the
threshold given by Hopkins (2003). Thus, in Table 6.10, the correlated values with cor-
relation coefficient >0.5 are shown in bold. The correlation coefficients shown in bold are
significant at 0.01 level. In this data set, weighted methods per class (WMC), LOC, and
DIT metrics are correlated with response for a class (RFC) metric. Similarly, the WMC and
coupling between object (CBO) metrics are correlated with LOC metric. Therefore, it shows
that these metrics are not totally independent and represent redundant information.

6.2 Attribute Reduction Methods


Sometimes the presence of a large number of attributes in an empirical study reduces the
efficiency of the prediction results produced by the statistical and machine learning tech-
niques. Reducing the dimensionality of the data reduces the size of the hypothesis space and
allows the methods to operate faster and more effectively. The attribute reduction methods
220 Empirical Research in Software Engineering

Attribute Attribute
selection extraction
techniques techniques

FIGURE 6.10
Attribute reduction procedure.

involve either selection of subset of attributes (independent variables) by eliminating the


attributes that have little or no predictive information (known as attribute selection), or
combining the relevant attributes into a new set of attributes (known as attribute extraction).
Figure 6.10 graphically depicts the procedures of attribute selection and extraction methods.
For example, a researcher may collect a large amount of data that captures various constructs
of the design to predict the probability of occurrence of fault in a module. However, much of
the collected information may not have any relation or impact on the occurrence of faults. It is
also possible that more than one attribute captures the same concept and hence is redundant.
The irrelevant and redundant attributes only add noise to the data, increase computational
time and may reduce the accuracy of the predicted models. To remove the noise and correla-
tion in the attributes, it is desirable to reduce data dimensionality as a preprocessing step of
data analysis. The advantages of applying attribute reduction methods are as follows:

1. Improved model interpretability


2. Faster training time
3. Reduction in overfitting of the models
4. Reduced noise

Hence, attribute reduction leads to improved computational efficiency, lower cost, increased
problem understanding, and improved accuracy. Figure 6.11 shows the categories of attri-
bute reduction methods.

Attribute
reduction
methods

Attribute Attribute
selection extraction

Wrapper Filter Supervised Unsupervised

Univariate Machine Correlation-based Linear Principal


analysis learning techniques feature discriminant component
subselection analysis

FIGURE 6.11
Classification of attribute reduction methods.
Data Analysis and Statistical Testing 221

6.2.1 Attribute Selection


Attribute selection involves selecting a subset of attributes from a given set of attributes.
For example, univariate analysis and correlation-based feature selection (CFS) techniques
can be used for attribute subselection. Different methods, as discussed below, are available
for metric selection. These methods can be categorized as wrapper and filter. Wrapper
methods use learning techniques to find subset of attributes whereas filter methods are
independent of the learning technique. Wrapper methods use learning algorithm for select-
ing subsets of attributes, hence they are slower in execution as compared to filter methods
that compute attribute ranking on the basis of correlation-based and information-centric
measures. But at the same time filter methods may produce a subset that does not work
very well with the learning technique as attributes are not tuned to specific prediction
model. Figure 6.12 depicts the procedure of filter methods, and Figure 6.13 shows the pro-
cedure of wrapper methods. Examples of learning techniques used in Wrapper methods
include Hill climbing, genetic algorithms, simulated annealing , *Tabu* search. Examples
of techniques used in filter methods include correlation coefficient, mutual information,
information gain. Two widely used methods for feature selection are explained in sections
below.

Bad

Reduced
Original set Attribute set Attribute
Evaluate?
subset generation measurement

Good
Testing Training
Accuracy data data
Model Learning
validation algorithm

FIGURE 6.12
Procedure of filter method.

Bad

All set Reduced


Original set Attribute set Learning
Evaluate?
subset generation algorithm

Testing Training Good


Accuracy data data
Model Learning
validation algorithm

FIGURE 6.13
Procedure of wrapper method.
222 Empirical Research in Software Engineering

6.2.1.1 Univariate Analysis


The univariate analysis is done to find the individual effect of each independent variable
on the dependent variable. One of the purposes of univariate analysis is to screen out
the independent variables that are not significantly related to the dependent variables.
For example, in regression analysis, only the independent variables that are significant at
0.05 significance level may be considered in subsequent model prediction using multivari-
ate analysis. The primary goal is to preselect the independent variables for multivariate
analysis that seems to be useful predictors. The choice of methods in the univariate analy-
sis depends on the type of dependent variables being used.
In the univariate regression analysis, the independent variables are chosen based on the
results of the significance value (see Section 6.3.2), whereas, in the case of other methods,
the independent variables are ranked based on the values of the performance measures
(see Chapter 7).

6.2.1.2 Correlation-Based Feature Selection


This is a commonly used method for preselecting attributes in machine learning methods.
To incorporate the correlation of independent variables, a CFS method is applied to select the
best predictors out of the independent variables in the data sets (Hall 2000). The best combina-
tions of independent variables are searched through all possible combinations of variables. CFS
evaluates the best of a subset of independent variables, such as software metrics, by considering
the individual predictive ability of each attribute along with the degree of redundancy between
them. Hall (2000) showed that CFS can be used in drastically reducing the dimensionality of
data sets, while maintaining the performance of the machine learning methods.

6.2.2 Attribute Extraction


Unlike attribute selection, which selects the existing attributes with respect to their sig-
nificance values or importance, attribute extraction transforms the existing attributes
and produces new attributes by combining or aggregating the original attributes so that
useful information for model building can be extracted from the attributes. Principal
component analysis is the most widely used attribute extraction technique in the
literature.

6.2.2.1 Principal Component Method


Principal component method (or P.C. method) is a standard technique used to find the
interdependence among a set of variables. The factors summarize the commonality of
the variables, and factor loadings represent the correlation between the variables and the
factor. P.C. method maximizes the sum of squared loadings of each factor extracted in turn.
The P.C. method aims at constructing new variable (Pi), called principal component (P.C.)
out of a given set of variables X′j s (j = 1, 2, …, k).

P1 = ( b11 × X1 ) + ( b12 × X2 ) +  + ( b1k × Xk )

P2 = ( b21 × X1 ) + ( b22 × X2 ) +  + ( b2 k × Xk )

Pk = ( bk 1 × X1 ) + ( bk 2 × X2 ) +  + ( bkk × Xk )
Data Analysis and Statistical Testing 223

All bij ’s called loadings are worked out in such a way that the extracted P.C. satisfies the
following two conditions:

1. P.C.s are uncorrelated (orthogonal).


2. The first P.C. (P1) has the highest variance, the second P.C. has the next highest
variance, and so on.

The variables with high loadings help identify the dimension the P.C. is capturing, but this
usually requires some degree of interpretation. To identify these variables, and interpret
the P.C.s, the rotated components are used. As the dimensions are independent, orthogo-
nal rotation is used, in which the axes are maintained at 90 degrees. There are various
strategies to perform such rotation. This includes quartimax, varimax, and equimax
orthogonal rotation. For detailed description refer Hair et al. (2006) and Kothari (2004).
Varimax method maximizes the sum of variances of required loadings of the factor matrix
(a table displaying the factor loadings of all variables on each factor).Varimax rotation is
the most frequently used strategy in literature. Eigenvalue (or latent root) is associated
with each P.C. It refers to the sum of squared values of loadings relating to a dimension.
Eigenvalue indicates the relative importance of each dimension for the particular set of vari-
ables being analyzed. The P.C. with eigenvalue >1 is taken for interpretation (Kothari 2004).

6.2.3 Discussion
It is useful to interpret the results of regression analysis in the light of results obtained from
P.C. analysis. P.C. analysis shows the main dimensions, including independent variables
as the main drivers for predicting the dependent variable. It would also be interesting to
observe the metrics included in dimensions across various replicated studies; this will help
in finding differences across various studies. From such observations, the recommendations
regarding which independent variable appears to be redundant and need not be collected
can be derived, without losing a significant amount of design information (Briand and Wust
2002). P.C. analysis is a widely used method for removing redundant variables in neural
networks.
The univariate analysis is used in preselecting the metrics with respect to their signifi-
cance, whereas CFS is the widely used method for preselecting independent variables in
machine learning methods (Hall 2000). In Hall (2003), the results showed that CFS chooses
few attributes, is faster, and overall good performer.

6.3 Hypothesis Testing


As discussed in Section 4.7, hypothesis testing is an important part of empirical research.
Hypothesis testing allows a researcher to reach to a conclusion on the basis of the statistical
tests. Generally, a hypothesis is an assumption that the researcher wants to accept or reject. For
example, an experimenter observes that birds can fly and wants to show that an animal is not
a bird. In this example, the null hypothesis can be “the observed animal is a bird.” A critical
area c is given to test a particular unit x. The test can be formulated as given below:

1. If x ∈ c, then null hypothesis is rejected


2. If x ∉ c, then null hypothesis is accepted
224 Empirical Research in Software Engineering

In the given example, the x is attributes of animals with critical area c = run, walk, sit,
and so on. These are the values that will cause null hypothesis to be rejected. The test is
“whether x ≠ fly”; if yes, reject null hypothesis, otherwise accept it. Hence, if x=fly that
means that null hypothesis is accepted.
In real-life, a software practitioner may want to prove that the decision tree algorithms
are better than the logistic regression (LR) technique. This is known as assumption of the
researcher. Hence, the null hypothesis can be formulated as “there is no difference between
the performance of the decision tree technique and the LR technique.” The assumption
needs to be evaluated using statistical tests on the basis of data to reach to a conclusion.
In empirical research, hypothesis formulation and evaluation are the bottom line of research.
This section will highlight the concept of hypothesis testing, and the steps followed in
hypothesis testing.

6.3.1 Introduction
Consider a setup where the researcher is interested in whether some learning technique
“Technique X” performs better than “Technique Y” in predicting the change proneness of
a class. To reach a conclusion, both technique X and technique Y are used to build change
prediction models. These prediction models are then used to predict the change proneness
of a sample data set (for details on training and testing of models refer Chapter 7) and
based on the outcome observed over the sample data set, it is determined which technique
is the better predictor out of the two. However, concluding which technique is better is a
challenging task because of the following issues:

1. The number of data points in the sample could be very large, making data analysis
and synthesis difficult.
2. The researcher might be biased towards one of the techniques and could overlook
minute differences that have the potential of impacting the final result greatly.
3. The conclusions drawn can be assumed to happen by chance because of bias in the
sample data itself.

To neutralize the impact of researcher bias and ensure that all the data points contribute
to the results, it is essential that a standard procedure be adopted for the analysis and
synthesis of sample data. Statistical tests allow the researcher to test the research questions
(hypotheses) in a generalized manner. There are various statistical tests like the student
t-test, chi-squared test, and so on. Each of these tests is applicable to a specific type of data
and allows for comparison in such a way that using the data collected from a small sample,
conclusions can be drawn for the entire population.

6.3.2 Steps in Hypothesis Testing


In hypothesis testing, a series of steps are followed to verify a given hypothesis. Section
4.7.5 summarizes the following steps; however, we restate them as these steps are followed
in each statistical test described in coming sections. The first two steps, however, are part of
experimental design process and carried out while the design phase progresses.

Step 1: Define hypothesis—In the first step, the hypothesis is defined corresponding
to the outcomes. The statistical tests are used to verify the hypothesis formed in
the experimental design phase.
Data Analysis and Statistical Testing 225

Step 2: Select the appropriate statistical test—The appropriate statistical test is


determined in experiment design on the basis of assumptions of a given statisti-
cal test.
Step 3: Apply test and calculate p-value—The next step involves applying the appro-
priate statistical test and calculating the significance value, also known as p-value.
There are a series of parametric and nonparametric tests available. These tests are
illustrated with example in the coming sections.
Step 4: Define significance level—The threshold level or critical value (also known as
α-value) that is used to check the significance of the test statistic is defined.
Step 5: Derive conclusions—Finally, the conclusions on the hypothesis are derived
using the results of the statistical test carried out in step 3.

6.4 Statistical Testing


The hypothesis formed in an empirical study is verified using statistical tests. In the
following subsections, the overview of statistical tests, the difference between one-tailed
and two-tailed tests, and the interpretation of statistical tests are discussed.

6.4.1 Overview of Statistical Tests


The validity of the hypothesis is evaluated using the test statistic obtained by statistical
tests. The rejection region is the region within which if a test value falls, then the null
hypothesis is rejected. The statistical tests are applied on independent and dependent
variables and test value is computed using test statistic. After applying the statistical tests,
the actual or test value is compared with the predetermined critical or p-value. Finally, a
decision on acceptance or rejection of hypothesis is made (Figure 6.14).

6.4.2 Categories of Statistical Tests


Statistical tests can be classified according to the relationship between the samples, that
is, whether they are independent or dependent (Figure 6.15). The decision on the sta-
tistical tests can be made based on the number of data samples to be compared. Some
tests work on two data samples, such as t-test or Wilcoxon signed-rank, whereas oth-
ers work on multiple data sets, such as Friedman or Kruskal–Wallis. Further, the tests
can be categorized as parametric and nonparametric. Parametric tests are statistical tests
that can be applied to a given data set, if it satisfies the underlying assumptions of the
test. Nonparametric tests are used when certain assumptions are not satisfied by the data
sample. The categorization is depicted in Figure 6.15. Univariate LR can also be applied

Compare test Determine


Apply statistical validity of
Variables value and
tests hypothesis
p-value

FIGURE 6.14
Steps in statistical tests.
226 Empirical Research in Software Engineering

One-way
Parametric
ANOVA
More than
two samples
Kruskal–
Nonparametric
Wallis
Independent
samples
Parametric T-test

Two samples

Nonparametric Mann–
Whitney U

Related
Parametric measures
More than ANOVA
two samples
Statistical Nonparametric Friedman
tests
Dependent
samples
Parametric Paired t-test

Two samples
Wilcoxon
Nonparametric
Association signed-rank
between Chi-square
variables
Univariate
Causal
regression
relationships
analysis

FIGURE 6.15
Categories of statistical tests

for testing the hypothesis for binary dependent variable. Table 6.11 depicts the summary
of assumptions, data scale, and normality requirement for each statistical test discussed
in this chapter.

6.4.3 One-Tailed and Two-Tailed Tests


In two-tailed test, the deviation of the parameter in each direction from the specified
value is considered. When the hypothesis is specified in one direction, then one-tailed
test is used. For example, consider the following null and alternative hypotheses for one-
tailed test:
H 0 : µ = µ0

H a : µ > µ0

where:
µ is the population mean
µ0 is the sample mean
Data Analysis and Statistical Testing 227

TABLE 6.11
Summary of Statistical Tests
Test Assumptions Data Scale Normality
One sample t-test The data should not have any Interval or ratio. Required
significant outliers.
The observations should be
independent.
Two sample t-test Standard deviations of the two Interval or ratio. Required
populations must be equal.
Samples must be independent of Interval or ratio.
each other.
The samples are randomly drawn Interval or ratio.
from respective populations.
Paired t-test Samples must be related with each Interval or ratio. Required
other.
The data should not have any
significant outliers.
Chi-squared test Samples must be independent of Nominal or ordinal. Not required
each other.
The samples are randomly drawn
from respective populations.
F-test All the observations should be Interval or ratio. Required
independent.
The samples are randomly drawn
from respective populations and
there is no measurement error.
One-way ANOVA One-way ANOVA should be used Interval or ratio. Required
when you have three or more
independent samples.
The data should not have any
significant outliers.
The data should have homogeneity
of variances.
Two-way ANOVA The data should not have any Interval or ratio. Required
significant outliers.
The data should have homogeneity
of variances.
Wilcoxon signed test The data should consist of two Ordinal or continuous. Not required
“related groups” or “matched
pairs.”
Wilcoxon–Mann– The samples must be independent. Ordinal or continuous. Not required
Whitney test
Kruskal–Wallis test The test should validate three or Ordinal or continuous. Not required
more independent sample
distributions.
The samples are drawn randomly
from respective populations.
Friedman test The samples should be drawn Ordinal or continuous. Not required
randomly from respective
populations.
228 Empirical Research in Software Engineering

Here, the alternative hypothesis specifies that the population mean is strictly “greater than”
sample mean. The below hypothesis is an example of two-tailed test:
H 0 : µ = µ0

H a : µ < µ 0 or µ > µ 0

Figure 6.16 shows the probability curve for a two-tailed test with rejection (or critical
region) on both sides of the curve. Thus, the null hypothesis is rejected if sample mean lies
in either of the rejection region. Two-tailed test is also called nondirectional test.
Figure 6.17 shows the probability curve for one-tailed test with rejection region on one
side of the curve. One-tailed test is also referred as directional test.

6.4.4 Type I and Type II Errors


There can be two types of errors that occur in hypothesis testing. They are distinguished
as type I and type II errors. Type I or type II error depends directly on the null hypoth-
esis. The goal of the test is to reject the null hypothesis. A statistical test can either reject
(prove false) or fail to reject (fail to prove false) a null hypothesis, but can never prove it
to be true.
Type I error is the probability of wrongly rejecting the null hypothesis when the null
hypothesis is true. In other words, a type I error occurs when the null hypothesis of no
difference is rejected, even when there is no difference. A type I error can also be called as
“false positive”; a result when an actual “hit” is erroneously seen as a “miss.” Type I error
is denoted by the Greek letter alpha (α). This means that it usually equals the significance

FIGURE 6.16
Probability curve for two-tailed test.

FIGURE 6.17
Probability curve for one-tailed test.
Data Analysis and Statistical Testing 229

TABLE 6.12
Types of Errors
H0 True H0 False

Reject H0 Type I error (false positive) Correct result (true positive)


Fail to reject H0 Correct result (true negative) Type II error (false negative)

level of a test. Type II error is defined as the probability of wrongly not rejecting the null
hypothesis when the null hypothesis is false. In other words, a type II error occurs when
the null hypothesis is actually false, but somehow, it fails to get rejected. It is also known
as “false negative”; a result when an actual “miss” is erroneously seen as a “hit.” The rate
of the type II error is denoted by the Greek letter beta (β) and related to the power of a test
(which equals 1 − β). The definitions of these errors can also be tabularized as shown in
Table 6.12.

6.4.5 Interpreting Significance Results


If the calculated value of a test statistic is greater than the critical value for the test then the
alternative hypothesis is accepted, else the null hypothesis is accepted and the alternative
hypothesis is rejected.
The test results provide calculated p-value. This p-value is the exact level of significance
for the outcome. For example, if the p-value reported by the test is 0.01, then the confidence
level of the test is (1 − 0.01) × 100 = 99% confidence. The obtained p-value is compared
with the significance value or critical value, and decision about acceptance or rejection
of the hypothesis is made. If the p-value is less than or equal to the significance value,
the null hypothesis is rejected. The various tables for obtaining p-values and various test
statistic values are presented in Appendix I. The appendix lists t-table test values, chi-
square test values, Wilcoxon–Mann–Whitney test values, area under the normal distribu-
tion table, F-test table at 0.05 significance level, critical values for two-tailed Nemenyi test
at 0.05 significance level, and critical values for two-tailed Bonferroni test at 0.05 signifi-
cance level.

6.4.6 t-Test
W. Gossett designed the student t-test (Student 1908). The purpose of the t-test is to
determine whether two data sets are different from each other or not. It is based on the
assumption that both the data sets are normally distributed. There are three variants of
t-tests:

1. One sample t-test, which is used to compare mean with a given value.
2. Independent sample t-test, which is used to compare means of two independent
samples.
3. Paired t-test, which is used to compare means of two dependent samples.

6.4.6.1 One Sample t-Test


This is the simplest type of t-test that determines the difference between the mean of a
data set from a hypothesized value. In this test, the mean from a single sample is collected
230 Empirical Research in Software Engineering

and is compared with a given value of interest. The aim of one sample t-test is to find
whether there is sufficient evidence to conclude that there is difference between mean
of a given sample from a specified value. For example, one sample t-test can be used to
determine whether the average increase in number of comment lines per method is more
than five after improving the readability of the source code.
The assumption in the one sample t-test is that the population from which the sample is
derived must have normal distribution. The following null and alternative hypotheses are
formed for applying one sample t-test on a given problem:

H0: µ = µ0 (Mean of the sample is equal to the hypothesized value.)


Ha: µ ≠ µ0 (Mean of the sample is not equal to the hypothesized value.)

The t statistic is given below:

µ − µ0
t=
σ n

where:
µ represents mean of a given sample
σ represents standard deviation
n represents sample size

The above hypothesis is based on two tailed t-test. The degrees of freedom (DOFs) is
n − 1 as t-test is based on the assumption that the standard deviation of the population
is equal to the standard deviation of the sample. The next step is to obtain significance
values (p-value) and compare it with the established threshold value (α). To obtain p-value
for the given t-statistic, the t-distribution table needs to be referred. The table can only be
used given the DOF.

Example 6.2:
Consider Table 6.13 where the number of modules for 15 software systems are shown.
We want to conclude that whether the population from which sample is derived is on
average different than the 12 modules.

TABLE 6.13
Number of Modules
Module Module
Module No. Module# No. Module# No. Module#
S1 10 S6 35 S11 24
S2 15 S7 26 S12 23
S3 24 S8 29 S13 14
S4 29 S9 19 S14 12
S5 16 S10 18 S15 5
Data Analysis and Statistical Testing 231

Solution:
The following steps are carried out to solve the example:

Step 1: Formation of hypothesis.


In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µ = 12 (Mean of the sample is equal to 12.)
Ha: µ ≠ 12 (Mean of the sample is not equal to 12.)
Step 2: Select the appropriate statistical test.
The sample that belongs to a normally distributed population does not con-
tain any significant outliers and has independent observations. The mean
of the number of modules can be tested using one sample t-test, as standard
deviation for the population is not known.
Step 3: Apply test and calculate p-value.
It can be seen that the mean is 19.933 and standard deviation is 8.172. For one
sample t-test, the value of t is,

µ − µ 0 19.93 − 12
t= = = 3.76
σ n 8.17 15

The DOF is 14 (15 − 1) in this example.


To obtain the p-value for a specific t-statistic, we perform the following steps,
referring to Table 6.14:
1. For corresponding DOF, named df, identify the row with the desired DOF.
In this example, the desired DOF is 14.
2. Now, in the desired row, mark out the t-score values between which the
computed t-score falls. In this example, the calculated t-statistic is 3.76.
This t-statistic falls beyond the t-score of 2.977.
3. Now, move upward to find the corresponding p-value for the selected t-score
for either one-tail or two-tail significance test. In this example, the signifi-
cance value for one-tail test would be <0.005, and for two-tail test it would
be <0.01.
Given 14 DOF and referring the t-distribution table, the obtained p-value is 0.002.

TABLE 6.14
Critical Values of t-Distributions
Level of significance for one-tailed test
0.10 0.05 0.02 0.01 0.005

Level of significance for two-tailed test


df 0.20 0.10 0.05 0.02 0.01
1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
14 1.345 1.761 2.145 2.624 2.977
15 1.341 1.753 2.131 2.602 2.947
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
120 1.289 1.658 1.980 2.358 2.617
∞ 1.282 1.645 1.960 2.326 2.576
232 Empirical Research in Software Engineering

Step 4: Define significance level.


After obtaining p-value, we need to decide the threshold or α value. Hence, it
can be seen that the results are statistically significant at 0.01 significance value
for two-tailed test and 0.005 for one-tailed test.
It is important to note that we will apply two-tail significance in all other
examples of the chapter.
Step 5: Derive conclusions.
As computed in Step 4, the results are statistically significant at 0.01 α value.
Hence, we reject the null hypothesis and conclude that the average modules in a
given software are statistically significantly different than 12 (t = 3.76, p = 0.002).

6.4.6.2 Two Sample t-Test


The two sample (independent sample) t-test determines the difference between the unknown
means of two populations based on the independent samples drawn from the two popula-
tions. If the means of two samples are different from each other, then we conclude that the
population are different from each other. The samples are either derived from two different
populations or the population is divided into two random subgroups and the samples are
derived from these subgroups, where each group is subjected to a different treatment (or tech-
nique). In both the cases, it is necessary that the two samples are independent to each other.
The hypothesis for the application of this variant of t-test can be formulated as given below:

H0: µ1 = µ2 (There is no difference in the mean values of both the samples.)


Ha: µ1 ≠ µ2 (There is difference in the mean values of both the samples.)

The t-statistic for two sample t-test is given as,

µ1 − µ 2
t=
(σ2
1 ) (
n1 + σ22 n2 )
where:
µ1 and µ2 are the means of both the samples, respectively
σ1 and σ2 are the standard deviations of both the samples, respectively

The DOF is n1 + n2 − 1, where n1 and n2 are the sample sizes of both the samples. Now,
obtain the significance value (p-value) and compare it with the established threshold value
(α) for the computed t-statistic using the t-distribution.
Example 6.3:
Consider an example for comparing the properties of industrial and open source soft-
ware in terms of the average amount of coupling between modules (the other modules
to which a module is coupled). The purpose of both the software is to serve as text
editors developed in Java language. In this example, we believe that the type of software
affects the amount of coupling between modules.

Industrial: 150, 140, 172, 192, 186, 180, 144, 160, 188, 145, 150, 141
Open source: 138, 111, 155, 169, 100, 151, 158, 130, 160, 156, 167, 132

Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
Data Analysis and Statistical Testing 233

TABLE 6.15
Descriptive Statistics
Descriptive Statistic Industrial Software Open Source Software
No. of observations 12 12
Mean 162.33 143.92
Standard deviation 20.01 21.99

H0: µ1= µ2 (There is no difference in the mean amount of coupling between


modules depicted by industrial and open source data sets.)
Ha: µ1 ≠ µ2 (There is difference in the mean amount of coupling between
modules depicted by industrial and open source data sets.)
Step 2: Select the appropriate statistical test.
As we are using two samples derived from different populations: one sample
from industrial software and other from open source software, the samples are
independent. Also the test variable, amount of coupling between modules, is
measured at the continuous/interval measurement level. Hence, we need to
use two sample t-test comparing the difference between average amount of
coupling between modules derived from two independent samples.
Step 3: Apply test and calculate p-value.
The summary of descriptive statistic of each sample is given in Table 6.15.
The t-statistic is given below:

µ1 − µ 2 162.33 − 143.92
t= = = 2.146
( ) (
σ12 n1 + σ 22 n2 ) ( ) (
20.012 12 + 21.992 12 )
The DOF is 22 (12 + 12 − 2) in this example. Given 22 DOF and referring the
t-distribution table, the obtained p-value is 0.043.
Step 4: Define significance level.
As computed in Step 3, the p-value is 0.043. It can be seen that the results are
statistically significant at 0.05 significance value.
Step 5: Derive conclusions.
The results are significant at 0.05 significance level. Hence, we reject the null
hypothesis, and the results show that the mean amount of coupling between
modules depicted by the industrial software is statistically significantly differ-
ent than the mean amount of coupling between modules depicted by the open
source software (t = 2.146, p = 0.043).

6.4.6.3 Paired t-Test


The paired t-test can be used if the two samples are related or paired in some manner. The
samples are same but subject to different treatments (or technique), or the pairs must consist
of before and after measurements on the same sample. We can formulate the following null
and alternative hypotheses for application of paired t-test on a given problem:

H0: µ1 − µ2 = 0 (There is no difference between the mean values of the two samples.)
Ha: µ1 − µ2 ≠ 0 (There exists difference between the mean values of the two samples.)

The measure of paired t-test is given as,


µ1 − µ 2
t=
σd n
234 Empirical Research in Software Engineering

∑ d − ( ∑ d ) n
2 2

σd =
n−1

where:
n represents number of pairs and not total number of samples
d is difference between values of two samples

The DOF is n − 1. The p-value is obtained and compared with the established threshold
value (α) for the computed t-statistic using the t-distribution.

Example 6.4:
Consider an example where values of the CBO (number of other classes to which a class
is coupled to) metric is given before and after applying refactoring technique to improve
the quality of the source code. The data is given in Table 6.16.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µCBO1 = µCBO2 (Mean of CBO metric before and after applying refactoring
are equal.)
Ha: µCBO1 ≠ µCBO2 (Mean of CBO metric before and after applying refactoring
are not equal.)
Step 2: Select the appropriate statistical test.
The samples are extracted from populations with normal distribution. As
we are using samples derived from the same populations and analyzing the
before and after effect of refactoring on CBO, these are related samples. We
need to use paired t-test for comparing the difference between values of CBO
derived from two dependent samples.
Step 3: Apply test and calculate p-value.
We first calculate the mean values of both the samples and also calculate the
difference (d) among the paired values of both the samples as shown in Table 6.17.
The t-statistic is given below:

∑ d − ( ∑ d ) 
2
2
n 12 − ( 8 ) 15 
2

σd =  =   = 0.743
n−1 14

µ1 − µ 2 67.6 − 67.07
t= = = 2.779
σd n 0.743 15

The DOF is 14 (15 − 1) in this example. Given 14 DOF and referring the
t-distribution table, the obtained p-value is 0.015.

TABLE 6.16
CBO Values
CBO before refactoring 45 48 49 52 56 58 66 67 74 75 81 82 83 88 90
CBO after refactoring 43 47 49 52 56 57 66 67 74 73 80 82 83 87 90
Data Analysis and Statistical Testing 235

TABLE 6.17
CBO Values
CBO before Refactoring CBO after Refactoring Differences (d)
45 43 2
48 47 1
49 49 0
52 52 0
56 56 0
58 57 1
66 66 0
67 67 0
74 74 0
75 73 2
81 80 1
82 82 0
83 83 0
88 87 1
90 90 0
µCBO1 = 67.6 µCBO2 = 67.07

Step 4: Define significance level.


As the computed p-value is 0.015, which is less than α = 0.05. Thus, the result is
significant at α = 0.05.
Step 5: Derive conclusions.
Fifteen classes were selected and CBO metric is calculated for these classes.
The mean CBO is found to be 67.6. With the aim to improve the quality of these
classes, the software developer applied refactoring technique on these classes.
The mean of CBO metrics after applying refactoring is found to be 67.06. The
reduction in the mean is found to be statistically significant at 0.05 significance
level (p-value < 0.05). Hence, the software developer can reject the null hypoth-
esis and conclude that there is a statistically significant improvement in the
mean value of CBO metric after refactoring technique is applied.

6.4.7 Chi-Squared Test


2
It is a nonparametric test, symbolically denoted as χ (pronounced as Ki-square). This test
is used when the attributes are categorical (nominal or ordinal). It measures the distance of
the observed values from the null expectations. The purpose of this test is to

• Test the interdependence between attributes.


• Test the goodness-of-fit of models.
• Test the significance of attributes for attribute selection or attribute ranking.
• Test whether the data follows normal distribution or not.
2
The χ calculates the difference between the observed and expected frequencies and is
given as
( O ij − Eij )
2
2
χ = ∑Eij
236 Empirical Research in Software Engineering

where:
Oij is the observed frequency of the cell in the ith row and jth column
Eij is the expected frequency of the cell in the ith row and jth column

The expected frequency is calculated as below:

N row × N column
Erow,column =
N

where:
N is the total number of observations
Nrow is the total of all observations in a specific row
Ncolumn is the total of all observations in a specific column
Erow,column is the grand total of a row or column

The larger the difference of the observed and the expected values, the more is the deviation
from the stated null hypothesis. The DOF is (row − 1) × (column − 1) for any given table.
The expected values are calculated for each category of the categorical variable at each factor
2
of the other categorical variable. Then, calculate the χ value for each cell. After calculating
2 2 2
individual χ value, add the individual χ values of each cell to obtain an overall χ value. The
2
overall χ value is compared with the tabulated value for (row − 1) × (column − 1) DOF. If the
2 2
calculated χ value is greater than the tabulated χ value at critical value α, we reject the null
hypothesis.

Example 6.5:
Consider Table 6.18 that consists of data for a particular software. It states the catego-
rization of modules according to three maintenance levels (high, medium, and low)
and according to the number of LOC (high and low). A researcher wants to investigate
whether LOC and maintenance level are independent of each other or not.

Step 1: Formation of hypothesis.


In this step, null (Ho) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: LOC and maintenance level are independent of each other.
Ha: LOC and maintenance level are not independent of each other.
Step 2: Select the appropriate statistical test.
The attributes explored in the example “maintenance level” and “LOC”
are ordinal. The data can be arranged in a bivariate table to investigate the

TABLE 6.18
Categorization of Modules
Maintenance Level
High Low Medium Total
LOC High 23 40 22 85
Low 17 30 20 67
Total 40 70 42 152
Data Analysis and Statistical Testing 237

relationship between the two attributes. Thus, chi-square test is an appropriate


test for checking the independence of the two attributes.
Step 3: Apply test and calculate p-value.
Calculate the expected frequency of each cell according to the following formula:

N row × N column
Erow,column =
N

Table 6.19 shows the calculated expected frequency of each cell.


Now, calculate the chi-square value for each cell according to the following
formula as shown in Table 6.20:

( Oij − Eij )
2
2
χ = ∑ Eij

2
Finally, calculate the overall χ value by adding all corresponding χ values of
2

each cell.

χ 2 = 0.017 + 0.018 + 0.093 + 0.022 + 0.023 + 0.118 = 0.291

Step 4: Define significance level.


The DOF = (rows − 1) × (columns − 1) = (2 − 1) × (3 − 1) = 2. Given 2 DOF
2 2
and referring the χ -distribution table, the obtained p-value is 0.862 and χ
value is 5.991. It can be seen that the results are not statistically significant at
0.05 significance value.
Step 5: Derive conclusions.
The results are not statistically significant at 0.05 significance level. Hence, we
accept the null hypothesis, and the results show that the two attributes “main-
tenance level” and “LOC” are independent (χ2 = 0.291, p = 0.862).

TABLE 6.19
Calculation of Expected Frequency
Maintenance Level
High Low Medium
LOC High 85 × 40 85 × 70 85 × 42
= 22.36 = 39.14 = 23.48
152 152 152

Low 67 × 40 67 × 70 67 × 42
= 17.63 = 30.85 = 18.52
152 152 152

TABLE 6.20
Calculation of χ2 Values
Maintenance Level
High Low Medium
2
LOC High (23 − 22.36) (40 − 39.14) 2
(22 − 23.48)2
= 0.017 = 0.018 = 0.093
23 39.14 23.48
2
Low (17 − 17.63)2 (30 − 30.85) (20 − 18.52)2
= 0.022 = 0.023 = 0.118
17.63 30.85 18.52
238 Empirical Research in Software Engineering

Example 6.6
Analyze the performance of four algorithms when applied on a single data set as given
in Table 6.21. Evaluate whether there is any significant difference in the performance of
the four algorithms at 5% significance level.
Solution:
Step 1: Formation of hypothesis.
The hypotheses for the example are given below:
H0: There is no significant difference in the performance of the algorithms.
Ha: There is significant difference in the performance of the algorithms.
Step 2: Select the appropriate statistical test.
To explore the “goodness-of-fit” of different algorithms when applied on a
specific data set, we can effectively use chi-square test.
Step 3: Apply test and calculate p-value.
Calculate the expected frequency of each cell according to the following
formula:


n
Oi
E= i =1
n

where:
Oi is the observed value of ith observation
n is the total number of observations

81 + 61 + 92 + 43
E= = 69.25
4
2
Next, we calculate individual χ values as shown in Table 6.22.

TABLE 6.21
Performance Values of Algorithms
Algorithm Performance
A1 81
A2 61
A3 92
A4 43

TABLE 6.22
Calculation of χ Values
2

Observed Expected
Frequency Frequency
(Oij − Eij )
2

(Oij − Eij ) (Oij − Eij )


2
Algorithm Oij Eij Eij
A1 81 69.25 11.75 138.06 1.99
A2 61 69.25 −8.25 68.06 0.98
A3 92 69.25 22.75 517.56 7.47
A4 43 69.25 −26.25 689.06 9.95
Data Analysis and Statistical Testing 239

Now

(Oij − Eij )
2

χ2 = ∑ Eij
= 20.393

The DOF would be n − 1, that is, (4 − 1) = 3. Given 3 DOF and referring


2 2
the χ -distribution table, we get χ value as 7.815 at α = 0.05, and the obtained
p-value is 0.0001.
Step 4: Define significance level.
It can be seen that the results are statistically significant at 0.05 significance
value as the obtained p-value in Step 3 is less than 0.05.
Step 5: Derive conclusions.
The results are significant at 0.05 significance level. Hence, we reject the null
hypothesis, and the results show that there is significant difference in the per-
2
formance of four algorithms ( χ = 20.393, p = 0.0001).

Example 6.7:
Consider a scenario where a researcher wants to find the importance of SLOC metric,
in deciding whether a particular class having more than 50 source LOC (SLOC) will
be defective or not. The details of defective and not defective classes are provided in
Table 6.23. Test the result at 0.05 significance value.
Solution:
Step 1: Formation of hypothesis.
The null and alternate hypotheses are formed as follows:
H0: Classes having more than 50 SLOC will not be defective.
Ha: Classes having more than 50 SLOC will be defective.
Step 2: Select the appropriate statistical test.
To investigate the importance of SLOC attribute in detection of defective and
not defective classes, we can appropriately use chi-square test to find an attri-
bute’s importance.
Step 3: Apply test and calculate p-value.
Calculate the expected frequency of each cell according to the following formula:
N row × N column
Erow,column =
N
Table 6.24 shows the observed and the calculated expected frequency of each
2
cell. We also then calculate the individual χ value of each cell.
Now

(Oij − Eij )
2

χ2 = ∑ Eij
= 716.66

The DOF = (rows − 1) × (columns − 1) = (2 − 1) × (2 − 1) = 1. Given 1 DOF and


2
referring the χ -distribution table, the obtained p-value is 0.00001.

TABLE 6.23
SLOC Values for Defective and Not Defective Classes
Defective (D) Not Defective (ND) Total
Number of classes having SLOC ≥ 50 200 200 400
Number of classes having SLOC < 50 100 700 800
Total 300 900 1,200
240 Empirical Research in Software Engineering

TABLE 6.24
Calculation of Expected Frequency

(Oij − Eij )
2
Observed Frequency Expected Frequency
(Oij − Eij ) (Oij − Eij )
2
Oij Eij Eij
200 400 × 300 100 10,000 100
= 100
1200
200 400 × 900 −100 10,000 33.33
= 300
1200
100 800 × 300 −100 10,000 50
= 200
1200
700 400 × 900 400 160,000 533.33
= 300
1200

Step 4: Define significance level.


2
The tabulated χ value is 3.841. It can be seen that the results are statistically
significant at α = 0.05 significance value as the computed p-value is 0.00001.
Step 5: Derive conclusions.
The results are significant at 0.05 significance level. Hence, we reject the null
hypothesis, and the results show that classes having more than 50 SLOC value
2
would be defective (χ = 716.66, p = 0.00001).

Example 6.8:
Consider a scenario where 40 students had developed the same program. The size of the pro-
gram is measured in terms of LOC and is provided in Table 6.25. Evaluate whether the size
values of the program developed by 40 students individually follows normal distribution.
Solution:
Step 1: Formation of hypothesis.
The null and alternative hypotheses are as follows:
H0: The data follows a normal distribution.
Ha: The data does not follow a normal distribution.
Step 2: Select the appropriate statistical test.
In the case of the normal distribution, there are two parameters, the mean (µ)
and the standard deviation (σ) that can be estimated from the data. Based on
the data, µ = 793.125 and σ = 64.81. To test the normality of data, we can use
chi-square test.
Step 3: Apply test and calculate p-value.
We first need to divide data into segments in such a way that the segments
have the same probability of including a value, if the data actually is normally

TABLE 6.25
LOC Values
641 672 811 770 741 854 891 792 753 876
801 851 744 948 777 808 758 773 734 810
833 704 846 800 799 724 821 757 865 813
721 710 749 932 815 784 812 837 843 755
Data Analysis and Statistical Testing 241

distributed with mean µ and standard deviation σ. We divide the data into
10 segments. We find the upper and lower limits of all the segments. To find
upper limit (xi) of ith segment, the following equation is used:

i
P ( X < xi ) =
10

where:
i = 1–9
X is N(µ, σ2)

which in terms of the standard normal distribution corresponds to


i
P ( X s < zi ) =
10

where:
i = 1–9
Xs is N(0,1)
xi −µ
zi =
σ

Using standard normal table, we can calculate the values of zi. We can then
calculate the value of xi using the following equation:
xi = σzi + µ

The calculated values zi and xi are given in Table 6.26. Since, a normally
distributed variable theoretically ranges from −∞ to +∞, the lower limit of
segment 1 is taken as –∞ and the upper limit of segment 10 is taken as +∞. The
number of values that fall in each segment are also shown in the table. They
represent the observed frequency (Oi). The expected number of values (Ei) in
each segment can be calculated as 40/10 = 4.
Now,

(Oij − Eij )
2

χ2 = ∑ Eij
=5

TABLE 6.26
Segments and χ2 Calculation
Segment Lower Upper
No. zi Limit Limit Oi Ei (O i−Ei)2
1 −1.28 −∞ 710.17 4 4 0
2 −0.84 710.17 738.68 3 4 1
3 −0.525 738.68 759.10 7 4 9
4 −0.255 759.10 776.60 2 4 4
5 0 776.60 793.13 3 4 1
6 0.255 793.13 809.65 4 4 0
7 0.525 809.65 827.15 6 4 4
8 0.84 827.15 847.56 4 4 0
9 1.28 847.56 876.08 4 4 0
10 – 876.08 +∞ 3 4 1
242 Empirical Research in Software Engineering

DOF = n − e − 1, where e is the number of parameters that must be estimated


(mean [µ] and standard deviation [σ]) and n is the number of segments.
In our example, DOF = 10 − 2 − 1 = 7. The computed p-value is 0.0499.
Step 4: Define significance level.
2
At significance value (0.05), χ value from distribution table is 14.07. Since, the
2
tabulated value of χ is greater than the calculated value, the results are not
significant. The obtained p-value is also almost equal to α = 0.05.
Step 5: Derive conclusions.
The results are not significant at 0.05 significance level. Hence, we accept the null
2
hypothesis, which means that the data follows a normal distribution (χ = 5).

6.4.8 F-Test
F-test is used to investigate the equality of variance for two populations. A number
of assumptions need to be checked for application of F-test, which includes the follow-
ing (Kothari 2004):

1. The samples should be drawn from normally distributed populations.


2. All the observations should be independent.
3. The samples are randomly drawn from respective populations and there is no
measurement error.

We can formulate the following null and alternative hypotheses for the application of
F-test on a given problem with two populations:

H0: σ12 = σ22 (Variances of two populations are equal.)


Ha: σ12 ≠ σ22 (Variances of two populations are not equal.)

To test the above stated hypothesis, we compute the F-statistic as follows:

( σsample1 )
2

F=
( σsample2 )
2

The variance of a sample can be computed by the following formula:


n
( xi − µ )
2

σsample = i =1
n−1

where:
n represents the number of observations in a sample
xi represents the ith observation of the sample
µ represents the mean of the sample observations

We also designate v1 as the DOF in the sample having greater variance and v2 as the DOF in the
other sample. The DOF is designated as one less than the number of observations in the cor-
responding sample. For example, if there are 5 observations in a sample, then the DOF is des-
ignated as 4 (5 − 1). The calculated value of F is compared with tabulated Fα (v1, v2) value at the
desired α value. If the calculated F-value is greater than Fα, we reject the null hypothesis (H0).
Data Analysis and Statistical Testing 243

TABLE 6.27
Runtime Performance of Learning Techniques
A1 11 16 10 4 8 13 17 18 5
A2 14 17 9 5 7 11 19 21 4

Example 6.9:
Consider Table 6.27 that shows the runtime performance (in seconds) of two learning
techniques (A1 and A2) on several data sets. We want to test whether the populations
have the same variances.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypothesis are formed. The hypoth-
eses for the example are given below:
H0: σ12 = σ22 (Variances of two populations are equal.)
Ha: σ12 ≠ σ22 (Variances of two populations are not equal.)
Step 2: Select the appropriate statistical test.
The samples belong to normal populations and are independent in nature.
Thus, to investigate the equality of variances of two populations, we use F-test.
Step 3: Apply test and calculate p-value.
In this example, n1 = 9 and n2 = 9. The calculation of two sample variances is as
follows:
We first compute the means of the two samples,

µ1 = 11.33 and µ2 = 11.89


9
( xi − µ )
2
(11 − 11.33 )
2
2 +  + (5 − 11.33)2
σ 1 = i =1
= = 26
n1 − 1 9−1


9
( xi − µ )
2
(14 − 11.89 )
2
2 +  + (4 − 11.89)2
σ 2 = i =1
= = 38.36
n2 − 1 9−1

Now, compute the F-statistic,

σ 22 38.36
= 1.47 (because σ 2 > σ1 )
2 2
F= =
σ12 26

DOF in Sample 1 (v2) = 8.


DOF in Sample 2 (v1) = 8.
The computed p-value is 0.299.
Step 4: Define significance level.
We look up the tabulated value of F-distribution with v1 = 8 and v2 = 8 at
α = 0.05, which is 3.44. The calculated value of F (F = 1.47) is lesser than the
tabulated value and, as obtained in Step 3, the computed p-value is 0.299. The
results are not significant at α = 0.05.
Step 5: Derive conclusions.
Because the calculated value of F is less than the tabulated value, we accept the
null hypothesis. Thus, we conclude that the variance in runtime performance
of both the techniques do not differ significantly (F = 1.47, p = 0.299).
244 Empirical Research in Software Engineering

6.4.9 Analysis of Variance Test


Analysis of variance (ANOVA) test is a method used to determine the equality of sample
means for three or more populations. The variation in data can be attributed to two
reasons: chance or just specific causes (Harnett and Murphy 1980). ANOVA test helps
in determining whether the cause of variance is “specific” or just by chance. It splits up
the variance into “within samples” and “between samples.” A “within sample” variance
is attributed to just random effects and other influences that cannot be explained.
However, a “between samples” variance is attributed to a “specific factor,” which can
also be termed as the “treatment effect” (Kothari 2004). This helps a researcher in draw-
ing conclusions about different factors that can affect the dependent variable outcome.
However, the ANOVA test only indicates that there is difference among different groups,
but not which specific group is different. The various assumptions required for use of
ANOVA test is as follows:

1. The populations from which samples (observations) are extracted should be


normally distributed.
2. The variance of the outcome variable should be equal for all the populations.
3. The observations should be independent.

We also assume that all the other factors except the ones that are being investigated
are adequately controlled, so that the conclusions can be appropriately drawn. One-
way ANOVA, also called the single factor ANOVA, considers only one factor for analy-
sis in the outcome of the dependent variable. It is used for a completely randomized
design.
In general, we calculate two variance estimates, one “within samples” and the other
“between samples.” Finally, we compute the F-value with these two variance estimates as
follows:

Variance between samples


F=
Variance within samples

The computed F-value is then compared with the F-limit for specific DOF. If the computed
F-value is greater than the F-limit value, then we can conclude that the sample means
differ significantly.

6.4.9.1 One-Way ANOVA


This test is used to determine whether various sample means are equal for a quan-
titative outcome variable and a single categorical factor (Seltman 2012). The factor
may have two or more number of levels. These levels are called “treatments.” All the
subjects are exposed to only one level of treatment at a time. For example, one-way
ANOVA can be used to determine whether the performance of different techniques
(factors) vary significantly from each other when applied on a number of data sets. It is
analogous to two independent samples t-test and is applied when we want to investi-
gate the equality of means of more than two samples; otherwise independent samples
Data Analysis and Statistical Testing 245

t-test is sufficient. We can formulate the following null and alternative hypotheses for
application of one-way ANOVA on a given problem:

H0: µ1 = µ2 = µ3 = ……. µk (Means of all the samples are equal.)


Ha: µ1 ≠ µ2 ≠ µ3 ≠ ….. µk (Means of all the samples are not equal, i.e., at least mean
value of one sample is different than the others.)

The steps for computing F-statistic is as follows. Here, we assume k is the number of sam-
ples and n is the number of levels:

Step a: Calculate the means of each of the samples: µ1, µ2, µ3 … µk.
Step b: Calculate the mean of sample means.
µ1 +µ 2 +µ 3 ++µ k
µ=
Number of samples (k )

Step c: Calculate the sum of squares of variance between the samples (SSBS).

SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) + + nk ( µ k − µ )
2 2 2 2

Step d: Calculate the sum of squares of variance within samples (SSWS). To obtain
SSWS, we find the deviation of each sample observation with their corresponding
mean and square the obtained deviations. We then sum all the squared deviations
values to obtain SSWS.

SSWS = ∑ ( x1i − µ1 ) + ∑ ( x2 i − µ 2 ) + ∑ ( x3 i − µ 3 ) + + ∑ ( xki − µ k ) for i = 1, 2, 3…


2 2 2 2

Step e: Calculate the sum of squares for total variance (SSTV).

SSTV = SSBS + SSWS

Step f: Calculate the mean square between samples (MSBS) and mean square
within samples (MSWS), and setup an ANOVA summary as shown in Table 6.28.
The calculated value of F is compared with tabulated Fα (k − 1, n − k) value at the
desired α value. If the calculated F-value is greater than Fα, we reject the null
hypothesis (H0).

TABLE 6.28
Computation of Mean Square and F-Statistic
Source of Variation Sum of Squares (SS) DOF Mean Square (MS) F-Ratio
Between sample SSBS k−1 SSBS MSBS
MSBS= F −ratio=
K −1 MSWS
Within sample SSWS n−k SSWS
MSWS=
n−k
Total SSTV n−1
246 Empirical Research in Software Engineering

TABLE 6.29
Accuracy Values of Techniques
Techniques
Data Sets A1 A2 A3
D1 60 (x11) 50 (x12) 40 (x13)
D2 40 (x21) 50 (x22) 40 (x23)
D3 70 (x31) 40 (x32) 50 (x33)
D4 80 (x41) 70 (x42) 30 (x43)

Example 6.10:
Consider Table 6.29 that shows the performance values (accuracy) of three techniques
(A1, A2, and A3), which are applied on four data sets (D1, D2, D3, and D4) each. We want
to investigate whether the performance of all the techniques calculated in terms of accu-
racy (refer to Section 7.5.3 for definition of accuracy) are equivalent.

Solution:
The following steps are carried out to solve the example.

Step 1: Formation of hypothesis.


In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses are given below:
H0: µ1 = µ2 = µ3 (Means of all the samples are equal, i.e., all techniques work
equally well.)
Ha: µ1 ≠ µ2 ≠ µ3 (Means of all the samples are not equal, i.e., at least mean
value of one technique is different than the others.)
Step 2: Select the appropriate statistical test.
The given hypothesis checks the means of more than two sample populations.
The data is normally distributed, and the homogeneity of variance of outcome
variables is checked. The observations are independent, that is, at a time only
one treatment is applied on a specific data set. Thus, we use one-way ANOVA
to test the hypothesis as only one factor (technique) is used to determine the
outcome (performance).
Step 3: Apply test and calculate p-value.
Step a: Calculate the means of each of the samples.

60 + 40 + 70 + 80 50 + 50 + 40 + 70 40 + 40 + 50 + 30
µ1 = = 62.5 ; µ 2 = = 52.5 ; µ1 = = 40
4 4 4
Step b: Calculate the mean of sample means.
µ1 + µ 2 + µ 3 ...+ µ k
µ=
Number of samples (k )

62.5 + 52.5 + 40
µ= = 51.67
3
Step c: Calculate the SSBS.

SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) +  + nk ( µ k − µ )
2 2 2 2

SSBS = 4 ( 62.5 − 51.67 ) + 4 ( 52.5 − 51.67 ) + 4 ( 40 − 51.67 ) = 1016.68


2 2 2
Data Analysis and Statistical Testing 247

Step d: Calculate the SSWS.


SSWS = ∑ ( x1i − µ1 ) + ∑ ( x2 i − µ 2 ) + ∑ ( x3 i − µ 3 ) +  + ∑ ( xki − µ k ) for i = 1, 2, 3…
2 2 2 2

SSWS = ( 60 − 62.5 ) +  + ( 80 − 62.5 )  + ( 50 − 52.5 ) +  + ( 70 − 52.5 ) 


2 2 2 2
   

+ ( 40 − 40 ) +  + ( 30 − 40 )  = 1550
2 2
 

Step e: Calculate the SSTV.


SSTV = SSBS + SSWS = 1016.68 + 1550 = 2566.68

Step f: Calculate MSBS and MSWS, and setup an ANOVA summary as shown
in Table 6.30.
The DOF for between sample variance is 2 and that for within sample vari-
ance is 9. For the corresponding DOF, we compute the F-value using the
F-distribution table and obtain the p-value as 0.103.
Step 4: Define significance level.
After obtaining the p-value in Step 3, we need to decide the threshold or α
value. The calculated value of F at Step 3 is 2.95, which is less than the tabu-
lated value of F (4.26) with DOF being v1 = 2 and v2 = 9 at 5% level. Thus, the
results are not statistically significant at 0.05 significance value.
Step 5: Derive conclusions.
As the results are not statistically significant at 0.05 significance value, we
accept the null hypothesis, which states that there is no difference in sample
means and all the three techniques perform equally well. The difference in
observed values of the techniques is only because of sampling fluctuations
(F = 2.95, p = 0.103).

6.4.10 Wilcoxon Signed Test


Wilcoxon signed-ranks test is a nonparametric test that is used to perform pairwise
comparisons among different treatments (Wilcoxon 1945). It is also called Wilcoxon
matched pairs test and is used in the scenario of two related samples (Kothari 2004).
The Wilcoxon signed-ranks test is based on the following hypotheses:

H0: There is no statistical difference between the two treatments.


Ha: There exists a statistical difference between the two treatments.

TABLE 6.30
Computation of Mean Square and F-Statistic
Sum of
Source of Squares F-Limit
Variation (SS) DOF Mean Square (MS) F-Ratio (0.05)
Between sample 1016.68 3−1=2 1016.68 508.34 F(2,9) = 4.26
MSBS = = 508.34 F= = 2.95
2 172.22
Within sample 1550 12 − 3 = 9 1550
MSWS = = 172.22
9
Total 2566.68 11
248 Empirical Research in Software Engineering

To perform the test, we compute the differences among the related pair of values of both
the treatments. The differences are then ranked based on their absolute values. We perform
the following steps while assigning ranks to the differences:

1. Exclude the pairs where the absolute difference is 0. Let nr be the reduced number
of pairs.
2. Assign rank to the remaining nr pairs based on the absolute difference. The
smallest absolute difference is assigned a rank 1.
3. In case of ties among differences (more than one difference having the same
value), each tied difference is assigned an average of tied ranks. For example,
if there are two differences of data value 5 each occupying 7th and 8th ranks,
we would assign the mean rank, that is, 7.5 ([7 + 8]/2 = 7.5) to each of the
difference.

We now compute two variables R+ and R−. R+ represents the sum of ranks assigned to dif-
ferences, where the data instance in the first treatment outperforms the second treatment.
However, R− represents the sum of ranks assigned to differences, where the second treat-
ment outperforms the first treatment. They can be calculated by the following formula
(Demšar 2006):

R+ = ∑ rank ( d )
di >0
i

R− = ∑ rank ( d )
di <0
i

where:
di is the difference between performance measures of first treatment from the second
treatment when applied on n different data instances

Finally, we calculate the Z-statistic as follows, where Q = minimum (R+, R−).

Q − ( 1 4 ) nr ( nr + 1)
Z=
(1 24 ) nr ( nr + 1) ( 2nr + 1)
If the Z-statistic is in the critical region with specific level of significance, then the null
hypothesis is rejected and it is concluded that there is significant difference between two
treatments, otherwise null hypothesis is accepted.

Example 6.11:
For example, consider an example where a researcher wants to compare the perfor-
mance of two techniques (T1 and T2) on multiple data sets using a performance measure
as given in Table 6.31. Investigate whether the performance of two techniques measured
in terms of AUC (refer to Section 7.5.6 for details on AUC) differs significantly.
Solution:
Step 1: Formation of hypothesis.
The hypotheses for the example are given below:
H0: The performance of the two techniques does not differ significantly.
Ha: The performance of the two techniques differs significantly.
Data Analysis and Statistical Testing 249

TABLE 6.31
Performance Values of Techniques
Techniques
Data Sets T1 T2
D1 0.75 0.65
D2 0.87 0.73
D3 0.58 0.64
D4 0.72 0.72
D5 0.60 0.70

Step 2: Select the appropriate statistical test.


The two techniques have matched pairs as they are evaluated on the same
data sets. Moreover, the performance measurement scale is continuous. As we
need to investigate the comparative performance of the two techniques, we use
Wilcoxon signed test.
Step 3: Apply test and calculate p-value.
We assign ranks based on the basis of absolute difference between the per-
formances of two techniques. Here, n = 5. For each pair, ranks are given in
Table 6.32.
According to Table 6.32, we can see that nr = 4. We now compute R+ and R− as
follows:

R+ = ∑ rank ( d ) = 1 + 2.5 = 3.5


di >0
i

R− = ∑ rank ( d ) = 2.5 + 4 = 6.5


di < 0
i

Thus, Q = minimum (R+, R−) = 3.5. The Z-statistic can be computed as follows:

Q − ( 1 4 ) nr ( nr + 1) 3.5 − ( 1 4 ) 4 ( 4 + 1)
Z= = = −0.549
(1 24 ) nr ( nr + 1) ( 2nr + 1) (1 24 ) 4 ( 4 + 1) ( 2 × 4 + 1)
The obtained p-value is 0.581 with Z-distribution table, when DOF is (n − 1),
that is, 1.
Step 4: Define significance level.
2
The chi-square value is χ 0.05 = 3.841. As the test statistic value (Z = −0.549) is
2
less than χ value, we accept the null hypothesis. The obtained p-value in Step
3 is greater than α = 0.05. Thus, the results are not significant at critical value
α = 0.05.

TABLE 6.32
Computing R+ and R−
Data Set T1 T2 di |di| Rank(di)
D1 0.75 0.65 −0.10 0.10 2.5
D2 0.87 0.73 −0.14 0.14 4
D3 0.58 0.64 0.06 0.06 1
D4 0.72 0.72 0 0 –
D5 0.60 0.70 0.10 0.10 2.5
250 Empirical Research in Software Engineering

Step 5: Derive conclusions


As shown in Step 4, we accept the null hypothesis. Thus, we conclude that
the performance of both the techniques do not differ significantly (Z = −0.549,
p = 0.581).

6.4.11 Wilcoxon–Mann–Whitney Test (U-Test)


This test is used to ascertain the difference among two independent samples when the
outcome variable is continuous or ordinal (Anderson et al. 2002). It is the nonparametric
equivalent of the independent samples t-test. However, the underlying data does not need
to be normal for the application of Wilcoxon–Mann–Whitney test. It is also commonly
known as Wilcoxon rank-sum test or Mann–Whitney U-test. The test investigates whether
the two samples drawn independently belong to the same population by checking the
equality of the two sample means. It can be used when sample sizes are unequal. We can
formulate the following null and alternative hypotheses for application of Wilcoxon–Mann–
Whitney test on a given problem:
H0: µ1 − µ2 = 0 (The two sample means belong to the same population and are
identical.)
Ha: µ1 − µ2 ≠ 0 (The two sample means are not equal and belong to different
populations.)
To perform the test, we need to compute the rank-sum statistics for all the observations in
the following manner. We assume that the number of observations in sample 1 is n1 and
the number of observations in sample 2 is n2. The total number of observations is denoted
by N (N = n1 + n2):

1. Arrange the data values of all the observations (both the samples) in ascending
(low to high) order.
2. Assign ranks to all the observations. The lowest value observation is provided
rank 1, the next to lowest observation is provided rank 2, and so on, with the
highest observation given the rank N.
3. In case of ties (more than one observation having the same value), each tied obser-
vation is assigned an average of tied ranks. For example: if there are three observa-
tions of data value 20 each occupying 7th, 8th, and 9th ranks, we would assign the
mean rank, that is, 8 ([7 + 8 + 9]/3 = 8) to each of the observation.
4. We then find the sum of all the ranks allotted to observations in sample 1 and
denote it with T1. Similarly, find the sum of all the ranks allotted to observations in
sample 2 and denote it as T2.
5. Finally, we compute the U-statistic by the following formula:

n1 ( n1 + 1)
U = n1.n2 + − T1
2
or

n2 ( n2 + 1)
U = n1.n2 + − T2
2
It can be observed that the sum of the U-values obtained by the above two formulas is
always equal to the product of the two sample sizes (n1.n2; Hooda 2003). It should be noted
Data Analysis and Statistical Testing 251

that we should use the lower computed U-value as obtained by the two equations described
above. Wilcoxon–Mann–Whitney test has two specific cases (Anderson et al. 2002; Hooda
2003): (1) when the sample sizes are small (n1 < 7, n2 < 8) or (2) when the sample sizes
are large (n1 ≥ 10, n2 ≥ 10). The p-values for the corresponding computed U-values are
interpreted as follows:

Case 1: When the sample sizes are small (n1 < 7, n2 < 8)
To decide whether we should accept or reject the null hypothesis, we should derive
the p-value from the tables shown in Appendix I. For the given values of n1 and n2,
we find a p-value that is less than or equal to the computed U-value. For example,
if the value of n1 and n2 is 4 and 5, respectively, and the computed U-value is 3, then
the p-value would be 0.056. For a two-tailed test, the U-value should be computed
for the lesser of the two computed U-values.
Case 2: When the sample sizes are large (n1 ≥ 10, n2 ≥ 10)
For sample sizes, where each sample contains 10 or more data values, the sampling
U-distribution can be approximated by the normal distribution. In this case, we can
calculate the mean (µU) and standard deviation (σU) of the normal population as
follows:

n1 . n2 n1 . n2 ( n1 + n2 +1)
µU = ; σU =
2 12
Thus, the Z-statistic can be defined as,
U − µu
Z=
σu

If the tabulated Z-value at a significance level α is greater than the computed


Z-value, we reject the null hypothesis. Otherwise, we accept the alternate
hypothesis.

Example 6.12:
Consider an example for comparing the coupling values of two different software (one
open source and other academic software), to ascertain whether the two samples are
identical with respect to coupling values (coupling of a module corresponds to the
number of other modules to which a module is coupled).

Academic: 89, 93, 35, 43


Open source: 52, 38, 5, 23, 32

Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µ1 − µ2 = 0 (The two samples are identical in terms of coupling values.)
Ha: µ1 − µ2 ≠ 0 (The two sample are not identical in terms of coupling values.)
Step 2: Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected
from two different software. Also, the outcome variable (amount of coupling)
is continuous or ordinal in nature. The data may not be normal. Hence, we
252 Empirical Research in Software Engineering

TABLE 6.33
Computation of Rank Statistics for
Coupling Values of Two Software
Observations Rank Sample Name
5 1 Open source
23 2 Open source
32 3 Open source
35 4 Academic
38 5 Open source
43 6 Academic
52 7 Open source
89 8 Academic
93 9 Academic

use the Wilcoxon–Mann–Whitney test for comparing the differences among


coupling values of an academic and open source software.
Step 3: Apply test and calculate p-value.
In this example, n1 = 4, n2 = 5, and N = 9. Table 6.33 shows the arrangement of
all the observations in ascending order, and the ranks allocated to them.
Sum of ranks assigned to observations in Academic software (T1) = 4 + 6 + 8 + 9 = 27.
Sum of ranks assigned to observations in open source software
(T2) = 1 + 2 + 3 + 5 + 7 = 18.
The U-statistic is given below:
n1 ( n1 + 1)
U = n1.n2 + − T1
2
4 ( 4 +1)
= 4. 5 + − 27 = 3
2

n2 ( n2 + 1)
U = n1.n2 + − T2
2
5 ( 5 +1)
= 4. 5 + − 18 = 17
2
We compute the p-value to be 0.056 at α = 0.05 for the values of n1 and n2 as
4 and 5, respectively, and the U-value as 3.
Step 4: Define significance level.
As the derived p-value of 0.056, in Step 3, is greater than 2α = 0.10, we accept
the null hypothesis at α = 0.05. Thus, the results are not significant at α = 0.05.
Step 5: Derive conclusions.
As shown in Step 4, we accept the null hypothesis. Thus, we conclude that
the coupling values of the academic and open source software do not differ
significantly (U = 3, p = 0.056).

Example 6.13:
Let us consider another example for large sample size, where we want to ascertain
whether the two sets of observations (sample 1 and sample 2) are extracted from identi-
cal populations by observing the cohesion values of the two samples.
Sample 1: 55, 40, 71, 59, 48, 40, 75, 46, 71, 72, 58, 76
Sample 2: 46, 42, 63, 54, 34, 46, 72, 43, 65, 70, 51, 70
Data Analysis and Statistical Testing 253

Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
esis for the example is given below:
H0: µ1 − µ2 = 0 (The two samples are identical in terms of cohesion values.)
Ha: µ1 − µ2 ≠ 0 (The two sample are not identical in terms of cohesion
values.)
Step 2: Select the appropriate statistical test.
The two samples of our study are independent in nature as they are collected
from two different software. Also, the outcome variable (amount of cohesion) is
continuous or ordinal in nature. The data may not be normal. Hence, we use the
Wilcoxon–Mann–Whitney test for comparing the differences among cohesion
values of the two software.
Step 3: Apply test and calculate p-value.
In this example, n1 = 12, n2 = 12, and N = 24. Table 6.34 shows the arrangement
of all the observations in ascending order, and the ranks allocated to them.
Sum of ranks assigned to observations in sample 1 (T1) = 2.5 + 2.5 + 7 + 9
+ 12 + 13 + 14 + 19.5 + 19.5 + 21.5 + 23 + 24 = 167.5.
Sum of ranks assigned to observations in sample 2 (T2) = 1 + 4 + 5 + 7 + 7
+ 10 + 11 + 15 + 16 + 17.5 + 17.5 + 21.5 = 132.5.

TABLE 6.34
Computation of Rank Statistics for
Cohesion Values of Two Samples
Observations Rank Sample Name
34 1 Sample 2
40 2.5 Sample 1
40 2.5 Sample 1
42 4 Sample 2
43 5 Sample 2
46 7 Sample 1
46 7 Sample 2
46 7 Sample 2
48 9 Sample 1
51 10 Sample 2
54 11 Sample 2
55 12 Sample 1
58 13 Sample 1
59 14 Sample 1
63 15 Sample 2
65 16 Sample 2
70 17.5 Sample 2
70 17.5 Sample 2
71 19.5 Sample 1
71 19.5 Sample 1
72 21.5 Sample 1
72 21.5 Sample 2
75 23 Sample 1
76 24 Sample 1
254 Empirical Research in Software Engineering

The U-statistic is given below:

n1 ( n1 + 1)
U = n1.n2 + − T1
2
12 ( 12 +1)
= 12 ⋅ 12 + − 167.5 = 54.5
2

n2 ( n2 + 1)
U = n1.n2 + − T2
2
12 ( 12 + 1)
= 12 ⋅ 12 + − 132.5 = 89.5
2

As the sample size is large, we can calculate the mean (µU) and standard devia-
tion (σU) of the normal population as follows:

n1 . n2 12 ⋅ 12 n1. n2 ( n1 + n2 +1) 12 ⋅ 12 (12+12+1)


µU = = =72; σ U = = = 17.32
2 2 12 12

Thus, the Z-statistic can be computed as,

U − µ u 54.5 − 72
Z= = = − 1.012
σu 17.32

The obtained p-value from the normal table is 0.311.


Step 4: Define significance level.
As computed in Step 3, the obtained p-value is 0.311. This means that the
results are not significant at α = 0.05. Thus, we accept the null hypothesis.
Step 5: Derive conclusions.
As shown in Step 4, we accept the null hypothesis. Thus, we conclude that the
cohesion values of two software samples do not differ significantly (U = 54.5,
p = 0.311).

6.4.12 Kruskal–Wallis Test


This test is used to investigate whether there is any significant difference among three or more
independent sample distributions (Anderson et al. 2002). It is a nonparametric test that extends
the Wilcoxon–Mann–Whitney test on k sample distributions. We can formulate the following
null and alternative hypothesis for application of Kruskal–Wallis test on a given problem:

H0: µ1 = µ2 = … µk (All samples have identical distributions and belong to the same
population.)
Ha: µ1 ≠ µ2 ≠ … µk (All samples do not have identical populations and may belong to
different populations.)

The steps to compute the Kruskal–Wallis test statistic H are very similar to that of
Wilcoxon–Mann–Whitney test statistic U. Assuming there are k samples of size n1, n2, … nk,
respectively, and the total number of observations N (N = n1 + n2 + … nk), we perform the
following steps:

1. Organize and sort the data values of all the observations (belonging to all the
samples) in an ascending (low to high) order.
Data Analysis and Statistical Testing 255

2. Next, allocate ranks to all the observations from 1 to N. The observation with the
lowest data value is assigned a rank of 1, and the observation with the highest data
value is assigned rank N.
3. In case of two or more observations of equal values, assign the average of the
ranks that would have been assigned to the observations. For example, if there
are two observations of data value 40 each occupying 3rd and 4th ranks, we
would assign the mean rank, that is, 3.5 (  3 + 4  2 = 3.5 ) to each of the 3rd and 4th
observations.
4. We then compute the sum of ranks allocated to observations in each sample and
denote it as T1, T2… Tk.
5. Finally, the H-statistic is computed by the following formula:
k
Ti 2

12
H= − 3 ( N + 1)
N ( N + 1) i =1
ni
The calculated H-value is compared with the tabulated χα value at (k − 1) DOF at the
2

desired α value. If the calculated H-value is greater than χα value, we reject the null
2

hypothesis (H 0).

Example 6.14:
Consider an example (Table 6.35) where three research tools were evaluated by 17 dif-
ferent researchers and were given a performance score out of 100. Investigate whether
there is a significant difference in the performance rating of the tools.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µ1 = µ2 = µ3 (The performance rating of all tools does not differ
significantly.)
Ha: µ1 ≠ µ2 ≠ µ3 (The performance rating of all tools differ significantly.)
Step 2: Select the appropriate statistical test.
The three samples are independent in nature as they are rated by 17 different
researchers. The outcome variable is continuous. As we need to compare more
than two samples, we use Kruskal–Wallis test to investigate whether there is a
significant difference in the performance rating of the tools.

TABLE 6.35
Performance Score of Tools
Tools
Tool 1 Tool 2 Tool 3
30 65 55
75 25 75
65 35 65
90 20 85
100 45 95
95 75
256 Empirical Research in Software Engineering

TABLE 6.36
Computation of Rank Kruskal–Wallis Test
for Performance Score of Research Tools
Sample
Observations Rank Name
20 1 Tool 2
25 2 Tool 2
30 3 Tool 1
35 4 Tool 2
45 5 Tool 2
55 6 Tool 3
65 8 Tool 1
65 8 Tool 2
65 8 Tool 3
75 11 Tool 1
75 11 Tool 3
75 11 Tool 3
85 13 Tool 3
90 14 Tool 1
95 15.5 Tool 1
95 15.5 Tool 3
100 17 Tool 1

Step 3: Apply test and calculate p-value.


In this example, n1 = 6, n2 = 5, n3 = 6, and N = 17. The arrangement of all the
performance rating observations in ascending order and their corresponding
ranks are shown in Table 6.36.
Sum of ranks assigned to performance rating observations of Tool 1 (T1) =
3 + 8 + 11 + 14 + 15.5 + 17 = 68.5.
Sum of ranks assigned to performance rating observations of Tool 2 (T2) =
1 + 2 + 4 + 5 + 8 = 20.
Sum of ranks assigned to performance rating observations of Tool 3 (T3) =
6 + 8 + 11 + 11 + 13 + 15.5 = 64.5.
The H-statistic can be computed as follows:
k
Ti 2
∑n
12
H= − 3 ( N + 1)
N ( N + 1) i =1 i

12  ( 68.5 )2 ( 20 )2 ( 64.5 )2 
=  + +  − 3 ( 17 + 1) = 7
17 ( 17 + 1)  6 5 6 
 

The p-value obtained at 2 DOF is 0.029.


Step 4: Define significance level.
We compute chi-square distribution with 2 (k − 1) DOF at α = 0.05. The
2
chi-square value is χ 0.05 = 5.99. As the test statistic value (H = 7) is greater
2
than χ value, we reject the null hypothesis. Thus, the results are significant
with a p-value of 0.029.
Step 5: Derive conclusions.
As shown in Step 4, we reject the null hypothesis. Thus, we conclude that the
performance rating of all tools differ significantly (H = 7, p = 0.029).
Data Analysis and Statistical Testing 257

6.4.13 Friedman Test


Friedman test is a nonparametric test, which can be used to rank a set of k treatments over
multiple data instances or subjects (Friedman 1940). The test can be used to investigate
the existence of any statistical difference between various treatments. It is generally used
in a scenario where same set of treatments (techniques/methods) are repeatedly applied
over n independent data instances or subjects. A uniform measure is required to com-
pute the performance of different treatments on n data instances. However, Friedman
test does not require that the samples should be drawn from normal populations. To
proceed with the test, we must compute the ranks based on the performance of differ-
ent treatments on n data instances. The Friedman test is based on the assumption that
the measures over data instances are independent of each other. The hypotheses can be
formulated as follows:

H0: There is no statistical difference between the performances of various treatments.


Ha: There is statistical significant difference between the performances of various
treatments.
2
The steps to compute the Friedman test statistic χ are as follows. Assuming there are k
treatments that are applied on n independent data instances each.

1. Organize and sort the data values of all the treatments for a specific data instance or
data set in descending (high to low) order. Allocate ranks to all the observations from
1 to k, where rank 1 is assigned to the best performing treatment value and rank k to
the worst performing treatment. In case of two or more observations of equal values,
assign the average of the ranks that would have been assigned to the observations.
2. We then compute the total of ranks allocated to a specific treatment on all the data
instances. This is done for all the treatments and the rank total for k treatments is
denoted by R1, R 2, … Rk.
3. Finally, the χ2-statistic is computed by the following formula:
k

∑R
12
χ2 = 2
− 3n ( k + 1)
nk ( k + 1)
i
i =1

where:
Ri is the individual rank total of the ith treatment
n is the number of data instances

The value of Friedman measure χ2 is distributed over k − 1 DOF. If the value of Friedman
measure is in the critical region (obtained from chi-squared table with specific level of
significance, i.e., 0.01 or 0.05 and k − 1 DOF), then the null hypothesis is rejected and it is
concluded that there is difference among performance of different treatments, otherwise
the null hypothesis is accepted.

Example 6.15:
Consider Table 6.37, where the performance values of six different classification methods
are stated when they are evaluated on six data sets. Investigate whether the performance
of different methods differ significantly.
258 Empirical Research in Software Engineering

TABLE 6.37
Performance Values of Different Methods
Methods
Data Sets M1 M2 M3 M4 M5 M6
D1 83.07 75.38 73.84 72.30 56.92 52.30
D2 66.66 75.72 73.73 71.71 70.20 45.45
D3 83.00 54.00 54.00 77.00 46.00 59.00
D4 61.93 62.53 62.53 64.04 56.79 53.47
D5 74.56 74.56 73.98 73.41 68.78 43.35
D6 72.16 68.86 63.20 58.49 60.37 48.11

Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: There is no statistical difference between the performances of various
methods.
Ha: There is statistical significant difference between the performances of
various methods.
Step 2: Select the appropriate statistical test.
As we need to evaluate the difference between the performances of different
methods when they are evaluated using six data sets, we are evaluating
different treatments on different data instances. Moreover, there is no specific
assumption for data normality. Thus, we can use Friedman test.
Step 3: Apply test and calculate p-value.
We compute the rank total allocated to each method on the basis of perfor-
mance ranking of each method on different data sets as shown in Table 6.38.
Now, compute the Friedman statistic,

∑R
12
χ2 = 2
− 3n ( k +1)
nk ( k +1)
12
=
6 × 6 × ( 6 + 1)
( )
13.52 + 13.52 + 18 2 + 192 + 292 + 33 2 − 3.6 ( 6 + 1) = 16.11

DOF = k − 1 = 5

TABLE 6.38
Computation of Rank Totals for Friedman Test
Methods
Data Sets M1 M2 M3 M4 M5 M6
D1 1 2 3 4 5 6
D2 5 1 2 3 4 6
D3 1 4.5 4.5 2 6 3
D4 4 2.5 2.5 1 5 6
D5 1.5 1.5 3 4 5 6
D6 1 2 3 5 4 6
Rank total 13.5 13.5 18 19 29 33
Data Analysis and Statistical Testing 259

We look up the tabulated value of χ2-distribution with 5 DOF, and find the
tabulated value as 15.086 at α = 0.01. The p-value is computed as 0.007.
Step 4: Define significance level.
The calculated value of χ2 (χ2 = 16.11) is greater than the tabulated value. As the
computed p-value in Step 3 is <0.01, the results are significant at α = 0.01.
Step 5: Derive conclusions.
Since the calculated value of χ2 is greater than the tabulated value, we reject the
null hypothesis. Thus, we conclude that the performance of six methods differ
significantly (χ2 = 16.11, p = 0.007).

6.4.14 Nemenyi Test


Nemenyi test is a post hoc test that is used to compare multiple subjects (techniques/
tools/other experimental design settings) when the sample sizes are equal. It can be used
after the application of a Kruskal–Wallis test or a Friedman test, if the null hypothesis of
the corresponding test is rejected. Nemenyi test is applicable when we compare all the
subjects with each other and want to investigate whether the performance of two subjects
differ significantly (Demšar 2006). We compute the critical distance (CD) value as follows:

k ( k + 1)
CD = qα
6n

Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
The computed CD value is compared with the difference between average ranks allocated
to two subjects. If the difference is at least equal to or greater than the CD value, the two
subjects differ significantly at the chosen significance level α.

Example 6.16:
Consider an example where we compare four techniques by analyzing the performance
of the models predicted using these four techniques on six data sets each. We first apply
Friedman test to obtain the average ranks of all the methods. The computed average
ranks are shown in Table 6.39. The result of the Friedman test indicated the rejection
of null hypothesis. Evaluate whether there are significant differences among different
methods using pairwise comparisons.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H01: The performance of T1 and T2 techniques do not differ significantly.
Ha1: The performance of T1 and T2 techniques differ significantly.
H02: The performance of T1 and T3 techniques do not differ significantly.
Ha2: The performance of T1 and T3 techniques differ significantly.
H03: The performance of T1 and T4 techniques do not differ significantly.

TABLE 6.39
Average Ranks of Techniques after
Applying Friedman Test
T1 T2 T3 T4
Average rank 3.67 2.67 1.92 1.75
260 Empirical Research in Software Engineering

Ha3: The performance of T1 and T4 techniques differ significantly.


H04: The performance of T2 and T3 techniques do not differ significantly.
Ha4: The performance of T2 and T3 techniques differ significantly.
H05: The performance of T2 and T4 techniques do not differ significantly.
Ha5: The performance of T2 and T4 techniques differ significantly.
H06: The performance of T3 and T4 techniques do not differ significantly.
Ha6: The performance of T3 and T4 techniques differ significantly.
Step 2: Select the appropriate statistical test.
The evaluation of different techniques is performed using Friedman test, and
the result led to rejection of the null hypothesis. To analyze whether there are
any significant differences among pairwise comparisons of all the techniques,
we need to apply a post hoc test. The number of data sets for evaluating each
technique is same (six each, i.e., equal sample sizes), thus we use Nemenyi test.
Step 3: Apply test and calculate CD.
In this example, k = 4 and n = 6. The value of qα for four subjects at α = 0.05 is
2.569. The CD can be calculated by the following formula:

k ( k + 1) 4. ( 4 + 1)
CD = qα = 2.569 = 1.91
6n 6.6

We now find the differences among ranks of each pair of techniques as shown
in Table 6.40.
Step 4: Define significance level.
Table 6.41 shows the comparison results of critical difference and actual rank
differences among different techniques. The rank difference of only T1–T4 pair
is higher than the computed critical difference. The rank differences of all other

TABLE 6.40
Computation of Pairwise Rank
Differences among Techniques
for Nemenyi Test
Pair Difference
T1–T2 3.67 − 2.67 = 1.00
T1–T3 3.67 − 1.92 = 1.75
T1–T4 3.67 − 1.75 = 1.92
T2–T3 2.67 − 1.92 = 0.75
T2–T4 2.67 − 1.75 = 0.92
T3–T4 1.92 − 1.75 = 0.17

TABLE 6.41
Comparison of Differences
for Nemenyi Test
Pair Difference
T1–T2 1.00 < 1.91
T1–T3 1.75 < 1.91
T1–T4 1.92 > 1.91
T2–T3 0.75 < 1.91
T2–T4 0.92 < 1.91
T3–T4 0.17 < 1.91
Data Analysis and Statistical Testing 261

technique pairs is not significant at α = 0.05. The rank difference of only T1–T4
pair (shown in bold) is higher than the computed critical difference.
Step 5: Derive conclusions.
As the rank difference of only T1–T4 pair is higher than the computed critical dif-
ference, we conclude that the T4 technique significantly outperforms T1 technique
at significance level α = 0.05. The difference in performance of all other techniques
is not significant. We accept all the null hypotheses H01–H06, except H03.

6.4.15 Bonferroni–Dunn Test


Bonferroni–Dunn test is a post hoc test that is similar to Nemenyi test. It can be used
to compare multiple subjects, even if the sample sizes are unequal. It is generally used
when all subjects are compared with a control subject (Demšar 2006). For example, all
techniques are compared with a specific control technique A for evaluating the compara-
tive pairwise performance of all techniques with technique A. Bonferroni–Dunn test is also
called Bonferroni correction and is used to control family-wise error rate. A family-wise
error may occur when we are testing a number of hypotheses referred to as family of
hypotheses, which are performed on a single set of data or samples. The probability that
at least one hypothesis may be significant just because of chance (Type I error) needs to
be controlled in such a case (Garcia et al. 2007). Bonferroni–Dunn test is mostly used after
a Friedman test, if the null hypothesis is rejected. To control family-wise error, the criti-
cal value α is divided by the number of comparisons. For example, if we are comparing
k − 1 subjects with a control subject then the number of comparisons is k − 1. The formula
for new critical value is as follows:
α
α New =
Number of comparisons

There is another method for performing the Bonferroni–Dunn’s test by computing the CD
(same as Nemenyi test). However, the α values used are adjusted to control family-wise
error. We compute the CD value as follows:

k ( k + 1)
CD = qα
6n

Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
Note that the number of comparisons in the Appendix table includes the control subject.
We compare the computed CD with difference between average ranks. If the difference is
less than CD, we conclude that the two subjects do not differ significantly at the chosen
significance level α.

Example 6.17:
Consider an example where we compare four techniques by analyzing the performance
of the models predicted using these four techniques on six data sets each. We first apply
Friedman test to obtain the average ranks of all the methods. The computed average
ranks are shown in Table 6.42. The result of the Friedman test indicated the rejection of
the null hypothesis. Evaluate whether there are significant difference among M1 and all
the other methods.
262 Empirical Research in Software Engineering

TABLE 6.42
Average Ranks of Techniques
T1 T2 T3 T4
Average rank 3.67 2.67 1.92 1.75

Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H01: The performance of T1 and T2 techniques do not differ significantly.
Ha1: The performance of T1 and T2 techniques differ significantly.
H02: The performance of T1 and T3 techniques do not differ significantly.
Ha2: The performance of T1 and T3 techniques differ significantly.
H03: The performance of T1 and T4 techniques do not differ significantly.
Ha3: The performance of T1 and T4 techniques differ significantly.
Step 2: Select the appropriate statistical test.
The example needs to evaluate the comparison of T1 technique with all other
techniques. Thus, T1 is the control technique. The evaluation of different tech-
niques is performed using Friedman test, and the result led to rejection of
the null hypothesis. To analyze whether there are any significant differences
among the performance of the control technique and other techniques, we need
to apply a post hoc test. Thus, we use Bonferroni–Dunn’s test.
Step 3: Apply test and calculate CD.
In this example, k = 4 and n = 6. The value of qα for four subjects at α = 0.05 is
2.394. The CD can be calculated by the following formula:

k ( k + 1) 4. ( 4 + 1)
CD = qα = 2.394 = 1.79
6n 6.6

We now find the differences among ranks of each pair of techniques, as shown
in Table 6.43.
Step 4: Define significance level.
Table 6.44 shows the comparison results of critical difference and actual rank
differences among different techniques. The rank difference of only T1–T4 pair
is higher than the computed critical difference. However, the rank difference
of T1–T3 is quite close to the critical difference. The difference in performance
of T1–T2 is not significant.
Step 5: Derive conclusions.
As the rank difference of only T1–T4 pair is higher than the computed critical
difference. We conclude that the T4 technique significantly outperforms
T1 technique at significance level α = 0.05. We accept the null hypothesis

TABLE 6.43
Computation of Pairwise Rank
Differences among Techniques for
Bonferroni–Dunn Test
Pair Difference
T1–T2 3.67 − 2.67 = 1.00
T1–T3 3.67 − 1.92 = 1.75
T1–T4 3.67 − 1.75 = 1.92
Data Analysis and Statistical Testing 263

TABLE 6.44
Comparison of Differences
for Bonferroni–Dunn Test
Pair Difference
T1–T2 1.00 < 1.79
T1–T3 1.75 < 1.79
T1–T4 1.92 > 1.79

H03 and reject hypotheses H01 and H02. As the rank difference of only T1–T4 pair
(shown in bold) is higher than the computed critical difference.

6.4.16 Univariate Analysis


Univariate LR may be defined as a statistical method that works by formulating a mathemati-
cal model to depict the relationship between dependent variable and each of the independent
variables, taken one at a time. As discussed in 6.2, one of the purposes of the univariate analy-
sis is to screen out the independent variables that are not significantly related to the dependent
variables. The other goal is to test the hypothesis about the relationship of independent vari-
ables with the dependent variable. The choice of methods in the univariate analysis depends
on the type of dependent variables being used. The formula for univariate LR is given below:

e(
A0 + A1X1 )
prob ( X1 ) =
1 + e(
A0 + A1X1 )

where:
X1 is an independent variable
A1 is the weight
Ao is a constant

The sign of the weight indicates the direction of effect of the independent variable on the
dependent variable. The positive sign indicates that independent variable has positive effect on
the dependent variable, and negative sign indicates that the independent variable has negative
effect on the dependent variable. The significance statistic is employed to test the hypothesis.
In linear regression, t-test is used to find the significant independent variables and, in
LR, Wald test is used for the same purpose.

6.5 Example—Univariate Analysis Results for Fault Prediction System


We treat a class as faulty, if it contained at least one fault. Tables 6.45 through 6.48 provide
the coefficient (B), standard error (SE), statistical significance (sig), odds ratio [exp (B)], and
R2 statistic for each measure. The statistical significance estimates the importance or the sig-
nificance level for each independent variable. Odd ratio represents the probability of occur-
rence of an event divided by the probability of nonoccurrence of an event. The R2 statistic
depicts the variance in the independent variable caused by the variance in the independent
variable. A higher value of R2 means high accuracy. The metrics with a significant relationship
to fault proneness, that is, below or at the significance (named as Sig. in Tables 6.45 through
6.48) threshold of 0.01 are shown in bold (see Tables 6.45 through 6.48). Table 6.45 presents the
264 Empirical Research in Software Engineering

TABLE 6.45
Univariate Analysis Using LR Method for HSF
Metric B SE Sig. Exp(B) R2

CBO 0.145 0.028 0.0001 1.156 0.263


WMC 0.037 0.011 0.0001 1.038 0.180
RFC 0.016 0.004 0.0001 1.016 0.160
SLOC 0.003 0.001 0.0001 1.003 0.268
LCOM 0.015 0.006 0.0170 1.015 0.100
NOC −18.256 5903.250 0.9980 0.000 0.060
DIT 0.036 0.134 0.7840 1.037 0.001

TABLE 6.46
Univariate Analysis Using LR Method for MSF
Metric B SE Sig. Exp(B) R2
CBO 0.276 0.030 0.0001 1.318 0.375
WMC 0.065 0.011 0.0001 1.067 0.215
RFC 0.025 0.004 0.0001 1.026 0.196
SLOC 0.010 0.001 0.0001 1.110 0.392
LCOM 0.009 0.003 0.0050 1.009 0.116
NOC −1.589 0.393 0.0001 0.204 0.090
DIT 0.058 0.092 0.5280 1.060 0.001

TABLE 6.47
Univariate Analysis Using LR Method for LSF
Metric B SE Sig. Exp(B) R2
CBO 0.175 0.025 0.0001 1.191 0.290
WMC 0.050 0.011 0.0001 1.052 0.205
RFC 0.015 0.004 0.0001 1.015 0.140
SLOC 0.004 0.001 0.0001 1.004 0.338
LCOM 0.004 0.003 0.2720 1.004 0.001
NOC −0.235 0.192 0.2200 0.790 0.002
DIT 0.148 0.099 0.1340 1.160 0.005

TABLE 6.48
Univariate Analysis Using LR Method for USF
Metric B SE Sig. Exp(B) R2
CBO 0.274 0.029 0.0001 1.315 0.336
WMC 0.068 0.019 0.0001 1.065 0.186
RFC 0.023 0.004 0.0001 1.024 0.127
SLOC 0.011 0.002 0.0001 1.011 0.389
LCOM 0.008 0.003 0.0100 1.008 0.013
NOC −0.674 0.185 0.0001 0.510 0.104
DIT 0.086 0.091 0.3450 1.089 0.001
Data Analysis and Statistical Testing 265

results of univariate analysis for predicting fault proneness with respect to high-severity faults
(HSF). From Table 6.45, we can see that five out of seven metrics were found to be very signifi-
cant (Sig. < 0.01). However, NOC and DIT metrics are not found to be significant. The LCOM
metric is significant at 0.05 significance level. The value of R2 statistic is highest for SLOC and
CBO metrics.
Table 6.46 summarizes the results of univariate analysis for predicting fault proneness
with respect to medium-severity faults (MSF). Table 6.46 shows that the values of R2 statistic
is the highest for SLOC metric. All the metrics except DIT are found to be significant. NOC
has a negative coefficient, which implies that classes with higher NOC value are less fault
prone.
Table 6.47 summarizes the results of univariate analysis for predicting fault proneness
with respect to low-severity faults (LSF). Again, it can be seen from Table 6.47 that the
value of R 2 statistic is highest for SLOC metric. The results show that four out of seven
metrics are found to be very significant. LCOM, NOC, and DIT metrics are not found to
be significant.
Table 6.48 summarizes the results of univariate analysis for predicting fault proneness. The
results show that six out of seven metrics were found to be very significant when the faults
were not categorized according to their severity, that is, ungraded severity faults (USF). The
DIT metric is not found to be significant and the NOC metric has a negative coefficient. This
shows that the NOC metric is related to fault proneness but in an inverse manner.
Thus, the SLOC metric has the highest R2 value at all the severity of faults, which shows
that it is the best predictor. The CBO metric has the second highest R2 value. The values of
R 2 statistic are more important as compared to the value of sig. as they show the strength
of the correlation.

Exercises
6.1 Describe the measures of central tendency? Discuss the concepts with
examples.
6.2 Consider the following data set on faults found by inspection technique for a
given project. Calculate mean, median, and mode.
100, 160, 166, 197, 216, 219, 225, 260, 275, 290, 315, 319, 361, 354, 365, 410, 416, 440, 450,
478, 523
6.3 Describe the measures of dispersion. Explain the concepts with examples.
6.4 What is the purpose of collecting descriptive statistics? Explain the importance
of outlier analysis.
6.5 What is the difference between attribute selection and attribute extraction
techniques?
6.6 What are the advantages of attribute reduction in research?
6.7 What is CFS technique? State its application with advantages.
6.8 Consider the data set consisting of lines of source code given in exercise 6.2.
Calculate the standard deviation, variance, and quartile.
6.9 Consider the following table presenting three variables. Determine the normality
of these variables.
266 Empirical Research in Software Engineering

Cyclomatic Branch
Fault Count Complexity Count

332 25 612
274 24 567
212 23 342
106 12 245
102 10 105
93 09 94
63 05 89
23 04 56
09 03 45
04 01 32

6.10 What is outlier analysis? Discuss its importance in data analysis. Explain uni-
variate, bivariate, and multivariate.
6.11 Consider the table given in exercise 6.7. Construct box plots and identify univari-
ate outliers for all the variables given in the data set.
6.12 Consider the data set given in exercise 6.7. Identify bivariate outliers between
dependent variable fault count and other variables.
6.13 Consider the following data with the performance accuracy values for different
techniques on a number of data sets. Check whether the conditions of ANOVA are
met. Also apply ANOVA test to check whether there is significant difference in the
performance of techniques.

Techniques
Data Sets Technique 1 Technique 2 Technique 3
D1 84 71 59
D2 76 73 66
D3 82 75 63
D4 75 76 70
D5 72 68 74
D6 85 82 67

6.14 Evaluate whether there is significant difference between different algorithms


evaluated on three data sets on the runtime performance (in seconds) of the
model using appropriate statistical test.

Data Sets #
Algorithms 1 2 3
Algorithm 1 9 7 9
Algorithm 2 19 20 20
Algorithm 3 18 15 14
Algorithm 4 13 7 6
Algorithm 5 10 9 8
Data Analysis and Statistical Testing 267

6.15 A software company plans to adopt a new programming paradigm, that will
ease the task of software developers. To assess its effectiveness, 50 software devel-
opers used the traditional programming paradigm and 50 others used the new
one. The productivity values per hour are stated as follows. Perform a t-test to
assess the effectiveness of the new programming paradigm.

Old New
Programming Programming
Statistic Paradigm Paradigm

Mean 1.5 2.21


Standard Deviation 0.4 0.36

6.16 A company deals with development of certain customized software products.


The following data lists the proposed cost and the actual cost of 10 different soft-
ware products. Evaluate whether the company makes a good estimate of the
proposed cost using a paired sample t-test.

Software Proposed Actual


Product Cost Cost

P1 1,739 1,690
P2 2,090 2,090
P3 979 992
P4 997 960
P5 2,750 2,650
P6 799 799
P7 980 1,000
P8 1,099 1,050
P9 1,225 1,198
P10 900 943

6.17 The software team needs to determine average number of methods in a class
for a particular software product. Twenty-two classes were chosen at random
and the number of methods in these classes were analyzed. Evaluate whether the
hypothesized mean of the chosen sample is different from 11 methods per class for
the whole population.

Class No. No. of Methods Class No. No. of Methods Class No. No. of Methods

C1 11.5 C9 9 C17 11.5


C2 12 C10 14 C18 12.5
C3 10 C11 11.5 C19 14
C4 13 C12 7.5 C20 8.5
C5 9.5 C13 11 C21 12
C6 14 C14 6 C22 9.5
C7 11.5 C15 12
C8 12 C16 12.5

6.18 A software organization develops software tools using five categories of pro-
gramming languages. Evaluate a goodness-of-fit test on the data given below to
268 Empirical Research in Software Engineering

test whether the organization develops equal proportion of software tools using
the five different categories of programming languages.

Programming Number of
Language Software
Category Tools

Category 1 35
Category 2 30
Category 3 45
Category 4 44
Category 5 28

6.19 Twenty-five students developed the same program and the cyclomatic
complexity values of these 25 programs are stated. Evaluate whether the
cyclomatic complexity values of the program developed by the 25 students fol-
lows normal distribution.

6, 11, 9, 14, 16, 10, 13, 9, 15, 12, 10, 14, 15, 10, 8, 11, 7, 12, 13, 17, 17, 19, 9, 20, 26, 6, 11, 9, 14, 16,

6.20 A software organization uses either OO methodology or procedural method-


ology for developing software. It also uses effective verification techniques at
different stages to obtain errors. Given the following data, evaluate whether the
two attributes, software development stage for verification and methodology, are
independent.

Methodology
OO Procedural Total
Software Requirements 80 100 180
Development Initial design 50 110 160
Stage Detailed design 75 65 140
Total 205 275 480

6.21 The coupling values of a number of classes are provided below for two different
samples. Test the hypothesis using F-test whether the two samples belong to the
same population.

Sample 1 32 42 33 40 42 44 42 38 32
Sample 2 31 31 31 35 35 32 30 36

6.22 Two training programmes were conducted for software professionals by an


organization. Nine participants were asked to rate the training programmes on a
scale of 1 to 100. Using Wilcoxon signed-rank test, evaluate whether one program
is favorable over the other.
Data Analysis and Statistical Testing 269

Participant No. Program A Program B

1 25 45
2 15 55
3 25 65
4 15 65
5 5 35
6 35 15
7 45 45
8 5 75
9 55 85

6.23 A researcher wants to compare the performance of two learning algorithms


across multiple data sets using receiver operating characteristic (ROC) values as
shown below. Investigate whether there is a statistical difference among the per-
formance of two learning algorithms.

Algorithms
Data Sets A1 A2
D1 0.65 0.55
D2 0.78 0.85
D3 0.55 0.70
D4 0.60 0.60
D5 0.89 0.70

6.24 Two attribute selection techniques were analyzed to check whether they have
any effect on model’s performance. Seven models were developed using attribute
selection technique X and nine models were developed using attribute selection
technique Y. Use Wilcoxon–Mann–Whitney test to evaluate whether there is any
significant difference in the model’s performance using the two different attribute
selection techniques.

Attribute Selection Attribute Selection


Technique X Technique Y

57.5 58.9
58.6 58.0
59.3 61.5
56.9 61.2
58.4 62.3
58.8 58.9
57.7 60.0
60.9
60.4

6.25 A researcher wants to find the effect of the same learning algorithm on
three data sets. For every data set, a model is predicted using the same learn-
ing algorithm with a specific performance measure area under the ROC curve.
270 Empirical Research in Software Engineering

Evaluate whether there is statistical difference in the performance of learning


algorithm on different data sets.

Data Set ROC Values

1 0.76
2 0.85
3 0.66

6.26 A market survey is conducted to evaluate the effectiveness of three text editors
by 20 probable customers. The customers assessed the text editors on various
criteria and provided a score out of 300. Test the hypothesis whether there is any
significant differences among the three text editors using Kruskal–Wallis test.

Text Editor A Text Editor B Text Editor C

200 110 260


60 200 290
150 60 240
190 70 150
150 140 250
270 30 280
210 230

6.27 A researcher wants to compare the performance of four learning techniques


on multiple data sets (five) using the performance measure, area under the ROC
curve. The data for the scenario is given below. Determine whether there is any
statistical difference in the performance of different learning techniques.

Methods
Data Sets A1 A2 A3 A4
D1 0.65 0.56 0.72 0.55
D2 0.79 0.69 0.69 0.59
D3 0.65 0.65 0.62 0.60
D4 0.85 0.79 0.66 0.76
D5 0.71 0.61 0.61 0.78

6.28 What is the purpose of Bonferroni–Dunn correction? Consider data given


in Exercise 6.27. Evaluate the pairwise differences using Wilcoxon test with
Bonferroni–Dunn correction.
6.29 A researcher wants to evaluate the effectiveness of four tools by analyzing the
performances of different models as given below. Evaluate using Friedman test
whether the performance of tools is significantly different. If the difference is
significant, evaluate the pairwise differences using Nemenyi test.
Data Analysis and Statistical Testing 271

Tools
Data Sets T1 T2 T3 T4
Model 1 69 60 83 73
Model 2 70 68 81 69
Model 3 73 54 75 67
Model 4 71 61 91 79
Model 5 77 59 85 69
Model 6 73 56 89 77

6.30 Explain a scenario where application of Nemenyi test is advisable?


6.31 Which test is used to control family-wise error?
6.32 What is type-I and type-II errors? Why are they important to be identified?
6.33 Compare and contrast various statistical tests with respect to their assumptions
and normality conditions of the underlying data.
6.34 Differentiate between:
(a) Wrapper and filter methods
(b) Nemenyi and Bonferroni–Dunn
(c) One-tailed and two-tailed tests
(d) Independent sample and Wilcoxon–Mann–Whitney tests
6.35 Discuss two applications of univariate analysis.

Further Readings
The following books provide details on summarizing data:

D. D. Boos, and C. Brownie, “Comparing variances and other measures of disper-


sion,” Statistical Science, vol. 19, pp. 571–578, 2004.
J. I. Marden, Analysing and Modeling Rank Data, Chapman and Hall, London,
1995.
H. Mulholland, and C. R. Jones, “Measures of dispersion,” In: Fundamentals of
Statistics, Springer, New York, chapter 6, pp. 93–110, 1968.
R. R. Wilcox, and H. J. Keselman, “Modern robust data analysis methods: Measures
of central tendency,” Psychological Methods, vol. 8, no. 3, pp. 254–274, 2003.

There are several books on research methodology and statistics in which various concepts
and statistical tests are explained:

W. G. Hopkins, A New View of Statistics, Sportscience, 2003. http://sportsci.org/


resource/stats
272 Empirical Research in Software Engineering

C. R. Kothari, Research Methodology: Methods and Techniques, New Age International


Limited, New Delhi, India, 2004.

The details on outlier analysis can be obtained from:

V. Barnett, and T. Price, Outliers in Statistical Data, John Wiley & Sons, New York, 1995.

The concept of principal component analysis is explained in the following:

H. Abdi, and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary


Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010.

The basic concept of univariate analysis are presented in:

F. Hartwig, and B. E. Dearing, Exploratory Data Analysis, Sage Publications, Beverly


Hills, CA, 1979.
H. M. Park, “Univariate analysis and normality test using SAS, Stata, and SPSS,” The
University Information Technology Services, Indiana University, Bloomington, IA,
2008.

The details about the CFS technique are provided in:

M. A. Hall, “Correlation-based feature selection for machine learning,” PhD disserta-


tion, The University of Waikato, Hamilton, New Zealand, 1999.
M. A. Hall, and L. A. Smith, “Feature subset selection: A correlation based filter
approach,” In proceedings of International Conference of Neural Information
Processing and Intelligent Information Systems, pp. 855–858, 1997.

A detailed description of various wrapper and filter methods can be found in:

A. L. Blum, and P. Langley, “Selection of relevant features and examples in machine


learning,” Artificial Intelligence, vol. 97, pp. 245–271, 1997.
N. Sánchez-Maroño, A. Alonso-Betanzos, and M. Tombilla-Sanromán, “Filter meth-
ods for feature selection, a comparative study,” In: Proceedings of the 8th International
Conference on Intelligent Data Engineering and Automated Learning, H. Yin, P. Tino,
W. Byrne, X. Yao, E. Corchado (eds.), Springer-Verlag, Berlin, Germany, pp. 178–187.

Some of the useful facts and concepts of significance tests are presented in:

P. M. Bentler, and D. G. Bonett, “Significance tests and goodness of fit in the analysis
of covariance structures,” Psychological Bulletin, vol. 88, no. 3, pp. 588–606, 1980.
J. M. Bland, and D. G. Altman, “Multiple significance tests: The Bonferroni method,”
BMJ, vol. 310, no. 6973, pp. 170, 1995.
L. L. Harlow, S. A. Mulaik, and J. H. Steiger, What If There Were No Significance Tests,
Psychology Press, New York, 2013.
Data Analysis and Statistical Testing 273

The one-tailed and two-tailed tests are described in:

J. Hine, and G. B. Wetherill (eds.), “One-and Two-Tailed Tests,” In: A Programmed


Text in Statistics Book 4: Tests on Variance and Regression, Springer, Amsterdam, the
Netherlands, pp. 6–11, 1975.
D. B. Pillemer, “One-versus two-tailed hypothesis tests in contemporary educational
research,” Educational Researcher, vol. 20, no. 9, pp. 13–17, 1991.

Frick provides an excellent use of hypothesis testing based on null hypothesis:

R. W. Frick, “The appropriate use of null hypothesis testing,” Psychological Methods,


vol. 1, no. 4, 379–390, 1996.

The following books provide details on parametric and nonparametric tests:

D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, CRC


Press, Boca Raton, FL, 2003.
D. J. Sheskin (ed.), “Parametric versus nonparametric tests,” In: International
Encyclopedia of Statistical Science, Springer, Berlin, Germany, pp. 1051–1052, 2011.

The following is an excellent and widely used book on hypothesis testing:

E. L. Lehmann, and J. P. Romano, Testing Statistical Hypotheses: Springer Texts in


Statistics, Springer, New York, 2008.

You might also like