MDA Book
MDA Book
The research data can be analyzed using various statistical measures and inferring
conclusions from these measures. Figure 6.1 presents the steps involved in analyzing and
interpreting the research data. The research data should be reduced in a suitable form
before it can be used for further analysis. The statistical techniques can be used to prepro-
cess the attributes (software metrics) so that they can be analyzed and meaningful conclu-
sions can be drawn out of them. After preprocessing of the data, the attributes need to be
reduced so that dimensionality can be reduced and better results can be obtained. Then,
the model is predicted and validated using statistical and/or machine learning techniques.
The results obtained are analyzed and interpreted from each and every aspect. Finally,
hypotheses are tested and decision about the accuracy of model is made.
This chapter provides a description of data preprocessing techniques, feature reduction
methods, and tests for statistical testing. As discussed in Chapter 4, hypothesis testing can be
done either without model prediction or can be used for model comparison after the models
have been developed. In this chapter, we present the various statistical tests that can be applied
for testing a given hypothesis. The techniques for model development, methods for model vali-
dation, and ways of interpreting the results are presented in Chapter 7. We explain these tests
with software engineering-related examples so that the reader gets an idea about the practical
use of the statistical tests. The examples of model comparison tests are given in Chapter 7.
6.1.1.1 Mean
Mean can be computed by taking the average values of the data set. Mean is defined as the
ratio of sum of values of the data points to the total number of data points and is given as,
207
208 Empirical Research in Software Engineering
FIGURE 6.1
Steps for analyzing and interpreting data.
∑N
xi
Mean ( µ ) =
i =1
where:
xi (i = 1, . . . N) are the data points
N is the number of data points
For example, consider 28, 29, 30, 14, and 67 as values of data points.
The mean is (28 + 29 + 30 + 14 + 67) 5 = 33.6.
6.1.1.2 Median
The median is that value which divides the data into two halves. Half of the number of
data points are below the median values and half number of the data points are above the
median values. For odd number of data points, median is the central value, and for even
number of data points, median is the mean of the two central values. Hence, exactly 50%
of the data points lie above the median values and 50% of data points lie below the median
values. Consider the following data points:
8, 15, 5, 20, 6, 35, 10
The median is at 4th value, that is, 10. If one more additional data point 40 is added to the
above distribution then,
5, 6, 8, 10, 15, 20, 35, 40
Data Analysis and Statistical Testing 209
10 + 15
Median = = 12.5
2
Median is not useful, if number of categories in the ordinal type of scale are very low.
In such cases, mode is the preferred measure of central tendency.
6.1.1.3 Mode
Mode gives the value that has the highest frequency in the distribution. For example,
consider Table 6.1, the second category of fault severity has the highest frequency of 50.
Hence, 2 can be reported as the mode for Table 6.1 as it has the highest frequency.
Unlike the mean and median, the same distribution may have multiple values of mode.
Consider Table 6.2, there are two categories of maintenance effort with same frequency:
very high and medium. This is known as bimodal distribution.
The major disadvantage of mode is that it does not produce useful results when applied
to interval/ratio scales having many values. For example, the following data points
represent the number of failures occurred per second, while testing a given software and
are arranged in ascending order:
15, 17 , 18, 18, 45, 63, 64, 65, 71, 75, 79
It can be seen that the data is centered around 60–80 number of failures. But the mode
of the distribution is 18, since it occurs twice in the distribution whereas the rest of the
values only occur once. Clearly, the mode does not represent the central values in this case.
Hence, either other measures of central tendency will be useful in this case or the data
should be organized in suitable class intervals before mode is computed.
TABLE 6.1
Faults at Severity Levels
Fault Severity Frequency
0 23
1 19
2 50
3 17
TABLE 6.2
Maintenance Effort
Maintenance Effort Frequency
Very high 15
High 10
Medium 15
210 Empirical Research in Software Engineering
TABLE 6.3
Statistical Measures with Corresponding Relevant Scale Types
Measures Relevant Scale Type
Mean Interval and ratio data that are not skewed.
Median Ordinal, interval, and ratio, but not useful for
ordinal scales having few values.
Mode All scale types, but not useful for scales having
multiple values.
Table 6.3 depicts the relevant scale type of data for each statistical measure.
Consider the following data set:
18, 23, 23, 25, 35, 40, 42
The mean, median, and mode are shown in Table 6.4, as each measure has different ways
for computing “average” values. In fact, if the data is symmetrical, all the three measures
(mean, median, and mode) have the same values. But, if the data is skewed, there will
always be difference between these measures. Figure 6.2 shows the symmetrical and
skewed distributions. The symmetrical curve is a bell-shaped curve, where all the data
points are equally distributed.
Usually, when the data is skewed, the mean is a misleading measure for determining
central values. For example, if we calculate average lines of code (LOC) of 10 modules
given in Table 6.5, it can be seen that most of the values of the LOC are between 200 and 400,
but one module has 3,000 LOC. In this case, the mean will be 531. Only one value has influ-
enced the mean and caused the distribution to skew to the right. However, the median will
be 265, since the median is based on the midpoint and is not affected by the extreme values
TABLE 6.4
Descriptive Statistics
Measure Value
Mean 29.43
Median 25
Mode 23
Mean
median
Mode mode Mode
Median Median
Frequency
Mean Mean
FIGURE 6.2
Graphs representing skewed and symmetrical distributions: (a) left skewed, (b) normal (no skew), and (c) right
skewed.
Data Analysis and Statistical Testing 211
TABLE 6.5
Sample Data of LOC for 10 Modules
Module# LOC Module# LOC
1 200 6 270
2 202 7 290
3 240 8 300
4 250 9 301
5 260 10 3,000
in the data distribution. Hence, the median better reflects the average LOC in modules as
compared to the mean and is the best measure when the data is skewed.
FIGURE 6.3
Quartiles.
212 Empirical Research in Software Engineering
Median
Q1 Q3
200 202 240 250 260 270 290 300 301 3,000
FIGURE 6.4
Example of quartile.
The IQR is defined as the difference between upper quartile and lower quartile and is given as,
IQR = Q3 − Q1
For example, for Table 6.5, the quartiles are shown in Figure 6.4.
IQR = Q3 − Q1 = 300 − 240 = 60
The standard deviation is used to measure the average distance a data point has from the
mean. The standard deviation assesses the spread by calculating the distance of the data
point from the mean. The standard deviation is large, if most of the data points are near to
the mean. The standard deviation (σx) for the population is given as:
∑( x − µ)
2
σx =
N
where:
x is the given value
N is the number of values
µ is the mean of all the values
68.3%
34.15 34.15
200 250 300
FIGURE 6.5
Normal curve.
TABLE 6.6
Range of Distribution for Normal Data Sets
S. No. Mean Standard Deviation Ranges
1 250 50 200–300
2 220 60 160–280
3 200 30 170–230
4 200 10 190–210
TABLE 6.7
Sample Fault Count Data
Fault Count Data1 35, 45, 45, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75, 75, 85, 85, 85, 85, 95, 95, 95,
105, 105, 115
Data2 0, 2, 72, 75, 78, 80, 80, 85, 85, 87, 87, 87, 87, 88, 89, 90, 90, 92, 92, 95, 95, 98, 98,
99, 102
Data3 20, 37, 40, 43, 45, 52, 55, 57, 63, 65, 74, 75, 77, 82, 86, 86, 87, 89, 89, 90, 95, 107,
165, 700, 705
5 20
4
15
Frequency
Frequency
3
10
2
5
1
0 0
20.00 40.00 60.00 80.00 100.00 120.00 00 20.00 40.00 60.00 80.00 100.00 120.00
(a) Fault count (b) Fault count
12
10
8
Frequency
0
00 200.00 400.00 600.00 800.00
(c) Fault count
FIGURE 6.6
Histogram analysis for fault count data given in Table 6.7: (a) Data1, (b) Data2, and (c) Data3.
For example, suppose that one calculates the average of LOC, where most values are
between 1,000 and 2,000, but the LOC for one module is 15,000. Thus, the data point with
the value 15,000 is located far away from the other values in the data set and is an outlier.
Outlier analysis is carried out to detect the data points that are overinfluential and must be
considered for removal from the data sets.
The outliers can be divided into three types: univariate, bivariate, and multivariate.
Univariate outliers are influential data points that occur within a single variable. Bivariate
outliers occur when two variables are considered in combination, whereas multivariate
outliers occur when more than two variables are considered in combination. Once the out-
liers are detected, the researcher must make the decision of inclusion or exclusion of the
identified outlier. The outliers generally signal the presence of anomalies, but they may
sometimes provide interesting patterns to the researchers. The decision is based on the
reason of the occurrence of the outlier.
Box plots, z-scores, and scatter plots can be used for detecting univariate and bivariate
outliers.
Median
End of the
tail
Start of Lower quartile Upper quartile
the tail
FIGURE 6.7
Example box plot.
signify the start and end of the tail. These two boundary lines correspond to ±1.5 IQR.
Thus, once the value of IQR is known, it is multiplied by 1.5. The values shown inside of
the box plots are known to be within the boundaries, and hence are not considered to be
extreme. The data points beyond the start and end of the boundaries or tail are considered
to be outliers. The distance between the lower and the upper quartile is often known as
box length.
The start of the tail is calculated as Q 3 − 1.5 × IQR and end of the tail is calculated
as Q 3 + 1.5 × IQR. To avoid negative values, the values are truncated to the nearest
values of the actual data points. Thus, actual start of the tail is the lowest value in the
variable above (Q 3 − 1.5 × IQR), and actual end of the tail is the highest value below
(Q 3 − 1.5 × IQR).
The box plots also provide information on the skewness of the data. The median lies in
the middle of the box if the data is not skewed. The median lies away from the middle if
the data is left or right skewed. For example, consider the LOC values given below for a
software:
200, 202, 240, 250, 260, 270, 290, 300, 301, 3000
The median of the data set is 265, lower quartile is 240, and upper quartile is 300. The IQR
is 60. The start of the tail is 240 − 1.5 × 60 = 150 and end of the tail is 300 + 1.5 × 60 = 390.
The actual start of the tail is the lowest value above 150, that is, 200, and actual end of
the tail is the highest value below 390, that is, 301. Thus, the case number 10 with value
30,000 is above the end of the tail and, hence, is an outlier. The box plot for the given data
set is shown in Figure 6.8 with one outlier 3,000.
A decision regarding inclusion or exclusion of the outliers must be made by the research-
ers during data analysis considering the following reasons:
Outlier values may be present because of combination of data values present across more
than one variable. These outliers are called multivariate outliers. Scatter plot is another
visualization method to detect outliers. In scatter plots, we simply represent all the data
points graphically. The scatter plot allows us to examine more than one metric variable at
a given time.
216 Empirical Research in Software Engineering
3000 ∗
3000
2500
2000
1500
1000
500
LOC
FIGURE 6.8
Box plot for LOC values.
6.1.5.2 Z-Score
Z-score is another method to identify outliers and is used to depict the relationship of a
value to its mean, and is given as follows:
x −µ
z-score =
σ
where:
x is the score or value
µ is the mean
σ is the standard deviation
The z-score gives the information about the value as to whether it is above or below
the mean, and by how many standard deviations. It may be positive or negative. The
z-score values of data samples exceeding the threshold of ±2.5 are considered to be
outliers.
Example 6.1:
Consider the data set given in Table 6.7. Calculate univariate outliers for each variable
using box plots and z-scores.
Solution:
The box plots for Data1, Data2, and Data3 are shown in Figure 6.9. The z-scores for
data sets given in Table 6.7 are shown in Table 6.8.
To identify multivariate outliers, for each data point, the Mahalanobis Jackknife distance
D measure can be calculated. Mahalanobis Jackknife is a measure of the distance in
multidimensional space of each observation from the multivariate mean center of the
observations (Hair et al. 2006). Each data point is evaluated using chi-square distribution
with 0.001 significance value.
Data Analysis and Statistical Testing 217
120 120
100 100
80
80
60
60
40
40
20
20 ∗1
0 ∗2
(a) Data1 (b) Data2
800
25
∗∗
24
600
400
200 23
0
(c) Data3
FIGURE 6.9
(a)–(c) Box plots for data given in Table 6.7.
TABLE 6.8
Z-Score for Data Sets
Case No. Data1 Data2 Data3 Z-scoredata1 Z-scoredata2 Z-scoredata3
1 35 0 20 −1.959 −3.214 −0.585
2 45 2 37 −1.469 −3.135 −0.488
3 45 72 40 −0.979 −0.052 −0.404
4 55 75 43 −0.489 −0.052 −0.387
5 55 78 45 −0.489 0.145 −0.375
6 55 80 52 −0.489 0.145 −0.341
7 65 80 55 −0.489 0.224 −0.330
8 65 85 57 0 0.224 −0.279
9 65 85 63 0 0.224 −0.273
10 65 87 65 0 0.224 −0.262
11 75 87 74 0 0.264 −0.234
12 75 87 75 0 0.303 −0.211
13 75 87 77 0.489 0.343 −0.211
15 75 89 86 0.489 0.422 −0.194
16 85 90 86 0.489 0.422 −0.194
(Continued)
218 Empirical Research in Software Engineering
Table 6.9 shows “min,” “max,” “mean,” “median,” “standard deviation,” “25% quartile,”
and “75% quartile” for all metrics considered in FPS study. The following observations are
made from Table 6.9:
• The size of a class measured in terms of lines of source code ranges from 0 to 2,313.
• The values of depth of inheritance tree (DIT) and number of children (NOC)
are low in the system, which shows that inheritance is not much used in all the
Data Analysis and Statistical Testing 219
TABLE 6.9
Descriptive Statistics for Metrics
Std. Percentile Percentile
Metric Min. Max. Mean Median Dev. (25%) (75%)
CBO 0 24 8.32 8 6.38 3 14
LCOM 0 100 68.72 84 36.89 56.5 96
NOC 0 5 0.21 0 0.7 0 0
RFC 0 222 34.38 28 36.2 10 44.5
WMC 0 100 17.42 12 17.45 8 22
LOC 0 2313 211.25 108 345.55 8 235.5
DIT 0 6 1 1 1.26 0 1.5
TABLE 6.10
Correlation Analysis Results
Metric CBO LCOM NOC RFC WMC LOC DIT
CBO 1
LCOM 0.256 1
NOC −0.03 −0.028 1
RFC 0.386 0.334 −0.049 1
WMC 0.245 0.318 0.035 0.628 1
LOC 0.572 0.238 −0.039 0.508 0.624 1
DIT 0.4692 0.256 −0.031 0.654 0.136 0.345 1
systems; similar results have also been shown by others (Chidamber et al. 1998;
Cartwright and Shepperd 2000; Briand et al. 2000a).
• The lack of cohesion in methods (LCOM) measure, which counts the number of classes
with no attribute usage in common, has high values (upto 100) in KC1 data set.
Attribute Attribute
selection extraction
techniques techniques
FIGURE 6.10
Attribute reduction procedure.
Hence, attribute reduction leads to improved computational efficiency, lower cost, increased
problem understanding, and improved accuracy. Figure 6.11 shows the categories of attri-
bute reduction methods.
Attribute
reduction
methods
Attribute Attribute
selection extraction
FIGURE 6.11
Classification of attribute reduction methods.
Data Analysis and Statistical Testing 221
Bad
Reduced
Original set Attribute set Attribute
Evaluate?
subset generation measurement
Good
Testing Training
Accuracy data data
Model Learning
validation algorithm
FIGURE 6.12
Procedure of filter method.
Bad
FIGURE 6.13
Procedure of wrapper method.
222 Empirical Research in Software Engineering
P2 = ( b21 × X1 ) + ( b22 × X2 ) + + ( b2 k × Xk )
Pk = ( bk 1 × X1 ) + ( bk 2 × X2 ) + + ( bkk × Xk )
Data Analysis and Statistical Testing 223
All bij ’s called loadings are worked out in such a way that the extracted P.C. satisfies the
following two conditions:
The variables with high loadings help identify the dimension the P.C. is capturing, but this
usually requires some degree of interpretation. To identify these variables, and interpret
the P.C.s, the rotated components are used. As the dimensions are independent, orthogo-
nal rotation is used, in which the axes are maintained at 90 degrees. There are various
strategies to perform such rotation. This includes quartimax, varimax, and equimax
orthogonal rotation. For detailed description refer Hair et al. (2006) and Kothari (2004).
Varimax method maximizes the sum of variances of required loadings of the factor matrix
(a table displaying the factor loadings of all variables on each factor).Varimax rotation is
the most frequently used strategy in literature. Eigenvalue (or latent root) is associated
with each P.C. It refers to the sum of squared values of loadings relating to a dimension.
Eigenvalue indicates the relative importance of each dimension for the particular set of vari-
ables being analyzed. The P.C. with eigenvalue >1 is taken for interpretation (Kothari 2004).
6.2.3 Discussion
It is useful to interpret the results of regression analysis in the light of results obtained from
P.C. analysis. P.C. analysis shows the main dimensions, including independent variables
as the main drivers for predicting the dependent variable. It would also be interesting to
observe the metrics included in dimensions across various replicated studies; this will help
in finding differences across various studies. From such observations, the recommendations
regarding which independent variable appears to be redundant and need not be collected
can be derived, without losing a significant amount of design information (Briand and Wust
2002). P.C. analysis is a widely used method for removing redundant variables in neural
networks.
The univariate analysis is used in preselecting the metrics with respect to their signifi-
cance, whereas CFS is the widely used method for preselecting independent variables in
machine learning methods (Hall 2000). In Hall (2003), the results showed that CFS chooses
few attributes, is faster, and overall good performer.
In the given example, the x is attributes of animals with critical area c = run, walk, sit,
and so on. These are the values that will cause null hypothesis to be rejected. The test is
“whether x ≠ fly”; if yes, reject null hypothesis, otherwise accept it. Hence, if x=fly that
means that null hypothesis is accepted.
In real-life, a software practitioner may want to prove that the decision tree algorithms
are better than the logistic regression (LR) technique. This is known as assumption of the
researcher. Hence, the null hypothesis can be formulated as “there is no difference between
the performance of the decision tree technique and the LR technique.” The assumption
needs to be evaluated using statistical tests on the basis of data to reach to a conclusion.
In empirical research, hypothesis formulation and evaluation are the bottom line of research.
This section will highlight the concept of hypothesis testing, and the steps followed in
hypothesis testing.
6.3.1 Introduction
Consider a setup where the researcher is interested in whether some learning technique
“Technique X” performs better than “Technique Y” in predicting the change proneness of
a class. To reach a conclusion, both technique X and technique Y are used to build change
prediction models. These prediction models are then used to predict the change proneness
of a sample data set (for details on training and testing of models refer Chapter 7) and
based on the outcome observed over the sample data set, it is determined which technique
is the better predictor out of the two. However, concluding which technique is better is a
challenging task because of the following issues:
1. The number of data points in the sample could be very large, making data analysis
and synthesis difficult.
2. The researcher might be biased towards one of the techniques and could overlook
minute differences that have the potential of impacting the final result greatly.
3. The conclusions drawn can be assumed to happen by chance because of bias in the
sample data itself.
To neutralize the impact of researcher bias and ensure that all the data points contribute
to the results, it is essential that a standard procedure be adopted for the analysis and
synthesis of sample data. Statistical tests allow the researcher to test the research questions
(hypotheses) in a generalized manner. There are various statistical tests like the student
t-test, chi-squared test, and so on. Each of these tests is applicable to a specific type of data
and allows for comparison in such a way that using the data collected from a small sample,
conclusions can be drawn for the entire population.
Step 1: Define hypothesis—In the first step, the hypothesis is defined corresponding
to the outcomes. The statistical tests are used to verify the hypothesis formed in
the experimental design phase.
Data Analysis and Statistical Testing 225
FIGURE 6.14
Steps in statistical tests.
226 Empirical Research in Software Engineering
One-way
Parametric
ANOVA
More than
two samples
Kruskal–
Nonparametric
Wallis
Independent
samples
Parametric T-test
Two samples
Nonparametric Mann–
Whitney U
Related
Parametric measures
More than ANOVA
two samples
Statistical Nonparametric Friedman
tests
Dependent
samples
Parametric Paired t-test
Two samples
Wilcoxon
Nonparametric
Association signed-rank
between Chi-square
variables
Univariate
Causal
regression
relationships
analysis
FIGURE 6.15
Categories of statistical tests
for testing the hypothesis for binary dependent variable. Table 6.11 depicts the summary
of assumptions, data scale, and normality requirement for each statistical test discussed
in this chapter.
H a : µ > µ0
where:
µ is the population mean
µ0 is the sample mean
Data Analysis and Statistical Testing 227
TABLE 6.11
Summary of Statistical Tests
Test Assumptions Data Scale Normality
One sample t-test The data should not have any Interval or ratio. Required
significant outliers.
The observations should be
independent.
Two sample t-test Standard deviations of the two Interval or ratio. Required
populations must be equal.
Samples must be independent of Interval or ratio.
each other.
The samples are randomly drawn Interval or ratio.
from respective populations.
Paired t-test Samples must be related with each Interval or ratio. Required
other.
The data should not have any
significant outliers.
Chi-squared test Samples must be independent of Nominal or ordinal. Not required
each other.
The samples are randomly drawn
from respective populations.
F-test All the observations should be Interval or ratio. Required
independent.
The samples are randomly drawn
from respective populations and
there is no measurement error.
One-way ANOVA One-way ANOVA should be used Interval or ratio. Required
when you have three or more
independent samples.
The data should not have any
significant outliers.
The data should have homogeneity
of variances.
Two-way ANOVA The data should not have any Interval or ratio. Required
significant outliers.
The data should have homogeneity
of variances.
Wilcoxon signed test The data should consist of two Ordinal or continuous. Not required
“related groups” or “matched
pairs.”
Wilcoxon–Mann– The samples must be independent. Ordinal or continuous. Not required
Whitney test
Kruskal–Wallis test The test should validate three or Ordinal or continuous. Not required
more independent sample
distributions.
The samples are drawn randomly
from respective populations.
Friedman test The samples should be drawn Ordinal or continuous. Not required
randomly from respective
populations.
228 Empirical Research in Software Engineering
Here, the alternative hypothesis specifies that the population mean is strictly “greater than”
sample mean. The below hypothesis is an example of two-tailed test:
H 0 : µ = µ0
H a : µ < µ 0 or µ > µ 0
Figure 6.16 shows the probability curve for a two-tailed test with rejection (or critical
region) on both sides of the curve. Thus, the null hypothesis is rejected if sample mean lies
in either of the rejection region. Two-tailed test is also called nondirectional test.
Figure 6.17 shows the probability curve for one-tailed test with rejection region on one
side of the curve. One-tailed test is also referred as directional test.
FIGURE 6.16
Probability curve for two-tailed test.
FIGURE 6.17
Probability curve for one-tailed test.
Data Analysis and Statistical Testing 229
TABLE 6.12
Types of Errors
H0 True H0 False
level of a test. Type II error is defined as the probability of wrongly not rejecting the null
hypothesis when the null hypothesis is false. In other words, a type II error occurs when
the null hypothesis is actually false, but somehow, it fails to get rejected. It is also known
as “false negative”; a result when an actual “miss” is erroneously seen as a “hit.” The rate
of the type II error is denoted by the Greek letter beta (β) and related to the power of a test
(which equals 1 − β). The definitions of these errors can also be tabularized as shown in
Table 6.12.
6.4.6 t-Test
W. Gossett designed the student t-test (Student 1908). The purpose of the t-test is to
determine whether two data sets are different from each other or not. It is based on the
assumption that both the data sets are normally distributed. There are three variants of
t-tests:
1. One sample t-test, which is used to compare mean with a given value.
2. Independent sample t-test, which is used to compare means of two independent
samples.
3. Paired t-test, which is used to compare means of two dependent samples.
and is compared with a given value of interest. The aim of one sample t-test is to find
whether there is sufficient evidence to conclude that there is difference between mean
of a given sample from a specified value. For example, one sample t-test can be used to
determine whether the average increase in number of comment lines per method is more
than five after improving the readability of the source code.
The assumption in the one sample t-test is that the population from which the sample is
derived must have normal distribution. The following null and alternative hypotheses are
formed for applying one sample t-test on a given problem:
µ − µ0
t=
σ n
where:
µ represents mean of a given sample
σ represents standard deviation
n represents sample size
The above hypothesis is based on two tailed t-test. The degrees of freedom (DOFs) is
n − 1 as t-test is based on the assumption that the standard deviation of the population
is equal to the standard deviation of the sample. The next step is to obtain significance
values (p-value) and compare it with the established threshold value (α). To obtain p-value
for the given t-statistic, the t-distribution table needs to be referred. The table can only be
used given the DOF.
Example 6.2:
Consider Table 6.13 where the number of modules for 15 software systems are shown.
We want to conclude that whether the population from which sample is derived is on
average different than the 12 modules.
TABLE 6.13
Number of Modules
Module Module
Module No. Module# No. Module# No. Module#
S1 10 S6 35 S11 24
S2 15 S7 26 S12 23
S3 24 S8 29 S13 14
S4 29 S9 19 S14 12
S5 16 S10 18 S15 5
Data Analysis and Statistical Testing 231
Solution:
The following steps are carried out to solve the example:
µ − µ 0 19.93 − 12
t= = = 3.76
σ n 8.17 15
TABLE 6.14
Critical Values of t-Distributions
Level of significance for one-tailed test
0.10 0.05 0.02 0.01 0.005
µ1 − µ 2
t=
(σ2
1 ) (
n1 + σ22 n2 )
where:
µ1 and µ2 are the means of both the samples, respectively
σ1 and σ2 are the standard deviations of both the samples, respectively
The DOF is n1 + n2 − 1, where n1 and n2 are the sample sizes of both the samples. Now,
obtain the significance value (p-value) and compare it with the established threshold value
(α) for the computed t-statistic using the t-distribution.
Example 6.3:
Consider an example for comparing the properties of industrial and open source soft-
ware in terms of the average amount of coupling between modules (the other modules
to which a module is coupled). The purpose of both the software is to serve as text
editors developed in Java language. In this example, we believe that the type of software
affects the amount of coupling between modules.
Industrial: 150, 140, 172, 192, 186, 180, 144, 160, 188, 145, 150, 141
Open source: 138, 111, 155, 169, 100, 151, 158, 130, 160, 156, 167, 132
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
Data Analysis and Statistical Testing 233
TABLE 6.15
Descriptive Statistics
Descriptive Statistic Industrial Software Open Source Software
No. of observations 12 12
Mean 162.33 143.92
Standard deviation 20.01 21.99
µ1 − µ 2 162.33 − 143.92
t= = = 2.146
( ) (
σ12 n1 + σ 22 n2 ) ( ) (
20.012 12 + 21.992 12 )
The DOF is 22 (12 + 12 − 2) in this example. Given 22 DOF and referring the
t-distribution table, the obtained p-value is 0.043.
Step 4: Define significance level.
As computed in Step 3, the p-value is 0.043. It can be seen that the results are
statistically significant at 0.05 significance value.
Step 5: Derive conclusions.
The results are significant at 0.05 significance level. Hence, we reject the null
hypothesis, and the results show that the mean amount of coupling between
modules depicted by the industrial software is statistically significantly differ-
ent than the mean amount of coupling between modules depicted by the open
source software (t = 2.146, p = 0.043).
H0: µ1 − µ2 = 0 (There is no difference between the mean values of the two samples.)
Ha: µ1 − µ2 ≠ 0 (There exists difference between the mean values of the two samples.)
∑ d − ( ∑ d ) n
2 2
σd =
n−1
where:
n represents number of pairs and not total number of samples
d is difference between values of two samples
The DOF is n − 1. The p-value is obtained and compared with the established threshold
value (α) for the computed t-statistic using the t-distribution.
Example 6.4:
Consider an example where values of the CBO (number of other classes to which a class
is coupled to) metric is given before and after applying refactoring technique to improve
the quality of the source code. The data is given in Table 6.16.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µCBO1 = µCBO2 (Mean of CBO metric before and after applying refactoring
are equal.)
Ha: µCBO1 ≠ µCBO2 (Mean of CBO metric before and after applying refactoring
are not equal.)
Step 2: Select the appropriate statistical test.
The samples are extracted from populations with normal distribution. As
we are using samples derived from the same populations and analyzing the
before and after effect of refactoring on CBO, these are related samples. We
need to use paired t-test for comparing the difference between values of CBO
derived from two dependent samples.
Step 3: Apply test and calculate p-value.
We first calculate the mean values of both the samples and also calculate the
difference (d) among the paired values of both the samples as shown in Table 6.17.
The t-statistic is given below:
∑ d − ( ∑ d )
2
2
n 12 − ( 8 ) 15
2
σd = = = 0.743
n−1 14
µ1 − µ 2 67.6 − 67.07
t= = = 2.779
σd n 0.743 15
The DOF is 14 (15 − 1) in this example. Given 14 DOF and referring the
t-distribution table, the obtained p-value is 0.015.
TABLE 6.16
CBO Values
CBO before refactoring 45 48 49 52 56 58 66 67 74 75 81 82 83 88 90
CBO after refactoring 43 47 49 52 56 57 66 67 74 73 80 82 83 87 90
Data Analysis and Statistical Testing 235
TABLE 6.17
CBO Values
CBO before Refactoring CBO after Refactoring Differences (d)
45 43 2
48 47 1
49 49 0
52 52 0
56 56 0
58 57 1
66 66 0
67 67 0
74 74 0
75 73 2
81 80 1
82 82 0
83 83 0
88 87 1
90 90 0
µCBO1 = 67.6 µCBO2 = 67.07
where:
Oij is the observed frequency of the cell in the ith row and jth column
Eij is the expected frequency of the cell in the ith row and jth column
N row × N column
Erow,column =
N
where:
N is the total number of observations
Nrow is the total of all observations in a specific row
Ncolumn is the total of all observations in a specific column
Erow,column is the grand total of a row or column
The larger the difference of the observed and the expected values, the more is the deviation
from the stated null hypothesis. The DOF is (row − 1) × (column − 1) for any given table.
The expected values are calculated for each category of the categorical variable at each factor
2
of the other categorical variable. Then, calculate the χ value for each cell. After calculating
2 2 2
individual χ value, add the individual χ values of each cell to obtain an overall χ value. The
2
overall χ value is compared with the tabulated value for (row − 1) × (column − 1) DOF. If the
2 2
calculated χ value is greater than the tabulated χ value at critical value α, we reject the null
hypothesis.
Example 6.5:
Consider Table 6.18 that consists of data for a particular software. It states the catego-
rization of modules according to three maintenance levels (high, medium, and low)
and according to the number of LOC (high and low). A researcher wants to investigate
whether LOC and maintenance level are independent of each other or not.
TABLE 6.18
Categorization of Modules
Maintenance Level
High Low Medium Total
LOC High 23 40 22 85
Low 17 30 20 67
Total 40 70 42 152
Data Analysis and Statistical Testing 237
N row × N column
Erow,column =
N
( Oij − Eij )
2
2
χ = ∑ Eij
2
Finally, calculate the overall χ value by adding all corresponding χ values of
2
each cell.
TABLE 6.19
Calculation of Expected Frequency
Maintenance Level
High Low Medium
LOC High 85 × 40 85 × 70 85 × 42
= 22.36 = 39.14 = 23.48
152 152 152
Low 67 × 40 67 × 70 67 × 42
= 17.63 = 30.85 = 18.52
152 152 152
TABLE 6.20
Calculation of χ2 Values
Maintenance Level
High Low Medium
2
LOC High (23 − 22.36) (40 − 39.14) 2
(22 − 23.48)2
= 0.017 = 0.018 = 0.093
23 39.14 23.48
2
Low (17 − 17.63)2 (30 − 30.85) (20 − 18.52)2
= 0.022 = 0.023 = 0.118
17.63 30.85 18.52
238 Empirical Research in Software Engineering
Example 6.6
Analyze the performance of four algorithms when applied on a single data set as given
in Table 6.21. Evaluate whether there is any significant difference in the performance of
the four algorithms at 5% significance level.
Solution:
Step 1: Formation of hypothesis.
The hypotheses for the example are given below:
H0: There is no significant difference in the performance of the algorithms.
Ha: There is significant difference in the performance of the algorithms.
Step 2: Select the appropriate statistical test.
To explore the “goodness-of-fit” of different algorithms when applied on a
specific data set, we can effectively use chi-square test.
Step 3: Apply test and calculate p-value.
Calculate the expected frequency of each cell according to the following
formula:
∑
n
Oi
E= i =1
n
where:
Oi is the observed value of ith observation
n is the total number of observations
81 + 61 + 92 + 43
E= = 69.25
4
2
Next, we calculate individual χ values as shown in Table 6.22.
TABLE 6.21
Performance Values of Algorithms
Algorithm Performance
A1 81
A2 61
A3 92
A4 43
TABLE 6.22
Calculation of χ Values
2
Observed Expected
Frequency Frequency
(Oij − Eij )
2
Now
(Oij − Eij )
2
χ2 = ∑ Eij
= 20.393
Example 6.7:
Consider a scenario where a researcher wants to find the importance of SLOC metric,
in deciding whether a particular class having more than 50 source LOC (SLOC) will
be defective or not. The details of defective and not defective classes are provided in
Table 6.23. Test the result at 0.05 significance value.
Solution:
Step 1: Formation of hypothesis.
The null and alternate hypotheses are formed as follows:
H0: Classes having more than 50 SLOC will not be defective.
Ha: Classes having more than 50 SLOC will be defective.
Step 2: Select the appropriate statistical test.
To investigate the importance of SLOC attribute in detection of defective and
not defective classes, we can appropriately use chi-square test to find an attri-
bute’s importance.
Step 3: Apply test and calculate p-value.
Calculate the expected frequency of each cell according to the following formula:
N row × N column
Erow,column =
N
Table 6.24 shows the observed and the calculated expected frequency of each
2
cell. We also then calculate the individual χ value of each cell.
Now
(Oij − Eij )
2
χ2 = ∑ Eij
= 716.66
TABLE 6.23
SLOC Values for Defective and Not Defective Classes
Defective (D) Not Defective (ND) Total
Number of classes having SLOC ≥ 50 200 200 400
Number of classes having SLOC < 50 100 700 800
Total 300 900 1,200
240 Empirical Research in Software Engineering
TABLE 6.24
Calculation of Expected Frequency
(Oij − Eij )
2
Observed Frequency Expected Frequency
(Oij − Eij ) (Oij − Eij )
2
Oij Eij Eij
200 400 × 300 100 10,000 100
= 100
1200
200 400 × 900 −100 10,000 33.33
= 300
1200
100 800 × 300 −100 10,000 50
= 200
1200
700 400 × 900 400 160,000 533.33
= 300
1200
Example 6.8:
Consider a scenario where 40 students had developed the same program. The size of the pro-
gram is measured in terms of LOC and is provided in Table 6.25. Evaluate whether the size
values of the program developed by 40 students individually follows normal distribution.
Solution:
Step 1: Formation of hypothesis.
The null and alternative hypotheses are as follows:
H0: The data follows a normal distribution.
Ha: The data does not follow a normal distribution.
Step 2: Select the appropriate statistical test.
In the case of the normal distribution, there are two parameters, the mean (µ)
and the standard deviation (σ) that can be estimated from the data. Based on
the data, µ = 793.125 and σ = 64.81. To test the normality of data, we can use
chi-square test.
Step 3: Apply test and calculate p-value.
We first need to divide data into segments in such a way that the segments
have the same probability of including a value, if the data actually is normally
TABLE 6.25
LOC Values
641 672 811 770 741 854 891 792 753 876
801 851 744 948 777 808 758 773 734 810
833 704 846 800 799 724 821 757 865 813
721 710 749 932 815 784 812 837 843 755
Data Analysis and Statistical Testing 241
distributed with mean µ and standard deviation σ. We divide the data into
10 segments. We find the upper and lower limits of all the segments. To find
upper limit (xi) of ith segment, the following equation is used:
i
P ( X < xi ) =
10
where:
i = 1–9
X is N(µ, σ2)
where:
i = 1–9
Xs is N(0,1)
xi −µ
zi =
σ
Using standard normal table, we can calculate the values of zi. We can then
calculate the value of xi using the following equation:
xi = σzi + µ
The calculated values zi and xi are given in Table 6.26. Since, a normally
distributed variable theoretically ranges from −∞ to +∞, the lower limit of
segment 1 is taken as –∞ and the upper limit of segment 10 is taken as +∞. The
number of values that fall in each segment are also shown in the table. They
represent the observed frequency (Oi). The expected number of values (Ei) in
each segment can be calculated as 40/10 = 4.
Now,
(Oij − Eij )
2
χ2 = ∑ Eij
=5
TABLE 6.26
Segments and χ2 Calculation
Segment Lower Upper
No. zi Limit Limit Oi Ei (O i−Ei)2
1 −1.28 −∞ 710.17 4 4 0
2 −0.84 710.17 738.68 3 4 1
3 −0.525 738.68 759.10 7 4 9
4 −0.255 759.10 776.60 2 4 4
5 0 776.60 793.13 3 4 1
6 0.255 793.13 809.65 4 4 0
7 0.525 809.65 827.15 6 4 4
8 0.84 827.15 847.56 4 4 0
9 1.28 847.56 876.08 4 4 0
10 – 876.08 +∞ 3 4 1
242 Empirical Research in Software Engineering
6.4.8 F-Test
F-test is used to investigate the equality of variance for two populations. A number
of assumptions need to be checked for application of F-test, which includes the follow-
ing (Kothari 2004):
We can formulate the following null and alternative hypotheses for the application of
F-test on a given problem with two populations:
( σsample1 )
2
F=
( σsample2 )
2
∑
n
( xi − µ )
2
σsample = i =1
n−1
where:
n represents the number of observations in a sample
xi represents the ith observation of the sample
µ represents the mean of the sample observations
We also designate v1 as the DOF in the sample having greater variance and v2 as the DOF in the
other sample. The DOF is designated as one less than the number of observations in the cor-
responding sample. For example, if there are 5 observations in a sample, then the DOF is des-
ignated as 4 (5 − 1). The calculated value of F is compared with tabulated Fα (v1, v2) value at the
desired α value. If the calculated F-value is greater than Fα, we reject the null hypothesis (H0).
Data Analysis and Statistical Testing 243
TABLE 6.27
Runtime Performance of Learning Techniques
A1 11 16 10 4 8 13 17 18 5
A2 14 17 9 5 7 11 19 21 4
Example 6.9:
Consider Table 6.27 that shows the runtime performance (in seconds) of two learning
techniques (A1 and A2) on several data sets. We want to test whether the populations
have the same variances.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypothesis are formed. The hypoth-
eses for the example are given below:
H0: σ12 = σ22 (Variances of two populations are equal.)
Ha: σ12 ≠ σ22 (Variances of two populations are not equal.)
Step 2: Select the appropriate statistical test.
The samples belong to normal populations and are independent in nature.
Thus, to investigate the equality of variances of two populations, we use F-test.
Step 3: Apply test and calculate p-value.
In this example, n1 = 9 and n2 = 9. The calculation of two sample variances is as
follows:
We first compute the means of the two samples,
∑
9
( xi − µ )
2
(11 − 11.33 )
2
2 + + (5 − 11.33)2
σ 1 = i =1
= = 26
n1 − 1 9−1
∑
9
( xi − µ )
2
(14 − 11.89 )
2
2 + + (4 − 11.89)2
σ 2 = i =1
= = 38.36
n2 − 1 9−1
σ 22 38.36
= 1.47 (because σ 2 > σ1 )
2 2
F= =
σ12 26
We also assume that all the other factors except the ones that are being investigated
are adequately controlled, so that the conclusions can be appropriately drawn. One-
way ANOVA, also called the single factor ANOVA, considers only one factor for analy-
sis in the outcome of the dependent variable. It is used for a completely randomized
design.
In general, we calculate two variance estimates, one “within samples” and the other
“between samples.” Finally, we compute the F-value with these two variance estimates as
follows:
The computed F-value is then compared with the F-limit for specific DOF. If the computed
F-value is greater than the F-limit value, then we can conclude that the sample means
differ significantly.
t-test is sufficient. We can formulate the following null and alternative hypotheses for
application of one-way ANOVA on a given problem:
The steps for computing F-statistic is as follows. Here, we assume k is the number of sam-
ples and n is the number of levels:
Step a: Calculate the means of each of the samples: µ1, µ2, µ3 … µk.
Step b: Calculate the mean of sample means.
µ1 +µ 2 +µ 3 ++µ k
µ=
Number of samples (k )
Step c: Calculate the sum of squares of variance between the samples (SSBS).
SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) + + nk ( µ k − µ )
2 2 2 2
Step d: Calculate the sum of squares of variance within samples (SSWS). To obtain
SSWS, we find the deviation of each sample observation with their corresponding
mean and square the obtained deviations. We then sum all the squared deviations
values to obtain SSWS.
Step f: Calculate the mean square between samples (MSBS) and mean square
within samples (MSWS), and setup an ANOVA summary as shown in Table 6.28.
The calculated value of F is compared with tabulated Fα (k − 1, n − k) value at the
desired α value. If the calculated F-value is greater than Fα, we reject the null
hypothesis (H0).
TABLE 6.28
Computation of Mean Square and F-Statistic
Source of Variation Sum of Squares (SS) DOF Mean Square (MS) F-Ratio
Between sample SSBS k−1 SSBS MSBS
MSBS= F −ratio=
K −1 MSWS
Within sample SSWS n−k SSWS
MSWS=
n−k
Total SSTV n−1
246 Empirical Research in Software Engineering
TABLE 6.29
Accuracy Values of Techniques
Techniques
Data Sets A1 A2 A3
D1 60 (x11) 50 (x12) 40 (x13)
D2 40 (x21) 50 (x22) 40 (x23)
D3 70 (x31) 40 (x32) 50 (x33)
D4 80 (x41) 70 (x42) 30 (x43)
Example 6.10:
Consider Table 6.29 that shows the performance values (accuracy) of three techniques
(A1, A2, and A3), which are applied on four data sets (D1, D2, D3, and D4) each. We want
to investigate whether the performance of all the techniques calculated in terms of accu-
racy (refer to Section 7.5.3 for definition of accuracy) are equivalent.
Solution:
The following steps are carried out to solve the example.
60 + 40 + 70 + 80 50 + 50 + 40 + 70 40 + 40 + 50 + 30
µ1 = = 62.5 ; µ 2 = = 52.5 ; µ1 = = 40
4 4 4
Step b: Calculate the mean of sample means.
µ1 + µ 2 + µ 3 ...+ µ k
µ=
Number of samples (k )
62.5 + 52.5 + 40
µ= = 51.67
3
Step c: Calculate the SSBS.
SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) + + nk ( µ k − µ )
2 2 2 2
+ ( 40 − 40 ) + + ( 30 − 40 ) = 1550
2 2
Step f: Calculate MSBS and MSWS, and setup an ANOVA summary as shown
in Table 6.30.
The DOF for between sample variance is 2 and that for within sample vari-
ance is 9. For the corresponding DOF, we compute the F-value using the
F-distribution table and obtain the p-value as 0.103.
Step 4: Define significance level.
After obtaining the p-value in Step 3, we need to decide the threshold or α
value. The calculated value of F at Step 3 is 2.95, which is less than the tabu-
lated value of F (4.26) with DOF being v1 = 2 and v2 = 9 at 5% level. Thus, the
results are not statistically significant at 0.05 significance value.
Step 5: Derive conclusions.
As the results are not statistically significant at 0.05 significance value, we
accept the null hypothesis, which states that there is no difference in sample
means and all the three techniques perform equally well. The difference in
observed values of the techniques is only because of sampling fluctuations
(F = 2.95, p = 0.103).
TABLE 6.30
Computation of Mean Square and F-Statistic
Sum of
Source of Squares F-Limit
Variation (SS) DOF Mean Square (MS) F-Ratio (0.05)
Between sample 1016.68 3−1=2 1016.68 508.34 F(2,9) = 4.26
MSBS = = 508.34 F= = 2.95
2 172.22
Within sample 1550 12 − 3 = 9 1550
MSWS = = 172.22
9
Total 2566.68 11
248 Empirical Research in Software Engineering
To perform the test, we compute the differences among the related pair of values of both
the treatments. The differences are then ranked based on their absolute values. We perform
the following steps while assigning ranks to the differences:
1. Exclude the pairs where the absolute difference is 0. Let nr be the reduced number
of pairs.
2. Assign rank to the remaining nr pairs based on the absolute difference. The
smallest absolute difference is assigned a rank 1.
3. In case of ties among differences (more than one difference having the same
value), each tied difference is assigned an average of tied ranks. For example,
if there are two differences of data value 5 each occupying 7th and 8th ranks,
we would assign the mean rank, that is, 7.5 ([7 + 8]/2 = 7.5) to each of the
difference.
We now compute two variables R+ and R−. R+ represents the sum of ranks assigned to dif-
ferences, where the data instance in the first treatment outperforms the second treatment.
However, R− represents the sum of ranks assigned to differences, where the second treat-
ment outperforms the first treatment. They can be calculated by the following formula
(Demšar 2006):
R+ = ∑ rank ( d )
di >0
i
R− = ∑ rank ( d )
di <0
i
where:
di is the difference between performance measures of first treatment from the second
treatment when applied on n different data instances
Q − ( 1 4 ) nr ( nr + 1)
Z=
(1 24 ) nr ( nr + 1) ( 2nr + 1)
If the Z-statistic is in the critical region with specific level of significance, then the null
hypothesis is rejected and it is concluded that there is significant difference between two
treatments, otherwise null hypothesis is accepted.
Example 6.11:
For example, consider an example where a researcher wants to compare the perfor-
mance of two techniques (T1 and T2) on multiple data sets using a performance measure
as given in Table 6.31. Investigate whether the performance of two techniques measured
in terms of AUC (refer to Section 7.5.6 for details on AUC) differs significantly.
Solution:
Step 1: Formation of hypothesis.
The hypotheses for the example are given below:
H0: The performance of the two techniques does not differ significantly.
Ha: The performance of the two techniques differs significantly.
Data Analysis and Statistical Testing 249
TABLE 6.31
Performance Values of Techniques
Techniques
Data Sets T1 T2
D1 0.75 0.65
D2 0.87 0.73
D3 0.58 0.64
D4 0.72 0.72
D5 0.60 0.70
Thus, Q = minimum (R+, R−) = 3.5. The Z-statistic can be computed as follows:
Q − ( 1 4 ) nr ( nr + 1) 3.5 − ( 1 4 ) 4 ( 4 + 1)
Z= = = −0.549
(1 24 ) nr ( nr + 1) ( 2nr + 1) (1 24 ) 4 ( 4 + 1) ( 2 × 4 + 1)
The obtained p-value is 0.581 with Z-distribution table, when DOF is (n − 1),
that is, 1.
Step 4: Define significance level.
2
The chi-square value is χ 0.05 = 3.841. As the test statistic value (Z = −0.549) is
2
less than χ value, we accept the null hypothesis. The obtained p-value in Step
3 is greater than α = 0.05. Thus, the results are not significant at critical value
α = 0.05.
TABLE 6.32
Computing R+ and R−
Data Set T1 T2 di |di| Rank(di)
D1 0.75 0.65 −0.10 0.10 2.5
D2 0.87 0.73 −0.14 0.14 4
D3 0.58 0.64 0.06 0.06 1
D4 0.72 0.72 0 0 –
D5 0.60 0.70 0.10 0.10 2.5
250 Empirical Research in Software Engineering
1. Arrange the data values of all the observations (both the samples) in ascending
(low to high) order.
2. Assign ranks to all the observations. The lowest value observation is provided
rank 1, the next to lowest observation is provided rank 2, and so on, with the
highest observation given the rank N.
3. In case of ties (more than one observation having the same value), each tied obser-
vation is assigned an average of tied ranks. For example: if there are three observa-
tions of data value 20 each occupying 7th, 8th, and 9th ranks, we would assign the
mean rank, that is, 8 ([7 + 8 + 9]/3 = 8) to each of the observation.
4. We then find the sum of all the ranks allotted to observations in sample 1 and
denote it with T1. Similarly, find the sum of all the ranks allotted to observations in
sample 2 and denote it as T2.
5. Finally, we compute the U-statistic by the following formula:
n1 ( n1 + 1)
U = n1.n2 + − T1
2
or
n2 ( n2 + 1)
U = n1.n2 + − T2
2
It can be observed that the sum of the U-values obtained by the above two formulas is
always equal to the product of the two sample sizes (n1.n2; Hooda 2003). It should be noted
Data Analysis and Statistical Testing 251
that we should use the lower computed U-value as obtained by the two equations described
above. Wilcoxon–Mann–Whitney test has two specific cases (Anderson et al. 2002; Hooda
2003): (1) when the sample sizes are small (n1 < 7, n2 < 8) or (2) when the sample sizes
are large (n1 ≥ 10, n2 ≥ 10). The p-values for the corresponding computed U-values are
interpreted as follows:
Case 1: When the sample sizes are small (n1 < 7, n2 < 8)
To decide whether we should accept or reject the null hypothesis, we should derive
the p-value from the tables shown in Appendix I. For the given values of n1 and n2,
we find a p-value that is less than or equal to the computed U-value. For example,
if the value of n1 and n2 is 4 and 5, respectively, and the computed U-value is 3, then
the p-value would be 0.056. For a two-tailed test, the U-value should be computed
for the lesser of the two computed U-values.
Case 2: When the sample sizes are large (n1 ≥ 10, n2 ≥ 10)
For sample sizes, where each sample contains 10 or more data values, the sampling
U-distribution can be approximated by the normal distribution. In this case, we can
calculate the mean (µU) and standard deviation (σU) of the normal population as
follows:
n1 . n2 n1 . n2 ( n1 + n2 +1)
µU = ; σU =
2 12
Thus, the Z-statistic can be defined as,
U − µu
Z=
σu
Example 6.12:
Consider an example for comparing the coupling values of two different software (one
open source and other academic software), to ascertain whether the two samples are
identical with respect to coupling values (coupling of a module corresponds to the
number of other modules to which a module is coupled).
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µ1 − µ2 = 0 (The two samples are identical in terms of coupling values.)
Ha: µ1 − µ2 ≠ 0 (The two sample are not identical in terms of coupling values.)
Step 2: Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected
from two different software. Also, the outcome variable (amount of coupling)
is continuous or ordinal in nature. The data may not be normal. Hence, we
252 Empirical Research in Software Engineering
TABLE 6.33
Computation of Rank Statistics for
Coupling Values of Two Software
Observations Rank Sample Name
5 1 Open source
23 2 Open source
32 3 Open source
35 4 Academic
38 5 Open source
43 6 Academic
52 7 Open source
89 8 Academic
93 9 Academic
n2 ( n2 + 1)
U = n1.n2 + − T2
2
5 ( 5 +1)
= 4. 5 + − 18 = 17
2
We compute the p-value to be 0.056 at α = 0.05 for the values of n1 and n2 as
4 and 5, respectively, and the U-value as 3.
Step 4: Define significance level.
As the derived p-value of 0.056, in Step 3, is greater than 2α = 0.10, we accept
the null hypothesis at α = 0.05. Thus, the results are not significant at α = 0.05.
Step 5: Derive conclusions.
As shown in Step 4, we accept the null hypothesis. Thus, we conclude that
the coupling values of the academic and open source software do not differ
significantly (U = 3, p = 0.056).
Example 6.13:
Let us consider another example for large sample size, where we want to ascertain
whether the two sets of observations (sample 1 and sample 2) are extracted from identi-
cal populations by observing the cohesion values of the two samples.
Sample 1: 55, 40, 71, 59, 48, 40, 75, 46, 71, 72, 58, 76
Sample 2: 46, 42, 63, 54, 34, 46, 72, 43, 65, 70, 51, 70
Data Analysis and Statistical Testing 253
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
esis for the example is given below:
H0: µ1 − µ2 = 0 (The two samples are identical in terms of cohesion values.)
Ha: µ1 − µ2 ≠ 0 (The two sample are not identical in terms of cohesion
values.)
Step 2: Select the appropriate statistical test.
The two samples of our study are independent in nature as they are collected
from two different software. Also, the outcome variable (amount of cohesion) is
continuous or ordinal in nature. The data may not be normal. Hence, we use the
Wilcoxon–Mann–Whitney test for comparing the differences among cohesion
values of the two software.
Step 3: Apply test and calculate p-value.
In this example, n1 = 12, n2 = 12, and N = 24. Table 6.34 shows the arrangement
of all the observations in ascending order, and the ranks allocated to them.
Sum of ranks assigned to observations in sample 1 (T1) = 2.5 + 2.5 + 7 + 9
+ 12 + 13 + 14 + 19.5 + 19.5 + 21.5 + 23 + 24 = 167.5.
Sum of ranks assigned to observations in sample 2 (T2) = 1 + 4 + 5 + 7 + 7
+ 10 + 11 + 15 + 16 + 17.5 + 17.5 + 21.5 = 132.5.
TABLE 6.34
Computation of Rank Statistics for
Cohesion Values of Two Samples
Observations Rank Sample Name
34 1 Sample 2
40 2.5 Sample 1
40 2.5 Sample 1
42 4 Sample 2
43 5 Sample 2
46 7 Sample 1
46 7 Sample 2
46 7 Sample 2
48 9 Sample 1
51 10 Sample 2
54 11 Sample 2
55 12 Sample 1
58 13 Sample 1
59 14 Sample 1
63 15 Sample 2
65 16 Sample 2
70 17.5 Sample 2
70 17.5 Sample 2
71 19.5 Sample 1
71 19.5 Sample 1
72 21.5 Sample 1
72 21.5 Sample 2
75 23 Sample 1
76 24 Sample 1
254 Empirical Research in Software Engineering
n1 ( n1 + 1)
U = n1.n2 + − T1
2
12 ( 12 +1)
= 12 ⋅ 12 + − 167.5 = 54.5
2
n2 ( n2 + 1)
U = n1.n2 + − T2
2
12 ( 12 + 1)
= 12 ⋅ 12 + − 132.5 = 89.5
2
As the sample size is large, we can calculate the mean (µU) and standard devia-
tion (σU) of the normal population as follows:
U − µ u 54.5 − 72
Z= = = − 1.012
σu 17.32
H0: µ1 = µ2 = … µk (All samples have identical distributions and belong to the same
population.)
Ha: µ1 ≠ µ2 ≠ … µk (All samples do not have identical populations and may belong to
different populations.)
The steps to compute the Kruskal–Wallis test statistic H are very similar to that of
Wilcoxon–Mann–Whitney test statistic U. Assuming there are k samples of size n1, n2, … nk,
respectively, and the total number of observations N (N = n1 + n2 + … nk), we perform the
following steps:
1. Organize and sort the data values of all the observations (belonging to all the
samples) in an ascending (low to high) order.
Data Analysis and Statistical Testing 255
2. Next, allocate ranks to all the observations from 1 to N. The observation with the
lowest data value is assigned a rank of 1, and the observation with the highest data
value is assigned rank N.
3. In case of two or more observations of equal values, assign the average of the
ranks that would have been assigned to the observations. For example, if there
are two observations of data value 40 each occupying 3rd and 4th ranks, we
would assign the mean rank, that is, 3.5 ( 3 + 4 2 = 3.5 ) to each of the 3rd and 4th
observations.
4. We then compute the sum of ranks allocated to observations in each sample and
denote it as T1, T2… Tk.
5. Finally, the H-statistic is computed by the following formula:
k
Ti 2
∑
12
H= − 3 ( N + 1)
N ( N + 1) i =1
ni
The calculated H-value is compared with the tabulated χα value at (k − 1) DOF at the
2
desired α value. If the calculated H-value is greater than χα value, we reject the null
2
hypothesis (H 0).
Example 6.14:
Consider an example (Table 6.35) where three research tools were evaluated by 17 dif-
ferent researchers and were given a performance score out of 100. Investigate whether
there is a significant difference in the performance rating of the tools.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: µ1 = µ2 = µ3 (The performance rating of all tools does not differ
significantly.)
Ha: µ1 ≠ µ2 ≠ µ3 (The performance rating of all tools differ significantly.)
Step 2: Select the appropriate statistical test.
The three samples are independent in nature as they are rated by 17 different
researchers. The outcome variable is continuous. As we need to compare more
than two samples, we use Kruskal–Wallis test to investigate whether there is a
significant difference in the performance rating of the tools.
TABLE 6.35
Performance Score of Tools
Tools
Tool 1 Tool 2 Tool 3
30 65 55
75 25 75
65 35 65
90 20 85
100 45 95
95 75
256 Empirical Research in Software Engineering
TABLE 6.36
Computation of Rank Kruskal–Wallis Test
for Performance Score of Research Tools
Sample
Observations Rank Name
20 1 Tool 2
25 2 Tool 2
30 3 Tool 1
35 4 Tool 2
45 5 Tool 2
55 6 Tool 3
65 8 Tool 1
65 8 Tool 2
65 8 Tool 3
75 11 Tool 1
75 11 Tool 3
75 11 Tool 3
85 13 Tool 3
90 14 Tool 1
95 15.5 Tool 1
95 15.5 Tool 3
100 17 Tool 1
12 ( 68.5 )2 ( 20 )2 ( 64.5 )2
= + + − 3 ( 17 + 1) = 7
17 ( 17 + 1) 6 5 6
1. Organize and sort the data values of all the treatments for a specific data instance or
data set in descending (high to low) order. Allocate ranks to all the observations from
1 to k, where rank 1 is assigned to the best performing treatment value and rank k to
the worst performing treatment. In case of two or more observations of equal values,
assign the average of the ranks that would have been assigned to the observations.
2. We then compute the total of ranks allocated to a specific treatment on all the data
instances. This is done for all the treatments and the rank total for k treatments is
denoted by R1, R 2, … Rk.
3. Finally, the χ2-statistic is computed by the following formula:
k
∑R
12
χ2 = 2
− 3n ( k + 1)
nk ( k + 1)
i
i =1
where:
Ri is the individual rank total of the ith treatment
n is the number of data instances
The value of Friedman measure χ2 is distributed over k − 1 DOF. If the value of Friedman
measure is in the critical region (obtained from chi-squared table with specific level of
significance, i.e., 0.01 or 0.05 and k − 1 DOF), then the null hypothesis is rejected and it is
concluded that there is difference among performance of different treatments, otherwise
the null hypothesis is accepted.
Example 6.15:
Consider Table 6.37, where the performance values of six different classification methods
are stated when they are evaluated on six data sets. Investigate whether the performance
of different methods differ significantly.
258 Empirical Research in Software Engineering
TABLE 6.37
Performance Values of Different Methods
Methods
Data Sets M1 M2 M3 M4 M5 M6
D1 83.07 75.38 73.84 72.30 56.92 52.30
D2 66.66 75.72 73.73 71.71 70.20 45.45
D3 83.00 54.00 54.00 77.00 46.00 59.00
D4 61.93 62.53 62.53 64.04 56.79 53.47
D5 74.56 74.56 73.98 73.41 68.78 43.35
D6 72.16 68.86 63.20 58.49 60.37 48.11
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H0: There is no statistical difference between the performances of various
methods.
Ha: There is statistical significant difference between the performances of
various methods.
Step 2: Select the appropriate statistical test.
As we need to evaluate the difference between the performances of different
methods when they are evaluated using six data sets, we are evaluating
different treatments on different data instances. Moreover, there is no specific
assumption for data normality. Thus, we can use Friedman test.
Step 3: Apply test and calculate p-value.
We compute the rank total allocated to each method on the basis of perfor-
mance ranking of each method on different data sets as shown in Table 6.38.
Now, compute the Friedman statistic,
∑R
12
χ2 = 2
− 3n ( k +1)
nk ( k +1)
12
=
6 × 6 × ( 6 + 1)
( )
13.52 + 13.52 + 18 2 + 192 + 292 + 33 2 − 3.6 ( 6 + 1) = 16.11
DOF = k − 1 = 5
TABLE 6.38
Computation of Rank Totals for Friedman Test
Methods
Data Sets M1 M2 M3 M4 M5 M6
D1 1 2 3 4 5 6
D2 5 1 2 3 4 6
D3 1 4.5 4.5 2 6 3
D4 4 2.5 2.5 1 5 6
D5 1.5 1.5 3 4 5 6
D6 1 2 3 5 4 6
Rank total 13.5 13.5 18 19 29 33
Data Analysis and Statistical Testing 259
We look up the tabulated value of χ2-distribution with 5 DOF, and find the
tabulated value as 15.086 at α = 0.01. The p-value is computed as 0.007.
Step 4: Define significance level.
The calculated value of χ2 (χ2 = 16.11) is greater than the tabulated value. As the
computed p-value in Step 3 is <0.01, the results are significant at α = 0.01.
Step 5: Derive conclusions.
Since the calculated value of χ2 is greater than the tabulated value, we reject the
null hypothesis. Thus, we conclude that the performance of six methods differ
significantly (χ2 = 16.11, p = 0.007).
k ( k + 1)
CD = qα
6n
Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
The computed CD value is compared with the difference between average ranks allocated
to two subjects. If the difference is at least equal to or greater than the CD value, the two
subjects differ significantly at the chosen significance level α.
Example 6.16:
Consider an example where we compare four techniques by analyzing the performance
of the models predicted using these four techniques on six data sets each. We first apply
Friedman test to obtain the average ranks of all the methods. The computed average
ranks are shown in Table 6.39. The result of the Friedman test indicated the rejection
of null hypothesis. Evaluate whether there are significant differences among different
methods using pairwise comparisons.
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H01: The performance of T1 and T2 techniques do not differ significantly.
Ha1: The performance of T1 and T2 techniques differ significantly.
H02: The performance of T1 and T3 techniques do not differ significantly.
Ha2: The performance of T1 and T3 techniques differ significantly.
H03: The performance of T1 and T4 techniques do not differ significantly.
TABLE 6.39
Average Ranks of Techniques after
Applying Friedman Test
T1 T2 T3 T4
Average rank 3.67 2.67 1.92 1.75
260 Empirical Research in Software Engineering
k ( k + 1) 4. ( 4 + 1)
CD = qα = 2.569 = 1.91
6n 6.6
We now find the differences among ranks of each pair of techniques as shown
in Table 6.40.
Step 4: Define significance level.
Table 6.41 shows the comparison results of critical difference and actual rank
differences among different techniques. The rank difference of only T1–T4 pair
is higher than the computed critical difference. The rank differences of all other
TABLE 6.40
Computation of Pairwise Rank
Differences among Techniques
for Nemenyi Test
Pair Difference
T1–T2 3.67 − 2.67 = 1.00
T1–T3 3.67 − 1.92 = 1.75
T1–T4 3.67 − 1.75 = 1.92
T2–T3 2.67 − 1.92 = 0.75
T2–T4 2.67 − 1.75 = 0.92
T3–T4 1.92 − 1.75 = 0.17
TABLE 6.41
Comparison of Differences
for Nemenyi Test
Pair Difference
T1–T2 1.00 < 1.91
T1–T3 1.75 < 1.91
T1–T4 1.92 > 1.91
T2–T3 0.75 < 1.91
T2–T4 0.92 < 1.91
T3–T4 0.17 < 1.91
Data Analysis and Statistical Testing 261
technique pairs is not significant at α = 0.05. The rank difference of only T1–T4
pair (shown in bold) is higher than the computed critical difference.
Step 5: Derive conclusions.
As the rank difference of only T1–T4 pair is higher than the computed critical dif-
ference, we conclude that the T4 technique significantly outperforms T1 technique
at significance level α = 0.05. The difference in performance of all other techniques
is not significant. We accept all the null hypotheses H01–H06, except H03.
There is another method for performing the Bonferroni–Dunn’s test by computing the CD
(same as Nemenyi test). However, the α values used are adjusted to control family-wise
error. We compute the CD value as follows:
k ( k + 1)
CD = qα
6n
Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
Note that the number of comparisons in the Appendix table includes the control subject.
We compare the computed CD with difference between average ranks. If the difference is
less than CD, we conclude that the two subjects do not differ significantly at the chosen
significance level α.
Example 6.17:
Consider an example where we compare four techniques by analyzing the performance
of the models predicted using these four techniques on six data sets each. We first apply
Friedman test to obtain the average ranks of all the methods. The computed average
ranks are shown in Table 6.42. The result of the Friedman test indicated the rejection of
the null hypothesis. Evaluate whether there are significant difference among M1 and all
the other methods.
262 Empirical Research in Software Engineering
TABLE 6.42
Average Ranks of Techniques
T1 T2 T3 T4
Average rank 3.67 2.67 1.92 1.75
Solution:
Step 1: Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
eses for the example are given below:
H01: The performance of T1 and T2 techniques do not differ significantly.
Ha1: The performance of T1 and T2 techniques differ significantly.
H02: The performance of T1 and T3 techniques do not differ significantly.
Ha2: The performance of T1 and T3 techniques differ significantly.
H03: The performance of T1 and T4 techniques do not differ significantly.
Ha3: The performance of T1 and T4 techniques differ significantly.
Step 2: Select the appropriate statistical test.
The example needs to evaluate the comparison of T1 technique with all other
techniques. Thus, T1 is the control technique. The evaluation of different tech-
niques is performed using Friedman test, and the result led to rejection of
the null hypothesis. To analyze whether there are any significant differences
among the performance of the control technique and other techniques, we need
to apply a post hoc test. Thus, we use Bonferroni–Dunn’s test.
Step 3: Apply test and calculate CD.
In this example, k = 4 and n = 6. The value of qα for four subjects at α = 0.05 is
2.394. The CD can be calculated by the following formula:
k ( k + 1) 4. ( 4 + 1)
CD = qα = 2.394 = 1.79
6n 6.6
We now find the differences among ranks of each pair of techniques, as shown
in Table 6.43.
Step 4: Define significance level.
Table 6.44 shows the comparison results of critical difference and actual rank
differences among different techniques. The rank difference of only T1–T4 pair
is higher than the computed critical difference. However, the rank difference
of T1–T3 is quite close to the critical difference. The difference in performance
of T1–T2 is not significant.
Step 5: Derive conclusions.
As the rank difference of only T1–T4 pair is higher than the computed critical
difference. We conclude that the T4 technique significantly outperforms
T1 technique at significance level α = 0.05. We accept the null hypothesis
TABLE 6.43
Computation of Pairwise Rank
Differences among Techniques for
Bonferroni–Dunn Test
Pair Difference
T1–T2 3.67 − 2.67 = 1.00
T1–T3 3.67 − 1.92 = 1.75
T1–T4 3.67 − 1.75 = 1.92
Data Analysis and Statistical Testing 263
TABLE 6.44
Comparison of Differences
for Bonferroni–Dunn Test
Pair Difference
T1–T2 1.00 < 1.79
T1–T3 1.75 < 1.79
T1–T4 1.92 > 1.79
H03 and reject hypotheses H01 and H02. As the rank difference of only T1–T4 pair
(shown in bold) is higher than the computed critical difference.
e(
A0 + A1X1 )
prob ( X1 ) =
1 + e(
A0 + A1X1 )
where:
X1 is an independent variable
A1 is the weight
Ao is a constant
The sign of the weight indicates the direction of effect of the independent variable on the
dependent variable. The positive sign indicates that independent variable has positive effect on
the dependent variable, and negative sign indicates that the independent variable has negative
effect on the dependent variable. The significance statistic is employed to test the hypothesis.
In linear regression, t-test is used to find the significant independent variables and, in
LR, Wald test is used for the same purpose.
TABLE 6.45
Univariate Analysis Using LR Method for HSF
Metric B SE Sig. Exp(B) R2
TABLE 6.46
Univariate Analysis Using LR Method for MSF
Metric B SE Sig. Exp(B) R2
CBO 0.276 0.030 0.0001 1.318 0.375
WMC 0.065 0.011 0.0001 1.067 0.215
RFC 0.025 0.004 0.0001 1.026 0.196
SLOC 0.010 0.001 0.0001 1.110 0.392
LCOM 0.009 0.003 0.0050 1.009 0.116
NOC −1.589 0.393 0.0001 0.204 0.090
DIT 0.058 0.092 0.5280 1.060 0.001
TABLE 6.47
Univariate Analysis Using LR Method for LSF
Metric B SE Sig. Exp(B) R2
CBO 0.175 0.025 0.0001 1.191 0.290
WMC 0.050 0.011 0.0001 1.052 0.205
RFC 0.015 0.004 0.0001 1.015 0.140
SLOC 0.004 0.001 0.0001 1.004 0.338
LCOM 0.004 0.003 0.2720 1.004 0.001
NOC −0.235 0.192 0.2200 0.790 0.002
DIT 0.148 0.099 0.1340 1.160 0.005
TABLE 6.48
Univariate Analysis Using LR Method for USF
Metric B SE Sig. Exp(B) R2
CBO 0.274 0.029 0.0001 1.315 0.336
WMC 0.068 0.019 0.0001 1.065 0.186
RFC 0.023 0.004 0.0001 1.024 0.127
SLOC 0.011 0.002 0.0001 1.011 0.389
LCOM 0.008 0.003 0.0100 1.008 0.013
NOC −0.674 0.185 0.0001 0.510 0.104
DIT 0.086 0.091 0.3450 1.089 0.001
Data Analysis and Statistical Testing 265
results of univariate analysis for predicting fault proneness with respect to high-severity faults
(HSF). From Table 6.45, we can see that five out of seven metrics were found to be very signifi-
cant (Sig. < 0.01). However, NOC and DIT metrics are not found to be significant. The LCOM
metric is significant at 0.05 significance level. The value of R2 statistic is highest for SLOC and
CBO metrics.
Table 6.46 summarizes the results of univariate analysis for predicting fault proneness
with respect to medium-severity faults (MSF). Table 6.46 shows that the values of R2 statistic
is the highest for SLOC metric. All the metrics except DIT are found to be significant. NOC
has a negative coefficient, which implies that classes with higher NOC value are less fault
prone.
Table 6.47 summarizes the results of univariate analysis for predicting fault proneness
with respect to low-severity faults (LSF). Again, it can be seen from Table 6.47 that the
value of R 2 statistic is highest for SLOC metric. The results show that four out of seven
metrics are found to be very significant. LCOM, NOC, and DIT metrics are not found to
be significant.
Table 6.48 summarizes the results of univariate analysis for predicting fault proneness. The
results show that six out of seven metrics were found to be very significant when the faults
were not categorized according to their severity, that is, ungraded severity faults (USF). The
DIT metric is not found to be significant and the NOC metric has a negative coefficient. This
shows that the NOC metric is related to fault proneness but in an inverse manner.
Thus, the SLOC metric has the highest R2 value at all the severity of faults, which shows
that it is the best predictor. The CBO metric has the second highest R2 value. The values of
R 2 statistic are more important as compared to the value of sig. as they show the strength
of the correlation.
Exercises
6.1 Describe the measures of central tendency? Discuss the concepts with
examples.
6.2 Consider the following data set on faults found by inspection technique for a
given project. Calculate mean, median, and mode.
100, 160, 166, 197, 216, 219, 225, 260, 275, 290, 315, 319, 361, 354, 365, 410, 416, 440, 450,
478, 523
6.3 Describe the measures of dispersion. Explain the concepts with examples.
6.4 What is the purpose of collecting descriptive statistics? Explain the importance
of outlier analysis.
6.5 What is the difference between attribute selection and attribute extraction
techniques?
6.6 What are the advantages of attribute reduction in research?
6.7 What is CFS technique? State its application with advantages.
6.8 Consider the data set consisting of lines of source code given in exercise 6.2.
Calculate the standard deviation, variance, and quartile.
6.9 Consider the following table presenting three variables. Determine the normality
of these variables.
266 Empirical Research in Software Engineering
Cyclomatic Branch
Fault Count Complexity Count
332 25 612
274 24 567
212 23 342
106 12 245
102 10 105
93 09 94
63 05 89
23 04 56
09 03 45
04 01 32
6.10 What is outlier analysis? Discuss its importance in data analysis. Explain uni-
variate, bivariate, and multivariate.
6.11 Consider the table given in exercise 6.7. Construct box plots and identify univari-
ate outliers for all the variables given in the data set.
6.12 Consider the data set given in exercise 6.7. Identify bivariate outliers between
dependent variable fault count and other variables.
6.13 Consider the following data with the performance accuracy values for different
techniques on a number of data sets. Check whether the conditions of ANOVA are
met. Also apply ANOVA test to check whether there is significant difference in the
performance of techniques.
Techniques
Data Sets Technique 1 Technique 2 Technique 3
D1 84 71 59
D2 76 73 66
D3 82 75 63
D4 75 76 70
D5 72 68 74
D6 85 82 67
Data Sets #
Algorithms 1 2 3
Algorithm 1 9 7 9
Algorithm 2 19 20 20
Algorithm 3 18 15 14
Algorithm 4 13 7 6
Algorithm 5 10 9 8
Data Analysis and Statistical Testing 267
6.15 A software company plans to adopt a new programming paradigm, that will
ease the task of software developers. To assess its effectiveness, 50 software devel-
opers used the traditional programming paradigm and 50 others used the new
one. The productivity values per hour are stated as follows. Perform a t-test to
assess the effectiveness of the new programming paradigm.
Old New
Programming Programming
Statistic Paradigm Paradigm
P1 1,739 1,690
P2 2,090 2,090
P3 979 992
P4 997 960
P5 2,750 2,650
P6 799 799
P7 980 1,000
P8 1,099 1,050
P9 1,225 1,198
P10 900 943
6.17 The software team needs to determine average number of methods in a class
for a particular software product. Twenty-two classes were chosen at random
and the number of methods in these classes were analyzed. Evaluate whether the
hypothesized mean of the chosen sample is different from 11 methods per class for
the whole population.
Class No. No. of Methods Class No. No. of Methods Class No. No. of Methods
6.18 A software organization develops software tools using five categories of pro-
gramming languages. Evaluate a goodness-of-fit test on the data given below to
268 Empirical Research in Software Engineering
test whether the organization develops equal proportion of software tools using
the five different categories of programming languages.
Programming Number of
Language Software
Category Tools
Category 1 35
Category 2 30
Category 3 45
Category 4 44
Category 5 28
6.19 Twenty-five students developed the same program and the cyclomatic
complexity values of these 25 programs are stated. Evaluate whether the
cyclomatic complexity values of the program developed by the 25 students fol-
lows normal distribution.
6, 11, 9, 14, 16, 10, 13, 9, 15, 12, 10, 14, 15, 10, 8, 11, 7, 12, 13, 17, 17, 19, 9, 20, 26, 6, 11, 9, 14, 16,
Methodology
OO Procedural Total
Software Requirements 80 100 180
Development Initial design 50 110 160
Stage Detailed design 75 65 140
Total 205 275 480
6.21 The coupling values of a number of classes are provided below for two different
samples. Test the hypothesis using F-test whether the two samples belong to the
same population.
Sample 1 32 42 33 40 42 44 42 38 32
Sample 2 31 31 31 35 35 32 30 36
1 25 45
2 15 55
3 25 65
4 15 65
5 5 35
6 35 15
7 45 45
8 5 75
9 55 85
Algorithms
Data Sets A1 A2
D1 0.65 0.55
D2 0.78 0.85
D3 0.55 0.70
D4 0.60 0.60
D5 0.89 0.70
6.24 Two attribute selection techniques were analyzed to check whether they have
any effect on model’s performance. Seven models were developed using attribute
selection technique X and nine models were developed using attribute selection
technique Y. Use Wilcoxon–Mann–Whitney test to evaluate whether there is any
significant difference in the model’s performance using the two different attribute
selection techniques.
57.5 58.9
58.6 58.0
59.3 61.5
56.9 61.2
58.4 62.3
58.8 58.9
57.7 60.0
60.9
60.4
6.25 A researcher wants to find the effect of the same learning algorithm on
three data sets. For every data set, a model is predicted using the same learn-
ing algorithm with a specific performance measure area under the ROC curve.
270 Empirical Research in Software Engineering
1 0.76
2 0.85
3 0.66
6.26 A market survey is conducted to evaluate the effectiveness of three text editors
by 20 probable customers. The customers assessed the text editors on various
criteria and provided a score out of 300. Test the hypothesis whether there is any
significant differences among the three text editors using Kruskal–Wallis test.
Methods
Data Sets A1 A2 A3 A4
D1 0.65 0.56 0.72 0.55
D2 0.79 0.69 0.69 0.59
D3 0.65 0.65 0.62 0.60
D4 0.85 0.79 0.66 0.76
D5 0.71 0.61 0.61 0.78
Tools
Data Sets T1 T2 T3 T4
Model 1 69 60 83 73
Model 2 70 68 81 69
Model 3 73 54 75 67
Model 4 71 61 91 79
Model 5 77 59 85 69
Model 6 73 56 89 77
Further Readings
The following books provide details on summarizing data:
There are several books on research methodology and statistics in which various concepts
and statistical tests are explained:
V. Barnett, and T. Price, Outliers in Statistical Data, John Wiley & Sons, New York, 1995.
A detailed description of various wrapper and filter methods can be found in:
Some of the useful facts and concepts of significance tests are presented in:
P. M. Bentler, and D. G. Bonett, “Significance tests and goodness of fit in the analysis
of covariance structures,” Psychological Bulletin, vol. 88, no. 3, pp. 588–606, 1980.
J. M. Bland, and D. G. Altman, “Multiple significance tests: The Bonferroni method,”
BMJ, vol. 310, no. 6973, pp. 170, 1995.
L. L. Harlow, S. A. Mulaik, and J. H. Steiger, What If There Were No Significance Tests,
Psychology Press, New York, 2013.
Data Analysis and Statistical Testing 273