Investigation and Comparison Missing Data Imputation Methods

THESIS APPROVAL ROUTING FORM
The thesis of Jason Yingling entitled Multiple Imputation of Missing Data via
Gradiant Boosted Trees has been reviewed and approved by the thesis committee and
all departmental, college, and university policies and procedures have been followed.
Thesis Committee Chair Date
Department Chair Date
College Dean Date
Graduate Dean Date

MULTIPLE IMPUTATION OF MISSING DATA VIA GRADIANT
BOOSTED TREES
by
Jason Yingling
A thesis presented to the Department of Mathematics

and the Graduate School of the University of Central Arkansas in partial
fulfillment of the requirements for the degree of
Master of Science
in
Applied Mathematics
Conway, Arkansas
Aug 2019
TO THE OFFICE OF GRADUATE STUDIES:
The members of the committee approve the thesis of

Jason Yingling presented on July 11, 2019.
Janet Nakarmi, Committee Chairperson
Danny Arrigo
Yinlin Dong
PERMISSION
Title: Multiple Imputation of Missing Data via Gradiant Boosted Trees
Department: Mathematics
Degree: Applied Mathematics
In presenting this thesis in partial fulfillment of the requirements for a graduate

degree from the University of Central Arkansas, I agree that the Library of this
University shall make it freely available for inspection. I further agree that permission
for extensive copying for scholarly purposes may be granted by the professor who
supervised my thesis work, or, in the professor’s absence, by the Chair of the
Department or the Dean of the Graduate School. It is understood that due recognition
shall be given to me and to the University of Central Arkansas in any scholarly use
which may be made of any material in my thesis.
Jason Yingling
Date

c 2019 Jason Yingling
iv
ACKNOWLEDGEMENTS
v
ABSTRACT
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS v
ABSTRACT vi
LIST OF TABLES ix
LIST OF FIGURES x
1 Missing Data 1
1.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Non-parametric Methods . . . . . . . . . . . . . . . . . . . . . 3
2 XGBoost Imputation 5
2.0.1 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.0.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . 7
2.1 XGBoost Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Simulate Missing Data 13

3.1 Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Multiple Imputation by Chained Equations . . . . . . . . . . . 14
3.1.4 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.5 Soft Thresholding SVD . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Boston Housing Market Data (Regression) . . . . . . . . . . . . . . . 17
vii
3.3 Breast Cancer Data Set (Classification) . . . . . . . . . . . . . . . . . 19
4 Predictave Ability 22
4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Melborn Housing Data . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Breast Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Simulation Study 28
5.0.1 Imputation Results . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Data Preprocessing 30
6.1 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1.1 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . 31
7 Conclusion 34
BIBLIOGRAPHY 36
APPENDIX 39
A Melbourne Feature Engineering 40
B Code 41
viii
LIST OF TABLES
3.1 Percent Data Missing for Boston Housing Market Data . . . . . . . . 18

3.2 Imputation Results for Boston Housing Data . . . . . . . . . . . . . . 18
3.3 Percent Data Missing for Breast Cancer Data . . . . . . . . . . . . . 20
3.4 Imputation Results for Breast Cancer Data . . . . . . . . . . . . . . . 20
4.1 kNN Prediction Results for Melborne Housing Data . . . . . . . . . . 24

4.2 Random Forest Prediction Results for Melborne Housing Data . . . . 24
4.3 Linear Regression Prediction Results for Melborne Housing Data . . . 25
4.4 Random Forest Prediction Accuracy for Boston Housing Data . . . . 25
4.5 Linear Regression Prediction Accuracy for Boston Housing Data . . 26
4.6 kNN Prediction Accuracy for Boston Housing Data . . . . . . . . . . 26
4.7 kNN Prediction Accuracy for Breast Cancer Data . . . . . . . . . . . 26
4.8 Random Forest Prediction Accuracy for Breast Cancer Data . . . . . 27
4.9 Logistic Regression Prediction Accuracy for Breast Cancer Data . . . 27
5.1 Simulated Regression Data Imputation Results . . . . . . . . . . . . . 29

5.2 Simulated Classification Data Imputation Results . . . . . . . . . . . 29
6.1 Fisher-Pearson Coefficient Before and After Transformation for the

Skewed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 XGBoost Imputation Comparison Between Transformed and Original
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
LIST OF FIGURES
x
CHAPTER 1
Missing Data
Missing data has been an ever present problem for Statistics and other related fields.
It is common for an object to be missing, or entire records to be missing from a data
set. Analyzing a data set with missing values can potentially lead to biased results
[14]. The biased results can potentially be detrimental if one is conducting predictive
tasks such as predicting if a patient has cancer or a self driving car predicting if it is
safe to merge into a new lane.
Rubin’s seminal work he identified up with the three mechanisms of missing data
[20]. They are Missing Completely at Random, Missing at Random, and Not Missing
at Random.
1.1 Types of Missing Data
The proper method for handling missing data requires knowledge about the reason
for the data to be missing. Regarding the three types of missing data:
Missing Completely at Random (MCAR) values are values that do not depend
on any observed data. In other words, there is no relationship between the missing
data and any values, observed or missing. The MCAR assumption is generally not
realistic in many real world situations. However, the situation may arise in situations
where the data is missing due to random data entry or transcription errors
Missing at Random (MAR) are values that have a systematic relationship between
the missing values and the observed data, but not the missing data. Under MAR, all
of the available information about the missing data can be found within the data set
but may requires statistical analysis to observe. For example, if men are more likely
to tell you their age than women, age is MAR.
In Missing Not at Random (MNAR), the missing value depends on unobserved
1
data. Under this condition, the unobserved values can be either be the missing value
itself or some other unknown variable that is not in the data set [20]. This thesis will
focus on imputation via modeling to impute missing data.
1.2 Data Modeling
In general, statistical learning or machine learning can be viewed as an approximation

of a function that governs the behavior of the variable of interest [8]. Suppose we
have k different input variables X1 , X2 , ...., Xk . The assumption is that there is a
relationship between the variable we would like to model/predict Y and the input
variables (X = X1 , X2 , ...., Xk ) which we can write in a general form:
Y = f (X) + .
The function f is some unknown function that operates on the X variables and
is random error term which has mean zero.
If we are attempting to create a model of the Dallas housing market, each
individual house price, yi and the market itself Y . Say We know the square feet
and lot size, the x-values. We generally know that there is some function f that
would connect the input variables to the house’s value but the actual form of f is
unknown. In order to predict the value of houses given data, we would have to
estimate the function f based on the observed data available to us.
Ŷ = fˆ(X),
where Ŷ is our prediction for Y and fˆ is our estimation for the governing function
f . In most cases, one is not explicitly interested in the form of fˆ so long as the
predictions are accurate. Most models fall under two types of approaches of predicting
f , parametric or non-parametric.
2
1.2.1 Parametric Methods
Parametric methods typically follow two steps in order to model data. The first step
is to make an assumption of the form of f . Returning to our Dallas example, if we
let sq-ft be X1 and lot size be X2 , a very simple model for f could be:
f (X) = β0 + β1 X1 + β2 X2 .
The primary assumptions we made are that f is linear and only consists of the two
input variables. Now that we have the form of f all we need to do is estimate the
coefficients β0 , β1 , and β2 .
The second step is to use the data to ”train” the model. In other words, we need
to find the parameters such that
Y ≈ β0 + β1 X1 + β2 X2
The primary benefit of using a parametric form for f is that it simplifies the
problem to estimating parameters of the specified model. However, this approach
has a large disadvantage in that the form of the model may not match the actual
unknown form f .
1.2.2 Non-parametric Methods
Non-parametric methods do not make an assumption about the form of f . Instead,

they attempt to estimate f that describes the data points available. By not making
any assumption of the form of f , non-parametric methods are able to accurately
predict a much wider range of possible forms for f . The inherent flexibility of
non-parametric approaches allow a single approach to be able to handle more types
of data than parametric approaches. However, since non-parametric approaches are
3
not reducing the problem to simply estimating parameters, the computation time and
number of required observations required to accurately estimate f increases [8].
The broader ability of non-parametric methods to accurately fit different types
of data is why we chose to use XGBoost, a non-parametric model, as our basis for
imputation in this thesis.
4
CHAPTER 2
XGBoost Imputation
2.0.1 XGBoost
A model in supervised learning typically refers to the mathematical structure that is

used to predict yi from the input variables xi . A simple example where a predicted
P
value yi = j wj xij is a linear combination of inputs that are weighted by w.
We learn the parameters for the model from the data. For the linear combination,
the parameters estimated are simply the weights wi . The task of training a model is
to identify which weights best describe the training data and labels. We also require
an objective function to obtain a sense of how well the model is performing.
The objective function typically consists of two parts: training loss and the
regularization term. We can write the objective function as:
obj(w) = L(w) + R(w)
where L is the loss function and R is the regularization term. The loss measures
how predictive the model is in regards to the data. The Regularization term helps
prevent a model from learning the training set to the detriment of generalizing the
data it hasn’t seen before - over-fitting.
Bias-variance trade-off is a common term used to describe the trade-off between
minimizing errors. A model that has a high variance and low bias tends to make a
complex model with low predictive error on the training data, but high error on the
testing data. Where a model with high bias and low variance tends to produce a
simple model with high error in both the training and test data set - under-fitting
[11]
5
2.0.2 Decision Trees
Classification and Regression Tree (CART), is named for the two tasks of decision
trees - classification and regression. CART is a binary decision tree where each node
can be viewed as a question that is based on the data. The node is split based on the
answer and this process is continued until there are no more splits and each node, or
leaf, of the tree contains an outcome used for prediction [11]. The predided value of
yi is as follows:
N
X
ŷi = fj (xi )
j
where N is the number of functions or decisions. The objective function is,
n
X N
X
obj(w) = L(ŷi , yi ) + R(fk ).
i j
The tree is constructed in a top-down approach by a greedy algorithm that always

chooses the best possible split at the current step. CART is able to create very
complicated decision trees that can perfectly describe the data it is trained on.
Therefore CART will be able to accurately predict on similar samples to the data
it has already seen, but can lead to poor ability to predict on data it has never seen
before [16]. One way to overcome the problem of overfitting is to use many smaller
trees to build an ensemble of trees. Random Forest and Boosted Trees are two types
of decision tree based models that utilize an ensemble of decision trees [16, 8].
Random Forest uses a concept called Bootstrap Aggregating or ”bagging”.
Bootsrapping is a statistical method where a large number of smaller equal-sized
samples are drawn, with replacement, to estimate a value. The value estimated from
Bootstrapping can be mean, median, or in the case of machine learning - a parameter
[8].
The goal of bagging in machine learning is to reduce the variance of the model
6
by averaging the predictions of many decision trees that have high variance and low
bias. This can be done by partitioning the data set and having different trees trained
on the different partitions [8].
Boosting refers to a technique that turns weak trees into stronger trees. Boosting
is conceptually different from bagging. Bagging each model ran independently on
different partitions of the data then aggregate the outputs. Boosting utilizes a
sequential model structure such that the subsequent estimator attempts to correct
the error of the previous estimator [11]. This process is repeated for each tree added
to the model.
2.0.3 Extreme Gradient Boosting
Extreme Gradient Boosting (XGBoost) is a tree boosting ensemble that achieves

better performance than prior algorithms by controlling the complexity of the trees
with a new regularization technique [22].
The regularization term used by XGBoost is:
1
R(f ) = γT + λ||w||2
2
where T is the number of leaves and λ is the L2 norm of the scores of each node.
XGBoost is trained by optimizing for one tree at a time in a recursive manner,
starting with a constant function and adding a new function to each previous funciton:
7
ŷi 0 = 0
ŷi 1 = f1 (xi ) = ŷi 0 + f1 (xi )
ŷi 1 = f1 (xi ) + f2 (xi ) = ŷi 1 + f2 (xi )

..
.
t
X
t
ŷi = fn (xi ) = ŷi t−1 + ft (xi )
n=1
Note that because the prediction for each round t is,
yît = ŷit−1 + ft (xi ),
the decision of which new f to add each round is chosen by optimizing the objective
function at round t:
X N
X
t
obj = L(ŷi , y) + R(fn )
i n
X
= L(ŷi t−1 + ft (xi )) + R(ft )
i
The objective function is simplified the objective further using a second order
approximation that is used as a scoring function. The score is used to determine
the quality of the tree’s structure. Therefore XGBoost is able to effectively utilize
the ability of decision trees to learn the data without the drawback of overfitting the
training set [22].
Friedman describes how XGBoost uses a concept called shrinkage to help prevent
overfitting [12]. Shrinkage scales the feature weights by a factor η, also known as
the learning rate. To further assist with preventing XGBoost from overfitting the
8
data, Chen included row and column sub-sampling [22]. For each node, a sub-sample
of rows or columns can be chosen instead of the whole data set to further aid in
preventing over-fitting, another feature that made XGBoost ideal for our imputation
method.
2.1 XGBoost Imputation
XGBoost is a decision tree based model that does not require the assumption of
normality [22]. Therefore, it is possible to draw from the insights of Multiple
Imputation by Chained Equations (MICE), described in further detail in Chapter
3, and the benefit of XGBoost to create a new imputation technique. We wanted to
create a multiple imputation technique that would be able to handle any given data
set, without requiring any assumptions on the structure of the data or type of data
within the data set.
The steps we used for our imputation method is outlined below:
1. The mean is imputed for each of the numerical variables. These values are
considered placeholders for further computation.
2. The mode is imputed for each of the categorical variables.
3. Choose one variable from the list of variables with missing values and remove
the imputed means.
4. Create a training set only using the original values from from the chosen
variable.
5. Fit a XGBoost model with the chosen variable as the dependent variable.
6. Use the trained model to predict the missing values.
7. The process is repeated for each of the variables with missing data.
9
8. Steps 5-6 are repeated k number of times.
Therefore, if there are n variables, we will need to create n XGBoost models that
need to be trained - one for each variable that has missing data. We also repeat the
process k times, providing us a total of nk models that must be trained.
XGBoost has many parameters that can be tuned and altered to assist in
increasing the predictive accuracy of the model. However, since we are creating a
model for each missing variable, we will not be able to know which parameters are best
for each missing variable and finding the optimal parameters will be a cumbersome
task, especially if the data set has many variables that contain missing values. For
this reason, we will use the generic settings for XGBoost other than the utilization of
row subsampling.
Subsampling will provide model diversity between the different iterations of
XGBoost. This will give variability in the imputations and enables us to obtain the
benefits of Multiple Imputation. If we did not include subsampling, there is a chance
that each imputation round will predict the exact score, therefore being equivalent
to single imputation.
There are several items in the process that could potentially alter the predictive
ability of the imputation technique. The first is the initialization step. The easiest
methods of initializing the data is to take the mean or the median of the columns. The
second is the order of columns chosen. The reason the order can impact the regression
is due to the limitation of the imputation technique. If we have a data set that has
two variables that contain missing data, the first variable imputed will include the
mean imputation of the second column as input to the model. However, the second
column to be imputed will have the model’s prediction for the missing values, which
will be more accurate than the mean imputation for all missing values. This problem
has the potential to be more pronounced for data sets with many missing columns.
Finally there is the question of how to handle categorical variables. There are
10
two primary techniques to transform categorical variables into something that is
numerically tractable, Label Encoding and One-Hot Encoding. Label Encoding
consist of providing each category a numerical value for a given categorical variable.
For example, you can assign Monday - 1 and Sunday - 7. This appears to be a
logical approach, however we have now given numerical comparisons to categories.
In our example, Sunday is numerically 7x larger than Monday, which doesn’t make
sense. Further, simply changing how we decide our ordering can potentially drastically
change the results.
The second method is called One-Hot Encoding. For One-Hot Encoding, we
simply create a new variable for each category and assign a 1 for having the category
and 0 for not having the category. Therefore, One-Hot encoding does not give any
special arbitrary weight for each category. However, we need to keep in mind the
number of features being used. For small data sets, the inclusion of one-hot encoded
variables can make the problem become highly dimensional, in other words, the
number of features exceeds the number of data leading to high dimensional data. For
large data sets, the additional features will make the computation time significantly
longer due to the increased complexity and amount of data required to hold in
memory.
To test the difference between Label Encoding and One Hot Encoding we run our
imputation method on the Melbourne data. The Melbourne data is an ideal data set
to test of One Hot Encoding vs Label Encoding because after we went through the
feature engineering process, described in Appendix B, we ended up with a total of 23
features and 10 categorical features. Once we applied one Hot Encoding, the total
number of features extended to 951 total features.
After running our imputation comparisons, further information in Chapter 4, we
found Label Encoding the categorical variables took 38 seconds to run. One Hot
Encoding took 16 minutes to run on an i7 processor and 16G of RAM. Although One
11
Hot Encoding is more theoretically sound and potentially more accurate than Label
Encoding, we chose to proceed with Label Encoding throughout the thesis due to the
substantive difference in computation speed.
12
CHAPTER 3
Simulate Missing Data
For the first step of this study, we wanted to assess the accuracy of the XGB
Imputation method and compare it to standard techniques. We chose two standard
data sets covering classification and regression tasks, breast cancer data set and
Boston housing market data set respectively.
Since these data sets do not have missing values we have to simulate missing data
by randomly removing data points. The benefit of using a full data set is that we are
able to test how far off the imputation methods are from the actual data values. In
other words, we are able to obtain a seance of accuracy from the imputation methods.
In order to create the missing data for each data set, we conduct the following steps:
1. Select a random number of columns
2. For each column, draw a random number between 0 and 0.5.
3. Remove the percent of data corresponding to the column and the random
number generated.
3.1 Imputation Methods
3.1.1 Mean
Mean imputation is the most common method of imputing data. This method is
easy to use and understand since it simply takes the average of each column with
missing data and imputes the respective missing value with it’s column’s average.
One large downside to this method is the fact that the mean is sensitive to outliers.
This leads to the standard deviation and variance estimates to be underestimated.
More importantly, the magnitude of co-variance and correlation also decreases when
the variability decreases, resulting in biased estimates [8].
13
3.1.2 KNN
The Nearest-neighbor method utilizes observations in the training set that is the most
”similar” to a given observation. The simplest form of k-NN can be defind as:
1 X
Ŷ = yi
k
xi ∈Nk (x)
where Nk (x) is the neighborhood of x consisting of k nearest points. In other words,

k-NN can be viewed as the average of the k nearest neighbor’s values. Since we are
using closeness, we need a metric to describe the distance. k-NN typically uses the
Euclidian or Mahalanobis distance to identify the nearest neighbors.
For imputing categorical variables, k-NN identifies the k neighbors and then takes
a vote of the class of the neighbors. The classification that has the most votes becomes
the prediction for the given observation[11].
For imputing numerical values, k-NN calculates an inverse distance weighted
average with the k nearest neighbors to obtain the predicted value [8].
3.1.3 Multiple Imputation by Chained Equations
During the process of MICE procedure, a series of regression models are ran such
that each variable with missing data is modeled conditioned by the other variables
in the data. Therefore, each variable can be modeled according to its distribution.
One of the benefits of MICE is it creates multiple imputations to help account for
the statistical uncertainty in the imputation process.
MICE process is as follows:
• The mean is imputed for each of the variables. These values are considered
placeholders for further computation.
• Choose one variable and remove the imputed means.
14
• Compute the regression using the variable with missing values as the dependent
variable.
• The missing values are filled in with the predictions from the linear regression
model.
• The process is repeated for each variable with missing data, which when
completed is considered one cycle.
• The cycles are repeated until the values converge.
The regression model for predicting the missing values that MICE is built upon
is:
ŷi = β0 + βi Xi
MICE is as follows:
ŷi = β0 + βi Xi + δ
where δ is the root mean square error (RMSE) and is a random error from standard
normal distribution.
3.1.4 SVD
Singular Value Decomposition (SVD) is a well-known matrix factorization technique

that factors an mn matrix R into three matrices as the following:
Z = U DV T
Where, U and V are two orthogonal matrices of size mz and nz, respectivel. D
is a diagonal matrix of size zz having all singular values of matrix Z as its diagonal
entries [13]. The matrices obtained by performing SV D
Z = QP T
15
where Q = U and P T = DV T
ẑ = zrc = qrT pc
Since we know that there will be missing data, we need to modify SVD to ignore
the missing data in the decomposition.
We can find qc and pr using an iterative process of taking the square error difference
between their dot product and known data values:
X
min (zrc − qcT · pr )2 .
(u,c)∈K
Here, r and c represent a given row and column of the data set.
To help with over-fitting, we can add the magnitudes squared for each user and
item:
X
(zuc − ẑur )2 + λ ||qi ||2 + ||pu ||2

zuc ∈Ztrain
Where λ is a L2 regularization term.
3.1.5 Soft Thresholding SVD
Mazumder, et. al. [7] extended SVD by creating thresholds that can be learned to
assist with data imputation. The technique consists of three steps:
• Compute the SVD Z = U DV T and let di be the singular values
• Soft-threshold the singular values di = di − λ+
• Reconstruct Sλ(Z) = U D∗ V T .
This process is repeated until the matrix convergences.
16
Evaluation Criteria
Imputation methods will be compared based on three measures of performance.

RMSE and mean absolute error (MAE) are often used interchangeably as measures
of prediction accuracy. Both metrics have comparable behavior in response to model
bias. However, MAE is more robust to outliers, due to it having a lower sample
variance than RMSE. Whereas RMSE is better at capturing large model errors.
s
Pn obs
i=1 (Xi − Xiimputed )2
RM SE =
n
Pn
i=1 |Xiobs − Xiimputed |
M AE =
n
Another way to see the difference between MAE and RMSE is to look at how they
are related to eachother:
s
Pn Pn Pn
obs
i=1 |Xi − Xiimputed | obs
i=1 (Xi − Xiimputed )2 i=1 |Xiobs − Xiimputed | √
≤ ≤ · n
n n n
√
M AE ≤ RM SE ≤ M AE · n. This inequality shows that RMSE suffers when
comparing RMSE results between varying sample sizes. Since we will be evaluating
the different models on varying amounts of missing data, strictly using RMSE may
provide inconclusive results. However, we want to know if a model is making
imputations that are far away from it’s target value, providing us justification to
use RMSE and MAE.
3.2 Boston Housing Market Data (Regression)
The Boston Housing Market data [6] was created by Harrison and Rubinfield to
investigate the demand for clean air. The data set consists of 14 variables that the
authors thought would help predict the price of houses in Boston.
17
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: Proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable
5. NOX: nitric oxides concentration (pp 10mil)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distance to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk − .63)2 where Bk is the proportion of African Americans by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000’s
Varible Percent Missing

RM 23.32
TAX 37.54
PTRATIO 13.24
LSTAT 8.69
Table 3.1: Percent Data Missing for Boston Housing Market Data
Table 3.1 shows the random columns selected and the percent missing for the
respective column. All four of the columns that were selected were numerical.
Imputation Name MAE RMSE

KNN 17.202305 44.285102
SimpleFill 54.884513 103.212065
Mean 80.828464 166.416332
XGB 6.208879 15.134943
MICE 18.133222 36.872780
Table 3.2: Imputation Results for Boston Housing Data
18
Table 3.2 show that XGB is significantly more accurate in both MAE and RMSE
than the other imputation methods for the Boston Housing Data. For the Boston
data set, the degree of separation from the other imputation methods implies that
XGB will should be the best impuation method for all types of regression models.
Further, we can see Mean struggles to accurately impute data and performs worse
than simply taking the mean and mode of the data.
3.3 Breast Cancer Data Set (Classification)
The Wisconsin Breast Cancer data set is constructed from a digitized image of fine
needle aspirate (FNA) of breast masses. They describe distinct characteristics of the
cell nuclei present in the image. Ten features are computed for each cell nucleus.
Each of the variables are numerical variables.
1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension (”coastline approximation” - 1)
For each feature, the mean, standard error and ”largest” of these features were
computed for each image, creating 30 total features.
Table 3.1 shows that we randomly selected 23 of the 30 variables, providing us with
a data set that hat 77% of it’s variables that contain missing data. This simulation
will strain the ability of model based methods because of the amount of data missing
19
Variable Name Percent Missing
mean radius 36.90
mean texture 29.52
mean perimeter 23.37
mean area 23.02
mean smoothness 9.49
mean concavity 24.07
mean concave points 19.50
mean fractal dimension 06.32
radius error 26.18
texture error 28.82
area error 12.82
smoothness error 5.79
compactness error 0.87
concavity error 43.76
symmetry error 43.93
fractal dimension error 08.78
worst texture 34.27
worst perimeter 30.93
worst area 24.78
worst smoothness 25.30
worst concavity 21.79
worst symmetry 41.12
worst fractal dimension 34.62
Table 3.3: Percent Data Missing for Breast Cancer Data

KNN 8.791099 51.216050
Mean 34.158963 141.862113
SoftImpute 24.663311 121.266374
XGB 3.348926 16.986145
MICE 4.318915 18.061134
Table 3.4: Imputation Results for Breast Cancer Data
from the data set will make it harder to learn the patterns within the data.
Table 3.4 shows that XGBoost has the lowest MAE and RMSE. we see MICE is
slightly better than XGBoost. This means that overall, XGBoost is more accurate in
it’s imputations but has some values that are far off. Due to XGBoost having higher
accuracy overall accuracy but having worse RMSE, we suspect that XGBoost will
20
perform better for models that are more resilient to outliers such as Random Forest
and kNN and MICE will have better performance with Logistic Regression.
21
CHAPTER 4
Predictave Ability
Machine Learning models are constructed in vastly different ways [11]. The different
construction and prediction methods can cause a large amount of variability between
different models applied to the same data set. Therefore, it is possible to have a
scenario where one imputation method can provide the best results for one supervised
model and at the same time be the worst imputation method for a different supervised
model.
We chose to use generic supervised models of vastly different construction to test
the impact of applying different imputation techniques. We wanted to see if the type
of supervised model had an affinity for similar imputation types such as linear and
logistic regression for MICE. For this section, the models we chose are:
1. KNN
2. Random Forest
3. Linear Regression (Regression)
4. Logistic Regression (Classification)
KNN
The Nearest-neighbor method utilizes observations in the training set that is the most
”similar” to a given observation. The simplest form of k-NN can be defind as:
1 X
Ŷ = yi
k
xi ∈Nk (x)
where Nk (x) is the neighborhood of x consisting of k nearest points. In other words,

k-NN can be viewed as the average of the k nearest neighbor’s values. Since we are
using closeness, we need a metric to describe the distance. k-NN typically uses the
Euclidian or Mahalanobis distance to identify the nearest neighbors.
22
For classification tasks, k-NN identifies the k neighbors and then takes a vote of
the class of the neighbors. The classification that has the most votes becomes the
prediction for the given observation[11].
For regression tasks, k-NN calculates an inverse distance weighted average with
the k nearest neighbors to obtain the predicted value [8].
Random Forest
A Decision Tree is a tree with nodes, which contain information corresponding to

attributes in the input. This information is used to follow a decision path for a given
set of attributes. Decision trees have an appeal for being very easy to interpret and
fast to compute. However, decision trees tend to learn the training data but fail to
generalize rules to help predict data it hasn’t been exposed to.
Random Forest is an algorithm which extends decision trees by creating an
ensemble ensembles of Decision Trees through bagging [4]. Bagging is a technique that
combines multiple predictors that are trained on equal size but different partitions
of the data set then aggregates the predictions to create a more accurate estimation.
One of the major benefits of bagging is that it allows for the model to learn complex
patterns without over-learning the training data to the detriment of data the model
hasn’t seen before [8].
For each data set, after an imputation fills in the missing value we randomly
partition 75 percent of the data to train the model and allocate the remaining 25
percent of the data to test. Splitting the data set into a training and test set enables
us to test the performance on data the model has not seen before. If there is a large
difference between train and test metrics, there is a large possibility that the model
is overfitting the training set and needs to be modified so it can generalize more [8].
23
4.1 Results
4.1.1 Melborn Housing Data
For the Melbourne data set, we also include the predictive improvement by One Hot
Encoding the categorical variables for each imputation model during the XGBoost
Imputation outlined in Chapter 2. Tables 4.2 and 4.3, there is an improvement in
using One Hot Encoding for Random Forest and Linear Regression. Though we do see
an increase in performance for Random Forest and Linear regression, the significant
increase in computational time does not justify the minor improvement in predictive
accuracy we obtain. These results are why we opt to use Label Encoding for our
thesis over One Hot Encoding.
We can see in tables 4.1, 4.2, and 4.3 that XGBoost is beaten by SoftImpute.
Table 4.3 shows that MICE has better predictive improvement for Linear Regression
over XGBoost.
Test MAE Test RMSE Imputation Name

212451 384336 SoftImpute
226965 402351 XGB
226817 403885 XGB OHE
231509 410707 KNN
261011 451932 SimpleFill
278054 467380 MICE
Table 4.1: kNN Prediction Results for Melborne Housing Data

162571 305146 XGB OHE
168669 318002 XGB
176145 318065 KNN
184143 325730 MICE
Table 4.2: Random Forest Prediction Results for Melborne Housing Data
24
259046 411609 MICE
257845 412602 XGB OHE
258613 412743 XGB
263768 416833 KNN
Table 4.3: Linear Regression Prediction Results for Melborne Housing Data
4.1.2 Boston Housing Data
Tables 4.4, 4.5, and 4.6 showed that XGBoost performed the best for imputation for
all of the models. The across the board performance supported the hypothesis from
Chapter 3 that XGBoost would have the best improvement on predictive ability out
of all of the models. We also see that SoftImpute had the worst performance on
predict accuracy, even worse than Mean imputation for every data set.
It was interesting to find that for kNN, Mean was the second best performing
imputation method. This shows that Mean imputation is not always the worst option
for every situation and can perform better than model based methods for different
supervised models.

2.451654 3.676005 XGB
2.911575 4.770921 MICE
2.981890 5.004804 KNN
3.330472 5.316165 Mean
3.540394 5.805937 SoftImpute
Table 4.4: Random Forest Prediction Accuracy for Boston Housing Data
25
3.616397 5.412887 XGB
3.759428 5.606318 MICE
3.789144 5.715436 KNN
4.097958 5.934811 Mean
4.206722 6.178896 SoftImpute
Table 4.5: Linear Regression Prediction Accuracy for Boston Housing Data

4.451339 6.765719 XGB
4.584567 7.135352 Mean
4.582992 7.138653 KNN
4.694016 7.172817 MICE
4.817638 7.302483 SoftImpute
Table 4.6: kNN Prediction Accuracy for Boston Housing Data
4.1.3 Breast Cancer Data
The breast cancer data set was one that we thought would potentially cause problems
for the model based models such as MICE and our method. Since 77% of the variables
had missing data, there was the possibility of the models to struggle to learn the
underlying structure of the data. However, as we can see from tables 4.6,4.7, and 4.9
XGBoost and MICE had the best imputation improvements and XGBoost was the
best for Logistic Regression and kNN.
Test Accuracy Precission Recall F1 Imputation Name

0.9441 0.945652 0.966667 0.956044 XGB
0.9371 0.945055 0.955556 0.950276 KNN
0.9371 0.926316 0.977778 0.951351 Mean
0.9161 0.943182 0.922222 0.932584 MICE
0.8951 0.894737 0.944444 0.918919 SoftImpute
Table 4.7: kNN Prediction Accuracy for Breast Cancer Data
26
0.9650 0.988506 0.955556 0.971751 MICE
0.9510 0.966292 0.955556 0.960894 XGB
0.9510 1.000000 0.922222 0.959538 KNN
0.9371 0.976471 0.922222 0.948571 Mean
0.9301 0.954545 0.933333 0.943820 SoftImpute
Table 4.8: Random Forest Prediction Accuracy for Breast Cancer Data

0.9441 0.976744 0.933333 0.954545 XGB
0.9371 0.976471 0.922222 0.948571 MICE
0.9301 0.944444 0.944444 0.944444 SoftImpute
0.9301 0.954545 0.933333 0.943820 KNN
0.9231 0.954023 0.922222 0.937853 Mean
Table 4.9: Logistic Regression Prediction Accuracy for Breast Cancer Data
27
CHAPTER 5
Simulation Study
In order to control the different variables that may impact the performance and
accuracy of data imputation, we chose to simulate data at varying parameters.
Simulated data has the inherent ability to change one aspect of data at a time and
keep everything else constant. If we only use real data, but want to test the impact
of original sample size, we would be required to find sufficiently different data sets for
the test. We would be able to see differences in imputation between the two different
data sets, but we would not be able to attribute the difference solely to the size of the
data, because other aspects of the data that are completely out of our control may
be different as well.
One of the primary reasons to test the impact of data size is the amount of
data available may impact the ability of a given imputation model to fully learn
the structure of the data [2]. A model that performs the best in an environment of
abundant data may be unusable in situations where there is not a sufficient amount
of data. We test two kinds of data simulation perspectives available in scikit-learn’s
package, one built for regression tasks, the other built for classification tasks [17].
5.0.1 Imputation Results
Table 5.1 reinforces the findings that as we increased the size of the data[2], XGBoost
was able to more accurately predict the missing values. This is an inherent bi-product
of XGBoost being a non-parametric method. When we increase the size of the
data, the model has more information to accurately predict the underlying function
that governs the individual variables for each of the missing data points. However,
Table 5.2 shows that simply increasing the data size may not always lead to better
performance. The increased data size may not always lead to an improvement in
28
model performance if there is noise in the data similar to what we simulated in Table
5.2.
Data Size MAE RMSE

10K 0.56083 0.79473
1K 0.84546 1.04498
100 0.921536 1.13368
Table 5.1: Simulated Regression Data Imputation Results
Data Size MAE RMSE

10K 0.745078 1.002349
1K 0.671616 0.932976
100 0.806236 1.058789
Table 5.2: Simulated Classification Data Imputation Results
29
CHAPTER 6
Data Preprocessing
Data Pre-processing is an important preliminary step in the predictive analysis

process. The old phrase ”garbage in, garbage out” perfectly describes predictive
modeling. Therefore, pre-processing can be one of the most important aspects of
predictive ability.
Data pre-processing typically includes ”cleaning”, normalizing, transforming,
feature selection, and outlier removal. In this analysis, we will focus on data
transformation and outlier removal.
6.1 Skew
Skewedness is simply a measure for how asymmetrical data is. For parametric models,
such as Linear and Logistic Regression, parameter estimation is founded on the
assumption that the data is normally distributed. If the data does not follow a
normal distribution the estimated parameters cannot be accurately estimated [10].
For decision tree based models, variance is used to help perform the node splits.
Variance is sensible sensitive to outliers and skewed data. Therefore transforming the
variables can lead to improvements for both parametric and tree based models.
For univariate data, the measure of skewedness is calculated using the
Fisher-Pearson coefficient of skewedness:
PN
i=1 (Yi− Ȳ )3 /N
s3
where Ȳ is the mean, s is the standard deviation and N is the number of data.
We use Python’s adjusted Fisher-Pearson coefficient of skewedness [9]:
30
N (N − 1) N
p
3
P
i=1 (Yi − Ȳ ) /N
Skew =
N −2 s3
6.1.1 Data Transformation
The Box-Cox transformation is a technique that attempts to transform data into a

Gaussian distribution [3].
The Box-Cox transformation is defined as:

yiλ −1
λ 6= 0


λ
T (yi ) =
 ln yi
 λ=0
where yi is the data point being transformed and λ is the transformation

parameter. The Box-Cox transformation only allows for positive values from your
data [3].
The limitation that the Box-Cox transformation has in requiring non-negative
data does not work with the objective of creating a pipeline that is able to handle any
type of data. Therefore we chose to use an extension of the Box-Cox transformation,
the Yeo-Johnson transformation. The major benefit of Yeo-Johnson is the ability to
transform zero and negative values [23]. This transformation is defined as:

(yi +1)λ −1



 λ
λ 6= 0, y ≥ 0


 ln (yi + 1), λ = 0, y ≥ 0

T (yi ) =
−(−yi +1)2−λ −1



 2−λ
λ 6= 2, y < 0


 − ln (−y + 1)

λ = 2, y < 0
i
Bulmer advises the rule of thumb that normal data has a skew value such that the
Fisher-Pearson coefficient is less than 1 and greater than -1 [5]. We adhere to Bulmer’s
rule of tumb and transform all of the variables with a Fisher-Pearson coefficient larger
than 1.
31
We use the breast cancer data set for our investigation into the impact of skew.
Variable Skew Post Transformation

area error 5.432816 0.988037
concavity error 5.096981 0.939893
fractal dimension error 3.913617 0.723695
perimeter error 3.434530 0.648734
radius error 3.080464 0.497007
smoothness error 2.308344 0.491316
symmetry error 2.189342 0.455120
compactness error 1.897202 0.414330
worst area 1.854468 0.233932
worst fractal dimension 1.658193 0.223333
texture error 1.642100 0.204429
mean area 1.641391 0.198216
worst compactness 1.469667 0.194282
concave points error 1.440867 0.145516
worst symmetry 1.430145 0.135536
mean concavity 1.397483 0.105286
mean fractal dimension 1.301047 0.091418
mean compactness 1.186983 0.085185
mean concave points 1.168090 0.083782
worst concavity 1.147202 0.081532
worst perimeter 1.125188 0.080579
worst radius 1.100205 0.069123
Table 6.1: Fisher-Pearson Coefficient Before and After Transformation for the Skewed
Variables
After calculating the Fisher-Pearson coefficient for each variable in the breast
cancer data set, we found that there were 22 variables (73%) that were skewed.

Transformed Data 0.324336 0.581183
Skewed Data 2.770718 21.912681
Table 6.2: XGBoost Imputation Comparison Between Transformed and Original Data
Figure – shows a 97% decrease in RMSE after we transform the data. Since RMSE
heavily penalises predictions that were far from the true value [21], we believe that the
original data suffered from data points that were highly inaccurate imputations. The
32
inaccuracy of the imputation can lead to a deterioration of the predictive ability of a
predictive model. We chose not to test the impact of the predictive imputation of data
transformation vs non-transformed data because we would not be able to completely
identify if the difference in the predictive accuracy after imputation is caused by the
data transformation itself or by the increased accuracy if the data imputation.
33
CHAPTER 7
Conclusion
In this thesis, XGBoost was used to create a Multiple Imputation technique as a new
method to the missing data problem. The imputation method based on XGBoost
is more flexible than the current industry standard MICE in that it does not have
any assumptions about the underlying structure of the data. We compared XGBoost
imputation to several standard techniques across different types of data. XGBoost
MI performed on par or beat the standard methods throughout the different tests
however, showed to have some weakness in making predictions that were highly
innacurate, as seen when XGBoost had low MAE score but higher RMSE. As we
saw in Chapter 4, the inaccuracy on some data points caused XGBoost to not be
the ideal imputation scheme for models that are sensitive to outliers such as linear
regression and logistic regression. The overall accuracy, shown by the consistently
better MAE score, leads me to suggest that XGBoost MI may be generally better for
tree based methods and kNN for most tasks.
Our study did not include machine learning techniques such as Support Vector
Machine, Generalized Linear Regression, or Neural Networks, so we cannot make any
inference onto how XGBoost MI will perform as the imputation method for different
family of models. Future work would be to investigate how XGBoost MI performs as
the imputation scheme for a wider array of machine learning algorithms.
Further, there are several areas of potential improvement. We chose to only include
subsampling as a parameter for XGBoost. However, XGBoost performance is highly
sensitive to the parameters selected. We could see a significant increase in imputation
performance by tuning the parameters for each missing variable. Further, we could
find better ways of ordering the imputation other than the order the variables reside
within the data set. One method could be randomly selecting columns the columns
34
to be imputed. A more sophisticated method could be to impute the variables in
descending order with respect to the percent of missing data. There is a lot of room
for potential improvement, we believe that this thesis is just the first step in creating
a more robust imputation method that utilizes XGBoost.
35
BIBLIOGRAPHY
[1] Allison, P. Multiple Imputation For Missing Data: A Cautionary Tale.

Sociological Methods and Research, 28:301–309, 2000.
[2] Beleites, C., Neugebauer, U., Bocklitz, T., Krat, C., and Popp, J. Sample size
planning for classication models. Analytica, (760):25–33, 2013.
[3] Box, G. E. P. and Cox, D. R. An analysis of transformations. Journal of the

Royal Statistical Society, (6):211–252, 2964.
[4] L. Breiman. Inference and missing data. Biometrika, 63(3):581–592, 1976.
[5] Bulmer, M. G. Principles of Statistics. Second Edition. Dover Publ., 1979.
[6] Harrison, D. and Rubinfield, D.L. Hedonic prices and the demand for clean air.
Economics Management, 5:81–102, 1978.
[7] Trevor Hastie, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. Matrix
completion and low-rank svd via fast alternating least squares. Journal of
Machine Learning Research, 16(104):3367–3402, 2015.
[8] Hastie, T., Tibshirani, R., and Friedman, J. Elements of Statistical Learning:
Data Mining, Inference and Prediction. Springer Sverlag, 2009.
[9] Hotelling, H. New Light on the Correlation Coefficient and its Transforms.
Journal of the Royal Statistical Society, 15(2).
[10] Shao J. Mathematical Statistics. Springer, New York, NY, USA, 2 edition, 2003.
[11] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An
Introduction to Statistical Learning: With Applications in R. Springer Publishing
Company, Incorporated, 2014.
36
[12] Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting
Machine. The Annals of Statistics, 29:1189–1232, 2001.
[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques
for recommender systems. Computer, 42(8):30–37, August 2009.
[14] Roderick J A Little and Donald B Rubin. Statistical Analysis with Missing Data.
John Wiley & Sons, Inc., New York, NY, USA, 1986.
[15] Jayant Malani, Neha Sinha, Nivedita Prasad, and Vikas Lokesh. Forecasting
bike sharing demand.
[16] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY,
USA, 1 edition, 1997.
[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[18] Peter J. Rousseeuw Katrien Van Driessen. A Fast Algorithm for the Minimum
Covariance Determinant Estimator. Technometrics, 41(3):212–223, 1999.
[19] Rousseeuw, P.J. Least Median of Squares Regression. Journal of the American
Statistical Association, (79):871–880, 1984.
[20] Donald B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
[21] T. Chai and R. R. Draxler. Root mean square error (RMSE) or mean absolute
error (MAE)? - Arguments against avoiding RMSE in literature. Geoscientific
Model Development, 7(3):1247–1250, 2014.
37
[22] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd ACM SIGKDD International Conference on knowledge
discovery and data mining, pages 785–794, 2016.
[23] Yeo, In-Kwon and Johnson, Richard. A new family of power transformations.
Biometrika, (87):954–959, 2000.
38
APPENDIX
39
APPENDIX A
Melbourne Feature Engineering
Prior to running any tests on the Melbourne data, I created some features from the
original data in a process known as Feature Engineering. There are generally two
ways of improving the performance of any given machine learning model, increase
data size or increase feature space. Creating new features creates new dimensions
from the data for models to better help learn and predict the dependent variable.
The first group of features I created were from the time variable. I created a
categorical feature for the Month and another categorical feature for the Year. The
idea is that the given month of the year may have an impact on the value of a house.
For example if more people are looking to move in the summer, then perhaps prices
for housese are typically higher in the summer as a result of the increased demand.
Another feature I created was Age of the house. I computed 2017 - X for every X
in the column YearBuilt.
For Lattitude and Longitude, I created 10 bins for each respective by ordering the
values by quantiles and creating 10 bins from the resulting ordering.
Then I created another feature that crossed the latitude and longitude bins to
create a grid for Melbourne. Since there are 10 bins a piece, there is the potential of
100 new categories for this one feature.
The last feature I created was a Large house indicator for houses that have more
than 4 Bedrooms.
40
APPENDIX B
Code
from s k l e a r n . e n s e m b l e import I s o l a t i o n F o r e s t
import pandas a s pd
from f a n c y i m p u t e import KNN, SoftImpute , BiScaler , SimpleFill , IterativeImputer , IterativeSVD , MatrixFactorization
import m a t p l o t l i b . p y p l o t a s plt
import s e a b o r n a s s n s
import numpy a s np
import math
from s k l e a r n import p r e p r o c e s s i n g
from s k l e a r n . d a t a s e t s import m a k e c l a s s i f i c a t i o n , make regression
from s k l e a r n . m o d e l s e l e c t i o n import train test split
from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e
from s k l e a r n . m o d e l s e l e c t i o n import train test split
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r , LabelEncoder , RobustScaler , PowerTransformer
from s k l e a r n import p i p e l i n e
from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r , KNeighborsRegressor
from s k l e a r n . svm import SVC, SVR
from s k l e a r n . g a u s s i a n p r o c e s s import G a u s s i a n P r o c e s s C l a s s i f i e r
from s k l e a r n . g a u s s i a n p r o c e s s . k e r n e l s import RBF
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
from s k l e a r n . e n s e m b l e import R a n d o m F o r e s t C l a s s i f i e r , AdaBoostClassifier , BaggingClassifier , ExtraTreesClassifier , G
from s k l e a r n . n a i v e b a y e s import GaussianNB , BernoulliNB
from s k l e a r n . l i n e a r m o d e l import L o g i s t i c R e g r e s s i o n , LogisticRegressionCV , PassiveAggressiveClassifier , Perceptron ,
from s k l e a r n . d i s c r i m i n a n t a n a l y s i s import Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s
from s k l e a r n . m e t r i c s import c o n f u s i o n m a t r i x , mean absolute error
from s k l e a r n . m e t r i c s import m e a n s q u a r e d e r r o r , c o n f u s i o n m a t r i x , precision score , r e c a l l s c o r e , auc , r o c c u r v e , f1 sc
from x g b o o s t import X G B C l a s s i f i e r , XGBRegressor
from l i g h t g b m import L G B M C l a s s i f i e r , LGBMRegressor
from c a t b o o s t import C a t B o o s t C l a s s i f i e r , CatBoostRegressor
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r , MLPRegressor
import impyute a s impy
from s c i p y import s t a t s
from s c i p y . s t a t s import norm , skew #f o r some statistics
from s k l e a r n . e n s e m b l e import I s o l a t i o n F o r e s t
from s k l e a r n . impute import S i m p l e I m p u t e r
from s k l e a r n . d a t a s e t s import l o a d b o s t o n , load wine , load breast cancer , load iris
import random
MLA = [ I t e r a t i v e I m p u t e r ( ) ,
KNN( ) ,
SimpleFill () ,
#I t e r a t i v e S V D ( ) ,
SoftImpute ( ) ]
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
41
LinearRegression () ,
XGBRegressor ( ) ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
def f i x s k e w ( d f ) :
temp = d f . copy ( )
n u m e r i c f e a t s = temp . d t y p e s [ temp . d t y p e s != ” o b j e c t ” ] . i n d e x
s k e w e d f e a t s = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
s k e w n e s s = s k e w e d f e a t s [ abs ( s k e w e d f e a t s ) > 1 ]
skewed features = skewness . index
p t = P o w e r T r a n s f o r m e r ( method= ’ yeo−j o h n s o n ’ )
for feat in s k e w e d f e a t u r e s :
temp [ f e a t ] = p t . f i t t r a n s f o r m ( np . a r r a y ( temp [ f e a t ] ) . r e s h a p e ( − 1 , 1 ) )
print ( skewness . shape [ 0 ] , ” skewed n u m e r i c a l features have been t r a n s f o r m e d ” )
return temp
def r e m o v e o u t l i e r ( d f ) :
o u t d f = temp [ n u m e r i c f e a t s ] . copy ( ) . f i l l n a ( 0 )
IF = I s o l a t i o n F o r e s t ( b e h a v i o u r=”new” , c o n t a m i n a t i o n = ” a u t o ” )
IF . f i t ( o u t d f )
temp [ ” I F s c o r e ” ] = IF . d e c i s i o n f u n c t i o n ( o u t d f )
return temp [ temp [ ” I F s c o r e ” ] > 0 . 0 0 ]
def intersection ( lst1 , lst2 ):

temp = s e t ( l s t 2 )
l s t 3 = [ v a l u e for v a l u e in lst1 if v a l u e i n temp ]
return lst3
def p r e p o n e h o t ( d f , c a t c o l s , column ) :
tempdf = d f . copy ( )
t a r g e t l i s t = [ column ]
o n e h o t c o l s = c a t c o l s [ np . l o g i c a l n o t ( c a t c o l s . i s i n ( t a r g e t l i s t ) ) ]
tempdf = pd . g e t d u m m i e s ( tempdf , columns=o n e h o t c o l s )
return tempdf
def XGB impute ( d f , k ) :

c a t f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s == np . o b j e c t ) [ 0 ] ]
n u m e r i c a l f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s != np . o b j e c t ) [ 0 ] ]
missing cols = [ col for col i n temp . columns
i f temp [ c o l ] . i s n u l l ( ) . any ( ) ]
cat intersect = intersection ( cat features , missing cols )
numerical intersect = intersection ( numerical features , missing cols )
42
temp = l a b e l ( temp , cat features )
temp [ c a t i n t e r s e c t ] = temp [ c a t i n t e r s e c t ] . f i l l n a ( −5)
c a t i d x = temp [ c a t i n t e r s e c t ] == −5
temp [ n u m e r i c a l i n t e r s e c t ] = temp [ n u m e r i c a l i n t e r s e c t ] . f i l l n a ( −5)
num idx = temp [ n u m e r i c a l i n t e r s e c t ] == −5
if ( len ( c a t i n t e r s e c t ) > 0 ) :
temp [ c a t i n t e r s e c t ] = f i l l c a t ( temp [ c a t i n t e r s e c t ] , − 5 )
else : pass
if ( len ( n u m e r i c a l i n t e r s e c t ) >0):
temp [ n u m e r i c a l i n t e r s e c t ] = f i l l f l o a t ( temp [ n u m e r i c a l i n t e r s e c t ] , − 5 )
else : pass
temp [ n u m e r i c a l f e a t u r e s ] = temp [ n u m e r i c a l f e a t u r e s ] . a s t y p e ( ’ f l o a t ’ )
for i i n range ( k ) :
for col in cat intersect :
if l e n ( c a t i n t e r s e c t ) == 0 :
break
else :
temp . l o c [ c a t i d x [ c o l ] . v a l u e s , col ] = XGBClassifier ()
. f i t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ ˜ c a t i d x [ c o l ] . v a l u e s , : ] , temp . l o c [ ˜ c a t i d x [ c o l ]
. v a l u e s , c o l ] ) . p r e d i c t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ c a t i d x [ c o l ] . v a l u e s , : ] )
for col in numerical intersect :
if l e n ( n u m e r i c a l i n t e r s e c t ) == 0 :
break
else :
temp . l o c [ num idx [ c o l ] . v a l u e s , c o l ] = XGBRegressor ( )
. f i t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ ˜ num idx [ c o l ] . v a l u e s , : ] , temp . l o c [ ˜ num idx [ c o l ]
. v a l u e s , c o l ] ) . p r e d i c t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ num idx [ c o l ] . v a l u e s , : ] )
return temp
def l a b e l ( df , c o l l a b e l ) :
d f = d f . copy ( )
l e = LabelEncoder ( )
for col in col label :
idx = ˜ d f [ c o l ] . isna ()
d f . l o c [ idx , c o l ] = l e . f i t ( d f . l o c [ idx , c o l ] ) . t r a n s f o r m ( d f . l o c [ idx , col ])
return df
def f i l l c a t ( df , v a l ) :
imp mode = S i m p l e I m p u t e r ( m i s s i n g v a l u e s=v a l , s t r a t e g y= ’ m o s t f r e q u e n t ’ )
newdf = imp mode . f i t t r a n s f o r m ( tempdf )
return newdf
def f i l l f l o a t ( df , v a l ) :
imp mean = S i m p l e I m p u t e r ( m i s s i n g v a l u e s=v a l , s t r a t e g y= ’ mean ’ )
newdf = imp mean . f i t t r a n s f o r m ( tempdf )
return newdf
def r e a l i m p u t a t i o n r e g ( df , target , skew = F a l s e , o u t l i e r = False ) :

if skew == True :
43
temp = f i x s k e w ( temp )
else : pass
if o u t l i e r == True :
temp = r e m o v e o u t l i e r ( temp )
else : pass
MLA columns = [ ]
ML compare = pd . DataFrame ( columns = MLA columns )
row index1 = 0
cat intersect = intersection ( numerical features , missing cols )
# Start the actual imputation
#p r i n t ( c a t f e a t u r e s )
y = temp [ t a r g e t ]
X = temp . drop ( t a r g e t , a x i s = 1 ) . copy ( )
X i n c o m p l e t e = np . a r r a y (X)
for a l g i n MLA:
X f i l l e d = alg . f i t t r a n s f o r m ( X incomplete )
MLA name = a l g . class . name
x train , x test , y train , y test = train test split ( X filled , y , t e s t s i z e = 0.25 , random state = 0)
f o r ml i n M L r e g r e s s o r :
p r e d t r a i n = ml . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml . p r e d i c t ( x t e s t )
ML name = ml . class . name
ML compare . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] =
round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] =
round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
# C r e a t e new d a t a s e t f o r new I m p u t a t i o n
temp2 = temp . copy ( )
XGB = XGB impute ( temp2 , 8 )
#C a t B o o s t = C a t B o o s t i m p u t e ( temp2 , 5 )
#LGBM = LGBM impute ( temp2 , 5 )
XGB. name = ”XGB”
#C a t B o o s t . name = ” C a t B o o s t ”
#LGBM. name = ”LGBM”
X Y i n c o m p l e t e = np . a r r a y ( temp2 )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =2 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
44
for data in impute data :
MLA name = d a t a . name
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( t a r g e t , a x i s =1) , y , t e s t s i z e = 0.25 , random s
f o r ml2 i n M L r e g r e s s o r :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 4
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 4)
r o w i n d e x 1 += 1
return ML compare
def r e a l i m p u t a t i o n c l a s s ( df , target , skew = F a l s e , o u t l i e r = False ) :

if skew == True :
temp = f i x s k e w ( temp )
else : pass
temp = r e m o v e o u t l i e r ( temp )
else : pass
MLA columns = [ ]
ML compare = pd . DataFrame ( columns = MLA columns )
row index1 = 0
cat intersect = intersection ( numerical features , missing cols )
# Start the actual imputation
#p r i n t ( c a t f e a t u r e s )
y = temp [ t a r g e t ]
X = temp . drop ( t a r g e t , a x i s = 1 ) . copy ( )
X i n c o m p l e t e = np . a r r a y (X)
for a l g i n MLA:
X f i l l e d = alg . f i t t r a n s f o r m ( X incomplete )
x train , x test , y train , y test = train test split ( X filled , y , t e s t s i z e = 0.25 , random state = 0)
f o r ml i n ML classifier :
p r e d t r a i n = ml . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml . p r e d i c t ( x t e s t )
fp , tp , th = r o c c u r v e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
45
r o w i n d e x 1 += 1
# C r e a t e new d a t a s e t f o r new I m p u t a t i o n
temp2 = temp . copy ( )
XGB = XGB impute ( temp2 , 8 )
#C a t B o o s t = C a t B o o s t i m p u t e ( temp2 , 5 )
#LGBM = LGBM impute ( temp2 , 5 )
#C a t B o o s t . name = ” C a t B o o s t ”
#LGBM. name = ”LGBM”
X Y i n c o m p l e t e = np . a r r a y ( temp2 )
n imputations = 5
XY completed = [ ]
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( t a r g e t , a x i s =1) , y , t e s t s i z e = 0.25 , random s
f o r ml2 i n ML classifier :
fp , tp , th = r o c c u r v e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
r o w i n d e x 1 += 1
return ML compare
from d a t e t i m e import d a t e t i m e
plt . figure (1); p l t . t i t l e ( ’ Johnson SU ’ )

sns . d i s t p l o t (y , kde=F a l s e , f i t =s t a t s . j o h n s o n s u )
plt . figure (2); p l t . t i t l e ( ’ Normal ’ )

sns . d i s t p l o t (y , kde=F a l s e , f i t =s t a t s . norm )
plt . figure (3); p l t . t i t l e ( ’ Log Normal ’ )
sns . d i s t p l o t (y , kde=F a l s e , f i t =s t a t s . lognorm )
c e n s u s = pd . r e a d c s v ( ” a d u l t . c s v ” )
####################################################
def XGB impute2 ( d f , k ) :

46
cat intersect = intersection ( cat features , missing cols )
temp [ c a t i n t e r s e c t ] = temp [ c a t i n t e r s e c t ] . f i l l n a ( −5)
c a t i d x = temp [ c a t i n t e r s e c t ] == −5
temp [ n u m e r i c a l i n t e r s e c t ] = temp [ n u m e r i c a l i n t e r s e c t ] . f i l l n a ( −5)
num idx = temp [ n u m e r i c a l i n t e r s e c t ] == −5
if ( len ( c a t i n t e r s e c t ) > 0 ) :
temp [ c a t i n t e r s e c t ] = f i l l c a t ( temp [ c a t i n t e r s e c t ] , − 5 )
else : pass
if ( len ( n u m e r i c a l i n t e r s e c t ) >0):
temp [ n u m e r i c a l i n t e r s e c t ] = f i l l f l o a t ( temp [ n u m e r i c a l i n t e r s e c t ] , − 5 )
else : pass
temp [ n u m e r i c a l f e a t u r e s ] = temp [ n u m e r i c a l f e a t u r e s ] . a s t y p e ( ’ f l o a t ’ )
for i i n range ( k ) :
for col in cat intersect :
if l e n ( c a t i n t e r s e c t ) == 0 :
break
else :
temp3 = p r e p o n e h o t ( temp , c a t f e a t u r e s , c o l )
temp . l o c [ c a t i d x [ c o l ] . v a l u e s , c o l ] = X G B C l a s s i f i e r ( ) . f i t ( temp3 . drop ( c o l , a x i s = 1 ) . l o c [ ˜ c a t i d x [ c o l ] . v
for col in numerical intersect :
if l e n ( n u m e r i c a l i n t e r s e c t ) == 0 :
break
else :
tempdf = pd . g e t d u m m i e s ( temp , columns=c a t f e a t u r e s )
temp . l o c [ num idx [ c o l ] . v a l u e s , c o l ] = XGBRegressor ( ) . f i t ( tempdf . drop ( c o l , a x i s = 1 ) . l o c [ ˜ num idx [ c o l ] . v
return temp
iris data = l o a d i r i s ()
i r i s = pd . DataFrame ( i r i s d a t a . d a t a )
i r i s . columns = i r i s d a t a . f e a t u r e n a m e s
i r i s [ ’ class ’ ] = iris data . target
y = iris [ ’ class ’ ]
w i t h open ( ’ mamogram missing . t e x ’ , ’w ’ ) a s tf :

tf . write (a . to latex ())
bc data = l o a d b r e a s t c a n c e r ()
bc = pd . DataFrame ( b c d a t a . d a t a )
bc . columns = b c d a t a . f e a t u r e n a m e s
bc [ ’ Type ’ ] = b c d a t a . t a r g e t
def r e s u l t s c l a s s ( df , t a r g e t , skew = F a l s e , o u t l i e r = False ) :

if skew == True :
df , s k e w e d f e a t s = f i x s k e w ( d f )
else : pass
47
df = remove outlier ( df )
else : pass
ML classifier = [
XGBClassifier ( ) ]
X 1 m i s s i n g = d f . copy ( )
y = X1 missing [ t a r g e t ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) −2))
# Randomly r e m o v e percent
perclist = []
for i i n range ( l e n ( c o l s ) ) :
p e r c l i s t . append ( random . random ( ) )
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
counter = 0
for col in cols :
X 1 m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
tempdf . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
m i s s p e r c = X 1 m i s s i n g . i s n a ( ) . sum( ) / l e n ( X 1 m i s s i n g )
miss cols = cols
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ñp . i s n a n ( x ) ] )
# m y l i s t 2 = np . a r r a y ( [ x for x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜ np . i s n a n ( x ) ] )
# MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( a b s ( m y l i s t 1 −m y l i s t 2 ) ) . sum ( ) / l e n ( m y l i s t 1 )
# MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum ( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
fp , tp , th = r o c c u r v e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
48
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5 )
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
n imputations = 5
XY completed = [ ]
ML classifier = [
XGBClassifier ( ) ]
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ñp . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ñp . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2 , m i s s p e r c , m i s s c o l s
def r e s u l t s r e g ( df , t a r g e t , skew = F a l s e , o u t l i e r = False ) :

if skew == True :
df = fix skew ( df )
else : pass
else : pass
49
ML regressor = [
XGBRegressor ( ) ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , round ( random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) ) / 2 ) )
perclist = []
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
MLA columns = [ ]
row index = 0
ML columns = [ ]
row index1 = 0
for a l g i n MLA:
r o w i n d e x+=1
p r e d t r a i n = ml . p r e d i c t ( x t r a i n )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d i c t e d , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d i c t e d , y test )) , 6)
r o w i n d e x 1 += 1
50
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5)
n imputations = 5
XY completed = [ ]
ML regressor = [
XGBRegressor ( ) ]
r o w i n d e x+=1
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2
def r e s u l t s m u l t i c l a s s ( df , t a r g e t , skew = F a l s e , o u t l i e r = False ) :

if skew == True :
df , skewed feats = fix skew ( df )
else : pass
else : pass
ML classifier = [
XGBClassifier ( ) ]
51
perclist = []
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
MLA columns = [ ]
row index = 0
ML columns = [ ]
row index1 = 0
for a l g i n MLA:
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( y )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( y ) )
r o w i n d e x+=1
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = a u c ( f p , tp )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
r o w i n d e x 1 += 1
n imputations = 5
XY completed = [ ]
52
ML classifier = [
XGBClassifier ( ) ]
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( y )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( y ) )
r o w i n d e x+=1
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = a u c ( f p , tp )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
r o w i n d e x 1 += 1
w i t h open ( ’ b c c l a s s . t e x ’ , ’w ’ ) a s tf :
t f . w r i t e ( ML compare2 . t o l a t e x ( ) )
b = pd . r e a d c s v ( ’ f i l e : / / /C: / U s e r s / J a s o n Y i n g l i n g / Downloads / mammographic masses ( 1 ) . d a t a ’ , h e a d e r=None )

b . columns = [ ” BI−RADS” , ”Age” , ’ s h a p e ’ , ’ margin ’ , ’ D e n s i t y ’ , ’ S e v e r i t y ’ ]
b = b . r e p l a c e ( ’ ? ’ , np . nan )
a = b . i s n a ( ) . sum( ) / l e n ( b )
b [ ”Age” ] = b [ ”Age” ] . a s t y p e ( f l o a t )
b [ ” BI−RADS” ] = b [ ” BI−RADS” ] . a s t y p e ( f l o a t )
b [ ” Density ” ] = b [ ” Density ” ] . astype ( float )
#r e s = r e s u l t s c l a s s ( b , ’ S e v e r i t y ’ )
try1 = r e a l i m p u t a t i o n c l a s s (b , ’ Severity ’ )
w i t h open ( ’ mammogram pred . t e x ’ , ’w ’ ) a s tf :

t f . write ( try1 . t o l a t e x ( ) )
def r e s u l t s c l a s s 2 ( df , t a r g e t ) :
ML classifier = [
#K N e i g h b o r s C l a s s i f i e r ( ) ,
#R a n d o m F o r e s t C l a s s i f i e r ( ) ,
#L o g i s t i c R e g r e s s i o n ( ) ,
XGBClassifier ( ) ]
53
perclist = []
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
MLA columns = [ ]
row index = 0
ML columns = [ ]
row index1 = 0
for a l g i n MLA:
r o w i n d e x+=1
fp , tp , th = r o c c u r v e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
r o w i n d e x 1 += 1
XGB2 = XGB impute2 ( t e m p 2 i n c o m p l e t e , 5 )
XGB2 . name = ”XGB2”
54
n imputations = 5
XY completed = [ ]
ML classifier = [
XGBClassifier ( ) ]
r o w i n d e x+=1
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
r o w i n d e x 1 += 1
temp , s k e w c o l s = f i x s k e w ( bc )
w i t h open ( ’ s k e w c o l s . t e x ’ , ’w ’ ) a s tf :
tf . write ( skew cols . to latex ())
df3 , df4 , s k e w e d f e a t s = r e s u l t s c l a s s ( bc , ’ Type ’ , skew = True )
s k e w r e s = d f 4 [ ( d f 4 [ ” I m p u t a t i o n Name”]==”XGB” ) | ( d f 4 [ ” I m p u t a t i o n Name”]==”XGB2” ) ]
s k e w p r e d = d f 3 [ ( d f 3 [ ” I m p u t a t i o n Name”]==”XGB” ) | ( d f 3 [ ” I m p u t a t i o n Name”]==”XGB2” ) ]
w i t h open ( ’ s k e w i m p u t a t i o n . t e x ’ , ’w ’ ) a s tf :
t f . write ( skew res . to latex ())
df1 = r e a l i m p u t a t i o n c l a s s (b , ’ Severity ’ , o u t l i e r = True )

df2 = r e a l i m p u t a t i o n c l a s s (b , ’ Severity ’ )
d f 3 = r e a l i m p u t a t i o n c l a s s ( b , ’ S e v e r i t y ’ , skew = True , o u t l i e r = True )
d f 4 = r e a l i m p u t a t i o n c l a s s ( b , ’ S e v e r i t y ’ , skew = True )
55
def f i x s k e w 2 ( d f ) :
s k e w e d f e a t s = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
s k e w n e s s = s k e w e d f e a t s [ abs ( s k e w e d f e a t s ) > 1 ]
skewed features = skewness . index
p t = P o w e r T r a n s f o r m e r ( method= ’ yeo−j o h n s o n ’ )
for feat in s k e w e d f e a t u r e s :
temp [ f e a t ] = p t . f i t t r a n s f o r m ( np . a r r a y ( temp [ f e a t ] ) . r e s h a p e ( − 1 , 1 ) )
print ( skewness . shape [ 0 ] , ” skewed n u m e r i c a l features have been t r a n s f o r m e d ” )
s k e w e d f e a t s 2 = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
return temp , skewed feats , skewed feats2
bc skew , skew feats , s k e w f e a t s 2 = f i x s k e w 2 ( bc )

s k e w f e a t s . columns = [ ’ skew ’ ]
s k e w f e a t s 2 = s k e w f e a t s [ s k e w f e a t s >1]
skewed data = s k e w f e a t s . index
#############################################################################################
# G e n e r a t e d Data #
#############################################################################################
def r e s u l t s r e g s i m u l a t i o n ( df , t a r g e t , X1 missing , skew = F a l s e , o u t l i e r = False ) :
if skew == True :
df = fix skew ( df )
else : pass
else : pass
ML regressor = [
XGBRegressor ( ) ]
y = df [ target ]
MLA columns = [ ]
row index = 0
ML columns = [ ]
row index1 = 0
for a l g i n MLA:
56
r o w i n d e x+=1
r o w i n d e x 1 += 1
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5)
n imputations = 5
XY completed = [ ]
ML regressor = [
XGBRegressor ( ) ]
r o w i n d e x+=1
r o w i n d e x 1 += 1
57
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =10000 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =7 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
c o l s = random . s a m p l e ( l i s t (X . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n (X . columns ) −2))

perclist = []
mylist = [ ]
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
m i s s p e r c = X . i s n a ( ) . sum( ) / l e n (X)
d f 1 , d f 2 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
X[ ’ y ’ ] = y
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
X[ ’ y ’ ] = y
mylist = [ ]
58
counter = 0
for col in cols :
c o u n t e r += 1
w i t h open ( ’ 10 k p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df1 . t o l a t e x ( ) )
w i t h open ( ’ 100 p r e d . t e x ’ , ’w ’ ) a s tf :
w i t h open ( ’ 10 k imp . t e x ’ , ’w ’ ) a s tf :
w i t h open ( ’ 100 imp . t e x ’ , ’w ’ ) a s tf :

################################################
# Simulate Classification data #
################################################
X[ ’ y ’ ] = y
c o l s = random . s a m p l e ( l i s t (X . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n (X . columns ) −2))

perclist = []
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
59
m i s s p e r c = X . i s n a ( ) . sum( ) / l e n (X)
X[ ’ y ’ ] = y
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
X[ ’ y ’ ] = y
mylist = [ ]
counter = 0
for col in cols :
c o u n t e r += 1
w i t h open ( ’ 100 p r e d . t e x ’ , ’w ’ ) a s tf :
60
w i t h open ( ’ 100 imp . t e x ’ , ’w ’ ) a s tf :

#######################################################
# Calculate time t o r u n OHE v s LE & compare results #
#######################################################
m e l d a t a = pd . r e a d c s v ( ” m e l b d a t a . c s v ” )
m e l d a t a [ ’ Date ’ ] = pd . t o d a t e t i m e ( m e l d a t a [ ’ Date ’ ] )
m e l d a t a [ ’ Year ’ ] = m e l d a t a [ ’ Date ’ ] . d t . y e a r
m e l d a t a [ ’ Month ’ ] = m e l d a t a [ ’ Date ’ ] . d t . month
m e l d a t a [ ’ Month ’ ] = m e l d a t a [ ’ Month ’ ] . a s t y p e ( o b j e c t )
m e l d a t a [ ’ Year ’ ] = m e l d a t a [ ’ Year ’ ] . a s t y p e ( o b j e c t )
mel data [ ’ Postcode ’ ] = mel data [ ’ Postcode ’ ] . a s t y p e ( object )
m e l d a t a [ ’ Age ’ ] = 2017 − m e l d a t a [ ’ Y e a r B u i l t ’ ]
lat , b i n s = pd . q c u t ( m e l d a t a [ ” L a t t i t u d e ” ] , 10 , r e t b i n s=True , l a b e l s=F a l s e )

mel data [ ’ l a t b i n ’ ] = l a t
lon , b i n s = pd . q c u t ( m e l d a t a [ ” L o n g t i t u d e ” ] , 10 , r e t b i n s=True , l a b e l s=F a l s e )
mel data [ ’ lonbin ’ ] = lon
abc list = []
c l a s s e s =10
for i i n range ( 9 7 , 123):
a b c l i s t . append ( s t r ( chr ( i ) ) )
train lon , l o n b i n s = pd . q c u t ( m e l d a t a [ ” L o n g t i t u d e ” ] , classes , r e t b i n s=True , l a b e l s= a b c l i s t [ 0 : c l a s s e s ] )
train lat , l a t b i n s = pd . q c u t ( m e l d a t a [ ” L a t t i t u d e ” ] , classes , r e t b i n s=True , l a b e l s= a b c l i s t [ 0 : c l a s s e s ] )
t r a i n l o n = t r a i n l o n . a s t y p e ( object )
t r a i n l a t = t r a i n l a t . a s t y p e ( object )
mel data [ ” grid ” ] = t r a i n l o n + t r a i n l a t
l e = p r e p r o c e s s i n g . LabelEncoder ( )
l e . f i t ( mel data [ ” grid ” ] )
mel data [ ” grid ” ] = l e . transform ( mel data [ ” grid ” ] )
m e l d a t a [ ’ L a r g e ’ ] = m e l d a t a [ ’ Bedroom2 ’ ] . apply ( lambda x : 1 i f x > 4 else 0)
m e l d a t a = m e l d a t a . drop ( columns =[ ’ A d d r e s s ’ , ’ L o n g t i t u d e ’ , ’ L a t t i t u d e ’ , ’ Date ’ , ’ Y e a r B u i l t ’ ] )
from t i m e i t import d e f a u l t t i m e r as timer

d f = m e l d a t a . copy ( )
target = ’ Price ’
c a t f e a t u r e s = d f . columns [ np . where ( d f . d t y p e s == np . o b j e c t ) [ 0 ] ]
d f = l a b e l ( df , cat features )
t e m p 2 i n c o m p l e t e = m e l d a t a . copy ( )
s t a r t 1 = timer ()
61
end1 = t i m e r ( )
t i m e 1 = end1 − s t a r t 1
s t a r t 2 = timer ()
XGB2 = XGB impute2 ( t e m p 2 i n c o m p l e t e , 5 )
end2 = t i m e r ( )
t i m e 2 = end2 − s t a r t 2
XGB2 . name = ”XGB2”
i m p u t e d a t a = [XGB, XGB2 ]
ML classifier = [
LogisticRegression ()]
MLA columns = [ ]
row index = 0
ML columns = [ ]
row index1 = 0
for a l g i n MLA:
X 1 i n c o m p l e t e = np . a r r a y ( d f . copy ( ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns= ’ P r i c e ’ ) , mel data [ ’ Price ’ ] , test size
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 6)
r o w i n d e x 1 += 1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns= ’ P r i c e ’ ) , mel data [ ’ Price ’ ] , test size =
r o w i n d e x 1 += 1
62

Investigation and Comparison Missing Data Imputation Methods

Uploaded by

Copyright:

Available Formats

Investigation and Comparison Missing Data Imputation Methods

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Investigation and Comparison Missing Data Imputation Methods

Uploaded by

Copyright:

Available Formats

THESIS APPROVAL ROUTING FORM

Thesis Committee Chair Date

Department Chair Date

College Dean Date

Graduate Dean Date

A thesis presented to the Department of Mathematics

The members of the committee approve the thesis of

Janet Nakarmi, Committee Chairperson

Title: Multiple Imputation of Missing Data via Gradiant Boosted Trees

Degree: Applied Mathematics

In presenting this thesis in partial fulfillment of the requirements for a graduate

3 Simulate Missing Data 13

A Melbourne Feature Engineering 40

3.1 Percent Data Missing for Boston Housing Market Data . . . . . . . . 18

4.1 kNN Prediction Results for Melborne Housing Data . . . . . . . . . . 24

5.1 Simulated Regression Data Imputation Results . . . . . . . . . . . . . 29

6.1 Fisher-Pearson Coefficient Before and After Transformation for the

1.1 Types of Missing Data

1.2 Data Modeling

In general, statistical learning or machine learning can be viewed as an approximation

1.2.2 Non-parametric Methods

Non-parametric methods do not make an assumption about the form of f . Instead,

A model in supervised learning typically refers to the mathematical structure that is

obj(w) = L(w) + R(w)

where N is the number of functions or decisions. The objective function is,

The tree is constructed in a top-down approach by a greedy algorithm that always

2.0.3 Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) is a tree boosting ensemble that achieves

ŷi 1 = f1 (xi ) = ŷi 0 + f1 (xi )

ŷi 1 = f1 (xi ) + f2 (xi ) = ŷi 1 + f2 (xi )

Note that because the prediction for each round t is,

yˆit = ŷit−1 + ft (xi ),

2.1 XGBoost Imputation

2. The mode is imputed for each of the categorical variables.

6. Use the trained model to predict the missing values.

3.1 Imputation Methods

where Nk (x) is the neighborhood of x consisting of k nearest points. In other words,

3.1.3 Multiple Imputation by Chained Equations

• Choose one variable and remove the imputed means.

• The cycles are repeated until the values converge.

Singular Value Decomposition (SVD) is a well-known matrix factorization technique

Where λ is a L2 regularization term.

3.1.5 Soft Thresholding SVD

• Compute the SVD Z = U DV T and let di be the singular values

• Soft-threshold the singular values di = di − λ+

This process is repeated until the matrix convergences.

Imputation methods will be compared based on three measures of performance.

3.2 Boston Housing Market Data (Regression)

Varible Percent Missing

Imputation Name MAE RMSE

Table 3.2: Imputation Results for Boston Housing Data

3.3 Breast Cancer Data Set (Classification)

Table 3.3: Percent Data Missing for Breast Cancer Data

Imputation Name MAE RMSE

Table 3.4: Imputation Results for Breast Cancer Data

where Nk (x) is the neighborhood of x consisting of k nearest points. In other words,

A Decision Tree is a tree with nodes, which contain information corresponding to

4.1.1 Melborn Housing Data