Investigation and Comparison Missing Data Imputation Methods
Investigation and Comparison Missing Data Imputation Methods
Investigation and Comparison Missing Data Imputation Methods
The thesis of Jason Yingling entitled Multiple Imputation of Missing Data via
Gradiant Boosted Trees has been reviewed and approved by the thesis committee and
all departmental, college, and university policies and procedures have been followed.
by
Jason Yingling
Master of Science
in
Applied Mathematics
Conway, Arkansas
Aug 2019
TO THE OFFICE OF GRADUATE STUDIES:
Danny Arrigo
Yinlin Dong
PERMISSION
Department: Mathematics
Jason Yingling
Date
c 2019 Jason Yingling
iv
ACKNOWLEDGEMENTS
v
ABSTRACT
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS v
ABSTRACT vi
LIST OF TABLES ix
LIST OF FIGURES x
1 Missing Data 1
1.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Non-parametric Methods . . . . . . . . . . . . . . . . . . . . . 3
2 XGBoost Imputation 5
2.0.1 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.0.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . 7
2.1 XGBoost Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
3.3 Breast Cancer Data Set (Classification) . . . . . . . . . . . . . . . . . 19
4 Predictave Ability 22
4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Melborn Housing Data . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Breast Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Simulation Study 28
5.0.1 Imputation Results . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Data Preprocessing 30
6.1 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1.1 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . 31
7 Conclusion 34
BIBLIOGRAPHY 36
APPENDIX 39
B Code 41
viii
LIST OF TABLES
ix
LIST OF FIGURES
x
CHAPTER 1
Missing Data
Missing data has been an ever present problem for Statistics and other related fields.
It is common for an object to be missing, or entire records to be missing from a data
set. Analyzing a data set with missing values can potentially lead to biased results
[14]. The biased results can potentially be detrimental if one is conducting predictive
tasks such as predicting if a patient has cancer or a self driving car predicting if it is
safe to merge into a new lane.
Rubin’s seminal work he identified up with the three mechanisms of missing data
[20]. They are Missing Completely at Random, Missing at Random, and Not Missing
at Random.
The proper method for handling missing data requires knowledge about the reason
for the data to be missing. Regarding the three types of missing data:
Missing Completely at Random (MCAR) values are values that do not depend
on any observed data. In other words, there is no relationship between the missing
data and any values, observed or missing. The MCAR assumption is generally not
realistic in many real world situations. However, the situation may arise in situations
where the data is missing due to random data entry or transcription errors
Missing at Random (MAR) are values that have a systematic relationship between
the missing values and the observed data, but not the missing data. Under MAR, all
of the available information about the missing data can be found within the data set
but may requires statistical analysis to observe. For example, if men are more likely
to tell you their age than women, age is MAR.
In Missing Not at Random (MNAR), the missing value depends on unobserved
1
data. Under this condition, the unobserved values can be either be the missing value
itself or some other unknown variable that is not in the data set [20]. This thesis will
focus on imputation via modeling to impute missing data.
Y = f (X) + .
The function f is some unknown function that operates on the X variables and
is random error term which has mean zero.
If we are attempting to create a model of the Dallas housing market, each
individual house price, yi and the market itself Y . Say We know the square feet
and lot size, the x-values. We generally know that there is some function f that
would connect the input variables to the house’s value but the actual form of f is
unknown. In order to predict the value of houses given data, we would have to
estimate the function f based on the observed data available to us.
Ŷ = fˆ(X),
where Ŷ is our prediction for Y and fˆ is our estimation for the governing function
f . In most cases, one is not explicitly interested in the form of fˆ so long as the
predictions are accurate. Most models fall under two types of approaches of predicting
f , parametric or non-parametric.
2
1.2.1 Parametric Methods
Parametric methods typically follow two steps in order to model data. The first step
is to make an assumption of the form of f . Returning to our Dallas example, if we
let sq-ft be X1 and lot size be X2 , a very simple model for f could be:
f (X) = β0 + β1 X1 + β2 X2 .
The primary assumptions we made are that f is linear and only consists of the two
input variables. Now that we have the form of f all we need to do is estimate the
coefficients β0 , β1 , and β2 .
The second step is to use the data to ”train” the model. In other words, we need
to find the parameters such that
Y ≈ β0 + β1 X1 + β2 X2
The primary benefit of using a parametric form for f is that it simplifies the
problem to estimating parameters of the specified model. However, this approach
has a large disadvantage in that the form of the model may not match the actual
unknown form f .
3
not reducing the problem to simply estimating parameters, the computation time and
number of required observations required to accurately estimate f increases [8].
The broader ability of non-parametric methods to accurately fit different types
of data is why we chose to use XGBoost, a non-parametric model, as our basis for
imputation in this thesis.
4
CHAPTER 2
XGBoost Imputation
2.0.1 XGBoost
where L is the loss function and R is the regularization term. The loss measures
how predictive the model is in regards to the data. The Regularization term helps
prevent a model from learning the training set to the detriment of generalizing the
data it hasn’t seen before - over-fitting.
Bias-variance trade-off is a common term used to describe the trade-off between
minimizing errors. A model that has a high variance and low bias tends to make a
complex model with low predictive error on the training data, but high error on the
testing data. Where a model with high bias and low variance tends to produce a
simple model with high error in both the training and test data set - under-fitting
[11]
5
2.0.2 Decision Trees
Classification and Regression Tree (CART), is named for the two tasks of decision
trees - classification and regression. CART is a binary decision tree where each node
can be viewed as a question that is based on the data. The node is split based on the
answer and this process is continued until there are no more splits and each node, or
leaf, of the tree contains an outcome used for prediction [11]. The predided value of
yi is as follows:
N
X
ŷi = fj (xi )
j
n
X N
X
obj(w) = L(ŷi , yi ) + R(fk ).
i j
6
by averaging the predictions of many decision trees that have high variance and low
bias. This can be done by partitioning the data set and having different trees trained
on the different partitions [8].
Boosting refers to a technique that turns weak trees into stronger trees. Boosting
is conceptually different from bagging. Bagging each model ran independently on
different partitions of the data then aggregate the outputs. Boosting utilizes a
sequential model structure such that the subsequent estimator attempts to correct
the error of the previous estimator [11]. This process is repeated for each tree added
to the model.
1
R(f ) = γT + λ||w||2
2
where T is the number of leaves and λ is the L2 norm of the scores of each node.
XGBoost is trained by optimizing for one tree at a time in a recursive manner,
starting with a constant function and adding a new function to each previous funciton:
7
ŷi 0 = 0
the decision of which new f to add each round is chosen by optimizing the objective
function at round t:
X N
X
t
obj = L(ŷi , y) + R(fn )
i n
X
= L(ŷi t−1 + ft (xi )) + R(ft )
i
The objective function is simplified the objective further using a second order
approximation that is used as a scoring function. The score is used to determine
the quality of the tree’s structure. Therefore XGBoost is able to effectively utilize
the ability of decision trees to learn the data without the drawback of overfitting the
training set [22].
Friedman describes how XGBoost uses a concept called shrinkage to help prevent
overfitting [12]. Shrinkage scales the feature weights by a factor η, also known as
the learning rate. To further assist with preventing XGBoost from overfitting the
8
data, Chen included row and column sub-sampling [22]. For each node, a sub-sample
of rows or columns can be chosen instead of the whole data set to further aid in
preventing over-fitting, another feature that made XGBoost ideal for our imputation
method.
XGBoost is a decision tree based model that does not require the assumption of
normality [22]. Therefore, it is possible to draw from the insights of Multiple
Imputation by Chained Equations (MICE), described in further detail in Chapter
3, and the benefit of XGBoost to create a new imputation technique. We wanted to
create a multiple imputation technique that would be able to handle any given data
set, without requiring any assumptions on the structure of the data or type of data
within the data set.
The steps we used for our imputation method is outlined below:
1. The mean is imputed for each of the numerical variables. These values are
considered placeholders for further computation.
3. Choose one variable from the list of variables with missing values and remove
the imputed means.
4. Create a training set only using the original values from from the chosen
variable.
5. Fit a XGBoost model with the chosen variable as the dependent variable.
7. The process is repeated for each of the variables with missing data.
9
8. Steps 5-6 are repeated k number of times.
Therefore, if there are n variables, we will need to create n XGBoost models that
need to be trained - one for each variable that has missing data. We also repeat the
process k times, providing us a total of nk models that must be trained.
XGBoost has many parameters that can be tuned and altered to assist in
increasing the predictive accuracy of the model. However, since we are creating a
model for each missing variable, we will not be able to know which parameters are best
for each missing variable and finding the optimal parameters will be a cumbersome
task, especially if the data set has many variables that contain missing values. For
this reason, we will use the generic settings for XGBoost other than the utilization of
row subsampling.
Subsampling will provide model diversity between the different iterations of
XGBoost. This will give variability in the imputations and enables us to obtain the
benefits of Multiple Imputation. If we did not include subsampling, there is a chance
that each imputation round will predict the exact score, therefore being equivalent
to single imputation.
There are several items in the process that could potentially alter the predictive
ability of the imputation technique. The first is the initialization step. The easiest
methods of initializing the data is to take the mean or the median of the columns. The
second is the order of columns chosen. The reason the order can impact the regression
is due to the limitation of the imputation technique. If we have a data set that has
two variables that contain missing data, the first variable imputed will include the
mean imputation of the second column as input to the model. However, the second
column to be imputed will have the model’s prediction for the missing values, which
will be more accurate than the mean imputation for all missing values. This problem
has the potential to be more pronounced for data sets with many missing columns.
Finally there is the question of how to handle categorical variables. There are
10
two primary techniques to transform categorical variables into something that is
numerically tractable, Label Encoding and One-Hot Encoding. Label Encoding
consist of providing each category a numerical value for a given categorical variable.
For example, you can assign Monday - 1 and Sunday - 7. This appears to be a
logical approach, however we have now given numerical comparisons to categories.
In our example, Sunday is numerically 7x larger than Monday, which doesn’t make
sense. Further, simply changing how we decide our ordering can potentially drastically
change the results.
The second method is called One-Hot Encoding. For One-Hot Encoding, we
simply create a new variable for each category and assign a 1 for having the category
and 0 for not having the category. Therefore, One-Hot encoding does not give any
special arbitrary weight for each category. However, we need to keep in mind the
number of features being used. For small data sets, the inclusion of one-hot encoded
variables can make the problem become highly dimensional, in other words, the
number of features exceeds the number of data leading to high dimensional data. For
large data sets, the additional features will make the computation time significantly
longer due to the increased complexity and amount of data required to hold in
memory.
To test the difference between Label Encoding and One Hot Encoding we run our
imputation method on the Melbourne data. The Melbourne data is an ideal data set
to test of One Hot Encoding vs Label Encoding because after we went through the
feature engineering process, described in Appendix B, we ended up with a total of 23
features and 10 categorical features. Once we applied one Hot Encoding, the total
number of features extended to 951 total features.
After running our imputation comparisons, further information in Chapter 4, we
found Label Encoding the categorical variables took 38 seconds to run. One Hot
Encoding took 16 minutes to run on an i7 processor and 16G of RAM. Although One
11
Hot Encoding is more theoretically sound and potentially more accurate than Label
Encoding, we chose to proceed with Label Encoding throughout the thesis due to the
substantive difference in computation speed.
12
CHAPTER 3
Simulate Missing Data
For the first step of this study, we wanted to assess the accuracy of the XGB
Imputation method and compare it to standard techniques. We chose two standard
data sets covering classification and regression tasks, breast cancer data set and
Boston housing market data set respectively.
Since these data sets do not have missing values we have to simulate missing data
by randomly removing data points. The benefit of using a full data set is that we are
able to test how far off the imputation methods are from the actual data values. In
other words, we are able to obtain a seance of accuracy from the imputation methods.
In order to create the missing data for each data set, we conduct the following steps:
1. Select a random number of columns
2. For each column, draw a random number between 0 and 0.5.
3. Remove the percent of data corresponding to the column and the random
number generated.
3.1.1 Mean
Mean imputation is the most common method of imputing data. This method is
easy to use and understand since it simply takes the average of each column with
missing data and imputes the respective missing value with it’s column’s average.
One large downside to this method is the fact that the mean is sensitive to outliers.
This leads to the standard deviation and variance estimates to be underestimated.
More importantly, the magnitude of co-variance and correlation also decreases when
the variability decreases, resulting in biased estimates [8].
13
3.1.2 KNN
The Nearest-neighbor method utilizes observations in the training set that is the most
”similar” to a given observation. The simplest form of k-NN can be defind as:
1 X
Ŷ = yi
k
xi ∈Nk (x)
During the process of MICE procedure, a series of regression models are ran such
that each variable with missing data is modeled conditioned by the other variables
in the data. Therefore, each variable can be modeled according to its distribution.
One of the benefits of MICE is it creates multiple imputations to help account for
the statistical uncertainty in the imputation process.
MICE process is as follows:
• The mean is imputed for each of the variables. These values are considered
placeholders for further computation.
14
• Compute the regression using the variable with missing values as the dependent
variable.
• The missing values are filled in with the predictions from the linear regression
model.
• The process is repeated for each variable with missing data, which when
completed is considered one cycle.
The regression model for predicting the missing values that MICE is built upon
is:
ŷi = β0 + βi Xi
MICE is as follows:
ŷi = β0 + βi Xi + δ
where δ is the root mean square error (RMSE) and is a random error from standard
normal distribution.
3.1.4 SVD
Z = U DV T
Where, U and V are two orthogonal matrices of size mz and nz, respectivel. D
is a diagonal matrix of size zz having all singular values of matrix Z as its diagonal
entries [13]. The matrices obtained by performing SV D
Z = QP T
15
where Q = U and P T = DV T
ẑ = zrc = qrT pc
Since we know that there will be missing data, we need to modify SVD to ignore
the missing data in the decomposition.
We can find qc and pr using an iterative process of taking the square error difference
between their dot product and known data values:
X
min (zrc − qcT · pr )2 .
(u,c)∈K
Here, r and c represent a given row and column of the data set.
To help with over-fitting, we can add the magnitudes squared for each user and
item:
X
(zuc − ẑur )2 + λ ||qi ||2 + ||pu ||2
zuc ∈Ztrain
Mazumder, et. al. [7] extended SVD by creating thresholds that can be learned to
assist with data imputation. The technique consists of three steps:
• Reconstruct Sλ(Z) = U D∗ V T .
16
Evaluation Criteria
s
Pn obs
i=1 (Xi − Xiimputed )2
RM SE =
n
Pn
i=1 |Xiobs − Xiimputed |
M AE =
n
Another way to see the difference between MAE and RMSE is to look at how they
are related to eachother:
s
Pn Pn Pn
obs
i=1 |Xi − Xiimputed | obs
i=1 (Xi − Xiimputed )2 i=1 |Xiobs − Xiimputed | √
≤ ≤ · n
n n n
√
M AE ≤ RM SE ≤ M AE · n. This inequality shows that RMSE suffers when
comparing RMSE results between varying sample sizes. Since we will be evaluating
the different models on varying amounts of missing data, strictly using RMSE may
provide inconclusive results. However, we want to know if a model is making
imputations that are far away from it’s target value, providing us justification to
use RMSE and MAE.
The Boston Housing Market data [6] was created by Harrison and Rubinfield to
investigate the demand for clean air. The data set consists of 14 variables that the
authors thought would help predict the price of houses in Boston.
17
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: Proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable
5. NOX: nitric oxides concentration (pp 10mil)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distance to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk − .63)2 where Bk is the proportion of African Americans by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000’s
Table 3.1: Percent Data Missing for Boston Housing Market Data
Table 3.1 shows the random columns selected and the percent missing for the
respective column. All four of the columns that were selected were numerical.
18
Table 3.2 show that XGB is significantly more accurate in both MAE and RMSE
than the other imputation methods for the Boston Housing Data. For the Boston
data set, the degree of separation from the other imputation methods implies that
XGB will should be the best impuation method for all types of regression models.
Further, we can see Mean struggles to accurately impute data and performs worse
than simply taking the mean and mode of the data.
The Wisconsin Breast Cancer data set is constructed from a digitized image of fine
needle aspirate (FNA) of breast masses. They describe distinct characteristics of the
cell nuclei present in the image. Ten features are computed for each cell nucleus.
Each of the variables are numerical variables.
1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension (”coastline approximation” - 1)
For each feature, the mean, standard error and ”largest” of these features were
computed for each image, creating 30 total features.
Table 3.1 shows that we randomly selected 23 of the 30 variables, providing us with
a data set that hat 77% of it’s variables that contain missing data. This simulation
will strain the ability of model based methods because of the amount of data missing
19
Variable Name Percent Missing
mean radius 36.90
mean texture 29.52
mean perimeter 23.37
mean area 23.02
mean smoothness 9.49
mean concavity 24.07
mean concave points 19.50
mean fractal dimension 06.32
radius error 26.18
texture error 28.82
area error 12.82
smoothness error 5.79
compactness error 0.87
concavity error 43.76
symmetry error 43.93
fractal dimension error 08.78
worst texture 34.27
worst perimeter 30.93
worst area 24.78
worst smoothness 25.30
worst concavity 21.79
worst symmetry 41.12
worst fractal dimension 34.62
from the data set will make it harder to learn the patterns within the data.
Table 3.4 shows that XGBoost has the lowest MAE and RMSE. we see MICE is
slightly better than XGBoost. This means that overall, XGBoost is more accurate in
it’s imputations but has some values that are far off. Due to XGBoost having higher
accuracy overall accuracy but having worse RMSE, we suspect that XGBoost will
20
perform better for models that are more resilient to outliers such as Random Forest
and kNN and MICE will have better performance with Logistic Regression.
21
CHAPTER 4
Predictave Ability
Machine Learning models are constructed in vastly different ways [11]. The different
construction and prediction methods can cause a large amount of variability between
different models applied to the same data set. Therefore, it is possible to have a
scenario where one imputation method can provide the best results for one supervised
model and at the same time be the worst imputation method for a different supervised
model.
We chose to use generic supervised models of vastly different construction to test
the impact of applying different imputation techniques. We wanted to see if the type
of supervised model had an affinity for similar imputation types such as linear and
logistic regression for MICE. For this section, the models we chose are:
1. KNN
2. Random Forest
3. Linear Regression (Regression)
4. Logistic Regression (Classification)
KNN
The Nearest-neighbor method utilizes observations in the training set that is the most
”similar” to a given observation. The simplest form of k-NN can be defind as:
1 X
Ŷ = yi
k
xi ∈Nk (x)
22
For classification tasks, k-NN identifies the k neighbors and then takes a vote of
the class of the neighbors. The classification that has the most votes becomes the
prediction for the given observation[11].
For regression tasks, k-NN calculates an inverse distance weighted average with
the k nearest neighbors to obtain the predicted value [8].
Random Forest
23
4.1 Results
For the Melbourne data set, we also include the predictive improvement by One Hot
Encoding the categorical variables for each imputation model during the XGBoost
Imputation outlined in Chapter 2. Tables 4.2 and 4.3, there is an improvement in
using One Hot Encoding for Random Forest and Linear Regression. Though we do see
an increase in performance for Random Forest and Linear regression, the significant
increase in computational time does not justify the minor improvement in predictive
accuracy we obtain. These results are why we opt to use Label Encoding for our
thesis over One Hot Encoding.
We can see in tables 4.1, 4.2, and 4.3 that XGBoost is beaten by SoftImpute.
Table 4.3 shows that MICE has better predictive improvement for Linear Regression
over XGBoost.
Table 4.2: Random Forest Prediction Results for Melborne Housing Data
24
Test MAE Test RMSE Imputation Name
255717 407328 SoftImpute
259046 411609 MICE
257845 412602 XGB OHE
258613 412743 XGB
263768 416833 KNN
266391 419301 SimpleFill
Table 4.3: Linear Regression Prediction Results for Melborne Housing Data
Tables 4.4, 4.5, and 4.6 showed that XGBoost performed the best for imputation for
all of the models. The across the board performance supported the hypothesis from
Chapter 3 that XGBoost would have the best improvement on predictive ability out
of all of the models. We also see that SoftImpute had the worst performance on
predict accuracy, even worse than Mean imputation for every data set.
It was interesting to find that for kNN, Mean was the second best performing
imputation method. This shows that Mean imputation is not always the worst option
for every situation and can perform better than model based methods for different
supervised models.
Table 4.4: Random Forest Prediction Accuracy for Boston Housing Data
25
Test MAE Test RMSE Imputation Name
3.616397 5.412887 XGB
3.759428 5.606318 MICE
3.789144 5.715436 KNN
4.097958 5.934811 Mean
4.206722 6.178896 SoftImpute
Table 4.5: Linear Regression Prediction Accuracy for Boston Housing Data
The breast cancer data set was one that we thought would potentially cause problems
for the model based models such as MICE and our method. Since 77% of the variables
had missing data, there was the possibility of the models to struggle to learn the
underlying structure of the data. However, as we can see from tables 4.6,4.7, and 4.9
XGBoost and MICE had the best imputation improvements and XGBoost was the
best for Logistic Regression and kNN.
26
Test Accuracy Precission Recall F1 Imputation Name
0.9650 0.988506 0.955556 0.971751 MICE
0.9510 0.966292 0.955556 0.960894 XGB
0.9510 1.000000 0.922222 0.959538 KNN
0.9371 0.976471 0.922222 0.948571 Mean
0.9301 0.954545 0.933333 0.943820 SoftImpute
Table 4.8: Random Forest Prediction Accuracy for Breast Cancer Data
Table 4.9: Logistic Regression Prediction Accuracy for Breast Cancer Data
27
CHAPTER 5
Simulation Study
In order to control the different variables that may impact the performance and
accuracy of data imputation, we chose to simulate data at varying parameters.
Simulated data has the inherent ability to change one aspect of data at a time and
keep everything else constant. If we only use real data, but want to test the impact
of original sample size, we would be required to find sufficiently different data sets for
the test. We would be able to see differences in imputation between the two different
data sets, but we would not be able to attribute the difference solely to the size of the
data, because other aspects of the data that are completely out of our control may
be different as well.
One of the primary reasons to test the impact of data size is the amount of
data available may impact the ability of a given imputation model to fully learn
the structure of the data [2]. A model that performs the best in an environment of
abundant data may be unusable in situations where there is not a sufficient amount
of data. We test two kinds of data simulation perspectives available in scikit-learn’s
package, one built for regression tasks, the other built for classification tasks [17].
Table 5.1 reinforces the findings that as we increased the size of the data[2], XGBoost
was able to more accurately predict the missing values. This is an inherent bi-product
of XGBoost being a non-parametric method. When we increase the size of the
data, the model has more information to accurately predict the underlying function
that governs the individual variables for each of the missing data points. However,
Table 5.2 shows that simply increasing the data size may not always lead to better
performance. The increased data size may not always lead to an improvement in
28
model performance if there is noise in the data similar to what we simulated in Table
5.2.
29
CHAPTER 6
Data Preprocessing
6.1 Skew
Skewedness is simply a measure for how asymmetrical data is. For parametric models,
such as Linear and Logistic Regression, parameter estimation is founded on the
assumption that the data is normally distributed. If the data does not follow a
normal distribution the estimated parameters cannot be accurately estimated [10].
For decision tree based models, variance is used to help perform the node splits.
Variance is sensible sensitive to outliers and skewed data. Therefore transforming the
variables can lead to improvements for both parametric and tree based models.
For univariate data, the measure of skewedness is calculated using the
Fisher-Pearson coefficient of skewedness:
PN
i=1 (Yi− Ȳ )3 /N
s3
where Ȳ is the mean, s is the standard deviation and N is the number of data.
We use Python’s adjusted Fisher-Pearson coefficient of skewedness [9]:
30
N (N − 1) N
p
3
P
i=1 (Yi − Ȳ ) /N
Skew =
N −2 s3
yiλ −1
λ 6= 0
λ
T (yi ) =
ln yi
λ=0
(yi +1)λ −1
λ
λ 6= 0, y ≥ 0
ln (yi + 1), λ = 0, y ≥ 0
T (yi ) =
−(−yi +1)2−λ −1
2−λ
λ 6= 2, y < 0
− ln (−y + 1)
λ = 2, y < 0
i
Bulmer advises the rule of thumb that normal data has a skew value such that the
Fisher-Pearson coefficient is less than 1 and greater than -1 [5]. We adhere to Bulmer’s
rule of tumb and transform all of the variables with a Fisher-Pearson coefficient larger
than 1.
31
We use the breast cancer data set for our investigation into the impact of skew.
Table 6.1: Fisher-Pearson Coefficient Before and After Transformation for the Skewed
Variables
After calculating the Fisher-Pearson coefficient for each variable in the breast
cancer data set, we found that there were 22 variables (73%) that were skewed.
Table 6.2: XGBoost Imputation Comparison Between Transformed and Original Data
Figure – shows a 97% decrease in RMSE after we transform the data. Since RMSE
heavily penalises predictions that were far from the true value [21], we believe that the
original data suffered from data points that were highly inaccurate imputations. The
32
inaccuracy of the imputation can lead to a deterioration of the predictive ability of a
predictive model. We chose not to test the impact of the predictive imputation of data
transformation vs non-transformed data because we would not be able to completely
identify if the difference in the predictive accuracy after imputation is caused by the
data transformation itself or by the increased accuracy if the data imputation.
33
CHAPTER 7
Conclusion
In this thesis, XGBoost was used to create a Multiple Imputation technique as a new
method to the missing data problem. The imputation method based on XGBoost
is more flexible than the current industry standard MICE in that it does not have
any assumptions about the underlying structure of the data. We compared XGBoost
imputation to several standard techniques across different types of data. XGBoost
MI performed on par or beat the standard methods throughout the different tests
however, showed to have some weakness in making predictions that were highly
innacurate, as seen when XGBoost had low MAE score but higher RMSE. As we
saw in Chapter 4, the inaccuracy on some data points caused XGBoost to not be
the ideal imputation scheme for models that are sensitive to outliers such as linear
regression and logistic regression. The overall accuracy, shown by the consistently
better MAE score, leads me to suggest that XGBoost MI may be generally better for
tree based methods and kNN for most tasks.
Our study did not include machine learning techniques such as Support Vector
Machine, Generalized Linear Regression, or Neural Networks, so we cannot make any
inference onto how XGBoost MI will perform as the imputation method for different
family of models. Future work would be to investigate how XGBoost MI performs as
the imputation scheme for a wider array of machine learning algorithms.
Further, there are several areas of potential improvement. We chose to only include
subsampling as a parameter for XGBoost. However, XGBoost performance is highly
sensitive to the parameters selected. We could see a significant increase in imputation
performance by tuning the parameters for each missing variable. Further, we could
find better ways of ordering the imputation other than the order the variables reside
within the data set. One method could be randomly selecting columns the columns
34
to be imputed. A more sophisticated method could be to impute the variables in
descending order with respect to the percent of missing data. There is a lot of room
for potential improvement, we believe that this thesis is just the first step in creating
a more robust imputation method that utilizes XGBoost.
35
BIBLIOGRAPHY
[2] Beleites, C., Neugebauer, U., Bocklitz, T., Krat, C., and Popp, J. Sample size
planning for classication models. Analytica, (760):25–33, 2013.
[6] Harrison, D. and Rubinfield, D.L. Hedonic prices and the demand for clean air.
Economics Management, 5:81–102, 1978.
[7] Trevor Hastie, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. Matrix
completion and low-rank svd via fast alternating least squares. Journal of
Machine Learning Research, 16(104):3367–3402, 2015.
[8] Hastie, T., Tibshirani, R., and Friedman, J. Elements of Statistical Learning:
Data Mining, Inference and Prediction. Springer Sverlag, 2009.
[9] Hotelling, H. New Light on the Correlation Coefficient and its Transforms.
Journal of the Royal Statistical Society, 15(2).
[10] Shao J. Mathematical Statistics. Springer, New York, NY, USA, 2 edition, 2003.
[11] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An
Introduction to Statistical Learning: With Applications in R. Springer Publishing
Company, Incorporated, 2014.
36
[12] Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting
Machine. The Annals of Statistics, 29:1189–1232, 2001.
[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques
for recommender systems. Computer, 42(8):30–37, August 2009.
[14] Roderick J A Little and Donald B Rubin. Statistical Analysis with Missing Data.
John Wiley & Sons, Inc., New York, NY, USA, 1986.
[15] Jayant Malani, Neha Sinha, Nivedita Prasad, and Vikas Lokesh. Forecasting
bike sharing demand.
[16] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY,
USA, 1 edition, 1997.
[18] Peter J. Rousseeuw Katrien Van Driessen. A Fast Algorithm for the Minimum
Covariance Determinant Estimator. Technometrics, 41(3):212–223, 1999.
[19] Rousseeuw, P.J. Least Median of Squares Regression. Journal of the American
Statistical Association, (79):871–880, 1984.
[20] Donald B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
[21] T. Chai and R. R. Draxler. Root mean square error (RMSE) or mean absolute
error (MAE)? - Arguments against avoiding RMSE in literature. Geoscientific
Model Development, 7(3):1247–1250, 2014.
37
[22] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd ACM SIGKDD International Conference on knowledge
discovery and data mining, pages 785–794, 2016.
[23] Yeo, In-Kwon and Johnson, Richard. A new family of power transformations.
Biometrika, (87):954–959, 2000.
38
APPENDIX
39
APPENDIX A
Melbourne Feature Engineering
Prior to running any tests on the Melbourne data, I created some features from the
original data in a process known as Feature Engineering. There are generally two
ways of improving the performance of any given machine learning model, increase
data size or increase feature space. Creating new features creates new dimensions
from the data for models to better help learn and predict the dependent variable.
The first group of features I created were from the time variable. I created a
categorical feature for the Month and another categorical feature for the Year. The
idea is that the given month of the year may have an impact on the value of a house.
For example if more people are looking to move in the summer, then perhaps prices
for housese are typically higher in the summer as a result of the increased demand.
Another feature I created was Age of the house. I computed 2017 - X for every X
in the column YearBuilt.
For Lattitude and Longitude, I created 10 bins for each respective by ordering the
values by quantiles and creating 10 bins from the resulting ordering.
Then I created another feature that crossed the latitude and longitude bins to
create a grid for Melbourne. Since there are 10 bins a piece, there is the potential of
100 new categories for this one feature.
The last feature I created was a Large house indicator for houses that have more
than 4 Bedrooms.
40
APPENDIX B
Code
from s k l e a r n . e n s e m b l e import I s o l a t i o n F o r e s t
import pandas a s pd
from f a n c y i m p u t e import KNN, SoftImpute , BiScaler , SimpleFill , IterativeImputer , IterativeSVD , MatrixFactorization
import m a t p l o t l i b . p y p l o t a s plt
import s e a b o r n a s s n s
import numpy a s np
import math
from s k l e a r n import p r e p r o c e s s i n g
from s k l e a r n . d a t a s e t s import m a k e c l a s s i f i c a t i o n , make regression
from s k l e a r n . m o d e l s e l e c t i o n import train test split
from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e
from s k l e a r n . m o d e l s e l e c t i o n import train test split
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r , LabelEncoder , RobustScaler , PowerTransformer
from s k l e a r n import p i p e l i n e
from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r , KNeighborsRegressor
from s k l e a r n . svm import SVC, SVR
from s k l e a r n . g a u s s i a n p r o c e s s import G a u s s i a n P r o c e s s C l a s s i f i e r
from s k l e a r n . g a u s s i a n p r o c e s s . k e r n e l s import RBF
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
from s k l e a r n . e n s e m b l e import R a n d o m F o r e s t C l a s s i f i e r , AdaBoostClassifier , BaggingClassifier , ExtraTreesClassifier , G
from s k l e a r n . n a i v e b a y e s import GaussianNB , BernoulliNB
from s k l e a r n . l i n e a r m o d e l import L o g i s t i c R e g r e s s i o n , LogisticRegressionCV , PassiveAggressiveClassifier , Perceptron ,
from s k l e a r n . d i s c r i m i n a n t a n a l y s i s import Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s
from s k l e a r n . m e t r i c s import c o n f u s i o n m a t r i x , mean absolute error
from s k l e a r n . m e t r i c s import m e a n s q u a r e d e r r o r , c o n f u s i o n m a t r i x , precision score , r e c a l l s c o r e , auc , r o c c u r v e , f1 sc
from x g b o o s t import X G B C l a s s i f i e r , XGBRegressor
from l i g h t g b m import L G B M C l a s s i f i e r , LGBMRegressor
from c a t b o o s t import C a t B o o s t C l a s s i f i e r , CatBoostRegressor
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r , MLPRegressor
import impyute a s impy
from s c i p y import s t a t s
from s c i p y . s t a t s import norm , skew #f o r some statistics
from s k l e a r n . e n s e m b l e import I s o l a t i o n F o r e s t
from s k l e a r n . impute import S i m p l e I m p u t e r
from s k l e a r n . d a t a s e t s import l o a d b o s t o n , load wine , load breast cancer , load iris
import random
MLA = [ I t e r a t i v e I m p u t e r ( ) ,
KNN( ) ,
SimpleFill () ,
#I t e r a t i v e S V D ( ) ,
SoftImpute ( ) ]
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
41
LinearRegression () ,
XGBRegressor ( ) ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
def f i x s k e w ( d f ) :
temp = d f . copy ( )
n u m e r i c f e a t s = temp . d t y p e s [ temp . d t y p e s != ” o b j e c t ” ] . i n d e x
s k e w e d f e a t s = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
s k e w n e s s = s k e w e d f e a t s [ abs ( s k e w e d f e a t s ) > 1 ]
skewed features = skewness . index
p t = P o w e r T r a n s f o r m e r ( method= ’ yeo−j o h n s o n ’ )
for feat in s k e w e d f e a t u r e s :
temp [ f e a t ] = p t . f i t t r a n s f o r m ( np . a r r a y ( temp [ f e a t ] ) . r e s h a p e ( − 1 , 1 ) )
print ( skewness . shape [ 0 ] , ” skewed n u m e r i c a l features have been t r a n s f o r m e d ” )
return temp
def r e m o v e o u t l i e r ( d f ) :
temp = d f . copy ( )
n u m e r i c f e a t s = temp . d t y p e s [ temp . d t y p e s != ” o b j e c t ” ] . i n d e x
o u t d f = temp [ n u m e r i c f e a t s ] . copy ( ) . f i l l n a ( 0 )
IF = I s o l a t i o n F o r e s t ( b e h a v i o u r=”new” , c o n t a m i n a t i o n = ” a u t o ” )
IF . f i t ( o u t d f )
temp [ ” I F s c o r e ” ] = IF . d e c i s i o n f u n c t i o n ( o u t d f )
return temp [ temp [ ” I F s c o r e ” ] > 0 . 0 0 ]
def p r e p o n e h o t ( d f , c a t c o l s , column ) :
tempdf = d f . copy ( )
t a r g e t l i s t = [ column ]
o n e h o t c o l s = c a t c o l s [ np . l o g i c a l n o t ( c a t c o l s . i s i n ( t a r g e t l i s t ) ) ]
tempdf = pd . g e t d u m m i e s ( tempdf , columns=o n e h o t c o l s )
return tempdf
42
temp = l a b e l ( temp , cat features )
temp [ c a t i n t e r s e c t ] = temp [ c a t i n t e r s e c t ] . f i l l n a ( −5)
c a t i d x = temp [ c a t i n t e r s e c t ] == −5
temp [ n u m e r i c a l i n t e r s e c t ] = temp [ n u m e r i c a l i n t e r s e c t ] . f i l l n a ( −5)
num idx = temp [ n u m e r i c a l i n t e r s e c t ] == −5
if ( len ( c a t i n t e r s e c t ) > 0 ) :
temp [ c a t i n t e r s e c t ] = f i l l c a t ( temp [ c a t i n t e r s e c t ] , − 5 )
else : pass
if ( len ( n u m e r i c a l i n t e r s e c t ) >0):
temp [ n u m e r i c a l i n t e r s e c t ] = f i l l f l o a t ( temp [ n u m e r i c a l i n t e r s e c t ] , − 5 )
else : pass
temp [ n u m e r i c a l f e a t u r e s ] = temp [ n u m e r i c a l f e a t u r e s ] . a s t y p e ( ’ f l o a t ’ )
for i i n range ( k ) :
for col in cat intersect :
if l e n ( c a t i n t e r s e c t ) == 0 :
break
else :
temp . l o c [ c a t i d x [ c o l ] . v a l u e s , col ] = XGBClassifier ()
. f i t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ ˜ c a t i d x [ c o l ] . v a l u e s , : ] , temp . l o c [ ˜ c a t i d x [ c o l ]
. v a l u e s , c o l ] ) . p r e d i c t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ c a t i d x [ c o l ] . v a l u e s , : ] )
for col in numerical intersect :
if l e n ( n u m e r i c a l i n t e r s e c t ) == 0 :
break
else :
temp . l o c [ num idx [ c o l ] . v a l u e s , c o l ] = XGBRegressor ( )
. f i t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ ˜ num idx [ c o l ] . v a l u e s , : ] , temp . l o c [ ˜ num idx [ c o l ]
. v a l u e s , c o l ] ) . p r e d i c t ( temp . drop ( c o l , a x i s = 1 ) . l o c [ num idx [ c o l ] . v a l u e s , : ] )
return temp
def l a b e l ( df , c o l l a b e l ) :
d f = d f . copy ( )
l e = LabelEncoder ( )
for col in col label :
idx = ˜ d f [ c o l ] . isna ()
d f . l o c [ idx , c o l ] = l e . f i t ( d f . l o c [ idx , c o l ] ) . t r a n s f o r m ( d f . l o c [ idx , col ])
return df
def f i l l c a t ( df , v a l ) :
tempdf = d f . copy ( )
imp mode = S i m p l e I m p u t e r ( m i s s i n g v a l u e s=v a l , s t r a t e g y= ’ m o s t f r e q u e n t ’ )
newdf = imp mode . f i t t r a n s f o r m ( tempdf )
return newdf
def f i l l f l o a t ( df , v a l ) :
tempdf = d f . copy ( )
imp mean = S i m p l e I m p u t e r ( m i s s i n g v a l u e s=v a l , s t r a t e g y= ’ mean ’ )
newdf = imp mean . f i t t r a n s f o r m ( tempdf )
return newdf
43
temp = f i x s k e w ( temp )
else : pass
if o u t l i e r == True :
temp = r e m o v e o u t l i e r ( temp )
else : pass
MLA columns = [ ]
ML compare = pd . DataFrame ( columns = MLA columns )
row index1 = 0
c a t f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s == np . o b j e c t ) [ 0 ] ]
n u m e r i c a l f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s != np . o b j e c t ) [ 0 ] ]
missing cols = [ col for col i n temp . columns
i f temp [ c o l ] . i s n u l l ( ) . any ( ) ]
cat intersect = intersection ( numerical features , missing cols )
numerical intersect = intersection ( numerical features , missing cols )
# Start the actual imputation
#p r i n t ( c a t f e a t u r e s )
temp = l a b e l ( temp , cat features )
y = temp [ t a r g e t ]
X = temp . drop ( t a r g e t , a x i s = 1 ) . copy ( )
X i n c o m p l e t e = np . a r r a y (X)
for a l g i n MLA:
X f i l l e d = alg . f i t t r a n s f o r m ( X incomplete )
MLA name = a l g . class . name
x train , x test , y train , y test = train test split ( X filled , y , t e s t s i z e = 0.25 , random state = 0)
f o r ml i n M L r e g r e s s o r :
p r e d t r a i n = ml . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml . p r e d i c t ( x t e s t )
ML name = ml . class . name
ML compare . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] =
round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] =
round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
ML compare . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
# C r e a t e new d a t a s e t f o r new I m p u t a t i o n
temp2 = temp . copy ( )
XGB = XGB impute ( temp2 , 8 )
#C a t B o o s t = C a t B o o s t i m p u t e ( temp2 , 5 )
#LGBM = LGBM impute ( temp2 , 5 )
XGB. name = ”XGB”
#C a t B o o s t . name = ” C a t B o o s t ”
#LGBM. name = ”LGBM”
X Y i n c o m p l e t e = np . a r r a y ( temp2 )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =2 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
44
for data in impute data :
MLA name = d a t a . name
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( t a r g e t , a x i s =1) , y , t e s t s i z e = 0.25 , random s
f o r ml2 i n M L r e g r e s s o r :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 4
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare
if o u t l i e r == True :
temp = r e m o v e o u t l i e r ( temp )
else : pass
MLA columns = [ ]
ML compare = pd . DataFrame ( columns = MLA columns )
row index1 = 0
c a t f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s == np . o b j e c t ) [ 0 ] ]
n u m e r i c a l f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s != np . o b j e c t ) [ 0 ] ]
missing cols = [ col for col i n temp . columns
i f temp [ c o l ] . i s n u l l ( ) . any ( ) ]
cat intersect = intersection ( numerical features , missing cols )
numerical intersect = intersection ( numerical features , missing cols )
# Start the actual imputation
#p r i n t ( c a t f e a t u r e s )
temp = l a b e l ( temp , cat features )
y = temp [ t a r g e t ]
X = temp . drop ( t a r g e t , a x i s = 1 ) . copy ( )
X i n c o m p l e t e = np . a r r a y (X)
for a l g i n MLA:
X f i l l e d = alg . f i t t r a n s f o r m ( X incomplete )
MLA name = a l g . class . name
x train , x test , y train , y test = train test split ( X filled , y , t e s t s i z e = 0.25 , random state = 0)
f o r ml i n ML classifier :
p r e d t r a i n = ml . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml . p r e d i c t ( x t e s t )
ML name = ml . class . name
fp , tp , th = r o c c u r v e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
45
ML compare . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
# C r e a t e new d a t a s e t f o r new I m p u t a t i o n
temp2 = temp . copy ( )
XGB = XGB impute ( temp2 , 8 )
#C a t B o o s t = C a t B o o s t i m p u t e ( temp2 , 5 )
#LGBM = LGBM impute ( temp2 , 5 )
XGB. name = ”XGB”
#C a t B o o s t . name = ” C a t B o o s t ”
#LGBM. name = ”LGBM”
X Y i n c o m p l e t e = np . a r r a y ( temp2 )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =2 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
for data in impute data :
MLA name = d a t a . name
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( t a r g e t , a x i s =1) , y , t e s t s i z e = 0.25 , random s
f o r ml2 i n ML classifier :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
fp , tp , th = r o c c u r v e ( y t e s t , pred test )
ML name = ml2 . class . name
ML compare . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
ML compare . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare
from d a t e t i m e import d a t e t i m e
c e n s u s = pd . r e a d c s v ( ” a d u l t . c s v ” )
####################################################
46
c a t f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s == np . o b j e c t ) [ 0 ] ]
n u m e r i c a l f e a t u r e s = temp . columns [ np . where ( temp . d t y p e s != np . o b j e c t ) [ 0 ] ]
missing cols = [ col for col i n temp . columns
i f temp [ c o l ] . i s n u l l ( ) . any ( ) ]
cat intersect = intersection ( cat features , missing cols )
numerical intersect = intersection ( numerical features , missing cols )
temp = l a b e l ( temp , cat features )
temp [ c a t i n t e r s e c t ] = temp [ c a t i n t e r s e c t ] . f i l l n a ( −5)
c a t i d x = temp [ c a t i n t e r s e c t ] == −5
temp [ n u m e r i c a l i n t e r s e c t ] = temp [ n u m e r i c a l i n t e r s e c t ] . f i l l n a ( −5)
num idx = temp [ n u m e r i c a l i n t e r s e c t ] == −5
if ( len ( c a t i n t e r s e c t ) > 0 ) :
temp [ c a t i n t e r s e c t ] = f i l l c a t ( temp [ c a t i n t e r s e c t ] , − 5 )
else : pass
if ( len ( n u m e r i c a l i n t e r s e c t ) >0):
temp [ n u m e r i c a l i n t e r s e c t ] = f i l l f l o a t ( temp [ n u m e r i c a l i n t e r s e c t ] , − 5 )
else : pass
temp [ n u m e r i c a l f e a t u r e s ] = temp [ n u m e r i c a l f e a t u r e s ] . a s t y p e ( ’ f l o a t ’ )
for i i n range ( k ) :
for col in cat intersect :
if l e n ( c a t i n t e r s e c t ) == 0 :
break
else :
temp3 = p r e p o n e h o t ( temp , c a t f e a t u r e s , c o l )
temp . l o c [ c a t i d x [ c o l ] . v a l u e s , c o l ] = X G B C l a s s i f i e r ( ) . f i t ( temp3 . drop ( c o l , a x i s = 1 ) . l o c [ ˜ c a t i d x [ c o l ] . v
for col in numerical intersect :
if l e n ( n u m e r i c a l i n t e r s e c t ) == 0 :
break
else :
tempdf = pd . g e t d u m m i e s ( temp , columns=c a t f e a t u r e s )
temp . l o c [ num idx [ c o l ] . v a l u e s , c o l ] = XGBRegressor ( ) . f i t ( tempdf . drop ( c o l , a x i s = 1 ) . l o c [ ˜ num idx [ c o l ] . v
return temp
iris data = l o a d i r i s ()
i r i s = pd . DataFrame ( i r i s d a t a . d a t a )
i r i s . columns = i r i s d a t a . f e a t u r e n a m e s
i r i s [ ’ class ’ ] = iris data . target
y = iris [ ’ class ’ ]
bc data = l o a d b r e a s t c a n c e r ()
bc = pd . DataFrame ( b c d a t a . d a t a )
bc . columns = b c d a t a . f e a t u r e n a m e s
bc [ ’ Type ’ ] = b c d a t a . t a r g e t
47
if o u t l i e r == True :
df = remove outlier ( df )
else : pass
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
X 1 m i s s i n g = d f . copy ( )
y = X1 missing [ t a r g e t ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) −2))
# Randomly r e m o v e percent
perclist = []
for i i n range ( l e n ( c o l s ) ) :
p e r c l i s t . append ( random . random ( ) )
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
counter = 0
for col in cols :
X 1 m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
tempdf . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
m i s s p e r c = X 1 m i s s i n g . i s n a ( ) . sum( ) / l e n ( X 1 m i s s i n g )
miss cols = cols
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
# m y l i s t 2 = np . a r r a y ( [ x for x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜ np . i s n a n ( x ) ] )
# MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( a b s ( m y l i s t 1 −m y l i s t 2 ) ) . sum ( ) / l e n ( m y l i s t 1 )
# MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum ( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
f o r ml i n ML classifier :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
fp , tp , th = r o c c u r v e ( y t e s t , predicted )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
48
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5 )
XGB. name = ”XGB”
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =5 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
for data in impute data :
MLA name = d a t a . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
f o r ml2 i n ML classifier :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2 , m i s s p e r c , m i s s c o l s
if o u t l i e r == True :
df = remove outlier ( df )
else : pass
49
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
LinearRegression () ,
XGBRegressor ( ) ]
X 1 m i s s i n g = d f . copy ( )
y = X1 missing [ t a r g e t ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , round ( random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) ) / 2 ) )
# Randomly r e m o v e percent
perclist = []
for i i n range ( l e n ( c o l s ) ) :
p e r c l i s t . append ( random . random ( ) )
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
counter = 0
for col in cols :
X 1 m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
f o r ml i n M L r e g r e s s o r :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
p r e d t r a i n = ml . p r e d i c t ( x t r a i n )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d i c t e d , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d i c t e d , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
50
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5)
XGB. name = ”XGB”
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =5 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
LinearRegression () ,
XGBRegressor ( ) ]
for data in impute data :
MLA name = d a t a . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
f o r ml2 i n M L r e g r e s s o r :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2
if o u t l i e r == True :
df = remove outlier ( df )
else : pass
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
X 1 m i s s i n g = d f . copy ( )
y = X1 missing [ t a r g e t ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) −2))
51
# Randomly r e m o v e percent
perclist = []
for i i n range ( l e n ( c o l s ) ) :
p e r c l i s t . append ( random . random ( ) )
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
counter = 0
for col in cols :
X 1 m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( y )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( y ) )
r o w i n d e x+=1
f o r ml i n ML classifier :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = a u c ( f p , tp )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5 )
XGB. name = ”XGB”
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =5 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
52
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
for data in impute data :
MLA name = d a t a . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( y )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( y ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
f o r ml2 i n ML classifier :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = a u c ( f p , tp )
# ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2
w i t h open ( ’ b c c l a s s . t e x ’ , ’w ’ ) a s tf :
t f . w r i t e ( ML compare2 . t o l a t e x ( ) )
#r e s = r e s u l t s c l a s s ( b , ’ S e v e r i t y ’ )
try1 = r e a l i m p u t a t i o n c l a s s (b , ’ Severity ’ )
def r e s u l t s c l a s s 2 ( df , t a r g e t ) :
ML classifier = [
#K N e i g h b o r s C l a s s i f i e r ( ) ,
#R a n d o m F o r e s t C l a s s i f i e r ( ) ,
#L o g i s t i c R e g r e s s i o n ( ) ,
XGBClassifier ( ) ]
53
X 1 m i s s i n g = d f . copy ( )
y = X1 missing [ t a r g e t ]
c o l s = random . s a m p l e ( l i s t ( X 1 m i s s i n g . columns ) [ : − 1 ] , random . r a n d i n t ( 1 , l e n ( X 1 m i s s i n g . columns ) −2))
# Randomly r e m o v e percent
perclist = []
for i i n range ( l e n ( c o l s ) ) :
p e r c l i s t . append ( random . random ( ) )
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
counter = 0
for col in cols :
X 1 m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
f o r ml i n ML classifier :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
fp , tp , th = r o c c u r v e ( y t e s t , predicted )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml . s c o r e ( x t e s t , y test ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , predicted )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5 )
XGB. name = ”XGB”
XGB2 = XGB impute2 ( t e m p 2 i n c o m p l e t e , 5 )
XGB2 . name = ”XGB2”
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
54
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =5 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression () ,
XGBClassifier ( ) ]
for data in impute data :
MLA name = d a t a . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
f o r ml2 i n ML classifier :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t r a i n , y train ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t A c c u r a c y ’ ] = round ( ml2 . s c o r e ( x t e s t , y test ) , 4)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML P r e c i s s i o n ’ ] = p r e c i s i o n s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML R e c a l l ’ ] = r e c a l l s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML AUC ’ ] = auc ( f p , tp )
ML compare2 . l o c [ r o w i n d e x 1 , ’ML f 1 ’ ] = f 1 s c o r e ( y t e s t , pred test )
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2
temp , s k e w c o l s = f i x s k e w ( bc )
w i t h open ( ’ s k e w c o l s . t e x ’ , ’w ’ ) a s tf :
tf . write ( skew cols . to latex ())
s k e w r e s = d f 4 [ ( d f 4 [ ” I m p u t a t i o n Name”]==”XGB” ) | ( d f 4 [ ” I m p u t a t i o n Name”]==”XGB2” ) ]
s k e w p r e d = d f 3 [ ( d f 3 [ ” I m p u t a t i o n Name”]==”XGB” ) | ( d f 3 [ ” I m p u t a t i o n Name”]==”XGB2” ) ]
w i t h open ( ’ s k e w i m p u t a t i o n . t e x ’ , ’w ’ ) a s tf :
t f . write ( skew res . to latex ())
55
def f i x s k e w 2 ( d f ) :
temp = d f . copy ( )
n u m e r i c f e a t s = temp . d t y p e s [ temp . d t y p e s != ” o b j e c t ” ] . i n d e x
s k e w e d f e a t s = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
s k e w n e s s = s k e w e d f e a t s [ abs ( s k e w e d f e a t s ) > 1 ]
skewed features = skewness . index
p t = P o w e r T r a n s f o r m e r ( method= ’ yeo−j o h n s o n ’ )
for feat in s k e w e d f e a t u r e s :
temp [ f e a t ] = p t . f i t t r a n s f o r m ( np . a r r a y ( temp [ f e a t ] ) . r e s h a p e ( − 1 , 1 ) )
print ( skewness . shape [ 0 ] , ” skewed n u m e r i c a l features have been t r a n s f o r m e d ” )
s k e w e d f e a t s 2 = temp [ n u m e r i c f e a t s ] . apply ( lambda x : skew ( x . dropna ( ) ) ) . s o r t v a l u e s ( a s c e n d i n g=F a l s e )
#############################################################################################
# G e n e r a t e d Data #
#############################################################################################
def r e s u l t s r e g s i m u l a t i o n ( df , t a r g e t , X1 missing , skew = F a l s e , o u t l i e r = False ) :
if skew == True :
df = fix skew ( df )
else : pass
if o u t l i e r == True :
df = remove outlier ( df )
else : pass
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
LinearRegression () ,
XGBRegressor ( ) ]
y = df [ target ]
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 f u l l = np . a r r a y ( d f . copy ( ) )
X 1 i n c o m p l e t e = np . a r r a y ( X 1 m i s s i n g . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( temp [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
56
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
f o r ml i n M L r e g r e s s o r :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( temp . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , ran
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
p r e d t r a i n = ml . p r e d i c t ( x t r a i n )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d i c t e d , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d i c t e d , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
temp2 = d f . copy ( )
t e m p 2 i n c o m p l e t e = X 1 m i s s i n g . copy ( )
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5)
XGB. name = ”XGB”
X Y i n c o m p l e t e = np . a r r a y ( t e m p 2 i n c o m p l e t e )
n imputations = 5
XY completed = [ ]
for i i n range ( n i m p u t a t i o n s ) :
i m p u t e r = I t e r a t i v e I m p u t e r ( n i t e r =5 , s a m p l e p o s t e r i o r=True , r a n d o m s t a t e=i )
XY completed . append ( i m p u t e r . f i t t r a n s f o r m ( X Y i n c o m p l e t e ) )
XY completed mean = pd . DataFrame ( np . mean ( XY completed , 0 ) , columns=temp2 . columns )
XY completed mean . name = ”MICE”
i m p u t e d a t a = [XGB, XY completed mean ]
ML regressor = [
KNeighborsRegressor ( ) ,
RandomForestRegressor ( ) ,
LinearRegression () ,
XGBRegressor ( ) ]
for data in impute data :
MLA name = d a t a . name
MLA compare2 . l o c [ r o w i n d e x , ’ I m p u t a t i o n Name ’ ] = MLA name
m y l i s t 1 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d a t a [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
m y l i s t 2 = np . a r r a y ( [ x f o r x i n np . a r r a y ( d f [ X 1 m i s s i n g . i s n a ( ) ] ) . f l a t t e n ( ) if ˜np . i s n a n ( x ) ] )
MLA compare2 . l o c [ r o w i n d e x , ’MAE ’ ] = ( abs ( m y l i s t 1 −m y l i s t 2 ) ) . sum( ) / l e n ( m y l i s t 1 )
MLA compare2 . l o c [ r o w i n d e x , ’RMSE ’ ] = math . s q r t ( ( ( m y l i s t 1 −m y l i s t 2 ) ∗ ∗ 2 ) . sum( ) / l e n ( m y l i s t 1 ) )
r o w i n d e x+=1
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns=t a r g e t ) , y , t e s t s i z e = 0.25 , random
f o r ml2 i n M L r e g r e s s o r :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
return ML compare2 , MLA compare2
57
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =10000 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =7 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
m i s s p e r c = X . i s n a ( ) . sum( ) / l e n (X)
d f 1 , d f 2 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =1000 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =7 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
d f 3 , d f 4 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =100 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =3 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
58
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
d f 5 , d f 6 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
w i t h open ( ’ 10 k p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df1 . t o l a t e x ( ) )
w i t h open ( ’ 1 k p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df2 . t o l a t e x ( ) )
w i t h open ( ’ 100 p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df3 . t o l a t e x ( ) )
w i t h open ( ’ 10 k imp . t e x ’ , ’w ’ ) a s tf :
t f . write ( df4 . t o l a t e x ( ) )
w i t h open ( ’ 1 k imp . t e x ’ , ’w ’ ) a s tf :
t f . write ( df5 . t o l a t e x ( ) )
################################################
# Simulate Classification data #
################################################
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =10000 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =7 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
59
m i s s p e r c = X . i s n a ( ) . sum( ) / l e n (X)
d f 1 , d f 2 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =1000 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =7 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
d f 3 , d f 4 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
X , y = m a k e r e g r e s s i o n ( n s a m p l e s =100 , b i a s = . 1 , n o i s e = 1 . 5 , e f f e c t i v e r a n k =3 , n f e a t u r e s =12)
X = pd . DataFrame (X)
X . columns = X . columns . a s t y p e ( s t r )
X[ ’ y ’ ] = y
mylist = [ ]
for p e r c in perclist :
m y l i s t . append ( random . s a m p l e ( range ( 1 , l e n ( y ) ) , round ( p e r c ∗ l e n ( y ) / 2 ) ) )
X m i s s i n g = X . copy ( )
counter = 0
for col in cols :
X m i s s i n g . l o c [ m y l i s t [ c o u n t e r ] , c o l ] = np . nan
c o u n t e r += 1
d f 5 , d f 6 = r e s u l t s r e g s i m u l a t i o n (X, ’ y ’ , X m i s s i n g )
w i t h open ( ’ 10 k p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df1 . t o l a t e x ( ) )
w i t h open ( ’ 1 k p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df2 . t o l a t e x ( ) )
w i t h open ( ’ 100 p r e d . t e x ’ , ’w ’ ) a s tf :
t f . write ( df3 . t o l a t e x ( ) )
w i t h open ( ’ 10 k imp . t e x ’ , ’w ’ ) a s tf :
t f . write ( df4 . t o l a t e x ( ) )
60
w i t h open ( ’ 1 k imp . t e x ’ , ’w ’ ) a s tf :
t f . write ( df5 . t o l a t e x ( ) )
#######################################################
# Calculate time t o r u n OHE v s LE & compare results #
#######################################################
m e l d a t a = pd . r e a d c s v ( ” m e l b d a t a . c s v ” )
m e l d a t a [ ’ Date ’ ] = pd . t o d a t e t i m e ( m e l d a t a [ ’ Date ’ ] )
m e l d a t a [ ’ Year ’ ] = m e l d a t a [ ’ Date ’ ] . d t . y e a r
m e l d a t a [ ’ Month ’ ] = m e l d a t a [ ’ Date ’ ] . d t . month
m e l d a t a [ ’ Month ’ ] = m e l d a t a [ ’ Month ’ ] . a s t y p e ( o b j e c t )
m e l d a t a [ ’ Year ’ ] = m e l d a t a [ ’ Year ’ ] . a s t y p e ( o b j e c t )
mel data [ ’ Postcode ’ ] = mel data [ ’ Postcode ’ ] . a s t y p e ( object )
m e l d a t a [ ’ Age ’ ] = 2017 − m e l d a t a [ ’ Y e a r B u i l t ’ ]
abc list = []
c l a s s e s =10
for i i n range ( 9 7 , 123):
a b c l i s t . append ( s t r ( chr ( i ) ) )
train lon , l o n b i n s = pd . q c u t ( m e l d a t a [ ” L o n g t i t u d e ” ] , classes , r e t b i n s=True , l a b e l s= a b c l i s t [ 0 : c l a s s e s ] )
train lat , l a t b i n s = pd . q c u t ( m e l d a t a [ ” L a t t i t u d e ” ] , classes , r e t b i n s=True , l a b e l s= a b c l i s t [ 0 : c l a s s e s ] )
t r a i n l o n = t r a i n l o n . a s t y p e ( object )
t r a i n l a t = t r a i n l a t . a s t y p e ( object )
mel data [ ” grid ” ] = t r a i n l o n + t r a i n l a t
l e = p r e p r o c e s s i n g . LabelEncoder ( )
l e . f i t ( mel data [ ” grid ” ] )
mel data [ ” grid ” ] = l e . transform ( mel data [ ” grid ” ] )
m e l d a t a [ ’ L a r g e ’ ] = m e l d a t a [ ’ Bedroom2 ’ ] . apply ( lambda x : 1 i f x > 4 else 0)
c a t f e a t u r e s = d f . columns [ np . where ( d f . d t y p e s == np . o b j e c t ) [ 0 ] ]
d f = l a b e l ( df , cat features )
t e m p 2 i n c o m p l e t e = m e l d a t a . copy ( )
s t a r t 1 = timer ()
61
XGB = XGB impute ( t e m p 2 i n c o m p l e t e , 5 )
end1 = t i m e r ( )
t i m e 1 = end1 − s t a r t 1
XGB. name = ”XGB”
s t a r t 2 = timer ()
XGB2 = XGB impute2 ( t e m p 2 i n c o m p l e t e , 5 )
end2 = t i m e r ( )
t i m e 2 = end2 − s t a r t 2
XGB2 . name = ”XGB2”
i m p u t e d a t a = [XGB, XGB2 ]
ML classifier = [
KNeighborsClassifier () ,
RandomForestClassifier () ,
LogisticRegression ()]
MLA columns = [ ]
MLA compare2 = pd . DataFrame ( columns = MLA columns )
row index = 0
ML columns = [ ]
ML compare2 = pd . DataFrame ( columns = ML columns )
row index1 = 0
for a l g i n MLA:
X 1 i n c o m p l e t e = np . a r r a y ( d f . copy ( ) )
X f i l l e d = alg . f i t t r a n s f o r m ( X1 incomplete )
temp = pd . DataFrame ( X f i l l e d , columns = X 1 m i s s i n g . columns )
MLA name = a l g . class . name
r o w i n d e x+=1
f o r ml i n M L r e g r e s s o r :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns= ’ P r i c e ’ ) , mel data [ ’ Price ’ ] , test size
p r e d i c t e d = ml . f i t ( x t r a i n , y train ). predict ( x test )
p r e d t r a i n = ml . p r e d i c t ( x t r a i n )
ML name = ml . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d i c t e d , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d i c t e d , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
for data in impute data :
x train , x test , y train , y t e s t = t r a i n t e s t s p l i t ( d a t a . drop ( columns= ’ P r i c e ’ ) , mel data [ ’ Price ’ ] , test size =
f o r ml2 i n M L r e g r e s s o r :
p r e d t r a i n = ml2 . f i t ( x t r a i n , y train ). predict ( x train )
p r e d t e s t = ml2 . p r e d i c t ( x t e s t )
ML name = ml2 . class . name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML Name ’ ] = ML name
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t r a i n , y train ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T r a i n RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t r a i n , y train )) ,
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t MAE ’ ] = round ( m e a n a b s o l u t e e r r o r ( p r e d t e s t , y test ) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ML T e s t RMSE ’ ] = round ( np . s q r t ( m e a n s q u a r e d e r r o r ( p r e d t e s t , y test )) , 6)
ML compare2 . l o c [ r o w i n d e x 1 , ’ I m p u t a t i o n Name ’ ] = MLA name
r o w i n d e x 1 += 1
62