Diabetic Prediction System Using Data Mining: September 2016
Diabetic Prediction System Using Data Mining: September 2016
net/publication/338581650
CITATIONS READS
2 922
6 authors, including:
Some of the authors of this publication are also working on these related projects:
Blockchain-based Secure, Reliable, and Distributed Voting System for Decision Making in Government Policies and Projects View project
All content following this page was uploaded by Kasun Kosala Jinasena on 14 January 2020.
66
Proceedings in Computing, 9th International Research Conference-KDU, Sri Lanka 2016
prediction system based on classification(predictive) data algorithms for diabetes dataset. They have used 70:30
mining methods, namely Decision Tree algorithm, Naïve percentage split and 10-fold cross validation techniques to
Bayes algorithm and SMO Support Vector Machine build their model. When using Decision Tree, they have
algorithms. got 76.9565% accuracy and for the Naïve Bayes, they have
got 79.5652% accuracy. These results assure that
II. LITERATURE REVIEW classification data mining methods are better for
A systematic review of research findings and applications prediction of diabetes. But there is no evidence that they
of data mining techniques in the field of diabetes has have developed a system that can predict risk level of a
been done in order to identify the present status of the patient in real time. They just have analysed those two
research question, the research gap, and the alternatives. algorithms using classifier models. But we have developed
Here our main objectives were to identify research goals, a system using classification data mining techniques which
diabetes types, data-mining methods, data-mining can diagnose the diabetic risk level of a patient.
software’s and their technologies, data sets and outcomes.
Based on that, we have developed a novel approach to Thirumal and Nagarajan have used Fuzzy, Neural Network,
predict diabetes using data mining technologies. Case-Based (FNC) approach to predict rate of Diabetes.
(Thirugnanam et al., 2012). They present a novel approach
for the computational intelligence and knowledge
Huge amounts of data produced by healthcare
engineering techniques as neural network (N), fuzzy logic
transactions are too complex and voluminous to be
(F), and case-based reasoning(C) as an individual approach
processed and analysed by conventional methods
(FCN). At the final prediction stage, they have applied the
(Cunningham and Holmes, 1999). However, the data
rule-based algorithm to the values obtained from the
mining is capable of extracting hidden knowledge from
initial stage. They position as the benefits of applying that
complex data repositories such as- research reports, flow
is the accuracy of predication rate is higher than other
charts, evidence tables and medical reports, and
diabetes prediction algorithms. But using Neural network
transform into useful information for decision making.
is somewhat slow because it does require more time to
train the network. But this diabetic disease will be critical
Breault and colleagues applied a “Classification and in some stages. So we have to find a quicker solution for
regression tree (CART) using the CART data-mining” this. That is one of the reasons we moved to classification
software on data of 15,902 diabetes patients and based approach. Because there if we know the problem,
detected that most important variable related to bad we can get results quickly.
glycemic control (HbA1c >9.5) is age (Marinov et al., 2011).
Patients below the threshold of 65.6 years old have worse A Case Study on “UTILIZATION OF DATA MINING
glycemic control than older people, which was very TECHNIQUES FOR DIAGNOSIS OF DIABETES MELLITUS”
surprising to clinicians. Using this knowledge, they have have done by Coimbatore Institute of Engineering and
targeted the specific age groups that are more likely to Technology (Thirumal and Nagarajan, 2006). This research
have poor glycemic control. However, they have found has based on old diabetes patients. They have found that
age is the most valuable variable for glycemic control risk of diabetes will be low when patients are often given
using the CART algorithm. There may be other important assessment and treatment plans that suit their wants and
variables too. Thus, more methods have to be used to lifestyle. Straight forward awareness measures like low
discover those. sugar diet, correct diet will avoid fatness. The Goal of this
study was to urge best algorithms that describe given
Myiaki and colleagues (Mehrpoor et al., 2014) conducted knowledge in multiple aspects. In this paper, several data
a study to find the best predictors of diabetes vascular mining algorithms have been used for test the dataset.
complications using CART on data from 165 type 2 Naïve Bayes, Decision trees, k nearest neighbor and SVM
diabetes mellitus (T2DM) patients. The authors found that are discussed and tested with Pima Indian polygenic
age (cut-off: 65.4 years) was the best predictor, and disease dataset. Accuracy of these models are needed to
depending on the age, the second best predictor was be evaluated before it is being used. If the available data
body weight (cut-off: 53.9kg) for the group above 65.4 or are limited, it makes estimating accuracy a difficult task.
systolic blood pressure for the group below 65.4. Here Table 1 shows accuracies of the algorithms given by the
they have gone more steps further. confusion matrix.
Aiswarya and colleagues have done a research on Table 1 accuracies of algorithms
“DIAGNOSIS OF DIABETES USING CLASSIFICATION MINING
Algori- Accuracy TP FP Preci- Recall
TECHNIQUES”(Iyer et al., 2015). According them, diabetes sion
thm (%)
has affected over 246 million individuals worldwide and
most of them were women. According to the WHO report, Naïve bayes 77.8646 0.83 0.317 0.83 0.83
by 2025, this variety is anticipated to rise to over 380
million. This paper focused on analyzing the patterns C4.5 78.2552 0.864 0.369 0.814 0.864
based Decision Tree and Naïve Bayes data mining
67
Proceedings in Computing, 9th International Research Conference-KDU, Sri Lanka 2016
SVM 77.474 0.775 0.309 0.77 0.775 model. It’s a java based algorithm, it works as follows. In
order to classify a new item, it first creates a decision tree
kNN 77.7344 0.892 0.437 0.792 0.892 based on the attribute values of the available training
data set. Every node of the decision tree is generated by
calculating the highest information gain for all attributes.
From the experiments, it's complete that kNN provides
If any attribute gives an unambiguous end result (explicit
lower accuracy when putting next to alternative
classification of class attribute), the branch of that
algorithms because it stores training examples and delays
attribute will be terminated and then target value is
the processing until a new instance is classified. The speed
assigned to it. We have used 12-fold cross validation
of the algorithm is also important when we decide the
technique to build the model using this algorithm. It’s
efficiency of an algorithm. Which tells that classification
simply as follows.
algorithms are better that these KNN algorithms when to
compare with this problem domain. Here also they only
have done an analysis using weka data mining tool Break data into 12 sets of size n/12.
whether what algorithm gives better results. But this is Train on 11 datasets and test on 1.
also a case study that focused on finding best data mining Repeat 12 times and take a mean accuracy.
algorithms for diabetic related data. So considering these
facts and previous diabetic related research, we have In 12-fold cross-validation, the original sample is randomly
developed a system which gives a real-time prediction partitioned into 12 equal sized subsamples. Of
about whether the patient has diabetes or not. the 12 subsamples, a single subsample is retained as the
validation data for test the model, and the remaining (12−
III. METHODOLOGY 1) subsamples are used as training data.
In previous studies, they have used only single approach
to identify the disease. But we have combined three B. Naïve Bayes Algorithm
classification algorithms through a voting mechanism to Naïve Bayes classifier algorithm has been created based
increase the accuracy level of the model. So if one on the Bayes rule of conditional probability. It uses all the
algorithm does not predict it correctly, it doesn’t affect to attributes contained in the data, and then analyses them
the final prediction because the system considers the individually as though they are equally important and
predictions of other two algorithms too. It gives the independent of each other. There are various data mining
majorities decision. Thus ensures more accuracy than a existing solutions exists to find relations between the
single algorithm. diseases and their symptoms also the medications for
them. But these algorithms have their own limitations like
A. Decision Tree J48 Algorithm binning of the continuous arguments, numerous
A Decision tree is basically a tree structure(Han and iterations, high computational time, etc. But Naïve Bayes
Kamber, 2006), which has the form of a flowchart. It can classifier affords fast, highly scalable model building and
be used as a method for classification and prediction with scoring. The build process for Naive Bayes is parallelized.
a representation using nodes and internodes. Root and It overcomes various limitations like the omission of
internal nodes are the test cases. Leaf nodes considered complex iterative estimations of the parameter because it
as class variables. Figure 1 shows a sample decision tree can be applied to a large dataset in real time. The formula
structure. used for that algorithm is simply showed here.
can be solved analytically. SMO algorithm it replaces all Total Number of Instances
missing values and transforms nominal attributes into
binary ones. It also normalizes all attributes by default E. Procedure:
which helps to speed up the training process. We have
Load previous data sets to the system (768 test
used 70:30 percentage split technique to train and test
cases).
the data set using this model. Here we are not only
Data pre-processing has done using integrating
considering the accuracy but it should have the ability to
WEKA tool(Witten et al., 2011). Following
handle missing values well. This algorithm does that very
operations are performed on the dataset after
accurately because it uses heuristics to partition the
that.
training problem into smaller problems. That’s the main
a. Replace Missing Values
reason we have selected this algorithm.
b. Normalization of values.
III. EXPERIMENTAL DESIGN Then User inputs data to the system in order to
This section explains the overall design of the system and diagnose whether he has the disease or not.
what is the process it has followed in order to get the Build a model using J48 Decision Tree Algorithm
prediction. and train the data set.
Build a model using Naïve Bayes Algorithm and
train the data set.
D. Dataset Used:
Build a model using SMO Support Vector
The data set we have used is a benchmarked dataset Machine Algorithm and train the data set.
which can be used for comparing the accuracy and the Test the data set using these three models.
efficiency of our model. Data has been obtained from Get the evaluation results.
Pima Indians Diabetes Database, National Institute of Finally, get the predicted voting from all
Diabetes and Digestive and Kidney Diseases. classifiers and gives the diagnostic result.
1) Inputs: .
Number of times pregnant
Plasma glucose concentration 2 hours in an oral
glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/ (height in m) ^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)
2) Outputs:
Predicted Results (Diagnosed State)
Evaluation Results
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Figure 2: Overview of procedure
69
Proceedings in Computing, 9th International Research Conference-KDU, Sri Lanka 2016
machine algorithms. J48 has more than 84% accuracy and X axis: True positive rate
other two also have more than 76% accuracy. So it has Y axis: False positive rate
more accuracy when comparing with most of other
systems that have developed. Furthermore, because the You can see in all the ROC curves, that they are skewed to
voting process that we have used in this system, it the True positive side. Which proves that our accuracies
ensures that it gives higher accurate results than when of the all three classifier models are high. Which conclude
considering accuracies of the classifiers separately. that our all three classifiers are appropriate ones and also
Because it first considers all the diagnosed results of three have good accuracies.
classifiers and gives the final prediction results after that.
F. Confusion Matrix
Confusion Matrix has the information about predicted
and actual classification results of the classifiers. The
Performance of the classifiers has been evaluated using
Predicted
Positive Negative
Actual True TP FN
Actual False FP TN
this matrix.
Figure 5: ROC Curve of J48 Decision Tree model
Table 3: Confusion Matrix
71
Proceedings in Computing, 9th International Research Conference-KDU, Sri Lanka 2016
V. CONCLUSION
Although all the methods have given more than 75%
accuracy, the Decision Tree and the SMO Support Vector
Machine give more accurate results than the Naïve Bayes
algorithm. However, the ensemble method gives the
highest accuracy from all due to the voting process of all
the algorithms.
References
Cunningham, S.J., Holmes, G., 1999. Developing innovative
applications in agriculture using data mining, in: The
Proceedings of the Southeast Asia Regional Computer
Confederation Conference.
Han, J., Kamber, M., 2006. Data mining: concepts and techniques,
2nd ed. ed, The Morgan Kaufmann series in data
management systems. Elsevier ; Morgan Kaufmann,
Amsterdam ; Boston : San Francisco, CA.
Iyer, A., S, J., Sumbaly, R., 2015. Diagnosis of Diabetes Using
Classification Mining Techniques. Int. J. Data Min.
Knowl. Manag. Process 5, 01–14.
doi:10.5121/ijdkp.2015.5101
Marinov, M., Mosa, A.S.M., Yoo, I., Boren, S.A., 2011. Data-
mining technologies for diabetes: a systematic review.
J. Diabetes Sci. Technol. 5, 1549–1556.
Mehrpoor, G., Azimzadeh, M.M., Monfared, A., 2014. Data
Mining: A Novel Outlook to Explore Knowledge in
Health and Medical Sciences. Int. J. Travel Med. Glob.
Health 2, 87–90.
Ross, T.J., 2010. Fuzzy logic with engineering applications, 3. ed.
ed. Wiley, Chichester.
SureshikaThilakarathna, n.d. Nearly four Million Diabetics in Sri
Lanka [WWW Document]. URL
http://www.news.lk/news/business/item/5701-nearly-
four-million-diabetics-in-sri-lanka (accessed 1.26.16).
Thirugnanam, M., Kumar, P., Srivatsan, S.V., Nerlesh, C.R., 2012.
Improving the Prediction Rate of Diabetes Diagnosis
Using Fuzzy, Neural Network, Case Based (FNC)
Approach. Procedia Eng. 38, 1709–1718.
doi:10.1016/j.proeng.2012.06.208
Thirumal, P.C., Nagarajan, N., 2006. UTILIZATION OF DATA
MINING TECHNIQUES FOR DIAGNOSIS OF DIABETES
MELLITUS-A CASE STUDY.
WHO | Diabetes [WWW Document], n.d. URL
http://www.who.int/mediacentre/factsheets/fs312/en
/ (accessed 5.22.16).
Witten, I.H., Frank, E., Hall, M.A., 2011. Data mining: practical
machine learning tools and techniques, 3rd ed. ed,
Morgan Kaufmann series in data management systems.
Morgan Kaufmann, Burlington, MA.
72