Disease Predictor: Algorithms To Reduce The Faulty Predictions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Engineering Applied Sciences and Technology, 2020

Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179


Published Online May 2020 in IJEAST (http://www.ijeast.com)

DISEASE PREDICTOR: ALGORITHMS TO


REDUCE THE FAULTY PREDICTIONS
Nikhil Sharma, Mimansha Singh, Yash Gupta
Student IMS Engineering College
Ghaziabad, India

Abstract -- There are many tools related to disease ● Support Vector Machine (SVM)
prediction, but particularly for any specific disease
such as heart, diabetes or cancer. But generally there is II. LITERATURE SURVEY
no such tool that is used for prediction of other
different diseases. So Disease Predictor helps for the Numerous studies and researches have been done that focus on
prediction of several other diseases making use of the diagnosis of diseases. They have used different kinds of
different machine learning algorithms to reduce the machine learning algorithms and other strategies but they lag
faulty predictions. in diagnosing more than one disease and also they use only a
single algorithm which cannot give an accurate result always.
Keywords— Naïve Bayes, Random Forest, Extra Tree, Therefore we proposed a tool for diagnosing more number of
SVM. diseases with greater accuracy by use of multiple machine
learning algorithms.
I. INTRODUCTION
III. DISCUSSION
Disease Predictor is a tool based on some good machine
learning algorithms whose main function is to diagnose the A. DATASETS:
diseases of patients with which a patient may be infected.
Taking into account the importance of datasets and their
According to a report by the World Health Organization impact in the obtained result, it is very crucial to discuss
(WHO), more than 138 million patients are harmed every datasets used in this study. We found it really difficult to
year by doctors' errors among which 2.6 million mistakes find any dataset which includes diseases and their
resulting the death of people. symptoms and was sufficient for the training of the model
as we know that if we train our model with less data then
Actually, there are many tools related to disease prediction, we cannot be ensured that our model is predicting fine and
but particularly for any specific disease such as heart, with optimum accuracy. There are many datasets which
diabetes or cancer “Krishnaiah V.,Theresa Princy R. et.el include diseases along with their symptoms but the
(2016,2016,2005,2012,2014) emphasize the same in their problem with them is, that they are pretty small for the
study”. But generally there is no such tool that is used for training purpose. So we have merged different datasets
prediction of other many different diseases. Also this tool from many resources so that we could ensure that training
helps to find accuracy in the cases where the symptoms of is fine enough for predicting any disease on the basis of
any 2 diseases are not differentiable and causes prediction symptoms.
fault by doctors resulting in some serious conditions. So
Disease Predictor helps for the prediction of several other B. DISEASE PREDICTION TECHNIQUES:-
diseases making use of different machine learning
algorithms to reduce the faulty predictions and increase the There are many Machine Learning algorithms available for
accuracy. use but the problem occurs when we have to make a choice
among those algorithms. Many studies shows that for the
This project aims to provide a platform to predict the classification problems where we have more than 2 or 3
occurrences of diseases on the basis of various symptoms classes, algorithms such as Linear Regression, Logistic
in which user can select various symptoms and can find the regressions and KNN do not perform well. While the
diseases with their probabilistic figures. Here we have used algorithms like Decision Tree, Random Forest, Naive
four predefined machine learning algorithms that are: Bayes and Support Vector machine are among some
algorithms which are accepted as some of the best
● Naïve Bayes classification algorithms “Lakshmi B.N et.el (2015)
● Extra Tree emphasize the same in their study”.Instead of only using
● Random Forest

174
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179
Published Online May 2020 in IJEAST (http://www.ijeast.com)

Decision Tree it is also seen that combination of Decision


Tree and Random Forest performs very well in such
Entropy(S)= ∑𝑐𝑖=1 −pi log 2 pi
scenarios “Kaur Beant et.el(2014) emphasize the same in
their study”. Gain(S,A)=Entropy(S) - ∑𝑣𝑒𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
𝑆𝑣
𝑆
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣)

So, this study also uses different machine learning


algorithms for the prediction of the diseases. The other
thing which is used is the Graphical User Interface (GUI)
made with the help of the tkinter library of Python.

Naïve Bayes:

Naive Bayes is a classification algorithm which follows the


bayes rule. It is basically a probabilistic model which produces
the probability of happening of any particular task.

𝑃(𝐴)𝑃(𝐴) Figure (2): Tree Representation


P(A|B )= 𝑃(𝐵) Random Forest:

Figure (1): Naïve Bayes Formula Random forest, as its name suggests, it is the collection of
a large number of the individual decision trees that operate
So, by using this classifier we are able to find out the as an ensemble. In a random forest each individual
probability of happening of any event say A when probability decision tree predicts a class as its output and that class
of happening of any other say B is known to us. In this which have maximum number of votes becomes the output
probabilistic model we make an assumption that the for the whole model.
events/features which we are considering are independent of
each other. In other words we can say that one particular
feature/event does not affect the other. That is why it is known
as Naive.

Extra Tree:

Extra Tree is an abbreviation for Extremely Randomized Tree.


Extra tree is nothing but an advance version of Random
Forest. The extra tree classifier is said to be an advance
version by the fact, that it is highly randomized.
Figure (3): Different Decision Trees
The Extra classifier also make use of a forest which consist of
many decision trees. In this forest each tree makes its In random forest algorithm having a low co-relation among
prediction. These trees are highly DE-co related which the various decision tree plays a very important role as
provides model an ability to predict the best result. these trees do not affect each other and provides a kind of a
reinforcement for the other tree. Due to this reason of low
The difference between extra tree and Random forest classifier co-relation we know that they can produce the best
lies in their tree construction method. An extra classifier make result. The reason behind this effect is that these trees
use of whole dataset for constructing the tree while random protect each other from their individual error and provides
forest make use of bootstrapping for its tree construction. a kind of reinforcement as well. We cannot ignore the
possibility that some of the tress may provide wrong output
Extra tree classifier predicts the result in a very less time
as well but as whole group they are able to move in the
because it do all the things by randomization procedure. The
right direction due to reinforcement and votes. There are
splitting in extra tree is also a random selection from the
some prerequisite which makes the process of Random
chosen features. It make use of gini index widely for
forest to work well. These prerequisite can be defined as
constructing the tree for learning process.
following:

175
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179
Published Online May 2020 in IJEAST (http://www.ijeast.com)

1. The features of data points should have real a signal so So for this study we are making use of RBF kernels. RBF
that the model can guess the actual output instead of kernels make use of Radial basis functions for their
some random guessing. implementation in order to find out the plane.
2. The output produced by the each tree should have a
low co-relation with others so that the training can be If we have two samples named x and x’, represented as
headed in the right direction. feature vectors in input space, then RBF kernel can be
defined as-
Support Vector Machine:
||𝑋−𝑋′||2
SVM which an abbreviation for Support Vector Machine is K(X,X’) = exp(- (2𝜎 2 )
)
one of the best classification algorithm. The main principle
of support vector machine is to find a best fit hyperplane
Where ||𝑋 − 𝑋′||2 is recognized as squared Euclidean
which can easily distinguish among different classes.
distance between the two feature vectors. σ is a free
parameter.

The above equation can also be written by including a


1
special parameter: γ = 2σ2

So, the equation becomes:

K(X,X’) = exp(-γ||X − X′||2)


IV. CASE STUDY

Figure (4): Selection of Hyper Plane A. Data Collection

There are many ways by which we can find out the In this study, the dataset of diseases and their symptoms is
hyperplanes but we have to make a choice among those. used to train the model which is first divided into 2
This is one most important part of whole algorithm. To different datasets which are training data and test data.
select any particular hyperplane we take helps of vectors.
These vectors are also features but these features are those Our dataset includes 3632 data rows and 133 columns that
which lies of the plane which distinguishes classes. are features in the training dataset while 1329 rows and
133 features in the testing dataset.
We consider the best plane which has maximum margin
i.e. the plane which has maximum distance among data Features are basically the name of the symptoms so we
points of two different classes will be considered as best have included 132 symptoms and 40 different types of
plane. We do so because this helps the methodology by diseases in our dataset which we have collected from
providing some kind of reinforcement so that other data different datasets which are used in some of the
points can be classified with more confidence. researches.

These hyperplanes are the actual boundary lines which B. Data Pre-Processing
helps in classifying the data points. Support vector
machine supports multi-dimensional planes. The We have applied techniques such that the values of the
dimension of any plane depends upon the number of features were normalized to range between 0 to 1. Then,
feature used. data was standardized to have a mean of 0 and standard
deviation of 1.After applying the pre-processing it
The procedure or the functions through which the planes becomes ready for training. Then the dataset is split into a
are identified are called kernels. There are many kernels proportion of 70:30 where 70% of data is used to train the
which are used. Some of the popular kernels are as- model while 30% data is used to test the model.

1. Linear Kernel C. Implementation


2. RBF kernel
3. Gaussian Kernel There are two phases of the project. These can be
divided as:

176
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179
Published Online May 2020 in IJEAST (http://www.ijeast.com)

Model Training We have also plotted some graphs to show the


We have used the algorithms which perform good in the comparative results of actual value to the predicted value
classification of data because all we need is to classify the for each algorithm:
diseases on the basis of the symptoms of the diseases.
After making the classifier’s objects we have fed the Naïve Bayes :
training set into the model and trained our model. After
the training we have tested our data with our testing set. The accuracy score for Naive Bayes algorithm was found
0.92. Given below is the graph between the actual disease
Using Trained Model for the detection of the disease and the predicted disease by using Naive Bayes
algorithm.
To use the trained model, a Graphical User Interface
(GUI) is made using the Tkinter library of Python which
looks like:

Extra Tree:

The accuracy score for Extra Tree algorithm was found


0.936. Given below is the graph between the actual
disease and the predicted disease by using Extra Tree
algorithm.
Figure (7): GUI of the tool

Here we have given a 5 drop down selection menu for the


selection of the symptoms and 4 buttons for the disease
prediction (each using a different algorithm).

D. Result:

On the basis of the data trained in our model we are


getting different accuracies for different algorithms
implemented which are as:

S.NO Classifier RMS- Accuracy Random Forest:


Error
The accuracy score for Random Forest algorithm was
1 Random 0.610 91.3% found 0.913. Given below is the graph between the actual
Forest disease and the predicted disease by using Random Forest
algorithm.
2 Extra Tree 0.515 93.6%

3 Naive 0.531 92.7%


Bayes

4 SVM 0.518 93.4%

Figure (6): Accuracy and RMS error

177
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179
Published Online May 2020 in IJEAST (http://www.ijeast.com)

Support Vector Machine: [2] Theresa Princy R. and Thomas J., 2016, Human heart
disease prediction system using data mining techniques, DOI:
The accuracy score for Support Vector machine algorithm
10.1109/ICCPCT.2016.7530265.
was found 0.934. Given below is the graph between the
actual disease and the predicted disease by using Support [3] Delen, D., Walker, G., & Kadam, A., 2005, Predicting
Vector Machine algorithm. breast cancer survivability: A comparison of three data
mining methods. Artificial Intelligence in Medicine, 113-
127.

[4] Gomathi, K., July2012, An empirical study on breast


cancer using data mining techniques. International
Journal of Research in Computer Application &
Management, 97-102.

[5] Kumara, M., Vohra, R., Arora, A., 2014, Prediction


of diabetes using Bayesian network. International Journal
of Computer Science and Information Technologies, 5174-
5178.

[6] Chaitrali S., Dangare and S. Apte Sulabha, June 2012,


V. FUTURE WORK Improved Study of Heart Disease Prediction System using
Data Mining Classification Techniques, International
 This project has not implemented recommendation of Journal of Computer Applications (0975-888), vol. 47, no.
medications to the user. So, medication 10, pp. 44-48.
recommendation can be implemented in the project.
[7] Obenshain M.K, 2004, Application of Data Mining
 History about the disease for a user can be kept as a
Techniques to Healthcare Data, Infection Control and
log and recommendation can be implemented for
Hospital Epidemiology, 25(8), 690-695.
medications.
 Dataset can be increased with more diseases and their [8] Lakshmi B.N. ,Indumathi T.S. and N. Ravi, 2015, A
symptoms to further improve the accuracy. comparative study of classification algorithms for
 This study does not involves the classification of predicting gestational risks in pregnant women,
diseases on the basis of age. So this feature may International Conference on Computers, Communications,
provide better results. and Systems (ICCCS), Kanyakumari,, pp. 42-46, doi:
10.1109/CCOMS.2015.7562849.
VI. CONCLUSION
[9] Kaur Beant and Singh H. Williamjeet, October 2014,
Researchers are passionate to try different types of Review on Heart Disease Prediction System using Data
classifiers and build new models with an effort to enhance Mining Techniques, International Journal on Recent and
the accuracy of the model they use. We have also tried to Innovation Trends in Computing and Communication, vol.
enhance the accuracy of our system so that we can avoid 2, no. 10, pp. 3003-08.
the faulty prediction of diseases.
[10] Jyothi Thomas and Kulanthaivel G., 2013, Preterm
Overall an accuracy of around 92% was found in this model Birth Prediction Using Cuckoo Search Based Fuzzy Min-
which is far better than other existing models. Max Neural Network, International Review on Computer
and Software (IRECOS), vol. 8, no. 8, pp. 1854-62.
VII. REFERENCES
[11] H. Blockeel and J. Struyf, 2002, Efficient algorithms
[1] Krishnaiah V., February 2016, Heart Disease Prediction
for decision tree cross-validation, Journal of Machine
System using Data Mining Techniques and Intelligent
Learning Research, vol. 3, pp. 621-650
Fuzzy Approach: A Review International Journal of
Computer Applications (0975 – 8887) Volume 136 – No.2 [12] R. Duriqi, V. Raca and B. Cico, 2016, Comparative
analysis of classification algorithms on three different
datasets using WEKA", 5th Mediterranean Conference on
Embedded Computing (MECO), pp. 335-338, DOI:
0.1109/MECO.2016.7525775.

178
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 5, Issue 1, ISSN No. 2455-2143, Pages 174-179
Published Online May 2020 in IJEAST (http://www.ijeast.com)

[13] Syed Umar Amin, Kavita Agarwal, Rizwan Beg,


2013, Genetic neural network based data mining in
prediction of heart disease using risk factors,
DOI: 10.1109/CICT.2013.6558288.

[14] Ektaa Meshram, Gajanan Patle, Dhiraj Dahiwade,


2019, Designing Disease Prediction Model Using
Machine Learning Approach, DOI:
10.1109/ICCMC.2019.8819782.

179

You might also like