An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of
An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of
An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of
ABSTRACT Large blocks of data must be analyzed and explored by utilizing the data mining procedures
in order to uncover significant patterns and trends. Medical databases are one area where the data mining
procedures can be utilized. Many people all over the world are struggling with their health and medical
diagnoses. Massive amounts of data are produced by hospital information systems (HIS), yet it might be
difficult to extract knowledge from diagnosis case data. By just giving the symptoms they are experiencing,
patients can quickly learn about the sickness they are experiencing and the medication that can assist, treat
it using the approaches utilized in this paper. In this paper, we give drug recommendations relied on ratings
and conditions to customers. Four distinct prototypes are utilized to predict the diseases. The Vader tool and
sentiment analysis relied on NLP are utilized to analyze the reviews. And finally, probabilistic and weighted
average methodologies are utilized to recommend the medications. Each model and strategy utilized in this
paper is described in detail. The experimental findings presented in this work can be utilized in future studies
and for a variety of different medicinal applications.
INDEX TERMS Data mining, drug recommendation system, NLP, sentiment analysis.
Project, with 35.6% of respondents concentrating on online In order to reduce medical errors, Bao et al. [6] developed
medical condition diagnosis [2]. Every day, more individuals and put into practise a framework for a universal medicine
become concerned about issues related to health and medical recommender system that applies data mining techniques to
diagnosis, but many people continue to perish as a result of the recommendation system. SVM, ID3 decision tree, and
medical mistakes. BP neural network are employed in experiments on the three
According to the administration’s research, drug errors models, but ultimately the SVM model is chosen for the
cause more than 200,000 fatalities annually in China and system due to its high accuracy of 95%. A open data set of
more than 100,000 in the USA. Doctors are at blame for 1200 record has been used for the experiment.
more than 42% of drug errors because they write prescriptions By combining ANN and CBR(Case Based Reasoning),
based on their relatively limited experience. Finding qualified Zhang et al. [7] suggested a hybrid recommender system
medical professionals to diagnose and treat medical disorders to assist General Practitioners (GP) in individualised clini-
are therefore one of the most crucial choices a patient must cal prescribing. The challenge of analysing the connection
make. The development of data mining and recommender between a prescription drug and a symptom is removed by
technologies enables us to investigate possible knowledge this paradigm.
from diagnosis history records, reviews, and ratings of med- To increase the impartiality and safety of treating infectious
ications in order to assist doctors in prescribing the right diseases, Bhimavarapu et al. [8] introduced a drug recom-
prescription and effectively reduce medication errors [3]. mender system with a stacked artificial neural network
This data mining paper’s goal is to develop and put into use model. Drugs are suggested based on a patient’s prior health
a global disease prediction and drug recommendation system history, lifestyle, and habits to minimise side effects. Results
that integrates a variety of data mining technologies. We uti- from the suggested system were 97.5% accurate.
lize a variety of prediction algorithms, together with NLP for A Disease Diagnosis and Treatment Recommendation Sys-
sentiment analysis and recommendation, to merge data from tem (DDTRS) based on Big Data Mining and Cloud Comput-
diverse sources. The remainder of the report discusses data ing is presented by Chen et al. [9]. The DDTRS was created
collection, pre-processing, methodology, findings, and finally with the purpose of using the benefits of cloud computing, big
the paper’s conclusion and future work [4]. data mining, and machine learning to identify diseases and
suggest therapies for them. For disease-symptom clustering,
II. RELATED WORK the Density-Peaked Clustering Analysis (DPCA) technique
The development and use of medication and illness recom- is introduced, and association analyses on the Disease-
mendation systems has gained popularity in recent years. Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are
In order to improve health outcomes, these systems are made carried out separately by the Apriori algorithm. To provide
to offer patients and professionals individualised medical a high performance and low latency response, the Apache
suggestions. The review focused on the opportunities and Spark cloud platform is deployed.
problems of applying artificial intelligence (AI) and machine The potential of machine learning and data min-
learning (ML) approaches to provide individualised sugges- ing techniques for medical diagnostics was proven by
tions for patients and clinicians. AI and ML algorithms are Kononenko et al. [10]. The authors go over how using
used in healthcare to find and recommend treatments for machine learning and data mining techniques can increase
diseases. However, the use of these algorithms is not without the precision and efficiency of medical diagnostics as well
challenges. Based on AI and ML models, it is challenging as lower the price of medical treatments. They show many
to assess the precision and dependability of suggested thera- case studies of how neural networks and decision trees have
pies and to accurately forecast patient outcomes. The design, been used successfully in machine learning and data mining
implementation, and evaluation of existing medical and ill- applications for medical diagnosis.
ness recommendation system literature has been reviewed. The current state of machine learning in the diagnosis,
Machine learning techniques are used by Gupta et al. [5] to classification, and prediction of heart failure was discussed
develop a prediction system that evaluates symptoms and pre- by Olsen et al. [11]. In addition to discussing the numerous
dicts the best treatment for each newly identified condition. datasets utilised, including the Framingham Heart Study and
The three data mining algorithms—Decision Tree Classifier, the Multi-Ethnic Study of Atherosclerosis, they also go over
Random Forest Classifier, and Naive Bayes Classifier—are the various techniques and algorithms employed, such as
used to create the diseases prediction system. A preliminary decision trees, support vector machines, and deep learning.
list of diseases that exist and their symptoms has been created. A hybrid strategy employing multi classification and uni-
Subsequently, the medications and their compositions are fied collaborative filtering was proposed by Hussein et al. [12]
examined in relation to the stated disorders. The dataset that for a Clinical Disease Diagnosis (CDD) recommender sys-
was gathered from New York-Presbyterian Hospital was used tem. Various classification models, including J48, Decision
to evaluate the system. According to this study, the accuracy Stump, REP, and RF, are utilised, however RF outperformed
of the Nave Bays Classifier is higher (around 98%) than that them all with a 99.7% accuracy rate.
of the Decision Tree and Random Forest algorithms (both of In order to provide automated diagnosis and preventive
which are about 97%). measures for various diseases, Rustam et al. [13] developed
VOLUME 11, 2023 99305
S. K. Nayak et al.: Intelligent Disease Prediction and Drug Recommendation
the Automated Disease Diagnostic and Precaution Rec- University of California Irvine Machine Learning Repository.
ommender System. During the real-time evaluation, the For dimensionality reduction, principal component analy-
suggested approach obtains a 99.9% accuracy rate. sis is employed as a method, and user-based collaborative
An item-based hybrid recommender system was created filtering is used for medication prediction. The proposed
by Bhat and Aishwarya [14] to increase the precision of recommendation system produces a result that is satisfactory,
drug recommendations for recently released pharmaceuticals. with an accuracy of 0.61 and a mean squared error metric
To make precise medicine recommendations, the system inte- of 0.51.
grates content-based and collaborative filtering algorithms. A innovative healthcare recommendation system named
The proposed model’s accuracy is 75%. iDoctor was devised by Zhang et al. [21] and is based on
Incorporating disease prediction algorithms into cus- hybrid matrix factorization techniques. Sentiment analysis
tomised healthcare can help it be scaled and contextualised, is employed to eliminate the emotional component of user
as Feldman et al. [15] demonstrate. They also emphasised the reviews and to update the actual user ratings. Latent Dirichlet
significance of scalability, pointing out that for customised Allocation is used to extract user preferences and doctor
healthcare to be successfully integrated, it is important to features, which are then added to a traditional matrix factor-
take into account the influence of data size and the technical ization. The suggested model’s accuracy is higher when com-
infrastructure required to support it. pared to the current healthcare recommendation system using
A new system that can assess medical data streams and open real datasets found on the crowd-sourced review website
produce real-time predictions in clinical decision support was Yelp. Four techniques—Hybrid Matrix Factorization (HMF),
introduced by Zhang et al. [16]. This system is based on the Basic Matrix Factorization (BMF), and Item-based and User-
VFDT (Very Fast Decision Tree) stream mining technique. based Collaborative Filtering—are compared. According to
The authors also go into some of the difficulties that come the outcome, the RSME (Root Mean Square Error) HMF
with data stream mining, like data size, data complexity, and is the lowest of all.
data accuracy. The application of a health recommender system to
In a case study, Austin et al. [17] used several data mining increase the precision and effectiveness of women’s cer-
and machine learning techniques to categorise and forecast vical cancer prognosis was studied by Kuanr et al. [22]
the disease of heart failure. They compared alternative flex- in 2021. LR (Logistic Regression), SVC (Support Vector
ible classification schemes, such as bootstrap aggregation Classifier), DT (Decision Tree), KNN (KNearest Neighbour),
(bagging), boosting, and random forests, to the conventional GNB (Gaussian Naive Bayes), XGBoost (eXtreme Gradient
classification and regression trees and discovered that the Boosting), and GBM (Gradient Boosting Machine) are the
flexible tree-based techniques from the data-mining literature seven classifiers utilised for the model construction. The
offer a significant improvement in prediction and classifica- findings also suggest that models with GBM classifiers per-
tion of heart failure subtype. form admirably and the model with Decision Tree classifiers
In order to support clinical judgements in the field of car- exhibit the highest accuracy.
diac disease prediction, AbuKhousa et al. [18] assessed the Han et al. [23] suggested a hybrid recommender sys-
state of research and development in predictive data mining. tem that combines content-based and collaborative filtering
The potential ramifications of five alternative models of data strategies for patient-doctor matchmaking in primary care.
mining techniques have been examined. Poor generalisation The results demonstrate that, when compared to both the
capacity is a significant unresolved problem for data mining heuristic baseline (which provides 37% accuracy) and a tradi-
in the healthcare sector due to the dearth of input data and the tional collaborative filtering (which provides 69% accuracy)
high cost of re-processing. recommender system, the hybrid model provides greater pre-
An overview of the present state of research on rec- dicted accuracy of 80%. The model was evaluated on a dataset
ommender systems in the healthcare industry is given by of large number of doctor-patient interactions.
Tran et al. [19]. They talk on how existing Systems in the Mudaliar et al. [24] presents an application programming
healthcare industry are structured and designed, as well as interface for recommending medications to patients suffering
how this affects patient care. This in-depth analysis offers from a certain condition, which the framework would also
insights into scenario-based recommendations, approach- diagnose by analyzing the patient’s symptoms by utilizing
based recommendations, and a variety of algorithm-based machine learning approaches. They utilize some smart infor-
recommendations. The review contains recommendations for mation about the mining technique to determine the most
foods, medications, healthcare professionals, healthcare ser- specific illness that may be associated with symptoms. The
vices, and health status forecasts. patient may identify the disorders without difficulty. Patients
Morales et al. [20] designed a drug recommendation sys- may clearly identify the disease by just ascribing their prob-
tem based on collaborative filtering and clustering methods lems, and the programme interface provides what disease the
as an addition to the medications provided by the pre- user may be infected with.
scribing physician for diabetic patients. To conduct the Predictive analysis would be done on the ailment, result-
studies, a set of diabetes patient data is gathered from the ing in medicine recommendations to the patient relied on
numerous features in the database. The experimental out- fulfillment table is ‘Ecounter ID’. We then used a left join on
comes can potentially be applied to future research and ‘‘Encounter ID’’ in a SQL query to combine the Medication
healthcare technologies. This work provided a schematic of fulfillment table with the encounter dx table’s severity and
the application of information mining processes in regulatory, description columns. There are 1176 rows in the ensuing
clinical, research, and instructional aspects of Clinical Pre- table.
dictions. This system has a broad reach since it includes the Then, we performed a query to verify the count of each row
following features: in the ‘‘order ID’’ column to see if it was the primary key for
•Disease Diagnosis Automation. lab results.csv. We discovered that it was not since the number
•Paperless labour that benefits the environment. of unique values did not correspond to the number of rows in
•To improve efficiency and accuracy for patients in order the table. We then discovered that the composite primary key
to assist them in the future. consists of two columns ‘‘Order ID’’ and ‘‘Result LOINC.’’
•Managing disease-related information. We did not use any columns for the combined dataset because
none of the lab results.csv’s columns were helpful.
III. DATASET AND PRE-PROCESSING We determined that the primary key for the remaining
HIPPA provides protection for healthcare information. It is encounter.csv file is ‘‘Encounter ID,’’ therefore we extracted
against the law to disclose a patient’s medical records with- the CC column and combined it using a left join on
out that patient’s consent. Government health records and ‘‘Encounter ID.’’ We now have a fully combined dataset with
databases required a number of licenses to access. As a result, all of the necessary columns. There are 1176 rows in it.
we are employing datasets for our research that were easily We took the Drug Name, description, severity, and CC from
accessible online and ready for download. the combined dataset and grouped them to determine the total
number of each drug and the number of linked diseases and
A. DATASET ONE descriptions. As seen in the Figure 1 below, the extracted
1) DATA GATHERING columns contain a significant number of null values.
Medical records 10 yrs. - dataset by arvin6 | data.world is
where the 10-year medical record dataset was found.Four
CSV files make up the package:
1) encounter.csv
2) encounter_dx.csv
3) lab_results.csv
4) medication_fulfillment.csv
There are 17 columns and 1176 rows in the encounter.csv file,
encounter dx. Lab results.csv has 7509 rows and 21 columns,
medical fulfillment.csv has 5447 rows and 28 columns, and
csv has 3063 rows and 6 columns. We pre-processed and
merged the dataset to meet our needs in order to gain insight- FIGURE 1. Drug name grouped by description, severity, and CC.
ful information and determine whether the dataset contains
the necessary data to address the problem statement. In order to get the total number of rows with null val-
ues associated with a medicine name, we conducted another
2) DATA PREPROCESSING query. We were only left with 416 rows of pertinent data to
We pre-processed each table individually and then joined train our classification model because a total of 764 rows
the tables to create a single merged dataset with the neces- had null values. We therefore came to the conclusion that
sary columns after acquiring the raw data and knowing the this dataset does not fulfill the requirement and searched for
structure of all four CSV files. In order to pre-process the additional datasets that would be consistent with the solution
data, we ran specific instructions. For example, the number of our issue statement.
of unique values, the count of each row, and the columns
that were removed or combined to make the data more useful B. DATASET TWO
because they were unnecessary or did not include any data. 1) DATA GATHERING
After performing the necessary pre-processing, we dis- We forecast diseases based on symptoms in order to make
covered that the four tables were controlled using the Star reliable drug recommendations, which are subsequently
Schema Model, with the fact table being medical ful- based on ratings. We have made an effort to assemble data
fillment.csv and the dimension tables being ecounter.csv, from two major datasets for this purpose.
encounter dx.csv, and lab results.csv. The one-to-many rela-
tionship is followed by the star schema. The four tables a: SYMPTOMS DATASET
were combined into one table once we identified the Primary The Disease-Symptom Knowledge Database, which is
Key and Foreign Key. The primary key in the medication a knowledge database of disease-symptom associations
generated by an automated method based on informa- for efficient medicine recommendations, this dataset is then
tion in textual discharge summaries of patients at New pre-processed and displayed.
York-Presbyterian Hospital admitted during 2004, provided
the dataset. This dataset has three columns as shown c: SIDE EFFECTS DATASET
in Figure 2: We were able to successfully include this dataset of adverse
1. Disease effects for particular medications in order to assist patients in
2. Count of Disease Occurrence understanding the risks associated with the medication that
3. Symptoms is being suggested. Once more, raw data from druglib.com
and the UCI Machine Learning Archive for Adverse Effects
of Medications were combined to create this dataset. This
dataset has fewer rows than the drug review dataset but more
columns than that dataset. Because of this, just the ‘‘Side
Effects’’ column from this dataset will be integrated with the
side effects from the other two datasets and druglib.com as
shown in Figure 4.
a: SYMPTOMS DATASET
This dataset needed to be cleansed in order to yield useful
information. The ‘‘Count’’ column was first removed since
the data it contained was not pertinent to this paper. The
drop function was then used to handle the null entries in the
Disease column. The disease and symptom columns were
cleaned to remove extraneous data and leave only the name.
FIGURE 3. Raw drug review dataset. The processed data is shown in Figure 5.
We have changed it into a new CSV format file with
This dataset includes 916 unique Conditions (Diseases), symptoms as the columns and diseases as the rows in
3671 unique Drug names, and ratings and reviews that order to categorize the symptoms according to the diseases.
match to the medicine names. To obtain more information We mapped every symptom to every disease using one hot
encoding, adding value 1 if the disease was present and 0 oth- In order to decrease the dimensionality of the combined
erwise. The one-hot encoded dataset is shown in the Figure 6 dataset, only the pertinent columns are preserved.
below. When symptoms are provided as input, this will assist
us in predicting the diseases. IV. METHODOLOGY
Our paper’s major objective is to suggest a medication to
b: DRUG REVIEW DATASET a patient relied on the symptoms they exhibit. The design
In order to visualize and analyze the data on a larger dataset, pipeline and dataflow of our implementation are shown below
the two sets in this dataset—Train and Test—were combined. in Figure 9. In order to develop a drug recommender system,
Also, they could be joined easily because they both had the there are two key subcategories that must be addressed: a
same columns. The resulting dataset was fairly clean and disease prediction model and a recommendation model.
did not need much pre-processing. However, a few rows
with null values were removed, and new names were given A. DISEASE PREDICTION PROTOTYPE
to the columns. The displaying challenge was intriguing A probabilistic prototype that bases predictions on symp-
because the dataset has a lot of information. The results of toms is the disease prediction prototype. We are employing
the pharmaceuticals with the most evaluations, the most well- the Disease-Symptom Knowledge Database, which includes
liked drugs, the most prevalent ailments, etc. were plotted 405 symptoms matched to over 149 distinct diseases, for this
on numerous different graphs. Figure 7 below illustrates one purpose. In our paper, we have tried with many procedures
such representation, which includes the names of a few of the for accurately forecasting the diseases. This is because the
most often used drugs: dataset only contains one feature, which are the symptoms,
and training any classifier with just one piece of information
weakens it, which in turn lowers the accuracy of predictions
using other inputs. The answer is to attempt to generate
many data points for the same condition by considering all
of its symptoms, their significance, and their occurrence.
We have created four potential strategies, each of which
focuses on a different type of prediction. All of these methods
are ultimately trained on four classifiers, and the accuracy
and prediction results are compared in order to select the best
strategy. This is a detailed explanation of the methods:
1) APPROACH 1
According to this approach, the data has been pre-processed,
converted, and mapped to a data frame with columns labeled
FIGURE 7. Visualizing the most popular drugs based on the ratings. with the list of specific symptoms and the list of specific
and test datasets since Method 1 only has one entry for each
disease. When the classifier was trained on many diseases and
the test data was composed of various diseases, splitting the
data resulted in 0% accuracy. The newly constructed illness
prediction data frame was mapped using the pre-processed
symptoms dataset by setting the values of symptoms present
for a certain disease as 1 and 0 if not. Additionally, the rows
are categorized according to the diseases, and each disease
row is multiplied according to the number of times that dis-
ease occurs. 38839 rows make up the complete data set. Data
for training and testing were then separated from the dataset.
Following that, the data was trained using the Multinomial
Probabilistic Model, ExtraTree Classifier Model, Decision
Tree Classifier Model, and Support Vector Machine Classi-
fier Model, with various Symptoms serving as the training
features and Diseases serving as the labels. By sending one,
two, and three features to the classifier models, the trained
model is utilized to forecast diseases.
3) APPROACH 3
The classifier still provided accuracy rates of 86% even after
adding additional data based on the frequency of disease
occurrence. The disease prediction using technique 2 was
only 86% accurate, which was insufficient. So, we made
changes to the data to enable the classifiers to recognize
patterns and adapt their learning accordingly. The ranking
of significance for each disease symptom is maintained in
this technique. By setting the values of symptoms present
for a given disease as 1 and 0 if not, the symptoms dataset
was utilized to map the newly constructed disease prediction
FIGURE 9. Design implementation pipeline and dataflow of the
recommender prototype.
data frame. To make more rows, the number of rows for
each disease is multiplied. As a result, each disease will
have an increased and equal amount of rows in the resulting
target diseases. By setting the values of symptoms present for data frame. The probabilistic method is used to map the 0s
a given disease as 1 and 0 if not, the pre-processed symptoms and 1s. The greatest priority symptom for each row of that
dataset was utilized to map the newly constructed disease particular condition is assigned to 1. 95% of all the rows are
prediction data frame. To create a single row for each distinct mapped to the second-most significant symptom, and so on.
ailment and provide a value of 1 to the appropriate symptoms Around 40% of the total rows for a given condition will be
column, the rows were then grouped by. Following that, the mapped to the least significant symptom. Data for training
data was trained by utilizing the Multinomial Probabilistic and testing were then separated from the dataset. Following
prototype, ExtraTree Classifier Model, Decision Tree Clas- that, the data was trained using the Multinomial Probabilistic
sifier Model, and Support Vector Machine Classifier Model, Model, ExtraTree Classifier Model, Decision Tree Classifier
with various Symptoms serving as the training features and Model, and Support Vector Machine Classifier Model, with
Diseases serving as the labels. By sending one, two, and three various Symptoms serving as the training features and Dis-
features to the classifier models, the trained model is then eases serving as the labels. By sending one, two, three, and
utilized to forecast the diseases. four features to the classifier models, the trained model is
utilized to forecast diseases as in Figure 10.
2) APPROACH 2
Each classifier in approach one offered the same accuracy 4) APPROACH 4
of 89%. The prediction of the disease using symptoms With this procedure, we have sought to produce a dataset in
as input was inaccurate, despite the accuracy being 89%. which all the symptoms of a specific disease are dispersed
We used approach two to implement precision in the disease at random across various rows. This results in a dataset with
forecasts. The ‘Count of Disease Occurrence’ column values several rows for a single disease and its symptoms indicated
were taken into account in this method to enhance the number in each row differently, 0 or 1.
of rows in the dataset, and the symptoms were thought to be This dataset was produced by splitting larger datasets into
equally important. We were unable to divide the data into train smaller ones using the groupby function on various numbers
experimentally validates a gold-standard sentiment lexicon specific rating. With the help of this data, each drug’s
to assess the sentiment of the text using a combination of weighted average rating or probabilistic value can be calcu-
qualitative and quantitative methods. The string input for this lated. The two methods are described below:
tool will be a drug review from our dataset. This tool not only
informs us of the review’s polarity (positive or negative), but 1) REVIEW APPROACH 1: WEIGHTED AVERAGE APPROACH
also of how positively or negatively weighted it is. Using this The difference between a weighted average and a regular
tool has many of benefits. They consist of: mean is that certain data points contribute more to the final
1. It requires no training data and excels at handling average than others, as opposed to all of the data points
social media-style texts that may contain emoticons and contributing equally.
punctuation. Formula for weighted average:
2. It is possible to use real-time streaming data. Pn
wi xi
There is no speed-performance trade-off with it. x̄ = Pi=1n (1)
Positive, negative, neutral, and compound components i=1 wi
make up the output that is produced. The terms Positive, In equation 1, data elements with a high weight contribute
Neutral, and Negative indicate how many different portions more to the weighted mean than do elements with a low
of the statement have a positive, neutral, or negative tone. The weight.
three components are stated as decimal values, and their sum In this instance, the ratings are x and the useful counts
will always equal one. The sentence’s overall mood is defined are w Based on Probability Recommendation After getting
by the compound component. The following thresholds aid in the combined dataset, a disease-based drug recommendation
interpreting compound test results: system was created. In order to recommend the best drug
1. Positive sentiment: compound score >= 0.05 possible, the medication with positive evaluations was ini-
2. Neutral sentiment: -0.05 < compound score < 0.05 tially weeded out and sorted based on the helpful count,
3. Negative sentiment: compound score <= -0.05 as demonstrated in the Figure 15.
We can see the output of a sample review in the accom- This result in a weighted average rating that prioritizes
panying graphic as in Figure 12. Thus here in the statement, the usable count, or the number of genuine raters, over other
there is a bit of positive component in the language but the factors.
overall emotion can be computed as negative which is clear The Figure 13 depicts the function created for calculating
by the compound value given by VADER. weighted average.
FIGURE 12. VADER analysis for a sample review from our dataset.
C. DRUG RECOMMENDATION
We need to suggest a medication for the disease after it has
been predicted. We used the UCI Machine Learning Repos-
itory for Drug Review dataset, which contains the disease
along with the many medications that are accessible, their FIGURE 13. Function to create the weighted average.
V. EXPERIMENTAL RESULTS
A. DISEASE PREDICTION
FIGURE 14. After weighted average calculation.
1) APPROACH 1
In addition, we tested a probabilistic strategy, which is We employed the Multinomial Naive Bayes, Extra Tree Clas-
described below. But, in order to get reliable results, a lit- sifier, Decision Tree Classifier, and Support Vector Machine
tle of both methodologies were used to create the final models to forecast diseases using procedure 1. SVM, Multi-
recommendation. nomial Naive Bayes, Extra Tree Classifier and Decision
Tree Classifier produce 89.93% accuracy. Because there was
2) APPROACH 2 insufficient data for the model to learn from, disease predic-
tions were incorrect. Figure 16, 17, 18 shows the prediction
Based on Probability Recommendation After getting the
result of disease with one, two and three symptoms.
combined dataset, a disease-based drug recommendation
system was created. In order to recommend the best drug
possible, the medication with positive evaluations was ini-
tially weeded out and sorted based on the helpful count,
as demonstrated in the Figure 15.
3) APPROACH 3
We employed Multinomial Naive Bayes, Extra Tree Classi- 4) APPROACH 4
fier, Decision Tree Classifier, and Support Vector Machine We have also utilized this method to make predictions for all
as our four models for approach 3’s disease prediction. 4 models, and the accuracy obtained is as follows: Decision
The accuracy of the Multinomial Naive Bayes is 86.90%. Tree: 88.24%, Multinomial NB: 87.98%, Random Forest:
Additional Tree Classifier has an accuracy rate of 88.18%. 81.88%, Gaussian NB: 88.34%, and SVM: 88.61%. Each
A decision tree classifier with a maximum depth of 120 offers classifier has done well, and the condition that was predicted
86.95% accuracy. The 86.96% provided by the Support Vec- using three symptoms as input features is likewise accu-
tor Machine classifier. The disease forecasts for symptoms rate. When we used symptoms from various diseases to try
one and two were wrong. Multinomial Naive Bayes produced and predict the disease, the model was 9 out of 10 times
C. SENTIMENT ANALYSIS
1) APPROACH 1 FIGURE 27. Predicted labels of reviews.
FIGURE 32. Mapped recommended drugs with possible side effects and
probabilistic score.
in the dataset after performing the Vader analysis on each
review as in Figure 28.
We applied the previously mentioned criteria to categorise
the overall sentiment of each review. The distribution of
positive, negative, and neutral reviews in the dataset may be
seen in the Figure 29 after each review has been given its
overall sentiment.
D. DRUG RECOMMENDATION FIGURE 33. Recommended drug for osteoporosis’s disease with its
We have tried with two methods for the drug recommen- possible side effects.
dation, as was mentioned above. Nonetheless, the final
suggestion was made by combining the usable count factor
from the probabilistic technique with the weighted average a: VERIFICATION OF RESULTS WITH REAL WORLD
approach. Many factors were taken into account before sug- After conducting a Google search on osteoporosis therapy,
gesting the medication: we learned that it calls for vitamins and bone health, which
Several medications are available for the same condition. is precisely what our first medicine suggests. We can there-
Hence, we were able to filter out the negative and neutral fore conclude that our medicine recommendation system
evaluations using sentiment analysis, leaving us with only performs as anticipated and produces the desired outcomes.
favorable ones. The side effects dataset is then integrated with this output to
determine if the suggested medication has any side effects as
1) APPROACH 1: THE WEIGHTED AVERAGE METHOD shown in Figure 31 and 32.
We were able to find the ideal and highly regarded medicine The final medicine recommendation code illustrates this as
for the anticipated ailment thanks to the outcomes of this in Figure 33:
strategy. Based on the information provided, we predicted
osteoporosis as the disease, and as an output, we are recom- 2) APPROACH 2: PROBABILISTIC APPROACH
mending the top 3 medications from our dataset. You may This method produced fairly excellent findings and evalua-
view it below in Figure 30: tions. Nevertheless, there isn’t a concrete dataset to compare
SUVENDU KUMAR NAYAK received the SANDEEP KUMAR PANDA is currently an Asso-
M.Tech. degree in computer science (CS) from ciate Professor and the Head of the Department
Berhampur University, in 2009. He is currently Artificial Intelligence and Data Science, Faculty
pursuing the Ph.D. degree in machine learning of Science and Technology (IcfaiTech), ICFAI
(recommender systems) with the Centurion Uni- Foundation for Higher Education (Deemed to be
versity of Technology and Management, Odisha. University), Hyderabad, Telangana, India. He has
He is also an Assistant Professor with the Depart- published 50 papers in international journals and
ment of Computer Science and Engineering, international conferences and book chapters in
Centurion University of Technology and Man- repute. His research interests include blockchain
agement. He is also having more than 12 years technology, the Internet of Things, AI, and cloud
of experience in academia. He is also certified in RHCSA (Red hat), computing. He received the ‘‘Research and Innovation of the Year Award
Cloud Practitioner (AWS), and also carry a certificate in Oracle Database 2020’’ from MSME, Government of India and DST, Government of India
Development. He has published many articles in international journals and at New Delhi, in 2020, and ‘‘Research Excellence Award’’ from Brand
conferences. Honchos, in 2022. He has 17 Indian patents on his credit. He has six edited
books named Bitcoin and Blockchain: History and Current Applications
(CRC Press, USA), Blockchain Technology: Applications and Challenges
(Springer, ISRL), Artificial Intelligence and Machine Learning in Business
MAMATA GARANAYAK received the M.Tech. Management: Concepts, Challenges, and Case Studies (CRC Press, USA),
degree and the Ph.D. degree in machine learning The New Advanced Society: Artificial Intelligence and Industrial Internet
(recommender systems) and in CSE from KIIT, of Things Paradigm (Wiley Press, USA), Recent Advances in Blockchain
Deemed to be University. She is currently an Technology: Real-World Applications (Springer, ISRL), and Metaverse and
Associate Professor with the Computer Science Immersive Technologies: An Introduction to Industrial Business and Social
Department, Kalinga Institute of Social Sciences Applications (Wiley Press, USA). He has ten lakh seed money projects from
(KISS), Deemed to be University, Bhubaneswar. IFHE. He is also a Reviewer of the IEEE ACCESS. His professional affiliations
She has 17.8 years of teaching experience. Her are MIEEE, MACM, and LMIAENG.
research interests include recommender systems,
image processing, and machine learning.
SANGRAM KESHARI SWAIN received the Ph.D. DEEPTHI GODAVARTHI received the Ph.D.
degree in computer science and engineering. degree from Andhra University, Andhra Pradesh,
He has 12 years of teaching experience having India, in 2022. She is currently an Assistant Profes-
a demonstrated history of working in the higher sor with the School of Computer Science and Engi-
education industry. He has skilled in social ser- neering, VIT-AP University, Amaravati, Andhra
vices, teaching, research, data analysis, and higher Pradesh. Her research interests include AI, NLP,
education. A strong education professional hav- computer vision, and block chain technologies.
ing multidimensional approaches like engineering,
technology, management, social service, and law.