An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of

Received 17 August 2023, accepted 7 September 2023, date of publication 11 September 2023,
date of current version 15 September 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3314332
An Intelligent Disease Prediction and Drug

Recommendation Prototype by Using Multiple
Approaches of Machine Learning Algorithms
SUVENDU KUMAR NAYAK 1, MAMATA GARANAYAK 2 , SANGRAM KESHARI SWAIN 3,
SANDEEP KUMAR PANDA 4, AND DEEPTHI GODAVARTHI 5

1 Department of Computer Science and Engineering, Centurion University of Technology and Management, Bhubaneswar, Odisha 752050, India
2 Department of Computer Science, Kalinga Institute of Social Sciences, Deemed to be University, Bhubaneswar, Odisha 751024, India
3 Department of Computer Science and Engineering, Centurion University of Technology and Management, Bhubaneswar, Odisha 752050, India
4 Department of Artificial Intelligence and Data Science, Faculty of Science and Technology (IcfaiTech), ICFAI Foundation for Higher Education, Hyderabad,
Telangana 501203, India

5 School of Computer Science & Engineering (SCOPE), VIT-AP University, Amaravati, Andhra Pradesh 522237, India
Corresponding author: Sandeep Kumar Panda ([email protected])

This work was supported by the Faculty of Science and Technology (IcfaiTech), ICFAI Foundation for Higher Education, Hyderabad,
Telangana, India.
ABSTRACT Large blocks of data must be analyzed and explored by utilizing the data mining procedures
in order to uncover significant patterns and trends. Medical databases are one area where the data mining
procedures can be utilized. Many people all over the world are struggling with their health and medical
diagnoses. Massive amounts of data are produced by hospital information systems (HIS), yet it might be
difficult to extract knowledge from diagnosis case data. By just giving the symptoms they are experiencing,
patients can quickly learn about the sickness they are experiencing and the medication that can assist, treat
it using the approaches utilized in this paper. In this paper, we give drug recommendations relied on ratings
and conditions to customers. Four distinct prototypes are utilized to predict the diseases. The Vader tool and
sentiment analysis relied on NLP are utilized to analyze the reviews. And finally, probabilistic and weighted
average methodologies are utilized to recommend the medications. Each model and strategy utilized in this
paper is described in detail. The experimental findings presented in this work can be utilized in future studies
and for a variety of different medicinal applications.
INDEX TERMS Data mining, drug recommendation system, NLP, sentiment analysis.
I. INTRODUCTION recommender prototype is wide spread; with well-known

A recommender prototype, broadly defined, is a prototype examples include medicine recommenders, product recom-
that anticipates the ratings a customer would give to a par- menders for online shops, playlist generators for video and
ticular item. The customer will subsequently be given a audio services, or content recommenders for social net-
ranking of these forecasts. Several household names includ- working platforms [1] The main operationalization of this
ing Google, Instagram, Spotify, Amazon, Reddit, Netflix, etc. objective has been to concentrate on the capacity to numeri-
employ them. Relied on the customer’s profile, a recom- cally estimate customers’ preferences for unseen objects. The
mender prototype can determine if a specific customer will purpose of recommenders is frequently stated as ‘‘assist the
favour an item or not. Both the service providers and cus- customers identify relevant items.’’
tomers can benefit from recommender prototype. They lower Which doctor to trust is one of the most frequently encoun-
the transaction costs associated with locating and choos- tered worries among individuals when faced with any medical
ing products in an online buying setting. The utilization of ailment. It is common knowledge that a person’s health has
a big impact on how happy they are. 58.99% of Americans
The associate editor coordinating the review of this manuscript and have gone online for health-related information, according
approving it for publication was Fabrizio Messina . to a 2013 survey by the Pew Internet and American Life
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

99304 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
S. K. Nayak et al.: Intelligent Disease Prediction and Drug Recommendation
Project, with 35.6% of respondents concentrating on online In order to reduce medical errors, Bao et al. [6] developed
medical condition diagnosis [2]. Every day, more individuals and put into practise a framework for a universal medicine
become concerned about issues related to health and medical recommender system that applies data mining techniques to
diagnosis, but many people continue to perish as a result of the recommendation system. SVM, ID3 decision tree, and
medical mistakes. BP neural network are employed in experiments on the three
According to the administration’s research, drug errors models, but ultimately the SVM model is chosen for the
cause more than 200,000 fatalities annually in China and system due to its high accuracy of 95%. A open data set of
more than 100,000 in the USA. Doctors are at blame for 1200 record has been used for the experiment.
more than 42% of drug errors because they write prescriptions By combining ANN and CBR(Case Based Reasoning),
based on their relatively limited experience. Finding qualified Zhang et al. [7] suggested a hybrid recommender system
medical professionals to diagnose and treat medical disorders to assist General Practitioners (GP) in individualised clini-
are therefore one of the most crucial choices a patient must cal prescribing. The challenge of analysing the connection
make. The development of data mining and recommender between a prescription drug and a symptom is removed by
technologies enables us to investigate possible knowledge this paradigm.
from diagnosis history records, reviews, and ratings of med- To increase the impartiality and safety of treating infectious
ications in order to assist doctors in prescribing the right diseases, Bhimavarapu et al. [8] introduced a drug recom-
prescription and effectively reduce medication errors [3]. mender system with a stacked artificial neural network
This data mining paper’s goal is to develop and put into use model. Drugs are suggested based on a patient’s prior health
a global disease prediction and drug recommendation system history, lifestyle, and habits to minimise side effects. Results
that integrates a variety of data mining technologies. We uti- from the suggested system were 97.5% accurate.
lize a variety of prediction algorithms, together with NLP for A Disease Diagnosis and Treatment Recommendation Sys-
sentiment analysis and recommendation, to merge data from tem (DDTRS) based on Big Data Mining and Cloud Comput-
diverse sources. The remainder of the report discusses data ing is presented by Chen et al. [9]. The DDTRS was created
collection, pre-processing, methodology, findings, and finally with the purpose of using the benefits of cloud computing, big
the paper’s conclusion and future work [4]. data mining, and machine learning to identify diseases and
suggest therapies for them. For disease-symptom clustering,
II. RELATED WORK the Density-Peaked Clustering Analysis (DPCA) technique
The development and use of medication and illness recom- is introduced, and association analyses on the Disease-
mendation systems has gained popularity in recent years. Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are
In order to improve health outcomes, these systems are made carried out separately by the Apriori algorithm. To provide
to offer patients and professionals individualised medical a high performance and low latency response, the Apache
suggestions. The review focused on the opportunities and Spark cloud platform is deployed.
problems of applying artificial intelligence (AI) and machine The potential of machine learning and data min-
learning (ML) approaches to provide individualised sugges- ing techniques for medical diagnostics was proven by
tions for patients and clinicians. AI and ML algorithms are Kononenko et al. [10]. The authors go over how using
used in healthcare to find and recommend treatments for machine learning and data mining techniques can increase
diseases. However, the use of these algorithms is not without the precision and efficiency of medical diagnostics as well
challenges. Based on AI and ML models, it is challenging as lower the price of medical treatments. They show many
to assess the precision and dependability of suggested thera- case studies of how neural networks and decision trees have
pies and to accurately forecast patient outcomes. The design, been used successfully in machine learning and data mining
implementation, and evaluation of existing medical and ill- applications for medical diagnosis.
ness recommendation system literature has been reviewed. The current state of machine learning in the diagnosis,
Machine learning techniques are used by Gupta et al. [5] to classification, and prediction of heart failure was discussed
develop a prediction system that evaluates symptoms and pre- by Olsen et al. [11]. In addition to discussing the numerous
dicts the best treatment for each newly identified condition. datasets utilised, including the Framingham Heart Study and
The three data mining algorithms—Decision Tree Classifier, the Multi-Ethnic Study of Atherosclerosis, they also go over
Random Forest Classifier, and Naive Bayes Classifier—are the various techniques and algorithms employed, such as
used to create the diseases prediction system. A preliminary decision trees, support vector machines, and deep learning.
list of diseases that exist and their symptoms has been created. A hybrid strategy employing multi classification and uni-
Subsequently, the medications and their compositions are fied collaborative filtering was proposed by Hussein et al. [12]
examined in relation to the stated disorders. The dataset that for a Clinical Disease Diagnosis (CDD) recommender sys-
was gathered from New York-Presbyterian Hospital was used tem. Various classification models, including J48, Decision
to evaluate the system. According to this study, the accuracy Stump, REP, and RF, are utilised, however RF outperformed
of the Nave Bays Classifier is higher (around 98%) than that them all with a 99.7% accuracy rate.
of the Decision Tree and Random Forest algorithms (both of In order to provide automated diagnosis and preventive
which are about 97%). measures for various diseases, Rustam et al. [13] developed
VOLUME 11, 2023 99305
the Automated Disease Diagnostic and Precaution Rec- University of California Irvine Machine Learning Repository.
ommender System. During the real-time evaluation, the For dimensionality reduction, principal component analy-
suggested approach obtains a 99.9% accuracy rate. sis is employed as a method, and user-based collaborative
An item-based hybrid recommender system was created filtering is used for medication prediction. The proposed
by Bhat and Aishwarya [14] to increase the precision of recommendation system produces a result that is satisfactory,
drug recommendations for recently released pharmaceuticals. with an accuracy of 0.61 and a mean squared error metric
To make precise medicine recommendations, the system inte- of 0.51.
grates content-based and collaborative filtering algorithms. A innovative healthcare recommendation system named
The proposed model’s accuracy is 75%. iDoctor was devised by Zhang et al. [21] and is based on
Incorporating disease prediction algorithms into cus- hybrid matrix factorization techniques. Sentiment analysis
tomised healthcare can help it be scaled and contextualised, is employed to eliminate the emotional component of user
as Feldman et al. [15] demonstrate. They also emphasised the reviews and to update the actual user ratings. Latent Dirichlet
significance of scalability, pointing out that for customised Allocation is used to extract user preferences and doctor
healthcare to be successfully integrated, it is important to features, which are then added to a traditional matrix factor-
take into account the influence of data size and the technical ization. The suggested model’s accuracy is higher when com-
infrastructure required to support it. pared to the current healthcare recommendation system using
A new system that can assess medical data streams and open real datasets found on the crowd-sourced review website
produce real-time predictions in clinical decision support was Yelp. Four techniques—Hybrid Matrix Factorization (HMF),
introduced by Zhang et al. [16]. This system is based on the Basic Matrix Factorization (BMF), and Item-based and User-
VFDT (Very Fast Decision Tree) stream mining technique. based Collaborative Filtering—are compared. According to
The authors also go into some of the difficulties that come the outcome, the RSME (Root Mean Square Error) HMF
with data stream mining, like data size, data complexity, and is the lowest of all.
data accuracy. The application of a health recommender system to
In a case study, Austin et al. [17] used several data mining increase the precision and effectiveness of women’s cer-
and machine learning techniques to categorise and forecast vical cancer prognosis was studied by Kuanr et al. [22]
the disease of heart failure. They compared alternative flex- in 2021. LR (Logistic Regression), SVC (Support Vector
ible classification schemes, such as bootstrap aggregation Classifier), DT (Decision Tree), KNN (KNearest Neighbour),
(bagging), boosting, and random forests, to the conventional GNB (Gaussian Naive Bayes), XGBoost (eXtreme Gradient
classification and regression trees and discovered that the Boosting), and GBM (Gradient Boosting Machine) are the
flexible tree-based techniques from the data-mining literature seven classifiers utilised for the model construction. The
offer a significant improvement in prediction and classifica- findings also suggest that models with GBM classifiers per-
tion of heart failure subtype. form admirably and the model with Decision Tree classifiers
In order to support clinical judgements in the field of car- exhibit the highest accuracy.
diac disease prediction, AbuKhousa et al. [18] assessed the Han et al. [23] suggested a hybrid recommender sys-
state of research and development in predictive data mining. tem that combines content-based and collaborative filtering
The potential ramifications of five alternative models of data strategies for patient-doctor matchmaking in primary care.
mining techniques have been examined. Poor generalisation The results demonstrate that, when compared to both the
capacity is a significant unresolved problem for data mining heuristic baseline (which provides 37% accuracy) and a tradi-
in the healthcare sector due to the dearth of input data and the tional collaborative filtering (which provides 69% accuracy)
high cost of re-processing. recommender system, the hybrid model provides greater pre-
An overview of the present state of research on rec- dicted accuracy of 80%. The model was evaluated on a dataset
ommender systems in the healthcare industry is given by of large number of doctor-patient interactions.
Tran et al. [19]. They talk on how existing Systems in the Mudaliar et al. [24] presents an application programming
healthcare industry are structured and designed, as well as interface for recommending medications to patients suffering
how this affects patient care. This in-depth analysis offers from a certain condition, which the framework would also
insights into scenario-based recommendations, approach- diagnose by analyzing the patient’s symptoms by utilizing
based recommendations, and a variety of algorithm-based machine learning approaches. They utilize some smart infor-
recommendations. The review contains recommendations for mation about the mining technique to determine the most
foods, medications, healthcare professionals, healthcare ser- specific illness that may be associated with symptoms. The
vices, and health status forecasts. patient may identify the disorders without difficulty. Patients
Morales et al. [20] designed a drug recommendation sys- may clearly identify the disease by just ascribing their prob-
tem based on collaborative filtering and clustering methods lems, and the programme interface provides what disease the
as an addition to the medications provided by the pre- user may be infected with.
scribing physician for diabetic patients. To conduct the Predictive analysis would be done on the ailment, result-
studies, a set of diabetes patient data is gathered from the ing in medicine recommendations to the patient relied on
99306 VOLUME 11, 2023

numerous features in the database. The experimental out- fulfillment table is ‘Ecounter ID’. We then used a left join on
comes can potentially be applied to future research and ‘‘Encounter ID’’ in a SQL query to combine the Medication
healthcare technologies. This work provided a schematic of fulfillment table with the encounter dx table’s severity and
the application of information mining processes in regulatory, description columns. There are 1176 rows in the ensuing
clinical, research, and instructional aspects of Clinical Pre- table.
dictions. This system has a broad reach since it includes the Then, we performed a query to verify the count of each row
following features: in the ‘‘order ID’’ column to see if it was the primary key for
•Disease Diagnosis Automation. lab results.csv. We discovered that it was not since the number
•Paperless labour that benefits the environment. of unique values did not correspond to the number of rows in
•To improve efficiency and accuracy for patients in order the table. We then discovered that the composite primary key
to assist them in the future. consists of two columns ‘‘Order ID’’ and ‘‘Result LOINC.’’
•Managing disease-related information. We did not use any columns for the combined dataset because
none of the lab results.csv’s columns were helpful.
III. DATASET AND PRE-PROCESSING We determined that the primary key for the remaining
HIPPA provides protection for healthcare information. It is encounter.csv file is ‘‘Encounter ID,’’ therefore we extracted
against the law to disclose a patient’s medical records with- the CC column and combined it using a left join on
out that patient’s consent. Government health records and ‘‘Encounter ID.’’ We now have a fully combined dataset with
databases required a number of licenses to access. As a result, all of the necessary columns. There are 1176 rows in it.
we are employing datasets for our research that were easily We took the Drug Name, description, severity, and CC from
accessible online and ready for download. the combined dataset and grouped them to determine the total
number of each drug and the number of linked diseases and
A. DATASET ONE descriptions. As seen in the Figure 1 below, the extracted
1) DATA GATHERING columns contain a significant number of null values.
Medical records 10 yrs. - dataset by arvin6 | data.world is
where the 10-year medical record dataset was found.Four
CSV files make up the package:
1) encounter.csv
2) encounter_dx.csv
3) lab_results.csv
4) medication_fulfillment.csv
There are 17 columns and 1176 rows in the encounter.csv file,
encounter dx. Lab results.csv has 7509 rows and 21 columns,
medical fulfillment.csv has 5447 rows and 28 columns, and
csv has 3063 rows and 6 columns. We pre-processed and
merged the dataset to meet our needs in order to gain insight- FIGURE 1. Drug name grouped by description, severity, and CC.
ful information and determine whether the dataset contains
the necessary data to address the problem statement. In order to get the total number of rows with null val-
ues associated with a medicine name, we conducted another
2) DATA PREPROCESSING query. We were only left with 416 rows of pertinent data to
We pre-processed each table individually and then joined train our classification model because a total of 764 rows
the tables to create a single merged dataset with the neces- had null values. We therefore came to the conclusion that
sary columns after acquiring the raw data and knowing the this dataset does not fulfill the requirement and searched for
structure of all four CSV files. In order to pre-process the additional datasets that would be consistent with the solution
data, we ran specific instructions. For example, the number of our issue statement.
of unique values, the count of each row, and the columns
that were removed or combined to make the data more useful B. DATASET TWO
because they were unnecessary or did not include any data. 1) DATA GATHERING
After performing the necessary pre-processing, we dis- We forecast diseases based on symptoms in order to make
covered that the four tables were controlled using the Star reliable drug recommendations, which are subsequently
Schema Model, with the fact table being medical ful- based on ratings. We have made an effort to assemble data
fillment.csv and the dimension tables being ecounter.csv, from two major datasets for this purpose.
encounter dx.csv, and lab results.csv. The one-to-many rela-
tionship is followed by the star schema. The four tables a: SYMPTOMS DATASET
were combined into one table once we identified the Primary The Disease-Symptom Knowledge Database, which is
Key and Foreign Key. The primary key in the medication a knowledge database of disease-symptom associations
VOLUME 11, 2023 99307

generated by an automated method based on informa- for efficient medicine recommendations, this dataset is then
tion in textual discharge summaries of patients at New pre-processed and displayed.
York-Presbyterian Hospital admitted during 2004, provided
the dataset. This dataset has three columns as shown c: SIDE EFFECTS DATASET
in Figure 2: We were able to successfully include this dataset of adverse
1. Disease effects for particular medications in order to assist patients in
2. Count of Disease Occurrence understanding the risks associated with the medication that
3. Symptoms is being suggested. Once more, raw data from druglib.com
and the UCI Machine Learning Archive for Adverse Effects
of Medications were combined to create this dataset. This
dataset has fewer rows than the drug review dataset but more
columns than that dataset. Because of this, just the ‘‘Side
Effects’’ column from this dataset will be integrated with the
side effects from the other two datasets and druglib.com as
shown in Figure 4.
FIGURE 2. Raw symptoms dataset.
This dataset contains 405 symptoms and 149 distinct dis-

eases. Each disease has 4-5 symptoms that go along with it.
In order to train models to categorize and forecast the disease,
this dataset is provided for pre-processing.
b: DRUG REVIEW DATASET

Using the projected disease as an input, this dataset is uti-
lized to suggest suitable medications based on reviews and
FIGURE 4. Raw dataset containing Side effects of drugs.
ratings (Sentiment Analysis). The UCI Machine Learning
Repository for Drug Review, which offers patient reviews on
particular medications together with information on linked 2) DATA PREPROCESSING
ailments and a 10-star patient rating indicating overall patient We used several datasets and worked with them in this paper.
happiness, is where the dataset is acquired. Due to the fact that The datasets were all acquired in their raw form. A few
the two datasets in the repository (Test and Train) had the standard procedures and tests were run on all of the datasets
same amount of columns, they were combined for analysis to pre-process them. They were:
and visualization. It has 213869 rows and 7 columns: ID, drug 1) The number of null and missing values in each dataset
name, condition, Review, Rating, Date and Useful count as was initially counted.
in Figure 3. 2) Every one of these values was either dealt with or
removed from the dataset.
3) The unique values and frequencies for each column were
then determined.
4) The dataset was displayed using standard libraries, and
any outliers were discovered.
The dataset was cleaned of any extraneous data.
a: SYMPTOMS DATASET
This dataset needed to be cleansed in order to yield useful
information. The ‘‘Count’’ column was first removed since
the data it contained was not pertinent to this paper. The
drop function was then used to handle the null entries in the
Disease column. The disease and symptom columns were
cleaned to remove extraneous data and leave only the name.
FIGURE 3. Raw drug review dataset. The processed data is shown in Figure 5.
We have changed it into a new CSV format file with
This dataset includes 916 unique Conditions (Diseases), symptoms as the columns and diseases as the rows in
3671 unique Drug names, and ratings and reviews that order to categorize the symptoms according to the diseases.
match to the medicine names. To obtain more information We mapped every symptom to every disease using one hot
99308 VOLUME 11, 2023

c: SIDE EFFECTS DATASET

This dataset was clean and has a lot of data that is similar
to that in the Drug Review dataset. The handling of a few
null values and the removal of unnecessary columns. By com-
bining this information with the drug review dataset, it is
possible to map solely the adverse effects of the particular
medications.
FIGURE 5. Symptoms dataset after pre-processing. d: MERGED DATASET

For the purpose of making a final medicine prediction, the
data set containing the symptoms and the reviews are com-
bined as in Figure 8.
FIGURE 6. Dataset after marking symptoms present for a disease

as 1 else 0. FIGURE 8. Merged dataset.
encoding, adding value 1 if the disease was present and 0 oth- In order to decrease the dimensionality of the combined
erwise. The one-hot encoded dataset is shown in the Figure 6 dataset, only the pertinent columns are preserved.
below. When symptoms are provided as input, this will assist
us in predicting the diseases. IV. METHODOLOGY
Our paper’s major objective is to suggest a medication to
b: DRUG REVIEW DATASET a patient relied on the symptoms they exhibit. The design
In order to visualize and analyze the data on a larger dataset, pipeline and dataflow of our implementation are shown below
the two sets in this dataset—Train and Test—were combined. in Figure 9. In order to develop a drug recommender system,
Also, they could be joined easily because they both had the there are two key subcategories that must be addressed: a
same columns. The resulting dataset was fairly clean and disease prediction model and a recommendation model.
did not need much pre-processing. However, a few rows
with null values were removed, and new names were given A. DISEASE PREDICTION PROTOTYPE
to the columns. The displaying challenge was intriguing A probabilistic prototype that bases predictions on symp-
because the dataset has a lot of information. The results of toms is the disease prediction prototype. We are employing
the pharmaceuticals with the most evaluations, the most well- the Disease-Symptom Knowledge Database, which includes
liked drugs, the most prevalent ailments, etc. were plotted 405 symptoms matched to over 149 distinct diseases, for this
on numerous different graphs. Figure 7 below illustrates one purpose. In our paper, we have tried with many procedures
such representation, which includes the names of a few of the for accurately forecasting the diseases. This is because the
most often used drugs: dataset only contains one feature, which are the symptoms,
and training any classifier with just one piece of information
weakens it, which in turn lowers the accuracy of predictions
using other inputs. The answer is to attempt to generate
many data points for the same condition by considering all
of its symptoms, their significance, and their occurrence.
We have created four potential strategies, each of which
focuses on a different type of prediction. All of these methods
are ultimately trained on four classifiers, and the accuracy
and prediction results are compared in order to select the best
strategy. This is a detailed explanation of the methods:
1) APPROACH 1
According to this approach, the data has been pre-processed,
converted, and mapped to a data frame with columns labeled
FIGURE 7. Visualizing the most popular drugs based on the ratings. with the list of specific symptoms and the list of specific
VOLUME 11, 2023 99309

and test datasets since Method 1 only has one entry for each
disease. When the classifier was trained on many diseases and
the test data was composed of various diseases, splitting the
data resulted in 0% accuracy. The newly constructed illness
prediction data frame was mapped using the pre-processed
symptoms dataset by setting the values of symptoms present
for a certain disease as 1 and 0 if not. Additionally, the rows
are categorized according to the diseases, and each disease
row is multiplied according to the number of times that dis-
ease occurs. 38839 rows make up the complete data set. Data
for training and testing were then separated from the dataset.
Following that, the data was trained using the Multinomial
Probabilistic Model, ExtraTree Classifier Model, Decision
Tree Classifier Model, and Support Vector Machine Classi-
fier Model, with various Symptoms serving as the training
features and Diseases serving as the labels. By sending one,
two, and three features to the classifier models, the trained
model is utilized to forecast diseases.
3) APPROACH 3
The classifier still provided accuracy rates of 86% even after
adding additional data based on the frequency of disease
occurrence. The disease prediction using technique 2 was
only 86% accurate, which was insufficient. So, we made
changes to the data to enable the classifiers to recognize
patterns and adapt their learning accordingly. The ranking
of significance for each disease symptom is maintained in
this technique. By setting the values of symptoms present
for a given disease as 1 and 0 if not, the symptoms dataset
was utilized to map the newly constructed disease prediction
FIGURE 9. Design implementation pipeline and dataflow of the
recommender prototype.
data frame. To make more rows, the number of rows for
each disease is multiplied. As a result, each disease will
have an increased and equal amount of rows in the resulting
target diseases. By setting the values of symptoms present for data frame. The probabilistic method is used to map the 0s
a given disease as 1 and 0 if not, the pre-processed symptoms and 1s. The greatest priority symptom for each row of that
dataset was utilized to map the newly constructed disease particular condition is assigned to 1. 95% of all the rows are
prediction data frame. To create a single row for each distinct mapped to the second-most significant symptom, and so on.
ailment and provide a value of 1 to the appropriate symptoms Around 40% of the total rows for a given condition will be
column, the rows were then grouped by. Following that, the mapped to the least significant symptom. Data for training
data was trained by utilizing the Multinomial Probabilistic and testing were then separated from the dataset. Following
prototype, ExtraTree Classifier Model, Decision Tree Clas- that, the data was trained using the Multinomial Probabilistic
sifier Model, and Support Vector Machine Classifier Model, Model, ExtraTree Classifier Model, Decision Tree Classifier
with various Symptoms serving as the training features and Model, and Support Vector Machine Classifier Model, with
Diseases serving as the labels. By sending one, two, and three various Symptoms serving as the training features and Dis-
features to the classifier models, the trained model is then eases serving as the labels. By sending one, two, three, and
utilized to forecast the diseases. four features to the classifier models, the trained model is
utilized to forecast diseases as in Figure 10.
2) APPROACH 2
Each classifier in approach one offered the same accuracy 4) APPROACH 4
of 89%. The prediction of the disease using symptoms With this procedure, we have sought to produce a dataset in
as input was inaccurate, despite the accuracy being 89%. which all the symptoms of a specific disease are dispersed
We used approach two to implement precision in the disease at random across various rows. This results in a dataset with
forecasts. The ‘Count of Disease Occurrence’ column values several rows for a single disease and its symptoms indicated
were taken into account in this method to enhance the number in each row differently, 0 or 1.
of rows in the dataset, and the symptoms were thought to be This dataset was produced by splitting larger datasets into
equally important. We were unable to divide the data into train smaller ones using the groupby function on various numbers
99310 VOLUME 11, 2023

sentiment predictions. The final fully connected layer had

softmax activation function and binary cross entropy loss
was employed with Adam optimizer. The model had a total
of 3+1 layers, each of which had batch normalisation and
relu activation. The keras API was used to implement the
neural network model in its entirety, and Figure 11 shows an
overview of the model.
FIGURE 10. Accurate drug predictions for four input symptoms.
of rows. Then performed groupby.sum() to merge all symp-

toms in one row for each small dataset.
Eventually, the tiny datasets with various symptom values
for each disease are joined to create a single huge dataset
with numerous rows and various data points for each disease.
We were able to train the models more effectively and accu-
rately as a result of this strategy.
B. SENTIMENT ANALYSIS OF DRUG REVIEWS

The next step is to utilize the Merged Dataset, which was
built by pre-processing the symptoms and the drug review
dataset, to map the list of drugs that can be recommended
for each of the most likely conditions. The next task is to be
able to suggest the optimal medication for the patient after FIGURE 11. Neural network prototype summary.
we have a list of potential medications. Hence, we utilized

two alternative strategies for this. The first procedure utilizes Algorithm: Sentiment Analysis; epochs: 25, l_rate:1e-1,
natural language processing (NLP) to examine attitudes and batch_size = 100
a neural network prototype to predict whether reviews will Input
be good or negative. We employed the VADER tool in the x: Reviews
second strategy. It is a straightforward rule-relied approach y: Ratings
for sentiment analysis. Output
Drugs, Reviews, Disease and Sentiments
1) APPROACH 1 Steps:
To analyze the trends in the good and negative reviews 1. Encode the rating with threshold of 7 into two classes
provided by patients, we will utilize sentiment analysis on (>7 and <7) –input:y
the drug review data set using Natural Language Processing 2. Get Tfidf vectors for the vectors - input: x
in this strategy. The main justification for conducting this 3. Define sequential model using keras API
sentiment analysis was that the rating and review were viewed 4. For epochs in steps per epoch
as being inconsistent. It is likely that even if the review said a. train(fit) the prototype
the product was good and had no side effects, it would still b. update gradients
have a rating below 5. Since NLP captures the essence of 5. Get predictions for all the reviews and merge the data to
a sentence to predict a sentiment, it is more accurate. The original dataset.
word2vec approach was utilized to create the NLP model,
and a neural network classifier was created to categorize the 2) APPROACH 2
reviews. The words in the sentences were converted into a The VADER, or Valence Aware Dictionary and Sentiment
vector of probabilities by TfidfVectorizer, which was used Reasoner, is a computer programme. It is a sentiment analysis
to extract the probabilistic values of relationship occurrence. tool with a lexicon and rules that is especially geared to the
After acquiring the vectors, a neural network model trained sentiments expressed on social media. This model doesn’t
on this data using a sequential manner was used to get the need any training data because it already creates and then
VOLUME 11, 2023 99311

experimentally validates a gold-standard sentiment lexicon specific rating. With the help of this data, each drug’s
to assess the sentiment of the text using a combination of weighted average rating or probabilistic value can be calcu-
qualitative and quantitative methods. The string input for this lated. The two methods are described below:
tool will be a drug review from our dataset. This tool not only
informs us of the review’s polarity (positive or negative), but 1) REVIEW APPROACH 1: WEIGHTED AVERAGE APPROACH
also of how positively or negatively weighted it is. Using this The difference between a weighted average and a regular
tool has many of benefits. They consist of: mean is that certain data points contribute more to the final
1. It requires no training data and excels at handling average than others, as opposed to all of the data points
social media-style texts that may contain emoticons and contributing equally.
punctuation. Formula for weighted average:
2. It is possible to use real-time streaming data. Pn
wi xi
There is no speed-performance trade-off with it. x̄ = Pi=1n (1)
Positive, negative, neutral, and compound components i=1 wi
make up the output that is produced. The terms Positive, In equation 1, data elements with a high weight contribute
Neutral, and Negative indicate how many different portions more to the weighted mean than do elements with a low
of the statement have a positive, neutral, or negative tone. The weight.
three components are stated as decimal values, and their sum In this instance, the ratings are x and the useful counts
will always equal one. The sentence’s overall mood is defined are w Based on Probability Recommendation After getting
by the compound component. The following thresholds aid in the combined dataset, a disease-based drug recommendation
interpreting compound test results: system was created. In order to recommend the best drug
1. Positive sentiment: compound score >= 0.05 possible, the medication with positive evaluations was ini-
2. Neutral sentiment: -0.05 < compound score < 0.05 tially weeded out and sorted based on the helpful count,
3. Negative sentiment: compound score <= -0.05 as demonstrated in the Figure 15.
We can see the output of a sample review in the accom- This result in a weighted average rating that prioritizes
panying graphic as in Figure 12. Thus here in the statement, the usable count, or the number of genuine raters, over other
there is a bit of positive component in the language but the factors.
overall emotion can be computed as negative which is clear The Figure 13 depicts the function created for calculating
by the compound value given by VADER. weighted average.
FIGURE 12. VADER analysis for a sample review from our dataset.
C. DRUG RECOMMENDATION
We need to suggest a medication for the disease after it has
been predicted. We used the UCI Machine Learning Repos-
itory for Drug Review dataset, which contains the disease
along with the many medications that are accessible, their FIGURE 13. Function to create the weighted average.
reviews, ratings, and helpful count.

According to the pre-processing of the merged dataset After calculating, this weighted average column is merged
section, this dataset is next combined with the Disease- with the dataset which replaces the rating column as shown
Symptom Knowledge Database based on common diseases. in Figure 14:
This combined dataset has 6 attributes; of which 3 will The names of the disease and drug are now included in
be used to assist us pick the optimal medication. This is an this integrated dataset, together with a review, its sentiment,
explanation of these 3 characteristics and their significance: the average rating, and a helpful count. We have only taken
1. Feature 1: Review: into account medications for user recommendations that have
One of the most significant sections for drug recommen- received favorable evaluations. Hence, removing harmful
dations is the review section since it provides an honest medications. The drug that has to be recommended is sorted
assessment of the drug by actual users. As a result, when with the highest useful count first, followed by the highest
performing Sentiment Analysis, we only considered reviews average rating, after we have added up all the useful counts
having a positive sentiment. for that particular drug. We have utilized the groupby.sum()
2. Feature 2 & 3: Rating & Useful Count: technique for this. With this method, we are able to use every
Each drug has a rating from 1 to 10, as well as a useful piece of data in the dataset in accordance with its significance,
figure that lets us know how many users have given that giving us the best possible drug suggestion as a result.
99312 VOLUME 11, 2023

2. Select the disease rows which have only positive reviews

3. Sort the values based on the useful count
4. Drop duplicates and selects top 5
5. Encode the ratings on side effects dataset
6. Define probabilistic score - based on effectiveness rat-
ing, overall rating and side effects rating.
7. Create the list of drugs recommended
8. For items in drug list a. map the corresponding side effect
with probabilistic score b. concatenate to new dataframe.
9. Sort the new dataset based on the prob. Score
V. EXPERIMENTAL RESULTS
A. DISEASE PREDICTION
FIGURE 14. After weighted average calculation.
1) APPROACH 1
In addition, we tested a probabilistic strategy, which is We employed the Multinomial Naive Bayes, Extra Tree Clas-
described below. But, in order to get reliable results, a lit- sifier, Decision Tree Classifier, and Support Vector Machine
tle of both methodologies were used to create the final models to forecast diseases using procedure 1. SVM, Multi-
recommendation. nomial Naive Bayes, Extra Tree Classifier and Decision
Tree Classifier produce 89.93% accuracy. Because there was
2) APPROACH 2 insufficient data for the model to learn from, disease predic-
tions were incorrect. Figure 16, 17, 18 shows the prediction
Based on Probability Recommendation After getting the
result of disease with one, two and three symptoms.
combined dataset, a disease-based drug recommendation
system was created. In order to recommend the best drug
possible, the medication with positive evaluations was ini-
tially weeded out and sorted based on the helpful count,
as demonstrated in the Figure 15.
FIGURE 16. Disease prediction using approach 1 (Prediction using one

symptom).
FIGURE 15. List of recommended drugs based on sentiment review and
useful count.
Later, we used the side effects dataset mentioned in

the Data Collecting section to rate the suggested drug
while accounting for potential negative effects. Unneces-
sary columns were removed, string values were converted to
numerical values, side effects for the drug that had previously
been prescribed were acquired, coupled with the probability
value for symptoms based on three distinct ratings, and the
results were then sorted according to probability. The final
medicine that was suggested therefore had both a probabilis-
tic score and potential negative effects.
Algorithm: Recommendation System - Probabilistic FIGURE 17. Disease prediction using approach 1 (Prediction using two
Score symptom).
Input
x: Disease 2) APPROACH 2
Output Four models—Multinomial Naive Bayes, Extra Tree Classi-
Drugs, Prob. of side effects, Disease and Side Effects fier, Decision Tree Classifier, and Support Vector Machine—
Steps: were utilized for disease prediction utilizing method 2.
1. Select the rows in the data for the input disease All four classifiers produced an accuracy rate of 86.84%.
VOLUME 11, 2023 99313

FIGURE 21. Disease prediction using approach 2 (Prediction using three

FIGURE 18. Disease prediction using approach 1 (Prediction using three symptom).
symptom).
the corrective prediction for three symptoms. Moreover,

For one symptom, the illness predictions were wrong. Multi- Multinomial Naive Bayes, Extra Tree Classifier, and Support
nomial Naive Bayes gave the corrective prediction for Vector Machine correctly diagnosed the disease for one, two
symptoms one, two and three as in Figure 19, 20, 21. and three symptoms as in Figure 22, 23, 24.

symptom).
symptom).
FIGURE 20. Disease prediction using approach 2 (Prediction using two

symptom).
FIGURE 23. Disease prediction using approach 3 (Prediction using two
symptom).
3) APPROACH 3
We employed Multinomial Naive Bayes, Extra Tree Classi- 4) APPROACH 4
fier, Decision Tree Classifier, and Support Vector Machine We have also utilized this method to make predictions for all
as our four models for approach 3’s disease prediction. 4 models, and the accuracy obtained is as follows: Decision
The accuracy of the Multinomial Naive Bayes is 86.90%. Tree: 88.24%, Multinomial NB: 87.98%, Random Forest:
Additional Tree Classifier has an accuracy rate of 88.18%. 81.88%, Gaussian NB: 88.34%, and SVM: 88.61%. Each
A decision tree classifier with a maximum depth of 120 offers classifier has done well, and the condition that was predicted
86.95% accuracy. The 86.96% provided by the Support Vec- using three symptoms as input features is likewise accu-
tor Machine classifier. The disease forecasts for symptoms rate. When we used symptoms from various diseases to try
one and two were wrong. Multinomial Naive Bayes produced and predict the disease, the model was 9 out of 10 times
99314 VOLUME 11, 2023

TABLE 1. Accuracy values of each approach.
FIGURE 24. Disease prediction using approach 3 (Prediction using three

symptom).
FIGURE 25. Disease prediction using approach 4 (Prediction using all

symptom).
FIGURE 26. Evaluation metrics for sentiment analysis.

accurate. We utilized this strategy for our final disease pre-
diction because the outcomes were pretty excellent and the
calculation time was also reduced. The prediction result is
shown in Figure 25.
We can see that Multinomial NB, NB and SVM predict the
correct disease.
B. COMPARISON OF PERFORMANCE VALUES

Table 1 shows the accuracy values of each approach as
described above.
C. SENTIMENT ANALYSIS
1) APPROACH 1 FIGURE 27. Predicted labels of reviews.
We used loss, accuracy, f1 score, recall, and precision met-

rics to assess the NLP model, and the results are shown
in Figure 26. for me’’ contains a bad review yet the predicted label correctly
It should be highlighted that while training accuracy labels it as a positive review.
exceeded 90%, CV accuracy did not. One possibility for
this is that actual labels were incorrectly labeled. The few 2) APPROACH 2
anticipated feelings are listed below in Figure 27. We have also employed vader analysis as a second method
A strong argument for using NLP-based projected values to obtain the sentiment analysis of each drug review. The
rather than thresholds for ratings can be seen in the following four parts that make up Vader’s output are positive, negative,
picture where the review at index 1 that states ‘‘Works good neutral, and compound. These are the results for each review
VOLUME 11, 2023 99315

FIGURE 30. Output of weighted average approach for drug

recommendation.
FIGURE 28. Vader sentiment analysis for all reviews.
FIGURE 31. Drug with possible side effects.
FIGURE 29. Graph showing sentiment of reviews in the dataset.
FIGURE 32. Mapped recommended drugs with possible side effects and
probabilistic score.
in the dataset after performing the Vader analysis on each
review as in Figure 28.
We applied the previously mentioned criteria to categorise
the overall sentiment of each review. The distribution of
positive, negative, and neutral reviews in the dataset may be
seen in the Figure 29 after each review has been given its
overall sentiment.
D. DRUG RECOMMENDATION FIGURE 33. Recommended drug for osteoporosis’s disease with its
We have tried with two methods for the drug recommen- possible side effects.
dation, as was mentioned above. Nonetheless, the final
suggestion was made by combining the usable count factor
from the probabilistic technique with the weighted average a: VERIFICATION OF RESULTS WITH REAL WORLD
approach. Many factors were taken into account before sug- After conducting a Google search on osteoporosis therapy,
gesting the medication: we learned that it calls for vitamins and bone health, which
Several medications are available for the same condition. is precisely what our first medicine suggests. We can there-
Hence, we were able to filter out the negative and neutral fore conclude that our medicine recommendation system
evaluations using sentiment analysis, leaving us with only performs as anticipated and produces the desired outcomes.
favorable ones. The side effects dataset is then integrated with this output to
determine if the suggested medication has any side effects as
1) APPROACH 1: THE WEIGHTED AVERAGE METHOD shown in Figure 31 and 32.
We were able to find the ideal and highly regarded medicine The final medicine recommendation code illustrates this as
for the anticipated ailment thanks to the outcomes of this in Figure 33:
strategy. Based on the information provided, we predicted
osteoporosis as the disease, and as an output, we are recom- 2) APPROACH 2: PROBABILISTIC APPROACH
mending the top 3 medications from our dataset. You may This method produced fairly excellent findings and evalua-
view it below in Figure 30: tions. Nevertheless, there isn’t a concrete dataset to compare
99316 VOLUME 11, 2023

[4] S. A. Alsaif, M. S. Hidri, I. Ferjani, H. A. Eleraky, and A. Hidri, ‘‘NLP-

based bi-directional recommendation system: Towards recommending jobs
to job seekers and resumes to recruiters,’’ Big Data Cognit. Comput., vol. 6,
no. 4, p. 147, Dec. 2022.
[5] J. P. Gupta, A. Singh, and R. K. Kumar, ‘‘A computer-based disease
prediction and medicine recommendation system using machine learn-
ing approach,’’ Int. J. Adv. Res. Eng. Technol. (IJARET), vol. 12, no. 3,
pp. 673–683, 2021.
[6] Y. Bao and X. Jiang, ‘‘An intelligent medicine recommender system frame-
FIGURE 34. List of recommended drugs along with possible side effects work,’’ in Proc. IEEE 11th Conf. Ind. Electron. Appl. (ICIEA), Jun. 2016,
for osteoporosis ranked based on probabilistic score. pp. 1383–1388.
[7] Q. Zhang, G. Zhang, J. Lu, and D. Wu, ‘‘A framework of hybrid recom-
mender system for personalized clinical prescription,’’ in Proc. 10th Int.
the suggested outcomes as this list of medications for osteo- Conf. Intell. Syst. Knowl. Eng. (ISKE), Nov. 2015, pp. 189–195.
[8] U. Bhimavarapu, N. Chintalapudi, and G. Battineni, ‘‘A fair and safe usage
porosis is advised together with any potential adverse effects, drug recommendation system in medical emergencies by a stacked ANN,’’
sorted by probabilistic score. Figure 34 shows the list of Algorithms, vol. 15, no. 6, p. 186, May 2022.
recommended drugs along with possible side effects for [9] J. Chen, K. Li, H. Rong, K. Bilal, N. Yang, and K. Li, ‘‘A disease diagnosis
and treatment recommendation system based on big data mining and cloud
osteoporosis ranked based on probabilistic score. computing,’’ Inf. Sci., vol. 435, pp. 124–149, Apr. 2018.
The outcomes of this strategy can be contrasted with a list [10] I. Kononenko, I. Bratko, and M. Kukar, ‘‘Application of machine learn-
of osteoporosis medications that are typically prescribed. ing to medical diagnosis,’’ Mach. Learn. Data Mining, Methods Appl.,
vol. 389, p. 408, Jun. 1997.
[11] C. R. Olsen, R. J. Mentz, K. J. Anstrom, D. Page, and P. A. Patel,
VI. DISCUSSION ‘‘Clinical applications of machine learning in the diagnosis, classifica-
In a single file, we have bundled all the functions used in this tion, and prediction of heart failure,’’ Amer. Heart J., vol. 229, pp. 1–17,
Nov. 2020.
paper with the most precise techniques. This programme fore- [12] A. S. Hussein, W. M. Omar, X. Li, and M. Ati, ‘‘Efficient chronic disease
casts the disease using symptoms as input. This is then used as diagnosis prediction and recommendation system,’’ in Proc. IEEE-EMBS
an input by the drug recommender, which subsequently gives Conf. Biomed. Eng. Sci., Dec. 2012, pp. 209–214.
[13] F. Rustam, Z. Imtiaz, A. Mehmood, V. Rupapara, G. S. Choi, S. Din, and
the prescribed medication as well as a list of its side effects I. Ashraf, ‘‘Automated disease diagnosis and precaution recommender sys-
as an output. tem using supervised machine learning,’’ Multimedia Tools Appl., vol. 81,
no. 22, pp. 31929–31952, Sep. 2022.
[14] S. Bhat and K. Aishwarya, ‘‘Item-based hybrid recommender system for
VII. CONCLUSION AND FUTURE SCOPE
newly marketed pharmaceutical drugs,’’ in Proc. Int. Conf. Adv. Comput.,
Drug recommendation systems are a common technology in Commun. Informat. (ICACCI), Aug. 2013, pp. 2107–2111.
today’s online services, and as demand for these services [15] K. Feldman, D. Davis, and N. V. Chawla, ‘‘Scaling and contextu-
alizing personalized healthcare: A case study of disease prediction
grows, there is an increasing need to automate the processes. algorithm integration,’’ J. Biomed. Informat., vol. 57, pp. 377–385,
As a result, we have created a medication recommendation Oct. 2015.
system. The main conclusions from our project are listed [16] Y. Zhang, S. Fong, J. Fiaidhi, and S. Mohammed, ‘‘Real-time clinical
decision support system with data stream mining,’’ J. Biomed. Biotechnol.,
below. vol. 2012, pp. 1–8, May 2012.
1. Successfully created a drug recommendation prototype [17] P. C. Austin, J. V. Tu, J. E. Ho, D. Levy, and D. S. Lee, ‘‘Using meth-
that prescribes medicines with potential adverse effects ods from the data-mining and machine-learning literature for disease
classification and prediction: A case study examining classification of
based on user-inputted symptoms. heart failure subtypes,’’ J. Clin. Epidemiol., vol. 66, no. 4, pp. 398–407,
2. For the execution of this project, we created three models. Apr. 2013.
a model for sentiment analysis, one for predicting dis- [18] E. AbuKhousa and P. Campbell, ‘‘Predictive data mining to support clinical
decisions: An overview of heart disease prediction systems,’’ in Proc. Int.
eases, and one for making recommendations. Conf. Innov. Inf. Technol. (IIT), Mar. 2012, pp. 267–272.
3. Tested several strategies for each of the three models. [19] T. N. T. Tran, A. Felfernig, C. Trattner, and A. Holzinger, ‘‘Recommender
4. Each of the three models provided accurate results, systems in the healthcare domain: State-of-the-art and research issues,’’
J. Intell. Inf. Syst., vol. 57, no. 1, pp. 171–201, Aug. 2021.
adding to the drug recommendation model’s overall [20] L. F. G. Morales, P. Valdiviezo-Diaz, R. Reátegui, and L. Barba-Guaman,
dependability. ‘‘Drug recommendation system for diabetes using a collaborative filter-
One key future scope can definitely be improving the accu- ing and clustering approach: Development and performance evaluation,’’
J. Med. Internet Res., vol. 24, no. 7, Jul. 2022, Art. no. e37233.
racies of the prediction and recommender model using deep [21] Y. Zhang, M. Chen, D. Huang, D. Wu, and Y. Li, ‘‘iDoctor: Person-
neural networks by using larger data. alized and professionalized medical recommendations based on hybrid
matrix factorization,’’ Future Gener. Comput. Syst., vol. 66, pp. 30–35,
Jan. 2017.
REFERENCES [22] M. Kuanr, P. Mohapatra, and J. Piri, ‘‘Health recommender system for
[1] F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, ‘‘Recommendation sys- cervical cancer prognosis in women,’’ in Proc. 6th Int. Conf. Inventive
tems: Principles, methods and evaluation,’’ Egyptian Informat. J., vol. 16, Comput. Technol. (ICICT), Jan. 2021, pp. 673–679.
no. 3, pp. 261–273, Nov. 2015. [23] Q. Han, M. Ji, I. M. de Rituerto de Troya, M. Gaur, and L. Zejnilovic,
[2] M. A. N. Banu and B. Gomathy, ‘‘Disease predicting system using data ‘‘A hybrid recommender system for patient-doctor matchmaking in pri-
mining techniques,’’ Int. J. Tech. Res. Appl., vol. 1, no. 5, pp. 41–45, 2013. mary care,’’ in Proc. IEEE 5th Int. Conf. Data Sci. Adv. Anal. (DSAA),
[3] H. Wang, Q. Gu, J. Wei, Z. Cao, and Q. Liu, ‘‘Mining drug-disease Oct. 2018, pp. 481–490.
relationships as a complement to medical genetics-based drug reposition- [24] V. Mudaliar, P. Savaridaasan, and S. Garg, ‘‘Disease prediction and drug
ing: Where a recommendation system meets genome-wide association recommendation Android application using data mining (virtual doc-
studies,’’ Clin. Pharmacol. Therapeutics, vol. 97, no. 5, pp. 451–454, tor),’’ Int. J. Recent Technol. Eng. (IJRTE), vol. 8, no. 3, pp. 6996–7001,
May 2015. Sep. 2019.
VOLUME 11, 2023 99317

SUVENDU KUMAR NAYAK received the SANDEEP KUMAR PANDA is currently an Asso-
M.Tech. degree in computer science (CS) from ciate Professor and the Head of the Department
Berhampur University, in 2009. He is currently Artificial Intelligence and Data Science, Faculty
pursuing the Ph.D. degree in machine learning of Science and Technology (IcfaiTech), ICFAI
(recommender systems) with the Centurion Uni- Foundation for Higher Education (Deemed to be
versity of Technology and Management, Odisha. University), Hyderabad, Telangana, India. He has
He is also an Assistant Professor with the Depart- published 50 papers in international journals and
ment of Computer Science and Engineering, international conferences and book chapters in
Centurion University of Technology and Man- repute. His research interests include blockchain
agement. He is also having more than 12 years technology, the Internet of Things, AI, and cloud
of experience in academia. He is also certified in RHCSA (Red hat), computing. He received the ‘‘Research and Innovation of the Year Award
Cloud Practitioner (AWS), and also carry a certificate in Oracle Database 2020’’ from MSME, Government of India and DST, Government of India
Development. He has published many articles in international journals and at New Delhi, in 2020, and ‘‘Research Excellence Award’’ from Brand
conferences. Honchos, in 2022. He has 17 Indian patents on his credit. He has six edited
books named Bitcoin and Blockchain: History and Current Applications
(CRC Press, USA), Blockchain Technology: Applications and Challenges
(Springer, ISRL), Artificial Intelligence and Machine Learning in Business
MAMATA GARANAYAK received the M.Tech. Management: Concepts, Challenges, and Case Studies (CRC Press, USA),
degree and the Ph.D. degree in machine learning The New Advanced Society: Artificial Intelligence and Industrial Internet
(recommender systems) and in CSE from KIIT, of Things Paradigm (Wiley Press, USA), Recent Advances in Blockchain
Deemed to be University. She is currently an Technology: Real-World Applications (Springer, ISRL), and Metaverse and
Associate Professor with the Computer Science Immersive Technologies: An Introduction to Industrial Business and Social
Department, Kalinga Institute of Social Sciences Applications (Wiley Press, USA). He has ten lakh seed money projects from
(KISS), Deemed to be University, Bhubaneswar. IFHE. He is also a Reviewer of the IEEE ACCESS. His professional affiliations
She has 17.8 years of teaching experience. Her are MIEEE, MACM, and LMIAENG.
research interests include recommender systems,
image processing, and machine learning.
SANGRAM KESHARI SWAIN received the Ph.D. DEEPTHI GODAVARTHI received the Ph.D.
degree in computer science and engineering. degree from Andhra University, Andhra Pradesh,
He has 12 years of teaching experience having India, in 2022. She is currently an Assistant Profes-
a demonstrated history of working in the higher sor with the School of Computer Science and Engi-
education industry. He has skilled in social ser- neering, VIT-AP University, Amaravati, Andhra
vices, teaching, research, data analysis, and higher Pradesh. Her research interests include AI, NLP,
education. A strong education professional hav- computer vision, and block chain technologies.
ing multidimensional approaches like engineering,
technology, management, social service, and law.
99318 VOLUME 11, 2023

An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of

Uploaded by

Copyright:

Available Formats

An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of

Uploaded by

Copyright:

Available Formats

Received 17 August 2023, accepted 7 September 2023, date of publication 11 September 2023,

date of current version 15 September 2023.

An Intelligent Disease Prediction and Drug

SANDEEP KUMAR PANDA 4, AND DEEPTHI GODAVARTHI 5

Telangana 501203, India

Corresponding author: Sandeep Kumar Panda ([email protected])

I. INTRODUCTION recommender prototype is wide spread; with well-known

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

99306 VOLUME 11, 2023

VOLUME 11, 2023 99307

FIGURE 2. Raw symptoms dataset.

This dataset contains 405 symptoms and 149 distinct dis-

b: DRUG REVIEW DATASET

99308 VOLUME 11, 2023

c: SIDE EFFECTS DATASET

FIGURE 5. Symptoms dataset after pre-processing. d: MERGED DATASET

FIGURE 6. Dataset after marking symptoms present for a disease

VOLUME 11, 2023 99309

99310 VOLUME 11, 2023

sentiment predictions. The final fully connected layer had

FIGURE 10. Accurate drug predictions for four input symptoms.

of rows. Then performed groupby.sum() to merge all symp-

B. SENTIMENT ANALYSIS OF DRUG REVIEWS

we have a list of potential medications. Hence, we utilized

VOLUME 11, 2023 99311

reviews, ratings, and helpful count.

99312 VOLUME 11, 2023

2. Select the disease rows which have only positive reviews

FIGURE 16. Disease prediction using approach 1 (Prediction using one

Later, we used the side effects dataset mentioned in

VOLUME 11, 2023 99313

FIGURE 21. Disease prediction using approach 2 (Prediction using three

the corrective prediction for three symptoms. Moreover,

FIGURE 19. Disease prediction using approach 2 (Prediction using one

FIGURE 20. Disease prediction using approach 2 (Prediction using two

99314 VOLUME 11, 2023

TABLE 1. Accuracy values of each approach.

FIGURE 24. Disease prediction using approach 3 (Prediction using three

FIGURE 25. Disease prediction using approach 4 (Prediction using all

FIGURE 26. Evaluation metrics for sentiment analysis.

B. COMPARISON OF PERFORMANCE VALUES

We used loss, accuracy, f1 score, recall, and precision met-

VOLUME 11, 2023 99315

FIGURE 30. Output of weighted average approach for drug

FIGURE 28. Vader sentiment analysis for all reviews.

FIGURE 31. Drug with possible side effects.

FIGURE 29. Graph showing sentiment of reviews in the dataset.

99316 VOLUME 11, 2023

[4] S. A. Alsaif, M. S. Hidri, I. Ferjani, H. A. Eleraky, and A. Hidri, ‘‘NLP-

VOLUME 11, 2023 99317

99318 VOLUME 11, 2023

You might also like