Fake News Detection Using Supervised Learning Meth

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 104-108

Ethics and Information Technology (ETIT)


DOI: http://doi.org/10.26480/etit.02.2020.104.108

ISBN: 978-1-948012-17-1

REVIEW ARTICLE

FAKE NEWS DETECTION USING SUPERVISED LEARNING METHOD


Swapnesh Jaina, Ruchi Patelb*, Shubham Guptac, Tanu Dhootd
aDepartment of Computer Science and Engineering, Medicaps University, Indore – 453331, India
bAssistant
Professor, Computer Science and Engineering, Medicaps University, Indore – 453331, India
cDepartment of Computer Science and Engineering, Medicaps University, Indore – 453331, India
dDepartment of Computer Science and Engineering, Medicaps University, Indore – 453331, India

*Corresponding Author Email: [email protected]

This is an open access article distributed under the Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.

ARTICLE DETAILS ABSTRACT

Article History: With the advent of social media, information has never been more accessible than today. But as much as it is
Received 25 October 2020
accessible, there is a flood of information at the time where virtually everyone is a content creator. In this age
Accepted 26 November 2020 where data is the new oil, the amount of data produced every day on social media has caused certain troubles
Available online 03 December 2020 too. One of the most notorious trouble thus created is that of “fake news”. While social media has proven to
be a voice of the previously voiceless and has enabled those to dismantle old one-sided narratives but on the
other hand, it has also enabled people to monger fear, incite violence and spread misinformation. The most
recent effect of fake news can be seen in the violence happening in New Delhi, India, where a systematic
spread of misinformation led to riots against the majority community of India. In this paper, a comparative
study has been made for the efficiency of various models in correctly identifying fake news along with their
true positives and true negatives. The aim here is to apply count vector and tf-idf vector to four different
machine learning methods such as Naïve Bayes, Logistic Regression, Random Forest and XGBoost on two
different datasets namely Kaggle and LIAR. Based on the results obtained XGBoost along with count vector
gave the highest accuracy in predicting fake news.

KEYWORDS
Count Vector, Fake News, Logistic Regression, Naïve Bayes, Random Forest, Tf-Idf Vector, True Positives, True
Negatives, XGBoost.

1. INTRODUCTION statistical methods. This issue becomes even more serious while dealing
with reviews obtained from interviews on television, social media posts
With the information revolution now being a real thing and virtually every like those on Facebook and Twitter. Based on a study, combined sentiment
person with a smartphone in hand and a working internet connection analysis with network metadata to detect fake news. He trained Random
being a content creator everyone has to deal with a new menace of fake Forest classifier which gave him the f1-score of more than 88%. He also
news almost every day with social media being its breeding ground. With designed a scrapping tool to gather news related articles from different
more and more people being introduced to social media every day and it sources (Shrestha, 2018). Other research from Korea University co-
is a great influence on the opinion-forming process of the people, there are authored a paper on article abstraction in which they created a factual
people with vested agendas who want to use these information as a means database by collecting obvious facts of human’s decisions. Their system
to propel their agenda. search for the articles related to the news in their factual database to verify
whether the news is reliable or not (Kim and Jeong, 2019).
This has led to the creation of many websites that publish articles
containing half-truths or full lies. Some websites publish fake news almost
A recent study implemented deep neural network for detecting fake news
exclusively to push propaganda (mostly political) to influence people in a
which include different machine learning models along with different deep
certain way. The fake news industry is a global issue as well as a challenge
learning models which evaluate their performance in identifying fake
for the world with many established media houses also losing credibility
news (Kaliyar, 2018). Recent work include an integrated web service
over the years after being repeatedly caught in fake news.
model that accepts news input or URL from the user which then checks for
Many scientists believe that this problem can be dealt by using Machine the truth level of the news (Mokhtar et al., 2019). Also in others study have
Learning techniques with Artificial Intelligence. Hence this paper focused on the detection of fake news by training the Naïve Bayes classifier
describes a comparative study of four most popular machine learning which was then tested on Facebook posts (Granik and Mesyura, 2017).
methods to identify and determine which method produces the best Their model achieved classification accuracy of 74% on the test data. In
results while detecting fake news. other studies work like evaluation of different machine learning methods
using tf-idf and probabilistic context-free grammar (Gilda, 2017), fake
2. RELATED WORK news detection on social media networks to filter out sites with false and
Fake news is a major challenge. Most challenging part of fake news misleading news (Aldwairi and Alwahedi, 2018), and hybrid text
detection is the detection of deceptive languages which is done by
Quick Response Code Access this article online

Website: DOI:
www.intelcomp-design.com 10.26480/etit.02.2020.104.108

Cite The Article: Swapnesh Jain, Ruchi Patel, Shubham Gupta, Tanu Dhoot(2020).Fake News Detection Using Supervised Learning Method.
Topics In Intelligent Computing And Industry Design, 2(2): 104-108.
Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 104-108

classification to deal with the classification of fake news were carried out regression, Random Forest and XGBoost.
(Kaur et al., 2019). Step 5: Then performance was evaluated for these different machine
learning methods and compared using their accuracy and confusion
3. METHODOLOGIES matrix.

For classification problem many different supervised learning techniques Kaggle and LIAR Dataset
are present such as Naïve Bayes, Decision Trees, Support Vector Machine,
Gradient Descent, K-Nearest Neighbors, K-Fold cross validation, Neural
Networks, etc. Some of these popular techniques have been used in this
study which are as follows: Cleaning the dataset
Stemming and removing stop
3.1 Naïve Bayes (NB)
words
Naive Bayes algorithm is a simple but an efficient technique for the
construction of classifiers. It consists of models that label problem Pre-processing techniques like
instances as classes, which are represented as feature-vectors. There is not
just a single algorithm to train all such classifiers, but there exists a family Count Vector and Tf-idf Vector
of algorithms which all are based on a common principle. All NB classifiers
work on the assumption that for a given class variable the value of a
feature does not depend upon the value of any other feature. Despite their Training models using train
naive and primitive design and seemingly oversimplified assumptions, NB dataset
classifiers have been known to function quite well in many complex
scenarios that exist in the real-world. An advantage of NB classifier is that
only a small number of training data is required to make an estimation of Predicting accuracies and
the parameters that would be necessary for classification. confusion matrix using test
3.2 Logistic Regression (LR) dataset
Figure 1: Flowchart of Proposed Solution
Logistic Regression algorithm is applied when the dependent or the target
variable is categorical in nature. In its basic form LR is a statistical model In the above proposed solution, steps 2 and 3 are important steps because
which makes uses of a logistic function for modelling a binary dependent these steps are not only normalizing the dataset but also converting the
variable. Models based on analogues technique that use a different kind of textual news data into numerical vectors as these machine learning
sigmoid function instead of using the logistic function can also be methods can be trained using numerical data only. In the step 2 the dataset
considered for logistic regression, one such example is the probity model. is first cleaned because when the dataset is gathered from different
The characteristic feature that defines the logistic model is that when sources they are in raw format which is not acceptable by the methods. In
value of one of the independent variable increases multiplicatively, the the step 3 CountVectorizer is used to count the frequency of each word
probability of the second variable gets scaled at a constant rate, where occurred in the news. TF-IDF is also used for information retrieval to
each independent variable has its own parameter; this feature generalizes determine the importance of individual terms in the set of text documents.
the odds ratio for a binary dependent variable also. TF-IDF’s value increases proportionately with the increase in occurrence
of the given word in a given document but often falls down with the
3.3 Random Forest (RF) frequency of the words in the corpus.
As the name implies, Random forest is a model that consists of numerous
individual and independent decision trees that function simultaneously as TF-IDF can be computed as the product of term frequency and inverse
an ensemble. Each individual trees in the RF returns a class as a prediction document frequency. Term frequency can be computed as the ratio of
number of times a term appears in a document to the total number of
and the class that attains the highest number of votes becomes the
terms in the document and Inverse Document Frequency can be computed
prediction of the entire model. The distinguishing feature of RF is that
as the log of ratio of number of documents to the number of documents
when a fairly large number of independent trees work together, they
that contain the word.
outperform any other standalone model. The trees also enforce canopy
effect like that of a real forest and protect each other from their respective
individual errors (Understanding Random Forest, 2020). Once the textual data is converted into the numerical vectors (using
CountVectorizer and TF-IDF) the methods are trained using these vectors
3.4 Extreme Gradient Boosting (XGBoost) so that the output of the test dataset can be predicted to compare the
accuracies between these four different methods to determine their
Extreme Gradient Boosting (herein after referred to as XGBoost) is an efficiency. Also a confusion matrix is generated which classifies these news
ensemble machine learning algorithm that is based on the concept of as positive or negative to evaluate their performance.
decision trees and works on the framework of gradient boosting. In such
prediction problems that deal with the data in the unstructured form In the confusion matrix shown in table 1, rows represent the number of
(images, text, etc.) artificial neural networks tend to outdo the classifications predicted by the model, while the columns represent the
performance of all other algorithms and frameworks working with similar number of actual classifications in the test data.
problems. However, while dealing with small-to-medium scale structured
or tabular data, algorithm based on decision tree concept are considered Table 1: Confusion Matrix
to be the most suited model and are widely recommended for this purpose Actual Positive Actual Negative
(Morde et al., 2020) XGBoost is one such gradient boosting library that is Predicted Positive True Positives (TP) False Positives (FP)
highly optimized and distributed and is designed to be highly efficient, Predicted Negative False Negatives (FN) True Negatives (TN)
flexible and portable.
Where, TP is defined as actual real news which are correctly predicted as
4. PROPOSED SOLUTION real,
FP is defined as actual false news which is incorrectly predicted as real,
FN is defined as actual real news which is incorrectly predicted as false,
This section consists of important steps for detecting fake news using
and
different machine learning methods, which are as follows:
Step 1: Two different datasets, Kaggle and Liar, are gathered to implement
each method. TN is defined as actual false news which is correctly predicted as false.
Step 2: Then cleaning techniques such as Tokenization, Stemming and Accuracy is the ratio of sum of predictions that are correctly classified by
removing punctuation marks, tags, and stop-words are used on each the model to the total number of samples. Accuracy can be computed using
dataset to clean them. the equation (1).
Step 3: After cleaning, two different pre-processing techniques such as 𝑇𝑃 + 𝑇𝑁
CountVectorizer (count of terms in vector) and term frequency-inverse 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
document frequency (tf-idf) are used on each dataset to build the
vocabulary of the count vectors. 5. EXPERIMENTAL SETTINGS
Step 4: The methods then are trained using Naive Bayes, Logistic

Cite The Article: Swapnesh Jain, Ruchi Patel, Shubham Gupta, Tanu Dhoot(2020).Fake News Detection Using Supervised Learning M ethod.
Topics In Intelligent Computing And Industry Design, 2(2): 104-108.
Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 104-108

5.1 Datasets
Table 2 describes in detail the labels obtained from various prediction
To implement each method two different datasets are gathered for this methods and also the extent of correctness. Count Vectorizer along with
projects. various prediction methods was applied on Kaggle dataset to obtain this
table.
The first dataset is of Kaggle which includes 16,548 human labeled news
articles in the train set described by five columns namely ‘id’, ‘title’, With Naïve Bayes classifier, 3489 correct and 368 incorrect predictions
‘author’, ‘text’ and ‘label’ (categorized as 0, indicating real news and 1, are obtained. With Logistic Regression, 3680 correct and 177 incorrect
indicating fake news). The test dataset of Kaggle consists of 4,137 news predictions are obtained. On applying Random Forest 3529 correct and
articles described by four columns namely ‘id’, ‘title’, ‘author’ and ‘text’. 328 incorrect predictions are obtained. Finally while applying XGBoost
This test set contains news without label. And the submit dataset of Kaggle 3716 correct and 141 incorrect predictions are obtained.
consists of same 4,137 news articles, as of test dataset, described by two
columns namely ‘id’ and ‘label’ but with label to compute the accuracy of Table 3: Accuracy table for CountVectorizer technique on Kaggle
each method. dataset
Classification Algorithms Accuracy %
Second dataset is of LIAR which includes 10,240 human labeled news Naïve Bayes 90.45
articles in the train set described by two columns namely ‘statement’ and Logistic Regression 95.41
‘label’ (categorized as true and false). The test dataset of LIAR consists of Random Forest 91.49
2,551 human labeled news articles described by two columns namely XGBoost 96.34
‘statement’ and ‘label’ to compute the accuracy of each model.
Table 3 contains the consolidated data for accuracy of various prediction
5.2 Pre-processing methods with CountVectorizer on Kaggle dataset in percentage point
terms.
First both the datasets are observed to look for what type of data each
methods are dealing with. But upon closely observing datasets it was Model obtained 90.45% accuracy using Naïve Bayes, 95.41% accuracy
found that both the Kaggle and LIAR datasets contain missing values and using Logistic regression, 91.49% accuracy using Random Forest and
incomplete news articles. But Kaggle dataset along with missing values 96.34% accuracy using XGBoost.
and incomplete news articles contain news from different languages also.
So to increase the efficiency of each models these missing values along Table 4: Confusion Matrix when TF-IDF was used as pre-processing
with the news articles which are incomplete and of different languages are technique on Kaggle dataset
eliminated. The text field plays an important role in labelling and therefore Classification Algorithm Label Actual Actual
text field containing blank columns have also been deleted. Positive Negative
Naïve Bayes Predicted 2016 20
After eliminating these type of data all the special characters (or say
Positive
punctuation marks and tags) along with the stop-words are removed. The
Predicted 511 1310
stop-words are those words that do not contribute much in predicting
Negative
whether the news is real or fake, instead they just introduce confusion to
Logistic Regression Predicted 1954 82
each model. Some of these commonly used stop-words are "a", "the", "of", Positive
"I", "you", "it", "and", etc. Rather machine would not want these words Predicted 107 1714
taking up space in the database, or taking up valuable processing time. Negative
Therefore eliminating them will improve the efficiency of the model. After Random Forest Predicted 1962 74
the soft-words are removed, stemming was carried out to reduce the Positive
words to their roots. Predicted 226 1595
The research has been carried out using Python language because it Negative
provides a great choice of libraries and packages to carry out most of the XGBoost Predicted 1955 81
tasks to build machine learning models. Also Python has a great Positive
community support because most of the programmers contribute to help Predicted 89 1732
each other. Negative

6. RESULT AND ANALYSIS Table 4 describes in detail the labels obtained from various prediction
methods and also the extent of correctness. TF-IDF Vectorizer along with
The model was trained and tested for both the datasets. In this project two various prediction methods was applied on Kaggle dataset to obtain this
techniques for pre-processing, namely ‘CountVectorizer’ and ‘TF-IDF’, and table.
four methods of machine learning (discussed in Methodologies) are used
and confusion matrix along with accuracies are obtained and tabulated. With Naïve Bayes classifier, 3326 correct and 531 incorrect predictions
are obtained. With Logistic Regression, 3668 correct and 189 incorrect
6.1 Experimental Result on Kaggle Dataset predictions are obtained. On applying Random Forest 3557 correct and
300 incorrect predictions are obtained. Finally while applying XGBoost
Table 2: Confusion Matrix when CountVectorizer was used as pre- 3687 correct and 170 incorrect predictions are obtained.
processing technique on Kaggle dataset
Classification Label Actual Actual Table 5: Accuracy table for TF-IDF technique on Kaggle dataset
Algorithm Positive Negative Classification Algorithms Accuracy %
Naïve Bayes Predicted 1938 98 Naïve Bayes 86.23
Positive Logistic Regression 95.09
Predicted 270 1551
Negative Random Forest 92.22
Logistic Predicted 1957 79 XGBoost 95.59
Regression Positive
Predicted 98 1723
Negative Table 5 contains the consolidated data for accuracy of various prediction
Random Forest Predicted 1975 61 methods with TF-IDF Vectorizer on Kaggle dataset in percentage point
Positive terms.
Predicted 267 1554
Model obtained 86.23% accuracy using Naïve Bayes, 95.09% accuracy
Negative
using Logistic regression, 92.22% accuracy using Random Forest and
XGBoost Predicted 1961 75
95.59% accuracy using XGBoost.
Positive
Predicted 66 1755
Negative 6.2 Experimental Result on LIAR Dataset

Cite The Article: Swapnesh Jain, Ruchi Patel, Shubham Gupta, Tanu Dhoot(2020).Fake News Detection Using Supervised Learning M ethod.
Topics In Intelligent Computing And Industry Design, 2(2): 104-108.
Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 104-108

Table 6: Confusion Matrix when CountVectorizer was used as pre-


Table 9: Accuracy table for TF-IDF technique on LIAR dataset
processing technique on LIAR dataset
Classification Algorithms Accuracy %
Classification Algorithm Label Actual Actual
Naïve Bayes 59.89
Positive Negative
Logistic Regression 61.19
Naïve Bayes Predicted 586 583
Random Forest 61.66
Positive
XGBoost 60.17
Predicted 401 981
Negative
Table 9 contains the consolidated data for accuracy of various prediction
Logistic Regression Predicted 590 579
methods with TF-IDF Vectorizer on LIAR dataset in percentage point
Positive
terms.
Predicted 447 935
Negative
Random Forest Predicted 561 608 Model obtained 59.89% accuracy using Naïve Bayes, 61.19% accuracy
Positive using Logistic regression, 61.66% accuracy using Random Forest and
Predicted 370 1012 60.17% accuracy using XGBoost.
Negative
XGBoost Predicted 482 687 This study has used two pre-processing algorithms and four prediction
Positive methods in all permutations on two datasets namely Kaggle and LIAR.
Predicted 294 1088 While testing the permutation on Kaggle dataset it was found that the
Negative highest accuracy was achieved when CountVectorizer technique was
applied and XGBoost prediction method was used. The accuracy achieved
Table 6 describes in detail the labels obtained from various prediction by this model was 96.34%.
methods and also the extent of correctness. Count Vectorizer along with
various prediction methods was applied on LIAR dataset to obtain this The same set of permutation was used to predict fake news on LIAR
table. dataset. The highest accuracy was achieved when CountVectorizer
technique was applied and Random forest prediction method was used.
With Naïve Bayes classifier, 1567 correct and 984 incorrect predictions The accuracy achieved by this model was 61.66%.
are obtained. With Logistic Regression, 1525 correct and 1026 incorrect
predictions are obtained. On applying Random Forest 1573 correct and 7. CONCLUSION AND FUTURE WORK
978 incorrect predictions are obtained. Finally while applying XGBoost
1570 correct and 981 incorrect predictions are obtained. With the proliferation of social media, more and more people are gaining
access to news from unconventional sources like social media rather than
traditional main stream media. This has led to mushrooming of fake news
Table 7: Accuracy table for CountVectorizer technique on LIAR
platforms all over the world.
dataset
Classification Algorithms Accuracy %
Therefore this paper analyses various text pre-processing techniques and
Naïve Bayes 61.42
classification methods that can predict whether a news is fake or not. In
Logistic Regression 59.78
this experiment it was found that XGBoost classifier with CountVectorizer
Random Forest 61.66
technique achieved the highest of accuracy, therefore it can be used to
XGBoost 61.54
predict whether a news is fake or real.
Table 7 contains the consolidated data for accuracy of various prediction In the future this model can be extended to sentiment based model and
methods with CountVectorizer on LIAR dataset in percentage point terms. also algorithms like Recursive Neural Networks (RNN) can be applied to
Model obtained 61.42% accuracy using Naïve Bayes, 59.78% accuracy further improve the efficiency. This project can be extended further to
using Logistic regression, 61.66% accuracy using Random Forest and include fact-checking and deep syntax analysis, as well as recommending
61.54% accuracy using XGBoost. similar credible articles. Convolutional Neural Network (CNN) can be used
on image type of news to detect whether they are real news or misleading
Table 8: Confusion Matrix when TF-IDF was used as pre-processing news.
technique on LIAR dataset
Classification Algorithm Label Actual Actual
Positive Negative
ACKNOWLEDGEMENTS
Naïve Bayes Predicted 366 803 We would like to express our deepest gratitude to Honourable Chancellor,
Positive Shri R. C. Mittal, Medicaps University, who has provided us with every
Predicted 220 1162 facility to successfully carry out this project, and our profound
Negative indebtedness to Prof. (Dr.) Sunil K. Somani, Vice Chancellor, Medicaps
Logistic Regression Predicted 531 638 University, whose unfailing support and enthusiasm has always boosted
Positive up our morale. We also thank Prof. (Dr.) D. K. Panda, Dean Engineering,
Predicted 352 1030 Medicaps University, for giving us a chance to work on this project. We
Negative would also like to thank our Head of the Department, Prof. (Dr.) Suresh
Random Forest Predicted 515 654 Jain, Medicaps University, for his continuous encouragement for
Positive betterment of the project. We express our heartfelt gratitude to our Project
Predicted 324 1058 Guide, Dr. Ruchi Patel, Asst. Professor, Department of Computer Science &
Negative Engineering, Medicaps University, without whose continuous help and
XGBoost Predicted 479 690 support this project would not have been possible.
Positive
Predicted 326 1056
Negative REFERENCES
Aldwairi M., Alwahedi A., 2018. Detecting Fake News in Social Media
Table 8 describes in detail the labels obtained from various prediction Networks, The 9th International Conference on Emerging Ubiquitous
methods and also the extent of correctness. TF-IDF Vectorizer along with Systems and Pervasive Networks (EUSPN 2018), Procedia Computer
various prediction methods was applied on LIAR dataset to obtain this Science, Abu Dhabi, 141, 215-222.
table.
Gilda S, 2017. Notice of Violation of IEEE Publication Principles: Evaluating
With Naïve Bayes classifier, 1528 correct and 1023 incorrect predictions machine learning algorithms for fake news detection, IEEE 15th
are obtained. With Logistic Regression, 1561 correct and 990 incorrect Student Conference on Research and Development (SCOReD),
predictions are obtained. Putrajaya, 110-115.

On applying Random Forest 1573 correct and 978 incorrect predictions Granik M., Mesyura V., 2017. Fake news detection using naive Bayes
are obtained. Finally while applying XGBoost 1535 correct and 1016 classifier, IEEE First Ukraine Conference on Electrical and Computer
incorrect predictions are obtained. Engineering (UKRCON), Kiev, 900-903.

Cite The Article: Swapnesh Jain, Ruchi Patel, Shubham Gupta, Tanu Dhoot(2020).Fake News Detection Using Supervised Learning M ethod.
Topics In Intelligent Computing And Industry Design, 2(2): 104-108.
Topics in Intelligent Computing and Industry Design (ICID) 2(2) (2020) 104-108

Kaliyar R.K., 2018. Fake News Detection Using A Deep Neural Network, 4th Mokhtar M. S., Jusoh Y. Y., Admodisastro N., Pa N. C., Amruddin A. Y., 2019.
International Conference on Computing Communication and Fakebuster: Fake News Detection System Using Logistic Regression
Automation (ICCCA), Greater Noida, India, 1-7. Technique In Machine Learning, International Journal of Engineering
and Advanced Technology (IJEAT), 9(1), 2407-2410.
Kaur P., Boparai R. S., Singh D., 2019. Hybrid Text Classification Method for
Fake News Detection, International Journal of Engineering and Morde V., XGBoost Algorithm: Long May She Reign! Medium, 2020.
Advanced Technology (IJEAT), 8(5), 2388-2392.
Shrestha M, 2018. Detecting Fake News with Sentiment Analysis and
Kim K., Jeong C., 2019. Fake News Detection System using Article Network Metadata, Earlham college, Richmond.
Abstraction, 16th International Joint Conference on Computer Science
and Software Engineering (JCSSE), Chonburi, Thailand, 209-212. Understanding Random Forest, Medium, 2020.

Cite The Article: Swapnesh Jain, Ruchi Patel, Shubham Gupta, Tanu Dhoot(2020).Fake News Detection Using Supervised Learning M ethod.
Topics In Intelligent Computing And Industry Design, 2(2): 104-108.

You might also like