Detection of Malicious Web Contents Using Machine and Deep Learning Approaches
Research Scholar, MUIT, Lucknow
KNIT Sultanpur
BBD University, Lucknow
MUIT, Lucknow
Websites have been the main target of intruders due to the fast progression of the Internet. An invader implants malicious content in
a website page in order to perform a variety of bad and unwanted actions, such as stealing credentials and resources, tempting a
web handler to an unsafe website, installing or downloading software to link a botnet, or participating in dispersed denial of
service attacks. It can also damage user’s system. Uninvited web content such as phishing, spam, and drive-by-downloads are
hosted on malicious URLs, which entice unsuspecting users to become victims of schemes such as financial loss, data theft, and
malware installation. Every year, billions of dollars are lost as a result of this. It is critical to detect and respond to such dangers
as soon as possible.
Keywords: Web content, URL, Cyber-crime, malware, Classification.
Nowadays after covid, there is a very heavy usage of internet, either in the form of distance learning or using it for
company team meetings. Using of internet can also cause Cyber-crime using malicious URL that they can fetch, read
and manipulate user data [11]. So, it’s very important to know that the page user is visiting is safe for them of not.
Under this assignment work we have tried several approaches for designing a model that helps us to determine the
category of the URL that the user is visiting. As the quantity of web pages grows, so do the number of rouge web pages,
and the attack becomes more sophisticated [12]. The aim of malicious URL identification is to preclude the company
employees from log on websites that may affect with the maneuver of the business – such as websites that are not
associated to the work, websites with distasteful or unlawful Web Content, or websites related with phishing efforts.
Whereas unrestricted website surfing and accessing might be very beneficial for the employees and can create them all
much more productive, this can also uncover administrations to a extensive variety of security threats, such as
dissemination of intimidations, data loss or removal, or legal issues. The web data presentation has become a primary
objective for cyber offenders by inoculating malware specifically JavaScript to accomplish malevolent actions for
impersonation [10]. Thus, it becomes an imperious to discover such malevolent code in real time before any spiteful
action is performed. We present an analysis for detecting a malicious web page using machine and deep learning
approaches here. The analysis of results shows that other techniques can competently classify spiteful code from benign
code with promising result.
S Sananse and Sarode in 2015, developed a technique for detecting phishing and non- phishing. In their research
paper they have used Random Forest and Content-based algorithm for classification on the dataset.
Jeeva and Raj Singh in 2016, classify important characteristic’s that distinguish between benign and phishing
URLs. In their research they have used mining association rules for features to detect phishing URLs. But they only
focused on two categories of URLs i.e. benign and phishing.
Patil and Patil in 2016, utilized the URL string's static analysis to detect malicious web pages effectively. They
used 79 static features of URLs that extracted from characteristics of benign and malicious URLs. They assessed
machine learning algorithms on their dataset and their experimental analysis have showed a detection rate between
95% - 99%.
Some scientists suggested URLNet, which is a convolutional neural network (CNN) technique for detection of
malicious URLs using deep neural networks. Their model instigated unadventurous CNN for all words and
characters of the URL (Le, Pham, Sahoo, and Hoi, 2018).
An analysis and study proposed in this paper propose a neural network based approach; in this research work, they
have used deep learning with convolutional neural network methodology that detects structures such as malicious
URLs, files, and registry keys (Saxe and Berlin, 2017).
A latest study fortifying URL address, a method using Event De-noising Convolutional Neural Network for
Sequence Detection in malicious URL. Here the authors proposed a model to detect series of malicious URL from
proxy logs with a low false positive rate (Shibahara - 2017).
EDCNN is a specific CNN to decrease the negative impact of benign URLs redirected from compromised websites.
Vazha yil, Vinaya kumar and Soman (2018), presented a comparative study between classical machine learning
techniques and deep learning methods to detect malicious URLs.
(a) Training dataset: For training dataset, we have different categories of dataset like: Benign, Spam, Phishing,
Malware, and Defacement.
(b) Testing dataset: Same categories of dataset used for testing purpose.
Data pre-processing:
For the pre-processing of data, first we sliced the URLs to use it as a features by “/” , “-”, “.” and “com”.
['', 'to', 'torrent', '720p', 'h264', 'dl', 'dd5', "b'http:", "'", '2015', 'russian', 'rufgt', '1', 'web', '',
'blackhat', '1110018', '1337x']
Feature Extraction: Then we have used “TfidfVectorizer” for extraction of feature from text words. We have used
70% data for training and rest 30% data for testing.
3.3.1 SVM
We have used Support Vector Classification from SVM to classify different malware URL classes. We have taken
penalty value C=2 i.e., if a class belongs to wrong hyper plane then this will cost penalty.
MLP Classifier stands for Multi-layer Perceptron classifier that it shows the connection by itself to a Neural
Network. As the performance of different classification algorithms such as Naive Bayes or Support Vectors Classifier,
the MLP.
Classifier depends on a fundamental Neural Network to accomplish the assignment of classification. In this work, we
have used Google colab environmental setting for running code. It gives all the resources that are required to run heavy
In this research paper, we have analyzed the accuracy performance by applying different machine learning techniques
and deep learning method. By using SVM (SVC) and Random Forest we get accuracy around 96%. By research paper
by Emine Uçar [13], they get accuracy around 96% to 98% by using CNN and LSTM. We have used MLP i.e. Multi-
layer perceptron that gives accuracy result between 97% to 99% accuracy.
