Detecting Spammers in Youtube: A Study To Find Spam Content in A Video Platform
Detecting Spammers in Youtube: A Study To Find Spam Content in A Video Platform
www.iosrjen.org
Abstract:- Social networking has become a popular way for users to meet and interact online. Users spend a
significant amount of time on popular social network platforms (such as Facebook, MySpace, or Twitter),
storing and sharing personal information. This information, also attracts the interest of cybercriminals.
In this paper, a step further is taken by addressing the issue of detecting video spammers and promoters.
Keywords: YouTube, Spammers, video spam, social network, supervised machine learning, SVM.
I.
INTRODUCTION
Over the last few years, social networking sites have become one of the main ways for users to keep
track and communicate with their friends online. Sites such as Facebook, MySpace, and Twitter are consistently
among the top 20 most-visited sites of the Internet. Moreover, statistics show that, on average, users spend more
time on popular social networking sites than on any other site [1]. Most social networks provide mobile
platforms that allow users to access their services from mobile phones, making the access to these sites
ubiquitous. The tremendous increase in popularity of social networking sites allows them to collect a huge
amount of personal information about the users, their friends, and their habits. Unfortunately, this amount of
information, as well as the ease with which one can reach many users, also attracted the interest of malicious
parties. In particular, spammers are always looking for ways to reach new victims with their unsolicited
messages. This is shown by a market survey about the user perception of spam over social networks, which
shows that, in 2008, 83% of the users of social networks have received at least one unwanted friend request or
message [2].
By allowing users to publicize and share their independently generated content, social video sharing
systems may become susceptible to different types of malicious and opportunistic user actions, such as selfpromotion, video aliasing and video spamming [3]. A video response spam is defined as a video posted as a
response to an opening video, but whose content is completely unrelated to the opening video. Video spammers
are motivated to spam in order to promote specific content, advertise to generate sales, disseminate pornography
(often as an advertisement) or compromise the system reputation.
Ultimately, users cannot easily identify a video spam before watching at least a segment of it, thus
consuming system resources, in particular bandwidth, and compromising user patience and satisfaction with the
system. Thus, identifying video spam is a challenging problem in social video sharing systems.
paper is addressed on the issue of detecting video spammers and promoters. To do it, a large user data
set from YouTube site is crawled, containing more than 260 thousands users. Then, the creation of a labeled
collection with users manually classified as spammers and non-spammers is taken place. Using attributes
based on the users profile, the users social behavior in the system, and the videos posted by the user as well as
the target (responded) videos, I investigated the feasibility of applying a supervised learning method to identify
polluters. I found that my approach is able to correctly identify the majority of the promoters, misclassifying
only a small percentage of legitimate users. In contrast, although I was able to detect a significant fraction of
spammers, they should to be much harder to distinguish from legitimate users.
The rest of the paper is organized as follows. The next section discusses background. Section 3
describes the crawling strategy and the test collection built from the crawled dataset. Section 4 discusses about
the spam metrics. Section 5 describes the classification. Finally, Section 6 offers conclusions.
II.
BACKGROUND
Mechanisms to detect and identify spam and spammers have been largely studied in the context of web
[4, 5] and email spamming [6]. In particular, Castillo et al [4] proposed a framework to detect web spamming
which uses social network metrics. A framework to detect spamming in tagging systems, which is a type of
attack that aims at raising the visibility of specific objects, was proposed in [7]. Although applicable to social
media sharing systems that allow object tagging by users, such as YouTube, the proposed technique exploits a
specific object attribute, i.e., its tags. A survey of approaches to combat spamming in Social web sites is
26 | P a g e
27 | P a g e
IV.
My spammer detection method relies on a machine learning approach for classifying my dataset. In this
approach, the classification algorithm learns(supervised learning[14]) with part of the data and then applies its
knowledge to classify users into two types: legitimate or spammers.
4.1 spam metrics:
In order to define the metrics used to evaluate the proposed heuristics, I have considered the following
measures:
True
Label
Legitimate
Spammer
Prediction
Legitimate
Spammer
a
b
c
d
Let us represent the number of legitimate users correctly classified as legitimate, b the number of
legitimate users falsely classified as spammer, c the spammers falsely classified as legitimate, and d the number
of spammers correctly classified as spammers. In order to evaluate the classification algorithms, we consider the
following metrics, commonly used on Machine Learning and Information Retrieval [13]:
True positive rate T P, or recall: R = d/(c+d) .
28 | P a g e
V.
CLASSIFICATION
The SVM [11] (Support vector machine) methods are well known class of algorithms for data
classification. I chose to use the SVM methods as the classifier for my dataset.
Basically, SVM performs classification by mapping input vectors to an N-dimensional space. The goal
is to find the optimal hyper plane that separates the data into two categories, each one constructed on each side
of the hyper plane. I use a binary non-linear SVM with RBF kernel to allow SVM models to perform
separations with very complex boundaries. I chose to use the implementation of SVM provided with libSVM
[12], an open source SVM package that allows searching for the best classifier parameters (i.e. cost and gamma)
in order to define the best SVM configuration for the dataset.
Metric
TP
TN
FP
FN
Accuracy
F-measure
User
0.064
0.998
0.007
0.976
0.822
0.094
Video
0.420
0.946
0.078
0.574
0.851
0.484
SN
0.335
1
0
0.625
0.874
0.590
ALL
0.469
0.991
0.023
0.571
0.890
0.558
VI.
CONCLUSIONS
In this paper I studied video spam in a popular online social video network, namely YouTube. My study
relies upon a dataset collected from YouTube. I crawled the YouTube site to obtain an entire component of the
video response user graph. By manual inspection, I created a test collection with users classified as spammers or
legitimate.
I provided a characterization of the users on this test collection which raises several attributes useful to
characterize the social or anti-social behavior of users.
Using a classification technique, I proposed a video spam detection mechanism which is able to correctly
identify significant fraction of the video.
REFERENCES:
[1].
[2].
[3].
[4].
[5].
[6].
[7].
[8].
29 | P a g e
M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the
worlds largest user generated content video system. In Proc. of IMC, 2007.
P. Gill, M. Arlitt, Z. Li, and A. Mahanti. Youtube traffic characterization: A view from the edge. In Proc.
of IMC, 2007.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and
interdependent output variables. Journal of Machine Learning Research (JMLR), 6:14531484, 2005.
R. Fan, P. Chen, and C. Lin. Working set selection using the second order information for training svm.
Journal of Machine Learning Research (JMLR), 6:18891918, 2005.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley,
1999.
https://en.wikipedia.org/wiki/Supervised_learning
30 | P a g e