Detecting Emerging Topics in Social Networks Using Anomaly Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ISSN: 2393-994X

KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)


Volume No.: II, Special Issue on IEEE Sponsored International Conference on Intelligent Systems and Control (ISCO’15)

Detecting Emerging Topics In Social Networks Using Anomaly Detection


M.RAMYA1, C.BALASUBRAMANIAN2
1
M.E-Final Year,Department Of Computer Science And Engineering,[email protected],
Mepco Schlenk Engineering College,Sivakasi,Tamilnadu,India.
2
Senior Assistant Professor,Department Of Computer Science And Engineering,[email protected],
Mepco Schlenk Engineering,College,Sivakasi,Tamilnadu,India.

Abstract

Our basic assumption is that a new (emerging) topic is something people feel like communicating the information
further to their friends. To detect the emergence of topics in a social networks. The idea is to focus on the social
aspect of the posts reflected in the mentioning behaviour of users instead of the textual contents. The proposed
system uses probability model that captures both the number of mentions per post and the frequency of mentionees .
To detect the anomalies in the social network. In this approach aggregate the anomaly instance based on the
reply/mention relationships in social network posts. The real data sets are gathered from Twitter. To implement this
process by using the technique SDNML, burst detection and Bayesian.
Keywords: Emerging topic, SDNML, Burst detection, Bayesian, Anomaly detection

1. Introduction
Over the past few years the Internet has not only become the most important source of information, but also a key-
player in event formation. The open community of publishing news and information made it an important indication for the
pulse of the society. Social networks have become a very important source of information and recently a source of creating
information. Blogs, Twitter and Facebook, have played a great role in near past and current events all over the world. For all
this, it was very important to have a system that can extract these information without human intervention.
Twitter is a popular micro blogging service that enables the users to send and read short text messages commonly
known as tweets. Over 140 million registered users and about 340 million short messages, called "tweets", per day make Twitter
the undisputed market leader in social micro blogging today[11].A new area of research in information retrieval (IR) has
developed over the past four years called Topic Detection . Topic detection involves detecting the occurrence of a new event
such as a plane crash, a murder, a jury trial result, or apolitical scandal in a stream of news stories from multiple sources[2].
In this paper ,the objective is to evaluate the anomaly instances using probability model and burst detection method.
The rest of the paper is organized as follows. Section II describes the plan of this paper, Section III describes the basic
notations, Section IV and V describes the proposed system. Section VI illustrates the result and the outcome of the proposed
system. Section VII concludes the work.

2.Contribution And Plan Of This Work


Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words. A
term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require
complicated pre-processing (e.g., segmentation) depending on the target language[1]. Moreover, it cannot be applied when the
contents of the messages are mostly nontextual information. On the other hand, the “words” formed by mentions are unique,
require little pre-processing to obtain (the information is often separated from the contents), and are available regardless of the
nature of the contents. For this, we go for the probabilistic approach and detect the anomaly instances based on the naïve bayes
calculations and behaviours of users.

3. Basic Notations
3.1. Outlier Detection

Outlier detection is one of the most important data analysis technologies in data mining. It can be used to discover

74
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: XX , Issue No.: XX.

anomalous phenomena in huge dataset. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent
behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify
system faults and fraud before they escalate with potentially catastrophic consequences[4]. Traditional approaches to outlier
detection can be classified as either distribution based, depth-based, clustering, distance-based or density-based. Anomaly
detection is applicable in a variety of domains, such as intrusion detection ,fault detection, system health monitoring event.To
detect and remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour,
fraudulent behaviour, human error, instrument error or simply through natural deviations in populations.

3.2. Naive Bayes Classification

Naive Bayes Classifier is one of the simple probabilistic classifier based on Baye's theorem with strong
independence assumptions between the features. For example they typically use bag of words features to identify spam e-mail,
an approach commonly used in classification. Naive Bayes classifiers work by correlating the use of tokens (typically words,
or sometimes other things), with spam and non-spam e-mails and then using Bayesian inference to calculate a probability that
an email is or is not spam.

4.Proposed System

The proposed system consists of 4 modules. First pre-process the twitter dataset .

4.1.Data Aggregation
4.2Probability Estimation
4.3.Burst Detection
4.4.Bayesian Classification

The flow of the system is given in Fig.1.In this system, the main challenge is to detect the anomaly detection. Dataset is
collected from the real world data from twitter. In that dataset, it contains the attributes such as username, friend list,
followers, screen name, and last tweet date etc.. This is the input process of our project. After the dataset insert into the
database, we eliminate the null or unwanted values in the next process. Eliminating the null or unwanted values in the dataset.
It is called as dataset pre-processing.

4.1. Data Aggregation


After the dataset has been pre-processed, then aggregating the data from the database. In the data aggregation is
processed based on the mentions, replies and retweets in the dataset. Through this, we can easily identify who posts the
mentions, replies and mentions in the network. In this, we are aggregated based on the description of user’s posts information.

4.2. Probability Estimation

The probability estimations consists of two types of distributions. They are predictive distribution and joint
probability distribution.
a) Predictive distribution is estimated based on the estimating the probability values based on the mention and
mentionees in the network.
b)Joint probability distribution. It is estimated based on the number of users in the network and number of users who
posts the mentions in the network.
To implement the Probability Estimation Values based on the classification of mentions and replies and retweets
from the pre-processing data set. First find out the Probability density function, by using number of mentions user v in the
dataset and total number of mentionees in dataset. Then we estimate predictive distribution using the equation(1)

Predictive Distributions = mv/mentioness (1)

Number of mentions user v in the dataset is that total number user mentions the post.

75
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: XX , Issue No.: XX.

4.3. Burst Detection

Burst detection is nothing but an detecting the anomaly which is based on the time series. In this, id, url, joined date
and last date are the important parameters to detect. The burst-detection method is based on a probabilistic automaton model
with two states, burst state and non-burst state.

Pseudo code for Burst Detection


---------------------------------------------------------------------------
for each Data
Select join date and last tweet date
Calculate burst detection
BT=join date-last tweet date
if BT ≥0
return burst state
else
return Non burst state
end for
---------------------------------------------------------------------------

4.4. Bayesian Classification

To estimate the Bayes rule , Fist we have to cluster the data based on the verified data in process. Cluster formation
help you to detect the anomaly instance. If the verified value =1 means the information goes to cluster 1 otherwise cluster 2.
From the Cluster information, count the language from the data, count the cluster information from the cluster debases. To use
this we assign A as cluster and B as languages to do Bayes Rule(2).

P(B/A) = P(A and B) (2)


P(A)

After Bayesian Classification we need to identify the neighborhood relationships between features, to calculate estimating
matrix, information gain, Entropy and gain values. Information gain to determine which attribute in a given set of training
feature vectors is most useful for discriminating between the classes to be learned ,i.e.-equation(3). Information gain tells us
how important a given attribute of the feature vectors.

re = (a1 * Math.log10(a1)) - (b1 * Math.log10(b1)); (3)

Pseudo code for Anomaly Detection


---------------------------------------------------------------------------
Input : Twitter dataset
Output : Anomaly detection
---------------------------------------------------------------------------
for all record
Pre-process the data
Identify mentions , replies , retweets.
Calculate joint probability distribution
k=modulo of mentions.
Calculate Predictive distribution.
Number of mentions to the user v in the dataset t.
Calculate Burst Estimation process.
Difference Value between join date and last tweet date.
Classify using Bayesian rule,
Calculate decision rule.
return Anomaly Score Aggregation
end for
---------------------------------------------------------------------------

76
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: XX , Issue No.: XX.

5. Experimental Setup

The experiments have been carried a Twitter dataset containing several attributes. Totally database consists of 137
data with 15 attributes.

6. Results And Outcome


The description of the data set used in this work is tabulated in Table 6.1.1.

Dataset Used : Twitter Data Set


Total number of Data : 138
Total number of Attributes : 15

6.1.Tables

Table 6.1.1: Attributes In Dataset

S.No ATTRIBUTES DESCRIPTION

1. ID Id for the user

2. Name Name of the user

3. JOINDATE Join Date on twitter

4. LASTTWEET DATE Last Tweet Data on Twitter

5. LANGUAGES Languages used in Twitter

6. SCREEN NAME Twitter Name

7. PROTECTED Security process

First the Data set are browsed from the system and data are inserted in to database Then pre-process the data , the data
contain null or missed values are eliminated from the database. After Pre-process Data we have 88 data. After pre-process
unwanted data are removed from the dataset, and values are updated in the database. To calculate the Mentioness in the Data is
that total number of user in the dataset. Here count the total number of mentioness, mentions, replies and retweets is tabulated
in Table 6.1.2.

Table 6.1.2: Calculate Mentions Replies And Retweet

S.No CLASSIFY DATA COUNT


1. Mentioness 88
2. Replies 14
3. Retweets 25
4. Mentions 42

After classify, need to calculate the Predictive Distribution by using total number of mentionees and mentions in the data set is
showed in Table 6.1.3.

Table 6.1.3:Predictive Distribution

S.No DATA DISTRIBUTION


1. Number of Mentions to user V in the 0.4772772
dataset

Dataset are cluster based on verified and the data taken

77
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: XX , Issue No.: XX.

from the cluster that are count the language from the data, count the cluster information from the cluster debases. To use this
we assign A as cluster and B as languages to do Bayes Rule is showed in Table 6.1.4.

Table 6.1.4:Bayesian Value

S.No BAYESIAN VALUE

1. 00081828

2. 0.146139

3. 0.006535

4. 0.006535

5. 0.029411

After calculate the Bayesian classification then attributes selection measure from the data using Information gain and Matrix
estimation and finally aggregate the anomaly link based on the URL for each user that are separated in each cluster1 and
cluster 2.Then finally estimate the anomaly score values and classify the instances is shown in the Table 6.1.5.

Table 6.1.5:Anomaly Instance

S.NO INSTANCE TOTAL

1. Normal Instance 88

2. Anomaly Instance 24

7.Conclusion
A new method to find the emergence of the topics in the social network. In this system considered the posts which are
reflected in the mentioning behaviour of users other than textual contents. Our model captures the mention and frequency of
that mentions. The proposed model does not trust the textual contents of posts which are robust to rephrasing and it can be
practical to the case where topics are worried with info other than texts. The probability model that captures both the number
of mentions per post and the frequency of mentionee. By using burst detection model and Bayes Rule to pinpoint the
emergence of a topic. Since the proposed method does not based on the textual contents of social network posts, it can be
applied to the case where topics are concerned with information other than texts, such as images, video, audio, and so on. It
has been classified through replies, mentions and retweets. To propose a probability model of the mentioning behaviour of a
social network user, and propose to detect the emergence of a new topic from the anomalies measured through the model.
Aggregate the anomaly scores based on the reply/mention relationships in social network posts.

Acknowledgements

I thank LORD ALMIGHTY for His immense grace and my parents and all the professors for incessant
encouragement and sustained support for the completion of this project work.

References

1. Journal Article

[1] Toshimitsu Takahashi, Ryota Tomioka, and Kenji Yamanishi, Member, IEEE “Discovering Emerging Topics in Social
Streams via Link-Anomaly Detection” IEEE Transaction on Knowledge and data engineering Vol. 26, NO. 1, Jan 2014.
[2] Allan.J et al., “Topic Detection and Tracking Pilot Study: Final Report,” Proc. DARPA Broadcast News Transcription and
Understanding Workshop, 1998.
[3] Aldous .D, “Exchangeability and Related Topics,” Ecole d’ Ete´ de Probabilities' de Saint-Flour XIII—1983, pp. 1-198,
Springer, 1985.

78
ISSN: 2393-994X
KARPAGAM JOURNAL OF ENGINEERING RESEARCH (KJER)
Volume No.: XX , Issue No.: XX.

[4] Chandola.V, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15:1-
15:58, 2009.
[5] Urabe. Y, K. Yamanishi, R. Tomioka, and H. Iwai, “Real-Time Change-Point Detection Using Sequentially Discounting
Normal-ized Maximum Likelihood Coding,” Proc. 15th Pacific-Asia Conf. Advances in Knowledge Discovery and Data
Mining (PAKDD’ 11), 2011.
[6] He .D and D.S. Parker, “Topic Dynamics: An Alternative Model of Bursts in Streamsof Topics,” Proc. 16th ACM SIGKDD Int’l
Conf. Knowledge Discovery and Data Mining, pp. 443-452, 2010.
[7] Kleinberg.J, “Bursty and Hierarchical Structure in Streams,” Data Mining Knowledge Discovery, vol. 7, no. 4, pp. 373-397,
2003.
[8] Luca Maria Aiello, Georgios Petkos, Carlos Martin, David Corney, Symeon Papadopoulos,Ryan Skraba, AyseGoker, "Sensing
trending topics in Twitter” Yiannis Kompatsiaris, Alejandro Jaimes.
[9] Mei.Q and C. Zhai, “Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining,” Proc. 11 th
ACM SIGKDD Int’l Conf. Knowledge Discovery in Data Mining, pp. 198-207, 2005.
[10] Teh.Y, M. Jordan, M. Beal, and D.Blei, “Hierarchical Dirichlet Processes,” J. Am. Statistical Assoc., vol. 101, no. 476, pp.
1566-1581, 2006.
[11] Robert Popovici, Andreas Weiler, and Michael Grossniklaus" On-line Clustering for Real-Time Topic Detectionin Social Media
Streaming Data.

79

You might also like