Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
11
SAMPLING ONLINE SOCIAL NETWORKS USING OUTLIER INDEXING
Mr. Yogesh P Murumkar
Student at PVPIT, Bavdhan, Pune
Prof. Yogesh B. Gurav
Asst. professor, PVPIT, Bavdhan, Pune
I. ABSTRACT
Online social networking are emerging, as well as underlying network infrastructure to use
has increased interest Information for improving the information available on the social partners as a
user. Multiplicative perturbations of linear data-additive, or a The combination of the two to study
the utility of the flustered we discuss the output distortion using nonlinear data Possible nonlinear
random data changes and show how This anomaly detection can be useful for maintaining the
confidentiality Sensitive data set. We expect to develop limits on the accuracy of by using nonlinear
distortion and also quantify privacy Standard definition to allow this approach. Main attractions by
varying the degree of privacy the amount of control a user the nonlinearity. In full generality, and
then changes to show that, for specific Cases, it is the distance protection.
A user or a dynamic social network to collect information from a node in the neighborhood is
focused on improving performance. User or node's social network to detect correctly we sampling-
based algorithms to compress interest structure and social network considering the amount of
estimated time is introduced to provide our sample correlations across the us, And also analyzed the
basic sampling scheme variants, Distributed and centralized network model. In proposed system we
used Outlier indexing algorithm because large datasets because random samples can be used for a
wide range of analytical tasks. A main contribution of this paper is the discussion between the
inevitability of a transformation and privacy preservation and the application of these techniques to
outlier detection. Experiments are conducted on real-life data sets demonstrate the effectiveness of
the approach.
Index Terms: Online Social Network, Information Networks, Search Process, Query Processing,
Performance Evaluation, Privacy.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 6367(Print)
ISSN 0976 6375(Online)
Volume 5, Issue 6, June (2014), pp. 11-18
IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
12
II. INTRODUCTION
Over the last decade, the World Wide Web and Web search engines have fundamentally
transformed the way people find and share information. Recently, a new form of publishing and
locating information, known as online social networking, has become very popular. The social
network structure can be modeled as a graph G with individuals representing nodes and relationships
among them representing edges. We model environments in which social peers participate in a
centralized social network (where knowledge of the network structure is assumed) or distributed
(where network structure is unknown or limited). In any case, we assume that the rate of change of
the content in these networks is high. Given such an environment we define the following problems
Sampling Nodes in Social Networks, Sampling Information in Social Networks, Low selectivity,
Centralized graphs are typical in social networking sites in which complete knowledge of
users network is maintained. The changing trends in the use of web technology that aims to enhance
interconnectivity, self-expression, and information sharing on the web have led to the emergence of
online social networking services. This is evident by the multitude of activity and social interaction
that takes place in web sites like Face book, My space, and Twitter to name a few. At the same time
the desire to connect and interact evolves far beyond centralized social networking sites and takes the
form of ad hoc social networks formed by instant messaging clients, VoIP software, or mobile geo
social networks. While numerous studies have focused on the hyperlinked structure of the Web and
have exploited it for searching content, few studies, if any, have examined the information exchange
in online social networks. [1] The majority of all combinatorial computing applications can
apparently be handled only by what amounts to an exhaustive search through all possibilities. [2] The
effectiveness of the branch-and-bound procedure for solving mixed integer programming (MIP)
problems has made it a method of choice in commercial software for several decades. [3] Anyone
who has used a backtracking procedure will probably have observed some problem instances being
solved almost immediately, and other problem instances of a similar size taking an inordinate length
of time to solve. [4] Online social networks have become increasingly popular in the recent decade
which gave rise to an increasing need in analyzing their properties and comparing them to one
another. Many properties of online social networks are considered important.[5] A more ecient
distributed algorithm for the DFS traversal of a network can help reduce the complexity of other
distributed graph algorithms which use a distributed DFS traversal as their basic building block.[6]
Many special traversal
Techniques have been applied to solve graph-related problems.[7] A new distributed
algorithm is presented for constructing breadth first search (BFS) trees. A BFS tree is a tree of
shortest paths from a given root node to all other nodes of a network under the assumption of unit
edge weights; such trees provide useful building blocks for a number of routing and control functions
in communication networks [8] survey many of the measures used to describe and evaluate the
efficiency and effectiveness of large-scale search services. These measures, herein visualized versus
verbalized, reveal a domain rich in complexity and scale.[9] Complex networks describe a wide
range of systems in nature and society. Frequently cited examples include the cell, a network of
chemicals linked by chemical reactions, and the Internet, a network of routers and computers
connected by physical links.[10]
In the following section III we will discuss the different types of recommendation approaches
along with their advantages and disadvantages. Section IV presents the proposed approach for web
page recommendation.
III. LITERATURE REVIEW
A. Mislove, K.P. Gummadi, and P. Druschel, [1] in this paper, they examined the potential
for using online social networks to enhance Internet search. They analyzed the differences between
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
13
the Web and social networking systems in terms of the mechanisms they use to publish and locate
useful information. They discussed the benefits of integrating the mechanisms for finding useful
content in both the Web and social networks. Our initial results from a social networking experiment
suggest that such integration has the potential to improve the quality of Web search experience.
D.E. Knuth [2] One of the chief difficulties associated with the so-called backtracking
technique for combinatorial problems has been our inability to predict the efficiency of a given
algorithm, or to compare the efficiencies of different approaches, without actually
Writing and running the programs. This paper presents a simple method which produces reasonable
estimates for most applications, requiring only a modest amount of hand calculation. The method
should prove to be of considerable utility in connection with D. H. Lehmer's branch-and-bound
approach to combinatorial optimization.
G. Cornujols, M. Karamanov, and Y. Li [3] In this paper they shows showed empirically that
the branch-and-bound solution time of an MIP solver can be roughly estimated in the early stages of
the solution process. We proposed a procedure for this estimation based on parameters of a small sub
tree. Our experiments showed that in a relatively short time, we can obtain sufficient information to
predict the total running time with an error within a factor of five. This procedure can easily be built
into an MIP solver. It is fast and does not interfere with the branch-and-bound algorithm.
P. Kilby, J. Slaney, S. Thiebaux, and T. Walsh [4] in this paper they propose two new online
methods for estimating the size of a backtracking search tree. The first method is based on a
weighted sample of the branches visited by chronological backtracking. The second is a recursive
method based on assuming that the unexplored part of the search tree will be similar to the part we
have so far explored. They compare these methods against an old method due to Knuth based on
random probing. They show that these methods can reliably estimate the size of search trees explored
by both optimization and decision procedures. They also demonstrate that these methods for
estimating search tree size can be used to select the algorithm likely to perform best on a particular
problem instance.
[5] They presented two algorithms for estimating the size of graphs. Both algorithms rely on
nodes being samples from the graph's stationary distribution. They showed both analytically and
experimentally that, for social-networks and other small world graphs, these algorithms considerably
outperform uniformly sampling nodes. They consistently provide more accurate estimates while
using a smaller number of samples. This result is even more outstanding since uniformly sampling
nodes is strictly harder than sampling them according to the stationary distribution.
IV. PROPOSED ALGORITHM
Figure 1: Flow Diagram
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
14
To explore the underlying social structure and information, to improve the accuracy, to
improve the efficiency, to study the application of sampling-based algorithms, to improve efficiency
for low sensitivity quantities using Outlier Indexing technique,
Figure 2: Block Diagram
We present algorithmic details of our proposed methods. First, we describe Sample Dyn, an
algorithm that is able to compute a near-uniform sample of users in dynamic social networks.
Sampling Dynamic Social Networks
Let Dd(v) be the vicinity of a user v at depth d. We introduce the algorithm SampleDyn that
takes as input the user v, the size of the sample n, the network depth d, and a constant value for
parameter C and obtains a near-uniform random sample of users by performing random walks on the
nodes of Dd(v).
Algorithm 1: Sampling in Dynamic Social Networks
Procedure SAMPLEDYN (u; n; d;C)
T = NULL, samples = 0, Sample array of size n
while samples <= n do
if (v = randomWalku; d;C; T))! = 0 then
Sample=[samples ++]
end if
end while
end procedure
procedure RANDOMWALK(u; d;C; T)
depth = 0, ps = 1
while depth < d do
pick v 2 children(u) [u with pv = 1/degree(u)+1
if T [ v has no cycle then add v to T
ps = ps & pv
if v = u then
accept with probability C
ps
if accepted then
return v
else
return 0
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
15
end if
else
u = v, depth++
end if
end if
end while
return 0
end procedure
Using Separate Samples
A first approach is to draw a separate independent sample from D(v) and estimate the
aggregate counts for each item.
Algorithm 2: Counts EstimationSeparate Samples
Procedure EVALSINGLE (v; d;C; n;X)
S array of size n
Count array of size jXj
for all x X do
S = SampleDyn(v; n; d;C)
for all i S do
Count[x]= Count[x] + countix
end for
end for
return Count
end procedure
Distributed Outlier Detection
First, we give outlier detection algorithm for horizontally partitioned data without considering
privacy. Consider a distributed setting with p players, each player having a subset of objects in the
whole database. In this setting, each player first computes its set of local outliers by using the
centralized algorithm on its local dataset. After the local outliers are generated, all the players
communicate to compute the global outliers from the sets of local outliers. At the end of the
algorithm each player will have its subset of the actual global outliers. We consider the horizontal
distribution where each player has a subset of the total number of objects.
The distributed algorithm DistributedOD is divided broadly into three phases. In the first
phase, all players communicate to compute the global parameters. Then each player locally computes
its set of local probable outliers M0. In the second phase GlobalApproxOD, the players engage in
communication to compute their subsets of global probable outliers. Finally, in the third phase
GlobalOD, the players again engage in communication to compute their subsets of the actual global
outliersan overview of the process in the distributed setting from the perspective of one player in a
two player setting. It is clear from the figure that the round complexity of our algorithm, which also
holds true for multi player setting.
Algorithm 1 DistributedOD: Outlier Detection Algorithm for Horizontal Distribution
Require: Players PA and PB, PAs Dataset DA, PBs Dataset DB, Distance Threshold dt, Point
Threshold pt, Approximation Factor _
Ensure: PAs Outliers MA
At PA :
PA sends |DA| to PB
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
16
At PB :
n = |DA| + |DB|
PB sends n to PA
At PA :
p0
t = (1 pt) n
R = dt/(1 + _)
TA
LH = LSH(DA,R)
compute bt
M0A = ApproximateOD(DA, TA
LH, p0
t, bt)
M00A = GlobalApproxOD(M0A)
MA = GlobalOD(M00A)
We give the distributed algorithm in a two player setting, which can be easily extended to a p
player setting. Consider two players denoted by PA and PB with local datasets DA and DB. We
present the algorithm such that one player, say PA will be able to compute its subset of the global
outliers at the end of the algorithm. Similarly the algorithm can be used to enable PB to compute its
subset of the global outliers by simply interchanging the roles of PA and PB in the algorithm.
Using the Same Sample
An alternate approach is to draw a sample S only once, and reuse the same sample to estimate
the aggregate counts for each item x X. We refer to this algorithm as because it evaluates a batch
of items at each visit to a sampled node.
Cost Analysis
Our sampling algorithms provide an alternative to performing an exhaustive search or
crawling on the network of a user using a depth-first-search or breadth-first-search.
Cost Model
Let Dd(v) = (N;E) be the neighborhood of a user v at depth d, where N is the set of nodes and
E the set of links in the network. Nodes are autonomous in that they perform their computation and
communicate with each other only by sending messages. Each node is unique and has local
information, such as the identity of each of its neighbors. We assume that each node handles
messages from and to neighbors and performs local computations in zero time, meaning that
communication delays outweigh local computations on the nodes.
V. EXPERIMENTAL ANALYSIS
Sampling Accuracy
Performing random walks by selecting each outgoing edge with equal probability shall pick
leaf nodes in a biased manner. This is because some leaves, e.g., leaves that are close to the root, are
more likely to be destinations of random walks than other leaves. In our first set of experiments, we
explore the effect of this bias in the sampling accuracy and compare the performance of the
aforementioned naive sampling method, say Naive, to our sampling method, Eval Single.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
17
Sampling Cost
EvalSingle performs considerably better in terms of accuracy than a naive sampling method,
but as many of the performed random walks end up rejecting a selected leaf node, it can be
expensive. In this experiment, we evaluate the cost of our sampling method against the naive
sampling method and against the cost of crawling the entire neighborhood of a user.
For experimental results we will use synthetic user search history logs. The synthetic log
consists of the same users as the real log (from AOL data set along with their search history logs) but
we populate users history logs with high numbers of queries and url counts.
Following table shows the sampaling result of query.
Name Type Users Queries Urls
Real
dataset
Real 75888 4026350 2789542
Synthetic
dataset
Synthetic 50 200 150
Table 1: The sampling result of query
Figure 3: Existing & Proposed Graph
Figure 3 shows the Accuracy Vs Data size in existing & proposed system. Table 2 shows the
Comparison of existing & proposed system.
Existing
System
Proposed
System
Efficiency Low High
Sampling
Accuracy
Medium High
Sampling
Cost
Low High
Table 2: Comparison with Existing system & Proposed system
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 6, June (2014), pp. 11-18 IAEME
18
VI. CONCLUSION AND FUTURE WORK
Our research shows the methods for collecting quickly Information from a user in a dynamic
neighborhood its structure has limited knowledge of when social network or is not available. Our
methods for efficient approximation of Sampling-based algorithms we sample. A user avoid listing all
nodes around and thus to improve the performance of our approach, Running real experiments on show
and Synthetic data set.
Despite its potential, our liaising sampling method limitations and the amount is expected to be
disabled with very little selectivity. About a similar problem arise Answering queries using sample
collection. Solution Based on weighted sampling there rely on workload Information. However, in our
reference data each node are stored on fast this method does not change the Consider that our algorithms
directly applicable. Information A network logs, Web history, as each user Access to user personal
information infringes on available. Privacy and, thus, privacy concerns may serve as a major obstacle
toward acceptance of our algorithms. Systems that must follow our algorithms use to prepare to approach
social translucence System that the visibility, awareness of the need to strike a balance of others, and
accountability.
A main contribution of this paper is the discussion between the inevitability of a transformation
and privacy preservation and the application of these techniques to outlier detection.
In future work Apart from hierarchical index structures, the proposed scheme of CS-SSE can be extended
to other data structures like hashing which may further improve performance in terms of server side
computations. One may also work towards achieving constant round protocol for the proposed CS-SSE
scheme as opposed to the logarithmic round protocol.
VII. REFERENCE
[1] A. Mislove, K.P. Gummadi, and P. Druschel, Exploiting Social Networks for Internet Search,
Proc. Fifth Workshop Hot Topics in Networks (HotNets), 2006.
[2] D.E. Knuth, Estimating the Efficiency of Backtrack Programs, Math. of Computation, vol. 29,
no. 129, pp. 121-136, 1975.
[3] G. Cornujols, M. Karamanov, and Y. Li, Early Estimates of the Size of Branch-and-Bound
Trees, INFORMS J. Computing, vol. 18, pp. 86-96, 2006.
[4] P. Kilby, J. Slaney, S. Thiebaux, and T. Walsh, Estimating Search Tree Size, Proc. Natl Conf.
Artificial Intelligence (AAAI), 2006.
[5] L. Katzir, E. Liberty, and O. Somekh, Estimating Sizes of Social Networks via Biased
Sampling, Proc. 20th Intl Conf. World Wide Web (WWW), 2011.
[6] S.A.M. Makki and G. Havas, Distributed Algorithms for Depth- First Search, Information
Processing Letters, vol. 60, no. 1, pp. 7-12, 1996.
[7] T.-Y. Cheung, Graph Traversal Techniques and the Maximum Flow Problem in Distributed
Computation, IEEE Trans. Software Eng., vol. SE-9, no. 4, pp. 504-512, July 1983.
[8] B. Awerbuch and R.G. Gallager, A New Distributed Algorithm to Find Breadth First Search
Trees, IEEE Trans. Information Theory, vol. 33, no. 3, pp. 315-322, May 1987.
[9] C.T.G. Pass and A. Chowdhury, A Picture of Search, Proc. First Intl Conf. Scalable
Information Systems (InfoScale), 2006.
[10] R. Albert and I. Barabasi, Statistical Mechanics of Complex Networks, Modern Physics Rev.,
vol. 74, p. 47, 2002.
[11] Muhanad A. Al-Khalisy and Dr.Haider K. Hoomod, POSN: Private Information Protection in
Online Social Networks, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 2, 2013, pp. 340 - 355, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
[12] L.Rajeswari and Dr.S.S.Dhenakaran, Page Access Coefficient Algorithm for Information
Filtering in Social Network, International Journal of Computer Engineering & Technology
(IJCET), Volume 4, Issue 3, 2013, pp. 60 - 69, ISSN Print: 0976 6367, ISSN Online:
0976 6375.