Ijser: Data Leakage Detection Using Cloud Computing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1255

ISSN 2229-5518

Data Leakage Detection Using Cloud Computing


Prof. Sushilkumar N. Holambe, Dr.Ulhas B.Shinde, Archana U. Bhosale

Abstract—In the virtual and widely distributed network, the process of handover sensitive data from the distributor to the trusted third
parties always occurs regularly in this modern world. It needs to safeguard the security and durability of service based on the demand of
usersA data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found
in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came
from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies
(across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g.,
watermarks). In some cases, we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and
identifying the guilty party. The idea of modifying the data itself to detect the leakage is not a new approach. Generally, the sensitive data
are leaked by the agents, and the specific agent is responsible for the leaked data should always be detected at an early stage. Thus, the
detection of data from the distributor to agents is mandatory. This project presents a data leakage detection system using various allocation
strategies and which assess the likelihood that the leaked data came from one or more agents For secure transactions, allowing only
authorized users to access sensitive data through access control policies shall prevent data leakage by sharing information only with
trusted parties and also the data should be detected from leaking by means of adding fake record`s in the data set and which improves
probability of identifying leakages in the system. Then, finally it is decided to implement this mechanism on cloud server.

Index Terms— cloud environment data leakage, data security, fake records.

——————————  ——————————

1 INTRODUCTION

IJSER
In this paper, we develop a model for finding the guilty agents. it costs our organization money, damages the competitive ad-
We also present algorithms for distributing objects to agents, in a vantage, brand, and reputation and destroys customer trust.
way that improves our chances of identifying a leaker. Finally, we To address this problem, we develop a model for assessing the
also consider the option of adding ―fake objects to the distribut- ―guilt‖ of agents. The distributor will ―intelligently ‖ give
ed set. Such objects do not correspond to real entities but appear data to agents in order to improve the chances of detecting a
realistic to the agents. In a sense, the fake objects act as a type of guilty agent like adding the fake objects to distributed sets.
watermark for the entire set, without modifying any individual At this point the distributor can assess the likelihood that
the leaked data came from one or more agents, as opposed to
members. If it turns out that an agent was given one or more fake
having been independently gathered by other means. If the
objects that were leaked, then the distributor can be more confi-
distributor sees enough evidence that an agent leaked data
dent that agent was guilty. We also consider optimization in
then they may stop doing business with him, or may initiate
which leaked data is compared with original data and according-
legal proceedings. Mainly it has one constraints and one objec-
ly the third party who leaked the data is guessed. We will also be tive. The Distributor’s constraint satisfies the agent, by provid-
using approximation technique to encounter guilty agents. We ing number of object they request that satisfy their conditions.
proposed one model that can handle all the requests from cus-
tomers and there is no limit on number of customers. The model
gives the data allocation strategies to improve the probability of
identifying leakages. Also there is application where there is a
2 LITERATURE SURVEY
distributor, distributing and managing the files that contain sensi- The guilt detection approach we present is related to the data
tive information to users when they send request. The log is main- provenance problem [3]: tracing the lineage of S objects implies
tained for every request, which is later used to find overlapping essentially the detection of the guilty agents. and assume some
with the leaked file set and the subjective risk and for Assessment prior knowledge on the way a data view is created out of data
sources. objects and sets is more general .As far as the data alloca-
of guilt probability.
tion strategies are concerned; our work is mostly relevant to wa-
Data leakage happens every day when confidential busi-
termarking that is used as a means of establishing original owner-
ness information such as customer or patient data, source code
ship of distributed objects. [3] Finally, there are also lots of other
or design specifications, price lists, intellectual property and
works on mechanisms that allow only authorized users to access
trade secrets, and forecasts and budgets in spreadsheets are
sensitive data through access control policies [9], [2]. Such ap-
leaked out. When these are leaked out it leaves the company
proaches prevent in some sense data leakage by sharing infor-
unprotected and goes outside the jurisdiction of the corpora-
mation only with trusted parties. However, these policies are re-
tion. This uncontrolled data leakage puts business in a vulner-
strictive and may make it impossible to satisfy agent’s requests.
able position. Once this data is no longer within the domain,
Maintaining the Integrity of the Specifications
then the company is at serious risk.
When cybercriminals ―cash out‖ or sell this data for profit
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1256
ISSN 2229-5518
the chances of detecting guilty agent, he injects fake objects
into the distributed dataset. These fake objects are created in
such a manner that, agent cannot distinguish it from original
3 NEED FOR DATA ALLOCATION STRATEGIES objects. One can maintain the separate dataset of fake objects
Using the data allocation strategies, the distributor intelli- or can create it on demand. In this paper we have used the
gently give data to agents in order to improve the chances of dataset of fake tuples. For example, distributor sends the tu-
detecting guilty agent. Fake objects are added to identify the ples to agents A1 and A2 as R1= {t1, t2} and R2= {t1}.
guilty party. If it turns out an agent was given one or more If the leaked dataset is L= {t1}, then agent A2 appears more
fake objects that were leaked, then the distributor can be more guilty than A1. So to minimize the overlap, we insert the fake
confident that agent was guilty and when the distributor sees objects in to one of the agent’s dataset. Practically server (Dis-
enough evidence that an agent leaked data then they may stop tributor) has given sensitive data to agent. In that distributor
doing business with him, or may initiate legal proceedings. In can send data with fake information. And that fake infor-
this section we describe allocation strategies that solve exactly mation does not affect to Original Data. Fake formation cannot
or approximately the scalar versions of approximation equa- identify by client. it also finds the data leakage from which
tion. We resort to approximate solutions in cases where it is agent (client)
inefficient to solve accurately the optimization problem.
3.1 Explicit Data Request 4 METHODOLOGY
In case of explicit data request with fake not allowed, the distributor is
not allowed to add fake objects to the distributed data. So Data 4.1 Problem Definition
allocation is fully defined by the agent’s data request. In case of The distributor owns the sensitive data set T= {t1, t2… tn}.
explicit data request with fake allowed, the distributor cannot The agent Ai request the data objects from distributor. The
remove or alter the requests R from the agent. However distribu- objects in T could be of any type and size, e.g. they could be
tor can add the fake object. In algorithm for data allocation for tuples in a relation, or relations in a database. The distribu-

IJSER
explicit request, the input to this is a set of request ,……, from n tor gives the subset of data to each agent., After giving ob-
agents and different conditions for requests. The e-optimal algo- jects to agents, the distributor discovers that a set L of T has
rithm finds the agents that are eligible to receiving fake objects. leaked. This means some third party has been caught in
Then create one fake object in iteration and allocate it to the possession of L. The agent Ai receives a subset Ri of objects
agent selected. The e-optimal algorithm minimizes every term of T determined either by implicit request or an explicit re-
the objective summation by adding maximum number of fake quest. Implicit Request Ri = Implicit (T, mi) : Any subset of mi
objects to every set yielding optimal solution. records from T can be given to agent Ai
Step 1: Calculate total fake records as sum of fake Records Explicit Request Ri = Explicit (T, Condi) : Agent Ai receives all
allowed. T objects that satisfy Condition.
Step 2: While total fake objects > 0
Step3:Select agent that will yield the greatest improvement in
the sum objective 4.2 Data Allocation Module
i.e. i = argma x((1/│Ri│)-(1/(│Ri+1│))) ΣRi∩Rj
Step 4: Create fake record The distributor may be able to add fake objects to the distrib-
Step 5: Add this fake record to the agent and also to fake rec- uted data in order to improve his effectiveness in detecting
ord set. guilty agents. However, fake objects may impact the correct-
ness of what agents do, so they may not always be allowable.
3.2 Sample Data Request Our use of fake objects is inspired by the use of ―trace ‖ re c-
With sample data requests, each agent Ui may receive any T ords in mailing lists. In this case, company A sells to company
subset out of different object allocations. In every allocation, B a mailing list to be used once (e.g., to send advertisements).
the distributor can permute T objects and keep the same Company A adds trace records that contain addresses owned
chances of guilty agent detection. The reason is that the guilt by company A. Thus, each time company B uses the pur-
probability depends only on which agents have received the chased mailing list, A receives copies of the mailing. These
leaked objects and not on the identity of the leaked objects. records are a type of fake objects that help identify improper
The distributor gives the data to agents such that he can easily use of data. The distributor creates and adds fake objects to the
detect the guilty agent in case of leakage of data. To improve data that he distributes to agents. Depending upon the addi-
tion of fake tuples into the agent’s request, data allocation
• Prof. Sushilkumar N.Holambe.He has completed master degree in problem is divided into four cases as:
computer science & Engg. C.O.E, Osmanabd. Dr. B.A.M. University.
& persuing P.H.D.in Dr. B.A.M.U. [email protected] i. Explicit request with fake tuples (EF)
ii. Explicit request without fake tuples (E~F)
• Dr.Ulhas B. Shinde, Dean, Faculty of Engg. & Technolo- iii. Implicit request with fake tuples (IF)
gy. Dr. B.A.M.U. Aurangabad.
iv. Implicit request without fake tuples (I~F).
• Archna U. Bhosale , Completed B.E.C.S.E.,C.O.E. Osmanabad, Implicit Request Ri = Implicit (T, mi) : Any subset of mi records
persuing M.E.C.S.E. in C.O.E. Osmanabad.
[email protected] from T can be given to agent Ai

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1257
ISSN 2229-5518
Output: - D- Data sent to agent
1. D=Φ, T’=Φ
2. For i=1 to n do
3. If(t .fields==cond) then
4. T’=T’U{ t i}
5. For i=0 to i<m do
6. D=DU{ti}
7. T’=T’-{ ti}
8. If T’=Φ then
9. Goto step 2
10. Allocate dataset D to particular agent
11. Repeat the steps for every agent

4.3 Optimization Module To improve the chances of finding guilty agent we can also
add the fake tuples to their data sets.
The distributor’s data allocation to agents has one con-
straint and one objective. The distributor’s constraint is to Algorithm2:
satisfy agents’ requests, by providing them with the num- Addition of fake tuples:
ber of objects they request or with all available objects that Input:
satisfy their conditions. His objective is to be able to detect i. D- Dataset of agent
an agent who leaks any portion of his data. The objective is
ii. F- Set of fake tuples
to maximize the chances of detecting a guilty agent that
leaks all his data objects. The Pr { Gj|S =Ri } or simply Pr iii. Cond- Condition given by agent

IJSER
{Gj |Ri } is the probability that agent is guilty if the distrib- iv. b- number of fake objects to be sent.
utor discovers a leaked table S that contains all objects .
Let the distributor have data request from n agents. The Output:- D- Dataset with fake tuples
1. While b>0 do
distributor wants to give tables R1 ,R2……..Rn to agents A1
2. f= select Fake Object at random from set F
,A2…………. An respectively, 3. D= DU {f}
. so that Distribution satisfies agent’s request; and 4. F= F-{f}
Maximizes the guilt probability differences 5. b=b-1
Δ (i, j) for all i, j= 1, 2, ……n and i≠j.
maximize(overR1….,Rn) (…,.Δ(i,j),…) i≠j……..(A) Similarly, we can distribute the dataset for implicit request of
minimize(over R1,….,Rn) (..,│Ri∩Rj│÷│Ri│,…) agent. For implicit request the subset of distributor’s dataset is
i≠j selected randomly. Thus with the implicit data request we get
different subsets. Hence there are different data allocations.
4.4 Guitl Model Assessment An object allocation that satisfies requests and ignores the dis-
tributor’s objective to give each agent unique subset of T of
Let L denote the leaked data set that may be leaked inten- size m. The s-max algorithm allocates to an agent the data rec-
tionally or guessed by the target user. Since agent having ord that yields the minimum increase of the maximum relative
some of the leaked data of L, may be susceptible for leaking overlap among any pair of agents. The s-max algorithm is as
the data. But he may argue that he is innocent and that the L follows:
data were obtained by target through some other means. Our 1. Initialize Min_Overlap, the minimum out of the minimum
goal is to assess the likelihood that the leaked data came from relative overlaps that the allocations of different objects to Ai
the agents as opposed to other resources. E.g. if one of the ob-
ject of L was given to only agent A1, we may suspect A1 more. 2. for k do Initialize max_rel_ov←0, the maximum relative
So probability that agent A1 is guilty for leaking data set L is overlap between Ri the allocation of tk to Ai
denoted as Pr{Gi| L} .
3. for all j=1,……,n:j=I and tkЄRj do calculate absolute overlap
Algorithm1:
as abs_ov← calculate relative overlap as
Allocation of Data Explicitly: rel_ov←abs_ov/min(mi, mj)
Input: -
i. T= {t1, t2, t3, .tn}-Distributor’s Dataset 4. Find maximum relative overlap as
ii. R- Request of the agent Max_rel_ov←MAX(max_rel_ov, rel_ov) If max_rel_ov≤
iii. Cond- Condition given by the agent min_ov then Min_ov←max_rel_ovret_k←k Return ret_k
iv. m= number of tuples given to an agent m<n, selected The algorithm presented implements a variety of data dis-
tribution strategies that can improve the distributor’s
randomly
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1258
ISSN 2229-5518
chances of identifying a leaker. It is shown that distributing puter’s data to a new computers in the cloud.
objects judiciously can make a significant difference in iden- Computing in the cloud may provide additional infrastructure
tifying guilty agents, especially in cases where there is large and flexibility.
overlap in the data that agents must receive.
5.1 Databases in Cloud Computing Environment

5 BESICS OF CLOUD COMPUTING In the past, a large database had to be housed onsite, typically
on a large server. That limited database access to users either
Key to the definition of cloud computing is the ―cloud itself. located in the same physical location or connected to the com-
For our purposes, pany’s internal database and excluded, in most instances,
The cloud is a large group of interconnected computers. These traveling workers and users in remote offices.
computers can be personal computers or network servers; Today, thanks to cloud computing technology, the underlying
they can be public or private. For example, Google hosts a data of a database can be stored in the cloud, on collections of
cloud that consists of both smallish PCs and larger servers. web server instead of housed in a single physical location.
Google’s cloud is a private on(that is, Google owns it) that is This enables users both inside and outside the company to
publicly accessible (by Google’s users). access the same data, day or night, which increases the useful-
This cloud of computers extends beyond a single company ness of the data. It’s a way to make data universal
or enterprise. The applications and data served by the 5.2 Lineage Tracing General Data wearhouse
cloud are available to broad group of users, cross- Tranformations [9]
enterprise and cross-platform. Access is via the Internet. .
Any authorized user can access these docs and apps from Yingwei Cui and Jennifer Widom focus on transformation or
any computer over any Internet connection. And, to the modification of data happening automatically due to mining

IJSER
of data or while storing the data in the warehouse.
user, the technology and infrastructure behind the cloud is
In a warehousing environment, the data lineage problem is that
invisible. It isn’t apparent (and, in most cases doesn’t mat- of tracing warehouse data items back to the original source
ter)whether cloud services are based on HTTP, HTML, items from which they were derived. It formally defines the
XML, Java script, or other specific technologies. lineage tracing problem in the presence of general data ware-
From Google’s perspective, there are six key properties of house transformations, and they present algorithms for line-
cloud computing: age tracing in this environment. The tracing procedures takes
 Cloud Computing is user-centric. Once you as a user are advantage of known structure or properties of transformations
connected to the cloud, whatever is stored there -- documents, when present, but also work in the absence of such infor-
messages, images, applications, whatever – becomes yours. In mation. Their results can be used as the basis for a lineage
addition, not only is the data yours, but you can also share it tracing tool in a general warehousing setting, and also can
with others. In effect, any device that accesses your data in the guide the design of data warehouses that enable efficient line-
cloud also becomes yours. age tracing.
 Cloud computing is task-centric. Instead of focusing on the The major drawback is that it should not focus on the latest
application and what it can do, the focus is on what you need tools which will solve this kind of problem automatically and
done and how the application can do it for you., Traditional there is no clear explanation is given at its security part of this
applications—word processing, spreadsheets, email, and so on technique.
– are becoming less important than the documents they create.
 Cloud computing is powerful. Connecting hundreds or
thousands of computers together in a cloud creates a wealth of
5.3 Databases in the cloud:a Work in Progress[10]
computing power impossible with a single desktop PC.
 Cloud computing is accessible. Because data is stored in Edward P. Holden, Jai W. Kang, Dianne P. Bills, MukhtarI-
the cloud, users can instantly retrieve more information from lyassov focus on trial of using cloud computing in the delivery
multiple repositories. You’re not limited to a single source of of the Database Architecture and Implementation in the cloud.
data, as you are with a desktop PC. It describes a curricular initiative in cloud computing intended
 Cloud computing is intelligent. With all the various data to keep our information technology curriculum at the fore-
stored on the computers in the cloud, data mining and analy- front of technology. Currently, IT degrees offer extensive da-
sis are necessary to access that information in an intelligent tabase concentrations at both the undergraduate and graduate
manner. levels. Supporting this curriculum requires extensive lab facili-
 Cloud computing is programmable. Many of the tasks nec- ties where students can experiment with different aspects of
essary with cloud computing must be automated. For exam- database architecture, implementation, and administration. A
ple, to protect the integrity of the data, information stored on a disruptive technology is defined as a new, and often an initially
single computer in the cloud must be replicated on other com- less capable technological solution, that displaces an existing
puters in the cloud. If that one computer goes offline, the technology because it is lower in cost. Cloud computing fits
cloud’s programming automatically redistributes that com- this definition in that it is poised to replace the traditional
IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1259
ISSN 2229-5518
model of purchased-software on locally maintained hardware
platforms. 2.5
From this perspective in academic, cloud computing is utiliz-
2
ing scalable virtual computing resources, provided by vendors
as a service over the Internet, to support the requirements of a 1.5
specific set of computing curricula without the need for local
infrastructure investment. 1 Column2
Cloud computing is the use of virtual computing technol-
0.5
ogy that is scalable to a given application’s specific re-
quirements, without local investment in extensive infra- 0
structure, because the computing resources are provided Arch1 Arch2 Arch3 Arch4
by various vendors as a service over the Internet.

6 EXPERIMENTAL RESULT Overlap graph probability at p=0.3

1.2
In our scenarios we have taken a set of 500 objects and re- 1
quests from every agent are accepted. There is no limit on
0.8
number of agents, as we are considering here their trust
values. The flow of our system is given as below: 0.6
1. Agent’s Request: Either Explicit or Implicit. 0.4

IJSER
2. Leaked dataset given as an input to the system.
3. The list of all agents having common tuples as that of leaked 0.2
tuples is found and the corresponding guilt probabilities are 0
calculated.
Arch1 Arch2 Arch3 Arch4
4. It shows that as the overlap with the leaked dataset mini-
mizes the chances of finding guilty agent increases.
The basic approaches for leakage identification system in vari- Overlap graph at p=0.3
ous areas and there by proposing a multi-angle approach in
handling the situational issues were all studied in detailed.
Case II M<[t] , where M=∑i=1…n
When the occurrence of handover sensitive data takes place
it should always watermarks each object so that it could
able to trace its origins with absolute certainty, however Agents Files requested Files given
Arch1 8 8
certain data cannot admit watermarks then it is possible to
Arch2 7 -
assess the likelihood that an agent is responsible for a leak,
Arch3 8 5
based on the overlap of the data with the leaked data and
Arch4 6 -
also based on the probability that objects can be guessed by
any other methodologies.
Random Graph at p= 0.3
Sample request
1.2
Case 1) M>[t],M=∑i=1…n 1
0.8
Agents Files requested Files given
0.6
Arch1 5 5
0.4 Column2
Arch2 5 -
Arch3 10 10 0.2
Arch4 10 - 0
Arch1 Arch2 Arch3 Arch4
Here M =30 i.e M>[t]

Graph probability (p)=0.3

IJSER © 2015
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 1260
ISSN 2229-5518
7 CONCLUSION

Data leakage is a silent type of threat. Your employee as IBM Almaden Research Center.
an insider can intentionally or accidentally leak sensitive
information. This sensitive information can be electronical- [9 ] L. Sweeney, ―Achieving K-Anonymity Privacy Protection Using Gen-
ly distributed via e-mail, Web sites, FTP, instant messaging, eralization ‖ And Suppression,
spread sheets, databases, and any other electronic means http://en.scientificcommons.org/43196131,2002
available – all without your knowledge. To assess the risk
.[10] Edward P. Holden, Jai W. Kang, Geoffrey R. Anderson, Dianne P.
of distributing data two things are important, where first Bills, Databases in the Cloud: A Work in Progress,2012.
one is data allocation strategy that helps to distribute the
tuples among customers with minimum overlap and se-
cond one is calculating guilt probability which is based on
overlapping of his data set with the leaked data set.

7.1 Acknowledgments
We sincerely thank Prof. Sushilkumar N. Holambe ,my
project guide,
Dr. Anilkumar N. Holambe, our P.G. co-ordinator, &
Head of the Department, COE, Osmanabadand .
Dr.S.M.Jagade,Principal,COE,Osmanabadand,for their con-
stant encouragement and motivation to write this paper.

IJSER
REFERENCES
[1] Papadimitriou P, Garcia-Molina H. A Model For Data Leakage Detec-
tion// IEEE Transaction On Knowledge And Data EngineeringJan.2011.

[2] International Journal of Computer Trends and Technology- vol-


ume3Issue1-2012 ISSN:2231-2803
http://www.internationaljournalssrg.org Data Allocation Strategies for
Detecting
Data LeakageSrikanthYadav, Dr. Y. Eswararao, V. ShanmukhaRao, R.
Vasantha

[3] International Journal of Computer Applications in Engineering Scienc-


es [ISSN: 2231-4946]197 | P a g e Development of Data leakage Detection
Using Data Allocation Strategies Rudragouda G PatilDept of CSE,The
Oxford College of Engg, Bangalore.

[4] P. Buneman, S. Khanna and W.C. Tan. Why and where: Acharacterization of
data provenance. ICDT 2001, 8th International Conference, London, UK,
January4-6, 2001,Proceedings, volume 1973 of Lecture Notes in Computer
Science, Springer, 2001

[5] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, ―Flexible


Support for Multiple Access Control Policies, ‖ A C M T rans. D atabase
Systems, vol. 26, no. 2, pp. 214-260, 2001.

[6] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, ―An Algebra for


Composing Access Control Policies,‖ ACM Trans. Information scientific-
commons and System Security, vol. 5, no. 1, pp. 1-35, 2002.

[7] YIN Fan, WANG Yu, WANG Lina, Yu Rongwei A Trustworthiness-


Based Distribution Model for Data Leakage Detection: Wuhan University
Journal Of Natural Sciences.

[8] RakeshAgrawal, Jerry Kiernan. Watermarking Relational Databases//

IJSER © 2015
http://www.ijser.org

You might also like