SID-0000003141750 Optimized
SID-0000003141750 Optimized
SID-0000003141750 Optimized
)
Big Data Security
De Gruyter Frontiers in
Computational Intelligence
Volume 3
Already published in the series
Editors
Dr. Shibakali Gupta
Department of Computer Science & Engineering, University Institute of Technology
The University of Burdwan
Golapbag North
713104 Burdwan, West Bengal, India
[email protected]
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License,
as of February 23, 2017. For details go to http://creativecommons.org/licenses/by-nc-nd/4.0/.
ISBN 978-3-11-060588-4
e-ISBN (PDF) 978-3-11-060605-8
e-ISBN (EPUB) 978-3-11-060596-9
ISSN 2512-8868
www.degruyter.com
Dr. Shibakali Gupta would like to dedicate this book to his daughter, wife & parents.
Dr. Indradip Banerjee would like to dedicate this book to his son, wife & parents.
Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to his parents
Late Ajit Kumar Bhattacharyya and Late Hashi Bhattacharyya, his beloved wife
Rashni, and his youngest sister’s parents-in-laws Late Anil Banerjee and Late
Sandhya Banerjee.
Preface
With the advent of a range of data-driven avenues and explosion of data, research
in the field of big data has become an important thoroughfare. Big data produces
exceptional amounts of data points, which give greater insights that determine sen-
sational research, better business decisions, and greater value for customers. To ac-
complish these endings, establishments need to be able to handle the data while
including measures for using sensitive private information efficiently and quickly,
and thus the implementation of security issue creates a vigorous role. End-point de-
vices create the main factors for observance of the big data. Processing, storage,
and other necessary responsibilities have to be performed with the help of input
data, which is generated by the end-points. Therefore, an association should make
sure to use an authentic and valid end-point security. Due to large amounts of data
generation, it is quite impossible to maintain regular checks by most of the estab-
lishments. Therefore, periodic observation and performing security checks can be
utmost promising in real time. On the other hand, cloud-based storage has enabled
data mining and collection. However, this big data and cloud storage incorporation
have introduced concerns for data secrecy and security threats.
This volume intends to deliberate some of the latest research findings regarding
the security issues and mechanisms for big data. The volume comprises seven well-
versed chapters on the subject.
The introductory chapter provides a brief and concise overview of the subject
matter with reference to the characteristics of big data, the inherent security con-
cerns, and mechanisms for ensuring data integrity.
Chapter 2 deals with the motivation for this research that came from lack of
practical applications of block chain technology, its history, and the principle of
how it functions within the digital identity and importance of EDU certificate trans-
parency and challenges in their sharing. In the theoretical part of the chapter, a
comparison of the “classical” identity and digital identity is set out, which is de-
scribed through examples of personal identity cards and e-citizen systems. Then,
following the introduction into block chain technology and describing the method
of achieving consensus and transaction logging, the principle of smart contracts is
described, which provide the ability to enter code or even complete applications
and put them into block chains, enabling automation of a multitude of processes.
The chapter also explains common platforms through examples that are described
as business models that use block chain as a platform for developing their pro-
cesses based on digital identity.
Chapter 3 describes the anomaly detection procedure in cloud database metric.
Each and every big data source or big database needs a security metric monitoring.
The monitoring software collects various metrics with the help of custom codes,
plugging, and so on. The chapter describes the approach of modifying the normal
metric thresholding to anomaly detection.
https://doi.org/10.1515/9783110606058-201
VIII Preface
With the tangible and exponential growth of big data in various sectors,
every day-to-day activities like websites traversed, locations visited, movie timings,
and others were stowed by various companies such as Google through Android cell
phone. Even bank details are accessible by Google. In such situations, wherein a
person’s identity can be mentioned almost completely by just a small number of
datasets, the security of those datasets is of huge importance especially in terms of
situations where human manipulations are involved. Using social engineering to re-
trieve few sensitive information could lead to completely rip off a person’s identity
and his/her personal life. Chapter 4 deals with similar facts, that is, social engineer-
ing angle of hacking for big data along with other hacking methodologies that can
be used for big data and how to secure the systems from the same. This chapter
helps the users to visualize major vulnerabilities in data warehousing systems for
big data along with an insight of major such hacking in recent past, which lead to
disclosure of major private and sensitive data of millions of people.
Chapter 5 describes the information hiding technique as well as consumptions
of this one in big data. Global communication has no bounds and more information
is being exchanged over the public medium that serves an important role in the
communication mode. The rapid growth in the usage of sensitive information ex-
change through the Internet or any public platform causes a major security concern
in these days. More essentially, digital data has given an easy access to communica-
tion of its contents that can also be copied without any kind of degradation or loss.
Therefore, the urgency of security during global communication is obviously quite
tangible nowadays.
Some of the big data security Issues have been discussed in Chapter 6 with
some solution mechanisms. Big data is a collection of huge sets of data of different
categories, where it could be distinguished as structured and unstructured ways. As
are revolutionizing to zeta bytes from Giga/Tera/Peta/Exabytes in this phase of
computing, the threats have also increased in parallel. Big data analysis is flattering
essential means for automatic determination of astuteness that is concerned in the
recurrently stirring outline and secreted convention. This can facilitate companies
to obtain an improved resolution, to envisage and recognize revolution, and to cate-
gorize new fangled prospects. Dissimilar procedure in support of big data analysis
as well as numerical analysis, batch processing, machine learning, data mining, in-
telligent investigation, cloud computing, quantum computing, and data stream pre-
paring become possibly the most important factor.
Chapter 7 summarizes the main contributions and findings of the previously
discussed chapters and offers future research directions. A conclusion has also
been derived out on possible scope of extension or future direction. In this book,
several security issues have been addressed in big data domain.
The book is targeted to meet the academic and research interests of the big
data community. It would come to use to students and faculty members involved
in the disciplines of computer science, information science, and communication
Preface IX
engineering. The editors would be more than happy if the readers find it useful in
exploring further ideas in this direction.
Shibakali Gupta
October 2019 Indradip Banerjee
Kolkata, India Siddhartha Bhattacharyya
Contents
Preface VII
Santanu Koley
6 Big data security issues with challenges and solutions 95
https://doi.org/10.1515/9783110606058-202
Shibakali Gupta, Indradip Banerjee, Siddhartha Bhattacharyya
1 Introduction
Security is one of the leading accomplishment of awareness in information technol-
ogy and communication system. In the contemporary communication epoch, digital
channels are used to communicate hypermedia content, which governs the field of
arts, entertainment, education, commerce, research, and so on. The users of the
field of the digital media technology are increasing massively, and they realized
that data on web is an extremely important aspect of modern life.
Devising discoursed certain security issues, there exist some chief principles.
Privacy principles specify that only sender and the receiver have a duty to be able
to access the message from the web. No other unsanctioned creature can access
this one. Authentication apparatuses help to launch the proof of identity. The au-
thentication confirms that the origin of a digital message is correctly recognized.
When the content of the message is altered after directing by the sender and be-
fore obtaining by the receiver, the uprightness of the message is lost. Access con-
trol regulates who should be able to admit the system and what. It has two areas:
role and rule management.
The digital data content includes audio, video, and image media, which can be
easily stored and manipulated. The superficial transmission and manipulation of dig-
ital content constitute an authentic threat to multimedia content engenderers and
traders.
Big data is a term that is used to explain datasets that are enormous in size
against normal database. Big data is becoming more and more popular each day. Big
data generally consists of unstructured, semistructured, or structured datasets. Some
algorithms as well as tools are used to process these data within the reasonable finite
amount of time, but the main prominence is known on the unstructured data [1].
The characteristics of big data mainly depend on 4Vs (volume, velocity, variety,
veracity) [2, 3]. Volume is a key characteristic of big data, which decides whether the
information is a normal dataset or not, the size of raw data or the data generated is
important because the time complexity, specifications cost which depend on it.
Velocity is the speed with a direction, which means the throughput or the speed of
the data processed. How fast the information can be generated in real time is to meet
the requirements. Variety is important in this literature because it stands for the qual-
ity and the type of data required in order to process it successfully. Data can be text,
audio, video, image, and so on. The quality of data on which the processing will be
done is vital, because if the information is corrupted or stolen then anybody can’t
expect accurate result from it.
https://doi.org/10.1515/9783110606058-001
2 Shibakali Gupta, Indradip Banerjee, Siddhartha Bhattacharyya
possesses a common problem to deal with some intelligent alarm method, which
can produce predictive warnings, that is, the system can detect any anomalies or
problems before it occurs. The novel concept detects all the anomalies by analyzing
previous metric data continuously. The chapter also deals with the power exponen-
tial moving average and exponential moving standard deviation method to produce
an effective solution. The work has been tested on CPU utilization and memory uti-
lization of big database servers, which reflects the real-time quality of the solution.
these days. More essentially, digital data has given an easy access to communica-
tion of its content that can also be copied without any kind of degradation or loss.
Therefore, the urgency of security during global communication is obviously quite
tangible nowadays. Without the communication medium, the field of technology
seems to downfall. But appallingly, these communications often turn out to be fatal
in terms of preserving the sensitivity of vulnerable data. Unwanted sources hamper
the privacy of the communication and may even annoyance with such data. The
importance of security is thus gradually increasing in terms of all aspects of protect-
ing the privacy of sensitive data. Various concepts of data hiding are hence into
much progress. Cryptography is one such concept, the others being watermarking,
and so on. But to protect the complete data content with some seamlessness, this
chapter incorporates concepts of steganography. The realm of steganography rati-
fies the stated fact to safeguard the privacy of data. Unlike cryptography, steganog-
raphy brings forth various techniques that strive to hide the existence of any
hidden information along with keeping it encrypted. On the other hand, any appar-
ently visible encrypted information is definitely more likely to captivate the interest
of some hackers and crackers. Therefore, precisely saying, cryptography is a prac-
tice of shielding the very contents of the cryptic messages alone. On the other hand,
steganography is seriously bothered with camouflaging the fact that some confiden-
tial information is being sent, along with concealing the very contents of the mes-
sage. Hence, the data hiding in the seemingly unimportant cover medium is
perpetuated. The field of big data is quite into fame these days as they deal with
complex and large datasets. Steganographic methodologies may be used for the
purpose of enhancing security of big data since they also find ways of doing so.
Chapter 7: Conclusions
This chapter summarizes the main contributions and findings of the previously dis-
cussed chapters and offers future research directions. A conclusion has also been
derived out on the possible scope of extension or future direction. In this book, sev-
eral security issues have been addressed in big data domain. The book covers a
wide area of big data security as well as steganography and points to a fairly large
number of ideas, where the concepts of this book may be improvised. Design of nu-
merous big data security concept through steganography has been discussed,
which can meet different requirements like robustness, security, embedding capac-
ity, and imperceptibility. Experimental studies are carried out to compare the per-
formance of these developments. The comparative study of each method along with
the existing method is also established.
References
[1] Snijders, C., Matzat, U., & Reips, U.-D. “‘Big Data’: Big gaps of knowledge in the field of
Internet”. International Journal of Internet Science, (2012), 7(1)
[2] Martin, Hilbert. “Big Data for Development: A Review of Promises and Challenges.
Development Policy Review”. martinhilbert.net. Retrieved 7 October 2015.
[3] DT&SC 7-3: What is Big Data?. YouTube. 12 August 2015.
[4] Cheddad, Abbas., Condell, Joan., Curran, Kevin., & Kevitt, Paul Mc. Digital image steganography:
Survey and analysis of current methods Signal Processing 90, 2010,pp. 727–752.
1 Introduction 7
[5] Capurro, Rafael., & Hjørland, Birger. (2003). The concept of information. Annual review of
information science and technology (s. 343–411). Medford, N.J.: Information Today. A version
retrieved November 6, 2011.
[6] Anthony Reading. Meaningful Information: The Bridge Between Biology, Brain, and Behavior.
Originally published: January 1, 2011.
[7] Liu, Shuiyin., Hong, Yi., & Viterbo, E. “Unshared secret key cryptography”. IEEE Transactions
on Wireless Communications, 2014, 13(12), 6670–6683.
[8] Hsu, Fu-Hau., Min-Hao, Wu., & WANG, Shiuh-Jeng., “Dual-watermarking by QR-code
Applications in Image Processing”, 2012 9th International Conference on Ubiquitous
Intelligence and Computing and 9th International Conference on Autonomic and Trusted
Computing.
[9] Kaminsky, A., Kurdziel, M., & Radziszowski, S.”An overview of cryptanalysis research for the
advanced encryption standard”, Military Communications Conference, 2010 – MILCOM 2010,
San Jose, CA, Oct.31 2010-Nov.3 2010, 1310–1316, ISSN: 2155-7578
[10] Bian, Yong., & Liang, S. “Locally optimal detection of image watermarks in the wavelet
domain using Bessel K Form distribution”. IEEE Transactions on Image Processing, 2013, 22
(6), 2372–2384.
[11] EL-Emam, N.N. “Hiding a large amount of data with high security using steganography
algorithm,” Journal of Computer Science, 3(4), 223–232, April 2007.
Leo Mrsic, Goran Fijacko and Mislav Balkovic
2 Digital identity protection using
blockchain for academic qualification
certificates
Abstract: Although it was always an important issue, digital era increases the im-
portance of both questions “what identity is” and “why is important to manage and
protect one.” There are many views and definitions of digital identity in the litera-
ture; however, this chapter explains the identity related to the identification of an
individual, his/her qualification, and his/her status in society. Using modern ap-
proach, this chapter focuses on the academic qualification of an individual where
blockchain is presented as an efficient concept for publishing, storing, and verify-
ing educational certificates/diplomas. Motivation for this research came from lack
of practical applications of blockchain and importance of EDU certificate transpar-
ency and challenges in their sharing (policy issues, national standard, etc.). By hav-
ing easy to apply and understand guidelines, it is easier for wider audience to
accept and use/reuse sometimes complex digital concepts as part of their solutions
and business processes. As part of institution research lab, explained approach and
proof of concept solution was developed in Algebra University College.
2.1 Introduction
If we talk about a classical identity of an individual, we can think of personal identity
card, a birth certificate, a driving license but also a university certificate/diploma or
other EDU certificate. In terms of the digital identity of an individual, we can talk about
an e-personal ID card, e-birth certificate, e-homepages, e-driver’s license or e-diplomas
[1]. The e-name tag is electronic, which means these documents also have a digital
component. This digital component can be, for example, an electronic data carrier
(chip) that stores certain data or certificates that are readily uploaded to the computer
by a reader if needed. The data being displayed are centralized and are guaranteed by
and responsible to the institution issuing them, where the data is stored [2].
Digital identity does not necessarily have to be a physical document or docu-
ment [3]. It also includes our email addresses as well as various user accounts and
Leo Mrsic, Goran Fijacko and Mislav Balkovic, Algebra University College, Croatia, Europe.
https://doi.org/10.1515/9783110606058-002
10 Leo Mrsic, Goran Fijacko and Mislav Balkovic
profiles on the Internet, such as an eCitizen account, Facebook profile, and email
(Figure 2.1).
Request
1 3 5
Y/N
6
7 Identity
Credentials PEP PDP store
4
2 Y/N
Authentificate
Authentification Security
server policy
Figure 2.1: Authentication process (adopted from Phil Windley, Digital identity, p. 37).
Blockchain is one of the disruptive technologies that are often called technology,
which will change the world and enable a new revolution. Blockchain represents a
decentralized database that is publicly available to everyone via the Internet. Take
for example databases and registers owned by the state and its institutions such as
ministries, banks, and mobile operators, and all listed registers and data are pub-
lished publicly by placing them in the blockchain. It allows us to access all data
concerning our own, with authorization and access to the Internet. Likewise, we
can present the same information to the other side as soon as we need to prove
their identity, valid information or some information.
or value. With this type of blockchain technology, there is a possibility that in the fu-
ture there will be no standard services of attorneys, commercial courts, public notaries,
and the like. A good part of the services they currently offer is likely to be easily re-
placed by smart deals because the relationship between the users and the providers of
the above-mentioned services can be precisely defined through the code and entered
into smart deals that are later realized by meeting all the conditions within the transac-
tion. Realization of the services is automated and extremely fast, which in today’s form
is not. The use of smart deals allows the exclusion of a whole range of mediators in
different processes and thus enables faster and easier to perform activities, private,
and more importantly, business. They can be used, for example, in the following:
– Under insurance: If authorized agents in the blockchain register that the con-
ditions for the payment of the insurance are met, the payment will be automati-
cally made.
– Medical insurance: If a doctor finds that a patient is ill and unable to complete
his/her business obligations, blockchain inserts that information, and the pa-
tient automatically starts to pay the sickness benefit.
– Pension insurance: If an authorized person or a state body certifies that a per-
son has fulfilled the retirement conditions, a person will automatically be paid a
pension.
– Audio and video industry: If a user pays for viewing or listening to a certain ma-
terial, he/she automatically gets access to and rights to the purchased material.
– Gambling industry: The user who makes a bet is paid into the account of a
smart contract. After the event is complete, the authorized party registers the
data on the winner in the blockchain and those who successfully hit the results
automatically receive payments.
the early stages of the project and we are still investigating all the possibilities and
the application of this technology [5].
With the need to prove our identity, we meet each day and in different places:
at work, in a bank, in a shop, in travel, in state institutions, and in many different
places [6].
Currently, there are many new and prospective projects and young companies
dealing with this problem and are trying to find their place in the market. In this part,
we will mention some of them and more specifically explain their business models.
2.3.1 Civic
Civic is a company that develops an identification system that allows users to selec-
tively share identifying information with companies. Their platform has a mobile
application, where users enter their personal information and then store them in
encrypted format. The company’s goal is to establish partnerships with state gov-
ernments and banks, that is, all those who can validate user identity data, and then
leave a verification stamp in blockchain. The system encrypts the hash of all veri-
fied data and stores it in the blockchain and deletes all personal information of the
user from their own servers.
As the company has written in its White Paper, the Civic Ecosystem is designed to
encourage the participation of trusted authentication bodies called “validators.”
“Validators” can be the aforementioned state governments, banks, various financial in-
stitutions, and others. As Civic currently validates user identity information through its
application, “validators” have the ability to verify the identity of an individual or a
company that is “user” of the application. They then affix the certificate and place it in
a blockchain in the form of a record known as attestation. This “verification” is actually
a user’s hash of personal information. Parties known as ’service providers wanting to
verify the same user identity data should no longer be able to independently verify that
information but rather use the verified information valid for those validators of that in-
formation. The goal is to remain a “ruler” of your identity and to have full control over
personal information so that it must give prior consent to each transaction of informa-
tion about its identity between the validator and the service provider. By smart deals,
validators have the ability to sell their approvals to service providers, but also to service
providers to see at what prices different validators offer their approvals. Each validator
can declare the price it is willing to sell personal user information. After the user, vali-
dator and provider confirm the transactions through the smart deal system, the service
provider pays the validator the required amount in the form of CVC tokens (utility to-
kens that support decentralized identity ecosystem supported by Civic’s model which
allow on-demand, secure and lower cost access to identity verification via blockchain).
After that, a clever contract will allocate CVC tokens and the user will get their share of
the participation. The user can use their tokens to purchase products and services on
2 Digital identity protection using blockchain for academic qualification certificates 13
the Civic platform. As we mentioned, the user is the one who is responsible for their
data and stores them on some of their personal devices using the Civic app, and it is
also recommended to back up a personal account on the cloud system. Since user iden-
tity data is not centralized, that is, not on Civic servers, there is no possibility of mas-
sive identity theft since the data of each user is actually on their devices and that data
will be stolen, so it is necessary to break it into each device separately. This information
largely helps to suppress the black market for personal information, for example, Black
credit card market is quite widespread because transactions can only be done by know-
ing these data without the knowledge of the user. If a credit card number needs to go
through the blockchain mechanism of proofing where the user’s consent for each
transaction should be, then the black market of such data slowly loses its meaning and
value (Figure 2.2) [7].
Service Service
provider provider
Provides validation
REQUESTOR VALIDATOR
of identity data
Signs up for
service, providing Provides identity
identity information for validation
Introduces new
user to the platform
USER USER
2.3.2 HYPR
HYPR is a young company, founded in 2014. Their business model is based on merging
biometric identification methods and blockchain technology. Biometric identification
can replace a classic identification with a username and password, which is faster and
14 Leo Mrsic, Goran Fijacko and Mislav Balkovic
safer. Biometrics can recognize different parts of the human body such as palm geom-
etry, fingerprint, eye iris, scent, face, and many long physiological elements unique to
the individual. Biometrics is a very good way of verifying an individual’s identity be-
cause it is very difficult or impossible to forge it.
HYPR therefore offers a password-free authentication platform with biometric en-
cryption. The company does not deal with the development and production of identi-
fication devices, but develops a distributed security system. As mentioned earlier,
every digital data can be used to insert some of the cryptographic algorithms and get
their hash. This hash can be used to validate these digital data without the need for a
validator to have a copy of that data. For example, we read our finger on a fingerprint
reader on a mobile phone, and a company that has access to the hash of our finger-
print in digital form can confirm our identity, without the possibility of being false as
we do. Digital print is just a part of the offering that is offered. HYPR supports many
types of biometric data, from simple authentication algorithms to face and speech al-
gorithms to much more complex algorithms such as keyboard typing, rhythm writing
on mobile devices, or the way we walk. With blockchain and data decentralization,
authentication becomes much faster and simpler. Each user is responsible for their
biometric data, such as on his mobile device. This avoids massive data theft, while
individual theft may still be possible if the user is not careful enough to protect their
data and devices. Such a system based on blockchain technology is resistant to de-
nial of service (DoS), which is a better centralized system. DoS attacks are attacks on
some computer service in order to disable its use. In this case, instead of attacking a
single server used to authenticate data, DoS attackers should identify and attack all
blockchain nodes in that system. The company emphasizes that protecting against
DoS attacks is equally important and the interoperability of business processes.
There is currently no possibility of authentication between two different corporate en-
tities such as a bank and an insurance company. Each company has a different iden-
tity database and they are not interoperable. Using blockchain technology, we can
have an interoperable distributed mainstream identity book between multiple entities
without the need for complex and expensive infrastructure. Thus, the insurance com-
pany can prove our identity to the bank through biometric data [8–10].
2.3.3 Blockverify
The problem of proof of identity does not only appear in people. It may also be pres-
ent in various products such as medicines, luxury products, diamonds, electronics,
music, and software. These products are often counterfeit, causing damage to man-
ufacturers in billions of dollars.
People behind the Blockverify project want to reduce the number of counterfeit
products on the market by preventing duplicate appearances. Different companies
2 Digital identity protection using blockchain for academic qualification certificates 15
from different industries can register and track their products using Blockverify and
blockchain technology.
The company believes that improvement in counterfeit products can only be
achieved by using decentralized, scalable, and safe solution attacks. Blockverify
has its own private blockchain, but it also uses Bitcoin’s blockchain to record im-
portant changes in its chain. Their chain is highly scalable and transparent so that
each manufactured product can enter into it as an asset. After that, each of these
assets will be added to the blockchain and assigned a unique hash. Anyone with
that hash can access blockchain and check whether the product is valid or not. The
primary goal of the company is to address the problem of counterfeit medicines,
which is first on the scale of counterfeit products, but also one of the more danger-
ous counterfeit products because it directly affects people’s health and causes mil-
lions of deaths per year. Another problem that a company wants to solve is the
problem of verification of ownership. Thanks to blockchain technology, ownership
changes can be easily recorded permanently. By this mode, individuals are pre-
vented from making duplicate records and unauthorized changes.
Transaction: Transaction:
Hash: feb359ad27c907d Hash: 76f0ec56ce04423
Transaction:
Hash: 8d0df86ffc15cd62
Figure 2.3: Showing transactions stored in blocks that connect each other to a chain
(Source: Gupta, M., Blockchain for dummies, 2nd IBM Limited Edition, 2018, p. 14th).
the transactions that are then recorded in the network chain according to certain se-
curity rules agreed between the participants. Each block contains a hash, that is, a
digital imprint or unique identifier, then time-tagged valid transactions and hash of
the previous block. Hash of the previous block mathematically links the blocks to the
chain and prevents any change of data and information in the previous blocks or
inserting new blocks between the existing ones. Thus, each of the following blocks
increases the security of the entire chain and reduces the already small chance of ma-
nipulation and change of value or data in the chain.
There are several types of blockchains. In this chapter, we will mention the two
most common types:
A public blockchain, such as Bitcoin blockchain (the first and most known
cryptovalue based on this technology), is a large distributed network that runs with
the release of a native token. The public blockchain is visible and open to everyone
to use at all levels. An open code is maintained by the developer community.
Private blockchain is smaller in volume and usually does not run with token
issuance. Membership in this type of blockchain is highly controlled and is often
used by organizations that have confidential members or traded with confidential
information.
All types of blockchains use cryptography to enable each participant to use
the network in a safe manner, and most importantly, without the need for a cen-
tral authority that applies the rules. Because of this, blockchain is considered rev-
olutionary because it is the first way to gain confidence in sending and writing
digital data.
An example, my name text, Goran, passing through the SHA-256 algorithm
gives the result
dbe08c149b95e2b97bfcfc4b593652adbf8586c6759bdff47b533cb4451287fb
2 Digital identity protection using blockchain for academic qualification certificates 17
The word Goran will always result in an identical hash value. Adding any charac-
ter or letter at the input changes the complete hash appearance, but of course, the
mentioned 32-character length always remains the same. An example, word Gordan,
gives the result
dbe08c149b95e2b97bfcfc4b593652adbf8586c6759bdff47b533cb4451287fb
In addition to the mentioned blocks and chains that are interconnected, there is an-
other very important segment, which is a network. The network consists of nodes
and full nodes. The device that connects and uses a blockchain network becomes a
node, but if this device becomes a complete node, it must retrieve a complete record
of all transactions from the very beginning of the creation of that chain and adhere
to the security rules that define the chain. A complete node can lead anyone and
anywhere, only the computer and the Internet are needed. But that’s not so simple
as it sounds.
Many people mix Bitcoin and Blockchain concepts or misuse them. Those are
two different things. Blockchain technology was introduced in 2008, but it was only
one year later launched in the form of cryptocurrency Bitcoin. Bitcoin is therefore a
cryptocurrency that has its blockchain. This blockchain is a protocol that enables
secure transmission and monitoring of cryptocurrency Bitcoin, all from the emer-
gence of its first block (genesis block) and the first transaction. Bitcoin is designed
solely as a criterion of vision that one day completely replaces fiat (paper) money
and crushes the money transfer barriers that are present today. Through the years
that passed, the community found that blockchain is more powerful than it origi-
nally thought, so if Bitcoin as a cryptocurrency does not live globally in everyday
life, it will leave behind a revolutionary invention that potentially can change the
technological world we are currently familiar with.
Blockchain through its mechanism of consensus eliminates the central authori-
ties that we know today and which are based on today’s technology.
wget https://www.multichain.com/download/multichain-1.0.6.tar.gz
tar -xvzf multichain-1.0.6.tar.gz
cd multichain-1.0.6
mv multichaind multichain-cli multichain-util /usr/local/bin
The last command line transfers the most important files to the bin folder for easier
calling through the commands in the next steps.
After installing the MultiChain application, the first step is to create your own
chain. Since the goal of the application is to enter, issue, and validate educational
certificates for the purpose of this project, we called the chain BlockchainCertificate.
This is done by performing the following function:
Using this command, we create the chain of this name with the default settings.
After that, you must launch the created chain using the following command:
The chain was launched and its first block (genesis block) was created. After
launch, the newly created chain gets its IP address and port through which it can
be accessed from another device. The device on which the chain is created becomes
2 Digital identity protection using blockchain for academic qualification certificates 19
the first node, and each subsequent computer that connects to that chain over its IP
address and the default port receives the complete chain data and also becomes a
node. For the purpose of this chapter, only one node has been used, but in produc-
tion it is not recommended to use only one node, of course, for safety reasons men-
tioned earlier in the work.
If the other computer joins this chain, it must also have the installed MultiChain
application and must run the command:
multichaind BlockchainDiploma@[ip-adress]:[port]
After merging other computers/nodes, the first node only has the authority to as-
sign certain rights, such as read and write rights, to other nodes.
Among other things, MultiChain has the ability to store data in a blockchain
using the so-called stream. With storage, it also offers the ability to extract data.
This functionality is most important for the concept of the application shown here.
So, at the main node you need to create a new stream, which we will call certifi-
cate/diplomas in this example. The above statement is executed:
The false statement in the command means that only those explicitly licensed ad-
dresses can be written in that stream. Since in this example we have only one node
that created this stream, it is not necessary to assign special rights. If there is
a second node and some other address from which you want to write something on
that same stream, you need to assign the rights for each address from the first node
by a special grant command.
The next step is to store the data in the created stream certificate/diploma. Data is
stored in hexadecimal form. In this example, we will store name and last name and
OIBID (personal identification number) with the command publish certificate key1
476f72616e2046696a61636b6f203638383136393734393035
Hexadecimal number
After issuing a command, we can obtain data record from stream using the simple
query. The following command gives us all the information recorded in the stream
certificate/diploma (Figure 2.4) liststreamkeys certificate
This type of application is intended for private blockchain. This means that each
educational institution should have its own stream that only the people in the
20 Leo Mrsic, Goran Fijacko and Mislav Balkovic
Figure 2.4: Displaying a textual (CLI) interface where chain creation, chain startup, and creation of
a graduate stream are shown.
institution have the authority to store the certificate. All streams are stored in the
main book that is distributed to all nodes, that is, educational institutions in this
example. The more nodes in the chain, the better, because the chain becomes ever
stronger and safer.
The application consists of three modules:
1. Module for certificate/diploma input
2. Certificate/diploma check module
3. Certificate/diploma print module
The first module is for entering a certificate/diploma. It switches the entered data
into a hexadecimal form and stores them in the chain and returns the transaction ID
(txid) back. Transaction ID is a private key that is awarded to a graduate student be-
cause it can be used to check the certificate/diploma data in the chain.
The Certificate/Diploma Check Module, combined with the OIB and Transaction
ID, sends a query to the chain and verifies whether there is a record in the chain.
Thereafter, it gives a positive or negative answer, depending on whether there is a
really required degree in the chain and whether it complies with the OIB entered.
2 Digital identity protection using blockchain for academic qualification certificates 21
Once the student successfully completes the faculty and defends his graduate the-
sis, the faculty system reports that the student has graduated. With this application
and the module for entering the certificate/diploma, an authorized person at the
university will enter the name, last name, and OIBID graduate student and this in-
formation will be stored in the chain. As a feedback, he receives a Transaction ID,
which gives the student and enrolls on the original print certificate/diploma. It can
also be printed in the form of a bar code whose scan is the value of the Transaction
ID (Figure 2.5).
The student gets his certificate/diploma and his private key certificate/diploma,
which in this case is
80bbfd9b068259c1f02a72b7196417c5464c54a4b68cfaf6e824777e268ff747.
He then reports for a job and after a call from the employer goes to the job inter-
view. The employer asks for a degree to check his qualifications. The procedure is
currently being conducted so that the employer contacts the educational institu-
tion to verify the validity of the certificate/diploma, most often in writing. This
process is long-lasting and consumes a lot of resources. But in this case, the
22 Leo Mrsic, Goran Fijacko and Mislav Balkovic
employer gets a certificate/diploma with a private key. The employer then ap-
points the OIBID of a person applying for a job and the public key in the applica-
tion. This way in a fraction of a second returns the information on the validity of
the diploma certificate.
After the application’s confirmation is answered, the screen prints (Figure 2.6).
The name and surname of the student, educational institution, orientation, date
and place of graduation are written in the print. The employer eventually has the
option of printing a copy of the certificate/diploma for his own archive. If you
choose a print option, the certificate/diploma will be generated and opened in PDF
format.
For ease of use of the application after release to production, it is a better choice
to use it as a WEB application. This means that everything shown will be moved to a
web server and the application will access the https protocol (e.g., via URL https://
www.diplome.hr) in web browsers. This means that users only need an Internet con-
nection and an account in the application to quickly and securely check the validity
of the certificate/diploma.
2.6 Conclusion
This chapter presents blockchain technology, its history, and the principle of how it
functions within the digital identity [15]. In the theoretical part of the chapter, a com-
parison of the “classical” identity and digital identity is set out, which is described
through examples of personal identity cards and e-citizen systems [16]. Then, follow-
ing the introduction into blockchain technology and describing the method of achiev-
ing consensus and transaction logging, the principle of smart contracts is described,
which provide the ability to enter code or even complete applications and put them
into blockchains, enabling automation of a multitude of processes [17].
2 Digital identity protection using blockchain for academic qualification certificates 23
This chapter explains common platforms through three examples (Civic, HYPR,
Blockverify), and describes business models that use blockchain as a platform for de-
veloping their processes based on digital identity. Also, traditional models with those
based on smart deals have been compared. Through examples of cancellation or de-
lays in air travel, voting, music industry, and tracking of personal health records, it
was established how existing models are actually sluggish, ineffective, and prone to
manipulation, and through examples of blockchain implementation, they showed
that these systems functioned faster, more transparent, and most importantly, safer.
The middle part of this chapter describes the application of technology in several
industries, from the Fintech industry to the insurance and real estate industry.
Concepts and test solutions are described, which are slowly implemented in the pro-
duction phase and show excellent results. For this reason, we believe that similar sol-
utions will be implemented, increasing adoption of blockchain technology globally.
In the last, practical part of the chapter, a survey of existing solutions that offer
creation of its own blockchain and a MultiChain platform was selected. For the pur-
pose of this work, it was necessary to create an Ubuntu virtual machine in the
Oracle VM VirtualBox, on which we then installed the MultiChain. Through the
Ubuntu Terminal, the application concept for entering, issuing, and verifying uni-
versity certificate/diplomas is presented; and all functionalities and user roles are
described in this process.
Motivation for this research came from lack of practical applications of block-
chain and importance of EDU certificate transparency and challenges in their sharing
(policy issues, national standard, etc.). By having easy to apply and understand
guidelines, it is easier for wider audience to accept and use/reuse sometimes complex
digital concepts as part of their solutions and business processes. Taking place in
novelty approach, we believe this chapter will contribute and be valuable informa-
tion for future researchers looking to implement blockchain but moreover ones who
are looking to improve exchange, storing, and harmonization of often heterogeneous
EDU certificates (in forms, acceptability, and content). As part of institutional re-
search lab, explained approach and proof of concept solution was developed in
Algebra University College.
References
[1] Ashworth, Andrew. Principles of Criminal Law (5th ed, 2006).
[2] Avery, Lisa. ‘A Return to Life: The Right to Identity and the Right to Identify Argentina’s
“Living Disappeared”(2004) 27 Harvard Women’s Law Journal 235.
[3] Conte, Frances. `Sink or Swim Together: Citizenship, Sovereignty, and Free Movement in the
European Union and the United States’ (2007) 61 University of Miami Law Review 331.
[4] Buergenthal, Thomas. ‘International Human Rights Law and Institutions: Accomplishments
and Prospects’ (1988) 63 Washington Law Review 1.
24 Leo Mrsic, Goran Fijacko and Mislav Balkovic
[5] Klepac, G., Kopal, R., & Mršić, L. (2015). Developing Churn Models Using Data Mining
Techniques and Social Network Analysis (pp. 1–361). Hershey, PA: IGI Global. doi:10.4018/
978-1-4666-6288-9
[6] Mohr, Richard. ‘Identity Crisis: Judgment and the Hollow legal Subject,’ 2007 11 Passages –
Law, Aesthetics, Politics 106.
[7] Bromby, Michael., & Ness, Haley. ‘Over-observed? What is the Quality of this New Digital
World?’7 Paper presented at 20th Annual Conference of British and Irish Law, Education and
Technology Association, Queens University, Belfast, April 2005.
[8] Nekam, Alexander. The Personality Conception of the Legal Entity (1938).
[9] Palmer, Stephanie. `Public, Private and the Human Rights Act 1988: An Ideological Divide’.
Cambridge Law Journal, 559, 2007.
[10] Solove, Daniel. The Digital Person, Technology and Privacy in the Information Age (2004).
[11] Third, A., Quick K., Bachler M., Domingue J. (2018), Government services and digital identity,
Knowledge Media Institute of the Open University.
[12] Davies, Margaret., & Naffine, Ngaire. Are Persons Property? Legal Debates About Property
and Personality (2001).
[13] UNWTO. UNWTO World Tourism Barometer: Advance Release January 2017. UNWTO. [Online]
January 2017.
[14] World Economic Forum. Digital Transformation Initiative: Aviation, Travel and Tourism
Industry. Geneva: World Economic Forum, 2017. REF 060117.
[15] Derham, David. ‘Theories of Legal Personality’ in Leicester Webb (ed), Legal Personality and
Political Pluralism (1958) 1.
[16] Naffine, Ngaire. ‘Who are Law’s Persons? From Cheshire Cats to Responsible Subjects’(2003)
May Modern Law Review 346.
[17] Stacey, Robert. ‘A Report on the Erroneous Fingerprint Individualization in the Madrid The
Council Of Europe’s Convention On Cybercrime. (2001). European Treaty Series 185. Retrieved
from http://www.europarl.europa.eu/meetdocs/2014_2019/documents/libe/dv/7_conv_
budapest_/7_conv_budapest_en.pdf.
Souvik Chowdhury and Shibakali Gupta
3 Anomaly detection in cloud big database
metric
Abstract: After cloudification of various big data sources or big databases, there is a
need to monitor health and security metrics of these big databases. Now there is
already a monitoring setup provided by various monitoring suites. The monitoring
software collects various metrics with the help of custom codes, plugins, and so on.
Here we are proposing a novel approach of modifying the normal metric threshold-
ing to anomaly detection. Every system administrator has a common problem to
deal with some intelligent alarm methods, which can produce predictive warnings,
that is, the system can detect any anomalies or problem before it occurs.
Here we are proposing an approach, which is basically a modification on stan-
dard monitoring where it will detect any anomalies by analyzing previous metric
data and indicate any problem. We are planning to harness the power exponential
moving average and exponential moving standard deviation method to implement
the solution.
3.1 Introduction
Anomaly occurs when a system deviates from its normal running operation. Normally
in any anomaly situations, except system overload, system malfunction, brutal net-
work attack, defect, or error during a program arises [1]. Anomaly is also being termed
as malware, intrusion, or outlier sometimes [2]. Any system, for example, monitoring
software generates data continuously, that is, in time series manner. The volume of
data is huge and due to its time series data, this is constantly changing; hence, proc-
essing of these data and tracking any kind of problem are very difficult by using a
traditional threshold-based monitoring approach.
Let us start with an example. Let’s consider a big database hosted in a server,
which is running various complex queries, data loading activities, and others. Now
we put a threshold of 70% for CPU utilization in that server. This may happen during
weekends, month ends, quarter ends as per business needs. A havoc of data loading
activity happens and due to that we saw a spike in CPU utilization till 99%. Does that
https://doi.org/10.1515/9783110606058-003
26 Souvik Chowdhury and Shibakali Gupta
mean there is a problem in CPU that time? No, the CPU spike happened due to extra
load. Now all the traditional monitoring software collects metrics, projects them in
graph, and triggers alarm in case if it breaches the predefined threshold. The anoma-
lies in various health and security metric could be due to bug in latest patch/upgrade
happened or random hardware failures. Simple metrics can support majority cases.
But this is not a good approach since it does not indicate any problem.
This is becoming tougher when the data is complex in nature. Because complex
data has various features to monitor. Anomaly detection for such a system is a tough
job [3]. That is why it is so important to do anomaly detection [4]. In real-time anomaly
detection setup, it constantly monitors the time series data and automatically detects
any anomaly and executes corrective actions based on the situation. Hence, this kind
of proactive prevention not only saves from a drastic system crash but also saves lot of
time and money.
Infrastructure as service is a private cloud-based infrastructure service avail-
able at a data center. End users can deploy their workloads using this secure on-
demand system. Resource management for information centers needs examining
user necessities and, it’s on the market resources to satisfy its client demands [5].
The system should have dynamic scalability in terms of infrastructure resource be-
cause of varying user demand to stay the pace of all these, an information center
should have a period observance system.
The system tracks the resource performance info that comes ceaselessly and de-
tects the abnormal behavior. It then reschedules its resources to remain the system in
balance. As an example, an information center allocates a tough and quick amount
of CPU and memory for sort of users [6]. After a certain time, if the users run extra
computationally expensive programs, the electronic equipment and memory usage
are high, which they have extra resources to try and do their job [7]. The amount the
system monitors the data and detects these abnormal resource statistics then allo-
cates extra resources dynamically.
Stream process could also be a time-sensitive operation. It needs info prepro-
cessing before doing any analysis. It converts continuous high-volume, high-rate,
structured, and unstructured Brobdingnagian info into amount price. Moreover,
these Brobdingnagian info volumes must be compelled to be processed with ease,
and delivered with low latency, even once info rate is high. To create such a system,
Brobdingnagian’s ascendible amount operational capability is needed. Recently,
we have got seen several massive info frameworks (e.g., Hadoop, Map Reduce,
HBase, Mahout, and Google Bigtable) that address the quantifiability issue [8].
However, they surpass batch-based process. The amount the system demands put
together the streaming process. Apache Storm could also be a distributed frame-
work that gives streaming integration with time-based in-memory analytics from
the live machine info as they’re accessible in stream. It generates low latency,
amount results. It has instant productivity and no hidden obstacles. When put to-
gether contains an easy cluster setup procedure [9].
3 Anomaly detection in cloud big database metric 27
– Smoothing technique can be differentiated from the partially and related over-
lapping concept of curve fitting in the below ways:
– Many times, an explicit function is being used in curve fitting to produce result
and if a function form is ever being used, then the immediate results from
smoothing are basically the smoothed values and which have no further user.
– Smoothing mainly generalizes the idea of relative slow movement of values,
where a small attention is being paid to the close matching of data, whereas
curve fitting focuses on matching to the nearest possible data values.
– A tuning parameter is often available with the smoothing method to control the
behavior of smoothing. To get the “best fit,” the data curve fitting can modify
any number of parameters.
for example, on associate hourly, daily, quarterly, or yearly basis. The ways of ana-
lyzing statistic represent a crucial space of study in statistics [12]. However, before
we tend to discuss the statistical analysis, we would prefer to show the plots of its
slow series from completely different fields. Within the next three sections, we
glance at the plots of three time series, namely, statistic with trend impact, statistic
with seasonal impact, and statistic with cyclic impact. These plots of square mea-
sure are known as time plots.
A trend could be a future swish variation (increase or decrease) within the statistic.
Once values in a very statistic area unit aforethought in a very graph and, on a me-
dian, these values show associate degree increasing or decreasing trend over an ex-
tended amount of your time, the statistic is named as the statistic with trend impact.
We should always note that incomparable series don’t show associate degree increas-
ing or decreasing trend [13]. In some cases, the values of the statistic fluctuate around
a continuing reading and don’t show any trend about time. We should always addi-
tionally remember that an increase or decrease might not essentially be within the
same direction throughout the given amount. Statistic could show associate degree up-
ward trend, a downward trend, or haven’t any trend in any respect, which allow us to
make a case for all three cases with the assistance of examples [14].
If a time series exhibits upward trend, that is, the metric values that get increased as
time progresses is termed as time series with upward trend [15]. Figure 3.1 shows an
upward trend, which depicts the profit of a company plotted for the period 1981–2012.
200
0
19 1
83
19 5
87
19 9
19 1
96
20 8
00
20 2
04
20 6
08
20 0
12
8
9
8
0
8
1
9
20
19
19
20
19
19
20
Metric values of a time series when plotted in a graph with respect to time showing a
downward trend are termed as time series with downward trend. Figure 3.2 shows
a downward trend in the time plot, which describes the values of mortality rate for a
developing country from 1991 to 2009.
Mortality rate
800
700
600
500
400
300 Mortality rate
200
100
0
1991 1993 1995 1997 1999 20012003200520072009
Metric values in a time series when plotted in a graph with respect to time if it does
not show any trend or to be precise random behavior, thus not showing upward or
downward trend then the time series is described as time series with no trend.
Figure 3.3 shows the time plot of the commodity production in tons of a factory
from 1988 to 2012 where the time series shows no trend.
Metric values in a time series when plotted at a graph with respect to time if it re-
flects variation seasonally with respect to any period such as half yearly, quarterly,
monthly, or a yearly, then the time series is termed as time series with seasonal ef-
fect. Figure 3.4 describes time series with seasonal effect, which plots the data of
weekly sales of air.
600
400
Seasonal time series
200
0
0 10 20 30
Metric values in a time series when plotted in a graph with respect of time if shows
a cyclic trend then the time series is termed as time series with cyclic effect.
Figure 3.5 shows an example of time series with cyclic effect, which gives employ-
ees attrition rate in software industries for last 25 years.
200
0
0 5 10 15 20 25 30
Figure 3.5: Employees attrition rate in software industries for the last 25 years.
3 Anomaly detection in cloud big database metric 31
The variations within the time series statistic values square measure of various varieties
arise because of a spread of things. These differing kinds of variations within the values
of the information in a very statistic are referred to as elements of the statistic [16].
Various components of time series that involved in variations are (i) trend, (ii)
seasonal, (iii) cyclic, and (iv) remaining variations attributed to random fluctuations.
In a time series, variations that occur because of regular or natural forces/factors and
operate in an exceedingly regular and periodic manner over a span of bus or adequate
1-year area unit termed as differences due to the season [17]. Though we tend to typi-
cally think about seasonal movement in statistic as occurring over one year, it may
also represent any frequent continuance pattern that’s but one year in period. As an
example, daily traffic volume knowledge shows seasonal behavior at intervals a
similar day, with peak level occurring throughout rush hours, moderate flow through-
out the remainder of the day, and lightweight ensue hour to early morning. Thus, in
an exceedingly statistic, differences due to the season could exist if knowledge area
unit recorded on a yearly, monthly, daily, or hourly basis.
Sometimes time series show variation for a fixed period due to certain physical rea-
sons and this is not part of seasonal effects. For example, sometimes economic data
are prone to be affected by business cycles with a period varying from few to several
years (see Figure 3.5). A period of moderate inflation followed by high inflation is a
primary reason for these cyclic variations. Hence, the existence of these business
cycles causes some intermittent bias about various cyclic, trend, and seasonal ef-
fects [18]. To overcome this problem, we will consider a cyclic pattern or trend in
time series only when the duration is more than a year.
32 Souvik Chowdhury and Shibakali Gupta
The long variations, that is, the trend part, and short-run variations, that is, the sea-
sonal and cyclic parts, are referred to as regular variations. Except for these regular
variations, random or irregular variations, that don’t seem to be accounted for by
trend, seasonal, or cyclic parts, exist in virtually incomparable series [19].
Here we calculate the simple moving average of time series data over n periods of time
called n-period moving averages. Below is the step to calculate simple moving average.
1. The average of the first m values of the time series is calculated.
2. First value has been discarded, and the average of the next n values has been
measured again.
3. Steps 1 and 2 are repeated till all data are used.
The above steps generate a new time series of n-period moving averages [21].
The simple moving average methodology delineated in Section 4.1 isn’t typically
well compatible for activity trend though we are able to take away seasonal varia-
tion victimization. It should additionally not lie about to the most recent values.
Therefore, the weighted (unequal) moving average methodology is employed.
During this methodology, rather than giving equal weights to any or all values, un-
equal weights are given in such a way that each one of the weights is positive, and
there add is capable one. If wi depicts ith observation weight, the weighted moving
average value yt can be given as follows:
3 Anomaly detection in cloud big database metric 33
q
X q
X
Yt = W Xi ti + ; wi ≥ 0, Wi = 1 (3:1)
i= −q i= −q
Below are the steps to calculate exponential moving average (EMA) method.
1. If y1, y2, . . .,yt is a time series, then the first element of EMA is equal to first ele-
ment of time series, that is, y1ʹ = y1.
2. The second smoothed value based on EMA, y2ʹ= w*y2 + (1 – w)*y1ʹ, where y1ʹ is
the first element of smoothed series and y2 is the second element of original
time series and w is the smoothing factor and it ranges from 0 < w < 1.
In this way, tth smoothed value can be calculated by yt’ = w*yt + (1 – w)*y(t – 1)’.
60
40
20
0
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
05-MAR-18
05-MAR-18
05-MAR-18
06-MAR-18
06-MAR-18
06-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
Figure 3.6: Fixed threshold-based mechanism with random spikes.
From Figure 3.7, it is evident that during weekends there is a high growth of CPU,
which could be due to some batch job being run on weekends. Hence, this fixed
threshold mechanism does not hold good in these cases.
60
40
20
0
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
05-MAR-18
05-MAR-18
05-MAR-18
06-MAR-18
06-MAR-18
06-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
Hence, anomaly detection with its adaptive capabilities also can be called as
adaptive thresholding, which is more intelligent than the traditional fixed threshold
approach. We will see them in Figure 3.7.
In the following example (Figure 3.8), we will see a mix of both unusual growth
or spikes (which turn out to be a problem) and steady growth (which could be due
to some job) and hence not a problem is successfully being distinguished by this
new anomaly detection-based mechanism.
120
80 Steady growth
60
40
20
0
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
05-MAR-18
05-MAR-18
05-MAR-18
06-MAR-18
06-MAR-18
06-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
07-MAR-18
Figure 3.8: Anomaly-based threshold mechanism for unusual spikes and steady spikes.
We will see some more graphs, for example, how the anomaly detection-based mecha-
nism works for memory utilization, where it shows a cyclic pattern as demonstrated
early.
From Figure 3.9, we could see there has been a cyclic pattern and anomaly de-
tection-based method that successfully smoothed the raw metric data.
100
80
60
40
20
0
19-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
Figure 3.9: Anomaly-based threshold mechanism for cyclic pattern (memory utilization in big
database servers).
Now we will see how this anomaly detection-based mechanism works good for
cyclic spikes as shown in Figure 3.10.
0
20
40
60
80
100
120
18-FEB-18
18-FEB-18
18-FEB-18
18-FEB-18
18-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
Figure 3.10: Distribution of normal metric data and exponential moving average value.
27-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
Souvik Chowdhury and Shibakali Gupta 36
3 Anomaly detection in cloud big database metric 37
In Figure 3.10, the blue line depicts normal metric data and red one gives EMA
in short EMA values. From the figure we could see for spikes that it is trying to
smooth the data which seems to work like normal average. This function works bet-
ter if the spike stays a bit, which could be due to some load, thus distinguishing a
problem with normal loaded situation.
Now we will see the difference between the normal standard deviation and EMSD.
From Figure 3.11 it is evident that EMSD is even a far more refined version
of standard deviation, where it is trying to smooth the data and track any real
problem. We could see only the extreme variations the standard deviation
value also spiked up; however, the case is not the same for EMSD which con-
tains historical knowledge and tries to smooth the data and detect any
anomalies.
60
50
40
30
20
10
0
18-FEB-18
18-FEB-18
18-FEB-18
18-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
04-MAR-18
Figure 3.11: Standard deviation (SD) versus exponential moving standard deviation (EMSD).
Figure 3.12 depicts the alert mechanism being set with the help of EMA and EMSD.
The orange dots on top are the cases when the alerts got triggered. Here we could
see in the highlighted portion that the spikes seem to be regular and this mecha-
nism did not trigger alert for this case.
0
20
40
60
80
100
120
18-FEB-18
18-FEB-18
18-FEB-18
18-FEB-18
18-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
19-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
20-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
21-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
22-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
23-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
24-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
25-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
26-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
27-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
28-FEB-18
Figure 3.12: Alert mechanism setup using exponential moving average and exponential moving standard deviation.
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
01-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
02-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
03-MAR-18
04-MAR-18 No alert since some job is running which caused the extra load
04-MAR-18
04-MAR-18
04-MAR-18
Souvik Chowdhury and Shibakali Gupta 38
3 Anomaly detection in cloud big database metric 39
with t1 as (
select <column names>, row_number() over (partition by <partition column> order by
<ordered column>) r_n,
--2 / (1 + row_number() over (partition by <partition column> order by <ordered col-
umn>)) k_i
0.5 k_i
from <metric table>
), t2 as (
select <column names>, (case when r_n = 1 then 1 else k_i end * c_i) a_i, case when
r_n = 1 then 1 else (1 – k_i) end b_i from t1
), t3 as (
SELECT <column names>,
ai,
xmlquery(REPLACE(wm_concat(bi) over(PARTITION BY <partition column> ORDER BY <ordered
column> rows BETWEEN unbounded preceding AND CURRENT ROW), ',', '*') RETURNING con-
tent).getnumberval() mi
FROM t2
), t4 as (
select <column names>, m_i, (a_i / m_i) x_i from t3
)
SELECT <column names>,
round(m_i * SUM(x_i) over(PARTITION BY <partition column> ORDER BY <ordered column>
rows BETWEEN unbounded preceding AND CURRENT ROW), 3) ema
FROM t4;
We can use the following sample to query EMA using simplified method. Here we
have two options to select the weight factor w mentioned in Section 3.3.3.1.
Special case w = 0.5. Here the weight factor has been fixed to 0.5, that is, giving
equal weightage to current metric data and past metric data, thus leaving a trail of
historical metric data in calculation:
with t1 as (
select <column names>, row_number() over (partition by <partition column> order by
<ordered column>) r_n, amount * power(2, nvl(nullif(row_number() over (partition by
<partition column> order by <ordered column>) - 1, 0), 1)) c_i
from <metric table>
)
select <column names>, round(sum(c_i) over (partition by <partition column> order by
<ordered column> rows between unbounded preceding and current row) / power(2, r_n),
3) ema
40 Souvik Chowdhury and Shibakali Gupta
from t1;
Special case w = 2/(1 + i). Here it is dynamic; hence, for various levels it would
have different behavior:
with t1 as (
select <column names>, row_number() over (partition by <partition column> order by
<ordered column>) r_n, amount * row_number() over (partition by <partition column>
order by <ordered column>) c_i
from <metric table>
)
select <column names>, round(sum(c_i) over (partition by <partition column> order by
<ordered column> rows between unbounded preceding and current row) * 2 / (r_n * (r_n
+ 1)), 3) ema
from t1;
Now we will discuss about various jobs and procedure setup to achieve this task. The
challenge was to achieve good performance because the real-time calculation of EMA
and EMSD is a difficult one and with enormous number of targets and its metric for a
cloud big database, cloud datacenter is even getting worse. We concentrated on SQL
tuning part and tried to make the query as fast as possible. We included parallelism
in query. But still that was not good enough. We took another option to parallelize
the job itself. Hence, the current model has been depicted as follows:
Figure 3.13 depicts the job flow. Here, a master job is responsible for all
job allocations. It was being tested to use parallel 20 jobs. The main job
ANOMALY_COUNT_REFRESH analyzes currently running jobs and forks a new
session and attaches the job to the session (Figure 3.14).
ANOMALY_COUNT_REFRESH
Job: Main Job
Job pool
1 2 n–1 n
Figure 3.13: Job flow.
3 Anomaly detection in cloud big database metric 41
ANOMALY_COUNT
ANOMALY_COUNT1
BEGIN
<Scheduler job create module> (
job_name => 'ANOMALY_COUNT_REFRESH',
job_type => <Object Type>,
job_action => 'ANOMALY_COUNT',
start_date => <Job start date>,
repeat_interval => 'FREQ=<minutely,hourly etc>;INTERVAL=<interval duration 5
mins,1 hour etc>;',
end_date => NULL,
enabled => TRUE,
comments => <Job Description>);
END;
/
Once this job is assigned to a session, the main job comes in which it collects all
necessary data and stores the output (Figure 3.15).
42 Souvik Chowdhury and Shibakali Gupta
ANOMALY_COUNT1
Invokes ALERT_TRIGGER
Mechanism to trigger alert for
anomalies detected
We have used the following sample commands to create pool of jobs. At present, only
two jobs have been shown.
BEGIN
<Scheduler job create module> (
job_name => 'ANOMALY_COUNT_JOB1',
job_type => => <Object Type>,
job_action => 'ANOMALY_COUNT1',
end_date => NULL,
enabled => FALSE,
number_of_arguments => 2);
END;
/
BEGIN
<Scheduler job create module> (
job_name => 'ANOMALY_COUNT_JOB2',
job_type => <Object Type>,
job_action => 'ANOMALY_COUNT1',
end_date => NULL,
enabled => FALSE,
number_of_arguments => 2);
END;
/
References
[1] Zhu, Qingsheng., & Liu, Renyu. “A Network Anomaly Detection Algorithm based on Natural
Neighborhood Graph”, “International Joint Conference on Neural Networks 2018”.
[2] Wang, Lin., Xue, Bai., Wang, Yulei., Li, Hsiao-Chi., Lee, Li-Chien., Song, Meiping.,
Yu, Chunyan., & Li, Sen. et al. “Iterative anomaly detection”, “IEEE International Geoscience
and Remote Sensing Symposium 2017”.
[3] Edisanter Lo. “Hyperspectral anomaly detection based on a generalization of the maximized
subspace model,” “2013 5th Workshop on Hyperspectral Image and Signal Processing:
Evolution in Remote Sensing”.
[4] Yao, Danfeng., Shu, Xiaokui., Cheng, Long., Stolfo, Salvatore J., & Bertino, Elisa.., Ravi et al.
“Anomaly Detection as a Service: Challenges, Advances, and Opportunities, Synthesis
Lectures on Information Security, Privacy, and Trust, Electronic”.
[5] Zhao, Rui., Du, Bo., & Zhang, Liangpei. “GSEAD: Graphical score estimation for hyperspectral
anomaly detection, 2016 8th Workshop on Hyperspectral Image and Signal Processing:
Evolution in Remote”.
[6] Chen, Mingqi., Jiang, Ting., & Zou, Weixia. “Differential physical layer secret key generation
based on weighted exponential moving average, 2015 9th International Conference on Signal
Processing and Communication”.
[7] Alexander, Belyaev., Ivan, Tutov., & Denis, Butuzov. “Analysis of noisy signal restoration
quality with exponential moving average filter, International Siberian Conference on Control
and Communications,2016“
[8] Md. Emdadul, Haque., Md. Nasmus, Sakib Khan., & Md. Rafiqul, Islam Sheikh. “Smoothing
control of wind farm output fluctuations by proposed Low Pass Filter, and Moving Averages,
International Conference on Electrical & Electronic Engineering 2015”.
[9] Lutfi Al-Sharif, Ahmad Hammoudeh. “Estimating the elevator traffic system arrival rate using
exponentially weighted moving average(EWMA), IEEE Jordan Conference on Applied Electrical
Engineering and Computing Technologies 2017”.
[10] Tajiri, Hiroki.. “Input filtering of MPPT control by exponential moving average in photovoltaic
system, Teruhisa Kumano IEEE International Conference on Power and Energy 2012”.
[11] Xie, Y. J., Xie, M., & Goh, T. N. “A MEWMA chart for a bivariate exponential distribution, IEEE
International Conference on Industrial Engineering and Engineering Management 2009”.
[12] Hansun, Seng., & Kristanda, Marcel Bonar. ”Performance analysis of conventional moving
average methods in forex forecasting, International Conference on Smart Cities, Automation &
Intelligent Computing Systems 2017”.
[13] Guo Feng Liu, Chen-Yu., Bin, Zhou., & Su-Qin, Zhang. “Spares Consumption Combination
Forecasting Based on Genetic Algorithm and Exponential Smoothing Method,
Fifth International Symposium on Computational Intelligence and Design 2012”.
[14] Akpinar, Mustafa., & Yumusak, Nejat. IEEE International Conference on Environment and
Electrical Engineering and IEEE Industrial and Commercial Power Systems Europe 2017”.
[15] Wang, Shuo., & Li, Ning. “MYCAT Shard Key Selection Strategy Based on Exponential
Smoothing, 2nd IEEE Advanced Information Management, Communicates, Electronic and
Automation Control Conference 2018”.
[16] Morimoto, J., Kasamatsu, H., Higuchi, A., Yoshida, T., & Tabuchi, T. “Setting method of
smoothing constant in exponential smoothing, SICE 2004”.
[17] Yanwei, Du.. “Research on Requirement Forecasting of Raw Materials for Boiler
Manufacturing Enterprise Based on Exponential Smoothing Method, Second International
Conference on Computer Modeling and Simulation, Electronic 2010”.
3 Anomaly detection in cloud big database metric 45
[18] Setiawan, Wawan., Juniati, Enjun., & Farida, Ida.. “The use of Triple Exponential Smoothing
Method (Winter) in forecasting passenger of PT Kereta Api Indonesia with optimization alpha,
beta, and gamma parameters, 2nd International Conference on Science in Information
Technology 2016”.
[19] Li, Jing., Xu, Danning., Zhang, Jinfeng., Xiao, Jianhua., & Wang, Hongbin.. “The comparison of
ARMA, exponential smoothing and seasonal index model for predicting incidence of
Newcastle disease, World Automation Congress, Electronic 2010”.
[20] Wi, Young-Min., Kim, Ji-Hui., Joo, Sung-Kwan., Park, Jong-Bae., & Oh, Jae-Chul. “Customer
baseline load (CBL) calculation using exponential smoothing model with weather adjustment,
Transmission & Distribution Conference & Exposition 2009”.
[21] Chu, Chao-Ting., Chiang, Huann-Keng., Chang, Chao-Hsi., Hong-Wei, Li., & Chang, Tsung-Jui..
“A exponential smoothing gray prediction fall detection signal analysis in wearable device,
6th International Symposium on Next Generation Electronics 2017”.
Shibakali Gupta, Ayan Mukherjee
4 Use of big data in hacking and social
engineering
Abstract: Nowadays, in the fast-paced world of Google and Facebook, every detail of
human being could be considered as a set of data or array of data that can be stored,
verified, and processed in several ways for the benefits of users. Big data would be
perfectly described with humongous large and complex data entities, where classic
approach application software is incompetent for them. Big data epitomizes the evi-
dence chattels classified by a high volume, velocity, and variability to require precise
technology and analytical approaches for its transformation into value. Big data in-
clude netting data, search, data stowing, transmission, updating, data scrutiny, visu-
alization, sharing, querying, data source, and information confidentiality. Big data
can castoff in innumerable sectors like defense, health care, and Internet of things.
The most famous example probably being Palantir, which was primarily sponsored
by the CIA (Central Intelligence Agency). Its primary function was to deliver analytics
sway in the war against terrorism of any kind but with accumulative dependency on
big data, the menace of exploitation of this data also arises. The prominence of big
data does not gyrate around data magnitude or dimensions rather it revolves around
how you process it. You can consider stats from whichever cradle and analyze it to
discover answers that facilitate cost diminutions, interval time declines, fresh product
development and elevated offerings, and smart management. When you conglomer-
ate big data with efficient and dynamic analytics, you can achieve business-corre-
lated tasks such as detecting fraudulent behavior, recalculating entire risk portfolios
in shorter span of time, determining root causes of failures, disputes, and blemishes
in near real time. Few instances such as Cambridge Analytica lighten the insight of
the exploitation of the big data. There are several instances where large amount of
data has been stolen like in 2014, Yahoo Inc., where 3 billion accounts were effec-
tively compromised according to official sources or in 2016, Adult Friend Finder
where 412.2 million accounts were effected with credit card details compromised
as well.
Deprived of the encompassing span of big data, it is taut to perceive a conse-
quence where dappled endeavors and marginal verboten deeds would be a news-
flash. It is merely with the inclusion of big data, does the sheer extent of this
statistics turn heads. If one individual cheats during a test, it is just earnest of a
quip from the instructor. If the entire class collaborates and cultivates a structure of
cheating, it becomes newsworthy. The Panama Papers are an exceptional specimen
https://doi.org/10.1515/9783110606058-004
48 Shibakali Gupta, Ayan Mukherjee
of an event that is not a requisite illegal, but sketchy to say the least. The statistic
that several sets of international figures were acknowledged in this bulk data set is
what marks the news. With the evolution of big data, it makes treasured visions for
hackers invariably tempting, but it also provides a big structure of data that con-
verts it to payload utmost necessary to protect.
In such a scenario, the security of big data is very important. This chapter
shares sheer insight of how big data can be used in hacking and social engineering.
This chapter will try to list down the ways big data is mined from various sources
such as Google Services of Android and Facebook. It will list the various ways the
big data is used in day-to-day life by the given companies and other advertising
companies. This chapter will try to enlist all the major ill ways this data can be
used against us and the ways the important and private data can be protected from
the data-collecting companies.
Keywords: big data, ethical hacking, social engineering, Cambridge Analytica, big
data security, data privacy, risk and threat
such as volume, variety, and velocity to oblige explicit technology and analytical
techniques for its transformation into value. Additionally, a new V, veracity, is added
by some officialdom to describe it, revisionism challenged by some industry authori-
ties. The three V’s have been primarily expanded to other harmonizing characteristics
of big data [1].
The systematic study of big data can lead to:
– Tuning according to target audience – Big data is used by business today
for scrutinizing gushes of the target audience and entertain them with opti-
mized services to upsurge the business.
– Cost cutting in various sectors – Scrutiny of such mammoth bulk of data
has also aided business in cutting down their overhead expenses in various
sectors. Several bucks are being saved by enhancements in operational effi-
ciency and more.
– Intensification in operating boundaries in different sectors – Big data
also aids businesses in increasing operational brims in different sectors. With
the help of big data, lot of blue-collar labor can be converted into machine
task and this helps in growing operating precincts.
xi. Data warehousing system in banking: Data warehousing system has pro-
vided major prospect to corporations to visualize the larger scenario due to
harmonizing the delicate trend of the records for prioritizing the privacy and
shielding of information along with conveying value adds for customers. It
has been fully embraced by several companies to drive business and advance
the services they offer to customers.
xii. Data warehousing system in finance: Financial amenities have extensively
espoused data warehousing system analytics to advise enhanced investment
assessments with constant returns.
xiii. Data warehousing system in telecom: According to reports in “Global Data
Warehousing System Analytics Market in Telecom Industry 2014–2018,” it
was found that the usage of data analytic tools in telecom segment is pre-
dicted to propagate at a compound annual growth rate of nearly 28% over
the next four years.
xiv. Data warehousing system in retail: Retailers hitch data warehousing sys-
tem to suggest that consumer has personalized shopping experiences.
Evaluating customer is one-way data warehousing system technology in mak-
ing a spot in retail. Two-thirds of retailers have made financial gains in cus-
tomer management and CRM through data warehousing system.
xv. Data warehousing system in healthcare: Data warehousing system is used
for scrutinizing data in the electronic medical record system with the objective
of sinking costs and refining patient care. This data includes the amorphous
data from physician notes, pathology reports, and so on. Data warehousing
system and healthcare analytics have the technological advancement to pre-
dict, prevent, and cure diseases.
xvi. Data warehousing system in media and entertainment: Data warehousing
system is altering the broadcasting and entertainment industry, providing
users and viewers a much more tailored and enriched experience. Data ware-
housing system is utilized for growing revenues, analyzing real-time patron
sentiment, increasing promotion effectiveness, ratings, and viewership.
xvii. Data warehousing system in tourism: Data warehousing system is renovating
the global tourism. Information about the world is easily available than ever be-
fore. People have detailed itineraries with the help of data warehousing system.
xviii. Data warehousing system in airlines: Data warehousing system analytics
provides with necessary tactics to the aviation industry. An airline now knows
where each and every plane is heading, where any passenger is sitting in any of
the flight, and what a passenger is watching on the IFE (In-flight Entertainment)
or connectivity system.
xix. Data warehousing system in social media: Data warehousing system is a
motivating influence behind every marketing resolution made by social
media houses and it is driving personalization to the highest extent possible
(Figure 4.1).
52 Shibakali Gupta, Ayan Mukherjee
As we race into the future, a swelling amount of modules concomitant to the infra-
structure of our realm and enterprises are reliant on an Internet assembly. The
probability of devastating cyberattacks from aggressive states, cyberterrorists, and
hacktivists becomes much more real: This can be visualized pretty well in movie
named Die Hard 4.0, where several unmanned cars crashing or rerouting of energy
and electricity on a large scale thereby leading to blackout or tampering in traffic
signal leading to accidents.
Few technological loopholes would never lead to an efficacious kinematic as-
sault in a large scale. As an alternative to get access to the system, the invaders
use several diverse but fundamental methodologies over the time. Data sabotage,
that is, altering of data records can be considered one such cyberattack that
seems to be minute but could be used by invaders for major advantages. Small
manipulation in data could affect a lot in major sectors like stock market or de-
fense agencies. A small manipulation of rating of a particular fake product in re-
tail market could lead to its perception as a original product and major sale boost
in retail sector or a simple tickle in financial figure of a company’s remuneration
could provide a major boost in stock market.
US agencies such as CIA and FBI are perceived as major fronts in 2016 for
cybercrimes.
4 Use of big data in hacking and social engineering 53
i. Yahoo
Date: 2013–14
Impact: 3 billion user accounts
Details: “In September 2016, the past prevailing Internet colossal, while in parleys to
peddle itself to Verizon, indicated that it had been the prey of the humongous data
breach in recent antiquity, probable by ‘a state-sponsored artiste.’ The outbreak com-
promised the original appellations, dates of birth, email addresses, and handset no.
of Five hundred million patrons. The corporation published that the preponderance
of the passwords had been hashed via the robust crypt algorithm.
Few months later, it buried that previous record with the revelation that a
breach in 2013, by a different set of black hat hackers had compromised 1 billion
records with names, dates of birth, security questions and answers, email addresses
and passwords that were not well secured as those involved in 2014. In October
2017, Yahoo reviled that, all 3 billion-user accounts were being compromised.
The breaches bashed a probable $350 million off from Yahoo’s sale amount.
Verizon eventually remunerated $4.48 billion for Yahoo’s core Internet industry.
The pact stated that the two corporations to share regulatory and legal obligations
from the breaches.”
54 Shibakali Gupta, Ayan Mukherjee
Details: “The Corporation came to know about the breach in late 2016 wherein cou-
ple of hackers was able to retrieve personal details of 57 customers of the Uber. They
were also able to retrieve the driver license details of 600,000 Uber drivers. Credit
card or Social Security numbers were secured as per the company. The hackers got
access Uber’s GitHub code repository account, where they retrieved user credentials
to Uber’s AWS account. Those authorizations should certainly not be on GitHub.
The Breach was made public a year later by Uber. They compensated the hack-
ers with $100,000 to rescind the records with no clause or way to authenticate that
same. The paid them stating it was a ‘bug bounty.’ Uber also sacked its CSO and
placed the responsibility on him.
The breach is said to have affected Uber in both reputation and money. At the
time that the break was announced, the business was in discussions to trade a
stake to Softbank. Uber’s valuation declined from $68 billion to $48 billion by the
time the deal was signed.”
Data warehousing system for big data or big data analytics was defined by the con-
noisseurs with the help of terminology like value of the data, volume, and variety of
the same, along with velocity and veracity of the data. This is also defined by 5V’s.
Recently, an additional V is gaining the focus of the market, let alone the experts of
big data analysis. Vulnerability, as it gains focus in the market, distresses entire en-
terprise sector and is urging for major attention since, if this is not handled, rest all
will be at stake. Due to numerous proficiency, it has now received the consideration
of entire domain.
Due to its capabilities of further optimizing the business by better understand-
ing of the habitué and enhanced productivity suggestion, it has made the life of de-
cision makers a lot easier. Then a clause of confidentiality also comes into the
picture, which mandates the enterprise to secure patron’s data from any unautho-
rized scrutiny and due to this the vulnerability dispute needs to be addressed as an
important contemplation.
haunting you for the rest of your life span in the same company. Hence, it needs to
be ensured that the incoming records are from reliable sources, and that it is not
tampered.
f) Huge dataset
Data warehousing system is already a set with huge data processing in a data ware-
house. In such a scenario, data redundancy or data replication are not stringently
required for future system, and failure can act as an overhead security concern.
Rather securing networking with remote data and securing the same could provide
better security.
As stated by Morrell, keeping several replicas of data floating for the system
without proper trail cannot be considered as a way to assure data security. Hence,
secured system and definitive measures are applied on cloistered data used for
analysis to minimize Online surfacing and allow minimum public functions access
to it. This eradicating security concerns due to insecure replicas.
Will the above-mentioned issues and procedures thwart all data warehousing secu-
rity susceptibilities? Perhaps not. They can only act as a base for a good start but it
is rather important that all the firms should consider and focus seriously on data
warehousing system security. Ease of data management along with several cost-
saving processing methodology is the major reason for expanding popularity from
large organizations to smaller and medium sized as well. Data analytics can be fur-
ther broken down into service of data digging and data collection, which is nowa-
days being assisted by cloud-based stowing services. Nevertheless, the issue of
4 Use of big data in hacking and social engineering 59
Roughley stated that, “We can initiate to aid individuals with the methodology to
extract the utmost value from their data, the way to treasure an economical supply
of electricity or other energy source, benefits while acquire a bank mortgage, even
advance and enhanced medic services using the data shared by their fitness tracker.
However, we need to acknowledge the actual point that we have all become entities
in the Data Warehousing system with gradually being habituated to data analytics
and data sharing.”
To put it simply, there are numerous ways that data exposure can be secured,
let it be from securing and updating the servers up to date with security servers.
Essentially following a virtuous line of disclosure for customer records, and using it
for real worth ought to find a solution to the modern present-day problem.
4.2 Hacking
Hacking is an endeavor to abuse a computer system or a remote grid inside a larger
network. In simple words, it is the illicit access to or control over computer grid-
safety systems for some forbidden drive. The party engaged in hacking deeds are
branded as a hacker. These hackers may alter structure or security topographies to
achieve an objective that diverges from the original drive of the system. Hacking
can also denote to nonmalicious actions, generally concerning scarce or improvised
variations to equipment or processes.
Ethical hacking signifies the act of tracing flaws and vulnerabilities of worksta-
tion and information engines by duplicating the resolution and activities of malevo-
lent hackers. Ethical hacking is also acknowledged as penetration testing or
intrusion testing. An ethical hacker is a security expert who smears his/her hacking
4 Use of big data in hacking and social engineering 61
Certain organizations hire hackers as part of their upkeep staff. These authentic
hackers also recognized as ethical hackers or white hat hackers that use their capa-
bilities to discover faults in the syndicate’s security system, thus averting distinc-
tiveness in individuality larceny and further computer-linked delinquencies. White
hat hackers are typically perceived as hackers who use their expertise to assist peo-
ple. They may be rehabilitated black-hat hackers or may merely be well proficient
in the procedures and practices used by hackers. An organization can employ these
professionals to test and implement best methodologies that make them less sus-
ceptible to malicious hacking efforts in the future.
While the syndicates currently are converging on exploring data warehousing sys-
tem and analytics because of economical stowage, reachability, usability, and con-
ception of distributed computing, they unknowingly also create a prospect for
hackers in social engineering as well. A technique wherein the hacker can know the
62 Shibakali Gupta, Ayan Mukherjee
inclinations and interests of employee in the enterprise that can assist in construct-
ing an efficacious social engineering attack. For example: With the predisposed
data warehousing system of the employee, the records can be excavated easily,
whose sites are frequently logged by the employees and the frequency of stopover
to the given site (Facebook, YouTube, etc.). With this information, a naive hyperlink
in a spam e-mail can be twisted to disclose not only his individual minutiae but can
also be enticed into providing corporate authorizations and thus providing numer-
ous accesses to the hacker.
Currently, data warehousing system and networks deliver “Just-in-time” back-
ing for governments, syndicates, and officialdoms during crises. It will also protect
forthcoming scenario of national and international network security, new proce-
dures of sovereignty. It also enriches the thoughtfulness of use, abuse, and net-
working of broad topical. These statistics if in mischievous hand can be a base
point for taking down an entire region or government off-guard.
Without the encircling span of data warehousing system, it’s tough to conceive a
scenario where dappled ventures and borderline illegal acts would make news. It’s
only with the application of data warehousing system does the utter scale of this evi-
dence turn heads. If one individual gazes at another individual’s sheet during a test,
it’s commendable of a red mark from the professor. If the whole class cooperates in
an organized way and develops a coordination of cheating, it becomes newsworthy.
The Panama Papers for illustration are an admirable example of something that is
not obligatory illegally, but sketchy to say the least. The element that hundreds of
high-profile global figures were acknowledged in this mass dataset is what makes the
news. With the evolution of data warehousing system, it makes opportunities for
hackers even more appealing, but it also creates a pool of data that becomes even
more necessary to protect [4] (Table 4.1).
Incursion of cloistered Shared and political engrossment on very Very truncated hurdles
communications enormous scale to intervention
Location sharing can be used for triangulation, A criminal can utilize the
Tracing, stalking judging shorter routes, proximate allies, even info to raid house when
evading natural calamities, and others empty
4 Use of big data in hacking and social engineering 63
With data warehousing system set, hackers may possibly destruct or yank data
warehousing system sets with reasonably trivial alterations in instruction to achieve
benefit. Certain techniques might be anodyne to the community but hackers might
even exploit annual economic corporate reports for individual advantage. Such vi-
cissitudes in monetary reporting models might also alter the policymaking of man-
agement, investors, dealers, and further people who build their verdicts on these
monetary reports.
Industries like Equifax, which is one of the distinct consumer credit agencies,
functions on multibillion-dollar statistics advisor industry, which acts as a perfect
example. They decorate an exhaustive depiction of an individual’s life and that
sketch is utilized to style resolutions with direct impressions. As a corporation
swells to its stockpile of data, the worth matures exponentially; so, the imperative
of dataset traders is to uninterruptedly hoard as much data as conceivable.
In nearby impending, hackers might have the capability to intrude into worksta-
tions that pedal vital technological paraphernalia that regulates water distribution,
rail networks, gas distribution, and so on. By gaining admin access of this worksta-
tion, hackers can alter the operational configurations or manually construct anarchy.
This would have a disparaging consequence. Grounded on the research steered by
specialists, a point was established that this was undeniably conceivable. As per re-
ports, events of such potential have not achieved the public news, yet it’s a possibil-
ity that it could have occurred already.
Thankfully, the good people are keeping up and developing strategies to thwart
modern cyberattacks. Let us compare how cyberattacks have traditionally been de-
tected and how data-centric menace revealing system is updating the cybersecurity
sphere, leading safekeeping enterprises to design a contextualized and analytical
slant to threat recognition system.
Customary security incident and evidence management software was not capable
enough to accumulate ample and adequate information to perceive up-to-date, er-
udite infiltrations. Furthermore, although they utilize chronological data, most of
them do not have the storage or handling competences to scrutinize data later
than 30 days, which leads to overlook significant idiosyncrasies. Additionally,
these tools scrutinize diverse cradles of data discretely rather than in conjunction
with one another [5–7].
Updated tools that have occurred take into account the speed, size, variety, and
complexity of data in a mandate to distinguish the new era of cyberattacks. The fresh
paradigm appeals for layering predictive analytics and machine learning systems on
cream layer of all cradles of data in an organization’s cyberinfrastructure (Figure 4.2).
64 Shibakali Gupta, Ayan Mukherjee
Figure 4.2: Monitoring UI of HADOOP (data warehousing system scalability and analysis) tool.
across much longer time horizons and from several disparate sources, well-de-
signed visualization becomes indispensable to threat scrutiny.
Companies that use data conceptualization tools have customarily utilized
them for post-destruction design and not for real-time threats monitoring. If plat-
forms are integrated and paired with streamlined visualization, users can swiftly
and accurately pinpoint system susceptibilities [5, 6].
Day by day, malware outbreaks intensify in volume and intricacy; they are grim for
traditional diagnostic tools and arrangement to tackle them because of majorly two
factors: scalability and data density.
For example, each day at Sophos Labs, more than 300,000 new potentially mis-
chievous files require scrutiny, and SQL-dependent infrastructure will not scale
well and has high maintenance cost [7, 8].
– IDC pinpoints that cloud and data warehousing system will avert cyberthreats
to the health organizations.
– According to Gartner, one-fourth of universal corporations has already adopted
methodology of data warehousing system processing [13].
This can be considered as one of such attacks with no real-time protections except
maintaining logs, which will lead to finding out later. It basically acts in manipulation
4 Use of big data in hacking and social engineering 67
or other humanoid faults, which let them as insider’s access without directly acting
upon the intrusion system for the hackers. Since inside attacks cannot be fully pre-
dicted based on human manipulation of authentic users, it is more tauter to recognize
such attacks, let alone the real-time protection. These types of attacks can only be
tackled when users are trained thoroughly with all the modes of attack and how to be
precautious for them (Figure 4.3).
ESTIGATION
INV
PLAY
There are several and diverse methodologies, which along with human collaboration
give shape to the attacks. Generally, these can be classified into five broad types of
assaults [15].
i. Baiting
As a fish is caught with the help of a bait or a rat catcher uses a bait to trap it, same
is this type of attack where the greed or curiosity of the victim is used as a bait to
provide a false assurance. This greed or curiosity either lands them right into the
deceptive trap compromising their personal information or wide opens their work-
station for viruses. The baits are generally having an authentic look, which provides
the victim with false assurance. Physical media is the mostly used form to disperse
these types of malwares.
Such a scenario like a malware infected flash drive can be considered as an ex-
ample, which contains the bait suitable for the target user. Due to inquisitiveness,
68 Shibakali Gupta, Ayan Mukherjee
victim uses the flash drive in any workstation, thereby providing straight route for
the malware to infest the system.
These types of attacks are not confined to physical world only; advertisements
and other lucrative links to download any software act as a form of online bait. The
baits are mostly generalized form and not targeted to any particular user.
ii. Scareware
All the online users have generally seen or faced scenarios where multiple alarms
suddenly pop up in the browser or system. Series of fictitious threats are bom-
barded in the system. This type of attack is termed as scareware, where the victim
is threatened in a cyberway to make them believe that their system is compro-
mised and/or is infested with malware. This leads the user to actually download a
software suggested by the attacker, which is the real payload for the attacker to
compromise the system. So in short, a rogue scanner software or deceptive soft-
ware that threatens the user to act according to the attacker can be termed as
scareware.
Figure 4.4 can be considered as one of the most common scenarios being en-
countered by almost every Internet user, where popup banners with utmost legiti-
mate looking banners are bombarded in the browser. These popups generally have
threatening messages or texts like the one in Figure 4.4. The users are forced to in-
stall malicious software or click a link that redirects them to a payload containing
site to compromise the system [9].
Spam emails with threat and warnings are a mode of operandi for this type of
attack, which lures the user to spend on worthless products.
iii. Pretexting
In this type of attack, series of well-planned manipulations are crafted by an in-
vader to acquire information of the victim. The perpetrator often instigates the at-
tack by pretending as someone else to the victim to requisite classified data to
accomplish the assignment. All varieties of apposite information and records are
congregated utilizing this swindle like as SSNs (Social Security Number) can be
considered as input or output for this type of attacks.
In a classic mode, invader kicks off the attack according to the following steps:
Imitates as colleague, law enforcement agency, bank and tax officials, or other
entities that under specific circumstances having authority-level access.
Enquires about classified, important but partial information of the victim to
avoid major doubts to the victim.
Uses the data received to data mine the rest of the classified and more impor-
tant data that can harm the victim in a major way.
iv. Phishing
One of the online’ s most prominent and prevalent type of manipulation attack de-
pendent on directly reaching the user via mailbox or messaging services can be de-
fined as attack style of phishing. It depends majorly on the human tendency of
receiving free services or earnestness or sense of distress. It focuses on a better form
of a lie in which subtle info is spit out to the victim to generate the sense of curiosity
or urgency, thereby leading them to clicking the malevolent link in the mails or
chats, which redirect them to payload pages or attachments.
As shown in Figure 4.5, using an electronic mail false sense of affection or car-
ing is injected in the user along with curiosity of knowing the identification of the
source showing the affection. The link that shows a greetings being shared by an
unknown user actually leads to a payload-containing website that is to be triggered
as soon as the user navigates into the webpage. Once the payload is installed, the
user falls on the mercy of predator only.
These types of attacks are generally send in a mass to huge set of receivers,
with almost similarity to the original links, and regularly updating the mail servers
with information from security platforms can actually help the admins to obstruct
these types of attacks.
v. Spear phishing
Since the phishing attack is more generalized and can easily be obstructed, it does
not have any specific target. The modified version is also available in manipulation
attacks where phishing is specifically directed according to a chosen victim that can
be an individual or a member of any large syndicate. They follow the below steps:
70 Shibakali Gupta, Ayan Mukherjee
Selection of a victim
Datamining more information about the victim like hobbies or interests, and
job-related information to make attack less suspiciously.
Closely monitoring the victim to initiate attack in a proper time to attack with
maximum success rate.
These types of attacks are generally long duration attacks but are ample tough
to sense out and have enhanced triumph rates.
These types of attacks can be visualized as any assailant impersonating as an em-
ployee of the same organization as the victim but with higher authority or access to
emergency services. After proper background studies and proper timing, a message is
delivered by the assailant that are mostly urgent or emergency routine services which
needs their authentications or other important details. The information shared by the
assailant like victim’s supervisor name and all are retrieved by the assailant during the
prerequisite data mining, thereby forcing the victim to believe the authenticity of
the call and disclosing all classified details or dispatching them via any web link.
Entities analyze the users’ data to operate commercial campaigns and, in lieu, foster
the financial development of the platform itself, thus subsidizing to comprehend the
visions of Internet pioneers, that is, to cultivate a digital grid where information is
free and can be utilized for the well-being and the financial development of the entire
humanity.
Not surprisingly, however, data analysis might be easily misused, for instance,
by exploiting the detailed information about users toward morally questionable ob-
jectives (e.g., tailored persuasion techniques, for which we refer to another post of
this blog). In addition, once disclosed to the acquiring party, data are not anymore
in possession of the social network and, as such, might be illicitly forwarded to
other parties. Given this scenario, we try to briefly explain what are the current ca-
pabilities and consequences of such capillary data production and analysis, that is,
how much can be done starting from our digital shadow?
Nowadays, the combination of psychology and data analysis is so powerful that
70 likes on Facebook are enough to infer more about a users’ personality than what
their friends know about him; 300 likes are enough to know that user more than his
partner. Hence, online social networks are such privacy-invasive that there is almost
a coincidence between the daily life of a person and their digital shadow. Artificial
intelligence techniques are the today’s state of the art in many data analysis tasks
and, while already performing excellently, their growth is not expected to stop [16].
Considering that the Internet is widespread at any level of our lives, with the
online social networks acting as a giant magnifying lens on the society, and being
particularly suitable to foster the political discussions, the inferences performed on
our data should raise serious concerns. Data might be used to profile users, to en-
counter them in a much-tailored fashion, and consequently, leveraged to induce
them doing something they would not do in their own to perform social engineering
to the extreme, precisely. The more is known about users, the easier is also to em-
ploy persuasion techniques to propose them exactly what they like, or are scared
of, thus opening the doors for a plague of our time: the widespread diffusion of
fake news, which, in turn, have detrimental effects on the democracy of a country.
In fact, a group of attackers with sufficient available resources can spread miscon-
ceptions and fake news on a global scale to influence the results of huge events by
hacking the voters (which ironically has the same effect of vote rigging!) [9].
Very recently, the case of an alleged misuse of data carried out by a company
operating in the marketing sector, named Cambridge Analytica, came under the
spotlight of the media. It is a case worth discussing because it embodies much of
the issues described throughout this post. First, some details about the fact:
Cambridge Analytica is accused to have been involved in an illicit sharing of data
with Aleksandr Kogan, a researcher who developed a Facebook-based application
to gather information about users’ personalities [17, 18]. Before 2014, Facebook’s
rules about data sharing were not as stricter as they are now. Specifically, a user
allowing to disclose some of his/her data had also the capability to reveal pieces of
72 Shibakali Gupta, Ayan Mukherjee
his friends’ information. In this way, from the 270K users who deliberately shared
their data with the application, it had been possible to profile up to 50 million
American electors. With such information in hands, Cambridge Analytica is accused
to have performed microtargeting campaigns to favor the election of Donald
Trump, by employing unscrupulous means, such as the spread of fake news to cre-
ate a significant shift in public opinion (Figure 4.6).
In our view, four main lessons should be learnt from this story:
Today’s data-driven business models come at the cost of sacrificing privacy and
require a high level of trust on the entities managing our data. Once data have been
disclosed, in fact, there is no guarantee that the party that is entitled to use them
(e.g., the legitimate application) illegally forward them to other entities or not.
Although rules are mostly imposed to limit the control that users have on their
friends’ information (as Facebook did in 2014), the issue is inherently present in on-
line social networks, since they are based on the friends/followers paradigm. Due to
this model, in fact, the boundaries among users’ information spaces have become
blurred. Just think of a picture where a user is inadvertently tagged. Moreover, it has
been shown that a target user’s information (e.g., location) could be accurately in-
ferred from the analysis of the profiles of his friends.
Social engineering benefits from the heterogeneity and volume of the available
data, and widely employs persuasion techniques. The data-centric and all-intercon-
nected world we live in represents the favorable scenario for the application of an ex-
treme social engineering, that is, people can be easily profiled, contacted, and
4 Use of big data in hacking and social engineering 73
deceived to induce effects that go far beyond the traditional industrial espionage. As a
matter of fact, social engineering has the potential to spread ideologies and influence
the result of huge political events by exploiting the structure of the democracy itself.
The Duolingo case, as explained in our project also, is an excellent example of
how tracking of people’s behavior on a large scale and inferring their behavioral
habits is one of the solutions to improve the efficiency not only of the attack pat-
terns, but also of the training systems.
4.4 Conclusion
Data warehousing system analytics is a major boom in current cyber industry. Data
warehousing system analytics if used correctly helps in identifying, understanding
customers, optimizing according to their needs, science and research, military, and
other defense applications. Data warehousing system analytics can help identifying
illegal or hacking attempts even from minute data availability. However, on the
contrary, data warehousing system can also be used in corporate espionage, spying
on people and even alter their decisions (e.g., U.S Elections) and with the rise in
social networking applications every details on every individual can be considered
to be achieved online in some way or the other.
Due to the above factor, data warehousing system security is one of the major
concerns in cyberindustry. As described by Einstein on the context of nuclear en-
ergy, tool that can provide major and sustainable development can also be the cra-
dle of foremost devastations. Data warehousing system security can be considered
as important in current cyber market.
The data warehousing system, which primarily meant 3V’s now, has been updated
to 6V’s, that is, volume, value, variability, velocity, variety, and veracity. Data ware-
housing system analytics and the related security measures are growing every day and
in this chapter an insight has been given for the same. With continuous growth in
data volumes and improvement and inclusion of new tools in the market for analyzing
the same, in future, data warehousing system security needs to be revamped every sin-
gle moment along with other methods to identify the hacking attempts as well.
References
[1] Bertino, Elisa “Big Data – Security and Privacy”, 2015 IEEE International Congress on Big
Data, (2015), doi: 10.1109/BigDataCongress.2015.126.
[2] Moreno, Julio, Serrano, Manuel A., & Fernández-Medina, Eduardo “Main Issues in Big Data
Security”, Alarcos Research Group, University of Castilla-La Mancha, 2016.
[3] Bertino, E., Jonker, W., & Pektovic, M. “Data Security – Challenges and Research
Opportunities”, SDM, 2013.
74 Shibakali Gupta, Ayan Mukherjee
[4] Toshniwal, Raghav, Dastidar, Kanishka Ghosh, & Nath, Asoke “Big Data Security Issues and
Challenges”. International Journal of Innovative Research in Advanced Engineering (IJIRAE),
2016, ISSN: 2349–2163.
[5] Chen, M. et al. “Big Data: A Survey”. Mobile Networks and Applications, 19(2), 171–209, Jan.
2014.
[6] Mayer-Schönberger, Viktor, & Cukier, Kenneth Big Data: A Revolution that Will Transform
How We Live. Work, and Think. Houghton Mifflin Harcourt., 2013. ISBN 9781299903029. OCLC
828620988.
[7] Paulet, R., Kaosar, Md. G., Yi, X., & Bertino, E. Privacy-Preserving and Content-Protecting
Location Based Queries IEEE Transactions on Knowledge and Data Engineering, 2014, 26(5),
1200–1210.
[8] Ongsulee, Pariwat, Chotchaung, Veena, Bamrungsi, Eak, & Rodcheewit, Thanaporn “Big Data,
Predictive Analytics and Machine Learning”, 2018 16th International Conference on ICT and
Knowledge Engineering(ICT&KE), 2018.
[9] Kappler, Karolin, Schrape, Jan-Felix, Ulbricht, Lena, & Weyer, Johannes “Societal Implications of
Big Data”, (2018). KI – Künstliche Intelligenz. 32(1), Springer. doi:10.1007/s13218-017-0520-x.
[10] Peter Kinnaird, Inbal Talgam-Cohen, eds. “Big Data”. XRDS: Crossroads. The ACM Magazine for
Students. 2012, 19(1), Association for Computing Machinery. ISSN 1528–4980. OCLC 779657714.
[11] Jagadish, H.V. et al. “Challenges and Opportunities with Big Data”, 2011, [online] Available:
http://docs.lib.purdue.edu/cctech/1/.
[12] Leskovec, Jure, Rajaraman, Anand, & Ullman, Jeffrey D. Mining of massive datasets.
Cambridge University Press, (2014). ISBN 9781107077232. OCLC 888463433.
[13] Press, Gil. (9 May 2013). “A Very Short History of Big Data”. forbes.com. Jersey City, NJ:
Forbes Magazine. Retrieved 17 September 2016.
[14] Carminati, B., Ferrari, E., & Viviani, M. “Security and Trust in Online Social Networks”,
Morgan & Claypoo, 2013.
[15] Bag, Monark, & Singh, Vrijendra. (2012) “A Comprehensive Study of Social Engineering
Based Attacks in India to Develop a Conceptual Model”, DOI: 10.11591/ijins.v1i2.426
[16] Andrew McAfee & Erik Brynjolfsson “Big Data: The Management Revolution”. hbr.org.
Harvard Business Review.
[17] O’Neil, Cathy (2017). Weapons of Math Destruction: How Big Data Increases Inequality and
Threatens Democracy. Broadway Books. ISBN 978-0553418835.
[18] Batini, C., & Scannapieco, M. “Data Quality: Concepts Methodologies and Techniques”,
2006.
[19] Breur, Tom “Statistical Power Analysis and the contemporary “crisis” in social sciences”.
Journal of Marketing Analytics, July 2016, 4(2–3), 61–65. doi:10.1057/s41270-016-0001-3.
ISSN 2050-3318.
[20] Sh. Hajirahimova, Makrufa Sciences, Institute of Information Technology of Azerbaijan
National Academy of; str., B. Vahabzade; Baku; AZ1141; Azerbaijan; Aliyeva, Aybeniz S.
“About Big Data Measurement Methodologies and Indicators”. International Journal of
Modern Education and Computer Science. 9 (10), 1–9. doi:10.5815/ijmecs.2017.10.01.
[21] [online] Available: https://searchbusinessanalytics.techtarget.com/definition/big-data-
analytics
[22] [online] Available: https://www.dataspace.com/big-data-applications/big-data-helps-detect-
hacking/
[23] [online] Available: https://www.cnbc.com/2016/03/09/the-next-big-threat-in-hacking–data-
sabotage.html
[24] [online] Available: https://www.csoonline.com/article/2130877/data-breach/the-biggest-
data-breaches-of-the-21st-century.html
Srilekha Mukherjee, Goutam Sanyal
5 Steganography, the widely used name for
data hiding
Abstract: Nowadays, global communication has no bounds. More information is
being exchanged over some public channels that serve to be an important mode
of communication. Without them, the field of technology seems to collapse. But
awfully, these communications often turn out to be fatal in terms of preserving
the sensitivity of vulnerable data. Unwanted sources hinder the privacy of the
communication and may even temper with such data. The importance of security
is thus gradually increasing in terms of all aspects of protecting the privacy of
sensitive data. Various concepts of data hiding are hence into much progress.
Cryptography is one such concept, and the others being watermarking and so on.
But to protect the complete data content with some seamlessness, we incorporate
concepts of steganography. It provides complete invisibility to any sensitive data
that is being communicated. This prevents attracting unwanted attention from
third-party sources, which helps to some extent with information safety. The field
of big data is quite into fame these days as they deal with complex and large data-
sets. Steganographic methodologies may be used for the purpose of enhancing
the security of big data since they also find ways of doing so.
5.1 Introduction
The worldwide booming technological [1] trends bring along with it several disin-
centives, and overcoming that has become a challenging task. In these days mod-
ern research [2] has given a new dimension to almost all fields of technology. The
new technologies have stressed on mediums being digital [3]. Communication [4]
is the sole mediator between any sorts of technology that offers a service to man-
kind. Modern technologies are indeed a boon for the human race in many ways.
Global network of computers serves as an error-free mediator in the source to des-
tination delivery of any data/document [5]. Research is carried out at a rigorous
level, and hundreds and thousands of newer techniques are coming up so as to
sort complex day-to-day problems [6]. Technology has made life simpler and eas-
ier. But as it is said there is no rose without a thorn, so goes the way with these
https://doi.org/10.1515/9783110606058-005
76 Srilekha Mukherjee, Goutam Sanyal
A) Secret key (symmetric) cryptography. SKC uses a single key for both
encryption and decryption.
B) Public key (asymmetric) cryptography. PKC uses two keys, one for
encryption and the other for decryption.
Hash function
Plaintext Ciphertext
receiver. On receiving it, the receiver again uses the same key for decrypting
the secret message. Thus, the receiver finally recovers the corresponding plain-
text. A single key is applied for both encryption and decryption functions;
therefore, secret key cryptography is again called as symmetric encryption.
Basically, these methods may be applied and used for provision of confidential-
ity and privacy [24].
– Public key cryptography
These methods apply a single key for encryption purpose and one another for
the purpose of decryption. Therefore, these methods are also known as asym-
metric encryption/decryption. They can primarily be used for the purposes of
authentication [25], key exchange, as well as nonrepudiation.
– Hash functions
Hash functions [26] are also known as message digests. They have one-way per-
mitted encryption. Mainly these are algorithms, which generally use no key.
Just a hash value of fixed length is estimated. This estimation is based on the
plaintext. Also, this makes it nearly impossible for either the length of the
plaintext or the contents to be recovered. The hash functions are most com-
monly used and applied by many operating systems for encrypting the entered
passwords. Further, the hash functions provide a different mechanism to as-
sure the integrity of a specific file.
Mainly, different methods or techniques of cryptography have been created for secur-
ing the secrecy of messages by encryption as well as decryption of data. The en-
crypted form of any data may often attract the attention of the third-party external
sources. This makes it inquisitive for any unwanted external source and might inter-
cept that this form may contain some hidden precious information from a source. This
feature of cryptography may actually provoke any unintended third party about the
covert communication that is being taking place. Hence, there arises an obvious re-
quirement to hide any form of encryption made to protect and secure sensitive data.
Steganography [27] solves this issue by completely masking the information
without making it visible to the outer sources. In order to avoid drawing any kind of
suspicion, it has its methods to make some changes in the structures of the host so
that it is not identifiable by any human eye. Therefore, the transmission of the hid-
den data is made in a completely undetectable manner. The communication in this
case is completely kept hidden. It is a skill of hiding any confidential information
within another media/entity such that nothing unusual appears in front of external
sources. It hides the contents of data/any information within a carrier [28] medium,
thus facilitating a seemingly invisible communication. Third parties are actually
not able to see the information that is being communicated. Only the sender and
receiver sides know and are aware about the secret communication being taking
place. This particular advantage of steganography has increased its usage to a
much higher level. It has given a new dimension to the concept of information
5 Steganography, the widely used name for data hiding 79
security. The safety and integrity of sensitive data is guaranteed. All the fields and
sectors have started using techniques that safeguards individual safety and secu-
rity. Due to its immense potential of secured connectivity, it has become wide-
spread. Therefore, the concepts of steganography are having huge demands in
today’s world. It facilitates privacy for several legitimate purposes during communi-
cation. Third parties are actually not able to see the information that is being com-
municated. Only the sender and receiver sides know and are aware about the secret
communication being taking place. More communications takes place electronically
in these days [29]. Likewise, for steganographic communications to take place, mul-
timedia signals [30] are mostly chosen as renowned message carriers necessary for
secured communication. There are many techniques that are figured out after high-
quality researches.
Another technique of watermarking [31] also has high usage in the field of infor-
mation security. The main advantage in this case is to confirm the authenticity of any
original data. Also, they may or may not be hidden in the host data. The watermark
[32] is hidden within the host data in such a way that it possibly can never be re-
moved. Even if its removal is made possible that can only be done at the cost of de-
meaning the concerned host data medium. Several watermarking applications, for
example, copyright protection or source authentication may have an active adversary
[33]. These stated groups may participate in making several attempts that removes,
forges, or invalidates the embedded watermarks. Special inks have been used for hid-
ing messages in the currencies as well. Steganography has its main goal of secure
communication intact. The controlling factor is that the people are not by any chance
aware of the presence of any hidden messages. This is what distinguishes steganog-
raphy from any other forms of data hiding or information security.
5.3 Steganography
Steganography might be defined as the science and art of hiding data or informa-
tion within another information, which appears to be harmless. The specific word
“steganography” is a mere combination of two different Greek words, that is,
“steganos” and “graphein,” which means “covered” and “writing,” respectively.
The sensitive message can be hidden within a selected carrier known as the cover
medium. This cover with the hidden data within is known as stego. The cover ob-
ject serves to be any kind of medium within which any private message might be
successfully embedded. This also aids to hide the presence of the very secret mes-
sage, which is being sent. Referring to an image as a medium, we may say that
the cover image is the seemingly unimportant image, within which the actual
confidential image is to be embedded. On the other hand, the stego-image serves
to be a carrier for communicating the private image across.
80 Srilekha Mukherjee, Goutam Sanyal
Right from the ancient days [34], the concept of steganography had been used. The
ancient kings and rulers used many techniques for data hiding. One was shaving
the head of a trusted slave and then writing the message on his scalp. Once the hair
grew back, he was sent to the corresponding recipient with that message. The recip-
ient king shaved his head to read the message or information. A Greek historian
Herodotus mentions a remarkable history related to this. Histiaeus, the chief of
Miletus (an ancient Greek city), had sent a secret message to his concerning vassal,
Aristagoras, the leader of Miletus, by shaving the head of one of his trusted serv-
ants. He then marked the secret message on the shaved scalp [35] and had sent him
on his advised way when the hair on his scalp had regrown. This was one of the
many techniques of how communication was made during those days.
Demaratus, the king of Sparta, from 510 until 491 BC had used this strategy to
sent an anticipated warning for a forthcoming attack to Greece, by inscribing it di-
rectly on the underlying wooden support of some wax tablet. The final covering
step was applying and smoothening its beeswax covered surface. Also, these wax-
covered tablets were commonly used at that time as popular reusable writing surfa-
ces. Even quite for some time, they were used for shorthand purposes.
Mary, Queen of Scotland, used to hide several letters with the combination of
some techniques of cryptography and steganography. She had her secret letters hid-
den in a bunghole of some beer barrel that could freely pass in as well as out of her
concerned prison cell. During World War II, another steganographic method that
was practiced by the United States Marines was mainly the use of Navajo “code
talkers.” They applied a kind of simple cryptographic technique and the messages
used to be sent in all very clear text.
The vast uses of steganography were simply not limited to mere writing materi-
als. The ancient use of large geoglyphs of the known Nazca lines in Peru can also
be considered as a said form of the steganography. The figures vary in actual com-
plexity. These geoglyphs are open to view, though most of them were not identi-
fied/detected until they were viewed directly from the above air. The designs are
mainly shallow lines, which were made in the ground. It was done by removing the
naturally existing reddish pebbles as well as uncovering the whitish or grayish
ground underneath. Scholars have different opinion in interpreting their purposes.
Moreover, in general, they accredit some sort of religious significance to those.
Another description of a human vector example does include writing secret
messages on textures of silk. Later, this would be compressed and converted into
one ball. A final covering with wax was the last step. The messenger then had to
swallow this wax ball. In this case, the method for retrieving the secret message
was not described in the sources.
Another example of steganography is the one that involves some specific use
of the Cardano grille. Named after its very creator, Girolamo Cardano, this device
5 Steganography, the widely used name for data hiding 81
Digital communication is the boon of the trending technology. With the progression
in this field of digital communication, the need of steganography serves to be a
backbone of global communication. The need of security for the information tra-
versed being the prime concern increases the demand of steganography. The sole
reason was to secure and hide sensitive data from everyone except its intended re-
cipient. This was the built-in feature of the domain steganography. It can actually
guarantee the covert communication. This heightens the application range of stega-
nography to an enormous extent.
82 Srilekha Mukherjee, Goutam Sanyal
In the recent years, the global interest followed by research and development in
this field has almost inflated to a high level. The presence of redundancy [36] in some
of the representations of digital media (used as carrier or cover) is the targeted areas
of data hiding in steganography. It attracted the attention of many researchers and
developers, who decided to generate newer techniques of availing and sustaining co-
vert communication [37]. During the communication stage, any unauthenticated peo-
ple may only notice the transmission of a seemingly unimportant image.
A communication always takes place between two parties. One is the sender party
and the other being the receiver party. Similarly in steganography, two processes take
place: one in the sender side known as sender phase and the other at the receiver side,
also known as receiver phase. Robustness as well as transmission security during com-
munication are extremely essential for transmitting the vital message to its intended
sources while declining access to unauthorized people. Hence, a secret point-to-point
communication between the two trusted parties in the two sides should be ensured.
The communication channel is a public medium where any kind of untrusted
source may be present. Their main aim might be mainly to uncover any secret data
that passes by. Steganography evaluates and generates a number of ways until the
attacker does not find some way to detect and trace the hidden information. With the
communicating channel being selected, the communication proceeds with sending
the stego. Many new techniques came up for the purpose of hiding information.
In these days communication has become digital. Therefore, the techniques
used are digital steganographic techniques. A steganographic procedure has two
phases: one taking place at the sender side and the other at receiver side. On the
sender’s side, the sender embeds the message within a chosen cover medium. On
other hand, on the very receiver’s side, the receiver extracts the hidden message
from the received stego. The resemblance of the stego with its respective cover rep-
resents the efficiency of the procedure used. Also the efficiency of the algorithm lies
in extracting the hidden information in a lossless manner. The lossy [38] extraction
results in loss of data fields from the hidden information. This is definitely not what
is expected out of a steganographic procedure, whose main aim is to communicate
data secretly from sender to receiver. Therefore, if there is even a partial loss of hid-
den data, then the procedure is not fully efficient to what it promises.
The primary demand of data integrity and authentication leads to absorption of cer-
tain effective measures in the respective systems. Government organizations have a
wide range of use in this area. Various purposes of individual interests are some
other important factors using the same. The self-conscience level of the modern
crowd regarding the security attacks has increased a lot. People have become much
more aware regarding protection of their personal and professional data. This self-
5 Steganography, the widely used name for data hiding 83
awareness has led to an increase in the use of enhanced security in the communica-
tion systems. Even several trade and business purposes make use of the potential of
steganography to communicate new product launching information or any other
trade secrets.
Note: It is always not possible to maximize capacity simultaneously with the trait of
imperceptibility and improve robustness in any data hiding scheme. Henceforth, an
acceptable balance of the above stated parameters must be sorted based on the nec-
essary goal/application. For example, some steganographic schemes may forgo a
bit of robustness in the favor of capacity with low perceptibility. On the other hand,
a watermarking scheme, for which large capacity and low perceptibility is not a req-
uisite, would definitely promote high robustness. Since the prime aim of steganog-
raphy is hiding data, so the methods must promote sufficient capacity.
Then, in Figure 5.2 the types [41] of steganography are based on the medium in
which data is hidden. For example, if data is hidden in any text file, then it is text
84 Srilekha Mukherjee, Goutam Sanyal
Steganography
All these types are used in their relevant mediums, where hiding will be of utmost
important in such medium. Various sectors of public importance emphasize in in-
formation hiding in the respective required medium, as per their necessity.
Due to some subsistence of restricted potential of our human visual system [43],
concealing information within digital images is asserted to be quite an efficient me-
dium. Image steganography is also considered to be a potential for facilitating a se-
cured communication globally.
There are two primary types of image files [44]: raster and vector (as specified in
Figure 5.3). Any image may be catalogued in terms of either vector or any raster
graphics. The image preserved in raster form is often said to be a bitmap. We may
also define an image map as a file containing some information, which associates
different locations on some given specified image with their hypertext links.
Image
Raster Vector
image image Figure 5.3: Two primary categories of image
files.
5 Steganography, the widely used name for data hiding 85
– Raster image
Raster images (Figure 5.4) are made up of collection of dots called pixels. These
are generally more common (like PNG, JPG, GIF, etc.) and are widely used over
web. Each pixel is specified as a tiny colored square. Suppose we zoom in to
any raster image, we may see a lot of these little tiny squares. Raster images
are the ones that are created with pixel-based programs. They may also be cap-
tured with a camera or scanner. When an image is scanned, it is converted to a
collection of pixels, which we call a raster image.
– Vector image
A vector image (Figure 5.5) is specified as one of the two major image file types.
Vector graphics are those that are created with any vector software. These are
more common for image files, which are applied onto any physical product. All
vector images are object oriented while raster ones are pixel oriented. Since the
vector graphics are not formed of pixels; therefore, they are resolution indepen-
dent. Also, the vector shapes (called as objects) may be printed as large and at
that highest resolution what the printer or output device allows. They always
maintain all their details when zoomed in or out.
5.3.6.2 Pixel
For example, in a 512 × 512 image that has 512 pixels considering side to side and
512 considering top to bottom has a total of 512 × 512 = 262,144 pixels.
There are several types of steganographic techniques that efficiently hide data.
Broadly, it is categorized into two types of domain: spatial [46] and transform.
Figure 5.6 shows few categories from both the spatial and transform [47] domain
techniques.
Steganographic
techniques
In the field of spatial domain, the secret data bits/messages are embedded di-
rectly into the cover bit planes. The least significant cover bits get directly replaced
with the specific bits of secret message. There are a wide variety of procedures that
use spatial domain techniques, like that of least significant bit [48] and pixel value
differencing [49]. They are efficient in terms of several aspects like that of maximum
data carrying capacity. In the transform domain, the secret message is embedded in
their respectively transformed cover. There are a number of efficient transform do-
main techniques, like discrete wavelet transformation [50], discrete cosine transfor-
mation [51], and discrete Fourier transformation [52].
There are several benchmark metrics based on which the efficiency of the con-
structed steganographic procedure may be found out. Accordingly, its strength can
be determined. Given below are few such metrics, whose values may be computed
and the efficiency of the steganographic output or rather stego may be found.
– Payload is stated as the total data-carrying capacity of that specific image, re-
ferred to as the carrier host or cover. The carried one is termed as the confidential
or secret data. This carrier or confidential object might be any file, including
some text, image, audio, and video. Henceforth, the embedding capacity of any
cover or host image is the maximum capacity, which may denote that only on
crossing this particular point distortion is recorded.
– Mean squared error or MSE: This is in correspondence to that expected value of
the obtained squared error loss or the obtained quadratic loss. MSE [53] is actu-
ally a risk function. We may say that the difference occurs mainly due to the
presence of randomness or may be the estimator do not account for any infor-
mation that might generate a more accurate estimation. Thus, it supposedly in-
corporates both the variance of estimator along with its respective bias.
If we consider a cover image “CI” that has M by N pixels such that M==N
and a stego-image “SI(i,j),” which is obtained after hiding data within “CI,” the
MSE is found out as
M X N
1 X
MSE = ½CIðijÞ − SIðijÞ2 (5:1)
ðM*NÞ i = 1 j = 1
– Peak signal-to-noise ratio (PSNR): While the MSE represents the obtained cumu-
lative squared error measured between the images, the PSNR [54] represents a
specific measure of the existing peak error between the stego and original.
Now, for color images (i.e., having three RGB component values per pixel), the
PSNR definition is same. It’s just that the MSE is now calculated as the sum of
88 Srilekha Mukherjee, Goutam Sanyal
all the squared value differences, which is divided by the respective image size
and also by three. It is formulated as
10log10 2552
PSNR = db (5:2)
MSE
– Structural similarity index measure (SSIM): In general, SSIM [55] is considered
to be a full reference metric. This signifies that the measure of any image qual-
ity depends on an uncompressed initial or rather distortion-free image, which
is used as a reference. Now, we may say that this structural information is the
primary idea that the resident pixels have very strong interdependencies. Also
this is especially the case when they are close spatially. These kinds of depen-
dencies always carry some relevant and important information related to the
very structure of objects in their visual scene. It is calculated as
Consider some univariate data D1, D2,. . .,DN, then their skewness as well as kurtosis
is found as follows:
ðDi − μÞ3 =N
PN
i=1
Skewness = (5:4)
σ3
ðDi − μÞ4 =N
PN
i=1
Kurtosis = (5:5)
σ4
where μ is mean, σ is standard deviation, and N is number of pixels.
5.4 Conclusion
The rapid sprout in the usage of sensitive information exchange through the Internet
or any public platform causes a major security concern in these days. More essen-
tially, digital data has given an easy access to communication of its content that can
also be copied without any kind of degradation or loss. Therefore, the urgency of se-
curity during global communication is obviously quite palpable nowadays. Hence,
the data hiding in the seemingly unimportant cover medium is perpetuated. The
realm of steganography ratifies the stated fact to safeguard the privacy of data.
Unlike cryptography, steganography brings forth various techniques that strive to
hide the existence of any hidden information along with keeping it encrypted. On the
other hand, any apparently visible encrypted information is definitely more likely to
captivate the interest of some hackers and crackers. Therefore, precisely saying, cryp-
tography is a practice of shielding the very contents of the cryptic messages alone.
On the other hand, steganography is seriously bothered with camouflaging the fact
that some confidential information is being sent, along with concealing the very con-
tents of the message. Hence, using steganographic procedures in the field of big data
enhances their security.
90 Srilekha Mukherjee, Goutam Sanyal
References
[1] Mukherjee, S., & Sanyal, G. A chaos based image steganographic system”, Multimed Tools
Appl, Springer, 2018, 77 (21).
[2] Gupta, B., Agrawal, D.P., & Yamaguchi, S. Handbook of Research on Modern Cryptographic
Solutions for Computer and Cyber Security, 2016.
[3] Saha, PK., Strand, R., & Borgefors, G. Digital Topology and Geometry in Medical Imaging:
A Survey. IEEE Transactions on Medical Imaging, 2015, 34(9), 1940–1964.
[4] Potdar, V., & Chang, E. Gray level modification steganography for secret communication, IEEE
International Conference on Industrial Informatics, Berlin, Germany, 2004, 355–368
[5] Dagadita, MA., Slusanschi, EI., & Dobre, R. Data Hiding Using Steganography. 12th
International Symposium on Parallel and Distributed Computing, IEEE, 2013, 159–166
[6] Katzenbeisser, S., & Petitcolas, F. A. Information Hiding. Artech House information security
and privacy series, Artech House, 2015, ISBN 978-1-60807-928-5, pp. I-XVI, 1–299
[7] Mukherjee, S., & Sanyal, G. (2018): A Multi-level Image Steganography Methodology Based
on Adaptive PMS and Block Based Pixel Swapping, Multimed Tools Appl, Springer, 2018
[8] Mukherjee, S., & Sanyal, G. Extended Power Modulus Scrambling (PMS) Based Image
Steganography with Bit Mapping Insertion. In: Fahrnberger G., Gopinathan S., Parida L. (eds)
Distributed Computing and Internet Technology, 2019, ICDCIT 2019. Lecture Notes in
Computer Science, vol 11319. 364–379. Springer, Cham
[9] Mukherjee, S., Roy, S., & Sanyal, G. Image Steganography Using Mid Position Value
Technique, International Conference on Computational Intelligence and Data Science
(ICCIDS), Procedia Computer Science, 2018, 132,461–468, Elsevier
[10] Mukherjee, S., & Sanyal, G. A Novel Image Steganography Methodology Based on Adaptive
PMS Technique. In: Sa P., Sahoo M., Murugappan M., Wu Y., Majhi B. (eds) Progress in
Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent
Systems and Computing, 2018, vol 518. 157–164. Springer, Singapore.
[11] Das, Shantanu, Tixeuil, Sebastien (Eds.). Structural Information and Communication
Complexity. 24th International Colloquium, SIROCCO 2017, Porquerolles, France, 2017
[12] 6. Mukherjee, S., Ash, S., & Sanyal, G. A Novel Differential Calculus Based Image
Steganography with Crossover, International Journal of Information and Communication
Engineering, World Academy of Science, Engineering and Technology (WASET), (2015), 9(4):
1056–1062
[13] Zhengan, H., Shengli, L., Xianping, M., Kefei, C., & Jin, L. Insight of the Protection for Data
Security Under Selective Opening Attacks. Information Sciences, 2017, 412–413, 223–241
[14] Mukherjee, S., & Sanyal, G. Enhanced Position Power First Mapping (PPFM) based Image
Steganography, International Journal of Computers and Applications (IJCA), Taylor and
Francis, 2017, 39 (2): 59–68,
[15] Mukherjee, S., & Sanyal, G. Edge Based Image Steganography with Variable Threshold,
Multimed Tools Appl, Springer, 2018.
[16] Khosla, S., & Kaur, P. Secure Data Hiding Technique using Video Steganography and
Watermarking. International Journal of Computer Applications, 2014, 95(20), 7–12.
[17] Mukherjee, S., & Sanyal, G. A Physical Equation Based Image Steganography with
Electro-magnetic Embedding, Multimed Tools Appl, Springer, 2019
[18] Kaminsky, Alan., Kurdziel, Michael., & Radziszowski, Stanislaw. An Overview of Cryptanalysis
Research for the Advanced Encryption Standard. Proceedings – IEEE Military Communications
Conference MILCOM, 2010, 10.1109/MILCOM.2010.5680130.
[19] Khalaf, Abdulrahman. Fast Image Encryption based on Random Image Key. International
Journal of Computer Applications, 2016, 134.
5 Steganography, the widely used name for data hiding 91
[20] Dai, Hong-Ning., Wang, Qiu., Dong, Li., & Wong, Raymond. On Eavesdropping Attacks in
Wireless Sensor Networks with Directional Antennas. International Journal of Distributed
Sensor Networks, 2013, 2013.
[21] Panda, M., & Nag, A. Plain Text Encryption Using AES, DES and SALSA20 by Java Based
Bouncy Castle API on Windows and Linux. 2015 Second International Conference on Advances
in Computing and Communication Engineering, 2015, 541–548.
[22] Wei, S., Sun, Z., Yin, R., & Yuan, J. Trade-Off Between Security and Performance in Block
Ciphered Systems With Erroneous Ciphertexts. IEEE Transactions on Information Forensics
and Security, 2013, 8, 636–645.
[23] Khalaf, Abdulrahman. Fast Image Encryption based on Random Image Key. International
Journal of Computer Applications, 2016, 134.
[24] Ping, L., Jin, L., Zhengan, H., Tong, L., Chong-Zhi, G., Siu-Ming, Y., & Kai, C. Multi-Key Privacy-
Preserving Deep Learning in Cloud Computing. Future Generation Computer Systems, 2017,
74, 76–85.
[25] Muhammad, K., Ahmad, J., Rho, S., & Baik, S.W. Image steganography for authenticity of
visual contents in social networks. Multimedia Tools and Applications, 2017, 76,
18985–19004.
[26] Sobti, Rajeev., & Ganesan, Geetha. Cryptographic Hash Functions: A Review. International
Journal of Computer Science Issues, 2012, 9, 461–479.
[27] Sedighi, V., Cogranne, R., & Fridrich, J. Content-Adaptive Steganography by Minimizing
Statistical Detectability.: IEEE Transactions on Information Forensics and Security, 2016,
11(2), 221–234
[28] Steendam, H. On the Selection of the Redundant Carrier Positions in UW-OFDM. IEEE
Transactions on Signal Processing, 2013, 61(5), 1112–1120.
[29] Zhu, L., & Zhu, L. Electronic signature based on digital signature and digital watermarking.
5th International Congress on Image and Signal Processing, CISP, 2012, 1644–1647
[30] Zhang, Weiming., Zhang, Xinpeng., & Wang, Shuozhong., “Near-Optimal Codes for
Information Embedding in Gray-Scale Signals,” IEEE Transactions on Information Theory,
2010, 1262–1270.
[31] Abdallah, E.E., Ben Hamza, A., & Bhattacharya, P. MPEG Video Watermarking Using Tensor
Singular Value Decomposition, International Conference Image Analysis and Recognition,
ICIAR 2007: Image Analysis and Recognition, pp. 772–783
[32] Li, J., Yu, C., Gupta, BB. et al. Color image watermarking scheme based on quaternion
Hadamard transform and Schur decomposition. Multimed Tools Appl, 2018, 77(4),
4545–4561.
[33] Do, Q., Martini, B., & Choo K-K, R. The Role of the Adversary Model in Applied Security
Research. Computers & Security
[34] Kahn, D. The History of Steganography. Lect Notes Comput Sci, 1996, 1174. 1–5.
[35] Siper, A., Farley, R., & Lombardo, C. The Rise of Steganography. Proceedings of Student/
Faculty Research Day, CSIS, Pace University, D1_1-7, 2005.
[36] Hamid, Nagham., Yahya, Abid., Ahmad, R. Badlishah., Osamah, M. Al-Qershi., Alzubaidy, Dheiaa
Aldeen Najim., & Kanaan, Lubna. Enhancing the Robustness of Digital Image Steganography
Using ECC and Redundancy. Journal of Information Science and Engineering, 2012.
[37] Carson, Austin., & Yarhi-Milo, Keren., Covert Communication: The Intelligibility and
Credibility of Signaling in Secret, Security Studies, 2016, 26, 124–156.
[38] Hussain, A., Al-Fayadh, A., & Radi, N. Image Compression Techniques: A Survey in Lossless
and Lossy algorithms, Neurocomputing, 2018, 300, 44–69
[39] Borges, PVK., Mayer, J., & Izquierdo, E. Robust and Transparent Color Modulation for Text
Data Hiding. IEEE Transactions on Multimedia, 2008, 10(8), 1479–1489.
92 Srilekha Mukherjee, Goutam Sanyal
[40] Cem Kasapbaşi, M., & Elmasry, W. New LSB-based colour image steganography method to
enhance the efficiency in payload capacity, security and integrity check. Sādhanā (2018)
43: 68.
[41] Febryan, A., Purboyo, TW., & Saputra, RE. Steganography Methods on Text, Audio, Image and
Video: A Survey. International. Journal of Applied Engineering Research, 2017, 12(21),
10485–10490
[42] Ahvanooey, MT., Li, Q., Hou, J., et al. AITSteg: An Innovative Text Steganography Technique
for Hidden Transmission of Text Message via Social Media. IEEE Access, 2018, 6,
65981–65995.
[43] Khalil, M., Li, JP., & Kumar, K. (2015): Color constancy models inspired by human visual
system: Survey paper. 12th International Computer Conference on Wavelet Active Media
Technology and Information Processing. 432–435
[44] Zhang, Y-M., & Cen, J-J. (2010) Research on method of transformation from bitmap to vector
graphics based on Adobe Illustrator CS4. International Conference on Advanced Computer
Theory and Engineering, IEEE, V3_75–77
[45] Olugbara, OO., Adetiba, E., & Oyewole, SA. Pixel Intensity Clustering Algorithm for Multilevel
Image Segmentation. Mathematical Problems in Engineering, 2015, 1–19
[46] Hashim, M., Mohd, R., & Alwan, A. A review and open issues of multifarious image
steganography techniques in spatial domain. Journal of Theoretical and Applied Information
Technology, 2018, 96(4). 956–977
[47] Elham, Ghasemi., Shanbezadeh, Jamshid., & Nima, Fassihi. High Capacity Image
Steganography using Wavelet Transform and Genetic Algorithm. Lecture Notes in Engineering
and Computer Science, 2011, 1. 10.1007/978-1-4614-1695-1_30.
[48] Yang, C., Weng, C., Wang, S., et al Adaptive Data Hiding in Edge Areas of Images With Spatial
LSB Domain Systems. IEEE Transactions on Information Forensics and Security, 2008, 3,
488–497
[49] Shen, S., & Huang, L. A Data Hiding Scheme Using Pixel Value Differencing and Improving
Exploiting Modification Directions. Computers and Security, 2014, 48, 131–141
[50] Dey, N., Roy, A. B., & Dey, S. A novel approach of color image hiding using RGB color planes
and DWT. International Journal of Computer Applications, 2012, 36(5), 19–24.
[51] Zhou, X., Yunhao Bai, Y., & Wang, C. Image Compression Based on Discrete Cosine Transform
and Multistage Vector Quantization, International Journal of Multimedia and Ubiquitous
Engineering, 2015, 10(6), 347–356
[52] Bhattacharyya, D., & Kim, T. Image Data Hiding Technique Using Discrete Fourier
Transformation. In: Kim T., Adeli H., Robles R.J., Balitanas M. (eds) Ubiquitous Computing and
Multimedia Applications. Communications in Computer and Information Science, Springer,
Berlin, Heidelberg, 2011, 151
[53] Hansen, B. The Integrated Mean Squared Error of Series Regression and a Rosenthal Hilbert-
Space Inequality. Econometric Theory, 2015, 31, 337–361
[54] Tao, D., Di, S., Liang, X., Chen, Z., & Cappello, F. Fixed-PSNR Lossy Compression for Scientific
Data. 2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018, 314–318.
[55] Dosselmann, R., & Yang, X.D. A comprehensive assessment of the structural similarity index.
Signal, Image and Video Processing. SIViP. Vol. 5, pp 81–91 (2011)
[56] Malik, F. Mean and Standard Deviation Features of Color Histogram Using Laplacian Filter for
Content-Based Image Retrieval. Journal of Theoretical and Applied Information Technology,
2011, 34
[57] Sung, Jungmin., Kim, Dae-Chul., Choi, Bong-Yeol., & Ha, Yeong-Ho. Image Thresholding Using
Standard Deviation. Proceedings of SPIE – The International Society for Optical Engineering,
2014, 9024. 10.1117/12.2040990.
5 Steganography, the widely used name for data hiding 93
[58] Duncan, K., & Sarkar, S. (2012) Relational Entropy-Based Saliency Detection in Images and
Videos. 19th IEEE International Conference on Image Processing. 1093–1096
[59] Koo, H., & Cho, N. Skew Estimation of Natural Images Based on a Salient Line Detector.
J. Electronic Imaging, 2013, 22.
[60] Ferzli, R., Girija, L., & Ali, W. (2010) Efficient Implementation of Kurtosis Based No Reference
Image Sharpness Metric. in Jaakko Astola & Karen O. Egiazarian (Ed.): Proc. SPIE 7532, Image
Processing: Algorithms and Systems VIII
Santanu Koley
6 Big data security issues with challenges
and solutions
Abstract: Big data is a collection of huge sets of data with different categories where
it could be distinguished as structured and unstructured data. As we are revolutioniz-
ing to zeta bytes from Giga/tera/peta/exabytes in this phase of computing, the threats
have also increased in parallel. Besides big organizations, cost reduction is the crite-
rion for the use of small- and medium-sized organizations too, thereby increasing the
security threat. Checking of the streaming data once is not the solution as security
breaches cannot be understood.
The data stack up within the clouds is not the only preference as big data tech-
nology is available for dispensation of both structured and unstructured data.
Nowadays, an enormous quantity of data is provoked by mobile phones (smart-
phone) or equally the symphony form. Big data architecture is comprehended
among the mobile cloud designed for supreme consumption. The best ever imple-
mentation is able to be conked out realistically to make use of a novel data-centric
architecture of MapReduce technology, while Hadoop distributed file system also
acts with immense liability in using data with divergent arrangement.
As time approaches, the level of information and data engendered from different
sources enhanced and faster execution is the claim for the same. In this chapter our
aim is to find out big data security that is vulnerable and also to find out the best
possible solutions for them. We consider that this attempt will dislodge a stride for-
ward along the way to an improved evolution in secure propinquity to opportunity.
Keywords: Big data, Hadoop, MapReduce, data security, big data analysis
6.1 Introduction
Massive knowledge is employed by industries at all levels that have access to big data
and therefore the means to use it. Software package infrastructures like Hadoop
change developers to distribute storage. The distribution process of terribly giant
knowledge sets on computer clusters. It simply leverages a lot of computing nodes to
perform data-parallel computing [1]. With the mixture of the ability to shop for comput-
ing power on demand [2] from public cloud suppliers, such developments greatly dra-
matize the adoption of huge data processing methodologies. Therefore, new security
Santanu Koley, Department of Computer Science and Engineering, Budge Budge Institute of
Technology, Kolkata, India
https://doi.org/10.1515/9783110606058-006
96 Santanu Koley
challenges have turned out from the coupling of big data with public cloud environ-
ments classified by heterogeneous compositions of hardware with operating systems
(OS), and software package infrastructures for storing and computing on knowledge.
Data security is a much-needed criterion in today’s Internet-based world that is
dependent on mobile phone technology. Real-time use of social media such as
Facebook, Twitter, LinkedIn, blogs, WhatsApp, and further Internet-based sites pro-
duces an enormous quantity of data all the way through the world. Eric Schmidt,
the erstwhile CEO of Google, once said about the production in such a rate with the
aim that it is fashioned by this Internet world up to 2003 is around 5 exabytes.
These data are in an increasing manner in its day-to-day applications in a multipli-
cative manner [3]. The reason behind this growth of unstructured (e.g., MS-Word,
PDF, any Text, Media Logs) or semistructured (XML, CSV, JavaScript Object
Notation, etc.) data along with structured (data with relational databases) one is
mainly the grounds of facet data restricted by diverse associations due to dissimilar
reasons approximating enhancing of sales, detail study, analysis, escalation of so-
cial media, continuous survey, shared projects, IoT, multimedia, and so on.
Big data expertise works with other well-known technology, that is, cloud com-
puting, provides the user much flexibility in terms of services and money, pays as
per the use of singular services endowed with, eradication of costly computer hard-
ware, has little speculation on setup, moves threats in excess of contradictory sys-
tems, and furthermore has diminutive point of time to market make cloud systems
remarkably acknowledged. Cloud computing has set out novel applications in mo-
bile technology as the entire services and amenities are provided at its finest.
The traditional applications like RDBMS (relational database management sys-
tem) come into play for structured data of tabular approach stored in .csv or .xls cate-
gory of formats. Today the demand for applications like Facebook and WhatsApp
involves in storing unstructured data besides the structured one. The unstructured
data similar to images of diverse setup together with healthcare substantiation is
analogous toward X-rays, ECG, MRI images, travel and logistics as well as moveable
text layouts, forms of miscellaneous kind, video as well as audio, documents compa-
rable to text, doc, rtf, and additional setups, manuals, contacts, automotive data,
data associated with safety, accessory of electronic mail, energy/industry retails, and
others Prearranged one enlightens us concerning tables of early database manage-
ment systems .CSV’s and .XLS’s, where row and column are applicable, as conven-
tional database management systems represent superior.
Big data follows an essential part with cloud computing as data stores in cloud-
lets. Five V’s of big data provides strength to analyze the data with dissimilar ap-
proaches and finally find the results. Using the elucidation endows with Google,
Doug Cutting and his panel developed an open-source project called HADOOP.
Hadoop scuttles function with the MapReduce algorithm, where the data is routing
in comparison with others. In short, Hadoop is used to build up applications that
could carry out absolute numerical study on gigantic quantity of data.
6 Big data security issues with challenges and solutions 97
Real-time privacy with big data analysis is required. The security problem arises
when distributed frameworks are used like MapReduce function of Hadoop, which
dispense huge processing tasks to dissimilar systems to save processing time and
breach of security crop up. These tasks are taken as input to endpoint devices.
These are the main factors of security violation as data processing, storage, and
other tasks are performed here.
Storage on endpoints can stock up this streaming data into several tiers. When
the next data is stored, the tiers are also changed as the criteria of the most used
and least used concept are there. The manual tiering system is replaced with auto-
tiering, thus transaction logs cannot cop up with and increase security crisis.
Nonrelational data stores like NoSQL cannot encrypt data during distributing with
endpoints when it is flowing or composed, labeling or logging also makes the same
problem. At the time of storing data into storage devices, a proper encryption tech-
nique is needed. Similarly the access control, encryption, and validation are also
necessary when users are associated with enterprise IT as a whole. Data mining so-
lution is another security breach as data collected from the provider and collector
provides this data to the miner. This technique involves a specified mining algo-
rithm, which ensures the privacy and security of data.
Granular auditing is a kind of security check on logs to ensure the result on dif-
ferent parts, where data is stored in case of external attacks. It may be an unsuc-
cessful hit, but auditing finds its consequences. Granular access control of big data
requirements by means of NoSQL databases or the Hadoop distributed file system
(HDFS) intended for a vigorous verification practice and obligatory admission
power. Data provenance is done with metadata, where users can check, verify, and
authenticate the data with high speed for any security issues.
Securing data storage endpoints is much necessary, but at the same time, it is
critical too in case of distributed architecture. When the data volume can rise up
to exabytes, autotiering system is essential. This will set streaming data involun-
tarily allocate to indicate and put together, organizing enormous dimensions of
data uncomplicated to manage. Unverified services or contradictory protocols like
crisis can occur as a result. This system generates logs that can store the data
about storage in tiers that can be isolated and preserved accurately. Secure un-
trusted data repository is a network file system that can protect the data and logs
from modifying them from external unauthorized users by constantly checking
and monitoring.
Organizations that employ huge unstructured data like photographs, video,
audio, and text cannot use typical structured query language (SQL), rather they
utilize NoSQL. It does not have default administrative user enabled, weak authen-
tication connecting server and client as communicating via plaintext. Weak pass-
word storage, lack of encryption support, susceptible to SQL injection, and denial
of service attacks are also responsible for severe security issues. A bit solution to
these may include server–client encryption algorithms approximating Rivest–
98 Santanu Koley
Heroku, AWS Elastic Beanstalk, and Google App Engine. Finally, as an end user,
we are aware of the names of SaaS such as Gmail, Trello, Salesforce CRM, EventPro,
Office 365, and Google Docs.
The cloud does not stand for the traditional one as the name recommended is a
flexible service by means of service-oriented structure. It makes use of the Internet
intended for supplying certain services. The data, computer hardware along with
software everything, is shared in this construction. “Share and use of applications
and resources of a network environment to get work done without concern about
ownership and management of the resources and applications” (M-S. E Scale, 2009)
[6–8]. At this moment, the SOA put in the picture concerning the services provided
through this technology is not something except contradictory transformation of
several extraordinary technologies. The exceptional form of cloud service provides
service and deployment models. Service model can be divided into three different
varieties of perception, resembling PaaS, SaaS, and IaaS. Service models are de-
scribed as the NIST model [9] for the above structural design. The discussion divi-
sion for us is the IAAS cloud of the recently developed system.
The fantastically well-preferred cloud figuring is referred to as its notable high-
lights as pay-per-utilize model, which makes it a minimal effort, resource pooling,
simple to introduce and usage capacity, arrangement of services as required by the
client and ranch out capacity, QoS, wide network access, suppleness at a brief beat,
self-stipulation, decided upgrade of administration, adaptable, relentlessness, easy
upkeep with updegree, squat fence vitality, and so on [10].
“MCC at its simplest refers to an infrastructure where both the data storage and
the data processing happen outside of the mobile device. Mobile cloud applications
move the computing power and data storage away from mobile phones and into the
cloud, bringing applications and mobile computing to not just Smartphone users,
but a much broader range of mobile subscribers” [11, 12].
As the processing task isn’t finished by the cell phone, the power and memory
utilization is likewise less here and in the long run, the cell phone turned out to be
quick [13].
IaaS cloud innovation is the suggested technique for putting away information
on cell phones, apps, and so forth as second code portion or CDroid [14] server plot.
To achieve the minimal effort phrasing, we should find the cloud server in a neigh-
borhood land, where Fujitsu server is situated must have most reduced power cost
for every unit in this world [15, 16].
Analytics/reports
Data
sources
Query engine, e.g., Hive, Mahout
synchronization service
Programming model
for processing large datasets
with a parallel, distributed algorithm
on a cluster like MapReduce
Data Data
storage visualization
in clouds
Distributed fault tolerant database
for large unstructured datasets
like NoSQL
The present routine with regard to the expression “big data” is slanted to submit to
the utilization of prescient analytics, client conduct investigation, or certain other
propelled data analytic techniques that separate an incentive from data, and some-
times to a specific size of the dataset. “There is little doubt that the quantities of data
now available are indeed large, but that’s not the most relevant characteristic of this
new data ecosystem” [17]. Analysis of data collections can discover new relationships
to “spot business patterns, anticipate ailments, and battle wrongdoing thus on” [18].
Scientists, business officials, experts of drug, publicizing, and governments alike con-
sistently meet troubles with extensive datasets in territories including Internet look,
fintech, urban informatics, and business informatics. Researchers experience restric-
tions in e-science work, including meteorology, genomics [19], connectomics, com-
plex material science reproductions, science, and ecological research.
Big data is an idea to store, process, and break down gigantic sum (e.g.,
exabytes) of data that is almost incomprehensible with conventional RDBMS
6 Big data security issues with challenges and solutions 101
framework. This is on the grounds that other than organized data it manages
semi- or unstructured data. Big data is grouped into a few classifications like
data sources, content arrangement, data stores, data organizing, and data pre-
paring. The detailed order can be comprehended with the assistance of
Figure 6.3.
Data sources are named as web and social media, a content format known as
structured, semistructured, or unstructured configuration.
The data format is document oriented, column oriented, graph based, and key
qualities while data staging can be depicted as cleaning, normalization, and
change. At long last, data processing should be possible with the assistance of two
unique strategies like batch processing and real-time processing.
Big data is able to illustrate by means of three ideologies of “V” such as volume,
variety, velocity, and it can be extended to veracity, validity, volatility, and value
(Figure 6.2).
Volume
Variety Terabyte
Structured records/arch
Unstructured tables/files Velocity
Multifactor distributed Batch
Probabilistic Real/near-time
Linked processes
Dynamic Stream
5 V’s of
big data
Value
Veracity Statistical
Trustworthiness
Events
Authenticity Variability Correlations
Origin/reputation Changing data Hypothetical
Availability Changing model
Accountability Linkage
– Volume: Volume talks on the extent of the data created from various sources
and is prepared to extend as far as records, tables, or other forms. The size of
the data may vary up to terabyte, petabyte, and zetabytes. Information made
with definite data analysis and smartphone are the greatest maker of such sort
of longitudinal data [21].
102 Santanu Koley
– Variety: Variety means the class of data, for example, in various structures.
Furthermore, various sources will deliver big data, for example, sensors, gadg-
ets, social networks, the web, and cell phones. For instance, data could be web
logs, radiofrequency identification sensor readings, unstructured social net-
working data, gushed video, and sound. Other than unstructured data, both
structured and semistructured assortment of data stores into big data.
– Velocity: This implies how regularly the data is produced, for example, data anal-
ysis. For instance, each nanosecond, millisecond, second, minute, hour, day,
week, month, and year. Handling recurrence may likewise contrast from the cli-
ent necessities. A few data should be prepared real time, batch, or stream too and
some may not.
– Veracity: Veracity is the data in doubt, that is, uncertain, untrusted, and un-
clean. The uncertainty is due to data inconsistency and incompleteness, ambi-
guities, latency, deception, and model approximations. It is the management
of the reliability and predictability of inherently imprecise data types.
– Validity: Validity of input data based on accurate processing is done on the
data and provides a particular output as the product. The validity of data is
very near to the veracity of the data. Through big data, one should be spared
attentively regarding validity. For example, in the banking sector, data accu-
mulated from various banks must be valid to show the growth of the nation.
The finance ministry may plan their strategy for the next financial year regard-
ing the valid data provided.
– Volatility: Volatility is very common when data is needed to transform into
other types on a regular basis. If it is not accounted for analytical results, per-
haps invalid at the instant they are produced. Such types of circumstances are
very common, where businesses like the stock market or a telecom company
(call data records related to one day). Volatility is directly linked by means of
the challenges of invalidity and veracity.
– Value: In big data, value means exact data that are in reality and has some
meaningful aspects that take out from some usable system. Apparently, this
must be the output of big data processing. Value is an essential feature in the
big data. Extracting the exact value is only possible when proper roads are
there; here road refers to IT infrastructure arrangement to collect big data.
These can be possible in dealing with a business that is meant for a return on
investment.
There is another factor called inconstancy, which can be an issue for the individu-
als who break down and analyze the data. This alludes to the irregularity, which
can be appeared by the data on occasion, along these lines hampering the way to-
ward having the capacity to deal with the data adequately.
Big data are portrayed by various perspectives: (a) data sources, (b) data are
various, (c) data can’t be sorted into standard social databases, (d) content
6 Big data security issues with challenges and solutions 103
arrangement, (e) data stores, (f) data organizing, and (g) data are created, caught,
and processed quickly. The real classification of big data can be illustrated in the
hierarchical structure as shown in Figure 6.3.
Data
Data source Content format Data stores Data staging processing
Normalization
Semistructured Real time
Machine Column-oriented
Transform
Sensing Unstructured Graph-based
IoT
The distributed file system structure is utilized to store in big data utilizing two
unique strategies like HDFS and MapReduce programming system. The two are uti-
lized to store and keep up the entire data structure (Figure 6.4).
5 V’s of
Big data Analyze Result
big data
HDFS is responsible to collect the data in lumps. Here data is parted into squares of
64 MB each. It is made perfect for executing its file framework on a hardware
104 Santanu Koley
platform where the cost is lower as well. It can endure the exceedingly imperfection
framework as it keeps up block replication. The Internet searcher for the web appli-
cation as named Apache-Nutch venture was the explanation behind the production
of this engineering. Apache-Hadoop is the great introduction as a subproject of the
resulting web crawler. An occasion of HDFS is made of a major number of server
PCs; some of the time the amount can leave a couple of thousand as well. These
servers reserve the data of the said record framework. They are not set up for a soli-
tary collaboration by the clients, yet for cluster preparing. The data documents are
a lot greater past evaluation, as they achieved the size close to TBs. Here the low
latency of data access is ignored; however, high throughput is upgraded and that is
much fundamental for a framework like HDFS. Data coherency and this throughput
are the establishment of the idea of “compose once and read various occasions”
idea of the files.
The Hadoop architecture contains data sources, Hadoop system, and big data un-
derstanding. The data sources contain site clickstream data, content administration
framework, outside web substance, and user created content. The Hadoop system
incorporates HDFS that is controlled with big data landing zone and MapReduce
algorithms. These algorithms are constrained by keywords investigate, content char-
acterizations, or subjects and client division. The big data knowledge holds keyword-
applicable substance-rich client focused on presentation pages.
HDFS is set up to utilize Java technology that takes care of business. The basic
plan of this document framework is given a name of NameNode – DataNode fol-
lowed on master–slave development. Serving read–write operation/application
from the file frameworks, customer is performed by the pair. The guidance from
NameNode is completed by performing block creation, cancelation, and replication.
They are expected to execute on item frameworks that might be OS as GNU/Linux
as NameNode is somewhat software same as DataNode. NameNode runs file system
namespace tasks like opening, shutting, and renaming files and directories. It is a
master server and administrates the file framework namespace and controls access
to documents by customers.
NameNode finishes up the mapping of different blocks to DataNodes. To guar-
antee the DataNodes working precisely, that is ordinarily single per bunch, the
NameNode engages time-to-time refreshing of heartbeat and a block report from
DataNodes in an ordinary interim. DataNode might be called as a vendor for the
hub with its arrangement, as a gathering of DataNode is upright for amassing of a
few blocks. These blocks contain part documents dispersed in changed blocks.
HDFS depicts a file system namespace and, put aside, client data to be put
away in these documents are in blocks.
NameNode does not flood with user data; it is an arbitrator for all of HDFS meta-
data. HDFS is planned such a way that it has a solitary NameNode in a bunch to the
most noteworthy degree that disentangles the design of the framework. Hadoop
6 Big data security issues with challenges and solutions 105
system has the layer, in particular HDFS layer and MapReduce layer. The second one
is the known execution engine in a multinode cluster. A job tracker colleague tasks
trackers in both master and slave parts; then again, name node in the HDFS layer
partner’s data nodes in those parts as appeared in the outline in Figure 6.5.
Master Slave
Task Task
tracker tracker
Job
MapReduce layer tracker
Data node
Data
node
Multinode cluster
6.3.2 MapReduce
Figure 6.6: Security and privacy challenges in big data ecosystem [22].
Big data security structure can be characterized into various classifications. Some
of them can be named as infrastructure security, data protection, data manage-
ment, and integrity and responsive security [27]. The outline depicts the design in
detail. Foundation of security might be portrayed in cases like dispersed systems
and nonsocial information stores, although information protection shows infor-
mation mining and examination, cryptographically authorized information secu-
rity and granular access control. Data executives depict data tiering, exchange
logs, granular examining, and data provenance. Respectability and responsive se-
curity delineate constant protection and endpoint gadgets. The arrangement can
be portrayed in Figure 6.7.
Big data is the most recent innovation utilized by associations that get vulnera-
bility as we are uninformed of the vast majority of the things. The vast majority of
the devices bringing into play are open source, and accordingly discover assaults of
the hubs where information stores. Data stores here conveyed in nature; subse-
quently, ill-advised client verification happens. There is a significant prospect for
pernicious data input and inadequate data approval.
The examination of the advancement of big data uncovers the high adequacy of
big data as far as data handling. Be that as it may, the data preparing and informa-
tion stockpiling of big data raises the issue of the data security ruptures and in-
fringement of clients’ protection. At the equivalent time, the fundamental exercises
6 Big data security issues with challenges and solutions 107
Compute cluster
Data
DFS block 1 DFS block 1
data data data
data data data Map
data data data DFS block 1
Results
data data data data data data
data data data DFS block 2 data data data
data data data data data data
Reduce data data data
data data data DFS block 2 Map data data data
data data data data data data
data data data Map
DFS block 2
data data data DFS block 3 DFS block 3
data data data
data data data
DFS block 3
learned are the questionable idea of big data in light of the fact that, from one per-
spective, the advancement of big data raises data security dangers; in any case,
then again, big data can possibly improve the data security in the event that they
are legitimately utilized. The capability of big data is gigantic and the consideration
regarding the data security is basic for the powerful improvement of big data and
the avoidance of various dangers.
Data Data
Distributed tiering Real-time
mining and
frameworks privacy
analytics
Non Transaction
logs Endpoint
relational
Cryptographically devices
data store
enforced data
security
Granular
auditing
Granular
access Data
control provenance
Figure 6.8: Different big data security and privacy challenges [22].
tax, and claims sometimes get frauds done on tax payments. Here the problem
arises when accessing the data between different parties, resources of those data,
and accessing the data in office or none office hours.
Today we move to computation on real-time data where big data faces most
challenges. Here, real-time updating or keeping an eye on the websites and web
pages is completed. The gigantic quantity of data (sometimes tera/petabytes) is
composed on or after a variety of resources, sorted out, scrutinized by means of
numerous data mining, data classification, and prediction algorithms, and conse-
quently, reports are kept up of all these analyses. These prepared reports are ex-
ceptionally helpful when decision-making standards are satisfied. The carrying
out of an organization depends to a great extent on those accounts. Language
processing is a real-time data processing language used to process data streams
coming from multiple sources (Figure 6.9).
IBM’s Stream Processing Language has three singular varieties of operators:
utility, relational, and arithmetic, which take data through input source operator
and give output through output source operators. These multiple operators present
in between the source filter, aggregate, join multiple data streams according to the
need of the user. As per the necessities, the formulations of the operators can be
executed manually by the users. Processing of streaming data is put in a more com-
petent technique in big data, whereas it also props up ad hoc queries.
Here the end users can write their own query in SQL as in custom database ap-
plication and straightforwardly submit them to relevant web applications too.
Thus, it can get further flexibility and power. But there is a larger security concern
of having ad hoc queries. The entire practice can be expressed in Figure 6.10.
6 Big data security issues with challenges and solutions 109
Smart grid
Threats/challenges Solutions provided
circumstances
REAL- EDGE
TIME SERVING
master
meta-data
VARIETY OF
LOW LATENCY
INPUT DAG
SOURCES PROCESSING
CUSTOMER
INFO
Ad hoc
AGGREGATE SEARCH
Aggregation
Data
SOCIAL, Conditioning,
Production,Ad-
WEB hoc analysis
Solution
At that place should be a control on the admittance to the data; moreover, it ought to be monitored.
To put off illicit right of entry to the data, threat intelligence should be employed. Use big data
analytics to set up confidence acquaintances to come together to make sure that merely authorized
links to take place on a cluster. Scrutinizing tools like security information and event management
(SIEM) way out can be brought into play to stumble on uncharacteristic associations. They may
include:
The distributed computing framework is the modern approach today, where not
only hardware, the software is distributed to. Here a software component shares
112 Santanu Koley
Smart grid
Threats/challenges Solutions provided
circumstances
Trust establishment:
Malfunctioning compute
initiation, periodic trust
working nodes
update
Application computation
Access to sensitive data Mandatory access control
infrastructure
To carry out a gigantic quantity of data, DPF makes use of parallel computa-
tions. DPF makes use of parallelism in computations with storage space to practice
an enormous amount of data (Figure 6.11).
There are two different methods presented to make certain the trustworthiness
of mappers: trust organization and mandatory access control (MAC). Throughout
the first part, that is, trust establishment, “workers” must be genuine along with
prearranged belongings by “masters,” and only when they’re experiencing can they
be doled out mapper responsibilities. Subsequent to this requirement, periodic up-
dates must be constructed to ensure mappers, again and again, and congregate the
recognized procedure.
Alternately, MAC, the predefined security approach, will help to follow out. On
the other hand, while MAC makes certain that the input of mappers is safe and
sound, it does not put off data loss from mapper output. To keep away from this, it
is important to influence data de-identification techniques that will set off the erro-
neous info from being circulated among nodes.
Solution
The seclusion and security problem take into account of a number of questions like auditing, ac-
cess control, authentication, authorization, and privacy once bring into play the mapper and re-
ducer process. The way out pays trusted third-party monitoring and security analytics (Apache
Shiro, Apache Ranger, and Sentry) just as protection arrangement implementation with security to
put off information spillage. In that regard is inalienable lack of clarity in administering compound
application reconciliations on the creation-scale distributed framework. The system is to work by
methods for an undertaking class programming model that has the office to grasp remaining task
at hand strategies, tuning, and general observing and organization. At that point, when we make
an application for a solitary office or different capacities, we influence an IPAF (to entomb protocol
acceptability framework). The solution provided with the following:
The above framework of big data accumulates data from an assortment of resour-
ces. They are generally called as endpoint devices. The technique of collecting data
split into two classes of perils such as data collection with validation and filtering
of data. The first part (data collection with validation) is to collect data from several
endpoints connected in a distributed network, where millions of hardware and soft-
ware are associated with it in an enterprise network.
Here another problem arises when an input validation is performed on the
piled-up data. Substantiation of the input data is important as infected data are too
compiled with some malware application with computer viruses that may harm the
data sources. Now the data should be filtered and modified as per the given format
of the data requirements. The second part (filtering of data) provides the exact out-
come improbable to validate and filter data. The data mapping can be done with
the help of knowledge processing and business goals or assess the scope of cus-
tomer data. Previous signature-oriented data filtering may perhaps be unsuccessful
to solve the input validation and data filtering to slow down completely.
Solution
Various solutions include the following:
Smart grid
circumstances Threats/challenges Solutions provided
Compromise in
Analytics to detect outliers
communication of data
(SSD) through archival data among little significance that ends up in tape storage
or low-cost cloud storage services.
The normal manual train system is flexible enough, but automated tiering sys-
tem has the largest flexibility ever, to shift data among diverse storage media as its
value changes.
Big data describes distributed processing frameworks like Hadoop that are
worn for data processing and tiering storage for big data applications. Hadoop’s
substantial storage potential appears on or after its clustering architecture, wherein
data are distributed and lay-up in a network of compound computing possessions.
If subscribed with a specific cloud setup, the data tiering part may be stored in the
cloud data center; otherwise, Hadoop setup may be installed on-premise for enter-
prise-specific area.
At the very beginning, the datasets are created and afterward, they are used on
a regular basis by several users from the diverse user from dissimilar endpoints.
Seeing that large datasets come in Hadoop clusters, a component of the data is
stock up on individual computing machines or nodes in a cluster.
Yet, dissimilar variety of data along with the regularity of data access is very
frequent but with the progression of time, the value of data diminishes. Data tiering
is supported by Hadoop, and by splitting the cluster into different tiers based on
the frequency of use, it can considerably shrink the costs to pile up this data. This
technique of reduction of cost in data storage is continuous in nature. It is publi-
cized as (Figure 6.14).
Hot data: The real-time data analysis needs high-frequency entrance of data.
The regular reporting or ad hoc queries are also associated with this type of data.
The utmost computing power is utilized with this kind of data.
Warm data: This type of data accessed seldom. But it is functional at times so
that it is necessary to be stored on hard disks or sometimes in SSD storage too. The
computing power needed is obviously much less than hot data.
Cold data: Cold data is essentially a library type of data called archival data.
This character of data is used in enterprises at times when the periodic report is
generated or an enterprise wants to retain for compliance or once-off queries. This
data is located in storage by means of negligible processing power obtainable of it.
Frozen data: The frozen data means it will almost never be used. This kind of
data does not store in basic computer storage areas like HDD. Such kind of data is
118 Santanu Koley
Real-time data
analysis In memory
almost unusable, so the power consumption is near to zero. They can be stored in a
node that uses minimal computing power, and less processing task is performed.
Here data is stored in an achievable manner (Figure 6.15).
Hot Warm
Disk Achieve
Cold
Frozen
Solution
The cost reduction in data storage tiering is done for easy handling of data. Data is separated ac-
cording to the frequency of use of data with a specific time frame stored in different nodes.
Reduction of cost for data storage depends on the storage; when data stores on nodes with mini-
mum computing power or in case of achieving data, the data with minimum computing power
saves a huge amount of money. The data can move among said tiers via Hadoop tools, for instance,
Mover. If the system turns out to be relatively dynamic, it sets aside for superior competence in the
big data storage technique.
Now big data are classified into singular storage tiers per the occurrence of its usage. This is an
excellent opening tip for the enterprise that wants to store information in a less pricey manner.
After sometime, when this process is rationalized, user organization need not think about its stor-
age and reallocation of the same data with its cost minimization as they move between tiers.
The development of every job is directly proportionate with the addition of data as big as it con-
structs the peak of big data that gathers by themselves. A small part within big data is useful in
6 Big data security issues with challenges and solutions 119
this instance as the cold and especially archive data exist. Thus there is a perspective to categorize
and tier enterprise level big data as soon as it gets into the clusters, leading to even greater
efficiencies.
Storage tiering has immense prospective in a business world, where industries are under pres-
sure to accomplish useful imminent commencing the bulky bind of data they gather on a habitual
basis. Now the data will simply maintain to breed in volume and velocity in terms of big data.
Tiered storage brings cost optimizations to the table that can guarantee organizations achieve the
correct equilibrium between performance, capability, and cost on their big data clusters.
Storage tiering has extraordinary potential in this present reality where organizations are feel-
ing the squeeze to increase valuable bits of knowledge from the extensive swathes of data they
gather all the time; data that will just keep on developing in volume and speed. Layered capacity
conveys cost enhancements to the table that can guarantee associations that accomplish the cor-
rect harmony between execution, limit, and cost on their big data groups.
The big data analytics system can be made automated in terms of collecting data, but
as a result, it enhances data loss during this period. Different encryption techniques
and proper training of users can minimize the risk associated with it. Adequate pro-
tection method must be introduced as big data stored into clouds should be tested
accurately. This can be done for extra protection of data on the user side while cloud
service provider confers customary precautions review within the time frame.
Cloud service providers can be imposed penalties if the security standard does
not meet up to the expected standards. The right to use of power strategy must be
set up for authorized user access to both internal and external user sites. User au-
thentication for the data coming from different sites must be controlled up to some
level. To protect data from unauthorized access, a second-level security mechanism
is very much useful, namely encryption (Figure 6.16).
The reason behind doing the same is the use of raw data as well as analyzed data.
Confidentiality and integrity should be imposed on data as a measure of data protection.
Another aspect of the data security mechanism is the use of antivirus and fire-
wall protection. Today the trend is the creation of special attacks like ransomware –
a type of malware that is very common. They are breaching the defense of com-
puters throughout the world on a regular basis. Security mechanism includes some
other small techniques like disconnection of user devices with servers restraining
important information when they are unused. The security mechanism must pay at-
tention to the fortification of the function, rather than just safekeeping the device.
Proactive and reactive security skill should be supplied to big data.
Data visibility manages to different entities differently as there are two methods
for organizing, like systems, individuals, and organizations. The primary technique
systematizes the visibility of data in the form of preventive admittance to the main
scheme, for example, the OS itself or hypervisor. The secondary technique encapsu-
lates the data itself in a self-protective shell by means of cryptography.
120 Santanu Koley
Both the approaches put in their remuneration and detriments. In the past, the fore-
most procedure was simpler to carry out and, as integrated among cryptographi-
cally protected statement. These may be customary on behalf of the prevalence of
computing and communication infrastructure.
The conventional security methods legitimate for an assortment of protective
intimidation. They include replay attacks, password-guessing attacks, spoofing log-
ins, intersession that has chosen plaintext attacks (Kerberos-specific attack), and
session key’s exposure. These attempts are really common in big data security tech-
niques. They are pertinent to any of the customary verification schemes. The pecu-
liar and more protected system, namely Kerberos fails sometimes. Today there is a
need for implementing a scheme that can forestall the attempts.
Solution
Instead of using cryptography in security control, steganography can be used where data is cov-
ered up by some other media. In the case of cryptography, data are actually changed in its format
but steganography creates an illusion for the intruders who want unauthorized access to user data.
To address the security problem like access control in big data is setting a revision in the secure
remote password (SRP) protocol to have room for the access control of the clients in authentication
level. This is the aim of assigning the access labels to the big data users to limit their access rights
in the big data environment. Simple remote password protocol is a secure password-based authen-
tication and key management protocol (Figure 6.17).
This protocol authenticates the clients to the server using a password-like mystery.
This mystery must be known to the client only. No other secret information is
needed to remember by the client. The server stores the verifiers for every user to
6 Big data security issues with challenges and solutions 121
Smart grid
Threats/challenges Solutions provided
circumstances
Group signatures with trusted
Imposing access control
third parties
The protected data storage and transaction logs are really much linked up in the midst
of data warehousing, management, processing, and hence make the security tribula-
tions. Sometimes external security threats may distort the memory due to unautho-
rized access within big data. This deformed data is then transferred from one source to
another while the data transmission occurs between different nodes and users. The
tiered approach has already described how data is stored in different tiers. This case is
likewise for the transaction logs, and both data and transaction logs are stacked out in
multitiered storage media. The data moves frequently among tiers manually.
This procedure assures the developer direct authority over vitally what data is
moved and when. On the other hand, as the data set volume produces exponentially,
scalability and accessibility require autotiering for big data storage management. The
autotiering solution does not have any track of where the data is stocked up. This
leaves a new problem of protected data storage. For instance, on that point is an estab-
lishment that wants to ingest data from different departments of their own. There are
dissimilar types of data (every bit per data tiering approach) available within the
122 Santanu Koley
departments. The data that are virtually not used and mostly used both exist in paral-
lel. An autotiering storage system will help a great deal and can put away the money
via transferring the less used data to a lower grade and thus along. It may find that the
data stored in a lower tier is sensitive information. Usually the lower level data have
less security as companies do not desire to spend much for the unused ones, only in
this event, they must convert the policy to send data toward lower level or increase the
security of data.
Exchange logs can develop wild when not appropriately kept up. Each time
data is changed in the database, records get added to the exchange log. On the off-
chance that an exchange is kept running against an extremely huge table, the ex-
change log must record and store those data changes simultaneously until the ex-
change is finished. Since the log composes its data to disk, this can gobble up a ton
of disk space all around rapidly (Figure 6.18).
3. A query
returns to the user
Database
Database File
01 10 11 00 01 00
4. Change is recorded in 11 10 01 00 11 00
the data file, this happens 10 01 11
Users
during the checkpoints
Solution
Data tiering in big data analysis plays a major role, especially when an autotiering scheme is
adopted. This autotiering must be fully log based so that the movements of data between tiers are
noted and side by side the sensitive data should be marked properly and increase the security of
the tier itself. Now, this may be the problem for huge data to provide high security, for this case
data should be isolated and kept in a safe place. The encryptions, especially policy-based encryp-
tion, and a signature on data safe secure data storage and transaction logs. The proof of data pos-
session and periodic audits of data are also recommended (Figure 6.19).
Smart grid
Threats/challenges Solutions provided
circumstances
Data confidentiality
Policy-based encryption
and integrity
6.3.2.7.3 Backup
At last, take steady and robotized reinforcements. There are numerous methods for
setting this up (and is past the extent of this post). On the off-chance that the recu-
peration model is set to SIMPLE, the main alternative is to perform reinforcements
of the information; backing up the exchange log is beyond the realm of imagina-
tion. Every reinforcement will be a solitary depiction of the information in the data-
base at the season of the reinforcement.
In the event that the recuperation model is set to FULL, at that point of exchange
log reinforcements enable the database to be completely reestablished to a specific
point in time and aren’t obliged to just a single explicit minute in time. Log reinforce-
ments likewise play out the significant errand of stamping existing log records as dor-
mant. Idle log record space would then be able to be reused by the exchange log
when it returns to that area of the log document since it keeps in touch with the sign
in a consecutive request. It’s likewise essential to take note of that playing out a “full
reinforcement” from the reinforcement alternatives does not back up the exchange
log, notwithstanding when the database is in the FULL recuperation model.
6.3.2.8 NoSQL
The contemporary big data analysis is the problem of synchronization between da-
tabase systems, which host the data and make available SQL querying, by means of
data analytic [28] packages that carry out numerous forms of non-SQL processing,
like data mining and statistical analysis.
124 Santanu Koley
Nonrelational data stores recognized as NoSQL databases are even under pres-
sure concerning security infrastructure.
The basic difference between NoSQL and SQL maybe thought of the difference in
nonrelational and relational structure of data. Data stored here in has NoSQL more
like document layout, whereas SQL has more table setup. This provides NoSQL to be
more flexible and trouble free to deal with the new data models than in SQL. NoSQL
is a open-source software; it signifies the minimum cost associated with it and can
act with low configuration hardware too. As a consequence, small software compa-
nies make use of it. Quick processing on big data is obtainable using NoSQL tools.
Elastic scalability is also useable with this database. Here no database models are
employed with NoSQL and thus a huge time is preserved. All the above points are
just opposite to SQL (shown in Figure 6.20). Examples of NoSQL database tools such
as MongoDB, CouchDB, CloudDB, and Bigtable are set up to use a decent variety of
difficulties presented by the examination world and subsequently security was never
part of the model at any time of its origination stage (Figure 6.21). Most engineers
utilizing NoSQL databases by and large incorporate security in the middleware. No
financing is offered by NoSQL databases for implementing it expressly in the data-
base. Such security rehearses represent extra difficulties. As far as hotel and prepar-
ing the colossal majority of data, associations managing enormous unstructured
informational collections may pick up the preferred standpoint by moving from a cus-
tomary social database to a NoSQL database. Subsequently, the security of NoSQL
databases depends on outside upholding systems.
NoSQL SQL
For instance, well-molded answers for NoSQL infusion are as yet not built up. Each
NoSQL databases was in parliamentary law to drop down the security episodes. The
6 Big data security issues with challenges and solutions 125
Smart grid
Threats/challenges Solutions provided
circumstances
Enforcement through
Lack of rigorous
middleware-layer passwords
authentication and
should never be held in clear
authorization mechanisms
Data from diverse encrypted data at rest
appliances
and sensors
Lack of secure Protect communication
communication between using
compute nodes SSL/TLS
general public must continue through security arrangements for the middleware
adding things to its motor and toughen NoSQL database itself to coordinate social
databases without settling on its utilitarian qualities.
To ensure that the most touchy private data is completely secure and just open to
the approved elements, data must be encoded depending on access control ap-
proaches. To guarantee confirmation, course of action, and decency among the dis-
persed substances, a correspondence system that is cryptographically verified must
be completed. Delicate data is for the most part put away decoded in the cloud.
Lack of designing security measures is also creating security problems such
as encryption, policy enablement, compliance, and risk management. If these
126 Santanu Koley
things are needed, they should be built on their own. Data masking policies and
aggregating datasets may be used as security measures. Here re-identifying indi-
viduals are the proper tools that may perhaps locate datasets back simulta-
neously. Confining solitude might show the way to augmented safety measures
menace, particularly if the data caught up be full of responsive and commercial
information. Assortment of data is another important security measure, where
data provider provides structured or unstructured data, but both are used by
high-, middle-, or low-level users. This is the newest among all technologies using
in today’s computing world. It is very clear when any technologies are not well
understood, certainly they become vulnerable.
The principle issue to scramble data is the win or bust recovery strategy of en-
coded data, which limits clients from effectively performing finely grained activities,
for example, sharing records or ventures. Quality-based encryption (ABE) reduces
this issue by using an open key cryptosystem where ascribes identified with the data
scrambled serve to unscramble the keys. Then again, the decoded less touchy data
helpful for investigation must be conveyed in a safe and settled upon way utilizing a
cryptographically secure correspondence system.
The use of big data can make out intrusion of security, obtrusive advertising, dimin-
ished common opportunities, what’s more, increment state and corporate control.
The special user data that appears really simple can be made usable when they
make in use to predict something after analyzing it.
6 Big data security issues with challenges and solutions 127
Anonym zing data for investigation isn’t adequate to continue client privacy. For
instance, Netflix (an American worldwide entertainment organization that blesses
with spilling media and video on interest on the web) confronted an issue when cli-
ents of their anonym zed dataset collection were perceived by associating their
Netflix motion picture scores with IMDB scores. Consequently, it is critical to decide
rules and proposals for averting incidental security divulgences. User data marshal-
led by organizations and government experts are steadily mined and dissected by in-
side investigators and furthermore by outside contractual workers. A malevolent
insider or unapproved accomplice can manhandle these datasets and get private in-
formation from clients. Also, knowledge organizations require the amassing of im-
mense amounts of information. Hearty and adaptable, protection-saving calculations
will expand the likelihood of gathering pertinent data to complement client security.
Solution
Data mining is a useful means to dispense with the vast amounts of big data, as important informa-
tion can be pulled out, and then utilized to project future developments. The analysis of this data
can help to solve problems and to shape strategies for predicting future trends (Figure 6.23).
A class of data mining is named as basket analysis. This analysis process reveals with
reference to consumer buying options over a combination of different commodities at
128 Santanu Koley
a time for specific occupations. The pattern matching ability of data mining provides
retailers spending capability and increase out the products bought collectively.
Mining gives the analyzed data to the marketing section of the business, based on
those different items they advertise to the customer.
Another brand of data mining is sales forecasting; the operative procedure for
this is very simple. It memorizes which product a consumer bought at what timing.
This forecasting technique now determines when the purchaser failed to buy the
same product again and once more. For instance, coffee retailing is a lot higher in
winter than in summer, so retailers are acquainted with increased stock for the win-
ter months. It can be useful for the retailer to keep funds in monthly basis when
sales (maintaining inventories) budgeting plays a significant part in modern com-
mercial enterprise.
Finally, data mining helps to predict the customer needs at times and build
proper databases in a particular area with the quantity and quality of different prod-
ucts (Figure 6.24).
This enables retailers, dealers, distributors, as well as manufacturers to know
close to consumer needs and also launching new products at times. The encryp-
tions are necessary for data at rest, so that no external attack can harm data. Access
control and authorization mechanisms are likewise required to go through. The sev-
erance of duty, ideology, and comprehensible policy for logging access to datasets
are the big factors in data mining. For the awareness of re-identification issues, dif-
ferential privacy should also be preserved.
6 Big data security issues with challenges and solutions 129
Smart grid
Threats/challenges Solution provided
circumstances
Developing vulnerability at Encryptions of data at rest,
host access control and
authorization mechanisms
Insider threat
Separation of duty principles,
Consumer data isolation clear policy for logging access
Outsourcing analytics to to datasets
untrusted partners
Consciousness of re-
Unintended leakage through identification issues,
sharing of data differential privacy
Figure 6.24: Scalable and compostable privacy-preserving data mining and analytics.
In real-time security settings, the system keeps an eye on each and every attack
with proper notification to the administrator of the system. Theoretically, the sys-
tem is 100% protected, but in reality, it may not happen every time. There are many
things to settle on like missed hit auditing of information is very much essential.
Periodic system audit information is required to analyze the system to enhance se-
curity. What information is to be checked, what to ignore, what and when went
wrong at times with other processes involved in it are also obligatory. The acquies-
cence, regulations, and forensic exploration are significant. System auditing is
mandatory in case of distributed data processing applications in big data applica-
tions as data volume is so big that a standalone system cannot process them prop-
erly. Distributed systems enable the computer hardware like gateway, router,
server; client computer system of numerous types, thus auditing system is put into
practice over these systems as well. The auditing system must involve with the soft-
ware systems too, which includes application software enabled and also the system
software (OS).
The data-level safekeeping assistance could be comprehended by means of the
pattern given. To analyze database activities, a standard statement is produced by
the sample organization. The database administrators (DBA) analyze and review ac-
tivity reports of their databases. The data accumulation and filtration are different
jobs; DBAs have the farm duties to sort out data that is not important to them.
Figure 6.25 describes two managers, namely DBA manager and audit director.
The DBA manager for Oracle databases can only see the data that belongs to the
same database. This procedure is the same for the DB2 and MySQL databases as
well. The DBA manager has permission to access all the data stored in the different
databases. There is an audit manager associated with the auditing section for in-
house audits. The SOX and HIPAA are two different auditors work under the same
130 Santanu Koley
audit manager. The SOX auditor is accountable for the monetary data, i.e. sales, sal-
aries, and orders, no matter where that data are stocked up. The HIPAA auditor is
used in healthcare systems. It audits, data associated with patient information, no
issue where that data is accumulated (Figure 6.26).
Smart grid
Threats/challenges Solutions provided
circumstances
Comprehensiveness of audit
information
Sensible access to audit
information Infrastructure solutions (already
Audit of usage, pricing, billing
provided), scaling of SIEM tools
Veracity of audit information
Solution
Granular access audits in the majority of the cases can facilitate to come across the attack. Audit
trials disclose the reason behind the nondetection of data at the beginning phase. The most signif-
icant mechanism of auditing may be talked about as the completeness of the required audit, well-
timed approach to audit information, the reliability of the information, and endorsed entrance to
the audit information.
Successful audit trial guarantees to incorporate the appropriate procedure and technologies
in the big data infrastructure, including function logging, SIEM, forensic tools, and enabling
SYSLOG on routers.
Split big data and audit data to differentiate among responsibilities. The audit data looks upon
information concerning what has turned out in the big data infrastructure; however, it must be re-
served taking, apart from the “regular” big data. A dissimilar network subdivision or cloud should
be set up to host the audit system infrastructure.
The big data scheme acts as a significant task in data over the network of networks
and storages. They carry out an excellent job regarding execution and versatility.
But alas! It finds nearly no protection in mind. Existing RDBMS applications are se-
cure enough with several protection attributes regarding access control, clients, ta-
bles, and rows and even at the cell level. However, a change of essential challenges
put off big data solution to endow with comprehensive access control. The most im-
portant and foremost involvedness with course-grained get to components is that
data that could somehow or another be shared is frequently cleared into an increas-
ingly prohibitive gathering to ensure security.
Here granular access control is essential for diagnostic frameworks to adjust
to this continuously progressive complex security environment. Keeping a log of
jobs and experts of clients is one of the dangers alongside keeping up access
marks crosswise over diagnostic changes. Big data examination and distributed
(cloud) computing are progressively cantered around taking care of assorted data-
set collections, both as far as assortment of diagrams and necessities. Legitimate
and strategy confinements on data originate from different sources. Security ar-
rangements, sharing understandings, and corporate strategy additionally forces
prerequisites on data that dealt with. Dealing with this overabundance of the con-
finements has so far brought about expanded expenses for creating applications
and a walled pationursery approach in which few individuals can take an interest
in the analysis.
132 Santanu Koley
Solution
Granular access control as with any background is significant in a cloud environment to make cer-
tain that only the right people can have the right to use the right information and that the wrong
people can’t. The granular access control is vital to achieving this, and the cloud security alliance
has acknowledged three specific tribulations in the realm of cloud data admission authority such
as tracking confidentiality necessities that meant for individual essentials, supervising roles and
authorities of users, and properly implementing secrecy requirements with MAC (Figure 6.27).
The resolution includes addressing these tribulations: associations may want to put
into practice get to controls in the foundation layer, rearrange multifaceted nature
in the application space, and receive confirmation and required access control.
Nonetheless, scaling these technologies to the levels necessary to accommodate big
data can present their own set of unique challenges. Use the typical single sign-on
(SSO) method to condense the organizational work caught up in supporting a huge
user base. SSOs transfer the trouble of user verification from administrators to pub-
licly obtainable scheme.
As a conclusion, the data must pick the right dimension of granularity at the
row level, column level, and cell levels. At the base, comply with cross section to
get to impediments; progressively complex data changes are being considered in
dynamic research. The authentication, authorization, and compulsory access con-
trol are mandatory.
Smart grid
Threats/challenges Solutions provided
circumstances
The information and process integrity are needed in big data. Thus, the term data
provenance comes into play, which integrates them by reporting the entities, frame-
works, and procedures working on and adding to data of intrigue. The lifetime of data
and its starting place must confer as unchangeable chronological evidence. The large
provenance graphs and diagrams are analyzed to perceive metadata dependence for
protection, and discretion application is computationally exhaustive. The beginning
of an application or the place (process) of creation must need to know to the assort-
ments of key security applications. This source of data is important for companies
based on financial trading. They need to know the origin, process, precision of data
for further research on market trends and future forecasting. Data provenances endow
with the safety measures and they are completed within the time frame and involve
speedy algorithms to switch the provenance metadata restraining this information.
Big data provenance architecture may be partitioned into few parts as big data
admittance, distributed big data platform, provenance, and application employing
provenance. Every partition will get in touch with particular live data, then data-
base engines calculate the provenance data from there. The above diagram de-
scribes a reference design; system developers force choice as well as make mind up
on component in each sub-framework that is dependent on the focused provenance
practice circumstances along with competence.
The subsystem called big data access describes the most proficient method to get
the various sorts of big data for provenance and distributed systems. System develop-
ers are required to find out the finest method, or tool to access the detailed datasets.
Further work is needed for experimentation requisite on the datasets. The synchroni-
zation involving data and computer system is extremely desirable. Based on this syn-
chronization, the provenance following the execution is an important aspect.
Distributed big data sub-framework manages development and execution sup-
port for big data applications and capacity support for big data provenance over
appropriated big data platform (Figure 6.28). Before executing with the definite big
data engine, the application designers will choose one big data work process frame-
work to construct their submission. Some systems similar to Spark are capable to
take action as both workflow production tool and distributed data parallel (DDP)
execution engine. By means of DDP programming models, a big data submission
can be constructed by covering inheritance tools or straight programming. The
provenance storage too requires selecting the appropriate (distributed) databases or
file handling systems (Table 6.1).
The provenance sub-framework settles on which provenance information will
be reported and how to record it. The provenance can be separated into three ex-
tents: data, lineage, and the environment. Revamping the definite condition of the
investigation over these three measurements is basic to repeat any information-
driven logical analysis. Data provenance catches the condition of the input halfway
134 Santanu Koley
and yield data at the season of the investigation. The fundamental pressure calcula-
tion is favored by this provenance relying on the data informational index. Lineage
provenance arranges the computational bustle of the trial, which is caught by put-
ting away the guidelines that actuated on these information datasets. System prove-
nance unites data concerning the careful condition of the framework design, which
assesses both equipment determinations and framework programming details (OS,
libraries, outsider apparatuses, and so on).
Large prospects of big data provenance carry on how we could make use of it.
Big data provenance can be worn for provenance question, test multiplication,
provenance mining, try checking, information quality administration, analyze ad-
aptation to noncritical failure, and numerous others as an uncommon sort of big
6 Big data security issues with challenges and solutions 135
Smart grid
Threats/challenges Solutions provided
circumstances
Authentication
Secure collection of data
techniques
Keeping track of
Consistency of data and
ownership of data Message digests
metadata
pricing, audit
Access control through
Insider threats
systems and cryptography
Solution
Cloud computing involves big data applications within it; thus, provenance of metadata matures in
volume. They tend to be huge and thus intricacy grows very high. There are three most important
intimidation to protect provenance metadata in big data relevance such as faulty infrastructure
mechanism, infrastructure external hit, and infrastructure in the interior hit.
Secure data provenance is necessary to solve these threats. The system needs to improve trust;
the external attack should be condemned. The usability of secure provenance can be accomplished
by protecting origin finding technique, and data admittance is tuned up to some degree. The system
is based on clouds, and the infrastructure of the cloud setup solves the problem where collection
and preventing outside attacks, pairing a fast, lightweight authentication technique with current
provenance tools are present. The insider attacks must be prohibited, for that a dynamic, scalable
access control product is essential. To maintain the employability, connection ability of provenance
graphs fine-grained access control technology offers data access attribution in big data application.
Design goals must be identified in big data provenance. The layer-based architecture is fol-
lowed, where the access control mechanism is needed to address security. It can hold equally the
structured and unstructured types of data, but this can handle simple queries only. For handling
complex queries we will add additional components.
Finally, as a conclusion, it can be assured that the authentication techniques, message digests,
access control through systems, and cryptography are the supporting points.
Cloud computing technology enables dynamic collection of colossal data from dif-
ferent ends or nodes. These types of data generated by different organizations in-
clude unstructured as well as semistructured data. These data are commonly
referred to as big data. These data are needed for security from both ends (internal
6 Big data security issues with challenges and solutions 137
as well as external), privacy for specific data. Cloud computing involves equally
with affirmative and unenthusiastic belongings.
Fake data generation is another problem in the big data analysis, where secu-
rity breaches made by cybercriminals fabricate data and store it in the same place
as it was there. The big data technology used so far is implemented through open-
source code that can be modified by an individual to have proper knowledge about
it. The attack is done using any one of the user sites that cannot be recognized,
checked, or imposed penalties in big data; at the same time, the server side is also
unaware about what measures to be maintained.
The most astonishing thing about this methodology is that the phony data
doesn’t have to coordinate the qualities of genuine data precisely. That data-to-data
coordinating is exceptionally difficult to incorporate with manufactured data, so it is
uplifting news that it is commonly not required. Rather, you simply need to coordi-
nate key performance indicators (KPIs) among phony and genuine data when utilized
as contribution for the procedure of intrigue. Note that these KPIs are in respect to
the particular issue and models you are working with. Android malware [31] some-
times cause problems in big data, and proper use of antivirus helps to remove them.
The volume of big data market increases exponentially where the growth of en-
terprise-level data is the fastest. It is relied upon to twofold at regular intervals, from
2,500 exabytes in 2012 to 40,000 exabytes in 2020. Figure 6.30 suggests that about
big data market today (2018) is about $24.3 billion and will be $49.1 billion in 2020
[32]. These will not only tend to increase complexity but also the problem of access-
ing and storing security measures that should be improved up to some extent.
Till now the discussion is all about hardware, and software associated with it,
but another part of big data security is left off. The data, software, and hardware
everything maintained by human beings are sometimes taking the system unani-
mous. The devices logged in but not in use yield access to further workforce as well
as the threat of admittance data passing through unrestricted Wi-Fi. Automatic user
log table with actual work seconds should be taken for an internal purpose for indi-
vidual systems, and proper training on usage of data is also be useful.
Another way out of the big data security problem is to use software tools that can
analyze and monitor in real time so that the system can get the happenings instantly.
Besides the actual hazard, the alarming system can find some artificial data works as
threats called false alarm. It must have the capability to differentiate between false
alarms and real threats. The tools produce a huge amount of network data; again, it
is problematic to maintain this data. Now each organization may not afford the cost
of tools as well as it needs to update the hardware system to execute 24 × 7 checking
capability. Big data analytics itself can resolve for the said problem by enhancing the
services of network protection. Log tables are maintained to take data about changes
made on the network; they can also be used for anomalous network connections.
Numerous large organizations, which are exercised on big data, their leading
apprehension are the security of the cloud-based systems. Malevolent assaults on
138 Santanu Koley
120
$99.3
100
Market in billion U.S. dollars
80
$69.8
60
$49.1
$34.6
40
$24.3
$17.1
20
0
2017 2018* 2019* 2020* 2021* 2022*
For explicit big data tribulations, there must have a big ecosystem that exists.
The subject matter here provides to illuminate unambiguous characteristic of the
susceptible regions in the whole big data dispensation communications that require
being investigated on behalf of definite intimidation.
The dispute with big data is that the unstructured character of the information
creates it intricate to classify, model, and map the data as soon as it is confined to
moreover accumulate. The dilemma is ended most horrible by the actuality that the
data generally appear as of external sources, over and over again building it convo-
luted to substantiate its correctness.
Conflict resembling protection of data storage, data mining and analytics, transac-
tion log and secure communication do subsist. The study on an assortment of safety
measures faces up to roughly big data security and intact stack in large apparatus. Big
data infrastructure security mechanism can be more stretched and modified to their
directorial upbringing.
One of the modern approaches to secure big data is the use of blockchain. It
could transform the technique we move toward big data. The quality of data would
be improved and safekeeping on data has immediately two benefits to individuals
and businesses as the blockchain is all around to grip information that can be digi-
tized. The blockchains’ prevalent improvement is its decentralization and thus no
one owns the data entry or the integrity as it is established constantly by each com-
puter on the network. Certainly, as much as necessary, blockchain and big data are
a match made in heaven. The genuine issue at the present time is who will be the
first to endow with the good number of appropriate and paramount skilled artificial
intelligence/machine learning model working on top of distributed, transparent,
and immutable blockchain generated data layers. The business to do this will roll
in investments and engender enormous proceeds.
6.4 Conclusion
Big data analysis is flattering essential means for automatic determination of as-
tuteness that is concerned in the recurrently stirring outline and secreted conven-
tion. This can facilitate companies to obtain an improved resolution, to envisage
and recognize revolutionize and to categorize new fangled prospects. Dissimilar
procedure in support of big data analysis as well as numerical analysis, batch proc-
essing, machine learning, data mining, intelligent investigation, cloud computing,
quantum computing, and data stream preparing become possibly the most impor-
tant factors. There is a gigantic open door for the big data industry in addition to
plenty of possibility for research and enhancement.
Big data security solutions we have discussed so far may solve the current prob-
lem up to some extent. But it is to be said these are not the only solutions. They can
140 Santanu Koley
Nomenclature
IoT Internet of things
SaaS Software as a service
Paas Platform as a service
IaaS Infrastructure as a service
NIST National Institute of Standards and Technology
HDFS Hadoop distributed file system
RDBMS Relational database management system
MCC Mobile cloud computing
QoS Quality of service
SQL Structured query language
IBM International business machines
SIEM security information and event management
DPF Distributed programming frameworks
MAC Mandatory access control
SSD Solid-state drive
HDD Hard disk drive
ABE Attribute-based encryption
DBA Database administrators
SSO Single sign-on
DDP Distributed data parallel
RAMP Reduce and map provenance
References
[1] A Cloud Security Alliance Collaborative research – Big Data working group. “Expanded Top
Ten Big Data Security and Privacy challenges”, 2013.
[2] Rountree, Derrick., & Castrillo, Ileana. (2014), “The Basics of Cloud Computing”, https://doi.
org/10.1016/B978-0-12-405932-0.00001-3, ScienceDirect Elsevier Inc., pp 1-17
6 Big data security issues with challenges and solutions 141
Emma., Magnusonk, Jon Karl., Yaver, Debbie., Berka, Randy., Lail, Kathleen., Chen, Cindy.,
LaButti, Kurt., Nolan, Matt., Lipzen, Anna., Aerts, Andrea., Riley, Robert., Barry, Kerrie.,
Henrissat, Bernard., Blanchette, Robert., Grigoriev, Igor V., & Cullen, Dan. “Draft genome
sequence of a monokaryotic model brown-rot fungus Postia (Rhodonia) placenta SB12”.
Genomics Data-ELSVIER, 2017, 14, 21–23.
[20] Demchenko, Y., Ngo, C., Laat, C. de., Membrey, P., & Gordijenko, D. (2014). “Big Security for
Big Data: Addressing Security Challenges for the Big Data Infrastructure” In W. Jonker
& M. Petković (Eds.), Secure Data Management, Springer International Publishing. Retrieved
from http://link.springer.com/chapter/10.1007/978-3-319-06811-4_13, pp. 76–94.
[21] Xueheng, Hu., Meng, Lei., & Aaron, D. Striegel.(2014) “Evaluating the Raw Potential for
Device-to-Device Caching via Co-location” https://doi.org/10.1016/j.procs.2014.07.042
ELSEVIER Procedia Computer Science, Vol 34, pp. 376–383.
[22] Wang, Jianwu., Crawl, Daniel., Purawat, Shweta., Nguyen, Mai., & Altintas, Ilkay. (2015) “Big
Data Provenance: Challenges, State of the Art and Opportunities” INSPEC Accession Number:
15679551, DOI: 10.1109/BigData.2015.7364047
[23] Terzi, Duygu Sinanc., Demirezen, Umut., & Sagiroglu, Seref. “Evaluations of big data
processing.” Services Transactions on Big Data (ISSN 2326-442X), 2016, 3, 1, 44–53.
[24] Geist, A. et al. (1996) “MPI-2: Extending the message-passing interface” In: L. Bougé,
P. Fraigniaud, A. Mignotte, & Y. Robert (eds) Euro-Par’96 Parallel Processing. Euro-Par 1996.
Lecture Notes in Computer Science, Vol. 1123, DOI https://doi.org/10.1007/3-540-61626-
8_16, Online ISBN 978-3-540-70633-5. Springer, Berlin, Heidelberg.
[25] Kailai, Xu. (2017) http://stanford.edu/~kailaix/files/MPI.pdf, Last Access: May (2019).
[26] Alfredo Cuzzocrea, C. “Privacy and Security of Big Data: Current Challenges and Future
Research Perspectives”. ACM Digital Library, 2014, ISBN: 978-1-4503-1583-8 doi>10.1145/
2663715.2669614, pp. 45–47.
[27] Boyd, Dana., & Crawford, Kate. (2011) “Six Provocations for Big Data ” Social Science
Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet
and Society. doi:10.2139/ssrn.1926431.
[28] Cuzzocrea, A., Song, I., & Davis, K.C. (2011) “Analytics over Large-Scale Multidimensional
Data: The Big Data Revolution!” In: Proceedings of the ACM International Workshop on Data
Warehousing and OLAP, pp. 101–104.
[29] Adhikari, M., Koley, S., & Arab, J. Sci Eng https://doi.org/10.1007/s13369-017-2739-0 (2017),
“Cloud Computing: A Multi-workflow Scheduling Algorithm with Dynamic Reusability”.
Arabian Journal for Science and Engineering (Springer), Journal, 2017, 13369, Article: 2739.
[30] Koley, S., Nandy, S., Dutta, P., Dhar, S., & Sur, T. (2016), “Big Data architecture with mobile
cloud in CDroid operating system for storing huge data,” 2016 International Conference on
Computing, Analytics and Security Trends (CAST), Pune, India, doi: 10.1109/
CAST.2016.7914932, pp. 12–17.
[31] Bacci, Alessandro., Bartoli, Alberto., Martinelli, Fabio., Medvet, Eric., Mercaldo, Francesco., &
Visaggio, Corrado Aaron. “Impact of Code Obfuscation on Android Malware Detection based
on Static and Dynamic Analysis” 4rd International Conference on Information Systems
Security and Privacy, Funchal, Portugal, 2018.
[32] Transparent data encryption or always encrypted?, Source: https://azure.microsoft.com/en-
us/blog/transparent-data-encryption-or-always-encrypted/, Last Access: March (2019).
Shibakali Gupta, Indradip Banerjee, Siddhartha Bhattacharyya
7 Conclusions
Security of data is a major concern in this work-a-day world of information technol-
ogy and communication systems. Unintended use and encroachment of data often
lead to breach in the security of the underlying data. As a result, the integrity of
data gets compromised, which leads to unsolicited system outcomes. With the ad-
vent of digital media technologies, a huge volume of data explosion has happened.
Digital data content includes audio, video, and image media, which can be easily
stored and manipulated. The superficial transmission and manipulation of digital
content constitute an authentic threat to multimedia content engenderers and trad-
ers. Thus, the issue of security of data has become imminent.
Several approaches are in vogue for thwarting unwanted attacks on the integ-
rity of data under consideration. The basic approach is based on ensuring a privacy
policy of the intended users. The different authentication apparatuses help to
launch the proof of identity of the end users. Access control regulations also add to
the mechanism of data handling to a great extent.
Big data refers to datasets that are enormous in size as compared to normal da-
tabases. Big data generally consists of unstructured, semistructured, or structured
datasets. Several algorithms as well as tools are in existence for processing these
data within reasonable finite amount of time. The most prominent type of big data
that has attached much attention is the unstructured form of data [1].
Big data is mainly characterized by the 4Vs (volume, velocity, variety, and ve-
racity) [2–5]. Volume is a key characteristic of big data, which decides whether the
information is a normal dataset or not. Velocity is the speed with a direction, which
means the throughput or the speed of the data processing. It indicates as to how
fast the information can be generated in real time to meet the requirements. Variety
is important in this literature because it stands for quality and the type of data re-
quired in order to process it successfully.
This book is targeted to discuss the fundamental concepts of big data forms
and the security concerns associated therein [7]. The first contributory chapter ex-
plains the common business models/platforms that use block chain as a platform
for developing their processes based on digital identity. Each and every big data
source or big database needs a security metric monitoring. The monitoring software
collects various metrics with the help of custom codes, plugging, and so on. The
next chapter describes the approach of modifying the normal metric thresholding
to anomaly detection. The third contributory chapter deals with the social
https://doi.org/10.1515/9783110606058-007
144 Shibakali Gupta, Indradip Banerjee, Siddhartha Bhattacharyya
engineering aspect of big data hacking along with other hacking methodologies
that can be used for big data and how to secure the systems from the same. The
fifth chapter describes the information hiding and data consumption techniques in
big data domain. The next chapter discusses some of the big data security issues
with reference to some solution mechanisms.
Given the varied content of the book, the book would surely serve as a good
treatise on the subject and would benefit the readers to grasp the inherent ideas of
big data manifestation and security mechanisms involved therein.
References
[1] Snijders, C., Matzat, U., & Reips, U.-D. “‘Big Data’: Big gaps of knowledge in the field of
Internet”. International Journal of Internet Science, 2012, 7(1).
[2] Hilbert, Martin. “Big Data for Development: A Review of Promises and Challenges.
Development PolicyReview”. martinhilbert.net. Retrieved 7 October 2015.
[3] DT&SC 7-3: What is Big Data?. YouTube. 12 August 2015.
[4] Cheddad, Abbas., Condell, Joan., Curran, Kevin., & Kevitt, Paul Mc. Digital image
steganography: Survey and analysis of current methods Signal Processing, 2010, 90,
727–752.
[5] Capurro, Rafael., & Hjørland, Birger. (2003). The concept of information. Annual review of
information science and technology (s. 343–411). Medford, N.J.: Information Today. A version
retrieved November 6, 2011.
[6] Liu, Shuiyin., Hong, Yi., & Viterbo, E. “Unshared Secret Key Cryptography”. IEEE Transactions
on Wireless Communications, 2014, 13(12), 6670–6683.