A Review of Hadoop Security Issues, Threats and Solutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol.

5 (2) , 2014, 2126-2131

Securing Big Data Hadoop: A Review of Security


Issues, Threats and Solution
Priya P. Sharma Chandrakant P. Navdeti
Information Technology Department Information Technology Department
SGGS IE&T, Nanded, India SGGS IE&T, Nanded, India

Abstract— Hadoop projects treat Security as a top agenda Not only Big Data is about the size of data but also includes
item which in turn represents which is again classified as a data variety and data velocity. Together, these three
critical item. Be it financial applications that are deemed attributes form the three V’s of Big Data
sensitive, to healthcare initiatives, Hadoop is traversing new
territories which demand security-subtle environments. With
the growing acceptance of Hadoop, there is an increasing
trend to incorporate more and more enterprise security
features. In due course of time, we have seen Hadoop
gradually develop to label important issues pertaining to,
what we summarize as 3ADE (authentication, authorization,
auditing, and encryption) within a cluster. There is no dearth
of Production environments that support Hadoop Clusters.In
this paper, we aim at studying “Big Data” security at the
environmental level, along with the probing of built-in
protections and the Achilles heel of these systems, and also
embarking on a journey to assess a few issues that we are
dealing with today in procuring contemporary Big Data and
proceeds to propose security solutions and commercially
accessible techniques to address the same.
Keywords—Big Data, SASL, delegation, sniffing, cell level,
variety, unauthorized
Fig.1 Three V's Of Big-Data [17]
I. INTRODUCTION Each of the V’s represented in Figure 1 are depicted as
So, what exactly is “Big data”. Put in simple words, it is below:
described as mammoth volumes of data which might be
both structured and unstructured. Generally, it is so gigantic Volume or the size of data in present time is larger than
that is provides a challenge to process using terabytes and petabytes. That data is comes from machines,
conventional database and software techniques. As networks and human interaction on systems like social
witnessed in enterprise scenarios, three observations can be media the volume of data to be analysed is very huge. [8]
inferred; Velocity defines the speed of data processing, is required
1. The data is stupendous in terms of volumes. not only for big data, but also all processes, and involves,
2. It moves at a very fast pace. real time processing,batch processing.
3. It outpaces the prevailing capacity. Variety refers todifferent types of data from different or
The volumes of Big Data are on a roll, which can be many sources both structured and unstructured. In Past data
inferred from the fact that as far back in the year 2012, was stored from sources like spreadsheets and databases.
there were a few dozen terabytes of data in a single dataset, Now in this data comes in the form of emails, pictures,
which has interestingly been catapulted to many petabytes audio, videos, monitoring devices, PDFs, etc. This
today. multifariousness of unstructured data creates problems for
To carter to the demands of the industry, new manifestos of storage, mining and analysing the data. [8]To process the
manipulating “Big Data” are being commissioned. large volume of data from different sources, for fast
Quick fact: 5 exabytes (1 Exabyte = 1.1529*1018 bytes) of processing Hadoop is used.
data were created by humans until 2003. Today this amount Hadoop is a free, Java-based programming framework that
of information is created in two days [8, 16]. In 2012, supports the processing of large data sets in a distributed
digital world of data was expanded to 2.72 zettabytes (1021 computing environment. Hadoopallows running
bytes). It is predicted to double every two years, reaching applications on systems with thousands of nodes with
the number about 8 zettabytes of data by 2015 [8, 16]. With thousands of terabytes of data [2]. Its distributed file system
an increase in the data, there is a corresponding increase in supports fast data transfer rates among nodes and allows
the applications and framework to administer it. This gives the system to continue operating uninterrupted at times of
rise to new vulnerabilities that need being responded to. node failure.

www.ijcsit.com 2126
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131

Hadoop consists of distributed file system, data storage and II. BIG DATA HADOOP ‘S TRADITIONAL SECURITY
analytics platforms and a layer that handles parallel
computation, rate of flow (workflow) and configuration A. Hadoop Security Overview
administration [8]. HDFS runs across the nodes in a Originally Hadoop was developed without security in mind,
Hadoop cluster and together connects the file systems on no security model, no authentication of users and services
many input and output data nodes to make them into one and no data privacy, so anybody could submit arbitrary
big file system [2]. The present Hadoop ecosystem (as code to be executed. Although auditing and authorization
shown in fig 2.) consists of the Hadoop kernel, Map- controls (HDFS file permissions and ACLs) were used in
Reduce, the Hadoop distributed file system (HDFS) and a earlier distributions, such access control was easily evaded
number of related components such as Apache Hive, because any user could impersonate any other user.
HBase, Oozie, Pig and Zookeeper and these components Because impersonation was frequent and done by most
are explained as below[7,8]: users, the security controls measures that did subsist were
not very effective. Later authorization and authentication
• HDFS: A highly faults tolerant distributed file was added, but that to have some weakness in it. Because
system that is responsible for storing data on the there were very few security control measures within
clusters. Hadoop ecosystem, many fortuity and security incidents
• MapReduce: A powerful parallel programming happened in such environments. Well-meant users can
technique for distributed processing of vast make mistakes (e.g. deleting massive amounts of data
amount of dataon clusters. within seconds with a distributed delete). All users and
• HBase: A column oriented distributed NoSQL programmers had the same level of access privileges to all
database for random read/write access. the data in the cluster, any job could access any of the data
• Pig: A high level data programming language for in the cluster, and any user could read any data set [4].
analyzing data of Hadoop computation. Because MapReduce had no concept of authentication or
• Hive: A data warehousing application that authorization, an impish user could lower the priorities of
provides a SQL like access and relational model. other Hadoop jobs in order to make his job complete faster
or to be executed first – or worse, he could kill the other
• Sqoop: A project for transferring/importing data
jobs.
between relational databases and Hadoop.
Hadoop is an entire eco-system of applications that
• Oozie: An orchestration and workflow
involves Hive, HBase, Zookeeper, Oozie, and Job Tracker,
management for dependent Hadoop jobs.
and not just a single technology. Each of these applications
• requires hardening. To add security potentials or
capabilities into a big data environment, functions need to
scale with the data. Supplementary security does not scale
well, and simply cannot keep up. [6]
The Hadoop community supports some security features
through the current Kerberosimplementation, the use of
firewalls, and basic HDFS permissions and ACLs [5].
Kerberos is not a compulsory requirement for a Hadoop
cluster, making it possible to run entire clusters without
deploying or implementingany security. Kerberos is also
not very easy to install and configure on the cluster, and to
integrate with Active Directory (AD) and Lightweight
Directory Access Protocol, (LDAP) services. [6]
This makes security problematic to be implemented, and
thus limits the adoption of even the most basic security
functions forusers of Hadoop. Hadoop security is not
properly addressed by firewalls, once a firewall is breached;
the cluster is wide-open for attack. Firewalls offer no
protection for data at-rest or in-motion within the cluster.
Firewalls also offer no protection from security failure
Fig. 2 Hadoop Architecture which originates from within the firewall perimeter [6]. An
attacker who can enter the data centre either physically or
The paper is organised asfollows: In section II we describe electronically can steal the data they want, since the data is
Big Data Hadoop traditional security and also discuss un-encrypted and there is no authentication enforced for
weakness of the same, security threats, we have describe access[6, 10].
various security issues in Section III, Section IV we present
our analysis of security solution for each of the hadoop B. Security Threats
components in tabular format and section V is also an We have identified three categories of security violation:
analysis of security technologies used to secure Hadoop. unauthorized release of information,
Finally we conclude in section VI. unauthorizedmodification of information and denial of

www.ijcsit.com 2127
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131

resources. The following are the related areas of threat we they don’t implement secure communication; they
identify in Hadoop [7]: bring into use the RPC (Remote Procedure Call) over
• An unauthorized user may access an HDFS file TCP/IP.
via the RPC or via HTTP protocols and could 5) Client Interaction: Communication of client takes place
execute arbitrary code or carry out further attacks with resource manager, data nodes. However, there is a
• An unauthorized client may read/write a data catch. Even though efficient communication is
block of a file at a DataNode via the pipeline facilitated by this model, it makes cumbersome to
streaming Data-transfer protocol. shield nodes from clients and vice-versa and also name
• An unauthorized client may gain access privileges servers from nodes. Clients that have been
and may submit a job to a queueor delete or compromised tend to propagate malicious data or links
change priority of the job. to either service.
• An unauthorized user may access intermediatedata 6) Virtually no security:Big data stacks were designed with
of Map job via its task trackers HTTP little or no security in mind. Prevailing big data
shuffleprotocol. installations are built on the web services model, with
• A task in execution may use the host OS interfaces few or no facilities for preventing common web threats
to access other tasks, or would accesslocal data making it highly susceptible.
which include intermediate Map outputor the local
storage of the DataNode that runs on the same IV. HADOOP SECURITY SOLUTION
physical node. Hadoop is a distributed system which allows us to store
• An unauthorized user may eavesdrop/sniff to data huge amounts of data and processing the data in parallel.
packets being sent by Data nodes to client. Hadoop is used as a multi-tenant service and stores
sensitive data such as personally identifiable information or
• A task or node may masquerade as a Hadoop
financial data. Other organizations, including financial
service component such as a DataNode,
organizations, using Hadoop are beginning to store
NameNode, job tracker, task tracker etc.
sensitive data on Hadoop clusters. As a result, strong
• A user may submit a workflow to Oozie as
authentication and authorization is necessary. [7]
another user.
The Hadoop ecosystem consists of various components.
• DataNodes imposed no access control, a We need to secure all the other Hadoop ecosystem
unauthorized user could read arbitrary data blocks components. In this section, we will look at the each of the
from DataNodes, bypassing access control ecosystem components security and the security solution
mechanism/restrictions, or writing garbage data to for each of these components, each component has its own
DataNode.[10] security challenges, issues and needs to be configured
III. SECURITY ISSUES properly based on its architecture to secure them. Each of
these hadoop components has end users directly accessing
Hadoop present some unique set of security issues for the component or a backend service accessing the Hadoop
data centre managers and security professionals. The core components (HDFS and Map-Reduce).
security issues are depicted below [5, 6]: We have done a security analysis of hadoop components
1) Fragmented Data: Big Data clusters contain data that and a brief study of built-in security of the Hadoop
portray the quality of fluidity, allowing multiple copies ecosystem and we see that hadoop security is not very
moving to-and-fro various nodes ensuring redundancy strong, so in this paper we provide with a security solution
and resiliency.The data is available for fragmentation around the four security pillars i.e. authentication,
and can be shared across multiple servers. As a result, authorization, encryption and audits (we summarize as
more complexity is added as a result of the 3ADE), for each of the ecosystem components. This
fragmentation which poses a security issue due to the section describes the four pillars (sufficient and necessary)
absence of a security model. to help secure the Hadoop cluster, we will narrow our focus
2) Distributed Computing: Since, the availability of and take a deep dive into the built-in and our proposed
resources leads to virtual processing of data at any security solution for the Hadoop ecosystem
instant or instance where it is available, this progresses
to large levels of parallel computation. As a result, A. Authentication
complicated environments are created that are at high Authentication is verifying user or system identity
risks of attacks than their counterparts of repositories accessing the system. Hadoop provides Kerberos as a
that are centrally managed and monolithic, which primary authentication. Initially SASL/GSSAPI was used
enables easier security implications. to implement Kerberos and mutually authenticate users,
3) Controlling Data Access:Commissioned data their applications, and Hadoop services over the RPC
environments provision access at the schema level, connections [7]. Hadoop also supports “Pluggable”
devoid of finer granularity in addressing proposed Authentication for HTTP Web Consoles meaning that
users in terms of roles and access related scenarios. implementers of web applications and web consoles could
Many of the available database security schemas implement their own authentication mechanism for HTTP
provide role based access. connections. This includes but was not limited to HTTP
4) Node-to-node communication: A concern with Hadoop SPNEGO authentication. The Hadoop components support
and a variety of players available in this field is that,

www.ijcsit.com 2128
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131

SASL Framework i.e. the RPC layer can be changed to B. Authorization and ACLs
support the SASL based mutual authentication viz. SASL Authorization is a process of specifying access control
Digest-MD5 authentication or SASL GSSAPI/Kerberos privileges for user or system. In Hadoop, access controls is
authentication. implemented by using file-based permissions that follow
MapReducesupports Kerberos authentication, SASL Digest the UNIX permissions model. Access control to files in
MD-5 authentication, and also Delegation token HDFS could be enforced by the NameNode based on file
authentication on RPC connections. In HDFS permissions and ACLs of users and groups. MapReduce
communications between the NameNode and DataNodes is provides ACLs for job queues; that define which users or
over RPC connection and mutual Kerberos authenticationis groups can submit jobs to a queue and change queue
performed between them [15]. HBase supports SASL properties. Hadoop offers fine-grained authorization using
Kerberos secure client authentication via RPC, HTTP. Hive file permissions in HDFS and resource level access control
supports Kerberos and LDAP authentication for the user using ACLs for MapReduce and coarser grained access
authentication and authentication via Apache Knox control at a service level[13].HBase offers user
explained in section V. authorization on tables, column families. The user
Pig uses the user credentials to submit the job to Hadoop. authorization is implemented using coprocessors.
So there is no need of any additional Kerberos security Coprocessors are like database triggers in HBase [15].
authentication required but before starting Pig the user They intercept any request to the table before and after,
should authenticate with KDC and get a valid Kerberos now we can use the Project Rhino [V] to extend HBase
ticket [15]. Oozie provides user authentication to the Oozie support for cell level ACLs. In Hive, authorization is
web services. It also provides Kerberos HTTP Simple and implemented using Apache Sentry [V].Pig provides
Protected GSSAPI Negotiation Mechanism (SPNEGO) authorization using ACLs for job queues; Zookeeper also
authentication for web clients. SPNEGO protocol is used offers authorization using node ACLs.Hue provides access
when a client application wants to authenticate to a remote control via file system permission; it also offers ACLs for
server, but is not sure of the authentication protocols to use. job queue.
Zookeeper supports SASL Kerberos authentication on RPC Although Hadoop can be set up to perform access control
connections. Hue offers SPENGO authentication,LDAP via user and group permissions and Access Control Lists
authentication, it now also supports SAML SSO (ACLs), this may not be sufficient for every organization.
authentication[15]. Now-a-days many organizations use flexible and dynamic
There are a number of data flows involved in Hadoop access control policies based on XACML and Attribute-
authentication – Kerberos RPC authentication mechanism Based Access Control [10, 13].Hadoop can now be
is used for users authentication, applications and Hadoop configured to support RBAC, ABAC access control using
Services, HTTP SPNEGO authentication is used for web some third party (as discussed in this section and section V)
consoles, and the use of delegation tokens [10]. Delegation framework or tool some of which are discussed in section
token is a two party authentication protocol used between V. Some of the Hadoop‘s components like HDFS can offer
user and NameNode for authenticating users, it is simple ABAC using Apache Knox and also Hive can support role
and more effective than three party protocol used by based access control using Apache Sentry. Zettaset
Kerberos [7, 15].Oozie and HDFS,MapReduce supports Orchestration a product by Zettaset provides role based
delegation token. access control support and enables Kerberos to be
seamlessly integrated into hadoop ecosystem. [6, 15]

TABLE I: ANALYSIS OF SECURITY SOLUTION

Map-Reduce HDFS HBase Hive Pig Oozie Zookeeper Hue

MD5-Digest, Apache
SASL Kerberos
GSSAPI Kerberos, SASL Knox, Delegation
framework , User level authenticatio Kerberos
Authentication (Kerberos), (secure client LDAP tokens,
Delegation permissions n at RPC (Pluggable)
Delegation authentication) authenticat Kerberos
tokens layer
tokens ion
Job & Queue
POSIX HBase ACLs on ACLs,
ACL Apache ACLs and FS ACLs and FS
Authorization permissions, tables, columns, Apache ACLs
(resource sentry permissions permissions
ABAC families Sentry
level)
Encryption of AES, OS Third party Third party Third party Third party Third party
--- N/A
data at rest level solution solution solution solution solution
RPC – SASL,
Encryption of RPC – SASL, SASL (secure Third party Third party
Data transfer SASL SSL/TLS HTTPS
data in transit HTTPS RPC) solution solution
protocol
No (But Third
Yes (Base Yes (Base Yes (Hive Third party Yes Third party Yes(Hue
Audit Trails party solution
audit) audit) metastore) solution (services) solution logs)
can be used)

www.ijcsit.com 2129
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131

C. Encryption HDFS and MapReduce provide base audit support. Apache


Encryption ensures confidentiality and privacy of user Hive metastore maintains audit (who/when) information for
information, and it secures the sensitive data in Hive interactions [13, 15]. Apache Oozie, the workflow
Hadoop. Hadoop is a distributed system running on distinct engine, provides audit trail for services, workflow
machines, which means that data must be transmitted over submission is maintained into Oozie log files. Hue also
the network on a regular basis, there is an increasing need supports audit logs.For those Hadoop components which
of demand to move sensitive information into the Hadoop donot provide built-in audit logging, we can use audit logs
ecosystem to generate valuable perceptions. Sensitive data monitoring tools. Scribe and LogStash are open source
within the cluster needs special kind of protection and tools that integrate into most big data environments, as
should be secured both at rest and in motion [10]. This data numbers of commercial products do. So one just need to
needs to be protected during the transfer to and from the need to find a compatible tool, get it install, integrate it
Hadoop system. The simple authentication and security with other systems like log management, and then actually
layer (SASL) authentication framework is used for review the results,and what could went wrong. Cloudera
encrypting the data in motion in hadoop ecosystem. SASL Navigator by Cloudera is popular commercial tool that
security gives guarantee of the data being exchanged provides audit logging for big data environment. Zettaset
between client and servers and make sure that, the data is orchestration provides centralized configuration
not readable by a “man-in-middle”.SASL supports various management, logging, and auditing support. [6][15]
authentication mechanisms, for example, DIGEST-MD5,
CRAM-MD5, etc. The data at rest can be protected in two
ways: First, when file is stored in Hadoop, the complete file V. SECURITY TECHNOLOGIES -SOLUTION FOR SECURING
can be encrypted first and then stored in Hadoop. In this HADOOP
approach, the data blocks in each DataNode can't
bedecrypted until we put all the blocks back and create the In this we will look at overview of the various commercial
entire encrypted file. Second, to applying encryption to data and open source technologies that are available to address
blocks once they are loaded in Hadoop system [15]. the various security aspects of big data Hadoop [15].

Hadoop supports encryption capability for various channels A. Apache Sentry


like RPC, HTTP, and Data Transfer Protocol for data in Apache sentry an open source project by Cloudera is an
motion. Hadoop Crypto Codec framework and Crypto authorization module for Hadoop that offers the granular,
Codec Implementation have been introduced to support role-based authorization required to provide precise levels
data at rest encryption. HDFS supports AES, OS level of access to the right users and applications. It support for
encryption for data at rest.Zookeeper, Oozie, Hive, HBase, role-based authorization, fine-grained authorization, and
Pig, Huedon’t offer data at rest encryption solution but for multi-tenant administration [11][15].
this components encryption can be implemented via custom
encryption techniques or third party tools like Gazzang’s B. Apache Knox
zNcryptor using crypto codec framework’s. File system The Apache Knox Gateway is a system that provides a
level security, and encryption and decryption of files can be single point of authentication and access for various
performed on the fly using eCryptfs and Gazzang’s zNcrypt Hadoop services in a cluster. It provides a perimeter
tools which are commercial security solution available for security solution for Hadoop. The second advantage is it
Hadoop clusters [10, 13, 15]. supports various authentication and token verification
scenarios. It manages security across multiple clusters and
To protect data in transit and at rest, encryption and versions of Hadoop. It also provides SSO solutions, and
masking techniques can be implemented. Tools such as allows integrating other identity management solutions
IBM Optim and Dataguise provide data maskingfor such as LDAP,Active Directory (AD), and SAMLbased
enterprise data [15]. Intel's distribution offers encryption SSO and other SSO systems [9].
and compression of files [15]. Project Rhino enables block-
level encryption similar to Dataguiseand Gazzang. [5] C. Project Rhino
Project Rhino provides an integrated end-to-end data
security solution to the Hadoop ecosystem. It provides a
D. Audit Trails
token based authentication and SSO solution. It offers
Hadoop cluster hosts sensitive information, security of this
Hadoop crypto codec framework and crypto codec
information is utmost important for organizations to have a
implementation to provide block level encryption for the
successful secure big data journey. There is always a
data stored in Hadoop. It supports key distribution and
possibility of occurrence of security breaches by
management so that MR can decrypt data block and
unintended, unauthorized access or inappropriate access by
execute the program as per requirement. It also enhances
privileged users. [13] So to meet the security compliance
the security of HBase by offering cell level authentication
requirements, we need to audit the entire Hadoop
and transparent encryption for table stored in Hadoop. It
ecosystem on a periodic basis and deploy or implement a
supports audit logging framework for easy audit trails. [15]
system that does log monitoring.

www.ijcsit.com 2130
Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131

VI. CONCLUSION
In Big Data Era, where data is accumulated from various
sources, security is a major concern (critical requirement)
as there is no fixed source of data. With the Hadoop
gaining larger acceptance within the industry, a natural
concern over the security has spread. A growing need to
accept and assimilate these security solution and
commercial security features has surfaced. In this paper we
have tried to cover all the security solutionto secure the
Hadoop ecosystem.

REFERENCES
[1] Cloud Security Alliance “Top Ten big Data Security and Privacy
Challenges”
[2] Tom White O’Reilly |Yahoo! Press “Hadoop The definitive guide”
[3] Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, and
Christopher Harrell “Hadoop Security Design”
[4] Mike Ferguson “Enterprise Information Protection - The Impact of
Big Data”
[5] Vormetric “Securing Big Data: Security Recommendations for
Hadoop and NoSQL Environments ,October 12, 2012”
[6] Zettaset “The Big Data Security Gap: Protecting the Hadoop
Cluster”
[7] Devaraj Das, Owen O’Malley, Sanjay Radia, and Kan Zhang
“Adding Security to Apache Hadoop”
[8] Seref SAGIROGLU and Duygu SINANC “Big Data: A Review
Collaboration Technologies and Systems (CTS), 2013 International
Conference ,May 2013“
[9] Horton works “Technical Preview for Apache Knox Gateway”
[10] Kevin T. Smith “Big Data Security : The Evolution of Hadoop’s
Security Model”
[11] M. Tim Jones “Hadoop Security and Sentry”
[12] Victor L. Voydock and Stephen T. Kent “Security mechanisms in
high-level network protocols. ACM Comput. Surv.1983”.
[13] Vinay Shukla s “Hadoop Security: Today and Tomorrow”
[14] MahadevSatyanarayanan “Integrating security in a large distributed
system.ACM Trans. Comput. Syst., 1989”
[15] Sudheesh Narayana, Packt Publishing “Securing Hadoop-
Implement robust end-to-end security for your Hadoop ecosystem”
[16] S. Singh and N. Singh, "Big Data Analytics", 2012 International
Conference on Communication, Information & Computing
Technology Mumbai India, IEEE, October 2011
[17] jeffhurtblog.com “three-vs-of-big-data-as-applied-conferences, July
7,2012”

www.ijcsit.com 2131

You might also like