Big Data - Infrastructure Considerations: Author Anand Veeramani / Deepak Shivamurthy
Big Data - Infrastructure Considerations: Author Anand Veeramani / Deepak Shivamurthy
Big Data - Infrastructure Considerations: Author Anand Veeramani / Deepak Shivamurthy
This document is an exclusive property of Happiest Minds Technologies Pvt. Ltd. It is intended
for limited circulation.
This paper takes a closer look at the Big Data concept with the Hadoop framework as an example. We
look at the architecture and methods of implementing a Hadoop cluster, how it relates to server and
network infrastructure and the typical storage requirements for a Big Data cluster. We also look at
Information Security in the context of Big Data at a high level. The content presented here is largely
based on academic work, experiments conducted within Happiest Minds Technologies labs and
experiences derived from implementations for our customers.
Introduction
The volume of data generated globally is growing at a phenomenal scale and pace. The variety of
data generated further additions to its complexity. Data is continuously being generated by sensors
and humans; and volumes will grow exponentially over time. Cellular Networks and Social
Networking applications are some of the major contributors to data generation. Big Data has opened
up a completely new avenue for organizations to leverage these growing information assets to better
understand and compete in the market.
Big Data can be defined as any data repository with the following characteristics:
The current focus of the development community globally is driving the creation of best practices
and learnings for Big Data adoption. Within Happiest Minds, we have experienced the pressing need
for a Big Data specific architecture framework. Based on our understanding of the Big Data
architecture design process and its limitations, this paper recommends a robust approach to address
these limitations. The process is adaptive and can be extended to adopt best practices, as it evolves.
Going forward, we shall use Hadoop as an example of a Big Data product. The most important
components of Hadoop are the Hadoop Distributed File System (HDFS) which provides storage and
MapReduce, for parallel processing of large data sets.
NameNode:
The NameNode is the master of the HDFS that directs slave DataNodes to perform low level
input/output tasks.
DataNode:
Each slave machine, referred to as DataNode, reads and writes from HDFS blocks to actual files on
the local file system.
Secondary NameNode:
Secondary NameNode (SNN) assists in monitoring the state of the HDFS cluster.
JobTracker:
JobTracker is the master overseeing the overall execution of a MapReduce job. It acts as a liaison
between Hadoop and the application.
TaskTracker:
TaskTracker manages the execution of individual tasks on each slave node.
Presently, the global Big Data adoption trend focuses primarily on application and not infrastructure.
Hence, there is scope for improvement in taking a holistic view of the application and infrastructure
requirements, while designing a cluster. It is imperative to understand that the underlying data
processing algorithm (MapReduce or equivalent) will produce efficient output only if the data storage
cluster is designed well.
The overall Big Data adoption process, as it stands today, is depicted in below diagram.
1. Cluster Design: Application requirements are analyzed in terms of workload, volume and
other associated parameters based on which the cluster is designed. Cluster design is not an
iterative process. The initial setup is verified and validated with a sample application and
sample data before being rolled out. Although Big Data cluster design allows flexibility in fine-
tuning the configuration parameters, the large number of parameters and their cross-impacts
introduce additional complexity.
2. Hardware Architecture: The key success factor for Hadoop clusters is the usage of high
quality commodity equipment. Most Hadoop users are cost conscious and as clusters grow,
their cost can be significant. In the present scenario, the hardware architecture requirements
for the NameNode are higher RAM and moderate HDD. If the JobTracker is a physically
separate server, it will have higher RAM and CPU speed. DataNodes are standard low-end
server class machines.
4. Storage Architecture: Most enterprises have huge investments in NAS and SAN devices.
When implementing Big Data, they attempt to re-use this existing storage infrastructure even
though DAS is the recommended storage for Big Data clusters. Parameters like type of disk,
shared-nothing vs shared something, are often not taken into account.
Application requirements should drive the overall cluster design activity including cluster
sizing, hardware architecture, network architecture, storage architecture and information
security architecture.
Hardware architecture should be based on application requirements and cluster sizing.
Network architecture should also be derived from application requirements and cluster
sizing. This can be worked out in parallel to hardware architecture design.
Storage architecture should depend upon cluster sizing, hardware architecture and network
architecture. Application requirements should help fine tune the storage architecture.
Information security architecture should depend upon hardware architecture, network
architecture and storage architecture. Application requirements should validate the security
architecture.
1. Application Requirements and Cluster Sizing: There are a number of important cluster
configuration parameters to be derived from the application requirements.
2. Cluster Design and Hardware Architecture: Hadoop is built to handle component failure well and
to scale out on low cost gear; thereby eliminating the need for RAID cards, redundant power supplies
and other per-component reliability features. Error-correcting RAM and SATA drives with good MTBF
numbers should be used, as this assures reliable computations. Hard drives are the largest source of
failures. Therefore, care must be taken in making the right choice of hard drives with focus on
reliability and efficiency. Hardware architecture for each cluster component should be decided in line
with the application requirements and cluster design.
3. Network Architecture: The underlying network architecture impacts the efficiency of the
MapReduce algorithm. In fact, each of the networking components affects the performance of a
4. Storage Architecture: While considering storage for Big Data, the disk size is more important than
the seek time. The architectural parameters to be considered for storage are as follows:
Scalable: Storage should be scalable in terms of size, throughput and speed of access.
Provides tiered storage: It is critical for the storage system to be able to manage the “tiering” of data
across the range of media types: flash, fast disk, slower disk and tape.
Makes content widely accessible: Storage should distribute data geographically so that it is closer to
users.
Supports workflow automation: Big Data must be delivered to users in the context of a workflow. For
this reason, Big Data storage architecture must support easy integration of workflow.
Supports integration with legacy systems, and analytical and content applications: A well designed
Big Data storage system should be heterogeneous and flexible. It should offer interfaces that allow
direct access to the Big Data storage functionality.
Supports integration with cloud ecosystems: An ideal Big Data storage system must be built from the
ground up, to be cloud enabled.
Self-managing: The Big Data storage system should have the built-in ability to handle failures; it must
accommodate component failures and repair itself without intervention.
5. Information Security Architecture: To handle information security of Big Data, the architectural
and operational security aspects need to be looked into. By its very nature, Big Data security has
inherent challenges to tackle. And many of these challenges still need appropriately devised
mechanisms to address them. The top 10 challenges of handling Security within Big Data
environment are illustrated below.
As many clusters are being deployed within virtual and cloud environments, they can leverage vendor
supplied management tools to address operational security issues. While these measures cannot
provide fail-proof security, a reasonable amount of effort can make it considerably more difficult to
subvert systems or steal information.
For many Big Data scenarios, information comes from outside the company, for e.g., social media,
demographic data, web data, events, feeds, etc. The elasticity of the cloud makes it ideal for Big Data
analytics -- the practice of rapidly processing large volumes of unstructured data to identify patterns
and improve business strategies.
Making the storage perform at a level that enables the kind of data analysis Big Data needs is
important. Also, the cloud service provider’s inability to provide infrastructure from the same
physical location results in network latency and proves detrimental to the overall performance of
MapReduce. These factors prove to be the biggest detriments to the use of cloud for Big Data
processing.
Conclusion:
Looking at architecture design for Big Data only from the application perspective gives an isolated
view. Similarly, taking the infrastructure perspective alone into consideration may negate the
advantages of Big Data. A comprehensive strategy covering application and infrastructure aspects
while architecting Big Data is most desirable. The commonly followed process for Big Data
implementation today still has scope for improvement in this regard.
Big Data implementation is a multi-skilled discipline. It has major dependency on the underlying
infrastructure and the deployment architecture. Once a decision to use Big Data has been made by
an organization, the network, storage and security architecture associated with the deployment
architecture must be worked out in an iterative manner before finalizing it. The rule of thumb is to
exploit the benefit of low total cost of ownership (TCO) by using commodity hardware and get almost
real-time application performance.
10
[3] Forbes, Ten properties of the Perfect Big Data storage architecture,
http://www.forbes.com/sites/danwoods/2012/07/23/ten-properties-of-the-perfect-big-data-
storage-architecture/2/
[4] Securosis, Securing Big Data: Security Recommendations for Hadoop and NoSQL Environments,
https://securosis.com/Research/Publication/securing-big-data-security-recommendations-for-
hadoop-and-nosql-environment
[6] Virginia State University, Evaluating MapReduce System Performance: A Simulation Approach,
http://scholar.lib.vt.edu/theses/available/etd-08282012-152556/unrestricted/Wang_G_D_2012.pdf
11