Tech RPT New PDF

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
SANTHI BASTWAD ROAD, MACHHE, BELGAVI-590014

KARNATAKA
SEMINAR REPORT ON
“ BIG DATA”
Submitted in partial fulfillment of
BACHELOR OF ENGINEERING IN
ELECTRONICS AND COMMUNICATION ENGINEERING
Under the guidance of

Mr. FAZLULLA KHAN BE, M.Tech
Assistant Professor
Department of ECE
Submitted by:
SRIDEVI .B.N (1HM16EC024)
Department of Electronics and Communication Engineering

H.M.S. Institute of Technology
Manchakalukuppe, Tumakuru-572104
2020-2021
H.M.S. INSTITUTE OF TECHNOLOGY
NH-4, Kesaramadu Post, Kyathsandra, Tumkur-04
DEPARTMENT OF ELECTRONICS AND COMMUNICATION

ENGINEERING
CERTIFICATE
This is to Certified that the Seminar Report entitled “ BIG DATA ” is a
bonafide work carried out by S R I D E V I B N in partial fulfillment for the
award of degree of Bachelor of Engineering in ELECTRONICS AND
COMMUNICATION ENGINEERING under VISVESVARAYA
TECHNOLOGICAL UNIVERSITY, BELAGAVI during the year 2020-
2021.It is certified that all corrections/suggestions indicated for internal
assessment have been incorporated in the report deposited in the
departmental library. The Seminar Report has been approved as it satisfies
the academic requirements in respect of Seminar prescribed for the Bachelor
of Engineering Degree.
Guide Head of the Department Principal

Fazlulla Khan B.E., M.Tech Dr. C P Latha B.E., M.Tech., PhD Dr. Irfan G .B.E., M.Tech., Ph.D
Assistant Professor , Professor & Head of the Dept, Principal HMSIT,
Dept. of ECE Dept. of ECE Tumkur
ACKNOWLEDGEMENT
It is a pleasure to acknowledgement all those who have provided help,

inspiration and encouragement as we proceed with this Seminar
I wish to express my sincere thanks to our principal Dr. IRFAN.G, for

providing the facilities to carry out this Seminar .
I extend my deep gratitude to Dr. C.P LATHA, Head of the department

of Electronics and Communication, for being an inspiration and support for me
throughout the completion of this Seminar.
I am grateful to my Seminar guide Mr. FAZLULLA KHAN, Asst.

Professor, Dept of Electronics and Communication Engineering for his
willingness to teach, moral support given at various stage. Without his guidance
and invaluable time spent with us in this Seminar ,report would not have been
completed successfully.
Last but not the least I would like to express my sincere thanks to all the
Teaching and Nonteaching staff members of ECE department for their valuable
guidance and support.
SRIDEVI B N (1HM16EC024)
CONTENTS
CHAPTERS TITLE PAGE NO.
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 LITERATURE SURVEY 2
CHAPTER 3 THE SOURSES OF BIG DATA 6
CHAPTER 4 CHARACTERISTIC OF BIG DATA 7
CHAPTER 5 STORING, PROCESSING OF 10

BIG DATA
CHAPTER 6 BIG DATA IMPORTANTANCE 24
CHAPTER 7 BIG DATA ANALYTICAL TOOLS 25
CHAPTER 8 BIG DATA SECURITY TECHNOLOGIES 26
CHAPTER 9 BENEFITS, RISKS AND APPLICATIONS 28

OF BIGDATA
CHAPTER 10 FUTURE OF BIG DATA 33
CHAPTER 11 CONCLUSION 34
REFERENCES 35
LIST OF FIGURES
Fig no. Title Page no.
3.1 FIVE V’S BIG DATA 9
4.1 TECHNICAL VIEW OF BIG DATA 13
4.2 MASTER–SLAVE ARCHITECTURE OF HADOOP 14
4.3 HADOOP ARCHITECTURE 15
4.4 STRUCTURE OF HADOOP DISTRIBUTED FILE 16

SYSTEM
4.5 MAP REDUCE 21

ABSTRACT
The amount of data in world is growing day by day. Data is growing because of use of
internet, smart phone and social network. Big data is a collection of data sets which is very
large in size as well as complex. Generally size of the data is Petabyte and Exabyte.
Traditional database systems is not able to capture, store and analyze this large amount of
data. As the internet is growing, amount of big data continue to grow.
Data sets grow in size in part because they are increasingly being gathered by cheap
and numerous information-sensing mobile devices, aerial (remote sensing), software logs,
cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor
networks. The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s.
Big data analytics provide new ways for businesses and government to analyze
unstructured data. Now a days, Big data is one of the most talked topic in IT industry. It is
going to play important role in future. Big data changes the way that data is managed and
used. Some of the applications are in areas such as healthcare, traffic management, banking,
retail, education and so on. Organizations are becoming more flexible and more open. New
types of data will give new challenges as well.
Keywords: Big data, petabyte, exabyte, zettabytes.

BIG DATA
CHAPTER : 1
INTRODUCTION
Big data is a collective term referring to data that is so large and complex that it exceeds the
processing capability of conventional data management systems and software techniques. However with
big data come big values. Data becomes big data when individual data stops mattering and only a large
collection of it or analyses derived from it are of value. With many big data analyzing technologies,
insights can be derived to enable better decision making for critical development areas such as health
care, economic productivity, energy, and natural disaster prediction
Big data is a word used for description of massive amounts of data which are either structured,
semi structured or unstructured. The data if it is not able to be handled by the traditional databases and
software technologies then we categorize such data as big data. The term big data is originated from the
web companies who used to handle loosely structured or unstructured data. “Every day, we create 2.5
quintillion bytes of data so much that 90% of the data in the world today has been created in the last two
years alone. Lots of data is being collected and warehoused
– Web data, e-commerce
– Bank/Credit card
–transactions
– Social Network
Society is becoming increasingly more instrumented and as a result, organisations are producing
and storing vast amounts of data.
Dept. Of ECE, HMSIT Page 1

BIG DATA
CHAPTER :2
LITERATURE SURVEY
Big Data has been described by some Data Management pundits (with a bit of a snicker) as “huge,
overwhelming, and uncontrollable amounts of information.” In 1663, John Graunt dealt with
“overwhelming amounts of information” as well, while he studied the bubonic plague, which was
currently ravaging Europe. Graunt used statistics and is credited with being the first person to use
statistical data analysis. In the early 1800s, the field of statistics expanded to include collecting and
analyzing data.
The evolution of Big Data includes a number of preliminary steps for its foundation, and while
looking back to 1663 isn’t necessary for the growth of data volumes today, the point remains that “Big
Data” is a relative term depending on who is discussing it. Big Data to Amazon or Google is very
different than Big Data to a medium-sized insurance organization, but no less “Big” in the minds of those
contending with it.
Such foundational steps to the modern conception of Big Data involve the development of
computers, smart phones, the internet, and sensory (Internet of Things) equipment to provide data. Credit
cards also played a role, by providing increasingly large amounts of data, and certainly social media
changed the nature of data volumes in novel and still developing ways. The evolution of modern
technology is interwoven with the evolution of Big Data.
THE FOUNDATIONS OF BIG DATA:
Data became a problem for the U.S. Census Bureau in 1880. They estimated it would take eight
years to handle and process the data collected during the 1880 census, and predicted the data from the
1890 census would take more than 10 years to process. Fortunately, in 1881, a young man working for
the bureau, named Herman Hollerith, created the Hollerith Tabulating Machine. His invention was based
on the punch cards designed for controlling the patterns woven by mechanical looms. His tabulating
machine reduced ten years of labor into three months of labor.
 In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing information

magnetically on tape. Pfleumer had devised a method for adhering metal stripes to cigarette papers (to
keep a smokers’ lips from being stained by the rolling papers available at the time), and decided he
could use this technique to create a magnetic strip, which could then be used to replace wire recording
technology. After experiments with a variety of materials, he settled on a very thin paper, striped with
iron oxide powder and coated with lacquer, for his patent in 1928.

BIG DATA
During World War II (more specifically 1943), the British, desperate to crack Nazi codes,
invented a machine that scanned for patterns in messages intercepted from the Germans. The machine
was called Colossus, and scanned 5.000 characters a second, reducing the workload from weeks to
merely hours. Colossus was the first data processor. Two years later, in 1945, John Von Neumann
published a paper on the Electronic Discrete Variable Automatic Computer (EDVAC), the first
“documented” discussion on program storage, and laid the foundation of computer architecture today.
It is said these combined events prompted the “formal” creation of the United States’ NSA
(National Security Agency), by President Truman, in 1952. Staff at the NSA were assigned the task of
decrypting messages intercepted during the Cold War. Computers of this time had evolved to the point
where they could collect and process data, operating independently and automatically.
 THE INTERNET EFFECT AND PERSONAL COMPUTERS:
ARPANET began on Oct 29, 1969, when a message was sent from UCLA’s host computer to
Stanford’s host computer. It received funding from the Advanced Research Projects Agency (ARPA), a
subdivision of the Department of Defense. Generally speaking, the public was not aware of ARPANET.
In 1973, it connected with a transatlantic satellite, linking it to the Norwegian Seismic Array. However,
by 1989, the infrastructure of ARPANET had started to age. The system wasn’t as efficient or as fast as
newer networks. Organizations using ARPANET started moving to other networks, such as NSFNET, to
improve basic efficiency and speed. In 1990, the ARPANET project was shut down, due to a combination
of age and obsolescence. The creation ARPANET led directly to the Internet.
 In 1965, the U.S. government built the first data center, with the intention of storing millions of
fingerprint sets and tax returns. Each record was transferred to magnetic tapes, and were to be taken
and stored in a central location. Conspiracy theorists expressed their fears, and the project was closed.
However, in spite of its closure, this initiative is generally considered the first effort at large scale data
storage.
 Personal computers came on the market in 1977, when microcomputers were introduced, and became
a major stepping stone in the evolution of the internet, and subsequently, Big Data. A personal
computer could be used by a single individual, as opposed to mainframe computers, which required
an operating staff, or some kind of time-sharing system, with one large processor being shared by
multiple individuals. After the introduction of the microprocessor, prices for personal computers
lowered significantly, and became described as “an affordable consumer good.” Many of the early
personal computers were sold as electronic kits, designed to be built by hobbyists and technicians.
Eventually, personal computers would provide people worldwide with access to the internet.

BIG DATA
 In 1989, a British Computer Scientist named Tim Berners-Lee came up with the concept of the World
Wide Web. The Web is a place/information-space where web resources are recognized using URLs,
interlinked by hypertext links, and is accessible via the Internet. His system also allowed for the
transfer of audio, video, and pictures. His goal was to share information on the Internet using a
 HYPERTEXT SYSTEM:
By the fall of 1990, Tim Berners-Lee, working for CERN, had written three basic IT commands
that are the foundation of today’s web: HTML: HyperText Markup Language. The formatting language
of the web.
 URL: Uniform Resource Locator. A unique “address” used to identify each resource on the web. It is
also called a URI (Uniform Resource Identifier).
 HTTP: Hypertext Transfer Protocol. Used for retrieving linked resources from all across the web.
 In 1993, CERN announced the World Wide Web would be free for everyone to develop and use. The
free part was a key factor in the effect the Web would have on the people of the world. (It’s the
companies providing the “internet connection” that charge us a fee).
 THE INTERNET OF THINGS (IOT):
The concept of Internet of Things was assigned its official name in 1999. By 2013, the IoT had
evolved to include multiple technologies, using the Internet, wireless communications, micro-
electromechanical systems (MEMS), and embedded systems. All of these transmit data about the person
using them. Automation (including buildings and homes), GPS, and others, support the IoT.
The Internet of Things, unfortunately, can make computer systems vulnerable to hacking. In
October of 2016, hackers crippled major portions of the Internet using the IoT. The early response has
been to develop Machine Learning and Artificial Intelligence focused on security issues.
 COMPUTING POWER AND INTERNET GROWTH:
There was an incredible amount of internet growth in the 1990s, and personal computers became
steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s efforts,
Cern’s free access, and access to individual personal computers.

BIG DATA
 The term Big Data appeared for the first time in 1998 in a Silicon Graphics (SGI) slide deck by John
Mashey having the title Big Data and the Next Wave of Infra Stress. The first book mentioning Big
Data is a data mining book that came to force in 1998 too by Weiss and Indrukya.
The first academic paper having the word Big Data in the title appeared in the year 2000 in a
paper by Diebold. The era of Big Data has bought with it a plethora of opportunities for the advancement
of science, improvement of health care, promotion of economic growth, enhancement of education
system and more ways of social interaction and entertainment.
 In 2005, Big Data, which had been used without a name, was labeled by Roger Mougalas. He was
referring to a large set of data that, at the time, was almost impossible to manage and process using
the traditional business intelligence tools available. Additionally, Hadoop, which could handle Big
Data, was created in 2005.
 Hadoop was based on an open-sourced software framework called Nutch, and was merged with
Google’s MapReduce. Hadoop is an Open Source software framework, and can process structured
and unstructured data, from almost all digital sources. Because of this flexibility, Hadoop (and its
sibling frameworks) can process Big Data.

BIG DATA
CHAPTER: 3
THE SOURCES OF BIG DATA
The bulk of big data generated comes from three primary sources: social data, machine data and
transactional data. In addition, companies need to make the distinction between data which is generated
internally, that is to say it resides behind a company’s firewall, and externally data generated which needs
to be imported into a system. Whether data is unstructured or structured is also an important factor.
 Structured data: has semantic meaning attached to it like data stored in database SQL.
 Unstructured data: has no latent meaning. It includes calls, texts, tweets, net surfing, browsing
through various websites, and exchanging messages by every means possible, transaction made
through cards for various payment issues
 Semi structured data: include XML and other mark up languages, email.
The three primary sources of Big Data are:
Social data : comes from the Likes, Tweets & Re-tweets, Comments, Video Uploads, and general media
that are uploaded and shared via the world’s favourite social media platforms. This kind of data provides
invaluable insights into consumer behaviour and sentiment and can be enormously influential in
marketing analytics. The public web is another good source of social data, and tools like Google Trends
can be used to good effect to increase the volume of big data.
Machine data: is defined as information which is generated by industrial equipment, sensors that are
installed in machinery, and even web logs which track user behaviour. This type of data is expected to
grow exponentially as the internet of things grows ever more pervasive and expands around the world.
Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly growing
Internet Of Things will deliver high velocity, value, volume and variety of data in the very near future.
Transactional data: is generated from all the daily transactions that take place both online and offline.
Invoices, payment orders, storage records, delivery receipts – all are characterized as transactional data
yet data alone is almost meaningless, and most organizations struggle to make sense of the data that they
are generating and how it can be put to good use.

BIG DATA
CHAPTER : 4
CHARACTERISTICS OF BIG DATA
There are many properties associated with big data. The prominent aspects are volume, variety, velocity,
veracity, and value.
1. VOLUME:
The word big in big data is due to the sheer size of big data that it actually means. It refers to the
vast amounts of data that is generated every second, minute, hour and day in our digitized world. It can
come from large datasets being shared or many small data pieces collected over time. Every minute 204
emails are sent, 200,000 photos are uploaded and 1.8 million likes are generated on Facebook. On
YouTube 1.3 million videos are viewed and 72 hours of video are uploaded. Its size is massive to the
extent that they are measured by the likes of petabytes, exabytes and zettabytes.
Some astounding examples of massive data generated (by machines) are:
 CERN’s large hadrons’ generates data of about collider 15 petabytes (2^50 bytes)
 Airbus A380 engines. Each has 4 engines each and each generates 1 petabyte of data on a flight
from London to Singapore.
 10,000 credit card transactions are made per second.
 1 million customer transactions are made per second by Walmart.
 According to predictions by an IDC (International Data Corporation) report sponsored by a big data
company called EMC, digital data will grow by a factor of 44 until the year 2020, which is a growth
of 0.8 zettabytes (2^80 bytes) [1]. About 90% of world’s data has been created in the last two Years.
2. VARIETY:
Variety refers to the ever increasing different forms of data that can come in the form of texts,
images, voices and geospatial data, computer generated simulations. The heterogeneity of data can be
characterized along several dimensions. Some of these are:
 Structural variety: It refers to the difference in the representation of the data. For example an EKG
signal is very different from a newspaper article. Satellite images of wildfires from NASA are
different from tweets sent out by people seeing the spread of fire.
 Media Variety: Media variety refers to the medium in which the data gets delivered. For example:
The audio of a speech and the transcript of a speech represent the same information in two different
media.

BIG DATA
 Semantic variety: It comes from different assumptions of conditions on the data. Like conducting
two income surveys on two different groups of people and not being able to compare or combine
them without knowing more about the populations themselves.
3. VELOCITY:
Velocity refers to the speed at which big data is created or moves from one point to another and
the increasing pace at which it needs to be stored and analyzed. The processing of data in real time to
match its production rate as it gets generated is the main goal of big data analytics.
It allows personalization of advertisement on web pages one visits based on recent search,
viewing and purchase history. Thus we can put it this way, if a business cannot take advantage of the data
as it gets generated and analyze it at speed, it is missing opportunities. Accurate yet old information is
useless.
Taking an example of real life say we are on a road trip and need information about weather
conditions to start packing. In this case newer the information the higher is the relevance in deciding what
to pack. As weather conditions keep on changing so looking at last month’s information or last year’s
won’t help us much rather information from the current week or rather the present day will help us a great
deal. Obtaining latest information about weather, processing it and letting it reach us helps us in our
decision making.
Sensors and smart devices monitoring human body helps detect abnormalities in real time and aids us
in taking action, saving our lives. New information that is streaming often is needed to be integrated with
existing data to produce decisions in case of emergencies like in case of a tornado.
4. VERACITY:
Veracity refers to the quality of big data. It refers to the biases, noise and abnormality of data. It
also often refers to the immeasurable uncertainties and truthfulness and trustworthiness of data. It is very
important for making big data operational. Data is useless if it is not accurate. The results of big data
analysis are only as good as the data being analyzed. Data that are erroneous, duplicate and incomplete or
outdated, as a whole are referred to as dirty data.
5. VALUE:
Value refers to the fact how big data is going to benefit us and our organization. Data value helps
in measuring the usefulness of data in decision making. Queries can be run on the stored data so as to
deduce important results and gain insights from the filtered data so obtained so as to solve most
analytically complex business problems

BIG DATA
FIVE V’s OF BIG DATA :
Fig 3.1: Five V’s Big Data

BIG DATA
CHAPTER :5
STORING AND PROCESSING OF BIG DATA
DATA STORAGE
Big data stores are used in similar ways as traditional relational database management systems,
e.g. for online transactional processing (OLTP) solutions and data ware-houses over structured or semi-
structured data. Particular strengths are in handling unstructured and semi-structured data at large scale.
These approaches typically sacrifice properties such as data consistency in order to maintain fast query
responses with increasing amounts of data This section assesses the current state-of-the-art in data store
technologies that are capable of handling large amounts of data, and identifies data store related trends.
Following are differing types of storage systems:
 Distributed File Systems: File systems such as the Hadoop File System (HDFS) offer the capability
to store large amounts of unstructured data in a reliable way on commodity hardware. Although there
are file systems with better performance, HDFS is an integral part of the Hadoop framework and has
already reached the level of a de-facto standard. It has been designed for large data files and is well
suited for quickly ingesting data and bulk processing.
 No SQL Databases: Probably the most important family of big data storage technologies are No
SQL database management systems. No SQL databases use data models from outside the relational
world that do not necessarily adhere to the transactional properties of atomicity, consistency,
isolation, and durability (ACID)
 New SQL Databases: A modern form of relational databases that aim for comparable scalability as
No SQL databases while maintaining the transactional guarantees made by traditional database
systems.
 Big Data Querying Platforms. No SQL Databases No SQL databases are designed for scalability,
often by sacrificing consistency. Compared to relational databases, they often use low-level, non-
standardized query interfaces, which make them more difficult to integrate in existing applications
that expect an SQL interface. The lack of standard interfaces makes it harder to switch vendors. No
SQL databases can be distinguished by the data models they use.
 Key-Value Stores: Key-value stores allow storage of data in a schema-less way. Data objects can be
completely unstructured or structure d and are accessed by a single key. As no schema is used, it is
not even necessary that data objects share the same structure.
BIG DATA
 Columnar Stores: According to Wikipedia “A column-oriented DBMS is a database management

system (DBMS) that stores data tables as sections of columns of data rather than as rows of data, like
most relational DBMSs”(Wikipedia 2013). Such databases are typically sparse, distributed, and
persistent multi-dimensional sorted maps in which data is indexed by a triple of a row key, column
key, and a timestamp. The value is represented as an uninterrupted string data type.
Data is accessed by column families, i.e. a set of related column keys that effectively compress
the sparse data in the columns. Colum n families are created before data can be stored and their number is
expected to be small. In contrast, the number of columns is unlimited. In principle columnar stores are
less suitable when all columns need to be accessed. However in practice this is rarely the case, leading to
superior performance of columnar stores.
 Document Databases: In contrast to the values in a key-value store, documents are structured.
However, there is no requirement for a common schema that all documents must adhere to as in the
case for records in relational databases. Thus document databases are referred to as storing semi-
structured data.
Similar to key-value stores, documents can be queried using a unique key. However, it is possible
to access documents by querying their internal structure, such as requesting all documents that contain a
field with a specified value. The capability of the query interface is typically dependent on the encoding
format used by the databases. Common encodings include XML or JSON.
 Graph Databases: Graph databases, such as Neo4J (2015), store data in graph structures making
them suitable for storing highly associative data such as social network graphs. A particular flavour of
graph databases are triple stores such as Allegro Graph (Franz 2015) and Virtuoso (Erling 2009) that
are specifically3Here and throughout this chapter SQL refers to the Standard Query Language as
defined in the ISO/IEC Standard 9075-1:2011.124 M. Strohbach et al. designed to store RDF triples.
However, existing triple store technologies are not yet suitable for storing truly large datasets
efficiently. While in general No SQL data stores scale better than relational data bases , scalability
decreases with increased complexity of the data model used by the data store. This particularly
applies to graph databases that support applications that are both write and read intensive.

BIG DATA
DATA MANAGEMENT
This is one of the time-consuming tasks of analytics i.e.; to prepare the data for analysis. Analytics
are performed on large volumes of data that requires efficient methods to store, filter, transform, and
retrieve the data. Cloud analytics solutions need to consider the multiple Cloud deployment models
adopted by enterprises are:
 Private: These mainly work on the private network, managed by the organisation itself or by the third
party. A private Cloud is suitable for businesses that require the highest level of control of security
and data privacy.
 Public : These work with off-site over the Internet and available to the general public. Public Cloud
offers high efficiency and shared resources with low cost. The quality of services such as Privacy,
security, and availability is specified in a contract.
 Hybrid : combines both Clouds where additional resources from a public Cloud can be provided as
needed to a private Cloud. Customers can develop and deploy analytics applications using a private
environment.

BIG DATA
Technical View of Big Data
Internet
Communications
Sensor networks
GTS
Sourcing Transactions
Images
Storing
Graphs
B Documents
Formatting Key value Store
I
G Unstructured
Organizing Partially-Structured
D
Structured
A
Processing Extract
T
Cloan
A
Normalize
Querying
Transform
Load
L
Productive Analytics
Data mining
Text Analytics
On demand
Ad Hoc
Fig 4.1: Technical view of Big Data

BIG DATA
PROCESSING OF BIG DATA:
1. HADOOP
This is a freely available java based programming framework supporting for the processing of large
sets of data in a distributed computing environment. Using Hadoop, big amount of data sets can be
processed over cluster of servers and apps may be run on system with thousands of nodes involving
terabytes of information. This lowers the risk of system failure even when a huge number of nodes fail.
Hadoop follows a master–slave architecture for storing data and data processing. This master–slave
architecture has master nodes and slave nodes as shown in the image below:
Fig 4.2. Master–slave architecture of Hadoop
Terminologies before understanding the architecture:
 Name Node: Name Node is basically a master node that acts like a monitor and supervises
operations performed by Data Nodes.
 Secondary Name Node: A Secondary Name Node plays a vital role in case if there is some
technical issue in the Name Node.
 Data Node: Data Node is the slave node that stores all files and processes.
 Mapper: Mapper maps data or files in the Data Nodes. It will go to every Data Node and run a
particular set of codes or operations in order to get the work done.
 Reducer: While a Mapper runs a code, Reducer is required for getting the result from each
Mapper.

BIG DATA
 JobTracker: JobTracker is a master node used for getting the location of a file in different
DataNodes. It is a very important service in Hadoop as if it goes down, all the running jobs will
get halted.
 TaskTracker: TaskTracker is a reference for the JobTracker present in the DataNodes. It accepts
different tasks, such as map, reduce, and shuffle operations, from the JobTracker. It is a key
player performing the main MapReduce functions.
 Block: Block is a small unit wherein the files are split. It has a default size of 64 MB and can be
increased as needed.
 Cluster: Cluster is a set of machines such as DataNodes, NameNodes, Secondary NameNodes,
etc.
HADOOP ARCHITECTURE:
Fig 4.3: Hadoop architecture
There are two layers in the Hadoop architecture. First, we will see how data is stored in Hadoop and
then we will move on to how it is processed. While talking about the storage of files in
Hadoop, HDFS comes to place.

BIG DATA
Hadoop Distributed File System (HDFS)

HDFS is based on Google File System (GFS) that provides a distributed system particularly
designed to run on commodity hardware. The file system has several similarities with the existing
distributed file systems. However, HDFS does stand out among all of them. This is because it is fault-
tolerant and is specifically designed for deploying on low cost hardware.
HDFS is mainly responsible for taking care of the storage parts of Hadoop applications. So, if you
have a 100 MB file that needs to be stored in the file system, then in HDFS, this file will be split into
chunks, called blocks. The default size of each block in Hadoop 1 is 64 MB, on the other hand in Hadoop
2 it is 128 MB. For example, in Hadoop version 1, if we have a 100 MB file, it will be divided into 64
MB stored in one block and 36 MB in another block. Also, each block is given a unique name, i.e., blk_n
(n = any number). Each block is uploaded to one DataNode in the cluster. On each of the machines or
clusters, there is something called as a daemon or a piece of software that runs in the background.
Fig 4.4: Structure of Hadoop Distributed File System (HDFS)
The daemons of HDFS are as follows:
 Name Node: It is the master node that maintains or manages all data. It points to DataNodes and
retrieves data from them. The file system data is stored on a Name Node.
 Secondary NameNode: It is the master node and is responsible for keeping the checkpoints of the
file system metadata that is present on the Name Node.
 DataNode: DataNodes have the application data that is stored on the servers. It is the slave node that
basically has all the data of the files in the form of blocks.

BIG DATA
As we know, HDFS stores the application data and the files individually on dedicated servers. The file
content is replicated by HDFS on various DataNodes based on the replication factor to assure the
authenticity of the data. The DataNode and the NameNode communicate with each other using TCP
protocols.
The following prerequisites are required to be satisfied by HDFS for the Hadoop architecture to perform
efficiently:
o There must be good network speed in order to manage data transfer.

o Hard drives should have a high throughput.

BIG DATA
WORKING OPERATION OF HADOOP:
Hadoop runs code across a cluster of computers and performs the following tasks:
 Data is initially divided into files and directories. Files are then divided into consistently sized blocks
ranging from 128 MB in Hadoop 2 to 64 MB in Hadoop 1.
 Then, the files are distributed across various cluster nodes for further processing of data.
 The Job Tracker starts its scheduling programs on individual nodes.
 Once all the nodes are done with scheduling, the output is returned.
Data from HDFS is consumed through MapReduce applications. HDFS is also responsible for
multiple replicas of data blocks that are created along with the distribution of nodes in a cluster, which
enables reliable and extremely quick computations.
 So, in the first step, the file is divided into blocks and is stored in different Data Nodes. If a job
request is generated, it is directed to the Job Tracker.
The Job Tracker doesn’t really know the location of the file. So, it contacts with the Name Node
for this.

BIG DATA
 The Name Node will now find the location and give it to the Job Tracker for further processing
 Now, since the Job Tracker knows the location of the blocks of the requested file, it will contact the
Task Tracker present on a particular Data Node for the data file.

BIG DATA
 The Task Tracker will now send the data it has to the Job Tracker.
 Finally, the Job Tracker will collect the data and send it back to the requested source

BIG DATA
2. MapReduce Layer
MapReduce is a patented software framework introduced by Google to support distributed

computing on large datasets on clusters of computers.
It is basically an operative programming model that runs in the Hadoop background providing
simplicity, scalability, recovery, and speed, including easy solutions for data processing. This
MapReduce framework is proficient in processing a tremendous amount of data parallelly on large
clusters of computational nodes.
MapReduce is a programming model that allows you to process your data across an entire cluster.
It basically consists of Mappers and Reducers that are different scripts you write or different functions
you might use when writing a MapReduce program. Mappers have the ability to transform your data in
parallel across your computing cluster in a very efficient manner; whereas, Reducers are responsible for
aggregating your data together. Mappers and Reducers put together can be used to solve complex
problems.
Fig 4.5. : MapReduce Layer

BIG DATA
Working of the MapReduce Architecture:
 The job of MapReduce starts when a client submits a file. The file first goes to the JobTracker. It
combines Reduce functions, with the location, for input and output data.
 When a file is received, the JobTracker sends a request to the NameNode that has the location of the
DataNode. The NameNode will send that location to the Job Tracker. Next, the Job Tracker will go
to that location in the Data Node.
 Then, the Job Tracker present in the Data Node sends a request to the select Task Trackers.
Next, the processing of the map phase begins.
 In this phase, the Task Tracker retrieves all the input data. For each record, a map function is
invoked, which has been parsed by the ‘Input Format’ producing key–value pairs in the memory
buffer. Sorting the memory buffer is done next wherein different reducer nodes are sorted by
invoking a function called combine.
 When the map task is completed, the Job Tracker gets a notification from the Task Tracker for the
same.
 Once all the Task Trackers notify the Job Tracker, the Job Tracker notifies the select Task Trackers,
to begin with the reduce phase.
 The Task Tracker’s work now is to read the region files and sort the key–value pairs for each and
every key. Lastly, the reduce function is invoked, which collects the combined values into an output
file.

BIG DATA
3. HIVE:
Hive is a data warehousing infrastructure built above hadoop. Providing query, data summarization,
and analysis is the primary liability. Analysis of large datasets stored in Hadoop's HDFS is supported by
it. An SQL-like interface is provided by Hive to query data that is stored in various databases and file
systems that integrate with Hadoop.
Analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3
file system is supported by Apache Hive. It provides called Hive query langauge, an SQL-like language
with schema on read and clearly transforms queries to map or reduce. Embedded metadata is stored in
Apache Derby database, and other client-server databases like My SQL can alternatively be used.
Alternative features of Hive include:
 To give acceleration indexing types like compaction and bitmap index as of 0.10 and extra index
types are designed.
 Various types of storages such as plain text, HBase , ORC, RC File and others.
 The time to perform semantic checks during query execution are considerably reduced using metadata
storage in an RDBMS.
 Condensed data is stored in the Hadoop ecosystem can be operated using algorithms such as
DEFLATE, SNAPPY, BWT, etc.
 Management of strings, dates, and other data-mining tools are done using built-in user defined
functions. To handle use cases not supported by built in functions hive supports extending the user
defined function set.
 Implicitly, SQL-like queries such as Hive query language are transformed into map reduce or Spark
or Tez jobs.
Hive query language do not severely chase all the SQL- 92 standard although hive is based on SQL.
Extensions of hive query language offers not in SQL, including multi table inserts and create table as
select, but only provides basic support for indexes. Hive query language only has inadequate sub query
support and hence lacks support for transactions and materialized views.

BIG DATA
CHAPTER 6
BIG DATA IMPORTANTANCE

The importance of big data does not revolve around how much data a company has but how a
company utilises the collected data. Every company uses data in its own way; the more efficiently a
company uses its data, the more potential it has to grow. The company can take data from any source and
analyse it to find answers which will enable:
 Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost
advantages to business when large amounts of data are to be stored and these tools also help in
identifying more efficient ways of doing business.
 Time Reductions : The high speed of tools like Hadoop and in-memory analytics can easily identify
new sources of data which helps businesses analyzing data immediately and make quick decisions
based on the learnings.
.
 Understand the market conditions : By analyzing big data you can get a better understanding of
current market conditions. For example, by analyzing customers’ purchasing behaviours, a company
can find out the products that are sold the most and produce products according to this trend. By this,
it can get ahead of its competitors.
 Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve the
online presence of your business, then, big data tools can help in all this.
 Another huge advantage of big data is the ability to help companies innovate and redevelop their
products.
Best Examples Of Big Data
The best examples of big data can be found both in the public and private sector. From targeted
advertising, education, and already mentioned massive industries (healthcare, insurance, manufacturing
or banking), to real-life scenarios, in guest service or entertainment.
By the year 2020, 1.7 megabytes of data will be generated every second for every person on the
planet, the potential for data-driven organizational growth in the hospitality sector is enormous

BIG DATA
CHAPTER :7
BIG DATA ANALYTICS TOOLS
More and more tools offer the possibility of real-time processing of Big Data.
1. Storm: Storm, which is now owned by Twitter, is a real-time distributed computation system.
2. Cloudera: Cloudera offers the Cloudera Enterprise RTQ tools that offers real-time, interactive
analytical queries of the data stored in HBase or HDFS.
3. Grid grain: Grid Gain is an enterprise open source grid computing made for Java. It is compatible
with Hadoop DFS and it offers a substitute to Hadoop’s MapReduce.
4. Space Curve: The technology that Space Curve is developing can discover underlying patterns in
multidimensional geo data.

BIG DATA
CHAPTER :8
BIG DATA SECURITY TECHNOLOGIES

None of these big data security tools are new. What is new is their scalability and the ability to
secure multiple types of data in different stages.
1. Encryption: Your encryption tools need to secure data in-transit and at-rest, and they need to do it
across massive data volumes. Encryption also needs to operate on many different types of data, both
user- and machine-generated. Encryption tools also need to work with different analytics toolsets and
their output data, and on common big data storage formats including relational database management
systems (RDBMS), non-relational databases like No SQL, and specialized file systems such as
Hadoop Distributed File System (HDFS).
2. Centralized Key Management: Centralized key management has been a security best practice for
many years. It applies just as strongly in big data environments, especially those with wide
geographical distribution. Best practices include policy-driven automation, logging, on-demand key
delivery, and abstracting key management from key usage.
3. User Access Control: User access control may be the most basic network security tool, but many
companies practice minimal control because the management overhead can be so high. This is
dangerous enough at the network level, and can be disastrous for the big data platform. Strong user
access control requires a policy-based approach that automates access based on user and role-based
settings. Policy driven automation manages complex user control levels, such as multiple
administrator settings that protect the big data platform against inside attack.
4. Intrusion Detection and Prevention: Intrusion detection and prevention systems are security
workhorses. This does not make them any less valuable to the big data platform. Big data’s value and
distributed architecture lends itself to intrusion attempts. IPS enables security admins to protect the
big data platform from intrusion, and should an intrusion succeed, IDS quarantine the intrusion before
it does significant damage.
5. Physical Security: Don’t ignore physical security. Build it in when you deploy your big data
platform in your own data center, or carefully do due diligence around your cloud provider’s data
center security. Physical security systems can deny data center access to strangers or to staff members
who have no business being in sensitive areas. Video surveillance and security logs will do the same.

BIG DATA
Implementation of Big Data Security:
There are several ways organizations can implement security measures to protect their big data
analytics tools. One of the most common security tools is encryption, a relatively simple tool that can go
a long way.
 Encrypted data is useless to external factors such as hackers if they don’t have the key to unlock it.
Moreover, encrypting data means that both at input and output, information is completely protected.
 Building a strong firewall is another useful big data security tool. Firewalls are effective at filtering
traffic that both enters and leaves servers. Organizations can prevent attacks before they happen by
creating strong filters that avoid any third parties or unknown data sources.
 Data security must complement other security measures such as endpoint security, network security,
application security, physical site security and more to create an in-depth approach. By planning
ahead and being prepared for the introduction of big data analytics in your organization, you will be
able to help your organization meet its objectives securely. Incidents involving data breaches
continue to rise rapidly .This is the reason it’s important to follow the best practices mentioned below
for Big Data security:
 Boost the security on non-relational data scores

 Implement end point security
 Use customized solutions
 Ensure the safety of transaction and data storage logs
 Practice real-time security monitoring and compliance
 Rely on Big Data cryptography.

BIG DATA
CHAPTER :9
BENEFITS, RISKS AND APPLICATIONS OF BIG DATA
BENEFITS OF BIG DATA:
1. The big data allows an individual to analyze the threats he/she faces internally by knowing onto the
entire data landscape over the company using the rich set of tools that the software supporting the big
data provides.
2. This is an important advantage of big data since it allows the user to make the data safe and secure.
The speed, capacity and scalability of cloud storage provides aware advantage for the company and
organization.
3. Big data even allows the end users to visualize the data and companies can find new business
opportunities. Data analytics is one more notable advantage of the big data where in which the
individual is allowed to personalize the content or to look and feel the real time websites.
4. The big data is already a vital part of the $64 billion databases and the data analytics market. It
furnishes commercial opportunities of a comparable.
5. Businesses can utilize outside intelligence while taking decisions
6. Improved customer service
7. Early identification of risk to the product/services, if any
8. Better operational efficiency.

BIG DATA
RISKS OF BIG DATA:
Here are the five biggest risks that big data presents for digital enterprises.
1. Unorganized data:
Big data is highly versatile. It comes from number of sources and in number of forms. There’s
structured data, there’s unstructured data. There’s data coming from online and offline sources. And all
this data keeps piling up each day, each minute. It’s overwhelming for enterprises to tackle such
unorganized and siloed data sets effectively. A well planned governance strategy can bring you out of
your dark data and help you make sense of it.
2. Data storage and retention:
This is one of the most obvious risks associated with big data. When data gets accumulated at
such a rapid pace and in such huge volumes, the first concern is its storage. Traditional data storage
methods and technology are just not enough to store big data and retain it well. Enterprises today need a
shift to cloud based data storage solutions to store, archive and access big data effectively.
3. Cost management:
The process of storing, archiving, analyzing, reporting and managing big data involves costs.
Many small and medium enterprises think that big data is only for big businesses, and they cannot afford
it. However, with careful budgeting and planning of resources, big data costs can be mitigated well. Once
the initial set up, migration and overhauling costs are taken care of, big data acts as an incredible revenue
generator for digital enterprises.
4. Incompetent analytics :
Without proper analytics, big data is just a pile of trash lying unnecessarily in your organization.
Analytics is what makes data meaningful, giving management valuable insights to make business
decisions and plan strategies for growth. With data growing at such an alarming rate, there’s obviously a
lack of skilled professionals and technology to analyze big data efficiently. It exposes enterprises to the
risk of misinterpretation of data, and wrong decision making. Hiring the right talent and applying the
right tools is crucial to make relevant decisions from a big data project.
5. Data privacy:
With big data, comes the biggest risk of data privacy. Enterprises worldwide make use of
sensitive data, personal customer information and strategic documents. When there’s so much
confidential data lying around, the last thing you want is a data breach at your enterprise. A security
incident can not only affect critical data and bring down your reputation; it also leads to legal actions and
heavy penalties. Taking measures for data privacy is not just a good initiative anymore, it’s a compliance
necessity.

BIG DATA
APPLICATIONS OF BIG DATA

The applications of the big data are in the following fields:
1. Government Industry
2. Education Industry
3. International development
4. Manufacturing
5. Cyber-physical models
6. Media
7. Technology
8. Private sector
9. Science and research
10. Collection Information
1. Government Industry:
Along with many other areas, big data in government can have an enormous impact local, national
and global. With so many complex issues on the table today, governments have their work cut out trying
to make sense of all the information they receive and make vital decisions that affect millions of people.
Governments, be it of any country, come face to face with a very huge amount of data on almost daily
basis. Reason being, they have to keep track of various records and databases regarding the citizens. The
proper study and analysis of this data helps the Governments in endless ways. Few of them are:
 Welfare schemes:
 Cyber security:
 PDF - Big Data Applications in the Government Sector: A Comparative Analysis among Leading
Countries
2. Big Data in Education industry
Following are some of the fields in education industry that have been transformed by big data
motivated changes
 Customized and dynamic learning programs

 Reframing course material
 Grading Systems
 Career prediction

BIG DATA
3. International development : The development in the big data analysis furnishes cost-effective
opportunities to enhance the decision in critical advancement areas like health care, employment
opportunities and crime, security and natural disaster. Hence, in this way, the big data is helpful for
the international development.
 Big Data in Insurance industry:

The insurance industry holds importance not only for individuals but also business companies. The
reason insurance holds a significant place is because it supports people during times of adversities and
uncertainties. The data collected from these sources are of varying formats and change at tremendous
speeds.
4. Manufacturing: In manufacturing, the big data furnishes an infrastructure for transparency in

manufacturing or producing industry.
5. Cyber-physical models: The present PHM implementations make avail of data during the actual
usage while the analytical step by step procedures can do more precisely when more data is included.
This is the role of big data in the cyber-physical models.
6. Media: In the media, it is used in the internet of things which do the activities like targeting of
computers and data capturing.
7. Technology: In the technology, it is used in the websites like eBay, Amazon and Facebook and
Google utilize it.
8. Private sector: The application of big data in the private sector includes the retail, retail banking, and
real estate.
 Big Data in Banking Sector:
The amount of data in banking sectors is skyrocketing every second. According to GDC prognosis,
this data is estimated to grow 700% by 2020.study and analysis of big data can help detect -
 Misuse of debit cards

 Venture credit hazard treatment
 Business clarity
 Customer statistics alteration
 Money laundering
 Risk Mitigation

BIG DATA
9. Science and Research: The best example for its application in science is about the Large Hardom
collider that represented 150 million sensors transmitting information 40 million times per second.
The big data also has the application in the science and research. The big data will be very
advanced in the future as $15 billion is invested in software firms that are specialized in the data
management and the data analytics.
10. Collecting information
As big data refers to gathering data from disparate sources, this feature creates a crucial use case
for the insurance industry to pounce on. Eg: When a customer intends to buy a car insurance Kenya, the
companies can obtain information from which they can calculate the safety levels for driving in the
buyer’s vicinity and his past driving records. On basis of this they can effectively calculate cost of car
insurance as well.
 Gaining customer insight: Determining customer experience and making customers the center of a
company’s attraction is of prime importance to organizations.
 Fraud detection: Insurance frauds are a common incidence. Big data use case for reducing fraud is
highly effective.
 Threat mapping: When an insurance agency sells an insurance, they want to be aware of all the
possibilities of things going unfavourably with their customer, making them file a claim.

BIG DATA
CHAPTER :10
THE FUTURE OF BIG DATA

The future of big data itself is all but guaranteed to be a bright one — it’s universally recognized
these days that smart analytics can be a royal road to business success. So this implies that big data
architecture will both become more critical to secure, and more frequently attacked. Thus growing the list
of big data security issues…And that, in a nutshell, is the basis of the emerging field of security
intelligence, which correlates security info across disparate domains to reach conclusions. The solutions
available, already smart, are rapidly going to get smarter in the years to come.
At this time, an increasing number of businesses are adopting big data environments. The time is
ripe to make sure security teams are included in these decisions and deployments, particularly since big
data environments — which don’t include comprehensive data protection capabilities — represent low-
hanging fruit for hackers since they hold so much potentially valuable sensitive data. As far as the future
of big data is concerned it is for certain that data volumes will continue to grow and the prime reason for
that would be the drastic increment in the number of hand held devices and internet connected devices,
which is expected to grow in an exponential order.
SQL will remain as the standard for data analysis and Spark, which is emerging, will emerge as
the complimentary tool for data analysis. Tools for analysis without the presence of an analyst are set to
take over, with Microsoft and Sales force both recently announcing features letting non-coders to create
apps for viewing business data. As per IDC half of all business analytics software will include
intelligence where it is needed by 2020.
Data security is a detailed, continuous responsibility that needs to become part of business as
usual for big data environments. Securing data requires a holistic approach to protect organizations from
a complex threat landscape across diverse systems.
Impact of big data in IT Industry:
Big Data helps innovative companies use actionable insights to optimize real-time marketing
effectiveness. This can include: Delivering high-value, personalized customer experiences to drive ROI
and brand loyalty. Creating dynamic online merchandising like smart cross-selling and up-selling to
increase revenue

BIG DATA
CHAPTER :11
CONCLUSION
The availability of Big Data, low-cost commodity hardware, and new information management and
analytic software have produced a unique moment in the history of data analysis. The convergence of
these trends means that we have the capabilities required to analyze astonishing data sets quickly and
cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They
represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency,
productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary
times if both business and technology professionals continue to work together and deliver on the promise.
As far as security is concerned the existing technologies are promising to evolve as newer vulnerabilities
to big data arise and the need for securing them increases.

BIG DATA
CHAPTER : 12
REFERENCES:
[1] J. Gantz, D. Reinsel, "Extracting value from chaos", IDCi View, 2011, pp 1–12.
[2] E. Mcnulty, "Understanding Big Data: The Seven V’s", Data conomy, May 22, 2014.
[3] Gartner, "Big Data Strategy Components: Business Essentials", October 9, 2012.
[4] Gartner, "IT glossary: big data" [webpage on the Internet]. Stamford, CT; 2012.
[5] Canada Inforoute, "Big Data Analytics in health", White Paper, Full Report, April 2013
[6] A. Alexandru, D. Coardos, "BD in Tackling Energy Efficiency in Smart City", Scientific Bulletin of
the Electrical Engineering Faculty, vol. 28, no. 4, pp. 14-20, 2014, Bibliotheca Publishing House, ISSN
1843-6188
7] Arthur G. Erdman∗, Daniel F. Keefe, Senior Member,IEEE, and Randall Schiestl, Applying
Regulatory Science and Big Data to Improve Medical Device Innovation, IEEE TRANSACTIONS ON
BIOMEDICAL ENGINEERING, VOL.60, NO. 3, MARCH 2013
[8] http://lsst.org/lsst/google
[9] http://en.wikipedia.org/wiki/Parkinson‘s_law
[10] http://www.economist.com/node/15557443
[11]http://www.youtube.com/t/press_statistics/? hl=en
[12] http://www.internetlivestats.com/twitterstatistics/.
[13] Magoulas, Roger; Lorica, Ben (February 2009). "Introduction to Big Data". Release 2.0. Sebastopol
CA: O'Reilly Media (11).
[14]John R. Mashey (25 April 1998). "Big Data ... and the Next Wave of InfraStress" (PDF). Slides
from invited talk. Usenix. Retrieved 28 September 2016.
[15] Steve Lohr (1 February 2013). "The Origins of 'Big Data': An Etymological Detective Story". The
New York Times. Retrieved 28 September 2016.
[16] Snijders, C.; Matzat, U.; Reips, U.-D. (2012). "'Big Data': Big gaps of knowledge in the field of
Internet". International Journal of Internet Science. 7: 1–5.

BIG DATA
[17 ]Dedić, N.; Stanier, C. (2017). "Towards Differentiating Business Intelligence, Big Data, Data
Analytics and Knowledge Discovery". Innovations in Enterprise Information Systems Management and
Engineering. Lecture Notes in Business Information Processing. 285. Berlin ; Heidelberg: Springer
International Publishing. pp. 114–122. doi:10.1007/978-3-319-58801-8_10. ISBN 978-3-319-58800-1.
ISSN 1865-1356. OCLC 909580101.
[18] Everts, Sarah (2016). "Information Overload". Distillations. Vol. 2 no. 2. pp. 26–33. Retrieved 22
March 2018.
[19] Grimes, Seth. "Big Data: Avoid 'Wanna V' Confusion". Information Week. Retrieved 5 January
2016.
[20] Fox, Charles (25 March 2018). Data Science for Transport. Springer Textbooks in Earth Sciences,
Geography and Environment. Springer ISBN 9783319729527."avec focalisation sur Big Data &
Analytique" (PDF). Bigdataparis.com. Retrieved 8 October 2017.
[21] Billings S.A. "Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and
Spatio-Temporal Domains". Wiley, 2013 "le Blog ANDSI » DSI Big Data". Andsi.fr. Retrieved 8
October 2017.
[22] Les Echos (3 April 2013). "Les Echos – Big Data car Low-Density Data ? La faible densité en
information comme facteur discriminant – Archives". Lesechos.fr. Retrieved 8 October 2017.
[23] Kitchin, Rob; McArdle, Gavin (17 February 2016). "What makes Big Data, Big Data? Exploring the
ontological characteristics of 26 datasets". Big Data & Society. 3 (1): 205395171663113.
doi:10.1177/2053951716631130.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 02 | Feb-2018 www.irjet.net p-ISSN: 2395-0072
BIG DATA – CONCEPTS, ANALYTICS, ARCHITECTURES – OVERVIEW

P.Joseph Charles1, S. Thulasi Bharathi2, V.Susmitha3
1,2Assistant professor, Department of IT, St. Joseph’s College, Trichy, India,

3IIM.Sc., Computer Science, Department of IT, St. Joseph’s College, Trichy
----------------------------------------------------------------------***---------------------------------------------------------------------
ABSTRACT: The term, Big Data ‘has been coined to II.BIG DATA CONCEPTS
refer to the gargantuan bulk of data that cannot be dealt
with by traditional data-handling techniques. Big Data is
“Every day, we create 2.5 quintillion bytes of data — so
still a novel concept, and in the following literature we
much that 90% of the data in the world today has been
intend to elaborate it in a palpable fashion. It commences
created in the last two years alone. This data comes from
with the concept of the subject in itself along with its
everywhere: sensors used to gather climate information,
properties and the two general approaches of dealing with
posts to social media sites, digital pictures and videos,
it. We have entered the big data era. Organizations are
purchase transaction records, and cell phone GPS signals
capturing, storing, and analysing data that has high
to name a few”[1]. such colossal amount of data that is
volume, velocity, and variety and comes from a variety of
being produced continuously is what can be coined as
new sources, including social media, machines, log files,
Big Data. Big Data decodes previously untouched data to
video, text, image, RFID, and GPS. These sources have
derive new insight that gets integrated into business
strained the capabilities of traditional relational database
operations. However, as the amounts of data increases
management systems and spawned a host of new
exponential, the current techniques are becoming
technologies, approaches, and platforms. Big Data (BD) is
obsolete. Dealing with Big Data requires comp. Big Data
associated with a new generation of technologies and
can be simply defined by explaining the 3V‘s – volume,
architectures which can harness the value of extremely
velocity and variety which are the driving dimensions of
large volumes of very varied data through real time
Big Data quantification. Gartner analyst, Doug Laney [3]
processing and analysis.
introduced the famous 3 V‘s concept in his 2001
Metagroup publication, ‗3D data management:
Keywords: Big Data, 3 V‘s, Hadoop, framework, Controlling Data Volume, Variety and Velocity‘.
architecture.
I. INTRODUCTION
Big data and analytics are “hot” topics in both the

popular and business press. Articles in publications like
the New York Times, Wall Street Journal and Financial
Times, as well as books like Super Crunchers [Ayers,
2007], Competing on Analytics [Davenport and Harris,
2007], and Analytics at Work [Davenport, et al., 2010]
have spread the word about the potential value of big
data and analytics. recent decades, the increasing
importance of data to organisations has led to rapid
changes in data collection and management. Traditional
information management and data analysis methods
("analytics") are mainly intended to support internal
decision processes. They operate with structured data Figure-1: schematic representation of the 3V‘s [4] of Big
types, existing mainly within the organization. Data
Throughout the history of IT, each generation of
organizational data processing and analysis methods Volume: The increase in data volume in enterprise-type
acquired a new name. With the launch of Web 2.0, a large systems is caused by the amount of transactions and
amount of valuable business data started being other traditional data types, as well as by new data types.
generated beyond the organization by consumers and, Too much data becomes a storage problem, but also has
generally, by web users. This data can be structured or a great impact on the complexity of data analysis; This
unstructured, and can come from multiple sources such essentially concerns the large quantities of data that is
as social networks, products viewed in virtual stores, generated continuously. Initially storing such data was
information read by sensors, GPS signals from mobile problematic because of high storage costs. However with
devices, IP addresses, cookies, bar codes, etc. decreasing storage costs, this problem has been kept
somewhat at bay as of now. However this is only a
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 125
temporary solution and better technology needs to be Large Synoptic Survey Telescope (LSST): Over 30
developed. Smartphones, E-Commerce and social thousands gigabytes (30TB) of images will be generated
networking websites are examples where massive every night during the decade –long LSST survey sky.‖
amounts of data are being generated. This data can be [8]
easily distinguishes between structured data,
unstructured data and semi-structured data.  There is a corollary to Parkinson‘s Law that
states: ―Data expands to fill the space available
Velocity: refers to both the speed with which data is for storage.‖[9].
produced and that with which it must be processed to  This is no longer true since the data being
meet demand. This involves data flows, the creation of generated will soon exceed all available storage
structured records, as well as availability for access and space.[10][8]
delivery. The speed of data generation, processing and  72 hours of video are uploaded to YouTube
analysis is continuously increasing due to real-time every minute.[11]
generation processes, requests resulting from combining
data flows with business processes, and decision-making Variability: refers to how changing the meaning of the
processes. The velocity of the data processing must be data is. This is found especially with natural language
high, while the processing capacity depends on the type processing. Companies have to develop sophisticated
of processing of the data flows; In what now seems like programmes which can understand the context and
the pre-historic times, data was processed in batches. decode the precise meaning of words;
However this technique is only feasible when the
incoming data rate is slower than the batch processing Visualization: refers to how readable and accessible the
rate and the delay is much of a hindrance. At present data presentation is. Many spatial and temporal
times, the speed at which such colossal amounts of data parameters and relationships between them have to be
are being generated is unbelievably high. used in order to obtain something which is easily
comprehensible and actionable;
Variety: converting large volumes of transactional
information into decisions has always been a challenge Value: refers to the capacity of the data to bring new
for IT leaders, although in the past the types of generated insights for creating knowledge.
or processed data were less diverse, simpler and usually
structured. Currently, more information coming from There are at present two general approaches to big data:
new channels and emerging technologies - mainly from
social media, the Internet of Things, mobile sources and  Divide and Conquer using Hadoop:
online advertising - is available for analysis and The huge data set is spreaded into smaller parts
generates semi structured or unstructured data. This and processed in parallel fashion using many
includes tabular data (databases), hierarchical data, servers.
documents, XML, emails, blogs, instant messaging, click
streams, log files, data metering, images, audio, video,  Brute Force using technology on the likes of SAP
information about share rates (stock ticker), financial HANA: Compress the data set into single unit
transactions etc.; when the one very powerful server with massive
storage
Implementing Big Data is a mammoth task given the
large volume, velocity and variety. ―Big Data‖ is a term It is important to understand that what is thought to be
encompassing the use of techniques to capture, process, big data today won’t seem so big in the future [Franks,
analyze and visualize potentially large datasets in a 2012]. Many data sources are currently untapped—or at
reasonable timeframe not accessible to standard IT least underutilized. For example, every customer e-mail,
technologies. By extension, the platform, tools and customer-service chat, and social media comment may
software used for this purpose are collectively called be captured, stored, and analyzed to better understand
―Big Data technologies‖. [7] Currently, the most customers’ sentiments. Web browsing data may capture
commonly implemented technology is Hadoop. Hadoop every mouse movement in order to better understand
is the culmination of several other technologies like customers’ shopping behaviours. Radio frequency
Hadoop Distribution File Systems, Pig, Hive and HBase. identification (RFID) tags may be placed on every single
Etc. However, even Hadoop or other existing techniques piece of merchandise in order to assess the condition
will be highly incapable of dealing with the complexities and location of every item. Figure 1 shows the projected
of Big Data in the near future. The following are few growth of big data.
cases where standard processing approaches to
problems will fail due to Big Data
Figure:2 The Exponential Growth of Big Data
III. BIG DATA ANALYTICS The Hadoop Framework
Big Data Analytics (BDA) is a new approach in Traditional SQL database management systems are no
information management which provides a set of longer suited to manage such large and complex data
capabilities for revealing additional value from BD. It is sets as in BD. When working with large volumes of data
defined as “the process of examining large amounts of we need a solution that allows low cost storage, while
data, from a variety of data sources and in different also ensuring a good processing performance. One
formats, to deliver insights that can enable decisions in possible solution is the Apache Hadoop software
real or near real time” [4]. BDA can be used to identify framework.
patterns, correlations and anomalies [4], [5]. BDA is a
different concept from those of Data Warehouse (DW) or Hadoop[8] is an open source project developed by
Business Intelligence (BI) systems. Gartner defines a DW Apache which can be used for the distributed processing
as "a storage architecture designed to hold data of large data sets. It runs on multiple clusters using
extracted from transaction systems, operational data simple programming models. The design of the Hadoop
stores and external sources. The warehouse then framework ensured its scalability even when tasks are
combines that data in an aggregate, summary form run on thousands of computers, each with its own
suitable for enterprise wide data analysis and reporting processing and storage capability.
for predefined business needs" [6]. BI is defined as "a set
of methodologies, processes, architectures, and Since 2010, Hadoop has been widely adopted by
technologies that transform raw data into meaningful organizations for the storage of large volumes of data
and useful information used to enable more effective and as a platform for data analysis. Hadoop is currently
strategic, tactical, and operational insights and decision- used by many companies for which the volume of data
making" [7]. By itself, stored data does not generate generated daily exceeds the storage and processing
business value, and this is true of traditional databases, capacity of conventional systems. Adobe, AOL,
data warehouses, and the new technologies for storing Amazon.com, eBay, Facebook, Google, LinkedIn, Twitter,
big data (e.g., Hadoop). Once the data is appropriately Yahoo are some of the companies using Hadoop.
stored, however, it can be analyzed and this can create
tremendous value. A variety of analysis technologies, Additional software packages can be installed on top of
approaches, and products have emerged that are or alongside Hadoop, forming what is called the Hadoop
especially applicable to big data, such as in-memory ecosystem. They are designed to work together as an
analytics, in-database analytics, and appliances effective solution for the storage and processing of data.
The Hadoop products which are integrated into most
IV. ARCHITECTURES FOR BD SYSTEMS: distributions are HDFS, MapReduce, HBase, Hive,
Mahout, Oozie, Pig, Sqoop, Whirr, Zookeeper and Flume.
The complexity of BD systems required the development
of a specialized architecture. Nowadays, the most
commonly used BD architecture is Hadoop. It has
redefined data management because it processes large
amounts of data, timely and at a low cost.
according to needs (the reduce phase). components build

the foundation of four layers of the Hadoop Ecosystem,
which make up a collection of additional software
packages [9], [10].
Data Storage Layer, for storing data in a distributed file

system. It consists of:
 HDFS: the main distributed storage.
 HBase: a NoSQL column-oriented distributed

database based on the Google BigTable model
which uses HDFS as storage media. It is used in
Hadoop applications which require random read
Figure:3The Hadoop Ecosystem / write operations on very large data sets, or for
applications which have many clients. HBase has
The core of Apache Hadoop consists of two components: three main components: a client library, a
a distributed file system (HDFS - Hadoop Distributed File master server, and several region servers.
System) and a framework for distributed processing
(MapReduce) [9]. Hadoop was designed to operate in a  YARN - a resource management platform which
cluster architecture built on common server equipment. ensures security and data governance on
Given the distributed storage, the location of the data is different clusters.
not known beforehand, being determined by Hadoop
(HDFS). Each block of information is copied to multiple  Hive - a data storage platform (DW) used for
physical machines to avoid any problems caused by querying and managing large data sets from
faulty hardware. Unlike traditional systems, Apache distributed storage. Hive uses a SQL query
Hadoop provides a limited set of functionalities for data language named HiveQL;
processing (MapReduce), but has the ability to improve
its performance and its storage capacity as it is installed  Avro - serializes the data, manages remote
on more physical machines. MapReduce processing procedure calls and exchanges data from one
divides the problem into sub-problems which can be program or language to another. Data is saved
solved independently (the map phase), in the manner of based on its own schema because this enables
"divide et impera". Each of the sub-problems is executed its use with scripting languages such as Pig;
as close to the data on which it must operate as possible.
The results of the sub-problems are then combined
Figure:4 The architecture of a BD integration ecosystem
Data sources: the emergence of tables which are stored [7] Arthur G. Erdman∗, Daniel F. Keefe, Senior Member,
in the Cloud and of mobile infrastructures has led to a IEEE, and Randall Schiestl, Applying Regulatory Science
significant increase in the size and complexity of data and Big Data to Improve Medical Device Innovation, IEEE
sets. The data integration ecosystems must thus include TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL.
multiple strategies for the access and storage of a huge 60, NO. 3, MARCH 2013
quantity of very varied data. The following classification
can be made: [8] http://lsst.org/lsst/google
Data storage: the data collected is stored in NoSQL/SQL [9] http://en.wikipedia.org/wiki/Parkinson‘s_law

databases, or Log Management systems for logs;
[10] http://www.economist.com/node/15557443
Data Transformation: in order to load data into the
processing phase, it must first be transformed by using: [11]http://www.youtube.com/t/press_statistics/? hl=en
import/export tools (SQL/NoSQL vendor specific tools),
Sqoop (data source to Hadoop data transformation tool), [12] http://www.internetlivestats.com/twitter-
Log management tools; statistics/.
[13]http://www.internetlivestats.com/google-search-
Data Processing: both structured and unstructured data
are combined so that batch processing or real-time statistics/
processing can be performed. Data Warehousing and
Processing then generate usable data for data
consumption.
Data Analysis: can be performed using: DWs: ensure

the necessary basic information. New functionality
must be added for the better integration of unstructured
data sources and for satisfying the level of performance
required by analysis platforms. In order to perform
strategic decisions, operational analysis has to be
separated from deep analysis, which makes use of
historical data.
Data Consumption: the results of the data analysis have

to be presented is a readable and accessible form to the
final users. Query reports or visualisation charts can be
used.
REFERENCES:
[1] J. Gantz, D. Reinsel, "Extracting value from chaos", IDC
iView, 2011, pp 1–12.
[2] E. Mcnulty, "Understanding Big Data: The Seven V’s",

Dataconomy, May 22, 2014.
[3] Gartner, "Big Data Strategy Components: Business

Essentials", October 9, 2012.
[4] Gartner, "IT glossary: big data" [webpage on the

Internet]. Stamford, CT; 2012.
[5] Canada Inforoute, "Big Data Analytics in health",
White Paper, Full Report, April 2013
[6] A. Alexandru, D. Coardos, "BD in Tackling Energy

Efficiency in Smart City", Scientific Bulletin of the
Electrical Engineering Faculty, vol. 28, no. 4, pp. 14-20,
2014, Bibliotheca Publishing House, ISSN 1843-6188.

Tech RPT New PDF

Uploaded by

Copyright:

Available Formats

Tech RPT New PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tech RPT New PDF

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

SANTHI BASTWAD ROAD, MACHHE, BELGAVI-590014

Under the guidance of

SRIDEVI .B.N (1HM16EC024)

Department of Electronics and Communication Engineering

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

Guide Head of the Department Principal

It is a pleasure to acknowledgement all those who have provided help,

I wish to express my sincere thanks to our principal Dr. IRFAN.G, for

I extend my deep gratitude to Dr. C.P LATHA, Head of the department

I am grateful to my Seminar guide Mr. FAZLULLA KHAN, Asst.

CHAPTERS TITLE PAGE NO.

CHAPTER 2 LITERATURE SURVEY 2

CHAPTER 3 THE SOURSES OF BIG DATA 6

CHAPTER 4 CHARACTERISTIC OF BIG DATA 7

CHAPTER 5 STORING, PROCESSING OF 10

CHAPTER 7 BIG DATA ANALYTICAL TOOLS 25

CHAPTER 8 BIG DATA SECURITY TECHNOLOGIES 26

CHAPTER 9 BENEFITS, RISKS AND APPLICATIONS 28

CHAPTER 10 FUTURE OF BIG DATA 33

Fig no. Title Page no.

3.1 FIVE V’S BIG DATA 9

4.1 TECHNICAL VIEW OF BIG DATA 13

4.2 MASTER–SLAVE ARCHITECTURE OF HADOOP 14

4.3 HADOOP ARCHITECTURE 15

4.4 STRUCTURE OF HADOOP DISTRIBUTED FILE 16

4.5 MAP REDUCE 21

Keywords: Big data, petabyte, exabyte, zettabytes.

– Web data, e-commerce

Dept. Of ECE, HMSIT Page 1

THE FOUNDATIONS OF BIG DATA:

 In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing information

Dept. Of ECE, HMSIT Page 2

 THE INTERNET EFFECT AND PERSONAL COMPUTERS:

Dept. Of ECE, HMSIT Page 3

 THE INTERNET OF THINGS (IOT):

 COMPUTING POWER AND INTERNET GROWTH:

Dept. Of ECE, HMSIT Page 4

Dept. Of ECE, HMSIT Page 5

THE SOURCES OF BIG DATA

The three primary sources of Big Data are:

Dept. Of ECE, HMSIT Page 6

CHARACTERISTICS OF BIG DATA

Some astounding examples of massive data generated (by machines) are:

Dept. Of ECE, HMSIT Page 7

Dept. Of ECE, HMSIT Page 8

FIVE V’s OF BIG DATA :

Fig 3.1: Five V’s Big Data

Dept. Of ECE, HMSIT Page 9

STORING AND PROCESSING OF BIG DATA

 Columnar Stores: According to Wikipedia “A column-oriented DBMS is a database management

Dept. Of ECE, HMSIT Page 11

Dept. Of ECE, HMSIT Page 12

Technical View of Big Data

Fig 4.1: Technical view of Big Data

Dept. Of ECE, HMSIT Page 13

PROCESSING OF BIG DATA:

Fig 4.2. Master–slave architecture of Hadoop

Terminologies before understanding the architecture: