BDA Notes (Unit-1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Introduction to Big Data Analytics

1. Overview of Big Data:

Introduction to Big Data:

Big data refers to the vast amount of structured and unstructured data that inundates
businesses on a day-to-day basis. This data is characterized by its large volume, high
velocity, and diverse variety.

Example: Social media platforms like Facebook generate massive amounts of data
daily through user interactions, posts, likes, comments, etc.
Importance of Big Data:

Big data analytics helps organizations make data-driven decisions, gain insights, and
identify patterns that were previously hidden.

Example: Retailers analyze customer purchase history and behavior to personalize


marketing campaigns and improve customer satisfaction.

Characteristics of Big Data:

Volume: Refers to the sheer amount of data generated and collected, often ranging
from terabytes to petabytes.
Velocity: Indicates the speed at which data is generated, processed, and analyzed,
requiring real-time or near-real-time processing.
Variety: Represents the diverse types of data, including structured, semi-structured,
and unstructured data such as text, images, videos, sensor data, etc.
Veracity: Relates to the accuracy and reliability of the data, as big data can be noisy,
incomplete, or inconsistent.
Value: Refers to the potential insights and actionable information that can be extracted
from the data.

Example: Internet of Things (IoT) devices generate a variety of data such as sensor
readings (velocity), images (variety), and data streams from various sources (volume),
which may contain errors or inconsistencies (veracity).

BDA Notes(Unit-1)
: Prabhneet Singh
Clustering in Big Data Analytics:

Source: https://www.youtube.com/watch?v=aBCDy-dJE0Y

Applications of Big Data:

Big data analytics finds applications across various domains including healthcare,
finance, retail, marketing, and cybersecurity.

Example: Healthcare organizations use big data analytics to analyze patient records,
predict disease outbreaks, and personalize treatment plans.

2. Hadoop Ecosystem:

Introduction to Hadoop:

Hadoop is an open-source framework for distributed storage and processing of large


datasets across clusters of commodity hardware.

Example: Hadoop is used by companies like Yahoo, Facebook, and eBay for
processing and analyzing vast amounts of data efficiently.
Hadoop Distributed File System (HDFS):

HDFS is the primary storage system used by Hadoop for storing large files across
multiple nodes in a Hadoop cluster.

Example: In a Hadoop cluster, large datasets are divided into smaller blocks and
distributed across multiple nodes in the cluster for parallel processing and fault
tolerance.

MapReduce Paradigm:

MapReduce is a programming model for processing and generating large datasets in


parallel across a distributed cluster.

Example: In a word count example, MapReduce breaks down the task of counting
words in a large document into smaller tasks (mapping) and then aggregates the results
(reducing) to produce the final word count.

BDA Notes(Unit-1)
: Prabhneet Singh
3. Big Data Processing Frameworks:

Apache Spark and its Features:

Apache Spark is an open-source distributed computing system that provides high-level


APIs for in-memory data processing.
Example: Spark is used for real-time data processing in applications like streaming
analytics, machine learning, and graph processing.
Introduction to Apache Flink:

Apache Flink is an open-source stream processing framework for distributed,


high-performance, and fault-tolerant data processing.

Example: Flink is used in applications where low-latency processing and event-driven


architectures are required, such as fraud detection and real-time analytics.

Video Playlist for apache flink:

https://www.youtube.com/watch?v=3cg5dABA6mo&list=PLa7VYi0yPIH1UdmQcn
Ur8lvjbUV8JriK0&index=1

What is Apache Flink? — Architecture #

Apache Flink is a framework and distributed processing engine for stateful computations
over unbounded and bounded data streams. Flink has been designed to run in all
common cluster environments, perform computations at in-memory speed and at any
scale.

Here, we explain important aspects of Flink’s architecture.

Process Unbounded and Bounded Data #

Any kind of data is produced as a stream of events. Credit card transactions, sensor
measurements, machine logs, or user interactions on a website or mobile application,
all of these data are generated as a stream.

Data can be processed as unbounded or bounded streams.


BDA Notes(Unit-1)
: Prabhneet Singh
1. Unbounded streams have a start but no defined end. They do not terminate and
provide data as it is generated. Unbounded streams must be continuously
processed, i.e., events must be promptly handled after they have been ingested.
It is not possible to wait for all input data to arrive because the input is
unbounded and will not be complete at any point in time. Processing unbounded
data often requires that events are ingested in a specific order, such as the order
in which events occurred, to be able to reason about result completeness.
2. Bounded streams have a defined start and end. Bounded streams can be
processed by ingesting all data before performing any computations. Ordered
ingestion is not required to process bounded streams because a bounded data
set can always be sorted. Processing of bounded streams is also known as batch
processing.

Apache Flink excels at processing unbounded and bounded data sets. Precise control
of time and state enable Flink’s runtime to run any kind of application on unbounded
streams. Bounded streams are internally processed by algorithms and data structures
that are specifically designed for fixed sized data sets, yielding excellent performance.

Convince yourself by exploring the use cases that have been built on top of Flink.

Deploy Applications Anywhere #

Apache Flink is a distributed system and requires compute resources in order to


execute applications. Flink integrates with all common cluster resource managers such
as Hadoop YARN and Kubernetes but can also be setup to run as a stand-alone cluster.

Flink is designed to work well each of the previously listed resource managers. This is
achieved by resource-manager-specific deployment modes that allow Flink to interact
with each resource manager in its idiomatic way.

When deploying a Flink application, Flink automatically identifies the required resources
based on the application’s configured parallelism and requests them from the resource

BDA Notes(Unit-1)
: Prabhneet Singh
manager. In case of a failure, Flink replaces the failed container by requesting new
resources. All communication to submit or control an application happens via REST
calls. This eases the integration of Flink in many environments.

Run Applications at any Scale #

Flink is designed to run stateful streaming applications at any scale. Applications are
parallelized into possibly thousands of tasks that are distributed and concurrently
executed in a cluster. Therefore, an application can leverage virtually unlimited amounts
of CPUs, main memory, disk and network IO. Moreover, Flink easily maintains very
large application state. Its asynchronous and incremental checkpointing algorithm
ensures minimal impact on processing latencies while guaranteeing exactly-once state
consistency.

Users reported impressive scalability numbers for Flink applications running in their
production environments, such as

● applications processing multiple trillions of events per day,


● applications maintaining multiple terabytes of state, and
● applications running on thousands of cores.

Leverage In-Memory Performance #

Stateful Flink applications are optimized for local state access. Task state is always
maintained in memory or, if the state size exceeds the available memory, in
access-efficient on-disk data structures. Hence, tasks perform all computations by
accessing local, often in-memory, state yielding very low processing latencies. Flink
guarantees exactly-once state consistency in case of failures by periodically and
asynchronously checkpointing the local state to durable storage.

BDA Notes(Unit-1)
: Prabhneet Singh
Comparative Analysis of Hadoop and Spark:

Hadoop is primarily designed for batch processing and is optimized for disk-based
storage, while Spark provides in-memory processing, making it faster for iterative
algorithms and interactive queries.

Example: For processing large historical datasets stored in HDFS, Hadoop MapReduce
might be suitable, whereas Spark is preferable for iterative machine learning algorithms
or interactive data analysis.

Comparison Between Apache Spark vs Hadoop MapReduce

Below is the feature wise comparison of Apache Spark vs Hadoop MapReduce,


let’s discuss in detail –

i. Introduction
Apache Spark – It is an open source big data framework. It provides a faster and more
general purpose data processing engine. Spark is basically designed for fast
computation. It also covers a wide range of workloads for example batch, interactive,
iterative and streaming.
Hadoop MapReduce – It is also an open source framework for writing applications. It
also processes structured and unstructured data that are stored in HDFS. Hadoop
MapReduce is designed in a way to process a large volume of data on a cluster of
commodity hardware. MapReduce can process data in batch mode.
BDA Notes(Unit-1)
: Prabhneet Singh
ii. Speed

Apache Spark – Spark is a lightning fast cluster computing tool. Apache Spark runs
applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because
of reducing the number of the reading/write cycle to disk and storing intermediate data
in-memory Spark makes it possible.
Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows
down the processing speed.

iii. Difficulty

Apache Spark – Spark is easy to program as it has tons of high-level operators with
RDD – Resilient Distributed Dataset.
Hadoop MapReduce – In MapReduce, developers need to hand code each and every
operation which makes it very difficult to work.

iv. Easy to Manage

Apache Spark – Spark is capable of performing batch, interactive and Machine Learning
and Streaming all in the same cluster. As a result, makes it a complete data analytics
engine. Thus, no need to manage different component for each need. Installing Spark
on a cluster will be enough to handle all the requirements.
Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are
dependent on different engines. For example- Storm, Giraph, Impala, etc. for other
requirements. So, it is very difficult to manage many components.

v. Real-time analysis

Apache Spark – It can process real-time data i.e. data coming from the real-time event
streams at the rate of millions of events per second, e.g. Twitter data for instance or
Facebook sharing/posting. Spark’s strength is the ability to process live streams
efficiently.
Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as
it was designed to perform batch processing on voluminous amounts of data.

vi. latency

Apache Spark – Spark provides low-latency computing.


Hadoop MapReduce – MapReduce is a high latency computing framework.

BDA Notes(Unit-1)
: Prabhneet Singh
vii. Interactive mode

Apache Spark – Spark can process data interactively.


Hadoop MapReduce – MapReduce doesn’t have an interactive mode.

viii. Streaming

Apache Spark – Spark can process real-time data through Spark Streaming.
Hadoop MapReduce – With MapReduce, you can only process data in batch mode.

ix. Ease of use

Apache Spark – Spark is easier to use. Since its abstraction (RDD) enables a user to
process data using high-level operators. It also provides rich APIs in Java, Scala,
Python, and R.
Hadoop MapReduce – MapReduce is complex. As a result, we need to handle low-level
APIs to process the data, which requires lots of hand coding.

x. Recovery

Apache Spark – RDDs allows recovery of partitions on failed nodes by re-computation


of the DAG while also supporting a more similar recovery style to Hadoop by way of
checkpointing, to reduce the dependencies of an RDDs.
Hadoop MapReduce – MapReduce is naturally resilient to system faults or failures. So,
it is a highly fault-tolerant system.

xi. Scheduler

Apache Spark – Due to in-memory computation spark acts its own flow scheduler.
ad
Hadoop MapReduce – MapReduce needs an external job scheduler for example, Oozie
to schedule complex flows.

xii. Fault tolerance

Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the


application from scratch in case of any failure.

BDA Notes(Unit-1)
: Prabhneet Singh
Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is
no need to restart the application from scratch in case of any failure.

xiii. Security

Apache Spark – Spark is little less secure in comparison to MapReduce because it


supports the only authentication through shared secret password authentication.
Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of
Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file
permission model.

xiv. Cost

Apache Spark – As spark requires a lot of RAM to run in-memory. Thus, increases the
cluster, and also its cost.
Hadoop MapReduce – MapReduce is a cheaper option available while comparing it in
terms of cost.

xv. Language Developed

Apache Spark – Spark is developed in Scala.


Hadoop MapReduce – Hadoop MapReduce is developed in Java.

xvi. Category

Apache Spark – It is data analytics engine. Hence, it is a choice for Data Scientist.
Hadoop MapReduce – It is basic data processing engine.

xvii. License

Apache Spark – Apache License 2


Hadoop MapReduce – Apache License 2

xviii. OS support

ad
Apache Spark – Spark supports cross-platform.
Hadoop MapReduce – Hadoop MapReduce also supports cross-platform.

xix. Programming Language support

BDA Notes(Unit-1)
: Prabhneet Singh
Apache Spark – Scala, Java, Python, R, SQL.
Hadoop MapReduce – Primarily Java, other languages like C, C++, Ruby, Groovy, Perl,
Python are also supported using Hadoop streaming.

xx. SQL support

Apache Spark – It enables the user to run SQL queries using Spark SQL.
Hadoop MapReduce – It enables users to run SQL queries using Apache Hive.

xxi. Scalability

Apache Spark – Spark is highly scalable. Thus, we can add n number of nodes in the
cluster. Also, a largest known Spark Cluster is of 8000 nodes.
Hadoop MapReduce – MapReduce is also highly scalable we can keep adding n
number of nodes in the cluster. Also, the largest known Hadoop cluster is of 14000
nodes.

xxii. The line of code

Apache Spark – Apache Spark is developed in merely 20000 line of codes.


Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes

xxiii. Machine Learning

Apache Spark – Spark has its own set of machine learning ie MLlib.
Hadoop MapReduce – Hadoop requires machine learning tool for example Apache
Mahout.

xxiv. Caching

Apache Spark – Spark can cache data in memory for further iterations. As a result, it
enhances system performance.
Hadoop MapReduce – MapReduce cannot cache the data in memory for future
requirements. So, the processing speed is not that high as that of Spark.
ad

xxv. Hardware Requirements

BDA Notes(Unit-1)
: Prabhneet Singh
Apache Spark – Spark needs mid to high-level hardware.
Hadoop MapReduce – MapReduce runs very well on commodity hardware.

xxvi. Community

Apache Spark – Spark is one of the most active projects at Apache. Since it has a very
strong community.
Hadoop MapReduce – MapReduce community has been shifted to Spark.

Apache Spark Vs Hadoop MapReduce – Infographic

Infographic of Difference Between Hadoop and Spark

Image:
https://data-flair.training/blogs/wp-content/uploads/sites/2/2016/09/Apache-Spark-vs-Ha
doop-mapreduce-Infographic-01.jpg

Case Studies on Real-world Big Data Applications:

Examples of real-world applications of big data analytics include Netflix's


recommendation system, which analyzes user behavior and preferences to suggest
personalized content, and Uber's dynamic pricing algorithm, which processes real-time
data to adjust fares based on supply and demand.

These case studies demonstrate how big data analytics can drive business value and
improve customer experiences through data-driven insights and decision-making.

BDA Notes(Unit-1)
: Prabhneet Singh

You might also like