Unit 1
Unit 1
1
Data Analysis Has Been Around for a While…
Howard
Dresner
Data Science: Why all the Excitement?
Exciting new effective
applications of data analytics
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
4
PageRank: The web as a behavioral dataset
Sponsored search
Sponsored search
• Google revenue around $50 bn/year from marketing, 97%
of the companies revenue.
• Text Data, Social Media Data → Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
9
“Big Data” Sources(petabytes, zettabytes, or exabytes.)
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event
….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…
12
Data Science – A Definition
Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to
collect, clean, integrate, analyze, visualize, interact with data to create
data products.
13
Goal of Data Science
Improve/validate on a few, relatively clean, Develop/use tools that can handle massive
small datasets datasets
Take action!
Publish a paper
Big Data vs Data Analytics vs Data Science:
What’s The Difference?
• Big data refers to any large and complex collection of data.
• Data analytics is the process of extracting meaningful information
from data.
• Data science is a multidisciplinary field that aims to produce broader
insights.
19
What is big data?
• As the name suggests, big data simply refers to extremely large data
sets. (petabytes, zettabytes, or exabytes. )
• This size, combined with the complexity and evolving nature of these
data sets, has enabled them to surpass the capabilities of traditional
data management tools.
• Some data sets that we can consider truly big data include:
➢Stock market data
➢Social media
➢Sporting events and games
➢Scientific and research data
20
Facets of data
• Structured
• Semi structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
21
Types of big data
• Structured data.
➢Any data set that adheres to a specific structure can be called structured data.
➢These structured data sets can be processed relatively easily compared to other data
types as users can exactly identify the structure of the data.
➢A good example for structured data will be a distributed RDBMS which contains data
in organized table structures/excel .
22
• Semi-structured data.
➢This type of data does not adhere to a specific structure yet retains some kind of
observable structure such as a grouping or an organized hierarchy.
➢Some examples of semi-structured data will be markup languages (XML), web pages,
emails, etc.
• Unstructured data.
➢This type of data consists of data that does not adhere to a schema or a preset
structure.
➢It is the most common type of data when dealing with big data—things like text,
pictures, video, and audio all come up under this type.
• Natural language
➢Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
➢The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models
trained in one domain don’t generalize well to other domains.
• Machine-generated data
➢Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
➢Machine-generated data is becoming a major data resource and will continue to do so
➢ Examples of machine data are web server logs, call detail records, network event logs,
and telemetry (remote sources data collected using sensors)
23
• Graph-based or network data
➢ Data that focuses on the relationship or adjacency of objects.
➢The graph structures use nodes, edges, and properties to represent and store
graphical data.
➢Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a
person and the shortest path between two people
➢Examples of graph-based data can be found on many social media websites
➢For instance, on LinkedIn you can see who you know at which company.
➢Your follower list on Twitter is another example of graph-based data.
➢The power and sophistication comes from multiple, overlapping graphs of the
same nodes.
➢For example, imagine the connecting edges here to show “friends” on
Facebook. Imagine another graph with the same people which connects
business colleagues via LinkedIn. Imagine a third graph based on movie
interests on Netflix. Overlapping the three different-looking graphs makes
more interesting questions possible.
• Streaming data
➢The data flows into the system when an event happens instead of being
loaded into a data store in a batch.
➢ Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
24
25
Big data systems & tools
26
• Data lakes and data warehouses are both widely used for storing big data, but
they are not interchangeable terms.
• A data lake is a vast pool of raw data, the purpose for which is not yet defined.
A data warehouse is a repository for structured, filtered data that has already
been processed for a specific purpose.
• There is even an emerging data management architecture trend of the data
lakehouse, which combines the flexibility of a data lake with the data
management capabilities of a data warehouse.
• The distinction is important because they serve different purposes and require
different sets of eyes to be properly optimized. While a data lake works for one
company, a data warehouse will be a better fit for another.
27
Data structure: raw vs. processed
• Raw data is data that has not yet been processed for a purpose. Perhaps the
greatest difference between data lakes and data warehouses is the varying
structure of raw vs. processed data. Data lakes primarily store raw,
unprocessed data, while data warehouses store processed and refined data.
• Because of this, data lakes typically require much larger storage capacity
than data warehouses.
• Additionally, raw, unprocessed data is malleable, can be quickly analyzed for
any purpose, and is ideal for machine learning. The risk of all that raw data,
however, is that data lakes sometimes become data swamps without
appropriate data quality and data governance measures in place.
• Data warehouses, by storing only processed data, save on pricey storage
space by not maintaining data that may never be used. Additionally,
processed data can be easily understood by a larger audience.
28
Purpose: undetermined vs in-use
• The purpose of individual data pieces in a data lake is not fixed. Raw
data flows into a data lake, sometimes with a specific future use in
mind and sometimes just to have on hand.
• This means that data lakes have less organization and less filtration of
data than their counterpart.
• Processed data is raw data that has been put to a specific use. Since
data warehouses only house processed data, all of the data in a data
warehouse has been used for a specific purpose within the
organization.
• This means that storage space is not wasted on data that may never
be used.
29
Users: data scientists vs business professionals
30
Accessibility: flexible vs secure
31
What is data analytics?
• Data Analytics is the process of analyzing data in order to extract meaningful
data from a given data set. These analytics techniques and methods are
carried out on big data in most cases, though they certainly can be applied
to any data set.
• The primary goal of data analytics is to help individuals or organizations to
make informed decisions based on patterns, behaviors, trends, preferences,
or any type of meaningful data extracted from a collection of data.
• For example, businesses can use analytics to identify their customer
preferences, purchase habits, and market trends and then create strategies
to address them and handle evolving market conditions. In a scientific sense,
a medical research organization can collect data from medical trials and
evaluate the effectiveness of drugs or treatments accurately by analyzing
those research data.
• Combining these analytics with data visualization techniques will help you
get a clearer picture of the underlying data and present them more flexibly
and purposefully.
32
Types of analytics
While there are multiple analytics methods and techniques for data analytics,
there are four types that apply to any data set.
• Descriptive. This refers to understanding what has happened in the data
set. As the starting point in any analytics process, the descriptive analysis
will help users understand what has happened in the past.
• Diagnostic. The next step of descriptive is diagnostic, which will consider
the descriptive analysis and build on top of it to understand why something
happened. It allows users to gain knowledge on the exact information
of root causes of past events, patterns, etc.
• Predictive. As the name suggests, predictive analytics will predict what will
happen in the future. This will combine data from descriptive and diagnostic
analytics and use ML and AI techniques to predict future trends, patterns,
problems, etc.
• Prescriptive. Prescriptive analytics takes predictions from predictive
analytics and takes it a step further by exploring how the predictions will
happen. This can be considered the most important type of analytics as it
allows users to understand future events and tailor strategies to handle any
predictions effectively. 33
Accuracy of data analytics
34
Data analytics tools & technologies
• There are both open source and commercial products for data
analytics. They will range from simple analytics tools such as
Microsoft Excel’s Analysis ToolPak that comes with Microsoft Office
to SAP BusinessObjects suite and open source tools such as Apache
Spark.
• When considering cloud providers, Azure is known as the best
platform for data analytics needs. It provides a complete toolset to
cater to any need with its Azure Synapse Analytics suite, Apache
Spark-based Databricks, HDInsights, Machine Learning, etc.
• AWS and GCP also provide tools such as Amazon QuickSight, Amazon
Kinesis, GCP Stream Analytics to cater to analytics needs.
35
What is data science?
Unlike the first two, data science cannot be limited to a single function or
field. Data science is a multidisciplinary approach that extracts information
from data by combining:
• Scientific methods
• Maths and statistics
• Programming
• Advanced analytics
• ML and AI
• Deep learning
• In data analytics, the primary focus is to gain meaningful insights from
the underlying data. The scope of Data Science far exceeds this
purpose—data science will deal with everything, from analyzing complex
data, creating new analytics algorithms and tools for data processing and
purification, and even building powerful, useful visualizations.
36
Data science tools & technologies
• This includes programming languages like R, Python, Julia, which can be
used to create new algorithms, ML models, AI processes for big data
platforms like Apache Spark and Apache Hadoop.
• Data processing and purification tools such as Winpure, Data Ladder,
and data visualization tools such as Microsoft Power Platform, Google
Data Studio, Tableau to visualization frameworks like matplotlib and
ploty can also be considered as data science tools.
• As data science covers everything related to data, any tool or
technology that is used in Big Data and Data Analytics can somehow be
utilized in the Data Science process.
37
Applications of Data Science
Internet Search
• Search engines make use of data science algorithms to deliver the best results for
search queries in seconds.
Digital Advertisements
• The entire digital marketing spectrum uses data science algorithms, from display
banners to digital billboards.
• This is the main reason that digital ads have higher click-through rates than
traditional advertisements.
Recommender Systems
• The recommender systems not only make it easy to find relevant products from
billions of available products, but they also add a lot to the user experience. Many
companies use this system to promote their products and suggestions in accordance
with the user’s demands and relevance of information. The recommendations are
based on the user’s previous search results.
38
Applications of Big Data
Big Data for Financial Services
• Credit card companies, retail banks, private wealth management advisories, insurance
firms, venture funds, and institutional investment banks all use big data for their
financial services.
• The common problem among them all is the massive amounts of multi-structured data
living in multiple disparate systems, which big data can solve.
• As such, big data is used in several ways, including:
• Customer analytics
• Compliance analytics
• Fraud analytics
• Operational analytics
Big Data in Communications
• Gaining new subscribers, retaining customers, and expanding within current subscriber
bases are top priorities for telecommunication service providers.
• The solutions to these challenges lie in the ability to combine and analyze the masses of
customer-generated data and machine-generated data that is being created every day.
Big Data for Retail
• Whether it’s a brick-and-mortar company an online retailer, the answer to staying in the
game and being competitive is understanding the customer better.
• This requires the ability to analyze all disparate data sources that companies deal with
every day, including the weblogs, customer transaction data, social media, store-branded
credit card data, and loyalty program data.
39
Applications of Data Analytics
Healthcare
• Instrument and machine data are increasingly being used to track and optimize patient flow,
treatment, and equipment used in hospitals. It is estimated that there will be a one percent
efficiency gain that could yield more than $63 billion in global healthcare savings by
leveraging software from data analytics companies.
Travel
• Data analytics can optimize the buying experience through mobile/weblog and social media
data analysis.
• Travel websites can gain insights into the customer’s preferences.
• Products can be upsold by correlating current sales to the subsequent browsing increase in
browse-to-buy conversions via customized packages and offers.
• Data analytics that is based on social media data can also deliver personalized travel
recommendations.
Gaming
• Data analytics helps in collecting data to optimize and spend within and across games.
Gaming companies are also able to learn more about what their users like and dislike.
Energy Management
• Most firms are using data analytics for energy management, including smart-grid
management, energy optimization, energy distribution, and building automation in utility
companies.
• The application here is centered on the controlling and monitoring of network devices and
dispatch crews, as well as managing service outages. Utilities have the ability to integrate
millions of data points in the network performance and gives engineers the opportunity to
use the analytics to monitor the network.
40
Data science is all about:
• Asking the correct questions and analyzing the raw data.
• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the final
result.
Example:
Let suppose we want to travel from station A to station B by car.
• Now, we need to take some decisions such as which route will be the best route to reach
faster at the location, in which route there will be no traffic jam, and which will be cost-
effective.
• All these decision factors will act as input data, and we will get an appropriate answer
from these decisions, so this analysis of data is called the data analysis, which is a part of
data science.
41
Need for Data Science:
42
• Some years ago, data was less and mostly available in a structured form,
which could be easily stored in excel sheets, and processed using BI tools.
• But in today's world, data is becoming so vast, i.e., approximately 2.5
quintals bytes of data is generating on every day, which led to data
explosion. It is estimated as per researches, that at present , 1.7 MB of data
will be created at every single second, by a single person on earth. Every
Company requires data to work, grow, and improve their businesses.
• Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that
technology came into existence as data Science.
• Following are some main reasons for using data science technology:
• With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
• Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
• Data science is working for automating transportation such as creating a self-driving
car, which is the future of transportation.
• Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
43
Data Science Components:
44
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science. Statistics is a
way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes
metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization makes
it easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science,
we use various machine learning algorithms to solve the problems.
45
46
Tools for Data Science
47
Machine learning in Data Science
• Following are the name of some machine learning algorithms used in
data science:
1. Regression
2. Decision tree
3. Clustering
4. Principal component analysis
5. Support vector machines
6. Naive Bayes
7. Artificial neural network
8. Apriori
48
1. Linear Regression Algorithm:
• Linear regression is the most popular machine learning algorithm
based on supervised learning.
• This algorithm work on regression, which is a method of modeling
target values based on independent variables.
• It represents the form of the linear equation, which has a
relationship between the set of inputs and predictive output.
• This algorithm is mostly used in forecasting and predictions. Since it
shows the linear relationship between input and output variable,
hence it is called linear regression.
50
3. K-Means Clustering:
• K-means clustering is one of the most popular algorithms of machine learning,
which belongs to the unsupervised learning algorithm.
• It solves the clustering problem.
• If we are given a data set of items, with certain features and values, and we need
to categorize those set of items into groups, so such type of problems can be
solved using k-means clustering algorithm.
• K-means clustering algorithm aims at minimizing an objective function, which
known as squared error function, and it is given as:
Where,J(V)=>Objective function
52
Data Science Lifecycle
Communicate
53
The main phases of data science life cycle are given below:
1. Discovery:
• The first phase is discovery, which involves asking the right questions.
• When you start any data science project, you need to determine what are the basic
requirements, priorities, and project budget.
• In this phase, we need to determine all the requirements of the project such as the
number of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this
phase, we need to perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
• SQL Analysis Services
•R
• SAS
• Python
54
4. Model-building:
• In this phase, the process of model building starts.
• We will create datasets for training and testing purpose.
• We will apply different techniques such as association, classification, and
clustering, to build the model.
• Following are some common Model building tools:
➢SAS Enterprise Miner
➢WEKA
➢SPCS Modeler
➢MATLAB
5. Operationalize:
• In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents.
• This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
6. Communicate results:
• In this phase, we will check if we reach the goal, which we have set on the initial
phase.
• We will communicate the findings and final result with the business team.
55
Applications of Data Science:
• Image recognition and speech recognition:
➢Data science is currently using for Image and speech recognition.
➢When you upload an image on Facebook and start getting the suggestion to tag to
your friends. This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
➢When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
• Gaming world:
➢In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
• Internet search:
➢ When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc.
➢All these search engines use the data science technology to make the search
experience better, and you can get a search result with a fraction of seconds.
• Transport:
➢Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents. 56
Applications of Data Science:
• Healthcare:
➢ In the healthcare sector, data science is providing lots of benefits.
➢Data science is being used for tumor detection, drug discovery, medical
image analysis, virtual medical bots, etc.
• Recommendation systems:
➢Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with
personalized recommendations.
➢when you search for something on Amazon, and you started getting
suggestions for similar products, so this is because of data science
technology.
• Risk detection:
➢ Finance industries always had an issue of fraud and risk of losses, but with
the help of data science, this can be rescued.
➢Most of the finance companies are looking for the data scientist to avoid risk
and any type of losses with an increase in customer satisfaction.
57
Types of Data Science Job
If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:
• Data Scientist: A data scientist is a professional who works with an enormous
amount of data to come up with compelling business insights through the
deployment of various tools, techniques, methodologies, algorithms, etc.
• Data Analyst: Data analyst is an individual, who performs mining of huge amount
of data, models the data, looks for patterns, relationship, trends, and so on. At the
end of the day, he comes up with visualization and reporting for analyzing the
data for decision making and problem-solving process.
• Machine learning expert: The machine learning expert is the one who works with
various machine learning algorithms used in data science such as regression,
clustering, classification, decision tree, random forest, etc.
• Data engineer: A data engineer works with massive amount of data and
responsible for building and maintaining the data architecture of a data science
project. Data engineer also works for the creation of data set processes used in
modeling, mining, acquisition, and verification.
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager
58
BIG DATA ECOSYSTEM
59
Computer Clusters
• A computer cluster is defined as a single logical unit which consist of
several computers that are linked through a fast local area network (LAN).
• The components of a cluster, which is commonly termed as nodes, operate
their own instance of an operating system.
• Computer Clusters are needed for Big Data.
Apache Hadoop
• It is an open-source software framework for processing and querying vast amounts of
data on large clusters of commodity.
• Hadoop is being written in Java and can process huge volume of structured and
unstructured data
• It is implemented for Google MapReduce as an open source and is based on simple
programming model called MapReduce.
• It provides reliability through replication
• The Apache Hadoop ecosystem is composed of the Hadoop Kernel, MapReduce, HDFS
and several other components like Apache Hive, Base and Zookeeper
60
Characteristics of Hadoop
• Scalable– New nodes are added without disruption and without any
change on the format of the data.
• Cost effective– There is parallel computing to all the commodity
servers using Hadoop. This decrease cost makes it affordable to
process massive amount of data.
• Flexible– Hadoop is able to process any type of data from various
sources and deep analysis can be performed.
• Fault tolerant– When a node is damaged, the system is able to
redirect the work to another location to continue the processing
without missing any data.
61
Hadoop/MapReduce
Computing Paradigm
62
Large-Scale Data Analytics
Database
vs.
63
Why Hadoop is able to compete?
Database
vs.
64
What is Hadoop
65
Hadoop Master/Slave Architecture
66
Design Principles of Hadoop
67
Design Principles of Hadoop
68
Who Uses MapReduce/Hadoop
69
Hadoop scalability
◼ Hadoop can reach massive scalability by
exploiting a simple distribution architecture and
coordination model
◼ Huge clusters can be made up using (cheap)
commodity hardware
A 1000-CPU machine would be much more
expensive than 1000 single-CPU or 250 quad-core
machines
◼ Cluster can easily scale up with little or no
modifications to the programs
Hadoop Components
HDFS
• The cluster can grow
indefinitely simply by adding
new nodes
The MapReduce Paradigm
Map Reduce
execution
x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
Eurostat
MapReduce and Hadoop
MapReduce
HDFS
Eurostat
MapReduce and Hadoop
Output is written
on HDFS
Scalability principle:
Perform the computation were the data is
Main Properties of HDFS
76
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Map Parse-hash
Reduce
Map Parse-hash
Reduce
In this example, 1 map-reduce
job consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce
Map Parse-hash
78
Key-Value Pairs
79
MapReduce Phases
Deciding on what will be the key and what will be the value ➔ developer’s
responsibility
80
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Map Reduce
Tasks Tasks
81
Example 2: Color Count
Job: Count the number of each color in a data set
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it has
3 parts on probably 3
different machines 82
Example 3: Color Filter
Write to HDFS
Map Part0002
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Map Part0003 different machines
Write to HDFS
Map Part0004
83
84
HDFS has 2 components –
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored
in files.
A file is split into one or more blocks and set of blocks are stored in
DataNodes.
DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
12/1/2023
85
How HDFS stores the data
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Client
12/1/2023
87
File system Namespace
• Hierarchical file system with directories and files
• Create, remove, move, rename etc.
• Namenode maintains the file system
• Any meta information changes to the file system recorded by the
Namenode.
• An application can specify the number of replicas of the file needed:
replication factor of the file. This information is stored in the Namenode.
12/1/2023
88
Data Replication
HDFS is designed to store very large files across machines in a large
cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each
DataNode in the cluster.
BlockReport contains all the blocks on a Datanode.
12/1/2023
89
Filesystem Metadata
• The HDFS namespace is stored by Namenode.
• Namenode uses a transaction log called the EditLog to record every change
that occurs to the filesystem meta data.
• For example, creating a new file.
• Change replication factor of a file
• EditLog is stored in the Namenode’s local filesystem
• Entire filesystem namespace including mapping of blocks to files and file
system properties is stored in a file FsImage. Stored in Namenode’s local
filesystem.
12/1/2023
90
Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files per directory and
creates directories appropriately:
Research issue?
When the filesystem starts up it generates a list of all HDFS blocks and
send this report to Namenode: Blockreport.
12/1/2023
91
YARN(“Yet Another Resource Negotiator”)
• Hadoop YARN is the storage unit of Hadoop with the various processing tools.
Why YARN?
• MapReduce Version 1 performed both processing and resource management functions.
• It consisted of a Job Tracker which was the single master. The Job Tracker allocated the
resources, performed scheduling and monitored the processing jobs.
• It assigned map and reduce tasks on a number of subordinate processes called the Task
Trackers. The Task Trackers periodically reported their progress to the Job Tracker.
93
94
Apart from Resource Management, YARN also performs Job Scheduling.
YARN performs all your processing activities by allocating resources and
scheduling tasks. Apache Hadoop YARN Architecture consists of the
following main components :
2.Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.
3.Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.
95
Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The
image below represents the YARN Architecture.
96
The first component of YARN Architecture is,
1. Resource Manager
➢It is the ultimate authority in resource allocation.
➢On receiving the processing requests, it passes parts of requests to
corresponding node managers accordingly, where the actual processing takes
place.
➢It is the arbitrator of the cluster resources and decides the allocation of the
available resources for competing applications.
➢Optimizes the cluster utilization like keeping all resources in use all the
time against various constraints such as capacity guarantees, fairness, and
SLAs.
➢It has two major components: a) Scheduler b) Application Manager
97
a) Scheduler
b) Application Manager
98
The second component which is :
2. Node Manager
100
The fourth component is:
4.Container
• It is a collection of physical resources such as RAM, CPU cores, and disks
on a single node.
• YARN containers are managed by a container launch context which is
container life-cycle(CLC). This record contains a map of environment
variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary
to create the process.
• It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
101
Steps involved in application submission of Hadoop YARN:
MapReduce
YARN
Cluster resource manager
HDFS
Hadoop Distributed File System
Data processing with MapReduce
Data Slice Data Slice Data Slice Data Slice Data Slice Data Slice
1 2 3 4 5 X
- Extraction
- Filtering Data Data Data Data Data
Data
- Transformation
processor
proces proces proces proces process Mapping
sor sor sor sor or
Data shuffling
- Grouping
- Aggregating
Data Data
- Dissmising
collector collector
Reducing
Result
Example: The famous “word counting”
Demo
• The problem
• Q: „What happens after two rainy days in the Geneva region?”
• The goal
• Good or bad condition with MapReduce
• Solution
• Build a histogram of days of a week preceded by 2 or more bad weather days
based on meteor data for GVA
?
count
days
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
2nd MR job
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
Demo – MapReduce flow
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
2016.09.11 0
"06.06.2015 00:50";"18.0";
2016.09.12 0
"06.06.2015 00:20";"18.0";
2016.09.13 0
"05.06.2015 23:50";"19.0";
Wednesday 3
2016.09.20 6
"05.06.2015 23:20";"19.0";
https://gitlab.cern.ch/db/hadoop-intro/tree/master/MapReduce
2016.09.26 5
Thursday 10
Saturday 23
"05.06.2015 22:50";"19.0"; Monday 32
2016.09.30 3
"05.06.2015 22:20";"20.0";
Sunday 25
Tuesday 0
2016.10.04 3
Friday 20
"05.06.2015 21:50";"22.0";
2016.10.05 0
"05.06.2015 21:20";"23.0";
2016.10.06 0
"05.06.2015 20:50";"23.0";
2016.10.07 0
"05.06.2015 20:20";"23.0";
2016.10.10 2
"05.06.2015 19:50";"28.0";
2016.10.12 1
"05.06.2015 19:20";"28.0";
2016.10.15 2
"06.06.2015 00:50";"18.0";
2016.10.20 4
"06.06.2015 00:20";"18.0";
2016.10.21 0
"05.06.2015 23:50";"19.0";
2016.10.22 0
"05.06.2015 23:20";"19.0";
2016.10.27 4
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
Reduced data:
Reduced data:
Input Data:
Code:
hour
Apache Pig
• Apache Pig is the scripting platform for processing and analyzing large data
sets
• Apache Pig is used with Hadoop where we can perform all the data
manipulation operations in Hadoop by using Apache Pig.
• Apache Pig allows Apache Hadoop users to write complex MapReduce
transformations which are done using a simple scripting language called Pig
Latin.
• Pig Latin language provides various numbers of operators using which the
programmers can develop their own functions for reading, writing, and
processing data given.
• It is also an abstraction over MapReduce. All the scripts are internally
converted to Map and Reduce tasks and Apache Pig has a component which
is known as Pig Engine that accepts the Pig Latin scripts as input and
converts those scripts into MapReduce jobs which is used for Apache Pig.
110
111
Why Do We Need Apache Pig ?
• Some of the Programmers won’t be good at java and hence if they do not know
they might have some difficulty while doing hadoop.
• The reason is that if we know java we can understand Hadoop because Java is the
platform for hadoop. But if we don’t know java, we cannot understand Hadoop or
else we might have some difficulty understanding Hadoop.
• If we can’t understand Java or Hadoop, we can use Apache Pig. Apache Pig is benefit
for all programmers
• By using Pig Latin, it is benefit for some programmers to perform MapReduce tasks
easily without having to type complex lines of code in Java.
• Apache Pig uses multi-query approach, which reduces the length of codes. For
example, if we need to perform some operation for a task that would require us to
type 200 lines of code (LoC) in Java. Hence this can be done easily in Apache Pig by
typing 10 LoC (Lines of Code) for an operation which is given for a certain task.
Apache Pig reduces the development time by 16 times when compared to
development time in Java.
• Pig Latin is like SQL-like language and if we are familiar with Apache Pig it is easy to
learn SQL Language
• Apache Pig provides us many built-in operators in order to support data operations
like joins, filters, ordering and also it supports nested data types like tuples, bags,
and maps that were missing from MapReduce. 112
Features of Pig
• Rich set of operators − It provides rich set of operators to perform
operations like join, sort, and filter.
• Ease of programming - Program is ease when Pig Latin is similar to SQL
and if we are good at SQL language then it is easier to write Pig Script.
• Optimization opportunities - If the tasks in Apache Pig has an
automatically optimize their execution then the programmers will need
to focus only on the semantics of the language.
• Extensibility -Extensibility is done when using the existing operators
where the users can develop their own functions to read, write, and
process data.
• UDF’s (User Defined Functions) - Apache Pig creates User-defined
Functions in other programming languages such as Java and invokes
them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data,
structured and unstructured data and hence it stores the results of the
data in HDFS also known as Hadoop Disturbed File System.
113
Apache Pig Vs MapReduce
Apache Pig MapReduce
116
Hive chiefly consists of three core parts:
• Hive Clients: Hive offers a variety of drivers designed for communication
with different applications. For example, Hive provides Thrift clients(cross
language services) for Thrift-based applications. These clients and drivers
then communicate with the Hive server, which falls under Hive services.
• Hive Services: Hive services perform client interactions with Hive. For
example, if a client wants to perform a query, it must talk with Hive services.
• Hive Storage and Computing: Hive services such as file system, job client,
and meta store then communicates with Hive storage and stores things like
metadata table information and query results.
117
Hive's Features
• Hive is designed for querying and managing only structured data stored in tables
• Hive is scalable, fast, and uses familiar concepts
• Schema gets stored in a database, while processed data goes into a Hadoop
Distributed File System (HDFS)
• Tables and databases get created first; then data gets loaded into the proper tables
• Hive supports four file formats: ORC(stores rows in columnar format) ,
SEQUENCEFILE, RCFILE (Record Columnar File), and TEXTFILE
• Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns, tables,
rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
• Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries
• Hive supports partition and buckets for fast and simple data retrieval
• Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements
118
Limitations of Hive
• Hive doesn’t support OLTP. Hive supports Online Analytical
Processing (OLAP), but not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.
119
Hive vs. Relational Databases
Relational Database Hive
Maintains a data
Maintains a database
warehouse
Supports automation
Doesn’t support partitioning
partition
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing Model - Notion of transactions - Notion of jobs
- Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency control - No concurrency control
Data Model - Structured data with known schema - Any data will fit in any format
- Read/Write mode - (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare - Failures are common over thousands
- Recovery mechanisms of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can
run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
136
Hadoop pros & cons
◼ Good for
Repetitive tasks on big size data