Unit 1-Big Data Analytics & Lifecycle
Unit 1-Big Data Analytics & Lifecycle
Unit 1-Big Data Analytics & Lifecycle
P.V.Kale 1
Course Prerequisite
Knowledge of basic Computer Science Principles and Skills,
P.V.Kale 2
Course Objectives
Throughout the course, students will be expected to demonstrate
their understanding of Big Data Analytics by being able to do
each of the following:
P.V.Kale 3
Course Outcomes (Expected Outcome)
On completion of the course, the students will be able to
1. Work with big data tools and its analysis techniques.
2. Analyze data by utilizing clustering and classification
algorithms.
3. Learn and apply different algorithms and recommendation
P.V.Kale 5
Unit -I
Big Data Analytics: Big Data Overview,
State of the Practice in Analytics,
Key Roles for the New Big Data Ecosystem,
Examples of Big Data Analytics,
P.V.Kale 6
Unit -II
Exploratory Data Analysis, Statistical Methods for Evaluation:
Hypothesis Testing, Difference of Means,
Wilcoxon Rank-Sum Test,
Type I and II Errors, ANOVA,
P.V.Kale 7
Unit -III
Linear Regression:
Use Cases, Model Description, Diagnostics,
P.V.Kale 8
Unit -IV
Overview of Time Series Analysis: Box-Jenkins Methodology,
P.V.Kale 9
Unit -V
Big Data Tool and Techniques:
Big Data Storage, High-Performance Architecture, HDFS,
MapReduce and YARN, Big Data Application Ecosystem,
Zookeeper, HBase, Hive, Pig, Mahout,
P.V.Kale 10
Unit -VI
SQL Essentials, In-Database Text Analysis, Advanced SQL,
Graph Analytics:
Model, Triples, Graphs and Network Organization,
Graph Analytics and Use Cases, Graph Analysis Algorithms,
Technical Complexity,
Features of Graph Analytic Platform, Data Visualization Basics.
P.V.Kale 11
1st Text Book
EMC Education Services,
"Data Science and Big
Data Analytics:
Discovering, Analyzing,
Visualizing and Presenting
Data", 2015,JohnWiley &
Sons, Inc.,
ISBN: 978-1-118-87613-
8.
P.V.Kale 12
2nd Text Book
David Loshin, "Big Data
Analytics: From Strategic
Planning to Enterprise
Integration with Tools,
Techniques, NoSQL, and
Graph", First Edition,
2013, Morgan Kaufmann/
Elsevier Publishers,
ISBN: 978-0-12-417319-4.
P.V.Kale 13
Reference Books:
[1] Bart Baesens, "Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications", First Edition, 2014, Wiley Publishers,
ISBN: 978-1-118-89271-8.
[3] Arshdeep Bahga & Vijay Madisetti, “Big Data Science & Analytics: A
Hands-On Approach”, First Edition, 2019, ISBN: 978-1-949978-00-1.
P.V.Kale 14
Class: 3rd Year VI Semester
P.V.Kale 15
Unit 1- Big Data Analytics
Contents
Big Data Overview
P.V.Kale 16
What is Data?
• The quantities,
• characters, or
• symbols
on which operations are performed by a computer, which may be
stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
P.V.Kale 17
Overview
Nowadays data is created constantly, and at an ever-
increasing rate.
P.V.Kale 20
Survey of Big Data
For example, in 2012 Facebook users posted 700 status updates
per second worldwide, which can be leveraged to deduce latent
interests or political views of users and show relevant ads.
P.V.Kale 23
Social Media
• There are 2.3 billion
monthly active users and
counting.
• There are 250 billion
photos uploaded to
Facebook
• Facebook generated 4
petabytes of data per day!
• 1 Petabytes=1,000,000
GB
P.V.Kale 24
Facebook
• Also the statistic shows
that 500+terabytes of new
data get ingested into the
databases of social media
site Facebook, every day.
• This data is mainly
generated in terms of
photo and video uploads,
message exchanges,
putting comments etc.
P.V.Kale 25
Instagram
• There are 40 billion photos
which are shared.
• There are 95 million
photos uploaded every
single day.
• Over 500 million videos
uploaded to Instagram per
day!
• Petabytes worth of data
used and moved around per
day!
P.V.Kale 26
Twitter
• There are approx. 600
tweets per second
• Handling huge quantities
of data takes skill.
• Terabytes worth of data
uploaded to the servers
every day!
• 1 terabytes= 1,000 GB
P.V.Kale 27
P.V.Kale 28
How the measures are used?
• Data is measured in bits and bytes.
• One bit contains a value of 0 or 1.
• Eight bits make a byte.
• Then we have kilobytes (1,000 bytes),
• megabytes (1000² bytes),
• gigabytes (1000³ bytes),
• terabytes (1000⁴ bytes),
• petabytes (1000⁵ bytes),
• exabytes (1000⁶ bytes) and
• zettabytes (1000⁷ bytes).
P.V.Kale 29
P.V.Kale 1
Social media and genetic sequencing are among the fastest-
growing sources of Big Data and examples of untraditional
sources of data being used for analysis.
Examples- Gathering of Data
Several industries have led the way in developing their
ability to gather and exploit data:
P.V.Kale 34
Definition
Big Data is data whose scale, distribution, diversity, and /or
timeliness require the use of new technical architectures and
analytics to enable insights that unlock new sources of business
value.
P.V.Kale 36
Unit 1- Big Data Analytics
Contents
Evolution of Big Data
Types of Data
P.V.Kale 37
Evolution of Big Data
• 1970s and before was the era of mainframes.
P.V.Kale 38
P.V.Kale 39
Difference
P.V.Kale 40
What is Analytics
Analytics is the process of discovering, interpreting, and
communicating significant patterns in data.
Structured Data or
Unstructured knowledge.
P.V.Kale 49
Structured Data
• Any data that can be stored, accessed and processed in the
form of fixed format is termed as a ‘structured’ data.
• The structured data includes all the data that can be stored in a
tabular column.
• It is the type of data that is stored in a relational databases
such as SQL and Oracle where data is organized in rows and
columns within named tables.
• Relational databases and Excel spreadsheet are examples of
structured data.
• Structured data depends on the existence of a data model – a
model of how data can be stored, processed and accessed
P.V.Kale 50
Structured Data
• Because of a data model, each field is discrete and can be
accesses separately or jointly along with data from other fields.
• This makes structured data extremely powerful: it is possible to
quickly aggregate data from various locations in the database.
• Structured data exists in a format created to be captured, stored,
organized and analyzed.
• It’s neatly organized for easy access.
• If structured data was an office it would contain many file
cabinets that are efficiently set up, clearly labeled and easy to
access.
P.V.Kale 51
P.V.Kale 52
Semi-Structured data
• This type of data does not have a standard data model but it has
clear self-describing patterns and structure.
• Examples of semi-structured data are Excel spreadsheets that
have a row and column structure and XML files that are defined
by an XML schema.
• Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated
with relational databases or other forms of data tables,
• but nonetheless contain tags or other markers to separate
semantic elements and enforce hierarchies of records and fields
within the data.
• Therefore, it is also known as self-describing structure.
P.V.Kale 53
Semi-Structured data
• The reason that this third category exists (between structured
and unstructured data) is because semi-structured data is
considerably easier to analyse than unstructured data.
• Many Big Data solutions and tools have the ability to ‘read’
and process either JSON or XML. This reduces the complexity
to analyse structured data, compared to unstructured data.
P.V.Kale 54
Quasi-Structured data
• Textual data with erratic data formats that can be formatted
with effort, tools, and time
• For instance, web clickstream data that may contain
inconsistencies in data values and formats)
P.V.Kale 55
Unstructured data
• Unstructured data is information that either does not have
a predefined data model or is not organised in a pre-
defined manner.
• Unstructured information is typically text-heavy, but may
contain data such as dates, numbers, and facts as well.
• This results in irregularities and ambiguities that make it
difficult to understand using traditional programs as
compared to data stored in structured databases.
• Common examples of unstructured data include audio,
video files or No-SQL databases.
P.V.Kale 56
Unstructured data
• The ability to store and process unstructured data has
greatly grown in recent years, with many new
technologies and tools coming to the market that are able
to store specialised types of unstructured data.
• MongoDB, for example, is optimised to store documents.
• Apache Giraph, as an opposite example, is optimised for
storing relationships between nodes.
• The ability to analyse unstructured data is especially
relevant in the context of Big Data, since a large part of
data in organisations is unstructured.
P.V.Kale 57
Unstructured data
• Think about pictures, videos or PDF documents. The
ability to extract value from unstructured data is one of
main drivers behind the quick growth of Big Data.
P.V.Kale 58
P.V.Kale 59
P.V.Kale 60
Class: 3rd Year VI Semester
P.V.Kale 61
Unit 1- Big Data Analytics
Contents
Capturing Big Data
P.V.Kale 62
Big Data
• Big data is high-velocity and high-variety information assets
that demand cost effective, innovative forms of information
processing for enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the
storage capacity of and also complex for traditional database
software tools
• Big data is anything beyond the human & technical
infrastructure needed to support storage, processing and
analysis.
• It is data that is big in volume, velocity and variety
P.V.Kale 63
Data: Big in Volume, Variety and Velociy
P.V.Kale 64
Volume of Data
• Volume can be in Terabytes or Petabytes or Zettabytes.
• Gartner Glossary Big data is high-volume, high-velocity
and/or high variety information assets that demand cost-
effective, innovative forms of information processing that
enable enhanced insight and decision.
• The quantity of data generated in the world has been doubling
every 12-18 months.
• Searching the world – wide – web was the first true Big Data
application.
• Google perfected the art of this application and developed
many of the path breaking Big Data technologies.
P.V.Kale 65
Velocity of Data
• If traditional data is like a lake , Big Data is like a fast flowing
river.
P.V.Kale 66
Variety of Data
• If traditional data forms such as invoices and ledgers were like
a small store , Big Data is the biggest imaginable shopping
mall that offers unlimited variety.
• Data can be structured data, semi-structured data and
unstructured data. Data stored in a database is an example of
structured data.HTML data, XML data, email data.
• CSV files are the examples of semi-structured data. Power
point presentation, images, videos, researches, white papers,
body of email etc are the examples of unstructured data.
• Three major kinds of variety of data Form of data , Function
of Data & Source of Data.
P.V.Kale 67
V’s in Big Data
P.V.Kale 68
Veracity of Data
• The ingestion and the processed data of different systems
result in veracity challenges about the data accuracy.
• For example, if different records show the same data with
different date timestamps, it is hard to determine which record
is the correct one.
• Alternatively, if data is incomplete, one does not know about it
and there can be a system error. Hence, big data systems need
concepts, methods, and tools to overcome the veracity
challenge.
P.V.Kale 69
Value of Data
• The last V in the 5 V's of big data is value.
• This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.
• Being able to pull value from big data is a requirement, as the
value of big data increases significantly depending on the
insights that can be gained from them.
P.V.Kale 70
P.V.Kale 71
In short
P.V.Kale 72
Challenges with Big Data
P.V.Kale 73
Technological Challenges & Solutions
Challenge Description Solution Technology
P.V.Kale 74
Technological Challenges & Solutions
Challenge Description Solution Technology
P.V.Kale 75
Comparing Big Data with Traditional data
P.V.Kale 76
P.V.Kale 77
Class: 3rd Year VI Semester
P.V.Kale 78
Unit 1- Big Data Analytics
Contents
State Of The Practice In Analytics
P.V.Kale 79
STATE OF THE PRACTICE IN ANALYTICS
• Current business problems provide many opportunities for
organizations to become more analytical and data driven.
P.V.Kale 80
• The first three examples do not represent new problems.
• Organizations have been trying to reduce customer churn,
increase sales, and cross-sell customers for many years.
• What is new is the opportunity to fuse advanced analytical
techniques with Big Data to produce more impactful analyses
for these traditional problems.
• The last example portrays emerging regulatory requirements.
P.V.Kale 81
• Many compliance and regulatory laws have been in existence
for decades, but additional requirements are added every year,
which represent additional complexity and data requirements
for organizations.
• Laws related to anti-money laundering (AML) and fraud
prevention require advanced analytical techniques to comply
with and manage
P.V.Kale 82
Different types of analytics:
• BI Versus Data Science
• Current Analytical Architecture (data flow)
• Drivers of Big Data
• Emerging Big Data Ecosystem and a New Approach to
Analytics
P.V.Kale 83
Comparing BI with DS
P.V.Kale 84
BI & DS
P.V.Kale 85
Analytical Architecture
P.V.Kale 86
Drivers of Big Data
P.V.Kale 87
• In the 1990s the volume of information was often measured in
terabytes. Most organizations analyzed structured data in rows
and columns and used relational databases and data
warehouses to manage large amount of enterprise information.
• Different kinds of data sources mainly productivity and
publishing tools such as content management repositories and
networked attached storage systems-to manage this kind of
information, and the data began to increase in size and started
to be measured at petabyte scales.
P.V.Kale 88
• The information that organizations try to manage has
broadened to include many other kinds of data.
• In this era, everyone and everything is leaving a digital
footprint.
• These applications, which generate data volumes that can be
measured in exabyte scale, provide opportunities for new
analytics and driving new value for organizations.
• The data now comes from multiple sources, like Medical
information, Photos and video footage, Video surveillance,
Mobile devices, Smart devices, Non traditional IT devices etc.
P.V.Kale 89
Emerging Big Data Ecosystem
P.V.Kale 90
Class: 3rd Year VI Semester
P.V.Kale 91
Contents
• Key Roles for the New Big Data Ecosystem
• Overview
P.V.Kale 92
Key Roles for the New Big Data Ecosystem
P.V.Kale 93
Data Analytics Lifecycle
• The Data analytic lifecycle is designed for Big Data problems
and data science projects.
P.V.Kale 94
CLASSIFICATION OF ANALYTICS
• There are basically two schools of thought:
P.V.Kale 95
First School of Thought
• It includes Basic analytics, Operationalized analytics,
Advanced analytics and Monetized analytics.
• Basic analytics:
• This primarily is slicing and dicing of data to help with basic
business insights.
• This is about reporting on historical data, basic visualization,
etc.
How can we make it happen?
What will happen?
Why did it happen ?
P.V.Kale 96
• Operationalized analytics:
• It is operationalized analytics if it gets woven into the
enterprises business processes.
• Advanced analytics:
• This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
• Monetized analytics:
• This is analytics in use to derive direct business revenue
P.V.Kale 97
Analytics 1.0, 2.0 and 3.0
P.V.Kale 98
Second School of Thought
P.V.Kale 99
P.V.Kale 100
P.V.Kale 101
Phase 1 : Data discovery
• In this first phase of data analytics, the stakeholders regularly
perform the following tasks —
• examine the business trends, make case studies of similar data
analytics, and study the domain of the business industry.
• The entire team makes an assessment of the in-house
resources, the in-house infrastructure, total time involved, and
technology requirements.
• Once all these assessments and evaluations are completed, the
stakeholders start formulating the initial hypothesis for
resolving all business challenges in terms of the current
market scenario.
P.V.Kale 102
Cont…
• The data science team learn and investigate the problem.
• Develop context and understanding.
• Come to know about data sources needed and available for the
project.
• The team formulates initial hypothesis that can be later tested
with data.
P.V.Kale 103
Phase 2 : Data Preparation
• In the second phase after the data discovery phase, data is
prepared by transforming it from a legacy system into a data
analytics form by using the sandbox platform.
• A sandbox is a scalable platform commonly used by data
scientists for data preprocessing.
• It includes huge CPUs, high capacity storage, and high I/O
capacity.
• The IBM Netezza 1000 is one suc.
P.V.Kale 104
Cont…
• Steps to explore, preprocess, and condition data prior to
modeling and analysis.
• It requires the presence of an analytic sandbox, the team
execute, load, and transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple
times and not in predefined order.
• Several tools commonly used for this phase are – Hadoop,
Alpine Miner, Open Refine, etc.
P.V.Kale 105
Phase 3 : Model Planning
• The third phase of the lifecycle is model planning, where the
data analytics team makes proper planning of the methods to
be adapted and the various workflow to be followed during the
next phase of model building.
• At this stage, the various division of work among the team is
decided to clearly define the workload among the team
members.
• The data prepared in the previous phase is further explored to
understand the various features and their relationships and also
perform feature selection for applying it to the model.
P.V.Kale 106
Phase 3 : Model Planning
• Team explores data to learn about relationships between
variables and subsequently, selects key variables and the most
suitable models.
• In this phase, data science team develop data sets for training,
testing, and production purposes.
• Team builds and executes models based on the work done in
the model planning phase.
• Several tools commonly used for this phase are – Matlab,
STASTICA.
P.V.Kale 107
Phase 4 : Model Building
• The next phase of the lifecycle is model building in which the
team works on developing datasets for training and testing as
well as for production purposes.
• Also, the execution of the model, based on the planning made
in the previous phase, is carried out.
• The kind of environment needed for the execution of the
model is decided and prepared so that if a more robust
environment is required, it is accordingly applied.
P.V.Kale 108
Cont..
• Team develops datasets for testing, training, and production
purposes.
• Team also considers whether its existing tools will suffice for
running the models or if they need more robust environment
for executing models.
• Free or open-source tools – R and PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
P.V.Kale 109
Phase 5 : Communicate results
• Phase five of the life cycle checks the results of the
project to find whether it is a success or failure.
• The result is scrutinized by the entire team along
with its stakeholders to draw inferences on the key
findings and summarize the entire work done.
• Also, the business values are quantified and an
elaborate narrative on the key findings is prepared
that is discussed among the various stakeholders.
P.V.Kale 110
Cont…
• After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
• Team considers how best to articulate findings and outcomes
to various team members and stakeholders, taking into account
warning, assumptions.
• Team should identify key findings, quantify business value,
and develop narrative to summarize and convey findings to
stakeholders.
P.V.Kale 111
Phase 6 : Operationalization
• In phase six, a final report is prepared by the team along with
the briefings, source codes, and related documents.
• The last phase also involves running the pilot project to
implement the model and test it in a real-time environment.
• As data analytics help build models that lead to better
decision-making, it, in turn, adds value to individuals,
customers, business sectors, and other organizations.
P.V.Kale 112
Cont…
• The team communicates benefits of project more broadly and
sets up pilot project to deploy work in controlled way before
broadening the work to full enterprise of users.
• This approach enables team to learn about performance and
related constraints of the model in production environment on
small scale , and make adjustments before full deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
P.V.Kale 113
Class: 3rd Year VI Semester
P.V.Kale 114
P.V.Kale 115
P.V.Kale 116
GINA
P.V.Kale 117
P.V.Kale 118
P.V.Kale 119
P.V.Kale 120
P.V.Kale 121
P.V.Kale 122
P.V.Kale 123
P.V.Kale 124
P.V.Kale 125
P.V.Kale 126
P.V.Kale 127
P.V.Kale 128
Questions
1. Describe the V’s model of Big Data.
2. What are the major technological challenges in managing Big
Data?
3. Explain the difference between BI & Data Science.
4. What are different phases of the Data Analytics Lifecycle?
Explain each in detail.
5. Describe the current analytical architecture for data scientists.
6. What are the key roles for the New Big Data Ecosystem?
P.V.Kale 129
P.V.Kale 130