Unit 1-Big Data Analytics & Lifecycle

Class: 3rd Year VI Semester
Sub: Big Data Analytics

Lecture-1
P.V.Kale 1
Course Prerequisite
 Knowledge of basic Computer Science Principles and Skills,
 Basic knowledge of Linear Algebra and Probability Theory,
 Basic knowledge of Data Base Management Systems
P.V.Kale 2
Course Objectives
Throughout the course, students will be expected to demonstrate
their understanding of Big Data Analytics by being able to do
each of the following:
1. To know the fundamental concepts of big data and analytics.
2. To explore tools and practices for working with big data.
3. To know about the research that requires the integration of

large amounts of data.
P.V.Kale 3
Course Outcomes (Expected Outcome)
On completion of the course, the students will be able to
1. Work with big data tools and its analysis techniques.
2. Analyze data by utilizing clustering and classification
algorithms.
3. Learn and apply different algorithms and recommendation
systems for large volumes of data.

4. Perform analytics on data streams.
5. Learn NoSQL databases and management.
P.V.Kale 4
Review on Each Unit
Unit I: Big Data Analytics and Lifecycle Hours: 6
Unit II: Review of Basic Data Analytics Methods

Hours: 7
Clustering and Association Rules
Unit III: Regression and Classification Hours: 7
Unit IV: Time Series Analysis and Text Analysis Hours: 6
Unit V: Tool and Techniques: Map Reduce & Hadoop Hours: 7
Unit VI: Database Analytics, NoSQL and Graph Analytics Hours: 6
P.V.Kale 5
Unit -I
Big Data Analytics: Big Data Overview,
State of the Practice in Analytics,
Key Roles for the New Big Data Ecosystem,
Examples of Big Data Analytics,
Data Analytics Lifecycle: Overview,

Phase 1: Discovery,
Phase 2: Data Preparation,
Phase 3: Model Planning,
Phase 4: Model Building,
Phase 5: Communicate Results,
Phase 6: Operationalize,
Case Study: Global Innovation Network and Analysis(GINA).
P.V.Kale 6
Unit -II
Exploratory Data Analysis, Statistical Methods for Evaluation:
Hypothesis Testing, Difference of Means,
Wilcoxon Rank-Sum Test,
Type I and II Errors, ANOVA,
Overview of Clustering, K-means: Use Cases, Overview,

Number of Clusters,
Diagnostics, Additional Algorithms, Overview,
Apriori Algorithm, Evaluation of Candidate Rules,
Applications of Association Rules, An Example:

Transactions in a Grocery Store, The Groceries Dataset,
Frequent Item set Generation, Rule Generation and Visualization,
Validation and Testing, Diagnostics.
P.V.Kale 7
Unit -III
Linear Regression:
Use Cases, Model Description, Diagnostics,
Logistic Regression: Use Cases, Model Description, Diagnostics,

Reasons to Choose and Cautions,
Additional Regression Models,
Decision Trees: Overview of a Decision Tree,

The General Algorithm,
DecisionTree Algorithms, Evaluating a Decision Tree, Decision Trees,
Naïve Bayes: Bayes’ Theorem,
Naïve Bayes Classifier, Smoothing,
Diagnostics, Naïve Bayes, Diagnostics of Classifiers,
Additional Classification Methods.
P.V.Kale 8
Unit -IV
Overview of Time Series Analysis: Box-Jenkins Methodology,
ARIMA Model: Autocorrelation Function (ACF),

Autoregressive Models,
Moving Average Models, Unit
Unit--II
IV
ARMA and ARIMA Models,
Building and Evaluating an ARIMA Model,
Reasons to Choose and Cautions,
Additional Methods, Text Analysis Steps,
A Text Analysis Example, Collecting Raw Text, Representing Text,
Term Frequency—Inverse Document Frequency (TFIDF),
Categorizing Documents by Topics,
Determining Sentiments, Gaining Insights.
P.V.Kale 9
Unit -V
Big Data Tool and Techniques:
Big Data Storage, High-Performance Architecture, HDFS,
MapReduce and YARN, Big Data Application Ecosystem,
Zookeeper, HBase, Hive, Pig, Mahout,
Developing Big Data Applications: Parallelism,

Myth, Application Development Framework,
MapReduce Programming Model, Simple Example,
More on MapReduce, Other Frameworks, The Execution Model,
Analytics for Unstructured Data: Use Cases, MapReduce, Apache

Hadoop,
The Hadoop Ecosystem: Pig, Hive, HBase, Mahout, NoSQL.
P.V.Kale 10
Unit -VI
SQL Essentials, In-Database Text Analysis, Advanced SQL,
NoSQL Data Management:

What is NoSQL, Schema-less Models,
Key-Value Stores, Document Stores,
Tabular Stores,
Object Data Stores, Graph Database,
Communicating and Operationalizing
an Analytics Project, Creating the Final Deliverables,
Graph Analytics:
Model, Triples, Graphs and Network Organization,
Graph Analytics and Use Cases, Graph Analysis Algorithms,
Technical Complexity,
Features of Graph Analytic Platform, Data Visualization Basics.
P.V.Kale 11
1st Text Book
EMC Education Services,
"Data Science and Big
Data Analytics:
Discovering, Analyzing,
Visualizing and Presenting
Data", 2015,JohnWiley &
Sons, Inc.,
ISBN: 978-1-118-87613-
8.
P.V.Kale 12
2nd Text Book
David Loshin, "Big Data
Analytics: From Strategic
Planning to Enterprise
Integration with Tools,
Techniques, NoSQL, and
Graph", First Edition,
2013, Morgan Kaufmann/
Elsevier Publishers,
ISBN: 978-0-12-417319-4.
P.V.Kale 13
Reference Books:
[1] Bart Baesens, "Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications", First Edition, 2014, Wiley Publishers,
ISBN: 978-1-118-89271-8.
[2] Mohammad Guller, “Big Data Analytics with Spark A Practitioner’s

Guide to Using Spark for Large-Scale Data Processing, Machine Learning,
and Graph Analytics, and HighVelocity Data Stream Processing”, First
Edition, 2015, Apress Publisher, ISBN-13 (pbk): 978-1-4842-0965-3.
[3] Arshdeep Bahga & Vijay Madisetti, “Big Data Science & Analytics: A
Hands-On Approach”, First Edition, 2019, ISBN: 978-1-949978-00-1.
P.V.Kale 14

Unit-I - Lecture-2
P.V.Kale 15
Unit 1- Big Data Analytics
Contents
 Big Data Overview
 Examples of Big Data Analytics
P.V.Kale 16
What is Data?
• The quantities,
• characters, or
• symbols
on which operations are performed by a computer, which may be
stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
P.V.Kale 17
Overview
 Nowadays data is created constantly, and at an ever-
increasing rate.
 Mobile phones, social media, imaging technologies to
determine a medical diagnosis all these and more

create new data, and that must be stored somewhere
for some purpose.
 Devices and sensors automatically generate diagnostic

Cont….
 The most difficult and more challenging task is
analyzing vast amounts of it, especially when it does
not conform to traditional notions of data structure to
identify meaningful patterns and extract useful
information
What is Big Data?
Big Data is a collection of data
that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of
traditional data management tools can store it or process it
efficiently.
So in simple words
Big data is also a data but with huge size.
P.V.Kale 20
Survey of Big Data
For example, in 2012 Facebook users posted 700 status updates
per second worldwide, which can be leveraged to deduce latent
interests or political views of users and show relevant ads.
For instance, an update in which a woman changes her

relationship status from “single” to “engaged” would trigger
ads on bridal dresses, wedding planning, or name-changing
services.
Cont…
Facebook can also construct social graphs to analyze which
users are connected to each other as an interconnected network.
In March 2013, Facebook released a new feature called “Graph

Search,” enabling users and developers to search social graphs
for people with similar interests, hobbies, and shared locations.
Data Generation by Various Sources
The New York Stock
Exchange is an example
of Big Data
that generates about one
terabyte of new trade data
per day.
P.V.Kale 23
Social Media
• There are 2.3 billion
monthly active users and
counting.
• There are 250 billion
photos uploaded to
Facebook
• Facebook generated 4
petabytes of data per day!
• 1 Petabytes=1,000,000
GB
P.V.Kale 24
Facebook
• Also the statistic shows
that 500+terabytes of new
data get ingested into the
databases of social media
site Facebook, every day.
• This data is mainly
generated in terms of
photo and video uploads,
message exchanges,
putting comments etc.
P.V.Kale 25
Instagram
• There are 40 billion photos
which are shared.
• There are 95 million
photos uploaded every
single day.
• Over 500 million videos
uploaded to Instagram per
day!
• Petabytes worth of data
used and moved around per
day!
P.V.Kale 26
Twitter
• There are approx. 600
tweets per second
• Handling huge quantities
of data takes skill.
• Terabytes worth of data
uploaded to the servers
every day!
• 1 terabytes= 1,000 GB
P.V.Kale 27
P.V.Kale 28
How the measures are used?
• Data is measured in bits and bytes.
• One bit contains a value of 0 or 1.
• Eight bits make a byte.
• Then we have kilobytes (1,000 bytes),
• megabytes (1000² bytes),
• gigabytes (1000³ bytes),
• terabytes (1000⁴ bytes),
• petabytes (1000⁵ bytes),
• exabytes (1000⁶ bytes) and
• zettabytes (1000⁷ bytes).
P.V.Kale 29
P.V.Kale 1
Social media and genetic sequencing are among the fastest-
growing sources of Big Data and examples of untraditional
sources of data being used for analysis.
Examples- Gathering of Data
Several industries have led the way in developing their
ability to gather and exploit data:
 Credit Card Companies
 Mobile phone companies
 Companies such as LinkedIn and Facebook
 The transportation industry

Big Data characteristics
Three attributes stand out as defining Big Data characteristics:
Huge volume of data
• Rather than thousands or millions of rows,
• Big Data can be billions of rows and millions of columns
Complexity of data types and structures

• Big Data reflects the variety of new data sources, formats,
• And structures, including digital traces being left on the web and other
digital repositories for subsequent analysis.
Speed of new data creation and growth

• Big Data can describe high velocity data, with rapid data ingestion and
near real time analysis
• Big Data includes all kinds of data ,which helps deliver the
right information to the right person , in the right time , to help
make the right decisions.
• Big Data can be harnessed by developing infinitely scalable ,
flexible and evolutionary data architectures coupled with the
use of cost-effective computing machine.
• The infinite potential knowledge embedded within this Big
Data cosmic computer would help connect with and enjoy the
support of all the laws of nature.
P.V.Kale 34
Definition
 Big Data is data whose scale, distribution, diversity, and /or
timeliness require the use of new technical architectures and
analytics to enable insights that unlock new sources of business
value.
It can also be defined as :
 McKinsey’s definition of Big Data implies that organizations

will need new data architectures and analytic sandboxes, new
tools, new analytical methods, and an integration of multiple
skills into the new role of the data scientist,

Unit-I - Lecture-3
P.V.Kale 36
Contents
 Evolution of Big Data
Difference in Analytics & Analysis
 Types of Data
P.V.Kale 37
Evolution of Big Data
• 1970s and before was the era of mainframes.
• The data was essentially primitive and structured. Relational

databases evolved in 1980s and 1990s. The era was of data
intensive applications.
• The World Wide Web (WWW) and the Internet of Things

(IOT) have led to an onslaught of structured, unstructured, and
multimedia data
P.V.Kale 38
P.V.Kale 39
Difference
P.V.Kale 40
What is Analytics
Analytics is the process of discovering, interpreting, and
communicating significant patterns in data.
Analytics helps us see insights and meaningful data that we

might not otherwise detect.
For Ex : Business analytics focuses on using insights derived

from data to make more informed decisions that will
help organizations increase sales, reduce costs, and
make other business improvements.
Big Data Analytics
Big data analytics is the often complex process of examining
big data to uncover information -- such as hidden patterns,
correlations, market trends and customer preferences -- that can
help organizations make informed business decisions.
Big Data
refers to
the large amounts of data that is

pouring in
from various data sources and has
different formats
P.V.Kale 43
Traditional & Big Data Difference
TRADITIONAL DATA BIG DATA
Traditional data is generated in Big data is generated in outside
enterprise level. and enterprise level.
Its volume ranges from
Its volume ranges from
Petabytes to Zettabytes or
Gigabytes to Terabytes.
Exabytes.
Big data system deals with
Traditional database system
structured, semi structured and
deals with structured data.
unstructured data.
Conti…
Traditional data is generated per But big data is generated more
hour or per day or more. frequently mainly per seconds.
Traditional data source is Big data source is distributed
centralized and it is managed in and it is managed in distributed
centralized form. form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is
High system configuration is
capable to process traditional
required to process big data.
data.
The size of the data is very The size is more than the
small. traditional data size.
Conti…
Traditional data base tools are Special kind of data base tools
required to perform any data are required to perform any data
base operation. base operation.
Normal functions can Special kind of functions can
manipulate data. manipulate data.
Its data model is strict schema Its data model is flat schema
based and it is static. based and it is dynamic.
Traditional data is stable and Big data is not stable and
inter relationship. unknown relationship.
Traditional data is in Big data is in huge volume
manageable volume. which becomes unmanageable.
Conti…
It is easy to manage and It is difficult to manage and
manipulate the data. manipulate the data.
Its data sources includes ERP
transaction data, CRM Its data sources includes social
transaction data, financial data, media, device data, sensor data,
organizational data, web video, images, audio etc.
transaction data etc.
Types of Data
A volume data that is massive which is difficult to process using
Traditional database & Software Technique.
Based on however the data is stored and managed, digital data
may be broadly classified as either
Structured Data or
Unstructured knowledge.
P.V.Kale 49
Structured Data
• Any data that can be stored, accessed and processed in the
form of fixed format is termed as a ‘structured’ data.
• The structured data includes all the data that can be stored in a
tabular column.
• It is the type of data that is stored in a relational databases
such as SQL and Oracle where data is organized in rows and
columns within named tables.
• Relational databases and Excel spreadsheet are examples of
structured data.
• Structured data depends on the existence of a data model – a
model of how data can be stored, processed and accessed
P.V.Kale 50
Structured Data
• Because of a data model, each field is discrete and can be
accesses separately or jointly along with data from other fields.
• This makes structured data extremely powerful: it is possible to
quickly aggregate data from various locations in the database.
• Structured data exists in a format created to be captured, stored,
organized and analyzed.
• It’s neatly organized for easy access.
• If structured data was an office it would contain many file
cabinets that are efficiently set up, clearly labeled and easy to
access.
P.V.Kale 51
P.V.Kale 52
Semi-Structured data
• This type of data does not have a standard data model but it has
clear self-describing patterns and structure.
• Examples of semi-structured data are Excel spreadsheets that
have a row and column structure and XML files that are defined
by an XML schema.
• Semi-structured data is a form of structured data that does not
conform with the formal structure of data models associated
with relational databases or other forms of data tables,
• but nonetheless contain tags or other markers to separate
semantic elements and enforce hierarchies of records and fields
within the data.
• Therefore, it is also known as self-describing structure.
P.V.Kale 53
Semi-Structured data
• The reason that this third category exists (between structured
and unstructured data) is because semi-structured data is
considerably easier to analyse than unstructured data.
• Many Big Data solutions and tools have the ability to ‘read’
and process either JSON or XML. This reduces the complexity
to analyse structured data, compared to unstructured data.
P.V.Kale 54
Quasi-Structured data
• Textual data with erratic data formats that can be formatted
with effort, tools, and time
• For instance, web clickstream data that may contain
inconsistencies in data values and formats)
P.V.Kale 55
Unstructured data
• Unstructured data is information that either does not have
a predefined data model or is not organised in a pre-
defined manner.
• Unstructured information is typically text-heavy, but may
contain data such as dates, numbers, and facts as well.
• This results in irregularities and ambiguities that make it
difficult to understand using traditional programs as
compared to data stored in structured databases.
• Common examples of unstructured data include audio,
video files or No-SQL databases.
P.V.Kale 56
Unstructured data
• The ability to store and process unstructured data has
greatly grown in recent years, with many new
technologies and tools coming to the market that are able
to store specialised types of unstructured data.
• MongoDB, for example, is optimised to store documents.
• Apache Giraph, as an opposite example, is optimised for
storing relationships between nodes.
• The ability to analyse unstructured data is especially
relevant in the context of Big Data, since a large part of
data in organisations is unstructured.
P.V.Kale 57
Unstructured data
• Think about pictures, videos or PDF documents. The
ability to extract value from unstructured data is one of
main drivers behind the quick growth of Big Data.
P.V.Kale 58
P.V.Kale 59
P.V.Kale 60

Unit-I - Lecture-4
P.V.Kale 61
Contents
 Capturing Big Data
 Challenges in Big Data
P.V.Kale 62
Big Data
• Big data is high-velocity and high-variety information assets
that demand cost effective, innovative forms of information
processing for enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the
storage capacity of and also complex for traditional database
software tools
• Big data is anything beyond the human & technical
infrastructure needed to support storage, processing and
analysis.
• It is data that is big in volume, velocity and variety
P.V.Kale 63
Data: Big in Volume, Variety and Velociy
P.V.Kale 64
Volume of Data
• Volume can be in Terabytes or Petabytes or Zettabytes.
• Gartner Glossary Big data is high-volume, high-velocity
and/or high variety information assets that demand cost-
effective, innovative forms of information processing that
enable enhanced insight and decision.
• The quantity of data generated in the world has been doubling
every 12-18 months.
• Searching the world – wide – web was the first true Big Data
application.
• Google perfected the art of this application and developed
many of the path breaking Big Data technologies.
P.V.Kale 65
Velocity of Data
• If traditional data is like a lake , Big Data is like a fast flowing
river.
• Big Data is being generated by billions of devices and

communicated at the speed of the light through internet .
• Velocity essentially refers to the speed at which data is being

created in real- time. We have moved from simple desktop
applications like payroll application to real- time processing
applications
P.V.Kale 66
Variety of Data
• If traditional data forms such as invoices and ledgers were like
a small store , Big Data is the biggest imaginable shopping
mall that offers unlimited variety.
• Data can be structured data, semi-structured data and
unstructured data. Data stored in a database is an example of
structured data.HTML data, XML data, email data.
• CSV files are the examples of semi-structured data. Power
point presentation, images, videos, researches, white papers,
body of email etc are the examples of unstructured data.
• Three major kinds of variety of data Form of data , Function
of Data & Source of Data.
P.V.Kale 67
V’s in Big Data
P.V.Kale 68
Veracity of Data
• The ingestion and the processed data of different systems
result in veracity challenges about the data accuracy.
• For example, if different records show the same data with
different date timestamps, it is hard to determine which record
is the correct one.
• Alternatively, if data is incomplete, one does not know about it
and there can be a system error. Hence, big data systems need
concepts, methods, and tools to overcome the veracity
challenge.
P.V.Kale 69
Value of Data
• The last V in the 5 V's of big data is value.
• This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.
• Being able to pull value from big data is a requirement, as the
value of big data increases significantly depending on the
insights that can be gained from them.
P.V.Kale 70
P.V.Kale 71
In short
P.V.Kale 72
Challenges with Big Data
P.V.Kale 73
Technological Challenges & Solutions
Challenge Description Solution Technology
Volume Avoid risk of data Replicate segments HDFS

loss from machine of data in multiple
failure in clusters o machines ,master
commodity node keeps track o
machines segment location
Volume & Velocity Avoid choking of Move processing Map - reduce

network bandwidth logic to where the
by moving large data is stored
volume of data manage using
parallel processing
algorithms.
P.V.Kale 74
Technological Challenges & Solutions
Challenge Description Solution Technology
Variety Efficient storage of Columnar databases HBase , Cassandra

large and small data using key-pair
objects values format.
Velocity Monitoring streams Fork shaped Spark

too large to store architecture to spark
process data as
stream and as batch.
P.V.Kale 75
Comparing Big Data with Traditional data
P.V.Kale 76
P.V.Kale 77

Unit-I - Lecture-5
P.V.Kale 78
Contents
State Of The Practice In Analytics
Emerging Big Data Ecosystem
Key Roles for the New Big Data Ecosystem
P.V.Kale 79
STATE OF THE PRACTICE IN ANALYTICS
• Current business problems provide many opportunities for
organizations to become more analytical and data driven.
P.V.Kale 80
• The first three examples do not represent new problems.
• Organizations have been trying to reduce customer churn,
increase sales, and cross-sell customers for many years.
• What is new is the opportunity to fuse advanced analytical
techniques with Big Data to produce more impactful analyses
for these traditional problems.
• The last example portrays emerging regulatory requirements.
P.V.Kale 81
• Many compliance and regulatory laws have been in existence
for decades, but additional requirements are added every year,
which represent additional complexity and data requirements
for organizations.
• Laws related to anti-money laundering (AML) and fraud
prevention require advanced analytical techniques to comply
with and manage
P.V.Kale 82
Different types of analytics:
• BI Versus Data Science
• Current Analytical Architecture (data flow)
• Drivers of Big Data
• Emerging Big Data Ecosystem and a New Approach to
Analytics
P.V.Kale 83
Comparing BI with DS
P.V.Kale 84
BI & DS
P.V.Kale 85
Analytical Architecture
P.V.Kale 86
Drivers of Big Data
P.V.Kale 87
• In the 1990s the volume of information was often measured in
terabytes. Most organizations analyzed structured data in rows
and columns and used relational databases and data
warehouses to manage large amount of enterprise information.
• Different kinds of data sources mainly productivity and
publishing tools such as content management repositories and
networked attached storage systems-to manage this kind of
information, and the data began to increase in size and started
to be measured at petabyte scales.
P.V.Kale 88
• The information that organizations try to manage has
broadened to include many other kinds of data.
• In this era, everyone and everything is leaving a digital
footprint.
• These applications, which generate data volumes that can be
measured in exabyte scale, provide opportunities for new
analytics and driving new value for organizations.
• The data now comes from multiple sources, like Medical
information, Photos and video footage, Video surveillance,
Mobile devices, Smart devices, Non traditional IT devices etc.
P.V.Kale 89
Emerging Big Data Ecosystem
P.V.Kale 90

Unit-I - Lecture-6
P.V.Kale 91
Contents
• Key Roles for the New Big Data Ecosystem
• What is Data Analytics Lifecycle
• Overview
P.V.Kale 92
Key Roles for the New Big Data Ecosystem
P.V.Kale 93
Data Analytics Lifecycle
• The Data analytic lifecycle is designed for Big Data problems
and data science projects.
• The cycle is iterative to represent real project.
• To address the distinct requirements for performing analysis

on Big Data, step – by – step methodology is needed to
organize the activities and tasks involved with acquiring,
processing, analyzing, and repurposing data.
P.V.Kale 94
CLASSIFICATION OF ANALYTICS
• There are basically two schools of thought:
• Those that classify analytics into basic, operationalized,

advanced and Monetized.
• Those that classify analytics into analytics 1.0, analytics 2.0,

and analytics 3.0.
P.V.Kale 95
First School of Thought
• It includes Basic analytics, Operationalized analytics,
Advanced analytics and Monetized analytics.
• Basic analytics:
• This primarily is slicing and dicing of data to help with basic
business insights.
• This is about reporting on historical data, basic visualization,
etc.
How can we make it happen?
What will happen?
Why did it happen ?
P.V.Kale 96
• Operationalized analytics:
• It is operationalized analytics if it gets woven into the
enterprises business processes.
• Advanced analytics:
• This largely is about forecasting for the future by way of
predictive and prescriptive modelling.
• Monetized analytics:
• This is analytics in use to derive direct business revenue
P.V.Kale 97
Analytics 1.0, 2.0 and 3.0
P.V.Kale 98
Second School of Thought
P.V.Kale 99
P.V.Kale 100
P.V.Kale 101
Phase 1 : Data discovery
• In this first phase of data analytics, the stakeholders regularly
perform the following tasks —
• examine the business trends, make case studies of similar data
analytics, and study the domain of the business industry.
• The entire team makes an assessment of the in-house
resources, the in-house infrastructure, total time involved, and
technology requirements.
• Once all these assessments and evaluations are completed, the
stakeholders start formulating the initial hypothesis for
resolving all business challenges in terms of the current
market scenario.
P.V.Kale 102
Cont…
• The data science team learn and investigate the problem.
• Develop context and understanding.
• Come to know about data sources needed and available for the
project.
• The team formulates initial hypothesis that can be later tested
with data.
P.V.Kale 103
Phase 2 : Data Preparation
• In the second phase after the data discovery phase, data is
prepared by transforming it from a legacy system into a data
analytics form by using the sandbox platform.
• A sandbox is a scalable platform commonly used by data
scientists for data preprocessing.
• It includes huge CPUs, high capacity storage, and high I/O
capacity.
• The IBM Netezza 1000 is one suc.
P.V.Kale 104
Cont…
• Steps to explore, preprocess, and condition data prior to
modeling and analysis.
• It requires the presence of an analytic sandbox, the team
execute, load, and transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple
times and not in predefined order.
• Several tools commonly used for this phase are – Hadoop,
Alpine Miner, Open Refine, etc.
P.V.Kale 105
Phase 3 : Model Planning
• The third phase of the lifecycle is model planning, where the
data analytics team makes proper planning of the methods to
be adapted and the various workflow to be followed during the
next phase of model building.
• At this stage, the various division of work among the team is
decided to clearly define the workload among the team
members.
• The data prepared in the previous phase is further explored to
understand the various features and their relationships and also
perform feature selection for applying it to the model.
P.V.Kale 106
Phase 3 : Model Planning
• Team explores data to learn about relationships between
variables and subsequently, selects key variables and the most
suitable models.
• In this phase, data science team develop data sets for training,
testing, and production purposes.
• Team builds and executes models based on the work done in
the model planning phase.
• Several tools commonly used for this phase are – Matlab,
STASTICA.
P.V.Kale 107
Phase 4 : Model Building
• The next phase of the lifecycle is model building in which the
team works on developing datasets for training and testing as
well as for production purposes.
• Also, the execution of the model, based on the planning made
in the previous phase, is carried out.
• The kind of environment needed for the execution of the
model is decided and prepared so that if a more robust
environment is required, it is accordingly applied.
P.V.Kale 108
Cont..
• Team develops datasets for testing, training, and production
purposes.
• Team also considers whether its existing tools will suffice for
running the models or if they need more robust environment
for executing models.
• Free or open-source tools – R and PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
P.V.Kale 109
Phase 5 : Communicate results
• Phase five of the life cycle checks the results of the
project to find whether it is a success or failure.
• The result is scrutinized by the entire team along
with its stakeholders to draw inferences on the key
findings and summarize the entire work done.
• Also, the business values are quantified and an
elaborate narrative on the key findings is prepared
that is discussed among the various stakeholders.
P.V.Kale 110
Cont…
• After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
• Team considers how best to articulate findings and outcomes
to various team members and stakeholders, taking into account
warning, assumptions.
• Team should identify key findings, quantify business value,
and develop narrative to summarize and convey findings to
stakeholders.
P.V.Kale 111
Phase 6 : Operationalization
• In phase six, a final report is prepared by the team along with
the briefings, source codes, and related documents.
• The last phase also involves running the pilot project to
implement the model and test it in a real-time environment.
• As data analytics help build models that lead to better
decision-making, it, in turn, adds value to individuals,
customers, business sectors, and other organizations.
P.V.Kale 112
Cont…
• The team communicates benefits of project more broadly and
sets up pilot project to deploy work in controlled way before
broadening the work to full enterprise of users.
• This approach enables team to learn about performance and
related constraints of the model in production environment on
small scale , and make adjustments before full deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
P.V.Kale 113

Unit-I - Lecture-7
P.V.Kale 114
P.V.Kale 115
P.V.Kale 116
GINA
P.V.Kale 117
P.V.Kale 118
P.V.Kale 119
P.V.Kale 120
P.V.Kale 121
P.V.Kale 122
P.V.Kale 123
P.V.Kale 124
P.V.Kale 125
P.V.Kale 126
P.V.Kale 127
P.V.Kale 128
Questions
1. Describe the V’s model of Big Data.
2. What are the major technological challenges in managing Big
Data?
3. Explain the difference between BI & Data Science.
4. What are different phases of the Data Analytics Lifecycle?
Explain each in detail.
5. Describe the current analytical architecture for data scientists.
6. What are the key roles for the New Big Data Ecosystem?
P.V.Kale 129
P.V.Kale 130

Unit 1-Big Data Analytics & Lifecycle

Uploaded by

Copyright:

Available Formats

Unit 1-Big Data Analytics & Lifecycle

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1-Big Data Analytics & Lifecycle

Uploaded by

Copyright:

Available Formats

Class: 3rd Year VI Semester

Sub: Big Data Analytics

 Basic knowledge of Linear Algebra and Probability Theory,

 Basic knowledge of Data Base Management Systems

1. To know the fundamental concepts of big data and analytics.

2. To explore tools and practices for working with big data.

3. To know about the research that requires the integration of

systems for large volumes of data.

Unit II: Review of Basic Data Analytics Methods

Unit III: Regression and Classification Hours: 7

Unit IV: Time Series Analysis and Text Analysis Hours: 6

Unit V: Tool and Techniques: Map Reduce & Hadoop Hours: 7

Unit VI: Database Analytics, NoSQL and Graph Analytics Hours: 6

Data Analytics Lifecycle: Overview,

Overview of Clustering, K-means: Use Cases, Overview,

Applications of Association Rules, An Example:

Logistic Regression: Use Cases, Model Description, Diagnostics,

Decision Trees: Overview of a Decision Tree,

ARIMA Model: Autocorrelation Function (ACF),

Developing Big Data Applications: Parallelism,

Analytics for Unstructured Data: Use Cases, MapReduce, Apache

The Hadoop Ecosystem: Pig, Hive, HBase, Mahout, NoSQL.

NoSQL Data Management:

[2] Mohammad Guller, “Big Data Analytics with Spark A Practitioner’s

Sub: Big Data Analytics

 Examples of Big Data Analytics

 Mobile phones, social media, imaging technologies to

determine a medical diagnosis all these and more

 Devices and sensors automatically generate diagnostic

For instance, an update in which a woman changes her

In March 2013, Facebook released a new feature called “Graph

 Credit Card Companies

 Mobile phone companies

 Companies such as LinkedIn and Facebook

 The transportation industry

Complexity of data types and structures

Speed of new data creation and growth

It can also be defined as :

 McKinsey’s definition of Big Data implies that organizations

Sub: Big Data Analytics

Difference in Analytics & Analysis

• The data was essentially primitive and structured. Relational

• The World Wide Web (WWW) and the Internet of Things

Analytics helps us see insights and meaningful data that we

For Ex : Business analytics focuses on using insights derived

the large amounts of data that is

Sub: Big Data Analytics

 Challenges in Big Data

• Big Data is being generated by billions of devices and

• Velocity essentially refers to the speed at which data is being

Volume Avoid risk of data Replicate segments HDFS

Volume & Velocity Avoid choking of Move processing Map - reduce

Variety Efficient storage of Columnar databases HBase , Cassandra

Velocity Monitoring streams Fork shaped Spark

Sub: Big Data Analytics

Emerging Big Data Ecosystem

Key Roles for the New Big Data Ecosystem

Sub: Big Data Analytics

• What is Data Analytics Lifecycle

• The cycle is iterative to represent real project.