Big Data Module 2
Big Data Module 2
Big Data Module 2
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 1
How to Approach Your Analytics Problems
Your Thoughts?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 4
Value of Using the Data Analytics Lifecycle
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 5
Need For a Process to Guide Data Science Projects
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 6
Key Roles for a Successful Analytic Project
Role Description
Someone who benefits from the end results and can consult and advise project team on
Business User
value of end results and how these will be operationalized
Person responsible for the genesis of the project, providing the impetus for the project and
Project Sponsor core business problem, generally provides the funding and will gauge the degree of value
from the final outputs of the working team
Project Manager Ensure key milestones and objectives are met on time and at expected quality.
Business Intelligence Business domain expertise with deep understanding of the data, KPIs, key metrics and
Analyst business intelligence from a reporting perspective
Deep technical skills to assist with tuning SQL queries for data management, extraction and
Data Engineer
support data ingest to analytic sandbox
Database Database Administrator who provisions and configures database environment to support the
Administrator (DBA) analytical needs of the working team
Provide subject matter expertise for analytical techniques, data modeling, applying valid
Data Scientist analytical techniques to given business problems and ensuring overall analytical objectives
are met
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 7
Data Analytics Lifecycle Do I have enough
information to draft an
analytic plan and share for
1 peer review?
Discovery
Do I have
enough good
quality data to
6 2
start building
Operationalize Data Prep the model?
5 3
Communicate Model
Results Planning
4
Model Do I have a good idea
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 8
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share for
1 peer review?
Discovery
Do I have
enough good
quality data to
• Learn the Business Domain start building
Operationalize Data you
Determine amount of domain knowledge needed to orient Prep
to the data
the and
model?
interpret results downstream
Determine the general analytic problem type (such as clustering, classification)
If you don’t know, then conduct initial research to learn about the domain area
you’ll be analyzing
Communicate Model
• Learn from the past
Results Planning
Have there been previous attempts in the organization to solve this problem?
If so, why did they fail? Why are we trying again? How have things changed?
Model Do I have a good idea
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 9
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share for
1 peer review?
Discovery
Do I have
enough good
quality data to
start building
• Resources
Operationalize Data Prep the model?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 10
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share for
1 peer review?
Discovery
Do I have
enough good
quality data to
• Frame the problem…..Framing is the process of stating the analytics start
problem
building
to beOperationalize
solved Data Prep the model?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 11
Tips for Interviewing the Analytics Sponsor
• Even if you are “given” an analytic problem you should work with clients to
clarify and frame the problem
You’re typically handed solutions, you need to
identify the problem and their desired outcome
Sponsor Interview Tips
• Prepare for the interview – draft your questions, review with colleague, team
• Use open-ended questions, don’t ask leading questions
• Probe for details, follow-up
• Don’t fill every silence – give them time to think
• Let them express their ideas, don’t put words in their mouth, let them share their feelings
• Ask clarifying questions, ask why – is that correct? Am I on target? Is there anything else?
• Use active listening – repeat it back to make sure you heard it correctly
• Don’t express your opinions
• Be mindful of your body language and theirs – use eye contact, be attentive
• Minimize distractions
• Document what you heard and review it back with the sponsor
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 12
Tips for Interviewing the Analytics Sponsor
Interview Questions
• What is the business problem you’re trying to solve?
• What is your desired outcome?
• Will the focus and scope of the problem change if the following dimensions
change:
• Time – analyzing 1 year or 10 years worth of data?
• People – how would this project change this?
• Risk – conservative to aggressive
• Resources – none to unlimited (tools, tech, …..)
• Size and attributes of Data
• What data sources do you have?
• What industry issues may impact the analysis?
• What timelines are you up against?
• Who could provide insight into the project? Consulted?
• Who has final say on the project?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 13
Data Analytics Lifecycle Do I have enough
Phase 1: Discovery information to draft an
analytic plan and share for
1 peer review?
Discovery
Do I have
enough good
• Formulate Initial Hypotheses quality data to
start building
IH, H1 , H2, H3, … Hn
Operationalize Data Prep the model?
Gather and assess hypotheses from stakeholders and
domain experts
Preliminary data exploration to inform discussions with
Communicate
stakeholders during the hypothesis forming stageModel
• IdentifyResults
Data Sources – Begin Learning the Data
Planning
Aggregate sources for previewing the data and provide
high-level understanding Model Do I have a good idea
Review the raw data Building about the type of model
Is the model robust to try? Can I refine the
Determine
enough? Havethe we structures and tools needed analytic plan?
failed for sure?
Scope the kind of data needed for this kind of problem
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 14
Using a Sample Case Study to Track the Phases in the
Data Analytics Lifecycle
Mini Case Study: Churn Prediction for
Yoyodyne Bank
Situation Synopsis
• Retail Bank, Yoyodyne Bank wants to improve the Net Present Value
(NPV) and retention rate of customers
• They want to establish an effective marketing campaign targeting
customers to reduce the churn rate by at least five percent
• The bank wants to determine whether those customers are worth
retaining. In addition, the bank also wants to analyze reasons for
customer attrition and what they can do to keep them
• The bank wants to build a data warehouse to support Marketing
and other related customer care groups
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 15
How to Frame an Analytics Problem Mini Case
Study
Analytical
Sample Business Problems Qualifiers
Approach
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 16
Data Analytics Lifecycle Do I have enough
Phase 2: Data Preparation information to draft an
analytic plan and share for
peer review?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 17
Data Analytics Lifecycle Do I have enough
Phase 2: Data Preparation information to draft an
analytic plan and share for
peer review?
Discovery
• Familiarize yourself with the data thoroughly Do I have
List your data sources enough good
quality data to
2
What’s needed vs. what’s available start building
Operationalize Data Prep the model?
• Data Conditioning
Clean and normalize data
Discern what you keep vs. what you discard
• SurveyCommunicate
& Visualize Model
Overview, zoom & filter, details-on-demand
Results Planning
Descriptive Statistics
Data Quality
Model Do I have a good idea
• Is the for
Useful Tools model robust
this phase:
Building about the type of model
to try? Can I refine the
enough? Statistics
• Descriptive Have we on candidate variables for diagnostics & quality analytic plan?
failed for sure?
• Visualization: R (base package, ggplot and lattice), GnuPlot, Ggobi/Rggobi, Spotfire,
Tableau
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 18
Data Analytics Lifecycle Do I have enough
Phase 3: Model Planning information to draft an
analytic plan and share for
peer review?
Discovery
Do I have
• Determine Methods enough good
quality data to
Select methods based on hypotheses, data start building
Operationalize
structure and volume Data Prep the model?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 19
Data Analytics Lifecycle Do I have enough
Phase 3: Model Planning information to draft an
analytic plan and share for
peer review?
• Data Exploration Discovery
Do I have
enough good
• Variable Selection quality data to
start building
Inputs from stakeholders and domain
Operationalize Data Prep the model?
experts
Capture essence of the predictors, leverage
a technique for dimensionality reduction
3
Iterative testing to confirm the most
Communicate Model
significant
Resultsvariables Planning
• Model Selection
Model Do I have a good idea
Conversion to SQL or database language for
Is the model robust
Building about the type of model
best performance to try? Can I refine the
enough? Have we analytic plan?
Choose
failedtechnique
for sure? based on the end goal
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 20
Sample Research: Churn Prediction in Other Verticals
Mini Case Study:
Churn Prediction for
Yoyodyne Bank
Wireless Telecom Neural network, decision tree, hierarchical neurofuzzy systems, rule evolver
Retail Banking Multiple regression
Wireless Telecom Logistic regression, neural network, decision tree
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 21
Data Analytics Lifecycle Do I have enough
Phase 4: Model Building information to draft an
analytic plan and share for
peer review?
Discovery
• Develop data sets for testing, training, and production purposes Do I have
Need to ensure that the model data is sufficiently robust for enough good
the model
quality data to
and analytical techniques start building
Operationalize Data
Smaller, test sets for validating approach, training setPrep
for initial the model?
experiments
• Get the best environment you can for building models and workflows…
fast hardware, parallel processing
Communicate Model
Results Planning
4
Is the model robust Model Do I have a good idea
enough? Have we Building about the type of model
failed for sure? to try? Can I refine the
analytic plan?
• Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 22
Data Analytics Lifecycle Do I have enough
Phase 5: Communicate Results information to draft an
analytic plan and share for
peer review?
Discovery
Do I have
enough good
quality data to
Did we succeed? Did we fail? start building
Operationalize Data Prep the model?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 23
Data Analytics Lifecycle Do I have enough
Phase 6: Operationalize information to draft an
analytic plan and share for
peer review?
Discovery
Do I have
6
• Run a pilot enough good
quality data to
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 24
Mini Case Study:
Analytic Plan Churn Prediction for
Retail Banking
Components of Retail Banking: Yoyodyne Bank
Analytic Plan
Phase 1: Discovery How do we identify churn/no churn for a customer?
Business Problem
Framed
Initial Hypotheses Transaction volume and type are key predictors of churn rates.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 25
Key Outputs from a Successful Analytic Project, by Role
Role Description What the Role Needs in the Final Deliverables
Someone who benefits from the end results and can consult • Sponsor Presentation addressing:
and advise project team on value of end results and how these • Are the results good for me?
Business User will be operationalized • What are the benefits of the findings?
• What are the implications of this for me?
Person responsible for the genesis of the project, providing • Sponsor Presentation addressing:
the impetus for the project and core business problem, • What’s the business impact of doing this?
Project generally provides the funding and will gauge the degree of • What are the risks? ROI?
Sponsor value from the final outputs of the working team • How can this be evangelized within the
organization (and beyond)?
Project Ensure key milestones and objectives are met on time and at
Manager expected quality.
Business Business domain expertise with deep understanding of the • Show the analyst presentation
Intelligence data, KPIs, key metrics and business intelligence from a • Determine if the reports will change
Analyst reporting perspective
Deep technical skills to assist with tuning SQL queries for • Share the code from the analytical project
Data Engineer data management, extraction and support data ingest to • Create technical document on how to implement
analytic sandbox it.
Database Database Administrator who provisions and configures • Share the code from the analytical project
Administrator database environment to support the analytical needs of the • Create technical document on how to implement
(DBA) working team it.
Provide subject matter expertise for analytical techniques, • Show the analyst presentation
data modeling, applying valid analytical techniques to given • Share the code
Data Scientist
business problems and ensuring overall analytical objectives
are met
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 26
4 Core Deliverables to Meet Most Stakeholder Needs
1. Presentation for Project Sponsors
• “Big picture" takeaways for executive level stakeholders
• Determine key messages to aid their decision-making process
• Focus on clean, easy visuals for the presenter to explain and for the
viewer to grasp
2. Presentation for Analysts
• Business process changes
• Reporting changes
• Fellow Data Scientists will want the details and are comfortable with
technical graphs (such as ROC curves, density plots, histograms)
3. Code for technical people
4. Technical specs of implementing the code
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 27
Analyst Wish List for a Successful Analytics
Project
Tools
• Statistical/mathematical/visual software of choice for a given situation and problem set,
such as SAS, Matlab, R, java tools, Tableau, Spotfire
• Collaboration: an online platform or environment for collaboration and communicating
with team members
• Tool or place to log errors with systems, environments or data sets
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 28
Concepts in Practice
Greenplum’s Approach to Analytics
Attract all kinds of data Flexible and elastic data structures Rich data repository and
algorithmic engine
Analyze
Future
data What will How can we
Ou happen? do better?
r ts ling de tlier
Ale samp
- t
S e ec t i o
Restructure
Re gm n
en
tat data
es da raw
A/ stin g Mo n
ion What
Te orin
B
sin ta
How and
pr ta ive
Sc
happened
g
da ass
why did it
oc &
Past
where and
M
ze
happen?
Si
Mode n
Analytics
selec
Repeat when?
desig
ta
Da
del
l
tio
ETL /
Fast
Facts Interpretation
Data
ELT
EDC PLATFORM
Source: MAD Skills: New Analysis Practices for Big Data, March 2009
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 29
“The pessimist –
complains about the wind
The optimist –
expects it to change
The leader –
adjusts the sails
John Maxwell
(Leadership Author)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 30
Check Your Knowledge
• In which phase would you expect to invest most of your project time and
why? Where would expect to spend the least time? Your Thoughts?
• What are the benefits of doing a pilot program before a full scale rollout of a
new analytical methodology? Discuss this in the context of the mini case
study.
• What kinds of tools would be used in the following phases, and for which
kinds of use scenarios?
Phase 2: Data Preparation
Phase 4: Model Execution
• Now that you have completed the analytical project at Yoyodyne, you have an
opportunity to repurpose this approach for an online eCommerce company.
What phases of the lifecycle do you need to focus on to identify ways to do
this?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 31
Module 2: Summary
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 32
Lab Exercise 1: Introduction to Data Environment
This first lab introduces the Analytics Lab Environment you
will be working on throughout the course.
After completing the tasks in this lab you should be able to:
• Authenticate and access the Virtual Machine (VM)
assigned to you for all of your lab exercises
• Locate data sets you will be working with for the
course’s labs
• Use meta commands and PSQL to navigate through
the data sets
• Create sub-sets of the big data, using table joins and
filters to analyze subsequent lab exercises
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 2: Data Analytics Lifecycle 33