Module 2 - Introduction To BA

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 65

Module 2 n 3

• Introduction to Business Analytics: Definition, Types -


Descriptive, Predictive and Prescriptive Analytics    
• Predictive and Prescriptive Analytics, Business Analytics for
decision making
• Ethics in data management   
Module 3
• Introduction to Machine Learning: Machine Learning - Definition, Machine Learning
workflow
• Models :– CRISP DM & SEMMA
Data analytics (definition )
• Businesses use analytics to explore and examine their data and then
transform their findings into insights that ultimately help executives,
managers and operational employees make better, more informed
business decisions.
• The major types of analytics businesses use are descriptive analytics,
what has happened in a business; predictive analytics, what could happen;
and prescriptive analytics, what should happen.
• While each of these methodologies offers its own unique insights,
advantages and disadvantages in their application, used in combination
these analytics tools can be an especially powerful asset to a business.
• https://www.youtube.com/watch?v=diaZdX1s5L4
Descriptive Analytics: Insight into the past

• Descriptive analysis or statistics does exactly what the name implies:


they “describe”, or summarize, raw data and make it something that is
interpretable by humans.
• They are analytics that describe the past.
• The past refers to any point of time that an event has occurred,
whether it is one minute ago, or one year ago.
• Descriptive analytics are useful because they allow us to learn from
past behaviors, and understand how they might influence future
outcomes.
Descriptive analytics
• Descriptive analytics is a commonly used form of data analysis whereby historical data is
collected, organised and then presented in a way that is easily understood.
• Descriptive analytics is focused only on what has already happened in a business and,
unlike other methods of analysis, it is not used to draw inferences or predictions from its
findings.
• Descriptive analytics is, rather, a foundational starting point used to inform or prepare data
for further analysis down the line. 
• Generally, the most simplistic form of data analytics, descriptive analytics uses simple
maths and statistical tools, such as arithmetic, averages and per cent changes, rather than
the complex calculations necessary for predictive and prescriptive analytics.
• Visual tools such as line graphs and pie and bar charts are used to present findings,
meaning descriptive analytics can – and should – be easily understood by a wide
business audience.
• Descriptive analytics uses two key methods, data aggregation and data mining (also
known as data discovery), to discover historical data. Data aggregation is the process of
collecting and organising data to create manageable data sets. These data sets are then
used in the data mining phase where patterns, trends and meaning are identified and then
presented in an understandable way. 
Descriptive analytics process

• Business metrics are decided. First, metrics are created that will effectively evaluate
performance against business goals, such as improving operational efficiency or
increasing revenue. the success of descriptive analytics heavily relies on KPI (key
performance indicator) governance.
• The data required is identified. Data is sourced from repositories such as reports
and databases. ‘To measure accurately against KPIs, ‘companies must catalogue
and prepare the correct data sources to extract the needed data and calculate
metrics based on the current state of the business.
• The data is collected and prepared. Data preparation –transformation and cleansing,
for example – takes place before the analysis stage and is a critical step to ensure
accuracy; it is also one of the most time-consuming steps for the analyst. 
• The data is analysed. Summary statistics, clustering, pattern tracking and regression
analysis are used to find patterns in the data and measure performance. 
• The data is presented. Finally, charts and graphs are used to present findings in a
way that non-analytics experts can understand.
Descriptive analytics :Usage
• Descriptive analytics is frequently used in the day-to-day operations of an
organisation.
• Company reports – such as those on inventory, workflow, sales and
revenue – are all examples of descriptive analytics that provide a historical
review of an organisation’s operations.
• Data collected by these kinds of reports can be easily aggregated and
used to create snapshots of an organisation’s operations.
• Social analytics are almost always an example of descriptive analytics.
The number of followers, likes and posts can be used to determine the
average number of replies per post, the number of page views and the
average response time, for example. The comments that people post on
Facebook or Instagram are also examples of descriptive analytics and can
be used to better understand user attitudes. 
advantages and disadvantages of descriptive analytics 

• Since descriptive analytics relies only on historical data and


simple calculations, this methodology can easily be applied in
day-to-day operations, and its application doesn’t necessarily
require an extensive knowledge of analytics.
• This means that businesses can relatively quickly and easily
report on performance and gain insights that can be used to
make improvements.
• On the other hand, descriptive analytics has the obvious
limitation that it doesn’t look beyond the surface of the data –
this is where predictive and prescriptive analytics come into
play. 
Egs descriptive analytics
• helps organisations measure performance to ensure goals and
targets are being met. And if they aren’t being met, descriptive
analytics can identify areas that require improvement or change.
Some examples of how descriptive analytics can be used include
the following:
• Summarising past events such as sales and operations data or
marketing campaigns
• Social media usage and engagement data such as Instagram or
Facebook likes
• Reporting general trends
• Collating survey results
Predictive Analytics: Understanding the future

• Predictive analytics has its roots in the ability to “predict” what might
happen.
• These analytics are about understanding the future.
• Predictive analytics provides companies with actionable insights
based on data.
• Predictive analytics provides estimates about the likelihood of a future
outcome
• use of predictive analytics to produce a credit score. These scores are
used by financial services to determine the probability of customers
making future credit payme
Predictive Analytics

• While descriptive analytics focuses on historical data, predictive


analytics, as its name implies, is focused on predicting and
understanding what could happen in the future.
• Analysing past data patterns and trends by looking at historical
data and customer insights can predict what might happen
going forward and, in doing so, inform many aspects of a
business, including setting realistic goals, effective planning,
managing performance expectations and avoiding risks.
Predictive Analytics

• Predictive analytics is based on probabilities.


• Using a variety of techniques – such as data mining, statistical modelling (mathematical
relationships between variables to predict outcomes) and machine learning algorithms
(classification, regression and clustering techniques) – predictive analytics attempts to
forecast possible future outcomes and the likelihood of those events.
• To make predictions, machine learning algorithms, for example, take existing data and
attempt to fill in the missing data with the best possible guesses.
• A newer branch of machine learning is deep learning, which, according to 
Cornerstone Performance Management, mimics the construction of ‘human neural
networks as layers of nodes that learn a specific process area but are networked
together into an overall prediction.’
• Deep learning examples include credit scoring using social and environmental analysis
and sorting digital medical images such as X-rays to automate predictions for doctors to
use when diagnosing patients.
Predictive analytics

• Since predictive analytics can tell a business what could happen


in the future, this methodology empowers executives and
managers to take a more proactive, data-driven approach to
business strategy and decision making.
• Businesses can use predictive analytics for anything from
forecasting customer behaviour and purchasing patterns to
identifying sales trends.
• Predictions can also help forecast such things as supply chain,
operations and inventory demands.
Advantages and disadvantages of predictive analytics

• Predictive analysis is based on probabilities, it can never be completely accurate – but it


can act as a vital tool to forecast possible future events and inform effective business
strategy for the future.
• Predictive analytics can also improve many areas of a business, including:
• Efficiency, which could include inventory forecasting
• Customer service, which can help a company gain a better understanding of who their
customers are and what they want in order to tailor recommendations
• Fraud detection and prevention, which can help companies identify patterns and
changes
• Risk reduction, which, in the finance industry, might mean improved candidate
screening 
• This method of analysis relies on the existence of historical data, usually large amounts
of it.
Examples Of predictive analytics

• The healthcare industry, as an example, is a key beneficiary of predictive analytics. In 2019, RMIT
University partnered with Digital Health Cooperative Research Centre to 
develop clinical decision support software for aged care that will reduce emergency
hospitalisations and predict patient deterioration by interpreting historical data and developing
new predictive analytics techniques. The goal is that predictive analytics will allow aged-care
providers, residents and their families to better plan for the end of life. 
• Other examples of industries in which predictive analysis can be used,, include the following:
• E-commerce – predicting customer preferences and recommending products to customers
based on past purchases and search history
• Sales – predicting the likelihood that customers will purchase another product or leave the store
• Human resources – detecting if employees are thinking of quitting and then persuading them to
stay
• IT security – identifying possible security breaches that require further investigation
• Healthcare – predicting staff and resource needs
Prescriptive Analytics: Advise on possible outcomes
• The relatively new field of prescriptive analytics allows users to “prescribe” a number of
different possible actions and guide them towards a solution.
• In a nutshell, these analytics are all about providing advice.
• Prescriptive analytics attempts to quantify the effect of future decisions in order to advise
on possible outcomes before the decisions are actually made.
• At their best, prescriptive analytics predicts not only what will happen, but also why it
will happen, providing recommendations regarding actions that will take advantage of the
predictions.
• Prescriptive analytics use a combination of techniques and tools such as business rules,
algorithms, machine learning and computational modelling procedures. These techniques
are applied against input from many different data sets including historical and
transactional data, real-time data feeds, and big data.
• Descriptive analytics tells you what has happened and predictive analytics tells
you what could happen, then prescriptive analytics tells you what should be done
Prescriptive Analytics
• Prescriptive analytics takes what has been learned through
descriptive and predictive analysis and goes a step further by
recommending the best possible courses of action for a business.
• This is the most complex stage of the business analytics process,
requiring much more specialised analytics knowledge to perform, and
for this reason it is rarely used in day-to-day business operations. 
• predictive analytics looks at historical data using statistical
techniques to make predictions about the future, machine learning, a
subset of artificial intelligence, refers to the ability of a computer
system to understand large – often huge – amounts of data, without
explicit directions, and while doing so adapt and become increasingly
smarter.
Prescriptive analytics

• Prescriptive analytics anticipates what, when and, importantly, why


something might happen.
• After considering the possible implications of each decision option,
recommendations can then be made in regard to which decisions will
best take advantage of future opportunities or mitigate future risks.
• prescriptive analytics predicts multiple futures and, in doing so, makes
it possible to consider the possible outcomes for each before any
decisions are made.
• When prescriptive analytics is performed effectively, findings can have
a real impact on business strategy and decision making to improve
things such as production, customer experience and business growth.
Advantages disadvantages of prescriptive analytics

• Prescriptive analytics, when used effectively, provides


invaluable insights in order to make the best possible, data-
based decisions to optimise business performance.
• However, as with predictive analytics, this methodology
requires large amounts of data to produce useful results, which
isn’t always available.
• Also, machine learning algorithms, on which this analysis often
relies, cannot always account for all external variables.
• On the flip side, the use of machine learning dramatically
reduces the possibility of human error. 
Examples of prescriptive analytics
a commonly used prescriptive analytics tool is GPS technology, since it provides
recommended routes to get the user to their desired destination based on such things as
journey time and road closures.
 In this instance, prescriptive analysis ‘optimises an objective that measures the distances
from your starting point to your destination and prescribes the optimal route that has the
shortest distance.’
Other areas of prescriptive analysis application, according to data analytics firm Sisense,
include the following:
• Oil and manufacturing – tracking fluctuating prices  
• Manufacturing – improving equipment management, maintenance, price modelling,
production and storage
• Healthcare – improving patient care and healthcare administration by evaluating things
such as rates of readmission and the cost-effectiveness of procedures
• Insurance – assessing risk in regard to pricing and premium information for clients
• Pharmaceutical research – identifying the best testing and patient groups for clinical trials.
Ethics in data management   
• Facebook's recent data breach, if found to violate the EU General Data
Protection Regulation (GDPR), could cost them 4% of their global revenue (or
$1.63 billion) in fines.
• This resonated as a warning shot to enterprises across the globe.
Why should businesses handle their users' data ethically?
• The answer is simple: it helps them earn their customers' trust. Every organization is, at
its core, in the people business
• . The trust they establish determines their success.
• That trust is notoriously easy to lose; a single instance of unethical behavior by a
company could jeopardise its future.
• Companies have expended enormous amounts of money and effort to fix small
negligences in ethical data handling..
• To preserve consumer trust, companies will need to go beyond data security and
privacy to ensure that there is ethical handling of data within and beyond the
organisation. Clearly defined and communicated principles and practices can drive
honest and appropriate behaviors. That's where Data Governance comes in.
Ethics in data management   
• defining ownership of data,
• obtaining consent to collect and share data,
• protecting the identity of human subjects and
• their personal identifying information, and
• the licensing of data. Etc
Main
• What are the 3 basic data ethics?
• PRINCIPLE ONE: Minimising the risk of harm.
• PRINCIPLE TWO: Obtaining informed consent.
• PRINCIPLE THREE: Protecting anonymity and confidentiality. PRINCIPLE
FOUR: Avoiding deceptive practices.
Ethics in data management   

• Before enterprises can initiate Data Governance, they must identify the
regulations and frameworks relevant to their businesses.
• Data Governance programs could benefit from ethics frameworks from the
government and/or the wider public sector.
• The UK's Department for Digital, Culture, Media & Sport, for example,
formulated one in service of its National Data Strategy.
• It outlines principles on how data should be used in the public sector,
emphasising the importance of collective standards and ethical
frameworks.
• Such frameworks serve as guidelines on understanding the effects of
technology, data workflows and data sharing, as well as their ethical
and real-world consequences
Ethics in data management   

• Data ethics describes a behavior code, often focused on what is wrong


and what is right.
This encompasses the following:
• Data management – This includes recording, generation, curation,
dissemination, processing, use, and sharing.
• Algorithms – This includes machine learning al, robots, and artificial
agents.
• Corresponding practices – It includes programming, responsible
innovation, professional codes, and hacking.
ethical principles
•experts agree on the following ethical principles for using data:
1.
Privacy customer identity and data should remain private – The term privacy does not mean
confidentiality, as private information might need for auditing based on the requirements in the
legal procedure. However, this private data was acquired from an individual with full consent. It is
also noted that the data should not be revealed for the utilization of other individuals or companies,
allowing them to track their identity.
2.
Shared private information should always remain private – In most cases, third-party
companies share sensitive data. The typical examples of these are locational, financial, and
medical data. Also, they need to have limitations on how the data can be shared for privacy and
legal concern.
3.
Customers should exercise a transparent view of how the data is being sold or utilized. They
also need to have the ability to handle the flow of their private data across third-party and massive
analytical systems.
4.
There should be no interference between big data and human will. This is one of the ethical
principles for using data as big data analytics can determine and even moderate who we are
before making up our minds. Organizations need to start to ponder about the types of inferences
and predictions that should be and not allowed.
5.
Big data should not institutionalize prejudicial biases – The typical examples of these are sexism
and racism. Algorithms of machine learning can grip unconscious biases in people and empower
them through countless training samples.
Ethical principles

Ethical principles for using data provide a high-level and wide context
for resolving ethical predicaments, namely:
• Secure vulnerable humans who could be impaired by the activities in
their professions;
• Enhance and protect the trust and reputation for the profession;
• Give a basis for public evaluation and expectations of the profession;
• Make the profession as a diverse moral public worthy of self-
sufficiency from external regulation and control;
• Serve as a guide for adjudicating disputes among organizations,
both non-members and members;
• Make institutions buoyant in the face of external burdens; and
• https://www2.deloitte.com/us/en/insights/industry/public-sector/chief-
data-officer-government-playbook/managing-data-ethics.html
• India : The new law, the Personal Data Protection Bill (PDP), is
currently in front of parliament and was proposed to effect a
comprehensive overhaul of India's current data protection regime,
which today is governed by the Information Technology Act, 2000.
• PDP Bill proposes that the processing of personal data must
comply with seven principles for processing, namely: (i) processing
of personal data has to be fair and reasonable; (ii) it should be for a
specific purpose; (iii) only personal data necessary for the
purpose should be collected; (iv) it should be lawful; 
UK GDPR sets out seven key principles:
• Lawfulness, fairness and transparency.
• Purpose limitation.
• Data minimisation.
• Accuracy.
• Storage limitation.
• Integrity and confidentiality (security)
• Accountability.
• https://iclg.com/practice-areas/data-protection-laws-and-regulations/
india

• https://www.youtube.com/watch?v=RbxdkTixxLo

• https://www.youtube.com/watch?v=q_okDS2RtzY
(Crisp DM)
Machine learning
• Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy.
• Arthur Samuel, is credited for coining the term, “machine learning”
• Robert Nealey, the self-proclaimed checkers master, played the game on an
IBM 7094 computer in 1962, and he lost to the computer.
• Compared to what can be done today, this feat almost seems trivial, but it’s
considered a major milestone within the field of artificial intelligence.
• Over the next couple of decades, the technological developments around
storage and processing power will enable some innovative products that we
know and love today, such as Netflix’s recommendation engine or self-driving
cars.
ML and Deep Learning

• Machine learning, deep learning, and neural networks are all sub-fields of artificial
intelligence.
• However, deep learning is actually a sub-field of machine learning, and neural
networks is a sub-field of deep learning.
• The way in which deep learning and machine learning differ is in how each algorithm
learns.
• Deep learning automates much of the feature extraction piece of the process,
eliminating some of the manual human intervention required and enabling the use of
larger data sets.
• Classical, or "non-deep", machine learning is more dependent on human intervention to
learn.
• Human experts determine the set of features to understand the differences between data
inputs, usually requiring more structured data to learn.
ML and Deep Learning

• "Deep" machine learning can leverage labeled datasets, also known as


supervised learning, to inform its algorithm, but it doesn’t necessarily require a
labeled dataset.
• It can ingest unstructured data in its raw form (e.g. text, images), and it can
automatically determine the set of features which distinguish different
categories of data from one another.
• Unlike machine learning, it doesn't require human intervention to process
data, allowing us to scale machine learning in more interesting ways.
• Deep learning and neural networks are primarily credited with accelerating
progress in areas, such as computer vision, natural language processing, and
speech recognition.
ML and Deep Learning
• Neural networks, or artificial neural networks (ANNs), are comprised of a node
layers, containing an input layer, one or more hidden layers, and an output layer.
• Each node, or artificial neuron, connects to another and has an associated weight and
threshold.
• If the output of any individual node is above the specified threshold value, that node
is activated, sending data to the next layer of the network.
• Otherwise, no data is passed along to the next layer of the network.
• The “deep” in deep learning is just referring to the depth of layers in a neural
network.
• A neural network that consists of more than three layers—which would be inclusive
of the inputs and the output—can be considered a deep learning algorithm or a deep
neural network.
• A neural network that only has two or three layers is just a basic neural network.
a machine learning algorithm into three main parts.
1.A Decision Process: In general, machine learning algorithms are used to make a
prediction or classification.
Based on some input data, which can be labelled or unlabeled, your algorithm will
produce an estimate about a pattern in the data.
2.An Error Function: An error function serves to evaluate the prediction of the
model. If there are known examples, an error function can make a comparison to
assess the accuracy of the model.
3.An Model Optimization Process: If the model can fit better to the data points in the
training set, then weights are adjusted to reduce the discrepancy between the known
example and the model estimate. The algorithm will repeat this evaluate and
optimize process, updating weights autonomously until a threshold of accuracy has
been met.  
Methods (ML)
• Supervised learning, also known as supervised machine learning, is
defined by its use of labeled datasets to train algorithms that to classify
data or predict outcomes accurately.
• As input data is fed into the model, it adjusts its weights until the model
has been fitted appropriately.
• This occurs as part of the cross validation process to ensure that the
model avoids overfitting or underfitting.
• Supervised learning helps organizations solve for a variety of real-world
problems at scale, such as classifying spam in a separate folder from your
inbox.
• Some methods used in supervised learning include neural networks, naïve
bayes, linear regression, logistic regression, random forest, support vector
machine (SVM), and more.
Methods (ML)
• Unsupervised learning, also known as unsupervised machine learning,
uses machine learning algorithms to analyze and cluster unlabeled
datasets.
• These algorithms discover hidden patterns or data groupings without the
need for human intervention.
• Its ability to discover similarities and differences in information make it the
ideal solution for exploratory data analysis, cross-selling strategies,
customer segmentation, image and pattern recognition.
• It’s also used to reduce the number of features in a model through the
process of dimensionality reduction; principal component analysis (PCA)
and singular value decomposition (SVD) are two common approaches for
this.
• Other algorithms used in unsupervised learning include neural networks, k-
means clustering, probabilistic clustering methods, and more.
Methods (ML)
• Semi-supervised learning offers a happy medium between supervised and
unsupervised learning.
• During training, it uses a smaller labeled data set to guide classification and
feature extraction from a larger, unlabeled data set.
• Semi-supervised learning can solve the problem of having not enough labeled
data (or not being able to afford to label enough data) to train a supervised
learning algorithm.
• Reinforcement machine learning is a behavioral machine learning model that is
similar to supervised learning, but the algorithm isn’t trained using sample data.
• This model learns as it goes by using trial and error. A sequence of successful
outcomes will be reinforced to develop the best recommendation or policy for a
given problem.
Egs of ML
• Speech Recognition: It is also known as automatic speech recognition (ASR),
computer speech recognition, or speech-to-text, and it is a capability which uses
natural language processing (NLP) to process human speech into a written
format. Many mobile devices incorporate speech recognition into their systems to
conduct voice search—e.g. Siri—or provide more accessibility around texting.
• Customer Service:  Online chatbots are replacing human agents along the
customer journey. They answer frequently asked questions (FAQs) around topics,
like shipping, or provide personalized advice, cross-selling products or suggesting
sizes for users, changing the way we think about customer engagement across
websites and social media platforms. Examples include messaging bots on e-
commerce sites with virtual agents, messaging apps, such as Slack and Facebook
Messenger, and tasks usually done by virtual assistants and voice assistants.
Egs of ML
• Recommendation Engines: Using past consumption behavior data,
AI algorithms can help to discover data trends that can be used to
develop more effective cross-selling strategies. This is used to make
relevant add-on recommendations to customers during the checkout
process for online retailers.
• Automated stock trading: Designed to optimize stock portfolios, AI-
driven high-frequency trading platforms make thousands or even
millions of trades per day without human intervention.
Egs of ML

• Computer Vision: This AI technology enables computers and systems


to derive meaningful information from digital images, videos and
other visual inputs, and based on those inputs, it can take action.
• This ability to provide recommendations distinguishes it from image
recognition tasks.
• Powered by convolutional neural networks, computer vision has
applications within photo tagging in social media, radiology imaging
in healthcare, and self-driving cars within the automotive industry.
CRISP DM

• The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a


process model with six phases that naturally describes the 
data science life cycle. It’s like a set of guardrails to help you plan,
organize, and implement your data science (or machine learning) project.
1.Business understanding – What does the business need?
2.Data understanding – What data do we have / need? Is it clean?
3.Data preparation – How do we organize the data for modeling?
4.Modeling – What modeling techniques should we apply?
5.Evaluation – Which model best meets the business objectives?
6.Deployment – How do stakeholders access the results?
CRISP-DM

• CRISP-DM, which stands for Cross-Industry Standard Process for


Data Mining, is an industry-proven way to guide your data mining
efforts.
• As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the
relationships between these tasks.
• As a process model, CRISP-DM provides an overview of the data
mining life cycle.
CRISP-DM
• The CRISP-DM model is flexible and can be customized easily.
• For example, if your organization aims to detect money laundering, it is
likely that you will sift through large amounts of data without a specific
modeling goal.
• Instead of modeling, your work will focus on data exploration and
visualization to uncover suspicious patterns in financial data. CRISP-DM
allows you to create a data mining model that fits your particular needs.
Stage 1: DETERMINE Business Objectives
 DETERMINE Business Objectives
 first stage of the CRISP-DM process is to understand what you want
to accomplish from a business perspective.
• Your organisation may have competing objectives and constraints that
must be properly balanced.
• The goal of this stage of the process is to uncover important factors
that could influence the outcome of the project.
• Neglecting this step can mean that a great deal of effort is put
into producing the right answers to the wrong questions.
Stage 1
Assess the current situation
• involves more detailed fact-finding about all of the resources, constraints,
assumptions and other factors that you’ll need to consider when determining
your data analysis goal and project plan.
• Inventory of resources – List the resources available to the project
• Requirements, assumptions and constraints 
• Risks and contingencies
• Terminology
• Costs and benefits 
Stage 1
Determine Data Mining goal
• A business goal states objectives in business terminology. A data mining goal states
project objectives in technical terms. For example, the business goal might be
“Increase catalogue sales to existing customers.” A data mining goal might be
“Predict how many widgets a customer will buy, given their purchases over the past
three years, demographic information (age, salary, city, etc.), and the price of the
item.” 
1.Business success criteria  – describe the intended outputs of the project that enable
the achievement of the business objectives.
2.Data mining success criteria – define the criteria for a successful outcome to the
project in technical terms—for example, a certain level of predictive accuracy or a
propensity-to-purchase profile with a given degree of “lift.” As with business
success criteria, it may be necessary to describe these in subjective terms, in which
case the person or persons making the subjective judgment should be identified. 
Stage 1 :produce project plan
• Project plan – List the stages to be executed in the project, together with
their duration, resources required, inputs, outputs, and dependencies.
• analyze dependencies between time schedule and risks.
• Mark results of these analyses explicitly in the project plan, ideally with
actions and recommendations if the risks are manifested.
• Decide at this point which evaluation strategy will be used in the
evaluation phase. Your project plan will be a dynamic document.
• At the end of each phase you’ll review progress and achievements and
update the project plan accordingly.
• Specific review points for these updates should be part of the project plan.
• Initial assessment of tools and techniques – At the end of the first
phase you should undertake an initial assessment of tools and techniques.
• Here, for example, you select a data mining tool that supports various
methods for different stages of the process. It is important to assess tools
and techniques early in the process since the selection of tools and
STAGE 2 :Data Understanding

• The second stage of the CRISP-DM process requires you to acquire


the data listed in the project resources.
• This initial collection includes data loading, if this is necessary for
data understanding. For example, if you use a specific tool for data
understanding, it makes perfect sense to load your data into this tool.
If you acquire multiple data sources then you need to consider how
and when you’re going to integrate these.
• Initial data collection report – List the data sources
acquired together with their locations, the methods used to acquire
them and any problems encountered.
• Record problems you encountered
STAGE 2 :Data Understanding
Describe data :  Examine the “gross” or “surface” properties of the acquired data and report on the
results.
Data description report – Describe the data that has been acquired including its format, its
quantity (for example, the number of records and fields in each table), the identities of the
fields and any other surface features which have been discovered. Evaluate whether the data
acquired satisfies your requirements.
• Explore data
• During this stage you’ll address data mining questions using querying, data visualization and
reporting techniques. These may include:
• Distribution of key attributes (for example, the target attribute of a prediction task)
• Relationships between pairs or small numbers of attributes
• Results of simple aggregations
• Properties of significant sub-populations
• Simple statistical analyses
• These analyses may directly address your data mining goals. They may also contribute to or refine
the data description and quality reports, and feed into the transformation and other data preparation
steps needed for further analysis. 
STAGE 2 :Data Understanding

Data exploration report – Describe results of your data exploration,


including first findings or initial hypothesis and their impact on the remainder
of the project.
• If appropriate you could include graphs and plots here to indicate data
characteristics that suggest further examination of interesting data subsets.
Verify data Quality : Examine the quality of the data, addressing questions
such as:
• Is the data complete (does it cover all the cases required)?
• Is it correct, or does it contain errors and, if there are errors, how common are
they?
• Are there missing values in the data? If so, how are they represented, where do
they occur, and how common are they?
Stage 3 : data preparation
CLEAN the data
• Data cleaning report – Describe what decisions and actions you took to address data
quality problems. Consider any transformations of the data made for cleaning purposes
and their possible impact on the analysis results.
• Construct required data
• This task includes constructive data preparation operations such as the production of
derived attributes or entire new records, or transformed values for existing attributes.
• Derived attributes – These are new attributes that are constructed from one or more
existing attributes in the same record, for example you might use the variables of length
and width to calculate a new variable of area.
• Generated records – Here you describe the creation of any completely new records. For
example you might need to create records for customers who made no purchase during
the past year. There was no reason to have such records in the raw data, but for modelling
purposes it might make sense to explicitly represent the fact that particular customers
made zero purchases.
Integrate data : Merged data ( Merging tables refers to joining together two or more tables), Aggregations refers
to operations in which new values are computed by summarising information from multiple records and/or tables
STAGE 4 : Modelling
• Modelling technique – Document the actual modelling technique that is
to be used.
• Modelling assumptions – Many modelling techniques make specific
assumptions about the data, for example that all attributes have uniform
distributions, no missing values allowed, class attribute must be
symbolic etc. Record any assumptions made.
• Generate test design: Test design – Describe the intended plan for
training, testing, and evaluating the models. A primary component of
the plan is determining how to divide the available dataset into
training, test and validation datasets.
• Build Model : Run the modelling tool on the prepared dataset to create
one or more models.
• Assess Model
Stage 5 :Evaluate Results
• Assessment of data mining results
• Approved models – After assessing models with respect to business
success criteria, the generated models that meet the selected criteria
become the approved models
STAGE 6 : Deployment
• Plan Deployment : Summarise your deployment strategy including the
necessary steps and how to perform them.
• Plan Monitoring and Maintenance  – Summarise the monitoring and
maintenance strategy, including the necessary steps and how to perform them.
• Produce Final Report : This is the final written report of the data mining
engagement. It includes all of the previous deliverables, summarising and
organising the results.
• Final presentation – There will also often be a meeting at the conclusion of
the project at which the results are presented to the customer.
• Review project
Step # CRISP-DM SEMMA
1 Business understanding Sample
2 Data understanding Explore
Crisp Vs Semma 3 Data preparation Modify
4 Modeling Model
5 Evaluation Assess
6 Deployment
KDD Semma DM : comparison
SEMMA

• The acronym
SEMMA stands
for Sample,
Explore, Modify,
Model, Assess,
and refers to the
process of
conducting a
data mining
project.
STAGES : Crisp Semma

• Sample — a portion of a large data set is taken that is big enough to extract
significant information and small enough to manipulate quickly.
• Explore — data exploration can help in gaining understanding and ideas as
well as refining the discovery process by searching for trends and anomalies.
• Modify — data modification stage focuses on creating, selecting and
transformation of variables to focus model selection process. This stage may
also look for outliers and reducing the number of variables.
• Model — there are different modeling techniques present and each type of
model has its strengths and is appropriate for a specific goal for data mining.
• Assess — this final stage focuses on the evaluation of the reliability and
usefulness of findings and estimates the performance.
1.Determine Business Objectives
- Background
- Business Objectives
- Business Success Criteria
2.Assess the Situation
- Inventory of Resources
- Requirements, Assumptions, and Constraints
- Risks and Contingencies
- Terminology
- Costs and Benefits
3.Determine Goals
- Data Mining Goals
- Data Mining Success Criteria
4.Produce Project Plan
- Project Plan
- Initial Assessment of Tools and Techniques
Data Understanding

• The second stage consists of collecting and exploring the input dataset.
The set goal might be unsolvable using the input data, you might need
to use public datasets, or even create a specific one for the set goal.
1.Collect Initial Data
- Initial Data Collection Report
2.Describe Data
- Data Description Report
3.Explore Data
- Data Exploration Report
4.Verify Data Quality
- Data Quality Report
Data Preparation
• As we all know, bad input inevitably leads to bad output. Therefore no matter what
you do in modeling — if you made major mistakes while preparing the data —
you will end up returning to this stage and doing it over again.
1.Select Data
- The rationale for Inclusion/Exclusion
2.Clean Data
- Data Cleaning Report
3.Construct Data
- Derived Attributes
- Generated Records
4.Integrate Data
- Merged Data
5.Format Data
- Reformatted Data
Modelling

1.Select Modeling Techniques


- Modeling Technique
- Modeling Assumptions
2.Generate Test Design
- Test Design
3.Build Model
- Parameter Settings
- Models
- Model Descriptions
4.Assess Model
- Model Assessment
- Revised Parameter Settings
Evaluation

1.Evaluate Results
- Assessment of Data Mining Results w.r.t. Business Success Criteria
- Approved Models
2.Review Process
- Review of Process
3.Determine Next Steps
- List of Possible Actions
- Decision
• Deployment
Deployment
1.Plan Deployment
- Deployment Plan
2.Plan Monitoring and Maintenance
- Monitoring and Maintenance Plan
3.Produce Final Report
- Final Report
- Final Presentation
4.Review Project
- Experience Documentation

You might also like