Module 1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

DATA ANALYTICS

Dr. S. ILANKUMARAN
ASSISTANT PROFESSOR / IT
Introduction
• Data are everywhere.
• IBM projects that every day 2.5 quintillion bytes of data was
generated
• 90 percent of the data has been created in the last two years.
• 85 percent of organizations will be unable to exploit big data for
competitive advantage.
• 4.4 million jobs will be created around big data
Largest Data Sets Analysis by KDnuggets
Data Size Percentage
Less than 1 MB (12) 3.3
1.1 to 10 MB (8) 2.5
11 to 100 MB (14) 4.3
101 MB to 1 GB (50) 15.5
1.1 to 10 GB (59) 18
11 to 100 GB (52) 16
101 GB to 1 TB(59) 18
1.1 to 10 TB (39) 12
11 to 100 TB (15) 4.7
101 TB to 1 PB (6) 1.9
1.1 to 10 PB (2) 0.6
11 to 100 PB (0) 0
Over 100 PetaByte (6) 1.9
PROCESS OF DATA ANALYTICS
• The steps involved in Analytics Process

Collecting data
Cleaning data
Manipulating data
Analyzing data
Visualizing data
Example Applications
• Mail box analysis
• Internet Bill
• Electricity Bill
• Social Media
Analytics Process Model
ANALYTICS
• Analytics is a term that is often used interchangeably with data
science, data mining and knowledge discovery.
• It refers to extracting useful business patterns or mathematical
decision models from a preprocessed data set.
• Different underlying techniques can be used for this purpose,
• Statistics (Linear and logistics regression)
• Machine Learning (Decision tree)
• Biology (Neural Network)
• Kernel Methods (SVM)
Predictive and Descriptive - Distinction
• Predictive
• Target is available
• Categorical or continues
• Descriptive
• Target is not available
• Association rules, Sequence rules, and Clustering
Example of Classification Predictive
Analytics
Analytical Model Requirements
• A first critical success factor is business relevance
• The analytical model should actually solve the business
problem for which it was developed.
• It makes no sense to have a working analytical model
that got sidetracked from the original problem
statement.
• In order to achieve business relevance, the business
problem to be solved is appropriately defined,
qualified, and agreed upon by all parties involved at the
outset of the analysis.
Analytical Model Requirements
• A second criterion is statistical performance.
• The model should have statistical significance and
predictive power.
• Depending upon the application Analytical models should
also be
• Interpretable - understanding the patterns that the
analytical model captures
• Justifiable - the degree to which a model corresponds
to previous business knowledge
Analytical Model Requirements
• Analytical models should also be operationally efficient.
• the efforts needed to collect the data,
• preprocess the model,
• evaluate the model
• feed its outputs to the business application
• The economic cost needed to set up the analytical model
• Analytical models should also comply with both local and
international regulation.
STANDARDIZING
• Data standardization is the process of converting data to a
common format to enable users to process and analyze it
• Data standardization is the critical process of bringing data into a
common format that allows for
• collaborative research,
• large-scale analytics,
• sharing of sophisticated tools and methodologies.
STANDARDIZING
• Data standardization is also essential for preserving data quality.
When data is standardized, it is much easier to detect errors and
ensure that it is accurate.
• This is essential for making sure that decision-makers have access
to accurate and reliable information.
Why Data Standardization?
• Data standardization is essential because it allows different systems to
exchange data consistently.
• Without standardization, it would be challenging for computers to
communicate with each other and exchange data.
• Standardization also makes it easier to process and analyze data and
store it in a database
• With this approach, businesses can make better decisions based on
their data.
• When data is standardized, companies can compare and analyze it more
easily to make insights that they can use to improve their operations.
Why Data Standardization?
• Data standardization has many benefits, but one of the most
important is that it helps businesses avoid making decisions based on
inaccurate or incomplete data.
• Data standardization ensures that companies have a complete and
accurate picture of their data, allowing them to make better decisions
to improve their bottom line.
How to Standardize Data
• Determine Your Requirement
• First, look at the types of data you have and how it's currently organized.
• Is it all in one place?
• Are there different formats?
• Is it accurate and up-to-date?
• Once you understand your current data situation, you can start to identify
areas where standardization would be beneficial.
• Next, consider your business goals and the decisions you need to make.
• What kinds of data would you need to make those decisions?
• Would standardization help you to access and analyze that data more
effectively?
How to Standardize Data
• Assess Data Entry Points
• Several things need to be determined when evaluating data entry points
during the data standardization process.
• To simplify the process, it is helpful first to identify all potential data entry
points and evaluate their feasibility.
• Some factors to consider when assessing data entry points include:
• The data source: Is the data reliable and accurate?
• The data format: Can the data be easily converted into the desired format?
• The data volume: Is the volume of data manageable?
• The data entry points: Are the data entry points clearly defined and easy to use?
How to Standardize Data
• Define Data Standards
• When handling data, it is crucial to establish standards for how that data is
organized and formatted. This ensures that everyone in your organization
works with the same assumptions and that data can be easily shared between
different departments and systems.
• Data standards are rules or guidelines that dictate how data should be
organized and formatted. By establishing data standards, you can ensure that
your data is consistent and easy to work with.
• You need to decide what format your data should be in. Data can be
formatted as text, numbers, dates, or any other data type.
How to Standardize Data
• Clean Your Data
• This means removing any invalid, incorrect, or duplicate data points. Invalid data
does not meet the field in which it is being entered.
• For example, a phone number field should only contain numbers and perhaps a
dash or parentheses. Any other characters in that field would be invalid.
Incorrect data does not accurately represent what it is supposed to mean.
• For example, an area that is supposed to contain a person's last name may
instead include their first name.
• Duplicate data is data that is identical to another data point in the same dataset.
• Once you have cleaned your data, you can begin the data standardization
process. This means setting consistent rules for how data should be entered
and encoded.
How to Standardize Data
• Normalize Your Data With a Data Automation Platform
• A data automation platform can help you to normalize your data so that it is
all in the same format.
• This can make it much easier to work with and analyze.
• You can also use a data automation platform to standardize data types.
• This can help you ensure that all of your data is in the same format, making it
much easier to work with
Standardizing Data in Excel
• Excel STANDARDIZE is available under Excel Statistical Functions. It
returns a normalized value, which is also called Z-score.
• The mean and standard deviation are the basis of the z-score. The z-
score (or standard score) is a method to standardize scores across the
same scale. It divides a score's deviation by the standard deviation in
a data set. The resulting score is the standard deviation of a data
point from the mean.
• Zero is the average of all z-scores for a dataset. A negative z score
indicates that the value is lower than the mean. A positive z score
indicates that the value is higher than the mean.
Standardizing Data in Excel
• Z-Score Formula = STANDARDIZE(x, mean, standard_dev)
• Here: X= data value that you need to normalize.
• Mean= Distribution arithmetic mean
• Standard_dev= Distribution standard deviation.
Steps to standardize data
• Four steps to standardize customer data for better insights
• Step 1: Conduct a data source audit.
• Step 2: Define standards for data formats.
• Step 3: Standardize the format of external data sources.
• Step 4: Standardize existing data in the database.
• Standard Deviation

• Standardize = (xi-mean) / N
CATEGORIZATION
• Categorization is a major component of qualitative data analysis by
which investigators attempt to group patterns observed in the data
into meaningful units or categories.
• Categorization is also referred as coarse classification, classing,
grouping, binning, etc.
• For categorical variables, it is needed to reduce the number of
categories.
• E.g. Purpose of loan – has 50 values.
• 49 dummy variables are needed to estimate one variable.
Categorization Methods
• Two very basic methods are used for categorization.
• equal interval binning
• equal frequency binning.
• Consider, for example, the income values 1,000, 1,200, 1,300, 2,000, 1,800, and 1,400.
• Equal interval binning would create two bins with the same range—Bin 1: 1,000,
1,500 and
• Bin 2: 1,500, 2,000
• Equal frequency binning would create two bins with the same number of
observations—
• Bin 1: 1,000, 1,200, 1,300;
• Bin 2: 1,400, 1,800,2,000.
Weight of Evidence Coding
• Variable transformation of independent variables.
• Used for grouping, variable selection etc.
• The weight of evidence tells the predictive power of an independent
variable in relation to the dependent variable.
Weight of Evidence Coding
• Example: Predict good or bad customer based on age or income
• Model 1:
• Customer type = a + b (income) ----> Predicts 70% correctly
• Model 2:
• Customer type = a + b (age) ----> Predicts 60% correctly

• So the ability of “income” to separate good and bad is more than


“age” and hence the weight
Weight of Evidence Coding
• Definition:

• Since it evolved from credit scoring world, it is generally described as


a measure of the separation of good and bad customers.
• "Bad Customers" refers to the customers who defaulted on a loan.
and "Good Customers" refers to the customers who paid back loan.
• Positive WOE means Distribution of Goods > Distribution of Bad’s
Negative WOE means Distribution of Goods < Distribution of Bad’s
Weight of Evidence Coding
DATA SEGMENTATION
• Sometimes data is segmented before the analytical modeling starts.
• The segmentation can be conducted
• using the experience and knowledge from a business expert
• based on statistical analysis using decision trees, k‐means, or self‐organizing
maps
• Segmentation is used to estimate different analytical models each
personalized to a specific segment.
• This process must be done careful because it may lead to increase the
production, monitoring and maintenance cost.
DATA SEGMENTATION
• Data Segmentation is the process of taking the data you hold and
dividing it up and grouping similar data together based on the chosen
parameters
• So that you can use it more efficiently within marketing and
operations
• It is the process of grouping your data into at least two subsets
F TEST
• The F-test is used to carry out the test for the equality of the two
population variances.
• If a researcher wants to test whether or not two independent
samples have been drawn from a normal population with the same
variability, then he generally employs the F-test.
F-TEST
• It is a statistical test used to compare any two different data sets
• It gives the mean, variance, observations etc details
• F-Test :
• compares your model with zero predictor variables and decides whether
your added coefficients improved the model.
T-distribution
• The t-distribution is used as an alternative to the normal distribution
when sample sizes are small in order to estimate confidence
• It also determine critical values that an observation is a given distance
from the mean.

You might also like