Big Data Module 4 - Print-Ready Workbook (Letter)

Module 4: Fundamental Big Data Analysis & Science
INTRODUCTION.............................................................................................................................. 7
MIND MAP POSTER......................................................................................................................... 8
BIG DATA MODULE 4 OFFICIAL SUPPLEMENT: ANALYSIS FORMULAS ................................................ 9
ANALYSIS TECHNIQUES COVERAGE ............................................................................................... 10
OVERVIEW ................................................................................................................................... 10
PART I: BIG DATA SCIENCE CONCEPTS & ANALYSIS CHALLENGES................................. 11
TERMS AND CONCEPTS............................................................................................................. 12
DATA SCIENCE ............................................................................................................................. 12
MODEL......................................................................................................................................... 12
EXPLORATORY DATA ANALYSIS (EDA) .......................................................................................... 13
CONFIRMATORY DATA ANALYSIS (CDA) ........................................................................................ 13
DATA PRODUCT............................................................................................................................ 13
STATISTICS .................................................................................................................................. 13
DESCRIPTIVE STATISTICS.............................................................................................................. 14
INFERENTIAL STATISTICS .............................................................................................................. 15
MACHINE LEARNING ..................................................................................................................... 15
DATA MUNGING ............................................................................................................................ 16
BIG DATA ANALYSIS LIFECYCLE .................................................................................................... 16
READING ...................................................................................................................................... 16
COMMON BIG DATA DATASET CATEGORIES ......................................................................... 20
COMMON BIG DATA DATASET CATEGORIES ................................................................................... 21
HIGH-VOLUME DATASETS ............................................................................................................. 21
HIGH-VELOCITY DATASETS ........................................................................................................... 22
HIGH-VARIETY DATASETS ............................................................................................................. 22
HIGH-VERACITY DATASETS ........................................................................................................... 23
HIGH-VALUE DATASETS ................................................................................................................ 23
EXERCISE 4.1: MATCH TERMS TO STATEMENTS ............................................................................. 24
PART II: ELEMENTS OF BIG DATA ANALYSIS......................................................................... 28
EXPLORATORY DATA ANALYSIS (EDA) .................................................................................. 30
ATTRIBUTES ................................................................................................................................. 30
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 1
EDA ............................................................................................................................................ 30
OPTIONAL READING ...................................................................................................................... 31
DATA SUMMARY TYPES ................................................................................................................ 31
NUMERICAL SUMMARIES ............................................................................................................... 31
NUMERICAL SUMMARIES: MEASURES OF CENTRAL TENDENCY ....................................................... 31
NUMERICAL SUMMARIES: MEASURES OF VARIATION OR DISPERSION .............................................. 31
NUMERICAL SUMMARIES: MEASURES OF ASSOCIATION .................................................................. 32
GRAPHICAL SUMMARIES ............................................................................................................... 32
QUANTITATIVE ANALYSIS .............................................................................................................. 33
UNIVARIATE ANALYSIS .................................................................................................................. 33
BIVARIATE ANALYSIS .................................................................................................................... 33
MULTIVARIATE ANALYSIS .............................................................................................................. 33
STATISTICS .................................................................................................................................. 37
VARIABLE TYPES .......................................................................................................................... 38
EXERCISE 4.2: FILL IN THE BLANKS ............................................................................................... 39
POPULATION & SAMPLE ................................................................................................................ 40
STATISTICAL INFERENCE ............................................................................................................... 40
MEASURES OF CENTRAL TENDENCY .............................................................................................. 41
MEAN .......................................................................................................................................... 41
MEDIAN........................................................................................................................................ 41
MODE .......................................................................................................................................... 41
ROBUSTNESS ............................................................................................................................... 42
MEASURES OF VARIATION OR DISPERSION .................................................................................... 42
RANGE......................................................................................................................................... 42
MEAN, MEDIAN, MODE & RANGE ................................................................................................... 43
QUANTILES .................................................................................................................................. 43
QUINTILES.................................................................................................................................... 44
QUARTILES .................................................................................................................................. 44
INTERQUARTILE RANGE & OUTLIERS ............................................................................................. 44
PERCENTILES ............................................................................................................................... 45
BIAS ............................................................................................................................................ 45
DISTRIBUTION .............................................................................................................................. 46
VARIANCE .................................................................................................................................... 46
STANDARD DEVIATION .................................................................................................................. 47
VARIANCE & STANDARD DEVIATION............................................................................................... 47
Z-SCORE ..................................................................................................................................... 47
EXERCISE 4.3: NAME THE MEASURE ............................................................................................. 48
DISTRIBUTIONS............................................................................................................................. 50
FREQUENCY DISTRIBUTION ........................................................................................................... 50
PROBABILITY ................................................................................................................................ 50
PROBABILITY DISTRIBUTION .......................................................................................................... 51
READING ...................................................................................................................................... 51
SAMPLING DISTRIBUTION .............................................................................................................. 51
STANDARD ERROR ....................................................................................................................... 51
STATISTICAL ESTIMATORS ............................................................................................................ 52
CONFIDENCE INTERVAL................................................................................................................. 52
SKEWNESS................................................................................................................................... 53
DISCRETE & CONTINUOUS PROBABILITY DISTRIBUTIONS ................................................................ 54
DISTRIBUTION FITTING .................................................................................................................. 55
NORMAL DISTRIBUTION ................................................................................................................. 56
STANDARD NORMAL DISTRIBUTION................................................................................................ 56
CENTRAL LIMIT THEOREM ............................................................................................................. 57
MEASURES OF ASSOCIATION......................................................................................................... 58
CORRELATION .............................................................................................................................. 58
CORRELATION & HIGH-VOLUME DATASETS.................................................................................... 59
CORRELATION & HIGH-VELOCITY DATASETS.................................................................................. 60
CORRELATION & HIGH-VARIETY DATASETS ................................................................................... 60
CORRELATION & HIGH-VERACITY DATASETS ................................................................................. 60
CORRELATION & HIGH-VALUE DATASETS ...................................................................................... 60
COVARIANCE ................................................................................................................................ 61
ESTIMATES OF POPULAR DISTRIBUTION ......................................................................................... 61
CHEBYSHEVS INEQUALITY RULE ................................................................................................... 61
EMPIRICAL RULE .......................................................................................................................... 62
EXERCISE 4.4: NAMING AND MATCHING ......................................................................................... 63
CONFIRMATORY DATA ANALYSIS (CDA) ................................................................................ 68
HYPOTHESIS TESTING .................................................................................................................. 69
NULL HYPOTHESIS ....................................................................................................................... 69
ALTERNATIVE HYPOTHESIS ........................................................................................................... 69
STATISTICAL SIGNIFICANCE ........................................................................................................... 70
P-VALUE ...................................................................................................................................... 70
CRITICAL REGION, ONE-TAILED & TWO-TAILED TESTS ................................................................... 70
TYPE I ERROR, TYPE II ERROR & THE POWER OF HYPOTHESIS TEST .............................................. 70
VISUALIZATION............................................................................................................................ 74
VISUALIZATION FOR EDA & CDA .................................................................................................. 75
BAR GRAPH ................................................................................................................................. 75
LINE GRAPH ................................................................................................................................. 75
HISTOGRAM ................................................................................................................................. 76
FREQUENCY POLYGONS ............................................................................................................... 77
SCATTER PLOT............................................................................................................................. 78
STEM & LEAF PLOT ...................................................................................................................... 79
CROSS-TABULATION ..................................................................................................................... 80
BOX & W HISKER PLOT .................................................................................................................. 80
QUANTILE-QUANTILE PLOT ........................................................................................................... 82
LATTICE PLOT .............................................................................................................................. 83
PART III: FUNDAMENTAL BIG DATA ANALYSIS TECHNIQUES ............................................. 88
READING ...................................................................................................................................... 89
PREDICTION: LINEAR REGRESSION ........................................................................................ 90
MULTIPLE LINEAR REGRESSION .................................................................................................... 91
MEAN SQUARED ERROR ............................................................................................................... 92
ERROR TERM & RESIDUALS .......................................................................................................... 92
2
COEFFICIENT OF DETERMINATION R ............................................................................................. 93
STANDARD ERROR OF ESTIMATE................................................................................................... 93
LINEAR REGRESSION & OTHER TECHNIQUES ................................................................................. 94
LINEAR REGRESSION & HIGH-VOLUME DATASETS.......................................................................... 94
LINEAR REGRESSION & HIGH-VELOCITY DATASETS ....................................................................... 94
LINEAR REGRESSION & HIGH-VARIETY DATASETS ......................................................................... 94
LINEAR REGRESSION & HIGH-VERACITY DATASETS ....................................................................... 95
LINEAR REGRESSION & HIGH-VALUE DATASETS ............................................................................ 95
CLASSIFICATION: K-NN (K-NEAREST NEIGHBORS) ............................................................ 100
SELECTING THE VALUE OF K........................................................................................................ 101
OPTIONAL READING .................................................................................................................... 102
CLUSTERING: K-MEANS........................................................................................................... 103
CLUSTERING .............................................................................................................................. 103
K-MEANS .................................................................................................................................... 103
THE ASSIGN STAGE .................................................................................................................... 104

THE UPDATE STAGE ................................................................................................................... 104
THE REASSIGNMENT STAGE........................................................................................................ 105
SELECTING THE VALUE OF K........................................................................................................ 105
MISSING FEATURE VALUES ......................................................................................................... 106
CLUSTER DISTORTION ................................................................................................................ 106
OPTIONAL READING .................................................................................................................... 106
CLUSTERING & OTHER TECHNIQUES ........................................................................................... 106
CLUSTERING & HIGH-VOLUME DATASETS .................................................................................... 106
CLUSTERING & HIGH-VELOCITY DATASETS .................................................................................. 107
CLUSTERING & HIGH-VARIETY DATASETS .................................................................................... 107
CLUSTERING & HIGH-VERACITY DATASETS.................................................................................. 107
CLUSTERING & HIGH-VALUE DATASETS....................................................................................... 107
EXERCISE 4.7: NAME THE ALGORITHM......................................................................................... 108
EXERCISE ANSWERS................................................................................................................ 113
EXERCISE 4.1 ANSWERS............................................................................................................. 114
EXAM B90.04 .............................................................................................................................. 118
MODULE 4 SELF-STUDY KIT .................................................................................................... 118
CONTACT INFORMATION AND RESOURCES ........................................................................ 119
AITCP COMMUNITY.................................................................................................................... 119
GENERAL PROGRAM INFORMATION ............................................................................................. 119
GENERAL INFORMATION ABOUT COURSE MODULES AND SELF-STUDY KITS ................................... 119
PEARSON VUE EXAM INQUIRIES ................................................................................................. 119
PUBLIC INSTRUCTOR-LED WORKSHOP SCHEDULE ....................................................................... 119
PRIVATE INSTRUCTOR-LED WORKSHOPS..................................................................................... 120
BECOMING A CERTIFIED TRAINER ................................................................................................ 120
GENERAL BDSCP INQUIRIES ...................................................................................................... 120
AUTOMATIC NOTIFICATION .......................................................................................................... 120
FEEDBACK AND COMMENTS ........................................................................................................ 120
Introduction
This is the official workbook for the BDSCP course Module 4: Fundamental Big Data Analysis
& Science and the corresponding Pearson VUE Exam B90.04.
The purpose of this document is to establish an understanding of fundamental Big Data
concepts, which include but are not limited to:
- Understanding Big Data
- Fundamental Big Data Terminology & Concepts
- Big Data Business & Technology Drivers
- Traditional Enterprise Technologies Related to Big Data
- Characteristics of Data in Big Data Environments
- Types of Data in Big Data Environments
- Fundamental Analysis, Analytics & Machine Learning Types
- Business Intelligence & Big Data
- Data Visualization & Big Data
- Big Data Adoption & Planning Considerations
Mind Map Poster
The BDSCP Module 4: Mind Map Poster that accompanies this course booklet provides an
alternative visual representation of all primary topics covered in this course.
Big Data Module 4 Official Supplement:
Analysis Formulas
This supplement provides the formulas and algorithms upon which
analysis techniques are based. This supplement provides optional
reading for topics not covered on Exam B90.04.
Formulas for the following techniques are provided:
x Mean (Generic, Frequency-based)
x Median (Odd, Even)
x Mode
x Range
x Variance
x Standard Deviation
x Z-score
x Probability
x Sampling Distribution
x Standard Error
x Correlation (Pearsons)
x Covariance
x Distribution (Uniform, Binomial, Geometric, Poisson)
x Histogram
x Linear Regression
x K-Nearest Neighbour
x K-Means
Analysis Techniques Coverage
Modules 4 and 5 cover a variety of topics. The following are the twelve primary Big Data
analysis techniques that are emphasized and further explored in Module 6 lab exercises. Those
listed in red are covered in Module 4, and the rest are covered in Module 5.
x Correlation
x Linear Regression
x k-NN
x k-means
x Logistic Regression
x Nave Bayes
x Decision Trees
x Classification Rules
x Association Rules
x Time Series Analysis
x Text Analytics
x Outlier Detection
Overview
This module is comprised of the following three primary parts:
x Part I: Big Data Science Concepts & Analysis Challenges
- Terms and Concepts
- Common Big Data Dataset Categories
x Part II: Elements of Big Data Analysis
- Exploratory Data Analysis (EDA)
- Statistics
- Confirmatory Data Analysis (CDA)
- Visualization
x Part III: Fundamental Big Data Analysis Techniques
- Prediction: Linear Regression
- Classification: k-NN (k-Nearest Neighbors)
- Clustering: k-means
Part I: Big Data Science Concepts & Analysis Challenges
This section covers the following topics:
x Terms and Concepts
x Common Big Data Dataset Categories
Terms and Concepts
Data Science
Data Science is the overarching set of principles, processes, and techniques that enable the
extraction of knowledge from large amounts of data. Data is analyzed to understand and glean
insights in the form of generalizable patterns and correlations. Techniques and theories from
statistics, machine learning, computer science, data mining, and visualization all contribute to
Data Science. Data is generally explored without any prior hypothesis via exploratory data
analysis (EDA) in order to understand the relationships among differing variables.
This level of understanding of data, as described above, is captured in the form of a model
which is then implemented and deployed in the form of a data product. Models and data
products will be discussed separately in the upcoming Model and Data Product topics.
Depending on the nature of the analysis, some situations may not warrant the need for a data
product. Instead, the modeling results are communicated using visualization techniques.
Model
In generic terms, a model is a simplified representation of a phenomenon to aid human
understanding, such as a blueprint of the house, a model plane, a logical data model, or a
physical data model. In data science, a model is a generalized representation of relationships
between data attributes in the form of a mathematical/statistical equation or set of rules.
A model can help the data scientist develop an understanding of the data-generating process,
which can further help in making predictions. A model enhances understanding by removing
unnecessary details and is based on assumptions and constraints pertinent to the problem
domain.
A descriptive model describes the current behavior in order to develop a causal (cause and
effect) understanding of the phenomenon. A successful descriptive model is generally one that
can be easily understood even though it may not produce accurate results.
A predictive model describes future behavior by estimating a target value based on predictor
values. Although understanding a predictive model is important, such models are considered
successful if they produce accurate results even though they it may not be easily
comprehensible.
NOTE
It should be noted that in data science, a predictive

model may not always be used to predict a future value.
Instead, it can also be used to predict an unknown value
of interest based on an event that has already occurred,
such as predicting if a comment carries a positive or a
negative sentiment.
Exploratory Data Analysis (EDA)

EDA, as introduced in Module 2, is a data analysis technique that explores data without any
prior hypothesis to develop an understanding of the data. While helping to generate rather than
prove hypotheses, EDA further helps to understand the process that generated the data in order
to produce models.
Various summary statistics are generated and comparisons performed using different
visualization techniques. EDA may reveal the need to further cleanse data or to collect missing
data, and can also help determine whether the data is suitable for modeling.
Confirmatory Data Analysis (CDA)

A hypothesis, as introduced in Module 2, is a proposed cause of assumption of a phenomenon
that can be proved or disproved through the analysis of data. Within confirmatory data analysis,
a hypothesis is established before data is collected for testing, whereas in Big Data
environments and data science the hypothesis is generated from the already collected data.
Within data science, the hypothesis is generally not established until the EDA is performed
because it is often not known what phenomena the large amount of data may be hiding until
after data analysis.
Data Product
A data product is an instantiation of the model built during the data analysis that exists in the
form of an application, which generates value from data for fulfilling a business goal. During the
course of its operation, a data product creates further data that is generally used to enhance the
data product via a feedback loop. In the business domain, the end goal of applying data science
is to develop a data product that provides business value.
Statistics
The term statistics, when used with a singular verb, is the science of collection, organization,
analysis, and interpretation of numerical data. The term statistics, when used with a plural
verb, are numerical facts regarding a set of data, such as mean, median, and mode.
The field of statistics generally involves summarization of data through the generation of various
types of statistical information utilized for interpreting data. Statistics involves scientifically
drawing a sample, which is a subset of a dataset, from a population, which is the entire dataset,
and the use of probability theory for prediction.
Descriptive Statistics
Descriptive statistics is the numerical description of data via summarization and visualization
techniques. They help a data scientist to interpret the data to formulate hypotheses. Numerical
data generated via statistics include but are not limited to averages, quartiles, percentiles, and
standard deviations. Visualization techniques include histograms and scatter plots. Table 4.1
provides averages that summarize the daily temperature data for NYC across 12 months.
Table 4.1 An example of descriptive statistics in the form of a table.
Inferential Statistics
Inferential statistics goes beyond description of data to making inferences about the population
based on the observed sample. For successful inferential statistics, we need to draw a random
sample. Use of a non-random sampling mechanism introduces bias, discussed in the upcoming
Statistics section, in the sample that leads to making wrong or inaccurate inferences about the
population.
Inferential statistics involves the use of point estimators and interval estimators. The process
involves drawing a sample from the population and, based on the sample, making an inference
about the population.
Figure 4.1 The process of inferential statistics as a cycle.
Machine Learning
Machine learning, as introduced in Module 1, is the process through which computers
automatically learn from data to implicitly program themselves by identifying rules and patterns
for formulating predictions about unknown data. The learned rules and patterns essentially
represent the model that has been inferred from the data.
Machine learning and data mining are closely related, as both are used to find hidden patterns.
Data mining is more prevalent in business domains, whereas machine learning is a more
generic field that extends to other fields, such as artificial intelligence and natural language
processing (NLP).
Data mining generally employs machine learning algorithms and is more concerned with the
complete data analytic process, including data acquisition, cleansing, and model creation, rather
than just the application of algorithms. Machine learning and statistics can both be used to
create models.
Statistical models are more concerned with understanding the data generation process,
whereas machine learning algorithms are more concerned with producing the correct output(s)
through means that may not be fully comprehensible.
Machine learning involves the use of algorithms that can be divided into the following three
types:
x Supervised Learning input data includes example outputs
x Unsupervised Learning input data does not include any example outputs
x Semi-Supervised Learning input data includes few example outputs
Data Munging
Data munging, also known as data wrangling, refers to the extraction and manipulation of raw
data by applying cleansing, filtering, validation, and format transformation techniques in order to
make data appropriate for analysis. This generally involves the use of tools and programming
languages like SQL, Python, R, Hive, and Pig. In the context of data science, data munging
provides clean input data, which is essential for correctly understanding the data and further
discovering patterns and rules.
Big Data Analysis Lifecycle

Module 2 introduced the Big Data analysis lifecycle.
The focus of Modules 4 and 5, with respect to the
data science process, will be the Data Analysis stage.
Figure 4.2 The Big Data Analysis Lifecycle
Reading
Further discussion on these topics is provided in the sections A Data Science Profile on pages
10-12 and OK, So What is a Data Scientist, Really? on pages 14-16 of the Doing Data Science
text book accompanying this module.
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Common Big Data Dataset Categories
Common Big Data Dataset Categories

This section discusses some of the unique challenges related to the analysis of Big Data
datasets in the context of the five Vs: Volume, Velocity, Variety, Veracity, and Value.
In particular, the following types of datasets are discussed:
x High-Volume Datasets
x High-Velocity Datasets
x High-Variety Datasets
x High-Veracity Datasets
x High-Value Datasets
High-Volume Datasets
Data within Big Data environments comes in large volumes, such as an entire collection of daily
financial transactions for a month from across all branches of a supermarket, and varying
volumes, such as tweets that are only 560 bytes (140 characters) in length versus a two-hour
video that is 4.7 gigabytes.
Within structured datasets, large volume can be due to a large number of records or rows, or
due to a large number of fields or columns. In some cases, large volume can be due to both a
large number of records and of fields.
Generally, a large number of rows/records are considered tall or long data while a large number
of columns/fields is considered wide data, as illustrated on the next page. Tall datasets have
numerous rows, whereas wide datasets have numerous columns, as depicted in Figure 4.3.
Both tall and wide data bring a unique set of challenges for analyzing data in Big Data
environments, and both often require increased processing resources.
Figure 4.3 Tall datasets have several rows, pictured left, while wide datasets have several columns, pictured right.
Analysis of tall datasets is somewhat easier, as there are fewer fields/characteristics to take into
consideration. However, such datasets are generally more prone to noise and outliers, because
there are a large number of records that will need automated data cleansing and outlier
detection techniques.
Analyses of wide datasets can contain comparatively fewer outliers and noise, but are generally
complex as there are a large number of fields/characteristics that must be taken into account.
Both types of datasets require intensive EDA to be conducted in order to develop a thorough
understanding before conducting a more targeted, detailed analysis.
Voluminous semi-structured and unstructured datasets can generally be thought of as tall
datasets, as each record is often represented as a BLOB of information in a single column. Pre-
processing of data is required on these types of voluminous semi-structured and unstructured
datasets. Common pre-processing tasks include data cleansing and derivation of new fields, as
well as ensuring the data is represented in a form that can be used for quantitative techniques.
High-Velocity Datasets
Data within Big Data environments arrives at a fast pace, often due to the scale of the
underlying data-generating process. For example, thousands of individuals tweet at any point in
time and a large number of financial transactions occur across multiple stores within a short
span of time.
With high-velocity machine-generated data, the recurring data structure remains the same, such
as smart meter data or Web server logs. With high-velocity human-generated data, unstructured
data values can change on a per record basis, such as customer comments. However, the
overall structure of the individual record often remains the same as it will typically be formatted
by a data-capturing device.
Depending on business requirements, the analysis of high-velocity data can be performed in
transactional or batch mode, and in some circumstances both. With transactional analysis,
individual records are processed as they arrive. The processing may simply involve data
cleansing and updating KPIs for reporting purposes or may involve complex automated analysis
of the record, such as fraud detection. With batch analysis, fast-arriving data is accumulated first
and only then processed for reporting purposes or for performing complex analysis, such as
model development.
High-Variety Datasets
Within Big Data environments, a variety of datasets containing structured, semi-structured, and
unstructured data are generally used for analysis purposes. Unlike traditional data analysis,
which is only focused on structured datasets, analysis within Big Data environments must
incorporate semi-structured and unstructured datasets, as this type of data carries latent
information that can be of potential benefit for an enterprise. For example, text analytics and
sentiment analysis performed on customer comments can identify customers who may be at
risk of defecting to a competitor.
The notion of variety applies to the fact that multiple differently formatted datasets must be
analyzed, rather than the same dataset comprising records made up of different formats that
continue to change. For example, even in a semi-structured dataset that comprises structured
and unstructured data, the data type for a particular field is often fixed, despite some records
containing additional or fewer fields. From a data analysis point of view, high-variety datasets
generally require certain pre-processing steps and may need a combination of analysis
techniques for their analyses.
It can be hard to join high-variety datasets together in order to perform unified data analysis.
The datasets are usually heterogeneous due to a range of enterprise-wide information systems
or different devices that generate the data required for analysis.
For example, the different types of sensors on the factory floor may generate data in different
formats. Noise must be carefully removed from real data (the signal) to achieve meaningful,
correct analytical results. In general, removing noise from machine-generated data is less
difficult as compared to human-generated data, as the former often conforms to some
lower/upper limits whereas the latter requires semantic assessment.
High-Veracity Datasets
Meaningful analysis of data generated within Big Data environments requires high-veracity
datasets. However, voluminous datasets can potentially contain large amounts of noise that
negatively affect the veracity of datasets. Noise creates false data that cannot be trusted and
further produces incorrect analysis results. For example, a misconfigured sensor or device will
create false readings in machine-generated data. Similarly, biased comments or the
appearance of similar comments multiple times with different user ids is an indication of noise
from human-generated data.
High-Value Datasets
A high-value dataset within Big Data environments is one that is high-veracity, contains useful
insights for the enterprise, and can be analyzed within a meaningful time period, requiring
comparatively simple analysis techniques. Like veracity, the value of a dataset is dependent on
the volume, velocity, and variety characteristics.
High-volume datasets, whether tall or wide, add more value as compared to datasets
comprising fewer records, due to the applicability of the Law of Large Numbers. High-velocity
datasets add further value when compared to low-velocity datasets because of the constant
addition of new records and increased frequency with which results are updated.
Similarly, high-variety, heterogeneous datasets add increased value in comparison to
homogeneous datasets, as a combination of differently formatted datasets provides richer,
unified datasets with increased chances of finding significant insights.
Exercise 4.1: Match Terms to Statements
Answer the questions below by filling in the blank fields with one of the following terms:
x High-Volume Datasets x High-Veracity Datasets
x High-Velocity Datasets x High-Value Dataset
x High-Variety Datasets
1. A company collects customer comments that undergo text analytics and sentiment analysis
in order to identify the customers who may be at risk of defecting to a competitor. Which
category of Big Data datasets best characterizes this process?
______________________
2. Thousands of stock trading transactions are arriving very quickly as a result of being
concurrently generated by traders at the New York Stock Exchange. Which Big Data dataset
category is best-suited for describing the resulting dataset?
______________________
3. An application that collects comments from a Web site is run to filter user-created data for
bias and significance. Which Big Data dataset category best describes such removal of
noise?
______________________
4. High data veracity, velocity, and variety contribute to measuring which Big Data dataset
category?
______________________
5. A large banking institution collects a months worth of daily financial transactions from all of
its branches across the country. What is the appropriate Big Data dataset category for
describing the resulting dataset?
______________________
Exercise answers are provided at the end of this booklet.
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Part II: Elements of Big Data Analysis
This portion of the workbook is divided into the following sections:
x Exploratory Data Analysis (EDA)
x Statistics
x Confirmatory Data Analysis (CDA)
x Visualization
Exploratory Data Analysis (EDA)
Attributes
In order to analyze the data and build models, it is important to first understand the data by
exploring data attributes or features of the data and to understand their types. An attribute is a
characteristic of the data. For example, in a database table, the columns are the attributes of
each instance of data displayed in the rows.
The notion of an attribute is more common within data mining, whereas in statistics, machine
learning, and data warehousing, the attribute is known as a variable, feature, and dimension
respectively. The variable types introduced in the upcoming Statistics section also apply to
attributes.
EDA
The process of EDA involves extracting quantitative attributes from the data and producing
various numerical and graphical summaries that are based on statistics generated from the
values of these attributes, with a view to develop an understanding of the data. This
understanding helps to assess the data quality, to make comparisons and find relationships,
and to identify attributes that will eventually become part of the statistical models and machine
learning algorithms.
Another objective of EDA is to ensure targeted data mining efforts by decreasing the amount of
data through the selection of only relevant attributes and data discretization, a topic covered in
Module 5: Advanced Big Data Analysis & Science.
EDA provides information on which type of model to develop and which relationships are
important in the context of the problem space, as well as information on any assumptions that
should be made for the models and which type of patterns should be extracted and generalized.
Alternatively, EDA can be used to determine whether the captured data is erroneous, or if the
process used to capture the data is not configured properly and is producing data that consists
of unrealistic patterns not normally associated with such data.
Optional Reading
For a more in-depth discussion on this topic, see the Exploratory Data Analysis section from
pages 34-37 of the Doing Data Science text book.
Data Summary Types

A range of different data summaries can be generated when conducting EDA. These can
generally be divided into the following types:
x Numerical Summaries
x Graphical Summaries
Numerical Summaries
Numerical summaries make use of descriptive statistics for summarizing data. There are
generally three types of numerical summaries:
x Measures of Central Tendency
x Measures of Variation or Dispersion
x Measures of Association
Numerical Summaries: Measures of Central Tendency

When conducting EDA, the first step is to develop an understanding of the dataset or
distribution, introduced in the upcoming Statistics section, by finding out how the data is
arranged around the center of the distribution and which values are most commonly occurring.
This understanding provides a basis for comparing different values within the distribution as well
as with other distributions.
The measures of central tendency include:
x Mean
x Median
x Mode
Numerical Summaries: Measures of Variation or Dispersion

In understanding a distribution, it is also important to establish how much the values are spread
out from the center. In other words, are the values closely packed or are they spread out over a
large range?
The measures of variation or dispersion include:
x Range
x IQR
x Variance
The main objective of analyzing the spread is to determine how consistently the values appear
when compared with the averages (mean, median, and mode) and to find and remove any
outliers. When used in conjunction with z-scores, measures of variation provide the ability to
make decisions about which processes or models produce consistent or better results when
compared with other processes or models. Z-scores are introduced in the upcoming Statistics
section.
Numerical Summaries: Measures of Association

The measures of association provide information related to the existence of any relationship
between variables that is important when developing models for making predictions. The
measures of association include:
x Correlation
x Covariance
Graphical Summaries
Graphical summaries make use of visual techniques for summarizing data. This helps to explore
data beyond its descriptive characteristics, which further helps in generating hypotheses or
discovering patterns and correlations. Generally, the following graphical techniques are used in
EDA:
x Bar Graph
x Line Graph
x Histogram
x Frequency Polygons
x Scatter Plot
x Scatter & Leaf Plot
x Cross-Tabulation
x Box & Whisker Plot
x Quantile-Quantile Plot
x Lattice Plot
Quantitative Analysis
Quantitative analysis of data can be categorized by the number of variables involved. The
following are the three main types:
x Univariate Analysis
x Bivariate Analysis
x Multivariate Analysis
Univariate Analysis
Quantitative analysis of a single variable is known as univariate analysis, such as analysis of
census data for gaining insights about literacy levels or the ethnic makeup of a population. The
main objective is to understand the type of distribution the values make up and to identify any
outliers. Univariate analysis often starts with formulating frequency and probability distributions,
which will be introduced in the upcoming Statistics section.
The techniques involved within univariate analysis include:
x Measures of Central Tendency
x Measures of Variation or Dispersion
Bivariate Analysis
Quantitative analysis of two variables in order to explore their relationship is known as bivariate
analysis, such as an analysis of ice-cream sales and temperature. It is good practice to first
conduct univariate analysis on the variables involved within bivariate analysis before proceeding
to the actual bivariate analysis.
The techniques involved within bivariate analysis include:
x Measures of Association
x Cross-tabulation
x Regression
Multivariate Analysis
Quantitative analysis of more than two variables in order to explore their relationship is known
as multivariate analysis, such as predicting ice-cream sales based on temperature and age
group. Multiple linear regression, covered in the Prediction section, is an example of conducting
multivariate analysis.
The numerical summaries used for conducting the aforementioned univariate, bivariate and
multivariate analyses are generally complemented by the graphical summaries for visual
perception.
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Statistics
Variable Types
A variable is a measurable or observable attribute of an object which can be categorized as
follows:
x Discrete variables can only take specific values from a defined set of values, such as the
number of people in a city or blood group type. A discrete variable is a variable whose value
is obtained by counting.
x Continuous variables can take any value, such as a patients temperature or height.
A continuous variable is a variable whose value is obtained by measuring.
x Nominal variables have values that represent a category, such as product categories or
music genres. Such values can be counted but not measured or ordered.
x Ordinal variables take numerical values that can be discrete or continuous and can be
ordered or ranked, such as a survey question based on a satisfaction scale or educational
level. Such values can be counted and ordered but not measured.
x Binary variables consist of only two categories where the categories are generally the
opposite of each other, such as 1/0, true/false, and heads/tails.
x Quantitative variables are number-based and can be counted or measured, such as an
employees income.
x Qualitative variables, also known as categorical variables, can be counted but not
measured, such as gender.
x Independent variables have values that do not depend on any other variable but rather
influence other variables, whereas dependent variables have values that are influenced by
the independent variable. For example, temperature is an independent variable that ice-
cream sales depend upon.
x A random variable, generally denoted by X, is a variable that can assume a range of values
based on probability.
Exercise 4.2: Fill in the Blanks
1. ______________________ variables can take only specific values from a defined set of
values.
2. ______________________ variables can take any value and are often obtained by
measurement.
3. ______________________ variables have values that represent a category that can be

counted but not measured or ordered.
4. ______________________ variables take numerical values that can be discrete or

continuous and counted and ordered, but not measured.
5. ______________________ variables consist of only two categories where the

categories are generally the opposite of each other.
6. ______________________ variables are number-based and can be counted or

measured, whereas ______________________ variables can be counted but not
measured.
7. ______________________ variables have values that do not depend on any other

variable, but rather influence other variables. These other variables are known as
______________________ variables.
8. ______________________ variables can assume a range of values based on

probability.
Population & Sample
In statistics, a population is the entire set of objects of a particular type that is being analyzed,
such as a dataset of all customers. A sample is a subset of data drawn from the population,
such as a few customers from the entire customer dataset. An observation is a set of attributes
related to the object, such as customer name and e-mail address. N (population size)
represents all observations in a population, while n (sample size) represents all observations in
a sample.
Figure 4.4 illustrates where population, sample, and observation pertain to a specific dataset.
Within Big Data environments:
x it is possible for n to be close to or equal to N as large amounts of data can be processed
within a reasonable amount of time. Having n close to N helps to make predictions about
population with higher confidence.
x ... n can also be equal to 1 but with a large observation set. This helps to make conclusions
about a single object rather than the whole population.
Figure 4.4 - An example of a subset of data drawn from a population.
A sample statistic describes a numerical fact related to a sample that is generally used to make
conclusions or estimations about the related population parameter, whereas a population
parameter describes a numerical fact about the entire population. For estimation, a sample
statistic is known as an estimator that produces biased/unbiased and precise/imprecise results,
as discussed shortly.
A sample statistic calculated from different samples of fixed size drawn from the same
population can produce different results between themselves, as well as when compared
against the corresponding population parameter. This variation is represented by a sampling
distribution, introduced shortly.
Statistical Inference
Statistical inference is the process of deriving conclusions from data generated by random data-
generating processes, also known as stochastic processes. This generally involves creating
models from data in order to represent the random data-generating process in a simplified
manner.
Sample data is used in order to make estimates or test hypotheses related to the population.
For example, sample data gathered regarding insurance claims shows that fewer insurance
claims are made by women as compared to men. A conclusion could be that this is because
women drive more carefully than men.
Measures of Central Tendency

A set of values can be described in terms of different characteristics, such as the number of
contained values, and the minimum and maximum values. Central tendency refers to the middle
point of a set of values and the measures that define this center point are known as the
measures of central tendency. Apart from summarizing a set of values, these measures are also
useful for making comparisons, such as comparing two sets of values or comparing a single
value to a set of values.
The measures of central tendency include:
x Mean
x Median
x Mode
Mean
The mean, commonly known as the average, is a statistic obtained by dividing the sum of all
values by the count of all values. Population mean is denoted by , while the sample mean is
denoted by . Mean is generally used when the values do not change much and increase or
decrease in a normal manner. It is affected by the presence of outliers. Both population and
sample means are calculated in the same manner.
Median
The median is a statistic obtained by finding the middle value among all ordered values where
the total number of values is odd. A sample median is denoted by M or . The median is best
suited for scenarios where extreme values can produce false mean. A median is not affected by
the presence of outliers and, as it does not take into consideration all values, generally stays the
same.
Mode
The mode is a statistic obtained by counting the most occurring value among all values, and is
the only type of average (the others being mean and median) that can be calculated for nominal
variables. When the dataset consists of groups of values rather than individual values, the mode
is the median of the most occurring group of values. A set of values can have two or more
modes, in which case the values are called bimodal and multimodal respectively.
Robustness
In statistics, a sample statistic is termed as robust if shifting some values or the presence of
outliers does not change the value of the statistic. The median and more are robust measures.
The mode is not a robust measure.
For example, for a set of five values (3, 1, 5, 1, and 7), the mean, median, and mode are:
Table 4.2 An example of the mean, median, and mode from a set of five values.
Adding an extreme value of 50 changes the mean completely:
Table 4.3 An example where adding an extreme value changes the mean.
Measures of Variation or Dispersion

In a set of values, the values may be arranged in a number of ways, for example, the values
may occur close to each other or may occur far from each other. Although the measures of
central tendency provide information about the typical makeup of a set of values in terms of its
center point, these do not provide any information about how the values themselves are
arranged. The measures of variation or dispersion summarize the spread of values in a set of
values and describe how far the values typically occur with respect to the center of a set of
values.
The measures of variation or dispersion include:
x Range
x IQR
x Variance
Range
The range is a statistic obtained by subtracting the minimum value from the maximum value that
tells about the spread or width of data. The range is also heavily affected by the presence of
extreme values, as the presence of a single extreme value gives the impression that the values
are spread over a very large range.
The averages (mean, median, and mode) provide central value, while range provides an idea
about the variation in the data. Using range, two different sets of values can be compared in
terms of variation in their values.
Mean, Median, Mode & Range
For a set of five values (3, 1, 5, 1, and 7) Table 4.4 summarizes mean, median, mode, and
range for a quick comparison:
Table 4.4 A summary of mean, median, mode, and range.
Figure 4.4 shows a number line for a visual analysis of the data using the above measures:
Figure 4.4 An example of visual analysis of data using a number line.
Quantiles
Quantiles divide ranked or ordered data into a specific number of equally sized portions. The
values that indicate the boundary between the portions are actual quantiles and in total are
always one less than the number of portions.
For example, dividing the set of values in Figure 4.5 into three portions results in two quantiles
(3, 6) containing 33.33% and 66.66% of the values. Data can be divided into any number of
portions, but is generally divided into four (quartiles), five (quintiles), or 100 portions
(percentiles).
Figure 4.5 - A set of values is divided into three portions resulting in two quantiles, shown in red.
Quintiles
Quintiles represent four values that divide the data into five equally sized portions obtained by
first arranging the data values in ascending order and then dividing the data into five portions.
The first (Q1), second (Q2), third (Q3), and fourth (Q4) quintiles represent 20%, 40%, 60%, and
80% of the data values below them, as shown in Figure 4.6.
Figure 4.6 An example depicting data distribution over four quintiles.
Quartiles
Quartiles represent three values that divide the data into four equally sized portions obtained by
first arranging the data values in ascending order and then dividing the data into four quarters.
The first, second, and third quartiles are known as lower quartile, median, and upper quartile
and are denoted by Q1, Q2, and Q3 respectively. Q1, Q2, and Q3 represent values below which
25%, 50%, and 75% of data values exist respectively.
There are multiple ways to compute quartiles. The simplest approach is to first divide data into
two portions by finding the median, Q2, before excluding Q2 from these portions if n is odd. Q1
and Q3 are the medians of the first and second portions respectively.
Consider the set of values in Figure 4.7. The median Q2 is 4.5. As the total number of values is
14, an even number, we can calculate Q1 and Q3 without removing any number. Q1 is the
median of the first half of 2, while a Q3 of 7 is the median of the second half.
Figure 4.7 Quartile Example
Interquartile Range & Outliers

A related statistic, interquartile range (IQR), is the set of values between Q1 and Q3 obtained by
subtracting Q1 from Q3, as follows:
IQR = Q3 Q1
Outliers are abnormal or extreme data values that generally occur within the first and last
quarter of the data and can skew the results of a calculation. Figure 4.8 illustrates how an IQR
can be used to exclude outliers. As it only includes data values between Q1 and Q3, any
outliers in the first and last quarters can be effectively eliminated.
Figure 4.8 An example of an IQR that is used to exclude outliers.
Percentiles
A percentile, like a quartile, is a value that divides the data into equal portions using
percentages instead of quarters, and is a value under which a given percentage of data values
exists. Each percentile represents the corresponding percentage of values. For example, the
30th percentile means 30% of values are less than the value represented by the 30th percentile.
Q1, Q2, and Q3 are also known as the 25th, 50th, and 75th percentiles, respectively.
Bias
A bias is introduced when the sample is not a true representation of the population, which can
happen if the sample has not been drawn in a random manner. A sample statistic from a biased
sample will result in making false conclusions about the corresponding population parameter.
In technical terms, a bias represents how far the average of multiple values of an estimator,
calculated from multiple samples, is from the corresponding population parameter. On the other
hand, an estimator can be imprecise if different values of the estimator from different samples
are not close to each other, meaning the estimator can be biased or unbiased and precise or
imprecise at the same time.
In Figure 4.9, the estimator is biased as the average value that lies at a distance from the
population parameter, shown as the X on the number line. The results are close to each other;
therefore, the estimator is precise.
Figure 4.9 An example where a bias is present.
Distribution
A distribution is a group of numbers or a function that shows all occurrences of different values
or outcomes of a variable. In other words, it shows how values of a variable are distributed. For
example, Table 4.5 shows the distribution of different colored balls when drawn randomly from a
bag.
Table 4.5 - An example of a distribution.
Depending on the type of variable, a distribution can be either discrete or continuous. Generally,
a discrete distribution is shown using a bar chart, while a continuous distribution is shown using
a histogram. This is explained shortly in the Visualization section. In statistics, a distribution can
also refer to a function that explains the nature of a group of numbers.
Variance
The variance is a non-negative value that shows how spread the values are compared to the
mean of the values or center of a distribution. Sample variance is denoted by s2, while the
population variance is denoted by 2.
A small variance shows that there is comparatively small difference between the values and the
mean value, and that the values occur close to each other. A large variance shows that there is
comparatively large difference between the values and the mean value, and that the values
occur far from each other.
Standard Deviation
Like the variance, the standard deviation is another non-negative value to view the spread of the
values from the center of the distribution. Sample standard deviation is denoted by s, while the
population standard deviation is denoted by . The calculated value is known as one standard
deviation and it is expressed in the same units as the values in the distribution.
Variance & Standard Deviation

The standard deviation is generally more useful than a variance deviation for descriptive
purposes, whereas a variance deviation is usually more useful mathematically. The lower the
variance and standard deviation, the less spread out and the closer to the mean value the
values are.
The s2 and s can be used to estimate the corresponding population parameters 2 and . The
variance and standard deviation enable us to measure how consistently a process generates
data, for example to analyze which bottle-filling machine fills bottles on a more consistent basis.
Z-Score
A z-score, also known as standard score, is the number of standard deviations above or below
the mean value of the distribution. The z-score is denoted by a z. A set of values can be
converted to z-scores through a process of standardization. A negative z-score shows that the
value is less than the mean, whereas a positive z-score shows that the value is greater than the
mean value.
Z-scores help to make decisions about data in a standardized manner by concentrating on
values that are either closer to or farther from the normal set of values to include or exclude
data based on their distance from the mean value.
Z-scores can be used as a baseline for comparing different datasets with different means and
standard deviations. For example, two bottle filling machines have z-scores of -0.5 and 0.5,
which means that the first machine is under-filling while the second machine is over-filling.
Exercise 4.3: Name the Measure
1. Jack is performing EDA on income data for a particular region with middle class earning
potential. He summarizes the data using one of the averages. However, when he adds data
from another region consisting of few but extremely wealthy people, recalculation of the
average results in a completely different value. Identify the measure that Jack is using.
______________________
2. Amber is comparing the temperature of tropical countries with the temperature of countries
that are farther away from the equator, in order to help chemists develop different variants of
engine oil for each region. She has compiled two sets of distributions, with the average
temperatures for each month of the year arranged in ascending order. Which measure can
be used to determine the temperature fluctuations for each region?
______________________
3. A technician is comparing the performance of two similar machines using a certain measure
of variation. However, he is getting a lot of variation between the lower and upper operating
bounds and is unable to obtain a meaningful comparison. A quick investigation reveals that
the data has extreme values towards both the lower and upper bounds. Which measure of
variation can be used to enable a meaningful comparison of the two machines?
______________________
4. Two dozen contestants participated in an essay writing competition last week. The
published results informed each contestant of the mark he or she received out of 100.
However, the contestants want to know how well they performed in comparison to the other
contestants, in terms of the percentage of contestants that had received lower marks. Which
measure will provide the required additional information?
______________________
5. A data scientist is analyzing the sales figures of two different stores. Calculating the range of
both sets of sales figures reveals that the first store has a much wider range than the
second store. Which measure of variation can be used to quantify the variation based on all
sales figures, in order to identify the store that produced more consistent sales figures?
______________________
6. A bio scientist is comparing two different types of corn seeds that have been genetically
modified. Production data for each type shows that both types have different mean and
standard deviation figures. The production figures for the last season indicate that both
types have resulted in a higher than average yield. Which measure can be used to find the
variety that performed better than the other?
______________________
Distributions
A distribution, as explained earlier, is a set of values showing how often different values occur,
or the chance of occurring of different values. In statistics, there are a number of different types
of distributions, including the following:
x Frequency Distribution
x Probability Distribution
x Sampling Distribution
x Normal Distribution
Frequency Distribution
A frequency is the number of times each value of a variable appears. A distribution that shows
the frequency of a variable is known as the frequency distribution. A frequency distribution is a
quick and easy way of summarizing data, generally shown using a table or a bar chart.
For example, a frequency distribution of different colored balls pulled randomly from a bag can
be displayed in the form of a bar chart, as shown in Figure 4.10.
Figure 4.10 A bar chart depicting frequency distribution.
Probability
A probability is the measure of a possible occurrence of an event or value of a variable, and is a
value between 0 and 1. The probabilities of all events add up to 1. A probability of an event
closer to 0 indicates a rare event, while an event closer to 1 indicates a common event.
In statistics, an experiment is a test based on chance that leads to different results known as
outcomes. An event refers to an individual outcome or a group of outcomes of an experiment.
Probability Distribution
A distribution that shows the probability of each event or value of a variable is known as the
probability distribution. The bar chart in Figure 4.11 shows the probability distribution of one red
ball, one yellow ball, and two blue balls.
Figure 4.11 A bar chart depicting probability distribution.
Reading
For a more in-depth discussion on this topic, see the Probability Distributions section from
pages 30-31 of the Doing Data Science text book that accompanies this module.
Sampling Distribution
A sampling distribution is the probability distribution of a sample statistic, such as a mean, that
is commonly used to make inferences about the population parameters by calculating sample
statistics from a number of fixed-size samples.
A sample statistic, such as a mean, calculated from a number of different samples of the same
size would generally result in different values. In order to view the variation in the sample
statistic values, a sampling distribution is used. The mean of the sampling distribution is an
estimate of the population mean.
Standard Error
The standard error is the standard deviation of a sampling distribution that is used to estimate
how close the sample statistic, generally the mean, is to the population parameter. Standard
error of mean is denoted by SE. As the sample size n increases, the standard error decreases.
The standard deviation of a sample is used to measure how far the values are from the sample
mean, whereas the standard error of mean is used to measure how far the sample mean is from
the population mean.
Statistical Estimators
A statistical estimator is a rule or a function that provides an estimate for the population
parameter based on a sample statistic. There are two types of estimators, known as a point
estimator and an interval estimator. A point estimator provides a single value, whereas the
interval estimator provides a range of values. A sample mean is an example of a point
estimator, whereas a confidence interval is an example of an interval estimator.
Confidence Interval
A confidence interval measures the reliability of the estimate for the population parameter,
which has been calculated from a sample. Instead of specifying a point estimate for the
population parameter, such as the mean of the population, it specifies a range or interval
estimate with a probability or confidence level expressed as a percentage of this interval
estimate containing the population parameter.
Although confidence intervals can be calculated at different confidence levels, such as 50%,
90%, or 99%, they are often calculated at a confidence level of 95%.
At best, the true value of the population parameter can only be estimated and can never be
found as samples are used. Due to this fact, the confidence interval specifies the uncertainty
related to the sampling method rather than specifying the value of the population parameter. A
95% confidence interval for the population mean can be interpreted as:
95% of the estimate intervals, calculated from different samples, will contain the population
mean.
- or -
There is a 95% chance that a single estimate interval will contain the population mean.
As shown in Figure 4.12, the higher the confidence level, the wider the interval, although making
the interval too wide can affect the importance of this measure. For example, a confidence level
of 99% stating that a Web page load time is between 15 and 25 seconds is less helpful in
estimating the actual load time than a confidence level of 90% stating that the Web page load
time is between 19 and 21 seconds.
Figure 4.12 An example of high and low confidence intervals.
A confidence interval is normally expressed in the form:

p the error margin, such as 3.4 0.5
...where p is the mean value.
Therefore, a 95% confidence interval of 3.4 0.5 kg for the mean value of a newborn indicates
that there is a certainty of 95% that the mean weight of newborn babies falls within the range of
2.9 - 3.9 kg.
Skewness
Skewness is the amount of asymmetry of a (probability) distribution when measured from the
mean value. A distribution can be positively skewed where the tail of the curve is longer on the
right side or skewed to the right, and the mean is greater than the median and mode. The
majority of the values exist on the left side of the curve.
A distribution can be negatively skewed where the tail of the curve is longer on the left side or
skewed to the left, and the mean is less than the median and mode. The majority of the values
exist on the right side of the curve.
A normal distribution is not skewed. The left and right tails are similar to each other, and the
mean, median, and mode are equal to each other. For example, in Figure 4.13, three
distributions are summarized in bar graphs with a negative skew, without any skew, and with
positive skew.
Figure 4.13 - An example of three distributions with different skews summarized in bar graphs.
Discrete & Continuous Probability Distributions

A probability distribution is a function used to estimate the occurrence of an event, a single
value, or a range of values of a random variable used for building statistical models. In a
discrete distribution, each specific value of the random variable has a non-zero probability. For
continuous distributions, the probability is zero for a specific value and non-zero for a range of
values called intervals.
The function used for expressing the probability distribution of a continuous variable is known as
the probability density function (PDF). It can be used to find the probability of an interval, which
is the area under the curve between two points on the x-axis of the probability distribution curve.
The area under the probability distribution curve for all possible values of a variable is always
equal to one. For example, the Figure 4.14 shows the probability of all values (0 to 50) as the
shaded area of the rectangle that is equal to one. By applying the formula for the area of a
rectangle (area = length * width), a value of 0.02 is calculated for the PDF. Based on the known
value of PDF, the Figure 4.15 shows the probability of values > 30 in the distribution that is
equal to 0.4.
Figure 4.14 An example where a value of 0.02 is calculated for the PDF.
Figure 4.15 An example where the probability of values > 30 in the distribution that is equal to 0.04.
Distribution Fitting
Generally, random data-generating processes follow certain patterns. As a result, the probability
of a random variable that assumes a certain value or a range of values is somewhat predictable.
Depending upon the nature of the random data-generating process, an appropriate probability
distribution can be selected that fits the distribution in order to describe its nature and make
estimates about its values in terms of probabilities.
In some probability distributions, values are more centered around the mean value, whereas in
other distributions the values are evenly distributed. This behavior gives the probability
distribution a particular shape, from which a number of probability distributions have been
formulated. The shape of the curve of a continuous distribution indicates how values are spread
within the distribution.
Optional Reading
For further discussion on this topic, refer to the Fitting a Model section on page 33 of the Doing
Data Science text book that accompanies this module.
Normal Distribution
A normal distribution, also known as a bell-shaped curve or Gaussian distribution, is a
symmetric continuous probability distribution where the majority of values are found in close
proximity to the mean value. A normal distribution represents data that occurs commonly where
most values are the same as the average value and only few values are found at the
extremities, as shown in Figure 4.16.
Figure 4.16 An example of a normal distribution.
In a normal distribution, approximately 99% of the values are within three standard deviations of
the mean, and the area under the curve is equal to one, as shown in Figure 4.17. A normal
distribution has the same mean, median, and mode.
Figure 4.17 An example of a normal distribution where the area under the curve is equal to one.
Standard Normal Distribution

A standard normal distribution or z-distribution also exists that comprises z-scores of the
probability distribution. A standard normal distribution has a mean of zero and a standard
deviation of one, as shown in Figure 4.18.
Figure 4.18 An example of a normal distribution and a standard normal distribution.
Central Limit Theorem

The central limit theorem states that the sampling distribution of mean becomes normal or
nearly normal as the sample size n increases. Even if the population distribution is not a normal
distribution, the central limit theorem holds true. For a non-normal population, the sampling
distribution of the mean will get closer to being a normal distribution as the sample size n
increases. It can be used for making estimates about the sample mean and the population
mean even if the population distribution is not normal.
For example, the central limit theorem is applied on a non-normal population in Figure 4.19.
With a smaller sample size, the corresponding sampling distribution of mean is not normal.
However, as the sample size increases, the sampling distribution of the mean starts becoming
normal.
Figure 4.19 An example of the central limit theorem applied to a non-normal population.
Measures of Association
A dataset representing a data generating process may contain certain variables that are related
to each other based on a pattern such that when the value of one variable changes, the other
one also changes in the same or different direction with a proportionate or disproportionate
magnitude. The measures of association quantify the relationship between two variables in a
dataset, and include:
x Correlation
x Covariance
Correlation
As originally introduced in Module 2, correlation is the degree of linear association between two
variables, measured using a correlation coefficient. The relationship is considered to be linear
when the scatter plot of the variables values results in a straight line, which means that both
variables change with the same proportion at a constant rate.
Pearsons product moment coefficient, generally denoted by r , is one example of the
correlation coefficient that is used most commonly for measuring the correlation between two
variables.
The presence of correlation does not constitute causation. Correlation only constitutes a
mathematical association between the variables rather than a factual association. Non-linear
associations may also exist between variables, in which case Spearmans rank correlation can
be used. However, a monotonic relationship must exist between the variables.
A monotonic relationship is where one variable always either increases or decreases while the
other may remain constant. Variables that first increase and then decrease or vice-versa do not
constitute such a monotonic relationship.
A monotonically increasing relationship is where y either increases or remains constant but
never decreases, as shown in Figure 4.20.
Figure 4.20 A monotonically increasing relationship.
A monotonically decreasing relationship is where y either decreases or remains constant, but

never increases, as shown in Figure 4.21.
Figure 4.21 A monotonically decreasing relationship.
A non-monotonic relationship is where y increases and decreases, as shown in Figure 4.22.
Figure 4.22 A non-monotonic relationship.
Both the Pearson and the Spearman correlation coefficients have a range of -1 to +1 and are
interpreted in the same manner. The Pearson correlation coefficient is affected by outliers as it
takes into account the actual magnitude of the values. Instead of using the values as is, the
calculation of Spearmans correlation coefficient requires converting original values to ranked
values. As a result, Spearmans correlation coefficient is not affected by outliers as the actual
magnitude of the values is ignored.
NOTE
If the dataset involves time-based elements, then a

simultaneous time series analysis, as covered in Module
5: Advanced Big Data Analysis & Science, of variable(s)
can provide visual aid in confirming correlation analysis
results or identifying new relationships between the
variables.
Correlation & High-Volume Datasets

Applying correlation to high-volume datasets requires special consideration. In the case of
untargeted data mining, which is data mining without any predetermined goal, coupled with wide
datasets, correlation may need to be applied to a number of pairs of variables as multiple
variables may be correlated.
Even in the case of targeted data mining where one of the variables may be known, such as the
dependent variable, the other correlated variable must still be discovered through testing
multiple independent variables. In the case of tall datasets, the large number of records can
pose performance penalties and strain the underlying processing resources.
Choose an algorithmic implementation that supports a distributed/parallel architecture, as fitting
millions of records in the main memory of a single machine may not be possible or ideal.
Implementing an algorithm that supports a distributed/parallel architecture can often be
achieved through the introduction of an analytics engine mechanism in a Big Data solution.
Correlation does not imply causation, especially in high-volume datasets as there is a potential
for uncovering several correlations. However, some of these uncovered correlations may be
coincidental or may only exist in a particular version of a dataset. Therefore, validation is
required to confirm the findings and eliminate valid but insignificant correlations, from a business
point of view, by applying domain knowledge. Over time, multiple versions of the dataset should
be analyzed to ascertain if a correlation is of recurring nature before devising an action plan.
Correlation & High-Velocity Datasets

In the case of high-velocity datasets where data arrives at a fast pace, the correlation model is
generally updated once the complete dataset is available because performing correlation on a
small dataset may not reveal the true nature of the relationship between variables.
For random data-generating processes, the correlation between two variables will seldom
change with directionality, which generally remains the same, while the strength may change.
As a result, a correlation model may not require such frequent updates despite the high-velocity
datasets.
Correlation & High-Variety Datasets

Faced with a variety of datasets, determining the correlation between variables may prove to be
challenging. Difficulties can arise because the related variables may not exist within the same
dataset. This would require combining datasets, which can be performed by making use of the
query engine mechanism. Ranked variables, containing values such as low, medium, and high,
also cause difficulty in establishing correlation. In this case, the ranked values need to be
converted into numerical values.
Correlation & High-Veracity Datasets

High-veracity datasets are required to determine the right level of correlation between two
variables containing the least possible amounts of noise and outliers.
Apart from providing false results, noise contributes towards inefficient use of underlying
processing resources (the processing engine) as the noise is unnecessarily processed. Noise
can be removed during the data acquisition and filtering stage of the Big Data analysis lifecycle,
while outliers can be removed using techniques such as those discussed in the Outlier
Detection section in Module 5.
Correlation & High-Value Datasets

As the value characteristic is directly related to the veracity characteristic, correlation shares
similar considerations when applied to high-value datasets in comparison to high-veracity
datasets. In order to achieve maximum value out of high-volume, high-velocity datasets,
correlations must be discovered as soon as the datasets become available. This requires the
underlying correlation algorithm to support distributed/parallel execution in a Big Data platform.
Optional Reading
For a more in-depth discussion on this topic, see the Correlation Doesnt Imply Causation
section from pages 274-278 of the Doing Data Science text book that accompanies this module.
Covariance
Like correlation, covariance is a measure of how two variables change collectively. Sample
covariance is denoted by , while the population covariance is denoted by .
However, unlike correlation, its value can be any negative or positive number and is in the same
units as the units of the variables. Unlike correlation, the value of covariance is dependent on
the units used, meaning the covariance value for inches will be different from the covariance
value for centimeters. The value of correlation is standardized and is not affected by the units
used.
Estimates of Popular Distribution

The following rules help to make general estimates about distributions in terms of what
percentage of a distribution's values fall within a specific distance from the mean:
x Chebyshevs Inequality Rule
x Empirical Rule
Chebyshevs Inequality Rule

Chebyshevs inequality rule, which applies to all kinds of distributions, states that at least 1-

percentage of the values in a distribution are within k standard deviations of the mean, provided
the k is greater than one. 75% of the values are within two standard deviations of the mean,
89% of the values are within three standard deviations of the mean, and 95% of the values are
within four and a half standard deviations of the mean, as shown in Figure 4.23.
Figure 4.23 An example of Chebyshevs inequality rule.
Empirical Rule
The empirical rule, also known as the 68-95-99.7 rule, states that 68% of the values within a
distribution are within one standard deviation of the mean, 95% of the values are within two
standard deviations of the mean and 99.7% of the values are within three standard deviations of
the mean, as shown in Figure 4.24. Unlike Chebyshevs rule, the empirical rule only applies to
normal distributions.
Figure 4.24 An example of the empirical rule.
Exercise 4.4: Naming and Matching
1. A car manufacturer is ordered by court to publish reliability figures. In response, a sample of

100 warranty cases is analyzed and reliability is published based on the mean breakdown
value. However, some dealerships complain that this is not accurate and can mislead
customers, as breakdown time can vary. Which measure can be used to specify reliability
more accurately without misleading the customers?
______________________
2. A dataset contains income figures for over 100,000 individuals and is positively skewed. To
determine the probability of a randomly chosen sample with a mean income greater than
$50,000, the data analyst starts to create a sampling distribution of mean based on a large
sample size. Which rule or theorem is the data analyst applying?
______________________
3. A negatively skewed distribution consisting of number of children across households, with

known standard deviation and mean, is being analyzed. Which rule or theorem can be
applied to confirm if the probability of a household having up to six children (four and a half
standard deviations from the mean) is 0.95?
______________________
4. A normal distribution consisting of tree heights across the country, with known standard
deviation and mean, is being analyzed. Which rule or theorem can be applied to confirm if
the probability of a tree whose height is within two standard deviations from the mean is
0.95?
______________________
Frequency Distribution ___
Normal Distribution ___
Probability Distribution ___
Sampling Distribution ___
A. used to make inferences on population parameters based on a sample statistic

B. used to find the number of times each value of a variable appears in a dataset
C. used when data often occurs with values the same as or close to the average value
D. used to find the possible occurrence of an event or value of a variable in a dataset
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Confirmatory Data Analysis (CDA)
Hypothesis Testing
A hypothesis is a testable claim or proposition that explains a phenomenon. For example, Drug
A is better than Drug B. In statistics, this is a claim about the population parameter based on a
sample statistic. Hypothesis testing is the scientific process of assessing whether a claim or
proposition is of significance, and not based on chance.
Understanding hypotheses testing requires knowing the following concepts that are introduced
in this section:
x Null Hypothesis (H0)
x Alternative Hypothesis (H1)
x P-Value
x Type I Error
x Type II Error
x Statistical Significance
Null Hypothesis
Null hypothesis, denoted by H0, states that observations made using the sample data are based
on chance alone, meaning there is no truth behind the observed phenomenon. The null
hypothesis is generally the opposite of the actual hypothesis, and is considered to be true by
default. It is only rejected if there is compelling evidence to the contrary.
Generally, the null hypothesis is stated in terms of equality or status quo, such as same as,
and it is the null hypothesis that is actually tested with the conclusion of the hypothesis testing
stated in terms of H0, such as reject H0.
x H0 = Drug A has the same effect as Drug B
Alternative Hypothesis
Alternative hypothesis, as denoted by H1 or Ha, is the opposite of null hypothesis, and is
generally accepted when the null hypothesis is rejected.
H1 = Drug A has a different effect than Drug B
Rejecting the null hypothesis means that there is enough evidence against H0 but not
necessarily that H1 is true. Rather, an alternative hypothesis dictates the possibility of H1 being
true. If H0 is not rejected, then it does not automatically mean that H0 is true, rather that there is
not enough evidence against H0 in support of H1.
Statistical Significance
The term statistical significance means that the chances of a claim or proposition being true due
solely to chance are unlikely. In other words, such a claim or an effect is based on some non-
random cause. The significance level D represents the statistical significance that is a
predetermined threshold probability. This value is represented as a percentage that is set at the
start of the hypothesis testing, often at a value of 5%. H0 is rejected when the p-value is less
than D, meaning the test results are unlikely and do not support H0. Therefore, the original claim
is statistically significant.
P-Value
The p-value is the probability of getting a value, calculated from the sample, as extreme as or
more extreme than the original observed value under the assumption that the null hypothesis is
true. A p-value is used in weighing the test results to establish whether the original claim is
statistically significant or not.
If the p-value is less than or equal to D, then there is strong evidence against H0 and H0 is
rejected. If the p-value is greater than D, then there is weak evidence against H0 and H0 is not
rejected.
Critical Region, One-Tailed & Two-Tailed Tests

A term also based on D is the critical region, which contains the values as stipulated by the D.
H0 is rejected if the test results fall within the extreme set of values represented by the critical
region. The critical region can either be on one side of the normal set of values or on both sides.
The former is known as a one-tailed test, whereas the latter is known as a two-tailed test, as
discussed on the upcoming page.
Type I Error, Type II Error & the Power of Hypothesis Test

A type I error occurs when H0 is rejected even though it is true. As H0 is rejected when values
fall within the critical region, a type I errors probability is the same as the significance level D.
A type II error occurs when H0 is accepted even though it is false, with probability represented
by E. The power of hypothesis test is the probability of making the correct decision, which is the
probability of rejecting H0 when it is false. This probability is given by (1- E).
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Visualization
Visualization for EDA & CDA

Unlike the visualization techniques discussed in Module 2, those presented in Module 4 are
intended for the data scientist, not the business user. Although a business user may gain some
value from these visualizations, it is not their intended or primary purpose.
Like statistical distribution diagrams, the visualizations presented in this section are useful for
guiding EDA and formulating hypotheses for the CDA process. Visualization helps the data
scientist gain insight into a data set by engaging the human visual system in the analytic
process.
Bar Graph
A bar graph, also known as bar chart, is a graph generally used to view values of discrete
variables that can be ordinal or nominal, and can also be used to view discrete distributions.
Each discrete value is represented as a category on the x-axis, while the y-axis is used to
display the count of each category, as illustrated in the example in Figure 4.25. The actual count
is represented using a rectangle, called a bar, where its height shows the category count.
Generally, there are gaps between each bar in a bar graph.
Figure 4.25 - An example of a bar graph.
Line Graph
A line graph is a type of bar graph used for displaying numerical ordinal data where, instead of
using a bar, a single point is used to represent the value before all points are joined together
using a line. Line graphs are often used to analyze data over time or trends, and should not be
used to display nominal data like product categories. However, ordinal data related to multiple
categories can be shown using a single line graph, as shown in Figure 4.26.
Figure 4.26 - An example of a line graph.
Histogram
A histogram is like a bar graph used to view values of continuous variables that have been
grouped into intervals. However, instead of viewing the frequency of a distribution in a tabular
form, a histogram is often used to view a distribution in a graphical manner, as shown in Figure
4.27.
Unlike a bar graph, there are no gaps between the bars. The height of each bar represents the
frequency of the corresponding value, where the area of each bar is proportional to its
frequency.
Figure 4.27 - An example of a histogram.
In order to create a histogram, a frequency table with values divided into intervals is required, as
shown in Table 4.6. The intervals must be created without gaps between them, with all values of
a continuous variable covered. Generally, such intervals are equal; however, this is not a
restriction. When unequal intervals are used, a frequency density is calculated for ensuring that
the bar area is in proportion to its frequency, as shown in Figure 4.28.
Table 4.6 - An example of a frequency table with values divided into two intervals.
Figure 4.28 - An example of a histogram with each bar area in proportion to its frequency.
Frequency density shows the concentration of values in a range. Instead of the actual
frequency, histograms can also be used to show relative frequencies or probabilities, in which
case the max value for the y-axis = 1. An example of this is shown in Figure 4.29. Relative
frequencies are proportions of values in each interval. Such a histogram can be created by
dividing the frequency of the interval by the sum of all frequencies.
Figure 4.29 - An example of relative frequency or probability shown in a histogram.
Frequency Polygons
Like histograms, frequency polygons can be used to display continuous distributions. However,
these can also be used to compare distributions in terms of their shape, such as whether the
distribution is a normal distribution or skewed, as shown in Figures 4.30 and 4.31. The midpoint
of each interval is used on the x-axis and a point is plotted at the corresponding location on the
y-axis that represents the frequency of the interval.
Figure 4.30 A frequency polygon. Figure 4.31 A frequency polygon comparing
distributions.
Frequency polygons can also be used to view cumulative frequencies. A cumulative frequency
is the total frequency up to a certain interval, as summarized in Table 4.7 and Figure 4.32.
Table 4.7 Cumulative frequency summary in a table. Figure 4.32 Cumulative frequency summary.
Scatter Plot
A scatter plot can be used to view the association between two variables to determine whether
a pattern exists between the variables. It also offers a graphical means of spotting outliers.
Generally, a scatter plot is used to plot variables for correlation and regression analysis.
For regression analysis, the independent variable is plotted on the x-axis and the dependent
variable on the y-axis. Each pair of values is generally marked by a cross or a dot on the graph.
In Figure 4.33, black circles represent overlapping values and highlight the concentration of
values, with red circles indicating outliers.
Figure 4.33 - An example of a scatter plot.
Stem & Leaf Plot

Like a histogram, a stem and leaf plot, or stemplot, is a graphical technique for analyzing a
distribution and is well suited for viewing small datasets or samples. Instead of showing
frequencies, it shows actual values. Although the frequencies are not explicitly shown, they can
be estimated through the shape of the plot.
A value is divided into its constituent parts (units, tens) with stems displaying the higher value
part (ten) and the leaf displaying the lower value part (unit). Both the stem and the leaves are
arranged in ascending order. A key is often provided for interpretation of the plot.
Figure 4.34 - An example of a stem and leaf plot.

A stemplot can also be used to compare two distributions, in which case it is known as a back-
to-back stemplot. A common stem is used for both distributions, as illustrated in the left side of
Figure 4.35. A stem and leaf plot is useful in facilitating the identification of outliers and the
mode of the distribution, as indicated by the red and blue circles on the right side of Figure 4.35.
Figure 4.35 An example of a back-to-back stemplot comparing two distributions.
Cross-Tabulation
While not strictly a graphical technique, cross-tabulation, also known as cross-tabs, is a two-way
frequency table used for viewing relationships between two variables. It is also used to evaluate
the performance of a classification model.
The values of the two variables become the actual column or row headers, and the cell values
are the counts of the intersection between the two values. Values from a normal table can be
converted into a cross-tab, as illustrated in the example in Table 4.8.
Table 4.8 - The normal table of individuals to the left is converted into a cross-tab, depicted on the right.
Box & Whisker Plot

A box and whisker plot, also known as a box plot, can be used to display the median, range,
Q1, Q3, and IQR of a distribution using just one type of a graph. The mean value can also be
shown by adding a plus sign to the box, as shown in Figures 4.36 and 4.37. The box and
whisker plot is the ideal visual analysis technique for comparing multiple distributions.
Figure 4.36 An example of a box and whisker plot (I).
Figure 4.37 An example of a box and whisker plot (II).
The position of the box reveals whether the distribution is

symmetrical or asymmetrical. A box positioned in the
middle of the whiskers represents a symmetrical
distribution. If the right whisker is longer than the left
whisker, then the distribution is positively skewed, and
vice-versa. Similarly, if the median is greater than the
mean then the distribution is negatively skewed and
vice-versa. Outliers can also be identified, as the
presence of outliers makes the whiskers longer.
Depending upon which axis of the graph represents the
categories, a box and whisker plot can be both horizontal
and vertical. Box and whisker plots provide an ideal
graphical method for visualizing a five-number summary
that includes the minimum value, Q1, median, Q3, and
maximum value.
Figure 4.38 - An example of a vertical plot.
Quantile-Quantile Plot
A quantile-quantile (q-q) plot is used for comparing distributions with a graph of quantiles of the
two distributions against each other. Based on the similarity of the distributions, q-q plots can be
used to see whether or not the underlying data-generating processes are of similar type.
A q-q plot can also be used to compare observed values against theoretical values or the values
obtained from a model. This provides a means for testing whether a model fits a given
distribution. If the two distributions are same, the points on the plot follow a 45q line. In Figure
4.39, quartiles of Distribution A are compared against quartiles of Distribution B.
Figure 4.39 A quantile-quantile plot comparing Distribution A and Distribution B.
If the points form a line that is flatter, the distribution plotted on the x-axis has a greater variance
as compared to the distribution plotted on the y-axis, as shown in Figure 4.40.
However, if the points form a steeper line, then the distribution plotted on the y-axis has a
greater variance as compared to the distribution plotted on the x-axis, as shown in Figure 4.41.
If one of the distributions is skewed, then the plot follows an arc, as shown in Figure 4.42.
Any strong deviations from the straight line can indicate the presence of outliers, as shown in
Figure 4.43.
Figure 4.40 Flat line. Figure 4.41 Steep line. Figure 4.42 Arc plot. Figure 4.43 An outlier.
Lattice Plot
A lattice plot consists of multiple sub-plots arranged in a grid that enables bivariate and
multivariate analyses, with each panel of the grid representing a sub-plot. Different types of
graphs can be plotted as sub-plots for analyses purposes.
Figures 4.44 and 4.45 provide examples of different types of graphs. The first graph shows a
scatter plot of engine size vs. miles per gallon for vehicles with three, four, and five gears. The
bottom graph is a histogram of miles per gallon for vehicles with three, four, and five gears.
Figure 4.44 - An example of a lattice plot comprised of scatter plots.
Figure 4.45 - An example a lattice plot comprised of histograms.
Correctly identify the visual technique used to display corresponding datasets:
1. In a ______________________, each discrete value is represented as a category on the x-

axis, while the y-axis is used to display the count of each category. The actual count is
represented using a rectangle whose height shows the category count.
2. ______________________ are often used to analyze data over time or trends, and should
not be used to display nominal data like product categories. However, ordinal data related to
multiple categories can be shown.
3. For a ______________________, a frequency table is first collected with values divided into
intervals. The intervals must be created without gaps between intervals covering all values
of a continuous variable.
4. A ______________________can be used to view association between two variables to find

if a pattern exists between the two variables that offers a graphical means of spotting
outliers.
5. Like a histogram, a ______________________ is a graphical technique for analyzing a

distribution that is well-suited for viewing small datasets or samples.
6. While not strictly a graphical technique, ______________________ is a two-way frequency

table used for viewing relationships between two variables.
7. Data can be graphed visually for further analysis using the following methods:
______________________ plot for comparing two or more than two distributions and
visualizing a five-member summary. ______________________ plot for comparing exactly
two distributions, and ______________________ plot for managing multiple sub-plots for
bivariate and multivariate analyses.
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Part III: Fundamental Big Data Analysis Techniques
The following fundamental analysis techniques will be covered in this section:
x Prediction: Linear Regression
x Classification: k-NN (k-Nearest Neighbors)
x Clustering: k-means
Reading
For a more in-depth discussion on this topic, see the Three Basic Algorithms section from pages
54-55 of the Doing Data Science text book that accompanies this module.
Prediction: Linear Regression
Linear regression, also known as least squares regression, is a statistical technique for
predicting the values of a continuous dependent variable based on the values of an independent
variable. The dependent and independent variables are also known as response and
explanatory variables respectively.
Linear regression is used to explore the data in order to understand the nature of the
relationship between different variables. As a mathematical relationship between the response
variable and the explanatory variable(s), linear regression assumes that a linear correlation
exists between the response and explanatory variables.
A linear correlation between response and explanatory variables is represented through the line
of best fit, also called regression line. This is a straight line that passes as closely as possible
through all points on the scatter plot, as illustrated in Figure 4.46.
Figure 4.46 - An example of a regression line.
The linear regression model development starts by expressing the linear relationship. Once the
mathematical form has been established, the next stage is to estimate the parameters of the
model via model fitting. This determines the line of best fit achieved via least squares estimation
that aims to reduce the sum of squares error (SSE). The last stage is to evaluate the model
either using R2, mean squared error, or cross-validation.
Being a straight line, the regression line cannot pass through each point, and is an
approximation of the actual value of the response variable based on estimated values, as
demonstrated in Figure 4.47. The distance between the actual and the estimated value of
response variable is the error of estimation. For the best possible estimate of the response
variable, the errors between all points as represented by the sum of squares errors must be
minimized. The line of best fit is the line that results in the minimum possible sum of squares
errors.
Figure 4.47 An example of a straight regression line that cannot pass through all points.
Apart from predicting the value of the response variable, a regression model also provides the
nature of relationship between the response and the explanatory variables. When the values of
the explanatory variables are comparatively on the same scale, the size of each parameter
shows the relative significance of the respective explanatory variable. The higher the
magnitude, the more impact the explanatory variable has on the response variable. Similarly,
the sign of the parameter shows the direction of the association. A negative sign means
negative correlation while a positive sign means positive correlation.
Multiple Linear Regression

In regression, more than two explanatory variables can be used simultaneously for predicting
the response variable, in which case it is called multiple linear regression. For multiple linear
regression, it is recommended to make histograms and scatter plots of the explanatory and
response variables to:
x help ascertain the correctness of the model
x check if all relevant explanatory variables have been added to the model
x find the respective relevance of each explanatory variable
Mean Squared Error
The mean squared error (MSE) is a measure that tells how close the line of best fit is to the
actual values of the response variable. In other words, mean squared error identifies the
variation between the actual value and the estimated value of the response variable as provided
by the regression line. Generally, the mean squared error is also known as the estimator for the
variance in the predicted value.
Error Term & Residuals

While the line of best fit attempts to estimate the dependent variable as accurately as possible,
there is always a discrepancy between the predicted value and the actual value known as error
term or noise. The error term exists because the included independent variable(s) cannot
possibly predict the dependent variable with 100% accuracy. This is due to the fact that there
are generally other independent variables missing from the regression equation that also affect
the dependent variable. Generally, it is assumed that the noise is normally distributed.
In practice, the values of the parameters always remain unknown due to the variations in the
data and the factors that have not been captured by the model. If the true values of these
parameters are known, then the true regression line can be drawn and the actual estimate error
or error term can be calculated.
However, the actual line that can be drawn will always be an estimated regression line for the
true regression line, in which case the estimate error can only be estimated and is known as a
residual. The residual is known, but the error term is unknown and is best estimated via the
residual.
.
Figure 4.48 - Error term is the actual error, the distance between the point and the point on the grey line. Residual is
the estimated error, the distance between the point and the black line, shown in green.
Coefficient of Determination R2
The coefficient of determination R2 is the percentage of variation in the response variable that is
predicted or explained by the explanatory variable, with values that vary between 0 and 1. A
value equal to 0 means that the response variable cannot be predicted from the explanatory
variable, while a value equal to 1 means the response variable can be predicted without any
errors. A value between 0 and 1 provides the percentage of successful prediction.
The value of the coefficient of determination is simply the square of the correlation coefficient r.
The variation refers to the difference between the actual and the mean value of the response
variable. The explainable variation is the difference between the estimated and the mean value
of the response variable.
For example, 0.75 means that 75% of variation in the response variable is explained by the
explanatory variable, while the other 25% remains unexplained and is considered an error.
Instead of simply providing an average value as a measure of fit for the line, the coefficient of
determination provides a value that can be used to gauge the accuracy of the regression model.
The coefficient of determination R2 also reveals whether the model is affected by the variation in
the values. A regression model with a lower R2 is less stable as compared to one with a higher
R2 that estimates well even in the face of variation in the data. For example, the regression
model in Figure 4.49 to the left has a better fit than the model on in Figure 4.50 on the right, as
the model on the left has a higher R2 value than the right one.
Figure 4.49 A regression model with a low R2 value. Figure 4.50 A regression model with a high R2 value.
Standard Error of Estimate

The standard error of estimate (SEE) measures the accuracy of the predicted values of the
response variable that shows how close or far the estimated values are from the actual values
and the deviation of values from the regression line. The smaller the SEE, the more accurately
the regression line can predict the response variable values.
Linear Regression & Other Techniques
A linear regression model is a kind of correlation between the response and the explanatory
variable(s). By the virtue of this characteristic, each explanatory variable can be individually
tested for correlation. Similarly, if the dataset involves time-based elements, then a
simultaneous time series analysis (covered in Module 5) of the response and explanatory
variables may also prove helpful in identifying or testing the relationship between the two.
Linear Regression & High-Volume Datasets

High-volume datasets demand that the underlying linear regression algorithm be able to run in a
distributed/parallel environment where datasets are split over multiple nodes.
For tall datasets, the algorithm is required to make calculations across the whole length of the
dataset. It is important that the dataset is cleansed of any noise before applying linear
regression. First determine whether the variables are correlated by performing a correlation test,
as applying linear regression without knowing if the two variables have commonalities can result
in a meaningless model.
For wide datasets, a correlation test is prerequisite when applying multiple regression. Within
multiple linear regression, each potential explanatory variable should be individually tested for
correlation. Any time-based explanatory variables can further be analyzed using time series
analysis (covered in Module 5: Advanced Big Data Analysis & Science) to identify any
correlations.
Linear Regression & High-Velocity Datasets

Data arriving at a fast pace requires the regression model to be updated on a regular basis, as
the correlation between explanatory and response variables may change over time. Automated,
repeated application of regression models to high-velocity datasets may require configuration of
the workflow engine so that the values of the response variable are automatically calculated as
soon as the data becomes available.
Linear Regression & High-Variety Datasets

A single dataset may not include the required explanatory variable(s) for building an accurate
regression model. A variety of datasets may need to be joined together in order to extract the
relevant explanatory variables. This exercise carries further significance for multiple linear
regression due to the existence of multiple explanatory variables.
As in the case of high-volume datasets, each potential explanatory variable should be
individually tested for correlation. Any time-based explanatory variables can be further analyzed
using time series analysis (covered in Module 5) to identify any correlations.
Linear Regression & High-Veracity Datasets
Low-veracity datasets can adversely impact the accuracy of a regression model. Therefore, it is
necessary to remove any noise during the data acquisition and filtering analysis step of the Big
Data analysis lifecycle, and remove outliers using techniques such as those discussed in the
Outlier Detection section in Module 5.
Low-veracity datasets combined with high volume can pose performance penalties if the
regression model needs to be updated regularly because it will also be unnecessarily applied to
the noise, resulting in the waste of processing resources and time.
Linear Regression & High-Value Datasets

As the value characteristic is directly dependent on the veracity characteristic, the same
considerations apply to high-value datasets as to high-veracity datasets. For extracting
maximum value from high-veracity datasets, the underlying linear regression algorithm needs to
support execution in a distributed/parallel environment. This enables the regression models to
be updated swiftly, especially in cases of high-velocity datasets.
1. ______________________ is a statistical technique for predicting the values of a

dependent or response value based on the values of an independent or explanatory
variable. This technique is used to explore the data and understand the nature of the
relationships between variables.
2. A linear correlation between response and explanatory variables is represented through the
______________________ that passes as closely as possible through all points on a
scatter plot.
3. ______________________ is known as the estimator for the variance in the predicted

value.
4. For ______________________, histograms and scatter plots can be used to summarize the
explanatory and response variables to find the respective relevance of each explanatory
variable.
5. With values that vary between 0 and 1, the ______________________ is the percentage of
variation in the response variable that is predicted by the explanatory variable.
6. The ______________________ measures the accuracy of the predicted values of the

response variable to identify the difference between the estimated values and actual values
and the deviation of values from the regression line.
Optional Reading
For further discussion on this topic, see the Linear Regression Example from pages 55-68 of the
Doing Data Science text book accompanying this module.
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Classification: k-NN (k-Nearest Neighbors)
k-Nearest Neighbor (k-NN), also known as lazy learning and instance-based learning, is a
black-box classification technique where instances are classified based on their similarity, with a
user-defined (k) number of examples (nearest neighbors). No model is explicitly generated.
Rather, the examples are stored as-is and an instance is classified by first finding the closest k
examples in terms of distance, and then assigning the class based on the class of the majority
of the closest examples.
Figure 4.51 - An example of k-NN.
k-NN is able to classify instances when interactions and relationships that are difficult to explain
and hard to understand exist between a number of features and the target classes. k-NN works
well where the same-class instances share mostly similar feature values and class boundaries
are easily identifiable.
Because of the potentially large number of distance calculations between the examples and the
unseen instance, k-NN is compute-intensive during the classification stage; therefore it is
generally slow and requires large amounts of memory. These issues can be addressed by
running this algorithm in a distributed/parallel environment.
k-NN generally uses Euclidean distance for calculating the closeness between the examples
and unclassified instances. As the distance calculation can be overshadowed by features based
on larger units, for example mileage vs. number of doors, features values are normalized
through min-max normalization or z-score standardization.
Nominal features must be converted into their numerical counterparts by creating new binary
features (0 and 1) for each category of the original nominal feature. The nominal values can
also be compared as-is, in which case if the values are not the same, the numerical difference is
0 else 1.
Selecting the Value of k

Within k-NN, k is the number of neighbors. Choosing k requires testing the algorithm with
different values of k, generally between three and ten, and then choosing the one with the
lowest error rate. The choice of k also requires attaining bias-variance balance, as increasing k
reduces variance and increases bias. Bias refers to the error caused due to learning incorrect
model, and variance refers to the error caused due to variation in the input data. Choosing a
smaller k also means that outliers can affect the classification task.
Choosing the correct value of k depends on the nature of the classification task. For example, if
predicting whether a patient is suffering from a certain disease, it would make sense to err on
the side of caution by choosing a value of k that results in more false positives than false
negatives.
Serious consequences can result in not diagnosing a patient who is actually suffering. However,
when choosing someone to be an astronaut, it may make sense to tune k for getting more false
negatives, as falsely dropping someone who is nearly perfect is not going to result in serious
consequences.
Figure 4.52 illustrates the impact of selecting a smaller and larger k. For k =1, the closeness to
the outlier, represented by the diamond, results in assigning the class of the outlier example.
When k = 3, classification is unaffected by the outlier, as the majority of example data belongs
to a normal set of values.
Figure 4.52 - An example with smaller and larger values of k.
Taking the square root of the number of examples is a strategy to select an optimum value for k,
although the tests must still be performed to validate accuracy.
Within Big Data environments, the impact of choosing a non-optimal value for k decreases, as
there should be a greater number of examples, at close proximity, which will represent the
majority. Even instances belonging to rare classes can be successfully classified due to a larger
representation in the examples.
Optional Reading
For further discussion of this topic, see the k-NN Example on pages 71-81 of the Doing Data
Science text book.
NOTE
More advanced classification algorithms and the impact of the
five Vs of classification will be discussed in Module 5.
Clustering: k-means
Clustering
Clustering is an unsupervised machine learning technique used to create groups of items where
each group contains similar items but the groups themselves are dissimilar to each other. It is
also known as unsupervised classification, as unlabeled instances are classified according to
the properties of the homogeneous groups.
As an EDA tool to understand the data, clustering can identify any natural grouping within data
or interesting subsets of data for further analysis. Results can be used to pre-process data for
semi-supervised learning, where class labels are created based on the unlabeled training data
that can then be labeled and used for classification, or to select a subset of important features.
While clustering automatically creates homogeneous groups, the machine-generated labels
often carry no real meaning. Humans must analyze the properties of each group and create
meaningful labels as per the nature of the data analysis task, the business domain, or the
individuals to which the data mining results must be communicated.
k-means
k-means is a common clustering algorithm that uses distance as a measure for creating clusters
of homogeneous items. k is a user-defined number that denotes the number of clusters needed
to be created and means refers to the center point of the cluster, or centroid.
The centroid forms the basis for cluster creation around which other similar items that make up
a cluster are located. It is determined from the mean of all point locations that represent the
cluster items in a multidimensional space whose number of dimensions depends on the number
of features of items to cluster. 7KHYDOXHRINPXVWEHVHWZLWKLQNQ, where n is the total
number of items in the dataset.
k-means is similar to k-NN in that it generally uses the same Euclidian distance calculation for
determining closeness between the centroid and the items (represented as points) that requires
the user to specify the k value. Operating in an iterative fashion, k-means begins with less
homogeneous groups of instances and modifies each group during each iteration to attain
increased homogeneity within the group. The process continues until maximum homogeneity
within the groups and maximum heterogeneity between the groups is achieved. The k-means
operation is divided into the two stages, assign and update, as defined in the upcoming pages.
The Assign Stage
Based on the user-specified k value, the algorithm randomly selects k points as cluster center
points that represent the actual instances in a multidimensional feature space and have been
plotted according to the feature values. Each dimension represents a single feature. Instead of
choosing points that actually exist, new points can also be created and chosen as cluster
centers.
Another approach of beginning the assign stage is to arbitrarily allocate instances to a k number
of clusters without selecting any initial center points. When the initial center points are chosen,
each instance is then associated with one of those initial cluster center points that is closest to
it. This closeness is determined by calculating the distance, generally using the Euclidean
distance formula, between the instance (represented by the point) and the initial center point.
In order to calculate distance, all feature values must be numerical in nature and further
normalized by adjusting the scale of values, such as 10,000 to 10 if other feature values exist
between one and ten.
These values are standardized by converting values to z-scores so that features whose
difference results in large values do not dominate smaller valued features, such as income and
age, or discretized for meaningful results. The resulting clusters can be graphically viewed using
a Voronoi diagram whose lines mark the cluster boundaries. Between two clusters, each line in
the Voronoi diagram depicts the set of points that are equidistant from both center points.
For example, the assign stage can result in a graph of clusters, as shown in Figure 4.53. Where
k = 3, three randomly selected center points represented by stars are initially selected, with
instances allocated to these center points based on their proximity, after calculating their
Euclidean distances.
Figure 4.53 - Stars represent three randomly selected center points around which their Euclidian distances are
calculated.
The Update Stage

In the update stage, the true center point or centroid of each cluster is determined by calculating
the mean of all points in a cluster. This generally results in the relocation of the centroid and the
corresponding shift of the cluster boundary.
Figure 4.54 - The shifting of the center points and the corresponding cluster boundary shift as a result of the
determination of the cluster centroids.
The Reassignment Stage

As a result of changes in cluster boundaries, the assign stage requires a rerun as some points
may now be closer to a different centroid than the initial assignment. The update stage also
requires a rerun to calculate the new centroids due to the reassignment of instances to different
clusters. This process continues until no further reassignments are performed.
Figure 4.55 - The reassignment of the highlighted instance (the red circle) from Cluster C to Cluster B as a result of
the shifting of the cluster boundaries.
Selecting the Value of k

The centroid values can be used to understand the nature of each cluster, as each centroid
provides the mean value of each feature for the cluster that further helps in determining
meaningful labels for each cluster. The meaningfulness of clusters generated by this algorithm
can vary, depending on the randomly chosen initial seed value by the algorithm. Therefore, it is
important to test the algorithm with different values of k in order to find the stability of the
generated clusters. While increasing the value of k creates more homogeneous clusters,
surpassing a certain number may introduce model overfitting.
Obtaining information about the dataset or business constraints is an approach for selecting the
correct value of k, such as the known types of customers. In the absence of any information
about the dataset or business constraints, dividing the total number of instances by two and
taking the square root of the result is one way to determine the value of k.
Retaining instances with missing feature values is important, as such instances may indicate
special groups. Also, removing instances reduces their number, which can impact the
meaningfulness of the generated clusters.
Missing Feature Values

Dummy values can be inserted for categorical features with missing feature values, as such
instances may represent a distinct cluster. For example, the code DF can be used to represent
the dummy value of default color for the color feature.
Numerical features can be assigned values using a technique known as imputation, where
either one of the averages (mean, median, or mode) or a combination of other features can be
used to determine the missing feature value. For example, mileage can be determined based on
the age of the car.
Cluster Distortion
A clusters degree of homogeneity can be measured by calculating the clusters distortion. A
clusters distortion can be calculated by taking the sum of squared distances between all points
and its centroid. The lower the distortion, the higher the homogeneity, and vice-versa.
Optional Reading
For a more in-depth discussion on this topic, see the k-Means example on pages 81-84 of the
Doing Data Science text book that accompanies this module.
Clustering & Other Techniques

Classification can be used to develop an understanding of the auto-generated clusters and
determine how one cluster is different than the other clusters. For classification, all instances
belonging to each cluster can be labeled with an arbitrary or user-assigned class name. A
classification algorithm, like classification rules, can be run to understand the characteristics of a
particular cluster.
Clustering & High-Volume Datasets

Choosing a large value of k for high-volume tall datasets can introduce performance issues, as
both the assign and update stages must be executed for each additional cluster. The majority of
the performance penalty is incurred during the assign stage, when the distance between the
centroid and each instance is calculated.
However, a high-volume wide dataset, even with a small k value, can incur a performance
penalty, especially during the assign stage, as the distance calculation must take into account a
large number of features. It is important for the underlying implementation of the clustering
algorithm to support distributed/parallel execution, for efficient and rapid clustering of high-
volume datasets.
Clustering & High-Velocity Datasets

Clustering is generally an offline analysis technique, as it creates clusters that need further
interpretation or is performed as part of EDA. As a result, high-velocity data is generally added
to existing datasets for clustering purposes. However, some implementations of k-means are
based on incremental updates, where a re-computation of clusters from scratch as new
instances are added is not required.
Clustering & High-Variety Datasets

To make sure that only similar instances are grouped together, it is important to determine the
true nature of an instance that requires gathering as much feature data as possible. Clustering
can require combining a variety of datasets for extracting relevant features to build a large
feature vector (an ordered set of features) that creates further homogeneous clusters. Care
should be taken to only include relevant features and keep the count of features to an optimum
level, as adding irrelevant or excessive features can result in performance issues.
The resulting wide datasets can impose performance issues, as each additional feature adds a
new dimension. For example, when using the k-means algorithm, the Euclidean distance and
centroid calculations can become highly dimensional, requiring increased memory and
processing resources.
Clustering & High-Veracity Datasets

To create highly homogeneous clusters of data with a reduced amount of distortion, it is
important to ensure that the dataset is of high quality and free of any noise. At the same time, it
is necessary to not remove instances that may seemingly represent noise but in reality are only
missing few feature values. Such instances may represent distinctive clusters that require
discovery. Also, removal of too many noisy instances may inadvertently create small clusters
that are not meaningful.
Clustering & High-Value Datasets

Low-value datasets can negatively impact the success of a clustering task by producing clusters
of data from which no actionable information can be gleaned. In some cases, obtaining invalid
clusters can lead to making false conclusions. Data should be a mixed representative of the
data-generating process, as performing clustering on a dataset containing data pertaining to
only specific circumstances or operating conditions will result in invalid clusters.
Value also depends on the ability to perform clustering as soon as datasets become available
and to complete the clustering process as quickly as possible, which is determined by the
underlying Big Data platform.
Quick mashing-up of a variety of datasets requires a workflow engine mechanism that can
automatically perform various data blending activities in collaboration with the data transfer
engine mechanism(s). However, clustering algorithms based on incremental update
implementations can help to cluster new, additional data within a reduced amount of time to
obtain value faster out of such datasets. The overall value of a clustering effort still requires the
correct interpretation of automatically generated clusters, for which domain expertise is
considered a necessary skill.
Exercise 4.7: Name the Algorithm
1. John, who works for an airline company as a data scientist, is analyzing 5 TBs of flight data
in order to predict fuel consumption based on a number of potentially relevant factors such
as altitude, air turbulence, air temperature, air pressure, how often the plane changes
altitude, use of electrical equipment inside the plane, number of engines, weight of reserve
fuel, and thrust change during landing. The underlying Big Data platform runs a number of
other compute-intensive models. Which techniques or algorithms can be applied to develop
an efficient model for predicting fuel consumption based on only relevant factors?
______________________
2. David is working on character recognition software that can match handwritten characters to
a known set of characters belonging to different languages. He has successfully tagged a
number of characters obtained from a variety of handwritten samples from multiple
individuals who are proficient in these languages. Which algorithm can be used to develop
such a model?
______________________
3. Alice, who works for an insurance company, has been asked to analyze a dataset of 8 TBs
to determine whether policy holders can be divided into different groups according to
similarities in their profiles. No existing groups exist for reference. Which algorithm should
Alice use to divide the policy holders in to a meaningful set of groups?
______________________
4. Robin, who works for the national astronomy association, is tasked with identifying planets
from a very large number of celestial objects. He has already identified a meaningful number
of planets. Which algorithm can Robin use to develop a model for this task?
______________________
5. Elliot is developing a model that can estimate the completion time of a construction project.
He is planning to take into account a number of factors that may impact project completion
time, such as design changes, distance of construction site from the nearest major road,
number of contractors working, skill level of the workforce, and number of accidents. Which
algorithm or technique can Elliot use to develop such a model?
______________________
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Exercise Answers
Exercise 4.1 Answers

1. A company collects customer comments that undergo text analytics and sentiment analysis
in order to identify the customers who may be at risk of defecting to a competitor. Which
category of Big Data datasets best characterizes this process?
High-Variety Datasets
2. Thousands of stock trading transactions are arriving very quickly as a result of being
concurrently generated by traders at the New York Stock Exchange. Which Big Data dataset
category is best-suited for describing the resulting dataset?
High-Velocity Datasets
3. An application that collects comments from a Web site is run to filter user-created data for
bias and significance. Which Big Data dataset category best describes such removal of
noise?
High-Veracity Datasets
4. High data veracity, velocity, and variety contribute to measuring which Big Data dataset
category?
High-Value Datasets
5. A large banking institution collects a months worth of daily financial transactions from all of
its branches across the country. What is the appropriate Big Data dataset category for
describing the resulting dataset?
High-Volume Datasets

1. Discrete variables can take only specific values from a defined set of values.
2. Continuous variables can take any value and are often obtained by measurement.
3. Nominal variables have values that represent a category that can be counted but not
measured or ordered.
4. Ordinal variables take numerical values that can be discrete or continuous and counted and
ordered, but not measured.
5. Binary variables consist of only two categories where the categories are generally the
opposite of each other.
6. Quantitative variables are number-based and can be counted or measured, whereas
qualitative variables can be counted but not measured.
7. Independent variables have values that do not depend on any other variable, but rather
influence other variables. These other variables are known as dependent variables.
8. Random variables can assume a range of values based on probability.

1. Mean
2. Range
3. Interquartile Range (IQR)
4. Percentiles
5. Variance or Standard Deviation
6. Z-score

1. Confidence Interval
2. Central Limit Theorem
3. Chebyshevs Inequality Rule
4. Empirical Rule
Frequency Distribution B
Normal Distribution C
Probability Distribution D
Sampling Distribution A
1. In a bar graph, each discrete value is represented as a category on the x-axis, while the y-
axis is used to display the count of each category. The actual count is represented using a
rectangle whose height shows the category count.
2. Line graphs are often used to analyze data over time or trends, and should not be used to
display nominal data like product categories. However, ordinal data related to multiple
categories can be shown.
3. For a histogram, a frequency table is first collected with values divided into intervals. The
intervals must be created without gaps between intervals covering all values of a continuous
variable.
4. A scatter plot can be used to view association between two variables to find if a pattern
exists between the two variables that offers a graphical means of spotting outliers.
5. Like a histogram, a stem and leaf plot (stemplot) is a graphical technique for analyzing a
distribution that is well-suited for viewing small datasets or samples.
6. While not strictly a graphical technique, cross-tabulation (cross-tabs) is a two-way

frequency table used for viewing relationships between two variables.
7. Data can be graphed visually for further analysis using the following methods: Box and
whisker plot for comparing two or more than two distributions and visualizing a five-member
summary. Quantile-quantile (q-q) plot for comparing exactly two distributions, and lattice
plot for managing multiple sub-plots for bivariate and multivariate analyses.

1. Linear regression is a statistical technique for predicting the values of a dependent or
response value based on the values of an independent or explanatory variable. This
technique is used to explore the data and understand the nature of the relationships
between variables.
2. A linear correlation between response and explanatory variables is represented through the
line of best fit (regression line) that passes as closely as possible through all points on a
scatter plot.
3. Mean squared error is known as the estimator for the variance in the predicted value.
4. For multiple linear regression, histograms and scatter plots can be used to summarize the
explanatory and response variables to find the respective relevance of each explanatory
variable.
5. With values that vary between 0 and 1, the coefficient of determination (R2) is the
percentage of variation in the response variable that is predicted by the explanatory
variable.
6. The standard error of estimate measures the accuracy of the predicted values in the
response variable to identify the difference between the estimated values and actual values
and the deviation of values from the regression line.
1. Correlation
Multiple Linear Regression
2. k-NN
3. k-means
4. k-NN
5. Multiple Linear Regression
Exam B90.04
The course you just completed corresponds to
Exam B90.04, which is an official exam that is part of
the Big Data Science Certified Professional (BDSCP)
program.
This exam can be taken at Pearson VUE testing centers worldwide or via Pearson VUE Online
Proctoring, which enables you to take exams from your home or office workstation with a live
proctor. For more information, visit:
www.bigdatascienceschool.com/exams/
www.pearsonvue.com/arcitura/
www.pearsonvue.com/arcitura/op/ (Online Proctoring)
Module 4 Self-Study Kit

An official BDSCP Self-Study Kit is available for this module,
providing additional study aids and resources, including a
separate self-study guide, Audio Tutor CDs and flash cards.
Note that versions of this self-study kit are available with and
without a Pearson VUE exam voucher for Exam B90.04.
For more information, visit:
www.bigdataselfstudy.com
Contact Information and Resources
AITCP Community
Join the growing international Arcitura IT Certified Professional (AITCP) community by
connecting on official social media platforms: LinkedIn, Twitter, Facebook, and YouTube.
Social media and community links are accessible at:
x www.arcitura.com/community
x www.servicetechbooks.com/community
General Program Information

For general information about the BDSCP program and Certification requirements, visit:
www.bigdatascienceschool.com and www.bigdatascienceschool.com/matrix/
General Information about Course Modules and Self-Study Kits

For general information about BDSCP Course Modules and Self-Study Kits, visit:
www.bigdatascienceschool.com and www.bigdataselfstudy.com
Pearson VUE Exam Inquiries

For general information about taking BDSCP Exams at Pearson VUE testing centers or via
Pearson VUE Online Proctoring, visit:
www.pearsonvue.com/arcitura/
www.pearsonvue.com/arcitura/op/ (Online Proctoring)
Public Instructor-Led Workshop Schedule

For the latest schedule of instructor-led BDSCP workshops open for public registration, visit:
www.bigdatascienceschool.com/workshops
Private Instructor-Led Workshops
Certified trainers can deliver workshops on-site at your location with optional on-site proctored
exams. To learn about options and pricing, contact:
[email protected]
or
1-800-579-6582
Becoming a Certified Trainer

If you are interested in attaining the Certified Trainer status for this or any other Arcitura courses
or programs, learn more by visiting:
www.arcitura.com/trainerdevelopment/
General BDSCP Inquiries

For any other questions relating to this Course or any Module, Exam, or Certification that is part
of the BDSCP program, contact:
[email protected]
or
1-800-579-6582
Automatic Notification
To be automatically notified of changes or updates to the BDSCP program and related resource
sites, send a blank message to:
[email protected]
Feedback and Comments

Help us improve this course. Send your feedback or comments to:
[email protected]

Big Data Module 4 - Print-Ready Workbook (Letter)

Uploaded by

Copyright:

Available Formats

Big Data Module 4 - Print-Ready Workbook (Letter)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Module 4 - Print-Ready Workbook (Letter)

Uploaded by

Copyright:

Available Formats

Module 4: Fundamental Big Data Analysis & Science

THE ASSIGN STAGE .................................................................................................................... 104

It should be noted that in data science, a predictive

Exploratory Data Analysis (EDA)

Confirmatory Data Analysis (CDA)

Table 4.1 An example of descriptive statistics in the form of a table.

Figure 4.1 The process of inferential statistics as a cycle.

Big Data Analysis Lifecycle

Figure 4.2 The Big Data Analysis Lifecycle

Common Big Data Dataset Categories

Exercise answers are provided at the end of this booklet.

Data Summary Types

Numerical Summaries: Measures of Central Tendency

Numerical Summaries: Measures of Variation or Dispersion

Numerical Summaries: Measures of Association

3. ______________________ variables have values that represent a category that can be

4. ______________________ variables take numerical values that can be discrete or

5. ______________________ variables consist of only two categories where the

6. ______________________ variables are number-based and can be counted or

7. ______________________ variables have values that do not depend on any other

8. ______________________ variables can assume a range of values based on

Exercise answers are provided at the end of this booklet.

Figure 4.4 - An example of a subset of data drawn from a population.

Measures of Central Tendency

Adding an extreme value of 50 changes the mean completely:

Measures of Variation or Dispersion

Table 4.4 A summary of mean, median, mode, and range.

Figure 4.4 An example of visual analysis of data using a number line.

Figure 4.6 An example depicting data distribution over four quintiles.

Figure 4.7 Quartile Example

Interquartile Range & Outliers

Figure 4.8 An example of an IQR that is used to exclude outliers.

Table 4.5 - An example of a distribution.

Variance & Standard Deviation

Exercise answers are provided at the end of this booklet.

Figure 4.10 A bar chart depicting frequency distribution.

Figure 4.11 A bar chart depicting probability distribution.

A confidence interval is normally expressed in the form:

Discrete & Continuous Probability Distributions

Figure 4.16 An example of a normal distribution.

Standard Normal Distribution

Central Limit Theorem

Figure 4.20 A monotonically increasing relationship.

A monotonically decreasing relationship is where y either decreases or remains constant, but

A non-monotonic relationship is where y increases and decreases, as shown in Figure 4.22.

Figure 4.22 A non-monotonic relationship.

If the dataset involves time-based elements, then a

Correlation & High-Volume Datasets

Correlation & High-Velocity Datasets

Correlation & High-Variety Datasets

Correlation & High-Veracity Datasets

Correlation & High-Value Datasets

Estimates of Popular Distribution

Chebyshevs Inequality Rule

Figure 4.24 An example of the empirical rule.

1. A car manufacturer is ordered by court to publish reliability figures. In response, a sample of

3. A negatively skewed distribution consisting of number of children across households, with

Normal Distribution ___

Probability Distribution ___

Sampling Distribution ___