Big Data Module 4 - Print-Ready Workbook (Letter)
Big Data Module 4 - Print-Ready Workbook (Letter)
Big Data Module 4 - Print-Ready Workbook (Letter)
INTRODUCTION.............................................................................................................................. 7
MIND MAP POSTER......................................................................................................................... 8
BIG DATA MODULE 4 OFFICIAL SUPPLEMENT: ANALYSIS FORMULAS ................................................ 9
ANALYSIS TECHNIQUES COVERAGE ............................................................................................... 10
OVERVIEW ................................................................................................................................... 10
PART I: BIG DATA SCIENCE CONCEPTS & ANALYSIS CHALLENGES................................. 11
TERMS AND CONCEPTS............................................................................................................. 12
DATA SCIENCE ............................................................................................................................. 12
MODEL......................................................................................................................................... 12
EXPLORATORY DATA ANALYSIS (EDA) .......................................................................................... 13
CONFIRMATORY DATA ANALYSIS (CDA) ........................................................................................ 13
DATA PRODUCT............................................................................................................................ 13
STATISTICS .................................................................................................................................. 13
DESCRIPTIVE STATISTICS.............................................................................................................. 14
INFERENTIAL STATISTICS .............................................................................................................. 15
MACHINE LEARNING ..................................................................................................................... 15
DATA MUNGING ............................................................................................................................ 16
BIG DATA ANALYSIS LIFECYCLE .................................................................................................... 16
READING ...................................................................................................................................... 16
COMMON BIG DATA DATASET CATEGORIES ......................................................................... 20
COMMON BIG DATA DATASET CATEGORIES ................................................................................... 21
HIGH-VOLUME DATASETS ............................................................................................................. 21
HIGH-VELOCITY DATASETS ........................................................................................................... 22
HIGH-VARIETY DATASETS ............................................................................................................. 22
HIGH-VERACITY DATASETS ........................................................................................................... 23
HIGH-VALUE DATASETS ................................................................................................................ 23
EXERCISE 4.1: MATCH TERMS TO STATEMENTS ............................................................................. 24
PART II: ELEMENTS OF BIG DATA ANALYSIS......................................................................... 28
EXPLORATORY DATA ANALYSIS (EDA) .................................................................................. 30
ATTRIBUTES ................................................................................................................................. 30
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 1
EDA ............................................................................................................................................ 30
OPTIONAL READING ...................................................................................................................... 31
DATA SUMMARY TYPES ................................................................................................................ 31
NUMERICAL SUMMARIES ............................................................................................................... 31
NUMERICAL SUMMARIES: MEASURES OF CENTRAL TENDENCY ....................................................... 31
NUMERICAL SUMMARIES: MEASURES OF VARIATION OR DISPERSION .............................................. 31
NUMERICAL SUMMARIES: MEASURES OF ASSOCIATION .................................................................. 32
GRAPHICAL SUMMARIES ............................................................................................................... 32
QUANTITATIVE ANALYSIS .............................................................................................................. 33
UNIVARIATE ANALYSIS .................................................................................................................. 33
BIVARIATE ANALYSIS .................................................................................................................... 33
MULTIVARIATE ANALYSIS .............................................................................................................. 33
STATISTICS .................................................................................................................................. 37
VARIABLE TYPES .......................................................................................................................... 38
EXERCISE 4.2: FILL IN THE BLANKS ............................................................................................... 39
POPULATION & SAMPLE ................................................................................................................ 40
STATISTICAL INFERENCE ............................................................................................................... 40
MEASURES OF CENTRAL TENDENCY .............................................................................................. 41
MEAN .......................................................................................................................................... 41
MEDIAN........................................................................................................................................ 41
MODE .......................................................................................................................................... 41
ROBUSTNESS ............................................................................................................................... 42
MEASURES OF VARIATION OR DISPERSION .................................................................................... 42
RANGE......................................................................................................................................... 42
MEAN, MEDIAN, MODE & RANGE ................................................................................................... 43
QUANTILES .................................................................................................................................. 43
QUINTILES.................................................................................................................................... 44
QUARTILES .................................................................................................................................. 44
INTERQUARTILE RANGE & OUTLIERS ............................................................................................. 44
PERCENTILES ............................................................................................................................... 45
BIAS ............................................................................................................................................ 45
DISTRIBUTION .............................................................................................................................. 46
VARIANCE .................................................................................................................................... 46
STANDARD DEVIATION .................................................................................................................. 47
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 2
VARIANCE & STANDARD DEVIATION............................................................................................... 47
Z-SCORE ..................................................................................................................................... 47
EXERCISE 4.3: NAME THE MEASURE ............................................................................................. 48
DISTRIBUTIONS............................................................................................................................. 50
FREQUENCY DISTRIBUTION ........................................................................................................... 50
PROBABILITY ................................................................................................................................ 50
PROBABILITY DISTRIBUTION .......................................................................................................... 51
READING ...................................................................................................................................... 51
SAMPLING DISTRIBUTION .............................................................................................................. 51
STANDARD ERROR ....................................................................................................................... 51
STATISTICAL ESTIMATORS ............................................................................................................ 52
CONFIDENCE INTERVAL................................................................................................................. 52
SKEWNESS................................................................................................................................... 53
DISCRETE & CONTINUOUS PROBABILITY DISTRIBUTIONS ................................................................ 54
DISTRIBUTION FITTING .................................................................................................................. 55
OPTIONAL READING ...................................................................................................................... 55
NORMAL DISTRIBUTION ................................................................................................................. 56
STANDARD NORMAL DISTRIBUTION................................................................................................ 56
CENTRAL LIMIT THEOREM ............................................................................................................. 57
MEASURES OF ASSOCIATION......................................................................................................... 58
CORRELATION .............................................................................................................................. 58
CORRELATION & HIGH-VOLUME DATASETS.................................................................................... 59
CORRELATION & HIGH-VELOCITY DATASETS.................................................................................. 60
CORRELATION & HIGH-VARIETY DATASETS ................................................................................... 60
CORRELATION & HIGH-VERACITY DATASETS ................................................................................. 60
CORRELATION & HIGH-VALUE DATASETS ...................................................................................... 60
OPTIONAL READING ...................................................................................................................... 61
COVARIANCE ................................................................................................................................ 61
ESTIMATES OF POPULAR DISTRIBUTION ......................................................................................... 61
CHEBYSHEVS INEQUALITY RULE ................................................................................................... 61
EMPIRICAL RULE .......................................................................................................................... 62
EXERCISE 4.4: NAMING AND MATCHING ......................................................................................... 63
CONFIRMATORY DATA ANALYSIS (CDA) ................................................................................ 68
HYPOTHESIS TESTING .................................................................................................................. 69
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 3
NULL HYPOTHESIS ....................................................................................................................... 69
ALTERNATIVE HYPOTHESIS ........................................................................................................... 69
STATISTICAL SIGNIFICANCE ........................................................................................................... 70
P-VALUE ...................................................................................................................................... 70
CRITICAL REGION, ONE-TAILED & TWO-TAILED TESTS ................................................................... 70
TYPE I ERROR, TYPE II ERROR & THE POWER OF HYPOTHESIS TEST .............................................. 70
VISUALIZATION............................................................................................................................ 74
VISUALIZATION FOR EDA & CDA .................................................................................................. 75
BAR GRAPH ................................................................................................................................. 75
LINE GRAPH ................................................................................................................................. 75
HISTOGRAM ................................................................................................................................. 76
FREQUENCY POLYGONS ............................................................................................................... 77
SCATTER PLOT............................................................................................................................. 78
STEM & LEAF PLOT ...................................................................................................................... 79
CROSS-TABULATION ..................................................................................................................... 80
BOX & W HISKER PLOT .................................................................................................................. 80
QUANTILE-QUANTILE PLOT ........................................................................................................... 82
LATTICE PLOT .............................................................................................................................. 83
EXERCISE 4.5: FILL IN THE BLANKS ............................................................................................... 84
PART III: FUNDAMENTAL BIG DATA ANALYSIS TECHNIQUES ............................................. 88
READING ...................................................................................................................................... 89
PREDICTION: LINEAR REGRESSION ........................................................................................ 90
MULTIPLE LINEAR REGRESSION .................................................................................................... 91
MEAN SQUARED ERROR ............................................................................................................... 92
ERROR TERM & RESIDUALS .......................................................................................................... 92
2
COEFFICIENT OF DETERMINATION R ............................................................................................. 93
STANDARD ERROR OF ESTIMATE................................................................................................... 93
LINEAR REGRESSION & OTHER TECHNIQUES ................................................................................. 94
LINEAR REGRESSION & HIGH-VOLUME DATASETS.......................................................................... 94
LINEAR REGRESSION & HIGH-VELOCITY DATASETS ....................................................................... 94
LINEAR REGRESSION & HIGH-VARIETY DATASETS ......................................................................... 94
LINEAR REGRESSION & HIGH-VERACITY DATASETS ....................................................................... 95
LINEAR REGRESSION & HIGH-VALUE DATASETS ............................................................................ 95
EXERCISE 4.6: FILL IN THE BLANKS ............................................................................................... 96
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 4
OPTIONAL READING ...................................................................................................................... 96
CLASSIFICATION: K-NN (K-NEAREST NEIGHBORS) ............................................................ 100
SELECTING THE VALUE OF K........................................................................................................ 101
OPTIONAL READING .................................................................................................................... 102
CLUSTERING: K-MEANS........................................................................................................... 103
CLUSTERING .............................................................................................................................. 103
K-MEANS .................................................................................................................................... 103
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 5
GENERAL INFORMATION ABOUT COURSE MODULES AND SELF-STUDY KITS ................................... 119
PEARSON VUE EXAM INQUIRIES ................................................................................................. 119
PUBLIC INSTRUCTOR-LED WORKSHOP SCHEDULE ....................................................................... 119
PRIVATE INSTRUCTOR-LED WORKSHOPS..................................................................................... 120
BECOMING A CERTIFIED TRAINER ................................................................................................ 120
GENERAL BDSCP INQUIRIES ...................................................................................................... 120
AUTOMATIC NOTIFICATION .......................................................................................................... 120
FEEDBACK AND COMMENTS ........................................................................................................ 120
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 6
Introduction
This is the official workbook for the BDSCP course Module 4: Fundamental Big Data Analysis
& Science and the corresponding Pearson VUE Exam B90.04.
The purpose of this document is to establish an understanding of fundamental Big Data
concepts, which include but are not limited to:
- Understanding Big Data
- Fundamental Big Data Terminology & Concepts
- Big Data Business & Technology Drivers
- Traditional Enterprise Technologies Related to Big Data
- Characteristics of Data in Big Data Environments
- Types of Data in Big Data Environments
- Fundamental Analysis, Analytics & Machine Learning Types
- Business Intelligence & Big Data
- Data Visualization & Big Data
- Big Data Adoption & Planning Considerations
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 7
Mind Map Poster
The BDSCP Module 4: Mind Map Poster that accompanies this course booklet provides an
alternative visual representation of all primary topics covered in this course.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 8
Big Data Module 4 Official Supplement:
Analysis Formulas
This supplement provides the formulas and algorithms upon which
analysis techniques are based. This supplement provides optional
reading for topics not covered on Exam B90.04.
Formulas for the following techniques are provided:
x Mean (Generic, Frequency-based)
x Median (Odd, Even)
x Mode
x Range
x Variance
x Standard Deviation
x Z-score
x Probability
x Sampling Distribution
x Standard Error
x Correlation (Pearsons)
x Covariance
x Distribution (Uniform, Binomial, Geometric, Poisson)
x Histogram
x Linear Regression
x K-Nearest Neighbour
x K-Means
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 9
Analysis Techniques Coverage
Modules 4 and 5 cover a variety of topics. The following are the twelve primary Big Data
analysis techniques that are emphasized and further explored in Module 6 lab exercises. Those
listed in red are covered in Module 4, and the rest are covered in Module 5.
x Correlation
x Linear Regression
x k-NN
x k-means
x Logistic Regression
x Nave Bayes
x Decision Trees
x Classification Rules
x Association Rules
x Time Series Analysis
x Text Analytics
x Outlier Detection
Overview
This module is comprised of the following three primary parts:
x Part I: Big Data Science Concepts & Analysis Challenges
- Terms and Concepts
- Common Big Data Dataset Categories
x Part II: Elements of Big Data Analysis
- Exploratory Data Analysis (EDA)
- Statistics
- Confirmatory Data Analysis (CDA)
- Visualization
x Part III: Fundamental Big Data Analysis Techniques
- Prediction: Linear Regression
- Classification: k-NN (k-Nearest Neighbors)
- Clustering: k-means
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 10
Part I: Big Data Science Concepts & Analysis Challenges
This section covers the following topics:
x Terms and Concepts
x Common Big Data Dataset Categories
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 11
Terms and Concepts
Data Science
Data Science is the overarching set of principles, processes, and techniques that enable the
extraction of knowledge from large amounts of data. Data is analyzed to understand and glean
insights in the form of generalizable patterns and correlations. Techniques and theories from
statistics, machine learning, computer science, data mining, and visualization all contribute to
Data Science. Data is generally explored without any prior hypothesis via exploratory data
analysis (EDA) in order to understand the relationships among differing variables.
This level of understanding of data, as described above, is captured in the form of a model
which is then implemented and deployed in the form of a data product. Models and data
products will be discussed separately in the upcoming Model and Data Product topics.
Depending on the nature of the analysis, some situations may not warrant the need for a data
product. Instead, the modeling results are communicated using visualization techniques.
Model
In generic terms, a model is a simplified representation of a phenomenon to aid human
understanding, such as a blueprint of the house, a model plane, a logical data model, or a
physical data model. In data science, a model is a generalized representation of relationships
between data attributes in the form of a mathematical/statistical equation or set of rules.
A model can help the data scientist develop an understanding of the data-generating process,
which can further help in making predictions. A model enhances understanding by removing
unnecessary details and is based on assumptions and constraints pertinent to the problem
domain.
A descriptive model describes the current behavior in order to develop a causal (cause and
effect) understanding of the phenomenon. A successful descriptive model is generally one that
can be easily understood even though it may not produce accurate results.
A predictive model describes future behavior by estimating a target value based on predictor
values. Although understanding a predictive model is important, such models are considered
successful if they produce accurate results even though they it may not be easily
comprehensible.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 12
NOTE
Data Product
A data product is an instantiation of the model built during the data analysis that exists in the
form of an application, which generates value from data for fulfilling a business goal. During the
course of its operation, a data product creates further data that is generally used to enhance the
data product via a feedback loop. In the business domain, the end goal of applying data science
is to develop a data product that provides business value.
Statistics
The term statistics, when used with a singular verb, is the science of collection, organization,
analysis, and interpretation of numerical data. The term statistics, when used with a plural
verb, are numerical facts regarding a set of data, such as mean, median, and mode.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 13
The field of statistics generally involves summarization of data through the generation of various
types of statistical information utilized for interpreting data. Statistics involves scientifically
drawing a sample, which is a subset of a dataset, from a population, which is the entire dataset,
and the use of probability theory for prediction.
Descriptive Statistics
Descriptive statistics is the numerical description of data via summarization and visualization
techniques. They help a data scientist to interpret the data to formulate hypotheses. Numerical
data generated via statistics include but are not limited to averages, quartiles, percentiles, and
standard deviations. Visualization techniques include histograms and scatter plots. Table 4.1
provides averages that summarize the daily temperature data for NYC across 12 months.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 14
Inferential Statistics
Inferential statistics goes beyond description of data to making inferences about the population
based on the observed sample. For successful inferential statistics, we need to draw a random
sample. Use of a non-random sampling mechanism introduces bias, discussed in the upcoming
Statistics section, in the sample that leads to making wrong or inaccurate inferences about the
population.
Inferential statistics involves the use of point estimators and interval estimators. The process
involves drawing a sample from the population and, based on the sample, making an inference
about the population.
Machine Learning
Machine learning, as introduced in Module 1, is the process through which computers
automatically learn from data to implicitly program themselves by identifying rules and patterns
for formulating predictions about unknown data. The learned rules and patterns essentially
represent the model that has been inferred from the data.
Machine learning and data mining are closely related, as both are used to find hidden patterns.
Data mining is more prevalent in business domains, whereas machine learning is a more
generic field that extends to other fields, such as artificial intelligence and natural language
processing (NLP).
Data mining generally employs machine learning algorithms and is more concerned with the
complete data analytic process, including data acquisition, cleansing, and model creation, rather
than just the application of algorithms. Machine learning and statistics can both be used to
create models.
Statistical models are more concerned with understanding the data generation process,
whereas machine learning algorithms are more concerned with producing the correct output(s)
through means that may not be fully comprehensible.
Machine learning involves the use of algorithms that can be divided into the following three
types:
x Supervised Learning input data includes example outputs
x Unsupervised Learning input data does not include any example outputs
x Semi-Supervised Learning input data includes few example outputs
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 15
Data Munging
Data munging, also known as data wrangling, refers to the extraction and manipulation of raw
data by applying cleansing, filtering, validation, and format transformation techniques in order to
make data appropriate for analysis. This generally involves the use of tools and programming
languages like SQL, Python, R, Hive, and Pig. In the context of data science, data munging
provides clean input data, which is essential for correctly understanding the data and further
discovering patterns and rules.
Reading
Further discussion on these topics is provided in the sections A Data Science Profile on pages
10-12 and OK, So What is a Data Scientist, Really? on pages 14-16 of the Doing Data Science
text book accompanying this module.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 16
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 17
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 18
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 19
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 20
Common Big Data Dataset Categories
High-Volume Datasets
Data within Big Data environments comes in large volumes, such as an entire collection of daily
financial transactions for a month from across all branches of a supermarket, and varying
volumes, such as tweets that are only 560 bytes (140 characters) in length versus a two-hour
video that is 4.7 gigabytes.
Within structured datasets, large volume can be due to a large number of records or rows, or
due to a large number of fields or columns. In some cases, large volume can be due to both a
large number of records and of fields.
Generally, a large number of rows/records are considered tall or long data while a large number
of columns/fields is considered wide data, as illustrated on the next page. Tall datasets have
numerous rows, whereas wide datasets have numerous columns, as depicted in Figure 4.3.
Both tall and wide data bring a unique set of challenges for analyzing data in Big Data
environments, and both often require increased processing resources.
Figure 4.3 Tall datasets have several rows, pictured left, while wide datasets have several columns, pictured right.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 21
Analysis of tall datasets is somewhat easier, as there are fewer fields/characteristics to take into
consideration. However, such datasets are generally more prone to noise and outliers, because
there are a large number of records that will need automated data cleansing and outlier
detection techniques.
Analyses of wide datasets can contain comparatively fewer outliers and noise, but are generally
complex as there are a large number of fields/characteristics that must be taken into account.
Both types of datasets require intensive EDA to be conducted in order to develop a thorough
understanding before conducting a more targeted, detailed analysis.
Voluminous semi-structured and unstructured datasets can generally be thought of as tall
datasets, as each record is often represented as a BLOB of information in a single column. Pre-
processing of data is required on these types of voluminous semi-structured and unstructured
datasets. Common pre-processing tasks include data cleansing and derivation of new fields, as
well as ensuring the data is represented in a form that can be used for quantitative techniques.
High-Velocity Datasets
Data within Big Data environments arrives at a fast pace, often due to the scale of the
underlying data-generating process. For example, thousands of individuals tweet at any point in
time and a large number of financial transactions occur across multiple stores within a short
span of time.
With high-velocity machine-generated data, the recurring data structure remains the same, such
as smart meter data or Web server logs. With high-velocity human-generated data, unstructured
data values can change on a per record basis, such as customer comments. However, the
overall structure of the individual record often remains the same as it will typically be formatted
by a data-capturing device.
Depending on business requirements, the analysis of high-velocity data can be performed in
transactional or batch mode, and in some circumstances both. With transactional analysis,
individual records are processed as they arrive. The processing may simply involve data
cleansing and updating KPIs for reporting purposes or may involve complex automated analysis
of the record, such as fraud detection. With batch analysis, fast-arriving data is accumulated first
and only then processed for reporting purposes or for performing complex analysis, such as
model development.
High-Variety Datasets
Within Big Data environments, a variety of datasets containing structured, semi-structured, and
unstructured data are generally used for analysis purposes. Unlike traditional data analysis,
which is only focused on structured datasets, analysis within Big Data environments must
incorporate semi-structured and unstructured datasets, as this type of data carries latent
information that can be of potential benefit for an enterprise. For example, text analytics and
sentiment analysis performed on customer comments can identify customers who may be at
risk of defecting to a competitor.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 22
The notion of variety applies to the fact that multiple differently formatted datasets must be
analyzed, rather than the same dataset comprising records made up of different formats that
continue to change. For example, even in a semi-structured dataset that comprises structured
and unstructured data, the data type for a particular field is often fixed, despite some records
containing additional or fewer fields. From a data analysis point of view, high-variety datasets
generally require certain pre-processing steps and may need a combination of analysis
techniques for their analyses.
It can be hard to join high-variety datasets together in order to perform unified data analysis.
The datasets are usually heterogeneous due to a range of enterprise-wide information systems
or different devices that generate the data required for analysis.
For example, the different types of sensors on the factory floor may generate data in different
formats. Noise must be carefully removed from real data (the signal) to achieve meaningful,
correct analytical results. In general, removing noise from machine-generated data is less
difficult as compared to human-generated data, as the former often conforms to some
lower/upper limits whereas the latter requires semantic assessment.
High-Veracity Datasets
Meaningful analysis of data generated within Big Data environments requires high-veracity
datasets. However, voluminous datasets can potentially contain large amounts of noise that
negatively affect the veracity of datasets. Noise creates false data that cannot be trusted and
further produces incorrect analysis results. For example, a misconfigured sensor or device will
create false readings in machine-generated data. Similarly, biased comments or the
appearance of similar comments multiple times with different user ids is an indication of noise
from human-generated data.
High-Value Datasets
A high-value dataset within Big Data environments is one that is high-veracity, contains useful
insights for the enterprise, and can be analyzed within a meaningful time period, requiring
comparatively simple analysis techniques. Like veracity, the value of a dataset is dependent on
the volume, velocity, and variety characteristics.
High-volume datasets, whether tall or wide, add more value as compared to datasets
comprising fewer records, due to the applicability of the Law of Large Numbers. High-velocity
datasets add further value when compared to low-velocity datasets because of the constant
addition of new records and increased frequency with which results are updated.
Similarly, high-variety, heterogeneous datasets add increased value in comparison to
homogeneous datasets, as a combination of differently formatted datasets provides richer,
unified datasets with increased chances of finding significant insights.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 23
Exercise 4.1: Match Terms to Statements
Answer the questions below by filling in the blank fields with one of the following terms:
x High-Volume Datasets x High-Veracity Datasets
x High-Velocity Datasets x High-Value Dataset
x High-Variety Datasets
1. A company collects customer comments that undergo text analytics and sentiment analysis
in order to identify the customers who may be at risk of defecting to a competitor. Which
category of Big Data datasets best characterizes this process?
______________________
2. Thousands of stock trading transactions are arriving very quickly as a result of being
concurrently generated by traders at the New York Stock Exchange. Which Big Data dataset
category is best-suited for describing the resulting dataset?
______________________
3. An application that collects comments from a Web site is run to filter user-created data for
bias and significance. Which Big Data dataset category best describes such removal of
noise?
______________________
4. High data veracity, velocity, and variety contribute to measuring which Big Data dataset
category?
______________________
5. A large banking institution collects a months worth of daily financial transactions from all of
its branches across the country. What is the appropriate Big Data dataset category for
describing the resulting dataset?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 24
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 25
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 26
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 27
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 28
Part II: Elements of Big Data Analysis
This portion of the workbook is divided into the following sections:
x Exploratory Data Analysis (EDA)
x Statistics
x Confirmatory Data Analysis (CDA)
x Visualization
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 29
Exploratory Data Analysis (EDA)
Attributes
In order to analyze the data and build models, it is important to first understand the data by
exploring data attributes or features of the data and to understand their types. An attribute is a
characteristic of the data. For example, in a database table, the columns are the attributes of
each instance of data displayed in the rows.
The notion of an attribute is more common within data mining, whereas in statistics, machine
learning, and data warehousing, the attribute is known as a variable, feature, and dimension
respectively. The variable types introduced in the upcoming Statistics section also apply to
attributes.
EDA
The process of EDA involves extracting quantitative attributes from the data and producing
various numerical and graphical summaries that are based on statistics generated from the
values of these attributes, with a view to develop an understanding of the data. This
understanding helps to assess the data quality, to make comparisons and find relationships,
and to identify attributes that will eventually become part of the statistical models and machine
learning algorithms.
Another objective of EDA is to ensure targeted data mining efforts by decreasing the amount of
data through the selection of only relevant attributes and data discretization, a topic covered in
Module 5: Advanced Big Data Analysis & Science.
EDA provides information on which type of model to develop and which relationships are
important in the context of the problem space, as well as information on any assumptions that
should be made for the models and which type of patterns should be extracted and generalized.
Alternatively, EDA can be used to determine whether the captured data is erroneous, or if the
process used to capture the data is not configured properly and is producing data that consists
of unrealistic patterns not normally associated with such data.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 30
Optional Reading
For a more in-depth discussion on this topic, see the Exploratory Data Analysis section from
pages 34-37 of the Doing Data Science text book.
Numerical Summaries
Numerical summaries make use of descriptive statistics for summarizing data. There are
generally three types of numerical summaries:
x Measures of Central Tendency
x Measures of Variation or Dispersion
x Measures of Association
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 31
x Variance
x Standard Deviation
The main objective of analyzing the spread is to determine how consistently the values appear
when compared with the averages (mean, median, and mode) and to find and remove any
outliers. When used in conjunction with z-scores, measures of variation provide the ability to
make decisions about which processes or models produce consistent or better results when
compared with other processes or models. Z-scores are introduced in the upcoming Statistics
section.
Graphical Summaries
Graphical summaries make use of visual techniques for summarizing data. This helps to explore
data beyond its descriptive characteristics, which further helps in generating hypotheses or
discovering patterns and correlations. Generally, the following graphical techniques are used in
EDA:
x Bar Graph
x Line Graph
x Histogram
x Frequency Polygons
x Scatter Plot
x Scatter & Leaf Plot
x Cross-Tabulation
x Box & Whisker Plot
x Quantile-Quantile Plot
x Lattice Plot
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 32
Quantitative Analysis
Quantitative analysis of data can be categorized by the number of variables involved. The
following are the three main types:
x Univariate Analysis
x Bivariate Analysis
x Multivariate Analysis
Univariate Analysis
Quantitative analysis of a single variable is known as univariate analysis, such as analysis of
census data for gaining insights about literacy levels or the ethnic makeup of a population. The
main objective is to understand the type of distribution the values make up and to identify any
outliers. Univariate analysis often starts with formulating frequency and probability distributions,
which will be introduced in the upcoming Statistics section.
The techniques involved within univariate analysis include:
x Measures of Central Tendency
x Measures of Variation or Dispersion
Bivariate Analysis
Quantitative analysis of two variables in order to explore their relationship is known as bivariate
analysis, such as an analysis of ice-cream sales and temperature. It is good practice to first
conduct univariate analysis on the variables involved within bivariate analysis before proceeding
to the actual bivariate analysis.
The techniques involved within bivariate analysis include:
x Measures of Association
x Cross-tabulation
x Regression
Multivariate Analysis
Quantitative analysis of more than two variables in order to explore their relationship is known
as multivariate analysis, such as predicting ice-cream sales based on temperature and age
group. Multiple linear regression, covered in the Prediction section, is an example of conducting
multivariate analysis.
The numerical summaries used for conducting the aforementioned univariate, bivariate and
multivariate analyses are generally complemented by the graphical summaries for visual
perception.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 33
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 34
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 35
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 36
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 37
Statistics
Variable Types
A variable is a measurable or observable attribute of an object which can be categorized as
follows:
x Discrete variables can only take specific values from a defined set of values, such as the
number of people in a city or blood group type. A discrete variable is a variable whose value
is obtained by counting.
x Continuous variables can take any value, such as a patients temperature or height.
A continuous variable is a variable whose value is obtained by measuring.
x Nominal variables have values that represent a category, such as product categories or
music genres. Such values can be counted but not measured or ordered.
x Ordinal variables take numerical values that can be discrete or continuous and can be
ordered or ranked, such as a survey question based on a satisfaction scale or educational
level. Such values can be counted and ordered but not measured.
x Binary variables consist of only two categories where the categories are generally the
opposite of each other, such as 1/0, true/false, and heads/tails.
x Quantitative variables are number-based and can be counted or measured, such as an
employees income.
x Qualitative variables, also known as categorical variables, can be counted but not
measured, such as gender.
x Independent variables have values that do not depend on any other variable but rather
influence other variables, whereas dependent variables have values that are influenced by
the independent variable. For example, temperature is an independent variable that ice-
cream sales depend upon.
x A random variable, generally denoted by X, is a variable that can assume a range of values
based on probability.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 38
Exercise 4.2: Fill in the Blanks
1. ______________________ variables can take only specific values from a defined set of
values.
2. ______________________ variables can take any value and are often obtained by
measurement.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 39
Population & Sample
In statistics, a population is the entire set of objects of a particular type that is being analyzed,
such as a dataset of all customers. A sample is a subset of data drawn from the population,
such as a few customers from the entire customer dataset. An observation is a set of attributes
related to the object, such as customer name and e-mail address. N (population size)
represents all observations in a population, while n (sample size) represents all observations in
a sample.
Figure 4.4 illustrates where population, sample, and observation pertain to a specific dataset.
Within Big Data environments:
x it is possible for n to be close to or equal to N as large amounts of data can be processed
within a reasonable amount of time. Having n close to N helps to make predictions about
population with higher confidence.
x ... n can also be equal to 1 but with a large observation set. This helps to make conclusions
about a single object rather than the whole population.
A sample statistic describes a numerical fact related to a sample that is generally used to make
conclusions or estimations about the related population parameter, whereas a population
parameter describes a numerical fact about the entire population. For estimation, a sample
statistic is known as an estimator that produces biased/unbiased and precise/imprecise results,
as discussed shortly.
A sample statistic calculated from different samples of fixed size drawn from the same
population can produce different results between themselves, as well as when compared
against the corresponding population parameter. This variation is represented by a sampling
distribution, introduced shortly.
Statistical Inference
Statistical inference is the process of deriving conclusions from data generated by random data-
generating processes, also known as stochastic processes. This generally involves creating
models from data in order to represent the random data-generating process in a simplified
manner.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 40
Sample data is used in order to make estimates or test hypotheses related to the population.
For example, sample data gathered regarding insurance claims shows that fewer insurance
claims are made by women as compared to men. A conclusion could be that this is because
women drive more carefully than men.
Mean
The mean, commonly known as the average, is a statistic obtained by dividing the sum of all
values by the count of all values. Population mean is denoted by , while the sample mean is
denoted by . Mean is generally used when the values do not change much and increase or
decrease in a normal manner. It is affected by the presence of outliers. Both population and
sample means are calculated in the same manner.
Median
The median is a statistic obtained by finding the middle value among all ordered values where
the total number of values is odd. A sample median is denoted by M or . The median is best
suited for scenarios where extreme values can produce false mean. A median is not affected by
the presence of outliers and, as it does not take into consideration all values, generally stays the
same.
Mode
The mode is a statistic obtained by counting the most occurring value among all values, and is
the only type of average (the others being mean and median) that can be calculated for nominal
variables. When the dataset consists of groups of values rather than individual values, the mode
is the median of the most occurring group of values. A set of values can have two or more
modes, in which case the values are called bimodal and multimodal respectively.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 41
Robustness
In statistics, a sample statistic is termed as robust if shifting some values or the presence of
outliers does not change the value of the statistic. The median and more are robust measures.
The mode is not a robust measure.
For example, for a set of five values (3, 1, 5, 1, and 7), the mean, median, and mode are:
Table 4.2 An example of the mean, median, and mode from a set of five values.
Table 4.3 An example where adding an extreme value changes the mean.
Range
The range is a statistic obtained by subtracting the minimum value from the maximum value that
tells about the spread or width of data. The range is also heavily affected by the presence of
extreme values, as the presence of a single extreme value gives the impression that the values
are spread over a very large range.
The averages (mean, median, and mode) provide central value, while range provides an idea
about the variation in the data. Using range, two different sets of values can be compared in
terms of variation in their values.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 42
Mean, Median, Mode & Range
For a set of five values (3, 1, 5, 1, and 7) Table 4.4 summarizes mean, median, mode, and
range for a quick comparison:
Figure 4.4 shows a number line for a visual analysis of the data using the above measures:
Quantiles
Quantiles divide ranked or ordered data into a specific number of equally sized portions. The
values that indicate the boundary between the portions are actual quantiles and in total are
always one less than the number of portions.
For example, dividing the set of values in Figure 4.5 into three portions results in two quantiles
(3, 6) containing 33.33% and 66.66% of the values. Data can be divided into any number of
portions, but is generally divided into four (quartiles), five (quintiles), or 100 portions
(percentiles).
Figure 4.5 - A set of values is divided into three portions resulting in two quantiles, shown in red.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 43
Quintiles
Quintiles represent four values that divide the data into five equally sized portions obtained by
first arranging the data values in ascending order and then dividing the data into five portions.
The first (Q1), second (Q2), third (Q3), and fourth (Q4) quintiles represent 20%, 40%, 60%, and
80% of the data values below them, as shown in Figure 4.6.
Quartiles
Quartiles represent three values that divide the data into four equally sized portions obtained by
first arranging the data values in ascending order and then dividing the data into four quarters.
The first, second, and third quartiles are known as lower quartile, median, and upper quartile
and are denoted by Q1, Q2, and Q3 respectively. Q1, Q2, and Q3 represent values below which
25%, 50%, and 75% of data values exist respectively.
There are multiple ways to compute quartiles. The simplest approach is to first divide data into
two portions by finding the median, Q2, before excluding Q2 from these portions if n is odd. Q1
and Q3 are the medians of the first and second portions respectively.
Consider the set of values in Figure 4.7. The median Q2 is 4.5. As the total number of values is
14, an even number, we can calculate Q1 and Q3 without removing any number. Q1 is the
median of the first half of 2, while a Q3 of 7 is the median of the second half.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 44
Outliers are abnormal or extreme data values that generally occur within the first and last
quarter of the data and can skew the results of a calculation. Figure 4.8 illustrates how an IQR
can be used to exclude outliers. As it only includes data values between Q1 and Q3, any
outliers in the first and last quarters can be effectively eliminated.
Percentiles
A percentile, like a quartile, is a value that divides the data into equal portions using
percentages instead of quarters, and is a value under which a given percentage of data values
exists. Each percentile represents the corresponding percentage of values. For example, the
30th percentile means 30% of values are less than the value represented by the 30th percentile.
Q1, Q2, and Q3 are also known as the 25th, 50th, and 75th percentiles, respectively.
Bias
A bias is introduced when the sample is not a true representation of the population, which can
happen if the sample has not been drawn in a random manner. A sample statistic from a biased
sample will result in making false conclusions about the corresponding population parameter.
In technical terms, a bias represents how far the average of multiple values of an estimator,
calculated from multiple samples, is from the corresponding population parameter. On the other
hand, an estimator can be imprecise if different values of the estimator from different samples
are not close to each other, meaning the estimator can be biased or unbiased and precise or
imprecise at the same time.
In Figure 4.9, the estimator is biased as the average value that lies at a distance from the
population parameter, shown as the X on the number line. The results are close to each other;
therefore, the estimator is precise.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 45
Figure 4.9 An example where a bias is present.
Distribution
A distribution is a group of numbers or a function that shows all occurrences of different values
or outcomes of a variable. In other words, it shows how values of a variable are distributed. For
example, Table 4.5 shows the distribution of different colored balls when drawn randomly from a
bag.
Depending on the type of variable, a distribution can be either discrete or continuous. Generally,
a discrete distribution is shown using a bar chart, while a continuous distribution is shown using
a histogram. This is explained shortly in the Visualization section. In statistics, a distribution can
also refer to a function that explains the nature of a group of numbers.
Variance
The variance is a non-negative value that shows how spread the values are compared to the
mean of the values or center of a distribution. Sample variance is denoted by s2, while the
population variance is denoted by 2.
A small variance shows that there is comparatively small difference between the values and the
mean value, and that the values occur close to each other. A large variance shows that there is
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 46
comparatively large difference between the values and the mean value, and that the values
occur far from each other.
Standard Deviation
Like the variance, the standard deviation is another non-negative value to view the spread of the
values from the center of the distribution. Sample standard deviation is denoted by s, while the
population standard deviation is denoted by . The calculated value is known as one standard
deviation and it is expressed in the same units as the values in the distribution.
Z-Score
A z-score, also known as standard score, is the number of standard deviations above or below
the mean value of the distribution. The z-score is denoted by a z. A set of values can be
converted to z-scores through a process of standardization. A negative z-score shows that the
value is less than the mean, whereas a positive z-score shows that the value is greater than the
mean value.
Z-scores help to make decisions about data in a standardized manner by concentrating on
values that are either closer to or farther from the normal set of values to include or exclude
data based on their distance from the mean value.
Z-scores can be used as a baseline for comparing different datasets with different means and
standard deviations. For example, two bottle filling machines have z-scores of -0.5 and 0.5,
which means that the first machine is under-filling while the second machine is over-filling.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 47
Exercise 4.3: Name the Measure
1. Jack is performing EDA on income data for a particular region with middle class earning
potential. He summarizes the data using one of the averages. However, when he adds data
from another region consisting of few but extremely wealthy people, recalculation of the
average results in a completely different value. Identify the measure that Jack is using.
______________________
2. Amber is comparing the temperature of tropical countries with the temperature of countries
that are farther away from the equator, in order to help chemists develop different variants of
engine oil for each region. She has compiled two sets of distributions, with the average
temperatures for each month of the year arranged in ascending order. Which measure can
be used to determine the temperature fluctuations for each region?
______________________
3. A technician is comparing the performance of two similar machines using a certain measure
of variation. However, he is getting a lot of variation between the lower and upper operating
bounds and is unable to obtain a meaningful comparison. A quick investigation reveals that
the data has extreme values towards both the lower and upper bounds. Which measure of
variation can be used to enable a meaningful comparison of the two machines?
______________________
4. Two dozen contestants participated in an essay writing competition last week. The
published results informed each contestant of the mark he or she received out of 100.
However, the contestants want to know how well they performed in comparison to the other
contestants, in terms of the percentage of contestants that had received lower marks. Which
measure will provide the required additional information?
______________________
5. A data scientist is analyzing the sales figures of two different stores. Calculating the range of
both sets of sales figures reveals that the first store has a much wider range than the
second store. Which measure of variation can be used to quantify the variation based on all
sales figures, in order to identify the store that produced more consistent sales figures?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 48
6. A bio scientist is comparing two different types of corn seeds that have been genetically
modified. Production data for each type shows that both types have different mean and
standard deviation figures. The production figures for the last season indicate that both
types have resulted in a higher than average yield. Which measure can be used to find the
variety that performed better than the other?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 49
Distributions
A distribution, as explained earlier, is a set of values showing how often different values occur,
or the chance of occurring of different values. In statistics, there are a number of different types
of distributions, including the following:
x Frequency Distribution
x Probability Distribution
x Sampling Distribution
x Normal Distribution
Frequency Distribution
A frequency is the number of times each value of a variable appears. A distribution that shows
the frequency of a variable is known as the frequency distribution. A frequency distribution is a
quick and easy way of summarizing data, generally shown using a table or a bar chart.
For example, a frequency distribution of different colored balls pulled randomly from a bag can
be displayed in the form of a bar chart, as shown in Figure 4.10.
Probability
A probability is the measure of a possible occurrence of an event or value of a variable, and is a
value between 0 and 1. The probabilities of all events add up to 1. A probability of an event
closer to 0 indicates a rare event, while an event closer to 1 indicates a common event.
In statistics, an experiment is a test based on chance that leads to different results known as
outcomes. An event refers to an individual outcome or a group of outcomes of an experiment.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 50
Probability Distribution
A distribution that shows the probability of each event or value of a variable is known as the
probability distribution. The bar chart in Figure 4.11 shows the probability distribution of one red
ball, one yellow ball, and two blue balls.
Reading
For a more in-depth discussion on this topic, see the Probability Distributions section from
pages 30-31 of the Doing Data Science text book that accompanies this module.
Sampling Distribution
A sampling distribution is the probability distribution of a sample statistic, such as a mean, that
is commonly used to make inferences about the population parameters by calculating sample
statistics from a number of fixed-size samples.
A sample statistic, such as a mean, calculated from a number of different samples of the same
size would generally result in different values. In order to view the variation in the sample
statistic values, a sampling distribution is used. The mean of the sampling distribution is an
estimate of the population mean.
Standard Error
The standard error is the standard deviation of a sampling distribution that is used to estimate
how close the sample statistic, generally the mean, is to the population parameter. Standard
error of mean is denoted by SE. As the sample size n increases, the standard error decreases.
The standard deviation of a sample is used to measure how far the values are from the sample
mean, whereas the standard error of mean is used to measure how far the sample mean is from
the population mean.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 51
Statistical Estimators
A statistical estimator is a rule or a function that provides an estimate for the population
parameter based on a sample statistic. There are two types of estimators, known as a point
estimator and an interval estimator. A point estimator provides a single value, whereas the
interval estimator provides a range of values. A sample mean is an example of a point
estimator, whereas a confidence interval is an example of an interval estimator.
Confidence Interval
A confidence interval measures the reliability of the estimate for the population parameter,
which has been calculated from a sample. Instead of specifying a point estimate for the
population parameter, such as the mean of the population, it specifies a range or interval
estimate with a probability or confidence level expressed as a percentage of this interval
estimate containing the population parameter.
Although confidence intervals can be calculated at different confidence levels, such as 50%,
90%, or 99%, they are often calculated at a confidence level of 95%.
At best, the true value of the population parameter can only be estimated and can never be
found as samples are used. Due to this fact, the confidence interval specifies the uncertainty
related to the sampling method rather than specifying the value of the population parameter. A
95% confidence interval for the population mean can be interpreted as:
95% of the estimate intervals, calculated from different samples, will contain the population
mean.
- or -
There is a 95% chance that a single estimate interval will contain the population mean.
As shown in Figure 4.12, the higher the confidence level, the wider the interval, although making
the interval too wide can affect the importance of this measure. For example, a confidence level
of 99% stating that a Web page load time is between 15 and 25 seconds is less helpful in
estimating the actual load time than a confidence level of 90% stating that the Web page load
time is between 19 and 21 seconds.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 52
Figure 4.12 An example of high and low confidence intervals.
Skewness
Skewness is the amount of asymmetry of a (probability) distribution when measured from the
mean value. A distribution can be positively skewed where the tail of the curve is longer on the
right side or skewed to the right, and the mean is greater than the median and mode. The
majority of the values exist on the left side of the curve.
A distribution can be negatively skewed where the tail of the curve is longer on the left side or
skewed to the left, and the mean is less than the median and mode. The majority of the values
exist on the right side of the curve.
A normal distribution is not skewed. The left and right tails are similar to each other, and the
mean, median, and mode are equal to each other. For example, in Figure 4.13, three
distributions are summarized in bar graphs with a negative skew, without any skew, and with
positive skew.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 53
Figure 4.13 - An example of three distributions with different skews summarized in bar graphs.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 54
Figure 4.14 An example where a value of 0.02 is calculated for the PDF.
Figure 4.15 An example where the probability of values > 30 in the distribution that is equal to 0.04.
Distribution Fitting
Generally, random data-generating processes follow certain patterns. As a result, the probability
of a random variable that assumes a certain value or a range of values is somewhat predictable.
Depending upon the nature of the random data-generating process, an appropriate probability
distribution can be selected that fits the distribution in order to describe its nature and make
estimates about its values in terms of probabilities.
In some probability distributions, values are more centered around the mean value, whereas in
other distributions the values are evenly distributed. This behavior gives the probability
distribution a particular shape, from which a number of probability distributions have been
formulated. The shape of the curve of a continuous distribution indicates how values are spread
within the distribution.
Optional Reading
For further discussion on this topic, refer to the Fitting a Model section on page 33 of the Doing
Data Science text book that accompanies this module.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 55
Normal Distribution
A normal distribution, also known as a bell-shaped curve or Gaussian distribution, is a
symmetric continuous probability distribution where the majority of values are found in close
proximity to the mean value. A normal distribution represents data that occurs commonly where
most values are the same as the average value and only few values are found at the
extremities, as shown in Figure 4.16.
In a normal distribution, approximately 99% of the values are within three standard deviations of
the mean, and the area under the curve is equal to one, as shown in Figure 4.17. A normal
distribution has the same mean, median, and mode.
Figure 4.17 An example of a normal distribution where the area under the curve is equal to one.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 56
Figure 4.18 An example of a normal distribution and a standard normal distribution.
Figure 4.19 An example of the central limit theorem applied to a non-normal population.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 57
Measures of Association
A dataset representing a data generating process may contain certain variables that are related
to each other based on a pattern such that when the value of one variable changes, the other
one also changes in the same or different direction with a proportionate or disproportionate
magnitude. The measures of association quantify the relationship between two variables in a
dataset, and include:
x Correlation
x Covariance
Correlation
As originally introduced in Module 2, correlation is the degree of linear association between two
variables, measured using a correlation coefficient. The relationship is considered to be linear
when the scatter plot of the variables values results in a straight line, which means that both
variables change with the same proportion at a constant rate.
Pearsons product moment coefficient, generally denoted by r , is one example of the
correlation coefficient that is used most commonly for measuring the correlation between two
variables.
The presence of correlation does not constitute causation. Correlation only constitutes a
mathematical association between the variables rather than a factual association. Non-linear
associations may also exist between variables, in which case Spearmans rank correlation can
be used. However, a monotonic relationship must exist between the variables.
A monotonic relationship is where one variable always either increases or decreases while the
other may remain constant. Variables that first increase and then decrease or vice-versa do not
constitute such a monotonic relationship.
A monotonically increasing relationship is where y either increases or remains constant but
never decreases, as shown in Figure 4.20.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 58
Figure 4.21 A monotonically decreasing relationship.
Both the Pearson and the Spearman correlation coefficients have a range of -1 to +1 and are
interpreted in the same manner. The Pearson correlation coefficient is affected by outliers as it
takes into account the actual magnitude of the values. Instead of using the values as is, the
calculation of Spearmans correlation coefficient requires converting original values to ranked
values. As a result, Spearmans correlation coefficient is not affected by outliers as the actual
magnitude of the values is ignored.
NOTE
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 59
Choose an algorithmic implementation that supports a distributed/parallel architecture, as fitting
millions of records in the main memory of a single machine may not be possible or ideal.
Implementing an algorithm that supports a distributed/parallel architecture can often be
achieved through the introduction of an analytics engine mechanism in a Big Data solution.
Correlation does not imply causation, especially in high-volume datasets as there is a potential
for uncovering several correlations. However, some of these uncovered correlations may be
coincidental or may only exist in a particular version of a dataset. Therefore, validation is
required to confirm the findings and eliminate valid but insignificant correlations, from a business
point of view, by applying domain knowledge. Over time, multiple versions of the dataset should
be analyzed to ascertain if a correlation is of recurring nature before devising an action plan.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 60
datasets. In order to achieve maximum value out of high-volume, high-velocity datasets,
correlations must be discovered as soon as the datasets become available. This requires the
underlying correlation algorithm to support distributed/parallel execution in a Big Data platform.
Optional Reading
For a more in-depth discussion on this topic, see the Correlation Doesnt Imply Causation
section from pages 274-278 of the Doing Data Science text book that accompanies this module.
Covariance
Like correlation, covariance is a measure of how two variables change collectively. Sample
covariance is denoted by , while the population covariance is denoted by .
However, unlike correlation, its value can be any negative or positive number and is in the same
units as the units of the variables. Unlike correlation, the value of covariance is dependent on
the units used, meaning the covariance value for inches will be different from the covariance
value for centimeters. The value of correlation is standardized and is not affected by the units
used.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 61
Figure 4.23 An example of Chebyshevs inequality rule.
Empirical Rule
The empirical rule, also known as the 68-95-99.7 rule, states that 68% of the values within a
distribution are within one standard deviation of the mean, 95% of the values are within two
standard deviations of the mean and 99.7% of the values are within three standard deviations of
the mean, as shown in Figure 4.24. Unlike Chebyshevs rule, the empirical rule only applies to
normal distributions.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 62
Exercise 4.4: Naming and Matching
______________________
2. A dataset contains income figures for over 100,000 individuals and is positively skewed. To
determine the probability of a randomly chosen sample with a mean income greater than
$50,000, the data analyst starts to create a sampling distribution of mean based on a large
sample size. Which rule or theorem is the data analyst applying?
______________________
______________________
4. A normal distribution consisting of tree heights across the country, with known standard
deviation and mean, is being analyzed. Which rule or theorem can be applied to confirm if
the probability of a tree whose height is within two standard deviations from the mean is
0.95?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 63
Frequency Distribution ___
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 64
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 65
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 66
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 67
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 68
Confirmatory Data Analysis (CDA)
Hypothesis Testing
A hypothesis is a testable claim or proposition that explains a phenomenon. For example, Drug
A is better than Drug B. In statistics, this is a claim about the population parameter based on a
sample statistic. Hypothesis testing is the scientific process of assessing whether a claim or
proposition is of significance, and not based on chance.
Understanding hypotheses testing requires knowing the following concepts that are introduced
in this section:
x Null Hypothesis (H0)
x Alternative Hypothesis (H1)
x P-Value
x Type I Error
x Type II Error
x Statistical Significance
Null Hypothesis
Null hypothesis, denoted by H0, states that observations made using the sample data are based
on chance alone, meaning there is no truth behind the observed phenomenon. The null
hypothesis is generally the opposite of the actual hypothesis, and is considered to be true by
default. It is only rejected if there is compelling evidence to the contrary.
Generally, the null hypothesis is stated in terms of equality or status quo, such as same as,
and it is the null hypothesis that is actually tested with the conclusion of the hypothesis testing
stated in terms of H0, such as reject H0.
x H0 = Drug A has the same effect as Drug B
Alternative Hypothesis
Alternative hypothesis, as denoted by H1 or Ha, is the opposite of null hypothesis, and is
generally accepted when the null hypothesis is rejected.
H1 = Drug A has a different effect than Drug B
Rejecting the null hypothesis means that there is enough evidence against H0 but not
necessarily that H1 is true. Rather, an alternative hypothesis dictates the possibility of H1 being
true. If H0 is not rejected, then it does not automatically mean that H0 is true, rather that there is
not enough evidence against H0 in support of H1.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 69
Statistical Significance
The term statistical significance means that the chances of a claim or proposition being true due
solely to chance are unlikely. In other words, such a claim or an effect is based on some non-
random cause. The significance level D represents the statistical significance that is a
predetermined threshold probability. This value is represented as a percentage that is set at the
start of the hypothesis testing, often at a value of 5%. H0 is rejected when the p-value is less
than D, meaning the test results are unlikely and do not support H0. Therefore, the original claim
is statistically significant.
P-Value
The p-value is the probability of getting a value, calculated from the sample, as extreme as or
more extreme than the original observed value under the assumption that the null hypothesis is
true. A p-value is used in weighing the test results to establish whether the original claim is
statistically significant or not.
If the p-value is less than or equal to D, then there is strong evidence against H0 and H0 is
rejected. If the p-value is greater than D, then there is weak evidence against H0 and H0 is not
rejected.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 70
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 71
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 72
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 73
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 74
Visualization
Bar Graph
A bar graph, also known as bar chart, is a graph generally used to view values of discrete
variables that can be ordinal or nominal, and can also be used to view discrete distributions.
Each discrete value is represented as a category on the x-axis, while the y-axis is used to
display the count of each category, as illustrated in the example in Figure 4.25. The actual count
is represented using a rectangle, called a bar, where its height shows the category count.
Generally, there are gaps between each bar in a bar graph.
Line Graph
A line graph is a type of bar graph used for displaying numerical ordinal data where, instead of
using a bar, a single point is used to represent the value before all points are joined together
using a line. Line graphs are often used to analyze data over time or trends, and should not be
used to display nominal data like product categories. However, ordinal data related to multiple
categories can be shown using a single line graph, as shown in Figure 4.26.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 75
Figure 4.26 - An example of a line graph.
Histogram
A histogram is like a bar graph used to view values of continuous variables that have been
grouped into intervals. However, instead of viewing the frequency of a distribution in a tabular
form, a histogram is often used to view a distribution in a graphical manner, as shown in Figure
4.27.
Unlike a bar graph, there are no gaps between the bars. The height of each bar represents the
frequency of the corresponding value, where the area of each bar is proportional to its
frequency.
In order to create a histogram, a frequency table with values divided into intervals is required, as
shown in Table 4.6. The intervals must be created without gaps between them, with all values of
a continuous variable covered. Generally, such intervals are equal; however, this is not a
restriction. When unequal intervals are used, a frequency density is calculated for ensuring that
the bar area is in proportion to its frequency, as shown in Figure 4.28.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 76
Table 4.6 - An example of a frequency table with values divided into two intervals.
Figure 4.28 - An example of a histogram with each bar area in proportion to its frequency.
Frequency density shows the concentration of values in a range. Instead of the actual
frequency, histograms can also be used to show relative frequencies or probabilities, in which
case the max value for the y-axis = 1. An example of this is shown in Figure 4.29. Relative
frequencies are proportions of values in each interval. Such a histogram can be created by
dividing the frequency of the interval by the sum of all frequencies.
Frequency Polygons
Like histograms, frequency polygons can be used to display continuous distributions. However,
these can also be used to compare distributions in terms of their shape, such as whether the
distribution is a normal distribution or skewed, as shown in Figures 4.30 and 4.31. The midpoint
of each interval is used on the x-axis and a point is plotted at the corresponding location on the
y-axis that represents the frequency of the interval.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 77
Figure 4.30 A frequency polygon. Figure 4.31 A frequency polygon comparing
distributions.
Frequency polygons can also be used to view cumulative frequencies. A cumulative frequency
is the total frequency up to a certain interval, as summarized in Table 4.7 and Figure 4.32.
Table 4.7 Cumulative frequency summary in a table. Figure 4.32 Cumulative frequency summary.
Scatter Plot
A scatter plot can be used to view the association between two variables to determine whether
a pattern exists between the variables. It also offers a graphical means of spotting outliers.
Generally, a scatter plot is used to plot variables for correlation and regression analysis.
For regression analysis, the independent variable is plotted on the x-axis and the dependent
variable on the y-axis. Each pair of values is generally marked by a cross or a dot on the graph.
In Figure 4.33, black circles represent overlapping values and highlight the concentration of
values, with red circles indicating outliers.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 78
Figure 4.33 - An example of a scatter plot.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 79
Figure 4.35 An example of a back-to-back stemplot comparing two distributions.
Cross-Tabulation
While not strictly a graphical technique, cross-tabulation, also known as cross-tabs, is a two-way
frequency table used for viewing relationships between two variables. It is also used to evaluate
the performance of a classification model.
The values of the two variables become the actual column or row headers, and the cell values
are the counts of the intersection between the two values. Values from a normal table can be
converted into a cross-tab, as illustrated in the example in Table 4.8.
Table 4.8 - The normal table of individuals to the left is converted into a cross-tab, depicted on the right.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 80
Figure 4.36 An example of a box and whisker plot (I).
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 81
Quantile-Quantile Plot
A quantile-quantile (q-q) plot is used for comparing distributions with a graph of quantiles of the
two distributions against each other. Based on the similarity of the distributions, q-q plots can be
used to see whether or not the underlying data-generating processes are of similar type.
A q-q plot can also be used to compare observed values against theoretical values or the values
obtained from a model. This provides a means for testing whether a model fits a given
distribution. If the two distributions are same, the points on the plot follow a 45q line. In Figure
4.39, quartiles of Distribution A are compared against quartiles of Distribution B.
If the points form a line that is flatter, the distribution plotted on the x-axis has a greater variance
as compared to the distribution plotted on the y-axis, as shown in Figure 4.40.
However, if the points form a steeper line, then the distribution plotted on the y-axis has a
greater variance as compared to the distribution plotted on the x-axis, as shown in Figure 4.41.
If one of the distributions is skewed, then the plot follows an arc, as shown in Figure 4.42.
Any strong deviations from the straight line can indicate the presence of outliers, as shown in
Figure 4.43.
Figure 4.40 Flat line. Figure 4.41 Steep line. Figure 4.42 Arc plot. Figure 4.43 An outlier.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 82
Lattice Plot
A lattice plot consists of multiple sub-plots arranged in a grid that enables bivariate and
multivariate analyses, with each panel of the grid representing a sub-plot. Different types of
graphs can be plotted as sub-plots for analyses purposes.
Figures 4.44 and 4.45 provide examples of different types of graphs. The first graph shows a
scatter plot of engine size vs. miles per gallon for vehicles with three, four, and five gears. The
bottom graph is a histogram of miles per gallon for vehicles with three, four, and five gears.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 83
Exercise 4.5: Fill in the Blanks
Correctly identify the visual technique used to display corresponding datasets:
2. ______________________ are often used to analyze data over time or trends, and should
not be used to display nominal data like product categories. However, ordinal data related to
multiple categories can be shown.
3. For a ______________________, a frequency table is first collected with values divided into
intervals. The intervals must be created without gaps between intervals covering all values
of a continuous variable.
7. Data can be graphed visually for further analysis using the following methods:
______________________ plot for comparing two or more than two distributions and
visualizing a five-member summary. ______________________ plot for comparing exactly
two distributions, and ______________________ plot for managing multiple sub-plots for
bivariate and multivariate analyses.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 84
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 85
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 86
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 87
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 88
Part III: Fundamental Big Data Analysis Techniques
The following fundamental analysis techniques will be covered in this section:
x Prediction: Linear Regression
x Classification: k-NN (k-Nearest Neighbors)
x Clustering: k-means
Reading
For a more in-depth discussion on this topic, see the Three Basic Algorithms section from pages
54-55 of the Doing Data Science text book that accompanies this module.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 89
Prediction: Linear Regression
Linear regression, also known as least squares regression, is a statistical technique for
predicting the values of a continuous dependent variable based on the values of an independent
variable. The dependent and independent variables are also known as response and
explanatory variables respectively.
Linear regression is used to explore the data in order to understand the nature of the
relationship between different variables. As a mathematical relationship between the response
variable and the explanatory variable(s), linear regression assumes that a linear correlation
exists between the response and explanatory variables.
A linear correlation between response and explanatory variables is represented through the line
of best fit, also called regression line. This is a straight line that passes as closely as possible
through all points on the scatter plot, as illustrated in Figure 4.46.
The linear regression model development starts by expressing the linear relationship. Once the
mathematical form has been established, the next stage is to estimate the parameters of the
model via model fitting. This determines the line of best fit achieved via least squares estimation
that aims to reduce the sum of squares error (SSE). The last stage is to evaluate the model
either using R2, mean squared error, or cross-validation.
Being a straight line, the regression line cannot pass through each point, and is an
approximation of the actual value of the response variable based on estimated values, as
demonstrated in Figure 4.47. The distance between the actual and the estimated value of
response variable is the error of estimation. For the best possible estimate of the response
variable, the errors between all points as represented by the sum of squares errors must be
minimized. The line of best fit is the line that results in the minimum possible sum of squares
errors.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 90
Figure 4.47 An example of a straight regression line that cannot pass through all points.
Apart from predicting the value of the response variable, a regression model also provides the
nature of relationship between the response and the explanatory variables. When the values of
the explanatory variables are comparatively on the same scale, the size of each parameter
shows the relative significance of the respective explanatory variable. The higher the
magnitude, the more impact the explanatory variable has on the response variable. Similarly,
the sign of the parameter shows the direction of the association. A negative sign means
negative correlation while a positive sign means positive correlation.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 91
Mean Squared Error
The mean squared error (MSE) is a measure that tells how close the line of best fit is to the
actual values of the response variable. In other words, mean squared error identifies the
variation between the actual value and the estimated value of the response variable as provided
by the regression line. Generally, the mean squared error is also known as the estimator for the
variance in the predicted value.
Figure 4.48 - Error term is the actual error, the distance between the point and the point on the grey line. Residual is
the estimated error, the distance between the point and the black line, shown in green.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 92
Coefficient of Determination R2
The coefficient of determination R2 is the percentage of variation in the response variable that is
predicted or explained by the explanatory variable, with values that vary between 0 and 1. A
value equal to 0 means that the response variable cannot be predicted from the explanatory
variable, while a value equal to 1 means the response variable can be predicted without any
errors. A value between 0 and 1 provides the percentage of successful prediction.
The value of the coefficient of determination is simply the square of the correlation coefficient r.
The variation refers to the difference between the actual and the mean value of the response
variable. The explainable variation is the difference between the estimated and the mean value
of the response variable.
For example, 0.75 means that 75% of variation in the response variable is explained by the
explanatory variable, while the other 25% remains unexplained and is considered an error.
Instead of simply providing an average value as a measure of fit for the line, the coefficient of
determination provides a value that can be used to gauge the accuracy of the regression model.
The coefficient of determination R2 also reveals whether the model is affected by the variation in
the values. A regression model with a lower R2 is less stable as compared to one with a higher
R2 that estimates well even in the face of variation in the data. For example, the regression
model in Figure 4.49 to the left has a better fit than the model on in Figure 4.50 on the right, as
the model on the left has a higher R2 value than the right one.
Figure 4.49 A regression model with a low R2 value. Figure 4.50 A regression model with a high R2 value.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 93
Linear Regression & Other Techniques
A linear regression model is a kind of correlation between the response and the explanatory
variable(s). By the virtue of this characteristic, each explanatory variable can be individually
tested for correlation. Similarly, if the dataset involves time-based elements, then a
simultaneous time series analysis (covered in Module 5) of the response and explanatory
variables may also prove helpful in identifying or testing the relationship between the two.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 94
Linear Regression & High-Veracity Datasets
Low-veracity datasets can adversely impact the accuracy of a regression model. Therefore, it is
necessary to remove any noise during the data acquisition and filtering analysis step of the Big
Data analysis lifecycle, and remove outliers using techniques such as those discussed in the
Outlier Detection section in Module 5.
Low-veracity datasets combined with high volume can pose performance penalties if the
regression model needs to be updated regularly because it will also be unnecessarily applied to
the noise, resulting in the waste of processing resources and time.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 95
Exercise 4.6: Fill in the Blanks
2. A linear correlation between response and explanatory variables is represented through the
______________________ that passes as closely as possible through all points on a
scatter plot.
4. For ______________________, histograms and scatter plots can be used to summarize the
explanatory and response variables to find the respective relevance of each explanatory
variable.
5. With values that vary between 0 and 1, the ______________________ is the percentage of
variation in the response variable that is predicted by the explanatory variable.
Optional Reading
For further discussion on this topic, see the Linear Regression Example from pages 55-68 of the
Doing Data Science text book accompanying this module.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 96
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 97
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 98
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 99
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 100
Classification: k-NN (k-Nearest Neighbors)
k-Nearest Neighbor (k-NN), also known as lazy learning and instance-based learning, is a
black-box classification technique where instances are classified based on their similarity, with a
user-defined (k) number of examples (nearest neighbors). No model is explicitly generated.
Rather, the examples are stored as-is and an instance is classified by first finding the closest k
examples in terms of distance, and then assigning the class based on the class of the majority
of the closest examples.
k-NN is able to classify instances when interactions and relationships that are difficult to explain
and hard to understand exist between a number of features and the target classes. k-NN works
well where the same-class instances share mostly similar feature values and class boundaries
are easily identifiable.
Because of the potentially large number of distance calculations between the examples and the
unseen instance, k-NN is compute-intensive during the classification stage; therefore it is
generally slow and requires large amounts of memory. These issues can be addressed by
running this algorithm in a distributed/parallel environment.
k-NN generally uses Euclidean distance for calculating the closeness between the examples
and unclassified instances. As the distance calculation can be overshadowed by features based
on larger units, for example mileage vs. number of doors, features values are normalized
through min-max normalization or z-score standardization.
Nominal features must be converted into their numerical counterparts by creating new binary
features (0 and 1) for each category of the original nominal feature. The nominal values can
also be compared as-is, in which case if the values are not the same, the numerical difference is
0 else 1.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 101
model, and variance refers to the error caused due to variation in the input data. Choosing a
smaller k also means that outliers can affect the classification task.
Choosing the correct value of k depends on the nature of the classification task. For example, if
predicting whether a patient is suffering from a certain disease, it would make sense to err on
the side of caution by choosing a value of k that results in more false positives than false
negatives.
Serious consequences can result in not diagnosing a patient who is actually suffering. However,
when choosing someone to be an astronaut, it may make sense to tune k for getting more false
negatives, as falsely dropping someone who is nearly perfect is not going to result in serious
consequences.
Figure 4.52 illustrates the impact of selecting a smaller and larger k. For k =1, the closeness to
the outlier, represented by the diamond, results in assigning the class of the outlier example.
When k = 3, classification is unaffected by the outlier, as the majority of example data belongs
to a normal set of values.
Taking the square root of the number of examples is a strategy to select an optimum value for k,
although the tests must still be performed to validate accuracy.
Within Big Data environments, the impact of choosing a non-optimal value for k decreases, as
there should be a greater number of examples, at close proximity, which will represent the
majority. Even instances belonging to rare classes can be successfully classified due to a larger
representation in the examples.
Optional Reading
For further discussion of this topic, see the k-NN Example on pages 71-81 of the Doing Data
Science text book.
NOTE
More advanced classification algorithms and the impact of the
five Vs of classification will be discussed in Module 5.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 102
Clustering: k-means
Clustering
Clustering is an unsupervised machine learning technique used to create groups of items where
each group contains similar items but the groups themselves are dissimilar to each other. It is
also known as unsupervised classification, as unlabeled instances are classified according to
the properties of the homogeneous groups.
As an EDA tool to understand the data, clustering can identify any natural grouping within data
or interesting subsets of data for further analysis. Results can be used to pre-process data for
semi-supervised learning, where class labels are created based on the unlabeled training data
that can then be labeled and used for classification, or to select a subset of important features.
While clustering automatically creates homogeneous groups, the machine-generated labels
often carry no real meaning. Humans must analyze the properties of each group and create
meaningful labels as per the nature of the data analysis task, the business domain, or the
individuals to which the data mining results must be communicated.
k-means
k-means is a common clustering algorithm that uses distance as a measure for creating clusters
of homogeneous items. k is a user-defined number that denotes the number of clusters needed
to be created and means refers to the center point of the cluster, or centroid.
The centroid forms the basis for cluster creation around which other similar items that make up
a cluster are located. It is determined from the mean of all point locations that represent the
cluster items in a multidimensional space whose number of dimensions depends on the number
of features of items to cluster. 7KHYDOXHRINPXVWEHVHWZLWKLQNQ, where n is the total
number of items in the dataset.
k-means is similar to k-NN in that it generally uses the same Euclidian distance calculation for
determining closeness between the centroid and the items (represented as points) that requires
the user to specify the k value. Operating in an iterative fashion, k-means begins with less
homogeneous groups of instances and modifies each group during each iteration to attain
increased homogeneity within the group. The process continues until maximum homogeneity
within the groups and maximum heterogeneity between the groups is achieved. The k-means
operation is divided into the two stages, assign and update, as defined in the upcoming pages.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 103
The Assign Stage
Based on the user-specified k value, the algorithm randomly selects k points as cluster center
points that represent the actual instances in a multidimensional feature space and have been
plotted according to the feature values. Each dimension represents a single feature. Instead of
choosing points that actually exist, new points can also be created and chosen as cluster
centers.
Another approach of beginning the assign stage is to arbitrarily allocate instances to a k number
of clusters without selecting any initial center points. When the initial center points are chosen,
each instance is then associated with one of those initial cluster center points that is closest to
it. This closeness is determined by calculating the distance, generally using the Euclidean
distance formula, between the instance (represented by the point) and the initial center point.
In order to calculate distance, all feature values must be numerical in nature and further
normalized by adjusting the scale of values, such as 10,000 to 10 if other feature values exist
between one and ten.
These values are standardized by converting values to z-scores so that features whose
difference results in large values do not dominate smaller valued features, such as income and
age, or discretized for meaningful results. The resulting clusters can be graphically viewed using
a Voronoi diagram whose lines mark the cluster boundaries. Between two clusters, each line in
the Voronoi diagram depicts the set of points that are equidistant from both center points.
For example, the assign stage can result in a graph of clusters, as shown in Figure 4.53. Where
k = 3, three randomly selected center points represented by stars are initially selected, with
instances allocated to these center points based on their proximity, after calculating their
Euclidean distances.
Figure 4.53 - Stars represent three randomly selected center points around which their Euclidian distances are
calculated.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 104
Figure 4.54 - The shifting of the center points and the corresponding cluster boundary shift as a result of the
determination of the cluster centroids.
Figure 4.55 - The reassignment of the highlighted instance (the red circle) from Cluster C to Cluster B as a result of
the shifting of the cluster boundaries.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 105
about the dataset or business constraints, dividing the total number of instances by two and
taking the square root of the result is one way to determine the value of k.
Retaining instances with missing feature values is important, as such instances may indicate
special groups. Also, removing instances reduces their number, which can impact the
meaningfulness of the generated clusters.
Cluster Distortion
A clusters degree of homogeneity can be measured by calculating the clusters distortion. A
clusters distortion can be calculated by taking the sum of squared distances between all points
and its centroid. The lower the distortion, the higher the homogeneity, and vice-versa.
Optional Reading
For a more in-depth discussion on this topic, see the k-Means example on pages 81-84 of the
Doing Data Science text book that accompanies this module.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 106
algorithm to support distributed/parallel execution, for efficient and rapid clustering of high-
volume datasets.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 107
Quick mashing-up of a variety of datasets requires a workflow engine mechanism that can
automatically perform various data blending activities in collaboration with the data transfer
engine mechanism(s). However, clustering algorithms based on incremental update
implementations can help to cluster new, additional data within a reduced amount of time to
obtain value faster out of such datasets. The overall value of a clustering effort still requires the
correct interpretation of automatically generated clusters, for which domain expertise is
considered a necessary skill.
1. John, who works for an airline company as a data scientist, is analyzing 5 TBs of flight data
in order to predict fuel consumption based on a number of potentially relevant factors such
as altitude, air turbulence, air temperature, air pressure, how often the plane changes
altitude, use of electrical equipment inside the plane, number of engines, weight of reserve
fuel, and thrust change during landing. The underlying Big Data platform runs a number of
other compute-intensive models. Which techniques or algorithms can be applied to develop
an efficient model for predicting fuel consumption based on only relevant factors?
______________________
2. David is working on character recognition software that can match handwritten characters to
a known set of characters belonging to different languages. He has successfully tagged a
number of characters obtained from a variety of handwritten samples from multiple
individuals who are proficient in these languages. Which algorithm can be used to develop
such a model?
______________________
3. Alice, who works for an insurance company, has been asked to analyze a dataset of 8 TBs
to determine whether policy holders can be divided into different groups according to
similarities in their profiles. No existing groups exist for reference. Which algorithm should
Alice use to divide the policy holders in to a meaningful set of groups?
______________________
4. Robin, who works for the national astronomy association, is tasked with identifying planets
from a very large number of celestial objects. He has already identified a meaningful number
of planets. Which algorithm can Robin use to develop a model for this task?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 108
5. Elliot is developing a model that can estimate the completion time of a construction project.
He is planning to take into account a number of factors that may impact project completion
time, such as design changes, distance of construction site from the nearest major road,
number of contractors working, skill level of the workforce, and number of accidents. Which
algorithm or technique can Elliot use to develop such a model?
______________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 109
Notes
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 110
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 111
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 112
Notes / Sketches
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 113
Exercise Answers
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 114
7. Independent variables have values that do not depend on any other variable, but rather
influence other variables. These other variables are known as dependent variables.
8. Random variables can assume a range of values based on probability.
Frequency Distribution B
Normal Distribution C
Probability Distribution D
Sampling Distribution A
1. In a bar graph, each discrete value is represented as a category on the x-axis, while the y-
axis is used to display the count of each category. The actual count is represented using a
rectangle whose height shows the category count.
2. Line graphs are often used to analyze data over time or trends, and should not be used to
display nominal data like product categories. However, ordinal data related to multiple
categories can be shown.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 115
3. For a histogram, a frequency table is first collected with values divided into intervals. The
intervals must be created without gaps between intervals covering all values of a continuous
variable.
4. A scatter plot can be used to view association between two variables to find if a pattern
exists between the two variables that offers a graphical means of spotting outliers.
5. Like a histogram, a stem and leaf plot (stemplot) is a graphical technique for analyzing a
distribution that is well-suited for viewing small datasets or samples.
7. Data can be graphed visually for further analysis using the following methods: Box and
whisker plot for comparing two or more than two distributions and visualizing a five-member
summary. Quantile-quantile (q-q) plot for comparing exactly two distributions, and lattice
plot for managing multiple sub-plots for bivariate and multivariate analyses.
2. A linear correlation between response and explanatory variables is represented through the
line of best fit (regression line) that passes as closely as possible through all points on a
scatter plot.
3. Mean squared error is known as the estimator for the variance in the predicted value.
4. For multiple linear regression, histograms and scatter plots can be used to summarize the
explanatory and response variables to find the respective relevance of each explanatory
variable.
5. With values that vary between 0 and 1, the coefficient of determination (R2) is the
percentage of variation in the response variable that is predicted by the explanatory
variable.
6. The standard error of estimate measures the accuracy of the predicted values in the
response variable to identify the difference between the estimated values and actual values
and the deviation of values from the regression line.
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 116
Exercise 4.7 Answers
1. Correlation
Multiple Linear Regression
2. k-NN
3. k-means
4. k-NN
5. Multiple Linear Regression
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 117
Exam B90.04
The course you just completed corresponds to
Exam B90.04, which is an official exam that is part of
the Big Data Science Certified Professional (BDSCP)
program.
This exam can be taken at Pearson VUE testing centers worldwide or via Pearson VUE Online
Proctoring, which enables you to take exams from your home or office workstation with a live
proctor. For more information, visit:
www.bigdatascienceschool.com/exams/
www.pearsonvue.com/arcitura/
www.pearsonvue.com/arcitura/op/ (Online Proctoring)
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 118
Contact Information and Resources
AITCP Community
Join the growing international Arcitura IT Certified Professional (AITCP) community by
connecting on official social media platforms: LinkedIn, Twitter, Facebook, and YouTube.
Social media and community links are accessible at:
x www.arcitura.com/community
x www.servicetechbooks.com/community
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 119
Private Instructor-Led Workshops
Certified trainers can deliver workshops on-site at your location with optional on-site proctored
exams. To learn about options and pricing, contact:
[email protected]
or
1-800-579-6582
Automatic Notification
To be automatically notified of changes or updates to the BDSCP program and related resource
sites, send a blank message to:
[email protected]
Fundamental Big Data Analysis & Science (Copyright Arcitura Education Inc. www.arcitura.com) v2.0 120