Unit3 Eda
Unit3 Eda
Data analysis involves different processes of cleaning, transforming, analyzing the data, and
building models to extract specific, relevant insights. These are beneficial for making
important business decisions in real-time situations. Exploratory Data Analysis is important
for any business. It lets data scientists analyze the data before reaching any conclusion. Also,
this makes sure that the results which are out are valid and applicable to business outcomes
and goals. This article on Exploratory Data Analysis will provide in-depth information on the
techniques and learning models in Data Science. Also, check the online Data Science
certificate course, which can help you build a strong foundation in Data Science.
Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features
and trends used by machine learning and deep learning models in Data Science. Thus, EDA
has become an important milestone for anyone working in data science. This article covers
the concept, meaning, tools, and techniques of EDA to give complete awareness to a beginner
wanting to launch a career in data science. The article also enlists those fields that regularly
apply EDA efficiently in promoting their business activities.
The Data Science field is now very important in the business world as it provides many
opportunities to make vital business decisions by analyzing hugely gathered data.
Understanding the data thoroughly needs its exploration from every aspect. The impactful
features enable making meaningful and beneficial decisions; therefore, EDA occupies an
invaluable place in Data science.
The overall objective of exploratory data analysis is to obtain vital insights and hence usually
includes the following sub-objectives:
The role of data exploration analysis is based on the use of objectives achieved as above.
After formatting the data, the performed analysis indicates patterns and trends that help to
take the proper actions required to meet the expected goals of the business. As we expect
specific tasks to be done by any executive in a particular job position, it is expected that
proper EDA will fully provide answers to queries related to a particular business decision. As
data science involves building models for prediction, they require optimum data features to
be considered by the model. Thus, EDA ensures that the correct ingredients in patterns and
trends are made available for training the model to achieve the correct outcome, like a
successful recipe. Therefore, carrying out the right EDA with the correct tool based on
befitting data will help achieve the expected goal.
The key components in an EDA are the main steps undertaken to perform the EDA. These
are as follows:
1. Data Collection
Nowadays, data is generated in huge volumes and various forms belonging to every sector of
human life, like healthcare, sports, manufacturing, tourism, and so on. Every business knows
the importance of using data beneficially by properly analyzing it. However, this depends on
collecting the required data from various sources through surveys, social media, and customer
reviews, to name a few. Without collecting sufficient and relevant data, further activities
cannot begin.
When the analysis process starts, the first focus is on the available data that gives a lot of
information. This information contains changing values about various features or
characteristics, which helps to understand and get valuable insights from them. It requires
first identifying the important variables which affect the outcome and their possible impact.
This step is crucial for the final result expected from any analysis.
The next step is to clean the data set, which may contain null values and irrelevant
information. These are to be removed so that data contains only those values that are relevant
and important from the target point of view. This will not only reduce time but also reduces
the computational power from an estimation point of view. Preprocessing takes care of all
issues, such as identifying null values, outliers, anomaly detection, etc.
Finding a correlation between variables helps to know how a particular variable is related to
another. The correlation matrix method gives a clear picture of how different variables
correlate, which further helps in understanding vital relationships among them.
As will be seen in later sections, depending on the data, categorical or numerical, the size,
type of variables, and the purpose of analysis, different statistical tools are employed.
Statistical formulae applied for numerical outputs give fair information, but graphical visuals
are more appealing and easier to interpret.
Aspiring data science professionals must understand and practice the above EDA data science
steps to master exploratory data analysis. Explore the Data Science Bootcamp
curriculum to know more.
1. Univariate
2. Multivariate
In univariate analysis, the output is a single variable and all data collected is for it. There is
no cause-and-effect relationship at all. For example, data shows products produced each
month for twelve months. In bivariate analysis, the outcome is dependent on two variables,
e.g., the age of an employee, while the relation with it is compared with two variables, i.e.,
his salary earned and expenses per month.
In multivariate analysis, the outcome is more than two, e.g., type of product and quantity sold
against the product price, advertising expenses, and discounts offered. The analysis of data is
done on variables that can be numerical or categorical. The result of the analysis can be
represented in numerical values, visualization, or graphical form. Accordingly, they could be
further classified as non-graphical or graphical.
1. Univariate Non-Graphical
It is the simplest of all types of data analysis used in practice. As the name suggests, uni
means only one variable is considered whose data (referred to as population) is compiled and
studied. The main aim of univariate non-graphical EDA is to find out the details about the
distribution of the population data and to know some specific parameters of statistics. The
significant parameters which are estimated from a distribution point of view are as follows:
● Central Tendency: This term refers to values located at the data's central position or
middle zone. The three generally estimated parameters of central tendency are mean,
median, and mode. Mean is the average of all values in data, while the mode is the
value that occurs the maximum number of times. The Median is the middle value with
equal observations to its left and right.
● Range: The range is the difference between the maximum and minimum value in the
data, thus indicating how much the data is away from the central value on the higher
and lower side.
● Variance and Standard Deviation: Two more useful parameters are standard
deviation and variance. Variance is a measure of dispersion that indicates the spread
of all data points in a data set. It is the measure of dispersion mostly used and is the
mean squared difference between each data point and mean, while standard deviation
is the square root value of it. The larger the value of standard deviation, the farther the
spread of data, while a low value indicates more values clustering near the mean.
2. Univariate Graphical
The graphs in this section are based on Auto MPG dataset available on the UCI repository.
Some common types of univariate graphics are:
● Stem-and-leaf Plots: This is a very simple but powerful EDA method used to display
quantitative data but in a shortened format. It displays the values in the data set,
keeping each observation intact but separating them as stem (the leading digits) and
remaining or trailing digits as leaves. But histogram is mostly used in its place now.
● Histograms (Bar Charts): These plots are used to display both grouped or
ungrouped data. On the x-axis, values of variables are plotted, while on the y-axis are
the number of observations or frequencies. Histograms are very simple to quickly
understand your data, which tell about values of data like central tendency, dispersion,
outliers, etc. The simplest fundamental graph is a histogram, which is a bar plot with
each bar representing the frequency, i.e., the count or proportion (the ratio of count to
the total count of occurrences) for various values.
There are many types of histograms, a few of which are listed below:
1. Simple Bar Charts: These are used to represent categorical variables with
rectangular bars, where the different lengths correspond to the values of the variables.
2. Multiple or Grouped charts: Grouped bar charts are bar charts representing multiple
sets of data items for comparison where a single color is used to denote one specific
series in the dataset.
3. Percentage Bar Charts: These are bar graphs that depict the data in the form of
percentages for each observation. The following image shows a percentage bar chart
with dummy values.
4. Box Plots: These are used to display the distribution of quantitative value in the data.
If the data set consists of categorical variables, the plots can show the comparison
between them. Further, if outliers are present in the data, they can be easily identified.
These graphs are very useful when comparisons are to be shown in percentages, like
values in the 25 %, 50 %, and 75% range (quartiles).
3. Multivariate Non-Graphical
The multivariate non-graphical exploratory data analysis technique is usually used to show
the connection between two or more variables with the help of either cross-tabulation or
statistics.
Graphics are used in multivariate graphical data to show the relationships between two or
more variables. Here the outcome depends on more than two variables, while the
change-causing variables can also be multiple.
A) Scatter Plot
The essential graphical EDA technique for two quantitative variables is the scatter plot, so
one variable appears on the x-axis and the other on the y-axis and, therefore, the point for
every case in your dataset. This can be used for bivariate analysis.
B) Multivariate Chart
A Multivariate chart is a type of control chart used to monitor two or more interrelated
process variables. This is beneficial in situations such as process control, where engineers
are likely to benefit from using multivariate charts. These charts allow monitoring
multiple parameters together in a single chart. A notable advantage of using multivariate
charts is that they help minimize the total number of control charts for organizational
processes. Pair plots generated using the Seaborn library are a good example of
multivariate charts as they help visualize the relationships between all numerical variables
in the entire dataset at once.
C) Run Chart
A run chart is a data line chart drawn over time. In other words, a run chart visually
illustrates the process performance or data values in a time sequence. Rather than
summary statistics, seeing data across time yields a more accurate conclusion. A trend
chart or time series plot is another name for a run chart. The plot below depicts dummy
values of sales over a period of time.
D) Bubble Chart
Bubble charts scatter plots that display multiple circles (bubbles) in a two-dimensional
plot. These are used to assess the relationships between three or more numeric variables.
In a bubble chart, every single dot corresponds to one data point, and the values of the
variables for each point are indicated by different positions such as horizontal, vertical, dot
size, and dot colors.
E) Heat Map
A heat map is a colored graphical representation of multivariate data structured as a matrix
of columns and rows. The heat map transforms the correlation matrix into color coding
and represents these coefficients to visualize the strength of correlation among variables. It
assists in finding the best features suitable for building accurate Machine Learning
models.
Apart from the above, there is also the ‘Classification or Clustering analysis’ technique
used in EDA. It is an unsupervised type of machine learning used for the classification of
input data into specified categories or clusters exhibiting similar characteristics in various
groups. This can be further used to draw important interpretations in EDA.
1. Python
Python is used for different tasks in EDA, such as finding missing values in data collection,
data description, handling outliers, obtaining insights through charts, etc. The syntax for EDA
libraries like Matplotlib, Pandas, Seaborn, NumPy, Altair, and more in Python is fairly simple
and easy to use for beginners. You can find many open-source packages in Python, such as
D-Tale, AutoViz, PandasProfiling, etc., that can automate the entire exploratory data analysis
process and save time.
2. R
3. MATLAB
MATLAB is a well-known commercial tool among engineers since it has a very strong
mathematical calculation ability. Due to this, it is possible to use MATLAB for EDA but it
requires some basic knowledge of the MATLAB programming language.
EDA assists data analysts in identifying crucial trends quickly through data visualizations
using various graphs, such as box plots and histograms. Businesses also expect to make some
unexpected discoveries in the data while performing EDA, which can help improve certain
existing business strategies.
EDA can assist data analysts in identifying significant mistakes, abnormalities, or missing
values in the existing dataset. Handling the above entities is critical for any organization
before beginning a full study as it ensures correct preprocessing of data and may help save a
significant amount of time by avoiding mistakes later when applying machine learning
models.
Statistics:
1.Descriptive Statistics
2.Inferential Sattistics
Descriptive Statistics
Descriptive statistics are brief informational coefficients that summarize a given data set,
which can be either a representation of the entire population or a sample of a population or
Descriptive statistics is a means of describing features of a data set by generating summaries
about data samples. It's often depicted as a summary of data shown that explains the contents
of data. For example, a population census may include descriptive statistics regarding the
ratio of men and women in a specific city. Descriptive statistics, in short, help describe and
understand the features of a specific data set by giving short summaries about the sample and
measures of the data. Descriptive statistics are broken down into measures of central
tendency and measures of variability (spread). Measures of central tendency include the
mean, median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.
All descriptive statistics are either measures of central tendency or measures of variability,
also known as measures of dispersion.
1.Central Tendency(Mean,Median,Average)
Measures of central tendency focus on the average or middle values of data sets, whereas
measures of variability focus on the dispersion of data. These two measures use graphs, tables
and general discussions to help people understand the meaning of the analyzed data.
Measures of central tendency describe the center position of a distribution for a data set. A
person analyzes the frequency of each data point in the distribution and describes it using
the mean, median, or mode, which measures the most common patterns of the analyzed data
set.
2.Measures of Variability
Measures of variability (or the measures of spread) aid in analyzing how dispersed the
distribution is for a set of data. For example, while the measures of central tendency may give
a person the average of a data set, it does not describe how the data is distributed within the
set.
So while the average of the data maybe 65 out of 100, there can still be data points at both 1
and 100. Measures of variability help communicate this by describing the shape and spread of
the data set. Range, quartiles, absolute deviation, and variance are all examples of measures
of variability.
Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which
is calculated by subtracting the lowest number (5) in the data set from the highest (100).
3.Distribution
Distribution (or frequency distribution) refers to the quantity of times a data point occurs.
Alternatively, it is the measurement of a data point failing to occur. Consider a data set: male,
male, female, female, female, other. The distribution of this data can be classified as:
Inferential Statistics
Inferential statistics helps study a sample of data and make conclusions about its population.
In inferential statistics predictions are made by taking any group of data in which you are
interested. It can be defined as a random sample of data taken from a population to describe
and make inference about the population. Any group of data which includes all the data you
are interested is known as population. It basically allows you to make predictions by taking a
small sample instead of working on whole population.
Descriptive statistics have a different function than inferential statistics, data sets that are
used to make decisions or apply characteristics from one data set to another.
Imagine another example where a company sells hot sauce. The company gathers data such
as the count of sales, average quantity purchased per transaction, and average sale per day of
the week. All of this information is descriptive, as it tells a story of what actually happened in
the past. In this case, it is not being used beyond being informational.
Let's say the same company wants to roll out a new hot sauce. It gathers the same sales data
above, but it crafts the information to make predictions about what the sales of the new hot
sauce will be. The act of using descriptive statistics and applying characteristics to a different
data set makes the data set inferential statistics. We are no longer simply summarizing data;
we are using it predict what
One of the main benefits of using descriptive statistics is that they can simplify and organize
large amounts of data into a few numbers or graphs. This can make it easier to grasp the main
features and patterns of your data, as well as identify any outliers or errors. Descriptive
statistics can also help you compare different groups or variables within your data, such as by
using frequency tables, bar charts, or box plots. Descriptive statistics can provide a useful
overview and foundation for further analysis of your data.
One of the main limitations of using descriptive statistics is that they cannot tell you anything
about the relationships, causes, or effects of your data. Descriptive statistics only describe
what the data is, not why it is that way or what it means. For example, you can use
descriptive statistics to calculate the mean and standard deviation of test scores, but you
cannot use them to infer whether the test was easy or hard, or whether the scores were
influenced by other factors. To answer these questions, you need to use inferential statistics,
which test hypotheses and estimate probabilities based on your data.
Hypothesis generation is an educated “guess” of various factors that are impacting the
business problem that needs to be solved using machine learning. In framing a hypothesis, the
data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.A hypothesis may be simply defined as a guess. A scientific hypothesis is an
intelligent guess.
Hypothesis Testing:
Hypothesis testing is a formal procedure for investigating our ideas about the world
using statistics. It is most often used by scientists to test specific predictions, called
hypotheses, that arise from theories.
1. State your research hypothesis as a null hypothesis and alternate hypothesis (Ho) and
(Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.
The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population
means the return is not equal to zero). As a result, they are mutually exclusive, and only one
can be correct. One of the two possibilities, however, will always be correct.
What is Clustering?
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
Clustering itself can be categorized into two types viz. Hard Clustering and Soft Clustering.
In hard clustering, one data point can belong to one cluster only. But in soft clustering, the
output provided is a probability likelihood of a data point belonging to each of the
pre-defined numbers of clusters.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
In this method, the clusters are created based upon the density of the data points which are
represented in the data space. The regions that become dense due to the huge number of data
points residing in that region are considered as clusters.
The data points in the sparse region (the region where the data points are very less) are
considered as noise or outliers. The clusters created in these methods can be of arbitrary
shape. Following are the examples of Density-based clustering algorithms:
OPTICS follows a similar process as DBSCAN but overcomes one of its drawbacks, i.e.
inability to form clusters from data of arbitrary density. It considers two more parameters
which are core distance and reachability distance. Core distance indicates whether the data
point being considered is core or not by setting a minimum value for it.
Reachability distance is the maximum of core distance and the value of distance metric that is
used for calculating the distance among two data points. One thing to consider about
reachability distance is that its value remains not defined if one of the data points is a core
point.
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering:
In agglomerative clustering, initially, each data point acts as a cluster, and then it groups the
clusters one by one. This comes under in one of the most sought-after clustering methods.
Divisive is the opposite of Agglomerative, it starts off with all the points into one cluster and
divides them to create more clusters. These algorithms create a distance matrix of all the
existing clusters and perform the linkage between the clusters depending on the criteria of the
linkage. The clustering of the data points is represented by using a dendrogram. There are
different types of linkages: –
o Single Linkage: – In single linkage the distance between the two clusters is the shortest
distance between points in those two clusters.
o Complete Linkage: – In complete linkage, the distance between the two clusters is the
farthest distance between points in those two clusters.
o Average Linkage: – In average linkage the distance between the two clusters is the average
distance of every point in the cluster with every point in another cluster.
Fuzzy Clustering
In fuzzy clustering, the assignment of the data points in any of the clusters is not decisive.
Here, one data point can belong to more than one cluster. It provides the outcome as the
probability of the data point belonging to each of the clusters. One of the algorithms used in
fuzzy clustering is Fuzzy c-means clustering.
This algorithm is similar in approach to the K-Means clustering. It differs in the parameters
involved in the computation, like fuzzifier and membership values. In this type of clustering
method, each data point can belong to more than one cluster. This clustering technique
allocates membership values to each image point correlated to each cluster center based on
the distance between the cl
What is association?
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset