0% found this document useful (0 votes)
37 views24 pages

Python Exploratory Data Analysis

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
37 views24 pages

Python Exploratory Data Analysis

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Exploratory Data Analysis

6.1 Introduction
In the previous chapter, we have seen that data preprocessing is an
important step of the data science pipeline. Once we get the pre-processed
data, we have to choose suitable machine learning algorithms to model the
data. However, before applying machine learning, we have to answer the
following questions:

• How to find the structure of the data?


• How to test assumptions about the data?
• How to select the features that can be used for machine learning
methods?
• How to choose suitable machine learning algorithms to model our data
set?

Exploratory Data Analysis (EDA) is a process to get familiar with the structure
and important features of a data set. EDA helps us answer aforementioned
questions by providing us with a good understanding of what the data
contains. EDA explores the pre-processed data using suitable visualization
tools to find the structure of the data, its salient features and important
patterns.
EDA is also intended to define and refine the selection of important features.
Once EDA is complete, we can perform more complex modeling and
machine learning tasks on these features such as clustering, regression
and classification.
The goal of EDA is to make sure that the dataset is ready to be deployed in
a machine learning algorithm. EDA is a valuable step in a data science
project which validates the results, makes the interpretation of the results
easy in a desired business context. In the following sections, we explain
the process of exploratory data analysis along with practical Python
examples.

6.2 Revealing Structure of Data


Knowing the underlying structure of our data enables us to use appropriate
machine learning methods for data modeling and future predictions using
these models. In EDA, several techniques are employed to reveal the
structure of the data. These include:

1
Univariate visualization that is used to generate summary statistics for
each feature or variable in a dataset. We summarize our dataset through
descriptive statistics that uses a variety of statistical measurements to better
understand the dataset. It is also known as data profiling. The goal of
univariate visualization is to have a solid understanding of the data so we
can start querying and visualizing our data in various ways. It uses
visualization tools such as bar plots and histograms to reveal structure of the
data.
Bivariate visualization is performed to find the relationship between two
variables in a given dataset where one of the two variables can be the target
variable of interest. It uses correlations, scatter and line plots to reveal
structure of the data.
Multivariate visualization is employed to understand interactions between
different fields in the dataset. It uses line plots, scatter plots and matrices
with multiple colors to understand the relationship between various features
of a dataset.
Revealing the underlying structure of data enables us to discover patterns,
spot anomalies such as missing values and outliers, and check assumptions
about the data.
We first download a real-world dataset to reveal its underlying structure.
Suppose we want to predict the price of a house based upon already
available data of houses in a particular city or country. Fortunately, we
have a dataset of house prices in the United States of America:
USA_Housing dataset. This dataset can be downloaded from either
https://www.kaggle.com/aariyan101/usa-housingcsv
Or
https://raw.githubusercontent.com/bcbarsness/machine-
learning/master/USA_Housing.csv
The USA_Housing dataset contains the following columns or variables:
1. Avg. Area Income: It shows the average income of the residents of
the city.
2. Avg. Area House Age: It shows the average age of houses located
in the same city.
3. Avg. Area Number of Rooms: It shows the average number of
rooms for the houses located in the same city.
4. Avg. Area Number of Bedrooms: It shows the average number of
bedrooms for the houses located in the same city.
5. Area Population: It shows the average population of the city where
the house is located.
2
6. Price: It shows the price that the house is sold at.
7. Address: It shows the address of the house.
Let us start to explore this dataset. We import necessary libraries first
by typing the following commands.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
We have imported Matplotlib, a commonly used Python library for data
visualization. Advanced data visualization library Seaborn is based
upon Matplotlib. We use both libraries for the plotting and visualization
of our dataset.

Requirements – Anaconda, Jupyter, and Matplotlib

The Python scripts have been executed using the Jupyter notebook.
Thus, we should have the Jupyter notebook installed. Since Jupyter
notebook has Matplotlib, Numpy and Pandas libraries, we do not
need to install them separately.

Hands-on Time – Source Codes

The Jupyter notebook containing the source code given in this chapter can
be found in Resources/Chapter 6.ipynb. We suggest that the reader writes
all the code given in this chapter to verify the outputs mentioned in this
chapter.

6.3.1 Line Plot

3
Line plots are predominantly useful for conveying changes over space
or time. Line plots are mostly used to plot data along a scale divided
into equal intervals, for example, time. Let us generate a simple line
plot.
import matplotlib.pyplot as plt
x1 = [1,2,3]
y1 = [2,4,1]
plt.plot(x1, y1, label = "line 1")
plt.show()

Output:

In this graph, we generate 3 values of variable x1 and y1. To generate


a line plot via the pyplot module, we call the function plot(), pass it the
values for the x and y axes. It is important to mention that we are using
plt as an alias for pyplot.
We can add titles, labels and legends to the generated plots. To add
titles, labels and legend to a plot, we use the title, xlabel, ylabel and
legend methods of the pyplot module, respectively. We pass string
values to these methods, which appear on the plots as shown below.
x1 = [1,2,3]
y1 = [2,4,1]

plt.plot(x1, y1, label = "line 1")

# naming the x axis


plt.xlabel('x - axis')
# naming the y axis

4
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first line plot')
plt.legend()
plt.show()

Output:

Multiple plots can be drawn on the same figure as shown in the Python
script given below.
x1= np.arange(0, 10*(np.pi), 0.1) # Generating x and y points to plot
y1 = np.cos(x1) # cosine function from Numpy library
plt.plot(x1, y1, label = 'cosine') # potting the points
plt.legend() # shows the legend with labels
plt.xlabel('Angle') # naming the x axis
plt.ylabel('Amplitude') # naming the y axis
x2= np.arange(0, 10*(np.pi), 0.1) # Generating x and y points to plot
y2 = np.cos(x2)*0.1*np.arange(10*(np.pi),0,-0.1) # decaying cosine
function
plt.plot(x2, y2, label = 'Decaying cosine') # potting the points
plt.legend() # show a legend on the plot
plt.title('Two functions on the same graph') # gives a title to the plot
plt.show() # shows the graph, and removes the text output.

Output:

5
Let us plot Avg. Area Income against Price from US_Housing dataset.
plt.plot(housing_price['Avg. Area Income'], housing_price['Price'],
color='red', marker='o')
plt.title('Avg. Area Income Vs Price', fontsize=14)
plt.xlabel('Avg. Area Income', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.grid(True)
plt.show()

Output:

6
Since there are 5000 observations, it is evident that a line plot does not
give us a clear picture of what is the relationship between these two
variables.

6.3.2 Scatter Plot


A scatter plot is used to visualize any two variables from a given dataset
in two-dimensions. It uses dots or marks to plot values of two variables,
one along the x-axis and the other along the y-axis.
Scatter plots allow us to observe the relationship between the variables.
If an increase in one variable causes an increase in another variable
and vice versa, we can conclude that there is a positive linear
relationship between two variables. However, if increasing the first
variable reveals a decrease in the second variable, we say that there is
a negative linear relationship between both variables. For example, let
us generate a scatter plot between Avg. Area Income and Price.
plt.rcParams['figure.figsize'] = [12,8]
plt.scatter(housing_price['Avg. Area Income'], housing_price['Price'],
color='red', marker='o')
plt.title('Avg. Area Income Vs Price', fontsize=14)
plt.xlabel('Avg. Area Income', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.grid(True)
plt.show()

Output:

7
This plot shows that there is a positive linear relationship between
variables Avg. Area Income and Price. To increase the size of plot for
better readability, we have used rcParams option of pyplot module.
plt.rcParams['figure.figsize'] = [12,8]
If the points or dots are color-coded in a scatter plot, additional variables
can be shown in two-dimensions. For example, let us create the Avg.
Area Income against Price and Avg. Population against Price by
color-coding the plots on the same figure.
plt.scatter(housing_price['Avg. Area Income'], housing_price['Price'],
color='red', marker='o', label = 'Avg. Area Income')
plt.scatter(housing_price['Area Population'], housing_price['Price'],
color='blue', marker='x', label = 'Area Population')

plt.title('Avg. Area Income Vs Price', fontsize=14)


plt.xlabel('Avg. Area Income (Red), Area Population (blue)',
fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.legend()
plt.grid(True)
plt.show()

Output:

8
6.3.3 Box Plots
A box plot also called a box and whisker plot, displays the summary of
a dataset or its subset as five-numbers. These five-numbers are
1. The minimum value excluding any outlier,
2. The first quartile,
3. The second quartile or median,
4. The third quartile and
5. The maximum value excluding any outlier.
The outliers are shown beyond the minimum and maximum points. A
box and whisker plot is shown in Figure 6.1.

Figure 6.1: A box and whisker plot.

9
We can generate a box plot using either Matplotlib, Pandas or Seaborn.
To create a box plot using Matplotlib, we can type the following lines of
code.
plt.boxplot([0, 1, 10, 15, 4, -6, -15 -2, 30, 40, -20, 11])
plt.show()

Output:

We can generate multiple box plots on the same figure. Let us produce
the data for the boxplots by using the numpy.random.randn() function. This
function takes two input arguments, the number of arrays to be created and
the number of values within each array.
myframe = pd.DataFrame(np.random.randn(10, 3), columns=['Col1',
'Col2', 'Col3'])
boxplot = myframe.boxplot(column=['Col1', 'Col2', 'Col3'])
Output:

6.3.4 Histogram
A histogram is a plot that shows the frequency distribution or shape of
a dataset or its subset that comprises numeric data. This allows us to
10
discover the underlying distribution of the data by visual inspection. To
plot a histogram, we pass a collection of numeric values to the method
hist (). For example, the following histogram plots the distribution of
values in the price column of the USA_Housing dataset.
plt.hist(housing_price['Price'])
plt.show()
Output:

This plot shows that more than 1200 houses, out of 5000, have a price
around 1000000. A few houses have prices less than 500000 and
greater than 2000000. By default, the method hist () uses 10 bins or
groups to plot the distribution of the data. We can change the number
of bins by using the option bins.
plt.hist(housing_price['Price'], bins = 100)
plt.show()
Output:

11
From this plot, it can be observed that the house price follows a Normal
or a Gaussian distribution that is a bell curve. It is important to know
that many machine learning algorithms assume Gaussian distribution
of features. Thus, it is better for a feature to follow this distribution.

6.3.5 Bar Chart


If we have categorical or discrete data that can take one of a small set
of values, we use bar charts to show the values of categories as
rectangular bars whose lengths are proportional to the values. Since
the ”USA_Housing dataset” has a continuous range of Prices for
houses, it is not suitable to draw bar charts for this dataset. We import
“Iris dataset” that contains three species of Iris plant: Iris-setosa, Iris-
virginica and Iris-versicolor.
iris_data = pd.read_csv('c:/Users/GNG/Desktop/iris_dataset.csv')
iris_data.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
To make a bar chart of SepalLengthCm, we may write the following
Python script.
plt.title('SepalLengthCm Vs Species', fontsize=14)
plt.xlabel('Species', fontsize=14)
plt.ylabel('SepalLengthCm', fontsize=14)
plt.bar(iris_data['Species'], iris_data['SepalLengthCm'])
plt.show()

12
Output:

Similarly, we generate bar charts for SepalWidthCm, PetalLengthCm


and PetalWidthCm.

For the bar charts given above, we observe that species Iris-virginica
has the highest petal length, petal width and sepal length. Species Iris-
setosa has the smallest petal length, petal width and sepal length.

13
However, there is a deviation from the trend; Iris-setosa shows highest
sepal width followed by virginica and versicolor.

6.3.6 Pie Charts


A pie chart is a circular statistical chat that is used to display the
percentage distribution of categorical variables. The area of the whole
chart represents 100% or the whole data. The areas of the pies in the
chart denote the percentage of shares of data.
Pie charts are popular in business communications because they give
a quick summary of the various activities such as sales and operations.
Pie charts are also used to summarize survey results and resource
usage diagrams memory usage in a computer system.
To draw a pie chart, we use the function pie() in the pyplot module. The
following Python code draws a pie chart showing the world population
by continents.
cont_pop = {'Asia': 4641054775, 'Africa':1340598147, 'Europe':
747636026, 'North America': 592072212,'South America': 430759766,
'Australia/Oceania': 42677813}
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
plt.pie(cont_pop.values(), explode, labels=cont_pop.keys(),
autopct='%1.1f%%', shadow=True)
plt.show()

Output:

14
The explode option is used to generate some distance from the center of the
plot. The autopct='%.1f%%' string formatting is used for the formatting of
how the percentages appear on the pie chart.

Further Readings – Matplotlib Plots

To study more about Matplotlib plots, please check Matplotlib’s


official documentation for plots.
https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-
tutorials-introductory-pyplot-py
You can explore more features of Matplotlib by searching and reading
this documentation.

6.4 Testing Assumptions about Data


Statistical and machine learning models work best if the data follows
some assumptions. These assumptions can be the:
• Independence of features;

15
• Linear or quadratic dependency of features with the target variable
and
• Normal or Gaussian distribution of features.
One of the main objectives of the EDA is to test these assumptions. To
check these assumptions, we work with the USA_Housing dataset. We
download the USA_Housing dataset, and store it on the Desktop. We
can display the first 5 observations of the dataset by using the function
head().
housing_price =
pd.read_csv('c:/Users/GNG/Desktop/dataset_housing.csv')
housing_price.head()

Output:

It shows that the dataset has 7 variables or features. Last 5


observations of the dataset can be viewed using the function tail().
housing_price.tail()

Output:

16
It shows that the dataset has 5000 observations. To confirm this finding,
type:
housing_price.shape
Output:
(5000, 7)
Information of the dataset can be found by the use of the info () function.
housing_price.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

We can find out important statistics of the dataset by the function


describe().
housing_price.describe()
Output:

17
We observe that some statistics are given in floating points, for example, in
Avg. Area Number of Rooms, min equals 3.236194, and max shows
10.759588. This is because these statistics report the minimum and
maximum values of the average number of rooms in an area. Let us find the
number of null or missing values in our dataset.
housing_price.isnull().sum()
Output:
Avg. Area Income 0
Avg. Area House Age 0
Avg. Area Number of Rooms 0
Avg. Area Number of Bedrooms 0
Area Population 0
Price 0
Address 0
dtype: int64

Since there are 5000 observations, we resort to plotting and visualization to


explore the dataset. The following section reports a number of visualization
plots and tools to understand the data better.

Checking Assumption of Normal Distribution of Features


Many machine learning models assume a Normal or Gaussian
distribution of features. We can check whether the features in our
dataset are normally distributed by just plotting the histogram of
features.
plt.hist(housing_price['Avg. Area Income'], bins = 100, label = 'Avg.
Area Income')
plt.legend()
plt.show()

plt.hist(housing_price['Avg. Area House Age'], bins = 100, label ='Avg.


Area House Age')
plt.legend()
plt.show()

plt.hist(housing_price['Avg. Area Number of Rooms'], bins = 100, label


= 'Avg. Area Number of Rooms')
plt.legend()
plt.show()
18
plt.hist(housing_price['Avg. Area Number of Bedrooms'], bins = 100,
label = 'Avg. Area Number of Bedrooms')
plt.legend()
plt.show()

plt.hist(housing_price['Area Population'], bins = 100, label = 'Area


Population')
plt.legend()
plt.show()

Output:

19
Checking Independence of Features
We can also check if the features are independent of each other or not.
One of the many possible ways is to draw a scatter plot of every pair of
features. Another way is to draw a scatter matrix that is able to show
histograms of every feature on its diagonal entries, and displays scatter
plots of every pair of features. We type the following to draw a scatter
matrix.
from pandas.plotting import scatter_matrix
20
scatter_matrix(housing_price, figsize=(15, 15), diagonal='kde')
plt.show()

Output:

The option diagonal can be set to kde, kernel density estimation, that
is closely related to the histogram, or it can be hist. If the points on a
scatter plot are distributed such that they form a circle or approximately
a circle, we say that the features are independent or approximately
intendent of each other.
21
A close inspection of this plot reveals that most features, enclosed in
green rectangles, are independent of each other. If the points on a
scatter plot are distributed such that they form an angled oval, we say
that the features are dependent. A close inspection of this plot reveals
that the target variable Price, depends upon:
• Avg. Area Income,
• Avg. Area House Age,
• Avg. Area Number of Rooms and
• Area Population.

These dependencies are shown in red rectangles in this scatter plot.

6.5 Selecting Important Features / Variables


In order to reduce the computational load and to achieve better accuracy, we
have to identify the most relevant features from a set of data and remove the
irrelevant or less important features which do not contribute much to the
target variable.
Feature selection is a core concept that impacts the performance of a
machine learning model. Irrelevant or less important features can impact
model performance. The process of feature selection picks up those features
which contribute most to the target variable or output. Though, there are
many statistical and machine learning based methods to select important
features / variables from a dataset, here we describe a basic method based
on correlation between features. The correlation is a statistical term that
measures the dependency between two features as a number ranging from
-1 to 1.
If one variable increases, the other increases and vice versa, we say that the
correlation between these two variables is positive. However, if one variable
increases, the other decreases and vice versa, we say that the correlation
between these two variables is negative.
To find the correlation we use the method corr (). To plot the correlation
matrix, we import the Seaborn library as sns.
import seaborn as sns
corrmat = housing_price.corr()
feature_ind = corrmat.index
plt.figure(figsize=(12,12))
sns.heatmap(housing_price[feature_ind].corr(),annot=True,
cmap="RdYlGn") #plot heat map

22
plt.show()

Output:

The annot and cmap options display the value of correlation coefficient
within the square boxes, and the color map to display the figure,
respectively.
The values closer to zero in the red boxes indicate that these features
are nearly independent of each other. However, larger values such as
0.64 between Price and Avg. Area Income indicates that the house
prices are strongly correlated with the average income of residents of a

23
particular area. Features having a small value of correlation with the
target variable can be neglected. For example, the variable Avg. Area
Number of Bedrooms has a correlation of 0.17 with the Price that can
be neglected.
We have seen a basic method for feature selection. Advanced, machine
learning based methods, can be employed for feature selection and
extraction.

Hands-on Time – Exercise

To check your understanding of the basic data plotting and


visualization with Matplotlib, complete the following exercise
questions. The answers to the questions are given at the end of the
book.

24

You might also like