Python Exploratory Data Analysis
Python Exploratory Data Analysis
6.1 Introduction
In the previous chapter, we have seen that data preprocessing is an
important step of the data science pipeline. Once we get the pre-processed
data, we have to choose suitable machine learning algorithms to model the
data. However, before applying machine learning, we have to answer the
following questions:
Exploratory Data Analysis (EDA) is a process to get familiar with the structure
and important features of a data set. EDA helps us answer aforementioned
questions by providing us with a good understanding of what the data
contains. EDA explores the pre-processed data using suitable visualization
tools to find the structure of the data, its salient features and important
patterns.
EDA is also intended to define and refine the selection of important features.
Once EDA is complete, we can perform more complex modeling and
machine learning tasks on these features such as clustering, regression
and classification.
The goal of EDA is to make sure that the dataset is ready to be deployed in
a machine learning algorithm. EDA is a valuable step in a data science
project which validates the results, makes the interpretation of the results
easy in a desired business context. In the following sections, we explain
the process of exploratory data analysis along with practical Python
examples.
1
Univariate visualization that is used to generate summary statistics for
each feature or variable in a dataset. We summarize our dataset through
descriptive statistics that uses a variety of statistical measurements to better
understand the dataset. It is also known as data profiling. The goal of
univariate visualization is to have a solid understanding of the data so we
can start querying and visualizing our data in various ways. It uses
visualization tools such as bar plots and histograms to reveal structure of the
data.
Bivariate visualization is performed to find the relationship between two
variables in a given dataset where one of the two variables can be the target
variable of interest. It uses correlations, scatter and line plots to reveal
structure of the data.
Multivariate visualization is employed to understand interactions between
different fields in the dataset. It uses line plots, scatter plots and matrices
with multiple colors to understand the relationship between various features
of a dataset.
Revealing the underlying structure of data enables us to discover patterns,
spot anomalies such as missing values and outliers, and check assumptions
about the data.
We first download a real-world dataset to reveal its underlying structure.
Suppose we want to predict the price of a house based upon already
available data of houses in a particular city or country. Fortunately, we
have a dataset of house prices in the United States of America:
USA_Housing dataset. This dataset can be downloaded from either
https://www.kaggle.com/aariyan101/usa-housingcsv
Or
https://raw.githubusercontent.com/bcbarsness/machine-
learning/master/USA_Housing.csv
The USA_Housing dataset contains the following columns or variables:
1. Avg. Area Income: It shows the average income of the residents of
the city.
2. Avg. Area House Age: It shows the average age of houses located
in the same city.
3. Avg. Area Number of Rooms: It shows the average number of
rooms for the houses located in the same city.
4. Avg. Area Number of Bedrooms: It shows the average number of
bedrooms for the houses located in the same city.
5. Area Population: It shows the average population of the city where
the house is located.
2
6. Price: It shows the price that the house is sold at.
7. Address: It shows the address of the house.
Let us start to explore this dataset. We import necessary libraries first
by typing the following commands.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
We have imported Matplotlib, a commonly used Python library for data
visualization. Advanced data visualization library Seaborn is based
upon Matplotlib. We use both libraries for the plotting and visualization
of our dataset.
The Python scripts have been executed using the Jupyter notebook.
Thus, we should have the Jupyter notebook installed. Since Jupyter
notebook has Matplotlib, Numpy and Pandas libraries, we do not
need to install them separately.
The Jupyter notebook containing the source code given in this chapter can
be found in Resources/Chapter 6.ipynb. We suggest that the reader writes
all the code given in this chapter to verify the outputs mentioned in this
chapter.
3
Line plots are predominantly useful for conveying changes over space
or time. Line plots are mostly used to plot data along a scale divided
into equal intervals, for example, time. Let us generate a simple line
plot.
import matplotlib.pyplot as plt
x1 = [1,2,3]
y1 = [2,4,1]
plt.plot(x1, y1, label = "line 1")
plt.show()
Output:
4
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first line plot')
plt.legend()
plt.show()
Output:
Multiple plots can be drawn on the same figure as shown in the Python
script given below.
x1= np.arange(0, 10*(np.pi), 0.1) # Generating x and y points to plot
y1 = np.cos(x1) # cosine function from Numpy library
plt.plot(x1, y1, label = 'cosine') # potting the points
plt.legend() # shows the legend with labels
plt.xlabel('Angle') # naming the x axis
plt.ylabel('Amplitude') # naming the y axis
x2= np.arange(0, 10*(np.pi), 0.1) # Generating x and y points to plot
y2 = np.cos(x2)*0.1*np.arange(10*(np.pi),0,-0.1) # decaying cosine
function
plt.plot(x2, y2, label = 'Decaying cosine') # potting the points
plt.legend() # show a legend on the plot
plt.title('Two functions on the same graph') # gives a title to the plot
plt.show() # shows the graph, and removes the text output.
Output:
5
Let us plot Avg. Area Income against Price from US_Housing dataset.
plt.plot(housing_price['Avg. Area Income'], housing_price['Price'],
color='red', marker='o')
plt.title('Avg. Area Income Vs Price', fontsize=14)
plt.xlabel('Avg. Area Income', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.grid(True)
plt.show()
Output:
6
Since there are 5000 observations, it is evident that a line plot does not
give us a clear picture of what is the relationship between these two
variables.
Output:
7
This plot shows that there is a positive linear relationship between
variables Avg. Area Income and Price. To increase the size of plot for
better readability, we have used rcParams option of pyplot module.
plt.rcParams['figure.figsize'] = [12,8]
If the points or dots are color-coded in a scatter plot, additional variables
can be shown in two-dimensions. For example, let us create the Avg.
Area Income against Price and Avg. Population against Price by
color-coding the plots on the same figure.
plt.scatter(housing_price['Avg. Area Income'], housing_price['Price'],
color='red', marker='o', label = 'Avg. Area Income')
plt.scatter(housing_price['Area Population'], housing_price['Price'],
color='blue', marker='x', label = 'Area Population')
Output:
8
6.3.3 Box Plots
A box plot also called a box and whisker plot, displays the summary of
a dataset or its subset as five-numbers. These five-numbers are
1. The minimum value excluding any outlier,
2. The first quartile,
3. The second quartile or median,
4. The third quartile and
5. The maximum value excluding any outlier.
The outliers are shown beyond the minimum and maximum points. A
box and whisker plot is shown in Figure 6.1.
9
We can generate a box plot using either Matplotlib, Pandas or Seaborn.
To create a box plot using Matplotlib, we can type the following lines of
code.
plt.boxplot([0, 1, 10, 15, 4, -6, -15 -2, 30, 40, -20, 11])
plt.show()
Output:
We can generate multiple box plots on the same figure. Let us produce
the data for the boxplots by using the numpy.random.randn() function. This
function takes two input arguments, the number of arrays to be created and
the number of values within each array.
myframe = pd.DataFrame(np.random.randn(10, 3), columns=['Col1',
'Col2', 'Col3'])
boxplot = myframe.boxplot(column=['Col1', 'Col2', 'Col3'])
Output:
6.3.4 Histogram
A histogram is a plot that shows the frequency distribution or shape of
a dataset or its subset that comprises numeric data. This allows us to
10
discover the underlying distribution of the data by visual inspection. To
plot a histogram, we pass a collection of numeric values to the method
hist (). For example, the following histogram plots the distribution of
values in the price column of the USA_Housing dataset.
plt.hist(housing_price['Price'])
plt.show()
Output:
This plot shows that more than 1200 houses, out of 5000, have a price
around 1000000. A few houses have prices less than 500000 and
greater than 2000000. By default, the method hist () uses 10 bins or
groups to plot the distribution of the data. We can change the number
of bins by using the option bins.
plt.hist(housing_price['Price'], bins = 100)
plt.show()
Output:
11
From this plot, it can be observed that the house price follows a Normal
or a Gaussian distribution that is a bell curve. It is important to know
that many machine learning algorithms assume Gaussian distribution
of features. Thus, it is better for a feature to follow this distribution.
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
To make a bar chart of SepalLengthCm, we may write the following
Python script.
plt.title('SepalLengthCm Vs Species', fontsize=14)
plt.xlabel('Species', fontsize=14)
plt.ylabel('SepalLengthCm', fontsize=14)
plt.bar(iris_data['Species'], iris_data['SepalLengthCm'])
plt.show()
12
Output:
For the bar charts given above, we observe that species Iris-virginica
has the highest petal length, petal width and sepal length. Species Iris-
setosa has the smallest petal length, petal width and sepal length.
13
However, there is a deviation from the trend; Iris-setosa shows highest
sepal width followed by virginica and versicolor.
Output:
14
The explode option is used to generate some distance from the center of the
plot. The autopct='%.1f%%' string formatting is used for the formatting of
how the percentages appear on the pie chart.
15
• Linear or quadratic dependency of features with the target variable
and
• Normal or Gaussian distribution of features.
One of the main objectives of the EDA is to test these assumptions. To
check these assumptions, we work with the USA_Housing dataset. We
download the USA_Housing dataset, and store it on the Desktop. We
can display the first 5 observations of the dataset by using the function
head().
housing_price =
pd.read_csv('c:/Users/GNG/Desktop/dataset_housing.csv')
housing_price.head()
Output:
Output:
16
It shows that the dataset has 5000 observations. To confirm this finding,
type:
housing_price.shape
Output:
(5000, 7)
Information of the dataset can be found by the use of the info () function.
housing_price.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
17
We observe that some statistics are given in floating points, for example, in
Avg. Area Number of Rooms, min equals 3.236194, and max shows
10.759588. This is because these statistics report the minimum and
maximum values of the average number of rooms in an area. Let us find the
number of null or missing values in our dataset.
housing_price.isnull().sum()
Output:
Avg. Area Income 0
Avg. Area House Age 0
Avg. Area Number of Rooms 0
Avg. Area Number of Bedrooms 0
Area Population 0
Price 0
Address 0
dtype: int64
Output:
19
Checking Independence of Features
We can also check if the features are independent of each other or not.
One of the many possible ways is to draw a scatter plot of every pair of
features. Another way is to draw a scatter matrix that is able to show
histograms of every feature on its diagonal entries, and displays scatter
plots of every pair of features. We type the following to draw a scatter
matrix.
from pandas.plotting import scatter_matrix
20
scatter_matrix(housing_price, figsize=(15, 15), diagonal='kde')
plt.show()
Output:
The option diagonal can be set to kde, kernel density estimation, that
is closely related to the histogram, or it can be hist. If the points on a
scatter plot are distributed such that they form a circle or approximately
a circle, we say that the features are independent or approximately
intendent of each other.
21
A close inspection of this plot reveals that most features, enclosed in
green rectangles, are independent of each other. If the points on a
scatter plot are distributed such that they form an angled oval, we say
that the features are dependent. A close inspection of this plot reveals
that the target variable Price, depends upon:
• Avg. Area Income,
• Avg. Area House Age,
• Avg. Area Number of Rooms and
• Area Population.
22
plt.show()
Output:
The annot and cmap options display the value of correlation coefficient
within the square boxes, and the color map to display the figure,
respectively.
The values closer to zero in the red boxes indicate that these features
are nearly independent of each other. However, larger values such as
0.64 between Price and Avg. Area Income indicates that the house
prices are strongly correlated with the average income of residents of a
23
particular area. Features having a small value of correlation with the
target variable can be neglected. For example, the variable Avg. Area
Number of Bedrooms has a correlation of 0.17 with the Price that can
be neglected.
We have seen a basic method for feature selection. Advanced, machine
learning based methods, can be employed for feature selection and
extraction.
24