Exploratory Data Analysis Using Python
Exploratory Data Analysis Using Python
net/publication/341121348
CITATIONS READS
8 6,775
4 authors:
Some of the authors of this publication are also working on these related projects:
Designing an Intelligent web Browser using Web usage mining Techniques View project
All content following this page was uploaded by Jitendra Pramanik on 04 July 2020.
incorrect values.
Abstract: Data need to be analyzed so as to produce good result. EDA visualize data distributions; bar charts, histograms,
Using the result decision can be taken. For example box plots. Calculate and visualize correlations (relationships)
recommendation system, ranking of the page, demand fore between variables; heat map.
casting, prediction of purchase of the product. There are some
leading companies where the review of the customer plays a great Rest of the paper is organized as follows: Section II
role to analyze the factor which influences the review rating. We presents a brief review of literature and Section III presents a
have used exploratory data analysis (EDA) where data discussion on various techniques for the exploratory data
interpretations can be done in row and column format. We have analysis. Section IV discusses how to conduct exploratory
used python for data analysis. it is object oriented ,interpreted and data analysis using python while Section V presents how to
interactive programming language. it is open source with rich sets
of libraries like pandas, MATplotlib, seaborn etc. We have used
work with data sets to conduct exploratory data analysis.
different types of charts and various types of parameter to analyze Finally, Section VI presents the concluding remarks.
Amazon review data sets which contains the reviews of electronic
data items. We have used python programming for the data II. LITERATURE SURVEY
analysis.
Aindrila Ghosha et al. [1] have examined the different data
Keywords: Exploratory Data Analysis (EDA); MATPplotlib; exploration tool for exploratory analysis. They have
Seaborn, Visualization; Pandas; Jupyter Notebook described some of the data exploration tool.
Author John T. Behrens [2] has described about the
I. INTRODUCTION difference between classical data analysis and exploratory
data analysis using different visualization method.
Data are growing very faster in today’s world. It is not so
Chokey Wangmo [3] has done an exploratory study on
easy to process the data manually. Data analysis and
bank lending to SME sector in Bhutan.
visualization programs allow for reaching even deeper
Matthew Ntow-Gyamfi et al. [4] has done an exploratory
understanding. The programming language Python, with its
study on Credit risk and loan default among Ghanaian banks
English commands and easy-to-follow syntax, offers an
X.Francis Jency et al. [5] have done exploratory data
amazingly powerful (and free!) open-source alternative to
analysis for loan prediction depending upon the nature of the
traditional techniques and applications.
client .they have used machine learning techniques for
Data analytics allow businesses to understand their
predictive data analysis.
efficiency and performance, and ultimately helps the business
K. Ulaga Priya1 et al. [6] has done exploratory analysis on
make more informed decisions. For example, an e-commerce
prediction of loan privilege for customers using random
company might be interested in analyzing customer attributes
forest. They have used R programming for exploratory data
in order to display targeted ads for improving sales. Data
analysis.
analysis can be applied to almost any aspect of a business if
Bogumil M. Konopka et al. [7] has done exploratory data
one understands the tools available to process information.
analysis of a clinical study group. Development of a
The ecommerce companies are analyzing the reviews of
procedure for exploring multidimensional data.
customer by using proper visualization method. Exploratory
Data Analysis (EDA) is an approach to summarize the data by
III. TECHNIQUES FOR EDA
taking their main characteristics and visualize it with proper
representations. EDA focuses more narrowly on checking A. Exploratory Data Analysis (EDA)
assumptions required for model fitting and hypothesis testing, Primarily, exploratory data analysis is an approach to see
and handling missing values and making transformations of what the data can communicate us away from the formal
variables as needed. EDA encompasses IDA. modeling or hypothesis testing task. EDA helps to analyze the
EDA quickly describes the data sets number of data sets to summarize their statistical characteristics focusing
rows/columns, missing data, data types and preview. Clean on four key aspects, like, measures of central tendency
corrupted data; handle missing data, invalid data types and (comprising of the mean, the mode and the median), measures
of spread (comprising of standard deviation and variance), the
Revised Manuscript Received on October 05, 2019 shape of the distribution and the existence of outliers. In the
* Corresponding Author following paragraphs, we have presented a description of
Kabita Sahoo*, Asst. Professor, Dept. of Computer Science, MITS these key aspects of EDA. As shown in Figure 1, at every step
school of Biotechnology, Utkal University, Bhubaneswar, India.
Abhaya Kumar Samal, Professor, Dept. of Comp. Sc. & Engg., Trident
of machine learning process, data analysis and visualization
Academy of Technology, Bhubaneswar, India. techniques are extensively being used. These techniques are
Jitendra Pramanik, Asst. Professor, Centurion University of discussed in as below:
Technology and Management, Odisha, India.
Subhendu Kumar Pani, Associate Professor, Orissa Engineering I. Data Exploration
College, Bhubaneswar, India.
Published By:
Retrieval Number: L3591081219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.L3591.1081219 & Sciences Publication
4727
Exploratory Data Analysis using Python
It is the first stage of data analysis. Here we can know about I. Histograms
the content of the data set and characteristic of data set. It tells We can represent the distribution of numerical data by the use
about the size of the data. We can find the missing value of of histogram. Histogram can relate to one variable rather than
data. We can find the possible relationship among data. Data two variables. Here the entire range of value can be divided in
visualization is done by the use of tabular data and to series of interval. Histograms are mainly used for
understanding the characteristics. continuous data. Histogram can be represented as frequency
distribution by means of rectangle where a width represents
the class interval and area proportional to corresponding
frequencies. Height represents the average frequency density.
Tonal distribution of digital image is a graphical
representation which is called as image histogram.
II. Stem Plots
It is otherwise called as leaf plot. Here the data is spitted in to
two parts. The largest digit represents the stems and the
smallest digit represents the leaves. A little more information
is represented by stem plot over histogram. It is also used for
Figure-1: Steps of Machine Learning Process visualization purpose. Comparing the data is much easier
here. The numbers are arranged by place value. They are
II. Data Cleaning
basically used for highlighting the mode .they are used for
It is process of detecting the corrupt data, removing the small data sets
irrelevant parts of the data and replacing the correct data. The
actual process of data cleaning is to remove the error and III. Box plots
validating the data. Data can be cross checked to remove the A good graphical image of the concentration of data can be
error. Issue can be resolved by validating the data. represented by the use of box plot. It shows the central
tendency, symmetry, skew and outlier. It can be constructed
III. Model Building
from five values: the minimum, the first quartile, the median,
We use the statistical model or machine learning model to the third quartile and the maximum value. These values are
describe the variable and working of the variable. Model can compared to show how close other data values are to them.
be supervised or unsupervised model. We can use
classification, regression model to get the output. We can Bivariate Graphical EDA
visualize the result by the use of model. After that we have to Bivariate GEDA is accomplished to understand the
evaluate the model. connections between each variable in the dataset and the
target variable of interest or using two variables and finding
IV. Present Result
connection among them. Example of these types of GEDA
We can visualize large amount of complex data by the use includes Box plot and Violin plot.
of chart, graph and tables. Human brain can process
information using chart, graphs. It is an easy way to convey Multivariate Graphical EDA
the concept. It can identify the area which needs Multivariate GEDA is accomplished to understand the
improvement. It can clarify the factor very well. connections between different fields in the dataset or finding
the connections between more than two variables. Example of
B. Graphical EDA these types of GEDA includes Pair plot and 3D Scatter plot.
Fundamentally, graphical exploratory data analysis is nothing BARGRAPH plot is the most commonly used graphical
but the graphical counterpart of the traditional non-graphical technique. Nowadays Box plot is used to show the
EDA that analyzes the data sets to help summarize their relationship between two values. In some cases Pair plot is
statistical characteristics focusing on the same four key used to show the view of all variable and their relationship.
aspects, like, measures of central tendency, measures of
spread, the shape of the distribution and the existence of I. Side-by-Side Box plots
outliers. Further, we have categorized GEDA into: Univariate For comparing the levels of all possible values we use side by
GEDA, Bivariate GEDA and Multivariate GEDA. In the side box plot.it is used to compare two data sets. it basically
following paragraphs, we have discussed these key varieties summarize the data for each instant of categorical variable.
and aspects of GEDA.
II. Scatter plots
Univariate Graphical EDA It is a type of plot where Cartesian coordinate is used to
Univariate GEDA provides statistical summary for each field display the values between two variables for a set of data. We
in the raw data set or the summary only on one variable. can draw it by taking the variable value in X axis and Y axis.
Example of these types of GEDA includes cumulative The data are displayed as a collection of points. The value of
distribution function (CDF), probability density function X axis and y axis gives the value of the variable.
(PDF), Box plot and Violin plot. Few of them are discussed
below: III. Heat Maps and 3D Surface Plots
We can generate heat map taking
the entire feature variable.
Feature variables are taken as row and column header and the 5. Important factors can be Identified using it.
variable versus itself on the diagonal. It is very useful to 6. We can understand the relationship among various data.
visualize the relationship between variables in high 7. Data can speak for itself using visualization process.
dimensional space.
V. WORKING WITH THE DATA SETS
IV. EDA IN PYTHON It’s time to explore the data and find about it. The data we
We are using python for exploratory data analysis. It is are using belongs to Amazon review data set. We are going to
simple to learn. It has rich sets of libraries. Data handling analyse the data with possible set of options.
capacity are much higher. It is used as open source language. 1. In the first step we have imported the Pandas libraries.
It has the capacity to with all the third party language .it can numpy packages.
run on any platform. It can transfer the process from one 2. After that we have imported fairly large amazon CSV file
platform to another. It is easy to read. The developer can as a data frame df. It gives the data sets in the form of rows
understand the code .it offers a variety of libraries and some and column. In our CSV file 5 rows and 20 columns are
of them uses great visualization tool. Visualization process there. We have used head( ) method to return top 5 rows of
can make it easier to create the clear report. the data frame or series. This is shown in Figure 2 below.
3. We have to choose the right visualization method. When
Pandas
visualizing individual variables, it is important to first
It is the most powerful package for data analysis. We can understand what type of variable we are dealing with. This
clean, transform and analyze the data. Data can be stored in will help us find the right visualization method for that
CSV format in computer. Cleaning, visualizing and storing variable .for this we have imported Matplot lib, seaborn
the data can be done. It is built on the top of the NumPy library packages. We have used df.dtypes to list the data
package. Plotting functions from Matplotlib and machine for each column. This is shown in Figure 3 below. As
learning algorithm in Scikit-learn. shown in the figure, reviews.doRecommend is Boolean
Jupiter Notebook data type. Reviews.id is float64 data type.
Reviews.numhelpful, reviews.rating is int64 data type and
It gives ability to execute the code in a particular cell. It gives
all other are object data types.
the console based approach for computing. It provides web
4. We have used df.corr( ) to find the pair wise correlations
based application process. It includes input and output of the
of all column in data frame. It gives the following
computation. It gives rich media representation of the object.
correlations between reviews.id, reviews.doRecommened,
Applications of EDA reviews.numHelpful and reviews.rating. This pair wise
1. Mistakes and anomalies can be detected using EDA correlations is shown in Figure 4 below.
2. We can gain new insight in to various types of data
3. Outliers in data can be detected
4. We can test assumption using EDA.
Figure 2: Importing pandas library and head functions showing top 5 rows of the data frame
5. We have done the scatter plot between reviews.id and Figure 5 below.
reviews.rating to get the following output. This scatter plot 6. We can find the correlation
between reviews.id and reviews.rating is shown in between reviews.id and
Published By:
Retrieval Number: L3591081219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.L3591.1081219 & Sciences Publication
4729
Exploratory Data Analysis using Python
reviews.rating and form the scatter plot between them. This shown by the box plot between manufacturer Number
This correlation between reviews.id and reviews.rating is and reviews.rating presented in the Figure-7.
shown by the scatter plot presented in Figure 6 below. 8. Count Plot: We have used count plot to count the no
7. Box Plot: We have used categorical variable which takes a of observations. It can be taught as a histogram across a
fixed number of possible values .it describes the categorical variable. It is identical to those for bar plot. We
characteristics of a data unit. It is represented by box plot. have made the count plot between manufacturer and
We have done the box plot between: reviews.id and got the following output. This shown by the
1. Manufacturers no and reviews ratings. box plot between manufacturer Number and
2. Manufacturer and reviews ratings reviews.rating in the data frame presented in the Figure-8.
Figure-3: Showing data types for each column of the data frame
9. Descriptive Statistical Analysis: We have used descriptive 10. Counts: We have used count function that returns the
statistical analysis which is used to describe the entire data number of occurrences. It tells about how many units of
sets with a single value or metric. The describe function each characteristic/ variable we have. We got number of
automatically computes basic statistics for all continuous brand value and the different categories electronic
variables. Here NaN values are automatically skipped in products.
these statistics. Here the mean value is calculated by This is shown by the output for describe function
taking the sum of all the values in the data set divided by presented in the Figure-10. We have applied method
total number of data sets. here we have found out the count "describe" on the variables of type 'object' and got the
of that variable the mean the standard deviation (std) the result.
minimum value the IQR (Interquartile Range: 25%, 50% 11. Basic of Grouping: We have used "group by" method that
and 75%) the maximum value .we have found all these groups data by different categories. The data is grouped
factor for reviews.id, reviews. numHelpful and reviews based on one or several variables and analysis is
rating .we have used describe function and got the performed on the individual groups. This is shown by the
following output. describe method on the variable type object as presented
This is shown by the count plot between Amazon in the Figure-11 below.
manufacturer and reviews.id as presented in the Figure-9. 12. Here we have used unique( )
method to know all types of unique values in the column is implementation of value count function.
returned. This is presented in the Figure-12 showing
Figure-5: Shows reg plot and scatter plot between reviews.id and reviews.rating
Figure-6: Correlation between reviews.id and reviews.rating and the scatter plot between them
Published By:
Retrieval Number: L3591081219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.L3591.1081219 & Sciences Publication
4731
Exploratory Data Analysis using Python
Figure-8: Showing box plot between manufacturer and reviews.rating in the data frame
Published By:
Retrieval Number: L3591081219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.L3591.1081219 & Sciences Publication
4733
Exploratory Data Analysis using Python
Published By:
Retrieval Number: L3591081219/2019©BEIESP Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.L3591.1081219 & Sciences Publication
View publication stats
4735