E Commerce
E Commerce
Shipping Data
Introduction
We have selected a e commerce shipping data to know what kind of
shipping mode is preferred. who buys more males or females or the
people like online shopping or not. So, after the analysis we know
that the people prefer mode of shipment is ship than the road and the
flight. The females do more shopping than males. The highest ranking
is the 3 rank out of 5. For the detailed analysis and visual
representation the working is given below.
Libraries
– Pandas
– Numpy
– Matplotlib.ployty
– Matplotlib
– Sklearn matrix
Data Pre Processing
we can change the raw data into the understandable format by using the technique of the
data mining. The real world data is incomplete or have many errors but we can not
understand it by looking at the excel sheet. So we perform the data pre processing and
data mining. So to clean or to understand the large data we use the method of data
preprocessing in the data mining.
Shape function
Through the shape function in the data mining we know the total number of columns and rows of
our data set. So in our e commerce data set there are total 10999 rows and 12 columns. It means that
there are 10999 people data in the file.
Describe function
The Describe function tells us the whole dataset mean, minimum and maximum
values, standard deviation etc. This function tells us the whole statistics analysis of
the data frame. this excludes the character columns and give the values of the
numeric columns. The e commerce dataset statistical analysis is given below.
Columns
The column function shows the all column labels in the data frame.
The e commerce dataset has the columns: ID', 'Warehouse_block', 'Mode_of_Shipment',
'Customer_care_calls', 'Customer_rating', 'Cost_of_the_Product', 'Prior_purchases',
'Product_importance', 'Gender', 'Discount_offered', 'Weight_in_gms', 'Reached.on.Time_Y.N‘
Data cleanup
By using the data clean up we can remove the columns or data which in unnecessary for our data set
or the analysis. This help us to remove the tables, unfinished data, un reliable and the inaccurate
data. We can also re model our data set by using the data clean up function. So in the e commerce
data we don't need the column customer care calls so we remove that column.
Missing Values
The real world data is not accurate there may be the missing values or the data unavalibility. So for
looking the missing values in the data set we use the is null() , isna() functions. If the function print true
then it means that there are missing values if the function print false then it shoes there is no missing
values in our data set. So, there are no missing values in the data set. The results are given below:
Aggregation
Through applying aggregate we know that the shipment and purchases minimum and maximum
values.
Mean and Maximum function
Group by
– By applying groupby we know that the customers rate the product importance as the high
medium and low.
– By group by we know there are 5 warehouse in the dataset through which the shipment is
occurred. that are A B C D F.
– By applying groupby function we know the shipment occur through ship most than by the flight
or the road.
Data visualization
The data visualization help us to read the data easily. we can make the graphs of our data set by
using the data visualization. It is easy to understand the data in a visual form. It helps us to identify
the outliers from the data, the patterns and the trends.
shipment vs rating
– We made the bar graph on the shipment and the customer rating. Through the graphical
representation we know that through ship the customer gives more rating and the parcels reach
more safer than the road and flight through ship.
Gender vs. customer rating
Through the histogram we know that the females give more good rating to the products than the
male.
Product cost vs. discount
Importance vs. cost of product
Data splitting
If we want to split our data aur divide it for the testing and training we use the data splitting method.
We can make the portions of our data set in the lables and features to test the validity of our data set.
The train is used to develop a predictive model and test is used for the model performance.So we
divide our data set to half to test and train. so our data set is divided to 5499 rows for the testing
Regression
Linear Regression
Linear regression is used for finding linear relationship between target and one or more predictors. There
are many types of the regression but here we apply the linear regression. The Linear regression tells us
the linear relation between the two variables. There are two types of the linear regression one is simple
the other one is the multiple regression.
Residual error
Root Mean Square Error
The root mean square (RMSE) is essentially the square root of the MSE. Because
of this, the RMSE error is in the same units as the training data outcome. Low
RMSE values are desired.
RMSE=1n∑ni=1(y^i−yi)2√