ML Practical 1 Code

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

#Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.

Perform following
tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc. Dataset link: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset

In [1]: #Importing the required libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

In [2]: #importing the dataset

df = pd.read_csv("uber.csv")

1. Pre-process the dataset.


In [3]: df.head()

Out[3]: Unnamed: 0 key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count

0 24238194 2015-05-07 19:52:06.0000003 7.5 2015-05-07 19:52:06 UTC -73.999817 40.738354 -73.999512 40.723217 1

1 27835199 2009-07-17 20:04:56.0000002 7.7 2009-07-17 20:04:56 UTC -73.994355 40.728225 -73.994710 40.750325 1

2 44984355 2009-08-24 21:45:00.00000061 12.9 2009-08-24 21:45:00 UTC -74.005043 40.740770 -73.962565 40.772647 1

3 25894730 2009-06-26 08:22:21.0000001 5.3 2009-06-26 08:22:21 UTC -73.976124 40.790844 -73.965316 40.803349 3

4 17610152 2014-08-28 17:47:00.000000188 16.0 2014-08-28 17:47:00 UTC -73.925023 40.744085 -73.973082 40.761247 5

In [4]: df.info() #To get the required information of the dataset

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 200000 entries, 0 to 199999

Data columns (total 9 columns):

Unnamed: 0 200000 non-null int64

key 200000 non-null object

fare_amount 200000 non-null float64

pickup_datetime 200000 non-null object

pickup_longitude 200000 non-null float64

pickup_latitude 200000 non-null float64

dropoff_longitude 199999 non-null float64

dropoff_latitude 199999 non-null float64

passenger_count 200000 non-null int64

dtypes: float64(5), int64(2), object(2)

memory usage: 13.7+ MB

In [5]: df.columns #TO get number of columns in the dataset

Index(['Unnamed: 0', 'key', 'fare_amount', 'pickup_datetime',

Out[5]:
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',

'dropoff_latitude', 'passenger_count'],

dtype='object')

In [6]: df = df.drop(['Unnamed: 0', 'key'], axis= 1) #To drop unnamed column as it isn't required

In [7]: df.head()

Out[7]: fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count

0 7.5 2015-05-07 19:52:06 UTC -73.999817 40.738354 -73.999512 40.723217 1

1 7.7 2009-07-17 20:04:56 UTC -73.994355 40.728225 -73.994710 40.750325 1

2 12.9 2009-08-24 21:45:00 UTC -74.005043 40.740770 -73.962565 40.772647 1

3 5.3 2009-06-26 08:22:21 UTC -73.976124 40.790844 -73.965316 40.803349 3

4 16.0 2014-08-28 17:47:00 UTC -73.925023 40.744085 -73.973082 40.761247 5

In [8]: df.shape #To get the total (Rows,Columns)

(200000, 7)
Out[8]:

In [9]: df.dtypes #To get the type of each column

fare_amount float64

Out[9]:
pickup_datetime object

pickup_longitude float64

pickup_latitude float64

dropoff_longitude float64

dropoff_latitude float64

passenger_count int64

dtype: object

Column pickup_datetime is in wrong format (Object). Convert it to DateTime Format


In [10]: df.pickup_datetime = pd.to_datetime(df.pickup_datetime)

In [11]: df.dtypes

fare_amount float64

Out[11]:
pickup_datetime datetime64[ns, UTC]

pickup_longitude float64

pickup_latitude float64

dropoff_longitude float64

dropoff_latitude float64

passenger_count int64

dtype: object

Filling Missing values


In [12]: df.isnull().sum()

fare_amount 0

Out[12]:
pickup_datetime 0

pickup_longitude 0

pickup_latitude 0

dropoff_longitude 1

dropoff_latitude 1

passenger_count 0

dtype: int64

In [13]: df['dropoff_latitude'].fillna(value=df['dropoff_latitude'].mean(),inplace = True)

df['dropoff_longitude'].fillna(value=df['dropoff_longitude'].median(),inplace = True)

In [14]: df.isnull().sum()

fare_amount 0

Out[14]:
pickup_datetime 0

pickup_longitude 0

pickup_latitude 0

dropoff_longitude 0

dropoff_latitude 0

passenger_count 0

dtype: int64

To segregate each time of date and time


In [15]: df= df.assign(hour = df.pickup_datetime.dt.hour,

day= df.pickup_datetime.dt.day,

month = df.pickup_datetime.dt.month,

year = df.pickup_datetime.dt.year,

dayofweek = df.pickup_datetime.dt.dayofweek)

In [16]: df.head()

Out[16]: fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek

0 7.5 2015-05-07 19:52:06+00:00 -73.999817 40.738354 -73.999512 40.723217 1 19 7 5 2015 3

1 7.7 2009-07-17 20:04:56+00:00 -73.994355 40.728225 -73.994710 40.750325 1 20 17 7 2009 4

2 12.9 2009-08-24 21:45:00+00:00 -74.005043 40.740770 -73.962565 40.772647 1 21 24 8 2009 0

3 5.3 2009-06-26 08:22:21+00:00 -73.976124 40.790844 -73.965316 40.803349 3 8 26 6 2009 4

4 16.0 2014-08-28 17:47:00+00:00 -73.925023 40.744085 -73.973082 40.761247 5 17 28 8 2014 3

Here we are going to use Heversine formula to calculate the distance between two points and journey, using the
longitude and latitude values.
Heversine formula
hav(θ) = sin**2(θ/2).

In [19]: from math import *

# function to calculate the travel distance from the longitudes and latitudes

def distance_transform(longitude1, latitude1, longitude2, latitude2):

travel_dist = []

for pos in range(len(longitude1)):

long1,lati1,long2,lati2 = map(radians,[longitude1[pos],latitude1[pos],longitude2[pos],latitude2[pos]])

dist_long = long2 - long1

dist_lati = lati2 - lati1

a = sin(dist_lati/2)**2 + cos(lati1) * cos(lati2) * sin(dist_long/2)**2

c = 2 * asin(sqrt(a))*6371

travel_dist.append(c)

return travel_dist

In [20]: df['dist_travel_km'] = distance_transform(df['pickup_longitude'].to_numpy(),


df['pickup_latitude'].to_numpy(),

df['dropoff_longitude'].to_numpy(),

df['dropoff_latitude'].to_numpy()

In [21]: df.head()

Out[21]: fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km

0 7.5 2015-05-07 19:52:06+00:00 -73.999817 40.738354 -73.999512 40.723217 1 19 7 5 2015 3 1.683323

1 7.7 2009-07-17 20:04:56+00:00 -73.994355 40.728225 -73.994710 40.750325 1 20 17 7 2009 4 2.457590

2 12.9 2009-08-24 21:45:00+00:00 -74.005043 40.740770 -73.962565 40.772647 1 21 24 8 2009 0 5.036377

3 5.3 2009-06-26 08:22:21+00:00 -73.976124 40.790844 -73.965316 40.803349 3 8 26 6 2009 4 1.661683

4 16.0 2014-08-28 17:47:00+00:00 -73.925023 40.744085 -73.973082 40.761247 5 17 28 8 2014 3 4.475450

In [22]: # drop the column 'pickup_daetime' using drop()

# 'axis = 1' drops the specified column

df = df.drop('pickup_datetime',axis=1)

In [23]: df.head()

Out[23]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km

0 7.5 -73.999817 40.738354 -73.999512 40.723217 1 19 7 5 2015 3 1.683323

1 7.7 -73.994355 40.728225 -73.994710 40.750325 1 20 17 7 2009 4 2.457590

2 12.9 -74.005043 40.740770 -73.962565 40.772647 1 21 24 8 2009 0 5.036377

3 5.3 -73.976124 40.790844 -73.965316 40.803349 3 8 26 6 2009 4 1.661683

4 16.0 -73.925023 40.744085 -73.973082 40.761247 5 17 28 8 2014 3 4.475450

Checking outliers and filling them


In [24]: df.plot(kind = "box",subplots = True,layout = (7,2),figsize=(15,20)) #Boxplot to check the outliers

fare_amount AxesSubplot(0.125,0.787927;0.352273x0.0920732)

Out[24]:
pickup_longitude AxesSubplot(0.547727,0.787927;0.352273x0.0920732)

pickup_latitude AxesSubplot(0.125,0.677439;0.352273x0.0920732)

dropoff_longitude AxesSubplot(0.547727,0.677439;0.352273x0.0920732)

dropoff_latitude AxesSubplot(0.125,0.566951;0.352273x0.0920732)

passenger_count AxesSubplot(0.547727,0.566951;0.352273x0.0920732)

hour AxesSubplot(0.125,0.456463;0.352273x0.0920732)

day AxesSubplot(0.547727,0.456463;0.352273x0.0920732)

month AxesSubplot(0.125,0.345976;0.352273x0.0920732)

year AxesSubplot(0.547727,0.345976;0.352273x0.0920732)

dayofweek AxesSubplot(0.125,0.235488;0.352273x0.0920732)

dist_travel_km AxesSubplot(0.547727,0.235488;0.352273x0.0920732)

dtype: object

In [25]: #Using the InterQuartile Range to fill the values

def remove_outlier(df1 , col):

Q1 = df1[col].quantile(0.25)

Q3 = df1[col].quantile(0.75)

IQR = Q3 - Q1

lower_whisker = Q1-1.5*IQR

upper_whisker = Q3+1.5*IQR

df[col] = np.clip(df1[col] , lower_whisker , upper_whisker)

return df1

def treat_outliers_all(df1 , col_list):

for c in col_list:

df1 = remove_outlier(df , c)

return df1

In [26]: df = treat_outliers_all(df , df.iloc[: , 0::])

In [27]: df.plot(kind = "box",subplots = True,layout = (7,2),figsize=(15,20)) #Boxplot shows that dataset is free from outliers

fare_amount AxesSubplot(0.125,0.787927;0.352273x0.0920732)

Out[27]:
pickup_longitude AxesSubplot(0.547727,0.787927;0.352273x0.0920732)

pickup_latitude AxesSubplot(0.125,0.677439;0.352273x0.0920732)

dropoff_longitude AxesSubplot(0.547727,0.677439;0.352273x0.0920732)

dropoff_latitude AxesSubplot(0.125,0.566951;0.352273x0.0920732)

passenger_count AxesSubplot(0.547727,0.566951;0.352273x0.0920732)

hour AxesSubplot(0.125,0.456463;0.352273x0.0920732)

day AxesSubplot(0.547727,0.456463;0.352273x0.0920732)

month AxesSubplot(0.125,0.345976;0.352273x0.0920732)

year AxesSubplot(0.547727,0.345976;0.352273x0.0920732)

dayofweek AxesSubplot(0.125,0.235488;0.352273x0.0920732)

dist_travel_km AxesSubplot(0.547727,0.235488;0.352273x0.0920732)

dtype: object

In [28]: #Uber doesn't travel over 130 kms so minimize the distance

df= df.loc[(df.dist_travel_km >= 1) | (df.dist_travel_km <= 130)]

print("Remaining observastions in the dataset:", df.shape)

Remaining observastions in the dataset: (200000, 12)

In [29]: #Finding inccorect latitude (Less than or greater than 90) and longitude (greater than or less than 180)

incorrect_coordinates = df.loc[(df.pickup_latitude > 90) |(df.pickup_latitude < -90) |

(df.dropoff_latitude > 90) |(df.dropoff_latitude < -90) |

(df.pickup_longitude > 180) |(df.pickup_longitude < -180) |

(df.dropoff_longitude > 90) |(df.dropoff_longitude < -90)

In [30]: df.drop(incorrect_coordinates, inplace = True, errors = 'ignore')

In [31]: df.head()

Out[31]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km

0 7.5 -73.999817 40.738354 -73.999512 40.723217 1.0 19 7 5 2015 3 1.683323

1 7.7 -73.994355 40.728225 -73.994710 40.750325 1.0 20 17 7 2009 4 2.457590

2 12.9 -74.005043 40.740770 -73.962565 40.772647 1.0 21 24 8 2009 0 5.036377

3 5.3 -73.976124 40.790844 -73.965316 40.803349 3.0 8 26 6 2009 4 1.661683

4 16.0 -73.929786 40.744085 -73.973082 40.761247 3.5 17 28 8 2014 3 4.475450

In [32]: df.isnull().sum()

fare_amount 0

Out[32]:
pickup_longitude 0

pickup_latitude 0

dropoff_longitude 0

dropoff_latitude 0

passenger_count 0

hour 0

day 0

month 0

year 0

dayofweek 0

dist_travel_km 0

dtype: int64

In [33]: sns.heatmap(df.isnull()) #Free for null values

<matplotlib.axes._subplots.AxesSubplot at 0x8d8af2a080>
Out[33]:

In [34]: corr = df.corr() #Function to find the correlation

In [35]: corr

Out[35]: fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count hour day month year dayofweek dist_travel_km

fare_amount 1.000000 0.154069 -0.110842 0.218675 -0.125898 0.015778 -0.023623 0.004534 0.030817 0.141277 0.013652 0.844374

pickup_longitude 0.154069 1.000000 0.259497 0.425619 0.073290 -0.013213 0.011579 -0.003204 0.001169 0.010198 -0.024652 0.098094

pickup_latitude -0.110842 0.259497 1.000000 0.048889 0.515714 -0.012889 0.029681 -0.001553 0.001562 -0.014243 -0.042310 -0.046812

dropoff_longitude 0.218675 0.425619 0.048889 1.000000 0.245667 -0.009303 -0.046558 -0.004007 0.002391 0.011346 -0.003336 0.186531

dropoff_latitude -0.125898 0.073290 0.515714 0.245667 1.000000 -0.006308 0.019783 -0.003479 -0.001193 -0.009603 -0.031919 -0.038900

passenger_count 0.015778 -0.013213 -0.012889 -0.009303 -0.006308 1.000000 0.020274 0.002712 0.010351 -0.009749 0.048550 0.009709

hour -0.023623 0.011579 0.029681 -0.046558 0.019783 0.020274 1.000000 0.004677 -0.003926 0.002156 -0.086947 -0.038366

day 0.004534 -0.003204 -0.001553 -0.004007 -0.003479 0.002712 0.004677 1.000000 -0.017360 -0.012170 0.005617 0.003062

month 0.030817 0.001169 0.001562 0.002391 -0.001193 0.010351 -0.003926 -0.017360 1.000000 -0.115859 -0.008786 0.011628

year 0.141277 0.010198 -0.014243 0.011346 -0.009603 -0.009749 0.002156 -0.012170 -0.115859 1.000000 0.006113 0.024278

dayofweek 0.013652 -0.024652 -0.042310 -0.003336 -0.031919 0.048550 -0.086947 0.005617 -0.008786 0.006113 1.000000 0.027053

dist_travel_km 0.844374 0.098094 -0.046812 0.186531 -0.038900 0.009709 -0.038366 0.003062 0.011628 0.024278 0.027053 1.000000

In [36]: fig,axis = plt.subplots(figsize = (10,6))

sns.heatmap(df.corr(),annot = True) #Correlation Heatmap (Light values means highly correlated)

<matplotlib.axes._subplots.AxesSubplot at 0x8d8affc588>
Out[36]:

Dividing the dataset into feature and target values


In [37]: x = df[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','hour','day','month','year','dayofweek','dist_travel_km']]

In [38]: y = df['fare_amount']

Dividing the dataset into training and testing dataset


In [39]: from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.33)

Linear Regression
In [40]: from sklearn.linear_model import LinearRegression

regression = LinearRegression()

In [41]: regression.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Out[41]:

In [42]: regression.intercept_ #To find the linear intercept

2809.192377415925
Out[42]:

In [43]: regression.coef_ #To find the linear coeeficient

array([ 1.75328304e+01, -9.83172673e+00, 1.54611809e+01, -1.69707270e+01,

Out[43]:
5.40456388e-02, 9.46950748e-03, 1.66720620e-03, 5.40917698e-02,

3.61743634e-01, -3.69474342e-02, 2.00077959e+00])

In [44]: prediction = regression.predict(X_test) #To predict the target values

In [45]: print(prediction)

[10.80422002 4.74707896 9.95283165 ... 5.89597937 17.00144322

5.38487972]

In [46]: y_test

16850 8.50

Out[46]:
181076 4.10

70798 9.30

87421 12.90

169443 22.25

18976 11.00

50921 13.70

199564 14.50

125215 5.30

67510 8.50

85217 22.25

156903 21.50

116795 4.10

112179 16.90

124459 3.70

173299 22.25

51448 19.70

99502 22.25

174467 10.90

78880 20.50

26798 22.25

38501 4.50

63091 12.90

171207 22.25

142238 8.50

101106 7.30

120177 4.50

154585 14.50

75840 5.50

85918 14.00

...

104227 10.10

14172 19.70

49985 3.70

183045 6.50

11927 12.90

93684 4.50

101795 13.70

21444 6.10

85147 8.50

81311 8.00

157686 11.70

194074 6.50

132558 10.50

132616 11.70

188536 5.70

179629 8.90

11277 3.70

147880 7.30

116553 5.70

157394 6.50

103519 13.30

41348 12.90

12608 4.50

6820 5.50

84612 5.00

168836 3.70

39719 21.00

124536 4.90

90432 22.10

12543 4.90

Name: fare_amount, Length: 66000, dtype: float64

Metrics Evaluation using R2, Mean Squared Error, Root Mean Sqared Error
In [47]: from sklearn.metrics import r2_score

In [48]: r2_score(y_test,prediction)

0.7471032194200018
Out[48]:

In [49]: from sklearn.metrics import mean_squared_error

In [50]: MSE = mean_squared_error(y_test,prediction)

In [51]: MSE

7.464818887848474
Out[51]:

In [52]: RMSE = np.sqrt(MSE)

In [53]: RMSE

2.7321820744321696
Out[53]:

Random Forest Regression


In [54]: from sklearn.ensemble import RandomForestRegressor

In [55]: rf = RandomForestRegressor(n_estimators=100) #Here n_estimators means number of trees you want to build before making the prediction

In [56]: rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

Out[56]:
max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=100,

n_jobs=None, oob_score=False, random_state=None,

verbose=0, warm_start=False)

In [57]: y_pred = rf.predict(X_test)

In [58]: y_pred

array([ 9.7025, 4.744 , 9.202 , ..., 6.468 , 16.2802, 4.47 ])


Out[58]:

Metrics evaluatin for Random Forest


In [59]: R2_Random = r2_score(y_test,y_pred)

In [60]: R2_Random

0.8024361566950065
Out[60]:

In [64]: MSE_Random = mean_squared_error(y_test,y_pred)

MSE_Random

5.831542440662031
Out[64]:

In [65]: RMSE_Random = np.sqrt(MSE_Random)

RMSE_Random

2.4148586792319815
Out[65]:

You might also like