IET Intelligent Trans Sys - 2020 - V E - Seoul Bike Trip Duration Prediction Using Data Mining Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IET Intelligent Transport Systems

Special Issue: Intelligent Transportation Systems in Smart Cities for


Sustainable Environments

Seoul bike trip duration prediction using data ISSN 1751-956X


Received on 29th November 2019
Revised 5th March 2020
mining techniques Accepted on 17th April 2020
E-First on 14th July 2020
doi: 10.1049/iet-its.2019.0796
www.ietdl.org

Sathishkumar V E1, Jangwoo Park1, Yongyun Cho1


1Department of Information and Communication Engineering, Sunchon National University, Suncheon, Republic of Korea
E-mail: [email protected]

Abstract: Trip duration is the most fundamental measure in all modes of transportation. Hence, it is crucial to predict the trip-
time precisely for the advancement of Intelligent Transport Systems and traveller information systems. To predict the trip
duration, data mining techniques are employed in this study to predict the trip duration of rental bikes in Seoul Bike sharing
system. The prediction is carried out with the combination of Seoul Bike data and weather data. The data used include trip
duration, trip distance, pickup and dropoff latitude and longitude, temperature, precipitation, wind speed, humidity, solar
radiation, snowfall, ground temperature and 1-hour average dust concentration. Feature engineering is done to extract
additional features from the data. Four statistical models are used to predict the trip duration. (a) Linear regression, (b) Gradient
boosting machines, (c) k nearest neighbour and (d) Random Forest (RF). Four performance metrics root mean squared error,
coefficient of variance, mean absolute error and median absolute error is used to determine the efficiency of the models. In
comparison with the other models, the best model RF can explain the variance of 93% in the testing set and 98% (R2) in the
training set. The outcome proves that RF is effective to be employed for the prediction of trip duration.

1 Introduction October 2015 in selected areas on the right bank of the Han River,
Seoul. As months passed, the station's count rose above 150 and
According to recent studies, it is expected that more than 60% of 1500 bikes are made available. Since 2016, the number of stations
the population in the world tends to dwell in cities, which is higher keeps elevating and covers districts that were not included earlier.
than 50% of the present scenario [1]. Some countries around the According to July 2016 data, there is an availability of 300 stations
world are practising righteous scenarios, renderings mobility at a functioning with 3000 bikes. Furthermore, Seoul mayor Park Won-
fair cost and reduced carbon discharge. On the contrary other cities soon affirms his intention of raising the availability of bikes to
are far behind in the track [2]. Urban mobility usually fills 64% of 20,000. As of now, there are more than 1500 Bike Rental stations
the entire kilometres travelled in the world. It ought to be modelled in Seoul running 24 h with advanced technology.
and taken over by inter-modality and networked self-driving As Advanced Travellers Information Systems and Intelligent
vehicles which also provides a sustainable means of mobility. Transportation Systems shows rapid growth, the data raised by
Systems called Mobility on Demand (MOD) has a vital part in these systems can be valuable for knowing and enhancing
raising the vehicles’ supply, increasing its idle time and numbers. procedures in transport companies and also other transportation
Bike-sharing MOD systems are already firmly holding the effective dealing organisations, i.e. public and private transportation
part in short commuting and as ‘last mile’ mobility resources on companies, logistics companies and the local government. Trip
inter-modal trips in several cities. Certain issues prevail in the duration in public rental bikes is a notable example of a travel
maintenance, design, and management of bike-sharing systems: analytics problem which profits from data analysis. Knowing the
layout of the station design; fleet size and capacity of the station; estimated trip duration in advance helps the government
detecting broken, lost, or theft bikes; pricing; monitoring of traffic organisations and also the travellers as in case of picking the right
and customer activities to promote behaviour virtuously; and choice on route planning and timing. For this reason, information
marketing using campaigns etc. System balancing is the hardest on bike trips (more likely GPS data) obtained by rental bikes can
endeavour: In the daytime, some stations are likely to be crowded be utilised.
with bike flow, while leaving other stations empty, which hampers Data mining methods can be employed to predict the duration
pick-up and drop-off, respectively. So, to restore the balance, of the trip. By exploring the data acquired by Seoul Bike, data
several manual techniques, like shifting bikes through trucks, cars mining methods link trip duration to certain variables stating the
and even by volunteers are employed. Data analysis techniques and trip as the source, destination, weather, time information of the day,
studies focus on dynamic systems and optimisation methods are day of the week, month and minute. Several algorithms are
utilised for complementing the knowledge base of employing developed and employed to predict the trip duration. Yet the
optimum rebalancing policies [3]. performance of prediction varies, which results in certain
Today, bike-sharing systems are blooming across more than difficulties. Data mining performs the key task of classifying the
1000 cities around the world [4], particularly in big or large cities algorithm that shows an optimum function for a particular issue.
like New York City, Paris, Washington DC, London, Beijing and However, it is already known that there is no best algorithm for a
Barcelona. To complete a short trip renting a bike is a faster way domain of large issue [6]. Picking an algorithm for a specific issue
when compared to walking. Moreover, it is eco-friendly and is done by trial and error method or based on an expert's advice.
comfortable too compared to driving. This paved the way for its However, neither of the approaches is satisfactory to the end-user
raising popularity [5]. As in other countries, South Korea has its who wish to access the technology in a cost-efficient manner [7].
bike-sharing system called Ddareungi, set up in 2015. In English, it The Perfect learning algorithm to predict trip duration usually vary
is given the name Seoul Bike. The system is designed to tackle for different rental bike sharing system, due to differences in use,
traffic congestion, high oil price, and air pollution in Seoul, and to and driving habits. Hence, the choice of the algorithm should be
create a healthier society thus improving the quality of life for rendered at a lower range like the rental bike itself instead of the
Seoul residents. Ddareungi was first implemented in Seoul in global level. Whereas, for systems of multiple data sources in

IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474 1465
© The Institution of Engineering and Technology 2020
Fig. 1  Docking stand in Seoul

which the data structure is the same, raising the reliability of the predict Kriging travel time by Miura [11] used Kriging method, a
model is plausible for a specific source. It can be achieved by spatial prediction method as a predictive measure for the travel
utilising other data sources for training. time of car in a notional four-dimensional space. Every point in the
In this analysis, data mining methodology is used to predict the space signifies a particular trip and the co-ordinates of a particular
duration of trip of each trip using weather information. Given that point stand for the start and terminal on the plane. And so the
weather information plays an important role in transportation, duration of travel is rendered as a function over the four-
weather patterns information is considered a primary determiner in dimensional space. The system prediction depends on the aspect
predicting the duration of the bike rental trip. The data is pre- that the adjacent point (four-dimensional space) holds nearly the
processed and cleansed and then combined with Seoul weather same travel time. Here, a breaking down of travel time from source
data. Four regression algorithms are used to predict the duration of to destination into link travel time is not necessary. The data from
each rental bike trip, and the best performing algorithm is picked. ‘probe vehicles’ are necessarily used in this method for predicting
Fig. 1 presents a view of the docking station of Seoul rental bikes. the time of travel in imminence. One case study in London reveals
The knowledge extracted from this pattern of trip duration can be this method's feasibility.
used to provide convenient public bike sharing and development of For predicting the duration of the trip on a freeway, Kwon et al.
tourism services. Accurate travel-time prediction is also critical to [12] used the occupancy details and the flow data, which is from
the development of intelligent transport systems, route planning, the single loop detectors and past trip duration data. The very
navigation applications and advanced information systems for method is utilised by Chien and Kuchipudi [13]. Zhang andRice
travellers. [14] proposed using a linear-based model for predicting the lasting
The structure of the rest of the paper is organised as follows. period of a short-term freeway trip. Duration of a trip is considered
Section 2 reviews the previous studies on bike-sharing systems and as a function of the departure time. The outcomes reveal that the
prediction approaches. Section 3 describes the algorithms used in error drifts between 5 and 10%, in case of a low dataset, whereas
detail. Section 4 deals with the preparation of data and exploratory for a larger dataset, the error varies between 8 and 13%. Wu et al.
analysis. Section 5 describes the evaluation indices. Section 6 deals [15] used support vector regression (SVR) for the prediction of trip
with the discussion of results. Finally, Section 7 concludes the duration. For their examination, they exploited a real highway
paper. traffic data. Furthermore, a set of trial and error SVR parameter is
proposed which in turn directs to a model that surpasses a baseline
2 Literature survey model. Balan et al. [16] put forward a system which renders period
information, including the estimated price and trip duration for the
A vast range of researches is conducted in the prediction of trip travellers. A historical data of paid taxi trips that comprises almost
duration. Travel time is the required time for traversing a path or a 250 million entries are taken into consideration for this study.
link between any two points of interest. There are two approaches Given the fast-behavioural change in the network of vehicles,
to the estimation of travel times: point measurement and link application of the learning algorithms for the prediction of the
measurement [8]. By using actual traffic data, it is shown that travel time for various vehicles for a prolonged duration certainly
simple prediction methods can provide a significant estimate of the leads to incorrect predictions. So it is important to figure out the
duration of trips for beginning shortly (up to 20 min). On the other optimum algorithm for each context. Using a trial and error method
side, better predictions can be produced with historical data of trips is the other possibility. It intends to find the best fitting algorithm
beginning more than 20 min away. for the particular dataset (i.e. for a specific period and also for a
Research by Li et al. [9] deal with the issue to predict path specific vehicle). And the best algorithm is picked out by the
travel times when only a low number of GPS floating cars are process of comparison with other algorithms [17]. The method
accessible. An algorithm is developed for learning local congestion consumes a lot of time, given several available alternatives. Meta-
patterns of the compact set of commonly shared paths through learning deals with the problem concerning the selection of the
historical data. In the view of the travel time prediction question, algorithm which leads to an optimum model that gives a precise
the current trends of congestion around the query path through prediction for every trip [18]. The methodology is tried out on the
recent trajectories are established, accompanied by inferring its data gathered from a Drive-In project. The results assert meta-
travel time shortly. Mridha and NiloyGanguly [10] establish a learning's capability of picking out the algorithm with optimum
connection (Road Segment) Travel Time Estimation Algorithm, precision. Handley et al. [19] investigate the application of
which is named as Least Square Estimation with Constraint, that machine learning and data mining to boost the prediction of travel
calculates travel time by 20% more accurately than existing times in an automobile. Data collected from the San Diego freeway
algorithms. The primary concept is the augmentation of a subset of system and k-nearest neighbour combined with a wrapper are used
trips along the specific paths utilising logged distance information, to choose useful features and parameters for normalisation. The
rather than fitting the adhoc ‘route-choice’ model. An approach to results suggest that three nearest neighbours greatly outperform

1466 IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474
© The Institution of Engineering and Technology 2020
predictions available from existing digital maps when using and improves predictive accuracy by boosting. The rest of the
information from freeway sensors. Analyses often show some section enumerates the GBM algorithm in terms of mathematics.
surprises about the utility of other features such as day and time of Consider X as a feature set of explanatory variables and
the trip. Hailu and Gau [20] present models that predict two approximation function of the response variable y as F(x). This
parameters of the fishing trip that is recreational: the length and trip method computes the function F(x) as an additive expansion F(x)
timing in a year. A discrete choice model called logit connects the depending on the basic function h(x; am) [24, 25]. Equation (2)
choosing of trip timing with scheduling events and the represents F(x):
demographic characteristics of anglers and the trip's nature are
calculated econometrically. A Tobit model is implemented for M M
assessment of the effects of the trip and personal characteristics on F x = ∑ f m (x) = ∑ βmh x; am (2)
fishing trip length. The results indicate that the choice of timing m=1 m=1
and trip duration can be reported well in terms of personal
variables and observable trip. Knowing the connections is a where am is the mean of split locations and the terminal node for
valuable independent variable into the management of tourism/ each splitting variable in the individual decision tree h(x; am), βm is
recreational fishing and the development of models for the determined for reducing a specified loss function L(y, F(x)) = (y − 
simulation of tourism/fishing activities. F(x))2. For effective estimation, gradient boosting approach has
Lee et al. [21] proposed an algorithm for travel time prediction, been proposed [26]. Its algorithm can be summed up as follows
which utilises rule-based map reduce grouping of the huge scale of [27]:
trajectory data. Firstly, the algorithm sets rules for grouping based Step 1: Initialise F0(x) to be a constant,
on real traffic stat. And ascertains the effective classes of velocity N
for each part of the road. Secondly, it generates a distributed index, F0 x = arg min ∑i = 1 L yi, β Step 2: For m = 1 to M
δ
which is done by using a grid-based map partitioning method. For i = 1, 2, …, N compute the negative gradient
Also, it possesses the capacity to deduce the query processing cost
as the grid cells which includes the question region is retrieved, ∂L(yi, F xi )
rather than retrieving the whole network of the road. Also, the time ȳim = − F x = Fm − 1 x
∂F(xi)
for processing the query can be minimised by calculating the time
taken for travelling given queries for each segment in a parallel
Fit a regression tree h x; am to the target y~ im Compute a gradient
way.
N
The above studies provide information about the researches descent step size as βm = arg min ∑i − 1 L yi , Fm − 1 xi + βh xi; am
δ
carried out so far in the duration of trip and travel time prediction
in various modes of transport. Such studies reflect on the need to Upgrade the model as follows: Fm x = Fm − 1 x + βmh x; am Step
predict the duration of the trip for the development of various 3: Reveal the results of the final model F x = FM x To get over
applications. Many techniques are used to predict the duration of the over-fitting issue [4], learning rate (or shrinkage), is used for
the trip but the use of data mining techniques could be an efficient scaling each base tree model's contribution by presenting a factor
tool to provide satisfactory results in prediction. of ξ (0 < ξ ≤ 1) as given in the following equation:

Fm(x) = Fm − 1 x + ζ . βmh x; am , where 0<ζ≤1 (3)


3 Methodology
3.1 Linear regression The lesser the shrinkage value, the lower is the loss function.
Linear regression (LR) is the simplest statistical regression method Nonetheless, it involves the addition of many trees to the model.
for identifying the linear link between the independent and the So, it reveals the existence of an interchange between the rate of
dependent variables. It is done by fitting a linear equation line to learning and the number of trees. The other significant factor for
the observed data [22]. For fitting the model, it is utmost important the GBM method is the complexity of the tree which brings up to
to check, whether there is a connection between the variables or the number of splits that fits every decision tree. For seizing the
features of interest, which is supposed to use the numerical complex interactions between variables, the complexity of the tree
variable, that is the correlation coefficient. The following equation must be increased. Overall, the best function of the model relies on
defines an LR line: a collective choice of a number of trees, learning rate, and
complexity of the tree.
Y = a + bX, (1)
3.3 K nearest neighbour
where X is the independent variable whereas Y is a dependent k-nearest neighbours (KNNs), a non-parametric learning algorithm
variable. b is the slope of the line and a is the intercept (the value is employed for regression or classification [28]. The input in both
of y when x = 0). For figuring out the best fitting line, the least cases holds k closest examples of training in the feature space. The
square errors are commonly used, that is done by deducing the result varies by k-NN depending on its utilisation for regression or
addition of squares of the vertical deviation of each point from the classification.
line or the addition of squares of the residuals. In the case of kNN classification, the result is a member of the
class. An object is categorised by a relative majority of its
3.2 Gradient boosting machine neighbours, the object attributed to a class which is usual among its
The gradient boosting machine (GBM) model has more benefits nearest k neighbours (k is a positive integer).
than the LR model that is commonly found in the existing works. It In the case of kNN regression, the result is assumed to be the
is capable of managing various independent variables (categorical, property value of the object. The output value is computed by
continuous etc.). Also, it involves a minimum duration for making taking average values of k nearest neighbours.
data. Complex non-linear trip duration time relationships with It is considered as non-parametric because of the absence of
independent variables that fit, as trip duration time is not required explicit mapping relationship between independent and dependent
following a certain distribution. Complex non-linear driving time variables. The proximity of neighbouring independent variables
links with the independent variables, as the driving time does not and the respective dependent variables are used to render the
need to obey a particular distribution. In the case of decision trees, ultimate scores of the test data. K the parameter which specifies the
some other independent variable values at the higher trees level considered number of neighbouring observations must be observed
decide the reaction for an independent variable. So automatically prior to the score. kNN is considered to be a simple learning
GBM models a mutual notion between independent variables [23]. algorithm and when it is practically applied, the performance is
Besides, it seizes sharp or subtle variation in the duration of the trip considered to be acceptable.

IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474 1467
© The Institution of Engineering and Technology 2020
3.4 Random forest The inherent validation features improve the tree robustness of the
RFs while utilising independent test results. RFs have also been
The Random forests (RFs) suggested by Breiman [29] are non- shown to be a viable method of regression and classification and so
parametric and tree-based ensemble techniques [30, 31]. Unlike utilised in this study.
traditional statistical methods, RFs contain most easy-to-interpret
decision trees models instead of parametric models. A more
comprehensive prediction model can be concluded by integrating 4 Data preparation and exploratory analysis
the analysis results of decision trees models. The main objective of The objective of this research is to find the trip duration as much as
this research is to predict the Seoul bike trip duration by addressing accurate for each of the rental bikes with various predictors
regression mode. The RFs regression is a non-parametric considered. Moreover to discuss the performance of different
regression method consisting of a set of K trees {T1(X), T2(X), …, regression models linear regression (LR), K nearest neighbours
TK(X)}, where X = {x1, x2, …, x_}, is a dimension independent (kNN), gradient boosting machine (GBM), and random forest (RF)
vector that forms a forest. The ensemble produces P dependent to predict the trip duration.
variables corresponding to each tree Yp (p = 1, 2, …, P). The
ultimate result is achieved by computing the mean of all tree 4.1 Data creation
predictions. The training process is adopted as follows:
One year data (January 2018 to December 2018) is downloaded
(a) Extract a sample bootstrap from the accessible dataset, i.e. a from South Korean website Seoul Public Data Park (Open Data),
sample chosen randomly with replacement. where data about trips made using Seoul Bike throughout the year
is available [32]. The time-span of the dataset is 365 days (12 
(b) Employ the following adjustments to the bootstrap sample and
months). The one-year data consists of 99,87,224 entries, which
derive trees at each node. Select the best split between a randomly
means nearly 10 million trips are made in one year. The fields in
chosen subset of mtry (number of variant predictors which are
the dataset include rental bike number, pickup date and time,
tested at each node) descriptors. mtry serves a vital tuning factor in
pickup station number and address, dropoff date and time, dropoff
the RFs algorithm. The tree is grown to maximise size without
station number and address, return dock, trip duration in minutes
pruning back.
and total distance in metres. This data does not include latitude and
(c) Step (b) is repeated until the number of trees (ntree) defined by longitude details of pickup and dropoff stations. The data about
the user are grown which are based on a bootstrap observations latitude and longitude details of the rental bike stations are
sample. downloaded from the same website and merged using the station
number and address. Since details about the rental stations are not
For regression, RFs build the number of regression trees K and updated, the trips without station details are dropped. After this
average the results. Final predicted values are obtained by step, the total number of entries is 99,74,018.
aggregating of the outcomes of each tree. The following equation Since weather information is the most influential data
defines the RF regression predictor, after k trees {Tk(x)} has been contributing to the trip duration and was used in previous research
grown: studies, the weather information is also added to increase the
performance of the prediction models. The weather information is
K
downloaded from the Korean Meteorological Society [33]. An hour
f x = ∑ TK x /K (4) wise weather information is used and the weather variables used
k=1
are Temperature, Precipitation, Windspeed, Humidity, Solar
For each RFs regression tree construction, a fresh training set Radiation, Snowfall, Ground Temperature. South Korea has the
(bootstrap samples) is established to a replacement from the condition of fine dust which affects the environment a lot. So this
original training set. So, every time a regression tree is constructed can also be used as an influencing variable in trip duration
using the randomised sample training from the original dataset. prediction. One hour average fine dust concentration data is also
The out-of-bag sample is utilised for examining its exactness and added to the data. Fine dust data contains some missing values and
given in (5) the missing entries are replaced with 0. Fig. 2 shows the whole
process involved in data preparation and Fig. 3 presents the whole
m system flow followed in this research.
GI (tX (xi)) = 1 − ∑ (tX xi , , j)2 (5)
j=1

Fig. 2  Data creation process

1468 IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474
© The Institution of Engineering and Technology 2020
Fig. 3  System flow

Fig. 6  Boxplot of trip duration after outliers removal


Fig. 4  Boxplot of trip duration before outliers removal

Fig. 5  Boxplot of trip distance before outliers removal

4.2 Data preprocessing


To remove noise in the data and to make the prediction algorithms
perform better some of the basic pre-processing steps in data Fig. 7  Boxplot of trip distance after outliers removal
mining techniques are done. Dropping 0 entries in Trip duration
and Trip distance is the first step. After this step number of entries improves prediction performance. So data lies outside 3 standard
came down to 98,30,314. Next step is to remove outliers in Trip deviations from the mean value is excluded in both trip duration
duration and Trip distance field. Fig. 4 presents the boxplot of trip and trip distance.
duration and Fig. 5 shows the boxplot of the distance field. It can Fig. 6 shows the boxplot of trip duration after outliers removal
be noted from the figures that there are a lot of outliers in both and Fig. 7 shows the box plot of trip distance after outliers
fields including a maximum of 5940 in trip duration and a removal. In the box plot, the median is represented with a black
maximum of 2,55,990 in the trip distance. Removing these outliers line inside the blue rectangle. The thick line above the upper

IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474 1469
© The Institution of Engineering and Technology 2020
Fig. 8  Histogram plot for trip duration

Fig. 9  Average trip duration across


(a) Months, (b) Day of the month, (c) Day of the week, (d) Hour of the day

whisker represents the outliers and after the removal of the outliers, Fig. 8 shows the histogram plot for trip duration. This show that
the quantity of outliers is reduced compared to Figs. 4 and 5. After the Trip duration data is skewed. Fig. 9 presents the Average
outliers removal, the dataset reduced to 96,01,139 entries. duration of the trip across months, day of the week, day of the
month, and hour of the day. From the plots, it is clear that trip
4.3 Exploratory data analysis duration has strong time component and these dependencies are
used to predict the trip duration more accurately. From Fig. 9a it is
Exploratory data analysis (EDA) is a most common approach, clear that average trip duration is very less during January,
which helps in summarising the main characteristics of a dataset by February, November and December which is winter season in
analysing the characteristics, commonly with visual methods. South Korea. This proves that temperature affects the trip duration.
Using a statistical model is optional. However, mainly EDA helps Fig. 9b presents a plot of average trip duration across the day of the
to seek for pieces of information from the data which are beyond month. It can be seen from the figure that the trip duration across
the hypothesis testing task or formal modelling. Fig. 8 displays a the day of the month is not stable and there is no clear pattern
histogram and the boxplot of the data. It also shows the data across the day of the month. Fig. 9c presents average trip duration
distribution that possesses a long tail. It can be understood that across the day of the week and the trip duration is high during the
most of the trip is between 0 and 20. weekends. From Fig. 9d, average trip duration is high during the
hour 15, 16 and 20 which represents leisure hour with less traffic

1470 IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474
© The Institution of Engineering and Technology 2020
Fig. 10  Latitude and Longitude distribution
(a) Pickup Latitude and Longitude distribution, (b) Dropoff Latitude and Longitude distribution

and rush and less during the hours 8 and 18 which represent the extracted. Although the trip distance is already present in the data,
morning and evening peak hours in Seoul city, respectively. the distance between pickup station and drop off station is
Fig. 10a shows the Pickup latitude and longitude distribution computed using haversine function, by using pickup latitude and
and Fig. 10b shows Dropoff latitude and longitude distribution. It longitude details and drop off latitude and longitude details.
can be seen that Pickup and dropoff latitude is in the range of Table 1 presents the list of all the features or variables or
(126.60–127.80) and pickup and dropoff longitude is in the range parameters and its corresponding Abbreviation, Type (Continuous
of (37.45–37.70). This proves that all the trips are executed within or categorical) and Measurement.
the Seoul city range and no potential outliers associated with After creating the final dataset, the next step is to check whether
latitude and longitude. the considered variables used for predicting the trip duration is
correlated with the dependent variable. So a correlation plot is
4.4 Feature engineering created for finding the relationship among the variables. Fig. 11
shows the pairs plot and displays the correlation values of Trip
Next step is to create some additional features from the date/time duration with Distance, PLong, PLatd, DLong, DLatd, Haversine,
variable to make the machine learning algorithms work more Temp, Precip, Wind, Humid, Snow, GroundTemp and Dust. As can
efficiently. This process of creating additional features from the be seen from the plot that the dependent variable Duration has at
existing data by using domain knowledge is known as feature least least correlation value with the independent variables. This
engineering. From the Pickup date and time variable, variables shows the trip duration variable is associated with other variables
such as pickup month, day, hour, minute and day of the week are considered in this study. Positive values represent a positive
computed. From Dropoff date and time variable, variables such as correlation between the variables and negative value represents a
dropoff month, day, hour, minute and day of the week are

IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474 1471
© The Institution of Engineering and Technology 2020
Table 1 Data variables and description Rsquared (R2), Median Absolute Error (MedAE) and Mean
Parameters/ Abbreviation Type Measurement Absolute Error (MAE).
Features RMSE stands for the sample standard deviation of the residuals
date Date year-month-day -Time between the observed and the predicted values. Large errors can be
hour:minute:second identified using this measure and the fluctuation of model response
trip duration Duration Continuous 1, 2, 3, … 5940 regarding variance can be evaluated. RMSE, a scale-dependent
metric outputs values having the same units of the measurement.
trip distance Distance Continuous 1, 2, 3, …
33,290 On the other hand, R2 is the coefficient of determination that ranges
pickup date PD time year-month-day Time from 0 to 1, which reflects the fitting quality. A high R2 value
and time hour:minute:second signifies the predicted values which perfectly fits with the observed
dropoff date DDtime year-month-day Time values. Formula to compute RMSE and R2 values are given in (6)
and time hour:minute:second and (7) respectively:
pickup PLong continuous Radians 2
n ^
longitude ∑i = 1 Y i − Y i (6)
RMSE =
pickup latitude PLatd continuous Radians n
dropoff DLong continuous Radians
n ^ 2
longitude ∑i = 1 Y i − Y i
2
dropoff latitude DLatd continuous Radians R =1− n 2
(7)
haversine Haversine continuous Kilometres
∑i = 1 Y i − Ȳ i
distance
pickup month Pmonth categorical January, MAE is to assess the prediction acuteness. MAE, a scale-dependent
February, March, metric, effectively reveals the error in prediction by preventing the
… December offset between negative and positive errors. The MAE can be
calculated by using the followingequation:
pickup day Pday categorical 1,2,3, … 31
pickup hour Phour categorical 0,1,2, … 23 n ^
∑i = 1 Y i − Y i
pickup minute Pmin continuous 1,2,3, … 60 MAE = (8)
n
pickup day of PDweek categorical Sunday,
the week Monday,
Tuesday,
MedAE is specifically stimulating as it is resistant to deviations.
Wednesday,
Loss is determined by calculating the median of all the absolute
Thursday,
differences between the target and the prediction. In case of ŷ being
Friday, Saturday
the value predicted out of the ith sample and yi, being the
dropoff month Dmonth categorical January,
respective true value, the MedAE figured over n samples is given
February, March,
as follows:
… December ^ ^
dropoff day Dday categorical 1,2,3, … 31 MedAE = median (Y i − Y i) , …, (Y i − Y i) (9)
dropoff hour Dhour categorical 0,1,2, … 23 ^
dropoff minute Dmin continuous 1,2,3, … 60 where Y i is considered the actual values, Y i is the value predicted
dropoff day of DDweek categorical Sunday, from the models and n is the sample size and ȳ is the sample
the week Monday, average.
Tuesday,
Wednesday, 6 Results and discussion
Thursday,
Friday, Saturday Four regression algorithms such as LR, GBM, KNN and RF are
used to predict the trip duration. Each of the regression algorithms
temperature Temp continuous °C
requires the selection of the best hyperparameters to make them
precipitation Precip continuous Mm perform best. So it is crucial to select the optimum
windspeed Wind continuous m/s hyperparameters.
humidity Humd continuous % Since the data is large (Nearly 10 million), finding optimal
solar radiation Solar continuous MJ/m2 hyperparameters for each of the models is a time consuming and
computationally expensive. So to find the optimal hyperparameters
snow fall Snow continuous cm
a random search was done. For the LR model, intercept value is
ground GroundTemp continuous °C kept true and the model fit was done. For GBM mode, the optimal
temperature set of hyperparameters include α= 0.9, learning rate = 0.1,
1 hour average Dust continuous ㎍/㎥ maximum depth is 3, the minimum samples split is 2, and the
fine dust number of estimators is 100. For KNN the number of neighbours
concentration was set to be 5. The number of estimators or number of trees is 10,
minimum samples leaf is 1 and minimum samples leaf is 2 for RF.
After training each of the models with its best hyperparameters,
negative correlation between the variables. The highest correlation the performance of each of the models is tested with testing set and
value exists between Duration and Distance variables. evaluated using four metrics RMSE, R2, MedAE and MAE. The
The combined data set is split into training and test set using model's performance is summarised in Table 3. RMSE, R2, MedAE
train_test_split function [34]. A total of 75% of data is employed and MAE values of LR model in testing phase is 16.48, 0.56, 6.90
for training the models while the remaining is for testing (Table 2).
and 10.12, respectively. RMSE, R2, MedAE and MAE values of
GBM model in the testing phase are 12.58, 0.74, 3.75 and 7.37,
5 Evaluation indices respectively. RMSE, R2, MedAE and MAE values of KNN model
Multiple evaluating criteria are used for comparing the in testing phase is 13.93, 0.69, 2.59 and 6.83, respectively. RMSE,
performance of regression models. The performance evaluation R2, MedAE and MAE values of RF model in testing phase is 6.25,
indices used here are: Root Mean Squared Error (RMSE), 0.93, 1.20 and 2.92, respectively. The model with the lowest
RMSE, MedAE and MAE and highest R2 is considered the
1472 IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474
© The Institution of Engineering and Technology 2020
Fig. 11  Correlation plot

Table 2 Training and testing dataset


Dataset Number of observations
training 72,00,854 and 24 variables
testing 24,00,285 and 24 variables

Table 3 Models performance


Training Testing
models RMSE R2 MedAE MAE RMSE R2 MedAE MAE
LR 16.45 0.56 6.90 10.11 16.48 0.56 6.90 10.12
GBM 12.55 0.74 3.74 7.35 12.58 0.74 3.75 7.37
KNN 11.29 0.79 2.0 5.53 13.93 0.69 2.59 6.83
RF 2.76 0.98 0.40 1.21 6.25 0.93 1.20 2.92

optimum performing model. RF model has the best performance trip duration prediction. Since deep learning techniques are more
and LR produced the worst performance. RF yields the highest effective in learning big data with automatic feature extraction and
value for R2 and lowest values for RMSE, MedAE and MAE. deliver high-quality results, deep learning could be used for trip
Performance of LR is the worst. This is because the trip duration is duration prediction.
not linearly dependent on the independent variables. The
performance of the GBM and KNN models is almost similar. There 8 References
is only 1-point difference between values of RMSE, MedAE and
[1] Conservancy, Ocean: ‘Stemming the tide: land based strategies for a plastic-
MAE values. The performance of RF is two times higher than the free ocean’, Ocean Conservancy and McKinsey Center for Business and
performance of GBM and KNN. This shows that RF could be used Environment, 2015
as an effective tool to predict the trip duration. [2] Audenhove, V., François-Joseph, O., Dauby, L., et al.: ‘The future of urban
mobility 2.0: imperatives to shape extended mobility ecosystems of
tomorrow’, 2014
7 Conclusion [3] Calafiore, G.C., Portigliotti, F., Rizzo, A.: ‘A network model for an urban
bike-sharing system’, IFAC-PapersOnLine, 2017, 50, (1), pp. 15633–15638
A data mining approach is used to predict the drip duration using [4] Wikipedia: ‘List of bicycle-sharing systems’, 2017
data recorded in the rental bikes and weather information is [5] Shaheen, S., Guzman, S., Zhang, H.: ‘Bikesharing in Europe, the americas,
proposed. The analysis is done with Seoul Bike data. Four and Asia: past, present, and future’, Transp. Res. Rec., J. Transp. Res. Board,
2010, 2143, p. 159167
regression techniques LR, GBM, KNN and RF is used to predict [6] Wolpert, D., Macready, W.: ‘No free lunch theorems for optimization’, IEEE
the trip duration. This statistical data analysis shows interesting Trans. Evol. Comput., 1997, 1, (1), pp. 67–82
outcomes in prediction methods and also in an exploratory [7] Giraud-Carrier, C., Vilalta, R., Brazdil, P.: ‘Introduction to the special issue on
analysis. The experimental results prove that the RF model predicts meta-learning’, Mach. Learn., 2004, 54, (3), pp. 187–193
[8] Turner, S., Eisele, W., Benz, R., et al.: ‘Travel Time Data Collection
best the trip duration with the highest R2 and with less error rate Handbook’, Federal Highway Administration, Report FHWA-PL-98-035,
compared to LR, GBM and KNN. RF model exhibits its 1998
proficiency in the analysis of time-series and statistical learning. [9] Li, Y., DimitriosGunopulos, C.L., Guibas, L.: ‘Urban travel time prediction
However, a few works are done for the prediction of trip duration. using a small number of GPS floating cars’. Proc. of the 25th ACM
SIGSPATIAL Int. Conf. on Advances in Geographic Information Systems,
The results indicate that RF predictor significantly outperforms USA, 2017, p. 3
other baseline predictors. It proves the applicability of RF for the [10] Mridha, S., NiloyGanguly, S.B.: ‘Link travel time prediction from large scale
prediction of trip duration. This trip duration prediction with a endpoint data’. Proc. of the 25th ACM SIGSPATIAL Int. Conf. on Advances
combination of weather data and feature engineered variables are in Geographic Information Systems, USA, 2017, p. 71
[11] Miura, H.: ‘A study of travel time prediction using universal kriging’, Top,
used as an effective tool to develop future Artificial intelligence- 2010, 18, (1), pp. 257–270
based Transportation. The efficient trip duration prediction can also [12] Kwon, J., Coifman, B., Bickel, P.: ‘Day-to-day travel-time trends and travel-
provide multiple advantages by providing various applications to time prediction from loop-detector data’, Transp. Res. Rec.: J. Transp. Res.
users. Future work includes the use of Deep learning techniques for Board, 2000, 1717, (1), pp. 120–129

IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474 1473
© The Institution of Engineering and Technology 2020
[13] Chien, S.I.J., Kuchipudi, C.M.: ‘Dynamic travel time prediction with real- [24] De'Ath, G.: ‘Boosted trees for ecological modeling and prediction’, Ecology,
time and historic data’, J. Transp. Eng., 2003, 129, (6), pp. 608–616 2007, 88, (1), pp. 243–251
[14] Zhang, X., Rice, J.A.: ‘Short-term travel time prediction’, Transp. Res. C: [25] Saha, D., Alluri, P., Gan, A.: ‘Prioritizing highway safety manual's crash
Emerg. Technol., 2003, 11, (3), pp. 187–210 prediction variables using boosted regression trees’, Accident Anal. Prev.,
[15] Wu, C.H., Ho, J.M., Lee, D.T.: ‘Travel-time prediction with support vector 2015, 79, pp. 133–144
regression’, IEEE Trans. Intell. Transp. Syst., 2004, 5, (4), pp. 276–281 [26] Friedman, J.H.: ‘Greedy function approximation: a gradient boosting
[16] Balan, R.K., Nguyen, K.X., Jiang, L.: ‘Real-time trip information service for machine’, Ann. Stat., 2001, 29, pp. 1189–1232
a large taxi fleet’. Proc. of the 9th Int. Conf. on Mobile Systems, Applications, [27] Ding, C., Wu, X., Yu, G., et al.: ‘A gradient boosting logit model to
and Services, MobiSys, ACM, New York, 2011, pp. 99–112 investigate driver's stop-or-run behavior at signalized intersections using high-
[17] Brazdil, P., Soares, C., Costa, J.D.: ‘Ranking learning algorithms: using IBL resolution traffic data’, Transp. Res. C, Emerg. Technol., 2016, 72, pp. 225–
and meta-learning on accuracy and time results’, Mach. Learn., 2003, 50, pp. 238
251–277 [28] Altman, N.S.: ‘An introduction to kernel and nearest-neighbor nonparametric
[18] Zarmehri, M.N., Soares, C.: ‘Using metalearning for prediction of taxi trip regression’, Am. Stat., 1992, 46, (3), pp. 175–185
duration using different granularity levels’. Int. Symp. on Intelligent Data [29] Breiman, L.: ‘Random forests’, Mach. Learn., 2001, 45, pp. 5–32
Analysis, Cham, 2015, pp. 205–216 [30] Adusumilli, S., Bhatt, D., Wang, H., et al.: ‘A low-cost INS/GPS
[19] Handley, S., Langley, P., Rauscher, F.A.: ‘Learning to predict the duration of integrationmethodology based on random forest regression’, Expert Syst.
an automobile trip’. KDD, New York, NY, USA, 1998, pp. 219–223 Appl., 2013, 40, pp. 4653–4659
[20] Hailu, A., Gao, L.: ‘Research note: recreational trip timing and duration [31] Zhou, J., Shi, X.Z., Du, K., et al.: ‘Feasibility of random-forest approach for
prediction’, Tour. Econ., 2012, 18, (1), pp. 243–251 prediction of ground settlements induced by the construction of a shield-
[21] Lee, H., Hong, S., Kim, H.J., et al.: ‘A travel time prediction algorithm using driven tunnel’, Int. J. Geomech., 2017, 17, p. 04016129
rule-based classification on MapReduce’. Database and Expert Systems [32] ‘SEOUL OPEN DATA’, http://data.seoul.go.kr/
Applications, Cham, 2015, pp. 440–452 [33] ‘KOREA METEOROLOGICAL ADMINISTRATION’, https://
[22] Neter, J., Isserman, W., Kutner, M.H.: ‘Applied linear regression models’, www.kma.go.kr/eng/index.jsp
1989 [34] Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: ‘Scikit-learn: machine
[23] Elith, J., Leathwick, J.R., Hastie, T.: ‘A working guide to boosted regression learning in python’, J. Mach. Learn. Res., 2011, 12, pp. 2825–2830
trees’, J. Anim. Ecol., 2008, 77, (4), pp. 802–813

1474 IET Intell. Transp. Syst., 2020, Vol. 14 Iss. 11, pp. 1465-1474
© The Institution of Engineering and Technology 2020

You might also like