Applying Data Mining in Prediction and Classification of Urban Traffic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2009 World Congress on Computer Science and Information Engineering

Applying Data Mining in Prediction and Classification of Urban Traffic

Sedigheh Khajouei Nejad Farid Seifi


Department of Computer, Computer engineering Department,
Islamic Azad University, Iran University of Science and Technology
Sirjan Branch, [email protected]
Sirjan, Iran

Hamed Ahmadi Nima Seifi


Electrical and Computer Engineering Engineering Department, NIGC DIST4,
Department, Mashhad, Iran
National University of Singapore [email protected]
[email protected]

Abstract Data mining presents different methods and algorithms


that can help us in mining knowledge from data,
Data mining is a branch of computer science which clustering and classifying data, predicting future status
recently has a great use for enterprises. Applying data and decision making.
mining methods, huge databases have been analyzed Utilizing apparatuses like sensors, networks (cable
and processed. Data mining techniques are usually or wireless), cameras and high speed computers in
hired to mine knowledge and models from enormous current traffic systems open a new field for data mining
data sets for prediction of new events. Furthermore to demonstrate its capabilities. Prediction of traffic and
these techniques are commonly used in fields which making decision for it are two of the most important
generate great amount of data that can not be benefits of employing data mining in traffic
processed by ordinary methods. During the last decade management; there are many other instances to proof
traffic management became a new field of science applying data mining in traffics. Moreover numerous
which produced unlimited data, and this amount of quantity of gathered data and on the other hand
data needed new methods to be processed. It is clear frequent income of acquired data makes the manual
that one of the most important fields in traffic analyzing impossible. Therefore employing data
management is Traffic prediction. As a result data mining techniques becomes necessary.
mining methods were chosen to generate dependable
patterns. In this paper we applied Classification 2. Data Mining and Classification
methods to learn traffic behavior and prediction of new
events. A brief definition of data mining can be:
“Techniques of analyzing data and mining knowledge
1. Introduction from it”. Most of data mining techniques have two
parts: first part is a learning algorithm which produces
Progress and development of computer science and the current pattern from the learning set as knowledge.
usage of modern hardware technologies in sampling The second part is the Test which runs a test on the
data causes generating a huge amount of data to gather resulted pattern with a test set and calculates the
and process in most of the fields. Although more accuracy of this pattern. It is vital that test data must
sample data can help more accurate decisions, in most differ from the learning set.
cases dilatory methods of analysis and process make it In some cases all data is ready and the quantity of it
impossible to use this amount of data. As well as is known; we will use algorithms which are called
appearing high-tech devices of sampling data, new static data mining algorithms. Therefore the pattern is
algorithms and software technologies have been stable so, it is enough to run the learning algorithm
developed to survey and analyze the sampled data. once and then use the resulted pattern algorithm on the

978-0-7695-3507-4/08 $25.00 © 2008 IEEE 674


DOI 10.1109/CSIE.2009.906
rest of data. On the other hand there are situations that data [1]. In traffic issue samples have some variable
algorithm could not have all the data at the beginning fields like time, temperature, day of week, and etc.
and system will receive new data during its process. Moreover each sample is member of a class which
Algorithms which are working on these kinds of data indicates the traffic state in those conditions. The goal
are called Stream data mining algorithms. In these is to learn the relation between variable values of
cases it is possible that the pattern changes during the samples and their class or in other words traffic state in
time, so running the learning algorithm just once at the those conditions, and predict the class of new data that
beginning and utilizing mined knowledge for all new their class is unknown. This prediction can be the
data which are entering the system is not a reasonable decision itself, due to the fact that a default decision or
method. In stream data mining we should refresh the a solution can be defined for each of the traffic
pattern by running the learning algorithm in specified conditions (classes). So new incoming data related to
periods of time with new sets of learning data. In other their conditions will be divided into different classes
words the old knowledge should be removed and and for each of them system will make a decision like
recent knowledge driven from new stream of data must other members of their class. Numerous classification
replace it. This model is shown in figure1. methods have been presented yet which most of them
are based on decision trees. In this paper we are going
to present classification decision tree for learning and
prediction traffics.

Figure. 1. A model of classifying stream data


algorithm which replaces previous model with
newer ones if the pattern of data has been
changed.
Figure. 2. Classifying algorithm which updates
In another way of processing stream data, new former models.
learning data are used to update the former knowledge
hence available knowledge of system will become
compatible with the change of the data pattern by time.
3. Applying decision trees for learning and
Figure 2 presents this model. prediction traffic conditions
The traffic issue and its concerning data have
stream nature. This data is collected in time and its As it is said before the goal of classification – in
pattern may change in relation with season, oil price, order to variable values of each new event- is to map a
and holydays. Thus to analyse traffics we have to use class from predefined classes in the environment to
stream data mining methods. each new event. Actually this prediction is based on
Data mining is a vast field for research and has the sample’s variable values [3, 4]. There are different
numerous branches like classification, clustering, methods available for this kind of prediction that
association rules, artificial neural networks (ANN), classification with decision trees is on of them [5, 6]. A
genetic algorithms, fuzzy sets and fuzzy logics. In this decision tree has a flowchart like structure. In this
research we use data mining for learning effect of structure each internal node does a test on a certain
special parameters on traffic, and then we will employ property and each branch of this node show a result of
this knowledge to predict the traffic state and make the test; however a leaf node indicates a class. The top
decisions about it. Classification is one of the data node is called the root of tree. Classification with
mining branches that can be used in interference decision trees is a two-stage process. In the first stage
learning, knowledge mining, prediction and decision of this method; a set of training data is used to find out
making. In classification cases a learning set is defined the relation of variable values and classes in a decision
to result a classifier algorithm. Each sample in learning tree format. The tree is actually the discovered
set has some variable fields and a class number which knowledge which will be employed in the next stage.
indicates its class. Completed the learning and test During the second stage decision tree is tested with
phase, classifier predicts the class of incoming new another group of data to find out its accuracy. The

675
resulted decision tree can be used in the real world for
classification tasks [7]. This classification can be Apply ss to D to find the splitting criterion
If n split
prediction of new situations and also decision making
Use best split to partition D to D1 and D2
about them. Generally in a decision tree the root is on Build tree(n1,D1,ss)
the top and leaf nodes are situated in the lowest level. Build tree(n2,D2,ss)
A new record enters the root and tested there, related to End if
the result of the test it is sent to one of its branches to
the next test node. This process continues until the There are different methods to choose the break
record reaches a leaf node. The leaf node is the class of (separation) point which make methods of tree creation
that record. All records which reach a same leaf node distinctive. Some of these methods are:
are grouped in a same class; moreover it is important to • Gini Index
know that there is a unique path to a leaf node and that • Entropy
path is actually the rules which have defined to map • CART
the record to that class. • 2Pj
Effectiveness of a tree must be measured after its • Min(Pj)
creation. Employing testing data accuracy percent of • C4.5
each branch can be measured. As it is said each path
from the root to a leaf is actually a rule that should be In Gini Index, function runs for each of parameters
measured for its accuracy. In most cases removing and the parameter with the minimum result will be
inaccurate branches can increase the prediction power chosen as the test field for making new branches. To
of the tree. Removing branches to make the tree more select the divergence point first the training data set (S)
precise is called “Pruning”. must be sorted based on the selected field then the
Now we are going to explain learning and following formula1 must be calculated for each form
prediction of decision trees with a simple example. A of making branches, in other words set S should be
tiny data set is shown in table1 is considered as the divided into two sets S1 and S2. Finally divergence
training data. point which makes I(S) minimum is selected.

Table 1. Training dataset of traffic I (S) =| S1 | / | S | ∗I (S1)+ | S2 | / | S | ∗I (S2)


Time Temperature Traffic level
7 20 3
In this research creation of decision tree with Gini
7:30 -2 3
8:30 15 2
Index for a traffic data set (table1) is going to be
9 15 2 reviewed. For this reason first the data set has to be
8:30 6 2 sorted based on the time value (table2), then for each
10 -2 1 time value, I(S) must be measured and finally the point
9 -2 1 which makes I(S) minimum is chosen as the
9:30 -5 1 divergence point.

Records of this data set have two variable fields: Table 2. training data set sorted by the time.
time of sampling and temperature of environment at Time Temperature Traffic level
sampling time. Class of this data set indicates the
7 20 3
traffic condition at that time which has three values of
7:30 -2 3
1, 2, and 3. Number 1 shows very light traffic, 2 shows
light traffic and 3 shows heavy traffic. As it is said in 8:30 15 2
this example the data set just has two variable fields to 8:30 6 2
simplify the example, however in real world and real 9 15 2
cases so many parameters must be considered 9 -2 1
furthermore the classes can be more than this too. 9:30 -5 1
Creation of a tree or in other words learning and 10 -2 1
pattern discovery, consists of two phases:
• Creation and Growing the tree phase To calculate I(S) for each point a structure like
• Pruning of the tree phase table3 must be created.

General creation algorithm for a binary tree is as it Table 3. Defined structure to calculate I(S)
follows: Class 1 2 3

676
Lower a b c Table6 contains of I(S) test results based on
Higher d e f temperature parameter for data on the right branch.

In table3 the first row is filled by the classes, second Table 6. I(S) result for temperature parameter
row indicates the quantity of samples (records) of each Temperature ≤ -5 ≤ -2 ≤ 6 ≤ 15
class which their selected parameter value is less than I(S) 0.4 0 0.25 0.5
the value which table is based on. At last in the third
row the number of records of each class which their
value is more than the table value is shown. After Minimum result of I(S) in this test is 0 which
creation of this structure we will have: derived from temperature -2, so this temperature will
be the divergence point of the right branch. From the
data in this stage three of records whose class
(2) is 1 after
I (S1) = 1− (a /(a + b + c))2 − (b /(a + b + c))2 − (c /(a + b + c))2
the test go to the left branch; they will make a leaf
node because they are all from a same class, and other
I (S2) =1 − (d /(d + e + f ))2 − (e /(d + e + f ))2 + ( f /(d + e + f ))2 (3) The right
three records are going to the right branch.
branch will become a leaf node too because all its three
I (S) = (a + b + c) /(a + b + c + d + e + f ) ∗ I (S1) + members are from second class. As it is (4)visible all the
(d + e + f ) /(a + b + c + d + e + f ) ∗ I (S2) incoming records are placed in leaf nodes so the tree is
completed now (figure4).
Regarding to these calculations and formulas table4
is made:

Table 4. Data structure for time>= 7


Class 1 2 3
Lower 0 0 1
Higher 3 3 1

Results of other time values are shown in table5.

Table 5. results of I(S) for time parameter


Time ≤ 7 ≤ 7:30 ≤ 8:30 ≤9 ≤9:30 ≤ 10 Figure. 4. complete decision tree of the
example
I(S) 0.53 0.37 0.43 0.45 0.57 0.65
In the real world the number of parameters in the
As it is obvious 0.37 is the minimum value for I(S), creation phase are more than these examples,
therefore the divergence point will be 7:30 and time=< furthermore the classes and the quantity of training
7:30 will be the dividing condition (figure3). From the records should be more. After creation of tree, it must
training set two records fit the condition, and both of be tested with test set which their class is defined. For
them are member of class 3. So the left node will be a each record in the test set the tree must predict a class
leaf node however the right node are not from a same and finally the prediction and real classes will be
class and must take an other test based on temperature compared thus the accuracy percent can be calculated
value. in other words accuracy of the prediction will be
presented. To predict the class of a sample record we
should begin from the root node. The condition of each
node in the path of the record to a leaf node should be
tested on the record until it reaches a leaf node. Each
leaf node presents a class, so the class of this record
would be the class of the leaf node which it has
reached. This way the mined knowledge can be used to
predict the class of incoming records (data).
Figure. 3. First divergence in tree structure for
traffic data set 4. Conclusion and Future work

677
As we have argued in this paper we have tried to [2] Ruoming Jin, Gagan Agrawal, “Efficient Decision Tree
confirm that classification with decision trees which is Construction on Streaming Data”, KDD 03, 2003, pp. 571-
a data mining method is suitable and useful for traffic 576.
management and prediction of traffic. To proof our
[3] Nicholas R. Howe, Toni M. Rath, and R. Manmatha,
words and to avoid making this matter complicated “Boosted Secision Trees for Word Recognition in
with a simple example we explained the creation of a Handwritten Document Retrieval”, SIGIR, 2005, pp. 377-
decision tree. Moreover, using that example prediction 383.
with decision tree was described. To sum up all we
have mentioned above we have to say that data mining [4] Patrick Knab, Martin Pinzger, and Abraham Bernstein,
can be applied in traffic management and traffic “Predicting Defect Densities in Source Code Files with
prediction, and it can have a great, successful result in Decision Tree Learners”, MSR, 2006, pp. 119-125.
it.
Applying other classification methods and [5] Gehrke, V Ganti, R. Ramakrishnan, and W.-Y. Loh. ,
“BOAT - Optimistic Decision Tree Construction”, Proc.
comparing them is our future work Genetic algorithms ACM SIGMOD’99,Philadephia, USA, 1999, pp. 169-l 80.
and artificial neural networks can be used in traffic
management too. We can also combine some of these [6] J. Gehrke, R. Ramakrishnan, and V. Ganti. , “RainForest
methods and reach more accurate predictions. – A Framework for Fast Decision Tree Construction of Large
Datasets”, Proc. VLDB ‘98, New York, USA, 1998, pp. 416-
5. References 427.

[1] Qiang Ding, Qin Ding, William Perrizo, “Decision Tree [7] F. Seifi, H. Ahmadi, M. Kangavari, “Twins Decision Tree
Classification of Spatial Data StreamsUsing Peano Count Classification: A Sophisticated Approach to Decision Tree
Trees”, 2002, SAC 02: 413-417. Construction”, ICCSA’07, Florida USA, 2007, pp. 337-341.

678

You might also like