2208.06461v1
2208.06461v1
2208.06461v1
Abstract—Automatic detection of traffic accidents is an im- applied computer vision techniques in traffic surveillance
portant emerging topic in traffic monitoring systems. Nowadays systems [1]–[10] for various tasks. Automatic detection of
many urban intersections are equipped with surveillance cameras traffic incidents not only saves a great deal of unnecessary
connected to traffic management systems. Therefore, computer
manual labor, but the spontaneous feedback also helps the
arXiv:2208.06461v1 [cs.CV] 12 Aug 2022
Accident Detection
Calculate the
Yes Distance < Euclidean distance Yes More than one No
threshold? between a pair of road‐user?
objects
No
No
purposely designed with efficient algorithms in order to be model as backbone network for feature extraction followed by
applicable in real-time traffic monitoring systems. a neck and a head part. The neck refers to the path aggregation
network (PANet) and spatial attention module and the head is
A. Road-User Detection the dense prediction block used for bounding box localization
As in most image and video analytics systems the first step and classification. This architecture is further enhanced by
is to locate the objects of interest in the scene. Since here we additional techniques referred to as bag of freebies and bag of
are also interested in the category of the objects, we employ specials.
a state-of-the-art object detection method, namely YOLOv4 Here, we have applied the YOLOv4 [11] model pre-trained
[11], to locate and classify the road-users at each video on the MS COCO dataset [18] for the task of object detection.
frame. The family of YOLO-based deep learning methods Although the model is pre-trained on a dataset with different
demonstrates the best compromise between efficiency and visual characteristics in terms of object sizes and viewing
performance among object detectors. angles, YOLOv4 proved to generalize well to images with
The first version of the You Only Look Once (YOLO) deep overhead perspective. We are interested in trajectory conflicts
learning method was introduced in 2015 [14]. The main idea among most common road-users at regular urban intersections,
of this method is to divide the input image into an S × S grid namely, vehicles, pedestrians, and cyclists.
where each grid cell is either considered as background or
used for the detecting an object. A predefined number (B) of B. Road-User Tracking
bounding boxes and their corresponding confidence scores are Multiple object tracking (MOT) has been intensively studies
generated for each cell. The intersection over union (IOU) of over the past decades [19] due to its importance in video
the ground truth and the predicted boxes is multiplied by the analytics applications. Here we employ a simple but effective
probability of each object to compute the confidence scores. tracking strategy similar to that of the Simple Online and
In later versions of YOLO [15], [16] multiple modifications Realtime Tracking (SORT) approach [20]. The Hungarian
have been made in order to improve the detection perfor- algorithm [12] is used to associate the detected bounding boxes
mance while decreasing the computational complexity of the from frame to frame. Additionally, the Kalman filter approach
method. Although there are online implementations such as [13] is used as the estimation model to predict future locations
YOLOX [17], the latest official version of the YOLO family of each detected object based on their current location for
is YOLOv4 [11], which improves upon the performance of better association, smoothing trajectories, and predict missed
the previous methods in terms of speed and mean average tracks.
precision (mAP). As illustrated in fig. 2, the architecture of The inter-frame displacement of each detected object is
this version of YOLO is constructed with a CSPDarknet53 estimated by a linear velocity model. The state of each target
Backbone Neck
SPP
MaxPool (5)
MaxPool (13)
PANet Head
Concat + CBLx5
Concat + CBLx5
A
in the Kalman filter tracking approach is presented as follows: where Ci,j is a value between 0 and 1, b is the bin index, Hb
is the histogram of an object in the RGB color-space, and H̄
oti = [xi , yi , si , ri , ẋi , y˙i , s˙i ] (1) is computed as follows:
1 X
where xi and yi represent the horizontal and vertical locations H̄(ok ) = Hb (ok ) (4)
B
of the bounding box center, si , and ri represent the bounding b
box scale and aspect ratio, and ẋi , y˙i , s˙i are the velocities in in which B is the total number of bins in the histogram of an
each parameter xi , yi , si of object oi at frame t, respectively. object ok .
The velocity components are updated when a detection is The size dissimilarity is calculated based on the width and
associated to a target. Otherwise, in case of no association, height information of the objects:
the state is predicted based on the linear velocity model.
1 |hi − hj | |wi − wj |
S
Considering two adjacent video frames t and t + 1, we will Ci,j = + (5)
2 hi + hj wi + wj
have two sets of objects detected at each frame as follows:
where w and h denote the width and height of the object
Ot = {ot1 , ot2 , . . . , otn } bounding box, respectively. The more different the bounding
(2) boxes of object oi and detection oj are in size, the more CSi,j
Ot+1 = {ot+1 t+1 t+1
1 , o2 , . . . , om }
approaches one. The position dissimilarity is computed in a
Every object oi in set Ot is paired with an object oj in set Ot+1 similar way:
that can minimize the cost function C(oi , oj ). The index i ∈
P 1 |xi − xj | |yi − yj |
[N ] = 1, 2, . . . , N denotes the objects detected at the previous Ci,j = + (6)
2 xi + xj yi + yj
frame and the index j ∈ [M ] = 1, 2, . . . , M represents the new P
objects detected at the current frame. where the value of Ci,j is between 0 and 1, approaching more
towards 1 when the object oi and detection oj are further. In
In order to efficiently solve the data association problem
addition to the mentioned dissimilarity measures, we also use
despite challenging scenarios, such as occlusion, false positive
the IOU value to calculate the Jaccard distance as follows:
or false negative results from the object detection, overlapping
objects, and shape changes, we design a dissimilarity cost K Box(oi ) ∩ Box(oj )
Ci,j =1− (7)
function that employs a number of heuristic cues, including Box(oi ) ∪ Box(oj )
appearance, size, intersection over union (IOU), and position. where Box(ok ) denotes the set of pixels contained in the
The appearance distance is calculated based on the histogram bounding box of object k.
correlation between and object oi and a detection oj as The overall dissimilarity value is calculated as a weighted
follows: sum of the four measures:
P
A S P A K
A b Hb (oi ) − H̄(oi ) Hb (oj ) − H̄(oj ) Ci,j = wa Ci,j + ws Ci,j + wp Ci,j + wa Ci,j + wk Ci,j (8)
Ci,j = 1 − qP 2 P 2
b Hb (oi ) − H̄(oi ) b Hb (oj ) − H̄(oj ) in which wa , ws , wp , and wk define the contribution of each
(3) dissimilarity value in the total cost function. The total cost
locations on the Google Maps [24]. The distance in kilometers
Bounding boxes
can then be calculated by applying the haversine formula [25]
Object Object
detection tracking as follows:
Video frames from scene i φq − φp λq − λp
2D coordinates h = sin2 + cos φp · cos φq · sin2
Scene i 2 2
H matrix
√
Camera Haversine
calibration formula
dh (p, q) = 2r arcsin h
Camera coordinates
(9)
Distance in KM where φp and φq are the latitudes, λp and λq are the longitudes
of the first and second averaged points p and q, respectively,
Speed
estimation h is the haversine of the central angle between the two points,
Google Maps Projected virtual grids r ≈ 6371 kilometers is the radius of earth, and dh (p, q) is
latitude/longitude on the ground plane
the distance between the points p and q in real-world plane
in kilometers. The speed s of the tracked vehicle can then be
Fig. 3. The workflow of the speed estimation method demonstrated on a estimated as follows:
scene from the NVIDIA AI City Challenge 2022 dataset [21]. dh (p, q) × 3600 × f ps
S= (10)
f
function is used by the Hungarian algorithm [12] to assign where f ps denotes the frames read per second and S is the
the detected objects at the current frame to the existing tracks. estimated vehicle speed in kilometers per hour. Note that if the
If the dissimilarity between a matched detection and track is locations of the bounding box centers among the f frames do
above a certain threshold (τd ), the detected object is initiated not have a sizable change (more than a threshold), the object
as a new track. is considered to be slow-moving or stalled and is not involved
in the speed calculations.
C. Accident Detection Another factor to account for in the detection of accidents
and near-accidents is the angle of collision. Traffic accidents
In this section, details about the heuristics used to detect
include different scenarios, such as rear-end, side-impact,
conflicts between a pair of road-users are presented. The
single-car, vehicle rollovers, or head-on collisions, each of
conflicts among road-users do not always end in crashes, how-
which contain specific characteristics and motion patterns.
ever, near-accident situations are also of importance to traffic
Accordingly, our focus is on the side-impact collisions at the
management systems as they can indicate flaws associated
intersection area where two or more road-users collide at a
with the signal control system and/or intersection geometry.
considerable angle. The bounding box centers of each road-
Logging and analyzing trajectory conflicts, including severe
user are extracted at two points: (i) when they are first observed
crashes, mild accidents and near-accident situations will help
and (ii) at the time of conflict with another road-user. Then
decision-makers improve the safety of the urban intersections.
the approaching angle of the a pair of road-users a and b is
The most common road-users involved in conflicts at inter-
calculated as follows:
sections are vehicles, pedestrians, and cyclists [22]. Therefore,
0
for this study we focus on the motion patterns of these three yat − yat
major road-users to detect the time and location of trajectory ma =
(xt − xta0 )
conflicts. a 00
First, the Euclidean distances among all object pairs are ybt − ybt (11)
calculated in order to identify the objects that are closer than mb = t t 00
x − xb
a threshold to each other. These object pairs can potentially b
ma − mb
engage in a conflict and they are therefore, chosen for further θ = arctan
analysis. The recent motion patterns of each pair of close 1 + ma mb
objects are examined in terms of speed and moving direction. where θ denotes the estimated approaching angle, ma and mb
As there may be imperfections in the previous steps, are the the general moving slopes of the road-users a and b
especially in the object detection step, analyzing only two with respect to the origin of the video frame, xta , yat , xtb , ybt
successive frames may lead to inaccurate results. Therefore, a represent the center coordinates of the road-users a and b at the
0 0
predefined number f of consecutive video frames are used to current frame, xta and yat are the center coordinates of object
00 00
estimate the speed of each road-user individually. The average a when first observed, xtb and ybt are the center coordinates
bounding box centers associated to each track at the first of object b when first observed, respectively.
half and second half of the f frames are computed. The two If the bounding boxes of the object pair overlap each other
averaged points p and q are transformed to the real-world or are closer than a threshold the two objects are considered
coordinates using the inverse of the homography matrix H−1 , to be close. The trajectories of each pair of close road-users
which is calculated during camera calibration [23] by selecting are analyzed with the purpose of detecting possible anomalies
a number of points on the frame and their corresponding that can lead to accidents. The variations in the calculated
Fig. 4. Vehicle-to-Vehicle (V2V) traffic accidents at intersections detected by our proposed framework. The red circles indicate the location of the incidents.