1. Introduction
Advanced driving assistance systems (ADAS) are no longer an optional or a luxurious component in modern vehicles [
1,
2]. Instead, they are becoming a core component, especially with the migration towards autonomous vehicles. ADAS covers a number of varying functionalities, such as lane departure warning (LDW), lane keep assist (LKA), lane change merge (LCM), adaptive cruise control (ACC), collision detection and avoidance (CD), night vision, and blind spot detection, to mention a few [
1,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. The overall functionality of the ADAS is underpinned by a machine vision component whose ability to understand the surroundings, particularly the ability to extract lane boundaries and markings in roads. With ADAS becoming a core component, it is essential that potential errors arising out of the machine vision component be as low as possible. However, correctly, consistently, and constantly extracting lane markings across a range of weather conditions is not trivial. In addition to this, varying lane marking standards, obscure lane markings, splitting and merging of lanes, and shadows of vehicles and objects exacerbate this problem even more [
14,
15,
16,
17]. We show a number of such examples in
Figure 1.
The road markings can be extracted using image-based sensors like monocular or stereo vision cameras, or using LIDAR sensors. Among these, using monocular cameras is the cost-effective approach, although they lack the depth information. Stereo vision cameras can, however, provide the capability to infer the depth information and hence the ability to reconstruct three-dimensional scenarios for improved functionality, such as collision detection [
4]. LIDAR sensors exploit the fact that road markings are painted using retroreflective paints. These extracted markings can then be used to extract the lane markings. However, LIDAR sensors are, similar to stereo vision cameras, far more expensive than monocular cameras. As such, seeking a trade-off between performance, reliability, and cost is an important activity in the design process. Treating cost effectiveness as the primary objective, we assume that the lane detection is performed on images obtained from a monocular camera system.
The literature on lane detection and tracking is considerably rich with a variety of techniques, covering various applications domains, including LDW, LKA, LCM, and CD. Some of these perform lane marking detection (for example, [
19,
20]) and track them while the rest perform only the detection (for example, [
5,
13,
21]). In particular, we focus on techniques that solely rely on images or videos obtained from monocular vision cameras for lane marking detection followed by tracking. For instance, vision-based lane detection has been used for LDW in [
5,
11,
12,
14,
22]. These approaches predominantly rely on information such as color, color cues, and edge-specific details. Color cues exploit the color contrast information between the lane markings and roads. However, the conditions have to be favorable for the differences in contrast to be realized by the lane marking algorithms. Conditions such as illumination, back lights, shadows, night lights, and weather conditions, such as rain and snow, can significantly affect the performance of color-cue-based algorithms. One approach to overcome these limitations is to use the Hough transform along with color cues [
23]. However, Hough transform works well when the potential candidate lines are straight and visible enough. Although some preprocessing can improve the detection [
21], consistently differentiating lane boundaries from other artifacts, such as shadows and vehicles, is a challenge.
Inverse perspective Mapping (IPM) is another approach to determine the lane boundaries in LDW systems. The central idea behind IPM is to remove the perspective distortion of lines that are parallel in real world [
11,
18,
22]. In order to do this, images are transformed from camera view to bird’s eye view using camera parameters. During the transformation, the aspect ratios are retained so that gap or widths between lane boundaries are transformed appropriately. As such, the lane boundaries are still detectable in the transformed space. However, there are several downsides to this approach. Primarily, IPM is often used with fixed camera calibration parameters, and this may not always be optimal, owing to the surface conditions [
24]. Furthermore, these transformations are computationally intensive [
25], and as such, the real-time utility of these approaches needs careful implementation. Although these issues can reasonably be overcome by resorting to various techniques, such as calibration and adequate compute power systems [
24,
25,
26], the main limitation is that the transformation is sensitive to obstacles on the road, such as vehicles, and to terrain conditions [
27].
As lane markings are a pair of parallel lines, each pair should pass through a vanishing point [
28]. This property can be exploited to eliminate and filter out the line segments that do not constitute lanes [
29,
30,
31]. A number of approaches have been proposed in the literature for tracking a single lane, such as [
14,
17,
31,
32,
33,
34,
35]. In [
17], color, gradient, and line clustering information are used to improve the extraction of lane markings. In [
36], an approach for lane boundary detection based on random finite sets and PHD filter is proposed as a multitarget tracking problem. In [
32], a multilevel image processing and tracking framework is proposed for a monocular camera-based system. As such, it heavily relies on preprocessing of frames. Our approach also uses splines, but our tracking approach is significantly different to the one in [
32]. In [
33,
34], techniques for personalized lane-change maneuvering are discussed. They use driver-specific behaviors, collected as part of the system, to improve the results. Although this can improve the results, such approaches are practically difficult to implement. In [
35], the lane tracking is simplified by forming a midline of a single lane using B-splines. Although this approach may be useful over a short distance, conditions such as diverging lanes or missing lane markings will render the approach susceptible to bad approximations of midlines. This can easily lead to suboptimal results.
With recent advances in machine learning, particularly with supervised learning techniques such as deep learning, it is possible to engineer machine learning models to recognize lane markings. This possibility has been demonstrated in the literature [
37,
38,
39,
40,
41]. In [
38], a special convolutional neural network (CNN), termed spatial CNN (SCNN), was constructed for extracting the spatial correlation between objects in an image with the view of using that to establish the relative positioning of lane markings. In [
39], the LaneNet consisting of two deep neural networks was constructed for lane detection. One of the networks detects lane marking edges, whereas the other network groups and clusters lane markings. The lane extraction work described in [
38] relies on several integrated techniques, such as the YOLO framework [
42] for object detection and convolutional patch networks (CPN) [
43,
44] for detecting road surfaces and regions of interest. Although these supervised techniques can offer a good result, the approach suffers from a number of issues. First, supervised learning techniques rely on labeled datasets or ground truth information. Although this appear to be trivial, these labels have to be made for each and every pixel that are to be classified as lane marking. Second, the real success of deep learning is based on the volume of data upon which the model is trained. Although the process of securing several thousands of images with labeled pixels can be automated, it is a time-consuming process. Third, training requires substantial amount of compute time. Fourth, although various supervised learning techniques can offer good accuracy rates, the explainability of the machine learning models is still an upcoming area of research, and unlike general algorithms, deep neural networks lack the rigor of explainability. This is a serious concern where lives could be at risk. Finally, the accuracy rates are never sustained across different datasets. As such, the training process is a continuous one.
This paper aims to develop a tracking technique for a multilane tracking problem based on images/videos captured from a single camera mounted in front of a vehicle. The algorithm is designed to be real-time, robust, and cost-efficient in terms of sensors. To this end, we first model each lane marking as a spline, defined by a finite set of control points. By treating these splines (and thus the control points) as targets whose motions are defined during frame transitions, we develop a multitarget tracking problem with an appropriate motion model. The multilane tracking technique proposed in this paper is a precise amalgamation of several existing real-time and robust ideas in the pipeline with the addition of certain new ideas mentioned in the contribution.
We utilize the probabilistic Hough transformation [
45] to perform an initial extraction of lane markings. This is then followed by a series of algorithms prior to treating the extracted lanes as targets. The first algorithm in the pipeline performs an initial grouping of extracted line segments into different splines. This is then followed by an algorithm, which encapsulates a number of subalgorithms, to differentiate the clutter from lane boundaries in a robust manner and to manage the evolution of trajectories of splines being tracked. We then devise a multitarget tracking algorithm based on a motion model that assumes that the transitions of splines across frames are at a constant rate. The overall solution can be considered as a carefully engineered pipeline of algorithms. In doing this, we make the following key contributions:
We develop an algorithm, based on a maximum a posteriori (MAP) estimator [
46], to group and cluster different lane segments into unknown spline groups;
find intensity likelihood ratio of line segments and augment this ratio as a feature in a clustering and probabilistic data association (PDA) filter [
47] to distinguish lane markings from clutter; and
propose a new, real-time, multiple extended target tracking (targets that shape and position changed simultaneously) algorithm that works with clutter existence based on the PDA filter to distinguish and track multiple spline shape lane-lines.
The remainder of this paper is organized as follows: In
Section 2, we formulate the overall problem, and discuss our approach for solving each of the subproblems. This is then followed by
Section 3, in which we discuss a set of preprocessing steps on the input images prior to using our framework of methods. The aspect of clustering and estimating control points to describe the splines, and two of our key algorithms for this purpose are discussed in
Section 4. We then describe the techniques to track multiple splines using the IPDA filter in
Section 5. The results of our evaluations are then presented in
Section 6, and we discuss conclusions in
Section 7.
4. Clustering and Identification of Control Points
Identification of lane-line (spline) control points starts with:
Partitioning frame, finding line segments and likelihood ration.
Predict control points and validated measurements.
Update final control points using MAP estimator.
4.1. Frame Partitioning
Once the preprocessing is over, the next stage of the pipeline extracts the control points. Although we intend to identify a set of control points to model the lanes as splines, the process is much simpler if the splines are small in size and straight in shape. However, the extracted lane markings are seldom straight. One approach to address this issue is to partition each frame into
n horizontal tiles, each with an experimentally determined height, so that lanes on each partition are near straight.
Figure 4 shows the same image partitioned in two different ways: for two different values of
n (namely
and
), and with different partition heights.
However, considering the perspective characteristics of the camera and the distance of lanes from the camera, it is beneficial to have the heights of the partitions in increasing order toward the bottom of the frame. We experimentally determined that the extracted information is maximized for , such that , , and , where H is the overall height of the region of interest (ROI). We use this configuration with the values of n and () throughout the study conducted in this paper.
4.2. Intensity Likelihood Ratio of a Line Segment
For each of the partitions, we apply the probabilistic Hough transform to extract the lane markings. However, the extraction process, akin to edge in most of the detection techniques, produces a number of broken, small, noncontinuous, and irrelevant line segments. As such, one of the key challenges following the extraction process is to distinguish the lane markings from background noise and clutter. To render a more robust high-fidelity approach toward clutter and noise management, we augment the extractions with underlying intensity values. More specifically, we define the number of edge points that lie in an extended line segment s () as the intensity. The intensity can be extended to cover a set of line segments or a number of pseudomeasurements belonging to a curve. The intensity of an extended line segment is represented as a likelihood ratio, which we define below.
Let
be the probability density function (PDF) of the noise only, and
be the target originated line-segment detections before thresholding. Furthermore, let
and
be the scale parameters for false alarms and clutter, and target, respectively. These scale parameters are dependent on the minimum number of points used in the Hough transform. The noise only and target originated measurement density functions are
where
is the intensity of the candidate measurements
j. Furthermore, let
be the threshold to declare a detection. The probabilities of detection (
) and false alarm (
) can be computed as follows:
Although the probability of detection,
, can be increased by lowering
, it will increases
. Hence, the choice of
cannot be arbitrary. With these, the corresponding probability density functions after thresholding become
where
and
are the probability density functions of the validated measurement
(for
) that are due to noise only and originating from the target, respectively.
Considering Equations (
5) and (
6), the line segment intensity likelihood ratio
, which is the likelihood ratio of measurement
with intensity of
edge pixels originating from target rather than clutter, can be defined as
4.3. Pseudomeasurements
Pseudomeasurements
is a set of the control points for track
and lane
j in frame
k. More specifically,
Furthermore, let denote the extended line segment in section s, at time step k, for lane j—that is, abstracts away a number of pseudomeasurements for each (extended) line segment. Each such measurement is a two-element vector, with one capturing the pseudomeasurement and the other one representing the intensity of the extended line segment as a likelihood ratio, .
4.4. MAP Estimator for Measurements with Intensity Feature
Although we expect the pseudomeasurements to almost model the lane-lines, in reality, a number of factors make this process as challenging. Examples include, but are not limited to, missed detection, nondeterministic nature of the preprocessing, and noisy measurements due to clutter. Therefore, it is essential to model these imperfections as part of the process.
To simplify the analysis and derivation, we assume that measurements that originate from targets at a particular sampling instant are received by the sensor only once with probability of detection
. The measurement equation can be written as follows:
where
,
is the measurement noise, and
is the unknown value that we are aiming to estimate in the presence of the measurement noise.
We also assume that the measurement noise is independent and zero mean Gaussian distributed with covariance
R. In our case, various preprocessing stages, such as thinning and Hough transform, contribute towards
R. Thus,
, where
Because of the condition of the road and perspective effect of the camera lens for values of
and
, we would expect more deviation in the bottom part that is closer to the camera compared to the top. We also assume the measurements
to be normally distributed around
x with covariance
R and the prior probabilities
to be normally distributed around the predicted measurement
with a covariance
Q. Thus,
and
, where
Again, similar to
R, the perspective effects of the camera influences the values of
and
to be skewed toward the bottom part of the frame. Furthermore, the covariance
Q is often linked to the curvature
of the road. Assuming the maximum standard curvature of highways as a constant parameter, the posterior measurement density would be
Since the measurement and prior noises are assumed to be Gaussian, for a single measurement, (i.e.,
),
can be expressed as:
where
and
For a Gaussian distribution, the mean is the optimal maximization value
. Hence,
For
, the optimal maximized value can be derived using the total probability and combined residuals as follows:
where
is association probability, which we define as (see
Appendix A for derivations). where
and
are probabilities of detection and gating, respectively,
is the number of validated detections at time
k,
is the intensity of the extended line segments as a likelihood ratio, and
is the probability density function for the correct measurement without the intensity feature, defined as
where
, and
is the prior information.
4.5. Clustering Algorithm for Finding Control Points
Ideally, each partition will have a sufficient number of full measurements
so that a spline can be fitted over those measurements. However, in reality, this is seldom the case. The associated challenges are dealt with here using an algorithm that estimates the control points based on the available set of measurements. In particular, we use the MAP estimator (MAPE) to find the optimal control points. These aspects are handled by two algorithms, Algorithms 1 and 2, which are outlined and discussed in detailed below.
Algorithm 1 Control Points Estimator |
- 1:
- 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
- 9:
- 10:
fori=1; i<N; i++do - 11:
for each do - 12:
for each do - 13:
Initialize() - 14:
Predict() - 15:
MAPE() //Update - 16:
end for - 17:
- 18:
end for - 19:
A=RemoveSimilarCurves(A) - 20:
end for
|
Algorithm 1 handles each partition separately, but by extending the line segments into the next partition wherever needed. For a given partition
s, it estimates the control points for each line,
, using the curvature
. Then, the overall set of lines
L is used to estimate the control points for that partition using the MAP estimator (see Algorithm 2). These control points are accumulated into
A as a list. Notice that the
Predict() function finds predicted control points for each individual line segment
l using curvature vector
.
Algorithm 2 MAPE. |
- 1:
- 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
Validated-Measurements() - 9:
- 10:
forj=0; j<m; j++do - 11:
- 12:
- 13:
- 14:
end for - 15:
|
Algorithm 2 combines both the data association and the posteriori PDF to adjust the estimated control points. In particular, it uses the IPDA-specific target-to-track association probabilities (covering both the track existence and non-existence),
and
for finding the control points based on candidate control points
and measurements
. More specifically,
Validated-Measurements() function uses normalized distance to validate the line segments belonging to each spline. We show a sample outcome of these algorithms in
Figure 5. We first show two endpoint measurements
and
(
Figure 5a). These points are then corrected using the above algorithms to output corrected control points
and
(
Figure 5b).