1. Introduction
Traffic incidents are considered nonrecurrent events typically caused by crashes, highway construction, severe weather conditions, geohazard accidents, etc. [
1]. Compared to urban areas, mountainous areas with special geological and meteorological characteristics are more concentrated in fatal traffic crashes and natural disasters [
2,
3]. In particular, geological hazards are one of the major disasters in China’s mountainous regions, seriously threatening the lives and property of local people. In recent years, economic losses caused by geological disasters on roads in China have reached USD 10 billion yearly. An effective way to reduce the potential damage of geohazards is to accurately predict the duration of geohazard accidents in a timely manner at each location in a mountainous road network, which is a prerequisite for the implementation of traffic safety treatments [
4,
5]. Therefore, significant research efforts are needed to forecast the duration shortly after a geohazard accident has taken place to develop effective countermeasures for regional traffic safety managers and engineers, as well as to provide traffic information for travelers.
Most prior studies related to this topic focused on urban roadways or highways, with less consideration of mountainous areas [
6,
7]. For instance, ten accident types were included in the study by Araghi et al. [
8], including broken-down vehicles, broken-down lorries, accidents, fire, flooding, fuel spillage, gas leak, police incident, collapsed maintenance holes, and traffic light failure. Extensive efforts have been made to address accident duration issues on freeways with statistical methods and machine learning methods from various aspects. The earliest accident duration prediction model was linear regression, which assumes a linear relationship between accident duration and the various influencing factors [
9,
10]. However, this assumption is not rigorous because the influencing factors do not show a linear variation in duration [
6]. Then, Jones et al. [
11] analyzed the frequency and duration of accidents based on an analysis model, revealing a nonlinear relationship between influencing factors and duration for the first time. Later, survival analysis models have been widely used to model duration. Kaplan–Meier curves are the most commonly used nonparametric model in survival analysis and are often used to estimate survival functions, an important advantage of which is that it can visualize the difference between the survival curves of different conditions. Li et al. [
12] developed an analysis model using Kaplan–Meier estimation and found that different factors and clearance methods significantly affected the duration of the two accident groups. Chen and Tian [
13] carried out a study on the impact of different weather conditions, such as cloudy and sunny, on highway traffic accidents through Kaplan–Meier analysis. At the same time, a wide variety of other statistical survival analysis models have been used to analyze and predict the duration of traffic accidents. Alkaabi et al. [
14] used a Weibull function without gamma heterogeneity to investigate the effect of traffic accident characteristics on accident clearance times. In contrast to the linear regression model, the survival analysis model considers both the length of the accident duration and the outcome of the accidents, which makes it the classical model in this field. However, there are limitations to these studies. Traditional survival analysis requires strict assumptions, such as Cox regression analysis, which assumes that the accident duration and characteristics decay proportionally, and accelerated failure time (AFT) models, which need to obey a specific underlying functional distribution. When the assumptions are not met, the results are generally poor [
15]. With the rise of machine learning models, many promising methods, such as the K-nearest neighbor method [
16], support vector regression [
17], Bayesian networks [
7], and decision trees [
18], have been widely used in modeling accident durations. Compared to statistical method approaches, machine learning methods are more flexible and have no or few a priori assumptions about the input variables. Nevertheless, machine learning lacks a reasonable interpretation of the model and cannot predict datasets containing censored data.
Recently, more advanced machine learning methods that can model survival times have been proposed. In contrast to the machine learning model, the survival machine learning model was adapted to include censored data and to provide a full probability of deterioration curve [
19]. The RSF model is a typical survival machine learning model, which overcomes the weakness of needing to establish the basis for certain assumptions and addresses the high variability and bias of traditional survival analysis [
20]. The RSF has been used in fields as diverse as medicine [
21], business [
22], and environmental science [
23]. In the field of transportation applications, Wang et al. [
24] considered influencing factors, such as the cause of the event, time of the event, and line-related variables, and used the RSF model to describe the duration of subway service interruption. Lu and Ilgin [
19] used the RSF model for bridge deck deterioration analysis to provide bridge deterioration survival analysis influence factor identification and survival time prediction. Multiple studies compared the RSF with traditional survival analysis in duration modeling; RSF showed better performance than other methods [
25].
As described in the previous sections, various models have been used in previous studies to predict the duration of many types of highway accidents. However, the validity of existing methods to assess the duration of road geological hazard accidents is largely challenged by two circumstances: (1) Differences in disposal methods due to the other accident characteristics of road geological hazards. The existing traffic accident impact assessment results are unsuitable for road geological disasters. (2) The shortcomings of existing traditional survival analysis models and machine learning models, which do not address the problems of complex influencing factors in highway geological hazard accident duration modeling well. Therefore, given the characteristics of the duration of geohazard incidents, we used a combination of the K-M model and the RSF model to develop a model for predicting the duration of geohazard incidents on roads.
In summary, this study has the following contributions:
(1) As the first study to model the duration of geohazard incidents, this study analyzes the characteristics of geohazard incident duration and effectively identifies the key factors affecting the duration of geohazard accidents.
(2) An integrated prediction model of K-M and RSF is proposed, which is a relatively new study. The proposed method is evaluated through practical application in Yunnan, China. The results show the advantages of the proposed method in predicting the duration of accidents more accurately, which can provide a reference for road management and travelers.
The remainder of this paper is organized as follows:
Section 2 introduces the data used in this article;
Section 3 presents the methods and model evaluation measures used in this study;
Section 4 offers and analyzes the model results;
Section 5 discusses the results and applications of the model; and
Section 6 concludes this study.
2. Data Description
We chose Yunnan Province, China, as the study area. Yunnan Province is located in the southwest border area of China and has complex landforms. Yunnan is prone to geological disasters such as collapse, landslides, and debris flows due to its large number of mountains, steep slopes, and concentrated rainfall. We collected data on geological disaster road blockages in Yunnan Province from January 2018 to December 2020, which frontline emergency responders reported to the Yunnan Provincial Department of Transport Road Bureau. In these text data, each record is related to a variety of items, such as accident route name, route code, stake number, cause of the accidents, description of site conditions, and disposal measures. The raw text data are shown in
Table 1.
Road geological disasters are highly sudden and destructive and essential elements of major traffic accidents. The overall incident duration of road geological disaster accidents and significant traffic accidents comprises the following four phases: notification time, response time, clearance time, and traffic recovery time [
26]. Considering the complex impact of various factors during the traffic recovery phase, it is difficult to determine the actual duration. The duration of the accidents for this research work was estimated as the first three phases, including notification time, response time, and clearance time, as shown in
Figure 1.
The duration of the accidents
is given by:
where
is the timestamp of the actual recovery time of the incident and
is the time stamp when the accident was discovered by the traffic authorities.
It is important to note that there is a common type of data in survival analysis—censored data—which refers to not observing the complete duration and outcome of an event before ending the study for a range of reasons. The duration of road geohazard incidents is often long, and therefore, there are incidents that are not completed at the time of reporting. These data are censored, and the duration
is calculated as
where
is the time stamp when the incident department reported the disaster to management.
Road geological disasters are troublesome and time-consuming to clear. The duration time was measured in hours. Some geological disasters have a long duration, even more than one month. Only incidents interrupted within 72 h were selected in this study. A total of 349 data samples meet the requirements, including 55 censored data samples. The mean accident duration time was 13.14 h. The minimum and maximum values were 0.02 h and 71.93 h, respectively. Some candidate variables related to temporal characteristics, incident, and processing status, etc., can be extracted from the dataset.
4. Results
4.1. Variable Analysis and Selection
An appropriate choice of the dependent variable can significantly improve the performance of the model. In this section, the heterogeneity of geological hazard incident durations in multiple dimensions, including spatial and temporal, is discussed. With reference to the results of the analysis, we built the candidate variables dataset.
The specific spatial distributions and nuclear density of road geohazard accident frequency, and their duration, are shown in
Figure 3. Road geological hazard incidents in Yunnan Province are spatially heterogeneous. Road geological hazards are located mainly in tectonically active, steep mountainous, and fragmented northern provincial border areas, such as Zhaotong and Diqing cities. However, the ability to respond and recover from disasters corresponds to the level of economic development of the area to some extent. Although there were also more accidents in Honghe and Dali cities, economic development has improved local disaster prevention and mitigation capabilities, reduced social vulnerability, and enabled a reduction in the impact of, and rapid recovery from, geological hazards, resulting in relatively short accident durations.
Four incident types, namely, debris flow, subsidence or cracks in the ground, collapse, and landslide, were considered. The frequency and average duration of the four types of incidents are shown in
Figure 4. Among the 349 accidents counted, debris flow is the most common cause of the accidents, occurring a total of 129 times (37.0%) in the complete dataset. However, there is a large gap between accident duration and the distribution of accident frequency. The average accident duration for debris flow accidents was 9.14 h, the shortest average duration of the four accident types. Settlement or cracking of the ground was the least common cause of accidents, occurring only 20 times out of 349 counts (5.7%). In turn, collapse was the other most-important cause of accidents (35.0%). The average duration of accidents for settlement or cracking of the ground and collapse were 13.26 and 13.46 h, respectively, close to the average duration of 13.14 h for the 349 accidents counted. Landslides had a medium frequency (22.3%) of accidents but had the longest average duration (19.24 h).
To explore the unbalanced distribution of road geological disaster accidents in the temporal dimension, the monthly and temporal distributions of road geohazard occurrences were mapped.
Figure 5 shows that between 2018 and 2020, July to September was the period when road geohazard accidents were concentrated, accounting for 79.4% of the total number of accidents, especially in August, at 47.7%, because the rainy season is prone to road geohazards. From the distribution of road geohazard accident duration in each month, except for February, which lasted longer, the average accident duration in other months differed less and did not show significant differences, possibly because February contains the Chinese New Year, the most important holiday in China, and as maintenance agencies and emergency management departments move shifts earlier and reduce staff at work, fewer incident responders are available, and incident duration increases.
The time distribution diagram of the interruption due to road geological disasters is shown in
Figure 6, which shows marked morning peaks. Among the 349 accidents, 177 (50.7%) occurred during the morning peak (8–11 a.m.). However, the average duration of accidents during peak periods (9.93 h) does not reach the average (13.14 h).
As mentioned above, the analysis found that, unlike other traffic accidents, road geohazard accidents are concentrated in tectonically active, steep mountainous, and fragmented areas, and have characteristics such as a high incidence during the rainy season and morning peaks. Meanwhile, the duration of road geohazard accidents is heterogeneous in terms of accident cause and time of occurrence, but the average duration across months does not show heterogeneity. Thus, we extracted information from the dataset about the duration of the incident, the cause of the incident, the condition of the affected roads, and the time of the incident. We extended the accident-related and incident handling information as candidate explanatory variables for further analysis in this study for a total of 11 categorical variables.
Table 2 describes each candidate variable and the 349 accident data points used in the modeling process.
A statistical description of the data revealed significant differences in the duration of road geohazard incidents under different variable conditions. In the subsequent chapters, these phenomena and the associated causes are described and analyzed in light of the K-M model results.
4.2. Kaplan–Meier Model Regression Results
Kaplan–Meier curves of the different accident cases, road types, times, and treatment conditions are shown in
Figure 7,
Figure 8,
Figure 9 and
Figure 10, respectively. The results showed that six factors passed the 5% level of significance, and the factors were accident type, secondary accidents, detained vehicles or persons, closed road, morning peak, and level of accident management department.
The results show that different accident types have different durations (
Figure 7a). When subsidence or cracks in the ground occur, the probability of survival is more significant, indicating a prolonged duration of the incident. Survival curves also show that, after a duration of 60 h, the probability of surviving debris flow and collapse accidents is 0, and landslide and subsidence or cracks in the ground converge to 0, indicating that debris flow and collapse are essentially over after a duration greater than 60 h. Nevertheless, accident subsidence or cracks in the ground and landslides have the potential to persist.
Figure 7b shows the survival probability for the duration of road geohazard incidents with and without secondary incidents. The survival probability of road geohazard accidents with secondary accidents is higher than the survival probability of road geohazard accidents without secondary accidents of the same duration. The results show that the duration of road geological disasters is longer when secondary accidents occur. The average duration of road geological disasters with secondary accidents is 22.57 h, while without secondary accidents it is 12.35 h.
Figure 8a shows the survival probabilities under the stranded vehicle or person condition and without a stranded vehicle or person condition. The average duration of a road geohazard with detention is 18.94 h and without detention is 11.71 h. The results indicate that when there are stranded vehicles or people, the survival probability is greater than when there is no detention because when vehicles or people are stranded, it leads to a longer accident duration. This result corresponds with the reality that difficulty in handling accidents reduces traffic efficiency when detention occurs.
The Kaplan–Meier model results show that there is a significant difference in the survival probability for the duration of road geohazard incidents occurring in the morning peak and non-morning peak (
Figure 8b). The survival probability of road geohazards occurring during the non-morning peak times is consistently greater than in the morning peak. As with the results in
Figure 3, road geohazard incidents that occur in the morning peak are usually of shorter duration. When the duration is greater than 60 h, all accidents occurring in the morning peak are in the end state. Generally, incidents occurring during the morning rush hour can be detected and reported more promptly and dealt with more efficiently.
The survival probability of the duration of road geological disasters with or without road closures is shown in
Figure 9a. The small difference between the two survival curves indicates that there is not much difference in the duration of accidents with and without road closures. In general, the survival probability of closed roads is slightly higher than the survival probability of non-closed roads at the same overtaking duration. One reason is that when the road geological disaster needs to be closed, the pavement area that needs to be cleaned is larger, so the cleaning time is longer. At the same time, the two survival curves almost overlapped after a duration of 25 h. This result is in accordance with the reality that when the accidents last longer than 25 h, whether the road is closed or not has little effect on the duration. Specifically, the average duration of accidents on closed roads and unclosed roads is 15.11 h and 10.02 h, respectively.
The results at different accident management department levels are quite different.
Figure 9b shows the survivor probability of road geological hazard duration at different accident management department levels. For the squadron level of the accident management department, when the accident duration is greater than 25 h, the accidents are basically handled. However, the survivor probability is 0.4 for the accident management levels of the battalion and local road bureaus, and the survivor probability of the accident management department for the traffic management department is 0.5. This result is in line with the fact that the more complex the accidents are, the higher the level of the accident management department, and the more cautious the handling of the accidents.
Figure 10 shows the results of the K-M model estimates for the four variables with insignificant log-rank values.
Figure 10a shows that accidents with an incident road affected for a length of [0 km, 1 km) are largely dealt with when the duration of the accident is greater than 20 h. Accidents where the accident road is affected for a length of more than 10 km are largely dealt with when the duration of the accident is greater than 57 h. However, the shortest impact lengths have the highest probability of survival, probably due to the fact that accident impact lengths of [0 km, 1 km) are much more frequent than other class frequencies and contain occasional, extremely difficult road geohazard accidents.
Figure 10b–d shows small differences in survival curves in terms of day of the week, type of road, and mechanical maintenance, indicating that the different conditions for the above three variables have a small effect on accident duration. In terms of the magnitude of the log-rank value of the insignificant variables, day of week < road type < length of road affected < mechanical repair.
4.3. Model Construction and Comparison
In this section, the performance of different prediction models is compared. First, 80% of the data were randomly selected for training the model, and the remaining 20% of the data were used for model testing.
To investigate whether the variables that were not significant in the K-M statistical analysis would have an effect on the prediction results of geohazard incident duration, we used a stepwise forward regression selection element method based on the results of the K-M analysis to add other variables. The optimal RSF model, the SSVM model and the CPH model were constructed. The performance of these three models was compared by calculating Harrell’s C-index, which measures the agreement between predicted risk and actual survival, for both the training and test sets.
In the univariate Kaplan–Meier approach, six characteristics, such as accident type, secondary accidents, detained vehicles or persons, etc., were statistically significant variables (
Figure 7,
Figure 8,
Figure 9 and
Figure 10). Starting with six significant variables, five combinations of variables were constructed according to the log-rank values of insignificant variables, and the test set C-index was calculated for each combination of variables. The results are shown in
Figure 11. As the variables were added one after another, the C-index of the RSF showed a relatively steadier upward trend. The RSF model performs best when the input is variable combination 3, with a C-index of 0.756, which is the highest of all models. Although these variables were not statistically significant in the K-M method analysis, they were still considered important decisions in duration prediction. The SSVM model performed slightly better than the RSF model when inputting variable combinations 1 and 2. However, as the number of variables increased, CPH model performance hardly changed.
We further calculated the AUC of the three models by inputting variable combinations 2, 3, and 4 (the best-performing combination variable for each of the three model C-indexes). The results show that the AUC values of all three models achieve the best prediction when the input is variable combination 3. Although the mean AUC values of the RSF model and SSVM model are equal,
Figure 12 shows that when the survival time is less than 25 h, the prediction performance of the RSF model is significantly better than the prediction performance of other models, but when the survival time is more than 25 h, the prediction performance of the RSF model decreases.
To evaluate whether the RSF model performs better than the machine learning model, the RF and XGBoost models were also constructed. The best-performing variable combination 3 in the RSF model was the input. Since the RF and XGBoost models cannot predict the censored data, the models were all constructed with the same complete dataset, and the MAE and MSE were calculated separately for the three models.
Among the three prediction models, the prediction error of the XGBoost model was the largest among the two metrics. Compared with the XGBoost model, the RF model achieved better prediction performance, where the MAE and MSE were reduced by an average of 9.2% and 24.2%, respectively. However, the RF model still failed to capture the inherent characteristics of the duration of road geohazard incidents well. Compared with the two methods mentioned above, the RSF further decreases the prediction error in terms of MAE and MSE.
In general, the application of survival machine learning models, such as RSF and SSVM models, for duration prediction is superior to traditional survival analysis models; this is because CPH models assume that risk is proportional to time and independent when covariates take different values, whereas machine learning models are nonparametric and can better capture the nonlinear relationship between the duration of road geological hazards and the factors. At the same time, the RSF model not only predicts censored data but also has better prediction accuracy than ordinary machine learning models. The RSF model proved to be more accurate than the other models (
Figure 12 and
Table 3), and these results may indicate that the RSF model is a more powerful predictor of road geohazard duration.
5. Discussion
In this section, the findings and applications of this study are analyzed and discussed. From the spatio-temporal analysis of the characteristics of road geohazards based on real data, geological accidents on roads are concentrated in areas with active crustal movement and steep hills. These results are in correspondence with Wei et al. [
32] and Liang et al. [
33], who concluded that geomorphology was the controlling factor for geological accidents such as collapses and landslides. Moreover, the rainfall intensity also plays an important role in road geohazards; this result is similar to Qiu et al. [
34]. This means that during the construction of highways, engineering measures, such as drainage ditches in places prone to geological disasters, can effectively reduce the probability of accidents such as landslides and debris flows.
Regarding the factors that affect accident duration, statistical analyses conducted in existing studies [
14,
35] suggest that the ease of accident handling plays a key role in assessing event duration. Accidents of higher severity tend to result in longer durations and more severe congestion. When there are stranded people or vehicles and major highway geological disaster accidents, such as secondary accidents, the accident lasts longer.
Analysis of the influencing factors of road geological disaster accidents can reduce the impact of accidents on traffic through the effective allocation of equipment and personnel [
10]. According to the findings of the paper, considering the requirements of traffic management, in order to shorten the road geological accident duration, the following are suggested: (1) Improve the rapid handling of multiple incidents, such as mudslides, and optimize emergency plans. (2) Improve the rationality of the allocation of personnel and equipment for morning peak and non-morning peak shifts.
From the results based on simulations and real data, we found that our proposed model can better predict the duration of road geohazard accidents, which has important implications for accident management. For example, it can provide a reference for road management and maintenance authorities to know the approximate duration of work and plan their work, which is important for assessing traffic management actions. In addition, as shown in
Figure 13, when road congestion or disruption caused by various road geological hazards is known, it can provide reliable traffic information and recommend more efficient route options.
6. Conclusions
In this study, we investigated the characteristics of road geohazard accident duration and used the emerging survival machine learning methods to build a geohazard accident duration prediction model. The data on geological accidents in Yunnan Province were collected from January 2018 to December 2020. Through statistical analysis, the following were found:
(1) The mean geological accident duration time was 13.14 h. Road geological accidents are spatially heterogeneous, which are mainly located in tectonically active, steep mountainous, and fragmented areas, and have characteristics such as a high incidence during the rainy season and morning peaks. Debris flow and collapse were the most common causes of accidents, with an average duration of 9.14 h and 13.46 h, respectively. From the point of view of time, July to September is the period when road geohazard accidents are concentrated, accounting for 79.4% of the total number of accidents.
(2) The type of accident, secondary accidents, detained vehicles or persons, morning peak, closed roads, and level of accident management were found to have a significant impact on the duration of the geohazard accident.
(3) Compared to the traditional survival analysis model, the average C-index and AUC of the RSF model were 0.756 and 0.867, respectively, which were much larger than the mean C-index and AUC of the CPH and SSVM models. With uncensored data, compared to the machine learning model, the MAE and MSE evaluation metrics of the RSF model were 11.32 and 346.99, respectively, which were better than the MAE and MSE evaluation metrics of the RF and XGBoost models.
The results of this study can help us to understand the factors affecting the duration of road geological disaster incidents and implement appropriate strategies to mitigate the impact of incidents on traffic through the effective deployment of equipment and personnel. At the same time, it also can provide reliable traffic information for travelers and improves the reliability of travel times.