1. Introduction
As catastrophic natural events, forest fires pose a grave threat to ecosystems and the safety of human life and property [
1]. In the context of global climate change, the incidence of forest fires has shown a significant upward trend, which has made the study of forest fires more and more urgent and critical. In particular, investigating the origins of forest fires and predicting their propagation patterns have become central tasks in the current field of forest fire research. Advanced and sophisticated forest fire prediction and monitoring systems have been widely implemented and applied in Europe, the United States, and many other regions. For example, the Canadian Forest Fire Danger Rating System (CFFDRS) [
2] and the National Forest Fire Danger Rating System (NFDRS) [
3] play an essential role in assessing the risk of forest fires, while the BehavePlus Fire Modeling System (BFMS) [
4] and the Fire Area Simulator (FARSITE) [
5] show unique advantages in predicting the dynamics of fires. These technological tools provide forest fire managers with practical tools to reduce fire risk, maintain ecosystem stability, and protect human lives and property.
In recent years, with the continuous development of technology, machine learning methods have shown particular potential in wildfire control and prediction [
6]. Many researchers have combined various data sources, such as satellite, meteorological, and historical data, with advanced machine learning algorithms to build innovative wildfire risk prediction models. For example, Sim et al. [
7] demonstrated the effectiveness of using Sentinel satellite data and machine learning algorithms to predict wildfire severity, where the random forest model achieved an accuracy of up to 82.3%. Malik et al. developed a systematic approach by integrating and analyzing satellite, meteorological, and historical data to create a machine learning and big data-based wildfire risk prediction model with multiple geographic parameters [
8]. Banerjee’s research [
9] evaluated various environmental parameters to create a fire probability map for the Himachal Pradesh region and compared the performance of different machine learning algorithms. Pérez-Porras et al. [
10] proposed a large-scale fire prediction system that combines physical, statistical, and machine learning methods.
These studies demonstrate the diversity and depth of machine learning’s ability to improve wildfire prediction and management capabilities [
11,
12]. Although machine learning algorithms have achieved positive results in forest fire prediction, they still face many insurmountable obstacles. The limitations of traditional machine learning models in forest fire prediction primarily manifest in two key aspects [
13,
14,
15]. Firstly, the occurrence of forest fires is influenced by many complex factors, making it exceptionally difficult to find sufficient and appropriate feature parameters to construct effective prediction models. Forest fires are not only closely related to meteorological conditions (such as temperature, humidity, wind speed, and precipitation) but also intertwine with topographic features (such as elevation, slope, and aspect) as well as vegetation coverage (such as vegetation type and density). These factors are interconnected through complex non-linear relationships, which traditional machine learning models often struggle to handle. For example, in some forest fire prediction models, considering only meteorological factors may not accurately predict the occurrence of fires, as topographic factors can influence the direction and speed of fire spread.
In contrast, vegetation factors can determine the intensity of combustion. Secondly, traditional machine learning models often rely on human intervention in model selection and hyperparameter tuning. This process is not only time-consuming and labour-intensive, but the limitations of human judgment and experience can also lead to the selection of suboptimal models, thereby reducing the accuracy of the predictions. Excessive human intervention can also reduce the efficiency of model training, making it difficult for models to promptly adapt to the constantly changing forest environment and fire data [
16,
17].
To overcome the limitations of traditional machine learning models in forest fire prediction, automated machine learning (AutoML) technology [
18,
19] has emerged as a promising solution. AutoML represents a novel methodological approach that deeply integrates machine learning with automation techniques, aiming to comprehensively simplify and optimize the various critical stages of the machine learning process [
20]. The core objective of AutoML is to significantly improve the speed and efficiency of machine learning model development by automating complex operations, including feature engineering, model selection, and hyperparameter tuning. Specifically, in the feature engineering aspect, AutoML can automatically select and construct the most valuable features based on the characteristics and types of the data, such as generating statistical features, time series features, and text features, and performing effective feature selection and dimensionality reduction to extract critical information hidden in the data, thereby enhancing the model’s predictive capabilities [
21,
22]. During the model selection process, AutoML can automatically screen the most suitable machine learning models for a specific task, covering a wide range of classifiers, regressors, and clustering algorithms, and determine the optimal model through an automated model comparison and evaluation mechanism. Most importantly, in the hyperparameter tuning phase, AutoML employs advanced techniques, such as grid search, random search, Bayesian optimization, and genetic algorithms, to automatically search for the optimal combination of model parameters, maximizing the model’s performance. These techniques can efficiently explore different parameter space scales, avoiding the blindness and inefficiency of manual adjustments [
23,
24].
In summary, as the problem of forest fires becomes increasingly severe, the limitations of traditional machine learning models in forest fire prediction have gradually become apparent. By automating and optimizing the various stages of the machine learning process, AutoML promises to significantly improve the accuracy and efficiency of forest fire prediction, providing more effective decision support for forest fire management and prevention. By leveraging these AutoML capabilities, researchers and practitioners can develop more accurate, reliable, and adaptable forest fire prediction systems, ultimately supporting more effective wildfire management and mitigation strategies. The automation and optimization provided by AutoML can be a game-changer in addressing the complex challenges posed by forest fires in the face of climate change and other environmental pressures.
2. Materials
2.1. Research Significance
According to the statistical data on forest fires (
https://gwis.jrc.ec.europa.eu/apps/country.profile/charts/ba (accessed on 7 September 2024)), from 2002 to 2023, China recorded 251,000 forest fire incidents, as shown in
Figure 1. On average, there were approximately 11,400 fire occurrences each year, resulting in a deforested area close to 70.43 million hectares. Considering that China’s forest area is about 175 million hectares, accounting for 3.9% of the global total forest area and ranking fifth in the world, the incidence of forest fires in China constitutes 10% of the worldwide total. These figures highlight the threat of forest fires to the global economy and the immense destruction and loss they cause.
These statistics reveal the frequency and severity of forest fires in China and reflect the long-term impact of forest fires on the ecological environment, biodiversity, climate regulation, and socio-economic structures. Forest fires destroy forest resources and lead to substantial economic losses, including direct property losses and indirect environmental and financial losses. In addition, forest fires may adversely affect human health, cultural heritage, and regional security. Therefore, these data emphasize the urgency of strengthening forest fire prevention, monitoring, and management and the importance of developing practical forecasting tools and response strategies.
In this context, forest fires in the Guangxi Zhuang Autonomous Region are also of significant concern. Located in the subtropical zone, Guangxi’s warm and humid climate conditions are conducive to the growth of forest vegetation but also increase the risk of forest fires. The complexity of Guangxi’s terrain and the extensive area of mountainous regions make the extinguishing efforts of forest fires extraordinarily challenging and the resulting losses particularly severe.
According to statistical data, over the past few decades, the frequency and scope of forest fires in Guangxi have shown a relatively severe situation, as depicted in
Figure 2. These fires destroy the region’s forest resources and profoundly impact the local ecological environment, biodiversity, climate regulation functions, and socio-economy. The frequent occurrence of forest fires leads to direct environmental and economic losses and threatens the safety of residents.
Therefore, effective prevention and control measures must be implemented in response to the forest fire situation in Guangxi. These include but are not limited to strengthening monitoring and early warning systems for forest fires, raising public awareness of forest fire prevention, optimizing emergency response mechanisms for fires, and utilizing modern technological means to improve the accuracy and efficiency of forest fire prediction. Through these comprehensive management strategies, the incidence of forest fires can be effectively reduced, the damage caused by fires can be mitigated, and the forest resources and ecological environment of Guangxi can be protected.
2.2. Study Area
The Guangxi region (108°18′–112°04′ E, 20°54′–26°23′ N) is located in the southwestern part of southern China. It is an autonomous region with diverse topography and a warm climate, as shown in
Figure 3.
This research selects the Guangxi region as its subject of study based on several key considerations:
Located in southwestern China, Guangxi features a diverse topography, a warm and humid climate, and a high forest coverage rate of approximately 60%, with forest area reaching about hectares. The main tree species include Masson pine, Chinese fir, eucalyptus, camphor tree, teak, and osmanthus. With such abundant forest resources, any forest fire occurrence could enormously impact the ecological environment and biodiversity.
Guangxi’s terrain is complex, with higher elevation in the east and lower elevation in the west. The northeastern region is predominantly mountainous, while the western and southern parts are relatively flat. The elevation ranges from several meters in coastal areas to 1979 m inland. This complex topography makes forest firefighting extremely challenging. During fire incidents, the flames can spread rapidly due to topographical influences, increasing the difficulty of fire control and leading to more severe fire damage.
Guangxi has a subtropical monsoon climate characterized by distinct seasons and abundant rainfall, though precipitation distribution is uneven, with alternating periods of high-temperature drought and heavy rain. During high-temperature dry seasons, vegetation becomes highly flammable, significantly increasing forest fire risks. During concentrated rainfall, floods and other disasters may occur, damaging forest ecosystems and indirectly affecting forest fire occurrence and development patterns. These climatic characteristics make forest fire patterns complex and variable, increasing the difficulty of prediction and prevention.
Guangxi’s unique and challenging forest fire patterns urgently require targeted research and effective preventive measures. AutoML technology can process complex data relationships by automatically analyzing multi-source data collected from the Guangxi region, including meteorological, topographical, and vegetation data, to uncover potential patterns and key factors related to forest fires, providing strong support for precise fire prediction. Its automated model construction and optimization process can quickly adapt to dynamic changes in Guangxi’s forest fire data, and prediction models can be adjusted promptly to improve their timeliness and accuracy.
Given China’s vast territory and significant regional differences in forest resources, topography, and climate conditions, AutoML technology’s flexibility and adaptability enable it to adjust model parameters and algorithms according to regional characteristics quickly, building suitable forest fire prediction models for local conditions. For example, in northern arid regions, the focus can be on analyzing the relationship between climate drought factors and forest fires; in southern humid mountainous areas, careful consideration can be given to the impact of topography and vegetation factors on fires. Through application and optimization in different regions, AutoML technology shows promise in providing comprehensive and efficient technical support for China’s overall forest fire prevention and control work, promoting the advancement of forest resource protection and ecological security assurance levels.
2.3. Data
This study aimed to construct a forest fire occurrence prediction model based on climate, topography, and vegetation data, as shown in
Table 1. Climate, topography, and vegetation variables were selected due to their significant influence on forest fires. Climate variables such as temperature, humidity, wind speed, and precipitation directly affect the ignition and spread of forest fires. Temperature and humidity influence the dryness of fuel, with higher temperatures and lower humidities increasing the likelihood of ignition. Wind speed can accelerate the spread of fires, while precipitation can reduce fire risk by moistening the fuel. Topography variables like elevation, slope, and aspect index are crucial in fire behaviour. Steeper slopes can cause faster fire spread downhill, and different aspects may affect sunlight exposure and wind patterns, thereby influencing fire risk. Vegetation variables, especially NDVI, are closely related to fuel availability and type. Dense vegetation with high NDVI values can fuel fires, and different vegetation types have different combustion characteristics. We can better understand the complex mechanisms underlying forest fire occurrence by considering these variables and developing more accurate prediction models. This selection allows for a comprehensive assessment of the various factors contributing to forest fire risk, enhancing the ability to effectively predict and manage forest fires. For this purpose, we collected data from eight weather sites. Using a proximity analysis tool, we identified the nearest weather station for each fire and non-fire site. Furthermore, based on the weather station’s location and the fire’s time, we extracted meteorological data for model analysis and prediction from the corresponding weather station.
Geographic spatial information such as elevation, slope, aspect index, and Normalized Difference Vegetation Index (NDVI) is expressed as raster data. In this study, through processing the GDEMV2 30-m resolution digital elevation dataset, the Digital Elevation Model (DEM) data of Guangxi Province were extracted. Then, the slope and aspect information were analyzed. Among them, the aspect data are in degrees (°), and their range is from 0° to 360°, while flat areas are represented by −1 in the data.
To ensure the quality and consistency of the dataset, we implemented a series of data preprocessing steps to handle missing values and outliers. We employed a combination of imputation methods for missing data, including mean substitution for numerical variables and mode substitution for categorical variables. Where appropriate, we also used predictive models to estimate missing values based on other variables in the dataset. For outliers, we used a combination of statistical methods and visual inspection. We applied the Interquartile Range (IQR) method to detect and remove outliers that could skew the analysis. Additionally, we utilized scatter plots and box plots to visually identify any data points that deviated significantly from the norm and decided on a case-by-case basis whether to adjust or remove these points.
Given that the original aspect data format is not directly suitable for describing their correlation with the probability of forest fires, this study uses a grid calculation method to convert the raster data of an aspect according to Formula (
1) into an aspect index. The resulting aspect index, which also exists in raster format, facilitates the subsequent construction and analysis of the fire prediction model. This conversion method enhances the applicability and interpretability of the data in forest fire risk assessment. The specific content of Formula (
1) is as follows:
where
represents the aspect angle in the original aspect data (in degrees); the above formula transforms the original aspect angle into a value between −1 and 1. This value can more effectively capture the influence of the aspect on receiving solar radiation, thereby establishing a more direct correlation with the probability of forest fires. Solar radiation significantly affects fire risk by impacting fuel moisture levels and vegetation dryness. Increased solar radiation exposure, especially on south-facing slopes in the northern hemisphere, can accelerate vegetation drying, making it more susceptible to ignition. This effect is amplified in areas with high aspect index values, which receive more direct sunlight and thus retain less moisture, contributing to the likelihood and intensity of wildfires.
The processing and analysis of these geospatial data aim to deeply understand the potential risk factors of forest fires in the Guangxi region, especially under the premise of evaluating the impact of climate change. This study aims to explore how climate change affects the risk of forest fires in the Guangxi region. To establish the basic pattern of climate change, we first use time series graphs to examine the changes in four vital meteorological variables: daily precipitation, daily average wind speed, daily average temperature, and daily average relative humidity. Time series analysis of these variables reveals their fluctuation characteristics, trends, and possible seasonal or periodic patterns. Specifically, the daily average temperature shows significant seasonal changes, with high temperatures in summer and low temperatures in winter. Although the relative humidity fluctuates in the annual distribution, its variation range is less pronounced than temperature and precipitation. The fluctuations of daily rainfall and average wind speed are relatively large, as shown in
Figure 4. These comprehensive charts show the changing trends of daily precipitation, daily average wind speed, daily average temperature, and daily average relative humidity in the Guangxi region from 1994 to 2023. By smoothing the data, we can more accurately identify each variable’s long-term trends and seasonal changes, which is crucial for mastering and predicting the meteorological conditions in the Guangxi region and formulating effective forest fire prevention strategies.
To gain insights into the trends and seasonal characteristics of each meteorological variable time series, this study employed the Pearson correlation coefficient to investigate the interrelationships among different meteorological variables, as depicted in
Figure 5. The analysis of the Pearson correlation coefficient revealed a generally weak linear correlation among the four meteorological parameters. This finding indicates that in the present study, adopting nonlinear machine learning models may be more appropriate for predicting forest fire danger levels. Nonlinear models can more effectively capture the complex interactions and nonlinear relationships between variables, potentially enhancing the accuracy and reliability of forest fire prediction. By utilizing these models, we can gain a deeper understanding of the intricate connections between meteorological conditions and forest fire risk, thereby providing more precise predictive tools for the prevention and management of forest fires.
3. Methods
3.1. AutoML Technology Principles
AutoML is a methodology that integrates machine learning with automation techniques, aiming to streamline the various stages of the machine learning process. Its primary objective is to enhance the speed and efficiency of developing machine learning models by automating data preprocessing, feature engineering, model selection, and hyperparameter tuning. AutoML technology enables non-expert data scientists and machine learning practitioners to quickly leverage these techniques to solve real-world problems without delving into underlying algorithms’ complexities and technical details. Furthermore, AutoML can comprehensively explore the model and parameter space to discover more optimized model configurations and parameter combinations, thereby improving the overall performance of machine learning models.
Figure 6 depicts the standard workflow of an AutoML tool.
AutoML technology typically encompasses the following main components:
Data Preprocessing:
AutoML automatically addresses missing values, outliers, and duplicates in raw data. It carries out tasks like data cleaning, feature scaling, and feature encoding to guarantee the quality and consistency of the dataset.
Feature Engineering:
AutoML automatically selects and constructs features to extract valuable insights from the data. Depending on the characteristics and types of the data, it generates statistical features, time series features, text features, and more while also performing feature selection and dimensionality reduction.
Model Selection:
AutoML automatically selects machine learning models suitable for specific tasks. Based on the features of the data and the prediction targets, it picks appropriate classifiers, regressors, clustering algorithms, etc., and conducts model comparison and evaluation.
Hyperparameter Tuning:
AutoML automatically searches for the optimal combination of model parameters to improve model performance. Here are some commonly used techniques for AutoML hyperparameter tuning:
Grid Search:
Conducts an exhaustive search within a predefined parameter grid to find the best parameter combination. Each parameter combination is trained and evaluated to select the best performance. The advantage of grid search is its simplicity and intuitiveness, but it can be computationally expensive when dealing with large parameter spaces.
Random Search:
Randomly selects parameter combinations from a given distribution for evaluation. Random search is more efficient in exploring the parameter space than grid search, especially in large-scale parameter spaces.
Bayesian Optimization:
A hyperparameter tuning method based on probabilistic modeling that constructs a model to estimate the underlying surface of the objective function. It uses Bayesian inference to select parameter combinations that may improve performance. Bayesian optimization is particularly effective when computational resources are limited in large parameter spaces.
Genetic Algorithms:
A hyperparameter tuning method based on the principles of biological evolution that simulates evolutionary processes such as natural selection, crossover, and mutation to search for the optimal parameter combination. Genetic algorithms are suitable for complex parameter spaces and multimodal optimization problems.
Automated Hyperparameter Optimization:
A framework that fully integrates multiple tuning methods to automate the parameter tuning process. It can combine methods like Bayesian optimization to select parameter combinations and use grid or random search to refine parameter adjustments.
In practical applications, the most suitable parameter-tuning method can be chosen based on experimental results and resource constraints. Many AutoML frameworks have integrated these methods, making the parameter-tuning process more convenient and efficient.
3.2. AutoGluon Model
With the continuous advancement of machine learning technologies, the complexity of algorithms is also increasing, making it more challenging to integrate the latest and most advanced machine learning methods into the modeling process. To address this challenge, this study adopted an AutoML strategy based on the AutoGluon framework for predicting forest fire danger levels.
AutoGluon [
25,
26], introduced in 2020, is an open source AutoML toolkit. Unlike traditional AutoML approaches focusing on algorithm selection and hyperparameter optimization, AutoGluon has significantly enhanced raw data processing and multi-layer model integration. AutoGluon offers highly automated functionalities, enabling users to automatically execute tasks such as feature engineering, model selection, and hyperparameter tuning, enabling users to construct high-performance machine learning models rapidly. The toolkit supports various machine learning tasks, including classification, regression, and clustering, and can integrate multiple outstanding machine learning algorithms. Furthermore, AutoGluon employs robust training strategies that enable it to achieve excellent performance swiftly. The processing steps of an AutoGluon model include the following key phases:
Data Preparation: Users need to prepare datasets for training and validation, including feature and label data;
Automatic Feature Engineering: AutoGluon automatically performs the feature engineering process, which includes data cleaning, feature selection, and feature transformation, and aims to enhance model performance;
Model Selection: AutoGluon automatically selects appropriate machine learning models for training based on dataset characteristics and task types, such as ensemble learning and deep learning models;
Hyperparameter Tuning: Based on the selected models, AutoGluon automatically performs hyperparameter tuning to find the optimal combination of model parameters;
Model Training: After feature engineering, model selection, and hyperparameter tuning, AutoGluon automatically trains the selected models to achieve the best performance;
Model Integration: AutoGluon employs a multi-layer model integration approach, combining multiple trained models further to enhance the model’s generalization ability and performance;
Model Evaluation and Deployment: AutoGluon evaluates the trained models and selects the best-performing model for deployment.
The fundamental concept of AutoGluon is to enhance the predictive performance of the final model by combining multiple models without the need for an extensive hyperparameter search. AutoGluon relies on stacking algorithms, K-fold cross-bagging algorithms, and multi-layer stacking techniques to improve predictive performance. Specifically, it uses stacking techniques to train multiple independent models on the same dataset and weights and combines the outputs of these models to produce a linear model. Simultaneously, it employs K-fold cross-bagging to average the outputs of multiple similar trained models to reduce the variance of the final predictions. Finally, the outputs of multiple models are combined with the data and processed using multi-layer stacking techniques, as shown in
Figure 7. This approach excels due to its simplicity and robustness.
In a multi-layer stacking (Stacking) structure, the base machine learning models are positioned in the lower layers, and their outputs are combined with the original features to form the input features for the upper-layer models. Moreover, a repeated K-fold bagging method is utilized to enhance stacking performance and mitigate the risk of overfitting. The K-fold bagging approach involves randomly splitting the dataset into K different subsets and training K copies of the model using K-1. Each model iteration generates out-of-fold (OOF) predictions on the held-out validation subset during this process. Subsequently, higher-level models are trained exclusively on the OOF predictions of the lower-level models. This methodology aids in improving the model’s predictive accuracy and addresses overfitting concerns by preventing the leakage of training data information during the validation phase.
AutoML strategies often integrate transfer learning and reinforcement learning to automate the training of predictive models. Transfer learning focuses on leveraging knowledge or models from previous tasks to enhance learning and performance on related tasks. Conversely, reinforcement learning concentrates on automating the decision-making process in AutoML through trial and error to maximize cumulative rewards. These two techniques enhance machine learning models’ efficiency, generalization ability, and performance in distinct ways, thus driving progress in AutoML.
In this study, to simplify the algorithm selection and hyperparameter optimization process, the AutoML model directly reuses all base machine learning models as components of the stacking model and employs the same hyperparameter values for all models. Additionally, the structure of the AutoML model is automatically adjusted based on the given or default search time. The combination and application of these models allow AutoML to effectively handle complex predictive tasks while maintaining a high degree of automation and flexibility.
3.3. Evaluation Metrics
Model evaluation is a crucial step in the machine learning process aimed at assessing the performance of a model on unseen data. When it comes to the classification task of predicting forest fire danger levels, model evaluation encompasses the following four outcomes:
True Positive (TP): Instances that are hazardous and are correctly identified as hazardous by the model.
True Negative (TN): Instances that are actually non-hazardous and are correctly identified as non-hazardous by the model.
False Positive (FP): Instances that are non-hazardous but are incorrectly identified as hazardous by the model.
False Negative (FN): The model incorrectly identifies hazardous but non-hazardous instances.
This study evaluates the proposed system using the following metrics: accuracy, precision, recall, and F1-score. The range for all these metrics is from 0 to 1. Accuracy is the proportion of correctly predicted instances out of the total number of instances, calculated as follows:
Precision reflects the proportion of actual positive instances among those predicted as positive by the classifier, and its mathematical expression is the following:
Recall measures the proportion of all actual positive instances that the classifier can identify, calculated as follows:
The F1-score is the harmonic mean of precision and recall, also known as the F-score, calculated as follows:
These metrics collectively provide a comprehensive assessment of model performance. Accuracy indicates the overall prediction. Accuracy, precision, and recall reflect the model’s ability to identify positive instances from different perspectives, and the F1-score offers a balanced metric considering both precision and recall.
4. Results
4.1. Comparison of Model Predictions
To comprehensively evaluate the performance of forest fire prediction models, we conducted comparative experiments using the AutoGluon framework alongside a suite of traditional machine learning models. The results are presented in
Table 2 and
Table 3.
Under the AutoGluon framework, the KNeighborsDist classifier performed exceptionally, achieving accuracy, precision, recall, and an F1-score of 0.960. This ranked first among all models and significantly surpassed the best performance of traditional machine learning models, including SVM, BP, and logistic regression, which scored 0.834. This indicates that specific models within the AutoGluon framework can achieve higher prediction accuracy, which is crucial for critical tasks such as forest fire prediction.
Additionally, the performance of the RandomForestGini, RandomForestEntr, ExtraTreesGini, and ExtraTreesEntr classifiers within the AutoGluon framework was also very close to that of the KNeighborsDist, with metrics at 0.965 or 0.966. This suggests that the ensemble tree models within the AutoGluon framework exhibit high accuracy and stability in forest fire prediction. In contrast, while random forest and AdaBoost in traditional machine learning models performed decently, their F1-scores were 0.713 and 0.714, respectively, lower than similar models within the AutoGluon framework.
Regarding model generalization, the KNeighborsDist within the AutoGluon framework and the tree-based ensemble models outperformed most traditional machine learning models in precision and F1-score. For instance, the F1-score of the XGBoost model in AutoGluon was 0.580. In contrast, the XGBoost model in traditional machine learning had an F1-score of 0.717, indicating the superiority of the conventional machine learning model on this dataset. However, the strength of the AutoGluon framework lies in its automated model search and stacking techniques, which can quickly find the best-performing model combination on different datasets.
An essential advantage of the AutoGluon framework is its automation. It can automatically handle data preprocessing, feature engineering, model selection, and hyperparameter optimization, significantly reducing the time and complexity of model development. This is reflected in the WeightedEnsemble_L2 model in
Table 2, which, although not the best performer individually, improves overall performance by combining the predictions of multiple models—an example of AutoGluon’s automated model stacking.
By comparing the data in
Table 2 and
Table 3, we can see the advantages of the AutoGluon framework in forest fire prediction. AutoGluon provides high-precision prediction models such as KNeighborsDist and RandomForestGini and simplifies the model construction and optimization process through an automated workflow. This automation reduces manual intervention and is crucial for quickly responding to emergencies like forest fire prediction.
A relationship between the predicted and actual forest fire danger levels was established to visually demonstrate the prediction accuracy, as shown in
Figure 8. This figure compares actual risk with predicted risk for different models in forecasting forest fire risk. Each subplot represents a specific model, with the horizontal axis indicating the period (from 1 January 1994, to 31 December 2023) and the vertical axis representing the range of risk values (1 to 5), which correspond to low, moderate, high, very high, and extremely high forest fire levels, respectively.
This figure shows that specific models, such as XGBoost, LightGBMXL, etc., have higher prediction accuracy. In contrast, the prediction capabilities of other models, such as the two versions of the KNN algorithm (KNeighborsDist and KNeighborsUnif), need improvement. Additionally, some models show better performance at different time points, which may suggest better adaptability to data from certain specific periods. Overall, these results highlight the potential application of machine learning techniques in the field of prediction and the space for further optimization and improvement.
4.2. Model Error Analysis
After evaluating each model’s prediction performance, we further analyzed its error characteristics to better understand its performance in practical applications. The following is an analysis of each model’s error, including the mean error and the number of outliers.
As shown in
Figure 9, the WeightedEnsemble_L2 model has a minor mean error of 0.3838 and a relatively low number of outliers, 31. This indicates that the WeightedEnsemble_L2 model performs excellently in prediction accuracy and error control. The KNeighborsDist model also performs well, with a mean error of 0.3628 and 16 outliers, further confirming the KNeighborsDist model’s stability in the forest fire prediction task.
In contrast, tree-based ensemble models such as RandomForestGini, RandomForestEntr, ExtraTreesGini, and ExtraTreesEntr have more significant mean errors and a higher number of outliers, suggesting that these models may suffer from overfitting to some extent or have insufficient capacity to handle outliers during the prediction process.
Notably, the deep learning models NeuralNetTorch and NeuralNetFastAI have the most significant mean errors and the highest number of outliers, which could be due to their susceptibility to the influence of noisy data and outliers during the training process.
Overall, models under the AutoGluon framework perform better regarding error control, especially when handling outliers. This may be attributed to using various techniques during the model training process in AutoGluon to enhance model robustness. These analytical results provide significant guidance for understanding the performance of different models in forest fire prediction tasks and for selecting the most appropriate model.
5. Discussion
The forest fire prediction framework based on AutoML technology constructed in this study has been verified in its effectiveness in predicting forest fire risk levels through experiments. Compared with traditional machine learning models, models under the AutoGluon framework show higher prediction accuracy, especially the KNeighborsDist classifier. This result indicates that AutoML technology has significant advantages in processing complex data, such as forest fires, which are affected by multiple factors.
During the model comparison processes, we found that the AutoGluon framework provides high-precision prediction models such as KNeighborsDist and RandomForestGini. However, its automated workflow dramatically simplifies the model construction and optimization process. This automation feature is particularly critical in forest fire prediction scenarios. It can quickly respond to emergencies, reduce the uncertainty caused by manual intervention, and improve prediction efficiency. However, we also noticed that different models perform differently in different aspects. For example, XGBoost in traditional machine learning models has a higher F1-score on some datasets than XGBoost in the AutoGluon framework. However, the advantage of the AutoGluon framework lies in its powerful automated model search and stacking technology, which can swiftly yield optimal model combinations across different datasets.
From the perspective of model error analysis, the WeightedEnsemble_L2 model performs well in prediction accuracy and error control, and the KNeighborsDist model also has good stability. Tree-based ensemble models like RandomForestGini may have overfitting problems, while deep learning models are more susceptible to noisy data and outliers. This prompts us to optimize the model further and improve its robustness in future research, especially its ability to handle outliers.
In addition, this study also found differences in the model’s prediction performance at different time points, which may be attributed to seasonal changes in climate conditions, vegetation growth cycles, and other factors. Therefore, further research on how to make models better adapt to these dynamic changes will help improve the accuracy and reliability of predictions.
Finally, AutoML also has great potential for expansion in other regions of China. The automation and efficiency features of AutoML enable it to adapt to the data characteristics and prediction needs of different regions. In other regions with rich forest resources and high fire risk, such as the northeastern forest area and the southwestern mountainous area, AutoML can draw on the experience of its application in Guangxi and construct a forest fire prediction model suitable for the region by integrating local data resources such as meteorological stations and geographic information systems. Its automated data processing and model optimisation capabilities can improve prediction efficiency and provide a decision-making basis for local forest fire prevention and management. Meanwhile, with the continuous development of the technology and the improvement of the data sharing mechanism, AutoML can further integrate forest fire-related data nationwide, achieve cross-regional model training and optimisation, improve the overall level of forest fire prediction nationwide, and play a more significant role in protecting China’s forest resources and ecological security.
6. Conclusions
This study aims to solve the many challenges faced by traditional machine learning in forest fire prediction by applying AutoML technology to build a framework for forest fire prediction. The results show that the models built under the AutoGluon framework have excellent performance, with the KNeighborsDist classifier achieving an accuracy of 0.960. It performs well in precision, recall, and F1-score, significantly outperforming traditional machine learning models. This fully demonstrates the significant advantages of AutoML technology in handling complex data affected by multiple factors in forest fire prediction, providing strong support for the precision of forest fire prediction and having significant auxiliary value for forest fire management decision-making.
The AutoML automation features greatly simplify the model building and optimization process, covering data preprocessing, feature engineering, model selection, and hyperparameter tuning. This effectively reduces human intervention and improves model development efficiency and reliability. For example, the WeightedEnsemble_L2 model demonstrates AutoGluon’s powerful ability in model stacking by automatically integrating the prediction results of multiple models to improve overall performance.
However, the model error analysis also reveals problems with some models, such as tree-based ensemble models, which may have an overfitting tendency, and deep learning models, which are susceptible to noisy data and outliers. This points to the direction for subsequent research, where we will focus on optimizing model structures and training algorithms to enhance the robustness of the models to outliers and further improve prediction accuracy.
Based on this research, we will explore multiple directions in the future. Regarding model optimization, we will combine regularization techniques (such as L1 and L2 regularization) and cross-validation techniques (such as K-fold cross-validation) to effectively control model complexity and prevent overfitting, thereby improving model stability and generalization ability. At the same time, we will further optimize the parameter adjustment strategy, combining the advantages of Bayesian optimization and random search and developing customized optimization solutions for different model types to fully tap into the potential of the models. To enhance the model’s adaptability to dynamic changes, we will leverage Internet of Things (IoT) technology and satellite remote sensing technology to ensure that the model can obtain real-time meteorological, topographic, and vegetation data information and adjust the prediction results in time, providing strong support for real-time early warning and prevention and control of forest fires. In addition, we will explore model integration and fusion technologies in depth, integrating models based on AutoML technology with traditional physical models to utilize their respective advantages fully. For example, we can use the essential fire-spreading rules provided by physical models as constraints to guide the training and prediction of machine learning models, improving the models’ physical rationality and prediction accuracy. At the same time, we will develop multi-model fusion strategies, such as weighted average fusion methods, to integrate the prediction results of multiple different models and obtain more robust and accurate forest fire risk predictions.
Finally, we will focus on improving the forest fire risk assessment and decision-making support system. We will build a multi-indicator comprehensive evaluation model that considers various factors, such as the likelihood of fire occurrence, potential losses, impact range, and firefighting difficulty, to provide a more comprehensive and scientific basis for forest fire management decisions. Based on the prediction results of the model, we will develop an intelligent decision-making support system to achieve a seamless connection from fire risk prediction to decision-making, automatically generating a set of prevention and control strategy recommendations, including fire warning release, firefighting resources allocation plans, and evacuation plans, to improve the efficiency and scientificity of forest fire prevention and control work.
In summary, this study has significant implications for forest fire management. Precise prediction models can enable proactive planning, effective resource allocation, and mitigation of forest fire losses, thereby protecting ecosystems and human well-being. Technological advancements may expand the application scope of these models, enabling their integration with real-time systems for dynamic prediction and early warning.