Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning

Yang, Kai; Wu, Fan; Guo, Hongxu; Chen, Dongbin; Deng, Yirong; Huang, Zaoquan; Han, Cunliang; Chen, Zhiliang; Xiao, Rongbo; Chen, Pengcheng

doi:10.3390/land13111810

Open AccessArticle

Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning

by

Kai Yang

¹,

Fan Wu

¹,

Hongxu Guo

^1,*,

Dongbin Chen

¹,

Yirong Deng

^2,*,

Zaoquan Huang

²,

Cunliang Han

²,

Zhiliang Chen

³,

Rongbo Xiao

⁴

and

Pengcheng Chen

⁴

¹

School of Architecture and Urban Planning, Guangdong University of Technology, Guangzhou 510090, China

²

Guangdong Provincial Academy of Environmental Sciences, Guangzhou 510045, China

³

South China Institute of Environmental Sciences, Ministry of Ecology and Environment, Guangzhou 510535, China

⁴

School of Environmental Science and Engineering, Guangdong University of Technology, Guangzhou 510006, China

^*

Authors to whom correspondence should be addressed.

Land 2024, 13(11), 1810; https://doi.org/10.3390/land13111810

Submission received: 10 September 2024 / Revised: 24 October 2024 / Accepted: 30 October 2024 / Published: 1 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Heavy metal pollution in agricultural land poses significant threats to both the ecological environment and human health. Therefore, the rapid and accurate prediction of heavy metal content in agricultural soil is crucial for environmental protection and soil remediation. Acknowledging the limitations of traditional single linear or nonlinear machine learning models in terms of prediction accuracy, this study developed an ensemble learning model that integrates multiple linear or nonlinear learning models with a random forest (RF) model to improve both the prediction accuracy and reliability. In this study, we selected a typical copper (Cu) polluted area in the Pearl River Delta of Guangdong Province as the research site and collected Cu content data and indoor soil reflectance spectral data from 269 surface soil samples. First, the soil spectral data were preprocessed using Savitzky–Golay (SG) smoothing, multiplicative scattering correction (MSC), and continuous wavelet transform (CWT) to reduce noise interference. Next, principal components analysis (PCA) was employed to reduce the dimensionality of the preprocessed spectral data, eliminating redundant features and lowering the computational complexity. Finally, based on the dimensionality-reduced data and Cu content, we established a stacked ensemble learning model, where the base models included SVR, PLSR, BPNN, and XGBoost, with RF serving as the meta-model to estimate the soil heavy metal content. To evaluate the performance of the stacking model, we compared its prediction accuracy with that of individual models. The results indicate that, compared to the traditional machine learning models, the prediction accuracy of the stacking model was superior (R² = 0.77; RMSE = 7.65 mg/kg; RPD = 2.29). This suggests that the integrated algorithm demonstrates a greater robustness and generalization capability. This study presents a method to improve soil heavy metal content estimation using hyperspectral technology, ensuring a robust model that supports policymakers in making informed decisions about land use, agriculture, and environmental protection.

Keywords:

soil; heavy metal; hyperspectral; continuous wavelet transform; stacking model

1. Introduction

Heavy metal contamination in soil is a pressing global environmental issue with significant implications for ecosystems and human health [1]. In the Pearl River Delta, copper (Cu) pollution is particularly concerning due to the region’s high concentration of heavy industries, including electronics manufacturing, metal processing, chemical production, and electroplating. The poor management of copper-containing wastewater and emissions from these industries can easily lead to soil contamination. Additionally, the long-term application of copper-based pesticides and fertilizers contributes to Cu accumulation in the soil. Elevated levels of Cu can enter the human body through the food chain, leading to health problems such as anemia, liver damage, and kidney injury [2,3]. Therefore, the accurate and rapid monitoring of Cu levels in soil is essential. Traditional monitoring techniques are often time-consuming and labor-intensive. These methods may face limitations related to the selection of sampling points and their geographical distribution, which can hinder the accurate representation of the spatial distribution of Cu in soil. Hyperspectral remote sensing technology, a non-contact, rapid, and efficient approach for obtaining detailed soil spectral data, offers a novel solution to this challenge [4].

The principle underlying hyperspectral inversion for estimating soil Cu content is that Cu concentration in the soil affects the characteristics of its reflectance spectrum, especially in the near-infrared and visible wavelengths [5]. By collecting reflection spectra from samples with varying Cu concentrations, we established an inversion model to analyze the relationship between the spectral features and Cu content. However, the application of hyperspectral inversion for assessing soil heavy metal content still faces two major challenges. Firstly, soil heavy metals are trace elements, which leads to weak and unstable spectral responses. This makes it challenging to extract sensitive information from the spectrum [6]. Secondly, the inherent spatial heterogeneity of soil can reduce the generalization ability and stability of the inversion model, compromising the reliability of the results [7]. Previous studies have indicated that effective spectral transformation methods can significantly improve the accuracy of heavy metal content retrieval. For example, techniques such as reciprocal transformation, the reciprocal logarithm, the first derivative, and the second derivative can help reduce noise interference and enhance spectral response characteristics [8,9]. However, further research has revealed that while conventional mathematical transformation methods can improve model prediction accuracy to some extent, they may also lead to the loss of important information about subtle changes in specific spectral characteristics, which can affect the outcomes of data analysis. Studies have shown that continuous wavelet transform (CWT) offers strong analytical capabilities in both time and frequency domains. CWT can enhance the local characteristic information of spectral signals, thereby improving the correlation with soil components. Consequently, inversion models developed using this method often achieve a higher prediction accuracy [10]. Research by Guo et al. and Wang et al. indicates that CWT has superior noise reduction capabilities compared to traditional data transformation methods, further enhancing the accuracy of the model inversion [11,12]. In addition, hyperspectral data are rich in information, but their highly correlated bands often lead to redundancy, complicating data processing and diminishing modeling efficiency. To tackle this issue, dimensionality reduction is crucial for simplifying the model and enhancing its robustness. Research indicates that principal component analysis (PCA) is an effective statistical technique for addressing the correlation between hyperspectral bands and eliminating redundant information [13]. By identifying principal components, PCA reduces the data’s dimensionality while preserving essential information, thus effectively mitigating the challenges of band correlation and information redundancy in hyperspectral data.

Inversion models are created by integrating characteristic spectral bands with soil pollutant concentration data. The careful selection of the model is crucial, as it significantly influences the precision and reliability of the inversion results. Currently, both linear machine learning methods, such as multiple linear regression (MLR) [14] and partial least squares regression (PLSR) [15], and nonlinear methods, including back propagation neural networks (BPNN) [16] and support vector machine regression (SVR) [17], are widely used in model development. However, the linear models often rely on linear assumptions, which may overlook complex nonlinear relationships, thus limiting the models’ accuracy and adaptability. The limited number of training samples significantly impacts the prediction accuracy of traditional nonlinear machine learning models, particularly regarding parameter selection. While satisfactory results are often achieved with training datasets, validation datasets tend to show a poorer performance [18,19]. To overcome this challenge, an ensemble learning method was utilized to estimate heavy metal content. Guo et al. found that the random forest (RF) model outperformed partial least squares regression (PLSR) for predicting Zn levels [12]. Similarly, Mao et al. demonstrated that extreme gradient boosting (XGBoost) performed better than both SVR and PLSR when predicting Zn, Pb, and Cd [6]. When training data are limited, ensemble learning methods generally show advantages over traditional machine learning approaches. Current techniques, such as bagging and boosting, primarily focus on combining similar machine-learning models by constructing training sets or using iterative training [20,21]. However, in cases where training data are scarce and the methods are singular, traditional machine learning or ensemble learning algorithms often struggle to find the optimal solution, resulting in a low prediction accuracy and limited robustness and generalization. To address this, model fusion algorithms are increasingly being developed and applied. For example, Lin et al. used a stacking strategy to integrate multiple models, which improved the accuracy of soil heavy metal content estimation under limited sample conditions [22]. Zou et al. compared the stacking model with four basic machine learning algorithms for inverting soil heavy metal content and found that the stacking model offered greater stability and accuracy [23]. The stacking ensemble model first utilizes the outputs of various base models as inputs for the first layer, which are then used to train a meta-model, thereby enhancing the generalization accuracy of the predictions.

Therefore, this study focused on typical copper-polluted agricultural soil in the Pearl River Delta. We utilized CWT for spectral preprocessing to reduce spectral noise and applied PCA to lower the dimensionality. A stacking ensemble model was created using SVR, PLSR, BPNN, and XGBoost as base models, with RF as the meta-model to predict the Cu content. This ensemble model was then compared to the individual models to evaluate its effectiveness, enhance the accuracy of soil heavy metal content estimation, improve the robustness and generalization of the inversion model, and provide a solid framework for predicting soil heavy metal content.

2. Materials and Methods

2.1. Study Area

The study area is located in Foshan, Zhuhai, and Zhongshan in the Pearl River Delta of Guangdong Province, with a geographical range of 22.03° to 22.94° N and 113.01° to 113.62° E. The terrain is predominantly flat, with higher elevations in the center and surrounding low-lying areas. The average annual temperature is approximately 22.5 °C, and the annual rainfall typically ranges from 1800 to 2000 mm. This region belongs to the subtropical and marine monsoon climates and is marked by prevailing southeast and northeast winds. The primary soil type is red soil, which is generally acidic. With similar heavy metal concentrations, acidic soils present a higher risk of pollution compared to alkaline soil [24]. Since the introduction of the reform and opening-up policy, this area, recognized as a significant economic triangle in China, has experienced rapid industrial, agricultural, and aquacultural development, resulting in notable pollution challenges. Soil sampling was conducted in potentially contaminated areas based on the distribution of typical industrial enterprises and the prevailing wind direction (Figure 1).

2.2. Soil Sample Collection and Measurement

Soil sampling was conducted based on the spatial distribution of typical industrial enterprises in the study area, resulting in the collection of 269 surface soil samples (20 cm deep) during the fallow period from 15 June to 20 July 2022 around these enterprises (Figure 1). In the laboratory, the samples were air-dried, purified, and ground. Each sample was then split into two equal portions and stored in zip-lock bags for Cu content measurements and spectral measurements.

The soil samples were digested using the HNO₃-HCl-HF-HClO₄ heating digestion method, and the Cu content was analyzed by inductively coupled plasma mass spectrometry (ICP-MS). For the soil hyperspectral measurements, the ASD FieldSpec4 spectrometer (Analytical Spectral Devices Inc., Boulder, CO, USA) was employed. The spectral sampling interval was 1 nm, with a range from 350 nm to 2500 nm, covering a total of 2152 bands. Before conducting the reflectance spectral measurements, the instrument was preheated for 30 min and calibrated against a standard reference whiteboard to achieve a baseline close to a 100% reflectance. The measurements were carried out in a dark room using a 1000 W halogen lamp as the light source, with black velvet cloth placed beneath the soil samples. The light source was positioned at a 30° angle to the vertical and 30 cm away from the soil sample, while the spectrometer probe was also 30 cm above the sample. The soil samples were contained in black plastic dishes that were 5 cm high and 10 cm in diameter. This setup ensured that the measurement area fitted within the probe’s field of view (FOV), thus preventing mixed spectra. For each soil sample, 40 spectral curves were recorded, with 10 curves taken from each direction after rotating the sample three times by 90°. The final soil spectral reflectance data were calculated by averaging the 40 spectral curves.

2.3. Workflow

The workflow of this study is summarized as follows (Figure 2). (1) Data collection: We collected 269 soil samples from various locations around the typical industrial enterprises in the study area. These soil samples were then sent to the laboratory for analysis, where we measured both their spectral and heavy metal content. (2) Preprocessing: The Savitzky–Golay (SG) method, multiplicative scatter correction (MSC), and DWT were employed to reduce noise in the spectral data and enhance its features. (3) Dimensionality reduction: The dimensionality of the data was reduced using PCA. (4) Model construction: A Cu content estimation model was developed based on stacked ensemble learning and its performance was compared to that of a single model to validate the effectiveness of this ensemble learning strategy for accurate Cu content estimation.

2.4. Spectral Preprocessing

During the collection of soil spectral data, noise may appear in the spectral reflectance curves due to instrument interference and environmental factors, and the spectral response of soil heavy metals tends to be weak [25]. To enhance the spectral response characteristics of heavy metals, it is essential to preprocess the data to reduce or eliminate noise. Firstly, SG smoothing and MSC were applied to diminish the noise and highlight subtle differences in the spectral curves. Then, CWT was employed as an effective method to emphasize the local details in the soil spectral data. The gaussian function, resembling the absorption characteristics of soil spectra, was selected as the wavelet-generating function. The spectral data were decomposed based on 2, 2², 2³, 2⁴, 2⁵, 2⁶, 2⁷, 2⁸, 2⁹, and 2¹⁰ scales. For convenience, these scales are referred to as L1-L10, and the calculation formula is as follows:

Wf α, τ \leq f; φ_{α, τ} \geq \int_{- \infty}^{+ \infty} f (t) φ_{α, τ} (t) dt

(1)

where

W f α, τ

is the wavelet transform coefficient, and the wavelet generating function can be calculated to obtain the wavelet basis function

φ_{α, τ} (t)

through scaling and translation, as shown in the following equation:

φ_{α, τ} (t) = \frac{1}{\sqrt{α}} φ (\frac{t - τ}{α})

(2)

where f(t) is the spectral reflectance of the soil, t is the spectral band, α is the scale factor, and τ is the translation factor.

2.5. Dimensionality Reduction

The spectral bands collected by ASD instruments (Analytical Spectral Devices Inc., Boulder, CO, USA) often contain redundant information, which can be addressed through dimensionality reduction. To reduce the calculation cost, minimize the risk of overfitting, and enhance the model performance and interpretability, we employed the PCA algorithm [26]. PCA effectively retains the maximum variance while simplifying the data, improving computational efficiency and clarifying the internal structure and relationship within the dataset. To minimize information loss, we included all the principal components with an individual variance contribution greater than 1%.

2.6. Stacking Model Construction

The stacking method combines two ensemble learning methods, boosting and bagging, to create an ensemble algorithm that integrates multiple base models through a meta-model [27]. It operates as a multi-layer learning system with a parallel structure. As shown in Figure 3, the prediction results from SVR, PLSR, BPNN, and XGBoost are fed into the RF model to train the meta-model [28]. Compared to a single model, the stacking method enhances the robustness and generalization of the inversion model. If a base model makes an error, the meta-model can effectively correct it by leveraging the learning behavior of the other base models.

RF effectively enhances the model’s accuracy and generalization ability by constructing numerous decision trees from various subsets and features [29]. It then calculates the average of all the tree predictions to generate the final output. Moreover, the RF algorithm captures nonlinear relationships within the data and automatically evaluates the contribution of each underlying model to optimize the prediction outcomes. This study developed a stacked RF model that incorporates SVR, PLSR, BPNN, and XGBoost as base learners. SVR relies on support vectors to identify the hyperplane, making it well-suited for nonlinear fitting problems [30]. PLSR is a statistical modeling technique primarily used to analyze highly correlated and multicollinear datasets to prevent overfitting [31]. BPNN assign biases to each neuron through forward and backward propagation until the biases reach a satisfactory level, demonstrating high self-learning capabilities and a broad applicability [17]. XGBoost enhances model performance by evaluating all the feature segmentation points and selecting the optimal segmentation tree based on prior predictions [32].

In this study, the data processing and modeling were completed in MATLAB R2022a. To prevent overfitting with a limited sample size, grid search and 5-fold cross-validation techniques were applied during the model training. The resulting model parameters are shown in Table 1.

2.7. Model Accuracy Evaluation

In this study, three metrics, including the coefficient of determination (R²), root mean square error (RMSE), and relative prediction deviation (RPD), were used to evaluate the performance of the inversion model. R² represents the fit effect of the model, with values ranging from 0 to 1. The RMSE represents the deviation between the predicted values and the actual values. The RPD is the ratio of sample’s standard deviation to the RMSE and assesses the model’s predictive strength. When the RPD ≥ 2.0, it indicates that the model has a strong predictive ability. When 1.5 < RPD < 2.0, it indicates that the model can only provide rough estimates of the sample, requiring further refinement. If the RPD ≤ 1.5, the model is considered unreliable [33].

3. Results

3.1. Statistical Analysis of Cu Content in the Study Area

The fundamental statistics of the Cu content in the 269 soil samples from the study area (Table 2) indicate that the Cu content ranged from 3 to 163 mg/kg, with a maximum of 163 mg/kg and a minimum of 3 mg/kg. The average Cu content was 33.92 mg/kg, with a standard deviation of 24.46 mg/kg, resulting in a coefficient of variation of 72.11%. These results indicate significant differences in Cu pollution across the study area, suggesting an uneven distribution influenced by local human activities, particularly industrial operations. Based on the modeling requirements, the total sample was randomly divided into 190 training samples and 79 verification samples. When compared to the background levels of soil Cu in Guangdong Province and the national standard, the average Cu content in the study area exceeded both the provincial background value (20 mg/kg) and the national standard (20 mg/kg) [34]. The analysis suggests that this phenomenon may be related to the increased intensity of industrialization processes in the region.

3.2. Spectral Preprocessing Based on Continuous Wavelet Transform

The original spectral reflectance curve of the soil samples is shown in Figure 4. The spectral reflectance varied with the increasing wavelength, displaying a consistent trend throughout. Specifically, the reflectance increased rapidly within the visible light range (400–780 nm), remained relatively stable in the shortwave infrared range (780–2100 nm), and gradually declined in the longer shortwave infrared range (2100–2500 nm). Additionally, fluctuations around 1000 nm, caused by the internal hardware of the ASD, are noticeable but can be disregarded [35]. The soil reflectance rapidly increased, starting from 400 nm, which is related to the presence of iron ions and organic matter. Absorption valleys at 1400 nm and 2200 nm are observed, attributed to the absorption of lattice water, with the latter potentially influenced by the stretching of metal hydroxyl groups [12].

SG smoothing, MSC, and CWT were applied to the original spectral data to enhance the spectral response characteristics. As illustrated in Figure 5a, the SG method effectively removes spikes from the original spectrum, particularly in the 2200–2400 nm range. Following the application of the MSC transformation, the features of the spectral curve become more pronounced, especially at 500 nm, 1350 nm, 1800 nm, and 2200 nm (Figure 5b). The CWT method, when used to decompose the spectrum at scales L1-L10, further highlights less obvious feature peaks, with the shape of the absorption peaks reflecting the spectral characteristics. As the scale increases from L4, distinct spectral features gradually emerge. However, excessively increasing the decomposition scale may lead to the significant removal of low-frequency components, causing the spectral curve to become smoother and resulting in the loss of absorption peaks at various wavelengths (Figure 5c–l). Therefore, at excessively high decomposition scales, capturing useful characteristic spectral information becomes challenging.

The results of this study demonstrate that CWT can effectively highlight feature bands, likely due to its ability to perform time-scale analysis on signals. This finding is supported by several published studies. For instance, wavelet analysis has been shown to effectively capture spectral features through the multi-scale decomposition of spectral data in both the time and frequency domains, facilitating the identification of local signal characteristics [12]. Additionally, another study confirmed that CWT outperforms other methods in noise reduction when estimating soil heavy metal content, as it effectively suppresses noise while preserving essential information, thereby enhancing its estimation accuracy [36].

3.3. Analysis of PCA Reduction Results

PCA was utilized to reduce the dimensionality of the spectral data across 2152 wavebands. As the wavelet decomposition scale increased, the number of principal components required gradually decreased, as shown in Table 3. At larger decomposition scales, CWT progressively decomposed the signal into low-frequency components. This process effectively eliminated high-frequency noise and local fluctuations while preserving the global characteristics of the signal. Because the overall trend of low-frequency information is relatively simple, only a few principal components were needed to explain a significant portion of the variance. Additionally, the attenuation of high-frequency components further reduced noise and intricate details in the data, leading to a reduction in the number of selected principal components. This indicates that as the decomposition scale increased, the overall variability characteristics of the spectral data became more concentrated, resulting in fewer principal components after dimensionality reduction.

3.4. Construction and Accuracy Evaluation of Cu Content Inversion Model

The principal components extracted through the PCA were used as input variables for the following base models: SVR, BPNN, PLSR, and XGBoost. Using RF as the meta-model, the prediction results from all the base models served as input features to construct the stacking model. Figure 6 shows the accuracy results of the validation set for all the models. Across all the decomposition scales of the CWT, the inversion results after the PCA dimensionality reduction (Figure 6a–c) indicate that, at scales L1–L5, the R² values for most of the models were below 0.6, with relatively high RMSE values and RPD values mainly under 2, reflecting low predictive capabilities. At scales L6–L10, all the models, except for the BPNN and PLSR, exhibited a relatively good performance. The BPNN and PLSR models showed strong results only at scales L5 and L6. The validation set typically recorded R² values between 0.6 and 0.8, RMSE values below 10 mg/kg, and RPD values generally exceeding 2. Each model achieved its highest accuracy at scale L6 (Table 4), with the training set and validation set R² values exceeding 0.56 and 0.64, respectively. Among them, the stacking model performed the best, achieving accuracy results for the R², RMSE, and RPD of 0.77, 7.65 mg/kg, and 2.29, respectively. Compared to the SVR, BPNN, PLSR, XGBoost, and RF, the stacking model demonstrated an average increase in R² of 0.098, an average decrease in the RMSE of 1.672, and the highest RPD value, indicating superior predictive capabilities.

Figure 7 shows the fitting plots of the measured and predicted values of the heavy metal Cu in the validation set for each model at the L6 scale. The black dashed line represents the 1:1 line, and the red solid line represents the fitting line. The closer the R² value is to 1, the better the model performance is. The graph shows that all the models exhibited phenomena to underestimate high values and overestimate low values. This is likely due to the uneven distribution of the Cu content in the samples, particularly the scarcity of high-value samples, which resulted in incomplete information in the training set. Among the models, the stacking model demonstrated the best fit (R² = 0.77), with values closest to 1, indicating its effectiveness in mitigating low-value overestimation and high-value underestimation. Overall, the stacking algorithm can integrate the advantages of the base models, enhancing the robustness and generalization capability of the predictions.

4. Discussion

Hyperspectral techniques have the advantages of numerous bands and a high spectral resolution, enabling the capture of subtle spectral features. It has been widely employed in research on soil composition. However, spectral curves can be affected by environmental influences and instrument errors, leading to deviations and noise in the spectrum. Previous studies have shown that preprocessing spectral data can significantly improve the accuracy of inversion models. CWT spectral transformation can effectively reduce spectral noise and reveal hidden information. Wavelet analysis enables the decomposition of spectral data at multiple scales in both time and frequency domains, helping to identify the optimal signal for estimating the physical and chemical properties of soil [37,38]. In CWT processing, different scales capture varying frequency information, and some scales perform better by striking an optimal balance between high-frequency noise and low-frequency background interference. High-frequency scales may amplify noise, while low-frequency scales may obscure critical details. Thus, the ideal scale minimizes the effects of noise while preserving spectral features that strongly correlate with heavy metal content. In this study, scale 6 was identified as the most effective scale. This finding aligns with the research of Zhao et al. [39] and Baisong et al. [40], who showed that certain scales can more effectively extract spectral features related to soil composition, thus enhancing the model performance.

In this study, we developed an inversion model to predict heavy metal Cu content using a stacking ensemble learning strategy. The results show that the stacked model enhances prediction accuracy and generalization ability, particularly when sample data are limited. This finding is consistent with the work of Lin et al. [22], Tan et al. [41], and Guo et al. [20], who also utilized stacking, boosting, and bagging strategies to estimate heavy metal content in soil with limited samples, further confirming the effectiveness of the stacking approach. Compared to traditional ensemble learning models, such as bagging’s RF and boosting’s XGBoost, the stacking model demonstrates a superior performance. This shows that the stacked integrated learning strategy effectively integrates the advantages of multiple algorithms and gives full play to the advantages of each base learner. In our study, the stacking model was shown to mitigate potential bias from individual models by integrating the prediction results of multiple algorithms, leading to more accurate inversion outcomes in complex environments. The stacking model provides several advantages from the perspectives of machine learning and data analysis. First, it effectively accounts for the contributions of each base learner and sample, thereby minimizing prediction errors that may arise from underperforming individual models [42]. This is particularly crucial when sample sizes are limited, as smaller datasets can make certain models susceptible to noise and outliers. Second, the stacking model capitalizes on the strengths of various algorithms to mitigate the limitations of any single model during prediction and training. By incorporating diverse algorithms, the model can explore data characteristics more comprehensively, enhancing its flexibility and adaptability. However, constructing a stacking ensemble learning model also presents challenges. The selection and evaluation of base models, as well as their integration, are critical factors that influence model performance. Since the meta-learner directly utilizes the predictions from base learners as training data, incorporating poorly performing models can lead to significant errors in the validation set. Therefore, careful selection of high-performing base learners is essential to ensure the overall stability and accuracy of the model. In summary, this study demonstrates that the Cu content inversion model, which is based on a stacking ensemble learning strategy, exhibits a strong generalization ability and robust overall performance under limited sample conditions. Future research should focus on further optimizing the selection and combination strategies of base learners to enhance the model’s predictive capabilities, providing a more effective tool for monitoring heavy metals in soil.

This study demonstrates that the stacking ensemble learning strategy enhances prediction accuracy and stability by utilizing multiple base models. This approach allows for the improved identification of pollution hotspots and the development of targeted monitoring strategies. Researchers can analyze hyperspectral data from various regions using the established model to pinpoint areas with high pollutant concentrations and generate pollution maps that serve as a foundation for field verification. Additionally, the model can dynamically adjust its predictions based on historical and real-time data, enabling rapid responses to environmental changes. This real-time capability improves monitoring efficiency and provides timely data support for policymakers, assisting them in making more informed decisions regarding land use and environmental protection. To integrate this hyperspectral approach into existing monitoring frameworks, a hyperspectral data analysis module could be established to effectively process hyperspectral data, complementing traditional soil sampling and increasing the frequency of heavy metal pollution detection. However, scaling this approach may encounter logistical and technical challenges, such as the need for specialized software and hardware, high equipment costs, and the requirement for personnel training. Future research should focus on optimizing data acquisition and processing workflows to enhance stability and reliability.

This study utilizes laboratory hyperspectral data, but for the large-scale monitoring of Cu content, satellite hyperspectral imagery is typically employed [43]. However, integrating satellite data into the developed models presents several challenges. Satellite data often exhibit lower spatial and spectral resolutions compared to laboratory measurements, which can lead to information loss and decreased accuracy [33]. Atmospheric interference during data acquisition can distort spectral signals, while complex backgrounds, such as vegetation and buildings, may complicate the retrieval of heavy metals [17]. Additionally, the spatial variability of soil composition can make it hard for satellite observations to capture these differences accurately, affecting the model’s generalization capability [44]. To address these challenges, future research could integrate satellite data with ground measurements to enhance data quality. Effective atmospheric correction methods could also reduce interference, improving the model reliability. Techniques such as spectral mixture analysis or deep learning for background removal may help in identifying soil signals. Lastly, using geographic information systems (GIS) and local regression models can better accommodate soil variability, enhancing the model’s generalizability. Implementing these strategies has the potential to significantly improve the accuracy and reliability of satellite hyperspectral data in monitoring soil heavy metals.

5. Conclusions

Soil samples from typical industrial enterprises in the Pearl River Delta region of Guangdong Province were analyzed, collecting data on the heavy metal Cu content and indoor soil spectral data from 269 samples. CWT was used to capture localized soil spectral reflectance information, while the PCA algorithm was employed for dimensionality reduction. An integrated stacking algorithm model was constructed, and its predictive performance for soil heavy metals was compared to a single stacking model. The results were as follows: (1) CWT can effectively reduce the impact of noise on spectral data and enhance the spectral response of soil Cu elements. (2) The PCA algorithm can decrease redundancy in spectral information, significantly lowering the data dimensionality and computational complexity, which in turn improves accuracy. The number of principal components extracted by PCA varies with different CWT decomposition scales, with optimal inversion results achieved at the L6 scale. (3) Compared to a single model, the stacking ensemble model demonstrated a superior predictive accuracy, with R², RMSE, and RPD values of 0.77, 7.65 mg/kg, and 2.29, respectively. The results of the inversion analysis demonstrated the robust stability and reliability of the stacking model, indicating its potential as an effective method for predicting Cu concentration in soil.

The content of heavy metals in soil is shaped by a combination of natural factors and anthropogenic activities. The distribution of heavy metals is influenced by the physical and chemical properties of the soil, geographical characteristics, and human interventions. Variations in these factors may impact the accuracy and generalizability of predictive models. To enhance model performance in complex environments, it is essential to incorporate these variables as input parameters in future modeling efforts. Furthermore, the optimization of models and the adjustment of parameters can be intricate and labor-intensive, particularly when dealing with extensive datasets. Consequently, future research should also prioritize the simplification of models and the enhancement of data processing efficiency.

Author Contributions

Conceptualization, K.Y., F.W. and H.G.; Data curation, C.H. and P.C.; Funding acquisition, Y.D. and Z.C.; Investigation, D.C.; Methodology, K.Y., F.W., H.G. and D.C.; Project administration, H.G., Y.D., Z.C. and R.X.; Resources, Z.H., C.H. and P.C.; Software, K.Y.; Supervision, Y.D. and R.X.; Validation, K.Y., F.W. and Z.H.; Visualization, Z.H. and C.H.; Writing—original draft, K.Y., F.W. and D.C.; Writing—review and editing, H.G., Y.D., Z.C. and R.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Projects of Zhejiang Province (2022C03168), the National Natural Science Foundation of China (41501184), the Science and Technology Innovation Program of Guangdong Provincial Academy of Environmental Science (HKYKJ-2023005), the National Natural Science Foundation of China—Guangdong Joint Fund Key Project (No. U1911202), the Key Research and Development Program of the Ministry of Science and Technology (No. 2019YFC1805300), and the Natural Science Foundation of Guang-dong Province (No. 2019A1515012131).

Data Availability Statement

The data presented in this study are available on request from the corresponding author, the data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bian, Z.; Sun, L.; Tian, K.; Liu, B.; Zhang, X.; Mao, Z.; Huang, B.; Wu, L. Estimation of Heavy Metals in Tailings and Soils Using Hyperspectral Technology: A Case Study in a Tin-Polymetallic Mining Area. Bull. Environ. Contam. Toxicol. 2021, 107, 1022–1031. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Lei, S.; Zhao, Y.; Cheng, W. Use of hyperspectral imagery to detect affected vegetation and heavy metal polluted areas: A coal mining area, China. Geocarto Int. 2022, 37, 2893–2912. [Google Scholar] [CrossRef]
Yin, F.; Wu, M.; Liu, L.; Zhu, Y.; Feng, J.; Yin, D.; Yin, C.; Yin, C. Predicting the abundance of copper in soil using reflectance spectroscopy and GF5 hyperspectral imagery. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102420. [Google Scholar] [CrossRef]
Liu, Y.P.; Luo, Q.; Cheng, H.F. Application and development of hyperspectral remote sensing technology to determine the heavy metal content in soil. J. Agro-Environ. Sci. 2020, 39, 2699–2709. [Google Scholar]
Lee, K.; Kang, S.; Jeon, E.I.; Yu, S.; Kwon, O.S. Exploring correlations between hyper-spectral signatures acquired in the laboratory and in-situ observation for heavy metal concentrations in soil. Spat. Inf. Res. 2018, 26, 497–505. [Google Scholar] [CrossRef]
Mao, J.H.; Zhao, H.Q.; Jin, Q.; Wang, X.F.; Miao, Q.F.; Wang, P.; Li, M. Comparative study on the hyperspectral inversion methods for soil heavy metal contents in Hebei lead-zinc tailings reservoir areas. Trans. Chin. Soc. Agric. Eng. 2023, 39, 144–156. [Google Scholar]
Yang, K.; Zhang, W.; Fu, P.; Gao, P.; Cheng, F.; Li, Y. The LH-PSD Analysis Model of Cu Contaminated Soil Spectral Characteristics and Weak Characteristic Information. Spectrosc. Spectr. Anal. 2019, 39, 2228–2236. [Google Scholar]
Tang, C.; Xiao, R.; Ling, B.; Wang, P.; Zheng, J.; Huang, F.; Liu, W. Prediction of Cr and Ni contents in soil from hyperspectral data combined with Al-Fe minerals. Int. J. Remote Sens. 2023, 44, 2781–2797. [Google Scholar] [CrossRef]
Fu, P.; Zhang, W.; Yang, K.; Meng, F.; Yao, G.; Liu, P. Using the Hilbert-Huang spectrum transformation to estimate soil lead concentration. Remote Sens. Lett. 2021, 12, 768–777. [Google Scholar] [CrossRef]
Qin, X.; Lai, C.; Pan, Z.; Pan, M.; Xiang, Y.; Wang, Y. Recognition of Abnormal-Laying Hens Based on Fast Continuous Wavelet and Deep Learning Using Hyperspectral Images. Sensors 2023, 23, 3645. [Google Scholar] [CrossRef]
Wang, X.; Yumiti, M.; Huang, X.; Li, R.; Liu, D. Estimation of Arsenic Content in Soil Based on Continuous Wavelet Transform. Spectrosc. Spectr. Anal. 2023, 43, 206–212. [Google Scholar]
Guo, B.; Bai, H.R.; Zhang, B.; Pei, L.; Zhao, Y.H.; Lei, Y.Z.; Yuan, R.C. Inversion of soil zinc contents using hyperspectral remote sensing based on random forest and continuous wavelet transform in an opencast coal mine. Trans. Chin. Soc. Agric. Eng. 2022, 38, 138–147. [Google Scholar]
Guo, F.; Xu, Z.; Ma, H.H.; Liu, X.J.; Yang, Z.; Tang, S. A Comparative Study of the Hyperspectral Inversion Models Based on the PCA for Retrieving the Cd Content in the Soil. Spectrosc. Spectr. Anal. 2021, 41, 1625–1630. [Google Scholar]
Tian, S.; Wang, S.; Bai, X.; Zhou, D.; Lu, Q.; Wang, M.; Wang, J. Hyperspectral estimation model of soil Pb content and its applicability in different soil types. Acta Geochim. 2020, 39, 423–433. [Google Scholar] [CrossRef]
Hou, L.; Li, X.; Li, F. Hyperspectral-based Inversion of Heavy Metal Content in the Soil of Coal Mining Areas. J. Environ. Qual. 2019, 48, 57–63. [Google Scholar] [CrossRef]
Han, L.; Chang, S.; Chen, R.; Liu, Z.; Zhao, Y.; Li, R.; Xia, L. Monitoring soil mercury content based on hyperspectral data and machine learning methods. J. Appl. Remote Sens. 2022, 16, 24518. [Google Scholar] [CrossRef]
Guo, H.; Yang, K.; Wu, F.; Chen, Y.; Shen, J. Regional Inversion of Soil Heavy Metal Cr Content in Agricultural Land Using Zhuhai-1 Hyperspectral Images. Sensors 2023, 23, 8756. [Google Scholar] [CrossRef]
Yuxin, T.; Zhenghai, W.; Peng, X. Quantitative Hyperspectral Inversion of Soil Heavy Metals based on Feature Screening Combined with PSO-BPNN and GA-BPNN Algorithms. Remote Sens. Technol. Appl. 2024, 39, 259–268. [Google Scholar]
Xiaobo, G.; Zhikai, C.; Zhihui, Z.; Tian, C.; Wenlong, L.; Yadan, D. Remote Sensing Inversion of Leaf Area Index of Mulched Winter Wheat Based on Feature Downscaling and Machine Learning. Trans. Chin. Soc. Agric. Mach. 2023, 54, 148–157. [Google Scholar]
Guo, H.; Liu, H.; Wu, S. Simulation, prediction and optimization of typical heavy metals immobilization in swine manure composting by using machine learning models and genetic algorithm. J. Environ. Manag. 2022, 323, 116266. [Google Scholar] [CrossRef]
Shi, T.; Zhang, J.; Shen, W.; Wang, J.; Li, X. Machine learning can identify the sources of heavy metals in agricultural soil: A case study in northern Guangdong Province, China. Ecotoxicol. Environ. Safe. 2022, 245, 114107. [Google Scholar] [CrossRef] [PubMed]
Lin, N.; Jiang, R.; Li, G.; Yang, Q.; Li, D.; Yang, X. Estimating the heavy metal contents in farmland soil from hyperspectral images based on Stacked AdaBoost ensemble learning. Ecol. Indic. 2022, 143, 109330. [Google Scholar] [CrossRef]
Zou, Z.; Wang, Q.; Wu, Q.; Li, M.; Zhen, J.; Yuan, D.; Zhou, M.; Xu, C.; Wang, Y.; Zhao, Y.; et al. Inversion of heavy metal content in soil using hyperspectral characteristic bands-based machine learning method. J. Environ. Manag. 2024, 355, 120503. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Liu, J.; Li, F. Spatial distributions and controlled factors of heavy metals in surface soils in Guangdong based on the regional geology. Ecol. Environ. Sci. 2011, 20, 646–651. [Google Scholar]
Li, M.; Han, D.; Lu, D.; Lu, X.X.; Chai, C.X.; Liu, W.; Sun, K.X. Research Progress of Universal Model of Near-Infrared Spectroscopy in Agricultural Products and Foods Detection. Spectrosc. Spectr. Anal. 2022, 42, 3355–3360. [Google Scholar]
Jiang, C.; Ren, H.; Wang, Z.; Zeng, H.; Teng, Y.; Zhang, H.; Liu, X.; Jin, D.; Wang, M.; Liu, R.; et al. Estimation of Soil-Related Parameters Using Airborne-Based Hyperspectral Imagery and Ground Data in the Fenwei Plain, China. Remote Sens. 2024, 16, 1129. [Google Scholar] [CrossRef]
Xie, B.; Chen, B.; Ma, J.; Chen, J.; Zhou, Y.; Han, X.; Xiong, Z.; Yu, Z.; Huang, F. Rapid Identification of Choy Sum Seeds Infected with Penicillium decumbens Based on Hyperspectral Imaging and Stacking Ensemble Learning. Food Anal. Methods 2024, 17, 416–425. [Google Scholar] [CrossRef]
Zhang, H.M.; Chen, L.J.; Liu, W.; Han, W.T.; Zhang, S.Y.; Zhang, F. Estimation of Summer Corn Fractional Vegetation Coverage Based on Stacking Ensemble Learning. Trans. Chin. Soc. Agric. Mach. 2021, 52, 195–202. [Google Scholar]
Wang, Y.; Niu, R.; Hao, M.; Lin, G.; Xiao, Y.; Zhang, H.; Fu, B. A method for heavy metal estimation in mining regions based on SMA-PCC-RF and reflectance spectroscopy. Ecol. Indic. 2023, 154, 110476. [Google Scholar] [CrossRef]
Wang, Y.; Niu, R.; Lin, G.; Xiao, Y.; Ma, H.; Zhao, L. Estimate of soil heavy metal in a mining region using PCC-SVM-RFECV-AdaBoost combined with reflectance spectroscopy. Environ. Geochem. Health 2023, 45, 9103–9121. [Google Scholar] [CrossRef]
Yang, N.; Han, L.; Liu, M. Inversion of soil heavy metals in metal tailings area based on different spectral transformation and modeling methods. Heliyon 2023, 9, E19782. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Li, X.; Ma, X. Improving the Accuracy of Soil Organic Carbon Estimation: CWT-Random Frog-XGBoost as a Prerequisite Technique for In Situ Hyperspectral Analysis. Remote Sens. 2023, 15, 5294. [Google Scholar] [CrossRef]
Li, Y.; Yang, K.; Zhao, H. Scale transfer learning of hyperspectral prediction model of heavy metal content in maize: From laboratory to satellite. Int. J. Remote Sens. 2023, 44, 2590–2610. [Google Scholar] [CrossRef]
Yang, L.; Bai, Z.X.; Bo, W.H.; Lin, J.; Yang, J.J.; Chen, T. Analysis and Evaluation of Heavy Metal Pollution in Farmland Soil in China: A Meta-analysis. Environ. Sci. 2024, 5, 2913–2925. [Google Scholar]
Zhang, J.; Wang, M.; Yang, K.; Li, Y.; Li, Y.; Wu, B.; Han, Q. The New Hyperspectral Analysis Method for Distinguishing the Types of Heavy Metal Copper and Lead Pollution Elements. Int. J. Environ. Res. Public Health 2022, 19, 7755. [Google Scholar] [CrossRef]
Zhang, S.; Shen, Q.; Nie, C.; Huang, Y.; Wang, J.; Hu, Q.; Ding, X.; Zhou, Y.; Chen, Y. Hyperspectral inversion of heavy metal content in reclaimed soil from a mining wasteland based on different spectral transformation and modeling methods. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2019, 211, 393–400. [Google Scholar] [CrossRef]
Cheng, X.; Feng, Y.; Guo, A.; Huang, W.; Cai, Z.; Dong, Y.; Guo, J.; Qian, B.; Hao, Z.; Chen, G.; et al. Detection of Rubber Tree Powdery Mildew from Leaf Level Hyperspectral Data Using Continuous Wavelet Transform and Machine Learning. Remote Sens. 2024, 16, 105. [Google Scholar] [CrossRef]
Zhang, N.; Zhang, X.; Shang, P.; Yuan, X.; Li, L.; Bai, T. Stratified diagnosis of cotton canopy spectral characteristics based on CWT-SPA and its relationship with moisture, nitrogen, and SPAD values. Int. J. Remote Sens. 2024, 45, 325–350. [Google Scholar] [CrossRef]
Zhao, H.; Gan, S.; Yuan, X.; Hu, L.; Wang, J.; Liu, S. Prediction of low Zn concentrations in soil from mountainous areas of central Yunnan Province using a combination of continuous wavelet transform and Boruta algorithm. Int. J. Remote Sens. 2023, 44, 4753–4774. [Google Scholar] [CrossRef]
Baisong, A.; Xuemei, W.; Xiaoyu, H.; Baishan, K. Hyperspectral Estimation of Heavy Metal Cadmium Content in Soil based on Continuous Wavelet Transform. Earth Environ. 2023, 51, 246–253. [Google Scholar]
Tan, K.; Ma, W.; Chen, L.; Wang, H.; Du, Q.; Du, P.; Yan, B.; Liu, R.; Li, H. Estimating the distribution trend of soil heavy metals in mining area from HyMap airborne hyperspectral imagery based on ensemble learning. J. Hazard. Mater. 2021, 401, 123288. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Li, H.; Sun, M.; Liu, X.; Cao, L. A Study on Hyperspectral Soil Moisture Content Prediction by Incorporating a Hybrid Neural Network into Stacking Ensemble Learning. Agronomy 2024, 14, 2054. [Google Scholar] [CrossRef]
Zhang, B.; Guo, B.; Zou, B.; Wei, W.; Lei, Y.; Li, T. Retrieving soil heavy metals concentrations based on GaoFen-5 hyperspectral satellite image at an opencast coal mine, Inner Mongolia, China. Environ. Pollut. 2022, 300, 118981. [Google Scholar] [CrossRef] [PubMed]
Yao, L.; Xu, M.; Liu, Y.; Niu, R.; Wu, X.; Song, Y. Estimating of heavy metal concentration in agricultural soils from hyperspectral satellite sensor imagery: Considering the sources and migration pathways of pollutants. Ecol. Indic. 2024, 158, 111416. [Google Scholar] [CrossRef]

Figure 1. Location of the study area and sampling distributions.

Figure 2. Flowchart of Cu content estimation.

Figure 3. Integrated learning algorithm construction in stacking.

Figure 4. Original spectra.

Figure 5. Soil reflectance spectral curves based on three spectral preprocessing methods (SG, MSC, CWT).

Figure 6. Inversion accuracy results of PCA at each scale.

Figure 7. Results of the optimal validation set of each model.

Table 1. Results of the mesh search parameters of the model.

Model	Parameter	Value
SVR	Penalty parameter	5.66
SVR	Gamma	0.18
PLSR	Regularization	0.1
BPNN	Epochs	800
BPNN	Learning rate	0.01
XGBoost	Number of decision trees	20
XGBoost	Maximum depth	2
RF	Number of decision trees	400
RF	Minimum number of samples per leaf	2

Table 2. Descriptive statistical analysis of Cu content in soil.

Dataset	Number	Minimum (mg/kg)	Maximum (mg/kg)	Mean (mg/kg)	Standard Deviation (mg/kg)	CV (%)
Total sample	269	3	163	33.92	24.46	72.11
Training sample	190	3	163	33.91	26.83	79.12
Testing sample	79	4	63	34.32	17.50	50.99

Table 3. The number of principal components after PCA dimensionality reduction at different decomposition scales.

Scales	Dimension	Scales	Dimension
L1	189	L6	45
L2	188	L7	25
L3	189	L8	14
L4	148	L9	8
L5	80	L10	5

Table 4. Comparison between single model and stacking model.

Model	Training Set			Validation Set
Model	R²	RMSE	RPD	R²	RMSE	RPD
SVR	0.86	7.98	3.36	0.70	9.38	1.87
RF	0.93	5.19	5.16	0.68	7.66	2.29
BPNN	0.56	11.83	2.27	0.69	8.22	2.13
PLSR	0.56	13.34	2.01	0.64	12.71	1.38
XGBoost	0.82	7.57	3.54	0.65	8.64	2.02
Stacking	0.85	7.82	3.43	0.77	7.65	2.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, K.; Wu, F.; Guo, H.; Chen, D.; Deng, Y.; Huang, Z.; Han, C.; Chen, Z.; Xiao, R.; Chen, P. Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning. Land 2024, 13, 1810. https://doi.org/10.3390/land13111810

AMA Style

Yang K, Wu F, Guo H, Chen D, Deng Y, Huang Z, Han C, Chen Z, Xiao R, Chen P. Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning. Land. 2024; 13(11):1810. https://doi.org/10.3390/land13111810

Chicago/Turabian Style

Yang, Kai, Fan Wu, Hongxu Guo, Dongbin Chen, Yirong Deng, Zaoquan Huang, Cunliang Han, Zhiliang Chen, Rongbo Xiao, and Pengcheng Chen. 2024. "Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning" Land 13, no. 11: 1810. https://doi.org/10.3390/land13111810

APA Style

Yang, K., Wu, F., Guo, H., Chen, D., Deng, Y., Huang, Z., Han, C., Chen, Z., Xiao, R., & Chen, P. (2024). Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning. Land, 13(11), 1810. https://doi.org/10.3390/land13111810

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hyperspectral Inversion of Soil Cu Content in Agricultural Land Based on Continuous Wavelet Transform and Stacking Ensemble Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Soil Sample Collection and Measurement

2.3. Workflow

2.4. Spectral Preprocessing

2.5. Dimensionality Reduction

2.6. Stacking Model Construction

2.7. Model Accuracy Evaluation

3. Results

3.1. Statistical Analysis of Cu Content in the Study Area

3.2. Spectral Preprocessing Based on Continuous Wavelet Transform

3.3. Analysis of PCA Reduction Results

3.4. Construction and Accuracy Evaluation of Cu Content Inversion Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI