Sinkholes could cause significant damages to buildings, infrastructure, and lead to the loss of lives in residential areas and places with high density. The cost of damage for a typical sinkhole is estimated to be between $20K and $100K. The Florida state, also known as the capital of sinkhole, suffers from $200M to $400M of sinkhole damages every year.
Conventional sinkhole detection that relies on continuous ground surveying using a range of specialized techniques such as ground penetrating radar (GPR) could be time consuming and costly. There is an opportunity to amalgamate non-geological open-data to complement the identification of areas with higher risks of sinkhole formations.
By providing ZIP code level sinkhole risk intelligence, we equip users of our product with the ability to view sinkhole risk levels predicted by our machine learning model for neighborhoods within the state of Florida.
Our sinkhole risk intelligence portal enables users to identify and visualize areas with heightened risks of sinkhole formations through 3 key features:
🔍 Search nearby risks
👓 View risk ratings
🗺️ Visualize data layers
← Click on the image to access the portal and explore neighborhoods in the Florida state with sinkhole risks.
By leveraging open-data and data science techniques, our unique approach to sinkhole risk assessment combines data from sources including:
Sentinel-2 satellite imagery (434K+ 640mX640m tiles),
USDA soil composition (9 soil attributes per satellite image tile),
NOAA weather data (290K+ daily precipitation and temperature records) and
FDEP land subsidence incidents (4K+ incident reports).
2 types of machine learning models were used in the prediction of sinkhole risks. Convolutional neural network ResNet-50 was first used to extract land use features by classifying 434K+ satellite images into 10 classes including: Annual Crop, Forest, Vegetation, Highway, Industrial, Pasture, Permanent Crop, Residential, River and Sea Lake.
The land use classification resulted from ResNet-50, along with data on weather, soil composition and historical sinkhole incidents was then fed into a second machine learning model, XGBoost, to generate a probability score between 0 and 1.
Reverse geocoding was performed to enrich the overall dataset with ZIP code and neighborhood attributes. This enabled sinkhole risk probability scores to be aggregated by ZIP code area and visualized on an interactive map as shown on the sinkhole risk intelligence portal.
Stage 1
Input open-data sources including 10-meter resolution satellite images from Sentinel-2, NOAA temperature and precipitation data, soil composition and Florida land subsidence incidents.
Stage 2
A ResNet-50 model pre-trained with 27K+ EuroSAT images was used to extract land use features from 400K+ Florida state Sentinel-2 satellite images.
Stage 3
Using the centroid position of the 40K+ Sentinel-2 satellite image tiles (each sized at 640mX640m) that provide coverage for the entire Florida state, the respective weather, soil composition and past sinkhole incident data was retrieved. This gave rise to a 34-feature consolidated training dataset used in Stage 4.
Stage 4
Logistic regression, random forest and XGBoost for the prediction of sinkhole risks. XGBoost was found to be the highest performing based on F1-score and model fit.
Stage 5
Using the sinkhole risk probability scores (a numeric value between 0 and 1), a 5-level risk classification (Elevated/Moderate/Mild/Minor/Less Unlikely) was defined. The aggregated risk levels by ZIP code area were then used to fuel an interactive map made available on the sinkhole risk intelligence portal.
XGBoost as an ensemble algorithm outperforms the standard decision tree and random forest models by using boosting techniques to combine weak learners sequentially so that each new tree learns from the mistakes (errors) of the previous one. With this model, we were able to achieve the highest F1-score while minimizing overfitting of the model.
The top features from the most important to the least include:
Historical sinkhole incidents - Within 2 miles of radial distance
Land use classification - Permanent crop and highway
Weather - Mean cumulative precipitation for the last 2 years and
Soil composition - Percentage of sand, available water storage
The feature importance results could be viewed below:
The below visualizations showcase features that exhibit strong correlations with the probability of sinkhole risks. These features include water storage amount in soil, percent of clay in the ground and sinkholes occurrences in proximity (within a radial distance of at least 0.25 miles). These features contribute to the predictive power of our model in identifying areas that have a higher probability of encountering sinkholes.
Water storage amount
In examining the relationship between water storage amount and probability of sinkhole occurrences, we can see the chart at the top shows that the lower the water storage amount, the higher the probability of sinkhole risks. See visualization at the top.
Clay in the ground
Similarly, the lower the proportion of clay in the ground, the higher the risk of sinkholes forming. See second visualization from the top.
Sinkhole history
As for sinkhole incident history, we observe that there is more likelihood of sinkholes happening in regions that previously have incidents reported in close proximity. See third visualization from the top.
Organic matter in soil
We also observed water storage amount is proportionally correlated with the amount of organic matter in soil. The higher the water storage amount, the higher amount of organic matter in soil. See the scatter plot at the bottom.
Sinkhole history features accounted for 52.67% of the feature importance for our model’s prediction. Similar studies have found that areas that historically have a greater number of sinkholes are more likely to develop sinkholes in the future.
The land use classes were the second largest contributor to our model’s prediction accounting for 12.17% of the feature importance.
Sinkhole formation could be triggered by drastic changes in temperature and groundwater level. There have been several reports that point to severe drought followed by heavy rainfall triggering being a potential culprit. The weather attributes accounted for 9.82% of the feature importance.
Karst landscapes are susceptible to sinkhole development because they are water soluble landscapes (i.e. limestone, calcium carbonate, marble, or gypsum). The soil composition comprises 5.50% of the feature importance. The lower the water storage, the higher the risk of sinkhole formation.