A Population Spatialization Model at the Building Scale Using Random Forest
Abstract
:1. Introduction
2. Data and Preprocessing
- (1)
- Population data. The actual population data are used as model validation data in this experiment, and the study area contains 300,722 population records.
- (2)
- Building data. Buildings are the basic units for the experiment, and the dataset contains 117,116 residential buildings.
- (3)
- Land use dataset. The Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC) map set is used as the auxiliary data. The data resolution is approximately 30 m in the maps. There are 8 categories included in the dataset: agricultural land, forest, grassland, shrubland, wetland, water, impervious surfaces, and bare ground.
- (4)
- DMSP-OLS NTL imagery. The fourth version of the DMSP-OLS (Defense Meteorological Satellite Program) NTL remote sensing dataset synthesized in 2013 is used as auxiliary data for population spatialization. The resolution is approximately 1 km, and the data were resampled to 100 m.
- (5)
- Water systems, road networks, and POIs also affect the population distribution to a certain extent. The study area includes 85,876 rivers, 27,706 roads, and 4524 POI records. The detailed information for each POI is shown in Table 2. We calculated the closest Euclidean distance from each building type to the same type of POI and used the results as model inputs.
3. Methods
3.1. Feature Engineering
3.1.1. Feature Filtering
3.1.2. Feature Standardization
3.2. Model Building and Training
4. Results and Evaluation
4.1. RF Population Spatialization Results
4.2. Comparison with an ML Regression Model
5. Discussion
5.1. Feature Importance Analysis
5.2. Feature Contribution Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wu, C.; Murray, A.T. A Cokriging Method for Estimating Population Density in Urban Areas. Comput. Environ. Urban Syst. 2005, 29, 558–579. [Google Scholar] [CrossRef]
- Langford, M. An Evaluation of Small Area Population Estimation Techniques Using Open Access Ancillary Data: Small Area Population Estimation Techniques. Geogr. Anal. 2013, 45, 324–344. [Google Scholar] [CrossRef]
- Deville, P.; Linard, C.; Martin, S.; Gilbert, M.; Stevens, F.R.; Gaughan, A.E.; Blondel, V.D.; Tatem, A.J. Dynamic Population Mapping Using Mobile Phone Data. Proc. Natl. Acad. Sci. USA 2014, 111, 15888–15893. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bakillah, M.; Liang, S.; Mobasheri, A.; Jokar Arsanjani, J.; Zipf, A. Fine-Resolution Population Mapping Using OpenStreetMap Points-of-Interest. Int. J. Geogr. Inf. Sci. 2014, 28, 1940–1963. [Google Scholar] [CrossRef]
- Gaughan, A.E.; Stevens, F.R.; Linard, C.; Jia, P.; Tatem, A.J. High Resolution Population Distribution Maps for Southeast Asia in 2010 and 2015. PLoS ONE 2013, 8, e55882. [Google Scholar] [CrossRef]
- Bhaduri, B.; Bright, E.; Coleman, P.; Urban, M.L. LandScan USA: A High-Resolution Geospatial and Temporal Modeling Approach for Population Distribution and Dynamics. GeoJournal 2007, 69, 103–117. [Google Scholar] [CrossRef]
- Lu, D.; Weng, Q.; Li, G. Residential Population Estimation Using a Remote Sensing Derived Impervious Surface Approach. Int. J. Remote Sens. 2006, 27, 3553–3570. [Google Scholar] [CrossRef]
- Jia, P.; Qiu, Y.; Gaughan, A.E. A Fine-Scale Spatial Population Distribution on the High-Resolution Gridded Population Surface and Application in Alachua County, Florida. Appl. Geogr. 2014, 50, 99–107. [Google Scholar] [CrossRef]
- Ahola, T.; Virrantaus, K.; Krisp, J.M.; Hunter, G.J. A Spatio-temporal Population Model to Support Risk Assessment and Damage Analysis for Decision-making. Int. J. Geogr. Inf. Sci. 2007, 21, 935–953. [Google Scholar] [CrossRef]
- Aubrecht, C.; Özceylan, D.; Steinnocher, K.; Freire, S. Multi-Level Geospatial Modeling of Human Exposure Patterns and Vulnerability Indicators. Nat. Hazards 2013, 68, 147–163. [Google Scholar] [CrossRef]
- Hay, S.I.; Noor, A.M.; Nelson, A.; Tatem, A.J. The Accuracy of Human Population Maps for Public Health Application. Trop. Med. Int. Health 2005, 10, 1073–1086. [Google Scholar] [CrossRef]
- Zhou, Y.; Ma, L.J.C. China’s Urban Population Statistics: A Critical Evaluation. Eurasian Geogr. Econ. 2005, 46, 272–289. [Google Scholar] [CrossRef]
- Stevens, F.R.; Gaughan, A.E.; Linard, C.; Tatem, A.J. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLoS ONE 2015, 10, e0107042. [Google Scholar] [CrossRef] [Green Version]
- Mao, H.; Ahn, Y.-Y.; Bhaduri, B.; Thakur, G. Improving Land Use Inference by Factorizing Mobile Phone Call Activity Matrix. J. Land Use Sci. 2017, 12, 138–153. [Google Scholar] [CrossRef]
- Ural, S.; Hussain, E.; Shan, J. Building Population Mapping with Aerial Imagery and GIS Data. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 841–852. [Google Scholar] [CrossRef]
- Deichmann, U. A Review of Spatial Population Database Design and Modeling; Technical Report 96-3; National Center for Geographic Information and Analysis: Santa Barbara, CA, USA, 1996. [Google Scholar]
- Jones, H.R. Population Geography, 2nd ed.; Guilford Press: New York, NY, USA, 1990; ISBN 978-0-89862-464-9. [Google Scholar] [CrossRef]
- Tobler, W.R. Smooth Pycnophylactic Interpolation for Geographical Regions. J. Am. Stat. Assoc. 1979, 74, 519–530. [Google Scholar] [CrossRef]
- Langford, M.; Maguire, D.; Unwin, D. The areal interpolation problem: Estimating population using remote sensing in a GIS framework. In Handling Geographical Information: Methodology and Potential Applications; Longman Pub Group: London, UK, 2014. [Google Scholar]
- Mennis, J.; Hultgren, T. Intelligent Dasymetric Mapping and Its Application to Areal Interpolation. Cartogr. Geogr. Inf. Sci. 2006, 33, 179–194. [Google Scholar] [CrossRef]
- Holt, J.B.; Lo, C.P.; Hodler, T.W. Dasymetric Estimation of Population Density and Areal Interpolation of Census Data. Cartogr. Geogr. Inf. Sci. 2004, 31, 103–121. [Google Scholar] [CrossRef]
- Eicher, C.L.; Brewer, C.A. Dasymetric Mapping and Areal Interpolation: Implementation and Evaluation. Cartogr. Geogr. Inf. Sci. 2001, 28, 125–138. [Google Scholar] [CrossRef]
- Briggs, D.J.; Gulliver, J.; Fecht, D.; Vienneau, D.M. Dasymetric Modelling of Small-Area Population Distribution Using Land Cover and Light Emissions Data. Remote Sens. Environ. 2007, 108, 451–466. [Google Scholar] [CrossRef]
- Mennis, J. Generating Surface Models of Population Using Dasymetric Mapping. Prof. Geogr. 2003, 55, 31–42. [Google Scholar] [CrossRef]
- Su, M.-D.; Lin, M.-C.; Hsieh, H.-I.; Tsai, B.-W.; Lin, C.-H. Multi-Layer Multi-Class Dasymetric Mapping to Estimate Population Distribution. Sci. Total Environ. 2010, 408, 4807–4816. [Google Scholar] [CrossRef]
- Langford, M. Rapid Facilitation of Dasymetric-Based Population Interpolation by Means of Raster Pixel Maps. Comput. Environ. Urban Syst. 2007, 31, 19–32. [Google Scholar] [CrossRef]
- Tobler, W.; Deichmann, U.; Gottsegen, J.; Maloy, K. World Population in a Grid of Spherical Quadrilaterals. Int. J. Popul. Geogr. 1997, 3, 203–225. [Google Scholar] [CrossRef]
- CIESIN; WRI. Gridded Population of the World (GPW), Version 2. In Center for International Earth Science Information Network (CIESIN) Columbia University, International Food Policy Research Institute (IFPRI) and World Resources Institute (WRI); CIESIN, Columbia University: Palisades, NY, USA, 2000. [Google Scholar]
- Balk, D.L.; Deichmann, U.; Yetman, G.; Pozzi, F.; Hay, S.I.; Nelson, A. Determining Global Population Distribution: Methods, Applications and Data. In Advances in Parasitology; Elsevier: Amsterdam, The Netherlands, 2006; Volume 62, pp. 119–156. ISBN 978-0-12-031762-2. [Google Scholar]
- CIESIN; CIAT. Global Rural-Urban Mapping Project (GRUMP), Alpha Version. In Center for International Earth Science Information Network (CIESIN), Columbia University, International Food Policy Research Institute (IFPRI) and World Resources Institute (WRI); Socioeconomic Data and Applications Center (SEDAC), Columbia University: Palisades, NY, USA, 2005. [Google Scholar]
- Bright, E.A.; Coleman, P.R.; Dobson, J.E. LandScan: A Global Population Database for Estimating Populations at Risk. Photogramm. Eng. Remote Sens. 2000, 66, 849–858. [Google Scholar]
- Tatem, A.J.; Gaughan, A.E.; Stevens, F.R.; Patel, N.N.; Jia, P.; Pandey, A.; Linard, C. Quantifying the Effects of Using Detailed Spatial Demographic Data on Health Metrics: A Systematic Analysis for the AfriPop, AsiaPop, and AmeriPop Projects. Lancet 2013, 381, S142. [Google Scholar] [CrossRef]
- European Commission, Joint Research Centre (JRC). GHS-POP R2015A—GHS Population Grid, Derived from GPW4, Multitemporal (1975, 1990, 2000, 2015)—OBSOLETE RELEASE; European Commission, Joint Research Centre (JRC): Brussels, Belgium, 2015. Available online: http://data.europa.eu/89h/jrc-ghsl-ghs_pop_gpw4_globe_r2015a (accessed on 1 December 2021).
- Wang, L.; Wang, S.; Zhou, Y.; Liu, W.; Hou, Y.; Zhu, J.; Wang, F. Mapping Population Density in China between 1990 and 2010 Using Remote Sensing. Remote Sens. Environ. 2018, 210, 269–281. [Google Scholar] [CrossRef]
- Yao, Y.; Li, X.; Liu, X.; Liu, P.; Liang, Z.; Zhang, J.; Mai, K. Sensing Spatial Distribution of Urban Land Use by Integrating Points-of-Interest and Google Word2Vec Model. Int. J. Geogr. Inf. Sci. 2017, 31, 825–848. [Google Scholar] [CrossRef]
- Azar, D.; Graesser, J.; Engstrom, R.; Comenetz, J.; Leddy, R.M.; Schechtman, N.G.; Andrews, T. Spatial Refinement of Census Population Distribution Using Remotely Sensed Estimates of Impervious Surfaces in Haiti. Int. J. Remote Sens. 2010, 31, 5635–5655. [Google Scholar] [CrossRef]
- Ye, T.; Zhao, N.; Yang, X.; Ouyang, Z.; Liu, X.; Chen, Q.; Hu, K.; Yue, W.; Qi, J.; Li, Z.; et al. Improved Population Mapping for China Using Remotely Sensed and Points-of-Interest Data within a Random Forests Model. Sci. Total Environ. 2019, 658, 936–946. [Google Scholar] [CrossRef]
- Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying Urban Land Use by Integrating Remote Sensing and Social Media Data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, X.; Gao, S.; Gong, L.; Kang, C.; Zhi, Y.; Chi, G.; Shi, L. Social Sensing: A New Approach to Understanding Our Socioeconomic Environments. Ann. Assoc. Am. Geogr. 2015, 105, 512–530. [Google Scholar] [CrossRef]
- Cai, J.; Huang, B.; Song, Y. Using Multi-Source Geospatial Big Data to Identify the Structure of Polycentric Cities. Remote Sens. Environ. 2017, 202, 210–221. [Google Scholar] [CrossRef]
- Zhang, Q.; Gao, W.; Su, S.; Weng, M.; Cai, Z. Biophysical and Socioeconomic Determinants of Tea Expansion: Apportioning Their Relative Importance for Sustainable Land Use Policy. Land Use Policy 2017, 68, 438–447. [Google Scholar] [CrossRef]
- Su, S.; He, S.; Sun, C.; Zhang, H.; Hu, L.; Kang, M. Do Landscape Amenities Impact Private Housing Rental Prices? A Hierarchical Hedonic Modeling Approach Based on Semantic and Sentimental Analysis of Online Housing Advertisements across Five Chinese Megacities. Urban For. Urban Green. 2021, 58, 126968. [Google Scholar] [CrossRef]
- Su, S.; Zhang, J.; He, S.; Zhang, H.; Hu, L.; Kang, M. Unraveling the Impact of TOD on Housing Rental Prices and Implications on Spatial Planning: A Comparative Analysis of Five Chinese Megacities. Habitat Int. 2021, 107, 102309. [Google Scholar] [CrossRef]
- Yoshida, D.; Song, X.; Raghavan, V. Development of Track Log and Point of Interest Management System Using Free and Open Source Software. Appl. Geomat. 2010, 2, 123–135. [Google Scholar] [CrossRef] [Green Version]
- McKenzie, G.; Janowicz, K.; Gao, S.; Yang, J.-A.; Hu, Y. POI Pulse: A Multi-Granular, Semantic Signature–Based Information Observatory for the Interactive Visualization of Big Geosocial Data. Cartogr. Int. J. Geogr. Inf. Geovis. 2015, 50, 71–85. [Google Scholar] [CrossRef]
- Gao, S.; Janowicz, K.; Couclelis, H. Extracting Urban Functional Regions from Points of Interest and Human Activities on Location-Based Social Networks: GAO et Al. Trans. GIS 2017, 21, 446–467. [Google Scholar] [CrossRef]
- Hu, T.; Yang, J.; Li, X.; Gong, P. Mapping Urban Land Use by Using Landsat Images and Open Social Data. Remote Sens. 2016, 8, 151. [Google Scholar] [CrossRef]
- Lwin, K.; Murayama, Y. A GIS Approach to Estimation of Building Population for Micro-Spatial Analysis. Trans. GIS 2009, 13, 401–414. [Google Scholar] [CrossRef]
- Loh, W.-Y. Classification and Regression Trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 14–23. [Google Scholar] [CrossRef]
- Goel, E.; Abhilasha, E. Random Forest: A Review. Int. J. Adv. Res. Comput. Sci. Softw. 2017, 7, 251–257. [Google Scholar] [CrossRef]
- Fawagreh, K.; Gaber, M.M.; Elyan, E. Random Forests: From Early Developments to Recent Advancements. Syst. Sci. Control Eng. 2014, 2, 602–609. [Google Scholar] [CrossRef] [Green Version]
- Cutler, D.R.; Edwards, T.C.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J. Random Forests for Classification in Ecology. Ecology 2007, 88, 2783–2792. [Google Scholar] [CrossRef]
- Gaughan, A.E.; Stevens, F.R.; Huang, Z.; Nieves, J.J.; Sorichetta, A.; Lai, S.; Ye, X.; Linard, C.; Hornby, G.M.; Hay, S.I.; et al. Spatiotemporal Patterns of Population in Mainland China, 1990 to 2010. Sci. Data 2016, 3, 160005. [Google Scholar] [CrossRef]
- Anyanwu, M.N.; Sajjan, S. Comparative Analysis of Serial Decision Tree Classification Algorithms. Int. J. Comput. Sci. Secur. 2009, 3, 230–240. [Google Scholar]
- Resende, P.A.A.; Drummond, A.C. A Survey of Random Forest Based Methods for Intrusion Detection Systems. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
- Scikit-Learn 1.0. Available online: Https://Github.Com/Scikit-Learn/Scikit-Learn (accessed on 26 December 2021).
- Liu, Y. Mathematical Model of Multiple Linear Regression. J. Shenyang Inst. Eng. 2005, 128–129. [Google Scholar]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zhao, X.; Yu, B.; Liu, Y.; Chen, Z.; Li, Q.; Wang, C.; Wu, J. Estimation of Poverty Using Random Forest Regression with Multi-Source Data: A Case Study in Bangladesh. Remote Sens. 2019, 11, 375. [Google Scholar] [CrossRef] [Green Version]
- Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
Dataset | Format | Source |
---|---|---|
Population (2017) | Table | Hangzhou Public Security Bureau |
Buildings (2017) | Polygon vector features | Basic geographic information database for Hangzhou, China |
Finer Resolution Observation and Monitoring of Global Land Cover (2017) | Grid, 30 m spatial resolution | Tsinghua University Open Data Set (http://data.ess.tsinghua.edu.cn/, accessed on 1 December 2021) |
DMSP-OLS NTL imagery (2013) | Grid, 1 × 1 km spatial resolution | National Geophysical Data Center, USA (https://ngdc.noaa.gov/eog/, accessed on 1 December 2021) |
Road network (2017) | Line vector features | Basic geographic information database for Hangzhou, China |
POIs (2019) | Point features | Baidu Map API, China |
Water system | Polygon vector features | Basic geographic information database for Hangzhou, China |
Administrative districts (2017) | Polygon vector features | Basic geographic information database for Hangzhou, China |
No. | POI Type | Count | No. | POI Type | Count |
---|---|---|---|---|---|
1 | Medical | 295 | 9 | Nursing homes | 9 |
2 | Sports | 62 | 10 | Self-service | 35 |
3 | Education | 439 | 11 | Recreation | 721 |
4 | Parks | 22 | 12 | Government agencies | 396 |
5 | Markets | 186 | 13 | Shopping | 1416 |
6 | Gas stations | 56 | 14 | Factories | 246 |
7 | Museums | 5 | 15 | Banks | 221 |
8 | Retail | 123 | 16 | Corporations | 292 |
No. | Feature Name | Feature Source | No. | Feature Name | Feature Source |
---|---|---|---|---|---|
1 | Building footprint | Building | 16 | Factory_EDIST | POI |
2 | Night lighting_Min | Night lighting | 17 | Company_EDIST | POI |
3 | Night lighting_Max | Night lighting | 18 | Park_EDIST | POI |
4 | Night lighting_Ave | Night lighting | 19 | Store_EDIST | POI |
5 | Night lighting_Sum | Night lighting | 20 | Gas station_EDIST | POI |
6 | Land use type | Land Use | 21 | Education agency_EDIST | POI |
7 | River system_Cnt | River system | 22 | Retail_EDIST | POI |
8 | River system length_Min | River system | 23 | Market_EDIST | POI |
9 | River system length_Max | River system | 24 | Sports facility_EDIST | POI |
10 | River system length_Sum | River system | 25 | Entertainment_EDIST | POI |
11 | Water area_Min | River system | 26 | Nursing home_EDIST | POI |
12 | Water area_Max | River system | 27 | Medical institution_EDIST | POI |
13 | Water area_Sum | River system | 28 | Bank_EDIST | POI |
14 | Road_EDIST | Road | 29 | Government agency_EDIST | POI |
15 | Museum_EDIST | POI | 30 | Self-service_EDIST | POI |
No. | Parameter Value | Value Range | Optimal Value |
---|---|---|---|
1 | bootstrap | True, False | True |
2 | oob_score | True, False | True |
3 | n_estimators | 100, 200, …, 1500 | 1100 |
4 | max_features | auto, sqrt, log2 | auto |
5 | max_depth | 1, 2, …, 20 | 16 |
6 | min_samples_leaf | 1, 2, …, 20 | 19 |
7 | min_samples_split | 2, 4, …, 20 | 18 |
Variable Name | Coefficient | Variable Name | Coefficient |
---|---|---|---|
Building footprint | 3.45154761 | Factory_ EDIST | −0.09081666 |
Night lighting_Min | 0 | Company_ EDIST | 0.28783718 |
Night lighting_Max | −0.32496302 | Park_ EDIST | 0.17310836 |
Night lighting_Ave | 0 | Store_ EDIST | −0.58551197 |
Night lighting_Sum | −0.06787263 | Gas_ EDIST | −0.29694836 |
Land Use Type | 0.1270625 | Education_ EDIST | 0.20848317 |
River system_Cnt | −0.02700125 | Retail_ EDIST | 0.78328459 |
River systemlength_Min | 0.16721964 | Market_ EDIST | −0.7959491 |
River systemlength_Max | 0.3384865 | Sports_ EDIST | 0.47081309 |
River systemlength_Sum | −0.51151035 | Leisure_ EDIST | −0.09139956 |
Water area_Min | 0 | Nuring_ EDIST | 0.29575162 |
Water area_Max | 0.39497887 | Medical_ EDIST | −0.39804979 |
Water area_Sum | 0 | Bank_ EDIST | 0.26058286 |
Road_EDIST | −0.30963247 | Government_ EDIST | −0.02576417 |
Museum_ EDIST | −0.21837239 | Self-service_ EDIST | −0.98357816 |
Model Name | MAE | RMSE | R2 |
---|---|---|---|
Random forest | 2.52 | 8.2 | 0.44 |
Multiple linear regression | 3.21 | 9.8 | 0.18 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, M.; Wang, Y.; Li, B.; Cai, Z.; Kang, M. A Population Spatialization Model at the Building Scale Using Random Forest. Remote Sens. 2022, 14, 1811. https://doi.org/10.3390/rs14081811
Wang M, Wang Y, Li B, Cai Z, Kang M. A Population Spatialization Model at the Building Scale Using Random Forest. Remote Sensing. 2022; 14(8):1811. https://doi.org/10.3390/rs14081811
Chicago/Turabian StyleWang, Mengqi, Yinglin Wang, Bozhao Li, Zhongliang Cai, and Mengjun Kang. 2022. "A Population Spatialization Model at the Building Scale Using Random Forest" Remote Sensing 14, no. 8: 1811. https://doi.org/10.3390/rs14081811