A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images
Abstract
:1. Introduction
- ●
- We propose an end-to-end instance segmentation method based on Transformer architecture and build a multi-scale deformable attention module in encoder to aggregate cross hierarchical global context information to enhance distance dependency, in addition, this greatly reduces the number of parameters and improved the convergence speed;
- ●
- Group queries are built in the decoder, which changes the One-to-One label assignment method of the original DETR to the Many-to-One label assignment method, greatly improving the convergence speed of the decoder. Moreover, no additional parameters are required in the inference stage;
- ●
- We propose a method to solve the problem of it being difficult to accurately extract rural homestead due to its dense distribution. The extracted pattern can be used to count the number, area, and other information regarding homesteads.
2. Materials and Methods
2.1. Data Collecting
2.2. Methods
2.2.1. Overview of Network Architecture
2.2.2. Multi-Scale Deformable Attention Module
2.2.3. Self-Attention Module and Cross-Attention Module
2.2.4. Group Queries Assignment
2.2.5. Loss Design
3. Results
3.1. Dataset
3.2. Experimental Environment and Details
3.3. Comparison of Experimental Results
3.4. Performance Effect and Transferability of QueryFormer
3.5. Hyperparameters Contrast Experiment
4. Discussion
4.1. What Deformable Attention Learns in QueryFormer Encoder
4.2. What Query Learns in QueryFormer Decoder
4.3. What Group Queries Learns in QueryFormer Decoder
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep Feature Fusion for VHR Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
- Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A Global Context-aware and Batch-independent Network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
- Lv, Z.; Liu, T.; Benediktsson, J.A. Object-Oriented Key Point Vector Distance for Binary Land Cover Change Detection Using VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6524–6533. [Google Scholar] [CrossRef]
- Yi, Y.; Zhang, Z.; Zhang, W.; Zhang, C.; Li, W.; Zhao, T. Semantic Segmentation of Urban Buildings from VHR Remote Sensing Imagery Using a Deep Convolutional Neural Network. Remote Sens. 2019, 11, 1774. [Google Scholar] [CrossRef] [Green Version]
- Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
- Benediktsson, J.A.; Pesaresi, M.; Amason, K. Classification and feature extraction for remote sensing images from urban areas based on morphological transformations. IEEE Trans. Geosci. Remote Sens. 2003, 41, 1940–1949. [Google Scholar] [CrossRef] [Green Version]
- Fauvel, M.; Benediktsson, J.A.; Chanussot, J.; Sveinsson, J.R. Spectral and Spatial Classification of Hyperspectral Data Using SVMs and Morphological Profiles. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3804–3814. [Google Scholar] [CrossRef] [Green Version]
- Du, S.; Zhang, F.; Zhang, X. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach. ISPRS J. Photogramm. Remote Sens. 2015, 105, 107–119. [Google Scholar] [CrossRef]
- Yuan, Q.; Mohd Shafri, H.Z. Multi-Modal Feature Fusion Network with Adaptive Center Point Detector for Building Instance Extraction. Remote Sens. 2022, 14, 4920. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. (Eds.) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Cai, Z.; Vasconcelos, N. (Eds.) Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation as Rendering. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. (Eds.) Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. (Eds.) Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
- Kirillov, A.; Usunier, N.; Carion, N.; Zagoruyko, S.; Synnaeve, G.; Massa, F. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Fang, Y.; Yang, S.; Wang, X.; Li, Y.; Fang, C.; Shan, Y.; Feng, B.; Liu, W. Instances as Queries. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. (Eds.) Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
- Dong, B.; Zeng, F.; Wang, T.; Zhang, X.; Wei, Y. Solq: Segmenting objects by learning queries. Adv. Neural Inf. Process. Syst. 2021, 34, 21898–21909. [Google Scholar]
- Fang, F.; Wu, K.; Liu, Y.; Li, S.; Wan, B.; Chen, Y.; Zheng, D. A Coarse-to-Fine Contour Optimization Network for Extracting Building Instances from High-Resolution Remote Sensing Imagery. Remote Sens. 2021, 13, 3814. [Google Scholar] [CrossRef]
- Li, Y.; Xu, W.; Chen, H.; Jiang, J.; Li, X. A Novel Framework Based on Mask R-CNN and Histogram Thresholding for Scalable Segmentation of New and Old Rural Buildings. Remote Sens. 2021, 13, 1070. [Google Scholar] [CrossRef]
- Wu, T.; Hu, Y.; Peng, L.; Chen, R. Improved Anchor-Free Instance Segmentation for Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 2910. [Google Scholar] [CrossRef]
- Liu, X.; Chen, Y.; Wei, M.; Wang, C.; Gonçalves, W.N.; Marcato, J.; Li, J. Building Instance Extraction Method Based on Improved Hybrid Task Cascade. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
- Deng, W.; Shi, Q.; Li, J. Attention-Gate-Based Encoder–Decoder Network for Automatical Building Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2611–2620. [Google Scholar] [CrossRef]
- Zhou, J.; Liu, Y.; Nie, G.; Cheng, H.; Yang, X.; Chen, X.; Gross, L. Building Extraction and Floor Area Estimation at the Village Level in Rural China Via a Comprehensive Method Integrating UAV Photogrammetry and the Novel EDSANet. Remote Sens. 2022, 14, 5175. [Google Scholar] [CrossRef]
- Shi, X.; Huang, H.; Pu, C.; Yang, Y.; Xue, J. CSA-UNet: Channel-Spatial Attention-Based Encoder–Decoder Network for Rural Blue-Roofed Building Extraction From UAV Imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Wei, R.; Fan, B.; Wang, Y.; Zhou, A.; Zhao, Z. MBNet: Multi-Branch Network for Extraction of Rural Homesteads Based on Aerial Images. Remote Sens. 2022, 14, 2443. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. (Eds.) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the International Conference on 3d Vision, Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
- Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On Layer Normalization in the Transformer Architecture. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2018, arXiv:1711.05101. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. (Eds.) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Method | Backbone | Query Method | AP | APS | APM | APL |
---|---|---|---|---|---|---|
Mask-RCNN | R101 | Anchors | 39.8 | 23.8 | 48.3 | 51.1 |
Cascade-RCNN | R101 | Anchors | 44.5 | 25.9 | 51.8 | 53.5 |
HTC | R101 | Anchors | 47.1 | 28.6 | 53.3 | 59.9 |
QueryInst | R101 | 300 Queries | 46.5 | 28.2 | 54.1 | 60.2 |
Mask2Former | R101 | 300 Queries | 47.7 | 27.4 | 54.8 | 60.8 |
QueryFormer | R101 | 300 Queries | 48.3 | 27.9 | 55.7 | 61.6 |
Mask-RCNN | Swin-B | Anchors | 44.2 | 27.1 | 54.1 | 59.4 |
Cascade-RCNN | Swin-B | Anchors | 46.7 | 27.9 | 55.7 | 61.4 |
HTC | Swin-B | Anchors | 51.1 | 30.4 | 56.5 | 62.3 |
QueryInst | Swin-B | 300 Queries | 50.3 | 30.1 | 56.9 | 62.5 |
Mask2Former | Swin-B | 300 Queries | 52.0 | 29.3 | 57.6 | 63.8 |
QueryFormer | Swin-B | 300 Queries | 52.8 | 29.9 | 58.6 | 64.7 |
Model | Queries | Schedules | AP | APS | APM | APL |
---|---|---|---|---|---|---|
QueryFormer | 100 | 12e | 37.6 | 16.1 | 44.7 | 47.3 |
QueryFormer | 200 | 12e | 39.2 | 18.9 | 47.8 | 52.0 |
QueryFormer | 300 | 12e | 41.4 | 20.6 | 49.5 | 53.7 |
QueryFormer | 400 | 12e | 41.7 | 20.8 | 49.5 | 54.1 |
Model | Groups | Schedules | AP | APS | APM | APL |
---|---|---|---|---|---|---|
QueryFormer | 1 | 12e | 38.7 | 18.4 | 46.1 | 49.7 |
QueryFormer | 2 | 12e | 40.5 | 19.4 | 47.8 | 51.3 |
QueryFormer | 3 | 12e | 40.8 | 19.9 | 48.4 | 52.4 |
QueryFormer | 4 | 12e | 41.1 | 20.3 | 49.1 | 53.2 |
QueryFormer | 5 | 12e | 41.4 | 20.6 | 49.5 | 53.7 |
QueryFormer | 1 | 50e | 50.8 | 27.1 | 55.4 | 61.9 |
QueryFormer | 5 | 50e | 52.8 | 29.9 | 58.6 | 64.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, R.; Fan, B.; Wang, Y.; Yang, R. A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images. Sensors 2023, 23, 3643. https://doi.org/10.3390/s23073643
Wei R, Fan B, Wang Y, Yang R. A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images. Sensors. 2023; 23(7):3643. https://doi.org/10.3390/s23073643
Chicago/Turabian StyleWei, Ren, Beilei Fan, Yuting Wang, and Rongchao Yang. 2023. "A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images" Sensors 23, no. 7: 3643. https://doi.org/10.3390/s23073643
APA StyleWei, R., Fan, B., Wang, Y., & Yang, R. (2023). A Query-Based Network for Rural Homestead Extraction from VHR Remote Sensing Images. Sensors, 23(7), 3643. https://doi.org/10.3390/s23073643