Faster and Better: A Lightweight Transformer Network for Remote Sensing Scene Classification
Abstract
:1. Introduction
- An MLGC module with a low computational cost is proposed, which utilizes the co-representations of multi-level and multi-group features to enrich the diversity of local features in the RS scene.
- By introducing the MLGC module into the ordinary transformer block, we designed a LightFormer block with fewer parameters and FLOPs, which considers both rich multi-level local features and long-range dependencies in RS images.
- We built the efficient LTNet based on the MLGC module and LightFormer block for RS scene classification.
- Experiments on four RS scene classification datasets prove that the LTNet achieves a competitive classification performance with less training time.
2. Related Work
2.1. CNN-Based RS Scene Classification Methods
2.2. Vision Transformer Networks
2.3. Lightweight CNNs
3. Method
3.1. MLGC Module
3.2. LightFormer Block
3.3. LTNet
4. Experiments
4.1. Experiments on Four RS Scene Classification Datasets
4.1.1. Dataset Description
4.1.2. Experimental Settings
4.1.3. Experimental Results
4.2. Experiments on Two Natural Image Classification Datasets
4.2.1. Dataset Description
4.2.2. Experimental Settings
4.2.3. Comparison Experiments on CIFAR-10 and ImageNet
4.2.4. Ablation Experiments on CIFAR-10
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xiao, Y.; Zhan, Q. A review of remote sensing applications in urban planning and management in China. In Proceedings of the 2009 Joint Urban Remote Sensing Event, Shanghai, China, 20–22 May 2009; IEEE: Piscataway Township, NJ, USA, 2009; pp. 1–5. [Google Scholar]
- Martha, T.R.; Kerle, N.; Van Westen, C.J.; Jetten, V.; Kumar, K.V. Segment optimization and data-driven thresholding for knowledge-based landslide detection by object-based image analysis. IEEE Trans. Geosci. Remote. Sens. 2011, 49, 4928–4943. [Google Scholar] [CrossRef]
- Stumpf, A.; Kerle, N. Object-oriented mapping of landslides using Random Forests. Remote. Sens. Environ. 2011, 115, 2564–2577. [Google Scholar] [CrossRef]
- Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote. Sens. 2013, 34, 45–59. [Google Scholar] [CrossRef]
- Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote. Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-based high-resolution remote sensing image retrieval via unsupervised feature learning and collaborative affinity metric fusion. Remote Sens. 2016, 8, 709. [Google Scholar] [CrossRef] [Green Version]
- Du, Z.; Li, X.; Lu, X. Local structure learning in high resolution remote sensing image retrieval. Neurocomputing 2016, 207, 813–822. [Google Scholar] [CrossRef]
- Duan, Y.; Liu, F.; Jiao, L.; Zhao, P.; Zhang, L. SAR image segmentation based on convolutional-wavelet neural network and Markov random field. Pattern Recognit. 2017, 64, 255–267. [Google Scholar] [CrossRef]
- Jiao, L.; Zhang, S.; Li, L.; Liu, F.; Ma, W. A modified convolutional neural network for face sketch synthesis. Pattern Recognit. 2018, 76, 125–136. [Google Scholar] [CrossRef]
- Li, L.; Ma, L.; Jiao, L.; Liu, F.; Sun, Q.; Zhao, J. Complex Contourlet-CNN for polarimetric SAR image classification. Pattern Recognit. 2020, 100, 107110. [Google Scholar] [CrossRef]
- Wang, J.; Duan, Y.; Tao, X.; Xu, M.; Lu, J. Semantic perceptual image compression with a laplacian pyramid of convolutional networks. IEEE Trans. Image Process. 2021, 30, 4225–4237. [Google Scholar] [CrossRef]
- Singh, P.; Mazumder, P.; Namboodiri, V.P. Context extraction module for deep convolutional neural networks. Pattern Recognit. 2022, 122, 108284. [Google Scholar] [CrossRef]
- Cui, Y.; Liu, F.; Jiao, L.; Guo, Y.; Liang, X.; Li, L.; Yang, S.; Qian, X. Polarimetric multipath convolutional neural network for PolSAR image classification. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
- Nogueira, K.; Penatti, O.A.; Dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef] [Green Version]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Bazi, Y.; Al Rahhal, M.M.; Alhichri, H.; Alajlan, N. Simple yet effective fine-tuning of deep CNNs using an auxiliary classification loss for remote sensing scene classification. Remote Sens. 2019, 11, 2908. [Google Scholar] [CrossRef] [Green Version]
- Li, W.; Wang, Z.; Wang, Y.; Wu, J.; Wang, J.; Jia, Y.; Gui, G. Classification of high-spatial-resolution remote sensing scenes method using transfer learning and deep convolutional neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 1986–1995. [Google Scholar] [CrossRef]
- Lu, X.; Sun, H.; Zheng, X. A feature aggregation convolutional neural network for remote sensing scene classification. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 7894–7906. [Google Scholar] [CrossRef]
- Sun, H.; Li, S.; Zheng, X.; Lu, X. Remote sensing scene classification by gated bidirectional network. IEEE Trans. Geosci. Remote. Sens. 2019, 58, 82–96. [Google Scholar] [CrossRef]
- He, N.; Fang, L.; Li, S.; Plaza, A.; Plaza, J. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans. Geosci. Remote. Sens. 2018, 56, 6899–6910. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, Y.; Ding, L. Scene classification based on two-stage deep feature fusion. IEEE Geosci. Remote. Sens. Lett. 2017, 15, 183–186. [Google Scholar] [CrossRef]
- Xue, W.; Dai, X.; Liu, L. Remote sensing scene classification based on multi-structure deep features fusion. IEEE Access 2020, 8, 28746–28755. [Google Scholar] [CrossRef]
- Wang, X.; Wang, S.; Ning, C.; Zhou, H. Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 7918–7932. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 1155–1167. [Google Scholar] [CrossRef]
- Tang, X.; Ma, Q.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. Attention consistent network for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 2030–2045. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Sun, Z.; Cao, S.; Yang, Y.; Kitani, K.M. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3611–3620. [Google Scholar]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
- Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8741–8750. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Li, S.; Liu, F.; Jiao, L. Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 22 February–1 March 2022. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–heterogenous transformer learning framework for RS scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
- Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
- Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H. Escaping the Big Data Paradigm with Compact Transformers. arXiv 2021, arXiv:2104.05704. [Google Scholar]
- Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 12259–12269. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway Township, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
- Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
- He, Z.; Yuan, Z.; An, P.; Zhao, J.; Du, B. MFB-LANN: A lightweight and updatable myocardial infarction diagnosis system based on convolutional neural networks and active learning. Comput. Methods Programs Biomed. 2021, 210, 106379. [Google Scholar] [CrossRef]
- Jiang, X.; Wang, N.; Xin, J.; Xia, X.; Yang, X.; Gao, X. Learning lightweight super-resolution networks with weight pruning. Neural Networks 2021, 144, 21–32. [Google Scholar] [CrossRef]
- Qian, X.; Liu, F.; Jiao, L.; Zhang, X.; Guo, Y.; Liu, X.; Cui, Y. Ridgelet-Nets With Speckle Reduction Regularization for SAR Image Scene Classification. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 9290–9306. [Google Scholar] [CrossRef]
- Ma, H.; Yang, S.; Feng, D.; Jiao, L.; Zhang, L. Progressive Mimic Learning: A new perspective to train lightweight CNN models. Neurocomputing 2021, 456, 220–231. [Google Scholar] [CrossRef]
- Ioannou, Y.; Robertson, D.; Cipolla, R.; Criminisi, A. Deep roots: Improving cnn efficiency with hierarchical filter groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1231–1240. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. (NIPS) 2022, 35, 9969–9982. [Google Scholar]
- Luo, J.H.; Wu, J.; Lin, W. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5058–5066. [Google Scholar]
- Wang, Y.; Xu, C.; Xu, C.; Xu, C.; Tao, D. Learning versatile filters for efficient convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 31 (NIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Layer Name | Output Size | VGG-16 | Ghost-VGG-16 | MLGC-VGG-16 | LightFormer-VGG-16 |
---|---|---|---|---|---|
v1 | 32 × 32 | Conv3-64 | |||
v2 | Conv3-64 | Ghost3-64 | MLGC3-64 | MLGC3-64 | |
Max pool | |||||
v3 | 16 × 16 | Conv3-128 | Ghost3-128 | MLGC3-128 | MLGC3-128 |
v4 | Conv3-128 | Ghost3-128 | MLGC3-128 | MLGC3-128 | |
Max pool | |||||
v5 | 8 × 8 | Conv3-256 | Ghost3-256 | MLGC3-256 | MLGC3-256 |
v6 | Conv3-256 | Ghost3-256 | MLGC3-256 | - | |
v7 | Conv3-256 | Ghost3-256 | MLGC3-256 | - | |
Max pool | |||||
v8 | 4 × 4 | Conv3-512 | Ghost3-512 | MLGC3-512 | LightFormer-512 |
v9 | Conv3-512 | Ghost3-512 | MLGC3-512 | - | |
v10 | Conv3-512 | Ghost3-512 | MLGC3-512 | - | |
Max pool | |||||
v11 | 2 × 2 | Conv3-512 | Ghost3-512 | MLGC3-512 | - |
v12 | Conv3-512 | Ghost3-512 | MLGC3-512 | - | |
v13 | Conv3-512 | Ghost3-512 | MLGC3-512 | - | |
Max pool, FC-512, FC-10, Soft-max | |||||
Weights (M) | 15.0 | 7.7 | 6.0 | 0.8 | |
FLOPs (M) | 315 | 159 | 127 | 55 |
Layer Name | Output Size | ResNet-50 | Ghost-ResNet-50 | MLGC-ResNet-50 | LTNet |
---|---|---|---|---|---|
r1 | 112 × 112 | Conv7-64, Stride 2 | |||
r2 | 56 × 56 | Max pool, Stride2 | |||
r3 | 28 × 28 | ||||
r4 | 14 × 14 | ||||
r5 | 7 × 7 | ||||
Average pool, FC-1000, Soft-max | |||||
Weights (M) | 25.6 | 13.9 | 13.1 | 8.2 | |
FLOPs (B) | 4.1 | 2.2 | 1.9 | 1.7 |
PreTrained Network | Weights | FLOPs | Method | Merced (50% Train) | AID (20% Train) | Optimal31 (80% Train) | NWPU (10% Train) |
---|---|---|---|---|---|---|---|
VGG-16 [60] | 138.4 M | 15.5 G | Fine-tuning [19] | 96.57 ± 0.38 | 89.49 ± 0.34 | 89.52 ± 0.26 | - |
Fine-tuning [15] | - | - | - | 87.15 ± 0.45 | |||
ARCNet-VGG16 [24] | 96.81 ± 0.14 | 88.75 ± 0.40 | 92.70 ± 0.35 | - | |||
GBNet + global feature [19] | 97.05 ± 0.19 | 92.20 ± 0.23 | 93.28 ± 0.27 | - | |||
+MSCP [20] | - | 91.52 ± 0.21 | - | 85.33 ± 0.17 | |||
ACNet [25] | - | 93.33 ± 0.29 | - | 91.09 ± 0.13 | |||
AlexNet [64] | 61.0 M | 724 M | Fine-tuning [15] | - | - | - | 81.22 ± 0.19 |
ARCNet-AlexNet [24] | - | - | 85.75 ± 0.35 | - | |||
+MSCP [20] | - | 88.99 ± 0.38 | - | 81.70 ± 0.23 | |||
Inception-v3 [65] | 24 M | 5.7 G | Inception-v3-aux [16] | 97.63 ± 0.20 | 93.52 ± 0.21 | 94.13 ± 0.35 | 89.32 ± 0.33 |
ResNet-34 [62] | 21.8 M | 3.7 G | ARCNet-ResNet34 [24] | - | - | 91.28 ± 0.45 | - |
Ghost-ResNet-50 [44] | 13.9 M | 2.0 G | Fine-tuning | 98.25 ± 0.22 | 94.66±0.12 | 94.73 ± 0.58 | 91.79 ± 0.16 |
EfficientNet-B3 [66] | 12 M | 1.8 G | EfficientNet-B3-aux [16] | 98.22 ± 0.49 | 94.19 ± 0.15 | 94.51 ± 0.75 | 91.08 ± 0.14 |
GoogLeNet [67] | 6.7 M | 1.5G | Fine-tuning [15] | - | - | - | 82.57 ± 0.12 |
GoogLeNet-aux [16] | 97.90 ± 0.34 | 93.25 ± 0.33 | 93.11 ± 0.55 | 89.22 ± 0.25 | |||
EfficientNet-B0 [66] | 5.3 M | 0.4 G | EfficientNet-B0-aux [16] | 98.01 ± 0.45 | 93.69 ± 0.11 | 93.97 ± 0.13 | 89.96 ± 0.27 |
ViT-L/16 [27] | 304.4 M | 61.6 G | Fine-tuning | 98.24 ± 0.21 | 94.44 ± 0.26 | 94.89 ± 0.24 | 90.85 ± 0.16 |
ViT-B/16 [27] | 86.6 M | 17.6 G | Fine-tuning [36] | 98.14 ± 0.47 | 94.97 ± 0.01 | 95.07 ± 0.12 | 92.60 ± 0.10 |
MLGC-ResNet-50 () | 13.1 M | 1.9 G | Fine-tuning | 98.48 ± 0.28 | 94.73 ± 0.15 | 95.27 ± 0.36 | 92.16 ± 0.08 |
LTNet () | 8.2 M | 1.7 G | Fine-tuning | 98.36 ± 0.25 | 94.98 ± 0.08 | 95.70 ± 0.29 | 92.21 ± 0.11 |
PreTrained Network | Weights | FLOPs | Method | Merced (50% Train) | AID (20% Train) | Optimal31 (80% Train) | NWPU (10% Train) |
---|---|---|---|---|---|---|---|
ViT-L/16 [27] | 304.4 M | 61.6 G | Fine-tuning | 70 min | 234 min | 80 min | 629 min |
ViT-B/16 [27] | 86.6 M | 17.6 G | 40 min | 132 min | 45 min | 336 min | |
MLGC-ResNet-50 () | 13.1 M | 1.9 G | 16 min | 59 min | 18 min | 117 min | |
LTNet () | 8.2 M | 1.7 G | 17 min | 61 min | 20 min | 124 min |
Model | Weights (M) | FLOPs (M) | Acc. (%) |
---|---|---|---|
MobileNetV1 [39] | 3.2 | 47 | 92.5 |
MobileNetV2 [40] | 2.3 | 68 | 93.2 |
ShuffleNetV1(g = 3) [42] | 0.9 | 45 | 92.8 |
ShuffleNetV2 [43] | 1.3 | 45 | 93.5 |
EfficientNet-B0 [66] | 4.0 | 64 | 93.8 |
Ghost-VGG-16 [44] | 7.7 | 160 | 93.5 |
GhostV2-VGG-16 [68] | 9.4 | 188 | 93.6 |
CCT-6/3 × 2 [45] | 3.3 | 241 | 93.6 |
CCT-4/3 × 2 [45] | 0.5 | 46 | 91.5 |
MLGC-VGG-16 () | 2.8 | 97 | 93.8 |
LightFormer-VGG-16 () | 0.8 | 55 | 93.9 |
Model | Weights (M) | FLOPs (M) | Acc. (%) |
---|---|---|---|
Ghost-ResNet-56 | 0.44 | 67 | 92.7 |
MLGC-ResNet-56 (, ) | 0.43 | 65 | 93.0 |
LightFormer-ResNet-56 (, ) | 0.33 | 58 | 93.1 |
Model | Weights (M) | FLOPs (B) | Top-1 Acc. (%) |
---|---|---|---|
Thinet-ResNet-50 [69] | 16.9 | 24.9 | 72.0 |
Versatile-ResNet-50 [70] | 11.0 | 3.0 | 74.5 |
Ghost-ResNet-50 () [44] | 13.9 | 2.2 | 74.7 |
MLGC-ResNet-50 () | 13.1 | 1.9 | 75.5 |
LTNet () | 8.2 | 1.7 | 75.4 |
Model | s | Weights (M) | FLOPs (M) | Acc. (%) | ||
---|---|---|---|---|---|---|
Ghost-VGG-16 | 2 | - | - | 7.7 | 160 | 93.5 |
MLGC- VGG-16 | - | 2 | 1 | 8.0 | 173 | 94.2 |
4 | 1 | 6.2 | 134 | 93.4 | ||
2 | 2 | 6.0 | 127 | 93.6 | ||
8 | 1 | 5.3 | 115 | 92.7 | ||
2 | 4 | 5.0 | 104 | 93.5 | ||
Ghost-VGG-16 | 3 | - | - | 5.2 | 109 | 92.8 |
MLGC- VGG-16 | - | 16 | 1 | 4.8 | 105 | 90.8 |
32 | 1 | 4.6 | 100 | 90.4 | ||
2 | 8 | 4.5 | 92 | 93.1 | ||
2 | 16 | 4.2 | 87 | 93.1 | ||
4 | 2 | 4.1 | 88 | 92.9 | ||
2 | 32 | 4.1 | 84 | 93.0 | ||
Ghost-VGG-16 | 4 | - | - | 4.0 | 82 | 92.6 |
Model | MLGC-VGG-16 Layer Name | Weights (M) | FLOPs (M) | Acc. (%) |
---|---|---|---|---|
LightFormer-VGG-16 () | - | 6.0 | 127 | 93.6 |
v11 | 5.4 | 125 | 93.7 | |
v12 | 5.4 | 125 | 93.6 | |
v13 | 5.4 | 125 | 93.7 | |
v11–v13 | 4.3 | 120 | 93.8 | |
v8–v13 | 2.8 | 97 | 93.8 | |
v7–v13 | 2.7 | 88 | 93.3 | |
v6–v13 | 2.5 | 79 | 92.9 |
Model | Removed Layer Name | Weights (M) | FLOPs (M) | Acc. (%) |
---|---|---|---|---|
LightFormer-VGG-16 () | - | 2.8 | 97 | 93.8 |
v9–v13 | 1.2 | 83 | 93.8 | |
v7&v9–v13 | 1.0 | 69 | 93.8 | |
v6–v7&v9–v13 | 0.8 | 55 | 93.9 | |
v4&v6–v7&v9–v13 | 0.8 | 40 | 92.9 | |
v2&v4&v6–v7&v9–v13 | 0.7 | 26 | 92.1 |
Model | Position Embedding | Weights (M) | FLOPs (M) | Acc. (%) |
---|---|---|---|---|
LightFormer-VGG-16 () | Yes | 0.8 | 55 | 93.8 |
No | 93.9 |
Model | Heads | Positional Embedding | Weights (M) | Acc. (%) |
---|---|---|---|---|
MobileNetV2/0.5[1] | - | - | 0.7 | 84.8 |
CCT-2/3 × 2[2] | 4 | Yes | 0.3 | 88.9 |
LightFormer-VGG-16/0.5 () | 8 | Yes | 0.2 | 90.8 |
4 | Yes | 90.7 | ||
4 | No | 90.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, X.; Liu, F.; Cui, Y.; Chen, P.; Li, L.; Li, P. Faster and Better: A Lightweight Transformer Network for Remote Sensing Scene Classification. Remote Sens. 2023, 15, 3645. https://doi.org/10.3390/rs15143645
Huang X, Liu F, Cui Y, Chen P, Li L, Li P. Faster and Better: A Lightweight Transformer Network for Remote Sensing Scene Classification. Remote Sensing. 2023; 15(14):3645. https://doi.org/10.3390/rs15143645
Chicago/Turabian StyleHuang, Xinyan, Fang Liu, Yuanhao Cui, Puhua Chen, Lingling Li, and Pengfang Li. 2023. "Faster and Better: A Lightweight Transformer Network for Remote Sensing Scene Classification" Remote Sensing 15, no. 14: 3645. https://doi.org/10.3390/rs15143645
APA StyleHuang, X., Liu, F., Cui, Y., Chen, P., Li, L., & Li, P. (2023). Faster and Better: A Lightweight Transformer Network for Remote Sensing Scene Classification. Remote Sensing, 15(14), 3645. https://doi.org/10.3390/rs15143645