A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks
A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks
A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks
RESEARCH ARTICLE
Abstract Attention mechanism has become a widely of capturing unique information on each feature, which plays
researched method to improve the performance of an important role in deciding both “what” and “where” to
convolutional neural networks (CNNs). Most of the researches focus [12]. Meanwhile, from the perspective of the human
focus on designing channel-wise and spatial-wise attention brain, channel-wise attention and spatial-wise attention
modules but neglect the importance of unique information on correspond to feature-based attention and spatial-based
each feature, which is critical for deciding both “what” and attention in the human brain, respectively [13]. During the
“where” to focus. In this paper, a feature-wise attention module visual processing of the human brain, the two attentions co-
is proposed, which can give each feature of the input feature exist and jointly contribute to process the important
map an attention weight. Specifically, the module is based on information selection, which further demonstrates the
the well-known surround suppression in the discipline of necessity of capturing feature-wise attention.
neuroscience, and it consists of two sub-modules, Minus- In order to capture feature-wise attention, there are mainly
Square-Add (MSA) operation and a group of learnable non- two categories of methods. The first is to design an elaborate
linear mapping functions. The MSA imitates the surround encoder-decoder structure (e.g., Residual Attention Network
suppression and defines an energy function which can be for Image Classification (RANet) [9], Learning Pixel-wise
applied to each feature to measure its importance. The group of Contextual Attention for Saliency Detection(PiCANet) [14]).
non-linear functions refines the energy calculated by the MSA The second is to design a simple and efficient method of
to more reasonable values. By these two sub-modules, feature- weight calculation that can be applied to each feature (e.g., A
wise attention can be well captured. Meanwhile, due to the Simple, Parameter-Free Attention Module for Convolutional
simple structure and few parameters of the two sub-modules, Neural Networks (SimAM) [10], 3D Attention Map Learning
the proposed module can easily be almost integrated into any Using Contextual Information for Point Cloud Based Retrieval
CNN. To verify the performance and effectiveness of the (PCAN) [15]). The designed structure in the first category of
proposed module, several experiments were conducted on the methods can not be flexible and modularized enough because
Cifar10, Cifar100, Cinic10, and Tiny-ImageNet datasets, of its complexity and numerous parameters. Therefore, it may
respectively. The experimental results demonstrate that the be more general for CNNs to use the second category of
proposed module is flexible and effective for CNNs to improve methods. Based on the well-known phenomenon in
their performance. neuroscience called surround suppression, SimAM proposes a
method to calculate the feature-wise attention. Specifically,
Keywords feature-wise attention, surround suppression, the surround suppression shows that the most important
image classification, convolutional neural networks neurons are those which exhibit significant difference from
other neurons. What’s more, an important neuron may
1 Introduction suppress other neurons that surround it [16]. On the flip side,
Convolutional neural networks (CNNs) have shown outstan- it can be considered that the neurons that present the more
ding performance in tackling a series of tasks of computer notable difference may be more important. By imitating the
vision [1−5]. Recently some reasearches (e.g., [6−11]) find surround suppression, attention modules can be designed to
that combined with attention modules, CNNs can get greater make CNNs pay more attention to the features that show
performance. significant difference with surrounding features and reduce the
However, most of the existing attention modules focus on attention to others. Based on this, SimAM defines a linear
capturing channel-wise and spatial-wise attention while can function to measure the difference from each feature to others
not capture both at the same time. This limits their capability and hence measure the importance of each feature. However,
in order to get the values of the parameters of its linear
Received March 7, 2022; accepted September 29, 2022
function, the SimAM has to invite a number of drawbacks.
E-mail: [email protected] Specifically, it assumes that during the calculation of the
2 Front. Comput. Sci., 2023, 17(6): 176338
energy of each feature, the distribution of the current feature is module, which can be easily integrated into most CNNs to
similar to that of the others. The assumption is reasonable to improve their performance without big changes in
some extent. However, it is possible that the distribution of architecture.
important features may differ significantly from that of Channel-wise and spatial-wise attention modules Nume-
unimportant ones, which may affect the reliability of its whole rous existing researches on attention mechanism concentrate
module. Meanwhile, it introduces additional hyperparameters, upon capturing and utilizing channel-wise and spatial-wise
which given different values may have distinctive effects on attention. Specifically, Squeeze-and-Excitation Networks (SE)
performance. Therefore, it is necessary to explore other ways. [6] uses average-pooled features to compute channel-wise
In this paper, we follow the thought of surround suppression attention from a global view. Convolutional Block Attention
and propose another simple module, which can be applied to Module (CBAM) [7] and Squeeze and Excitation Blocks
each feature to capture feature-wise attention. Specifically, the (scSE) [30] use a convolution module to compute the spatial-
proposed module contains two sub-modules: the Minus- wise attention corresponding to the channel attention of SE
Square-Add (MSA) operation and a group of learnable non- and combine it with their designed channel-wise attention to
linear mapping functions. The MSA imitates the surround capture more useful information. Non-Local Attention [8]
suppression and uses an energy function defined as the introduces an approach that uses the representing relationships
average sum of the squares of the euclidean distances from in spatial dimension between features to capture long-range
each feature to others to measure the difference each feature dependencies. Double Attention Networks (A2-Net) [31]
exhibits from others and hence measure their importance. introduces a novel relation function for Non-Local Attention.
Moreover, the calculation process can easily be simplified by Dual Attention Network (DANet) [32] shows that Non-Local
a series of equivalent transformations without assumptions, Attention is a spatial-wise attention module and introduces
and the final form of the MSA is flexible and lightweight. corresponding channel-wise attention module. GCNet [33]
Next, since the MSA can not control the range of its output proposes a simplified Non-Local Attention and integrates it
very well, the group of non-linear functions is introduced to into SE, getting a lightweight channel-wise attention module
refine the range by some learnable parameters. As the training with the ability to capture long-range dependencies. Gated
of the entire network progresses, the group of non-linear Channel Transformation for Visual Recognition (GCT) [34]
functions can gradually learn and master the energy which is modifies the SE by using a l2 normalization to replace the FC
more reasonable for the features. layers and gets a more stable and effective channel-wise
In summary, the contributions of this work are mainly attention module. However, all of these attention modules can
summarized as follows: only capture either channel-wise attention or spatial-wise
attention at one time, while they can not capture feature-wise
● A simple method based on the surround suppression in attention. In contrast, this work aims at design a feature-wise
neuroscience called MSA is proposed to capture attention module, which can capture and well utilize unique
feature-wise attention. information on each feature.
● A series of simple equivalent transformations are Feature-wise attention modules In order to capture
derived to speed up energy calculation, and the MSA is feature-wise attention, some methods such as RANet [9] and
turned to a lightweight form. PiCANet [14] propose a number of well-designed encoder-
● A group of learnable non-linear mapping functions is decoder architectures. SimAM [10] introduces the well-known
introduced to refine the energy calculated by the MSA, surround suppression in neuroscience and based on it designs
and the effectiveness of combining it with the MSA is a simple weight calculation that can be applied to each feature.
verified. In contrast, unlike the methods like RANet and PiCANet, the
proposed module is more flexible and modularized.
2 Related work Meanwhile, unlike the SimAM, the module does not require
In this section, some representative works on powerful CNN assumptions and hyperparameters.
architectures and attention modules are discussed.
Increasingly powerful architectures Some works [17−23] 3 Method
show that by increasing the depth, width, and cardinality of In this section, details of the proposed module are presented.
the CNNs, the representation power of the networks can be
improved. Furthermore, a series of works (e.g., [24−29]) use 3.1 Overview of the proposed module
neural architecture search (NAS) in the field of automated The starting point of this work is to design a novel feature-
machine learning (AutoML) to search the best combination of wise attention module (the difference among channel-wise,
depth, width, and cardinality for the networks. These have spatial-wise, and feature-wise attention modules are clearly
greatly facilitated the development of CNNs and the existing illustrated in Fig. 1), which is simple, no hyperparameters, no
basic architectures are now very powerful. Attention parameters, modularized, plug-and-play, and effective.
mechanism is a now popular and effective way to further However, obviously it is difficult to design an attention
improve the performance of CNNs that does not greatly module that combines all of these advantages. For more
increase the complexity as well as the parameters. It aims to effectiveness, the module introduces some learnable
improve the representation power by telling CNNs where to parameters. Specifically, it consists of the MSA and a group of
focus. The aim of this work is to design a lightweight attention learnable non-linear mapping functions. The MSA defines an
Shuo TAN et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks 3
Fig. 1 Comparisons of channel-wise, spatial-wise, and feature-wise attention modules. In each subfigure, the left side represents the input
features and the right side represents the feature weights calculated by different attention modules. Most of the existing attention modules are
channel-wise attention modules (a) and spatial-wise attention modules (b). They give the same attention weights to features in the same channel
or spatial, while feature-wise attention modules (c) can give each feature an attention weight
energy function to measure the importance of each feature to Because the MSA can be used for each feature to estimate the
capture feature-wise attention. Furthermore, it is simplified by importance of individual features, it possesses the ability to
a series of equivalent transformations and its final form is capture feature-wise attention. Specifically, because euclidean
simple, no hyperparameters, and no parameters. However, the distance of two features can measure the difference between
energy calculated by the MSA may be too large to keep two features, the average sum of the squares of the euclidean
effective at all times, the gaps between the calculated energy distances from each feature to other features possesses the
of different categories of features are not reasonable enough, ability to measure the difference that each feature presents
and there is some noise in the features that gets big energy. from the others. On the basis of surround suppression, the
Therefore, the group of non-linear functions is introduced to difference that each feature shows from the others can be used
solve these by some learnable parameters. Specifically, each to estimate the importance of that feature. So, we define the
of its non-linear mapping functions consists of a linear above approach of measuring difference as an energy function
function, a non-linear hyperbolic tangent(tanh) function, and a and use the output of this function represent the importance of
linear function, and each of them corresponds to a channel. each feature. The energy function is as follows:
Meanwhile, each of the linear functions possesses two
1 ∑
n
parameters and hence the group of non-linear functions adds ei,c = (xi,c − x j,c )2 , (1)
four additional parameters for each channel, which are few n − 1 j=1, j,i
and acceptable to most CNNs. Combining these two sub-
where xi,c and x j,c denotes the target feature and other features
modules, it forms the final proposed module. The overview of
in channel c of the feature map X ∈ RC×H×W , i and j are two
the whole module can also be clearly seen in Fig. 2. The
indexes from 1 to n, and n = H × W denotes the total number
details of the MSA and the group of non-linear functions are
of features in channel c .
shown in turn in the following two subsections.
However, it requires complex calculations to compute
3.2 MSA energy ei,c directly via Eq. (1). To simplify the calculations for
According to the surround suppression, features show the efficiency, Eq. (1) is turned into an easily computable form
more significant difference with surrounding features may be through simple equivalent transformations.
the more important. Based on this, the MSA is proposed to Specifically, because (xi,c − xi,c )2 = 0, the value of i can be
measure the significant difference each feature exhibits from added to the range of j . Then Eq. (1) is fortunately converted
others and hence measure the importance of each feature. to the form as follows:
Fig. 2 Overview of the proposed module. Two 1 × 1 convolution modules of the number of groups equal to that of the channels with a tanh
function are used to implement the group of non-linear functions
4 Front. Comput. Sci., 2023, 17(6): 176338
Fig. 3 The exact position where the proposed module is integrated into a ResBlock. The module is applied to each ResBlock of the ResNet
Shuo TAN et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks 5
Table 1 Top-1 accuracies (%) for ResNet18 and ResNet50 with diferent attention modules, SE [6], CBAM [7], ECA [39], GCT [34], SimAM [10] and the
proposed module on Cifar10, Cifar100, and Cinic10 datasets. All results are reported as mean±std via over five trials
Dataset
Model
Cifar10 Cifar100 Cinic10
ResNet18 (Baseline) 93.21±0.38 73.37±0.20 84.84±0.45
ResNet18 + SE 93.67±0.19 73.93±0.10 85.49±0.10
ResNet18 + CBAM 93.65±0.13 73.41±0.25 85.27±0.13
ResNet18 + ECA 93.45±0.08 72.11±0.48 84.81±0.13
ResNet18 + GCT 93.15±0.35 73.51±0.31 84.97±0.49
ResNet18 + SimAM 93.57±0.11 74.21±0.21 85.48±0.19
ResNet18 + proposed module 93.78±0.09 74.39±0.09 85.53±0.03
ResNet50 (Baseline) 91.49±0.57 69.58±1.54 83.21±1.06
ResNet50 + SE 92.59±0.40 74.46±0.43 84.84±0.54
ResNet50 + CBAM 93.74±0.21 75.91±0.13 85.46±0.41
ResNet50 + ECA 92.33±1.69 74.73±0.77 85.20±0.51
ResNet50 + GCT 90.84±1.24 69.23±1.37 83.78±0.77
ResNet50 + SimAM 92.55±0.26 71.85±1.59 84.95±0.37
ResNet50 + proposed module 93.08±0.52 75.12±0.49 84.88±0.55
6 Front. Comput. Sci., 2023, 17(6): 176338
Table 2 Parameters, additional parameters to baseline, FLOPs, Top-1 and Top-5 accuracies (%) for various models with SE [6], CBAM [7], ECA [39], GCT
[34], SimAM [10] and the proposed module on Tiny-ImageNet
Model Parameters + Parameters-to-baseline FLOPs Top-1 Acc/% Top-5 Acc/%
ResNet18 (Baseline) 11.27M 0 2.23G 65.12 84.23
ResNet18 + SE [6] 11.36M 0.0870M 2.23G 66.38 85.44
ResNet18 + CBAM [7] 11.36M 0.0899M 2.23G 66.04 85.05
ResNet18 + ECA [39] 11.27M 24 2.23G 64.94 84.60
ResNet18 + GCT [34] 11.28M 0.0058M 2.23G 66.21 85.15
ResNet18 + SimAM [10] 11.27M 0 2.23G 65.48 84.37
ResNet18 + proposed module 11.28M 0.0077M 2.23G 65.90 85.00
ResNet34 (Baseline) 21.38M 0 4.65G 66.73 85.29
ResNet34 + SE [6] 21.54M 0.1572M 4.65G 67.29 86.16
ResNet34 + CBAM [7] 21.54M 0.1628M 4.65G 67.14 85.80
ResNet34 + ECA [39] 21.38M 48 4.65G 66.22 85.35
ResNet34 + GCT [34] 21.39M 0.0113M 4.65G 67.37 85.92
ResNet34 + SimAM [10] 21.38M 0 4.65G 67.29 85.87
ResNet34 + proposed module 21.39M 0.0151M 4.65G 67.49 85.86
ResNet50 (Baseline) 23.91M 0 5.22G 68.19 86.66
ResNet50 + SE [6] 26.43M 2.5149M 5.23G 69.68 87.83
ResNet50 + CBAM [7] 26.44M 2.5326M 5.23G 69.30 87.81
ResNet50 + ECA [39] 23.91M 48 5.23G 68.63 86.60
ResNet50 + GCT [34] 21.96M 0.0453M 5.22G 69.14 87.13
ResNet50 + SimAM [10] 23.91M 0 5.22G 68.93 87.27
ResNet50 + proposed module 23.97M 0.0604M 5.25G 69.89 87.72
ResNet101 (Baseline) 42.90M 0 10.08G 69.92 87.58
ResNet101 + SE [6] 47.65M 4.7431M 10.10G 71.06 88.53
ResNet101 + CBAM [7] 47.68M 4.7810M 10.09G 70.48 88.09
ResNet101 + ECA [39] 42.90M 99 10.09G 69.52 87.61
ResNet101 + GCT [34] 43.00M 0.0975M 10.08G 71.02 88.33
ResNet101 + SimAM [10] 42.90M 0 10.08G 69.79 87.78
ResNet101 + proposed module 43.03M 0.1300M 10.13G 70.46 88.46
MobileNetV2 (Baseline) 2.54M 0 0.38G 62.65 84.47
MobileNetV2 + SE [6] 2.57M 0.0284M 0.38G 62.39 84.17
MobileNetV2 + CBAM [7] 2.57M 0.0317M 0.38G 62.39 83.89
MobileNetV2 + ECA [39] 2.54M 51 0.38G 61.74 83.50
MobileNetV2 + GCT [34] 2.54M 0.0045M 0.38G 63.55 85.09
MobileNetV2 + SimAM [10] 2.54M 0 0.38G 63.50 84.92
MobileNetV2 + proposed module 2.55M 0.0060M 0.38G 63.62 85.14
0.88%, 0.97% and 0.67%, respectively. Moreover, compared Meanwhile, we can observe that the MSA also extracts some
with other attention modules, the proposed module is also noise that should not be focused on. As the back layers are
very competitive (e.g., with ResNet34, ResNet50 and more difficult to understand, we do not show and analyse the
MobileNetV2, the proposed module gets the best top1 visualisation results here. To a certain extent, these
accuracies). empirically support the effectiveness of the MSA and the
The proposed module is based on very simple operations necessary of the subsequent group of learnable non-Linear
rather than stacking of modules such as convolution, global mapping functions.
average pooling, and etc., and the module may capture and
4.5 Ablation studies on Tiny-ImageNet
make rational use of feature-wise attention. These are the
In this subsection, the results of several ablation studies are
reasons why the module can be few parameters and powerful
presented in Table 3. It can be observed that both the MSA
performance.
and the group of non-linear functions improve the
4.4 Analysis of the MSA performance of baselines slightly. This empirically shows that
To explore the MSA in more depth, we plot the intermediate the MSA has impact on capturing feature-wise attention and
results from the processing of the ResNet50 which integrates some additional parameters can improve the performance of
the MSA in Fig. 4. Combining the visualization results and CNNs. Meanwhile, it is worth noting that by combining these
Visualizing and Understanding Convolutional Networks [40], two sub-modules, the performance has been improved further.
it can be considered that the front layers of the network mainly Moreover, unsurprisingly, normalization is not able to work
extract edge features and the back layers primarily extract well, because its process may lose some useful information.
features such as texture. And from Fig. 4 we can also see that These results are good indicators that the two sub-modules can
the MSA after front layers are good for noticing edge features, work well together and show the reasonableness of the overall
which are indeed important features for the front layers. process of the proposed module.
Shuo TAN et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks 7
5 Conclusion
In this paper, inspired by surround suppression and a series of
Fig. 4 Visualization results of the intermediate features and the attention existing attention modules, a feature-wise attention module
weights calculated by the MSA was proposed. This work focused on how to make good use of
the information from each feature, which is critical for
deciding both “what” and “where” to focus, and introduced
Table 3 Ablation studies on Tiny-ImageNet
Model Top-1 Acc/% Top-5 Acc/%
ResNet18 (Baseline) 65.12 84.23
ResNet18 + MSA (With normalization) 59.45 80.49
ResNet18 + MSA (With sigmoid) 65.49 84.69
ResNet18 + group of non-linear functions 65.63 84.38
ResNet18 + proposed module 65.90 85.00
ResNet34 (Baseline) 66.73 85.29
ResNet34 + MSA (With normalization) 62.49 82.38
ResNet34 + MSA (With sigmoid) 66.87 85.53
ResNet34 + group of non-linear functions 67.24 85.63
ResNet34 + proposed module 67.49 85.86
ResNet50 (Baseline) 68.19 86.66
ResNet50 + MSA (With normalization) 66.60 86.31
ResNet50 + MSA (With sigmoid) 69.16 87.25
ResNet50 + group of non-linear functions 69.50 87.22
ResNet50 + proposed module 69.89 87.72
ResNet101 (Baseline) 69.92 87.58
ResNet101 + MSA (With normalization) 59.40 81.65
ResNet101 + MSA (With sigmoid) 69.02 87.38
ResNet101 + group of non-linear functions 70.29 87.86 3
ResNet101 + proposed module 70.46 88.46
MobileNetV2 (Baseline) 62.65 84.47
MobileNetV2 + MSA (With normalization) 52.08 76.87
MobileNetV2 + MSA (With sigmoid) 63.43 84.80
MobileNetV2 + group of non-linear functions 63.11 84.50
MobileNetV2 + proposed module 63.62 85.14
8 Front. Comput. Sci., 2023, 17(6): 176338
Table 4 Experiments on Tiny-ImageNet to choose non-linear activation function of the non-linear mapping function
Model Top-1 Acc/% Top-5 Acc/%
ResNet18 (Baseline) 65.12 84.23
ResNet18 + Proposed Module (Using sigmoid) 65.91 84.85
ResNet18 + Proposed Module (Using tanh) 65.90 85.00
ResNet34 (Baseline) 66.73 85.29
ResNet34 + Proposed Module (Using sigmoid) 67.20 86.02
ResNet34 + Proposed Module (Using tanh) 67.49 85.86
ResNet50 (Baseline) 68.19 86.66
ResNet50 + proposed module (Using sigmoid) 68.92 86.98
ResNet50 + proposed module (Using tanh) 69.89 87.72
ResNet101 (Baseline) 69.92 87.58
ResNet101 + proposed module (Using sigmoid) 70.31 87.71
ResNet101 + proposed module (Using tanh) 70.46 88.46
MobileNetV2 (Baseline) 62.65 84.47
MobileNetV2 + proposed module (Using sigmoid) 63.14 85.16
MobileNetV2 + proposed module (Using tanh) 63.62 85.14
Fig. 5 Visualization results using Grad-CAM [41]. The visualization results of SE, CBAM, SimAM, and the proposed module integrated into
ResNet50 on the Tiny-ImageNet validation set, respectively
the MSA. Then, to simplify the calculations for efficiency, the MSA to other computer vision tasks such as object detection
MSA was turned into a simpler form. Furthermore, since the and semantic segmentation for helping to capture boundaries.
MSA can not control the range of its output very well, a group
Acknowledgements This work was supported by the National Natural
of learnable non-linear mapping functions was introduced. At Science Fund for Distinguished Young Scholar (No. 62025601).
last, extensive experiments were conducted and the results
demonstrate the effectiveness of the whole module. References
For future research, the Section 4.4 shows that the MSA can 1. Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-
well extract the edge features, so it is promising to extend the scale hierarchical image database. In: Proceedings of 2009 IEEE
Shuo TAN et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks 9
Conference on Computer Vision and Pattern Recognition. 2009, Conference on Computer Vision and Pattern Recognition. 2017,
248–255 2261–2269
2. Krizhevsky A. Learning multiple layers of features from tiny images. 22. Chollet F. Xception: deep learning with depthwise separable
Toronto: University of Toronto, 2009 convolutions. In: Proceedings of 2017 IEEE Conference on Computer
3. Everingham M, van Gool L, Williams C K I, Winn J, Zisserman A. The Vision and Pattern Recognition. 2017, 1800–1807
PASCAL visual object classes (VOC) challenge. International Journal 23. Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual
of Computer Vision, 2010, 88(2): 303–338 transformations for deep neural networks. In: Proceedings of 2017 IEEE
4. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Conference on Computer Vision and Pattern Recognition. 2017,
Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban 5987–5995
scene understanding. In: Proceedings of 2016 IEEE Conference on 24. Domhan T, Springenberg J T, Hutter F. Speeding up automatic
Computer Vision and Pattern Recognition. 2016, 3213–3223 hyperparameter optimization of deep neural networks by extrapolation
5. Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, of learning curves. In: Proceedings of the 24th International Conference
Zitnick C L. Microsoft COCO: common objects in context. In: on Artificial Intelligence. 2015, 3460–3468
Proceedings of the 13th European Conference on Computer Vision. 25. Ha D, Dai A, Le Q V. Hypernetworks. 2016, arXiv preprint arXiv:
2014, 740–755 1609.09106
6. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings 26. Zoph B, Le Q V. Neural architecture search with reinforcement
of 2018 IEEE/CVF Conference on Computer Vision and Pattern learning. In: Proceedings of the 5th International Conference on
Recognition. 2018, 7132–7141 Learning Representations. 2017
7. Woo S, Park J, Lee J Y, Kweon I S. CBAM: convolutional block 27. Mendoza H, Klein A, Feurer M, Springenberg J T, Hutter F. Towards
attention module. In: Proceedings of the 15th European Conference on automatically-tuned neural networks. In: Proceedings of Workshop on
Computer Vision. 2018, 3–19 Automatic Machine Learning. 2016, 58–65
8. Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In: 28. Bello I, Zoph B, Vasudevan V, Le Q V. Neural optimizer search with
Proceedings of 2018 IEEE/CVF Conference on Computer Vision and reinforcement learning. In: Proceedings of the 34th International
Pattern Recognition. 2018, 7794–7803 Conference on Machine Learning. 2017, 459–468
9. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. 29. Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu A A, Pritzel
Residual attention network for image classification. In: Proceedings of A, Wierstra D. Pathnet: Evolution channels gradient descent in super
2017 IEEE Conference on Computer Vision and Pattern Recognition. neural networks. 2017, arXiv preprint arXiv: 1701.08734
2017, 6450–6458 30. Roy A G, Navab N, Wachinger C. Recalibrating fully convolutional
10. Yang L, Zhang R Y, Li L, Xie X. SimAM: a simple, parameter-free networks with spatial and channel “squeeze and excitation” blocks.
attention module for convolutional neural networks. In: Proceedings of IEEE Transactions on Medical Imaging, 2019, 38(2): 540–549
the 38th International Conference on Machine Learning. 2021, 31. Chen Y, Kalantidis Y, Li J, Yan S, Feng J. A2-nets: double attention
11863–11874 networks. 2018, arXiv preprint arXiv: 1810.11579
11. Wang L, Zhang L, Qi X, Yi Z. Deep attention-based imbalanced image 32. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network
classification. IEEE Transactions on Neural Networks and Learning for scene segmentation. In: Proceedings of 2019 IEEE/CVF Conference
Systems, 2022, 33(8): 3320–3330 on Computer Vision and Pattern Recognition. 2019, 3141–3149
12. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T S. SCA-CNN: 33. Cao Y, Xu J, Lin S, Wei F, Hu H. GCNet: non-local networks meet
spatial and channel-wise attention in convolutional networks for image squeeze-excitation networks and beyond. In: Proceedings of 2019
captioning. In: Proceedings of 2017 IEEE Conference on Computer IEEE/CVF International Conference on Computer Vision Workshop.
Vision and Pattern Recognition. 2017, 6298–6306 2019, 1971–1980
13. Carrasco M. Visual attention: the past 25 years. Vision Research, 2011, 34. Yang Z, Zhu L, Wu Y, Yang Y. Gated channel transformation for visual
51(13): 1484–1525 recognition. In: Proceedings of 2020 IEEE/CVF Conference on
14. Liu N, Han J, Yang M H. PiCANet: learning pixel-wise contextual Computer Vision and Pattern Recognition. 2020, 11791–11800
attention for saliency detection. In: Proceedings of 2018 IEEE/CVF 35. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network
Conference on Computer Vision and Pattern Recognition. 2018, training by reducing internal covariate shift. In: Proceedings of the 32nd
3089–3098 International Conference on Machine Learning. 2015, 448–456
15. Zhang W, Xiao C. PCAN: 3D attention map learning using contextual 36. Darlow L N, Crowley E J, Antoniou A, Storkey A J. CINIC-10 is not
information for point cloud based retrieval. In: Proceedings of 2019 ImageNet or CIFAR-10. 2018, arXiv preprint arXiv: 1810.03505
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37. Lee C Y, Xie S, Gallagher P, Zhang Z, Tu Z. Deeply-supervised nets.
2019, 12428–12437 In: Proceedings of the Eighteenth International Conference on Artificial
16. Webb B S, Dhruv N T, Solomon S G, Tailby C, Lennie P. Early and late Intelligence and Statistics. 2015, 562–570
mechanisms of surround suppression in striate cortex of macaque. 38. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. MobileNetV2:
Journal of Neuroscience, 2005, 25(50): 11666–11675 inverted residuals and linear bottlenecks. In: Proceedings of 2018
17. Simonyan K, Zisserman A. Very deep convolutional networks for large- IEEE/CVF Conference on Computer Vision and Pattern Recognition.
scale image recognition. 2014, arXiv preprint arXiv: 1409.1556 2018, 4510–4520
18. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, 39. Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: efficient channel
Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: attention for deep convolutional neural networks. In: Proceedings of
Proceedings of 2015 IEEE Conference on Computer Vision and Pattern 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2015, 1–9 Recognition. 2020, 11531–11539
19. He K, Zhang X, Ren S, Sun J. Deep residual learning for image 40. Zeiler M D, Fergus R. Visualizing and understanding convolutional
recognition. In: Proceedings of 2016 IEEE Conference on Computer networks. In: Proceedings of the 13th European Conference on
Vision and Pattern Recognition. 2016, 770–778 Computer Vision. 2014, 818–833
20. Zagoruyko S, Komodakis N. Wide residual networks. In: Proceedings 41. Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D.
of British Machine Vision Conference. 2016, 87.1–87.12 Grad-CAM: visual explanations from deep networks via gradient-based
21. Huang G, Liu Z, van der Maaten L, Weinberger K Q. Densely localization. In: Proceedings of 2017 IEEE International Conference on
connected convolutional networks. In: Proceedings of 2017 IEEE Computer Vision. 2017, 618–626
10 Front. Comput. Sci., 2023, 17(6): 176338
Shuo Tan is currently pursuing the MS degree at include theory and applications of neural networks based on
the Machine Intelligence Laboratory, College of neocortex computing and big data analysis methods by very deep
Computer Science, Sichuan University, China. His neural networks.
current research interests include convolutional
neural network and medical image analysis. Xin Shu is currently pursuing the PhD degree with
the Machine Intelligence Laboratory, College of
Computer Science, Sichuan University, China. His
current research interests include neural network
Lei Zhang received the BS and MS degrees in and intelligent medical.
mathematics and the PhD degree in computer
science from the University of Electronic Science
and Technology of China, China in 2002, 2005,
and 2008, respectively. She was a Post-Doctoral Zizhou Wang is currently pursuing the PhD
Research Fellow with the Department of degree with the Machine Intelligence Laboratory,
Computer Science and Engineering, Chinese College of Computer Science, Sichuan University,
University of Hong Kong, China from 2008 to 2009. She was an China. His current research interests include
Associate Editor of IEEE Transactions on Neural Networks and neural network and medical image analysis.
Learning Systems and an Associate Editor of IEEE Transactions on
Cognitive and Developmental Systems. Her current research interests