Advancements in Image Classification Using Convolutional Neural Network
Advancements in Image Classification Using Convolutional Neural Network
Advancements in Image Classification Using Convolutional Neural Network
Abstract—Convolutional Neural Network (CNN) is the state- time in spite of several advantages, the performance of CNN
of-the-art for image classification task. Here we have briefly in intricate problems such as classification of high-resolution
discussed different components of CNN. In this paper, We have image, was limited by the lack of large training data, lack of
explained different CNN architectures for image classification.
Through this paper, we have shown advancements in CNN from better regularization method and inadequate computing power.
LeNet-5 to latest SENet model. We have discussed the model Nowadays we have larger datasets with millions of high
description and training details of each model. We have also resolution labelled data of thousands category like ImageNet
drawn a comparison among those models. [10], LabelMe [11] etc. With the advent of powerful GPU
Keywords—AlexNet, Capsnet, Convolutional Neural Network, machine and better regularization method, CNN delivers out-
Deep learning, DenseNet, Image classification, ResNet, SENet.
standing performance on image classification tasks. In 2012 a
I. I NTRODUCTION large deep convolution neural network, called AlexNet [12],
designed by Krizhevsky et al. showed excellent performance
Computer vision consists of different problems such as on the ImageNet Large Scale Visual Recognition Challenge
image classification, localization, segmentation and object (ILSVRC) [13]. The success of AlexNet has become the
detection. Among those, image classification can be consid- inspiration of different CNN model such as ZFNet [14],
ered as the fundamental problem and forms the basis for VGGNet [15], GoogleNet [16], ResNet [17], DenseNet [18],
other computer vision problems. Until ’90s only traditional CapsNet [19], SENet [20] etc in the following years.
machine learning approaches were used to classify image. In this study, we have tried to give a review of the advance-
But the accuracy and scope of the classification task were ments of the CNN in the area of image classification. We
bounded by several challenges such as hand-crafted feature have given a general view of CNN architectures in section
extraction process etc. In recent years, the deep neural network II. Section III describes architecture and training details of
(DNN), also entitled as deep learning [1][2], finds complex different models of CNN. In Section IV we have drawn a
formation in large data sets using the backpropagation [3] comparison between various CNN models. Finally, we have
algorithm. Among DNNs, convolutional neural network has concluded our paper in Section V.
demonstrated excellent achievement in problems of computer
vision, especially in image classification. II. C ONVOLUTIONAL N EURAL N ETWORK
Convolutional Neural Network (CNN or ConvNet) is a A typical CNN is composed of single or multiple blocks of
especial type of multi-layer neural network inspired by the convolution and sub-sampling layers, after that one or more
mechanism of the optical system of living creatures. Hubel fully connected layers and an output layer as shown in figure
and Wiesel [4] discovered that animal visual cortex cells detect 1.
light in the small receptive field. Motivated by this work,
in 1980, Kunihiko Fukushima introduced neocognitron [5]
which is a multi-layered neural network capable of recognizing
visual pattern hierarchically through learning. This network is
considered as the theoretical inspiration for CNN. In 1990
LeCun et al. introduced the practical model of CNN [6] [7] Fig. 1: Building block of a typical CNN
and developed LeNet-5 [8]. Training by backpropagation [9]
algorithm helped LeNet-5 recognizing visual patterns from
raw pixels directly without using any separate feature engi- A. Convolutional Layer
neering mechanism. Also fewer connections and parameters The convolutional layer (conv layer) is the central part of a
of CNN than conventional feedforward neural networks with CNN. Images are generally stationary in nature. That means
similar network size, made model training easier. But at that the formation of one part of the image is same as any other
978-1-5386-7638-7/18/$31.00 © 2018 IEEE part. So, a feature learnt in one region can match similar
122
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
pattern in another region. In a large image, we take a small shown in figure 5, has 7 weighted (trainable) layers. Among
section and pass it through all the points in the large image them, three (C1, C3, C5) convolutional layers, two (S2, S4)
(Input). While passing at any point we convolve them into a average pooling layers, one (F6) fully connected layer and
single position (Output). Each small section of the image that one output layer. Sigmoid function was used to include non-
passes over the large image is called filter (Kernel). The filters linearity before a pooling operation. The output layer used
are later configured based on the back propagation technique. Euclidean Radial Basis Function units (RBF) [21] to classify
Figure 2 shows typical convolutional operation. 10 digits.
123
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
layers (conv layer) and 3 fully connected layers are there. The authors have noticed that removing any middle layer
Using rectified linear unit (ReLU) [24] non-linearity after degrades network’s performance. So, the result depends on the
convolutional and FC layers helped their model to be trained depth of the network. Also, they have used purely supervised
faster than similar networks with tanh units. They have learning approach to simplify their experiment, but they have
used local response normalization (LRN), called ”brightness expected that unsupervised pre-training would help if we can
normalization”, after the first and second convolutional layer have adequate computational power to remarkably increase the
which aids generalization. They have used max-pooling layer network size without increasing the amount of the correspond-
after each LRN layer and fifth convolutional layer. In figure 6 ing labelled dataset.
architectural details of AlexNet is shown. In table II we have C. ZFNet
shown different elements of AlexNet.
In 2014 Zeiler and Fergus presented a CNN called ZFNet
[14]. The Architecture of AlexNet and ZFNet is almost similar
except that the authors have reduced 1st layer filter size to 7×7
instead of 11×11 and used stride 2 convolutional layer in both
first and second layers to retain more information in those
layers’ features. In their paper, the authors tried to explain
the reason behind the outstanding performance of large deep
CNN. They have used a novel visualization technique which
is a deconvolutional network with multiple networks, called
Fig. 6: Architecture of AlexNet [12] deconvnet [28], to map activation at higher layers back to the
space of input pixel to recognize which pixels of the input
layer is accountable for a given activation in the feature map.
TABLE II: Details of different layers of AlexNet Basically, deconvnet is a reversely ordered convnet. It accepts
Layer filter padding # filter output size #Para feature map as input and applies unpooling using a switch. A
size/ meters switch is basically the position of maxima within a pooling
stride
Conv-1 11×11/4 0 96 55 × 55 × 96 34848
region recorded during convolution. Then they rectify it using
pool-1 3 × 3/2 27 × 27 × 96 ReLU non-linearity and uses the transpose version of filters
Conv-2 5 × 5/1 2 256 27×27×256 614400 to rebuild the activity in the layer below which activates the
pool-2 3 × 3/2 13×13×256 chosen activation.
Conv-3 3 × 3/1 1 384 13×13×384 981504
Conv-4 3 × 3/1 1 384 13×13×384 1327104
Conv-5 3 × 3/1 1 256 13×13×256 884736
pool3 3 × 3/2 6 × 6 × 256
FC6 1 × 1 × 4096 37748736
FC7 1 × 1 × 4096 16777216
FC8 1 × 1 × 1000 4096000
1) Dataset used: Krizhevsky et al. designed AlexNet for Fig. 7: Architecture of ZFNet [14]
classification of 1.2 million high-resolution images of 1000
classes for ILSVRC - 2010 and ILSVRC - 2012 [25] . There 1) Training Details: ZFNet used the ImageNet dataset of
are around 1.2 million/50K/150K training/validation/testing 1.3 million/50k/100k training/validation/testing images. The
images. On ILSVRC, competitors submit two kinds of error authors trained their model following [12]. The slight differ-
rates: top-1 and top-5. ence is that they have substituted the sparse connection of
2) Training Details: From the variable resolution image of layers 3, 4 and 5 of AlexNet with a dense connection in their
ImageNet, AlexNet used down-sampled and centred 256×256 model and trained it on single GTX-580 GPU for 12 days
pixels image. To reduce overfitting they have used runtime data with 70 epochs. They have also experimented their model with
augmentation as well as a regularization method called dropout different depths and different filter sizes on Caltech 101 [29],
[26]. In data augmentation, they have extracted translated Caltech-256 [30] and PASCAL-2012 [31] data set and shown
and horizontally reflected 10 random patches of 224 × 224 that their model also generalizes these datasets well.
images and also used principal component analysis (PCA) During training their visualization technique discovers dif-
[27] for RGB channel shifting of training images. The authors ferent properties of CNN such as the projections from each
trained AlexNet using stochastic gradient descent (SGD) with layer in ascending order shows that the nature of the features
batch size of 128, weight decay of 0.0005 and momentum of are hierarchical in the network. For this reason, firstly, the
0.9. The weight decay works as a regularizer and it reduces upper layers need a higher number of epochs than lower layers
training error also. Their initial learning rate was 0.01 reduced to converge and secondly, the network output is stable to
manually three times by 1/10 when value accuracy plateaus. translation and scaling. They have used a bunch of occlusion
AlexNet was trained on two NVIDIA GTX-580 3 GB GPUs experiments to check whether the model is sensitive to local
using cross-GPU parallelization for five to six days. or global information.
124
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
D. VGGNet with dimensionality reduction instead of the naive version
Simonyan and Zisserman used deeper configuration of of inception module. Figure 9a and figure 9b are showing
AlexNet [12], and they proposed it as VGGNet [15]. They both inception modules. Despite 22 layers, the number of
have used small filters of size 3 × 3 for all layers and made parameters used in GoogLeNet is 12 times lesser than AlexNet
the network deeper keeping other parameters fixed. They have but its accuracy is significantly better. All the convolution,
used total 6 different CNN configurations: A, A-LRN, B, C, D reduction and projection layers use ReLU non-linearity. They
(VGG16) and E (VGG19) with 11, 11, 13, 16, 16, 19 weighted have used average pooling layer instead of the fully connected
layers respectively. Figure 8 shows configuration of model D. layers. On top of some inception modules, they have used
auxiliary classifiers which are basically smaller CNNs, to
combat vanishing gradient problem and overfitting.
125
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 10: The architecture of GoogLeNet [16]
with additional layers to perform identity mapping. So that also shown that with increased depth the ResNet, it is easier
the performance of deeper network and the shallower network to optimize and it gains accuracy.
should be similar. They have proposed deep residual learning
G. DenseNet
framework [17] as a solution to the degradation problem. They
have included residual mapping (H(x) = F (x) + x) instead Huang et al. introduced Dense Convolutional Networks
of desired underlying mapping (H(x)) into their network and (DenseNet) [18], which includes dense block in conventional
named their model as ResNet [17]. CNN. The input of a certain layer in a dense block is the
concatenation of the output of all the previous layers as shown
in figure 12. Here, each layer is reusing the features of all pre-
vious layers, strengthening feature propagation and reducing
vanishing gradient problem. Also uses of small number of
filters reduced the number of parameters as well.
(a) (b)
126
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
1) Training Details: Huang et al. trained DenseNet on convolutional layer is a 1D layer, no routing is used between
CIFAR [35], SVHN [36] and ImageNet dataset using SGD this layer and primary capsule layer.
with batch size 64 on both CIFAR and SVHN dataset, and 1) Training details: Training of CapsNet is performed on
with batch size 256 on ImageNet dataset. Initial learning rate MNIST images. To compare the test accuracy, they have used
was 0.1 and is decreased two times by 1/10. They have used one standard CNN (baseline) and two CapsNets with 1 and 3
weight decay of 0.0001, Nesterov momentum [37] of 0.9 and routing iterations respectively. They have used reconstruction
dropout of 0.2. loss as regularization method. Using a 3 layer CapsNet with
On C10 [38], C100 [39], SVHN dataset DenseNet, 3 routing iterations and with added reconstruction the authors
DenseNet-BC outperforms the error rates of previous CNN get a test error of 0.25%.
architectures. A DenseNet, doubly deeper than ResNet, gives Though CapsNet has shown outstanding performance on
similar accuracy on ImageNet datasets with very less (factor MNIST, it may not perform well with large scale image dataset
of 2) number of parameters. The authors experienced that like ImageNet. It may also suffer from vanishing gradient
DenseNet can be scaled to hundreds layers without optimiza- problem.
tion difficulty. It also gives consistent improvement if number
of parameters increases without degrading performance and I. SENet
overfitting. Also, it requires comparatively fewer parameters In 2017, Hu et al. have designed ”Squeeze-and-Excitation
and less computational power for better performance. network” (SENet) [20] and have become the winner of
ILSVRC-2017. They have reduced the top-5 error rate to
H. CapsNet
2.25%. Their main contribution is ”Squeeze-and-Excitation”
Conventional CNNs, described above, suffer from two (SE) block as shown in figure 15. Here, Ftr : X → U is a
problems. Firstly, Sub-sampling loses the spatial information convolutional operation. A squeeze function (Fsq ) performs
between higher-level features. Secondly, it faces difficulty in average pooling on individual channel of feature map U
generalizing to novel view points. It can deal with translation and produce 1 × 1 × C dimensional channel descriptor. An
but can not detect different dimension of affine transformation. excitation function (Fex ) is a self-gating mechanism made up
In 2017, Geoffrey E. Hinton proposed CapsNet [19] to handle of three layers - two fully connected layers and a ReLU non-
these problems. CapsNet has components called capsule. A linearity layer in between. It takes squeezed output as input
capsule is a group of neurons. So a layer of CapsNet is and produce a per channel modulation weights. By applying
basically composed with nested neurons. Unlike a typical the excited output on the feature map U, U is scaled (Fs cale)
neural network, a capsule is squashed as a whole vector rather to generate final output (X) of SE block.
than individual output unit squashing. So scalar output feature
detector of CNN is replaced by vector output capsules. Also
max-pooling is replaced by ”dynamic routing by agreement”
which makes each capsule in each layer to go to the next most
relevant capsules at the time of forward propagation.
Architecture of a simple CapsNet is shown in figure 14.
Fig. 15: A Sqeeze and excitation block [20]
127
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Comparative performance of different CNN configurations. The + indicates- DenseNet with Bottleneck layer and
compression (10 crop testing result).
Name Dataset Year Type of CNN #trained layer Top- Top- Top-
of The 1(val) 5(val) 5(test)
CNN
1 CNN 8 40.7% 18.2%
5 CNN - 38.1% 16.4% 16.4%
AlexNet ImageNet 2012
1 CNN - 39.0% 16.6% -
7 CNN - 36.7% 15.4% 15.3%
1 CNN 8 38.4 % 16.5%
5 CNN - (a) - 36.7 % 15.3% 15.3%
ZFNet ImageNet 2013
1 CNN with layers 3, 4, 5: 512, 1024, 512 maps-(b) - 37.5 % 16.0% 16.1%
6 CNN, combination of (a) & (b) - 36.0 % 14.7% 14.8%
ensemble of 7 ConvNets (3-D,2-C & 2-E) - 24.7% 7.5% 7.3%
ConvNet- D( multi-crop & dense) 16 24.4 % 7.2% -
VGGNet ImageNet 2014
ConvNet-E (Multi-crop & dense ) 19 24.4 % 7.1% -
ConvNet-E (Multi-crop & dense ) 19 24.4 % 7.1% 7.0%
Ensemble of multi-scale ConvNets D & E (multi-crop & - 23.7% 6.8% 6.8%
dense)
1 CNN with 1 crop 22 - - 10.07%
1 CNN with 10 crops - - - 9.15%
GoogLeNet ImageNet 2014
1 CNN with 144 crops - - - 7.89%
7 CNN with 1 crop - - - 8.09%
1 CNN with 10 crops - - - 7.62%
1 CNN with 144 crops - - - 6.67%
plain layer 18 27.94% -
ResNet-18 18 27.88% -
ResNet ImageNet 2015
Plain layer 34 28.54% 10.02
ResNet-34 (zero-padding shortcuts), 10 crop testing -(a) 34 25.03% 7.76
ResNet-34 (projection shortcuts to increase dimension, oth- 34 24.52% 7.46%
ers are identity shortcuts ), 10 crop testing-(b)
ResNet-34 (all shortcuts are projection), 10 crop testing-(c) 34 24.52% 7.46%
ResNet-50 (with bottleneck layer), 10 crop testing 50 22.85% 6.71%
ResNet-101 (with bottleneck layer), 10 crop testing 101 21.75% 6.05%
ResNet-152 (with bottleneck layer), 10 crop testing 152 21.43% 5.71%
1 ResNet-34 (b) 34 21.84% 5.71%
1 ResNet-34 (c) 34 21.53% 5.60%
1 ResNet-50 50 20.74% 5.25%
1 ResNet-101 101 19.87% 4.60%
1 ResNet-152 152 19.38% 4.49%
Ensemble of 6 models - 3.57%
DensNet-121 + 121 23.61% 6.66%
DenseNet-169 + 169 22.80% 5.92%
DenseNet ImageNet 2016
DenseNet-201 + 201 22.58% 5.54%
DenseNet-264 + 264 20.80% 5.29%
SE-ResNet-50 50 23.29% 6.62%
SE-ResNext-50 50 21.10% 5.49%
SENet ImageNet 2017
SENet-154 (crop size 320 × 320/299 × 229) - 17.28% 3.79%
SENet-154(crop size 320 × 320) - 16.88% 3.58%
V. C ONCLUSION R EFERENCES
In this study, we have discussed the advancements of
CNN in image classification tasks. We have shown here [1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 5 2015.
that although AlexNet, ZFNet and VGGNet followed the [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
architecture of conventional CNN model such as LeNet-5 2016, http://www.deeplearningbook.org.
their networks are larger and deeper. We have experienced [3] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in
that combining inception module and residual blocks with International 1989 Joint Conference on Neural Networks, 1989, pp. 593–
605 vol.1.
conventional CNN model, GoogLeNet and ResNet gained [4] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional archi-
better accuracy than stacking the same building blocks again tecture of monkey striate cortex,” Journal of Physiology (London), vol.
and again. DenseNet focused on feature reusing to strengthen 195, pp. 215–243, 1968.
[5] K. Fukushima, “Neocognitron: A self-organizing neural network model
the feature propagation. Though CapsNet reached state-of-the- for a mechanism of pattern recognition unaffected by shift in position,”
art achievement on MNIST but it is yet to perform as well as Biological Cybernetics, vol. 36, no. 4, pp. 193–202, Apr 1980. [Online].
previous CNNs performance on high resolution image dataset Available: https://doi.org/10.1007/BF00344251
such as ImageNet. The result of SENet on ImageNet dataset [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
gives us the hope that it may turn out useful for other task zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
which requires strong discriminative features. Dec 1989.
128
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
[7] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, [27] I. Jolliffe, Principal Component Analysis. Berlin, Heidelberg: Springer
W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition Berlin Heidelberg, 2011, pp. 1094–1096. [Online]. Available: https:
with a back-propagation network,” in Advances in Neural Information //doi.org/10.1007/978-3-642-04898-2 455
Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, [28] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
1990, pp. 396–404. [Online]. Available: http://papers.nips.cc/paper/ networks for mid and high level feature learning,” in 2011 International
293-handwritten-digit-recognition-with-a-back-propagation-network. Conference on Computer Vision, Nov 2011, pp. 2018–2025.
pdf [29] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models
[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning from few training examples: An incremental bayesian approach tested
applied to document recognition,” Proceedings of the IEEE, vol. 86, on 101 object categories,” in 2004 Conference on Computer Vision and
no. 11, pp. 2278–2324, Nov 1998. Pattern Recognition Workshop, June 2004, pp. 178–178.
[9] Y. L. Cun, “A theoretical framework for back-propagation,” 1988. [30] G. Griffin, A. Holub, and P. Perona, “Caltech256 image
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: dataset,” 2006. [Online]. Available: http://www.vision.caltech.edu/
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. Image Datasets/Caltech256/
[11] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
A database and web-based tool for image annotation,” International man, “Pascal visual object classes challenge 2012 (voc2012) complete
Journal of Computer Vision, vol. 77, no. 1, pp. 157–173, May 2008. dataset.”
[Online]. Available: https://doi.org/10.1007/s11263-007-0090-8 [32] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400
with deep convolutional neural networks,” in Advances in Neural [33] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
Information Processing Systems 25, F. Pereira, C. J. C. Burges, object recognition with cortex-like mechanisms,” IEEE Transactions on
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ March 2007.
4824-imagenet-classification-with-deep-convolutional-neural-networks. [34] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
pdf M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V.
[13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Le, and A. Y. Ng, “Large scale distributed deep networks,” in
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and Advances in Neural Information Processing Systems 25, F. Pereira,
L. Fei-Fei, “Imagenet large scale visual recognition challenge,” Int. J. C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran
Comput. Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015. [Online]. Associates, Inc., 2012, pp. 1223–1231. [Online]. Available: http:
Available: http://dx.doi.org/10.1007/s11263-015-0816-y //papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf
[35] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
[14] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
tiny images,” Tech. Rep., 2009.
tional networks,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla,
[36] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International
“Reading digits in natural images with unsupervised feature learning,” in
Publishing, 2014, pp. 818–833.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks 2011, 2011. [Online]. Available: http://ufldl.stanford.edu/housenumbers/
for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. nips2011 housenumbers.pdf
[Online]. Available: http://arxiv.org/abs/1409.1556 [37] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, importance of initialization and momentum in deep learning,” in
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proceedings of the 30th International Conference on Machine Learning,
in The IEEE Conference on Computer Vision and Pattern Recognition ser. Proceedings of Machine Learning Research, S. Dasgupta and
(CVPR), June 2015. D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA:
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image PMLR, 17–19 Jun 2013, pp. 1139–1147. [Online]. Available:
recognition,” in The IEEE Conference on Computer Vision and Pattern http://proceedings.mlr.press/v28/sutskever13.html
Recognition (CVPR), June 2016. [38] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute
[18] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected for advanced research).” [Online]. Available: http://www.cs.toronto.edu/
convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. ∼kriz/cifar.html
Available: http://arxiv.org/abs/1608.06993 [39] A. Krizhevsky, V. Nair, and G. E. Hinton, “Cifar-100 (canadian institute
[19] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between for advanced research).” [Online]. Available: http://www.cs.toronto.edu/
capsules,” CoRR, vol. abs/1710.09829, 2017. [Online]. Available: ∼kriz/cifar.html
http://arxiv.org/abs/1710.09829 [40] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” CoRR, transformations for deep neural networks,” CoRR, vol. abs/1611.05431,
vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/ 2016. [Online]. Available: http://arxiv.org/abs/1611.05431
1709.01507 [41] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
[21] M. D. Buhmann, “Radial basis functions,” Acta Numerica, vol. 9, p. and the impact of residual connections on learning,” CoRR, vol.
138, 2000. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.
[22] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. 07261
[Online]. Available: http://yann.lecun.com/exdb/mnist/ [42] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[23] L. Bottou, “Large-scale machine learning with stochastic gradient M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
descent,” in Proceedings of COMPSTAT’2010, Y. Lechevallier and networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
[24] V. Nair and G. E. Hinton, “Rectified linear units improve [43] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
restricted boltzmann machines,” in Proceedings of the 27th International efficient convolutional neural network for mobile devices,” CoRR, vol.
Conference on International Conference on Machine Learning, ser. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/1707.
ICML’10. USA: Omnipress, 2010, pp. 807–814. [Online]. Available: 01083
http://dl.acm.org/citation.cfm?id=3104322.3104425
[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[26] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Improving neural networks by preventing co-
adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
[Online]. Available: http://arxiv.org/abs/1207.0580
129
Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.