Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
Image/ Hand-designed
Trainable Object
Video feature
classifier Class
Pixels extraction
Image/
Video Layer 1 Layer 2 Layer 3 Simple
Pixels Classifier
Shallow vs. deep architectures
Image/
Video Hand-designed Trainable Object
Pixels feature extraction classifier Class
Input
Weights
x1
w1
x2 w2
Output: (wx + b)
x3
w3
. Sigmoid function:
.
1
. (t )
wd 1 e t
xd
Inspiration: Neuron cells
Background: Multi-Layer Neural Networks
Nonlinear classifier
Training: find network weights w to minimize the error between true
training labels yi and estimated labels fw(xi):
N
E ( w ) yi f w ( x i )
2
i 1
Source
Convolutional Neural Networks (CNN, Convnet)
Neural network with specialized
connectivity structure
Stack multiple stages of feature
extractors
Higher stages compute more
global, more invariant features
Classification layer at the end
Convolution
(Learned)
Input Image
1. Convolution
Dependencies are local
Translation invariance
Few parameters (filter weights)
Stride can be greater than 1
(faster, less memory)
.
.
.
Max
Sum
4. Normalization
Within or across feature maps
Before or after spatial pooling
Feature Maps
Feature Maps After Contrast Normalization
Compare: SIFT Descriptor
Lowe
[IJCV 2004]
Image Apply
Pixels oriented filters
Spatial pool
(Sum)
Normalize to Feature
unit length Vector
Compare: Spatial Pyramid Matching
Lazebnik,
Schmid,
SIFT Filter with Ponce
Visual Words [CVPR 2006]
features
Take max VW
response (L-inf
normalization)
Multi-scale Global
spatial pool image
(Sum) descriptor
Convnet Successes
Handwritten text/digits
MNIST (0.17% error [Ciresan et al. 2011])
Arabic & Chinese [Ciresan et al. 2012]
35
30
25
Top-5 error rate %
20
15
10
0
SuperVision ISI Oxford INRIA Amsterdam
Visualizing Convnets
Patches from validation images that give maximal activation of a given feature map
Layer 2: Top-9 Patches
Layer
Layer3:3:Top-9
Top-9Patches
Patches
Layer 3: Top-9 Patches
Layer 4: Top-9 Patches
Layer 4: Top-9 Patches
Layer 5: Top-9 Patches
Layer 5: Top-9 Patches
Evolution of Features During Training
Evolution of Features During Training
Diagnosing Problems
Visualization of Krizhevsky et al.s architecture showed some problems
with layers 1 and 2
Large stride of 4 used
Alter architecture: smaller stride & filter size
Visualizations look better
Performance improves
Monitor output
Input image
0.17
0.16
Test error (top-5)
0.15
0.14
0.13
0.12
0.11
0.1
How important is depth?
Softmax Output
Layer 4: Conv
Layer 3: Conv
Input Image
How important is depth?
Softmax Output
Layer 5: Conv +
Pool
Drop 16 million
Layer 4: Conv
parameters
Layer 3: Conv
Input Image
How important is depth?
Softmax Output
Layer 3: Conv
5.7% drop in performance
Layer 2: Conv + Pool
Input Image
How important is depth?
Softmax Output
Layer 5: Conv +
Drop ~1 million parameters Pool
Input Image
How important is depth?
Softmax Output
Input Image
Tapping off Features at each Layer
[1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition, arXiv preprint, 2014
[2] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN Features off-the-shelf: an Astounding Baseline
for Recognition, arXiv preprint, 2014
CNN features for detection
Object detection system overview. Our system (1) takes an input image, (2) extracts
around 2000 bottom-up region proposals, (3) computes features for each proposal
using a large convolutional neural network (CNN), and then (4) classifies each region
using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP)
of 53.7% on PASCAL VOC 2010. For comparison, Uijlings et al. (2013) report 35.1%
mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-
words approach. The popular deformable part models perform at 33.4%.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate
Object Detection and Semantic Segmentation, CVPR 2014, to appear.
CNN features for face verification