CNN and Autoencoder

Convolutional Neural Network
• A fundamental goal of using CNN with images is to remove cumbersome and ultimately
limiting feature selection process.
• A convolutional neural network is a feed-forward neural network that is generally used to
analyze visual images by processing data with grid-like topology. It’s also known as
a ConvNet. A convolutional neural network is used to detect and classify objects in an
image.
• They can identify faces, individuals, street signs and many other aspects of visual data.
They are good at text and sound analysis.
• CNN works well with self driving cars, drones, robotics and treatment for visually impaired.
• They are good at building positions and rotation invariant features from raw image data.
• They help in building a more robust feature space based on signal.
In deep learning, a convolutional neural network (CNN/ConvNet) is a
class of deep neural networks, most commonly applied to analyze
visual imagery.
Agenda
Consider an example to classify between letters X and O
REPRESENTATION OF AN IMAGE IN CNN
In CNN, every image is represented in the form of an array of pixel values.
Some more examples:
Layers in a Convolutional Neural Network
A convolution neural network has multiple hidden layers that help in extracting information from an
image. The four important layers in CNN are:
1.Convolution layer
2.ReLU layer
3.Pooling layer
4.Fully connected layer

1st Layer : Convolutional Layer
The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load.
This layer performs a dot product between two matrices, where one matrix is the set of learnable
parameters otherwise known as a kernel, and the other matrix is the restricted portion of the
receptive field. The kernel is spatially smaller than an image but is more in-depth. This means that,
if the image is composed of three (RGB) channels, the kernel height and width will be spatially
small, but the depth extends up to all three channels.
Step 1: Line up the filter and the image

Step 2: Multiply each image pixel by the corresponding filter pixel
Step 3: Add them up
Step 4: Divide by total number of pixels in the filter
During the forward pass, the kernel slides across the height and width of the image-producing the image
representation of that receptive region. This produces a two-dimensional representation of the image
known as an feature map that gives the response of the kernel at each spatial position of the image. The
sliding size of the kernel is called a stride.
If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with stride S
and amount of padding P, then the size of output volume can be determined by the following formula:
Filter size: It can be 3*3, 5*5, 7*7. It is advisable to keep small size filter.
Stride: The distance between consecutive application of filter on input image/data.

It has suggested to keep stride one to capture all useful information in the feature map.
Zero padding: Padding on the input volume with zeros in such way that the conv layer does not alter
the spatial dimensions of the input
It helps in keeping the volume of input and output same.
2 Layer : ReLU Layer
nd
Output of one feature after applying ReLU activation
function:
3 Layer : Pooling Layer
rd
Pooling layer is introduced in order to speed up the computation in CNN architecture.

It reduces dimensionality of feature map, computations and hence increases speed of network .
More about Pooling Layer:
The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the
nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount
of computation and weights. The pooling operation is processed on every slice of the representation individually.
There are several pooling functions such as the average of the rectangular neighborhood, L2 norm of the
rectangular neighborhood, and a weighted average based on the distance from the central pixel. However, the
most popular process is max pooling, which reports the maximum output from the neighborhood.
If we have an activation map of size W x W x D, a pooling kernel of spatial size F, and stride S, then the size of
output volume can be determined by the following formula:
This will yield an output volume of size Wout x Wout x D.

In all cases, pooling provides some translation invariance which means that an object would be recognizable
regardless of where it appears on the frame.
Types of Pooling
Consider using Max Pooling for our
example:
The shrink the data further, we can stack up
layers:
4th Layer : Fully Connected Layer
Neurons in this layer have full connectivity with all neurons in the preceding and succeeding layer as seen in regular FCNN. This is
why it can be computed as usual by a matrix multiplication followed by a bias effect.
The FC layer helps to map the representation between the input and the output.
It helps for decision and classification purpose. The weights undergo training using back propagation algorithm. Its output will be
result of classification.
Repeating the above process for “O” and
comparing:
ANOTHER EXAMPLE…
Detecting a
Loop !
Similarly:
Therefore, filters are nothing but feature detectors.

Similarly, for a Koala :
Convolutional Neural Networks for Self-Driving Cars
•Several companies, such as Tesla and Uber, are using convolutional neural networks as the computer vision component of a self-
driving car.
•A self-driving car’s computer vision system must be capable of localization, obstacle avoidance, and path planning.
•Let us consider the case of pedestrian detection. A pedestrian is a kind of obstacle which moves. A convolutional neural network must
be able to identify the location of the pedestrian and extrapolate their current motion in order to calculate if a collision is imminent.
•A convolutional neural network for object detection is slightly more complex than a classification model, in that it must not only
classify an object, but also return the four coordinates of its bounding box.
•Furthermore, the convolutional neural network designer must avoid unnecessary false alarms for irrelevant objects, such as litter, but
also take into account the high cost of mis categorizing a true pedestrian and causing a fatal accident.
•A major challenge for this kind of use is collecting labeled training data. Google’s Captcha system is used for authenticating on
websites, where a user is asked to categorize images as fire hydrants, traffic lights, cars, etc. This is actually a useful way to collect
labeled training images for purposes such as self-driving cars and Google Street View.
AUTOENCODERS
Parts:
Traditionally, autoencoders were used for dimensionality reduction or feature learning.
 An Autoencoder can be divided into two parts

 Encoder which maps input into a lower dimensional space and decoder maps back this lower dimensional space into
reconstruction space where dimensionality of reconstructed space is same as input space.
 Eg: take an image, compress it and reconstruct it, original image back.
 In case of Auto encoder it is lossy compression.
 When Auto encoder compresses an input and when it decompresses it tries to be closer to the input.
 The difference between input and output representation is called reconstruction error.
 The training objective of Autoencoder is to minimise the error between the input and reconstructed input.
 Reconstruction error is given by |x-x’|. Its also called as loss function.
 During the course of training the model weights are updated to minimise the reconstruction error
Eg: Give 6 pixel values as input to Autoencoder
 Encoder encodes the signal and at hidden layer

 six pixels are compressed to 3 pixels.
 Further decoder decodes the information (reconstructs the signal) and at output layer 3 pixels are decoded back to 6
pixels.
Properties of Autoencoders:
Unsupervised: Autoencoders are considered to be unsupervised learning technique since they don’t need explicit
labels to train on.
Data-specific: Autoencoders are only able to compress data similar to what they have been trained on.
Lossy: Output of an Autoencoder will not be same as the input. The decompressed outputs will be degraded
compared to the original inputs.
Training Autoencoders: (Hyperparameters)
Code size: It represents the number of nodes in the middle layer. Smaller size results in more compression.
Number of layers: The autoencoder can consist of as many layers as we want.
Number of nodes per layer: The number of nodes per layer decreases with each subsequent layer of the encoder,
and increases back in the decoder. The decoder is symmetric to the encoder in terms of the layer structure.
Loss function: We either use mean squared error or binary cross-entropy. If the input values are in the range [0, 1]
then we typically use cross-entropy, otherwise, we use the mean squared error.
This is done by balancing two criteria:
Autoencoder types
Undercomplete autoencoders
Regularized autoencoders
Sparse autoencoders
Denoising autoencoder
Contractive autoencoders
51
UNDERCOMPLETE AUTOENCODER
• The simplest architecture for constructing an autoencoder is to constrain the number of

nodes present in the hidden layer(s) of the network, limiting the amount of information
that can flow through the network.
• By penalizing the network according to the reconstruction error, our model can learn
the most important attributes of the input data and how to best reconstruct the
original input from an "encoded" state.
• Ideally, this encoding will learn and describe latent attributes of the input data.
• Undercomplete Autoencoders are designed to have a hidden layer h, with smaller dimension
than input layer x.
• Its goal is to capture the important features present in the data rather than simply copying input to
output.
•Training of Undercomplete autoencoder forces it to capture silent features of training data.
• Network must model x in lower dim. space + map latent space accurately back to input space.
• Encoder network: function that returns a useful, compressed representation of input.
The learning process is described simply as minimizing a loss function L(x, g(f(x))) where:
L is a loss function penalizing g(f(x)) for being dissimilar from x, such as the mean squared
error.
Advantages-
Undercomplete autoencoders do not need any regularization as they maximize the probability of data
rather than copying the input to the output.
Drawbacks-
Using an overparameterized model due to lack of sufficient training data can create overfitting.
Applications
Denoising: input clean image + noise and train to reproduce the clean image.
Dimensionality Reduction/Data compression: Make high-quality, low-dimension representation of data
55
Applications
Water mark removal
Image colorization: input black and white and train to produce color images
56

CNN and Autoencoder

Uploaded by

CNN and Autoencoder

Uploaded by

Convolutional Neural Network

4.Fully connected layer

Step 1: Line up the filter and the image

Stride: The distance between consecutive application of filter on input image/data.

Pooling layer is introduced in order to speed up the computation in CNN architecture.

This will yield an output volume of size Wout x Wout x D.

Therefore, filters are nothing but feature detectors.

 An Autoencoder can be divided into two parts

Eg: Give 6 pixel values as input to Autoencoder

 Encoder encodes the signal and at hidden layer

Number of layers: The autoencoder can consist of as many layers as we want.

• The simplest architecture for constructing an autoencoder is to constrain the number of

Dimensionality Reduction/Data compression: Make high-quality, low-dimension representation of data

Water mark removal

You might also like