pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 5Here αα is a hyperparameter and can take a value between 0 and 1. Unless the valueis determined by some domain knowledge about the problem, it can be set to 0.5.The following figure shows a typical classification and localization networkarchitecture. As you can see, the only difference with respect to a typical CNNclassification network is the additional regression head on the top right:Figure 2: Network architecture for image classification and localizationSemantic segmentationAnother class of problem that builds on the basic classification idea is "semanticsegmentation." Here the aim is to classify every single pixel on the image asbelonging to a single class.An initial method of implementation could be to build a classifier network for eachpixel, where the input is a small neighborhood around each pixel. In practice, thisapproach is not very performant, so an improvement over this implementationmight be to run the image through convolutions that will increase the feature depth,while keeping the image width and height constant. Each pixel then has a featuremap that can be sent through a fully connected network that predicts the class of thepixel. However, in practice, this is also quite expensive, and it is not normally used.A third approach is to use a CNN encoder-decoder network, where the encoderdecreases the width and height of the image but increases its depth (number offeatures), while the decoder uses transposed convolution operations to increase itssize and decrease depth. Transpose convolution (or upsampling) is the process ofgoing in the opposite direction of a normal convolution. The input to this network isthe image and the output is the segmentation map.[ 141 ]

Advanced Convolutional Neural NetworksA popular implementation of this encoder-decoder architecture is the U-Net (agood implementation is available at: https://github.com/jakeret/tf_unet),originally developed for biomedical image segmentation, which has additional skipconnectionsbetween corresponding layers of the encoder and decoder. The U-Netarchitecture is shown in the following figure:Figure 3: U-Net architecture. Source: Pattern Recognition and Image Processing(https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/).Object detectionThe object detection task is similar to the classification and localization tasks. The bigdifference is that now there are multiple objects in the image, and for each one weneed to find the class and bounding box coordinates. In addition, neither the numberof objects nor their size is known in advance. As you can imagine, this is a difficultproblem and a fair amount of research has gone into it.A first approach to the problem might be to create many random crops of theinput image and for each crop, apply the classification and localization networkswe described earlier. However, such an approach is very wasteful in terms ofcomputing and unlikely to be very successful.[ 142 ]

Chapter 5

Here αα is a hyperparameter and can take a value between 0 and 1. Unless the value

is determined by some domain knowledge about the problem, it can be set to 0.5.

The following figure shows a typical classification and localization network

architecture. As you can see, the only difference with respect to a typical CNN

classification network is the additional regression head on the top right:

Figure 2: Network architecture for image classification and localization

Semantic segmentation

Another class of problem that builds on the basic classification idea is "semantic

segmentation." Here the aim is to classify every single pixel on the image as

belonging to a single class.

An initial method of implementation could be to build a classifier network for each

pixel, where the input is a small neighborhood around each pixel. In practice, this

approach is not very performant, so an improvement over this implementation

might be to run the image through convolutions that will increase the feature depth,

while keeping the image width and height constant. Each pixel then has a feature

map that can be sent through a fully connected network that predicts the class of the

pixel. However, in practice, this is also quite expensive, and it is not normally used.

A third approach is to use a CNN encoder-decoder network, where the encoder

decreases the width and height of the image but increases its depth (number of

features), while the decoder uses transposed convolution operations to increase its

size and decrease depth. Transpose convolution (or upsampling) is the process of

going in the opposite direction of a normal convolution. The input to this network is

the image and the output is the segmentation map.

[ 141 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!