09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5

Depthwise convolution

Let's consider an image with multiple channels. In the normal 2D convolution,

the filter is as deep as the input and it allows us to mix channels for generating each

element of the output. In Depthwise convolution, each channel is kept separate, the

filter is split into channels, each convolution is applied separately, and the results are

stacked back together into one tensor.

Depthwise separable convolution

This convolution should not be confused with the separable convolution. After

completing the Depthwise convolution, an additional step is performed: a 1×1

convolution across channels. Depthwise separable convolutions are used in

Xception. They are also used in MobileNet, a model particularly useful for mobile

and embedded vision applications because of its reduced model size and complexity.

In this section, we have discussed all the major forms of convolution. The next

section will discuss Capsule networks a new form of learning introduced in 2017.

Capsule networks

Capsule Networks (CapsNets) are a very recent and innovative type of deep

learning network. This technique was introduced at the end of October 2017 in a

seminal paper titled Dynamic Routing Between Capsules by Sara Sabour, Nicholas

Frost, and Geoffrey Hinton (https://arxiv.org/abs/1710.09829) [14]. Hinton

is the father of Deep Learning and, therefore, the whole Deep Learning community

is excited to see the progress made with Capsules. Indeed, CapsNets are already

beating the best CNN on MNIST classification, which is ... well, impressive!!

So what is the problem with CNNs?

In CNNs each layer "understands" an image at a progressive level of granularity.

As we discussed in multiple examples, the first layer will most likely recognize straight

lines or simple curves and edges, while subsequent layers will start to understand

more complex shapes such as rectangles up to complex forms such as human faces.

Now, one critical operation used for CNNs is pooling. Pooling aims at creating

the positional invariance and it is used after each CNN layer to make any problem

computationally tractable. However, pooling introduces a significant problem

because it forces us to lose all the positional data. This is not good. Think about a

face: it consists of two eyes, a mouth, and a nose and what is important is that there

is a spatial relationship between these parts (for example, the mouth is below the

nose, which is typically below the eyes).

[ 185 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!