09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5

In this image, an Inception-v1 network used for vision classification reveals many

fully realized features, such as electronics, screens, Polaroid cameras, buildings,

food, animal ears, plants, and watery backgrounds. Note that grid cells are labeled

with the classification they give most support for. Grid cells are also sized according

to the number of activations that are averaged within. This representation is very

powerful because it allows us to inspect the different layers of a network and how

the activation functions fire in response to the input.

In this section, we have seen many techniques to process images with CNNs. Next,

we'll move on to video processing.

Video

In this section, we move from image processing to video processing. We'll start our

look at video by discussing six ways in which to classify videos with pretrained nets.

Classifying videos with pretrained nets in

six different ways

Classifying videos is an area of active research because of the large amount of data

needed for processing this type of media. Memory requirements are frequently

reaching the limits of modern GPUs and a distributed form of training on multiple

machines might be required. Researchers are currently exploring different directions

of investigation, with increased levels of complexity from the first approach to the

sixth, described next. Let's review them.

The first approach consists of classifying one video frame at a time by considering

each one of them as a separate image processed with a 2D CNN. This approach

simply reduces the video classification problem to an image classification problem.

Each video frame "emits" a classification output, and the video is classified by taking

into account the more frequently chosen category for each frame.

The second approach consists of creating one single network where a 2D CNN is

combined with an RNN (see Chapter 9, Autoencoders). The idea is that the CNN will

take into account the image components and the RNN will take into account the

sequence information for each video. This type of network can be very difficult to

train because of the very high number of parameters to optimize.

The third approach is to use a 3D ConvNet, where 3D ConvNets are an extension

of 2D ConvNets operating on a 3D tensor (time, image_width, image_height). This

approach is another natural extension of image classification. Again, 3D ConvNets

can be hard to train.

[ 173 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!