pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 5In this image, an Inception-v1 network used for vision classification reveals manyfully realized features, such as electronics, screens, Polaroid cameras, buildings,food, animal ears, plants, and watery backgrounds. Note that grid cells are labeledwith the classification they give most support for. Grid cells are also sized accordingto the number of activations that are averaged within. This representation is verypowerful because it allows us to inspect the different layers of a network and howthe activation functions fire in response to the input.In this section, we have seen many techniques to process images with CNNs. Next,we'll move on to video processing.VideoIn this section, we move from image processing to video processing. We'll start ourlook at video by discussing six ways in which to classify videos with pretrained nets.Classifying videos with pretrained nets insix different waysClassifying videos is an area of active research because of the large amount of dataneeded for processing this type of media. Memory requirements are frequentlyreaching the limits of modern GPUs and a distributed form of training on multiplemachines might be required. Researchers are currently exploring different directionsof investigation, with increased levels of complexity from the first approach to thesixth, described next. Let's review them.The first approach consists of classifying one video frame at a time by consideringeach one of them as a separate image processed with a 2D CNN. This approachsimply reduces the video classification problem to an image classification problem.Each video frame "emits" a classification output, and the video is classified by takinginto account the more frequently chosen category for each frame.The second approach consists of creating one single network where a 2D CNN iscombined with an RNN (see Chapter 9, Autoencoders). The idea is that the CNN willtake into account the image components and the RNN will take into account thesequence information for each video. This type of network can be very difficult totrain because of the very high number of parameters to optimize.The third approach is to use a 3D ConvNet, where 3D ConvNets are an extensionof 2D ConvNets operating on a 3D tensor (time, image_width, image_height). Thisapproach is another natural extension of image classification. Again, 3D ConvNetscan be hard to train.[ 173 ]

Advanced Convolutional Neural NetworksThe fourth approach is based on a clever idea: instead of using CNNs directly forclassification, they can be used for storing offline features for each frame in the video.The idea is that feature extraction can be made very efficient with transfer learningas shown in a previous chapter. After all features are extracted, they can be passedas a set of inputs into an RNN, which will learn sequences across multiple framesand emit the final classification.The fifth approach is a simple variant of the fourth, where the final layer is an MLPinstead of an RNN. In certain situations, this approach can be simpler and lessexpensive in terms of computational requirements.The sixth approach is a variant of the fourth, where the phase of feature extraction isrealized with a 3D CNN that extracts spatial and visual features. These features arethen passed into either an RNN or an MLP.Deciding upon the best approach is strictly dependent on your specific applicationand there is no definitive answer. The first three approaches are generally morecomputationally expensive and less clever, while the last three approaches are lessexpensive and they frequently achieve better performance.So far, we have explored how CNNs can be used for image and video applications.In the next section, we will apply these ideas within a text-based context.Textual documentsWhat do text and images have in common? At first glance: very little. However, if werepresent a sentence or a document as a matrix, then this matrix is not much differentfrom an image matrix where each cell is a pixel. So, the next question is: how can werepresent a piece of text as a matrix?Well, it is pretty simple: each row of a matrix is a vector that represents a basic unitfor the text. Of course, now we need to define what a basic unit is. A simple choicecould be to say that the basic unit is a character. Another choice would be to say thata basic unit is a word, yet another choice is to aggregate similar words together andthen denote each aggregation (sometimes called clustering or embedding) with arepresentative symbol.Note that regardless of the specific choice adopted for our basic units, we needto have a 1:1 map from basic units into integer IDs so that a text can be seen as amatrix. For instance, if we have a document with 10 lines of text and each line is a100-dimensional embedding, then we will represent our text with a matrix of 10×100.In this very particular "image," a "pixel" is turned on if that sentence, X, contains theembedding, represented by position Y.[ 174 ]

Advanced Convolutional Neural Networks

The fourth approach is based on a clever idea: instead of using CNNs directly for

classification, they can be used for storing offline features for each frame in the video.

The idea is that feature extraction can be made very efficient with transfer learning

as shown in a previous chapter. After all features are extracted, they can be passed

as a set of inputs into an RNN, which will learn sequences across multiple frames

and emit the final classification.

The fifth approach is a simple variant of the fourth, where the final layer is an MLP

instead of an RNN. In certain situations, this approach can be simpler and less

expensive in terms of computational requirements.

The sixth approach is a variant of the fourth, where the phase of feature extraction is

realized with a 3D CNN that extracts spatial and visual features. These features are

then passed into either an RNN or an MLP.

Deciding upon the best approach is strictly dependent on your specific application

and there is no definitive answer. The first three approaches are generally more

computationally expensive and less clever, while the last three approaches are less

expensive and they frequently achieve better performance.

So far, we have explored how CNNs can be used for image and video applications.

In the next section, we will apply these ideas within a text-based context.

Textual documents

What do text and images have in common? At first glance: very little. However, if we

represent a sentence or a document as a matrix, then this matrix is not much different

from an image matrix where each cell is a pixel. So, the next question is: how can we

represent a piece of text as a matrix?

Well, it is pretty simple: each row of a matrix is a vector that represents a basic unit

for the text. Of course, now we need to define what a basic unit is. A simple choice

could be to say that the basic unit is a character. Another choice would be to say that

a basic unit is a word, yet another choice is to aggregate similar words together and

then denote each aggregation (sometimes called clustering or embedding) with a

representative symbol.

Note that regardless of the specific choice adopted for our basic units, we need

to have a 1:1 map from basic units into integer IDs so that a text can be seen as a

matrix. For instance, if we have a document with 10 lines of text and each line is a

100-dimensional embedding, then we will represent our text with a matrix of 10×100.

In this very particular "image," a "pixel" is turned on if that sentence, X, contains the

embedding, represented by position Y.

[ 174 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!