Chapter 5In this image, an Inception-v1 network used for vision classification reveals manyfully realized features, such as electronics, screens, Polaroid cameras, buildings,food, animal ears, plants, and watery backgrounds. Note that grid cells are labeledwith the classification they give most support for. Grid cells are also sized accordingto the number of activations that are averaged within. This representation is verypowerful because it allows us to inspect the different layers of a network and howthe activation functions fire in response to the input.In this section, we have seen many techniques to process images with CNNs. Next,we'll move on to video processing.VideoIn this section, we move from image processing to video processing. We'll start ourlook at video by discussing six ways in which to classify videos with pretrained nets.Classifying videos with pretrained nets insix different waysClassifying videos is an area of active research because of the large amount of dataneeded for processing this type of media. Memory requirements are frequentlyreaching the limits of modern GPUs and a distributed form of training on multiplemachines might be required. Researchers are currently exploring different directionsof investigation, with increased levels of complexity from the first approach to thesixth, described next. Let's review them.The first approach consists of classifying one video frame at a time by consideringeach one of them as a separate image processed with a 2D CNN. This approachsimply reduces the video classification problem to an image classification problem.Each video frame "emits" a classification output, and the video is classified by takinginto account the more frequently chosen category for each frame.The second approach consists of creating one single network where a 2D CNN iscombined with an RNN (see Chapter 9, Autoencoders). The idea is that the CNN willtake into account the image components and the RNN will take into account thesequence information for each video. This type of network can be very difficult totrain because of the very high number of parameters to optimize.The third approach is to use a 3D ConvNet, where 3D ConvNets are an extensionof 2D ConvNets operating on a 3D tensor (time, image_width, image_height). Thisapproach is another natural extension of image classification. Again, 3D ConvNetscan be hard to train.[ 173 ]

Advanced Convolutional Neural NetworksThe fourth approach is based on a clever idea: instead of using CNNs directly forclassification, they can be used for storing offline features for each frame in the video.The idea is that feature extraction can be made very efficient with transfer learningas shown in a previous chapter. After all features are extracted, they can be passedas a set of inputs into an RNN, which will learn sequences across multiple framesand emit the final classification.The fifth approach is a simple variant of the fourth, where the final layer is an MLPinstead of an RNN. In certain situations, this approach can be simpler and lessexpensive in terms of computational requirements.The sixth approach is a variant of the fourth, where the phase of feature extraction isrealized with a 3D CNN that extracts spatial and visual features. These features arethen passed into either an RNN or an MLP.Deciding upon the best approach is strictly dependent on your specific applicationand there is no definitive answer. The first three approaches are generally morecomputationally expensive and less clever, while the last three approaches are lessexpensive and they frequently achieve better performance.So far, we have explored how CNNs can be used for image and video applications.In the next section, we will apply these ideas within a text-based context.Textual documentsWhat do text and images have in common? At first glance: very little. However, if werepresent a sentence or a document as a matrix, then this matrix is not much differentfrom an image matrix where each cell is a pixel. So, the next question is: how can werepresent a piece of text as a matrix?Well, it is pretty simple: each row of a matrix is a vector that represents a basic unitfor the text. Of course, now we need to define what a basic unit is. A simple choicecould be to say that the basic unit is a character. Another choice would be to say thata basic unit is a word, yet another choice is to aggregate similar words together andthen denote each aggregation (sometimes called clustering or embedding) with arepresentative symbol.Note that regardless of the specific choice adopted for our basic units, we needto have a 1:1 map from basic units into integer IDs so that a text can be seen as amatrix. For instance, if we have a document with 10 lines of text and each line is a100-dimensional embedding, then we will represent our text with a matrix of 10×100.In this very particular "image," a "pixel" is turned on if that sentence, X, contains theembedding, represented by position Y.[ 174 ]

