pdfcoffee
Chapter 5In this image, an Inception-v1 network used for vision classification reveals manyfully realized features, such as electronics, screens, Polaroid cameras, buildings,food, animal ears, plants, and watery backgrounds. Note that grid cells are labeledwith the classification they give most support for. Grid cells are also sized accordingto the number of activations that are averaged within. This representation is verypowerful because it allows us to inspect the different layers of a network and howthe activation functions fire in response to the input.In this section, we have seen many techniques to process images with CNNs. Next,we'll move on to video processing.VideoIn this section, we move from image processing to video processing. We'll start ourlook at video by discussing six ways in which to classify videos with pretrained nets.Classifying videos with pretrained nets insix different waysClassifying videos is an area of active research because of the large amount of dataneeded for processing this type of media. Memory requirements are frequentlyreaching the limits of modern GPUs and a distributed form of training on multiplemachines might be required. Researchers are currently exploring different directionsof investigation, with increased levels of complexity from the first approach to thesixth, described next. Let's review them.The first approach consists of classifying one video frame at a time by consideringeach one of them as a separate image processed with a 2D CNN. This approachsimply reduces the video classification problem to an image classification problem.Each video frame "emits" a classification output, and the video is classified by takinginto account the more frequently chosen category for each frame.The second approach consists of creating one single network where a 2D CNN iscombined with an RNN (see Chapter 9, Autoencoders). The idea is that the CNN willtake into account the image components and the RNN will take into account thesequence information for each video. This type of network can be very difficult totrain because of the very high number of parameters to optimize.The third approach is to use a 3D ConvNet, where 3D ConvNets are an extensionof 2D ConvNets operating on a 3D tensor (time, image_width, image_height). Thisapproach is another natural extension of image classification. Again, 3D ConvNetscan be hard to train.[ 173 ]
Advanced Convolutional Neural NetworksThe fourth approach is based on a clever idea: instead of using CNNs directly forclassification, they can be used for storing offline features for each frame in the video.The idea is that feature extraction can be made very efficient with transfer learningas shown in a previous chapter. After all features are extracted, they can be passedas a set of inputs into an RNN, which will learn sequences across multiple framesand emit the final classification.The fifth approach is a simple variant of the fourth, where the final layer is an MLPinstead of an RNN. In certain situations, this approach can be simpler and lessexpensive in terms of computational requirements.The sixth approach is a variant of the fourth, where the phase of feature extraction isrealized with a 3D CNN that extracts spatial and visual features. These features arethen passed into either an RNN or an MLP.Deciding upon the best approach is strictly dependent on your specific applicationand there is no definitive answer. The first three approaches are generally morecomputationally expensive and less clever, while the last three approaches are lessexpensive and they frequently achieve better performance.So far, we have explored how CNNs can be used for image and video applications.In the next section, we will apply these ideas within a text-based context.Textual documentsWhat do text and images have in common? At first glance: very little. However, if werepresent a sentence or a document as a matrix, then this matrix is not much differentfrom an image matrix where each cell is a pixel. So, the next question is: how can werepresent a piece of text as a matrix?Well, it is pretty simple: each row of a matrix is a vector that represents a basic unitfor the text. Of course, now we need to define what a basic unit is. A simple choicecould be to say that the basic unit is a character. Another choice would be to say thata basic unit is a word, yet another choice is to aggregate similar words together andthen denote each aggregation (sometimes called clustering or embedding) with arepresentative symbol.Note that regardless of the specific choice adopted for our basic units, we needto have a 1:1 map from basic units into integer IDs so that a text can be seen as amatrix. For instance, if we have a document with 10 lines of text and each line is a100-dimensional embedding, then we will represent our text with a matrix of 10×100.In this very particular "image," a "pixel" is turned on if that sentence, X, contains theembedding, represented by position Y.[ 174 ]
- Page 157 and 158: Convolutional Neural NetworksIn gen
- Page 159 and 160: Convolutional Neural NetworksOur ne
- Page 161 and 162: Convolutional Neural NetworksThese
- Page 163 and 164: Convolutional Neural NetworksSo, we
- Page 165 and 166: Convolutional Neural NetworksEach i
- Page 167 and 168: Convolutional Neural NetworksVery d
- Page 169 and 170: Convolutional Neural NetworksRecogn
- Page 171 and 172: Convolutional Neural NetworksIf we
- Page 173 and 174: Convolutional Neural NetworksRefere
- Page 175 and 176: Advanced Convolutional Neural Netwo
- Page 177 and 178: Advanced Convolutional Neural Netwo
- Page 179 and 180: Advanced Convolutional Neural Netwo
- Page 181 and 182: Advanced Convolutional Neural Netwo
- Page 183 and 184: Advanced Convolutional Neural Netwo
- Page 185 and 186: Advanced Convolutional Neural Netwo
- Page 187 and 188: Advanced Convolutional Neural Netwo
- Page 189 and 190: Advanced Convolutional Neural Netwo
- Page 191 and 192: Advanced Convolutional Neural Netwo
- Page 193 and 194: Advanced Convolutional Neural Netwo
- Page 195 and 196: Advanced Convolutional Neural Netwo
- Page 197 and 198: Advanced Convolutional Neural Netwo
- Page 199 and 200: Advanced Convolutional Neural Netwo
- Page 201 and 202: Advanced Convolutional Neural Netwo
- Page 203 and 204: Advanced Convolutional Neural Netwo
- Page 205 and 206: Advanced Convolutional Neural Netwo
- Page 207: Advanced Convolutional Neural Netwo
- Page 211 and 212: Advanced Convolutional Neural Netwo
- Page 213 and 214: Advanced Convolutional Neural Netwo
- Page 215 and 216: Advanced Convolutional Neural Netwo
- Page 217 and 218: Advanced Convolutional Neural Netwo
- Page 219 and 220: Advanced Convolutional Neural Netwo
- Page 221 and 222: Advanced Convolutional Neural Netwo
- Page 223 and 224: Advanced Convolutional Neural Netwo
- Page 226 and 227: GenerativeAdversarial NetworksIn th
- Page 228 and 229: [ 193 ]Chapter 6Eventually, we reac
- Page 230 and 231: [ 195 ]Chapter 6Next, we combine th
- Page 232 and 233: Chapter 6And handwritten digits gen
- Page 234 and 235: Chapter 6Figure 1: Visualizing the
- Page 236 and 237: Chapter 6The resultant generator mo
- Page 238 and 239: Chapter 6Figure 4: A summary of res
- Page 240 and 241: Chapter 6def train(self, epochs, ba
- Page 242 and 243: Chapter 6The preceding images were
- Page 244 and 245: Chapter 6Another interesting paper
- Page 246 and 247: Chapter 6To elaborate, let us say t
- Page 248 and 249: Chapter 6Figure 7: The architecture
- Page 250 and 251: Chapter 6Figure 11: Illegible initi
- Page 252 and 253: Chapter 6Bedrooms: Generated bedroo
- Page 254 and 255: Chapter 6The images need to be norm
- Page 256 and 257: Chapter 6initializer = tf.random_no
Advanced Convolutional Neural Networks
The fourth approach is based on a clever idea: instead of using CNNs directly for
classification, they can be used for storing offline features for each frame in the video.
The idea is that feature extraction can be made very efficient with transfer learning
as shown in a previous chapter. After all features are extracted, they can be passed
as a set of inputs into an RNN, which will learn sequences across multiple frames
and emit the final classification.
The fifth approach is a simple variant of the fourth, where the final layer is an MLP
instead of an RNN. In certain situations, this approach can be simpler and less
expensive in terms of computational requirements.
The sixth approach is a variant of the fourth, where the phase of feature extraction is
realized with a 3D CNN that extracts spatial and visual features. These features are
then passed into either an RNN or an MLP.
Deciding upon the best approach is strictly dependent on your specific application
and there is no definitive answer. The first three approaches are generally more
computationally expensive and less clever, while the last three approaches are less
expensive and they frequently achieve better performance.
So far, we have explored how CNNs can be used for image and video applications.
In the next section, we will apply these ideas within a text-based context.
Textual documents
What do text and images have in common? At first glance: very little. However, if we
represent a sentence or a document as a matrix, then this matrix is not much different
from an image matrix where each cell is a pixel. So, the next question is: how can we
represent a piece of text as a matrix?
Well, it is pretty simple: each row of a matrix is a vector that represents a basic unit
for the text. Of course, now we need to define what a basic unit is. A simple choice
could be to say that the basic unit is a character. Another choice would be to say that
a basic unit is a word, yet another choice is to aggregate similar words together and
then denote each aggregation (sometimes called clustering or embedding) with a
representative symbol.
Note that regardless of the specific choice adopted for our basic units, we need
to have a 1:1 map from basic units into integer IDs so that a text can be seen as a
matrix. For instance, if we have a document with 10 lines of text and each line is a
100-dimensional embedding, then we will represent our text with a matrix of 10×100.
In this very particular "image," a "pixel" is turned on if that sentence, X, contains the
embedding, represented by position Y.
[ 174 ]