4041 # Builds a weighted random sampler to handle imbalanced classes42 sampler = make_balanced_sampler(y_train_tensor)4344 # Uses sampler in the training set to get a balanced data loader45 train_loader = DataLoader(46 dataset=train_dataset, batch_size=16, sampler=sampler)47 val_loader = DataLoader(dataset=val_dataset, batch_size=16)PatchesThere are different ways of breaking up an image into patches. The moststraightforward one is simply rearranging the pixels, so let’s start with that one.RearrangingTensorflow has a utility function called tf.image.extract_patches() that does thejob, and we’re implementing a simplified version of this function in PyTorch withtensor.unfold() (using only a kernel size and a stride, but no padding or anythingelse):# Adapted from https://discuss.pytorch.org/t/tf-extract-image-# patches-in-pytorch/43837def extract_image_patches(x, kernel_size, stride=1):# Extract patchespatches = x.unfold(2, kernel_size, stride)patches = patches.unfold(3, kernel_size, stride)patches = patches.permute(0, 2, 3, 1, 4, 5).contiguous()return patches.view(n, patches.shape[1], patches.shape[2], -1)It works as if we were applying a convolution to the image. Each patch is actually areceptive field (the region the filter is moving over to convolve), but, instead ofconvolving the region, we’re just taking it as it is. The kernel size is the patch size,and the number of patches depends on the stride—the smaller the stride, the morepatches. If the stride matches the kernel size, we’re effectively breaking up theimage into non-overlapping patches, so let’s do that:Vision Transformer | 849

kernel_size = 4patches = extract_image_patches(img, kernel_size, stride=kernel_size)patches.shapeOutputtorch.Size([1, 3, 3, 16])Since kernel size is four, each patch has 16 pixels, and there are nine patches intotal. Even though each patch is a tensor of 16 elements, if we plot them as if theywere four-by-four images instead, it would look like this.Figure 10.22 - Sample image—split into patchesIt is very easy to see how the image was broken up in the figure above. In reality,though, the Transformer needs a sequence of flattened patches. Let’s reshapethem:seq_patches = patches.view(-1, patches.size(-1))850 | Chapter 10: Transform and Roll Out


There are different ways of breaking up an image into patches. The most

straightforward one is simply rearranging the pixels, so let’s start with that one.


Tensorflow has a utility function called tf.image.extract_patches() that does the

job, and we’re implementing a simplified version of this function in PyTorch with

tensor.unfold() (using only a kernel size and a stride, but no padding or anything


# Adapted from https://discuss.pytorch.org/t/tf-extract-image-

# patches-in-pytorch/43837

def extract_image_patches(x, kernel_size, stride=1):

# Extract patches

patches = x.unfold(2, kernel_size, stride)

patches = patches.unfold(3, kernel_size, stride)

patches = patches.permute(0, 2, 3, 1, 4, 5).contiguous()

return patches.view(n, patches.shape[1], patches.shape[2], -1)

It works as if we were applying a convolution to the image. Each patch is actually a

receptive field (the region the filter is moving over to convolve), but, instead of

convolving the region, we’re just taking it as it is. The kernel size is the patch size,

and the number of patches depends on the stride—the smaller the stride, the more

patches. If the stride matches the kernel size, we’re effectively breaking up the

image into non-overlapping patches, so let’s do that:

Vision Transformer | 849

