Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

The minority class should have the largest weight, so each datapoint belonging to it gets overrepresented to compensate for theimbalance."But these weights do not sum up to one—isn’t it wrong?"It is common to have weights summing up to one, sure, but this is not required byPyTorch’s weighted sampler. We can get away with having weights inverselyproportional to the counts. In this sense, the sampler is very "forgiving." But it isnot without its own quirks, unfortunately.It is not enough to provide a sequence of weights that correspond to each differentclass in the training set. It requires a sequence containing the correspondingweight for each and every data point in the training set. Even though this is a bitannoying, it is not so hard to accomplish: We can use the labels as indexes of theweights we computed above. It is probably easier to see it in code:sample_weights = weights[y_train_tensor.squeeze().long()]print(sample_weights.shape)print(sample_weights[:10])print(y_train_tensor[:10].squeeze())Outputtorch.Size([240])tensor([0.0063, 0.0063, 0.0063, 0.0063, 0.0063, 0.0125, 0.0063,0.0063, 0.0063, 0.0063])tensor([1., 1., 1., 1., 1., 0., 1., 1., 1., 1.])Since there are 240 images in our training set, we need 240 weights. We squeezeour labels (y_train_tensor) to a single dimension and cast them to long type sincewe want to use them as indices. The code above shows the first ten elements, soyou can actually see the correspondence between class and weight in the resultingtensor.The sequence of weights is the main argument used to create theWeightedRandomSampler, but not the only one. Let’s take a look at its arguments:Data Preparation | 291

• weights: A sequence of weights like the one we have just computed.• num_samples: How many samples are going to be drawn from the dataset.◦ A typical value is the length of the sequence of weights, as you’re likelysampling from the whole training set.• replacement: If True (the default value), it draws samples with replacement.◦ If num_samples equals the length—that is, if the whole training set is used—itmakes sense to draw samples with replacement to effectively compensatefor the imbalance.◦ It only makes sense to set it to False if num_samples < length of the dataset.• generator: Optional, it takes a (pseudo) random number Generator that will beused for drawing the samples.◦ To ensure reproducibility, we need to create and assign a generator (whichhas its own seed) to the sampler, since the manual seed we’ve already set isnot enough.OK, we’ll sample from the whole training set, and we have our sequence of weightsready. We are still missing a generator, though. Let’s create both the generator andthe sampler now:generator = torch.Generator()sampler = WeightedRandomSampler(weights=sample_weights,num_samples=len(sample_weights),generator=generator,replacement=True)"Didn’t you say we need to set a seed for the generator?! Where is it?"Indeed, I said it. We’ll set it soon, after assigning the sampler to the data loader.You’ll understand the reasoning behind this choice shortly, so please bear with me.Now, let’s (re-)create the data loaders using the weighted sampler with the trainingset:292 | Chapter 4: Classifying Images

The minority class should have the largest weight, so each data

point belonging to it gets overrepresented to compensate for the

imbalance.

"But these weights do not sum up to one—isn’t it wrong?"

It is common to have weights summing up to one, sure, but this is not required by

PyTorch’s weighted sampler. We can get away with having weights inversely

proportional to the counts. In this sense, the sampler is very "forgiving." But it is

not without its own quirks, unfortunately.

It is not enough to provide a sequence of weights that correspond to each different

class in the training set. It requires a sequence containing the corresponding

weight for each and every data point in the training set. Even though this is a bit

annoying, it is not so hard to accomplish: We can use the labels as indexes of the

weights we computed above. It is probably easier to see it in code:

sample_weights = weights[y_train_tensor.squeeze().long()]

print(sample_weights.shape)

print(sample_weights[:10])

print(y_train_tensor[:10].squeeze())

Output

torch.Size([240])

tensor([0.0063, 0.0063, 0.0063, 0.0063, 0.0063, 0.0125, 0.0063,

0.0063, 0.0063, 0.0063])

tensor([1., 1., 1., 1., 1., 0., 1., 1., 1., 1.])

Since there are 240 images in our training set, we need 240 weights. We squeeze

our labels (y_train_tensor) to a single dimension and cast them to long type since

we want to use them as indices. The code above shows the first ten elements, so

you can actually see the correspondence between class and weight in the resulting

tensor.

The sequence of weights is the main argument used to create the

WeightedRandomSampler, but not the only one. Let’s take a look at its arguments:

Data Preparation | 291

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!