22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

There is one exception to the "always shuffle" rule, though: time

series problems, where shuffling can lead to data leakage.

Train-Validation-Test Split

It is beyond the scope of this book to explain the reasoning behind the trainvalidation-test

split, but there are two points I’d like to make:

1. The split should always be the first thing you do—no preprocessing, no

transformations; nothing happens before the split. That’s why we do this

immediately after the synthetic data generation.

2. In this chapter we will use only the training set, so I did not bother to create a

test set, but I performed a split nonetheless to highlight point #1 :-)

Train-Validation Split

1 # Shuffles the indices

2 idx = np.arange(N)

3 np.random.shuffle(idx)

4

5 # Uses first 80 random indices for train

6 train_idx = idx[:int(N*.8)]

7 # Uses the remaining indices for validation

8 val_idx = idx[int(N*.8):]

9

10 # Generates train and validation sets

11 x_train, y_train = x[train_idx], y[train_idx]

12 x_val, y_val = x[val_idx], y[val_idx]

"Why didn’t you use train_test_split() from Scikit-Learn?" you

may be asking.

That’s a fair point. Later on, we will refer to the indices of the data points belonging

to either train or validation sets, instead of the points themselves. So, I thought of

using them from the very start.

Data Generation | 27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!