Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
one matrix for each data point, each matrix containing a grid of errors.The only step missing is to compute the mean squared error. First, we take thesquare of all errors. Then, we average the squares over all data points. Since ourdata points are in the first dimension, we use axis=0 to compute this average:all_losses = (all_errors ** 2).mean(axis=0)all_losses.shapeOutput(101, 101)The result is a grid of losses, a matrix of shape (101, 101), each loss correspondingto a different combination of the parameters b and w.These losses are our loss surface, which can be visualized in a 3D plot, where thevertical axis (z) represents the loss values. If we connect the combinations of b andw that yield the same loss value, we’ll get an ellipse. Then, we can draw this ellipsein the original b x w plane (in blue, for a loss value of 3). This is, in a nutshell, what acontour plot does. From now on, we’ll always use the contour plot, instead of thecorresponding 3D version.Figure 0.4 - Loss surfaceIn the center of the plot, where parameters (b, w) have values close to (1, 2), the lossis at its minimum value. This is the point we’re trying to reach using gradientStep 2 - Compute the Loss | 35
descent.In the bottom, slightly to the left, there is the random start point, corresponding toour randomly initialized parameters.This is one of the nice things about tackling a simple problem like a linearregression with a single feature: We have only two parameters, and thus we cancompute and visualize the loss surface.Unfortunately, for the absolute majority of problems, computingthe loss surface is not going to be feasible: we have to rely ongradient descent’s ability to reach a point of minimum, even if itstarts at some random point.Cross-SectionsAnother nice thing is that we can cut a cross-section in the loss surface to checkwhat the loss would look like if the other parameter were held constant.Let’s start by making b = 0.52 (the value from b_range that is closest to our initialrandom value for b, 0.4967). We cut a cross-section vertically (the red dashed line)on our loss surface (left plot), and we get the resulting plot on the right:Figure 0.5 - Vertical cross-section; parameter b is fixedWhat does this cross-section tell us? It tells us that, if we keep b constant (at 0.52),the loss, seen from the perspective of parameter w, can be minimized if w getsincreased (up to some value between 2 and 3).36 | Chapter 0: Visualizing Gradient Descent
- Page 10 and 11: The Precision Quirk . . . . . . . .
- Page 12 and 13: A REAL Filter . . . . . . . . . . .
- Page 14 and 15: Jupyter Notebook . . . . . . . . .
- Page 16 and 17: Stacked RNN . . . . . . . . . . . .
- Page 18 and 19: Attention Mechanism . . . . . . . .
- Page 20 and 21: 4. Positional Encoding. . . . . . .
- Page 22 and 23: Model Configuration & Training . .
- Page 24 and 25: AcknowledgementsFirst and foremost,
- Page 26 and 27: Frequently Asked Questions (FAQ)Why
- Page 28 and 29: There is yet another advantage of f
- Page 30 and 31: • Classes and methods are written
- Page 32 and 33: What’s Next?It’s time to set up
- Page 34 and 35: After choosing a repository, it wil
- Page 36 and 37: 1. AnacondaIf you don’t have Anac
- Page 38 and 39: 3. PyTorchPyTorch is the coolest de
- Page 40 and 41: (pytorchbook) C:\> conda install py
- Page 42 and 43: (pytorchbook)$ pip install torchviz
- Page 44 and 45: 7. JupyterAfter cloning the reposit
- Page 46 and 47: Part IFundamentals| 21
- Page 48 and 49: notebook. If not, just click on Cha
- Page 50 and 51: Also, let’s say that, on average,
- Page 52 and 53: There is one exception to the "alwa
- Page 54 and 55: Random Initialization1 # Step 0 - I
- Page 56 and 57: Batch, Mini-batch, and Stochastic G
- Page 58 and 59: Outputarray([[-2. , -1.94, -1.88, .
- Page 62 and 63: Sure, different values of b produce
- Page 64 and 65: Output-3.044811379650508 -1.8337537
- Page 66 and 67: each parameter using the chain rule
- Page 68 and 69: What’s the impact of one update o
- Page 70 and 71: gradients, we know we need to take
- Page 72 and 73: Very High Learning RateWait, it may
- Page 74 and 75: true_b = 1true_w = 2N = 100# Data G
- Page 76 and 77: Let’s look at the cross-sections
- Page 78 and 79: Zero Mean and Unit Standard Deviati
- Page 80 and 81: Sure, in the real world, you’ll n
- Page 82 and 83: computing the loss, as shown in the
- Page 84 and 85: • visualizing the effects of usin
- Page 86 and 87: If you’re using Jupyter’s defau
- Page 88 and 89: Notebook Cell 1.1 - Splitting synth
- Page 90 and 91: Step 2# Step 2 - Computing the loss
- Page 92 and 93: Output[0.49671415] [-0.1382643][0.8
- Page 94 and 95: Notebook Cell 1.2 - Implementing gr
- Page 96 and 97: # Sanity Check: do we get the same
- Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3
- Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,
- Page 102 and 103: dummy_array = np.array([1, 2, 3])du
- Page 104 and 105: n_cudas = torch.cuda.device_count()
- Page 106 and 107: back_to_numpy = x_train_tensor.nump
- Page 108 and 109: I am assuming you’d like to use y
descent.
In the bottom, slightly to the left, there is the random start point, corresponding to
our randomly initialized parameters.
This is one of the nice things about tackling a simple problem like a linear
regression with a single feature: We have only two parameters, and thus we can
compute and visualize the loss surface.
Unfortunately, for the absolute majority of problems, computing
the loss surface is not going to be feasible: we have to rely on
gradient descent’s ability to reach a point of minimum, even if it
starts at some random point.
Cross-Sections
Another nice thing is that we can cut a cross-section in the loss surface to check
what the loss would look like if the other parameter were held constant.
Let’s start by making b = 0.52 (the value from b_range that is closest to our initial
random value for b, 0.4967). We cut a cross-section vertically (the red dashed line)
on our loss surface (left plot), and we get the resulting plot on the right:
Figure 0.5 - Vertical cross-section; parameter b is fixed
What does this cross-section tell us? It tells us that, if we keep b constant (at 0.52),
the loss, seen from the perspective of parameter w, can be minimized if w gets
increased (up to some value between 2 and 3).
36 | Chapter 0: Visualizing Gradient Descent