Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Output-3.044811379650508 -1.8337537171510832Visualizing GradientsSince the gradient for b is larger (in absolute value, 3.04) than the gradient for w (inabsolute value, 1.83), the answer for the question I posed you in the "Cross-Sections" section is: The black curve (b changes, w is constant) yields the largestchanges in loss."Why is that?"To answer that, let’s first put both cross-section plots side-by-side, so we can moreeasily compare them. What is the main difference between them?Figure 0.7 - Cross-sections of the loss surfaceThe curve on the right is steeper. That’s your answer! Steeper curves have largergradients.Cool! That’s the intuition… Now, let’s get a bit more geometrical. So, I am zoomingin on the regions given by the red and black squares of Figure 0.7.From the "Cross-Sections" section, we already know that to minimize the loss, bothb and w needed to be increased. So, keeping in the spirit of using gradients, let’sincrease each parameter a little bit (always keeping the other one fixed!). By theStep 3 - Compute the Gradients | 39

way, in this example, a little bit equals 0.12 (for convenience’s sake, so it results in anicer plot).What effect do these increases have on the loss? Let’s check it out:Figure 0.8 - Computing (approximate) gradients, geometricallyOn the left plot, increasing w by 0.12 yields a loss reduction of 0.21. Thegeometrically computed and roughly approximate gradient is given by the ratiobetween the two values: -1.79. How does this result compare to the actual value ofthe gradient (-1.83)? It is actually not bad for a crude approximation. Could it bebetter? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of0.12), we’ll get better and better approximations. In the limit, as the increaseapproaches zero, we’ll arrive at the precise value of the gradient. Well, that’s thedefinition of a derivative!The same reasoning goes for the plot on the right: increasing b by the same 0.12yields a larger loss reduction of 0.35. Larger loss reduction, larger ratio, largergradient—and larger error, too, since the geometric approximation (-2.90) isfarther away from the actual value (-3.04).Time for another question: Which curve, red or black, do you like best to reducethe loss? It should be the black one, right? Well, yes, but it is not as straightforwardas we’d like it to be. We’ll dig deeper into this in the "Learning Rate" section.BackpropagationNow that you’ve learned about computing the gradient of the loss function w.r.t. to40 | Chapter 0: Visualizing Gradient Descent

way, in this example, a little bit equals 0.12 (for convenience’s sake, so it results in a

nicer plot).

What effect do these increases have on the loss? Let’s check it out:

Figure 0.8 - Computing (approximate) gradients, geometrically

On the left plot, increasing w by 0.12 yields a loss reduction of 0.21. The

geometrically computed and roughly approximate gradient is given by the ratio

between the two values: -1.79. How does this result compare to the actual value of

the gradient (-1.83)? It is actually not bad for a crude approximation. Could it be

better? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of

0.12), we’ll get better and better approximations. In the limit, as the increase

approaches zero, we’ll arrive at the precise value of the gradient. Well, that’s the

definition of a derivative!

The same reasoning goes for the plot on the right: increasing b by the same 0.12

yields a larger loss reduction of 0.35. Larger loss reduction, larger ratio, larger

gradient—and larger error, too, since the geometric approximation (-2.90) is

farther away from the actual value (-3.04).

Time for another question: Which curve, red or black, do you like best to reduce

the loss? It should be the black one, right? Well, yes, but it is not as straightforward

as we’d like it to be. We’ll dig deeper into this in the "Learning Rate" section.

Backpropagation

Now that you’ve learned about computing the gradient of the loss function w.r.t. to

40 | Chapter 0: Visualizing Gradient Descent

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!