Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

D ( p p ) E( )•Chapter 51 pdata( x,y)1 pg1 1 1 1JS data g= log + E0, ~ 0,1 , ~ ( 0,1) log = 1log + 1log = log2x= y U x=θy uU2 p ( , ) ( , ) 2 ( , ) ( , ) 2 1 2 1datax y + pg x y pdata x y + pgx y∑ ∑2 22 2W p , p = θ• ( data g )Since D JSis a constant, the GAN will not have a sufficient gradient to drive p g→ p data.We'll also find that D KLor reverse D KLis not helpful either. However, with W(p data,p g)we can have a smooth function in order to attain p g→ p databy gradient descent. EMDor Wasserstein 1 seems to be a more logical loss function in order to optimize GANssince D JSfails in situations when two distributions have minimal to no overlap.For further understanding, an excellent discussion on distance functions can befound at https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html.Use of Wasserstein lossBefore using EMD or Wasserstein 1, there is one more problem to overcome. Itinfis intractable to exhaust the space of ∏ ( pdata,pg) to find γ ∈∏ ( pdata,pg). The proposedsolution is to use its Kantorovich-Rubinstein dual:1W ( pdata, pg ) = sup E ⎡x~ pf ( x) ⎤ ⎡~ ( )datax pf x ⎤gK⎣ ⎦−E (Equation 5.1.16)⎣ ⎦f L≤KsupEquivalently, EMD, f L≤ 1 , is the supremum (roughly, maximum value) overall the K-Lipschitz functions: f : x → R . K-Lipschitz functions satisfy the constraint:( ) ( )f x − f x ≤ K x − x (Equation 5.1.17)1 2 1 2x , x ∈R , the K-Lipschitz functions have bounded derivatives and almostFor all1 2always continuously differentiable (for example, f(x), = |x| has bounded derivativesand continuous but not differentiable at x = 0).Equation 5.1.16 can be solved by finding a family of K-Lipschitz functions { }f∈W :ww( data, g ) = max E ⎡x~ p w ( ) ⎤ −E⎡x~p w ( ) ⎤W p p f x f xw∈Wdata ⎣ ⎦ g ⎣ ⎦(Equation 5.1.18)[ 131 ]

Improved GANsIn the context of GANs, Equation 5.1.18 can be rewritten by sampling from z-noisedistribution and replacing f wby the discriminator function, D w:( data, g ) max E ⎡~ ( ) ⎤x p wE ⎡z w ( ( ))⎤W p p D x D G z (Equation 5.1.19)= −w∈Wdata ⎣ ⎦ ⎢ ⎣ ⎥ ⎦We use the bold letter to highlight the generality to multi-dimensional samples. Thefinal problem we face is how to find the family of functions w ∈ W . The proposedsolution we're going to go over is that at every gradient update, the weights of thediscriminator, w, are clipped between lower and upper bounds, (for example, -0.0,1and 0.01):w ← clip( w, − 0.01,0.01)(Equation 5.1.20)The small values of w constrains the discriminator to a compact parameter spacethus ensuring Lipschitz continuity.We can use Equation 5.1.19 as the basis of our new GAN loss functions. EMDor Wasserstein 1 is the loss function that the generator aims to minimize, and thecost function that the discriminator tries to maximize (or minimize -W(p data,p g)):( D)L =− Ex~ pDw ( x) + EzDw( G ( z)) 5.1.21 (Equation 5.1.21)data( G)( ( ))L =−E zD wG z (Equation 5.1.22)In the generator loss function, the first term disappears since it is not directlyoptimizing with respect to the real data.Following table shows the difference between the loss functions of GAN and( D)( G)WGAN. For conciseness, we've simplified the notation for L , and L . Theseloss functions are used in training the WGAN as shown in Algorithm 5.1.1. Figure5.1.3 illustrates that the WGAN model is practically the same as the DCGAN modelexcept for the fake/true data labels and loss functions:Network Loss Functions Equation( )GAN ( D)L =−Ex~ plog D x −Ezlog 1−D GL( G)data=−E zlog D( G( z))( ) ( ( z))4.1.14.1.5[ 132 ]

Improved GANs

In the context of GANs, Equation 5.1.18 can be rewritten by sampling from z-noise

distribution and replacing f w

by the discriminator function, D w

:

( data, g ) max E ⎡

~ ( ) ⎤

x p w

E ⎡

z w ( ( ))

W p p D x D G z (Equation 5.1.19)

= −

w∈W

data ⎣ ⎦ ⎢ ⎣ ⎥ ⎦

We use the bold letter to highlight the generality to multi-dimensional samples. The

final problem we face is how to find the family of functions w ∈ W . The proposed

solution we're going to go over is that at every gradient update, the weights of the

discriminator, w, are clipped between lower and upper bounds, (for example, -0.0,1

and 0.01):

w ← clip( w, − 0.01,0.01)

(Equation 5.1.20)

The small values of w constrains the discriminator to a compact parameter space

thus ensuring Lipschitz continuity.

We can use Equation 5.1.19 as the basis of our new GAN loss functions. EMD

or Wasserstein 1 is the loss function that the generator aims to minimize, and the

cost function that the discriminator tries to maximize (or minimize -W(p data

,p g

)):

( D)

L =− Ex~ p

Dw ( x) + EzDw

( G ( z)

) 5.1.21 (Equation 5.1.21)

data

( G)

( ( ))

L =−E z

D w

G z (Equation 5.1.22)

In the generator loss function, the first term disappears since it is not directly

optimizing with respect to the real data.

Following table shows the difference between the loss functions of GAN and

( D)

( G)

WGAN. For conciseness, we've simplified the notation for L , and L . These

loss functions are used in training the WGAN as shown in Algorithm 5.1.1. Figure

5.1.3 illustrates that the WGAN model is practically the same as the DCGAN model

except for the fake/true data labels and loss functions:

Network Loss Functions Equation

( )

GAN ( D)

L =−E

x~ p

log D x −Ez

log 1−D G

L

( G)

data

=−E z

log D( G( z)

)

( ) ( ( z)

)

4.1.1

4.1.5

[ 132 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!