Advanced Deep Learning with Keras
D ( p p ) E( )•Chapter 51 pdata( x,y)1 pg1 1 1 1JS data g= log + E0, ~ 0,1 , ~ ( 0,1) log = 1log + 1log = log2x= y U x=θy uU2 p ( , ) ( , ) 2 ( , ) ( , ) 2 1 2 1datax y + pg x y pdata x y + pgx y∑ ∑2 22 2W p , p = θ• ( data g )Since D JSis a constant, the GAN will not have a sufficient gradient to drive p g→ p data.We'll also find that D KLor reverse D KLis not helpful either. However, with W(p data,p g)we can have a smooth function in order to attain p g→ p databy gradient descent. EMDor Wasserstein 1 seems to be a more logical loss function in order to optimize GANssince D JSfails in situations when two distributions have minimal to no overlap.For further understanding, an excellent discussion on distance functions can befound at https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html.Use of Wasserstein lossBefore using EMD or Wasserstein 1, there is one more problem to overcome. Itinfis intractable to exhaust the space of ∏ ( pdata,pg) to find γ ∈∏ ( pdata,pg). The proposedsolution is to use its Kantorovich-Rubinstein dual:1W ( pdata, pg ) = sup E ⎡x~ pf ( x) ⎤ ⎡~ ( )datax pf x ⎤gK⎣ ⎦−E (Equation 5.1.16)⎣ ⎦f L≤KsupEquivalently, EMD, f L≤ 1 , is the supremum (roughly, maximum value) overall the K-Lipschitz functions: f : x → R . K-Lipschitz functions satisfy the constraint:( ) ( )f x − f x ≤ K x − x (Equation 5.1.17)1 2 1 2x , x ∈R , the K-Lipschitz functions have bounded derivatives and almostFor all1 2always continuously differentiable (for example, f(x), = |x| has bounded derivativesand continuous but not differentiable at x = 0).Equation 5.1.16 can be solved by finding a family of K-Lipschitz functions { }f∈W :ww( data, g ) = max E ⎡x~ p w ( ) ⎤ −E⎡x~p w ( ) ⎤W p p f x f xw∈Wdata ⎣ ⎦ g ⎣ ⎦(Equation 5.1.18)[ 131 ]
Improved GANsIn the context of GANs, Equation 5.1.18 can be rewritten by sampling from z-noisedistribution and replacing f wby the discriminator function, D w:( data, g ) max E ⎡~ ( ) ⎤x p wE ⎡z w ( ( ))⎤W p p D x D G z (Equation 5.1.19)= −w∈Wdata ⎣ ⎦ ⎢ ⎣ ⎥ ⎦We use the bold letter to highlight the generality to multi-dimensional samples. Thefinal problem we face is how to find the family of functions w ∈ W . The proposedsolution we're going to go over is that at every gradient update, the weights of thediscriminator, w, are clipped between lower and upper bounds, (for example, -0.0,1and 0.01):w ← clip( w, − 0.01,0.01)(Equation 5.1.20)The small values of w constrains the discriminator to a compact parameter spacethus ensuring Lipschitz continuity.We can use Equation 5.1.19 as the basis of our new GAN loss functions. EMDor Wasserstein 1 is the loss function that the generator aims to minimize, and thecost function that the discriminator tries to maximize (or minimize -W(p data,p g)):( D)L =− Ex~ pDw ( x) + EzDw( G ( z)) 5.1.21 (Equation 5.1.21)data( G)( ( ))L =−E zD wG z (Equation 5.1.22)In the generator loss function, the first term disappears since it is not directlyoptimizing with respect to the real data.Following table shows the difference between the loss functions of GAN and( D)( G)WGAN. For conciseness, we've simplified the notation for L , and L . Theseloss functions are used in training the WGAN as shown in Algorithm 5.1.1. Figure5.1.3 illustrates that the WGAN model is practically the same as the DCGAN modelexcept for the fake/true data labels and loss functions:Network Loss Functions Equation( )GAN ( D)L =−Ex~ plog D x −Ezlog 1−D GL( G)data=−E zlog D( G( z))( ) ( ( z))4.1.14.1.5[ 132 ]
- Page 98 and 99: batch_size=32,model_name="autoencod
- Page 100 and 101: Chapter 3Figure 3.2.6: Digits gener
- Page 102 and 103: Chapter 3As shown in Figure 3.3.2,
- Page 104 and 105: Chapter 3image_size = x_train.shape
- Page 106 and 107: Chapter 3# Mean Square Error (MSE)
- Page 108 and 109: Chapter 3from keras.layers import R
- Page 110 and 111: Chapter 3# build the autoencoder mo
- Page 112 and 113: Chapter 3x_train,validation_data=(x
- Page 114: Chapter 3ConclusionIn this chapter,
- Page 117 and 118: Generative Adversarial Networks (GA
- Page 119 and 120: Generative Adversarial Networks (GA
- Page 121 and 122: Generative Adversarial Networks (GA
- Page 123 and 124: Generative Adversarial Networks (GA
- Page 125 and 126: Generative Adversarial Networks (GA
- Page 127 and 128: Generative Adversarial Networks (GA
- Page 129 and 130: Generative Adversarial Networks (GA
- Page 131 and 132: Generative Adversarial Networks (GA
- Page 133 and 134: Generative Adversarial Networks (GA
- Page 135 and 136: Generative Adversarial Networks (GA
- Page 137 and 138: Generative Adversarial Networks (GA
- Page 139 and 140: Generative Adversarial Networks (GA
- Page 141 and 142: Generative Adversarial Networks (GA
- Page 143 and 144: Improved GANsIn summary, the goal o
- Page 145 and 146: Improved GANsThe intuition behind E
- Page 147: Improved GANsThis makes sense since
- Page 151 and 152: Improved GANsFigure 5.1.3: Top: Tra
- Page 153 and 154: Improved GANsThe functions include:
- Page 155 and 156: Improved GANsmodels = (generator, d
- Page 157 and 158: Improved GANsfor layer in discrimin
- Page 159 and 160: Improved GANsFollowing figure shows
- Page 161 and 162: Improved GANsThe preceding table sh
- Page 163 and 164: Improved GANsFollowing figure shows
- Page 165 and 166: Improved GANsEssentially, in CGAN w
- Page 167 and 168: Improved GANslayer = Dense(layer_fi
- Page 169 and 170: Improved GANsx = BatchNormalization
- Page 171 and 172: Improved GANsdiscriminator.compile(
- Page 173 and 174: Improved GANssize=batch_size)real_i
- Page 175 and 176: Improved GANsUnlike CGAN, the sampl
- Page 177 and 178: Improved GANsConclusionIn this chap
- Page 179 and 180: Disentangled Representation GANsIn
- Page 181 and 182: Disentangled Representation GANsInf
- Page 183 and 184: Disentangled Representation GANsFol
- Page 185 and 186: Disentangled Representation GANs# A
- Page 187 and 188: Disentangled Representation GANsif
- Page 189 and 190: Disentangled Representation GANsLis
- Page 191 and 192: Disentangled Representation GANsdat
- Page 193 and 194: Disentangled Representation GANsy[b
- Page 195 and 196: Disentangled Representation GANspyt
- Page 197 and 198: Disentangled Representation GANsThe
Improved GANs
In the context of GANs, Equation 5.1.18 can be rewritten by sampling from z-noise
distribution and replacing f w
by the discriminator function, D w
:
( data, g ) max E ⎡
~ ( ) ⎤
x p w
E ⎡
z w ( ( ))
⎤
W p p D x D G z (Equation 5.1.19)
= −
w∈W
data ⎣ ⎦ ⎢ ⎣ ⎥ ⎦
We use the bold letter to highlight the generality to multi-dimensional samples. The
final problem we face is how to find the family of functions w ∈ W . The proposed
solution we're going to go over is that at every gradient update, the weights of the
discriminator, w, are clipped between lower and upper bounds, (for example, -0.0,1
and 0.01):
w ← clip( w, − 0.01,0.01)
(Equation 5.1.20)
The small values of w constrains the discriminator to a compact parameter space
thus ensuring Lipschitz continuity.
We can use Equation 5.1.19 as the basis of our new GAN loss functions. EMD
or Wasserstein 1 is the loss function that the generator aims to minimize, and the
cost function that the discriminator tries to maximize (or minimize -W(p data
,p g
)):
( D)
L =− Ex~ p
Dw ( x) + EzDw
( G ( z)
) 5.1.21 (Equation 5.1.21)
data
( G)
( ( ))
L =−E z
D w
G z (Equation 5.1.22)
In the generator loss function, the first term disappears since it is not directly
optimizing with respect to the real data.
Following table shows the difference between the loss functions of GAN and
( D)
( G)
WGAN. For conciseness, we've simplified the notation for L , and L . These
loss functions are used in training the WGAN as shown in Algorithm 5.1.1. Figure
5.1.3 illustrates that the WGAN model is practically the same as the DCGAN model
except for the fake/true data labels and loss functions:
Network Loss Functions Equation
( )
GAN ( D)
L =−E
x~ p
log D x −Ez
log 1−D G
L
( G)
data
=−E z
log D( G( z)
)
( ) ( ( z)
)
4.1.1
4.1.5
[ 132 ]