16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8

Equation 8.1.10 is the core of VAEs. The left-hand side is the term Pθ ( x)

that we

are maximizing less the error due to the distance of Qφ ( z | x)

from the true P ( z | x)

θ .

We can recall that the logarithm does not change the location of maxima (or

minima). Given an inference model that provides a good estimate of Pθ ( z | x)

,

DKL

( Qφ ( z | x) || Pθ ( z | x)

) is approximately zero. The first term, Pθ

( x | z)

, on the righthand

side resembles a decoder that takes samples from the inference model to

reconstruct the input. The second term is another distance. This time it's between

Q ( z | x)

P z

φ and the prior ( )

θ .

The left side of Equation 8.1.10 is also known as the variational lower bound or

evidence lower bound (ELBO). Since the KL is always positive, ELBO is the lower

bound of log Pθ

( x)

. Maximizing ELBO by optimizing the parameters φ and θ of

the neural network means that:

• DKL

( Qφ ( z | x) || Pθ ( z | x)

) → 0 or the inference model is getting better in encoding

the attributes of x in z

• log Pθ ( x | z)

on the right-hand side of Equation 8.1.10 is maximized or the

decoder model is getting better in reconstructing x from the latent vector z

Optimization

The right-hand side of Equation 8.1.10 has two important bits of information

about the loss function of VAEs. The decoder term E ⎡

z~ Q

log Pθ

( x | z)

⎣ ⎦

means that the

generator takes z samples from the output of the inference model to reconstruct the

inputs. Maximizing this term implies that we minimize the Reconstruction Loss, L

R .

If the image (data) distribution is assumed to be Gaussian, then MSE can be used.

If every pixel (data) is considered a Bernoulli distribution, then the loss function is

a binary cross entropy.

( )

The second term, − DKL

( z | x) || Pθ

( z)

, turns out to be straightforward to evaluate.

From Equation 8.1.6, Q φ is a Gaussian distribution. Typically, Pθ ( z) = P( z) = N ( 0, I )

is also a Gaussian with zero mean and standard deviation equal to 1.0. The KL

term simplifies to:

2 2 2

( )

1 J

− DKL ( Qφ ( z | x) || Pθ ( z)

) = 1+ log( σ ) ( ) ( )

j 1

j

− µ

j

− σ

=

j

2

(Equation 8.1.11)

Where J is the dimensionality of z. Both µ

j and σ

j are functions of x computed

through the inference model. To maximize − DKL

, σ

j

→ 1 and µ

j

→ 0 . The choice

of P( z) = N ( 0, I ) stems from the property of isotropic unit Gaussian which can be

morphed to an arbitrary distribution given a suitable function. From Equation 8.1.11,

the KL Loss L is simply .

KL

[ 241 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!