16.03.2021 Views

Advanced Deep Learning with Keras

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 10

( | , ) ( )

π a s θ = softmax a for a ∈A (Equation 10.1.1)

i t i i

In that formula, a i

is the i-th action. a i

can be the prediction of a neural network

or a linear function of state-action features:

a

( s , a )

T

i

= φ

t i

θ (Equation 10.1.2)

φ ( s , a ) is any function such as an encoder that converts the state-action to features.

t

( a s , )

i

π

i t

θ determines the probability of each a i

. For example, in the cartpole

balancing problem in the previous chapter, the goal is to keep the pole upright

by moving the cart along the 2D axis to the left or to the right. In this case, a 0

and

a 1

are the probabilities of the left and right movements respectively. In general,

the agent takes the action with the highest probability, a max π( a s , θ)

For continuous action spaces, π( a s , )

= .

t i t

i

t t

θ samples an action from a probability

distribution given the state. For example, if the continuous action space is the range

a ∈ t [ − 1.0,1.0]

, then π ∗ = argmax π

R is usually a Gaussian distribution whose mean

t

and standard deviation are predicted by the policy network. The predicted action

is a sample from this Gaussian distribution. To ensure that no invalid prediction

is generated, the action is clipped between -1.0 and 1.0.

Formally, for continuous action spaces, the policy is a sample from a Gaussian

distribution:

( a s , ) a ~ ( ( s ),

( s ))

π θ = N µ σ

(Equation 10.1.3)

t t t t t

The mean, µ , and standard deviation, σ , are both functions of the state features:

( ) ( ) T

µ φ θ

st

= s (Equation 10.1.4)

t µ

T

( t )

( ) ( )

σ ς φ θ

st

= s (Equation 10.1.5)

σ

φ ( s

x

t ) is any function that converts the state to its features. ς ( x) = log( 1+ e ) is the

softplus function that ensures positive values of standard deviation. One way

of implementing the state feature function, φ ( s t ), is using the encoder of an

autoencoder network. At the end of this chapter, we will train an autoencoder

and use the encoder part as the state feature function. Training a policy network

is therefore a matter of optimizing the parameters θ ⎡θµ θ ⎤

σ

= ⎢ ⎣ ⎥ ⎦ .

[ 309 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!