09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 11

The αα is the learning rate, and its value lies in the range [0,1]. The first term

represents the component of the old Q value and the second term the target Q

value. Q-learning is good if the number of states and the number of possible actions

are small, but for large state spaces and action spaces, Q-learning is simply not

scalable. A better alternative would be to use a deep neural network as a function

approximator, approximating the target Q-function for each possible action. The

weights of the deep neural network in this case store the Q-table information. There

is a separate output unit for each possible action. The network takes the state as its

input and returns the predicted target Q value for all possible actions. The question

arises: how do we train this network, and what should be the loss function? Well,

since our network has to predict the target Q value:

QQ tttttttttttt = RR tt+1 + γγ mmmmmm AA QQ(SS tt+1 , AA tt )

the loss function should try and reduce the difference between the Q value predicted,

Q predicted

and the target Q, Q target

. We can do this by defining the loss function as:

llllllll = EE ππ [QQ tttttttttttt (SS, AA) − QQ ppppppppiiiiiiiiii (SS, WW, AA)]

Where W is the training parameters of our deep Q network, learned using

gradient descent, such that the loss function is minimized. Following is the general

architecture of a DQN. The network takes n-dimensional state as input, and outputs

the Q value of each possible action in the m-dimensional action space. Each layer

(including the input) can be a convolutional layer (if we are taking the raw pixels as

input convolutional layers makes more sense) or can be dense layers:

[ 421 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!