16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9

Initially, the agent assumes a policy that selects a random action 90% of the time and

exploits the Q-Table 10% of the time. Suppose the first action is randomly chosen and

indicates a move in the right direction. Figure 9.3.3 illustrates the computation of the

new Q value of state (0, 0) for a move to the right action. The next state is (0, 1). The

reward is 0, and the maximum of all the next state's Q values is zero. Therefore, the

Q value of state (0, 0) for a move to the right action remains 0.

To easily track the initial state and next state, we use different shades of gray on both

the environment and the Q-Table–lighter gray for initial state and darker gray for the

next state. In choosing the next action for the next state, the candidate actions are in

the thicker border:

Figure 9.3.3: Assuming the action taken by the agent is a move to the right,

the update on Q value of state (0, 0) is shown

[ 277 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!