Advanced Deep Learning with Keras

fourpersent2020
from fourpersent2020 More from this publisher
16.03.2021 Views

Chapter 9Initially, the agent assumes a policy that selects a random action 90% of the time andexploits the Q-Table 10% of the time. Suppose the first action is randomly chosen andindicates a move in the right direction. Figure 9.3.3 illustrates the computation of thenew Q value of state (0, 0) for a move to the right action. The next state is (0, 1). Thereward is 0, and the maximum of all the next state's Q values is zero. Therefore, theQ value of state (0, 0) for a move to the right action remains 0.To easily track the initial state and next state, we use different shades of gray on boththe environment and the Q-Table–lighter gray for initial state and darker gray for thenext state. In choosing the next action for the next state, the candidate actions are inthe thicker border:Figure 9.3.3: Assuming the action taken by the agent is a move to the right,the update on Q value of state (0, 0) is shown[ 277 ]

Deep Reinforcement LearningFigure 9.3.4: Assuming the action chosen by the agent is move down,the update on Q value of state (0, 1) is shownFigure 9.3.5: Assuming the action chosen by the agent is a move to the right,the update on Q value of state (1, 1) is shownLet's suppose that the next randomly chosen action is move down. Figure 9.3.4 showsno change in the Q value of state (0, 1) for the move down action. In Figure 9.3.5,the agent's third random action is a move to the right. It encountered the H andreceived a -100 reward. This time, the update is non-zero. The new Q value for thestate (1, 1) is -100 for the move to the right direction. One episode has just finished,and the agent returns to the Start state.[ 278 ]

Deep Reinforcement Learning

Figure 9.3.4: Assuming the action chosen by the agent is move down,

the update on Q value of state (0, 1) is shown

Figure 9.3.5: Assuming the action chosen by the agent is a move to the right,

the update on Q value of state (1, 1) is shown

Let's suppose that the next randomly chosen action is move down. Figure 9.3.4 shows

no change in the Q value of state (0, 1) for the move down action. In Figure 9.3.5,

the agent's third random action is a move to the right. It encountered the H and

received a -100 reward. This time, the update is non-zero. The new Q value for the

state (1, 1) is -100 for the move to the right direction. One episode has just finished,

and the agent returns to the Start state.

[ 278 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!