pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 11We set up the global values for the maximum episodes for which we will betraining the agent (EPOCHS), the threshold value when we consider the environmentsolved (THRESHOLD) and a bool to indicate if we want to record the training or not(MONITOR). Please note that as per the official OpenAI documentation the CartPoleenvironment is considered solved when the agent is able to maintain the pole in thevertical position for 195 time steps (ticks). In the following code for the sake of timewe have reduced the THRESHOLD to 45:EPOCHS = 1000THRESHOLD = 195MONITOR = TrueNow let us build our DQN. We declare a class DQN and in its __init__() functiondeclare all the hyperparameters and our model. We are creating the environmentalso inside the DQN class. As you can see, the class is quite general, and you canuse it to train for any Gym environment whose state space information can beencompassed in a 1D array:def __init__(self, env_string, batch_size=64):self.memory = deque(maxlen=100000)self.env = gym.make(env_string)self.input_size = self.env.observation_space.shape[0]self.action_size = self.env.action_space.nself.batch_size = batch_sizeself.gamma = 1.0self.epsilon = 1.0self.epsilon_min = 0.01self.epsilon_decay = 0.995alpha=0.01alpha_decay=0.01if MONITOR: self.env = gym.wrappers.Monitor(self.env, '../data/'+env_string, force=True)# Init modelself.model = Sequential()self.model.add(Dense(24, input_dim=self.input_size,activation='tanh'))self.model.add(Dense(48, activation='tanh'))self.model.add(Dense(self.action_size, activation='linear'))self.model.compile(loss='mse', optimizer=Adam(lr=alpha,decay=alpha_decay))[ 423 ]

Reinforcement LearningThe DQN that we have built is a three-layered perceptron; in the following outputyou can see the model summary. We use Adam optimizer with learning rate decay:Figure 1: Summary of the DQN modelThe variable list self.memory will contain our experience replay buffer. We needto add a method for saving the <S,A,R,S'> tuple into the memory and a method toget random samples from it in batches to train the agent. We perform these twofunctions by defining the class methods remember and replay:def remember(self, state, action, reward, next_state, done):self.memory.append((state, action, reward, next_state, done))def replay(self, batch_size):x_batch, y_batch = [], []minibatch = random.sample(self.memory, min(len(self.memory),batch_size))for state, action, reward, next_state, done in minibatch:y_target = self.model.predict(state)y_target[0][action] = reward if done else reward + self.gamma *np.max(self.model.predict(next_state)[0])x_batch.append(state[0])y_batch.append(y_target[0])self.model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)Our agent will use the Epsilon Greedy policy when choosing the action. This isimplemented in the following method:def choose_action(self, state, epsilon):if np.random.random() <= epsilon:return self.env.action_space.sample()[ 424 ]

Reinforcement Learning

The DQN that we have built is a three-layered perceptron; in the following output

you can see the model summary. We use Adam optimizer with learning rate decay:

Figure 1: Summary of the DQN model

The variable list self.memory will contain our experience replay buffer. We need

to add a method for saving the <S,A,R,S'> tuple into the memory and a method to

get random samples from it in batches to train the agent. We perform these two

functions by defining the class methods remember and replay:

def remember(self, state, action, reward, next_state, done):

self.memory.append((state, action, reward, next_state, done))

def replay(self, batch_size):

x_batch, y_batch = [], []

minibatch = random.sample(self.memory, min(len(self.memory),

batch_size))

for state, action, reward, next_state, done in minibatch:

y_target = self.model.predict(state)

y_target[0][action] = reward if done else reward + self.gamma *

np.max(self.model.predict(next_state)[0])

x_batch.append(state[0])

y_batch.append(y_target[0])

self.model.fit(np.array(x_batch), np.array(y_batch), batch_

size=len(x_batch), verbose=0)

Our agent will use the Epsilon Greedy policy when choosing the action. This is

implemented in the following method:

def choose_action(self, state, epsilon):

if np.random.random() <= epsilon:

return self.env.action_space.sample()

[ 424 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!