pdfcoffee
Chapter 11We set up the global values for the maximum episodes for which we will betraining the agent (EPOCHS), the threshold value when we consider the environmentsolved (THRESHOLD) and a bool to indicate if we want to record the training or not(MONITOR). Please note that as per the official OpenAI documentation the CartPoleenvironment is considered solved when the agent is able to maintain the pole in thevertical position for 195 time steps (ticks). In the following code for the sake of timewe have reduced the THRESHOLD to 45:EPOCHS = 1000THRESHOLD = 195MONITOR = TrueNow let us build our DQN. We declare a class DQN and in its __init__() functiondeclare all the hyperparameters and our model. We are creating the environmentalso inside the DQN class. As you can see, the class is quite general, and you canuse it to train for any Gym environment whose state space information can beencompassed in a 1D array:def __init__(self, env_string, batch_size=64):self.memory = deque(maxlen=100000)self.env = gym.make(env_string)self.input_size = self.env.observation_space.shape[0]self.action_size = self.env.action_space.nself.batch_size = batch_sizeself.gamma = 1.0self.epsilon = 1.0self.epsilon_min = 0.01self.epsilon_decay = 0.995alpha=0.01alpha_decay=0.01if MONITOR: self.env = gym.wrappers.Monitor(self.env, '../data/'+env_string, force=True)# Init modelself.model = Sequential()self.model.add(Dense(24, input_dim=self.input_size,activation='tanh'))self.model.add(Dense(48, activation='tanh'))self.model.add(Dense(self.action_size, activation='linear'))self.model.compile(loss='mse', optimizer=Adam(lr=alpha,decay=alpha_decay))[ 423 ]
Reinforcement LearningThe DQN that we have built is a three-layered perceptron; in the following outputyou can see the model summary. We use Adam optimizer with learning rate decay:Figure 1: Summary of the DQN modelThe variable list self.memory will contain our experience replay buffer. We needto add a method for saving the <S,A,R,S'> tuple into the memory and a method toget random samples from it in batches to train the agent. We perform these twofunctions by defining the class methods remember and replay:def remember(self, state, action, reward, next_state, done):self.memory.append((state, action, reward, next_state, done))def replay(self, batch_size):x_batch, y_batch = [], []minibatch = random.sample(self.memory, min(len(self.memory),batch_size))for state, action, reward, next_state, done in minibatch:y_target = self.model.predict(state)y_target[0][action] = reward if done else reward + self.gamma *np.max(self.model.predict(next_state)[0])x_batch.append(state[0])y_batch.append(y_target[0])self.model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)Our agent will use the Epsilon Greedy policy when choosing the action. This isimplemented in the following method:def choose_action(self, state, epsilon):if np.random.random() <= epsilon:return self.env.action_space.sample()[ 424 ]
- Page 408 and 409: Chapter 90.97905576229095460.989323
- Page 410 and 411: Unsupervised LearningThis chapter d
- Page 412 and 413: Chapter 10Next we load the MNIST da
- Page 414 and 415: Chapter 10TensorFlow Embedding APIT
- Page 416 and 417: 3. Recompute the centroids using cu
- Page 418 and 419: Chapter 10Figure 4: Plot of the fin
- Page 420 and 421: Chapter 10In SOMs, neurons are usua
- Page 422 and 423: [ 387 ]Chapter 10Colour mapping usi
- Page 424 and 425: Chapter 10# Calculating Neighbourho
- Page 426 and 427: We will also need to normalize the
- Page 428 and 429: Chapter 10ρρ(vv oo |h oo ) = σσ
- Page 430 and 431: # Generate the sample probabilityde
- Page 432 and 433: Chapter 10And the reconstructed ima
- Page 434 and 435: Chapter 10inpX = rbm.rbm_output(inp
- Page 436 and 437: Chapter 10(60000, 28, 28) (60000,)(
- Page 438 and 439: Chapter 10Figure 11: Summary of the
- Page 440 and 441: Chapter 10This chapter, along with
- Page 442 and 443: Reinforcement LearningThis chapter
- Page 444 and 445: Chapter 11And unlike unsupervised l
- Page 446 and 447: Chapter 11Normally, the value is de
- Page 448 and 449: Chapter 11• The next question tha
- Page 450 and 451: Chapter 11This neural network takes
- Page 452 and 453: Chapter 11The MuJoCo environment re
- Page 454 and 455: Chapter 11We will first import the
- Page 456 and 457: Chapter 11The αα is the learning
- Page 460 and 461: Chapter 11else:return np.argmax(sel
- Page 462 and 463: Chapter 11DQN to play a game of Ata
- Page 464 and 465: Chapter 11self.model.add( Conv2D(64
- Page 466 and 467: Chapter 11Here the action A was sel
- Page 468 and 469: Chapter 11Image source: https://arx
- Page 470 and 471: Chapter 11A neural network is used
- Page 472: Chapter 1111. Details regarding ins
- Page 475 and 476: TensorFlow and Cloud• Scalability
- Page 477 and 478: TensorFlow and Cloud• Azure DevOp
- Page 479 and 480: TensorFlow and Cloud• Lambda: The
- Page 481 and 482: TensorFlow and Cloud• Deep Learni
- Page 483 and 484: TensorFlow and CloudEC2 on AmazonTo
- Page 485 and 486: TensorFlow and CloudCompute Instanc
- Page 487 and 488: TensorFlow and CloudYou just share
- Page 489 and 490: TensorFlow and CloudIn case you req
- Page 491 and 492: TensorFlow and CloudIt starts with
- Page 493 and 494: TensorFlow and CloudTFX librariesTF
- Page 495 and 496: TensorFlow and CloudReferences1. To
- Page 497 and 498: TensorFlow for Mobile and IoT and T
- Page 499 and 500: TensorFlow for Mobile and IoT and T
- Page 501 and 502: TensorFlow for Mobile and IoT and T
- Page 503 and 504: TensorFlow for Mobile and IoT and T
- Page 505 and 506: TensorFlow for Mobile and IoT and T
- Page 507 and 508: TensorFlow for Mobile and IoT and T
Reinforcement Learning
The DQN that we have built is a three-layered perceptron; in the following output
you can see the model summary. We use Adam optimizer with learning rate decay:
Figure 1: Summary of the DQN model
The variable list self.memory will contain our experience replay buffer. We need
to add a method for saving the <S,A,R,S'> tuple into the memory and a method to
get random samples from it in batches to train the agent. We perform these two
functions by defining the class methods remember and replay:
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def replay(self, batch_size):
x_batch, y_batch = [], []
minibatch = random.sample(self.memory, min(len(self.memory),
batch_size))
for state, action, reward, next_state, done in minibatch:
y_target = self.model.predict(state)
y_target[0][action] = reward if done else reward + self.gamma *
np.max(self.model.predict(next_state)[0])
x_batch.append(state[0])
y_batch.append(y_target[0])
self.model.fit(np.array(x_batch), np.array(y_batch), batch_
size=len(x_batch), verbose=0)
Our agent will use the Epsilon Greedy policy when choosing the action. This is
implemented in the following method:
def choose_action(self, state, epsilon):
if np.random.random() <= epsilon:
return self.env.action_space.sample()
[ 424 ]