01.11.2014 Views

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

MACHINE LEARNING TECHNIQUES - LASA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

173<br />

general, acting to maximize immediate reward can reduce access to future rewards so that the<br />

return may actually be reduced. As γ approaches 1, the objective takes future rewards into<br />

account more strongly: the agent becomes more farsighted. We can interpret γ in several ways.<br />

It can be seen as an interest rate, a probability of living another step, or as a mathematical trick to<br />

bound the infinite sum.<br />

Another optimality criterion is the average-reward model, in which the agent is supposed to take<br />

actions that optimize its long-run average reward:<br />

⎛⎛1<br />

lim E<br />

h→∞<br />

⎜⎜<br />

⎝⎝h<br />

h<br />

⎞⎞<br />

r ⎟⎟<br />

⎠⎠<br />

∑ (7.17)<br />

t<br />

t=<br />

0<br />

Such a policy is referred to as a gain optimal policy; it can be seen as the limiting case of the<br />

infinite-horizon discounted model as the discount factor approaches. One problem with this<br />

criterion is that there is no way to distinguish between two policies, one of which gains a large<br />

amount of reward in the initial phases and the other of which does not. Reward gained on any<br />

initial prefix of the agent's life is overshadowed by the long-run average performance. It is<br />

possible to generalize this model so that it takes into account both the long run average and the<br />

amount of initial reward than can be gained. In the generalized, bias optimal model, a policy is<br />

preferred if it maximizes the long-run average and ties are broken by the initial extra reward.<br />

7.3.3 Markov World<br />

A reinforcement-learning task is described as a Markov decision process, or MDP. If the state<br />

and action spaces are finite, then it is called a finite Markov decision process (finite MDP).<br />

A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the<br />

environment. Given any state and action, s and a, the probability of each possible next state, s’, is<br />

{ }<br />

P = P s = s'| s = s,<br />

a = a<br />

(7.18)<br />

a<br />

ss' t+<br />

1 t t<br />

These quantities are called transition probabilities. Similarly, given any current state and action,<br />

s and a , together with any next state, s’, the expected value of the next reward is<br />

These quantities,<br />

P and<br />

a<br />

ss'<br />

{ | , , '}<br />

R = E r s = s a = a s = s (7.19)<br />

a<br />

ss' t+ 1 t t t+<br />

1<br />

R<br />

a<br />

ss'<br />

, completely specify the most important aspects of the dynamics of<br />

a finite MDP (only information about the distribution of rewards around the expected value is lost).<br />

Almost all reinforcement learning algorithms are based on estimating value functions--functions of<br />

states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or<br />

how good it is to perform a given action in a given state). The notion of "how good" here is<br />

© A.G.Billard 2004 – Last Update March 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!