MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
MACHINE LEARNING TECHNIQUES - LASA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
173<br />
general, acting to maximize immediate reward can reduce access to future rewards so that the<br />
return may actually be reduced. As γ approaches 1, the objective takes future rewards into<br />
account more strongly: the agent becomes more farsighted. We can interpret γ in several ways.<br />
It can be seen as an interest rate, a probability of living another step, or as a mathematical trick to<br />
bound the infinite sum.<br />
Another optimality criterion is the average-reward model, in which the agent is supposed to take<br />
actions that optimize its long-run average reward:<br />
⎛⎛1<br />
lim E<br />
h→∞<br />
⎜⎜<br />
⎝⎝h<br />
h<br />
⎞⎞<br />
r ⎟⎟<br />
⎠⎠<br />
∑ (7.17)<br />
t<br />
t=<br />
0<br />
Such a policy is referred to as a gain optimal policy; it can be seen as the limiting case of the<br />
infinite-horizon discounted model as the discount factor approaches. One problem with this<br />
criterion is that there is no way to distinguish between two policies, one of which gains a large<br />
amount of reward in the initial phases and the other of which does not. Reward gained on any<br />
initial prefix of the agent's life is overshadowed by the long-run average performance. It is<br />
possible to generalize this model so that it takes into account both the long run average and the<br />
amount of initial reward than can be gained. In the generalized, bias optimal model, a policy is<br />
preferred if it maximizes the long-run average and ties are broken by the initial extra reward.<br />
7.3.3 Markov World<br />
A reinforcement-learning task is described as a Markov decision process, or MDP. If the state<br />
and action spaces are finite, then it is called a finite Markov decision process (finite MDP).<br />
A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the<br />
environment. Given any state and action, s and a, the probability of each possible next state, s’, is<br />
{ }<br />
P = P s = s'| s = s,<br />
a = a<br />
(7.18)<br />
a<br />
ss' t+<br />
1 t t<br />
These quantities are called transition probabilities. Similarly, given any current state and action,<br />
s and a , together with any next state, s’, the expected value of the next reward is<br />
These quantities,<br />
P and<br />
a<br />
ss'<br />
{ | , , '}<br />
R = E r s = s a = a s = s (7.19)<br />
a<br />
ss' t+ 1 t t t+<br />
1<br />
R<br />
a<br />
ss'<br />
, completely specify the most important aspects of the dynamics of<br />
a finite MDP (only information about the distribution of rewards around the expected value is lost).<br />
Almost all reinforcement learning algorithms are based on estimating value functions--functions of<br />
states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or<br />
how good it is to perform a given action in a given state). The notion of "how good" here is<br />
© A.G.Billard 2004 – Last Update March 2011