Solving Time-dependent Markov Decision Processes - Emmanuel

Sep 19, 2009 - Value iteration Bellman backups for TiMDPs can be performed exactly if: L(µ|s,t,a) piecewise constant. R(µ,t,t ) = rµ,t (t)+rµ,τ(t −t)+rµ,t (t ). rµ,t (t) ...
955KB taille 1 téléchargements 256 vues
Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Solving Time-dependent Markov Decision Processes

Emmanuel Rachelson Intelligent Systems Lab Technical Univ. of Crete

Patrick Fabiani ONERA Toulouse

Frederick Garcia INRA Toulouse

September 19th, 2009

Solving Time-dependent Markov Decision Processes 1 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

UAV patrol mission

Solving Time-dependent Markov Decision Processes 2 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

UAV patrol mission

state (3, 8)

state (9, 10)

3

2 0

*

25

60 70

0

*

state (5, 2)

60

70

state (9, 3)

* *

5

2 0

45 50

0

20

50

Solving Time-dependent Markov Decision Processes 2 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Outline

1

Time-dependent MDPs

2

Value iteration in practice: TiMDPpoly

3

Experiments

Solving Time-dependent Markov Decision Processes 3 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Modeling background Sequential decision under probabilistic uncertainty: Markov Decision Process Tuple hS , A, p, r , T i Markovian transition model p(s0 |s, a) Reward model r (s, a) T is a set of timed decision epochs {0, 1, . . . , H } Infinite (unbounded) horizon: H → ∞ 0

o s0

n

1

)

1

n+1

t

0

p(s |s , a0 ) r(s0 , a0 )

sn

p(sn+1 |sn , an ) r(sn , an )

p(s1 |s0 , a2 ) r(s0 , a2 )

Solving Time-dependent Markov Decision Processes 4 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Optimal policies for MDPs Value of a sequence of actions



∀ (an ) ∈

AN , V (an ) (s)

=E





γ δ r (sδ , a

 δ)

δ =0

Stationary, deterministic, Markovian policy

   S → A D= π : s 7→ π(s) = a Optimality equation V ∗ (s)

=

max V π (s) π∈D





= max r (s, a) + γ ∑ a∈A

s0 ∈S

p(s0 |s, a)V ∗ (s0 )

Solving Time-dependent Markov Decision Processes 5 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

What are we looking for?

One way of considering the UAV patrol problem consists in saying that we search for policies and value functions which depend on time.

V (s, t)

in s3 : in s2 : in s1 :

a3 a2

a6 a3

2

a2 a7

a1 a1

1

t

0

0

1

2

3

4

5

t′

Solving Time-dependent Markov Decision Processes 6 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Time-dependent MDPs Definition (TiMDP, [Boyan and Littman, 2001]) Tuple hS , A, M , L, R , K i  M Set of outcomes µ = sµ0 , Tµ , Pµ L(µ|s, t , a) Probability of triggering outcome µ R (µ, t , t 0 ) = rµ,t (t ) + rµ,τ (t 0 − t ) + rµ,t 0 (t 0 ) µ1 , 0.2 s1

a1

µ2 , 0.8

Tµ2 = ABS

Pµ2

s2

Pµ1

Tµ1 = REL

Boyan, J. A. and Littman, M. L. (2001). Exact Solutions to Time Dependent MDPs. Advances in Neural Information Processing Systems, 13:1026–1032. Solving Time-dependent Markov Decision Processes 7 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDP dynamic programming equation

Q (s, t , a)

=

∑ L(µ|s, t , a) · U (µ, t )

µ∈M

 R∞ U (µ, t )

Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt

=

if Tµ = ABS if Tµ = REL

Qn Qn (s, t, a1 )

Qn (s, t, a3 ) Qn (s, t, a2 )

t Solving Time-dependent Markov Decision Processes 8 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDP dynamic programming equation

V (s, t ) Q (s, t , a)

= max Q (s, t , a) a∈A

=

∑ L(µ|s, t , a) · U (µ, t )

µ∈M

 R∞ U (µ, t )

Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt

=

if Tµ = ABS if Tµ = REL

Qn Qn (s, t, a1 )

Qn (s, t, a3 ) Qn (s, t, a2 )

t Solving Time-dependent Markov Decision Processes 8 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDP dynamic programming equation Z

t0

 K (s, θ )d θ + V (s, t ) 0

V (s, t )

= sup

V (s, t )

= max Q (s, t , a)

Q (s, t , a)

t 0 ≥t

t

a∈A

=

∑ L(µ|s, t , a) · U (µ, t )

µ∈M

 R∞ U (µ, t )

Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt

=

V (s, t)

V (s, t) 2

2

1

1

0

0

1

2

3

4

if Tµ = ABS if Tµ = REL

5

t′

0

0

1

2

3

4

5

t

Solving Time-dependent Markov Decision Processes 8 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Optimality equation?

Is this DP equation an optimality equation for TiMDPs? If yes, corresponding to which criterion? Rachelson, E., Garcia, F., and Fabiani, P. (2008). Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time in the Discounted Case. In International Symposium on Artificial Intelligence and Mathematics.

Yes, with as total reward criterion and specific hypotheses on the transition and reward models.

Solving Time-dependent Markov Decision Processes 9 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Value Iteration for TiMDPs

Solving TiMDPs ↔ solving the optimality equation.

Solving Time-dependent Markov Decision Processes 10 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Solving TiMDPs

µ1 , 0.2 s1

a1

µ2 , 0.8

Tµ2 = ABS

Pµ2

s2

Pµ1

Tµ1 = REL

Value iteration Bellman backups for TiMDPs can be performed exactly if: L(µ|s, t , a) piecewise constant R (µ, t , t 0 ) = rµ,t (t ) + rµ,τ (t 0 − t ) + rµ,t 0 (t 0 ) rµ,t (t ), rµ,τ (τ), rµ,t 0 (t 0 ) piecewise linear Pµ (t 0 ), Pµ (t 0 − t ) discrete distributions

Then V ∗ (s, t ) is piecewise linear. Solving Time-dependent Markov Decision Processes 11 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Solving TiMDPs

µ1 , 0.2 s1

a1

µ2 , 0.8

Tµ2 = ABS

Pµ2

s2

Pµ1

Tµ1 = REL

What about other, more expressive functions? How does this theoretical result scale to practical resolution?

Solving Time-dependent Markov Decision Processes 11 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution



Pµ ∈ DP A  ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1)  L ∈ PC

Solving Time-dependent Markov Decision Processes 12 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution



Pµ ∈ DP A  ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1)  L ∈ PC Stability ⇔ A + C = −1.

Solving Time-dependent Markov Decision Processes 12 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution



Pµ ∈ DP A  ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1)  L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions

  Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations:  L ∈ P0

Solving Time-dependent Markov Decision Processes 12 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution



Pµ ∈ DP A  ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1)  L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions

  Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations:  L ∈ P0 If B > 4: approximate root finding. Solving Time-dependent Markov Decision Processes 12 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution



Pµ ∈ DP A  ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1)  L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions

  Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations:  L ∈ P0 If A + C > 0: projection scheme of Vn on PB . Solving Time-dependent Markov Decision Processes 12 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

And in practice? Experimental result The number of definition intervals in Vn grows with n and does not necessarily converge.

⇒ numerical problems occur before kVn − Vn−1 k < ε . e.g. V calculation: Qn Qn (s, t, a1 )

Qn (s, t, a3 ) Qn (s, t, a2 )

t

Solving Time-dependent Markov Decision Processes 13 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

And in practice? Experimental result The number of definition intervals in Vn grows with n and does not necessarily converge.

⇒ numerical problems occur before kVn − Vn−1 k < ε . → general case: approximate resolution by piecewise polynomial interval simplification for the value function.

Approximation

% &

degree reduction interval simplification

Solving Time-dependent Markov Decision Processes 13 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification. pin max error > ǫ

first attempt second attempt

I I1

I2 Solving Time-dependent Markov Decision Processes 14 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification.

pin pout

I I1

I2

I3 Solving Time-dependent Markov Decision Processes 14 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification.

Properties pout ∈ PB

kpin − pout k∞ ≤ ε suboptimal number of intervals good complexity compromise

Solving Time-dependent Markov Decision Processes 14 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs

Prioritized Sweeping. Leveraging the computational effort by ordering Bellman backups

Perform Bellman backups in states with the largest value function change.

Moore, A. W. and Atkeson, C. G. (1993). Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time. Machine Learning Journal, 13(1):103–105. Solving Time-dependent Markov Decision Processes 15 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs

Adapting Prioritized Sweeping to TiMDPs.

Pick highest priority state → s0

s0

Solving Time-dependent Markov Decision Processes 15 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs

Adapting Prioritized Sweeping to TiMDPs. update V (s0 , t) update V (s0 , t) poly approx (V (s0 , t))

s0

Pick highest priority state → s0 Bellman backup → V (s0 , t)

Solving Time-dependent Markov Decision Processes 15 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs

Adapting Prioritized Sweeping to TiMDPs.

s1

a10 , µ10

s2

a20 , µ20

s3

a30 , µ30

s0

Pick highest priority state → s0 Bellman backup → V (s0 , t) Update Q values → Q(s, t, a)

Solving Time-dependent Markov Decision Processes 15 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly : Approximate Value Iteration on TiMDPs

Adapting Prioritized Sweeping to TiMDPs.

s1 s2 s3

prio(s1 )

prio(s2 )

prio(s3 )

s0

Pick highest priority state → s0 Bellman backup → V (s0 , t) Update Q values → Q(s, t, a) Update priorities → prio(s) = kQ − Qold k∞

Solving Time-dependent Markov Decision Processes 15 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

TiMDPpoly

TiMDPpoly in a nutshell

  Analytical polynomial calculations L∞ -bounded error projection TiMDPpoly :  Prioritized Sweeping for TiMDPs Analytical operations: option for representing continuous quantities. Approximation makes resolution possible. Asynchronous VI makes it faster.

Solving Time-dependent Markov Decision Processes 16 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

The UAV patrol problem

state (3, 8)

state (9, 10)

3

2 0

*

25

60 70

0

*

state (5, 2)

60

70

state (9, 3)

* *

5

2 0

45 50

0

20

50

Solving Time-dependent Markov Decision Processes 17 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

The UAV patrol problem

Solving Time-dependent Markov Decision Processes 18 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

The UAV patrol problem

Solving Time-dependent Markov Decision Processes 18 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

A Mars rover problem 6 photo [ts ; te ]

5 5

5 sample 5

4

5

3 3

12

5

2 sample

4 1

Solving Time-dependent Markov Decision Processes 19 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Mars rover policy V and π in p = 3 when no goals have been completed yet.

Solving Time-dependent Markov Decision Processes 20 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Mars rover policy π in p = 3 when no goals have been completed yet — 2D view. 40 35

Energy

30 25 20 15 10 5 0 0

10

20

30

40

50

60

70

Time Wait Recharge Take Picture

move_to_2 move_to_4 move_to_5 Solving Time-dependent Markov Decision Processes 20 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Related work and differences

Representation issues and formal resolution [Feng et al., 2004] extends the [Boyan and Littman, 2001] idea to continuous state spaces with discrete transition models and uses kd-trees for storing partitions. [Li and Littman, 2005] extends to continuous state space MDPs and PW constant functions illustrating the need for simplification. TiMDPpoly extends to PWP representations in the one-dimensional case with direct generalization to continuous state spaces. TiMDPpoly keeps the specific wait action of TiMDPs.

Solving Time-dependent Markov Decision Processes 21 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Related work and differences

Dynamic Programming [Boyan and Littman, 2001, Feng et al., 2004, Li and Littman, 2005] → finite horizon optimization Optimality equation analysis: TiMDPpoly → infinite horizon, asynchronous optimization.

Solving Time-dependent Markov Decision Processes 21 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Conclusion

We exploit previous results about observable time in MDPs [Rachelson et al., 2008] to provide better understanding of TiMDPs TiMDPpoly : an improved VI algorithm for solving TiMDPs with Analytical Bellman backups L∞ -bounded value function approximation Asynchronous dynamic programming

Solving Time-dependent Markov Decision Processes 22 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Perspectives

Generalization to continuous state space MDPs Rectangular partitions? Kuhn triangulations? Spline theory tools. Continuous action parameter optimization. Prioritizing prio(s) → prio(s, I ).

Solving Time-dependent Markov Decision Processes 23 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Thank you for your attention!

Solving Time-dependent Markov Decision Processes 24 / 24

Time-dependent MDPs

Value iteration in practice: TiMDPpoly

Experiments

Conclusion

Boyan, J. A., and Littman, M. L. 2001. Exact Solutions to Time Dependent MDPs. Advances in Neural Information Processing Systems 13:1026–1032.

Li, L., and Littman, M. L. 2005. Lazy Approximation for Solving Continuous Finite-Horizon MDPs. In National Conference on Artificial Intelligence.

Feng, Z.; Dearden, R.; Meuleau, N.; and Washington, R. 2004. Dynamic Programming for Structured Continuous Markov Decision Problems. In Conference on Uncertainty in Artificial Intelligence.

Rachelson, E., Garcia, F., and Fabiani, P. 2008. Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time in the Discounted Case. In International Symposium on Artificial Intelligence and Mathematics. Solving Time-dependent Markov Decision Processes 24 / 24