Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Solving Time-dependent Markov Decision Processes
Emmanuel Rachelson Intelligent Systems Lab Technical Univ. of Crete
Patrick Fabiani ONERA Toulouse
Frederick Garcia INRA Toulouse
September 19th, 2009
Solving Time-dependent Markov Decision Processes 1 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
UAV patrol mission
Solving Time-dependent Markov Decision Processes 2 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
UAV patrol mission
state (3, 8)
state (9, 10)
3
2 0
*
25
60 70
0
*
state (5, 2)
60
70
state (9, 3)
* *
5
2 0
45 50
0
20
50
Solving Time-dependent Markov Decision Processes 2 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Outline
1
Time-dependent MDPs
2
Value iteration in practice: TiMDPpoly
3
Experiments
Solving Time-dependent Markov Decision Processes 3 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Modeling background Sequential decision under probabilistic uncertainty: Markov Decision Process Tuple hS , A, p, r , T i Markovian transition model p(s0 |s, a) Reward model r (s, a) T is a set of timed decision epochs {0, 1, . . . , H } Infinite (unbounded) horizon: H → ∞ 0
o s0
n
1
)
1
n+1
t
0
p(s |s , a0 ) r(s0 , a0 )
sn
p(sn+1 |sn , an ) r(sn , an )
p(s1 |s0 , a2 ) r(s0 , a2 )
Solving Time-dependent Markov Decision Processes 4 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Optimal policies for MDPs Value of a sequence of actions
∀ (an ) ∈
AN , V (an ) (s)
=E
∞
∑
γ δ r (sδ , a
δ)
δ =0
Stationary, deterministic, Markovian policy
S → A D= π : s 7→ π(s) = a Optimality equation V ∗ (s)
=
max V π (s) π∈D
= max r (s, a) + γ ∑ a∈A
s0 ∈S
p(s0 |s, a)V ∗ (s0 )
Solving Time-dependent Markov Decision Processes 5 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
What are we looking for?
One way of considering the UAV patrol problem consists in saying that we search for policies and value functions which depend on time.
V (s, t)
in s3 : in s2 : in s1 :
a3 a2
a6 a3
2
a2 a7
a1 a1
1
t
0
0
1
2
3
4
5
t′
Solving Time-dependent Markov Decision Processes 6 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Time-dependent MDPs Definition (TiMDP, [Boyan and Littman, 2001]) Tuple hS , A, M , L, R , K i M Set of outcomes µ = sµ0 , Tµ , Pµ L(µ|s, t , a) Probability of triggering outcome µ R (µ, t , t 0 ) = rµ,t (t ) + rµ,τ (t 0 − t ) + rµ,t 0 (t 0 ) µ1 , 0.2 s1
a1
µ2 , 0.8
Tµ2 = ABS
Pµ2
s2
Pµ1
Tµ1 = REL
Boyan, J. A. and Littman, M. L. (2001). Exact Solutions to Time Dependent MDPs. Advances in Neural Information Processing Systems, 13:1026–1032. Solving Time-dependent Markov Decision Processes 7 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDP dynamic programming equation
Q (s, t , a)
=
∑ L(µ|s, t , a) · U (µ, t )
µ∈M
R∞ U (µ, t )
Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt
=
if Tµ = ABS if Tµ = REL
Qn Qn (s, t, a1 )
Qn (s, t, a3 ) Qn (s, t, a2 )
t Solving Time-dependent Markov Decision Processes 8 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDP dynamic programming equation
V (s, t ) Q (s, t , a)
= max Q (s, t , a) a∈A
=
∑ L(µ|s, t , a) · U (µ, t )
µ∈M
R∞ U (µ, t )
Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt
=
if Tµ = ABS if Tµ = REL
Qn Qn (s, t, a1 )
Qn (s, t, a3 ) Qn (s, t, a2 )
t Solving Time-dependent Markov Decision Processes 8 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDP dynamic programming equation Z
t0
K (s, θ )d θ + V (s, t ) 0
V (s, t )
= sup
V (s, t )
= max Q (s, t , a)
Q (s, t , a)
t 0 ≥t
t
a∈A
=
∑ L(µ|s, t , a) · U (µ, t )
µ∈M
R∞ U (µ, t )
Pµ (t 0 )[R (µ, t , t 0 ) + V (sµ0 , t 0 )]dt 0 R−∞ ∞ 0 0 0 0 0 −∞ Pµ (t − t )[R (µ, t , t ) + V (sµ , t )]dt
=
V (s, t)
V (s, t) 2
2
1
1
0
0
1
2
3
4
if Tµ = ABS if Tµ = REL
5
t′
0
0
1
2
3
4
5
t
Solving Time-dependent Markov Decision Processes 8 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Optimality equation?
Is this DP equation an optimality equation for TiMDPs? If yes, corresponding to which criterion? Rachelson, E., Garcia, F., and Fabiani, P. (2008). Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time in the Discounted Case. In International Symposium on Artificial Intelligence and Mathematics.
Yes, with as total reward criterion and specific hypotheses on the transition and reward models.
Solving Time-dependent Markov Decision Processes 9 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Value Iteration for TiMDPs
Solving TiMDPs ↔ solving the optimality equation.
Solving Time-dependent Markov Decision Processes 10 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Solving TiMDPs
µ1 , 0.2 s1
a1
µ2 , 0.8
Tµ2 = ABS
Pµ2
s2
Pµ1
Tµ1 = REL
Value iteration Bellman backups for TiMDPs can be performed exactly if: L(µ|s, t , a) piecewise constant R (µ, t , t 0 ) = rµ,t (t ) + rµ,τ (t 0 − t ) + rµ,t 0 (t 0 ) rµ,t (t ), rµ,τ (τ), rµ,t 0 (t 0 ) piecewise linear Pµ (t 0 ), Pµ (t 0 − t ) discrete distributions
Then V ∗ (s, t ) is piecewise linear. Solving Time-dependent Markov Decision Processes 11 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Solving TiMDPs
µ1 , 0.2 s1
a1
µ2 , 0.8
Tµ2 = ABS
Pµ2
s2
Pµ1
Tµ1 = REL
What about other, more expressive functions? How does this theoretical result scale to practical resolution?
Solving Time-dependent Markov Decision Processes 11 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution
Pµ ∈ DP A ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1) L ∈ PC
Solving Time-dependent Markov Decision Processes 12 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution
Pµ ∈ DP A ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1) L ∈ PC Stability ⇔ A + C = −1.
Solving Time-dependent Markov Decision Processes 12 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution
Pµ ∈ DP A ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1) L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions
Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations: L ∈ P0
Solving Time-dependent Markov Decision Processes 12 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution
Pµ ∈ DP A ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1) L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions
Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations: L ∈ P0 If B > 4: approximate root finding. Solving Time-dependent Markov Decision Processes 12 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Extending exact resolution Piecewise polynomial (PWP) models: L, Pµ , ri ∈ Pn . Degree evolution
Pµ ∈ DP A ri , V0 ∈ PB ⇒ d ◦ (Vn ) = B + n(A + C + 1) L ∈ PC Stability ⇔ A + C = −1. Exact resolution conditions
Pµ ∈ DP −1 ri ∈ P4 Degree stability + exact analytical computations: L ∈ P0 If A + C > 0: projection scheme of Vn on PB . Solving Time-dependent Markov Decision Processes 12 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
And in practice? Experimental result The number of definition intervals in Vn grows with n and does not necessarily converge.
⇒ numerical problems occur before kVn − Vn−1 k < ε . e.g. V calculation: Qn Qn (s, t, a1 )
Qn (s, t, a3 ) Qn (s, t, a2 )
t
Solving Time-dependent Markov Decision Processes 13 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
And in practice? Experimental result The number of definition intervals in Vn grows with n and does not necessarily converge.
⇒ numerical problems occur before kVn − Vn−1 k < ε . → general case: approximate resolution by piecewise polynomial interval simplification for the value function.
Approximation
% &
degree reduction interval simplification
Solving Time-dependent Markov Decision Processes 13 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification. pin max error > ǫ
first attempt second attempt
I I1
I2 Solving Time-dependent Markov Decision Processes 14 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification.
pin pout
I I1
I2
I3 Solving Time-dependent Markov Decision Processes 14 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs TiMDPpoly polynomial approximation pout = poly_approx(pin , [l , u ], ε, B ) Two phases: incremental refinement and simplification.
Properties pout ∈ PB
kpin − pout k∞ ≤ ε suboptimal number of intervals good complexity compromise
Solving Time-dependent Markov Decision Processes 14 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs
Prioritized Sweeping. Leveraging the computational effort by ordering Bellman backups
Perform Bellman backups in states with the largest value function change.
Moore, A. W. and Atkeson, C. G. (1993). Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time. Machine Learning Journal, 13(1):103–105. Solving Time-dependent Markov Decision Processes 15 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs
Adapting Prioritized Sweeping to TiMDPs.
Pick highest priority state → s0
s0
Solving Time-dependent Markov Decision Processes 15 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs
Adapting Prioritized Sweeping to TiMDPs. update V (s0 , t) update V (s0 , t) poly approx (V (s0 , t))
s0
Pick highest priority state → s0 Bellman backup → V (s0 , t)
Solving Time-dependent Markov Decision Processes 15 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs
Adapting Prioritized Sweeping to TiMDPs.
s1
a10 , µ10
s2
a20 , µ20
s3
a30 , µ30
s0
Pick highest priority state → s0 Bellman backup → V (s0 , t) Update Q values → Q(s, t, a)
Solving Time-dependent Markov Decision Processes 15 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly : Approximate Value Iteration on TiMDPs
Adapting Prioritized Sweeping to TiMDPs.
s1 s2 s3
prio(s1 )
prio(s2 )
prio(s3 )
s0
Pick highest priority state → s0 Bellman backup → V (s0 , t) Update Q values → Q(s, t, a) Update priorities → prio(s) = kQ − Qold k∞
Solving Time-dependent Markov Decision Processes 15 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
TiMDPpoly
TiMDPpoly in a nutshell
Analytical polynomial calculations L∞ -bounded error projection TiMDPpoly : Prioritized Sweeping for TiMDPs Analytical operations: option for representing continuous quantities. Approximation makes resolution possible. Asynchronous VI makes it faster.
Solving Time-dependent Markov Decision Processes 16 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
The UAV patrol problem
state (3, 8)
state (9, 10)
3
2 0
*
25
60 70
0
*
state (5, 2)
60
70
state (9, 3)
* *
5
2 0
45 50
0
20
50
Solving Time-dependent Markov Decision Processes 17 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
The UAV patrol problem
Solving Time-dependent Markov Decision Processes 18 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
The UAV patrol problem
Solving Time-dependent Markov Decision Processes 18 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
A Mars rover problem 6 photo [ts ; te ]
5 5
5 sample 5
4
5
3 3
12
5
2 sample
4 1
Solving Time-dependent Markov Decision Processes 19 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Mars rover policy V and π in p = 3 when no goals have been completed yet.
Solving Time-dependent Markov Decision Processes 20 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Mars rover policy π in p = 3 when no goals have been completed yet — 2D view. 40 35
Energy
30 25 20 15 10 5 0 0
10
20
30
40
50
60
70
Time Wait Recharge Take Picture
move_to_2 move_to_4 move_to_5 Solving Time-dependent Markov Decision Processes 20 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Related work and differences
Representation issues and formal resolution [Feng et al., 2004] extends the [Boyan and Littman, 2001] idea to continuous state spaces with discrete transition models and uses kd-trees for storing partitions. [Li and Littman, 2005] extends to continuous state space MDPs and PW constant functions illustrating the need for simplification. TiMDPpoly extends to PWP representations in the one-dimensional case with direct generalization to continuous state spaces. TiMDPpoly keeps the specific wait action of TiMDPs.
Solving Time-dependent Markov Decision Processes 21 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Related work and differences
Dynamic Programming [Boyan and Littman, 2001, Feng et al., 2004, Li and Littman, 2005] → finite horizon optimization Optimality equation analysis: TiMDPpoly → infinite horizon, asynchronous optimization.
Solving Time-dependent Markov Decision Processes 21 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Conclusion
We exploit previous results about observable time in MDPs [Rachelson et al., 2008] to provide better understanding of TiMDPs TiMDPpoly : an improved VI algorithm for solving TiMDPs with Analytical Bellman backups L∞ -bounded value function approximation Asynchronous dynamic programming
Solving Time-dependent Markov Decision Processes 22 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Perspectives
Generalization to continuous state space MDPs Rectangular partitions? Kuhn triangulations? Spline theory tools. Continuous action parameter optimization. Prioritizing prio(s) → prio(s, I ).
Solving Time-dependent Markov Decision Processes 23 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Thank you for your attention!
Solving Time-dependent Markov Decision Processes 24 / 24
Time-dependent MDPs
Value iteration in practice: TiMDPpoly
Experiments
Conclusion
Boyan, J. A., and Littman, M. L. 2001. Exact Solutions to Time Dependent MDPs. Advances in Neural Information Processing Systems 13:1026–1032.
Li, L., and Littman, M. L. 2005. Lazy Approximation for Solving Continuous Finite-Horizon MDPs. In National Conference on Artificial Intelligence.
Feng, Z.; Dearden, R.; Meuleau, N.; and Washington, R. 2004. Dynamic Programming for Structured Continuous Markov Decision Problems. In Conference on Uncertainty in Artificial Intelligence.
Rachelson, E., Garcia, F., and Fabiani, P. 2008. Extending the Bellman Equation for MDP to Continuous Actions and Continuous Time in the Discounted Case. In International Symposium on Artificial Intelligence and Mathematics. Solving Time-dependent Markov Decision Processes 24 / 24