slide

Sh(y), p(x)(v) = h ||. O(v(i)) ie E. *. Zie E (v(i)) h : R + R¢ : [0, 1] + R, h. O h zs xr/t xl/t. (h, 9) h(y). P(x). (h, o) –. –x log x. (1 – S)-1 log y. (t(t – r)-log y. (1 — 21-)-1(x ...
393KB taille 5 téléchargements 425 vues
Determination and Estimation of Generalized Entropy Rates for Markov Chains 1. Shannon entropy rate Entropy rate Asymptoti Equipartition Property 2. Generalized entropy rates Generalized entropy fun tionals Determination of the entropy rate Expli ite expression fo Markov hains 3. Estimation of entropy rates for Markov hains Shannon entropy and nite hains Generalized entropy and denumerable hains Valerie Girardin Université de Caen, Fran e

Joint work with:

Gabriela Ciuper a, U. Lyon, Loï k Lhote, ENSICAEN, André Sesboüé, U. Caen.

Shannon entropy rate of a sto hasti pro ess

• The entropy up to time n of a random sequen e X = (Xn)n∈N with denumerable state spa e E is X − pn(in1 ) log pn(in1 ), i1,...,in ∈E

where pn(in1 ) = P[(X1, . . . , Xn) = (i1, . . . , in)] is the likelihood of the sequen e. • The entropy rate of X is dened by

1 − n

X

pn(in1 ) log pn(in1 ) −→ H(X),

i1 ,...,in ∈E

n → +∞,

when this quantity is nite.

• Asymptoti Equirepartition Property : 1 − log pn(X1n) −→ H(X), n

n → +∞,

weak if the onvergen e is in probability, strong if it holds almost surely.

Generalized entropy fun tionals The (h, φ)-entropy of any measure ν on E is dened by

Sh(y),φ(x)(ν) = h if

P

"

X i∈E

φ(ν(i))

#

φ(ν(i)) is nite, and as +∞ either. The fun tions h : R → R and φ : [0, 1] → R+ are twi e ontinuously dierentiable fun tions, with either φ

on ave and h in reasing or φ onvex and h de reasing. i∈E

Some (h, φ)-entropies : h(y) y (1 − s)−1 log y [t(t − r)]−1 log y y (t − 1)−1(y t − 1) (r − 1)−1[y (r−1)/(s−1) − 1] (r − 1)−1[exp(r − 1)y − 1] y y (r − 1)−1(1 − y)

φ(x) (h, φ) − entropies −x log x Shannon (1948) s x Renyi (1961) r/t x Varma (1966) 1−s −1 s (1 − 2 ) (x − x ) Havrda and Charvat (1967) x1/t Arimoto (1971) s x Sharma and Mittal 1 (1975) −x log x Sharma and Mittal 2 (1975) s −x log x Taneja (1975) −1 r t (t − r) (x − x ) Sharma and Taneja (1975) r x Tsallis (1988)

• The (h, φ)-entropy rate of a random sequen e X = (Xn)n∈N with state spa e E ⊂ N is dened by 1 Sh(y),φ(x)(pn) −→ Hh,φ(X), n → +∞. n where pn(in0 ) = P(X0 = i0, . . . , Xn−1 = in−1) is the distribution of (X0, . . . , Xn−1).

Quasi-power property

The pro ess X satises the quasi-power property with parameters [σ0, λ, c, ρ] if: 1. sup pn(in0 ) −→ 0 when n → ∞. in0 ∈E n+1

2. ∃σ0 ∈] − ∞, 1], su h that ∀s > σ0 and ∀n ∈ N, the series P

Λn(s) =

is onvergent and satises

in0 ∈E n+1

pn(in0 )s

Λn(s) = c(s) · λ(s)n + Rn (s),

with |Rn (s)| = O (ρ(s)nλ(s)n ) , where: c and λ are stri tly positive analyti fun tions for s > σ0; λ is stri tly de reasing with λ(1) = c(1) = 1, Rn is also analyti , ρ(s) < 1.

Remarks:

The quasi-power property says that Λn (s) behaves like the n-th power of some analyti fun tion. In dynami al systems theory, Λn(s) is alled the Diri hlet series of fundamental measures of depth n + 1.

Classi al entropy rates of a random sequen e satisfying the quasi-power property. Entropy

Parameters

Shannon Rényi

s=1 s 6= 1

Varma

r=t r 6= t

Havrda-Charvat

s>1 s=1

Arimoto

Sharma-Mittal 1

s1 t=1 t 1 and s > 1 r = 1 and s > 1 r = 1 and s = 1 r > 1 and s = 1 r1

Entropy rate

−λ′ (1) −λ′ (1)

1 log λ(s) 1−s 1 − 2 λ′ (1) m 1 log λ(r/t) t(t − r) 0 −1 ′ λ (1) log 2 +∞ +∞ −λ′ (1) 0 +∞ 0 −λ′ (1) 1 log λ(s) 1−s (1 − s)−1[exp(−(s − 1)λ′(1)) − 1] +∞ −λ′ (1) 0 +∞ 0 0 −λ′ (1) 0 +∞ −λ′ (1) 0

For an i.i.d. sequen e with ommon distribution ν

Sin e pn(i0, i1, . . . , in) = ν(i0)ν(i1) . . . ν(in), the Diri hlet series Λn (s) an simply be written

Λn(s) =

"

X i∈E

ν(i)s

#n+1

.

Hen e, X satises the quasi-power property for s > 0 with fun tions λ, c and ρ dened by

λ(s) =

X

ν(i)s,

c(s) = 1 and ρ(s) = 0.

i∈E

For a nite hain

Λn (s) = 1 · Psn · νs, where Ps = (p(i, j)s)i,j∈E , with ν the initial distribution of the hain, and νs = (ν(i)s)i∈E . The following relation denes the fun tions λ, c and ρ of the quasi-power property:

Psn · v = λ(s)n · < v, rs > ls + Rn (s) · v,

where λ(s) is the unique dominant eigenvalue of Ps with maximum modulus, with asso iated left and right eigenve tors ls and rs.

For a denumerable hain Theorem Ciuper a, Girardin, Lhote (2010)

Let X = (Xn) be an ergodi Markov hain with transition matrix P and initial distribution ν . Suppose that: A. sup P (i, j) < 1 (i,j)∈E 2

B. ∃σ0 < 1 su h that ∀s > σ0,

sup i∈E

X

s

P (i, j) < +∞ and

X

ν(i)s < +∞,

i∈E

j∈E

C. ∀ǫ > 0 and ∀s > σ0, ∃A ⊂ E with |A| < +∞ su h that X

P (i, j)s < ε.

sup i∈E

j∈E\A

Then X satises the quasi-power property.

Proof of the theorem Lemma If Assumptions A, B, C hold true,

then Ps : (ℓ1, || . ||1) → (ℓ1, || . ||1) is a ompa t operator, ∀s > σ0, P 1 where ℓ = {u = (ui)i∈E : ||u||1 = i∈E |ui| < ∞}.

We dedu e from the lemma that the spe trum of Ps is a sequen e that onverges to zero. Hen e, Ps has a nite number of eigenvalues with maximum modulus and there exists a spe tral gap separating these dominant eigenvalues from the remainder of the spe trum. Further, sin e X is ergodi , Ps has a unique dominant eigenvalue λ(s) whi h, moreover, is positive. Hen e,

Psnu = λ(s)n Qsu + Rsn u,

u ∈ ℓ1 ,

where Qs is the proje tor over the dominant eigenspa e and Rs is the proje tor over the remainder of the spe trum. The spe tral radius of Rs an be written ρ(s)·λ(s) with ρ(s) < 1. Finally,

Λn(s) = λ(s)n ||Qsνs||1(1 + O(ρ(s)nλ(s)n )),

whi h means that X satises the quasi-power property. The analyti ity of the involved fun tions is due jointly to the analyti ity of s → Ps and to perturbation argu ments.

Theorem Let X be a random sequen e satifying the

quasi-power property with parameters [σ0, λ, c, ρ]. Suppose that

φ(x) ∼ c1 · xs · (log x)k

(P )

x→0

with s > σ0, c1 ∈ R∗+ and k ∈ N∗ . Then the entropy rate Hh,φ(X) is given by the following table. Value of s

Condition on h

Entropy rate 1/k

h(x) ∼ c2 · x1/k c2 · c1 · λ′(1) x→+∞

s=1

h(x) = o(x1/k )

0

x1/k = o(h(x))

+∞

h(x) ∼ + c2 · log x

c2 · log λ(s)

x→+∞

x→+∞ x→0

s>1

h(x) = + o(log x) x→0

0

log x = o(h(x))

+∞

h(x) ∼ c2 · log x

c2 · log λ(s)

x→0+

x→+∞

σ0 < s < 1 h(x) = o(log x) x→+∞

log x = o(h(x)) x→+∞

0

+∞

Proof supin0 ∈E n+1 νn(in0 ) → 0 and (P ) together indu e

that ∀ǫ > 0, ∃n0 ∈ N/ n ≥ n0 and in0 ∈ E n+1,

(1 − ǫ)c1νn(in0 )s logk νn (in0 ) ≤ φ(νn(in0 )) ≤ (1 + ǫ)c1νn (in0 )s logk νn(in0 ),

from whi h it follows that

(1 −

ǫ)c1Λ(k) n (s)



X

in0 ∈E n+1

φ(νn(in0 )) ≤ (1 + ǫ)c1Λ(k) n (s).

Due to the analyti ity of all involved fun tions, ′ k k n−k Λ(k) · [1 + O(1/n)]. n (s) = c(s) · λ (s) · n · λ(s)

whi h yields

X

in0 ∈E n+1

φ(νn(in0 )) ∼ c1 · c(s) · λ′(s)k · nk · λ(s)n−k .

Sin e φ is nonnegative, this sum onverges polynomially to innity. This leads to the next equivalen es:

h(Σn) ∼ c2 · |c1|1/k · |λ′(1)| · n if h(x) ∼ c2 · x1/k , h(Σn) ∼ o(n) if h(x) = o(x1/k ), h(Σn) ∼ sn · n with sn → ∞ if x1/k = o(h(x)).

Sin e by denition, the (h, φ)-entropy rate is the limit of h(Σn)/n when n tends to innity, the results follow immediately for s = 1. The other ases an be studied similarly. 

Entropy

Parameters

Entropy rate

−λ′ (1) −λ′ (1)

Shannon Rényi

s=1 s 6= 1

Varma

r=t r 6= t

Havrda-Charvat

s>1 s=1

Arimoto

Sharma-Mittal 1

s1 t=1 t 1 and s > 1 r = 1 and s > 1 r = 1 and s = 1 r > 1 and s = 1 r1

1 log λ(s) 1−s 1 − 2 λ′ (1) m 1 log λ(r/t) t(t − r) 0 −1 ′ λ (1) log 2 +∞ +∞ −λ′ (1) 0 +∞ 0 −λ′ (1) 1 log λ(s) 1−s (1 − s)−1[exp(−(s − 1)λ′(1)) − 1] +∞ −λ′ (1) 0 +∞ 0 0 −λ′ (1) 0 +∞ −λ′ (1) 0

Values of lassi al entropy rates of a random sequen e satisfying the quasipower property with parameters

[λ, c, ρ, σ0].

Estimation of Shannon entropy rate for a nite Markov hain For an ergodi Markov hain X = (Xn)n∈N with state spa e E with s states, transition matrix P = (P (i, j)), where P (i, j) = P(Xn+1 = j/Xn = i), and stationary distribution π su h that πP = π , and entropy

H(X) = −

X

π(i)

i∈E

X

P (i, j) log P (i, j) = h(P )

j∈E

(= −λ′(1)) .

Proposition Anderson and Goodman (1957)

The empiri al estimators

Pn 1{Xm−1=i,Xm=j} m=1 1 b Pn Pn(i, j) = P j∈E

m=1

11{Xm−1=i,Xm=j}

are strongly onvergent and asymptoti ally normal:

 √  L n Pbn(i, j) − P (i, j) −→ N s2 (0, Γ2)

where Γ2ij = δik [δjl P (i, j) − P (i, j)P (i, l)]/π(i).

• We dene the plug-in estimator b n = h(Pbn) H

of the entropy rate.

Theorem Ciuper a and Girardin (2007)

If the transition probabilities are not uniform, the plugb n = h(Pbn) of H(X) is strongly onverin estimator H gent and asymptoti ally normal. Pre isely,



L n[b hn − H(X)] −→ N (0, (∂ij h) Γ(∂ij h)′),

where ∂uv h is the dierential with order v with respe t to variable u of h.

Proof

Continuous mapping theorem and delta method



For a two-state hain The transition matrix of the hain is

P =



1−p p q 1−q



.

The stationary distribution satises πP = π , so

q π(0) = p+q

p and π(1) = . p+q

The entropy rate is

H(X) = h(p, q) = π(0)Sp + π(1)Sq q = [−p log p − (1 − p) log(1 − p)] p+q p + [−q log q − (1 − q) log(1 − q)]. p+q Entropy of a 2−state Markov chain

0.6 0.5

entropy

0.4 0.3 0.2 0.0 1.0

0.2 0.8

0.4

p

0.6

0.6

0.4 0.8

0.2 1.0

0.0

q

Theorem Girardin and Sesboue (2009) a.s. b hn = h(b pn, qbn) −→ H(X).

If the hain is not uniform,



L n[b hn − H(X)] −→ N (0, σ 2)

where σ 2 = Γ(0, 0)2[∂11h(p, q)]2 + Γ(1, 1)2[∂21h(p, q)]2

pq(1 − p)

=

+pq(1 − q)

h

h

Sq −Sp p+q

Sp −Sq p+q

p − log 1−p

− log

q 1−q

i

i

For illustration, we have simulated a hain for p = 0.2 and q = 0.3, for whi h H(X) = 0, 559. The rst gure shows the pun tual onvergen e of b hn to H(X) for n = 10 to 5000 by steps of 10. hn for 10 ≤ n ≤ 5000 after simulation ( omputation of b of one traje tory with length 5000)

0.45

0.50

Hn

0.55

H = 0.559

0

1000

2000

3000 n

4000

5000

This gure ompares the empiri al distribution fun √ tion of n[b hn −H(X)]/b σn to that of the standard normal distribution for dierent values of 10 ≤ n ≤ 1000. (for T = 500 traje tories simulated for ea h n)

Theorem Girardin and Sesboue (2009)

For a uniform hain, p = q = 1/2, b hn is strongly onL

hn] −→ χ2(2). vergent and 2n[H(X) − b

Proof. bhn − H(X) =

= [∂11h(p, q)][Pb(0, 1) − p] + [∂21h(p, q)][Pb(1, 0) − q] 1 2 1 2 2 b + [∂1 h(p, q)][P (0, 1) − p] + [∂2 h(p, q)][Pb(1, 0) − q]2 2 2 +o([Pb(0, 1) − p]2) + o([Pb(1, 0) − q]2) 1 1 2 b b(1, 0) − q]2 = 2Γ(0,0) [ P 2 [P (0, 1) − p] + 2 2Γ(1,1) +o([Pb(0, 1) − p]2) + o([Pb(1, 0) − q]2). √

n[Pb(0,1)−p] Γ(0,0)

and the result follows, sin e are asymptoti ally standard normal.

and

√ b n[P (1,0)−q] Γ(1,1)



Fχ2(2)(x)

0.0

0.2

0.4

Fn(x)

0.6

0.8

1.0

The last gure ompares the distribution fun tion of 2n[b hn − H(X)] to that of the χ2(2)-distribution for n = 1000. (T = 1000 simulated traje tories for n)

0

2

4

6 x p=q=1/2

8

10

Estimation of generalized entropy rates All the entropy rates are nite and non-zero only at a threshold where they are equal to the Rényi entropy rate up to a multipli ative fa tor. Therefore, we only estimate Shannon and Rényi entropy rates, that is

h(θ) = −λ′(1; θ0), and hs(θ) = (1 − s)−1 log λ(s; θ0).

The transition probabilities of the ergodi hain X with denumerable state spa e are supposed to depend on θ ∈ Θr , with true value θ 0. Proposition Billingsley (1962) Suppose that: A. ∀x, {y : P (x, y; θ) > 0} does not depend on θ. B. ∀(x, y), Pu(x, y; θ), Puv (x, y; θ) and Puvw (x, y; θ) are in C 1(Θ). C. ∀θ ∈ Θ, ∃N , neighborhood su h that ∀u, v , Pu(x, y; θ) and Puv (x, y; θ) are uniformly bounded in L1(µ(dy)) on N and

Eθ [ sup | Pu(x, y; θ ′) |2] < +∞.

D. ∃δ > 0

∀u = 1, . . . , r.

θ′ ∈N

su h that Eθ [| Pu(x, y; θ) |2+δ ] is nite

E. The Fisher information matrix

σ(θ) = (Eθ [Pu(x, y; θ)Pv (x, y; θ)]) is non singular. Then a strongly onsistent maximum likelihood esti√ b b mator θn of θ exists. Moreover, n(θu − θu) is asymptoti ally normal, with ovarian e matrix σ −1(θ 0).

It is natural to onsider the plug-in estimators:

h(θˆn) = −λ′(1; θn) and hs(θbn) = (1 − s)−1 log λ(s; θbn)

of Shannon entropy rate and of Rényi entropy rate.

Theorem If Billingsley's assumptions are satised and

if X satises the quasi-power property, then h(θˆn) and hs(θˆn) are strongly onsistent and asymptoti ally nor√ mal: n[h(θˆn) − h(θ)] → N (0, Σ1), where

Σ1 =



t ∂ ∂ [−λ′(1; θ)] σ −1(θ) [−λ′(1; (θ)] ∂θ ∂θ

√ and n[hs(θˆn) − Hs (θ 0)] → N (0, Σs), where t  1 ∂ ∂ −1 Σs = λ(s; θ) σ (θ) λ(s; (θ). 2 (1 − s) ∂θ ∂θ

Proof

Due to operators properties, the eigenvalue λ(s) and its derivative λ′(1) are ontinuous with respe t to the perturbated operator Ps. For a parametri hain depending on θ , Assumption B indu es that Ps is a ontinuously dierentiable fun tion of θ . Therefore both λ(s; θ) and λ′(s; θ) are ontinuous with respe t to θ . The results follow from the ontinuous mapping theorem and the delta method. 

Estimation of the Entropy Rate of a Countable Markov Chain Communi ation in Statisti s : Theory and Methods,

V36, pp25432557, G. Ciuper a & V. Girardin (2007) Asymptoti study of an estimator of the entropy rate of a two-state Markov hain for one long traje tory, with A. Sesboüé, in Bayesian Inferen e and Maximum Entropy Methods in S ien e and Engineering, Ed. A. Mohammad-Djafari, AIPCP, V872 pp403410 (2006). Comparative Constru tion of Plug-in Estimators of the Entropy Rate of Two-State Markov Chains Methodology and Computing in Applied Probability,

V11, pp. 181200, V. Girardin & A. Sesboüé (2009) Computation of Generalized Entropy Rates. Appli ation and Estimation for Countable Markov Chains Rapport de re her he Université de Caen,

21 pages, G. Ciuper a, V. Girardin et L. Lhote (2010) to appear in IEEE Transa tions on Information Theory