FOUNDATIONS OF COMPUTATIONAL INFERENCE Nested Sampling

Jul 4, 2010 - Measure m(x). Associative symmetry (a ∨ b) ∨ c = a ∨ (b ∨ c). ✎. ✍ .... Never make the computer do in arithmetic what you can do in algebra.
1MB taille 6 téléchargements 329 vues
The fourth leg of physics --- Ariel Caticha

FOUNDATIONS OF COMPUTATIONAL INFERENCE Nested Sampling and Galilean Monte Carlo

MaxEnt2010 Chamonix John Skilling

([email protected])

Maximum Entropy Data Consultants Ltd

1 Sunday, 4 July 2010

Foundations of Computational Inference

PLAN :

Duration:

50 minutes

Content:

15 minutes

Best before: 31 December 1985 The “Old Fart” talk

2 12

# ideas:

How SHOULD we do Bayesian inference? Start with basic probability and information theory. Seek generality including large size.

2 Sunday, 4 July 2010

Tone Gravitas Perspective

Measure m(x).

✎ a ✍



Associative symmetry (a ∨ b) ∨ c = a ∨ (b ∨ c) b c ✌ implies m(x ∨ y) = m(x) + m(y) or function(m). Probability p(x | t) is measure on x:

Sum Rule!

Associativity of implication a✎ =⇒ b =⇒ c =⇒ d☞ α β γ ✍ ✌ implies � � � � � � f p(x | z) = f p(x | y) + f p(y | z)

Consistency with sum rule implies f = log: Hence Bayes.

3 Sunday, 4 July 2010

Product Rule!

Inference: INPUTS OUTPUTS L(x) π(x)dx = Z P (x)dx Likelihood × prior = Evidence × Posterior

Information: Separation of distributions p and q ? Distance( between p and q ) ⊆ { Divergences( to p from q ) } Divergence( to p from q ) = induced by constraints on p

minimum

constraints on p

of

H(p ; q) variational potential

“Eliminative induction” 4 Sunday, 4 July 2010

Independence symmetry: (direct product of lattices is associative)  Problem 1 with p1 (x1 ) ← q1 (x1 )  = Joint problem on (x1 , x2 ) with  Problem 2 with p2 (x2 ) ← q2 (x2 ) p1 (x1 )p2 (x2 ) ← q1 (x1 )q2 (x2 )

Hence the information

H(p ; q) =



log



p(x) q(x)



p(x)dx

= −entropy S

is the unique divergence to p from q. H is non-commutative, H(p ; q) �= H(q ; p), so attempts to define commutative distance d(p , q) = d(q , p) break symmetry of independence. The only meaningful divergence requires source q to support destination p (q = 0 forces p = 0) — “p from q” is a compression.

5 Sunday, 4 July 2010

Prior supports posterior (L < ∞). Posterior need not support prior (L = 0). Inference is one-way prior-to-posterior compressive. Compression factor eH(P ; π) is enormous (H ∼ thousands). Need eH(p ; q) samples from q to get 1 from p (compression cost = eH ). So compression must be iterative: shrink by factor γ = eH(p ; q) each step.

shrink γ

shrink γ

−−− −−→

−−− −−→ �



shrink γ

−−− −−→

γ. Benefit = γ = eH(P ; π) as required. Cost = Minimum cost needs balanced compressions γ �> O(1).

6 Sunday, 4 July 2010

Seek balanced compressions. Prior supports entire current context, so intermediate distributions k are of form fk (x) π(x)dx The only function (aka Radon-Nikodym derivative) we have is L(x). Coordinate invariance implies form � � fk L(x) π(x)dx f

k+1 k

L

What should f be?

7 Sunday, 4 July 2010

Progressively weight towards higher L.

What should the modulating f be?

L

spike (exp high and thin) don’t communicate

Practical difficulty: if f varies with L,  fk may have all mass in slab  fk+1

slab x

although fk+1 very close to fk  may have all mass in spike (first-order phase change).

Exploration won’t detect the transition.

f k+1

Solution: f = constant or 0.

k

L

Compression needs to be progressive lower bounds on likelihood L.

8 Sunday, 4 July 2010

Compression needs to be progressive lower bounds on likelihood L.



Navigate this within an iterate:

1 L(x) > constant

Do not try to navigate this, which can be exponentially harder:

Raw L(x)

9 Sunday, 4 July 2010



shrink γ

−−− −−→

shrink γ

shrink γ

−−− −−→

−−− −−→

At each step, have n particles within constraint. Select r survivors by drawing new constraint through # r+1, (r = n−1 is most cost-effective). Repopulate with n−r new particles.

n = 3 particles

constrain within L(#3) r = 2 survivors

re-populate n = 3

Compression ratio is γ ≈ (n−1)/n from Pr(γ) = nγ n−1 � �� � balanced, known by construction

This method is nested sampling.

10 Sunday, 4 July 2010

Robust Bayesian computation SHOULD be done by nested sampling. Contour L encloses

Nested L contours

prior mass X

X(L∗ ) =



π(x)dx

L(x)>L∗

Start with complete prior of mass X0 = 1 and compress X0 = 1 to enclosed by L0 = 0,

X1 = γ1 , X2 = γ1 γ2 , X3 = γ1 γ2 γ3 , . . . L1 = L(x1 ), L2 = L(x2 ), L3 = L(x3 ), ...

where xk is particle location, Lk its computed likelihood, Xk its estimated X.

w = L ∆X

This tabulates the relationship L(X) — with known uncertainty. �1 L � Evidence Z = 0 L dX ≈ L ∆X 0

Z

1

X

Posterior P (x) ≈ {xk , weight wk /Z} � Get log Z ± H/n 11

Sunday, 4 July 2010

Nested sampling needs to generate new particle from constrained prior, generally by MCMC from clone of existing source.

new

Computational substrate should follow prior assignment (else huge density changes), so choose coordinates in which prior is flat. Never make the computer do in arithmetic what you can do in algebra. Seek systematic motion (not random), best done with specular reflection.

new

12 Sunday, 4 July 2010

Seek systematic motion (not random), best done with specular reflection. Give particle random initial velocity v, step to x+v and try to keep going — Galilean Monte Carlo (GMC). Flat exploration ought to be easy! But steps must be finite — can’t locate boundary exactly. And never let particle escape — it won’t come back (volume!). Start with (x1 , v) where L(x1 ) is OK x2 = x1 + v if( L(x2 ) is OK ) proceed (x2 , v)

proceed

v Start x1

13 Sunday, 4 July 2010

OK

x2

Seek systematic motion (not random), best done with specular reflection. Give particle random initial velocity v, step to x+v and try to keep going — Galilean Monte Carlo (GMC). Flat exploration ought to be easy! But steps must be finite — can’t locate boundary exactly. And never let particle escape — it won’t come back (volume!). Start with (x1 , v) where L(x1 ) is OK x2 = x1 + v if( L(x2 ) is OK ) proceed (x2 , v) else n = unit vector at x2 v� = v − 2 n(n.v) x3 = x2 + v� if( L(x3 ) is OK ) reflect (x3 , v� )

n proceed

v Start x1

OK

x2

v�

x3 reflect

If gradient ∇L available, use n � ∇L to anticipate boundary orientation.

14 Sunday, 4 July 2010

Seek systematic motion (not random), best done with specular reflection. Give particle random initial velocity v, step to x+v and try to keep going — Galilean Monte Carlo (GMC). Flat exploration ought to be easy! But steps must be finite — can’t locate boundary exactly. And never let particle escape — it won’t come back (volume!). Start with (x1 , v) where L(x1 ) is OK x2 = x1 + v if( L(x2 ) is OK ) proceed (x2 , v) else n = unit vector at x2 v� = v − 2 n(n.v) x3 = x2 + v� if( L(x3 ) is OK ) reflect (x3 , v� ) else reverse (x1 , −v)

n proceed

v Start x1 reverse

x2

v�

OK

x3 reflect

If gradient ∇L available, use n � ∇L to anticipate boundary orientation. Tune steplength v so that “proceed ” dominates moderately (Galileo). Tune pathlength N v so that expected number of “reverse” is about 1.

15 Sunday, 4 July 2010

SUMMARY   Associative symmetries =⇒ probability calculus L(x)π(x) = ZP (x). WHY  Independence symmetry =⇒ H(p ; q) = � log(p/q)dp.   Inference is highly compressive, by eH(P ; π) . WHAT  Large compression needs iterations with balanced compression �> O(1).  Robustness (to spike & slab) requires progressive lower bounds on L    (Nested Sampling). HOW d   Efficient exploration in IR needs systematic motion across flat prior  within constraint (Galilean Monte Carlo, GMC). Computational inference is becoming a principled discipline. Yesterday’s algorithms have solved yesterday’s problems. NS/GMC may help to solve your problems and tomorrow’s.

With thanks to a long history, recently including: Kevin Knuth for symmetries and foundations, Jos´e Bernardo for divergence, Radford Neal for snooker, Farhan Feroz and Michael Betancourt for reflections, — and you for listening.

16 Sunday, 4 July 2010

L(x) π(x)dx = Z P (x)dx Likelihood × prior = Evidence × Posterior

Compression needs to be progressive lower bounds on likelihood L.

Prior supports posterior (L < ∞). Posterior need not support prior (L = 0). Inference is one-way prior-to-posterior compressive. Compression factor eH(P ; π) is enormous (H ∼ thousands). Need eH(p ; q) samples from q to get 1 from p (compression cost = eH ). So compression must be iterative: shrink by factor γ = eH(p ; q) each step.

shrink γ

shrink γ

−−− −−→

Information H(p ; q) =

log



p(x) q(x)



p(x)dx

Nested sampling needs to generate new particle from constrained prior, generally by MCMC from clone of existing source.

� � 1 L(x) > constant

Navigate this within an iterate:

new

Do not try to navigate this, which can be exponentially harder:

shrink γ

−−− −−→



−−− −−→

Raw L(x)

Computational substrate should follow prior assignment (else huge density changes), so choose coordinates in which prior is flat. Never make the computer do in arithmetic what you can do in algebra. Seek systematic motion (not random), best done with specular reflection.





Benefit = γ = eH(P ; π) as required. Cost = γ. Minimum cost needs balanced compressions γ �> O(1). shrink γ

Seek balanced compressions. Prior supports entire current context, so intermediate distributions k are of form

The only function (aka Radon-Nikodym derivative) we have is L(x). Coordinate invariance implies form � � fk L(x) π(x)dx k+1

Progressively weight towards higher L.

k

n = 3 particles

constrain within L(#3) r = 2 survivors

−−−−−→

re-populate n = 3

Compression ratio is γ ≈ (n−1)/n from Pr(γ) = nγ � ��

L

n−1



balanced, known by construction

This method is nested sampling.

What should f be?

Robust Bayesian computation SHOULD be done by nested sampling. Contour L

What should the modulating f be?

L

spike (exp high and thin)

encloses

Nested L contours

prior mass X

fk+1

may have all mass in spike



slab

x

although fk+1 very close to fk (first-order phase change).

f

compress X0 = 1 to X1 = γ1 , X2 = γ1 γ2 , X3 = γ1 γ2 γ3 , . . . enclosed by L0 = 0, L1 = L(x1 ), L2 = L(x2 ), L3 = L(x3 ), ...

k+1 k

This tabulates the relationship L(X) — with known uncertainty. �1 � L Evidence Z = 0 L dX ≈ L ∆X

L

Compression needs to be progressive lower bounds on likelihood L.

0

Z

1

X

Posterior P (x) ≈ {xk , weight wk /Z} � Get log Z ± H/n

17 Sunday, 4 July 2010

π(x)dx

where xk is particle location, Lk its computed likelihood, Xk its estimated X.

Exploration won’t detect the transition. Solution: f = constant or 0.



Start with complete prior of mass X0 = 1 and

w = L ∆X

Practical difficulty: if f varies with L,  fk may have all mass in slab 

X(L∗ ) =

L(x)>L∗

don’t communicate

new

shrink γ

−−−−−→

At each step, have n particles within constraint. Select r survivors by drawing new constraint through # r+1, (r = n−1 is most cost-effective). Repopulate with n−r new particles.

fk (x) π(x)dx

f

shrink γ

−−−−−→

Seek systematic motion (not random), best done with specular reflection. Give particle random initial velocity v, step to x+v and try to keep going — Galilean Monte Carlo (GMC). Flat exploration ought to be easy! But steps must be finite — can’t locate boundary exactly. And never let particle escape — it won’t come back (volume!).

Start with (x1 , v) where L(x1 ) is OK x2 = x1 + v if( L(x2 ) is OK ) proceed (x2 , v) else n = unit vector at x2 v� = v − 2 n(n.v) x3 = x2 + v� if( L(x3 ) is OK ) reflect (x3 , v� ) else reverse (x1 , −v)

n proceed

v Start x1 reverse

x2

v�

OK

x3 reflect

If gradient ∇L available, use n � ∇L to anticipate boundary orientation. Tune steplength v so that “proceed ” dominates moderately (Galileo). Tune pathlength N v so that expected number of “reverse” is about 1.

SUMMARY   Associative symmetries =⇒ probability calculus L(x)π(x) = ZP (x). WHY  Independence symmetry =⇒ H(p ; q) = � log(p/q)dp.   Inference is highly compressive, by eH(P ; π) . WHAT  Large compression needs iterations with balanced compression �> O(1).  Robustness (to spike & slab) requires progressive lower bounds on L    (Nested Sampling). HOW d    Efficient exploration in IR needs systematic motion across flat prior within constraint (Galilean Monte Carlo, GMC). Computational inference is becoming a principled discipline. Yesterday’s algorithms have solved yesterday’s problems. NS/GMC may help to solve your problems and tomorrow’s.