slide

3. Questions: What is information? What is entropy? Why an entropy? Which entropy? Are Bayesian and Entropic methods compatible?
198KB taille 1 téléchargements 403 vues
Entropic Inference Ariel Caticha Department of Physics University at Albany – SUNY [email protected] MaxEnt 2010 Chamonix 1

The goal: To update probabilities when new information becomes available.

information →

2

Questions:

What is information? What is entropy? Why an entropy?

Which entropy?

Are Bayesian and Entropic methods compatible?

3

The goal of entropic inference: the prior q(x)

the posterior p(x)

To update from old beliefs to new beliefs when new information becomes available. ?? We seek a concept of information directly in terms of how it affects the beliefs of rational agents. 4

An analogy from physics:

initial state of motion.

final state of motion.

Force

 dp Force is whatever induces a change of motion: F = dt 5

Inference is dynamics too!

old beliefs

new beliefs

information

Information is what induces the change in rational beliefs. 6

What is information? Information is what induces the change in rational beliefs. Information is what constrains rational beliefs. Mathematical expression: information = constraints on probabilities

7

Entropic Inference Question: How do we select a distribution from among all those that satisfy the constraints? Answer:

Rank the distributions according to preference. (Skilling)

Transitivity:

if p1 is better than p2 , and p2 is better than p3 , then p1 is better than p3 .

To each p assign a real number S[p,q] such that

S[ p1 , q] > S[ p2 , q ] > S[ p3 , q] 8

Remarks: This answers the question “Why an entropy?” Entropies are real numbers designed to be maximized.

The Method of Maximum Entropy: Select the posterior that maximizes the entropy S[p,q] subject to the available constraints.

9

Question:

Which entropy functional S[p,q]?

Answer:

Eliminative Induction

• We want an S[p,q] of universal applicability. • Select a sufficiently broad family of functionals. • Identify criteria/principles that must be satisfied. • Eliminate the functionals that violate the criteria. Caution: too many criteria the universal theory might not exist.

10

Question:

What criteria/principles govern the choice of the functional S[p,q]?

Answer:

Marx

These are my principles; if you don't like them... ... I have others. Groucho Marx

11

Question:

What criteria/principles govern the choice of the functional S[p,q]?

Answer:

Principle of Minimal Updating

Prior information is valuable: do not ignore it. Beliefs ought to be revised... but only to the extent required by new information. Rather than prescribing what and how to update we prescribe what not to update. This is designed to maximize objectivity. 12

Criterion 1: Locality Local information has local effects. If the information does not refer to a domain D, then p(x|D) is not updated,

p( x | D) = q( x | D).

Criterion 2: Coordinate invariance Coordinates carry no information. 13

Criterion 3: Consistency for all independent systems When systems are known to be independent it should not matter whether they are treated jointly or separately.

Remark: this applies to all independent systems, whether identical, similar or very different, whether few or many.

14

Conclusion: The only ranking of universal applicability consistent with Minimal Updating is given by relative entropy,

p ( x) S [ p, q ] = − ∫ dx p ( x) log . q( x) Other entropies may be useful for other purposes. For the purpose of updating the only candidate of general applicability is the logarithmic relative entropy.

15

Bayes’ rule as Entropic Inference Maximize the appropriate entropy

p ( x, θ ) S [ p, q ] = − ∫ dxdθ p( x, θ ) log , q ( x, θ ) q (θ )q ( x | θ )

constrained by the data

∫ dθ p( x,θ ) = p( x) = δ ( x − x′) observed data Note: this is an ∞ number of constraints. 16

{

}

δ S + ∫ dx λ ( x)[ p ( x) − δ ( x − x′)] + α [ norm.] = 0 The joint posterior is

p( x, θ ) = p ( x) p (θ | x) = δ ( x − x′)q (θ | x) and the new marginal for  is

q ( x′ | θ ) p (θ ) = ∫ dx p ( x, θ ) = q (θ | x′) = q (θ ) q ( x′) which is Bayes’ rule !! 17

More on Entropic Inference q (x)

constraint

θ2

p( x | θ 0 )

θ1

Maximum entropy selects θ 0 . Question: To what extent is θ ≠ θ 0 ruled out?

18

We are asking for the joint distribution P ( x, θ ) Maximize

P ( x, θ ) S[ P, Q] = ∫ dxdθ P( x,θ ) log Q ( x, θ )

q ( x)q(θ )

P (θ ) p ( x | θ )

Answer:

P (θ ) ∝ e S (θ ) q (θ )d nθ

S (θ ) = − ∫ dx p( x | θ ) log

p( x | θ ) q ( x)

d Vol. 19

Entropic Inference: Summary posterior

prior

q

p( x) S [ p, q ] = − ∫ dx p ( x) log q( x)

constraint

p

Pr(dv) ∝ e S [ p,q ]dv

Maximize S[p,q] subject to the appropriate constraints. (MaxEnt, Bayes' rule and Large Deviations are special 20 cases.)

Conclusions and remarks •

Information is the constraints.



Minimal updating: prior information is valuable.



The tool for updating is (relative) Entropy.



Entropy needs no interpretation.



MaxEnt, Bayes and Large Deviations are special cases. 21

22

Probability Theory ̶ Theory of Inference

freq. PT

Orthodox Statistics

Bayesian Inference MaxEnt Entropic Inference

Terra Incognita 23

What is information? a) Epistemic: What is conveyed by an informative answer. Everyday usage. Concerned with meaning.

b) Probabilistic: Shannon information. Communication theory, Physics, Econometrics... Concerned with amount of information, not with meaning.

c) Algorithmic: Kolmogorov complexity. Computer science, complexity... Concerned with amount not meaning (arguable...).

24

Bayes’ rule for repeatable experiments

p ( x, θ ) S [ p, q ] = − ∫ dxdθ p( x, θ ) log , q ( x, θ ) q ( x, θ ) = q (θ )q ( x1 | θ )  q ( xn | θ )

x = {x1  xn } and data constraints:

∫ dθ dx  dx 2

n

p ( x, θ ) = p ( x1 ) = δ ( x1 − x1′ )

posterior:

p (θ , x2  xn ) = q (θ | x1′ )q ( x2 | θ )  q ( xn | θ ) 25

Constraints do not commute in general

constraint

constraint

q (x)

26

Criterion 1: Locality Local information has local effects. If the information does not refer to a domain D, then p(x|D) is not updated,

p( x | D) = q( x | D).

Consequence:

S [ p, q] = ∫ dx F ( x, p ( x), q ( x)) 27

S [ p, q] = ∫ dx F ( x, p ( x), q ( x)) Criterion 2: Coordinate invariance Coordinates carry no information.

Consequence:

p( x) q( x) S [ p, q ] = ∫ dx m( x)Φ ( , ) m( x ) m( x ) invariants 28

p( x) q( x) S [ p, q ] = ∫ dx m( x)Φ ( , ) m( x ) m( x ) To determine m(x) use Criterion 1 (Locality) again: If there is no new information there is no update. Consequence:

m( x) ∝ q ( x) is the prior.

p( x) S [ p, q ] = ∫ dx q ( x)Φ ( ) q ( x) 29

Criterion 3: Consistency for independent systems When systems are known to be independent it should not matter whether they are treated jointly or separately. Consequence:

η

 p ( x)   for η ≠ −1,0 Sη [ p, q ] = ∫ dx p ( x)   q( x)  p ( x) S 0 [ p, q ] = − ∫ dx p ( x) log for η = 0 q( x) S −1[ p, q ] = S 0 [q, p ] for η = −1 But this applies for two systems with the same . 30

For systems with different s use Criterion 3 again. Single system 1: use Sη1 [ p1 , q1 ] Single system 2:

use Sη 2 [ p2 , q2 ]

Combined system 1+2: use

Sη [ p1 p2 , q1q2 ]

But this is equivalent to using Sη [ p1 , q1 ] and Sη [ p2 , q2 ]. Therefore η = η1 and η = η 2



η1 = η 2

 is a universal constant! 31

For N→∞ systems use Criterion 3 again. Multinomial distribution:

N! n n PN (n1  nm | q ) = q1 1  qm m n1! nm !

For large N :

PN ( f1  f m | q ) ∝ exp NS 0 ( f , q )

where

η =0 and

f i ≈ pi and

ni fi = N

∑ fa ≈∑pa i i

i

i i

i

(in probability) 32