Antidata

Mar 4, 2006 - not only the p—r—meters of the priorD ˜ut the f—mily of priors —s ..... is the n—tur—l prior o˜t—ined —˜oveD iFeFD the joint prior distri˜ution of @ ; vA is ..... ™onstr—int —nd set —ll the p—rti—l deriv—tives equ—l to zero —nd ...
237KB taille 7 téléchargements 276 vues
Antidata Carlos C. Rodr guez March 4, 2006

To Herb.

http://omega.albany.edu:8008/Herb/

Introduction

How can I honestly justify to myself and others, the use of one prior instead of another? Where does prior information come from? These questions are in the mind of anyone confronted with bayesian inference for the rst time. The usual: Relax, don't worry, it doesn't really matter much what the prior is... after just a few observations the likelihood dominates over the prior and we all end up agreeing about the posterior...etc, does in fact help to reduce the level of anxiety of a soon to become bayesian acolyte but the problem remains. Since, what if the prior does matter? Surely there must exist situations where it matters what the prior is. Then what? Herbert Robbins loved to make fun of the non-empirical faithful bayesians by closing his eyes shaking, and saying: . Almost always he performed this act right after he nished showing how he could estimate the prior from the data and behave asymptotically as if he knew what the prior was, allowing him to shrink the value of the loss in a compound decision problem. He would then get back to those bayesians in his mind, again, screaming loud and clear: and leaving the room whistling the same military boot-camp tune that he arrived with. I like to think that Herb would have been pleased to know that the conjugate priors, that he often used, were in fact best possible in a precise sense. Thus, not only the parameters of the prior, but the family of priors as well, are chosen by objective data. I started writing this as class notes for an undergraduate statistics course at SUNY Albany. The reader I initially had in mind was a typical student from my class for whom the course was likely to be the rst encounter with bayesian inference. But, right after dotting all the i's for the simplest, most canonical, gaussian examples, in one dimension, I realized to my own surprise, that I had learned a few things that were not in the textbooks. First of all, a pedagogical (if not plain conceptual) point. When writing the conjugate prior for a gaussian mean  with known variance 2 do not write N (0 ; 02 ) but N (0 ; 2 =0 ). With the second form (which by the way is the form that comes \and they deliver the prior!"

\they can't learn from data!.

1

I can!"

from the entropic prior) the posterior variance is simply 2 =(0 + n). Second, the virtual data interpretation for the natural conjugate prior for the exponential family, essentially breaks down unless the correct information volume element dV is used. Third, and the most important lesson that I learned from my own notes, the 1-priors even though they are not conjugate they are more ignorant than the 0-priors. The 1-priors, just like the 0-priors can be thought as based on > 0 virtual observations. However, where the 0-priors add the virtual observations to the actual n sample points, the 1-priors subtract the from the n!. I call this since of these points annihilate of the observations leaving us with a total of n . I nd this, as far as I know, new statistics phenomenon very pleasing. True ignorance, that claims only the model and the observed data, has a price. To build the prior we must spend some of the information cash in hand. No free lunches. Thus, the posterior con dence intervals for a gaussian mean with unknown variance could end up a little larger, than the ones from sampling theory. anti-data

The Exponential Family

Let vector x 2 X have scalar probability density, p(xj) = exp

0 k X @ Cj

j =1

1

()Tj (x) T (x) log Z ()A

de ned for all  2   Rk . We say that the distribution of x is in the kparameter exponential family generated by the volume element dV in X when the above is the density of probability with respect to this dV , i.e., P [x 2 Aj] =

Z

A

p(xj)dV:

The data space X is assumed to be independent of . The functions T; Cj ; Tj for j = 1; : : : k and the normalizing constant Z (also known as ) are assumed to be (nice) functions of their arguments. Many well known families of probability distributions are of this type. For example, the Bernoulli, binomial, Poisson, beta, gamma, and Normal families are all of this kind. For the case of the two parameter families (beta, gamma and Normal) we can assume the rst parameter, or the second parameter or none of them to have a given x value and the resulting families are also members of the general exponential family. Let's consider the three cases generated by the Normal as typical illustrations. the partition

function

exponential

Gaussian with known variance:

N (;  2 )

This is a one parameter family with density (w.r.t. dx) given by,   ( x  )2 1 p exp p(xj) = 22 22 2

by expanding the square and bringing the normalization constant into the exponential we can write, p(xj) = exp



x2 22

 x 2

where the partition function is,



log Z ()

2 p 2 2 exp( 22 ); and, C1 = =2 ; T1 = x; T = x2 =(22 ):

Z () =

Gaussian with known mean:

The density is,

N (; )

(x )2 log p2 : 2 p In this case, T = 0; T1 = (x )2 ; C1 = 1=(2) and, Z = 2. p(xj) = exp

General Gaussian:

Here,



N (;  2 ); 

p(xj) = exp



;  2 )

= (

x2 22

 x 2

2 22

p 2 log 2

thus, T = 0; T1 = x; C1 = =2 ; T2 = x2 ; C2 = 1=(22 ) and p Z = 22 exp(2 =22 ) Natural Priors for the Exponential Family

When the distribution of a vector x is in the k-parameter exponential family, there is a simple recipe for priors. Make a (k + 1)-parameter exponential family of priors for  by de ning their scalar probability densities with respect to the information volume element dV in fp(j) :  2 g by, 0

1

k 1 exp @X tj Cj () t0 log Z ()A p(jt) = W (t) j =1

where, t = (t0 ; t1 ; : : : ; tk ) 2 ft : W (t) < 1g with, W (t) =

Z



exp

0 k X @ tj Cj

j =1

3

1

() t0 log Z ()A dV

Prior information is encoded in the values of t1 ; t2 ; : : : ; tk . The strength of the information supplied with t is measured by t0 . Hence, P [ 2 Ajt] =

Z

A

p(jt) dV

where the information volume element dV is given by, dV

and, gij () = 4

Z

q

= det(gij ()) d p

p

X

@i p(xj) @j p(xj) p(xj) dx

are the entries of the information matrix. What's so Natural About These Priors?

Here is another one of Herb's screams to the rescue: . How right. I would add, that probability distributions are also ghosts. No one has ever seen a probability distribution either (but that doesn't make me a deFinettian though). In fact all we ever see, as much as we anything, is data. Thus, a natural (if not only) way of providing prior information about the ghostly parameters  is to write down t0 typical examples: x 1 ; x 2 ; : : : ; x t of data that is expected to resemble, as much as possible, the actual observations. This is sometimes possible to implement even with a set of self-nominated domain experts. It has the added virtue that the more the experts disagree among themselves, the better prior you end up with. With the prior data in hand it is now very reasonable to want the prior probability around  to be proportional to the likelihood of the prior examples. Just like the rationale for maximum likelihood. In symbols,

\Parameters are ghosts.

Nobody has ever seen a parameter!"

see

0

p(jx 1 ; x 2 ; : : : ; x t0 ) / p(x 1 ; x 2 ; : : : ; x t0 j):

Or if you like, just bayes theorem with uniform (Je reys) prior. If the likelihood is assumed to be in the same k-parameter exponential family of the actual data (and if that is not a reasonable assumption for the virtual data in hand, then you need to throw away your model and/or interrogate some of your experts) then, p(jx 1 ; x 2 ; : : : ; x t0 ) / exp

0 k X @ Cj

j =1

( )

t0 X i=1

1

Tj (x i ) t0 log Z ()A

which is the natural conjugate prior for the exponential family with prior parameters, 4

tj =

t0 X i=1

Tj (x i )

for j = 1; : : : ; k

Hence, there is a simple and useful interpretation for the inferences obtained with the aid of these priors. A bayesian using the natural conjugate prior for the exponential family acts as if s/he has extra t0 observations with sucient statistics t1 ; : : : ; tk . Thus, reliable prior information should be used with a large value for t0 and weak prior information should be used with a small value for t0 . The values for t (and therefore the prior examples) need to be restricted to the set that allows the resulting prior to be normalizable, otherwise they can't be used. The smallest possible value for t0 that still de nes a normalizable prior, could be used to de ne a simple notion of ignorance in this context. We illustrate the general points with the three Normal families considered above. Posterior Parameters

When the likelihood is a k-parameter exponential family and the natural conjugate prior with prior parameter t is used, the posterior after a sample xn = (x1 ; : : : ; xn ) of n iid observations is collected is given by bayes theorem. Using the iid assumption for the data, collecting the exponents of the exponentials, and dropping overall multiplicative constants independent of  we obtain, p(jxn ; t)

/ p(xn0j)p(jt) / exp

k X @ tj j =1

( +

n X i=1

1

Tj (xi ))Cj ()

(t0 + n) log Z ()A / p(jt(n) )

where the (k + 1) new parameters t(n) are obtained from the simple formulas, n

X t(0n) = t0 + n; t(jn) = tj + Tj (xi ) for j = 1; : : : ; k i=1 Natural Prior for

j 2

is

2

N (0 ; 0 )

When the likelihood is Gaussian with a given variance 2 , the natural prior for the mean  =  is the two parameter family obtained by replacing the sucient statistic T1 = x 2 R by the prior parameter t1 2 R, replacing log Z by t0 log Z and dropping multiplicative constants independent of  in the exponential family likelihood for the N (; 2 ). The scalar probability density with respect to dV = d is, p(jt0 ; t1 ) / exp



2 t0 2 2

5

+ t1 2



2

/ N (0 ;  ) 0

the middle expression is integrable over  2 R (the real line) only when t0 > 0 and t1 2 R. Thus, 0 = tt 2 R and 0 = t0 > 0. To compute the posterior we just apply the general updating formulas. In this case, the posterior is N (n ; 2 =n ) where,   + nxn n = 0 + n; n = 0 0 0 + n is immediately obtained from the updating formulas for t0 and t1 . Notice that the posterior parameters n and n are the number of observations and the mean of the observed data x1 ; x2 ; : : : ; xn augmented by 0 extra observations xn+1 ; xn+2 ; : : : ; xn+ with mean 0 . 1 0

0

Natural Prior for

 2 j

is



2

0 ; 02 )

(

When the likelihood of an observation is xj  N (; ) we have,  1 (x )2 1 log  p(xj) / exp 2 2 and the natural prior is given with respect to dV = d= as, t p(jt0 ; t1 ) /  exp( 2t1 ): In order for this last function to be integrable over the region 0 <  < 1 with respect to d=, it is necessary that t0 > 0 and t1 > 0. This can be seen by nding the density of  = t1 = to be Gamma(t0 =2; 1=2), i.e. a chi-square with 0 = t0 degrees of freedom. We de ne an inverse chi-square by,     2 (0 ; 0 ) () 0 0  2  and we say that  follows an inverse chi-square with 0 degrees of freedom and scale 0 . Also,       2 (0 ; 0 ) () p() d /  exp 200 d : 0 2

0

0 2

The natural family of priors for 2 j is then  2 (0 ; 02 ) with 0 02 = t1 > 0 and 0 = t0 > 0. The posterior parameters n and n2 are obtained from the prior parameters and the observed data as, n

X n = 0 + n; and n n2 = 0 02 + (xi i=1

)2 :

Again, these parameters are the number of observations and the sum of the squared deviations with respect to  of the observed data x1 ; : : : ; xn augmented by 0 extra points xn+1 ; : : : ; xn+ with, 0 X i=1

0

(xn+i )2 = 0 02 6

The Natural Prior for



;  2 )

= (

We show here that when both the mean and the variance of a Gaussian are unknown the natural prior for the vector (; 2 ) is simply the product of the two distributions obtained above, i.e., (; 2 )   2 (0 ; 02 ) N (0 ; 2 =0 ) which is a three parameter family. To see it, just write the likelihood for one observation disregarding overall proportionality constants independent of  = (; v) where we now let v = 2 ), 1 (x )2 1 log v 2v 2   2 / exp 21v 2 + xv  2xv 21 log v and use the recipe for the prior: Replace T1 = x 2 R by a parameter t1 2 R, replace T2 = x2 > 0 by a parameter t2 > 0 and multiply the rest of the terms in the exponent by the strength parameter t0 . In general the vector of prior parameters t needs to be restricted to the set for which the resulting prior is proper. In this case, the information volume element is, p(xj)

/ exp



2

d dv / dd / 3  v3=2 and the scalar probability density with respect to this dV is, dV

p(jt)dV

/









2 exp t0 2v + 21 log v + tv1  2t2v ddv v3=2        dv 1 t0 2 t1 t2 t0 exp 2v 2 log v v exp 2v  + v  v1=2 d  2 (0 ; 02 ) N (0 ; 2 =0 ) d dv

/ / where we made the substitutions,

t t 0 = t0 ; 0 = 1 ; 02 = 2 : t0 t0 To obtain the posterior parameters n ; n and n2 we apply the following general recipe. Combine the sucient statistics for the observed P data x1 ; x2 ; : : : ; xn n x = nx and which in this case are the observed mean and variance, i.e. i n i =1 Pn 2 statistics for the  virtual extra observations, 0 i=1 (xi xn ) , with the sucient P 0 P 0 xn+1 ; : : : ; xn+0 , namely i=1 xn+i = 0 0 and i=1 (xn+i 0 )2 = 0 02 . To

obtain the updating formulas we simply pool together all the data. The actual observations with the virtual observations. We then have, 7

  + nxn n = 0 + n; n = 0 0 0 + n

as the total number of points and the overall mean. Finally, the new variance is obtained simply from, n n2 =

X 0 +n i=1

(xi n )2 :

This, however needs to be simpli ed to an expression containing only available data since the virtual observations are only given through the value of the sucient statistics, 0 0 and 0 02 . Towards this end, we split the sum and add and subtract the sample mean xn to the rst term and add and subtract the prior mean 0 to the second term, obtaining, n n2

= =

n X i=1

n X i=1

(xi xn + xn n )2 +

0 X i=1

(xn+i 0 + 0 n )2

(xi xn )2 + n(xn n )2 + 0 02 + 0 (0 n )2







= 0 02 + (0 n )2 + n ^n2 + (xn n )2 = 0 02 + n^n2 +  n+0 n (xn 0 )2 :



0

Any of the last two identities can be computed from the values of the sample and prior means and variances. Direct Computation of the Posterior

It is a bit harder to check that the above formulas actually come from a direct application of bayes rule. For completeness we show here all the steps. We have data xn = (x1 ; : : : ; xn ) iid N (; v), the parameter is  = (; v) and the prior is the natural prior obtained above, i.e., the joint prior distribution of (; v) is given as the marginal distribution v   2 (0 ; 02 ) multiplied by the conditional distribution jv  N (0 ; v=0 ) with both densities given with respect to the standard Lebesgue measure. By bayes theorem and the underlying assumptions we have that the posterior density, but now with respect to the usual dV = ddv, is given by, p(jxn )

/ p(xn j) p() / v

n=2 exp

8

n 1X 2 2v (xi ) i=1

!

v 0 =2 1 exp v 1=2 exp





 0 02 2v

0 2 2v ( 0 )



adding and subtracting the sample mean inside the summation, expanding the squares, simplifying, collecting terms, and letting ^n2 to be the sample variance, Pn 2 i.e., n^n = i=1 (xi xn )2 we get, p(jxn )

/ v

 n

( 0+ ) 2

1 exp



1 ( 2 + n^ 2 )

n 2v 0 0 1 [( + n)2 2(  + nx )] v 1=2 exp 0 0 n 2v 0   exp 2v1 [0 20 + n(xn )2 ] now, the middle line clearly shows that jv  N (n ; v=n ) but we need to complete the square explicitly to be able to identify the marginal distribution of v. We must be careful, not to add extra multiplicative terms containing the variance v and so what needs to be added to complete the square also needs to be killed explicitly. That's the reason for the second exponential in the second line below.

p(jxn )

/ v



 n

( 0+ ) 2

1 exp



1 ( 2 + n^ 2 )

2v 0 0  n   ( 0 + n) +( 0 + n) 2 2 exp 2v ( n )  exp 2v n  exp 2v1 [0 20 + n(xn )2 ] nally collect all the terms and simplify, p(jxn )

/



v 1=2 exp v

where,

 n

( 0+ ) 2



(0 + n) (  )2  n 2v  1 exp 1 (0 2 + n^ 2 + A) n 2v 0

(0 + n)A = (0 + n)(0 20 + n(xn )2 ) (0 0 + nxn )2 = 0 n(20 + (xn )2 20 xn ) = 0 n(xn 0 )2 : Which shows that the marginal distribution of v is indeed  2 (n ; n2 ) with the updating formulas for n and n2 as previously given. 9

The Natural Priors are Entropic

The natural conjugate priors for the exponential family are a special case of a larger and more general class of priors known as entropic priors. Entropic priors maximize ignorance and therefore the natural conjugate priors for the exponential family inherit that optimality property. Thus, the priors introduced above are not only convenient, they are also best in a precise objective sense. Entropy and Entropic Priors

Here I try to minimize technicalities and full generality in favor of easy access to the main ideas. Let's start with . It is a number associated to two probability distributions for the same data space X. It measures their intrinsic dissimilarity (or separation) as probability distributions. For distributions P; Q with densities p and q, with respect to a dV in X, de ne  Z  p(x) p(x) = p(x) log dV I (P : Q) = Ep log q(x) q(x) X as their entropy. The I is for Information and the : is for ratio. This notation makes I (P : Q) 6= I (Q : P ) explicit. The operator Ep denotes expectation assuming x  p. Even though the de nition of I (P : Q) uses the densities p; q and a volume element dV , the actual value is independent of all of that!. The number I (P : Q) depends only on the probability distributions P; Q not on the choice of dV that it is needed for de ning densities p; q. There is a continuum of measures of separation I (P : Q) for 0    1 that lls up the gap between I0 = I (Q : P ) and I1 = I (P : Q). In fact, I (P : Q) is the mean amount of information for discrimination of P from Q when sampling from P ; I (Q : P ) same but when sampling from Q instead; I (Q : P ) sort of when sampling from a mixture in between P and Q. What's important is that these I are essentially the only quantities able to measure the intrinsic separation between probability distributions. In here we'll be considering only the two extremes I0 and I1 . entropy

Entropic Priors for the Exponential Family

When the distributions P; Q are on the same exponential family we label them by their parameters  and  and compute, I ( : ) =

k X

(Cj () Cj ()) j log ZZ (()) j =1

straight from the above de nition. Where, j = E [Tj (x)] = E [Tj (x)j] = j ()

is the expected value of the j -th sucient statistic when x  p(xj). 10

The 0-entropic family of prior distributions for  with parameters > 0 and 0 2  is given by the following scalar densities with respect to the information

volume in ,

p(j ; 0 )

/ e

I0 (:0 )

/ exp

/e

0 k X @ Cj

j =1

I (0 :)

1

() j log Z ()A

where now j = j (0 ). This is exactly the density of the natural conjugate prior for the exponential family with prior parameter t = ( ;  ) = (1; 1 ; 2 ; : : : ; k ). Example1:

In this case,

xj  N (; 2 )

I (0 : )

= = =

Thus,





e0 =2 0   log 0 2 2 e2 =22 02 0  02 2 + 22 2 22 22 2 20  02 + 2 2 22 22 : 2

2

( 0 )2 22 and the 0-entropic prior coincides with the 1-entropic prior. The density w.r.t. d is,   ( 0 )2 2 p(j ; 0 ) / exp / N (0 ; ) 2 2 I (0 : ) = I ( : 0 ) =

0-prior when xj  N (; ) p For this case we have, C1 = 1=2; T1 = (x )2 and Z = 2. Hence,

Example2a:

1   1 log 0 20 2 0 2  = 20 21 21 log 0 and the element of probability for the 0-entropic prior is computed with dV = d= as, I (0 : )

=



1

11

exp ( I (0 : ))

d 

=2 exp





0 d 2 

/  /  2 ( ; 0 ) d and as expected, coincides with the previously obtained natural conjugate prior for this case. Recall that the conjugate posterior for this case is,  2 (n+ ; ^n2 + ) where we have written n2 with a hat and with index n + to make explicit the fact that it is the variance associated to the sample extended by the virtual points. 1-prior when xj  N (; ) By interchanging  with 0 in the previous formula for the entropy we obtain the element of probability for the 1-entropic prior,

Example2b:





 d 20  2 ( ; 0 ) d

d 

 =2 exp

/ / where we say that   2 ( ; 0 ) when the density (w.r.t. the usual d) of =0 is 2 , i.e. a chi-square with degrees of freedom. By bayes theorem, the 1-posterior (i.e. the posterior when the 1-prior is used) w.r.t. the usual d has the form,   n 1 n^n2 n p(jx ; ; 0 ) /  exp 2 2  0 P where n^n2 = ni=1 (xi )2 . exp ( I ( : 0 ))

(

2

)

The family of GIGs

The above distribution is known as Generalized Inverse Gaussian (or GIG). A GIG(a; b; c) has density proportional to a 1 exp( b= c) de ned for  > 0 and it is normalizable whenever a 2 R, b > 0 and c > 0. When, either b = 0 or c = 0, but not both zero, the GIG becomes a gamma or an inverse-gamma. The normalization constant involves the BesselK function and it is expensive to compute. It is easy to see that the GIGs have the following property,   GIG(a; b; c)

() 1=  GIG( a; c; b): When a > 0, and c > 0 the best second order Gamma( ; ) approximation to a GIG(a; b; c) is obtained by matching the quadratic Taylor polynomials for the log likelihoods expanded about the mode of the GIG. The values of and are given by the simple formulas, 12

= a + 2mb 1 = m



where m is the mode of the GIG(a; b; c) located at, 1 a 1 + p(a 1)2 + 4bc : m= 2c Furthermore, when 2b=m