Geometry of F -likelihood Estimators and F -Max-Ent Theorem Harsha K V and Subrahamanian Moosath K S Department of Mathematics Indian Institute of Space Science and Technology
26 September 2014
Harsha K V and Subrahamanian Moosath K S
26 September 2014
1 / 25
1. Introduction On a statistical manifold S, the Fisher information acts as a Riemannaian metric called the Fisher information metric g . Amari defined a one parameter family of connections ∇α called α−connections using α−embeddings. ( 1−α 2 α 6= 1 p 2 1−α Lα (p) = log p α=1 The connections ∇α and ∇−α are dual with respect to the Fisher information metric g .
Harsha K V and Subrahamanian Moosath K S
26 September 2014
2 / 25
1. Introduction On a statistical manifold S, the Fisher information acts as a Riemannaian metric called the Fisher information metric g . Amari defined a one parameter family of connections ∇α called α−connections using α−embeddings. ( 1−α 2 α 6= 1 p 2 1−α Lα (p) = log p α=1 The connections ∇α and ∇−α are dual with respect to the Fisher information metric g . An exponential family is flat with respect to ∇1 -connection (also known as exponential connection) and by duality it is also (−1)-flat. Hence (g , ∇1 , ∇−1 ) is a dually flat structure on exponential family. Amari has defined a α−family for any α ∈ R. But for α 6= 1, α−family is not flat with respect to the α−connection.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
2 / 25
A q-exponential family extends the notion of an exponential family. A q−exponential family, which is an α−family with α = 1 − 2q, has a dually flat structure called q-structure. This q-geometry is obtained by the conformal flattening of α−geometry.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
3 / 25
A q-exponential family extends the notion of an exponential family. A q−exponential family, which is an α−family with α = 1 − 2q, has a dually flat structure called q-structure. This q-geometry is obtained by the conformal flattening of α−geometry. Naudts generalized the notion of exponential family to a large class of families of probability distributions called φ-exponential family and studied the dually flat structure. An information geometric foundation for the deformed exponential family (χ-family) is given by Amari et al.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
3 / 25
A q-exponential family extends the notion of an exponential family. A q−exponential family, which is an α−family with α = 1 − 2q, has a dually flat structure called q-structure. This q-geometry is obtained by the conformal flattening of α−geometry. Naudts generalized the notion of exponential family to a large class of families of probability distributions called φ-exponential family and studied the dually flat structure. An information geometric foundation for the deformed exponential family (χ-family) is given by Amari et al. We extended Amari’s α-geometry to a new geometry called F −geometry using an embedding function F of S into the space of random variables RX . When F is the α−embeddings of Amari, the F −geometry reduces to α-geometry. In this general embedding case also a dualistic structure (g , ∇F , ∇H ) can be introduced on S, where H is the dual embedding of F .
Harsha K V and Subrahamanian Moosath K S
26 September 2014
3 / 25
Further we introduced (F , G )−geometry using the embedding F and a positive smooth function G . From the idea of (F , G )-geometry, we consider a F -exponential family which is an extension of q−exponential family which is not flat with respect to the F -connection ∇F . A dually flat structure can be defined on F −exponential family by the conformal flattening of (F , G )−geometry.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
4 / 25
Further we introduced (F , G )−geometry using the embedding F and a positive smooth function G . From the idea of (F , G )-geometry, we consider a F -exponential family which is an extension of q−exponential family which is not flat with respect to the F -connection ∇F . A dually flat structure can be defined on F −exponential family by the conformal flattening of (F , G )−geometry. Further using the function F and its inverse, we generalize the notion of independence called F −independence. Then we define F −likelihood function and F −likelihood estimators and discusses its geometry. An analytic proof of the F -version of the max-ent theorem is outlined.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
4 / 25
2. F -Geometry Let S be an n-dimensional statistical manifold. Let F : (0, ∞) −→ R be an injective function which is at least twice differentiable. Then F is an embedding of S into RX which takes each p(x; ξ) 7−→ F (p(x; ξ)). The metric induced by the embedding F is the Fisher information metric g defined by Z gij (θ) = ∂i ` ∂j ` p(x; θ) dx where `(x; θ) = log p(x; θ).
The affine connection induced by the embedding F , the F −connection is defined by Z pF 00 (p) )∂i ` ∂j `)(∂k `) p dx. ΓFijk (ξ) = (∂i ∂j ` + (1 + 0 F (p) Remark 2.1 Amari’s α−geometry is a special case of the F −geometry.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
5 / 25
3. (F , G )-Geometry Definition 3.1 Let G : (0, ∞) −→ R be a positive smooth function. Define the G −metric as gijG (θ) =
Z ∂i ` ∂j ` G (p) p dx.
Using the embedding F and the function G , Define (F , G )−connection ∇F ,G as Z pF 00 (p) ΓFijk,G = (∂i ∂j ` + (1 + 0 )∂i ` ∂j `)(∂k `) G (p) p dx. F (p) Theorem 3.2 The (F , G )−connection ∇F ,G and the (H, G )−connection ∇H,G are dual connections with respect to the G -metric iff the functions F and H satisfy H 0 (p) =
G (p) . pF 0 (p)
We call such an embedding H as a G −dual embedding of F . Harsha K V and Subrahamanian Moosath K S
26 September 2014
6 / 25
4. F −exponential family Definition 4.1 Let F : (0, ∞) −→ R be any smooth increasing concave function. Let Z be the inverse function of F . The standard form of an n-dimensional F −exponential family of distributions S = {p(x; θ) / θ ∈ E ⊆ Rn } is written as p(x; θ) = Z (
n X
θi xi − ψF (θ))
or
F (p(x; θ)) =
i=1
n X
θi xi − ψF (θ)
i=1
where x = (x1 , ..., xn ) is a set of random variables, θ = (θ1 , .., θn ) are the canonical parameters. ψF (θ) is called the F -free energy or the F -potential and is determined from the normalization condition. Z n X Z( θi xi − ψF (θ))dx = 1 i=1
Define a functional hF (θ) as Z hF (θ) =
1 dx F 0 (p(x; θ))
. Harsha K V and Subrahamanian Moosath K S
26 September 2014
7 / 25
Theorem 4.2 The F −potential function ψF (θ) is a convex function of θ and Z −pF 00 (p) 1 1 ∂i ∂j ψF (θ) = ∂i p ∂j p dx. hF (θ) F 0 (p) p Definition 4.3 Define a Riemannian metric called F −metric g F by gijF (θ) = ∂i ∂j ψF (θ). Note that (gijF ) is positive definite since ψF is a convex function of θ. Define a divergence of Bregman-type using ψF (θ), called the F −divergence as DF [p(x; θ1 ) : p(x; θ2 )] = ψF (θ2 ) − ψF (θ1 ) − ∇ψF (θ1 ).(θ2 − θ1 ). The two distributions p and r which are parametrized by θ1 and θ2 respectively. Then the F −divergence can be written as Z 1 1 DF [p : r ] = (F (p) − F (r )) 0 dx. hF (θ1 ) F (p)
Harsha K V and Subrahamanian Moosath K S
26 September 2014
8 / 25
Definition 4.4 For a density function p parametrized by θ, define the F −escort probability distribution of p as 1 pˆF (x) = . hF (θ)F 0 (p) Using pˆF , define the Fˆ−expectation of a random variable as Z 1 1 Epˆ(f (x)) = f (x)dx. hF (θ) F 0 (p) Then the F −divergence can be written as DF [p : r ] = Epˆ(F (p) − F (r )). Lemma 4.5 D
The metric gij F and the affine connection ∇DF induced by the F −divergence DF are given by D D gij F (θ) = gijF (θ) = ∂i ∂j ψF (θ); ΓijkF = ∂i ∂j ∂k ψF (θ). ∗
D∗
The dual DF∗ of DF induces an affine connection ∇DF defined by ΓijkF = 0.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
9 / 25
5. Dually flat structure of F -Exponential family The Legendre transformation of the convex function ψF (θ) is given by ηi = ∂i ψF (θ). The dual potential function φF is called the negative F -entropy and is given by φF (η)
=
max {θ.η − ψF (θ)}
=
Epˆ(F (p)) =
θ
We have, ηi = ∂i ψF (θ) = Epˆ(xi );
1 hF (θ)
Z
F (p) dx. F 0 (p)
∂i ηj = ∂i ∂j ψF (θ) = gijF (θ).
With respect to the dual co-ordinate system (ηj ), the metric and the dual connections are given by D
g˜ij F (η) = ∂ i ∂ j φF (η);
Harsha K V and Subrahamanian Moosath K S
˜DF (η) = 0; Γ ijk
∗
˜DF (η) = ∂ i ∂ j ∂ k φF (η). Γ ijk
26 September 2014
10 / 25
6. Conformal flattening of (F , G )-geometry ˜ are said to be ˜ h) Definition 6.1 Two statistical manifolds (M, ∇, h) and (M, ∇, β-conformally equivalent if there exist a positive function K on M such that ˜ ,Y) h(X
=
K h(X , Y )
˜∇ ˜ XY , Z) h(
=
K h(∇X Y , Z ) −
+
1−β {h(Y , Z )dK (X ) + h(X , Z )dK (Y )} 2
1+β h(X , Y )dK (Z ) 2
In terms of the basis vectors, we can rewrite the above expression as ˜ i , ∂j ) = h˜ij = K h(∂i , ∂j ) = K hij h(∂ ˜β = K Γijk − 1 + β hij ∂k K + 1 − β {hjk ∂i K + hik ∂j K } Γ ijk 2 2 The dually flat structure on F -exponential family is obtained by the conformal flattening of (F , G )-geometry.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
11 / 25
Theorem 6.2 D
The metric gij F induced by the F -divergence DF is obtained by the conformal flattening of the G -metric gijG by a gauge function K (θ) =
1 , hF (θ)
with G (p) =
−pF 00 (p) . F 0 (p)
Proof: D
gij F (θ) = ∂i ∂j ψF (θ) =
1 hF (θ)
Z
−pF 00 (p) 1 ∂i p ∂j p dx F 0 (p) p
= K (θ)gijG where K (θ) =
1 hF (θ)
and gijG =
R
∂i p ∂j p
G (p) dx p
is the G −metric with G (p) =
−pF 00 (p) . F 0 (p)
Theorem 6.3 The affine connection ∇DF induced by DF is the (−1)-conformal transformation of the 00 (p) (H, G )-connection ∇H,G by the gauge function K (θ) = hF1(θ) , where G (p) = −pF and F 0 (p) H is the G -dual embedding of F .
Harsha K V and Subrahamanian Moosath K S
26 September 2014
12 / 25
Proof: The components Z 1 D ΓijkF = hF (θ) Z 1 + hF (θ) Z 1 + hF (θ) Z 1 + hF (θ) = with G (p) =
of the connection ∇DF are given by −pF 00 (p) p 2 F 000 (p) 2p 2 (F 00 (p))2 ∂i ` ∂j ` ∂k ` pdx − + F 0 (p) F 0 (p) (F 0 (p))2 −pF 00 (p) )∂i ∂j ` ∂k ` pdx F 0 (p) pF 00 (p) ∂i ` dx ∂j ∂k ψF (θ) 0 (F (p))2 pF 00 (p) ∂j ` dx ∂i ∂k ψF (θ) 0 (F (p))2
(
G G K (θ)ΓH,G ijk + ∂j K (θ)gik (θ) + ∂i K (θ)gjk (θ)
−pF 00 (p) F 0 (p)
and K (θ) =
1 . hF (θ)
Theorem 6.4 ∗
The affine connection ∇DF induced by DF∗ is the 1-conformal transformation of the 00 (p) (F , G )-connection ∇F ,G by a gauge function K (θ) = hF1(θ) , where G (p) = −pF . F 0 (p)
Harsha K V and Subrahamanian Moosath K S
26 September 2014
13 / 25
7. F −likelihood estimator Definition 7.1 Let F be an increasing concave function and let Z be its inverse function. Then the F -product of two numbers x, y is defined as x ⊗F y = Z [F (x) + F (y )] The F −product satisfies the following properties Z (x) ⊗F Z (y ) = Z (x + y ) F (x ⊗F y ) = F (x) + F (y ) Definition 7.2 Two random variables X and Y are said to be F −independent with normalization if the joint probability density function pF (x, y ) is given by the F −product of the marginal probability density functions p1 (x) and p2 (y ). p1 (x) ⊗F p2 (y ) Zp1 ,p2 RR is the normalization defined by Zp1 ,p2 = p (x) ⊗F p2 (y )dxdy Ω1 Ω2 1 pF (x, y ) =
where Zp1 ,p2
Harsha K V and Subrahamanian Moosath K S
26 September 2014
14 / 25
8. The geometry of F −likelihood estimators Let S = {p(x; θ) / θ ∈ E ⊆ Rn } be an n−dimensional statistical manifold defined on a sample space Ω ⊆ R and let {x 1 , ....., x N } be N independent observations from p(x; θ) ∈ S. Definition 8.1 The F −likelihood function LF (θ) is defined as LF (θ) = p(x 1 ; θ) ⊗F .... ⊗F p(x N ; θ) Since F is an increasing function, it is equivalent to consider F (LF (θ)) as well. F (LF (θ)) = F (p(x 1 ; θ)) ⊗F .... ⊗F F (p(x N ; θ)) =
N X
F (p(x i ; θ))
i=1
Definition 8.2 A maximum F −likelihood estimator θˆ is defined as θˆ = arg max LF (θ) = arg max F (LF (θ)) θ∈E
Harsha K V and Subrahamanian Moosath K S
θ∈E
26 September 2014
15 / 25
Let S = {p(x; θ) / θ ∈ E ⊆ Rn } be a F −exponential family and let M be a curved F -exponential family in S. Consider {x 1 , ....., x N } be N independent observations from a probability density function p(x; u) = p(x; θ(u)) ∈ M. Theorem 8.3 The F −likelihood estimator for M is the orthogonal projection of that of S to the ∗ submanifold M with respect to the connection ∇DF . Proof: The F −likelihood function is given by F (LF (u)) =
N X
j
F (p(x ; u))
=
j=1
" n N X X j=1
=
n X i=1
∂i F (LF (u)) =
N X
# θ
i
(u)xij
− ψF (θ(u))
i=1
θi (u)
N X
xij − NψF (θ(u)).
j=1
xij − N∂i ψF (θ(u)).
j=1
Harsha K V and Subrahamanian Moosath K S
26 September 2014
16 / 25
Thus the maximum F −likelihood estimator for S is given by ηˆi =
N 1 X j x. N j=1 i
The canonical divergence DF∗ for S can be calculated as DF∗ [p(θ(u)); p(ˆ η )] = DF [p(ˆ η ); p(θ(u))] = ψF (θ(u)) + φF (ˆ η) −
n X
θi (u)ηˆi
i=1
1 = φF (ˆ η ) − F (LF (u)). N Hence the F −likelihood is maximum if the canonical divergence (or the dual of the F -divergence) is minimum. Equivalently, by the projection theorem, we can say that F −likelihood estimator for M is the orthogonal projection of ηˆ to the submanifold M with respect to the ∗ connection ∇DF .
Harsha K V and Subrahamanian Moosath K S
26 September 2014
17 / 25
9. F -Max-Ent Theorem Definition 9.1 For any probability density function p(x), the F −entropy is defined as Z −F (p) 1 HF (p) = −Epˆ(F (p)) = dx. hF (p) F 0 (p)
(1)
When F (p) =lnq p, theq−logarithm then HF (p) reduces to the q−entropy 1 1 − hq1(p) and when F (p) = ln p, HF (p) reduces to the Shannon 1−q R entropy H(p) = − p(x) ln p(x) dx.
Hq (p) =
Theorem 9.2 Probability distributions maximizing the F -entropy HF under the F -linear constraints EpˆF [ck (x)] = ak ; k = 1, ..., m
(2)
for m random variables ck (x) and various values of ak ∈ R form an m-dimensional F -exponential family m X F (p(x; θ)) = θi ci (x) − ψ(θ) i=1 Harsha K V and Subrahamanian Moosath K S
26 September 2014
18 / 25
Proof: We use the method of Lagrange multipliers and the calculus of variation principle. 1 hF (p)
R
1 hF (p)
Z
Our aim is to maximize HF (p) = EpˆF [ck (x)] =
−F (p) dx F 0 (p)
subject to the m constraints
ck (x) dx = ak ; k = 1, ..., m F 0 (p)
Consider Z ∞ Z ∞ −F (p) 1 dx + λ pdx 0 hF (p) 0 F 0 (p) 0 Z m m ∞ X X ck (x) 1 λi λ i ai + dx − λ0 − 0 hF (p) 0 F (p) i=1 i=1
L(p, λ0 , λ1 , .., λm ) =
At maximum F -entropy distribution we have
Harsha K V and Subrahamanian Moosath K S
dL dp
= 0.
26 September 2014
19 / 25
1 hF (p)
Using this we get λ0 = F (p)
=
and
m X
λi (ci (x) − ai ) +
i=1
= =
m X i=1 m X
1 hF (p)
∞
Z 0
F (p) dx F 0 (p)
λi (ci (x) − ai ) − HF (p) λi ci (x) − σ(λi , ai )
i=1
where σ(λi , ai ) =
Pm
i=1
λi ai + HF (p).
Using the m constraints we can solve for λi to get λi = − dHdaF (p) and λi ’s are the i canonical co-ordinates for the F -exponential family. Hence ai ’s are the dual co-ordinate of the canonical co-ordinate λi . Using the dual co-ordinates λi , ai and their potential functions, F (p) takes the form of a F -exponential family. Thus F (p(x; θ)) =
m X
θi ci (x) − ψ(θ)
i=1
Harsha K V and Subrahamanian Moosath K S
26 September 2014
20 / 25
Conclusion
Using a general embedding F and a positive smooth function G , the (F , G )−geometry is introduced and the geometry of the F -exponential family is studied. The dually flat structure of the F -exponential family is obtained by the conformal flattening of the (F , G )-geometry. Using a generalized notion of independence called the F -independence, we defined the F -likelihood function. Further, the geometry of the F -likelihood estimators are discussed. A generalized notion of entropy called F -entropy is given and an analytic proof of the F -version of the max-ent theorem is outlined. Future work Further one can explore the asymptotic behavior of the mle of the F -escort probability density function and its various applications.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
21 / 25
Acknowledgement
We would like to express our sincere gratitude to Prof. Shun-ichi Amari for his valuable suggestions and fruitful discussions.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
22 / 25
References H. Matsuzoe, and A. Ohara, “Geometry for q−exponential families”, Proceedings of the 2nd International Colloquium on Differential Geometry and its Related Fields, Veliko Tarnovo, September 6-10, 2010. S.I. Amari, and A. Ohara, “Geometry of q-Exponential Family of Probability Distributions”, Entropy, 13, 1170–1185 (2011). J. Naudts, “Deformed exponentials and logarithms in generalized thermostatistics”, Physica A, 316 323-334(2002). J. Naudts, “Estimators, escort probabilities, and phi-exponential families in statistical physics”, J. Ineq. Pure Appl. Math., 5, 102(2004). J. Naudts, Generalised Thermostatistics; Springer: London, UK, 2011. S.I. Amari,; A. Ohara, ; H. Matsuzoe, “Geometry of deformed exponential families: Invariant, dually flat and conformal geometries”, Physica A:Statistical Mechanics and its Applications, textbf391, 4308-4319 (2012).
Harsha K V and Subrahamanian Moosath K S
26 September 2014
23 / 25
K.V. Harsha, and K.S. Subrahamanian Moosath, “F -geometry and Amari’s α-geometry on a statistical manifold”, Entropy, 16(5), 2472-2487 (2014). T. Kurose, “On the Divergence of 1-conformally Flat Statistical Manifolds”, Tohoku Math. J., 46, 427433 (1994). T. Kurose, “Conformal-projective geometry of statistical manifolds”, Interdisciplinary Information Sciences, 8, 89100(2002). A. Ohara, ; H. Matsuzoe, and S.I. Amari, “A dually flat structure on the space of escort distributions”, Journal of Physics: Conference Series, 201(01), 2010. Y. Fujimoto, and N. Murata, “ A generalization of Independence in Naive Bayes Model”, Lecture Notes in Computer Science, 153-161(2010). S.I. Amari, and H. Nagaoka, Methods of Information Geometry, Translations of Mathematical Monographs, Oxford University Press, Oxford, UK, 2000.
Harsha K V and Subrahamanian Moosath K S
26 September 2014
24 / 25
Thank You
Harsha K V and Subrahamanian Moosath K S
26 September 2014
25 / 25