deBruijn identities: from Shannon, Kullback–Leibler and Fisher to

The purpose of this paper is to propose extensions of the deBruijn identity that links the entropy and the Fisher information of the output of a Gaussian channel.
119KB taille 36 téléchargements 32 vues
deBruijn identities: from Shannon, Kullback–Leibler and Fisher to generalized φ -entropies, φ -divergences and φ -Fisher informations Steeve Zozor and Jean-Marc Brossier GIPSA-Lab, 11 rue des mathématiques, 38420 Saint Martin d’Hères, France Abstract. In this paper we propose a generalization of the usual deBruijn identity that links the Shannon differential entropy (or the Kullback–Leibler divergence) and the Fisher information (or the Fisher divergence) of the output of a Gaussian channel. The generalization makes use of φ -entropies on the one hand, and of φ -divergences (of the Csizàr class) on the other hand, as generalizations of the Shannon entropy and of the Kullback–Leibler divergence respectively. The generalized deBruijn identities induce the definition of generalized Fisher informations and generalized Fisher divergences; some of such generalizations exist in the literature. Moreover, we provide results that go beyond the Gaussian channel: we are then able to characterize a noisy channel using general measures of mutual information, both for Gaussian and non-Gaussian channels. Keywords: Generalized deBruijn identities, generalized φ -entropies, φ -divergences of Csizàr, generalized Fisher informations PACS: 89.70.Cf, 89.70.-a, 02.0.-r, 05.40.-a

INTRODUCTION The purpose of this paper is to propose extensions of the deBruijn identity that links the entropy and the Fisher information of the output of a Gaussian channel. Such a channel √ is depicted figure 1. Let us denote by X its input, ξθ = θ N the additive (zero-mean) Gaussian noise of variance θ , and Yθ = X + ξθ its output. When the input and the noise are independent and X does not depend of θ , the deBruijn identity writes ∂ H(pYθ ) 1 = J(pYθ ), ∂θ 2

(1)

where pYθ Zdenotes the probability density function (pdf) of Yθ , defined on a domain Ω, H(p) = − p(x) log p(x) dx stands for the differential entropy associated to a pdf p and Ω  Z  ∂ log p(x) 2 J(p) = p(x) dx denotes its (nonparametric) Fisher information. ∂x Ω The usual deBruijn identity was reformulated later on by Barron [1, Lemma 1] and Johnson [2, Th. C.1] through the Kullback–Leibler divergence between the output pdf pYθ and the pdf pθ of a Gaussian with the same variance, where the Kullback– Leibler divergence between two pdfs defined on a common domain Ω writes Dkl (pkq) =

  p(x) log p(x) dx. More generally, for two possible inputs X (k) , k = 0, 1 of respecq(x) Ω (k) tive outputs Yθ with pdf pY (k) , the Kullback–Leibler divergence between the output Z

θ

pdfs satisfies the identity

   1  ∂ Dkl pY (1) kpY (0) = − J pY (1) kpY (0) , ∂θ 2 θ θ θ θ where Z 

J(pkq) = Ω

  p(x) 2 ∂ log p(x) dx ∂x q(x)

(2)

(3)

is called nonparametric Fisher divergence [2, 3]. ξθ

FIGURE 1: Noisy communication channel of input X, X



where the noise ξθ is parametrized by θ . The input and the noise are assumed independent and the input X is assumed not to depend on parameter θ .

The Shannon entropy and the Fisher information are fundamental quantities in signal processing, particularly in the communication domain for the first one, and in the estimation context for the second one. Both of them quantify in a manner a degree of uncertainty or of information on a variable or on a parameter attached to the variable. Although coming from different worlds, these measures are complementary as illustrated by the Stam inequality for instance [4]. The Fisher information plays an important role in the estimation of a parameter θ0 attached to random variable both through the Cramér–Rao inequality that lower bounds the variance of the parameter estimation, or through the curvature of the Kullback–Leibler divergence vs θ between the parametric pdf pθ and the “true” pdf pθ0 , around the true parameter θ0 . The deBruijn identity is important for various reasons: (i) it quantifies the loss or gain of entropy at the output of the Gaussian channel versus the noise level variations (see also the Verdu’s version dealing with the mutual information [5, 6]); (ii) it is a key point in the proof of the Stam’s inequality and on that of the entropy power inequality; the last inequality (or also its Fisher information version) is involved in a weak proof of the central limit theorem [1, 2, 4]. In this paper, we are interested by the use of generalized φ -entropies as measures of uncertainty, or generalized φ -divergences of the Csizàr class [7–9]. Although the Fisher information seems fundamental in various contexts, even when dealing with general divergences [10], the information has already been extended. Such generalizations appear in extensions of classical identities or inequalities associated with the “quadriptych” Shannon entropy – Kullback–Leibler divergence – Fisher information – Gaussian law [11, 12].

φ -DEBRUIJN IDENTITIES FOR THE GAUSSIAN CHANNEL Entropic formulation φ -entropies. The generalized φ -entropy of a random variable of pdf p introduced in [8, 9], inspired by [7], is a generalization of the Shannon entropy as: Hφ (p) = −

Z

φ (p(x)) dx,

(4)



where φ is a convex function on [0; +∞). This class of entropies1 contains well-known cases such that the Shannon entropy for φ (p) = p log p or the Havrda–Charvàt–Rényi– α −1 Tsallis entropies [13–16] when φ (p) = pα−1 , α ≥ 0 (up to a function, see footnote 1). In the sequel, we limit our study to functions φ that are C2 , so that the convexity writes φ 00 ≥ 0. A φ -deBruijn identity. The √ key point in the proof of the deBruijn identity is that the pdf pξθ of the noise ξθ = θ N (where N is a zero-mean standard noise) satisfies the heat equation ∂ pξθ (x) 1 ∂ 2 pξθ (x) = . ∂θ 2 ∂ x2 When X is independent on ξθ and does not depend on parameter θ , under some regularity conditions on its pdf pX , one shows that the output pdf pYθ also satisfies the heat equation. One differentiates versus θ the entropy of Yθ : assuming regularity conditions to permute differentiation in θ and integration, using the heat equation, and after an integration by parts, one obtains the deBruijn identity. Reproducing this scheme step by step, we obtain the following φ -deBruijn identity: ∂ Hφ (pYθ ) 1 = Jφ (pYθ ), ∂θ 2

(5)

where we define a φ -Fisher information under the form Z 

Jφ (p) = Ω

∂ log p(x) ∂x

2  2  φ 00 (p(x)) p(x) dx.

(6)

Note then the following facts: •

1

The interpretation of this identity is similar to that of the usual deBruijn one, as it quantifies the variations of the φ -entropy of the output of the Gaussian channel versus the noise variance θ . The φ -Fisher information being positive, the uncertainty on the output increases when the noise level increases, the rate of increase being quantified by Jφ .

Rigorously, it is defined up to a nondecreasing function, Hh,φ (X) = h(−

R

Ω φ (p(x)) dx)



2 The positive quantity φ 00 (p(x)) p(x) , correctly normalized, can be interpreted as a φ -escort distribution similar to that defined in [11, 12]. In the Shannon context this distribution reduces to p and the φ -Fisher information reduces to the usual Fisher information. In the Havrda–Charvát–Rényi–Tsallis context, this distribution is the escort distribution introduced in [11, 12] for instance.

Divergence formulation φ -divergences. Introduced by Csizàr in [7], the so-called φ -divergence between two pdfs is defined as   Z p(X) Dφ (pkq) = φ q(x) dx, (7) q(X) Ω where q serves as a reference and where φ is convex on [0; +∞). This class2 contains many well known cases such that the usual Kullback–Leibler divergence f (l) = l log l, the Jeffrey divergence =  for f (l) = (l − 1) log l, the Jensen–Shannon divergence f (l) l+1 l α −1 l l+1 2 log l − 2 log 2 , the Havrda–Charvát–Rényi–Tsallis divergence for f (l) = α−1 among many others [7, 10, 13, 14, 17]. These divergences are closely linked to the φ -entropies [9], as the Kullback–Leibler divergence is closely linked to the Shannon entropy. As for the φ -entropies, in the sequel we restrict to functions φ that are C2 , so that the convexity writes again φ 00 ≥ 0. A φ -deBruijn identity in terms of divergences. We concentrate here on the general (k) context of two possible channel inputs X (k) , k = 0, 1, of respective outputs Yθ with pdf pY (k) . In the context of φ -divergences, the key point of the derivation lies again in the θ

heat equation satisfied by both pdfs pY (k) . The same steps than that used to derive the θ

entropic version of the extended identity leads to its divergence version,    ∂ 1  Dφ pY (1) kpY (0) = − Jφ pY (1) kpY (0) , ∂θ 2 θ θ θ θ

(8)

where now the so-called nonparametric φ -Fisher divergence between two pdfs writes    2 φ 00 p(x) p(x)2 Z  p(x) ∂ q(x) log dx. (9) Jφ (pkq) = q(x) q(x) Ω ∂x Let us remark that: •

2

Since the φ -Fisher divergence in nonnegative, this identity can be interpreted as giving a rate of convergence (in terms of φ -divergence) between the densities of

R

Rigorously, Dh,φ = h(

Ωφ



p(X) q(X)



q(x) dx) where h is nondecreasing.

the output of the channel for two different inputs as the noise level increases. In particular, when X (0) is chosen Gaussian, Y (0) is also Gaussian: the identity gives the convergence of the output to the Gaussian when the noise level increases. The rate of convergence is given by the φ -Fisher divergence.   2 p(x) • The quantity φ 00 q(x) p(x) /q(x) can now be interpreted as a φ -escort distribution relatively to a reference q. In the Kullback–Leibler context it reduces to distribution p; in the Havrda–Charvát–Rényi–Tsallis context one obtains the generalized escort distribution introduced in [3, 18]. • For φ (l) = l log l, Jφ is the usual Kullback–Leibler divergence. For the Havrda– α −1 Charvàt–Rényi–Tsallis case φ (l) = lα−1 , the φ -Fisher corresponds to the α-Fisher gain introduced by Hammad in [3]. For the Jensen–Shannon divergence, the associated φ -Fisher divergence is the Jensen–Fisher divergence J JS (pkq) recently introduced by Sánchez-Moreno et al. in [19] by pure analogy with Jensen–Shannon divergence. Finally, some of the φ -Fisher divergences can be linked to various extensions of the usual Fisher divergences [11, 12, 20, 21].

BEYOND THE GAUSSIAN CHANNEL A question that naturally arises is to investigate if the usual and/or the extended version of the deBruijn identity extends to more general channels than the Gaussian channel. The answer to the question is given in the following results. Let us consider a channel of the form fig. 1 where the noise pdf pξθ (x) satisfies the general second order partial differential equation ∂ 2 pξθ ∂ 2 pξθ ∂ pξθ ∂ pξθ = β (θ) . (10) + α (θ) + β1 (θ) 2 1 2 2 ∂θ ∂θ ∂x ∂x Note that for the special case α2 = 0 this equation is nothing more than a Fokker–Planck equation (for state-independent drift and diffusion) [22]. The pdf pYθ of the output of the channel satisfies a similar equation, where β1 is replaced by −β1 . Under regularity conditions on the input pdf3 , proceeding step by step as for the Gaussian channel context, we show that the φ -entropy of the output pdf satisfies the identity α2 (θ)

α2 (θ)

∂ 2 Hφ (pYθ ) ∂ Hφ (pYθ ) (θ ) = β2 (θ) Jφ (pYθ ) − α2 (θ) Jφ (pYθ ), + α1 (θ) 2 ∂θ ∂θ

(11)

(θ )

where the parametric φ -Fisher information Jφ (pθ ) of a pdf parametrized by θ writes  Z  2  ∂ log pθ (x) 2  00 (θ ) Jφ (pθ ) = φ (pθ (x)) pθ (x) dx. (12) ∂θ Ω 3

Typically, we assume uniform convergences for the integrands in the entropy/divergence, allowing to permute integration and derivation versus θ ; we also assume that the all-integrated terms disappear, e.g., when performing integrations by parts (e.g., p φ (p) vanishes on the border of the integration domain).

Similarly, for two possible inputs X (k) , k = 0, 1 of the channel, of respective outputs with pdf pY (k) , the φ -divergence between these pdfs satisfies the identity

(k) Yθ

θ

α2 (θ)

  ∂ 2 Dφ pY (1) kpY (0) θ

θ

∂θ2

+ α1 (θ)

  ∂ Dφ pY (1) kpY (0) θ

θ

∂θ

= (θ )

α2 (θ) Jφ



pY (1) kpY (0) θ

θ



(13)

  − β2 (θ) Jφ pY (1) kpY (0) , θ

θ

(θ )

where the parametric φ -Fisher divergence Jφ (pθ kqθ ) between two pdfs parametrized by θ writes    2 φ 00 pθ (x) p (x)2 Z  θ ∂ pθ (x) qθ (x) (θ ) Jφ (pθ kqθ ) = log dx. (14) qθ (x) qθ (x) Ω ∂θ Both identities (11) and (13) are difficult to interpret in the general context. However, they contain some variations of the usual deBruijn identities that already exist in the literature: For the Lévy channel, ξθ = θ 2 L, where the noise (and output) pdf satisfies eq. (10) with α2 = 1, β1 = 2 and α1 = β2 = 0. In the context of the Kullback–Leibler divergence, the divergence formulation of the deBruijn identity can be found in [2, Th. 5.5] (when X (0) is assumed to be Lévy). For this channel, the parametric φ Fisher divergence characterizes the curvature of the φ -divergence; its integral gives the rate of convergence of two output laws when the scale factor of the noise increases. • For the Cauchy channel, ξθ = θ C, where the pdf satisfies eq. (10) with α2 = 1, β2 = −1 and α1 = β1 = 0. The generalized deBruijn identity generalizes that of [2, Th. 5.6] (Kullback–Leibler and X (0) being Cauchy): in this case, the sum of the parametric and nonparametric φ -Fisher divergences characterizes the channel. • Similar results can be obtained for more general laws, such that the so-called generalized q-Gaussians [11]. •

Note finally that a very interesting case concerns the Guo–Shamai–Verdu identity that characterizes the robustness of the Gaussian channel, in terms of input-output mutual information versus the signal-to-noise ratio (SNR) √ [5, 6]. In this context the input of the channel is a variable X preamplified by a factor s and the noise is of unitary variance (see figure 2). The mutual information between the input X (before preamplification) and the received output Ys assesses the quality of the transmission and thus the Guo– Shamai–Verdu relation quantifies the robustness of the transmission versus the SNR s.

√ s

ξ

FIGURE 2: Noisy communication channel of

X

input X, where the noise ξ is a zero-mean standard Gaussian. The input X is preamplified by √ a factor s before its transmission.

Ys

In such a context, one easily shows that the output law pYs and the conditional law pYs |X=x satisfies the same differential equation (10) (parameter s, state y): thus, the same algebra that led to eq. (13) gives here the identity ∂ 1 Dφ (pX,Ys kpX pYs ) = MMSEφ (X|Ys ) ∂s 2 1 ≡ 2

2   p (x, y) pX,Ys (x, y) X,Y s 2 00 (x − E[X|Ys = y]) φ dx dy pX (x)pYs (y) pX (x)pYs (y) Ω

Z

In the context of the Kullback–Leibler divergence, the left handside in nothing more than the input-output Shannon mutual information. Its variations are characterized by the right handside that is nothing more than the minimal mean squared error (MMSE) of the estimation of X from the output Y [6]. This result is generalized to more general mutual informations and for a generalized MMSE.

DISCUSSION In this paper, we have proposed generalizations of the usual deBruijn identity expressed either in terms of entropies, or in terms of divergences, in two directions: Considering general φ -entropies or φ -divergences of the Csizàr class, both being closely linked; • Beyond the Gaussian channel, provided that the noise pdf satisfies a well-suited second order differential equation, including as particular cases the Cauchy and the Lévy channels among others. •

Without many efforts, relations (11) and (13) extend to the multivariate laws and for a multivariate parameter θ . We will not give details in this paper. The key point is again based on an equation of the type (10) satisfied by the law of the channel noise, where the derivatives are replaced by gradients, where second order derivatives are replaced by Hessian matrices, and where the αi and βi can be replaced by general linear operators: we achieve identities of the type (11) and (13), linking gradients and Hessian of the entropy (or of the divergence) and φ -Fisher information (or φ -Fisher divergence) matrices. This allows to generalize our results to multivariate channel, with correlated components either for the noise, or for the input, and thus to generalize the results of [6]. The general interpretation of the identities we obtain, beyond the characterization of a transmission channel, remains open. The question of the potential implications

remains open as well; for instance, can we deduce a generalization of the entropy power inequality, and, as a consequence, a generalization of the central limit theorem for variables with infinite variances?

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

19. 20. 21. 22.

A. R. Barron. Entropy and the central limit theorem. The Annals of Probability, 14(1):336–342, January 1986. O. Johnson. Information Theory and The Central Limit Theorem. Imperial college Press, London, 2004. P. Hammad. Mesure d’ordre α de l’information au sens de Fisher. Revue de Statistique Appliquée, 26(1):73–84, 1978. A. J. Stam. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Information and Control, 2(2):101–112, June 1959. D. Guo, S. Shamai, and S. Verdú. Mutual information and minimum mean-square error in Gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, April 2005. D. P. Palomar and S. Verdú. Gradient of mutual information in linear vector Gaussian channels. IEEE Transactions on Information Theory, 52(1):141–154, January 2006. I. Csizàr. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. M. Salicrú, M. L. Menéndez, D. Morales, and L. Pardo. Asymptotic distribution of (h, φ )-entropies. Communications in Statistics, 22(7):2015–2031, 1993. M. Salicrú. Measures of information associated with Csiszár’s divergences. Kybernetica, 30(5):563– 573, 1994. I. Vajda. χ α -divergence and generalized Fisher’s information. In Transactions of the 6th Prague Conference on Information Theory, Statistics, Decision Functions and Random Processes, pages 873–886, 1973. J.-F. Bercher. On a (β , q)-generalized Fisher information and inequalities invoving q-Gaussian distributions. Journal of Mathematical Physics, 53(6):063303, june 2012. J.-F. Bercher. On generalized Cramér-Rao inequalities, generalized Fisher information and characterizations of generalized q-Gaussian distributions. Journal of Physics A, 45(25):255303, June 2012. A. Rényi. On measures of entropy and information. in Proceeding of the 4th Berkeley Symposium on Mathematical Statistics and Probability, 1:547–561, 1961. J. Havrda and F. Charvát. Quantification method of classification processes: Concept of structural α-entropy. Kybernetica, 3:30–35, 1967. Z. Daróczy. Generalized information functions. Information and Control, 16(1):36–51, March 1970. C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1-2):479–487, July 1988. F. Liese and I. Vajda. On divergence and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, October 2006. J.-F. Bercher. An amended MaxEnt formulation for deriving Tsallis factors, and associated issues. In 26th int. workshop on Bayesian Inference and Maximm entropy methods in Science and Engineering (MaxEnt 2006), volume 872, pages 441–448, Paris, France, 8-13 July 2006. AIP conference proceedings. P. Sánchez-Moreno, A. Zarzo, and J. S. Dehesa. Jensen divergence based on Fisher’s information. Journal of Physics A, 45(12):125305, March 2012. O. Johnson and C. Vignat. Some results concerning maximum Rényi entropy distributions. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 43(3):339–351, May-June 2007. E. Lutwak, S. Lv, D. Yang, and G. Zhang. Extension of Fisher information and Stam’s inequality. IEEE transactions on information theory, 58(3):1319–1327, March 2012. H. Risken. The Fokker-Planck Equation, Methods of Solution and Applications. Springer Verlag, Heidelberg, 2nd edition, 1989.