Fisher Information Properties

Zegers, Pablo

doi:10.3390/e17074918

Open AccessArticle

Fisher Information Properties

by

Pablo Zegers

Facultad de Ingeniería y Ciencias Aplicadas, Universidad de los Andes, Monseñor Álvaro del Portillo 12.455, Las Condes, Santiago, Chile

Entropy 2015, 17(7), 4918-4939; https://doi.org/10.3390/e17074918

Submission received: 18 June 2015 / Accepted: 10 July 2015 / Published: 13 July 2015

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Versions Notes

Abstract

:

A set of Fisher information properties are presented in order to draw a parallel with similar properties of Shannon differential entropy. Already known properties are presented together with new ones, which include: (i) a generalization of mutual information for Fisher information; (ii) a new proof that Fisher information increases under conditioning; (iii) showing that Fisher information decreases in Markov chains; and (iv) bound estimation error using Fisher information. This last result is especially important, because it completes Fano’s inequality, i.e., a lower bound for estimation error, showing that Fisher information can be used to define an upper bound for this error. In this way, it is shown that Shannon’s differential entropy, which quantifies the behavior of the random variable, and the Fisher information, which quantifies the internal structure of the density function that defines the random variable, can be used to characterize the estimation error.

Keywords:

Fisher information; Cramer–Rao bound; Shannon differential entropy; Markov chains

1. Introduction

The birth of information theory was signaled by the publication of Claude Shannon’s work [1], which is based on studying the behavior of systems described by density functions. However, much before that work was published, Ronald Fisher had already published the definition of a quantity called Fisher information [2], a hard bound on the capacity to estimate the parameters that define a system [3,4]. Hence, this quantity regulates how well it is possible to determine the internal structure of a system and provides another point of view that can be used to study systems: how they are composed, what they are made of. This work springs from the belief that the combination of these approaches is what completely defines systems: their behavior (Shannon) and their architecture (Fisher). In the following, a series of published results is summarized, together with new results, in order to present a coherent set of Fisher information properties that will hopefully be useful for those that work with this quantity.

1.1. Fisher Information and Other Fields

One connection between Fisher information and the Shannon differential entropy was stated by Kullback [5] (p. 26), who proved that the second derivatives of the Kullback–Leibler divergence with respect to the density functions parameters produce the Fisher information matrix terms. Related results were presented by Blahut [6] (p. 300), and Frieden [7] (p. 37). Another important result that also relates these two frameworks is Bruijn’s identity ([8,9] and [10] (p. 672)), which establishes a relation between the derivative of Shannon differential entropy and Fisher information when the underlying random variable is the subject of Gaussian perturbations. This result was recently generalized to non-Gaussian perturbations [11,12]. A consequence of these results is the convolution inequality for Fisher information ([8,9,13–16]; [10] (p. 674)).

Others have been studying the relation between Fisher information and physics. Here, it is important to point out the extreme physical information principle derived by Frieden and others in order to establish a general framework that explains physics [7,17–20]. Of special interest has been the role of Fisher information to generate thermodynamical theory [7,17–22]. It is very common in these approaches to use a special case of Fisher information where the estimated parameter is a location parameter. In this work, the original and general Fisher information definitions, and not the later special case, are addressed only.

Even thought Shannon’s ideas have been part of the the machine learning tool set for a long time, Fisher information has not followed the same track. Even though Fisher information is intimately connected to estimation theory [23], its use in the development of learning systems has not been well developed yet. Nevertheless, Amari discovered that natural gradient descent, i.e., common gradient descent corrected with the Fisher information matrix terms, takes into account the topology in a more precise manner, allowing for more efficient training procedures [24,25]. The use of Fisher information has also been taken into account in order to design objective functions to lead the estimation procedure. One of them is mixing maximum entropy with minimum Fisher information [26,27]. On the other hand, mixing Shannon’s differential entropy, Fisher information and the central limit theorem has allowed proving that in the presence of large datasets, it is natural to search for minimum Kullback–Leibler, or equivalent, solutions [28].

1.2. Contribution of This Work

This work is focused on presenting already known properties of Fisher information [3,4,7,8,10,29–32] and introducing new ones, such that the reader can have a better grasp of Fisher information and its usefulness. The main results presented in this work are: (i) the generalization of the mutual information concept using Fisher information expressions; (ii) a new proof that conditioning under certain assumptions increases Fisher information; (iii) proving that in Markov chains, the Fisher information increases as the random variables become further away from the estimated parameter; and (iv) an upper bound on estimation error, which is regulated by the Fisher information.

This work is structured roughly in the same way in which is organized the first chapter of the well-known book of Cover and Thomas [30], in order to help the reader to draw a parallel between Shannon and Fisher information.

2. Notation

In the following sections, vectors and matrices are denoted with a bold font [7,31]. Furthermore, density functions are denoted by f_X_;θ ≡ f_X_;θ(x), where the f is reserved for density functions, the lowercase X corresponds to the name of the random variable, θ represents the parameters that define the density function and the symbol within the (·) stands for the instance of the random variable that is used to evaluate the density function. In this way, as an example, a different random variable could be denoted by f_Y;_θ ≡ f_Y;_θ(y). A similar notation is used in [33].

3. Fisher Information

Let there be a random variable X and its associated density function f_X;_θ ≡ f_X;_θ(x), which has a support

S

, and it depends on a set of parameters that is represented by the vector θ ∈ Θ. The value θ_k is the k-th component of θ. According to the original definition designed by Fisher to characterize maximum likelihood estimation [2]:

Definition 1 (Fisher Information). Given a random variable X and its associated density function f_X;_θ(x), which depends on the parameter vector θ ∈ Θ, and θ_k is the k-th component of θ, then the Fisher information associated with θ_k is defined by:

i_{F} {(f_{X; θ})}_{θ_{k}} \equiv \int f_{X; θ} (X) {(\frac{\partial \ln f_{X; θ} (X)}{\partial θ_{k}})}^{2} d x

(1)

From the definition, it is clear that

i_{F} {(f_{X; θ})}_{θ_{k}} \geq 0

. Furthermore, if f_X does not depend on θ_k, then

i_{F} {(f_{X; θ})}_{θ_{k}} = 0

.

Example 1. In a Gaussian case with mean µ and standard deviation η, the density function is given by:

f_{X; μ, η} (x) = \frac{1}{\sqrt{2 π} η} \exp (- \frac{{(x - μ)}^{2}}{η^{2}})

(2)

In this case:

f_{X; μ, η} (x) = \ln \frac{1}{\sqrt{2 π} η} - \frac{{(x - μ)}^{2}}{2 η^{2}}

(3)

\ln \frac{1}{\sqrt{2 π} η} - \frac{x^{2} - 2 μ x + μ^{2}}{2 η^{2}}

(4)

If the parameter to be estimated is the mean µ, the previous expression needs to be derived with respect to µ:

\frac{d \ln f_{X; μ, η} (x)}{d μ} = \frac{2 x}{2 η^{2}} - \frac{2 μ}{2 η^{2}}

(5)

= \frac{x - μ}{η^{2}}

(6)

Replacing into the definition of Fisher information definition:

i_{F} {(f_{X; μ, η})}_{μ} = \int f_{X; μ, η} (x) {(\frac{x - μ}{η^{2}})}^{2} d x

(7)

= \int f_{X; μ, η} (x) \frac{x^{2} - 2 μ x + μ^{2}}{η^{4}} d x

(8)

= \frac{1}{η^{4}} \int f_{X; μ, η} (x) x^{2} d x - \frac{2 μ}{η^{4}} \pm \int f_{X; μ, η} (x) x d x + \frac{μ^{2}}{η^{4}} \int f_{X; μ, η} (x) d x

(9)

= \frac{1}{η^{4}} {(η^{2} + μ^{2}) - 2 μ^{2} + μ^{2}}

(10)

= \frac{1}{η^{2}}

(11)

This shows that for Gaussian functions, the variance of any estimator of the mean is directly proportional to the variance of the density function.

There is another expression that can be used to represent the Fisher information.

Theorem 1. Given a random variable X and its associated density function f_X;_θ(x), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ_k (see Appendix A), where θ_k is the k-th component of θ, then the Fisher information associated with θ_k is equal to:

i_{F} {(f_{X; θ})}_{θ_{k}} = - \int f_{X; θ} (X) \frac{\partial^{2} \ln f_{X; θ} (X)}{\partial θ_{k}^{2}} d x

(12)

A proof of this theorem can be found in [34] (p. 373).

Example 2. Continuing the Gaussian example, and using the alternative definition of the Fisher information, the required second derivative is first calculated:

\frac{d^{2} \ln f_{X; μ, η} (x)}{d μ^{2}} = \frac{d}{d μ} (\frac{x - μ}{η^{2}}) = - \frac{1}{η^{2}}

(13)

Replacing into Equation (12), the same result is obtained:

i_{F} {(f_{X; μ, η})}_{μ} = - \int f_{X; μ, η} (x) \frac{d^{2} \ln f_{X; μ, η} (x)}{d μ^{2}} d x

(14)

= - \int f_{X; μ, η} (x) (- \frac{1}{η^{2}}) d x

(15)

= \frac{1}{η^{2}} \int d x f_{X; μ, η} (x)

(16)

= \frac{1}{η^{2}}

(17)

The importance of the Fisher information quantity stems from the Cramer–Rao bound [3,4,23,35]:

Theorem 2 (Cramer–Rao Bound). Given a random variable X and its associated density function f_X_;θ(x), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ_k (see Appendix A), where θ_k is the k-th component of θ, also given that there is an unbiased estimator

{\hat{θ}}_{k} (x)

of the scalar parameter θ_k, then:

\frac{1}{i_{F} {(f_{X; θ})}_{θ_{k}}} \leq σ_{{\hat{θ}}_{k}}^{2}

(18)

where:

σ_{{\hat{θ}}_{k}}^{2} \equiv \int f_{X; θ} (x) {({\hat{θ}}_{k} (x) - θ_{k})}^{2} d x

(19)

is the variance of the estimator. Proofs of this theorem can be found in [7] (p. 29) and [23] (p. 66).

The Cramer–Rao bound establishes that the reciprocal of the Fisher information is a lower bound of the variance of an estimator. Any estimator that reaches the bound imposed by the Cramer–Rao theorem is called efficient [34]. It is important to notice that the bound does not depend on the estimator itself; it only depends on

i_{F} {(f_{X; θ})}_{θ_{k}}

. In this work, the case of biased estimators will not be analyzed, nor when the parameters themselves are random variables.

The following theorem states that the topology of the Fisher information in the density function space is very simple:

Theorem 3. The Fisher information

i_{F} {(f_{X; θ})}_{θ_{k}}

is convex in f_X_;θ. Proofs of this theorem can be found in [7] (p. 69) and [29].

4. Several Random Variables Depending on θ_k

4.1. Joint Fisher Information Definition

Definition 2. Given two random variables X and Y and the associated joint density function f_X,Y;_θ(x, y), which depends on the parameter vector θ ∈ Θ, and θ_k is the k-th component of θ, then the joint Fisher information associated with θ_k is defined by:

i_{F} {(f_{X, Y; θ})}_{θ_{k}} \equiv \iint f_{X, Y; θ} {(\frac{\partial \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}})}^{2} d x d y

(20)

4.2. An Equivalent Joint Fisher Information Definition

Theorem 4. Given two random variables X and Y and the associated joint density function f_X,Y_;θ(x, y), which depends on the parameter vector θ ∈ Θ and complies with the boundary condition for θ_k (see Appendix A), where θ_k is the k-th component of θ, then the joint Fisher information associated with θ_k is equal to:

i_{F} {(f_{X, Y; θ})}_{θ_{k}} = - \iint f_{X, Y; θ} (x, y) \frac{\partial^{2} \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}^{2}} d x d y

(21)

Proof. This follows trivially from the alternative definition of the Fisher information. □

4.3. Conditional Fisher Information Definition

Definition 3.

i_{F} {(f_{Y | X; θ})}_{θ_{k}} \equiv \iint f_{X, Y; θ} (x, y) (\frac{\partial \ln f_{Y | X; θ} (y | x)}{\partial θ_{k}}) d x d y

(22)

4.4. Chain Rule for Two Random Variables

The following result was first published by Zamir [32], who used it to produce an alternative proof of the Fisher information inequality. In the following lines, the same chain rule is proven using the results presented in the previous sections.

Theorem 5 (Chain Rule for Two Random Variables). Given a joint density function f_X,Y;θ(x, y), which depends on the parameter vector θ ∈ Θ, and given that the density functions comply with the boundary condition for θ_k (see Appendix A), where θ_k is the k-th component of θ, then:

i_{F} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{Y | X; θ})}_{θ_{k}} + i_{F} {(f_{X; θ})}_{θ_{k}}

(23)

= i_{F} {(f_{X | Y; θ})}_{θ_{k}} + i_{F} {(f_{Y; θ})}_{θ_{k}}

(24)

respectively.

Proof.

i_{F} {(f_{X, Y; θ})}_{θ_{k}} \equiv \iint f_{X, Y; θ} (x, y) {(\frac{\partial \ln f_{Y | X; θ} (x, y)}{\partial θ_{k}})}^{2} d x d y

(25)

= \iint f_{X, Y; θ} (x, y) {(\frac{\partial \ln (f_{Y | X; θ} | (x, y) f_{X; θ} (x))}{\partial θ_{k}})}^{2} d x d y

(26)

= \iint f_{X, Y; θ} (x, y) {(\frac{\partial \ln f_{Y | X; θ} | (x, y)}{\partial θ_{k}} + \frac{\partial \ln f_{X; θ} | (x)}{\partial θ_{k}})}^{2} d x d y

(27)

= i_{F} {(f_{Y | X; θ})}_{θ_{k}} + i_{F} {(f_{X; θ})}_{θ_{k}} + 2 \iint f_{X, Y; θ} (x, y) \frac{\partial \ln f_{Y | X; θ} (y | x)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} d x d y

(28)

but,

\iint \frac{\partial \ln f_{Y | X; θ} (y | x)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} d x d y = \iint \frac{\partial \ln f_{Y | X; θ} (y | x)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} d x d y

(29)

= \int (\frac{\partial f_{Y | X; θ} (y | x)}{\partial θ_{k}} \int \frac{\partial f_{Y | X; θ} (y | x)}{\partial θ_{k}} d y) d x

(30)

If

f_{Y | X; θ} (y | x)

complies with the boundary condition with respect to θ_k (see Appendix A), then:

\int \frac{\partial f_{Y | X; θ} (y | x)}{\partial θ_{k}} d y = \frac{\partial}{\partial θ_{k}} \int d y f_{Y | X; θ} (y | x) = 0

(31)

Therefore, the theorem is proven. The other result is proven analogously. □

When the chain rule is used to estimate the Fisher information associated with a parameter, it is important to take into account that all of the terms that come out after applying the chain rule contain derivatives with respect to the same parameter. Because some of these terms may be dependent on density functions that do not depend on the parameter, some of these terms may be equal to zero.

Example 3. Given the random variable Y = X + N, where X is a Gaussian density function with mean µ and standard deviation η and N another Gaussian density function with mean zero and standard deviation ν, if the joint density function is available, and the parameter to be estimated is µ, then:

i_{F} {(f_{Y, X;}_{µ, η, ν})}_{µ} = i_{F} {(f_{Y}_{| X; µ, η, ν})}_{µ} + i_{F} {(f_{X; µ, η})}_{µ}

(32)

= i_{F} {(f_{N; ν})}_{µ} + i_{F} {(f_{X; µ, η})}_{µ}

(33)

= i_{F} {(f_{X; μ, η})}_{μ}

(34)

= \frac{1}{η^{2}}

(35)

The previous result implies that if the joint density function of the output Y and the input X is available, the noise does not affect the estimation process. This is not surprising, since Y is a corrupted version of X, and it cannot shed more information on µ than that contained in X. Because all of the information hidden in X is available through the joint density function, it makes sense to think that the Fisher information of the joint density function corresponds to that of the marginal distribution f_X_;_µ_,_η.

Given the density functions mentioned above, it is possible to prove that:

f_{Y; μ, η, ν} (y) = \frac{1}{\sqrt{2 π (η^{2} + ν^{2})}} \exp (- \frac{{(y - μ)}^{2}}{2 (η^{2} + ν^{2})})

(36)

with Fisher information associated with μ equal to:

i_{F} {(f_{Y; μ, η, ν})}_{μ} = \frac{1}{η^{2} + ν^{2}}

(37)

Using the other expression for the chain rule:

i_{F} {(f_{Y, X;}_{µ, η, ν})}_{µ} = i_{F} {(f_{X}_{| Y; µ, η, ν})}_{µ} + i_{F} {(f_{Y; µ, η, ν})}_{µ}

(38)

= i_{F} {(f_{X}_{| Y; µ, η, ν})}_{µ} + \frac{1}{η^{2} + ν^{2}}

(39)

Using the previous results:

\frac{1}{η^{2}} = i_{F} {(f_{X | Y; μ, η, ν})}_{μ} + \frac{1}{η^{2} + ν^{2}}

(40)

which implies:

i_{F} {(f_{Y; μ, η, ν})}_{μ} = \frac{ν^{2}}{η^{2} (η^{2} + ν^{2})}

(41)

4.5. Chain Rule for Many Random Variables

In the case of more than two density functions:

Theorem 6 (Chain Rule for Many Random Variables). Given a set of n random variables X₁, X₂, …, X_n, all of them depending on θ_k, if the density functions comply with the boundary condition for θ_k (see Appendix A), then:

i_{F} {(f_{X_{1}, X_{2}, \dots, X_{n}; θ})}_{θ_{k}} = \sum_{k = 1}^{n} i_{F} {(f_{X_{k} | X_{k - 1}, \dots, X_{1}; θ})}_{θ_{k}}

(42)

Proof.

i_{F} {(f_{X_{1}, X_{2}, \dots, X_{n}; θ})}_{θ_{k}} = i_{F} {(f_{X_{n}, \dots, X_{2} | X_{1}; θ})}_{θ_{k}} + i_{F} {(f_{X_{1}; θ})}_{θ_{k}}

(43)

i_{F} {(f_{X_{1}, X_{2}, \dots, X_{n}; θ})}_{θ_{k}} = i_{F} {(f_{X_{n}, \dots, X_{3} {| X}_{2}, X_{1}; θ})}_{θ_{k}} + i_{F} {(f_{X_{2} | X_{1}; θ})}_{θ_{k}} + i_{F} {(f_{X_{1}; θ})}_{θ_{k}}

(44)

⋮

= \sum_{k = 1}^{n} i_{F} {(f_{x_{k} | x_{k - 1}, \dots, x_{1; θ}})}_{θ_{k}}

(45)

□

If the n random variables in Theorem 6 are i.i.d., then

i_{F} {(f_{X_{1}, X_{2}, \dots, X_{n}; θ})}_{θ_{k}} = n \cdot i_{F} {(f_{X; θ})}_{θ_{k}}

.

5. Relative Fisher Information Type I

In the following, the relative Fisher information is defined. As far as it was possible to determine, the first definition of the relative Fisher information was given by Otto and Villani [36], who defined it for the translationally-invariant case. Furthermore, this expression has been rediscovered or simply used in many applications thereafter in different problems and fields [22,37–44]. Furthermore, it seems that the first general analysis of the relative Fisher information was presented by the author in [45]. The following sections focus on this latter general case, where there is no assumption of translational invariance.

Analogously to the Kullback–Leibler divergence [46], also known as as relative entropy, which was designed to established how much two density functions differed, the relative Fisher information of Type I is obtained when the ratio of two intervening density functions is replaced into Equation (1), as is shown in the following definition.

Definition 4. The relative Fisher information Type I is defined by:

d_{F (I)} {(f_{X; θ} | | f_{Y; θ})}_{θ_{k}} \equiv \int f_{X; θ} (x) {(\frac{\partial}{\partial θ_{k}} (\ln (\frac{f_{X; θ} (x)}{f_{Y; θ} (x)})))}^{2} d x

(46)

The same mechanism can be used to generate a second definition for the relative Fisher information. The same ratio can be replaced into Equation (12), producing an alternative and equally valid expression, which is designated as relative Fisher information Type II. This second expression is studied in the following sections.

6. Information Correlation

Definition 5. The information correlation with respect to θ_k is defined by:

i_{C} {(f_{X, Y; θ})}_{θ_{k}} \equiv \iint f_{X, Y; θ} (x, y) \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ}}{\partial θ_{k}} d x d y

(47)

The name information correlation comes from the similarity between this definition and that of the classical correlation coefficient. It is important to keep in mind that it is different from the terms that fill the Fisher information matrix [23].

According to the definition

i_{C} {(f_{X, X; θ})}_{θ_{k}} = i_{F} {(f_{X; θ})}_{θ_{k}}

, and.

i_{C} {(f_{X, Y; θ})}_{θ_{k}} = i_{C} {(f_{Y, X; θ})}_{θ_{k}}

.

Example 4. Continuing with the example where Y = X + N, the information correlation between Y and X is given by:

i_{C} {(f_{Y, X; μ, η, ν})}_{μ} = \iint f_{Y, X; μ, η, ν} (y, x) \frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} \frac{d \ln f_{X; μ, η} (x)}{d μ} d y d x

(48)

= \iint f_{Y, X; μ, η, ν} (y | x) f_{X; μ, η} (x) \frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} \frac{d \ln f_{X; μ, η} (x)}{d μ} d y d x

(49)

= \iint f_{N; ν} (y - x) f_{X; μ, η} (x) \frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} \frac{d \ln f_{X; μ, η} (x)}{d μ} d y d x

(50)

= \iint (\frac{1}{\sqrt{2 π} ν} \exp (- \frac{{(y - x)}^{2}}{2 v^{2}})) (\frac{1}{\sqrt{2 π} η} \exp (- \frac{{(x - μ)}^{2}}{2 η^{2}})) \frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} \frac{d \ln f_{X; μ, η} (x)}{d μ} d y d x

(51)

= \frac{1}{2 π ν η} \iint \exp (- \frac{1}{2} (\frac{{(y - x)}^{2}}{v^{2}} + \frac{{(x - μ)}^{2}}{η^{2}})) \frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} \frac{d \ln f_{X; μ, η} (x)}{d μ} d y d x

(52)

where:

\frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} = \frac{d}{d μ} (\ln \frac{1}{\sqrt{2 π} \sqrt{η^{2} + ν^{2}}} - \frac{{(y - μ)}^{2}}{2 (η^{2} + ν^{2})})

(53)

= \frac{(y - μ)}{η^{2} + ν^{2}}

(54)

Analogously:

\frac{d \ln f_{Y; μ, η, ν} (y)}{d μ} = \frac{(x - μ)}{η^{2}}

(55)

Replacing these derivatives into the information correlation expression:

i_{C} {(f_{Y, X; μ, η, ν})}_{μ} = \frac{1}{2 π ν η} \iint \exp (- \frac{1}{2} (\frac{{(y - x)}^{2}}{v^{2}} + \frac{{(x - μ)}^{2}}{η^{2}})) (\frac{y - μ}{η^{2} + ν^{2}}) (\frac{x - μ}{η^{2}}) d y d x

(56)

= \frac{1}{2 π ν η (η^{2} + ν^{2}) η^{2}} \iint (y - μ) (x - μ) \exp (- \frac{1}{2} (\frac{{(y - x)}^{2}}{v^{2}} + \frac{{(x - μ)}^{2}}{η^{2}})) d y d x

(57)

= \frac{1}{2 π ν η (η^{2} + ν^{2}) η^{2}} \int (x - μ) \exp (- \frac{1}{2} \frac{{(x - μ)}^{2}}{η^{2}}) (\int (y - x) \exp (- \frac{1}{2} \frac{{(y - x)}^{2}}{ν^{2}}) d y) d x

(58)

\frac{1}{η^{2} + ν^{2}}

(59)

Theorem 7. The information correlation is bounded according to:

{(i_{C} {(f_{X, Y; θ})}_{θ_{k}})}^{2} \leq i_{F} {(f_{X; θ})}_{θ_{k}} i_{F} {(f_{Y; θ})}_{θ_{k}}

(60)

Proof.

0 \leq \iint f_{X, Y; θ} (x, y) (a \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} + \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}}) d x d y

(61)

which can be reexpressed as:

0 \leq a^{2} i_{F} {(f_{X; θ})}_{θ_{k}} + 2 a \cdot i_{C} {(f_{X, Y; θ})}_{θ_{k}} + i_{F} {(f_{Y; θ})}_{θ_{k}}

(62)

This is a second degree equation that is true for every possible a. Because this equation is always greater than zero, the discriminant of the equation has to comply with

4 {(i_{C} {(f_{X, Y; θ})}_{θ_{k}})}^{2} - 4 i_{F} {(f_{X; θ})}_{θ_{k}} i_{F} {(f_{Y; θ})}_{θ_{k}} \leq 0,

which proves the theorem. □

Definition 6. The information correlation coefficient is defined by:

ρ_{F} = \frac{i_{C} {(f_{X, Y; θ})}_{θ_{k}}}{\sqrt{i_{F} {(f_{X; θ})}_{θ_{k}} i_{F} {(f_{Y; θ})}_{θ_{k}}}}

(63)

Theorem 8. The information correlation coefficient is limited by:

- 1 \leq ρ_{F} \leq 1

(64)

Proof. This comes from the definition of the information correlation coefficient and Theorem 7.

Theorem 9. If at least one of the following conditions:

f_X;_θ and f_Y;_θ are independent.
Either f_X;_θ or f_Y;_θ does not depend on θ_k.

is true, then:

i_{C} {(f_{X, Y; θ})}_{θ_{k}} = 0

(65)

Proof. Examination of the information correlation definition clearly shows that compliance with the first and second cases directly implies that this quantity is zero.

7. Mutual Fisher Information Type I

As happens in Shannon’s differential entropy handling, in this work, mutual Fisher information is also defined as relative Fisher information Type I, where the argument is the ratio between a joint density function and the product of its marginals.

7.1. Definition

Definition 7. The mutual Fisher information Type I is defined by:

m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} \equiv \int \int f_{X, Y; θ} (x, y) {(\frac{\partial}{\partial θ_{k}} \ln (\frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)}))}^{2} d x d y

(66)

From the definition, it is obvious that

m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} \geq 0

.

Theorem 10. If the boundary condition (see Appendix A) with respect to θ_k holds for f_X,Y;_θ(x,y), the mutual Fisher information Type I can be reformulated as a function of the Fisher information as follows:

m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{X | Y; θ})}_{θ_{k}} - i_{F} {(f_{X; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y; θ})}_{θ_{k}}

(67)

= i_{F} {(f_{Y | X; θ})}_{θ_{k}} - i_{F} {(f_{Y; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y; θ})}_{θ_{k}}

(68)

Proof.

{(\frac{\partial}{\partial θ_{k}} \ln (\frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)}))}^{2} = {(\frac{\partial \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}} - \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} - \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}})}^{2}

(69)

\begin{array}{l} = {(\frac{\partial \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}})}^{2} + {(\frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}})}^{2} + {(\frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}})}^{2} - 2 \frac{\partial \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \\ - 2 \frac{\partial \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} + 2 \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} \end{array}

(70)

\begin{array}{l} = {(\frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}})}^{2} + 2 \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} + {(\frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}})}^{2} \\ + {(\frac{\partial \ln f_{X, θ} (x)}{\partial θ_{k}})}^{2} + {(\frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}})}^{2} \\ - 2 \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} - 2 \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \\ - 2 \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} - 2 \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} \\ - 2 \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \frac{\partial \ln f_{Y; θ} (y)}{\partial θ_{k}} \end{array}

(71)

Simplifying:

\begin{matrix} {(\frac{\partial}{\partial θ_{k}} \ln (\frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)}))}^{2} = (\frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}}) + {(\frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}})}^{2} \\ - 2 \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} \end{matrix}

(72)

Now,

\begin{array}{l} 2 \int \int f_{X, Y; θ} (x, y) \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} d x d y \\ = 2 \int \int \frac{f_{Y; θ} (y)}{f_{X; θ} (x)} \frac{\partial f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} d x d y \end{array}

(73)

= 2 \int (\frac{1}{f_{X; θ} (x)} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} \int f_{Y; θ} (y) \frac{\partial f_{X | Y; θ} (x | y)}{\partial θ_{k}} d y) d x

(74)

Assuming that f_X,Y;_θ complies with the boundary condition (see Appendix A) with respect to θ_k, then:

\frac{\partial f_{X; θ} (x)}{\partial θ_{k}} = \frac{\partial}{\partial θ_{k}} \int f_{X, Y; θ} (x, y) d y

(75)

= \int \frac{\partial f_{X | Y; θ} (x | y) f_{Y; θ} (y)}{\partial θ_{k}} d y

(76)

= \int f_{Y; θ} (y) \frac{\partial f_{X | Y; θ} (x | y)}{\partial θ_{k}} d y + \int f_{X | Y; θ} (x | y) \frac{\partial f_{Y; θ} (y)}{\partial θ_{k}} d y

(77)

Hence,

= \int f_{Y; θ} (y) \frac{\partial f_{X | Y; θ} (x | y)}{\partial θ_{k}} d y = \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} - \int f_{X | Y; θ} (x | y) \frac{\partial f_{Y; θ} (y)}{\partial θ_{k}} d y

(78)

Using the previous result, it is obtained:

\begin{matrix} 2 \int \int f_{X, Y; θ} (x, y) \frac{\partial \ln f_{X | Y; θ} (x | y)}{\partial θ_{k}} \frac{\partial \ln f_{X; θ} (x)}{\partial θ_{k}} d x d y \\ = 2 \int \frac{1}{f_{X; θ} (x)} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} d x - 2 \int \frac{1}{f_{X; θ} (x)} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} \int d y f_{X | Y; θ} (x | y) \frac{\partial f_{Y; θ} (y)}{\partial θ_{k}} d x \end{matrix}

(79)

= 2 i_{F} {(f_{X; θ})}_{θ_{k}} - 2 \int \int \frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)} \frac{\partial f_{X; θ} (x)}{\partial θ_{k}} \frac{\partial f_{Y; θ} (y)}{\partial θ_{k}} d x d y

(80)

This implies:

m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{X | Y; θ})}_{θ_{k}} - i_{F} {(f_{X; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y; θ})}_{θ_{k}}

(81)

The other result is obtained analogously. □

Example 5. Continuing with the example where Y = X + N, the mutual Fisher information Type I is given by:

m_{F (I)} {(f_{Y, X; μ, η, ν})}_{μ} = i_{F} {(f_{Y | X})}_{θ_{k}} - i_{F} {(f_{Y})}_{θ_{k}} + 2 \cdot i_{C} {(f_{Y, X; μ, η, ν})}_{μ}

(82)

= i_{F} {(f_{N})}_{θ_{k}} - i_{F} {(f_{Y})}_{θ_{k}} + 2 \cdot i_{C} {(f_{Y, X; μ, η, ν})}_{μ}

(83)

= \frac{1}{ν^{2}} - \frac{1}{η^{2} + ν^{2}} + 2 \cdot i_{C} {(f_{Y, X; μ, η, ν})}_{μ}

(84)

= \frac{η^{2}}{ν^{2} (η^{2} + ν^{2})} + 2 \cdot i_{C} {(f_{Y, X; μ, η, ν})}_{μ}

(85)

= \frac{η^{2}}{ν^{2} (η^{2} + ν^{2})} + \frac{2}{η^{2} + ν^{2}}

(86)

= \frac{1}{η^{2} + ν^{2}} + \frac{1}{ν^{2}}

(87)

7.2. Conditional Mutual Fisher Information of Type I

Definition 8. The conditional information correlation with respect to θ_k of random variables X and Y given random variable Z is defined by:

i_{C} {(f_{X, Y | Z; θ})}_{θ_{k}} \equiv \int \int f_{X, Y, Z; θ} (x, y, z) \frac{\partial \ln f_{X | Z; θ} (x | z)}{\partial θ_{k}} \frac{\partial \ln f_{Y | Z; θ} (y | z)}{\partial θ_{k}} d x d y d z

(88)

Definition 9. The conditional mutual Fisher information of Type I of random variables X and Y given random variable Z is defined by:

m_{F (I)} {(f_{X, Y | Z; θ})}_{θ_{k}} \equiv \int \int f_{X, Y, Z; θ} (x, y, z) {(\frac{\partial}{\partial θ_{k}} (\ln (\frac{f_{X, Y | Z; θ} (x, y | z)}{f_{X | Z; θ} (x | z) f_{Y | Z; θ} (y | z)})))}^{2} d x d y d z

(89)

Corollary 1. If the boundary condition (see Appendix A) with respect to 0_k holds for f_X_,_Y_,_Z_;_θ(x, y, z), the conditional mutual Fisher information of Type I of random variables X and Y given random variable Z can be reformulated as a function of the Fisher information as follows:

m_{F (I)} {(f_{X, Y | Z; θ})}_{θ_{k}} = i_{F} {(f_{X | Y, Z; θ})}_{θ_{k}} - i_{F} {(f_{X | Z; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y | Z; θ})}_{θ_{k}}

(90)

= i_{F} {(f_{Y | X, Z; θ})}_{θ_{k}} - i_{F} {(f_{Y | Z; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y | Z; θ})}_{θ_{k}}

(91)

Proof. This follows analogously to that of the simpler case.

8. Relative Fisher Information Type II

Given that there is an alternative expression for the Fisher information (check Equation (12)), there is another way of defining the relative Fisher information expression.

Definition 10. The relative Fisher information Type II is defined by:

d_{F (I I)} {(f_{X; θ} ‖ f_{Y; θ})}_{θ_{k}} \equiv - \int f_{X; θ} (x) \frac{\partial^{2}}{\partial θ_{k}^{2}} (\ln (\frac{f_{X; θ} (x)}{f_{Y; θ} (x)})) d x

(92)

Even though both definitions for the relative Fisher information are derived from equivalent expressions, they are not equivalent. Why is this so? This is because the argument of the Fisher information definition is a density function, whereas the argument of the relative Fisher information expression is a ratio of density functions, not a density function, thus their difference.

9. Mutual Fisher Information Type II

Analogously to the definition of the mutual Fisher information Type I, but in this case using the relative Fisher information of Type II, the following definition is obtained:

Definition 11. The mutual Fisher information Type II is defined by:

m_{F (I I)} {(f_{X, Y; θ})}_{θ_{k}} \equiv - \int \int f_{X, Y; θ} (x, y) \frac{\partial^{2}}{\partial θ_{k}^{2}} \ln (\frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)}) d x d y

(93)

Theorem 11. The mutual Fisher information Type II can be reformulated as a function of the Fisher information as follows:

m_{F (I I)} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{X, Y; θ})}_{θ_{k}} - i_{F} {(f_{X; θ})}_{θ_{k}} - i_{F} {(f_{Y; θ})}_{θ_{k}}

(94)

Proof.

m_{F (I I)} {(f_{X, Y; θ})}_{θ_{k}} = - \int \int f_{X, Y; θ} (x, y) \frac{\partial^{2}}{\partial θ_{k}^{2}} \ln (\frac{f_{X, Y; θ} (x, y)}{f_{X; θ} (x) f_{Y; θ} (y)}) d x d y

(95)

\begin{matrix} = - \int \int f_{X, Y; θ} (x, y) \frac{\partial^{2} \ln f_{X, Y; θ} (x, y)}{\partial θ_{k}^{2}} d x d y \\ + \int \int f_{X, Y; θ} (x, y) \frac{\partial^{2} \ln f_{X; θ} (x)}{\partial θ_{k}^{2}} d x d y \\ + \int \int f_{X, Y; θ} (x, y) \frac{\partial^{2} \ln f_{Y; θ} (y)}{\partial θ_{k}^{2}} d x d y \end{matrix}

(96)

from which the theorem follows. □

Corollary 2.

m_{F (I I)} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{X | Y; θ})}_{θ_{k}} - i_{F} {(f_{X; θ})}_{θ_{k}}

(97)

= i_{F} {(f_{Y | X; θ})}_{θ_{k}} - i_{F} {(f_{Y; θ})}_{θ_{k}}

(98)

Proof. This comes from combining Theorem 11 and the chain rule for Fisher information. □

Example 6. For the example where Y = X + N, the mutual Fisher information Type II is given by:

\begin{matrix} m_{F (I I)} {(f_{Y, X; μ, η, ν})}_{θ_{k}} = i_{F} {(f_{Y | X})}_{θ_{k}} - i_{F} {(f_{Y})}_{θ_{k}} \\ = i_{F} {(f_{N})}_{θ_{k}} - i_{F} {(f_{Y})}_{θ_{k}} \end{matrix}

(99)

\frac{1}{ν^{2}} - \frac{1}{η^{2} + ν^{2}}

(100)

= \frac{η^{2}}{ν^{2} (η^{2} + ν^{2})}

(101)

Corollary 3.

m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} = m_{F (I I)} {(f_{X, Y; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{X, Y; θ})}_{θ_{k}}

(102)

Proof. This can be deduced from the mutual Fisher information theorems.

Given that m_F₍_I₎ is always greater than or equal to zero, the expression m_F₍_II₎ can be positive or negative according to the value of the information correlation.

10. Other Properties

10.1. Lower Bound for Fisher Information

Stam’s inequality [8,9,40,47–50] states a lower bound for Fisher information, which links Fisher information and Shannon’s entropy power. However, this expression is limited to the special case where the parameters in the Fisher information expression correspond to a location parameter.

A more general result was recently proven by Stein et al. [51], which says that given a multidimensional random variable with density function f_X;_θ with:

μ (θ) = \int_{S} x f_{x; θ} (x) d x

(103)

\sum (θ) = \int_{S} (x - μ (θ)) {(x - μ (θ))}^{T} f_{x; θ} (x) d x

(104)

If the Fisher information matrix is defined by:

F (f_{X; θ}) = \int_{S} f_{X; θ} (x) {(\frac{\partial \ln f_{X; θ} (x)}{\partial θ})}^{T} (\frac{\partial \ln f_{X; θ} (x)}{\partial θ}) d x

(105)

then:

F (f_{X; θ}) \underline{≻} {(\frac{\partial μ (θ)}{\partial θ})}^{T} \sum^{- 1} (θ) (\frac{\partial μ (θ)}{\partial θ})

(106)

if

\frac{\partial μ (θ)}{\partial θ}

exists. The authors of [51] explain that this is the same as saying that:

0 \leq x^{T} (F (f_{X; θ}) - {(\frac{\partial μ (θ)}{\partial θ})}^{T} \sum^{- 1} (θ) (\frac{\partial μ (θ)}{\partial θ})) x

(107)

The previous expression states that the difference of matrices between the large parenthesis is a positive semi-definite matrix. Thus, its diagonal elements are non-negative, and it can be stated:

Corollary 4. The following lower bound for Fisher information holds:

\sum_{i = 1}^{m} \sum_{j = 1}^{m} \frac{\partial μ_{i}}{\partial θ_{k}} c_{i j}^{- 1} \frac{\partial μ_{j}}{\partial θ_{k}} \leq i_{F} {(f_{X; θ})}_{θ_{k}}

(108)

and

c_{i j}^{- 1}

stands for the ij-th element of Σ⁻¹ (θ).

10.2. In Some Cases, Conditioning Increases the Fisher Information

The following result states that in some cases, conditioning a random variable with another variable may increase the Fisher information. This result is a generalization of another published previously by Zamir [32].

Theorem 12 (Conditioning Increases Information). If f_Y|X;θ depends on θ_k and f_X does not depend on it, then:

i_{F} {(f_{Y; θ})}_{θ_{k}} \leq i_{F} {(f_{Y | X; θ})}_{θ_{k}}

(109)

Proof. Thus, given that only f_Y|X;_θ depends on θ_k, Theorem 9 guarantees that:

i_{C} {(f_{X, Y; θ})}_{θ_{k}} = 0

(110)

Hence, from the previous mutual Fisher information expressions:

0 \leq m_{F (I)} {(f_{X, Y; θ})}_{θ_{k}} = i_{F} {(f_{Y | X; θ})}_{θ_{k}} - i_{F} {(f_{Y; θ})}_{θ_{k}}

(111)

Thus:

i_{F} {(f_{Y; θ})}_{θ_{k}} \leq i_{F} {(f_{Y | X; θ})}_{θ_{k}}

(112)

10.3. Data Processing Inequality

Following the same analysis done by Cover and Thomas to present the data processing theorem for Shannon entropy [30] and continuing with the work done by Zamir [32], the case where the joint density function of the random variables R, S and T can be expressed by f_R,S,T;_θ = f_R;_θ · f_S|R;_θ · f_T|S;_θ is considered. In this case, they form a short Markov chain that is represented by R → S → T. Because Markovicity implies conditional independence, then it is true that f_R,T|S;_θ = f_R|S;_θ · f_T|S;_θ.

Theorem 13. Given a Markov chain R → S → T, where only f_T|S;_θ depends on θ_k, then:

m_{F (I)} {(f_{R, T; θ})}_{θ_{k}} \leq m_{F (I)} {(f_{S, T; θ})}_{θ_{k}}

(113)

Proof. From the previous results:

m_{F (I)} {(f_{(R, S), T; θ})}_{θ_{k}} = i_{F} {(f_{R, S | T; θ})}_{θ_{k}} - i_{F} {(f_{R, S; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{(R, S), T; θ})}_{θ_{k}}

(114)

= i_{F} {(f_{R | S, T; θ})}_{θ_{k}} + i_{F} {(f_{S | T; θ})}_{θ_{k}} - i_{F} {(f_{R | S; θ})}_{θ_{k}} - i_{F} {(f_{S; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{(R, S), T; θ})}_{θ_{k}}

(115)

\begin{array}{l} = (i_{F} {(f_{| R | S, T; θ})}_{θ_{k}} - i_{F} {(f_{R | S; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{R, T | S; θ})}_{θ_{k}}) - 2 \cdot i_{C} {(f_{R, T | S; θ})}_{θ_{k}} \\ + (i_{F} {(f_{S | T; θ})}_{θ_{k}} - i_{F} {(f_{S; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{S, T; θ})}_{θ_{k}}) - 2 \cdot i_{C} {(f_{S, T; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{(R, S), T; θ})}_{θ_{k}} \end{array}

(116)

= m_{F (I)} {(f_{R, T | S; θ})}_{θ_{k}} - m_{F (I)} {(f_{S, T; θ})}_{θ_{k}} - 2 \cdot i_{C} {(f_{R, T | S; θ})}_{θ_{k}} - 2 \cdot i_{C} {(f_{S, T; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{(R, S), T; θ})}_{θ_{k}}

(117)

Analogously:

\begin{matrix} m_{F (I)} {(f_{(R, S), T; θ})}_{θ_{k}} = m_{F (I)} {(f_{S, T | R; θ})}_{θ_{k}} + m_{F (I)} {(f_{R, T; θ})}_{θ_{k}} \\ - 2 \cdot i_{C} {(f_{S, T | R; θ})}_{θ_{k}} - 2 \cdot i_{C} {(f_{R, T; θ})}_{θ_{k}} + 2 \cdot i_{C} {(f_{(R, S), T; θ})}_{θ_{k}} \end{matrix}

(118)

Because only f_T|S;_θ depends on θ_k and all of the information correlation terms have derivatives of density functions that do not depend on this parameter, then all of the information correlation terms are zero. Hence:

m_{F (I)} {(f_{R, T | S; θ})}_{θ_{k}} + m_{F (I)} {(f_{S, T; θ})}_{θ_{k}} = m_{F (I)} {(f_{S, T | R; θ})}_{θ_{k}} + m_{F (I)} {(f_{R, T; θ})}_{θ_{k}}

(119)

Given that

m_{F (I)} {(f_{R, T | S; θ})}_{θ_{k}} = 0

because R and T are independent given S, and

m_{F (I)} {(f_{S, T | R; θ})}_{θ_{k}} \geq 0

, then:

m_{F (I)} {(f_{R, T; θ})}_{θ_{k}} \leq m_{F (I)} {(f_{S, T; θ})}_{θ_{k}}

(120)

□

Given that in the previous proof, all of the information correlation terms are zero, then

m_{F (I I)} {(f_{R, T; θ})}_{θ_{k}} = m_{F (I)} {(f_{R, T; θ})}_{θ_{k}}

, and

m_{F (I I)} {(f_{S, T; θ})}_{θ_{k}} = m_{F (I)} {(f_{R, T; θ})}_{θ_{k}}

. Thus, the following corollary is obtained:

Corollary 5. Given a Markov chain R → S → T, where only f_T|S;_θ depends on θ_k, then:

m_{F (I I)} {(f_{R, T; θ})}_{θ_{k}} \leq m_{F (I I)} {(f_{S, T; θ})}_{θ_{k}}

(121)

Proof. The conditional independence provided by the Markovicity of the random variables follows directly from the mutual Fisher information Type II definition, and in this case, the values of mutual Fisher information Type I and mutual Fisher information Type II are identical.

Using the definition of mutual Fisher information Type II and the previous expression, it is readily obtained, in a simpler way, a result already proven by Plastino et al. [52]:

Corollary 6. From the previous results, it is obvious that:

i_{F} {(f_{T | R; θ})}_{θ_{k}} \leq i_{F} {(f_{T | S; θ})}_{θ_{k}}

(122)

Proof. From Equation (121):

m_{F (I I)} {(f_{R | T; θ})}_{θ_{k}} \leq m_{F (I I)} {(f_{S, T; θ})}_{θ_{k}}

(123)

i_{F} {(f_{T | R; θ})}_{θ_{k}} - i_{F} {(f_{T; θ})}_{θ_{k}} \leq i_{F} {(f_{T | S; θ})}_{θ_{k}} - i_{F} {(f_{T; θ})}_{θ_{k}}

(124)

i_{F} {(f_{T | R; θ})}_{θ_{k}} \leq i_{F} {(f_{T | S; θ})}_{θ_{k}}

(125)

□

In other words, in any Markovian process, the further away that the random variables used by the estimator are, the larger is the variance of the estimated parameter.

10.4. Upper Bound on Estimation Error

A well-known result states that given a variance η, of all possible density functions, the one that maximizes the differential entropy is the Gaussian density function [30]. Hence, for an arbitrary density function f_X, some side information Y and an estimator

\hat{X}

, it is possible to obtain an estimation version of the Fano inequality [10] (p. 255):

\frac{1}{2 π e} e^{2 h_{S} (f_{X | Y})} \leq E_{X} {{(X - \hat{X} (Y))}^{2}}

(126)

In the context of Fisher information, the same question arises: is it possible to bound the estimation error using this quantity as well? Surprisingly, the answer is yes, but in the form of an upper bound. Thus, Shannon entropy can be used to set error lower bounds and Fisher information upper ones. In order to establish this bound, the following setup is defined, where a random variable R is given, and a related random variable Y is observed, which, in turn, is used to calculate a function

\hat{R} = g (Y)

. It is desired to bound the probability that

{(R - \hat{R})}^{2} > \in

. It is important to note that

R \to Y \to \hat{R}

is a Markov chain and that

\hat{R}

depends on θ.

Theorem 14. Given a random variable R and an estimator of it named

\hat{R}

, the estimation error is defined by:

E = {(R - \hat{R})}^{2}

(127)

Then, the probability that the estimation error exceeds some ε value:

P {E > ε} \leq \frac{i_{F} {(f_{R | \hat{R}; θ})}_{θ_{k}}}{i_{F} {(f_{R | \hat{R}, E = ξ; θ})}_{θ_{k}}}

(128)

for some ξ ∈ [ε, ∞].

Proof. Using the chain rule for Fisher information:

i_{F} {(f_{R, E | \hat{R}; θ})}_{θ_{k}} = i_{F} {(f_{R | \hat{R}, E; θ})}_{θ_{k}} + i_{F} {(f_{E | \hat{R}; θ})}_{θ_{k}} = i_{F} {(f_{E | R, \hat{R}; θ})}_{θ_{k}} + i_{F} {(f_{R | \hat{R}; θ})}_{θ_{k}}

(129)

Using the fact that given R and

\hat{R}

, then E is no longer a random variable, then:

i_{F} {(f_{E | R, \hat{R}; θ})}_{θ_{k}} = 0

(130)

Hence,

i_{F} {(f_{R | \hat{R}, E; θ})}_{θ_{k}} + i_{F} {(f_{E | \hat{R}; θ})}_{θ_{k}} = i_{F} {(f_{R | \hat{R}; θ})}_{θ_{k}}

(131)

Neglecting

i_{F} {(f_{E | \hat{R}; θ})}_{θ_{k}}

, because it is always greater or equal to zero, it is obtained:

i_{F} {(f_{R | \hat{R}, E; θ})}_{θ_{k}} \leq i_{F} {(f_{R | \hat{R}; θ})}_{θ_{k}}

(132)

Moreover, the term:

i_{F} {(f_{R | \hat{R}, E; θ})}_{θ_{k}} = \int_{e \in E}^{\infty} \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R}, E; θ} (r, \hat{r}, e) {(\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}})}^{2} d r d \hat{r} d e

(133)

\begin{array}{l} = \int_{e \in E | e \leq ε}^{ε} \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R}, E; θ} (r, \hat{r}, e) (\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}}) d r d \hat{r} d e \\ + = {\int_{e \in E | e > ε}^{\infty} \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R}, E; θ} (r, \hat{r}, e) (\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}})}^{2} d r d \hat{r} d e \end{array}

(134)

\geq {\int_{e \in E | e > ε}^{\infty} \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R}, E; θ} (r, \hat{r}, e) (\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}})}^{2} d r d \hat{r} d e

(135)

= \int_{e \in E | e > ε}^{\infty} \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R}, E; θ} (r, \hat{r} | e) f_{E; θ} (e) {(\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}})}^{2} d r d \hat{r} d e

(136)

= {\int_{e \in E | e \leq ε}^{\infty} f_{E; θ} (e) \int \int_{r \in R, \hat{r} \in \hat{R} | {(r - \hat{r})}^{2} = e} f_{R, \hat{R} | E; θ} (r, \hat{r}, e) (\frac{\partial \ln f_{R | \hat{R}, E; θ} (r | \hat{r}, e)}{\partial θ_{k}})}^{2} d r d \hat{r} d e

(137)

= \int_{e \in E | e > ε}^{\infty} f_{E; θ} (e) i_{F} {(f_{R | \hat{R}, E = e; θ})}_{θ}_{_{k}} d e

(138)

Using the mean value theorem, for some ξ ∈ [ε, ∞]:

i_{F} {(f_{R | \hat{R}, E = ξ; θ})}_{θ_{k}} {\int_{e \in E | e > ε}^{\infty} f_{E; θ} (e) d e = i_{F} {(f_{R | \hat{R}, E = ξ; θ})}_{θ_{k}} \cdot P {E > ε} \leq i_{F} {(f_{R | \hat{R}, E; θ})}_{θ_{k}} \leq i_{F} (f_{R | \hat{R}; θ})}_{θ_{k}}

(139)

Hence:

P {E > ε} \leq \frac{i_{F} {(f_{R | \hat{R}; θ})}_{θ_{k}}}{i_{F} {(f_{R | \hat{R}, E = ξ; θ})}_{θ_{k}}}

(140)

□

11. Discussion

The Fisher information, which sets a bound on how precise the estimation of an unknown parameter of a density function can be, has an associated set of properties that are equivalent to those of Shannon’s differential entropy. The properties presented in this work help to understand how to manipulate and use Fisher information in ways that so far have been exclusive to Shannon’s differential entropy These properties that are of special importance to the generalization of the mutual information concept for the Fisher information realm are a new version of the data processing theorem that shows that Fisher information decreases in a Markov chain and an upper bound of the estimation error of a random variable that is regulated by the Fisher information.

A. Boundary Condition

A general result from calculus establishes that for any function g(x, θ_k), the following is true:

\frac{\partial}{\partial θ_{k}} \int_{l (θ_{k})}^{u (θ_{k})} g (x, θ_{k}) d x = g (u (θ_{k}), θ_{k}) \frac{\partial u (θ_{k})}{\partial θ_{k}} - g (l (θ_{k}), θ_{k}) \frac{\partial l (θ_{k})}{\partial θ_{k}} + \int_{l (θ_{k})}^{u (θ_{k})} \frac{\partial g (x, θ_{k})}{\partial θ_{k}} d x

(141)

In the case of a vector integral, the previous expression applies to all of the components without any loss of generality.

Some of the results in this work use the following condition:

Condition 1 (Boundary Condition). A function complies with the boundary condition if it is possible to neglect the boundary terms in Equation (141), such that:

\frac{\partial}{\partial θ_{k}} \int g (x, θ_{k}) d x = \int \frac{\partial g (x, θ_{k})}{\partial θ_{k}} d x

(142)

This condition corresponds to what sometimes are called regular cases [34] (p. 373).

It is important to keep in mind that not all density functions go along with this condition. As an example, in calculations that involve the uniform density function, where the parameters define the support, it is not possible to neglect the terms, and the boundary condition does not hold. Hence, it is always necessary to check whether the condition holds or not. If not, one may arrive at false results.

However, it is always possible to add a smooth function, one that does not change the original function too much, such that the new mathematical expression does comply with the boundary condition. In this way, functions, such as the uniform density function, as an example, can be adjusted to comply with this condition.

Acknowledgments

The author thanks Alexis Fuentes and Carlos Alarcón for reviewing this work, and helping to improve some expressions. The author also thanks CONICYT Chile for its grant FONDECYT 1120680.

Conflicts of Interest

The author declares no conflict of interest.

References

Shannon, C. A Mathematical Theory of Communication. Bell Syst. Tech. J 1948, 27, 379–423. [Google Scholar]
Fisher, R. Theory of Statistical Estimation. Proc. Camb. Philos. Soc. 1925, 22, 700–725. [Google Scholar]
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–89. [Google Scholar]
Cramer, H. Mathematical Methods of Statistics; Princeton University Press: Princeton, NJ, USA, 1945. [Google Scholar]
Kullback, S. Information Theory and Statistics; Dover Publications Inc.: Mineola, NY, USA, 1968. [Google Scholar]
Blahut, R.E. Principles and Practice of Information Theory; Addison-Wesley Publishing Company: Boston, MA, USA, 1987. [Google Scholar]
Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Stam, A.J. Some mathematical properties of quantities of information. In Ph.D. Thesis; Technological University of Delft: Delft, The Netherlands, 1959. [Google Scholar]
Stam, A.J. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959, 2, 101–112. [Google Scholar]
Cover, T.; Thomas, J. Elements of Information Theory; John Wiley and Sons, Inc: Hoboken, NJ, USA, 2006. [Google Scholar]
Narayanan, K.R.; Srinivasa, A.R. On the Thermodynamic Temperature of a General Distribution; Cornell University Library: Ithaca, NY, USA, 2007. [Google Scholar]
Guo, D. Relative Entropy and Score Function: New Information-Estimation Relationships through Arbitrary Additive Perturbation, Proceedings of the IEEE International Symposium on Information Theory, Seoul, Korea, 28 June–3 July 2009; pp. 814–818.
Blachman, N.M. The Convolution Inequality for Entropy Powers. IEEE Trans. Inf. Theory 1965, 11, 267–271. [Google Scholar]
Costa, M.H.M.; Cover, T.M. On the Similarity of the Entropy Power Inequality and the Brunn Minkowski Inequality; Technical Report; Stanford University: Stanford, CA, USA, 1983. [Google Scholar]
Zamir, R.; Feder, M. A generalization of the entropy power inequality with applications. IEEE Trans. Inf. Theory 1993, 39, 1723–1728. [Google Scholar]
Lutwak, E.; Yang, D.; Zhang, G. CramerâǍŞRao and Moment-Entropy Inequalities for Renyi Entropy and Generalized Fisher Information. IEEE Trans. Inf. Theory 2005, 51, 473–478. [Google Scholar]
Frieden, B.R.; Plastino, A.; Plastino, A.R.; Soffer, B.H. Fisher-Based Thermodynamics: Its Legendre Transform and Concavity Properties. Phys. Rev. E 1999, 60, 48–53. [Google Scholar]
Frieden, B.R.; Plastino, A.; Plastino, A.R.; Soffer, B.H. Non-equilibrium thermodynamics and Fisher information: An illustrative example. Phys. Lett. A 2002, 304, 73–78. [Google Scholar]
Frieden, B.R.; Petri, M. Motion-dependent levels of order in a relativistic universe. Phys. Rev. E 2012, 86, 1–5. [Google Scholar]
Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to statistical systems. Phys. Rev. E 2013, 88, 1–6. [Google Scholar]
Flego, S.; Olivares, F.; Plastino, A.; Casas, M. Extreme Fisher Information, Non-Equilibrium Thermodynamics and Reciprocity Relations. Entropy 2011, 13, 184–194. [Google Scholar] [Green Version]
Venkatesan, R.C.; Plastino, A. Legendre transform structure and extremal properties of the relative Fisher information. Phys. Lett. A 2014, 378, 1341–1345. [Google Scholar]
Van Trees, H.L. Detection, Estimation, and Modulation Theory: Part 1; John Wiley and Sons, Inc: Hoboken, NJ, USA, 2001. [Google Scholar]
Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 10, 251–276. [Google Scholar]
Pascanu, R.; Bengio, Y. Revisiting Natural Gradient for Deep Networks; Cornell University Library: Ithaca, NY, USA, 2014; pp. 1–18. [Google Scholar]
Luo, S. Maximum Shannon entropy, minimum Fisher information, and an elementary game. Found. Phys. 2002, 32, 1757–1772. [Google Scholar]
Langley, R.S. Probability Functionals for Self-Consistent and Invariant Inference: Entropy and Fisher Information. IEEE Trans. Inf. Theory 2013, 59, 4397–4407. [Google Scholar]
Zegers, P.; Fuentes, A.; Alarcon, C. Relative Entropy Derivative Bounds. Entropy 2013, 15, 2861–2873. [Google Scholar]
Cohen, M. The Fisher Information and Convexity. IEEE Trans. Inf. Theory 1968, 14, 591–592. [Google Scholar]
Cover, T.; Thomas, J. Elements of Information Theory; John Wiley and Sons, Inc: Hoboken, NJ, USA, 1991. [Google Scholar]
Frieden, B.R. Physics from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Zamir, R. A Proof of the Fisher Information Inequality Via a Data Processing Argument. IEEE Trans. Inf. Theory 1998, 44, 1246–1250. [Google Scholar]
Taubman, D.; Marcellin, M. JPEG2000: Image Compression Fundamentals, Standards, and Practice; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2002. [Google Scholar]
Hogg, R.V.; Craig, A.T. Introduction to Mathematical Statistics; Prentice Hall: Upper Saddle River, NJ, USA, 1995. [Google Scholar]
Frieden, B.R. Probability, Statistical Optics, and Data Testing; Springer-Verlag: Berlin, Germany, 1991. [Google Scholar]
Otto, F.; Villani, C. Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality. J. Funct. Anal. 2000, 173, 361–400. [Google Scholar]
Yáñez, R.J.; Sánchez-Moreno, P.; Zarzo, A.; Dehesa, J.S. Fisher information of special functions and second-order differential equations. J. Math. Phys. 2008, 49, 082104. [Google Scholar] [Green Version]
Gianazza, U.; Savaré, G.; Toscani, G. The wasserstein gradient flow of the fisher information and the quantum drift-diffusion equation. Arch. Ration. Mech. Anal. 2009, 194, 133–220. [Google Scholar]
Verdú, S. Mismatched Estimation and Relative Entropy. IEEE Trans. Inf. Theory 2010, 56, 3712–3720. [Google Scholar]
Hirata, M.; Nemoto, A.; Yoshida, H. An integral representation of the relative entropy. Entropy 2012, 14, 1469–1477. [Google Scholar]
Sánchez-Moreno, P.; Zarzo, A.; Dehesa, J.S. Jensen divergence based on Fisher’s information. J. Phys. A: Math. Theor. 2012, 45, 125305. [Google Scholar]
Yamano, T. Phase space gradient of dissipated work and information: A role of relative Fisher information. J. Math. Phys. 2013, 54, 1–9. [Google Scholar]
Yamano, T. De Bruijn-type identity for systems with flux. Eur. Phys. J. B 2013, 86, 363. [Google Scholar]
Bobkov, S.G.; Chistyakov, G.P.; Gotze, F. Fisher information and the central limit theorem. Probab. Theory Relat. Fields. 2014, 159, 1–59. [Google Scholar]
Zegers, P. Some New Results on The Architecture, Training Process, and Estimation Error Bounds for Learning Machines. Ph.D. Thesis, The University of Arizona, Tucson, AZ, USA, 2002. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar]
Lutwak, E.; Yang, D.; Zhang, G. Renyi entropy and generalized Fisher information. IEEE Trans. Inf. Theory 2005, 51, 473–478. [Google Scholar]
Kagan, A.; Yu, T. Some Inequalities Related to the Stam Inequality. Appl. Math. 2008, 53, 195–205. [Google Scholar]
Lutwak, E.; Lv, S.; Yang, D.; Zhang, G. Extensions of Fisher Information and Stam’s Inequality. IEEE Trans. Inf. Theory 2012, 58, 1319–1327. [Google Scholar]
Bercher, J.F. On Generalized Cramér-Rao Inequalities, and an Extension of the Shannon-Fisher-Gauss Setting; Cornell University Library: Ithaca, NY, USA, 2014. [Google Scholar]
Stein, M.; Mezghani, A.; Nossek, J.A. A Lower Bound for the Fisher Information Measure. IEEE Signal Process. Lett. 2014, 21, 796–799. [Google Scholar]
Plastino, A.; Plastino, A. Symmetries of the Fokker-Planck equation and the Fisher-Frieden arrow of time. Phys. Rev. E 1996, 54, 4423–4426. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zegers, P. Fisher Information Properties. Entropy 2015, 17, 4918-4939. https://doi.org/10.3390/e17074918

AMA Style

Zegers P. Fisher Information Properties. Entropy. 2015; 17(7):4918-4939. https://doi.org/10.3390/e17074918

Chicago/Turabian Style

Zegers, Pablo. 2015. "Fisher Information Properties" Entropy 17, no. 7: 4918-4939. https://doi.org/10.3390/e17074918

APA Style

Zegers, P. (2015). Fisher Information Properties. Entropy, 17(7), 4918-4939. https://doi.org/10.3390/e17074918

Article Menu

Fisher Information Properties

Abstract

1. Introduction

1.1. Fisher Information and Other Fields

1.2. Contribution of This Work

2. Notation

3. Fisher Information

4. Several Random Variables Depending on θ_k

4.1. Joint Fisher Information Definition

4.2. An Equivalent Joint Fisher Information Definition

4.3. Conditional Fisher Information Definition

4.4. Chain Rule for Two Random Variables

4.5. Chain Rule for Many Random Variables

5. Relative Fisher Information Type I

6. Information Correlation

7. Mutual Fisher Information Type I

7.1. Definition

7.2. Conditional Mutual Fisher Information of Type I

8. Relative Fisher Information Type II

9. Mutual Fisher Information Type II

10. Other Properties

10.1. Lower Bound for Fisher Information

10.2. In Some Cases, Conditioning Increases the Fisher Information

10.3. Data Processing Inequality

10.4. Upper Bound on Estimation Error

11. Discussion

A. Boundary Condition

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Fisher Information Properties

Abstract

1. Introduction

1.1. Fisher Information and Other Fields

1.2. Contribution of This Work

2. Notation

3. Fisher Information

4. Several Random Variables Depending on θk

4.1. Joint Fisher Information Definition

4.2. An Equivalent Joint Fisher Information Definition

4.3. Conditional Fisher Information Definition

4.4. Chain Rule for Two Random Variables

4.5. Chain Rule for Many Random Variables

5. Relative Fisher Information Type I

6. Information Correlation

7. Mutual Fisher Information Type I

7.1. Definition

7.2. Conditional Mutual Fisher Information of Type I

8. Relative Fisher Information Type II

9. Mutual Fisher Information Type II

10. Other Properties

10.1. Lower Bound for Fisher Information

10.2. In Some Cases, Conditioning Increases the Fisher Information

10.3. Data Processing Inequality

10.4. Upper Bound on Estimation Error

11. Discussion

A. Boundary Condition

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. Several Random Variables Depending on θ_k