[time 288] Fisher Information


Stephen P. King (stephenk1@home.com)
Sat, 08 May 1999 09:47:15 -0400


On Mon, 26 Apr 1999, Stephen Paul King wrote:

> Could someone give a definiton of Fisher Information that a
> mindless philosopher would understand? :)

Let me start with a couple of intuitive ideas which should help to
orient
you. Fisher information is related to the notion of information (a kind
of
"entropy") developed by Shannon 1948, but not the same. Roughly
speaking

 1. Shannon entropy is the volume of a "typical set"; Fisher information
     is the area of a "typical set",

 2. Shannon entropy is allied to "nonparametric statistics"; Fisher
     information is allied to "parametric statistics".

Now, for the definiton.

Let f(x,t) be a family of probability densities parametrized by t.
For example,

  f(x,t) = 1/t exp(-x/t), on x >= 0

In parametric statistics, we want to estimate which t gives the best fit
to a finite data set, say of size n. An estimator is a function from
n-tuple data samples to the set of possible parameter values, e.g.
(0,infty) in the example above. Given an estimator, its bias, as a
function of t, is the difference between the expected value (as we range
over x) of the estimator, according to the density f(.,t), and the
actual
value of t. The variance of the estimator, as a function of t, is the
expectation (as we range over x), according to f(.,t), of the squared
difference between t and the value of the estimator. If the bias
vanishes
(in this case the estimator is called unbiased), the variance will
usually
still be a positive function of t. It is natural to try to minimize the
variance over the set of unbiased estimators defined for a given family
of
densities f(.,t).

Given a family of densities, the score is the logarithmic derivative

 V(x,t) = d/dt log f(x,t) = d/dt f(x,t)/f(x,t)

(We are tacitly now assuming some differentiablity properties of our
parameterized family of densities.) The mean of the score (as we average
over x) is always zero. The Fisher information is the variance of the
score:

 J(t) = expected value of square of V(x) as we vary x

Notice this is a function of t defined in terms of specific parametrized
family of densities. (Of course, the definition is readily generalized
to
more than one parameter).

The fundamentally important Cramer-Rao inequality says that

 variance of any estimator >= 1/J(t)

Thus, in parametric statistics one wants to find estimators which
achieve
the optimal variance, the reciprocal of the Fisher information. From
this
point of view, the larger the Fisher the information, the more precisely
one can (using a suitable estimator) fit a distribution from the given
parametrized family to the data.

(Incidently: someone has mentioned the work of Roy Frieden, who has
attempted to relate the Cramer-Rao inequality to the Heisenberg
inequality. See the simple "folklore" theorem (with complete proof) I
posted on a generalized Heisenberg inequality in sci.physics.research a
few months ago--- you should be able to find it using Deja News.)

This setup is more flexible than might at first appear. For instance,
given a density f(x), where x is real, define the family of densities
f(x-t); then the Fisher information is

 J(t) = expectation of [d/dt log f(x-t)]^2

      = int f(x-t) [d/dt log f(x-t)]^2 dx

By a change of variables, we find that for a fixed density f, this is a
constant. In this way, we can change our point of view and define a
(nonlinear) functional on densities f:

 J(f) = int f(x) [f'(x)/f(x)]^2 dx

On the other hand, Shannon's "continuous" entropy is the (nonlinear)
functional:

 H(f) = -int f(x) log f(x) dx

Suppose that X is a random variable with finite variance and Z is an
independent normally distributed random variable with zero mean and unit
variance ("standard noise"), so that X + sqrt(t) Z is another random
variable associated with density f_t, represented X perturbed by noise.
Then de Bruijn's identity says that

 J(f_t) = 2 d/dt h(f_t)

and if the limit t-> 0 exists, we have a formula for the Fisher
information of the density f_0 associated with X.

See Elements of Information Theory, by Cover & Thomas, Wiley, 1981, for
details on the above and for general orientation to the enormous body of
ideas which consistutes modern information theory, including typical
sets
and comment (1) above. Then see some of the many other books which
cover
Fisher information in more detail. In one of the books by J. N. Kapur
on
maximal entropy you will find a particularly simple and nice connection
between the multivariable Fisher information and Shannon's discrete
"information" (arising from the discrete "entropy" -sum p_j log p_j).

(Come to think of it, if you search under my name using Deja News you
should find a previous posting of mine in which I gave considerable
detail
on some inequalities which are closely related to the area-volume
interpretations of Fisher information and entropy. If you've ever heard
of Hadamard's inequality on matrices, you should definitely look at the
discussion in Cover & Thomas.)

Hope this helps!

Chris Hillman



This archive was generated by hypermail 2.0b3 on Sun Oct 17 1999 - 22:10:31 JST