**Stephen P. King** (*stephenk1@home.com*)

*Sat, 08 May 1999 09:47:15 -0400*

**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]**Next message:**Hitoshi Kitada: "[time 289] Re: [time 288] Fisher Information"**Previous message:**Stephen Paul King: "[time 287] Re: Fisher Information"**Next in thread:**Hitoshi Kitada: "[time 289] Re: [time 288] Fisher Information"

On Mon, 26 Apr 1999, Stephen Paul King wrote:

*> Could someone give a definiton of Fisher Information that a
*

*> mindless philosopher would understand? :)
*

Let me start with a couple of intuitive ideas which should help to

orient

you. Fisher information is related to the notion of information (a kind

of

"entropy") developed by Shannon 1948, but not the same. Roughly

speaking

1. Shannon entropy is the volume of a "typical set"; Fisher information

is the area of a "typical set",

2. Shannon entropy is allied to "nonparametric statistics"; Fisher

information is allied to "parametric statistics".

Now, for the definiton.

Let f(x,t) be a family of probability densities parametrized by t.

For example,

f(x,t) = 1/t exp(-x/t), on x >= 0

In parametric statistics, we want to estimate which t gives the best fit

to a finite data set, say of size n. An estimator is a function from

n-tuple data samples to the set of possible parameter values, e.g.

(0,infty) in the example above. Given an estimator, its bias, as a

function of t, is the difference between the expected value (as we range

over x) of the estimator, according to the density f(.,t), and the

actual

value of t. The variance of the estimator, as a function of t, is the

expectation (as we range over x), according to f(.,t), of the squared

difference between t and the value of the estimator. If the bias

vanishes

(in this case the estimator is called unbiased), the variance will

usually

still be a positive function of t. It is natural to try to minimize the

variance over the set of unbiased estimators defined for a given family

of

densities f(.,t).

Given a family of densities, the score is the logarithmic derivative

V(x,t) = d/dt log f(x,t) = d/dt f(x,t)/f(x,t)

(We are tacitly now assuming some differentiablity properties of our

parameterized family of densities.) The mean of the score (as we average

over x) is always zero. The Fisher information is the variance of the

score:

J(t) = expected value of square of V(x) as we vary x

Notice this is a function of t defined in terms of specific parametrized

family of densities. (Of course, the definition is readily generalized

to

more than one parameter).

The fundamentally important Cramer-Rao inequality says that

variance of any estimator >= 1/J(t)

Thus, in parametric statistics one wants to find estimators which

achieve

the optimal variance, the reciprocal of the Fisher information. From

this

point of view, the larger the Fisher the information, the more precisely

one can (using a suitable estimator) fit a distribution from the given

parametrized family to the data.

(Incidently: someone has mentioned the work of Roy Frieden, who has

attempted to relate the Cramer-Rao inequality to the Heisenberg

inequality. See the simple "folklore" theorem (with complete proof) I

posted on a generalized Heisenberg inequality in sci.physics.research a

few months ago--- you should be able to find it using Deja News.)

This setup is more flexible than might at first appear. For instance,

given a density f(x), where x is real, define the family of densities

f(x-t); then the Fisher information is

J(t) = expectation of [d/dt log f(x-t)]^2

= int f(x-t) [d/dt log f(x-t)]^2 dx

By a change of variables, we find that for a fixed density f, this is a

constant. In this way, we can change our point of view and define a

(nonlinear) functional on densities f:

J(f) = int f(x) [f'(x)/f(x)]^2 dx

On the other hand, Shannon's "continuous" entropy is the (nonlinear)

functional:

H(f) = -int f(x) log f(x) dx

Suppose that X is a random variable with finite variance and Z is an

independent normally distributed random variable with zero mean and unit

variance ("standard noise"), so that X + sqrt(t) Z is another random

variable associated with density f_t, represented X perturbed by noise.

Then de Bruijn's identity says that

J(f_t) = 2 d/dt h(f_t)

and if the limit t-> 0 exists, we have a formula for the Fisher

information of the density f_0 associated with X.

See Elements of Information Theory, by Cover & Thomas, Wiley, 1981, for

details on the above and for general orientation to the enormous body of

ideas which consistutes modern information theory, including typical

sets

and comment (1) above. Then see some of the many other books which

cover

Fisher information in more detail. In one of the books by J. N. Kapur

on

maximal entropy you will find a particularly simple and nice connection

between the multivariable Fisher information and Shannon's discrete

"information" (arising from the discrete "entropy" -sum p_j log p_j).

(Come to think of it, if you search under my name using Deja News you

should find a previous posting of mine in which I gave considerable

detail

on some inequalities which are closely related to the area-volume

interpretations of Fisher information and entropy. If you've ever heard

of Hadamard's inequality on matrices, you should definitely look at the

discussion in Cover & Thomas.)

Hope this helps!

Chris Hillman

**Next message:**Hitoshi Kitada: "[time 289] Re: [time 288] Fisher Information"**Previous message:**Stephen Paul King: "[time 287] Re: Fisher Information"**Next in thread:**Hitoshi Kitada: "[time 289] Re: [time 288] Fisher Information"

*
This archive was generated by hypermail 2.0b3
on Sun Oct 17 1999 - 22:10:31 JST
*