[time 314] Re: Physics from Fisher Information


Stephen Paul King (stephenk1@home.com)
Fri, 14 May 1999 01:16:43 GMT


On 13 May 1999 11:17:24 -0700, Chris Hillman
<hillman@math.washington.edu> wrote:

>On 11 May 1999, john baez wrote:
>
>> In article <01be9aec$b623a020$>> In article <01be9aec$b623a020$2897cfa0@sj816bt720500>,
>> Philip Wort <phil.wort@cdott.com> wrote:
>
>> >I find this book is very difficult to follow.
>
>> I wish I could help you, but I can't. When people started talking about
>> Friedan's work I went down to the library to look it up, but I couldn't
>> make much sense out of his papers. I'd hoped his book would be easier
>> to read, but from what you say it sounds like maybe not....
>
>> Does *anyone* understand this stuff? If so, could they explain it?
>
>I had the same reaction to Frieden's papers: I couldn't figure out what he
>was trying to say in the first few paragraphs of the papers I downloaded,
>so I put them aside... (If anyone else wants to have a go, check the
>PROLA archive http://prola.aps.org/search.html)
>
>> I don't even understand what "Fisher information" really is or why
>> people (not just Friedan) are interested in it.
>
>As it happens, the same question came up in bionet.info-theory recently,
>so I quote my reply below. My post was based on what I found in Thomas &
>Cover, so if I've gotten anything wrong, someone should correct what I've
>said!
>
>As for why people are interested... well, statisticians find everything
>Fisher did of enduring interest for one reason or another, it seems :-)
>
>One additional thing, which I'm not sure I remember quite right, but which
>physicists will probably find intriguing, is the notion of a "statistical
>manifold", where one can actually make a manifold out of a parametrized
>family of distributions in such a way that the Riemann curvature turns out
>to be an "entropy" related to Shannon's entropy and the corresponding
>connection is related to Fisher's information! Something like that,
>anyway --- it's been a decade since I looked at this. Now you're probably
>thinking what I thought ten years ago, but when I looked at the books on
>this stuff, the expected connections were not apparent. No forms in
>sight, it didn't even look like differential geometry. In fact, it looked
>very ugly :-(
>
>Chris Hillman
>
>=========== BEGIN REPOST [WITH NEW EXAMPLE] =======================
>
>Date: Fri, 7 May 1999 15:23:58 -0700
>From: Chris Hillman <hillman@math.washington.edu>
>Newsgroups: bionet.info-theory
>Subject: Re: Definition of Fisher Information
>
>
>On Mon, 26 Apr 1999, Stephen Paul King wrote:
>
>> Could someone give a definiton of Fisher Information that a
>> mindless philosopher would understand? :)
>
>Let me start with a couple of intuitive ideas which should help to orient
>you. Fisher information is related to the notion of information (a kind of
>"entropy") developed by Shannon 1948, but not the same. Roughly speaking
>
> 1. Shannon entropy is the volume of a "typical set"; Fisher information
> is the area of a "typical set",
>
> 2. Shannon entropy is allied to "nonparametric statistics"; Fisher
> information is allied to "parametric statistics".
>
>Now, for the definition.
>
>Let f(x,t) be a family of probability densities parametrized by t.
>
>[Example (I just made this up for this repost):
>
> 2 sin(pi t)
> f(x,t) = ------------ x^(1-t) (1-x)^t
> pi t(1-t)
>
>where 0 < x < 1 and 0 < t < 1. (The numerical factor is chosen to ensure
>that the integral over 0 < x < 1 is unity.)
>
>If you take appropriate limits, this family can be extended to -1 < t < 2;
>e.g. (its fun to plot these as functions of x)
>
> f(x,-1/2) = 8/(3 pi) x^(-1/2) (1-x)^(3/2)
>
> f(x, 0) = 2 (1-x)
>
> f(x, 1/4) = 16 sqrt(2)/(3 pi) (1-x)^(3/4) x^(1/4)
>
> f(x, 1/3) = 9 sqrt(2)/(2 pi) (1-x)^(2/3) x^(1/3)
>
> f(x, 1/2) = 8 sqrt(x-x^2)/pi
>
> f(x, 2/3) = 9 sqrt(2)/(2 pi) (1-x)^(1/3) x^(2/3)
>
> f(x, 1) = 2 x
>
> f(x, 3/2) = 8/(3 pi) x^(3/2) (1-x)^(-1/2)
>
>End of example]
>
>In parametric statistics, we want to estimate which t gives the best fit
>to a finite data set, say of size n. An estimator is a function from
>n-tuple data samples to the set of possible parameter values, e.g.
>(0,infty) in the example above. Given an estimator, its bias, as a
>function of t, is the difference between the expected value (as we range
>over x) of the estimator, according to the density f(.,t), and the actual
>value of t. The variance of the estimator, as a function of t, is the
>expectation (as we range over x), according to f(.,t), of the squared
>difference between t and the value of the estimator. If the bias vanishes
>(in this case the estimator is called unbiased), the variance will usually
>still be a positive function of t. It is natural to try to minimize the
>variance over the set of unbiased estimators defined for a given family of
>densities f(.,t).
>
>Given a family of densities, the score is the logarithmic derivative
>
> V(x,t) = d/dt log f(x,t) = d/dt f(x,t)/f(x,t)
>
>[In the example (the one I just made up), if I didn't goof we have, if I
>haven't goofed
>
> V(x,t) = log(x/(1-x)) + pi cot(pi t) + (2t-1)/(t(1-t))
>
>e.g. V(x,1/2) = log(x/(1-x)).]
>
>(We are tacitly now assuming some differentiability properties of our
>parameterized family of densities.)
>
>The mean of the score (as we average over x) is always zero. The Fisher
>information is the variance of the score:
>
> J(t) = expected value of square of V(x) as we vary x
>
>Notice this is a function of t defined in terms of specific parametrized
>family of densities. (Of course, the definition is readily generalized to
>more than one parameter).
>
>[In the example:
>
> J(-1/2) ~ 5.42516
>
> J (0) = pi^2/3 - 1 ~ 2.28987
>
> J (1/3) ~ 1.90947
>
> J(1/2) ~ 1.8696
>
> J(2/3) ~ 1.90947
>
> J(1) = pi^2/3 - 1 ~ 2.28987
>
> J(3/2) ~ 5.42516
>
>if I didn't goof. Note the expected symmetry of these values.]
>
>The fundamentally important Cramer-Rao inequality says that
>
> variance of any estimator >= 1/J(t)
>
>Thus, in parametric statistics one wants to find estimators which achieve
>the optimal variance, the reciprocal of the Fisher information. From this
>point of view, the larger the Fisher the information, the more precisely
>one can (using a suitable estimator) fit a distribution from the given
>parametrized family to the data.
>
>(Incidentally: someone has mentioned the work of Roy Frieden, who has
>attempted to relate the Cramer-Rao inequality to the Heisenberg
>inequality. See the simple "folklore" theorem (with complete proof) I
>posted on a generalized Heisenberg inequality in sci.physics.research a
>few months ago--- you should be able to find it using Deja News.)
>
>This setup is more flexible than might at first appear. For instance,
>given a density f(x), where x is real, define the family of densities
>f(x-t); then the Fisher information is
>
> J(t) = expectation of [d/dt log f(x-t)]^2
>
> = int f(x-t) [d/dt log f(x-t)]^2 dx
>
>By a change of variables, we find that for a fixed density f, this is a
>constant. In this way, we can change our point of view and define a
>(nonlinear) functional on densities f:
>
> J(f) = int f(x) [f'(x)/f(x)]^2 dx
>
>[The idea now is something like this: J(f) is measuring the precision of
>fitting f to numerical data, up to translation of the distribution. The
>larger J(f) is, the more precisely you can identified a particular
>translation which gives the best fit. I think this is the idea, anyway.]
>
>On the other hand, Shannon's "continuous" entropy is the (nonlinear)
>functional:
>
> H(f) = -int f(x) log f(x) dx
>
>Suppose that X is a random variable with finite variance and Z is an
>independent normally distributed random variable with zero mean and unit
>variance ("standard noise"), so that X + sqrt(t) Z is another random
>variable associated with density f_t, represented X perturbed by noise.
>Then de Bruijn's identity says that
>
> J(f_t) = 2 d/dt h(f_t)
>
>and if the limit t-> 0 exists, we have a formula for the Fisher
>information of the density f_0 associated with X.
>
>See Elements of Information Theory, by Cover & Thomas, Wiley, 1981, for
>details on the above and for general orientation to the enormous body of
>ideas which constitutes modern information theory, including typical sets
>and comment (1) above. Then see some of the many other books which cover
>Fisher information in more detail. In one of the books by J. N. Kapur on
>maximal entropy you will find a particularly simple and nice connection
>between the multivariable Fisher information and Shannon's discrete
>"information" (arising from the discrete "entropy" -sum p_j log p_j).
>
>(Come to think of it, if you search under my name using Deja News you
>should find a previous posting of mine in which I gave considerable detail
>on some inequalities which are closely related to the area-volume
>interpretations of Fisher information and entropy. If you've ever heard
>of Hadamard's inequality on matrices, you should definitely look at the
>discussion in Cover & Thomas.)
>
>Hope this helps!
>
>Chris Hillman
>
>



This archive was generated by hypermail 2.0b3 on Sun Oct 17 1999 - 22:10:31 JST