[time 262] Is thinking more like a thunderstorm or like a calculation?

Stephen P. King (stephenk1@home.com)
Wed, 28 Apr 1999 11:49:30 -0400

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: Lancelot Fletcher: "[time 263] RE: [time 259] What does an LS perceive?"
Previous message: Stephen P. King: "[time 261] Re: classification of 3-manifolds"

Subject:
             Re: Is thinking more like a thunderstorm or like a
calculation?
        Date: 21 Mar 1999 00:00:00 GMT
       From: minsky@media.mit.edu
Organization: Deja News - The Leader in Internet Discussion
Newsgroups: comp.ai.philosophy

In article <7d08oo$cdq@journal.concentric.net>,
modlin@concentric.net wrote:

Modlin made a few statements about the theorems that Papert and I proved
about
perceptions.

Modlin: Minsky and Papert showed that a SINGLE STAGE (or "layer) of
perceptrons is limited to computing linearly separable functions.

I'm not sure he read the book, although otherwise he knows quite a lot
about
neural networks. The networks that we considered were not single-layer,
but
two-layer networks. That is, we discussed what could be computer by a
linearly separable function of arbitrary Boolean functions. Then we
discussed how large must be the coefficients in the second layer, and
how
these grow with the number of variables presented at the "0-th" layer
--that
is, the inputs to the first layer of Boolean functions. In particular,
we
discussed several classes of functions at the first layer-in particular,
where there is a bound N on the numbers of inputs to the first-layer
neurons
and where, if the input is a two-dimensional 'retina", there is a bound
D on
the diameter of those inputs for each neuron.

Modlin: It is a trivial result, not nearly worth the attention paid it.
Unwarranted emphasis on the point retarded progress in neural modelling
for 20
years, and even now people who quote it without understanding continue
to
propagate the damage it has done.

These results aren't at all trivial. In the 30 years since, no one has
much
improved or generalized our results, except for one important book,
Discrete
Neural Computation, by Kai-Yeung Siu, Vwani Roychoudury, and Tom
Kailath.

Modlin: The same limitation applies to ANY set of primitive functions,
whether Perceptrons, arithmetic operations, or things like AND/OR/NOT
Boolean
logic gates. In one stage you can only compute those primitive
functions.
To compute more functions you have to combine the outputs of the first
steps
with more of the primitives, producing a multi-stage computation.

Modlin: Adding more devices lets you compute more complex functions.
Perceptrons with non-linear or thresholding outputs are boolean
complete, so
you can compute any computable function with a network of them, just as
you
can with NAND, or with AND/OR/NOT, or with many other sets of
primitives.
[...] Technically, one extra layer of intermediate functions between
inputs
and outputs is enough to compute anything. This cannonical form is
inneficient, and practical circuits use many stages.

That's just what we said. Even a shallow net can compute any function,
if
there are enough with a first layer of ANDs and a second layer of ORs
can
compute any function whatever. HOWEVER, in general, this requires a
number
of OR cells that is exponential in the number of INPUT variables. That
is
what the book was about, and we showed, for example, that to make such a
net
count the number of connected objects on the retina requires that sort
of
exponential growth in coefficients or numbers of cells. This is also
true for
other classes of patterns and input restrictions-but very little is yet
known
about this, perhaps because of the mistake that Modlin and many others
have
made about how our book "retarded progress in neural modelling for 20
years".
That rumor, together with not understanding what we did, has had a sad
result. We now know some cases where neural nets perform well, and we
know
many cases where they don't work at all well (e.g., in grammar-like
parsing
problems)-BUT WE KNOW ALMOST NO THEORY ABOUT WHEN THEY WORK AND WHEN
THEY
DON''T WORK WELL. (We also showed several surprising special cases
where the
exponential growth is surprisingly small.)

More important, our theorems in most cases also apply to multilayer
networks,
too-in contradiction to what Modlin claims. In both of those
number-of-input-limited networks, the coefficients still grow
exponentially,
but with exponents that are smaller by the order of the numbers of
layers-for
important classes of problems. However, no one has proven much about
this,
yet.

Modlin did not mention it, but he has done significant research on
"re-entrant" or "recursive" networks, which can be very efficient for
some
kinds of problems. By re-using the inner neurons, one can indeed do much
more-but as Modlin has pointed out in other Usenet messages, such
networks
are equivalent to multi-layer nets (whose depth is the number of
recurrent
operations). We still need better theories of what coefficients each
type of
problems then requires.

I should mention two other bad misunderstandings.

1) There is also a rumor that 'backpropagation' in multilayer nets
avoids the
obstacles we found. However, although backpropagation is sometimes an
efficient way to compute coefficients in multilayer feedforward nets, it
won't help if no adequate set of such coefficients exists-which was the
point
of about half of our book.

3. Almost all of the successful known applications of neural networks
are for
"approximate" solutions to pattern recognition problems-that is, for
systems
that are permitted to make some (usually unspecified) portion of
errors. Our
book was about the theory of which patterns could be correctly
recognized.
So all those people who say that ANNs "work" where we said that they
don't
have changed the definition of 'work" without telling this to all you
poor
folk who didn't actually look at our book. I'm happy to grant that
"frequently works" is often better than nothing at all-but isn't it
professionally disgraceful to not say clearly that you're talking about
a
different subject, when you're criticizing a mathematical treatise?

In fact we did consider "statistical separability" in Chapter 12 (as
opposed
to strictly linear separability), so the distinction should be very
clear to
readers. However, I regret that we didn't write more about that. The
trouble, of course, is that this is an even harder subject!

Finally, although most "histories of ANNs" blame our 1969n book for the
decrease in interest in neural networks after 1970, the big decline
actually
already took place before that. Rosenblatt's book was published in
1959.
There was a great surge in activity, which declined around 1965 when
several
projects found that larger Perceptron machines did not produce much
better
results than smaller one-and we started our own research to try to see
why
this had happened. Not much has been learned since then about this,
perhaps
because of those false rumors about our book. We still have very little
theoretical understanding of which patterns can be efficiently
recognized by
feedforward nets.

It should be noted, though, that the computers of that era did not have
cache,
and typical memory access was the order of 10 microseconds. The
resurgence
occurred when, in the1980s speeds had increased 100000-fold. Now better
coefficients could be found-that is, when they existed.

In other words, most critics did not understand how we defined
"recognizing a
pattern." We meant accepting all cases, and rejecting all non-cases.
It's
true that if some errors are permitted, then more patterns can by
usefully
"recognized". However, then you must live with uncertainty.

Finally, I should add that most of the "successful" applications on
feedforward nets aren't actually done with feedforward nets-because they
usually begin with some external device for 'normalizing' or "centering"
the
machine's focus on the object to be recognized. Naturally, this makes
the
problem much easier. On the positive side, for practical applications,
where
substantial amounts of error are acceptable, feedforward nets are indeed
competitive, and, in many instances, superior to other kinds of standard
statistical recognition schemes. Only recently, by using such methods as
'hidden Markov models, which in effect are more versatile at evolvin

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your
Own

Next message: Lancelot Fletcher: "[time 263] RE: [time 259] What does an LS perceive?"
Previous message: Stephen P. King: "[time 261] Re: classification of 3-manifolds"

This archive was generated by hypermail 2.0b3 on Sun Oct 17 1999 - 22:31:52 JST