**Stephen P. King** (*stephenk1@home.com*)

*Wed, 28 Apr 1999 11:49:30 -0400*

**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]**Next message:**Lancelot Fletcher: "[time 263] RE: [time 259] What does an LS perceive?"**Previous message:**Stephen P. King: "[time 261] Re: classification of 3-manifolds"

Subject:

Re: Is thinking more like a thunderstorm or like a

calculation?

Date: 21 Mar 1999 00:00:00 GMT

From: minsky@media.mit.edu

Organization: Deja News - The Leader in Internet Discussion

Newsgroups: comp.ai.philosophy

In article <7d08oo$cdq@journal.concentric.net>,

modlin@concentric.net wrote:

Modlin made a few statements about the theorems that Papert and I proved

about

perceptions.

Modlin: Minsky and Papert showed that a SINGLE STAGE (or "layer) of

perceptrons is limited to computing linearly separable functions.

I'm not sure he read the book, although otherwise he knows quite a lot

about

neural networks. The networks that we considered were not single-layer,

but

two-layer networks. That is, we discussed what could be computer by a

linearly separable function of arbitrary Boolean functions. Then we

discussed how large must be the coefficients in the second layer, and

how

these grow with the number of variables presented at the "0-th" layer

--that

is, the inputs to the first layer of Boolean functions. In particular,

we

discussed several classes of functions at the first layer-in particular,

where there is a bound N on the numbers of inputs to the first-layer

neurons

and where, if the input is a two-dimensional 'retina", there is a bound

D on

the diameter of those inputs for each neuron.

Modlin: It is a trivial result, not nearly worth the attention paid it.

Unwarranted emphasis on the point retarded progress in neural modelling

for 20

years, and even now people who quote it without understanding continue

to

propagate the damage it has done.

These results aren't at all trivial. In the 30 years since, no one has

much

improved or generalized our results, except for one important book,

Discrete

Neural Computation, by Kai-Yeung Siu, Vwani Roychoudury, and Tom

Kailath.

Modlin: The same limitation applies to ANY set of primitive functions,

whether Perceptrons, arithmetic operations, or things like AND/OR/NOT

Boolean

logic gates. In one stage you can only compute those primitive

functions.

To compute more functions you have to combine the outputs of the first

steps

with more of the primitives, producing a multi-stage computation.

Modlin: Adding more devices lets you compute more complex functions.

Perceptrons with non-linear or thresholding outputs are boolean

complete, so

you can compute any computable function with a network of them, just as

you

can with NAND, or with AND/OR/NOT, or with many other sets of

primitives.

[...] Technically, one extra layer of intermediate functions between

inputs

and outputs is enough to compute anything. This cannonical form is

inneficient, and practical circuits use many stages.

That's just what we said. Even a shallow net can compute any function,

if

there are enough with a first layer of ANDs and a second layer of ORs

can

compute any function whatever. HOWEVER, in general, this requires a

number

of OR cells that is exponential in the number of INPUT variables. That

is

what the book was about, and we showed, for example, that to make such a

net

count the number of connected objects on the retina requires that sort

of

exponential growth in coefficients or numbers of cells. This is also

true for

other classes of patterns and input restrictions-but very little is yet

known

about this, perhaps because of the mistake that Modlin and many others

have

made about how our book "retarded progress in neural modelling for 20

years".

That rumor, together with not understanding what we did, has had a sad

result. We now know some cases where neural nets perform well, and we

know

many cases where they don't work at all well (e.g., in grammar-like

parsing

problems)-BUT WE KNOW ALMOST NO THEORY ABOUT WHEN THEY WORK AND WHEN

THEY

DON''T WORK WELL. (We also showed several surprising special cases

where the

exponential growth is surprisingly small.)

More important, our theorems in most cases also apply to multilayer

networks,

too-in contradiction to what Modlin claims. In both of those

number-of-input-limited networks, the coefficients still grow

exponentially,

but with exponents that are smaller by the order of the numbers of

layers-for

important classes of problems. However, no one has proven much about

this,

yet.

Modlin did not mention it, but he has done significant research on

"re-entrant" or "recursive" networks, which can be very efficient for

some

kinds of problems. By re-using the inner neurons, one can indeed do much

more-but as Modlin has pointed out in other Usenet messages, such

networks

are equivalent to multi-layer nets (whose depth is the number of

recurrent

operations). We still need better theories of what coefficients each

type of

problems then requires.

I should mention two other bad misunderstandings.

1) There is also a rumor that 'backpropagation' in multilayer nets

avoids the

obstacles we found. However, although backpropagation is sometimes an

efficient way to compute coefficients in multilayer feedforward nets, it

won't help if no adequate set of such coefficients exists-which was the

point

of about half of our book.

3. Almost all of the successful known applications of neural networks

are for

"approximate" solutions to pattern recognition problems-that is, for

systems

that are permitted to make some (usually unspecified) portion of

errors. Our

book was about the theory of which patterns could be correctly

recognized.

So all those people who say that ANNs "work" where we said that they

don't

have changed the definition of 'work" without telling this to all you

poor

folk who didn't actually look at our book. I'm happy to grant that

"frequently works" is often better than nothing at all-but isn't it

professionally disgraceful to not say clearly that you're talking about

a

different subject, when you're criticizing a mathematical treatise?

In fact we did consider "statistical separability" in Chapter 12 (as

opposed

to strictly linear separability), so the distinction should be very

clear to

readers. However, I regret that we didn't write more about that. The

trouble, of course, is that this is an even harder subject!

Finally, although most "histories of ANNs" blame our 1969n book for the

decrease in interest in neural networks after 1970, the big decline

actually

already took place before that. Rosenblatt's book was published in

1959.

There was a great surge in activity, which declined around 1965 when

several

projects found that larger Perceptron machines did not produce much

better

results than smaller one-and we started our own research to try to see

why

this had happened. Not much has been learned since then about this,

perhaps

because of those false rumors about our book. We still have very little

theoretical understanding of which patterns can be efficiently

recognized by

feedforward nets.

It should be noted, though, that the computers of that era did not have

cache,

and typical memory access was the order of 10 microseconds. The

resurgence

occurred when, in the1980s speeds had increased 100000-fold. Now better

coefficients could be found-that is, when they existed.

In other words, most critics did not understand how we defined

"recognizing a

pattern." We meant accepting all cases, and rejecting all non-cases.

It's

true that if some errors are permitted, then more patterns can by

usefully

"recognized". However, then you must live with uncertainty.

Finally, I should add that most of the "successful" applications on

feedforward nets aren't actually done with feedforward nets-because they

usually begin with some external device for 'normalizing' or "centering"

the

machine's focus on the object to be recognized. Naturally, this makes

the

problem much easier. On the positive side, for practical applications,

where

substantial amounts of error are acceptable, feedforward nets are indeed

competitive, and, in many instances, superior to other kinds of standard

statistical recognition schemes. Only recently, by using such methods as

'hidden Markov models, which in effect are more versatile at evolvin

-----------== Posted via Deja News, The Discussion Network ==----------

http://www.dejanews.com/ Search, Read, Discuss, or Start Your

Own

**Next message:**Lancelot Fletcher: "[time 263] RE: [time 259] What does an LS perceive?"**Previous message:**Stephen P. King: "[time 261] Re: classification of 3-manifolds"

*
This archive was generated by hypermail 2.0b3
on Sun Oct 17 1999 - 22:31:52 JST
*