Re: Curtnetrons Don't Do Parity



Michael Olea <oleaj@xxxxxxxxxxxxx> wrote:
Curt Welch wrote:

Michael Olea <oleaj@xxxxxxxxxxxxx> wrote:
Curt Welch wrote:
Michael Olea <oleaj@xxxxxxxxxxxxx> wrote:

I have a pairing you won't find on amazon. It is quite interesting to
sudy both "Spikes" and Catania's "Learning" more or less side by side.

Ok, ok, I've ordered that as well. :) hell, if I'm so excited about
learning it wouldn't hurt for me to learn a little bit about it. :)

I know it sounds like I was promoting the book. I really meant I found it
interesting to read them "side by side" (e.g. on alternate evenings). But
I think you will find it worth reading.

Anyway, the notion of work as entropy reduction here is at heart a
simple idea.

Yeah, right. :) Everything is simple once you understand it!

Yeah. It all looks obvious in hindsight.

Suppose a 100ms presentation of the stimulus "red" results in 1024
input pulse trains,

pulse trains? Do you mean 1024 pulses on one input? Or 1024 sets of
pulses where each set is some length? Or 1024 different inputs, each
with some set of pulse trains?

What is a pulse train? What is 1024 pulse trains?

I mean a sequence of pulse events, so a list of event times, over some
finite time window - 100ms, here. So lets say the temporal resolution is
1ms. We divide the window into bins 1ms wide, and mark each bin with a 1
or a 0 (a trick I got from Spikes). Then one pulse train is a bit-string
100 bits long. So 1024 of them would be a collection of 1024 such
bit-strings.

all equally likely.

Ok, my best guess since you said that is that these "pulse trains" to
you are being treated as unique symbols. So ignoring encoding, it
sounds like you are talking about an information flow of 1024 symbols
all equally likely.

Right.

Ah, right, I think you are saying that RED produces a continuous
_stream_ of different pulse patterns and that there are 1024 different
possible pulse patterns that can be produced, all equally likely, but
you are not saying anything about how many total pulse patterns or
total pulses might be generated in that 100ms.

I think that you've got it. During 100ms some pulse train will occur.
When it is RED that elicited the train it will be any one of 1024 such
trains. Under other conditions, say GREEN, some pulse train drawn from
some other set of such pulse trains (which hopefully does not overlap
much with the RED set), will be elicited.

That is an entropy of 10 bits.

10 bits per pulse train. Right?

Yes.

If
the net mapped all 1024 of those pulse trains (and no other pulse
trains) to a single output pulse train the output would have zero
entropy

OK.

- a
gain of 10 bits.

10 to 0 is a reduction by 10 bits. Why do you call that a gain?

It is a reduction of uncertainty, and so a gain of information.

Ok. I can see that way of looking at it.

This is maybe a little tricky. Reduction of what uncertainty? It is in
this case the uncertainty of which pulse-train will be elicited, given
the stimulus RED. Why is that relevant? Suppose you have a sensor that is
the source of these pulse trains. It responds to RED, YELLOW, and GREEN
with pulse trains drawn from 3 different populations of pulse trains.
These are *populations* of pulse trains (or in a 100ms time window,
bit-strings 100 bits long) because, for one thing, as you know, the
stimuli that make up these categories are themselves sets. Now, the robot
is driving a car, approaching an intersection with a traffic light. The
sensor responds with pules-train "pt". The most probable state of this
little aspect of the world, the state of the traffic signal, is the
maximum of:

P(COLOR=R|pulse-train=pt) = P(R)*P(pt|R)/P(pt)
P(COLOR=Y|pulse-train=pt) = P(Y)*P(pt|Y)/P(pt)
P(COLOR=G|pulse-train=pt) = P(G)*P(pt|RG)/P(pt)

Here P(R), P(Y), P(G) are the "prior" probabilities of the state of the
traffic light - just what fraction of the time, on average, is it red,
yellow, and green.

P(pt|R) is the probability of the state RED resulting in the elicited
pulse-train pt. This distribution has an entropy. 10 bits. There are 1024
pulse trains, each with probability 1/1024, and all other pulse trains
have 0 probability of being elicited by the stimulus RED. Likewise
P(pt|Y), and P(pt|RG).

P(pt) is the total fraction of the time the pulse train pt occurs. We
don't need to calculate it to pick the most probable state of the traffic
light because it is just a constant that scales all 3 cases by the same
amount, so if we ignore it we can still pick the most probable state. If
for some reason we need to know the actual probabilities then we would
have to calculate it:

P(pt) = P(R)*P(pt|R) + P(Y)*P(pt|Y) + P(G)*P(pt|RG)

Lets suppose that this particular traffic light is RED 45% of the time,
GREEN 45% of the time, and YELLOW 10% of the time:

P(R) = P(G) = 0.45, P(Y) = 0.1.

The uncertainty in the state of the light, before we get a pulse train to
look at, is (using the formula you looked up) is a little less than 1.4
bits.

Now lets suppose the sensor is really good, and there is no overlap in
the pulse train populations, no pulse train that is elicited by, say,
both RED and YELLOW. Then after seeing the pulse train the uncertainty as
to the state of the traffic light is 0. We have GAINED (yes, gained :)
about 1.4 bits of information from the pulse train - information ABOUT
the state if the traffic signal. The pulse train may convey much more
information - say the particular shade of RED, but the most it can convey
ABOUT THE STATE OF THE TRAFFIC SIGNAL (I'm not trying to shout at you,
just to emphasize a point of frequent confusion) is at most a little
under 1.4 bits.

In a more realistic situation, the populations of per class pulse trains
do overlap, some generated by both RED and YELLOW - and, if the
realtionships between stimuli and consequences make the distinction
consequential, then these would come to form a category ORANGE. This is
part of the point that "seeing is behavior" - the distinctions in stimuli
that occasion distinctions in consequences of actions are the
distinctions that establish equivalence classes over stimuli, establish
the categories RED and ORANGE, and YELLOW that are "seen" - the
"maximally informative stimulus dimensions" (a phrase I learned from Wild
Bayesian Bill Bialek, not Glen or Catania, but, I think you will agree,
rather EABish). And, of course, the "hardware" imposes limits on
discriminal dimensions - nobody disputes that, but the strawman never
dies.

It's going to take me some time to chew through all the above....

So, the job of the net is not "efficiency" or "compression", but to
partition stimuli into categories along maximaly informative stimulus
dimensions. And it will have done usful work to the extent it collapses
the 1024 per class pulse trains down to 1 - in some part of the net. That
does not mean it throws away information, just that it untangles it.
Suppose the 1024 pulse trains corresponding to the category RED passed
through unchanged. The net did no useful work in supporting the
hit-the-breaks behavior.

Well, the way I look at my network, it's entire purpose is to create
stimulus classes to map information like the 1024 pulse trains to the
hits-the-brakes behavior. But it doesn't do this Beckie it "knows" all
1024 pulse trains mean "red" before hand.

The question here is if you had a sensor that produced 1024 different pulse
trains in response to a common stimulus of red, how would our black box
receiving that signal figure out that they were all the same? A
reinforcement learning network learns they are the same by experience - it
basically has to learn to hit the brakes for all those things. They end up
being grouped together not because they are similar, but because everything
in that group was shown though experience to be a good thing to stop for.

Without the benefit of a reward signal to guide the formation of
equivalence classes, why would the black box choose to put those 1024
different pulse trains into the same class? The network doesn't see those
1024 pulse trains as one color, it sees it as 1024 different colors.

Now, there is the possibility of using prediction to lump those together.
That's because red lights tend to stay red for an extended period which
means the odds of one of these 1024 pulse trains following another of the
same was much higher than it following some other pulse train. This type
of temporal association could be used to cluster stimulus signals and
create invariant representations. But that didn't seem to be the issues
you were getting at (but maybe what you wrote above touches on that).

The net's response would be optimal. That is the best
any device can do.

This brings up an issue I have with these definition of information
that I've always felt there was a problem with. But I'm going to have
to learn and study more before I can intelligently debate it.

Think about the game of Clue. You gain information to the extent the
possibilities are reduced. At the start of the game there is a set of
possibilities (who-done-it,where,with-what), all equally likely. Say that
is (4X8X8) = 256 possibilities. The entropy is 8 bits. As you gain
information you narrow the possibilities, reducing the entropy of the
distribution over possibilities. When you have reduced it to one
possibility, the entropy is zero. You have gained 8 bits of information.

Or more simply, if you receive 8 and send 0, you gain 8 internally. The
output has lost 8 but the internal system has gained 8. I think.

But the issue here I'm trying to grasp is that you are assuming the
only information worth encoding is the "RED" information when you make
your comment about the net's response being optimal. So by attempting
to measure the performance of the net (how close to optimal it is) are
you assuming the goal of the net is to produce an encoding of RED with
zero information redundancy.

No.I was too brief. I should have made it a discrimination problem - what
to do as you approach a traffic light, say.

In other words, these techniques are only useful when for some reason
you care about the efficiency of the encoding. Or something like that.

No, it is not about efficiency - it is about a reduction in the
uncertainty of the state of the world, those states that occasion
consequences.

yeah, but using this logic, it seems to me that a system that received full
sensory input and output a constant stream of 0's would be a case of
maximal information gain about the world.

There seems to be a problem here that this view is only useful if we as
third parties, know what external state of the universe the system is
trying to gain information about. We can then reasonably talk about it's
success at gaining that information, just like we can talk about a
communication's channel success at transmitting some desired information.
If we have no way of knowing what information it needs to gain, then it
seems we have nothing to measure in terms of it's success

But, a network receiving a stream of data at 10 bits per stream has no
way of knowing that data is redundant. So it can't just throw it away.
Not without some other information to guide it.

The idea is to group the raw streams into categories relevant to
"decisions".

Well, in my network, I look at it as if there is only one problem - the
formation of the response classes. It doesn't create response classes so
that it can use that to make a decision. The formation of the response
class is the decision that it's made. It's the one and only job that needs
to be done.

There's just stuff I'm not yet grasping....

Is this helping?

Yeah, but I have to take time to chew though it all.

If instead it mapped those pulse trains (and no
others) to 32 output pulse trains, all equally likely, the entropy of
the output would be 5 bits - a net gain of 5 bits.

A loss of 5 bits? :)

Gain. It has helped in some measure to categorized the relevant state of
the world - where "relevance" is those distinctions of state that signal
consequences to actions.

The net did some good. If,
on the other hand, the net mapped those 1024 inputs (and no others) to
2048 output pulse trains, all equally likely, the entropy of the
output would be 11 bits - a net loss of 1 bit. The net did more harm
than good.

Well, it added 1 bit of data from another source to the signal for each
symbol (pulse train?) output. But calling this "harm" assumes that the
purpose of the network was to maximise the efficiency of the encoding
of the RED signal. Why was that assumption made? How is that
assumption justified as "good" for this network?

No, not to maximize efficiency but to "parse" the stream. If you think of
your output net as asking your input net "what should I do, boss", then
the net should not respond with "lovely shade of red, reminds of that
ruby cab that day in Milan, ah, Milano...", but with "hit the breaks".

Well, like I said, my nodes have one job. To decide which response class
to assign every pulse. The network as a whole just does a lot more of the
same job and ends up with (potentially) more final response classes.

When you first turn it on, it's already in a default configuration which
causes it to sort all pulses some way. The point of learning is to just
slowly move the boundaries of those response classes in directions that
have been shown to increase rewards.

So the basic idea is how much collapsing, funneling, concentrating of
the input categories into output categories does the net do. That is
the useful work it has done in discriminating categories.

I don't yet grasp your view of "useful" here.

It gets more complex when we lift the qualifiers "equally likely", and
"and no others", but the idea is at heart simple.

Well, from wikipedia, I learn that that entropy for a discrete event
source
is just -Sum Pi Log2(Pi) where Pi is the probability of event i.
That's easy enough to understand as the expected value of the number of
bits per symbol (/event). It's also fairly easy to see that the
entropy (average bits per symbol) of a source drops when the symbols
are not equally likely.

Right.

But, I'm not sure I understand the practical use of these ideas of
information and entropy to learning networks....

Think of a classifier, like a character recognizer. It maps a "stimulus
space" - in this case bitmaps - into categories. Ultimately, there is a
vast collection of bitmaps that it labels as one thing: 'A'. So some
"feature" is useful in that endeavor to the extent it narrows the
possibilities. The task of "feature extraction" is a search for maximaly
informative stimulus dimasions in bitmap space.

Ok, but I understand that. My network is already designed to do just
that. By default the nodes balance the probability of pulses being
classified out each side of a single node. The node, by taking one signal,
and splitting it into two signals, has just extracted two features, from
that single signal. And those two feature signals are maximally
informative about the the original signal.

Likewise, as this process continues for the entire net, every output
generated by each node inside the net, becomes yet another maximally
informative feature of the stimulus data. A network with 1000 nodes will
by default, extract a set of 1000 maximally informative unique features of
the sensor space. That's it's default starting behavior.

Reinforcement learning will then go about matching up those features to the
correct outputs and adjusting their boundaries to maximize reward.

There may be better ways to create maximally informative pulse based
feature signals, so I'm game to try and get a better understand of it, but
the intent of that goal is something these nets are already doing.

And I really don't grasp the connections between thermodynamic entropy
and information entropy. But I'll figure that out as some point
here... :)

Well, that's where things start to get interesting. I offered a brief
explanation to Wolf, "Information & Uncertainty". I think it made some
sense to him.

Yeah, there's still plenty to be learned. :)

--
Curt Welch http://CurtWelch.Com/
curt@xxxxxxxx http://NewsReader.Com/
.



Relevant Pages

  • Re: Curtnetrons Dont Do Parity
    ... some set of pulse trains? ... This distribution has an entropy. ... pulse trains, each with probability 1/1024, and all other pulse trains have ... probability of being elicited by the stimulus RED. ...
    (comp.ai.philosophy)
  • Re: TPAS and Transponder - Blind Spot
    ... Overlapping pulse trains in SSR/transponders ... shout and directional antenas to try to de-garble their signals. ... concern about transponder antennas being really close to the MRX ...
    (rec.aviation.soaring)
  • Re: TPAS and Transponder - Blind Spot
    ... deal with analog ultrasonic signals produced by nature, ... pulse trains from transponders, so it made a first cut analysis easier ... "blanking" the receiver during the local transponder reply. ...
    (rec.aviation.soaring)