Re: behavior as mapping
- From: curt@xxxxxxxx (Curt Welch)
- Date: 06 Jul 2006 02:33:22 GMT
jalegris@xxxxxxxxxxxx wrote:
Curt Welch wrote:
...
And this is exactly how AI needs to work. It needs to be a temporal
reaction machine that by default, produce behavior with maximal
information (random behaviors). Then, through experience, it slowly
improves its estimates of the value of all these random behaviors and
bends the probabilities of the behaviors in favor of the higher value
ones. In the end, the intelligent behavior emerges from the initial
random behaviors. But the Shannon information content in this
intelligent behavior is far lower, than the information content in the
purely random non-intelligent behavior it started with.
Now this brings us around to a problem I have been having with the
definition of information.
I'm just now following up to this old message of yours. I've been saving
it to see if I had any more insight on how to answer. I got more insight
but no clear answers.
The concept of information has never been real clear to me either. The
word seems to be used in a lot of different ways (even by myself).
Stewart and Brook see four types of information.
( http://www.carleton.ca/ics/TechReports/files/2003-06.pdf )
To avoid confusion I'll use upper-case to distinguish their use of
the term:
Type 1: Reduction of local uncertainty. They equate this type of
INFORMATION to the uncertainty in receiving a message, as measured by
Shannon's entropy. This is the same sense of the term I would use.
Type 2: Exploitable regularity. They see this as INFORMATION occurring
in the environment, such as food, that is exploitable by living things.
They view it as the exact opposite of Type 1 because, paradoxically,
its orderly physical structure has relatively low information content.
For example, when food is ingested, energy and physical order are
transferred to the organism, increasing its own physical order and
reducing its information-content. I think this is a confounding of two
distinct processes. First, the occurrence of food in the wild (against
a background of "random noise") is a relatively improbable event that
has relatively high information content and is likely to be remembered.
Second, the ingestion of food and the associated reduction of
information is an interpretation of the reduced thermodynamic entropy
associated with order and available energy.
Type 3: Data to be manipulated or transformed. This is the INFORMATION
of cognitive psychology and the information-processing view of
cognition. They make a clear distinction between Type 1 where
INFORMATION "was something contained in the signal" and the present
type, where INFORMATION "is the signal itself". I think this type
of INFORMATION is just a form of representation.
Type 4: Aboutness. This category of INFORMATION admits intentionality,
such as when a representation is about something. I don't have much use
for this type.
I think ultimately, all information is aboutness. I think it can't be any
other way.
And this is where Shannon's information theory comes in. You can't
measure, or quantify, or even talk about, information, unless you define
some "aboutness" frame of reference to use in your discussion. Once you
make it clear what "aboutness" you are talking about, you can then talk
about how that aboutness is being represented in some physical events
(data/energy flows).
I'm starting to read this book that Michael suggested that gets into a
strong information theory perspective of neuron activity. Either because
of what this book will teach me, or because of what I'll have to go learn
to understand what the book is trying to teach me, I expect to end up with.
I really don't understand statistical information theory and entropy
concepts as well as I want to.
But, one small part of it I understand every well. That's the ability to
convert a message flow made up of a fixed number of symbols into a binary
message flow and being able to quantify the number of bits needed in the
binary message flow to transfer all the same information. This is how you
can can talk about the number of bits of information present in a single
symbol. It's just a formal way of saying that if you were using a binary
message system, you would need on average so many bits of data to transmit
all the same "aboutness". This translation is done based on the
probability distribution of the symbols in the starting language. The
number of bits needed to transmit the same "aboutness" (aka meaning), is
log2(1/p). This assumes not only a translation to a binary language, but
also a translation to a language with equal probability of each symbol.
This gives us a way to normalize in a quantitative way, the amount of
information being transmitted in a stream of finite symbols by converting
it to the equivalent flow of binary symbols and then using that as the
yardstick of measurement for all information flows.
But this is all based on the idea of a probability of a symbol occurring in
a message stream. A probability by it's very nature, is an indication of a
lack of knowledge about the message stream. We quote probabilities when we
don't have enough information to give the truth. Saying that there is a 20
% chance of a symbol occurring means we known a little about which symbol
might happen next, but not much. And it's what we don't know, that allows
us to count how many bits are "in" the data. Because if we knew for a fact
which symbol was going to happen next, there would be 0 bits if data in it
(for us).
So, we can say there are N bits of data in a symbol, but by saying that, we
are also admitting how much we don't know about what will happen next. The
more bits of data we claim to be in the stream, the less we actually know
about it. Which means we are receiving information we didn't know with each
symbol.
So also embedded with these concepts of information and "aboutness" is the
assumption that the receive must be unaware in order for their to be any
"aboutness" in the message. So "aboutness" information only seems to exist
in the message, if it doesnt already exist, in the receiver.
It's still all very confusing to me just what we think we are talking about
when we talk about information in many of these different ways.
This however is the only formal measure of "information" I currently
understand - to the limits which I think I understand it.
Anyway, Type 2 seems to correspond to your conception of the
information "reduction" that occurs with learning. Is that what you
mean?
I don't really know what I mean. It's even more confusing when you try to
talk about information content in temporal spike symbols which don't
directly match the concepts used to deal with Shannon information concepts
which relate to discrete symbols which have no temporal information
content. All attempts to understand information content of spike symbols
seem to do so by using one of various tricks to convert them to an
equivalent discrete message stream. I'm not yet convinced those
conversions are the best way to understand spike symbols (but I've just
started on Michael's suggested book which explores this very subject at
great depths).
But let me try to figure out and explain just what I do mean.
The ideas for how to build a better learning systems which I have been
exploring, are all based on the concept of building a reinforcement
learning system, to deal with environments too complex to fully model.
This means the learning system has no hope of building internal models
which correctly reflect the full current state of the environment.
But, even with this limitation, reinforcement learning must still happen.
And this means, that some internal state model of the environment must be
used to represent the state of the environment. This is because
reinforcement learning requires that statistical data be developed through
experience, associating the state of the environment (the current context),
with possible behaviors.
Whatever state model a given system uses to approximate the state of the
environment, and whatever behavior model it uses, one natural starting
configuration of the system, is to give all possible behaviors, from all
possible states, an equal probability of being selected.
So, if you look at this behavior generator as if it were a message
generator (each behavior selected is a different symbol transmitted to the
environment), you can talk about it transmitting maximal information (in
terms of bits per behavior) to the environment.
This is because Shannon information is maximal when the probability of all
symbols are equal. If you have two symbols, with unequal probabilities
..25, and .75, then the first symbol carries log2(1/.25) = 2 bits of
information each time it shows up. But the other symbol only carries with
it .42 bits of information. The entropy of this source (the average
expected bit per symbol) is .25*2 + .75 *.42 = .815 bits per symbol.
If on the other hand, the two symbols had equal probability, they would
carry a full 1 bit per symbol and the entropy of the source would be a full
1 bit per symbol.
As it learns which behaviors produce more rewards, that profile of
probabilities will shift. Some behavior will be used more than others in a
given context. As a result, the information flow out of the system,
effectively drops. It's behavior becomes less random, and more
predictable. The amount of bits per behavior, drops as it learns (though
this is dependent on the environment because that assumes that the optimal
behavior for the problem as created by the environment is something other
than random behavior).
So, the flow of information out of the learning system into the environment
drops (for all interesting problems) as it learns. This is the reduction I
was talking about.
However, for the system to learn, it must be receiving a flow of
information from the environment, which relates to the value of behaviors
in different contexts. I've not tried to understand the nature of that
flow.
Also, since I deal with reaction machines, there's a flow of sensory
information about the environment into the machine as well (which is
related to what it is learning, but also separate from it). And these
machines I play with translate that flow of sensory information, into a
flow of behaviors.
I believe it's valid to say that in the starting configuration, these
machines pass as much information from the environment, back out to the
environment as possible as behaviors. Through the process of learning,
this flow is reduced - more sensory data is seen as noise, and is filtered
out of the flow, leaving only the flow of the good stuff, back out to the
environment in terms of behaviors.
Also, in this problem of creating the most useful internal state model of
the environment from the incomplete sensory data, I think there's
application for using these formal information ideas to show that the state
model of a finite size, is representing a maximum amount of information
about the full state of the actual environment. SO the translation from
sensory data, to internal state, is a job of trying to maintain a maximal
amount of sensory data in the internal state. Though learning however, the
system should learn that some states are more valuable to "known about"
than others, and it should adjust it's state definitions to track those
with maximal value.
I don't have a full grasp of all this and how you can apply formal
information theories to all of it, but I'm working on it and making some
progress.
--
Curt Welch http://CurtWelch.Com/
curt@xxxxxxxx http://NewsReader.Com/
.
- Follow-Ups:
- Re: behavior as mapping
- From: Laat Gai
- Re: behavior as mapping
- Prev by Date: Re: "Next Generation Artificial Intelligence, Artificial Mind - Part One - Basic Architecture and Cognitive Structure"
- Next by Date: Re: Curtnetrons Don't Do Parity
- Previous by thread: 50 Years and Counting
- Next by thread: Re: behavior as mapping
- Index(es):
Relevant Pages
|