# Re: behavior as mapping

jalegris@xxxxxxxxxxxx wrote:

Curt Welch wrote:

....

And this is exactly how AI needs to work. It needs to be a temporal
reaction machine that by default, produce behavior with maximal
information (random behaviors).

You (Curt) are describing a kind of Monte Carlo sampling method of
estimating a probability distribution, the distribution
P(consequences|context,behavior), based on assuming a uniform prior over
the distribution ("random behaviors"); i.e. generate all behaviors with a
priori equal likelihood independently of context. In other words, initially
P(behavior|context) = P(behavior) = 1/nBehaviors.

This is only feasible if the sample space is small and the number of trials
large, and not always even then. You eat a piece of cheese, drink a bottle
of liquid plumber, sit down for 10 minutes, and throw up. Starting with the
uniform prior over the distribution, you next eat a piece of cheese, drink
a bottle of liquid plumber, sit down for 9 minutes 59 seconds and 999,999
microseconds, and stand up. Threw up again? Pick some other behavioral
sequence with equal probability - since you have microsecond temporal
resolution, and assuming the temporal reaction machine takes much less than
10 minutes to eat a piece of cheese or drink a bottle of liquid plumber,
then with very high probability the next trial will be to eat a piece of
cheese, drink a bottle of liquid plumber, and sit down for some number of
microseconds. You will also sample, eventually, drinking 987 thousandths of
a bottle of liquid plumber...

Imagine you were trying to estimate the bias of a coin in a coin-flipping
experiment. The coin has some probability p, a value in the closed interval
[0, 1] of coming up heads (and so probability 1-p of coming up tails).
Assuming a uniform prior distribution for p means all values in [0, 1] have
equal probability density. You flip the coin once, it comes up heads. The
maximum likelihood estimate for p is 1 -- is that your best guess? You flip
a coin once, it comes up heads, and on that basis you conclude that the
most likely probability is it always comes up heads? You flip it 3 times
and you get 2 heads, one tails. Now your maximum likelihood estimate for p
is 2/3. Note that the conditions of the experiment made it impossible for
you to arrive at an estimate of p = 1/2, given the uniform prior over p in
[0, 1]. You flip it long enough at the maximum likelihood estimate will
eventually converge to the true value of p (assuming there is such a thing,
which under normal coin flipping conditions is a reasonable assumption).
But a uniform prior probability (a pdf of constant height over the interval
[0,1]) is not the best choice. A better choice is a prior distribution that
assigns greater prior probability to 1/2 than to 0 or 1 (a pdf with a hump
in the middle of the interval [0, 1]). Both the uniform prior and an
"informative prior" will, given enough trials, *eventualy* converge to the
same estimate, but the informative prior will converge more quickly -
unless this a very unusual coin.

Of course in life we never get "enough trials", and distributions are not
uniform. This is probably why, for one thing, when rats throw up they
develop an aversion to the taste of the last thing they ate, and not the
place they ate it. It is not at all clear that building temporal reaction
machines that, initially, jump in front of cars, and drill holes in their
inventors' forheads with the same equal probability with which they do
anything else is "exactly how AI needs to work".

Then, through experience, it slowly improves its
estimates of the value of all these random behaviors and bends the
probabilities...

"Bends the probabilities" how? This is an on-line estimation update
algorithm (on-line because "experience" is not a batch data set). You need
to specify an algorithm. Different ones will perform better than others on
different distributions. You need to show under what conditions and at what

...of the behaviors in favor of the higher value ones.

It is not the behaviors that have value, it is the consequences that have
value. This is a form of state space search. You are searching for
behaviors that lead to high value consequences - maximize expected value -
which in general will be a probabilistic function of context, the
difference, for example, between
P(consequences|cars-coming,{cross-street,wait}) and
P(consequences|no-cars-coming,{cross-street,wait}).

In the end...

There is no end unless both 1) the distribution is stationary, and 2) your
bending algorithm converges (hopefully on something like the actual
distribution), rather than, say, oscillates, or wanders chaoticly over a
strange attractor - assuming the dynamics of the probability bender has an
attractor at all.

the intelligent behavior...

The behavior that maximizes the expected value of the consequences under a
given context.

...emerges from the initial random behaviors.

Emerges if the probability bender converges.

But the Shannon information content in this intelligent behavior is far
lower, than the information content in the purely random non-intelligent
behavior it started with.

A careless statement of the information budget, prompting Joe's remark
below. An observer, recording the responses of the temporal reaction
machine, will gain maximal information per response when all responses are
equally likely, that is true - this is the reduction in uncertainty of what
the machine was going to do, from a value that is maximal when all
behaviors are equally likely, to zero once it is known what it did.
However, when the machine's responses were "random", that is, uncorrelated
with context, the machine-responded-thus messages provided zero information
context (information is always about something). Precisely, it is the
reduction of the entropy P(context) to the entropy P(context|response)
wherein "the information content in this intelligent behavior" now resides.
Further, the machine has gained information - the reduction from the
entropy of the distribution P(consequences) to the entropy of the
distribution P(consequences|context,response). Its behavior has become
"informed behavior".

Now this brings us around to a problem I have been having with the
definition of information.

Stewart and Brook see four types of information.
( http://www.carleton.ca/ics/TechReports/files/2003-06.pdf )

A paper, it seems to me, that obscures more than it clarifies. The section
on "type 2" information is especially bad. The statement "In any case, for
both chaos theorists and dynamic systems researchers 'information' is
equated with order, or anti-entropy" is a gross misconception. It is not
surprising that they quote Gleick rather than, say, a chaos theorist or
dynamic systems researcher to support this. The authors slip into the basic
mistake of equating information and entropy (hence the characterization of
anti-entropy as the opposite of "type 1 information"). Entropy is not
information. Information is the reduction of entropy. All the chaos
theorists and dynamical systems researchers whose papers I've read that
discuss the matter at all (and several do, in quite some depth) are quite
clear on this.

An example from molecular biology night help clarify the relationship. For
transcription of DNA to take place RNA polymerase must bind to the DNA at
the start of a gene. Each gene has a sequence tag that acts as a marker of
where to bind. Each "letter" in the sequence tag comes from a 4-character
"alphabet", so each letter conveys at most 2-bits of information. How much
information (about where to bind) do the sequence tags actually contain?
Find all the sequence tags and estimate the probability distribution over
the alphabet at each position in the sequence. If each character at each
postion occurs with probability 1, that is with entropy zero, then the
reduction in entropy from 2 bits to zero bits is an information gain of 2
bits - each character of the sequence tag carries maximal information. In
fact the sequence tags are not that regular - the entropies per position do
not go to zero, but they are less than two bits. They contain where-to-bind
information to the extent they are regular - to the extent they reduce the
entropy over equiprobable random strings.

This is not an idle exercise. We can compute how much information a binding
site *must* convey. We need enough information to distinguish binding sites
from non-binding sites. There are nSights/nGenes distinctions to be made.
Suppose the DNA string were 16 characters long and it contained 8 genes.
Then 1/2 the sites would be binding sites - we would need 1 bit of
information per binding site to distinguish it from its non-binding site
neighbor. If there were 4 genes we would need 2 bits of binding site info.
And so on. In general we need log2(nSites/nGenes) bits of binding site
information per binding site - that is how much information the sequence
tags must contain. And, in the many cases where it has been studied, that
is almost exactly how much information they do contain. There have been
notable exceptions - a few cases where the information was either about
twice or three times the expected amount. It would be like walking along a
one. On that basis, in those cases, it was predicted that some other one or
two molecules in addition to RNA polymerase were binding at these sites,
and these secondary binders were later found.

One more intereting point from this example. I noted that the entropy per
character in the sequence tags was not reduced to zero, so each character
conveys less than the maximal 2 bits of information. Is this an
inefficiency? Nope. It is energy efficient, in the thermal noise
environment of jiggling molecules, to spread the information out over a
longer sequence (multiple low bit-rate channels rather than a few high
bit-rate channels) to maximize binding likelihood per energy cost - a
little like writing the more fault tolerant "HUNDRED" than "100".

Anyhow, it is the "exploitable regularity" of these strings that accounts
for their information content (about where to bind), which is in no way
opposite to "type 1 information", but in fact the same thing, an entropy
reducing message: "I"M A BINDING SITE", where the prior probability of
being a binding site is nGenes/nSites.

To avoid confusion I'll use upper-case to distinguish their use of
the term:

Type 1: Reduction of local uncertainty. They equate this type of
INFORMATION to the uncertainty in receiving a message, as measured by
Shannon's entropy. This is the same sense of the term I would use.

Type 2: Exploitable regularity. They see this as INFORMATION occurring
in the environment, such as food, that is exploitable by living things.
They view it as the exact opposite of Type 1 because, paradoxically,
its orderly physical structure has relatively low information content.
For example, when food is ingested, energy and physical order are
transferred to the organism, increasing its own physical order and
reducing its information-content. I think this is a confounding of two
distinct processes. First, the occurrence of food in the wild (against
a background of "random noise") is a relatively improbable event that
has relatively high information content and is likely to be remembered.
Second, the ingestion of food and the associated reduction of
information is an interpretation of the reduced thermodynamic entropy
associated with order and available energy.

Type 3: Data to be manipulated or transformed. This is the INFORMATION
of cognitive psychology and the information-processing view of
cognition. They make a clear distinction between Type 1 where
INFORMATION "was something contained in the signal" and the present
type, where INFORMATION "is the signal itself". I think this type
of INFORMATION is just a form of representation.

Just one observation - rhodopsin changes state either through
photoisomerization, or through spontaneous thermal isomerization.

such as when a representation is about something. I don't have much use
for this type.

Well, all information is about something in the sense that it is information
by virtue of reducing the entropy of some distribution. That's what it's

A silly paper (in my opinion, of course).

Anyway, Type 2 seems to correspond to your conception of the
information "reduction" that occurs with learning. Is that what you
mean?

The reduction in the entropy of the machine's behavior is due to its gain of
information about the probable consequences of its behavior. Seems simple
enough.

-- Michael

.

## Relevant Pages

• Re: behavior as mapping
... response, we generally do not call this an instance of operant conditioning. ... that a single temporal conjunction changes the estimate of the probability ... estimating a probability distribution, the distribution ... reduction of the entropy Pto the entropy P ...
(comp.ai.philosophy)
• Re: behavior as mapping
... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
(comp.ai.philosophy)
• Re: Shannons information theory
... Shannon's entropy is a function whose value is determined ... described by a probability distribution, ... When you say "my mind tells me: ...
(comp.theory)
• Re: Gian-Carlo Rotas "Twelve problems in probability no one likes to bring up."
... > tweleve problems are: ... > Work on entropy and algebra ... > Conditional probability, Bayes' Law ... > Multivariate Normal Distribution and the Clifford Distribution ...
(sci.stat.math)
• Re: Pigeons, People, and Priors
... the variance of the probability generator go to zero you have a continuum ... a random-interval 60 s schedule is not. ... The Exponential Distribution ... I probably should have used the phrase "statistical learning theory" rather ...
(comp.ai.philosophy)