Re: behavior as mapping



A few (non-random but rambling) thoughts and questions. How do you intend to
incorporate contiguity issues into your Bayesian approach? As I once pointed
out to Joe, the widely-held definition of positive reinforcement is silent
on the issue of temporal contiguity - it is a contingency-based definition.
Same for Pavlovian conditioning - that is why conditioned taste aversions
can be considered an instance of Pavlovian conditioning (incidentally, other
one-trial pairings of stimuli that produce Pavlovian conditioning are
possible). That is, if response-independent events increase the rate of some
response, we generally do not call this an instance of operant conditioning.
At the same time, it is widely, but not ubiquitously, held that contiguity
is more important than dependency and that maybe even the apparent effects
of dependencies are totally due to strictly temporal relations - relations
that COULD have occurred by accident. Joe's take on the issue was that
behaviorists wanted to be slippery with their definitions (or maybe his
point was that they are merely stupid) so that they could embrace disparate
phenomena. A more reasonable answer is, I offered, that it is simply
pragmatic and that is, to say the least, a huge issue for a real
experimental science. Response-independent events rarely maintain responding
(I managed to maintain responding, after exposure to somewhat unusual
schedules, under VT schedules in two of three pigeons, but I would like to
repeat that) after exposure to response-dependent schedules. This is extra
important given EAB's move - for pragmatic reasons - to an analysis of
stable states rather than transition states. Similarly, a definition in
terms of contiguity (like Skinner's view) is ultimately not pragmatic
because it has no way to distinguish "superstitious" responding from
elicitation or induction (maybe a better term when used in the sense), and
it is significant that Staddon interpreted the superstition experiment in
terms of the elicitation of different classes of species-typical
respondents. The issue of contiguity vs. dependency can be safely ignored
only when the most fundamental nature of conditioning is not the question.
That is, you will not be able to safely ignore it. How are you going to
explain the effects of a single temporal conjunction of response and event
or event and event, and at the same time explain why response-independent
reinforcement does not generally maintain responding? Do you intend to make
time part of the context in P(response/context)? Or will temporal issues
enter in, if at all, in other ways? For example, maybe the issues are
handled by your added value and response cost notions - issues that are
outside, it seems to me, of the strictly Bayesian calculations. For example,
if a response is regarded as an attempt to evaluate a hypothesis concerning
models of P(event/R1) and P(event/ no R1) and P(event/R2) etc., you can say
that a single temporal conjunction changes the estimate of the probability
that the model is correct by only a little, but since the response is not
very costly, you might as well run with the notion that the response makes
the event fairly likely. But even here, your model would have to somehow
incorporate the fact that increases in rate of response following a
"pairing" is a probably a function of the precise temporal relation (i.e.,
arranging an FR 1 and a Tand FR 1 FT 3.0 s would likely have different
effects. Anyway, I am pretty naïve about Bayesian stuff, but I am slightly
less so than the last time we talked. So, I guess I am sort of asking you
the same questions I did earlier in hopes that I have a better chance of
understanding some of the answers. Does any of this make sense to you?



Incidentally, have you noticed that the methodological approach to
experimentation favored by EABers is sort of consistent with Bayesian
statistics in sort of an intuitive way? In particular, the notion of steady
or, at least, stable states. Basic practice is for the scientist to plot his
or her data every day, and each data point has the potential to alter his or
her opinion about whether or not there are any long-term trends in the
measure? Could Bayesian statistics be used as sort of a formal type of
stability criterion?



Anyway, I'll stop emitting now. I enjoy most of your posts, and agree with
much of what you say that I can understand. I am sometimes dismayed when you
do not attack my enemies. On the other hand, I am delighted when you do.



Glen



"Michael Olea" <oleaj@xxxxxxxxxxxxx> wrote in message
news:pwegg.90667$H71.5593@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
jalegris@xxxxxxxxxxxx wrote:

Curt Welch wrote:

...

And this is exactly how AI needs to work. It needs to be a temporal
reaction machine that by default, produce behavior with maximal
information (random behaviors).

You (Curt) are describing a kind of Monte Carlo sampling method of
estimating a probability distribution, the distribution
P(consequences|context,behavior), based on assuming a uniform prior over
the distribution ("random behaviors"); i.e. generate all behaviors with a
priori equal likelihood independently of context. In other words,
initially
P(behavior|context) = P(behavior) = 1/nBehaviors.

This is only feasible if the sample space is small and the number of
trials
large, and not always even then. You eat a piece of cheese, drink a bottle
of liquid plumber, sit down for 10 minutes, and throw up. Starting with
the
uniform prior over the distribution, you next eat a piece of cheese, drink
a bottle of liquid plumber, sit down for 9 minutes 59 seconds and 999,999
microseconds, and stand up. Threw up again? Pick some other behavioral
sequence with equal probability - since you have microsecond temporal
resolution, and assuming the temporal reaction machine takes much less
than
10 minutes to eat a piece of cheese or drink a bottle of liquid plumber,
then with very high probability the next trial will be to eat a piece of
cheese, drink a bottle of liquid plumber, and sit down for some number of
microseconds. You will also sample, eventually, drinking 987 thousandths
of
a bottle of liquid plumber...

Imagine you were trying to estimate the bias of a coin in a coin-flipping
experiment. The coin has some probability p, a value in the closed
interval
[0, 1] of coming up heads (and so probability 1-p of coming up tails).
Assuming a uniform prior distribution for p means all values in [0, 1]
have
equal probability density. You flip the coin once, it comes up heads. The
maximum likelihood estimate for p is 1 -- is that your best guess? You
flip
a coin once, it comes up heads, and on that basis you conclude that the
most likely probability is it always comes up heads? You flip it 3 times
and you get 2 heads, one tails. Now your maximum likelihood estimate for p
is 2/3. Note that the conditions of the experiment made it impossible for
you to arrive at an estimate of p = 1/2, given the uniform prior over p in
[0, 1]. You flip it long enough at the maximum likelihood estimate will
eventually converge to the true value of p (assuming there is such a
thing,
which under normal coin flipping conditions is a reasonable assumption).
But a uniform prior probability (a pdf of constant height over the
interval
[0,1]) is not the best choice. A better choice is a prior distribution
that
assigns greater prior probability to 1/2 than to 0 or 1 (a pdf with a hump
in the middle of the interval [0, 1]). Both the uniform prior and an
"informative prior" will, given enough trials, *eventualy* converge to the
same estimate, but the informative prior will converge more quickly -
unless this a very unusual coin.

Of course in life we never get "enough trials", and distributions are not
uniform. This is probably why, for one thing, when rats throw up they
develop an aversion to the taste of the last thing they ate, and not the
place they ate it. It is not at all clear that building temporal reaction
machines that, initially, jump in front of cars, and drill holes in their
inventors' forheads with the same equal probability with which they do
anything else is "exactly how AI needs to work".

Then, through experience, it slowly improves its
estimates of the value of all these random behaviors and bends the
probabilities...

"Bends the probabilities" how? This is an on-line estimation update
algorithm (on-line because "experience" is not a batch data set). You need
to specify an algorithm. Different ones will perform better than others on
different distributions. You need to show under what conditions and at
what
rate your algorithm converges.

...of the behaviors in favor of the higher value ones.

It is not the behaviors that have value, it is the consequences that have
value. This is a form of state space search. You are searching for
behaviors that lead to high value consequences - maximize expected value -
which in general will be a probabilistic function of context, the
difference, for example, between
P(consequences|cars-coming,{cross-street,wait}) and
P(consequences|no-cars-coming,{cross-street,wait}).

In the end...

There is no end unless both 1) the distribution is stationary, and 2) your
bending algorithm converges (hopefully on something like the actual
distribution), rather than, say, oscillates, or wanders chaoticly over a
strange attractor - assuming the dynamics of the probability bender has an
attractor at all.

the intelligent behavior...

The behavior that maximizes the expected value of the consequences under a
given context.

...emerges from the initial random behaviors.

Emerges if the probability bender converges.

But the Shannon information content in this intelligent behavior is far
lower, than the information content in the purely random non-intelligent
behavior it started with.

A careless statement of the information budget, prompting Joe's remark
below. An observer, recording the responses of the temporal reaction
machine, will gain maximal information per response when all responses are
equally likely, that is true - this is the reduction in uncertainty of
what
the machine was going to do, from a value that is maximal when all
behaviors are equally likely, to zero once it is known what it did.
However, when the machine's responses were "random", that is, uncorrelated
with context, the machine-responded-thus messages provided zero
information
about context. The "intelligent behavior" is much more informative about
context (information is always about something). Precisely, it is the
reduction of the entropy P(context) to the entropy P(context|response)
wherein "the information content in this intelligent behavior" now
resides.
Further, the machine has gained information - the reduction from the
entropy of the distribution P(consequences) to the entropy of the
distribution P(consequences|context,response). Its behavior has become
"informed behavior".

Now this brings us around to a problem I have been having with the
definition of information.

Stewart and Brook see four types of information.
( http://www.carleton.ca/ics/TechReports/files/2003-06.pdf )

A paper, it seems to me, that obscures more than it clarifies. The section
on "type 2" information is especially bad. The statement "In any case, for
both chaos theorists and dynamic systems researchers 'information' is
equated with order, or anti-entropy" is a gross misconception. It is not
surprising that they quote Gleick rather than, say, a chaos theorist or
dynamic systems researcher to support this. The authors slip into the
basic
mistake of equating information and entropy (hence the characterization of
anti-entropy as the opposite of "type 1 information"). Entropy is not
information. Information is the reduction of entropy. All the chaos
theorists and dynamical systems researchers whose papers I've read that
discuss the matter at all (and several do, in quite some depth) are quite
clear on this.

An example from molecular biology night help clarify the relationship. For
transcription of DNA to take place RNA polymerase must bind to the DNA at
the start of a gene. Each gene has a sequence tag that acts as a marker of
where to bind. Each "letter" in the sequence tag comes from a 4-character
"alphabet", so each letter conveys at most 2-bits of information. How much
information (about where to bind) do the sequence tags actually contain?
Find all the sequence tags and estimate the probability distribution over
the alphabet at each position in the sequence. If each character at each
postion occurs with probability 1, that is with entropy zero, then the
reduction in entropy from 2 bits to zero bits is an information gain of 2
bits - each character of the sequence tag carries maximal information. In
fact the sequence tags are not that regular - the entropies per position
do
not go to zero, but they are less than two bits. They contain
where-to-bind
information to the extent they are regular - to the extent they reduce the
entropy over equiprobable random strings.

This is not an idle exercise. We can compute how much information a
binding
site *must* convey. We need enough information to distinguish binding
sites
from non-binding sites. There are nSights/nGenes distinctions to be made.
Suppose the DNA string were 16 characters long and it contained 8 genes.
Then 1/2 the sites would be binding sites - we would need 1 bit of
information per binding site to distinguish it from its non-binding site
neighbor. If there were 4 genes we would need 2 bits of binding site info.
And so on. In general we need log2(nSites/nGenes) bits of binding site
information per binding site - that is how much information the sequence
tags must contain. And, in the many cases where it has been studied, that
is almost exactly how much information they do contain. There have been
notable exceptions - a few cases where the information was either about
twice or three times the expected amount. It would be like walking along a
street and finding that the houses had two or three addresses instead of
one. On that basis, in those cases, it was predicted that some other one
or
two molecules in addition to RNA polymerase were binding at these sites,
and these secondary binders were later found.

One more intereting point from this example. I noted that the entropy per
character in the sequence tags was not reduced to zero, so each character
conveys less than the maximal 2 bits of information. Is this an
inefficiency? Nope. It is energy efficient, in the thermal noise
environment of jiggling molecules, to spread the information out over a
longer sequence (multiple low bit-rate channels rather than a few high
bit-rate channels) to maximize binding likelihood per energy cost - a
little like writing the more fault tolerant "HUNDRED" than "100".

Anyhow, it is the "exploitable regularity" of these strings that accounts
for their information content (about where to bind), which is in no way
opposite to "type 1 information", but in fact the same thing, an entropy
reducing message: "I"M A BINDING SITE", where the prior probability of
being a binding site is nGenes/nSites.

To avoid confusion I'll use upper-case to distinguish their use of
the term:

Type 1: Reduction of local uncertainty. They equate this type of
INFORMATION to the uncertainty in receiving a message, as measured by
Shannon's entropy. This is the same sense of the term I would use.

Type 2: Exploitable regularity. They see this as INFORMATION occurring
in the environment, such as food, that is exploitable by living things.
They view it as the exact opposite of Type 1 because, paradoxically,
its orderly physical structure has relatively low information content.
For example, when food is ingested, energy and physical order are
transferred to the organism, increasing its own physical order and
reducing its information-content. I think this is a confounding of two
distinct processes. First, the occurrence of food in the wild (against
a background of "random noise") is a relatively improbable event that
has relatively high information content and is likely to be remembered.
Second, the ingestion of food and the associated reduction of
information is an interpretation of the reduced thermodynamic entropy
associated with order and available energy.

Type 3: Data to be manipulated or transformed. This is the INFORMATION
of cognitive psychology and the information-processing view of
cognition. They make a clear distinction between Type 1 where
INFORMATION "was something contained in the signal" and the present
type, where INFORMATION "is the signal itself". I think this type
of INFORMATION is just a form of representation.

Just one observation - rhodopsin changes state either through
photoisomerization, or through spontaneous thermal isomerization.

Type 4: Aboutness. This category of INFORMATION admits intentionality,
such as when a representation is about something. I don't have much use
for this type.

Well, all information is about something in the sense that it is
information
by virtue of reducing the entropy of some distribution. That's what it's
all about (Mr. Natural).

A silly paper (in my opinion, of course).

Anyway, Type 2 seems to correspond to your conception of the
information "reduction" that occurs with learning. Is that what you
mean?

The reduction in the entropy of the machine's behavior is due to its gain
of
information about the probable consequences of its behavior. Seems simple
enough.

-- Michael




.



Relevant Pages

  • Re: behavior as mapping
    ... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
    (comp.ai.philosophy)
  • Re: behavior as mapping
    ... estimating a probability distribution, the distribution ... sequence with equal probability - since you have microsecond temporal ... reduction of the entropy Pto the entropy P ... If there were 4 genes we would need 2 bits of binding site info. ...
    (comp.ai.philosophy)
  • Re: Shannons information theory
    ... Shannon's entropy is a function whose value is determined ... described by a probability distribution, ... When you say "my mind tells me: ...
    (comp.theory)
  • Re: Gian-Carlo Rotas "Twelve problems in probability no one likes to bring up."
    ... > tweleve problems are: ... > Work on entropy and algebra ... > Conditional probability, Bayes' Law ... > Multivariate Normal Distribution and the Clifford Distribution ...
    (sci.stat.math)
  • Re: Pigeons, People, and Priors
    ... other features of a variable interval schedule of reinforcement than just ... response to a variable interval schedule and a fixed interval schedule - ... The probability density function for an FIt schedule is ... probability or 2) to the first derivative of the cummulative probability. ...
    (comp.ai.philosophy)

Loading