Re: Pigeons, People, and Priors
- From: Michael Olea <oleaj@xxxxxxxxxxxxx>
- Date: Wed, 05 Apr 2006 05:57:58 GMT
Glen M. Sizemore wrote:
MO: I also noticed, you probably thought of this already, that if you let
the variance of the probability generator go to zero you have a continuum
between each of FT-VT, FI-VI, and FR-VR schedules.
GS: Yes, but I don't use a probability generator. I use the same list of
values that approximate a constant probability output (in the modern era,
this is usually a list from which a computer draws without replacement).
This is somewhat standard, for better or worse. One of the things that it
does is make certain variables less variable day-to-day. The reinforcement
rate, for example, under a VI 60 s schedule is very stable, while that for
a random-interval (RI) 60 s schedule is not.
I'm unclear about a couple things, here. First, what is the difference
between a VI schedule and an RI schedule? The only distinction Catania
makes (circa 1984) is that the former are generated by hand (e.g. punching
holes in a tape), and the latter by computer. Second, I'm not sure what you
mean by "a constant probability output". Are you talking about a continuous
"rectangular distribution", i.e. event times uniformly distributed over
some interval [a, b], the distribution with pdf(t) = 1/(b-a); a <= t <= b?
That is one of the distributions I'm mulling over in thinking about VI
schedules. The other main one is the (a canonical choice for modeling event
times) is the exponential distribution: pdf(t) = lambda * e^-(lambda*t),
where lambda > 0. The former is symmetric and "flat", the latter skewed and
"peaky". Some summary stats:
The Exponential Distribution
============================
pdf: lambda * e^-(lambda * t)
cdf: 1 - e^-(lambda * t)
mean: 1/lambda
median: ln(2)/lambda = ~0.69*mean
mode: 0
variance: 1/lambda^2
skewness: 2
kurtosis: 6
entropy: 1 - ln(lambda)
P(a <= t <= b): e^-(lambda * a) - e^-(lambda * b)
P(mean +/- stdv): ~0.86
P(t < mean): ~0.63
P(t > mean): ~0.37
The Uniform Rectangular Distribution
====================================
pdf: 1/(b-a); a <= t <= b
cdf: (t-a)/(b-a)
mean: (a+b)/2
median: (a+b)/2
mode: all values in [a, b]
variance: ((b-a)^2)/12
skewness: 0
kurtosis: -6/5
entropy: ln(b - a)
Now - what have you got that will help me predict what various measures
of responding (especially rate of response) will be as these parameters
are changed? Maybe we should start with postdictions. What have you got
concerning, say, VI and VR response-rate differences? Then we can talk
about PREdictions. And I assume we're currently talking about
stable-states?
MO: It took a while to sort the issues out. Learning theory on its own
does not answer these questions.
GS: Sorry? I don't understand what you mean.
I probably should have used the phrase "statistical learning theory" rather
than "learning theory" throughout. The reason I didn't is that I had in
mind fairly recent results on "complexity" and "predictive information"
that have roots outside traditional SLT, roots in dynamical systems theory
(e.g. chaos), statistical mechanics, and information theory. But these
results have deep connections with SLT and direct bearing on the sorts of
questions SLT seeks to address.
Those questions are largely about the limits of prediction accuracy as a
function of number of observations, how those limits depend on properties
of the process to be predicted and properties of an abstract learning
machine doing the predicting, under a variety of different sets of
assumptions (e.g. stationary vs non-stationary processes). Earlier I talked
about a particular category of learning machine, the supervised learning of
a classifier from a training set, for example a digit recognizer. In this
case the "prediction" is the class label of a digit given a bitmap. The
learning algorithm in effect estimates the conditional distribution
P(label|bitmap) from a set of (bitmap, label) pairs. Once it's "trained"
its "learning" stops.
Here, when I say "(statistical) learning theory" I am talking about the
theory of a different category of learning machine, "unsupervised learning"
of probability distributions. Here there is no training set. Instead these
machines estimate probability distributions from streams of events, and
they continue updating their estimates as long as they are in operation.
The accuracy of these machines is not measured directly by how well they
predict events, but by how well they have learned whatever probability
distribution governs those events.
In the case of a schedule of reinforcement I'm thinking of it as a stream of
triples (context, response, consequence) drawn from some joint probability
distribution more or less stationary over some portion of an experiment.
The learning problem I had in mind is the estimation of the conditional
distribution P(consequence|context,response). The theory I had in mind is
about how quickly and to whithin what accuracy the estimates of different
kinds of distribution learners converge to the true distribution, depending
on the "complexity" of the distribution - a fundamental limit.
None of this has any direct bearing on rates of responding. "Decision
theory" (also called "Bayesian decision theory", or sometimes "utility
theory") - cost/benefit analysis - is a better fit. If, for example,
responses have zero cost and consequences have some positive benefit then
almost independently of the schedule in effect and almost independently of
any estimate of P(consequence|context,response) the optimal policy is to
emit responses as fast as possible. So cost/benefit issues have to be a
part of any model of response strategies.
MO: Learning theory in conjunction with utility theory
can be made to pop out definite predictions, but they are dubious on a
couple of counts - one due to some missing data. So, from learning theory
the following is easily proved:
All the schedules of reinforcement in your parameter space belong to the
same complexity class (unless the probability generator is really exotic,
a possibility I will discount) - they have different absolute
complexities, but these are differences of degree, not of kind. They all
have finite (as opposed to divergent) predictive information, bounded by
some constant. Different constants for different schedules, but all in the
class of finite predictive information event streams, so all essentially
of the same degree of inherently low difficulty learning problems.
GS: Sorry, I don't understand any of this - could you start with a
dumbed-down version and get progressively more complex. BTW, what the hell
is "leaning theory"?
"Learning theory" was a poorly chosen phrase for "statistical learning
theory augmented by cool new results from elsewhere".
The "complexity" of a time series, at least the notion of complexity used
here, is a measure of how long it has to be observed before it can be
"understood". A highly regular process, say a periodic process like the
orbit of a planet, has low complexity in the sense that its behavior can be
fully characterized - to any given level of precision - after only a few
observations. It's behavior is fully characterized by a small number of
parameters, and once you learn those there is nothing more to learn from
further observation (you can continue to refine the precision to which you
know the values of those parameters, but you aren't learning anything
fundamentaly new about how it behaves, and for any given practical purpose
there is some finite level of precision needed - it takes a finite amount
of information, which you get from a finite number of observations, to
learn those parameters to a given precision). The longer the orbital period
the more observations it takes to achieve a given level of precision, so
the complexity is higher, but still in the realm of finite observations
needed for a given precision.
At the other extreme from a completely regular process is a completely
random process. An example is a sequence of outcomes from a "fair coin".
This too has low complexity in the sense used here - its behavior is fully
characterized by a single parameter P(heads), the probability of the
outcome heads (here equal to 1/2). Again it takes a finite amount of
observation to learn this parameter to any given level of precision.
In the first case, the completely regular case, the behavior of the time
series in the future (e.g. position vector as a function of time) is fully
predictable from a single observation at any time in the past, once the
parameters of its trajectory through spacetime are known. Suppose the 3
spatial coordinates and time are each specified to 32 bits of precision, a
total of 128 bits. Before making any observations and assuming all possible
2^128 values of these coordinates are a priori (prior to observation)
equally likely then the uncertainty in the spacetime coordinates of this
system is 128 bits. After observation the uncertainty is zero bits, the
information gained is 128 bits, so the "predictive information" the entire
past of the time series contains about the entire future of the time series
is 128 bits.
In the completely random case, a sequence of equally likely binary outcomes,
the past states of the sequence has zero predictive information about the
future states of the sequence - all sequences of length n are equally
likely (they each have probability 1/2^n) independently of past states of
the sequence.
Stochastic processes that are neither fully regular nor fully random are
more interesting, more "complex", and harder to learn in the sense of
requiring more observations to characterize their behavior. In a first
order markov process the probability of the next state is fully determined
by the current state. In a k-th order markov process the probability of the
next state depends on the k previous states. There is progressively more to
learn (quantifiable in bits) to characterize the behavior of such systems,
and progressively more information (quantifiable in bits) the past behavior
of the system provides about its future behavior. The dependence of the
future behavior of a stochastic process on its past need not have a sharp
discrete cutoff. It can diminish smoothly, exponentially for example. The
decay constant of the exponential sets an effective correlation time for
the process - at some point the influence of its past behavior (and hence
its contribution to predictive information) drops below the resolution with
which the process can be observed. That is the time scale over which the
behavior of the process becomes essentially uncorrelated. This is all still
in the realm of processes with finite predictive information the past
provides about the future, finite information needed to characterize the
behavior of the process, which can be gleaned from finite observation.
Then there are stochastic processes whose behavior depends to a measurable
extent (for any given observational resolution) on its entire past, its
whole history. Such processes have infinite correlation times. The
predictive information the past provides about the future of such a process
is not finite, not bounded by some constant, but grows without bound with
continued observation (to put it another way, the more you know about the
history of such a process the more you can predict about its future). In
this case the predictive information is "divergent" in the calculus sense
of the term - an infinite series that does not converge to a finite limit
but diverges. The transition from finite to divergent predictive
information, from finite to infinite correlation times (or lengths or other
measures of scale) is seen in physics associated with some phase
transitions (e.g. in the onset of magnetism). I have been discussing
stochastic processes as time series, i.e. processes where the independent
variable is time, but other independent variables are possible (e.g.
length, position on a surface, elements from some vector space, or even
functions in some function space, for example sound waves or light...).
Anyway, predictive information can be divergent rather than finite.
When the predictive information of a stochastic process is divergent it
turns out there are only two possibilities: it either grows as a
logarithmic function of the history of the process or as a power with a
positive fractional exponent less than 1 (e.g. a*t^1/2). In both of these
cases the rate of growth of predictive information is sublinear (grows more
slowly than a linear function). On the other hand the entropy of a
stochastic process (i.e. its total information) is extensive, i.e. linear:
double the length of a sequence and its entropy doubles (technicaly only
true in the limit as the length goes to infinity). So in all cases the
ratio of predictive information to total information goes to zero as the
length of a sequence (or the observation time of a continuous stochastic
process) goes to infinity. Diminishing returns. Predictive information is a
vanishing fraction of total information. In this sense most information is
"useless". Since all actions take some amount of time it is not the state
of the world as it is now but the state of the world as it will be at some
time in the future that is relevant to behavior - hence only the available
predictive information is useful information. And this can be quantified as
the reduction in entropy of the distribution P(consequences) to the entropy
of the distribution P(consequences|context,response).
The three possibilities for the rate of growth of the predictive information
of a stochastic process, 1) bounded by some finite constant, 2) logarithmic
growth, or 3) power law growth, gives rise to 3 "complexity classes". This
has implications and applications that extend well beyond statistical
learning theory, but much of the relevance to SLT is that each complexity
class gives rise to successively harder learning problems, and to each
complexity class there corresponds a class of abstract learning machines
"optimal" for its class of problems. The "wrong" machine, however, can be a
better choice if rapid "good enough" approximations are more important than
more accurate and detailed approximations that take longer - especially for
nonstationary distributions with relatively rapidly changing statistics.
All schedules of reinforcement in the parametric family you described come
from the class of stochastic processes of finite predictive information. If
we take the 3 parameters you described, allow another 3 to further
parameterize the probability generator (even though you don't use one the
resulting distribution of inter-event times has attributes other than the
base rate), and we specify 32 bits of precision for each one (almost
certainly overkill) then any schedule in the space is fully characterized
by at most 192 bits. That is the amount of information that an abstract
learning machine has to collect from a stream of triples
(context,response,consequence) to estimate the distribution
P(consequence|contex,reponse) to what is almost certainly unwarranted
precision in that differences at that scale almost certainly do not make
discernable differences in the estimated value of
P(consequence|context,response).
But none of this has direct bearing on modeling response rates.
MO: That is there is nothing inherently complex in the event streams
themselves making them difficult to learn,[.]
GS: In what sense does an animal "learn an event stream"?
I wasn't speaking exclusively of animals, but Catania writes:
"The consequences of responding are critical to our understanding of
learning *not because learning follows from them but because they are what
is learned*." (The emphasis is Catania's, not mine.)
"The consequence of responding" can be formulated exactly and succinctly as
the conditional probability distribution P(consequences|context,response),
where each of the 3 terms is a random variable. That is in fact similar in
spirit to statements in Catania's book (though that specific formulation is
my own). This is what I meant by the statement I made a while back that
probably most and maybe even all learning boils down to unsupervised
learning of probability distributions. When consequences are independent of
reponses, P(cons|cntx,resp) = P(cons|cntx) and "consequences" is a stimulus
and contex is a preceeding stimulus e.g. P(thunder|lightening) or
P(food|tone) that is one category. Another case is:
P({accident,ticket,neither}|{red,yellow,green-light},{stop,drive-through})
i.e. stimulus control.
So in what sense is it reasonable to say an animal has learned a probability
distribution? I would say it is reasonable to the extent that the animal's
behavior is sensitive to features to of the distribution - does it depend
only on the mean (and how much difference in the mean for a measurable
difference in behavior) or sensitive also to variance, kurtosis, and skew,
for example? The spiking behavior of the H1 neuron of blowflies is
sensitive in lawful ways to both the mean and variance of the horizontal
angular velocity of the wide-field background (normaly due mostly but not
completely to self-motion, or due to stimulus presentation in experiments),
as well as the mean and variance of luminance contrasts in the visual
field, a sensitivity quantifiable in bits (the mutual information between
time-varying stimulus and spike trains) as is the flight behavior of the
whole organism (measured as horizontal torque as a function of time for
flies tethered to a torsion balance or deduced from frame by frame analysis
of animals in free flight), again the sensitivity quantifiable in bits of
mutual information between time-varying stimulus and time-varying response.
These distributions are non-stationary, P(consequence|stimulus,response)
being subject to abrupt changes on a scale of minutes. Flies adapt to these
changes in seconds. The spiking behavior of H1 is sensitive to changes in
stimulus statistics over several scales (at least seconds to hours).
By my comment about there being nothing inherently complex in the streams of
(context, reponse, consequence) tuple events generated by your schedule
space (which, by the way, is no criticism of the space) I just meant that
there are simple "laws" governing their behavior, albeit probabilistic ones,
easily discoverable by an analyst with a calculator and graph paper from
the tuple streams alone without knowledge of the generating schedule, using
standard techniques (e.g histograms and the like). And that these "laws"
are also easily discovered on the fly as the events occur by any of the
abstract learning machines corresponding to any one of the 3 complexity
classes described above. The distributions generating the streams are not
*inherently* difficult to learn. The difficulty they present to a pigeon
experiencing them without access to a stopwatch and calculator is another
issue. That is what I meant.
MO: [.]requiring prolonged observation to find the
patterns, etc. Just the opposite. Everything there is to learn about them
as far as predicting them to the extent they can be predicted comes down
to a relatively few bits. All 3 abstract learning machines, from the
simplest to the most complex in the distribution complexity hierarchy
will, based strictly on their general properties, rapidly converge to the
same stable state, making predictions as accurate as possible. All this is
indisputable.
GS: In actual fact (though I can't be sure that I understand a word of
what you're saying) some transitions between stable-states take a very
long time (some not).
By "transitions between stable-states" do you mean a change of the schedule
of reinforcement, say from VI to VT? This sort of transition is a scenario
where the differences between the different kinds of abstract learning
machines become more pronounced (the "fluctuations" in probability
estimates after transitions are more distinctive than the learning curves).
The claim of indisputability above is about the behavior of abstract
machines, not animals.
MO: This on its own does not predict response rates but its not hard to
construct a utility model with a response-cost function an event-gain
function, and a policy, either max utility or probability matching, say.
Now reponse rates can be predicted - but these would be the response rates
of a machine with a policy, not much related to the response rates of an
animal.
GS: If I am understanding you, I can say that much of this has been talked
about beginning with Skinner's "reflex reserve" in 1938. I'm not saying it
is the wrong approach, it just couldn't match the complexity (ordinary
language) of schedule effects. Maximization has been discussed since the
formulation of the matching law in the '60s and continues today. I'm not
sure I know what "probability matching" is.
Baysian decision theory has been around, under a variety of names, for a
long time, so I would be surprised if no one had used it to analyze
schedules of reinforcement (it has seen substantial use in economics, and
it is a critical component of the image analysis systems I've worked on).
I'm not surprised, if that is indeed what was tried, that it did not match
the complexity of effects of schedules. For one thing it is a normative
theory ("what would Bayes do?" - take whatever action maximizes expected
net gain), not a descriptive one. For another thing it is only proveably
optimal (in terms of maximizing net gain) when the probability distribution
over consequences given actions ("decisions") is a) known to the Bayesian
agent, and b) stationary (there may be a formulation for non-stationary
distributions where the "distribution over distributions" is known and
stationary). Neither of these assumptions is likely to be true in general
of ecologicaly relevant distributions (in fact many naturaly occuring
stochastic processes are likely to have logarithmicaly diverging predictive
information, so the probability distributions that generate them, even if
they are not changing, are only ever incompletely known).
"Probability matching" is a descriptive theory. I think it's just a
different name for something very similar to the "matching law" (based on a
quick glance in Catania). It just means rather than always taking the
action with the greatest expected payoff take actions with a frequency
proportional to expected payoffs. It is a suboptimal policy when the
conditions for max gain hold, but is probably a better policy (leads to
higher gain in the long run) under more realistic conditions. I have never
seen a rigorous derivation of that claim, but the rationale I've heard is
that it balances "exploitation" against "exploration", meaning it leads to
improved learning of P(consequences|response) and so greater expected gain
in the long term.
Despite all the caveats above I still think it is worth exploring max gain
models sensu strictu for schedules of reinforcment - much more so than I
did a couple of days ago. I have several reasons. For one thing there are
cases where they are a good match to experimental results (e.g. the
distribution of saccades, arm movement in sensorimotor learning - in these
cases though the utility models are very simple: minimize expected errors).
Secondly, even though it is an idealization it is benchmark of best
possible results that I would expect to be approximated by real behavior at
least under some conditions, especially if instead of using the precisely
known probability distributions for the schedules as the basis for expected
gain "coarser" approximations are used, either on the basis of speculation,
or if possible from observed resolution and sensitivity of animals to
features of distributions. The idealized models are principled starting
points, to be systematically modified. Third, it is possible that some of
the failures to match of earlier efforts may have been due to overly simple
models. I say that only because detailed probability models quickly become
computationaly intractable. The choice was often between a realistic model
for which no results could be calculated and a tractable model with gross
oversimplifications (this was the case in AI, which more or less abandoned
probabilistic methods for a while). However, extensive work on "Bayesian
networks" in the late 1980's vastly increased the space of tractable
probability models, and ongoing extensive work on methods like "particle
filtering", which yield good approximate computations of probabilities
continue to vastly expand the space of tractable probability models. So
maybe more detailed models would do better. Finaly, the deep, far-reaching,
and quite recent (circa 2000) results on complexity and predictive
information may well offer a principled approach to conditions of
optimality for matching law-like policies or even adaptive policies (in a
simpler vein, the spiking behavior of the H1 neuron interpolates between
policies optimized for high luminance contrast conditions and low luminance
contrast conditions, adapting to changing mean and variance in luminance
contrasts, remaining thereby sensitive to "maximally informative stimulus
dimensions", where in this case the problem is simple enough to show that
its behavior is optimal). If it is true that stochastic processes in
natural environments are most frequently logarithmicaly divergent, or
non-stationary, or both (the statistics of natural distributions being an
active area of research) then it is at least plausible that much of the
behavior of organisms would share some characteristics of the class of
abstract learning machines optimal for those conditions. Therein may lie a
principled approach to "balancing exploitation against explortation". Or
maybe not. It's worth a look.
I said "it's not hard to construct a utility model". That much is true, I
sketched one up last night. What is hard though is to derive its response
to events - nasty, nasty optimization problem I don't quite know yet how to
tackle.
MO: One problem is that there has been no allowance for resolution limits.
The learning machines count events, record event times, count responses
and record response times. In making predictions they have exact counts
and precise times available.Basically they recover the parameters that
drive the schedule, and since these are few the complexity is indeed low.
GS: Can't say I understand much of this.
I meant that these learning machines, as they learn distributions, are in
effect equipped with stopwatches and calculators, so they have no trouble
learning high fidelity versions of the true probability distributions
generated by the schedules, a luxery maybe not afforded animals. I am less
concerned about this now then I was a couple of days ago because the models
are an idealization, best possible results that serve as a point of
departure. These models could be systematicaly "coarsened". Animals'
behavior is certainly sensitive to some aspects of probability
distributions. A quick look at the "matching law" in Catania describes a
pigeon that emits about twice as many pecks on a VI30-sec key as a VI60-sec
key. So it responds differently to different means and it distinguishes
between a two to one time ratio - so at least 1 bit of resolution in time
over this range. The question is what other features of distributions
influence its behavior, and what is the resolution of differences it can
discriminmate. My guess is that the later is not fixed, but varies, within
limits of course, as such discriminations have consequences. One efficient
approach used in signal processing is a "coarse-to-fine" strategy - e.g.
subdivide a range into upper and lower halves, and as warranted subdivide
one or both of the halves into upper and lower halves, and so on down to a
maximum resolution. Only as much resolution as is needed to make
distinctions is actually used. Anyway, the models the machines learn need
to be blurred in some way, before they can serve as a basis of
utility-driven response models. It would be nice to do that from data.
MO: It would take a lot of data I don't have to model realistic
constraints.
GS: Maybe (but I admit that I'm not sure I understand much of what you
say), but do you have any idea how vast the literature is? It is true,
however, that if you want the temporal locus of each event, there isn't a
lot of that. More people collect data that way now. Still, you might be
surprised about what is out there, and the analyses on time series that
have been done.
The "constraints" I had in mind are resolution limits on times and counts
(e.g. distinguishable VR schedules), and sensitivity to features of
distributions not limited to means, but including "higher central moments".
Ideal would be those results already reduced. Or it could be deduced from
detailed event records. Not a pressing issue, though since just setting up
the machinery that could make use of the data, reduced or raw, is in itself
a big job (and not my day job).
MO: The other problem is that I don't know a principled basis for the
utility function.
GS: Yeah, ummm, that's the issue all right. Sort of.
MO: Why max utility? Why probability matching?
GS: Yeah, ummm, that's the issue all right. Sort of.
I'm not as concerned about these as I was.
MO: None of this means animals are not behaving the way an optimal
abstract learning machine + some utility function would behave. It still
might be possible to make qualitative predictions not sensitive to
detailed assumptions. One possibility is qualitative features of
extinction...
GS: If you can even account for the shapes of functions not yet directly
obtained then you are doing well. That is the nature of
schedule-controlled behavior. Do you think that you're the first smart guy
to consider these issues? What do you have about extinction.?
I'd be amazed if at least 99% of what I'm thinking about hasn't already been
investigated in depth. It doesn't matter, it is for my own edification
anyway. And since there is much overlap with my real work anyway
investigating these things has some net utility. On the other hand
"predictive information" is a new concept (though related to many earlier
similar ideas), and it's implications for learning are I think not widely
appreciated yet.
On extinction I was just thinking that the different abstract learning
machines have charactersitc different fluctuation behavior after
transitions from one probability distribution to another. So those
characteristics might be looked for in animal behavior. However, it is a
lot of work to get down to th nitty gritty on specifying in detail
fluctuation signitures. And for now, time permitting, my focus is on
utility models.
-- Michael
.
- Follow-Ups:
- Re: Pigeons, People, and Priors
- From: Michael Olea
- Re: Pigeons, People, and Priors
- References:
- Re: Pigeons, People, and Priors
- From: Michael Olea
- Re: Pigeons, People, and Priors
- From: Glen M. Sizemore
- Re: Pigeons, People, and Priors
- Prev by Date: Re: Is sentience an emergent brain behavior?
- Next by Date: Re: Is sentience an emergent brain behavior?
- Previous by thread: Re: Pigeons, People, and Priors
- Next by thread: Re: Pigeons, People, and Priors
- Index(es):
Relevant Pages
|