Re: Gradual Learning, not Reinforcement Learning
- From: curt@xxxxxxxx (Curt Welch)
- Date: 12 Sep 2006 05:29:39 GMT
"Lars" <LarsFiedler@xxxxxx> wrote:
Hello Curt,
I'm about 50 days late replying to this (I've got a lot of messages I've
fallen behind in replying to....)
you are absolutely right - using a new term ("goal") has no benefit. It
is just another description of a problem. And you ask the right
questions, that I also ask myself:
It seems to me you simply favor the idea of a goal directed learning
machine which is constantly trying to reduce the errors between it's
current actions and it's goal. But how does such a machine pick the
goals? How does it create sub-goals from the prime goals? How does it
know when a goal should be changed? Of all the goals it might have,
which would it be trying to reach at any moment in time?
So I will try to answer some of these questions. I am not sure about my
answers. And some answers will lead to further questions. But I think
it will lead us the right way. Of course this will be boring to you,
because you think goals can be realized as "reinforcement learning". So
I will explain afterwards what I am missing in your examples of
reinforcement learning.
I prefer the term "ideal" because it implies that there is a process of
getting nearer to the ideal. Maybe an ideal is a special kind of goal.
1.) How could an ideal be realized?
-----------------------------------------------------------
There are two physiological facts that inspired me:
- When a human being waits to do a simple action, e.g. pressing a
button when a certain signal appears, there are neurons in the
prefrontal cortex that fire steadily. It seems the human brain has
tension. Maybe this is a kind of goal or willing or whatever you call
it.
- There are steadily firing neurons in the brainstem, that seem to
represent the state of the body. If they do not fire anymore (because
of a damage) the human being looses consciousness or is not awake
anymore. It seems the brainstem is like a constant fire that keeps our
brain in action. And maybe it leads us to actions that keep our body
alive and in homeostasis. (Antonio R. Damasio described this)
Or maybe, it's simply the "power source" that drives our actions - acting
in a way that's not much more interesting than the power supplies for our
computers.
This leads me to the following ideas:
- An ideal must be kept up for a while. So maybe an ideal can be
realized as constantly firing neurons.
Yes, I think that's reasonable and also highly likely to explain some of
our goals.
- An ideal must depend on the needs of the body. So maybe the needs of
the body is set by the constantly firing neurons that represent the
body. If the state of the body changes the ideal changes.
- A constantly firing body is a very different approach than the
input-process-output-Model.
No, not really. Many sensory neurons are constantly firing. This means
there's a constant flow of data into the system in the input-process-output
model. Most inputs used in these models act as though they are constantly
firing.
A system with a constantly firing body will
act - not only react.
My designs have always included an output system that can act on its own.
It's clear we are able to do this. We react to our environment by changing
our actions. There are many ways to implement this. However, the way I
think that has the most value, is to simply create a system where it's
actions, are feed back as additional sensory inputs. In other words, the
system is able to sense, and react, to it's own actions.
For example, for a human to produce a walking motion, it must be able to
generate this repetitive pattern of leg and arm motions. It needs some
sort of central pattern generator.
If the system was only reacting to it external environment, think about how
hard it would be to learn to produce a pattern like this. We would have to
learn to take the first step, based on our current environment. But by
doing so, we have moved, and the environment has changed. So, now we have
to learn to take the second step, in this new environment. Once we learn
that, we can't "reuse" what we learned about the first step, because now we
are two steps away from where we started - yet another very different
environment. We would in effect, have to learn each step, in each new
location. We couldn't walk down a street we had never seen before, because
we hadn't learned that taking a step was a good thing to do in that
environment.
To produce a generic pattern of outputs, like that which is required for
walking we much learn to react to our own actions. If we can sense our own
actions, then we can learn to react to our own actions. If we sensed that
we were stepping forward with our right leg, when we sense it has reached a
limit, we can then react to that by stepping forward with our left leg.
When we sense that has reached it's limit, we can react to that by moving
our right leg forward. A small set of reactions to our own actions, can
then create cyclic behavior patterns (like walking), that our independent
of our external environment. It's an output pattern that can be triggered
in any environment - including walking down a street we have never seen
before.
Our reactions to external stimulus is then used in parallel with all our
internal reactions to override the behaviors when needed. Reactions to our
sense of balance is used to fine-tune the walking patterns. Reactions to
the objects around us make us stop walking, or take the actions needed to
turn to the right or left, or speed up, or slow down.
This is how you make a reaction machine act on it's own. It's doing it by
reacting, to it's own actions.
I also suspect this is how the brain is structured. If the neocortex is
the same basic structure for the whole brain, what makes the motor cortex
different from the sensory cortex? Why is it that the motor cortex seems
to be performing "motor functions"? I think it's because both halves are
in fact sensory reaction systems. One half is reacting to external sensory
signals, and the other half (the motor cortex) is wired to react to the
brain's own actions.
- I think human brains are not only designed to keep the body alive,
which means to get food and so on. I think there must be a design that
gives us pleasure that has only an intellectual reason not a bodily
reason.
Why? What is the evolutionary pressure that exists to justify the creation
of "happiness" in the human brain? And more to the point, what is
happiness.
I can answer these questions in the framework of an reinforcement learning
machine easily. I don't know how to answer them in the framework of a goal
directed machine.
Such a design makes us happy if we predict something.
The value of prediction is obvious for survival. But what does "happy"
mean? You seem to have defined it as a correct prediction with the above
comment. But where does that get us? I predict that if I reach over and
move by computer mouse, the cursor on the screen will move. I predict that
if I press these keys on the keyboard, that letters will show up on the
screen. We make millions of such little predictions for everything we do.
Yet, these things don't seem to make me happy. So it doesn't seem valid to
me to try and link the concept of happiness with a correct prediction.
With
"predict" I mean that the human being has an expectation (similar
realized as an ideal) that comes true. E.g. a child presses a
light switch and is happy that the light goes on - what he expected. But
let us keep things as simple as possible und let us see how far we come
with "only" food.
- "Subgoals": I do not know yet. Maybe somehow ideals can be
agglomerated to one ideal and the one ideal can be divided into his
elements as some chars can be agglomerated to a word and the word can
be divided into chars. (Ok, this idea is confusing and not on the
design layer.)
Or, connected with that, what are the prime goals and how are they
implemented?
As an educated adult, I can talk about how I'm hungry and my goal is to
find food to eat for dinner tonight. And I can analyze my thoughts that
might end up driving my actions that lead to me getting food (this is
making me hungry :)).
But, what about a new born baby. They seem to have food goals, but yet
clearly, they have not let learned to talk, and think about what they are
doing like an adult can. They don't have the skills to put together a plan
for getting food. Their skills are limited to what was hard wired into
them at birth - which for example includes the ability to swallow to get
food from the mouth to the stomach - or cry as a way of getting mother to
provide some food for us. But a baby quickly learns new skills he was not
born with, like the ability to grab a tit and put it in it's mouth.
How is this skill learned in the context of everything being goal directed?
How is the goal of getting-tit-in-mouth created? There is no indication we
were born with that as hard wired goal. Babies don't seem to know what a
tit is at birth. They know how to suck and swallow, but they don't seem to
know what a tit is. Only after exposure to a tit (or a bottle) does the
baby seem to learn that these things are "good", and only after exposure
does the baby form these tin-in-mouth goals.
These are easy to explain in terms of reinforcement, but I don't know to
explain them in a strict goal directed view. In terms of reinforcement,
the value of the tit is learned by reinforcement. Good stuff happens when
sucking on a tit, so then tit sucking (as apposed to sucking in general -
such a toe sucking) is reinforced as a good behavior. And the tit itself
becomes a predictor (a secondary reinforcer) of good things to come.
Acting as a secondary reinforcer, that sensory signal helps to shape our
grab-tit-and-suck behaviors.
- Economy: There is a general problem in neural networks. How can we
achieve that not all neurons fire at once and that at least one neuron
fires. I call this the "problem of economy". Maybe the solution is
something like this:
There is an special area that usually does not fire emediatly to other
areas. And there is a global parameter that rises during a second. This
parameter supports the neurons until one representation is strong
enough to fire into another area and e.g. cause an action. This
mechanism could be something human as we say sometimes: "Just a second,
it comes, I will remember it!". I have not heard of such a parameter in
human brains. But this problem cannot be solved functional. Functional
would mean that a set of neurons must inhibit all other neurons. This
design would need to many connections.
There is a solution to this which I've used in many of my earlier network
designs. It works by adding global activity regulation to the system. The
system is a learning systems already, meaning the weights are constantly
being adjusted to regulate which nodes fire, and when. But, if the neurons
don't have any understanding of when other neurons are firing, how can they
learn to take turns and not all speak at once? The answer is that the
learning rules must have some knowledge of global activity and must work to
prevent everyone from talking at the same time. The one simple way to
implement this is to track global network activity (how many neurons fired
recently), and bias the learning system, to push this activity level
towards some central norm. When the network becomes too active, all the
learning is biased in the direction to reduce the odds of nodes firing, and
when the network becomes to quite, all learning is biased in the direction
to make the network more active.
This type of global activity bias, when added on top of any other learning,
solves your economy problem. It would be easy to believe the brain used a
similar system that made learning a partial function of total neuron
activity. Since it takes energy to make a neuron fire, which flows to the
neurons though a shared blood source, one way to bias the learning rules
would be for them to simply sense the chemical levels used to
My latest networks solved it in a much simpler way. I switched to a pulse
sorting paradigm instead of a node firing paradigm. By doing this, network
activity is held constant by design.
- There is another problem (that you did not mention): I call it the
"problem of sharpness". I think we agree that in the human brain a set
of neurons represents a certain perception. The neurons in the brain
always have an actual state, that were set from previous perceptions.
Yes, such as the leaky integrate and fire model always has some internal
activation level which is constantly changing and which changes as a result
of other neuron activations.
And maybe at some points different kinds of perceptions (seeing,
hearing) collide.
Collide? They fuse to form new perceptions but I don't understand what you
mean by collide.
But we always have one thought - which is sharp but
not exactly sharp. - I have no general solution to this problem yet.
I don't understand what the problem is. I don't have one "thought" in my
head. There are many things happening, such as I am producing thoughts of
the words as I type them. Sometimes that changes to spelling out a word.
At the same time, I hear the keys clicking, I feel the touch of my fingers
on the keyboard. I hear the fan in my computer. I see things changing on
the screen. I hear noise from the family in the rest of the house. All
these things are "thoughts" which happen in parallel in my brain. What's
so "sharp" about all this buzzing going on in my brain?
Why all this difficult stuff, when there is the easier solution of
reinforcement learning?
What difficult stuff? What's easier?
2.) Why is an ideal necessary?
Curt, if I misunderstood you, please tell me where to find a
description of your system at the design layer.
Only in my posts here. And since I post a lot of stuff that is not about
my ideas about how an intelligent system could be structured, it would be
hard for you to find related post. It's too hard for me to even find them.
:)
I have not read all the
"DOHs" in this thread :-). As I understood it you think about a system
that has the input-processing-output-model - maybe with drawback loops
but no constantly firing neurons as I described above.
Yeah, actually, I do tend to include constantly firing neurons.
I've been looking at pulse sorting networks that work in software much like
a decision tree. Except it's a network instead of a tree. This system has
many interesting properties, but it's still lacking some powers so there's
work to be done. But yet, it can give you some insight into how I think
things need to work.
With this design, I'm basically using async pulse signals instead of some
more traditional synchronous network where all nodes calculate a new output
value for each clock cycle. And in my typical implementations, I force the
network to process only one pulse at a time. So there is never more than
one pulse in the network at a time. The nodes in the network I've played
with have two outputs, and for each pulse they receive, they must make a
sorting decision, and decide which output to send the pulse down. Each
pulse enters the net on some input path, and gets sorted though some path,
and reaches some output.
So, all the intelligence is in the decisions each node makes about how to
sort each pulse it receives. This is a simple reaction system where the
behavior of the nodes is trained by reinforcement learning.
Now, as I talked about above for a reaction to "act" and not just "react"
as you said above, something more is needed. But a feedback loop so that
all outputs of the network, are feed back as inputs, allow this to happen.
The network can then learn to react to it's own actions and in doing so,
produce any type of complex output patterns.
However, in this pulse sorting net, if I feed a pulse back into the net for
every one one that came out, the pulse would be stuck in the net in an
infinite loop. To solve that, the output of the network is used to control
pulse generators, and the output of the pulse generators is what gets feed
back into the network instead of the control pulses.
There are different types of pulse generators I've played with, but one
example is a node that constantly fires at a fixed rate. It's got two
input control paths where it receives pulses from the network. One path
makes the pulse generate reduce it's firing rate, and the other, makes it
increase it's firing rate. This node then has an internal state which is
the firing rate (pulses per second) which it maintains, and control pulses
received from the network makes it increase or decrease this internal rate.
These pulse generators are the real outputs of the system, but their
behavior is regulated, by the pulse sorting network. And every pulse that
is created by the pulse generator gets duplicated with one pulse being feed
back into the pulse sorting network to allow it to learn to react to what
the system has been doing.
I ask you similar questions that you have already answered at the
psychological layer but not at the design layer:
- How does such a machine pick the estimation rules? - One rule could
have the aim to reproduce the input as an output like a parrot.
No, it does nothing even close to that. That is not a rule or goal at all.
But
there must be other estimation rules for an intelligent system.
- How does it know when the estimation rule should be changed?
- Of all the estimation rules it might have, which would it be trying
to reach at any moment in time?
- How does it create sub-estimation-rules from the prime estimation
rules?
It intent is for it to work like this...
The machine is a reaction system. By it's design, it is forced to make a
reaction to every input pulse. Every sensory input pulse must be sent
somewhere by the network. The only issue is whether the current set of
reaction rules create useful behavior for the system - you can generally
assume the answer is probably now at the start. The value of all the
current behaviors, is evaluated with the help of reinforcement learning.
The only behavior this system has is pulse sorting. Each node is an
independent behavior machine, which is trained by reinforcement. They only
information the nodes have to work with, is the pulse which are sent to
them, and the times when they are show up. This type of machine is very
much a temporal processing machine because all the decisions about how to
sort each pulse are based on the temporal memory of each node.
The node design I've been looking at for a few years, had only one input
path (which was typically a merger of outputs from two other nodes). And
the only thing it used to base it's sorting decision, was on the amount of
time that had lapsed since the last pulse showed up. This does a lot, but
I've since decided this is not smart enough because it can't make sorting
decisions for example that are based on which previous node that pulse came
from - and I've decided that is something which it needs to be able to do.
However, ignoring that, I can explain the old design to give you a flavor
of what I'm thinking. What that, each node maintained an internal time
value which is what it used to make all it's sorting decisions. A pulse
that showed up quicker than that, would get sorted out the high frequency
output, and all pulses that showed up later than the time limit, would get
sorted out the low frequency side. These nodes can be looked at as
frequency sorting nodes because with a constant low frequency input, the
pulses all go out one way, and with a constant high frequency input, they
all go out the other way. But, with complex noisy signals, some pulses go
one way, and others go the other way. The density of pulses going out each
side is just a function of the signal fed it, and of the internal setting
of the pulse sorting reference value.
By default, these nodes have a learning rule which causes the internal
sorting value to seek out a value that will cause, on the long term, an
equal number of pulses to be sorted out each side. So by default, the node
will split the signal in half.
If for example, you feed one of these nodes from a light detector which is
configured to fire faster for brighter lights, you can then look at what
the outputs of this node would mean. One output would mean "bright light",
and the other output would mean "dim light". So the node is acting as
pulse classifier. It's sorting the pulse which man "bright light" out one
side, and the pulse which mean "dim light" out another side. The default
behavior of the node, is to set split between bright and dim, right in the
middle so that half the pulses from the sensors are classified as bright
light pulses, and half are classified as dim light pulses.
If the network had two light sensor inputs, each of those signals would get
split in half, and then two of those signals would be joined back together.
So the resulting signal, after the joining, would have a logical meaning
something like, "bright light from sensor 1 OR dim light from sensor 2".
In the end, every output function from the network, is some very complex
combination of all the sensory signals after they have been split apart,
and combined back together again, in many different combinations.
The reinforcement learning problem, is to change the definition of those
mapping functions, to make the output reactions, more useful than they are
when the network first starts. But at all times, the outputs are some
function, of the inputs.
Like all reinforcement learning machines, it must have a critic which is
fixed hardware for generating reward signals by monitoring various aspects
of the environment. It generates rewards when "good things" happen, and
either generators a punishment signal, or generates less rewards when "bad
things" happen.
The network, like all reinforcement learning systems, only has one real
goal. It's goal is to change it's behavior in ways that will increase the
number of rewards, per time, the machine is receiving from the critic. The
details of how to implement this for this type of network, is what I've
been looking at for a few years now, and not making much progress, but the
basics are easy to understand.
The only behavior the network has is pulse sorting, so that's what is being
rewarded. Each node tracks how much reward it's received, relative to what
it's been doing. If it gets more rewards for sorting pulses out one side
than the other, it will adjust it's behavior so that in the future, it will
tend to sort more pulses out the "good" side. This is easy for this type
of node to do since it can simply adjust it's internal sorting value a
small amount to make that happen. Assuming the input signal is complex
(very noisy) this will cause slightly more pulses to go out one side than
the other. So, over time, each node tracks rewards relative to it's
actions, and adjusts it's behavior to try and increase the total rewards.
If need be, it will end up sorting almost all pulses out one side vs the
other, so it will act more like a switch, than a signal classifier.
Each node has access to only a very limited amount of data, and each node
only has a very limited amount of power to control what the network as a
whole does. But yet, working together, the idea is that many nodes working
together can produce very complex behaviors.
Or to stay at the HORSE-example:
Let us assume the system says "HOXEL" and is rewarded. There must be 2
rules that lead the system nearer to "HORSE":
1. Rule: The system must say words that are similar to "HOXEL". So a
rule must define what means "similar". Maybe this could be achieved by
a neural network with a bit unsharp actions.
2. Rule: The system must have the estimation rule "parrot" at the
moment it says "HOXEL". This is an artificial rule, that has no
counterpart in the human brain. It is built on top of the software of
the neural network.
I do not think artificial rules will bring us any further, because it
restricts the system to learn something special.
Well, the point of this type of network design is to start of in a
maximally complex configuration, so that it's output behaviors look nearly
perfectly random. They are not random however, they are very
deterministic. However, the function is so complex that it will look
random to a human which will see no "purpose" in the complex behavior. The
reward system is expected to reward it at times, just because it gets lucky
(mom stuck a tit in my mouth so I get a reward even though I did nothing).
But, it learns from this experience because the recent behaviors of all the
nodes, is biased to reflect what has recently happened in the sensory data.
So the nodes all slightly change their behavior to reflect that these
sensory conditions are one in which it got a reward. Which brings up the
issue of secondary reinforcement. This is something I've not figured out
how to correctly implement in this type of network. But the goal is for
the network to also act as reward predictor. So it needs to learn that a
given sensory condition, is more likely to produce rewards, because it's
seen more rewards in those sensory conditions. It needs to learn that a
tit is a "good thing". Then, in the future, when it does something that
happens to create the "tit" sensory condition (such as turning it's head to
the right as a reaction to sound of mom's voice on the right), that head
turn reaction gets rewarded. This is how the system creates behaviors that
look like goal seeking behaviors - because the system has learned that the
sensory condition of "tit" is a good thing, which means, doing things that
reproduce that sensory condition, is a goal for it.
So the concept of "closeness" you related to happens a few ways. First, a
good critic design will be one that can reward based on "closeness". The
more the critic can do that, the easier it will be for the system to hone
in a better answer (aka hill climb towards maximal rewards). So in your
HORSE example, it would be far better for the critic to reward based on how
many letters the system got right than waiting until it got them all right,
and then giving it a reward. So, closings in this case is defined by the
critic to help Gide the learning machine in the right direction.
The other way closeness works is by the actions of the secondary
reinforcement. Since all the nodes in this large network are acting to
measure the "value" of any sensory condition, it will also naturally create
a measure of closeness based on how many of the nodes are in the right
state. The state of each node is a function of recent past sensory data,
so the state of the network as a whole is intended to represent the
system's best understanding of the current state of the entire environment.
If 10 nodes are reporting the environment is in a state of high expected
reward, that's not as good as when 50 nodes are reporting it. This allows
the network to produce an estimate of how good of a state, the environment
is in. The goal of the system is to produce whatever outputs work, to
manipulate the environment into the best possible state for it. So this
natural system of using many parallel networks to dissect, and understand,
the environment, works naturally to make "closeness" predictions to help
guide learning.
So if the critic only rewards for producing "HORSE" correctly, it will take
some time for the network to produce that by random chance (but the default
behavior of the network is to act very complexly (very random like), so
given enough time, it will always happen). It will have to happen many
times before the system starts to both learn the behavior, and to recognize
that sensory condition (sensing that we just produced the output HORSE) as
a "good thing". But as that ability develops, outputs that look close to
"HORSE" will produce partial rewards, and bias the behavior of the system
to produce more outputs close to "HORSE". This will help to quickly reduce
the amount of time between behavior of producing "HORSE". Which leads to
more rewards, which leads to better reward predictions, which leads to
better behavior, and next thing you know, the machine is producing nothing
but the word HORSE.
The goal here was never a rule to parrot. The goal was always to make the
critic produce as much rewards as possible. If the critic only rewards for
the behavior of "HORSE" then that's what the system would learn.
But a more interesting critic will not reward for a fixed output. Instead,
it will reward for some important result - like a robot getting it's
batteries charged because it's managed to position its solar cells in a
bright light. This type of critic will cause the robot to learn light
seeking behaviors, or dark avoidance behaviors. If it's smart enough, it
might even learn behaviors like turning on a light switch to get more light
in the room. If it's really smart, it might learn to speak English and ask
us to let it outside so it can get direct sunlight. :)
Now on the issue of goals. Humans, very much have goals. And I suspect
many short term goals are represented by neurons constantly firing as you
made reference to at the beginning. A reaction machine that has no short
term goals can't produce a long string of behaviors to reach that goal -
like walking to the kitchen for the purpose of getting a drink of water.
The environment can trigger a lot of goal seeking behavior but you can't
explain all human behavior in those terms. And the internal environment of
the human body can actually act as environment to the reinforcement
learning brain. So, an empty stomach for example act as the goal for the
food gathering behaviors. And it can keep us focused on that goal, instead
of allowing us to be distracted by other things. In a robot, with my type
of network used to drive it, you can for example give it extra inputs to
allow it to understand the state of the robot - such as a battery level
input. That input can act to motivate the robot to return to it's charging
station. This creates the "I want to go back to my charger" goal effect.
Humans however have language behaviors. We have the ability to speak
silently to ourselves. When we do this, we seem to activate internal
states that directs our short term goal seeking behaviors (like going to
the kitchen to get a glass of water). Or, seeing that the trash can in the
office doesn't have a trash bag in it, so we go to the kitchen to get a
trash bag. In a very simplistic reaction machine, this is hard to do since
the minute you leave the office, the trash can is no longer part of the
environment and it's no longer there to keep triggering the "get trash bag"
behaviors. Something we see on the way to the kitchen might trigger us to
go to the office and we ended up aborting the trash can behaviors. There
needs to be internal state that is maintained in the brain to complete that
task. We need to "remember" what we are trying to do.
I'm not sure how this would be implemented in my type of network, though I
have various thoughts. I think it would develop by the machine first
learning to perform a simple behavior which is triggered directly by the
environment. Like picking up a glass of water and drinking - all triggered
by an internal thirst signal and the sight of the glass of water. But what
happens when we get thirsty, and there is no glass of water? I believe
what needs to happen, is that this tries to trigger the "pick up glass and
drink behavior", but without a real glass around to direct the actions of
the arm the brain is simply triggered to perform "find glass behaviors"
instead. These things it seems, needs to create a persistent bias in the
brain state of "find glass" in the motor cortex the same way a bias is
created when a output pattern sequence is selected that allows us to keep
walking. I'm thinking it must happen with the help of feedback loops that
allow the system to lock into a given state which represents the "get
glass" goal. The existence of that loop in the motor cortex being active
is what drives the strings of behaviors, all as direct reactions to the
environment and that internal state, to make us seek out a glass, fill it
with water, and then drink.
The above is a vague and doesn't answer exactly how this might work. But
the general idea I think is there. It must learn simple behaviors first,
and then, leverage those to create more complex behaviors. When trying to
perform a behavior that worked well in the past, but the needed components
are missing (like the glass), the attempt to do that alone, acts as the
environmental state that the system reacts to, which causes us to get the
things we need.
This "getting what is missing" starts off very simple first. If we want to
drink from a bottle, and we don't see the bottle, then we don't know which
way to move our hands. So we learn to turn our head and scan our eyes in
an attempt to find the bottle. So the same things which trigger us to try
and drink, when combined with a "no bottle" condition, triggers the "search
behavior". In time, our "get bottle" behavior, grows increasingly complex,
and longer lasting - we can think about it and end up walking to the
kitchen and getting the glass out of the cabinet - even being interrupted
by a phone call in the middle of the act.
So, it's all a matter of building complex behaviors, through reinforcement.
The machine starts out producing very random looking behaviors, but they
are never actually random, they are very deterministic reactions created by
the complex actions of a large number of very simple agents (neurons)
working together. This means when good things happen, these
micro-behaviors can be independently reinforced, always moving the machine
to behaviors that produce better results. Though the behaviors produced by
such systems look like what we call goal directed behaviors the only real
goal is to maximize rewards. The internal reward prediction system then
creates a fairly continuous landscape to allow the system to slowly
hill-climb towards the higher grounds as it evolves it's set of reactions.
Each node in this multi-agent network is in effect solving it's own hill
climbing problem. Though some nodes will get stuck on a local maximum,
other nodes will continue to make progress. And as they change, the
environment changes because the machine is behaving differently, which can
cause nodes which were once stuck on a local maximum, to be kicked off it -
allowing it to make progress towards better behaviors.
The key is that all network level macro behaviors, are created by many
different nodes (micro behaviors) working together. But not all micro
behaviors are used at once. The sensory environment defines the current
context, which in turn, maps to a different subset of current nodes. So
only a subset of the nodes are used for making each decision. This is seen
in my pulse sorting net by the simple fact that only the nodes a pulse
passes though, were part of the behavior of the network at that time. A
network could have a million nodes, and only use on average 20 nodes to
sort each pulse. So when it's rewarded, only the nodes recently used, are
the ones being trained - the rest of the network is not effected. The
parts of the networks used, and trained, is always a function of the
current sensory context - the networks idea of the current state of the
environment. Unlike a more traditional neural network, where the entire
network is always used, to calculate the outputs, this type of network is
selected, and only the part of it is used to make each decision.
also, the number of different "reactions" such a network can produce, is
much higher than the number of nodes since they act together in different
combinations to produce each reactions. A network with hundreds of nodes
can produce billions of different reactions to different environments.
It's all just a temporal reaction machine able to produce billions and
billions of different reactions to different environments, tuned by
reinforcement learning to produce increasingly better behaviors over time.
Though these machines produce what we call goal directed behaviors, the
only goal it really has, the prime goal, is to maximize total rewards over
time, and all the behavior that looks like sub-goals, are just reaction
sequences the machine has learned which leads it to higher rewards.
--
Curt Welch http://CurtWelch.Com/
curt@xxxxxxxx http://NewsReader.Com/
.
- Prev by Date: Re: Bayesian Inference Engine
- Next by Date: Re: Gradual Learning, not Reinforcement Learning
- Previous by thread: Re: nanowire/synases
- Next by thread: Re: Gradual Learning, not Reinforcement Learning
- Index(es):