Re: Ben G on reinforcement-learning and the wirehead problem



greywolf@xxxxxxxxxx wrote:
Curt Welch wrote:
greywolf@xxxxxxxxxx wrote:
casey wrote:
On Jun 14, 4:29 pm, Wolf K <weki...@xxxxxxxxxxxx> wrote:
casey wrote:
Efforts to rear boys and girls equally fail! We have innate
differences that have nothing to do with conditioning. This
is an observable result of many such experiments. Thus this
is one more example that your belief that we are nothing but
RL machines is wrong.
The concept of "RL machine" implicit in the above is confused/
confusing. And it's not clear whether you subscribe to that
concept yourself, or ascribe it to Curt.
The issue has been about how much of our behaviors are innate
and how much is the result of reinforcement learning.
That's a non-question, like "How long is a piece of string?"

And keep in mind that reinforcement learning is not ex nihilo (as you
appear to suppose.

So what exactly do you think an "RL machine" is?
http://www.cs.ualberta.ca/~sutton/book/ebook/node7.html

"Reinforcement learning is learning what to do--how to map
situations to actions--so as to maximize a numerical reward
signal. The learner is not told which actions to take, as
in most forms of machine learning, but instead must discover
which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect
not only the immediate reward but also the next situation
and, through that, all subsequent rewards. These two
characteristics--trial-and-error search and delayed reward
--are the two most important distinguishing features of
reinforcement learning."

====================

When I first read John's message, I didn't bother to look at the link
to realize it was Sutton's book. No wonder I agreed with it! :)

I looked up the link, and read most of the page from which you quote.
By the 3rd paragraph, the authors are so hopelessly entangled in
anthropomorphic metaphors that their discussion amounts to handwaving.
What's depressing is that they seem unaware that they are speaking in
images.

Well, if that's hand waving to you, read the entire book and all that
hand waving is translated in to precise formulas and algorithms in the
following chapters. He most definitely backs up all his hand waving
with hard facts.

So? Formulas and algorithms are just models. They are no better than the
base assumptions (== metaphors!) about the phenomena being modelled.
Think of the difference between Ptolemaic and Keplerian models of the
solar system.

Sure formulas and algorithms are just models. But for the special case of
computers, they model the beahvior with 100% accuracy. The algorithm isn't
just a good approximation of what the computer will do, it's a exact
description of what it does (exact in to the extend of what it's
predicting).

A model like F=ma on the other hand does not exactly describe where a rock
will hit the ground when you throw it both becuase it's impossible to
collect perfect data about the starting condition of the rock and because
the model is only an approximation about the behavior of a falling rock.

Computer scientists like Rich Sutton are describing, with 100% accurately,
how computers behavior when they are programmed with various learning
algorithms.

How that machine (a computer running an RL algorithm) is similar to human,
or a rat, is an area for further study. We can talk about the computer
being a model of human behavior, but that is not what Sutton is working on
(at least not directly). He's working on it indirectly by exploring the
behavior of computers - with the hope that it will lead to a better
understanding of the beahvior of humans (I assume).

I claim that any models of AI that assume goals and values, etc, are
wrong.

I don't know what you mean by "assume goals and values".

And it gets worse. Consider this sentence:

"The agent must try a variety of actions and progressively favor those
that appear to be best."

It is of quite unnecessary for the agent to have any opinions about,
or evaluations of, any of its behaviours.

That's just not true. In order to actually _implement_ reinforcement
learning, such knowledge is key. Reinforcement learning simply doesn't
work well without it. It's the only way known to solve the delayed
reward problem.

The implementer needs to know, the agent he builds does not.

Ok, well, this becomes a discussion of what "to know" means - and that of
course is a central problem of AI and one which the philosophers have never
resolved either. In theory, we won't know the answer to that until _after_
AI has been fully solved. That is, until after we believe we know all that
is important about how the brain works.

We can ignore that question, and just do what Rich does, which is to study
the behavior of computers and not really claim to know what "to know"
means.

When we debate such subjects, if we want to share our understanding with
others, and if we want to be accepted by others in the field, we are forced
to play a complex game of politics. I don't tend to play such games as
someone who's actually making a living in these fields.

I believe I know what "to know" means, and as such, I use it based on what
I believe to be true, not how I would need to use it to be politically
correct.

I believe the machine actually "knows" and that's it's valid to talk like
that.

But humans do have the ability to verbalize some of their knowledge, and it
can be argued (even though I don't agree with such arguments) that
knowledge is limited to what we can verbalize. And the type of knowledge
I'm talking about above is clearly not the ability to verbalize.

I can catch a ball that's thrown to me (sometimes). And in having learned
how to do that, I think it's valid to say that I know how to catch a ball.
In other words, I have knowledge of how to catch a ball. I can't however,
verbalize that knowledge - I can't describe all the complex changes that
happened to my brain and body which was the result of acquiring that bit of
knowledge nor can I correctly verbalize what I do when I catch the ball. I
just catch it.

We certainly can just project our knowledge onto the machine we are
designing and talk _as_ _if_ the machine had knowledge when we are really
just talking about what we know. And I could claim that is what I believe
in an attempt to be politically correct. But it's not what I believe. I
believe the machine that has accumulated some statistical data on past
experience actually _has_ that knowledge.

All you
need in the agent is some method of of increasing the odds that a past
behaviour will be repeated when an environmental factor is
re-encountered.

I agree. But I also take the positi9\on that any such mechanism (however
it's implemented) does in fact create knowledge in the machine. It allows
the machine to know something.

But to call those methods "knowledge" is IMO stretching
the metaphor beyond the limits of sense.

I don't consider it a metaphor. I consider it to be real knowledge in the
machine.

The question comes down to what happens in a human that allows us to have
knowledge and why is the process that happens in us not just hardware being
conditioned. Clearly some people strongly believe that something
fundamentally more complex is happening in us and as such, even without
knowing what that "more complex" process is, choose to take the stance that
such simple conditioning is not the same as human knowledge. I don't
however agree with that, and I think simple conditioning is the collection
of knowledge in the machine.

As the creator, I don't even hold such knowledge. As the machine interacts
with its environment _it_ and not I, is the one being conditioned. It's
gaining knowledge of what behavior works best, not I. I understand how it
collects knowledge, but it is the one collecting that knowledge, not I. It
has the knowledge, not me.

The implementer must decide
which such encounters to keep track of, for example, in order to
increment a counter whose value is used by the algorithm that computes
the next behaviour.

Yes, as the creator, my design decisions limit what sort of knowledge it
can and will collect. But the knowledge collected is held by the machine,
not by me. It "knows" the value of walking, not I. Just like TD-gammon
has knowledge of the value of a given move in a given board position which
the creator does not have.

The implementer must also build in some method of
keeping track of delayed feedbacks, else they cannot be "rewards." Etc.
But that's architecture, not "knowledge." ("Knowledge" is too
anthropomorphic for my taste, hence the scare quotes.)

Yes, the creator has knowledge of the machines architecture, but the
machine collects the knowledge (and likely has no knowledge of its own
architecture). Which in RL terms, is the value array it computes over time
from its experience in interacting with the environment.

RL machines accumulate data from experiments. Every behavior it produces
is an experiment, and the results of these experiments are recorded in how
it updates it's internal variables. That accumulated data (the current
values of all it's adjustable variables) is the machines knowledge.

Any any other field but AI, the argument that the description is too
anthropomorphic is fine. But here, in AI, it's our job to define what
constitutes machine knowledge. To claim you don't know, just indicates you
haven't solved AI.

Though I could easily be proved wrong one day, I do have strong opinions
about what knowledge is in humans, and in machines.

This goes directly to what I just wrote to John in a previous post
minutes ago. I wrote something to the effect that before the system
can learn to walk, it must first learn to recognize the _value_ of
walking.

"Value" is your abstraction, not the agent's.

Well, again, even though value is my abstraction, I believe the agent does
know the value.

"Hand" is my abstraction as well but would you argue you don't have a hand
just because "hand" is my abstraction? Or would you argue that a wheel is
not round becuase round is my abstraction? Yes, the abstraction is mine,
but the property that the abstract labels is in the hand, or the wheel, or
in the machine that has the power to recognize value.

Now this might seem odd, because it might be hard to grasp how a
learning agent who has never walked, can have any comprehension of its
value. But they do, and that's exactly how it can learn such a complex
behaviour so quickly.

I assume you're referring to calves, fawns, foals, etc.

Well, I was thinking of robots, but it applies to animals in the same way.

among other
things. They do not, I think, "recognise the value of walking." They
just do it

Well again, this is just more of the same problem. We use words such as
"recognize value" to describe something humans can do. What is happening
in a human when they recognize value? How do we know if a machine is
duplicating the same sort of process? Is it ever valid to use the words
"recognize value" when talking about a machine other than a biological
human, or does social word usage convention prohibit the application of
such words to anything other than humans?

More typical when we say a human has "recognized value" it is a process
that happens at the level of language behavior. Such as when someone
recognized the value of the 50% off sale by reading an advertisement and by
potentially talking to themselves about the deal by saying "gee that's a
good deal, maybe I should go buy that". At such level of recognition, they
would be able to verbalize the value they had recognized. So, just like
with knowledge, we could attempt to argue that recognizing value happens
only when it language behavior emerges from a human to signify the
recognition. But like knowledge, I don't agree it starts, or ends, there.

We can also simply observe the behavior of a human and make just as strong
an argument about the human's recognition of value by studying their non
verbal actions. When a human walks up to a table with lots of food on it,
which item do they pick? We can label the item they pick as the item with
the most value to the human. There may be no verbalization (internally or
externally) associated with how the decision was arrived at, but yet, the
human "just did it". They recognized and responded to the value without
any high level verbal recognition of the value.

If we are a cave man walking though the woods we might stop and pick up a
rock, and carry it back to our cave. The cave man then later uses that
rock to break bones open to eat the marrow, or to trade for something else
with another cave man. In these sorts of actions we say the cave man
recognized the value of the rock when he picked it up. But yet, this cave
man might have no way to verbalize and explain his actions, or no
understanding of the abstraction of value. He simply did it. I argue the
cave man recognized the value of the rock, even though he didn't understand
at a level that would allow him to verbalize the rocks value by saying
words like "I got the rock becuase I liked the look of it, and thought it
might be useful, or thought others might like it which means I could trade
them for things I wanted". Even with that language ability, I think it's
value and accurate to say the cave man recognized the value in the rock.

The value in the rock was in its power to create future rewards for the
cave man. All value translates back to rewards in my view. It's what
value is and where value comes from.

A robot we build that is able to make use of an object to obtain future
rewards I would claim has the power to recognize the value in the object.
If for example, a robot is in a room with red and blue balls, and if it
picks up a red ball and drops it in a box, the robot receives a reward. If
such a robot is able to learn to pick up only the red balls, and not the
blue balls, I would say it's correct and valid to declare that the robot
has recognized the value of red balls. Such a robot I believe is well
within our understanding to build today, and if someone built it, I would
declare that machine as having the power to recognize value.

-- and the most important fact about their learning to walk
in the first few hours after birth is that it's impossible to stop them
from learning how to do it.

Well, you can stop them by not letting their feet touch the ground I would
assume. If that doesn't stop them, then they are not actually learning to
walk, we are instead just seeing their control system finish developing.

As long as tehre's room for them to move,
the nervous sytem makes the connections needed to co-ordinate their
elemental behaviours (flexing legs, tensing/relaxing torso muscles, etc)
into the macro-behaviour we call walking.

FWIW, I think building an artificial calf that learns to walk in a few
hours after being switched on would be a major achievement.

It's all about the secondary rewards. About recognize something as
"good" even though that something isn't a primary reward or doesn't
directly produce a primary reward from the environment. It's all about
predicting future rewards - about recognizing that something in the
environment is a predictor, of a future reward.

I'd like to see a suite of experiments that proves that a newborn horse
can recognise future rewards. I've watched quite a few of them. They
just like to walk and run. A behaviour that persists in the adult horse,
which is why we can train them to be race horses.

If you can train them by operant conditioning then that is proof of their
power to recognize future rewards. The limit of how far out into the
future they can make accurate predictions of rewards is just the limit of
the strength of their learning hardware. It might be limited to 5 seconds
in a horse for all I know. But 5 seconds in the future is still 5 seconds
in the future.

If for example, you show them an apple, and they don't walk over and eat
it, then that shows they don't yet know the value of an apple. But if you
feed them apples, and after that, they start to walk over to you and eat
the apple out of your hand, that shows they understand future rewards.
They undersdtand that the "walking towards the apple" behavior is likely to
produce a future reward.

Walking is good, because it helps us get more rewards. That's why
walking is valuable to us.

It's obvious you haven't studied babies learning to walk.

But if the only way the learning system could recognize
that value, was by first walking over to the food and eating it to
produce a real reward, the learning process would take a billion years
becuase it would be like waiting for monkeys to type out Shakespeare.
It would be a billion years before the agent just happened to walk over
to the food, and then grab it and put it in it's mouth and swallow -
producing the final "real" reward to let, after a billion years, the
agent finally get one reward to indicate that walking might be good.

Well, this thought experiment is a bit off IMO. If the system is capable
of walking at all, it will very quickly bump into things it likes or
needs, such as food. "Rewards", IOW, are inevitable.

Well, that's just the point. Humans aren't "capable of walking" in the
sense that they must first learn to do it first. Walking is not a simple
"move forward beahvior". It's a very complex sequence of actions combined
with a complex dynamic balancing process. We understand the complexity of
this when we see it takes millions of dollars of engineering to make a
machine walk on two legs poorly. It requires a very complex set of
internal circuits to make it happen - a set of circuits that don't just
magically show up by random chance in a few hours, months, or even years.
If you build a 2 legged robot and lay it down on the ground, and program to
explore random behaviors, just how long do you think you will have to wait
before you it gets up, walks over to the other side of the room, and pushes
the "reward" button located there? Obviously, a very long time - maybe
even millions of years.

But when guided by reinforcement learning, such a thing might be learned
far quicker.

OTOH, if the system cannot walk at all, then you have to to posit some
intermediate stages between immobility and walking.

And you have to posit why the machine advances through those states so
quickly.

There are such
intermediate stages, many of them, and it did take millions of years to
evolve them.

Well, it took millions of years to evolve hardware that both had the power
to walk, and had the control system to allow it to do it. But that's all
assumed.

When it comes to learning, if we are talking about learning to walk, we
assume it's got legs with enough power to perform the walking action, and
some array of sensors to make it possible for the machine to be configured
into the required control circuits to make walking happen. We are then
just talking about how long it takes for the learning machine to configure
itself into the correct circuit, and why it would happen in weeks, or
moths, instead of in millions of years.

But not because a worm way back then "recognised the value
of walking." It just wiggled, and bumped into food, or a possible mate.
(It's actually more complex, since the worm also sensed chemicals
dissolved out of food, etc.) Those that wiggled better got more food, so
whatever it was in their architecture that enabled better wiggling was
passed on to their offspring. But worms didn't evaluate their behaviour.
They just did it.

Yes, but of course you are now talking about the process of DNA based
evolution which as I've argued before, I claim is also a reinforcement
learning machine. And I also make the claim that it does recognize the
value of behaviors.

The learning machine however is not the individual worm in this example.
It's the entire worm species - that is the collection of all worms alive at
any point in time acting as one large learning machine. When a new worm is
created which includes a new type of behavior (determined by innate
genetics), that worm is a test of the value of that beahvior. The more
successful the beahvior is in keeping the species survive, the more it is
likely to reproduce, and the larger the percentage of worms in the current
population can be expected to make use of the beahvior. The percentage of
worms in the population with the genetic trait is the machine's mechanical
tracking of the traits value and it is the machines recognition of the
value of that trait. The genes with the most value, are the ones with the
largest population in the worm gene pool.

Learning works faster, because the agent learns to recognize elements
of the environment which are predictors of future rewards - it learns
to recognize secondary reinforcers.

Actually, learning often works slower, as anyone who has tried to master
a new skill will tell you.

I was talking about how learning with the help of secondary reinforcers
happen much faster than learning without the help of secondary reinforces.

Anyhow, what you are describing is operant conditioning.

Yes, I think it is. I claim that operant conditioning and reinforcement
learning are the same thing.

I think you should meditate on the odd fact that walking is learned by a
baby in very short time. In a few weeks, the baby progresses from
sitting down after every step or two to running.

But it takes a year of learning to use its legs in general, and learning to
use the legs to roll over, and to sit up, and to crawl before it gets to
that week where it stands and walks on two feet. Baby's don't learn to
walk in two weeks, it's a 12 month learning process. Still, none of that
entire year long learning process would have happened in anything less than
a few million years if it didn't have the help of a good reward prediction
system creating secondary reinforcer guiding that learning.

It masters the skill of
waving a stick at something and hitting it in a week or so.

Again, after a year of learning to first use it's eyes, and head, and
hands, and arms, and legs, etc.

To simply claim the week before the year long learning processes ends is
where the learning "started" is just silly (but typical of how parents
might think).

By contrast,
it takes months and years to master that combination of walking and
stick waving we call golf. And the frustrations of doing that are not
exactly "rewards." ;-)

Well, golf is never really mastered because there's no end goal. :) I'm
pretty sure if you ask Tiger if he think he's mastered the game of golf he
would say no. :)

After having food given to it many times, it learns to recognize the
sight of food as a predictor of a future reward. It learns to
recognize the sight of food getting close to the mouth, as a secondary
reward. In other words, it learns the _value_ of making the food come
close to us as being valuable.

Once again: the _value_ is something you have abstracted from the
situation. The agent does not need to evaluate anything. It just need to
have a) responses; and b) responses that change it architecture.

Yes, but a response that changes it's architecture in some direction is a
recognition of value. It's the definition of value in my view. I use the
term "directed change" to describe change that has a direction (as apposed
to random change which has no clear direction or purpose or goal in how it
changes).

And once again, you are describing operant conditioning (which, please
note, requires that at least two responses occur when a stimulus is
presented.)

Which two are you talking about?

There's the external stimulus, the response by the agent, the response by
the environment, the response of how the agent changes in response to how
the environment changed, the second external stimulus, and the second (now
potentially slightly different) response by the agent. And that of course
makes it sound like there's a clear delineation between the start ans stop
of a stimulus or a response which is not really the case at all when you
get down to implementation details. It's only the case when an experiment
is structured so as to force the clear delineations.

When it takes a single step towards the food, it recognizes the value
of that one simple behavior, as helping to get the food closer. That
acts as a secondary reinforcer which helps to reward that "step towards
the food" behavior. In other words, the agent already learned the
recognize the value of walking, in the fact that it was able to
recognize the _result_ it produced - the result of getting the agent
closer to the food.

Babies don't take step because the recognise value. They take steps
because it feels good to do so.

What you call "feels good" I call "recognize value". Same idea, just
different words.

The fact that a grown up makes smiley
faces and cooing noises just increases the feel-good feedback. IOW, it's
operant conditioning (and it happens very fast because a baby is a
system optimised for learning to walk.)

Yes, I agree that babies no doubt are optimized for learning to walk and
that speeds up the process. But more important, they are optimized for
strong learning period by the inclusion of very strong and effective future
reward _prediction_ hardware.

This is a simple example of trying behaviors, and favoring the one that
appears best.

As you describe it, it's not simple at all. It is a very adult human,
conscious, and top-down method of solving a problem: define it,
hypothesise possible solutions, try them, and evaluate them; repeat with
refinements of the best solutions; etc. It's a very linear process, and
that's one of the reasons I don't think it's an accurate model of how
most real systems learn. Using it as a model for computer learning is
IMO not useful.

Well, how I describe it might seem "linear". How it works in the type of
machines I'm looking at is not linear in at all. It's a highly parallel
real time continuous temporal process. It's linear only in the fact that
it's forced to happen over a period of time (aka time is linear and we
can't escape that).

It's an engineering approach, IOW. One that you have learned over many
years of training and practice. (Me too). And what kept you going was
not the eventual payoff (although you no doubt told yourself that from
time to time). It was the fact that you are built to enjoy problem
solving. (Me too.)

Yes, that's true and accurate I would say. But I don't think that "joy of
problem solving" is all that innate in me. I think it was learned by the
fact that the activity tended to reap higher rewards for me. I was simply
better at problem solving than many others, so it produced surprise and
respect and attention in the people around me (like my parents, and
teachers, and friends).

That "joy of problem solving" I believe is mostly a _learned_ secondary
reinforcer. I learned it was good way to get attention and I became
addicted to that attention (which itself was more learned secondary
reinforcers).

How much can be attributed to innate features and how much was learned is
never easy to figure out, but there's clearly a strong secondary reinforcer
learning system at work guiding our learning. The brain includes a very
strong, and very powerful, future reward prediction system, which is the
primary source of all our conditioning.

And I'll throw this in just to show all this talk is not just handwriting.

Look at formula 6.2 on this page of Sutton's book:

http://www.cs.ualberta.ca/~sutton/book/ebook/node61.html

It shows how the agents current understanding of the value of a state is
updated in response to each action it performs. The update works by
adjsuting the current estimated value of the state (V(St)) towards the
current estimated "target", which is r(t+1) + yV(S(t+1).

In that formula, r is the current "real" reward received at that point in
time, and the V(St+1) is the systems current estimate of all future
rewards. The system is learning not from just the current real real reward
t, but from the sum of the real reward, and it's current best estimate of
all future rewards.

The V() array is in fact the systems current understanding of value - it's
the systems current understanding of how the current state of the
environment acts as an estimator of future rewards.

Or, form another perspective, the r is the real reward, and the yV(St+1) is
the secondary reinforcer component of the reward using for learning.

The activity is its own reward - it is self
reinforcing.

Yes, but the system must LEARN the value of the action. It's not innate.
But that's why learning is so so effective in shaping highly complex
behaviors in us. It's because the brain has the power (like these
algorithms described in Sutton's book) to learn how to estimate the
probability of receiving future rewards based on the current state of the
environment or based on the current action performed (action and state are
nearly interchangeable concepts in the domain of RL).

IMO, that's a major fact about "reinforcement learning." An
AI enterprise that ignores it will fail.

Which is why it's not ignored and why it's such a central feature of nearly
EVERY formula in Sutton's book. Though he might never use the words "the
action is it;s own reward", that is exactly what he's talking about.

Let me just make this clear if it's not with how this translates to
something like a simple board game. These algorithms will try to estimate
the value of a board position - which is the value of the state of the
environment for this domain. The question at had is whether a given board
position is a predictor of a future reward (winning the game) or a
predictor of lack of reward (loosing the game).

In the case of TD-Gammon, the value array was implemented not as a large
array with one element for every possible board position (the game of
Backgammon has too many board position to make that practical for today's
computers), but as a function in the form of a neural network. None the
less, the function produced a value from 1 to 0 which represented the
probability of the computer winning the game from that board position.

When the program made a move, if it led to a board position with a high
expected probability of winning, then that move was "rewarded". Such an
action was seen as "good" by the program. The move was it's own reward,
becuase the move produced something the program _instantly_ recognized as
"good".

Now in the case of this game, the program can predict with 100% accuracy
how the environment will change before it makes the move. So it knows even
before it makes the move how "enjoyable" the result will be. In a more
complex and real example, the agent can't know for sure how the environment
will change ahead of time, so it has to select an action, and then wait to
see how much "joy" it got out of it once it finds out how the environment
has changed in response to the action.

Everything that we can sense, becomes part of that secondary reinforcer
prediction of future rewards. Not only do we sense how the environment
changes in response to our actions, we sense things like how our arms move
in response to the commands from the brain. When we see that our arms have
picked up the trash and correctly placed the trash in the trash can, we get
a joy out of this whole process - out of seeing our arms move as we wanted
them to, and in reaching the final state of having the trash relocated from
the floor to the trash can. So everything we are able to sense about the
entire process acts as a secondary reinforcer to reward, and strengthen,
our actions. Everything about the activity was the reward for our action
(assuming it all produced good things - which of course it doesn't always
do).

Keep in mind that most people do not enjoy problem solving the way an
engineer does - that's why most people are not engineers, nor aspire to
be.

Yes, that's very true.

OTOH, artists also solve problems, and engineers, significantly
enough, very rarely like to do what artists do. Yet both are problem
solvers. They solve problems in different ways. But they both engage in
self-reinforcing behaviours.

Yes, we all develop our own complex system of secondary reinforcers.

I don't paint because my efforts at painting in the past didn't produce the
sort of rewards that my efforts at mechanical problem solving did. My
system of secondary reinforcers - my brain's prediction of what is "good"
for me, will be very different than what other people's brains predict is
"good" for them.

What is a self-reinforcing behaviour? It's a loop. The agent gets
feedback from its own behaviour, not just from the environment. That
feedback is a reinforcing signal, to use your terminology. But note that
this feedback is not one of "value in the future" (to paraphrase what I
think you mean by "recognising value.")

Oh, it's very much a feedback of estimated future rewards. The value we
recognize is in effect all about future rewards. Nothing is immediate as
John would like to believe it is by oversimplifying the idea of immediate.
It's all an estimate of expected future rewards or "returns" as per the
language of reinforcement learning...

http://www.cs.ualberta.ca/~sutton/book/ebook/node30.html

This notion of trying behaviors and favoring the ones that appear best
is, as I said, translated into precise algorithms in the rest of the
book, so even if it sounds like hand waving to you on that page, it's
not in the least bit hand waving by the time you finish the book.

Ptolemaic epicycles.... Fun, and even predictive, within the error range
of observations at the time. (Did you know that observationally there
was initially do discernible difference between Ptolemaic and Keplerian
predictions?

No, I've not studied that history.

Tycho Brahe was able to refine his observations to the
point where it was possible to argue, but not prove, that Kepler's model
was more accurate. That more precise observations that put Ptolemy's
model to rest came later.) IOW, this approach will result in useful
machines, but will not IMO solve the problem of artificial learning
(which it seems is a synonym for artificial intelligence.)

Well, I argue that learning is intelligence and intelligence is learning,
but not everyone supports that view.

Machine learning however I don't consider to be artificial. It's
artificial if you claim to to be a model of human learning, but I just
claim it to be intelligence. Just like a machine that walks on two legs is
not artificial walking, it's just walking.

The question at hand which we have not answered is how close to human
beahvior can we get in a machine by implementing a reinforcement learning
algorithm? I believe we will get so close to human behavior from the
machine that none of these points will be debated in the future. People
will simply understand the human brain to be a reinforcement learning
machine controlling our actions.

But if I'm wrong, and it takes a lot of specialized modules which simply
include some elements of learning (maybe many different elements of
learning) alone with lots of other specialized features (as people like
John and Dan argue), then no one will consider the brain to be _just_ a
reinforcement learning machine just as no one considers a plane to be
_just_ an airfoil. The airfoil that provides lift is just one of many
elements that make up a working airplane.

The authors state in their introduction: "Rather than directly
theorizing about how people or animals learn, we explore idealized
learning situations and evaluate the effectiveness of various learning
methods." Anyone who believes they can construct idealised learning
situations without theorizing about how it's actually done by humans
and other animals is not likely to produce much of value.

Note that the Sutton book is about computer learning algorithms and not
about humans or animals. When he uses the term "reinforcement
learning" he is NOT making direct reference to human learning, he's
talking about the very specific field of computer research into a class
of _machine_ learning algorithms which are researched by people such as
himself.

I'm aware of that. The enterprise hasn't gotten very far, though. The
best it's been able to do is make smart washing machines and cunning
digital cameras. That is certainly AI, but very limited. Mostly, I don't
see any obvious way to generalise these machines.

Yeah, that's the rub. What works today, seems to still be 100 miles away
from human beahvior. I think it's actually extremely close but most don't
understand how close we were until we look back in retrospect after it's
been solved.

OTOH, it's quite possible (even likely IMO) that these limited machines
will turn out to be components of generalised learning machines of the
type you seem to be pursuing.

My very strong belief is that strong generalize learning machines will show
obvious (and shocking to most) levels of intelligence once they are
implemented correctly. I believe once these machines are created, it will
be a sudden, and highly shocking (to most) eye opening experience for
society in general.

It's much like trying to duplicate an encryption algorithm. Close doesn't
count. You can have 99.9% of the encryption algorithm correct, and it will
still be producing the wrong answer just as often as a random number
generator does. I think this is the same case with poorly implemented
reinforcement learning algorithms. We are very close, but they still look
like they have about as much intelligence as a random number generator.

Only time will tell if I'm right.

How close this class of computer algorithms is to what the brain does,
he makes no speculation about in the book as far as I know. What you
quote above is him making it clear he's not attempting to research or
describe human learning in the book.

I'm the only one here making the bold hand-waving claim that human
intelligence _IS_ a reinforcement learning (in the computer science
sense) process. It's my claim, not Sutton's, that got John to look at
that book, and then quote it in _our_ context of debating the human
brain (because it's the foundation of _my_ debate, not his, and not
Sutton's).

I think you are half right. We are "reinforcement learning machines",
but not in the computer sense. Computers are still linear machines -
they appear to do loopy thinking sometimes only because they do several
linear task very fast one right after the other. This is so even for
multi-core machines. So far. We are not linear machines.

Well, I agree completely in terms of how most our software systems are
written. But I don't believe that has anything to do with computers in
general - only in how we tend to structure and write our software.

The type of programs I play with have none of the "linear behavior" you are
thinking of. Though the code is, at the low level very linear - as it must
be per the design of the computer, it creates a simulation of a higher
level process (parallel signal processing network) which isn't linear in
the least.

You might have been explicitly talking about operant conditioning in
humans or the like when you asked john about reinforcement learning,
but what he quoted you was not a book on human behaviour or human
learning, but a book about computer algorithms.

I don't actually know what Rich Sutton thinks about the connection
between RL algorithms and human intelligence. His work is in AI, but
whether he thinks RL research will explain full human intelligence as I
do, or not, I just don't know.


I'll be reading more of the book.

I'm mulling over a block diagram of an operant-conditioning capable
machine. If it ever gets to the point where I think it may work, I'll
get back to you.

Sure, sounds like fun.

My block diagram is a pulse sorting network trained by reinforcement with
global feedback (and local feedback inside the pulse sorting network).
Though it's implemented as a highly linear system at the lowest level of
sorting pulses one at a time, it's doing this pulse sorting a million times
a second, and each of these pulse sorting paths though the network
constitutes a few hundred or more decisions the system has made. Each node
in the network is in effect its own parallel learning process so a
simulated network with a billion nodes (well within the reach of our
current computer technology) is actually simulating the behavior of a
billion parallel independent and communicating learning processes. Each
node is its own micro reinforcement learning machine, and the behavior of
the entire network is the combined behavior of the society of all these
micro machines working together in parallel in real time.

It's really nothing like how traditional software is structured (but very
much like how many neural network programs are structured). It produces
behavior very much unlike what we normally see on our computers.

--
Curt Welch http://CurtWelch.Com/
curt@xxxxxxxx http://NewsReader.Com/
.


Quantcast