Re: Q(λ)-learning algorithm question



kartoun@xxxxxxxxx wrote:
Me too, I don't know how to do it (for now). We can collaborate
scientifically if you'd like.

About the above two papers I promised to review:

1) R. Matthew Kretchmar. Parallel Reinforcement Learning. The 6th World
Conference on Systemics, Cybernetics, and Informatics. 2002.

Kretchmar investigates the problem of multiple reinforcement learning
agents attempting to learn the value function of a particular task in
parallel for the n-armed bandit task. A parallel reinforcement learning
solution is suggested to overcome the problem of statistic overwhelming
by an agent’s information that is correspondingly has a larger
accumulated experience than the other agents. To overcome this problem
each agent keeps track of two sets of parameters: (i) one set for the
actual independently experienced trials of a particular agent, and (ii)
an additional set for combined trials among all other agents. The
agents share accumulated experience by keeping separate parameters for
their own independent experience and the combined experience of all
other agents. In addition, the agents can compute an estimate (general
Q value which is a weighted combination of a particular agent with all
the other agents) based upon the global experience. This estimate is
computed from a weighted average of the agent’s own independent
experience and the accumulated experience of all other agents. Results
show that as more as you add agents, learning is accelerated because
there is a larger pool of accumulated experience upon which to base
future estimates. The experiment with 10 parallel agents (the highest
number of agents in Kretchmar's experiments) learns the fastest.

2) Your paper: Odalric-Ambrym Maillard, Rémi Coulom, and Philippe
Preux. Parallelization of the TD(λ) Learning Algorithm.

Odalric-Ambrym et al. present a method to parallelize the TD(λ)
algorithm for reducing computation times for various learning tasks. An
extensition to Kretchmar's is presented in which agents share their
experience by averaging their value functions.

More precisely, the weights of the parametric approximators are averaged, which is not equivalent to averaging the value functions, since the value does not depend linearly on weights in feedforward neural networks.


This is done for
multi-state episodic tasks, using the TD(λ) algorithm and generalizing
function approximators. Experiments using the same TD(λ) algorithm
conducted on three kinds of problems: (i) the pendulum problem; (ii)
the cart-pole problem and (iii) the swimmer problem, show that using
more than an average seven processes reveals no speeding up in
computation times.

Some remarks (in my opinion):
1) You had to mention neural networks in the abstract.
2) "Nevertheless, for this method to be successful, we assume that the
weight changes between two cycles differs little from what we would
have with the sequential method, and this does not hold if P or G is
too big.". Why? Please define "too big".
3) What is a small G? How small?

Numerical values are provided on the plots. In general, what is big and small depends on the problem and approximation architecture.


4) Why more than 7 processes don't speed up computation times? Is there
any reasonable explanation for that? It's task specific I believe - but
you tried it for 3 different tasks, didn't you?

There are two kind of losses: losses because of time lost in communication (processes are running on different machines that communicate with MPI over an ethernet network), and losses because running trials in parallel is not equivalent to running them sequentially. Numerical values are different for the 3 tasks.


Odalric wrote a longer report in French, that I can e-mail to you if you are interested.

5) Where was it published?
6) At what year?

_Seventh European Workshop on Reinforcement Learning_, Naples, Oct 2005


Thanks,

Uri.


Thanks for your feedback,

Rémi
.