REINFORCE algorithm



I've been trying to implement Ronald J. William's REINFORCE algorithm,
based on these two papers:
1-
2-
I'm using a simple function of the form: -abs(p1-39)-abs(p2-73); where
p1 and p2 are parameters.
It doesn't converge at all. For a case of only one parameter, it'll
converge in about 800 turn, but for more than that, no convergence.
Worst of all, is that I don't know what's wrong with my code
(obviously)!

Can anyone give me a hand, please? Or is there any implementation of
it available so that I can learn what to do?
Here's my code in Matlab:

function rltest(p1,p2);

if nargin<2
p1=19;
p2=42;
end
% x is Input vector
x=[p1 p2];
% m, no. of inpus and n, no. of output cells. I've assumed output to be
% between 0 and 127
m=size(x,2);
n=6*m;
% Learning Rate:
alpha=.00001;
% Ybar ans rbar, to be traces of output and reward, respectavely
ybar=zeros(n,1);
rbar=0;
% Weight Decay rate
delta=0.01;
gama=0.9;
turn=0;

% Matrix of Weights, and Weight vector for bias input:
W=50*rand(n,m);
W0=50*rand(n,1);

serie=[2.^[0:(n/m-1)]];

r=-10000;
while r<0
for i=1:n
% s is weighted summation of inputs to each output unit
s(i)=sum(W(i,:).*x)+W0(i);
% Output units are Bernoulli Logistic
f(i)=1/(1+exp(-s(i)));
y(i)=double(rand<f(i));
% Weights update, with last term to be weight decay
deltaW(i,:)=alpha*(r-rbar)*(y(i)-ybar(i)).*x-delta*W(i,:);
deltaW0(i)=alpha*(r-rbar)*(y(i)-ybar(i))-delta*W0(i);
ybar(i)=gama*ybar(i)+(1-gama)*y(i);
end
r=-abs(p1-39)-abs(p2-73);
W=deltaW+W;
W0=deltaW0'+W0;
rbar=gama*rbar+(1-gama)*r;

% New inputs are calculated from outputs
Y=reshape(y,m,n/m);
for i=1:m
x(i)=sum(Y(i,:).*serie);
end

p1=x(1);
p2=x(2);
turn=turn+1;
end
Reward=r
Turn=turn
********************
Belera
belera@xxxxxxxxx

.