Re: version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observations



In article <MPG.1f1891fbfe46c4c2989681@xxxxxxxxxxxxxxxxxxxxxx>, grove001
@umn.edu says...
The subject line says it. Due to either congenital or acquired brain
damage I'm having trouble today deriving the correct version of their
algorithm for an array of data X[1:n], processed serially:

mean = 0.0
var = 0.0
do i=1 to n
mean = 1/i * X[i] + (i-1)/i * mean
var = (i-2)/i^2 * (X[i] - mean)^2 + (i-1)/i * oldVar
end do

As I say, I'm having trouble with the adaptation of the second formula
to the situation where there is a parallel array of weights p[1:n], say
between 0 and 1 exclusive. The adaptation for the mean is simple
enough, e.g.,

mean_w = 0.0
sum_p = 0.0
do i=1 to n
sum_p = sum_p + p[i]
mean_w = p[i]/sum_p * X[i] + (1.0 - p[i]/sum_p) * mean_w
end do

The intuitive extension, namely replacing i everywhere by sum_p and
subtractions of 1 by -p[i] and subtractions of 2 by -2p[i], produces a
variance for small and large i alike that is way off; for large i it is
_very_ approximately twice too large.

The unintuitive extension
var_w = (sum_p - p[i])/i^2*(X[i]-mean_w)^2 + (sum_p - 2*p[i])/i*var_w

comes rather close to the weighted variance, calculated by R or by SAS,
for long sequences of random observations (e.g., N=1000, it's about a
factor of 1.00003385 off per iteration for Gaussian numbers and random
uniform weights), but is way off for short sequences like N=10; so
obviously I am not getting something fundamental here.

Can anybody help out? Thanks in advance.

Will Grove
Dept. of Psychology
U. of Minnesota



Sorry to have bothered you gentleman. My algebra returned to normal.
Here is the update, in Fortran:

sum_p = 0.0
mean = 0.0
var = 0.0
i = 0
DO
i = i + 1
...
! X gets generated in here, it's the integrand
X = .012345 ! for example
...
! p gets generated in here, it's the weight, ranging between 1.0 and
arbitrarily
! high values
p = 567.890 ! for example
...
sum_p_old = sum_p
sum_p = sum_p + p
prob = p / sum_p ! normalize so it has metric of a probability weight
R = sum_p_old / sum_p ! approximately 1 after a relatively few
observations
dev = X - mean
delta_mean = prob * dev ! equals difference between mean of X1...Xn
and mean of X1...X(n-1)
var = prob * (dev - delta_mean)**2 + (1-prob)*R * var
mean = prob * dev + (1-prob) * mean
... calculate convergence check based on (SQRT(var) / i / mean) ...
IF (converged) EXIT
END DO

I got the algebra for this by induction on i=1, 2, 3, ... and then
checked it numerically on n=3, 10, 1000, with Gaussian, uniform, etc.
X's and arbitrary nonnegative p's. Accurate to however many decimal
places R wants to print, so I'm happy.

.



Relevant Pages

  • version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observatio
    ... I'm having trouble with the adaptation of the second formula ... to the situation where there is a parallel array of weights p, ... but is way off for short sequences like N=10; ...
    (sci.stat.consult)
  • Been told I have to stop training
    ... Been having lower back trouble for about 18 months now. ... steroid and heat shots in my facet joints to help with the pain. ... also that weights may make it worse. ...
    (uk.rec.bodybuilding)
  • Re: Been told I have to stop training
    ... >Been having lower back trouble for about 18 months now. ... >steroid and heat shots in my facet joints to help with the pain. ... >also that weights may make it worse. ...
    (uk.rec.bodybuilding)
  • Re: Software suggestions
    ... E.g. least squares. ... a "good" set of weights. ... then you are in trouble. ... learning is far from a solved problem. ...
    (sci.stat.math)
  • Re: ISO help in probability
    ... Helmut Giese wrote: ... > probability by multiplying a random value with the associated weight, ... intervals based on the weights you have. ... proc create_wtable {var weights} { ...
    (comp.lang.tcl)