version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observations
- From: William M. Grove <grove001@xxxxxxx>
- Date: Fri, 7 Jul 2006 16:45:54 -0500
The subject line says it. Due to either congenital or acquired brain
damage I'm having trouble today deriving the correct version of their
algorithm for an array of data X[1:n], processed serially:
mean = 0.0
var = 0.0
do i=1 to n
mean = 1/i * X[i] + (i-1)/i * mean
var = (i-2)/i^2 * (X[i] - mean)^2 + (i-1)/i * oldVar
end do
As I say, I'm having trouble with the adaptation of the second formula
to the situation where there is a parallel array of weights p[1:n], say
between 0 and 1 exclusive. The adaptation for the mean is simple
enough, e.g.,
mean_w = 0.0
sum_p = 0.0
do i=1 to n
sum_p = sum_p + p[i]
mean_w = p[i]/sum_p * X[i] + (1.0 - p[i]/sum_p) * mean_w
end do
The intuitive extension, namely replacing i everywhere by sum_p and
subtractions of 1 by -p[i] and subtractions of 2 by -2p[i], produces a
variance for small and large i alike that is way off; for large i it is
_very_ approximately twice too large.
The unintuitive extension
var_w = (sum_p - p[i])/i^2*(X[i]-mean_w)^2 + (sum_p - 2*p[i])/i*var_w
comes rather close to the weighted variance, calculated by R or by SAS,
for long sequences of random observations (e.g., N=1000, it's about a
factor of 1.00003385 off per iteration for Gaussian numbers and random
uniform weights), but is way off for short sequences like N=10; so
obviously I am not getting something fundamental here.
Can anybody help out? Thanks in advance.
Will Grove
Dept. of Psychology
U. of Minnesota
.
- Follow-Ups:
- Prev by Date: Re: plotting regression lines after an ANCOVA
- Next by Date: Re: Are graded clinical signs more reliable than dichotomized?
- Previous by thread: Alpha, beta, power, CL, CI
- Next by thread: Re: version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observations
- Index(es):
Relevant Pages
|