Re: version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observations
- From: William M. Grove <grove001@xxxxxxx>
- Date: Mon, 10 Jul 2006 19:14:19 -0500
In article <MPG.1f1891fbfe46c4c2989681@xxxxxxxxxxxxxxxxxxxxxx>, grove001
@umn.edu says...
The subject line says it. Due to either congenital or acquired brainSorry to have bothered you gentleman. My algebra returned to normal.
damage I'm having trouble today deriving the correct version of their
algorithm for an array of data X[1:n], processed serially:
mean = 0.0
var = 0.0
do i=1 to n
mean = 1/i * X[i] + (i-1)/i * mean
var = (i-2)/i^2 * (X[i] - mean)^2 + (i-1)/i * oldVar
end do
As I say, I'm having trouble with the adaptation of the second formula
to the situation where there is a parallel array of weights p[1:n], say
between 0 and 1 exclusive. The adaptation for the mean is simple
enough, e.g.,
mean_w = 0.0
sum_p = 0.0
do i=1 to n
sum_p = sum_p + p[i]
mean_w = p[i]/sum_p * X[i] + (1.0 - p[i]/sum_p) * mean_w
end do
The intuitive extension, namely replacing i everywhere by sum_p and
subtractions of 1 by -p[i] and subtractions of 2 by -2p[i], produces a
variance for small and large i alike that is way off; for large i it is
_very_ approximately twice too large.
The unintuitive extension
var_w = (sum_p - p[i])/i^2*(X[i]-mean_w)^2 + (sum_p - 2*p[i])/i*var_w
comes rather close to the weighted variance, calculated by R or by SAS,
for long sequences of random observations (e.g., N=1000, it's about a
factor of 1.00003385 off per iteration for Gaussian numbers and random
uniform weights), but is way off for short sequences like N=10; so
obviously I am not getting something fundamental here.
Can anybody help out? Thanks in advance.
Will Grove
Dept. of Psychology
U. of Minnesota
Here is the update, in Fortran:
sum_p = 0.0
mean = 0.0
var = 0.0
i = 0
DO
i = i + 1
...
! X gets generated in here, it's the integrand
X = .012345 ! for example
...
! p gets generated in here, it's the weight, ranging between 1.0 and
arbitrarily
! high values
p = 567.890 ! for example
...
sum_p_old = sum_p
sum_p = sum_p + p
prob = p / sum_p ! normalize so it has metric of a probability weight
R = sum_p_old / sum_p ! approximately 1 after a relatively few
observations
dev = X - mean
delta_mean = prob * dev ! equals difference between mean of X1...Xn
and mean of X1...X(n-1)
var = prob * (dev - delta_mean)**2 + (1-prob)*R * var
mean = prob * dev + (1-prob) * mean
... calculate convergence check based on (SQRT(var) / i / mean) ...
IF (converged) EXIT
END DO
I got the algebra for this by induction on i=1, 2, 3, ... and then
checked it numerically on n=3, 10, 1000, with Gaussian, uniform, etc.
X's and arbitrary nonnegative p's. Accurate to however many decimal
places R wants to print, so I'm happy.
.
- References:
- Prev by Date: Re: goodness of fit test for data over time
- Next by Date: Re: Test of significance of a mean value?
- Previous by thread: Re: version of Golub Chan LeVeque stable algorithm for sample variance applicable to weighted observations
- Next by thread: which model (GLMs)is the best?
- Index(es):
Relevant Pages
|