Re: CLT and regression
- From: "Ray Koopman" <koopman@xxxxxx>
- Date: 19 Apr 2006 02:51:13 -0700
richardstartz@xxxxxxxxxxx wrote:
On Wed, 19 Apr 2006 08:00:56 +0300, "Anon."
<bob.ohara@xxxxxxxxxxxxxxxxx> wrote:
r.c.reulen@xxxxxxxxx wrote:
Hello,To follow up Ray's post, you should fit the regression (1000 subjects
Can someone explain how the central limit theorem is related to
regression analysis? I understand the basics of the CLT, but don't
understand its relationship with regression analyis. I am conducting
an analysis with approx. 1000 subjects. These subjects have a score
between 0-100 on a certain physical functioning (PF) scale. PF is the
dependent variable in my analysis. The distribution of these scores are
highly skewed. Can I still use linear regression analysis? Or should I
go for non-parametric or bootstrapping techniques?
isn't that large!), and look at the residuals for normality (e.g. with
normal probability plots). They may be normal, in which case you're
fine. If they're not, e.g. if they're skewed, then you could look at
using a Box-Cox transformation to get them normal: i.e. you use a power
transformation (y^alpha, and log(alpha) if alpha=0 is indicated).
Generally you don't have to be too precise with the transformation: for
positively skewed residuals, trying square root, cube root and log
transformations often gets you close enough to normality.
HTH
Bob
Let me disagree with some of the advice you've been given. With 1,000
observations, it's probably of very little importance that the errors
(or residuals) be normal. The CLT *is* relevant, because it says that
the estimated regression coefficients will be approximately normal so
long as the errors are independent.
Since this is based on an approximating argument, there is some level
of deviation from from normality that will make the normal
distribution approximation for the coefficients a poor one.
-*** Startz
Yes, with 1000 cases the sampling distribution of the estimated
weights should be reasonably close to normal, unless the error
distribution is somehow pathological, and the estimated weights will
probably be not too far from the true weights, as long as all the
relevant predictors are included and there are not too many of them.
However, non-normality of the errors hinders inference not so much by
making the sampling distribution of the estimated weights non-normal
as by (a) destroying the independence of the estimate of the error
variance from the estimates of the weights, and (b) destroying the
chi-square-ness of the sampling distribution of the estimated error
variance. Both of these effects make it difficult to establish a
valid confidence region for the weight vector.
In other words, the sample size is probably big enough that the
weights are not too bad, but it may be difficult to say just how
much to trust them or predictions made using them.
I suspect that there are things other than non-normality that the OP
might more profitably worry about, all of them following from the fact
that the dv is bounded (0-100). First, if the skew is due to ceiling
or floor effects, with large tie blocks of cases piled up at one
extreme or the other, then some form of multi-category logistic
regresson may be needed, possibly with some constraint on the widths
of the interior categories.
Another result of a bounded dv is the reduced error variance that
attaches to cases whose true values are close to either extreme. From
Gauss-Markov we know that such unrecognized heteroscedasticity makes
the estimated weights less efficient than they could be, and makes the
estimated error variance too optimistic. So, even if there are no
large tie blocks, a weighted least-squares approach may be called for.
Finally, if there are many cases near the extremes then a
transformation of the dv may be needed to linearize its relation to
the predictors. I see this as the "proper" reason for transforming the
dv. In general, transforming the dv changes the form of its relation
to the predictors. If a transformation that homogenizes the error
variances also linearizes the relation to the predictors, then fine.
But if a transformation that homogenizes the error variances also
destroys linearity, then either the form of the model should be
changed accordingly, or the transformation should not be done and
the heteroscedasticity should be dealt with some other way.
.
- Follow-Ups:
- Re: CLT and regression
- From: Old Mac User
- Re: CLT and regression
- References:
- CLT and regression
- From: r . c . reulen
- Re: CLT and regression
- From: Anon.
- Re: CLT and regression
- From: richardstartz
- CLT and regression
- Prev by Date: Re: help with ANOVA design
- Next by Date: cox regression with 3 outcomes
- Previous by thread: Re: CLT and regression
- Next by thread: Re: CLT and regression
- Index(es):
Loading