Re: Correlation coefficient not suited for small samples!? Cross validation as an alternative?



On 18 Apr 2006 02:40:02 -0700, "zorritillito-googlegroups@xxxxxxxx"
<zorritillito-googlegroups@xxxxxxxx> wrote:

Thank you all very much!

Richard Ulrich wrote:

R-squared is a biased estimate, since the expectation of
R-squared under the null hypothesis of no-relation is
k/(n-1) for k variables in the regression with n cases.

The formula "k/(n-1)" is especially useful to me. I did not find it
neither in my textbooks nor in the internet. To be sure, may I ask you,

This relation comes from statistical estimation theory, and
I didn't find it in a quick search of the web, either.

if k is - as I would suppose - the number of regressors, not counting
the ouput variable; and if the formula has any restrictions like normal
distribution?

At the limit, it is evident that a regression line with one variable
(and an intercept) will perfectly fit any 2 discrete points -- R^2
is 1.0. The simple extension works out so that 2 variables fit
3 points, and so on. The only restriction that I am aware of is
that the distributions be "continuous", in this sense that there
are no ties, so that all the points are discrete. -- If you let the
same X-vector be paired with two different y values, the R^2 is
not going to be 1.0



As to the cross validation that I mentioned, I am afraid that this was
a silly thought of mine.


Regression coefficients have less dependence on variance,
so the are a *better* measure of linear relationship.

The regression coefficient does, as far as I understand, not work for
my purpose. Since I want to estimate the strength of the linear
relation, I guess I would have to standardize the regression
coefficient - ending up again with the correlation coefficient.

What is your "purpose"?
I mean, I think, you want to consider the universe
of comparisons that you are comparing *this* result to.

If "everyone" talks about correlations, then that is what
you need to refer to. I was saying, especially for a small
sample, you have less assurance that a random sample is going
to represent the range of data. (And correlations suffer
from a truncated range more than regression coefficients do.)



But could not a corrected estimator be a solution to my problem - like

What is your n? Who is having a problem with
looking at r's and treating them as estimates?



"Fisher ´s approximate unbiased estimator" r*(1+(1-r²)/2n) ?
Although this formula does the opposite of what I expected. In my
opinion a smaller sample size should yield a smaller correction term,
in order to compensate the contrary tendency of r (the tendency to
yield a bigger absolute value when n is small). Can you tell me what my
mistake is?

--
Rich Ulrich, wpilib@xxxxxxxx
http://www.pitt.edu/~wpilib/index.html
.



Relevant Pages