Re: multicollinearity in regression
- From: "Reef Fish" <Large_Nassau_Grouper@xxxxxxxxx>
- Date: 27 Mar 2006 08:27:44 -0800
sangdonlee@xxxxxxxxx wrote:
There are many excellent statistical experts in this group but let me
give you MY explanation on multicollinearity.
MLR model is
Y=XB
The least square estimator of B is
B-hat=inv(X'X)X'Y
Finding B-hat involves the computation of the INVERSE matrix of X'X.
Computing the inverse matrix is THE main cause of problems in many
areas. Here, X'X is the same as covariance or correlation matrix of X
depending on whether the input columns are normalized or standardized.
This much is correct.
When the predictor variables are correlated among themselves,
ill-conditioning, near-singular, or multicollinearity are said to
exist.
This is NOT true. If a set of X's are correlated, even highly
correlated,
those X's may or may NOT be nearly "linearly dependent".
See http://tinyurl.com/hn28z
for the concept of "linear dependence", and how it is UNRELATED
to correlations, except a correlation of 1.00000. There are other
recent threads that dealt with the meaning of multicollinearity as the
condition of "almost linearly dependent"
If the predictor variables are highly correlated, the standard
errors of the estimated partial regression coefficients are so wide
that their interpretation is impossible, therefore the simple
interpretation of the partial regression coefficients as measuring
marginal effects (slope or sensitivity) is unwarranted. If X'X is near
singular, the inverse matrix of X'X are likely to be quite unstable or
possibly even not unique.
The above is partly true (when X'X is nearly singular), but the
reference to corelations should b
For example, let
A=X'X =[1 3 4
3 9 12
4 12 16]
The rank, the number of independent columns or rows, of matrix A = 1,
since the second and the third columns are obtained from the first
column by multiplying 3 and 4. Because the A is singular, the inverse
matrix, A-1 does not exist.
However, let's add small random noises to A and called it A1
A1=[1.02 3.08 4.05
2.95 9.01 12.01
4.06 12.01 15.99]
That's not a good example for the illustration of your point. The
perturbation of the DATA to make X'X nearly singular would still
make X'X a covariance matrix (hence non-negative definite).
You A1 is asymmetric and has a negative eigenvalue -.504.
Inverse(A1) =
[ -2.7474 -9.8258 8.0760
25.6679 -2.1502 -4.8863
-18.5814 4.1098 1.6820]
is now irrelevant to the problem of multicollinearity.
. Let me state clearly that multicollinearity does not
necessarily prevent MLR from satisfactory prediction IF the new values
are like the ones used to develop the prediction model.
That is true even without the stated condition.
How do we know
the new samples are like the ones used to develop the prediction model?
My approach is to compute the 95% confidence ellipse based on the
principal component scores.
Bad idea. Why priniciple components? Don't you mean "confidence
ellipsoids"? There are as many principle components as there are
the original X's/
< snipped incorrect and incomprehensible paragraphs>
What's the solution to the ill-conditioning? Dozens of methods have
been developed such as LU decomposition, QR decomposition,
eigenvalue/vector decomposition (EVD), singular value decomposition
(SVD),
These are methods to improve the NUMERICAL solution of the
inverse of X'X. They do NOT solve the statistical instability, and
worst of all, the numerically correct solution may be statistically
the WRONG solution (Rubin, Beaton, Barone 1976), The ONLY
solution is to drop one of more of the REDUNDANT variables.
The above has been thoroughly discussed in sci.stat.math
threads relating to multicollinearity.
ridge regression. My approach (which is not perfect though) is
using PCA (principal component analysis), PCR (principal component
regression), PLS (partial least square).
These methods are NOT recommended because they are ill-justified
and they do NOT solve the problem of multicollinearity, but
DISGUISES it in various forms to arrive at INFERIOR solutions
than any original solution of the problem.
-- Bob.
P.S. My comments on some technical aspects of the OP's problem
should now shed some light on my previous THREE posts replying
to Paul. The subject is far beyond his ability of comprehension
without first taking a sound course in Multiple Regression Analysis.
And you have illustrated that even having courses in MLR and a
Ph.D. degree, one can STILL have all kinds of misconceptions
and errors about the problem/query of the OP.
Sorry I can't give my explanation to your original query since much
information is missing......observational or experimental data,
objective, etc.
Hope this helps.
Sangdon Lee, Ph.D.,
GM Tech. Center.
.
- References:
- multicollinearity in regression
- From: Paul
- Re: multicollinearity in regression
- From: sangdonlee
- multicollinearity in regression
- Prev by Date: Re: multicollinearity in regression
- Next by Date: Re: multicollinearity in regression
- Previous by thread: Re: multicollinearity in regression
- Next by thread: reporting categories of continuous variable in a regression model
- Index(es):
Relevant Pages
|