Re: multicollinearity in regression




Greg Heath wrote:
Anon. wrote:
Paul wrote:
Hi,

After reading numerous articles on the web, I have a couple of
questions.

As Reef Fish is being his normal helpful self, I'll try and make some
sensible suggestions.

1. How do I enter 5 continuous control variables such as LogSize in
regression?
Note I also have 2 continuous independent variables which are not
control variables but on which hypotheses are based. I also have 3
categorical independents which have 2 categories on which hypotheses
are based.
I could use Analysis of Covariance but 2 of the independent variables
are also continuous.
Do I just enter all the variables as Independents into the regression
using Enter method (which is what I have done)?

I don't know what package you're using to do the regression, but you can
simply fit the model as a multiple regression with all of the variables in.

2. The multicollinearity diagnostics indicate that the highest VIF is
2.3 which is less that the rule of thumb value of 4 or 2.5 that I have
seen mentioned. However, the Condition Index is 67.899 and seems to be
related to LOGSIZE variable which has a Variance Proportion of .99. The
other variable with a high Variance Proportion (.99) is the Constant.

I must admit that I don't totally understand this, but I assume that
this is suggesting that LOGSIZE is co-linear with another variable (or a
combination of variables). It may be that you can find out what's going
on by making pairwise plots of the covariates, and this will guide you
to seeing what to do.

If I remove the variable LOGSIZE, then the coefficient of the Constant
is reduced from -33 to
-.44. None of the signs of the other coefficients are changed although
one Independent variable is now significant which wasn't previously.

The change in the coefficient of the Constant isn't surprising,
especially if some of the covariates are distrbuted a long way from zero.

I'm guessing that in the model with LOGSIZE, the LOGSIZE coefficient is
pretty small, true? Oh, and the independent variable that has become
significant: how much did the coefficient change? It's possible that it
only moved a bit, from being just non-significant to being jusy significant.

So should I just report the 2 regression models, one with LOGSIZE
included and one with it excluded?

I think you should try and understand why there seems to be
multicollinearity: it may be that you can then see a sensible approach
(i.e. one based on the substansive problem, not just a set of numbers).
When I get problems like these, I try to report one analysis, and make
a comment along the lines of "...if we include factor X, we get similar
results"-.

There are people on this list who have a much better understanding of
multicollinearity than I do, so hopefully they'll chime in with some
sensible advice as well.

I always find it helpful to calculate the correlation coefficient
matrix of all variables. This will give you pairwise correlation
information which usually helps to explain most problems with
multicollinearity.

This is patently FALSE, and had been debunked numerous times
in sci.stat.math. "Linear dependence" is an notion in LINEAR
ALBEBRA, whose definition does NOT depend on any notion of
"correlations". In that respect, correlations are completely
USELESS (except the case r = 1.000000) in diagnosing
multicollinearity problems.

Additional insight, if needed, can be obtained
from pairwise scatter plots. For example, if x2, x4 and x6 are
significantly correlated it sometimes helps to plot x4 and x6
vs x2.

You would only be waiting the time and resources of pairwise
scatter plots.

Eigenvalue and eigenvector analysis of the X's is the only way
to sort out and understand the underlying multicollinerity.

It's all DEJA VU.

Use the google archives and keywords to find what you missed.
in sci.stat.math, since March 2005.

--- Bob.

Hope this helps.

Greg
correlation

.



Relevant Pages

  • Re: multicollinearity in regression
    ... I could use Analysis of Covariance but 2 of the independent variables ... related to LOGSIZE variable which has a Variance Proportion of .99. ... I'm guessing that in the model with LOGSIZE, the LOGSIZE coefficient is ... I always find it helpful to calculate the correlation coefficient ...
    (sci.stat.consult)
  • Re: multicollinearity in regression
    ... multicollinearity than I do, so hopefully they'll chime in with some ... I always find it helpful to calculate the correlation coefficient ... This will give you pairwise correlation ... from pairwise scatter plots. ...
    (sci.stat.consult)
  • Re: multicollinearity in regression
    ... multicollinearity than I do, so hopefully they'll chime in with some ... I always find it helpful to calculate the correlation coefficient ... variables causing the linear dependence. ... All the textbooks I've used were GOOD introductory texts, ...
    (sci.stat.consult)
  • Re: correlation of random variables
    ... > John D'Errico wrote: ... that which xcorr produces does indeed reflect the mean ... >> the true correlation coefficient between two random variables. ...
    (comp.soft-sys.matlab)
  • Re: distribution of sample correlation coefficient
    ... I am wanting to use the measured correlation ... coefficient rho is known and deterministic. ... distribution might be a reasonable model as it is ... A large-sample approximation for the sampling distribution of r can be obtained from Fisher's z. ...
    (sci.stat.math)