Re: Approximate solution to linear regression
- From: Paige Miller <paige.miller@xxxxxxxxx>
- Date: Thu, 28 Jun 2007 12:28:53 -0700
On Jun 27, 5:10 pm, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
On Jun 26, 11:31 pm, "S.W.Christensen" <s...@xxxxxxxxxxxxxxxx> wrote:
On 26 Jun., 14:30, Paige Miller <paige.mil...@xxxxxxxxx> wrote:
On Jun 22, 3:16 am, "S.W.Christensen" <s...@xxxxxxxxxxxxxxxx> wrote:
On 17 Jun., 21:46, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
Problem can have 40,000 variables, most of them highly correlated.
More variables than observations in some cases.
I haven't looked over your solution in great detail, but I would
suggest that:
1) Group your variables into clusters, based on their correlationsWhy create an ad hoc procedure, where the properties are not known,
2) Construct an ensemble of regression models, each based on just one
exemplar from each cluster
3) Weight each model conservatively (because you have so many
variables); e.g. equal weighting.
and where you would have to defend the validity of the procedure, and
where you have to write the code yourself?
If you only choose to create procedures that are not ad hoc, then you
will never create a procedure. Talk about halting civilisation in its
tracks...
By the way: 1) the various parts of the method are thoroughly tested,
2) a method that consistently works has proven itself beyond the need
for defending, 3) writing code yourself is the best guarantee you can
ever have of its correctness (do you have faith in the correctness of
code written by software companies?)
Best regards,
Stefan W. Christensen- Hide quoted text -
- Show quoted text -
Indeed, I'd rather use a proven methodology such as PLS rather than
doing my own stuff, assuming implementing, testing and getting the
existing methodology is easier than starting from scratch / re-
inventing the wheel.
I believe that I have been misled when I purchased a book (in French)
about PLS regression. It had about 100 pages of complicated
computations spread over several chapters before starting to talk
about PLS regression. In that regard, my own methodology looks a
thousand times simpler / easier to implement.
Another issue with PLS is that it's not trying to achieve exactly what
I want -- start with a model with 200 variables, then add all products
of two variables, that is, add about 20,000 variables. By the way
these are binary variables. My understanding is that PLS starts with
the 20,000 variables, not with the original 200 variables. If I'm
wrong please tell me. I do not pretend to know everything.
Regarding the level of complication: PLS can be written in under 20
lines of MATLAB or SAS/IML code. The actual algorithm itself simply is
not that complicated, but it certainly is more complicated than using
a "canned" regression package where you don't have to write code.
Regarding adding products of two variables ... again, this can be done
in PLS, it isn't mentioned because you can choose whatever X matrix
you want in PLS, its entirely up to you. I suppose if you are writing
your own PLS code, you have to write code to create the products of
two variables ... but if you are using SAS PROC PLS, it is only
slightly more difficult than asking PROC PLS to use the original 200
variables, no extra code writing needed.
In PLS, you can fit a model of 200 variables, and then fit a model of
200 + approx 20,000 variables and compare them. That would make sense
and if the larger model has approximately the same R-squared on the Y
side, then you would say that bigger model hasn't improved things at
all.
Regarding binary variables ... Vincent, you didn't mention that in
your original posting. PLS has been used with binary variables that
are highly correlated, but in my experience, you are unlikely to find
meaningful information because with binary variables and fewer data
points than variables, there are only a limited number of possible
combinations of 0s and 1s -- in other words, many of your 20,000
columns are likely to be identical. But I suppose you never know until
you try.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
.
- Follow-Ups:
- Re: Approximate solution to linear regression
- From: vincent64@xxxxxxxxx
- Re: Approximate solution to linear regression
- References:
- Re: Approximate solution to linear regression
- From: Paige Miller
- Re: Approximate solution to linear regression
- From: S.W.Christensen
- Re: Approximate solution to linear regression
- From: vincent64@xxxxxxxxx
- Re: Approximate solution to linear regression
- Prev by Date: Re: Time series
- Next by Date: Re: Approximate solution to linear regression
- Previous by thread: Re: Approximate solution to linear regression
- Next by thread: Re: Approximate solution to linear regression
- Index(es):
Relevant Pages
|