Re: Approximate solution to linear regression
- From: "vincent64@xxxxxxxxx" <datashaping@xxxxxxxxx>
- Date: Sun, 01 Jul 2007 16:07:53 -0000
On Jun 29, 10:03 am, Paige Miller <paige.mil...@xxxxxxxxx> wrote:
On Jun 29, 3:57 am, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
On Jun 28, 12:28 pm, Paige Miller <paige.mil...@xxxxxxxxx> wrote:
On Jun 27, 5:10 pm, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
On Jun 26, 11:31 pm, "S.W.Christensen" <s...@xxxxxxxxxxxxxxxx> wrote:
On 26 Jun., 14:30, Paige Miller <paige.mil...@xxxxxxxxx> wrote:
On Jun 22, 3:16 am, "S.W.Christensen" <s...@xxxxxxxxxxxxxxxx> wrote:
On 17 Jun., 21:46, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
Problem can have 40,000 variables, most of them highly correlated.
More variables than observations in some cases.
I haven't looked over your solution in great detail, but I would
suggest that:
1) Group your variables into clusters, based on their correlationsWhy create an ad hoc procedure, where the properties are not known,
2) Construct an ensemble of regression models, each based on just one
exemplar from each cluster
3) Weight each model conservatively (because you have so many
variables); e.g. equal weighting.
and where you would have to defend the validity of the procedure, and
where you have to write the code yourself?
If you only choose to create procedures that are not ad hoc, then you
will never create a procedure. Talk about halting civilisation in its
tracks...
By the way: 1) the various parts of the method are thoroughly tested,
2) a method that consistently works has proven itself beyond the need
for defending, 3) writing code yourself is the best guarantee you can
ever have of its correctness (do you have faith in the correctness of
code written by software companies?)
Best regards,
Stefan W. Christensen- Hide quoted text -
- Show quoted text -
Indeed, I'd rather use a proven methodology such as PLS rather than
doing my own stuff, assuming implementing, testing and getting the
existing methodology is easier than starting from scratch / re-
inventing the wheel.
I believe that I have been misled when I purchased a book (in French)
about PLS regression. It had about 100 pages of complicated
computations spread over several chapters before starting to talk
about PLS regression. In that regard, my own methodology looks a
thousand times simpler / easier to implement.
Another issue with PLS is that it's not trying to achieve exactly what
I want -- start with a model with 200 variables, then add all products
of two variables, that is, add about 20,000 variables. By the way
these are binary variables. My understanding is that PLS starts with
the 20,000 variables, not with the original 200 variables. If I'm
wrong please tell me. I do not pretend to know everything.
Regarding the level of complication: PLS can be written in under 20
lines of MATLAB or SAS/IML code. The actual algorithm itself simply is
not that complicated, but it certainly is more complicated than using
a "canned" regression package where you don't have to write code.
Regarding adding products of two variables ... again, this can be done
in PLS, it isn't mentioned because you can choose whatever X matrix
you want in PLS, its entirely up to you. I suppose if you are writing
your own PLS code, you have to write code to create the products of
two variables ... but if you are using SAS PROC PLS, it is only
slightly more difficult than asking PROC PLS to use the original 200
variables, no extra code writing needed.
In PLS, you can fit a model of 200 variables, and then fit a model of
200 + approx 20,000 variables and compare them. That would make sense
and if the larger model has approximately the same R-squared on the Y
side, then you would say that bigger model hasn't improved things at
all.
Regarding binary variables ... Vincent, you didn't mention that in
your original posting. PLS has been used with binary variables that
are highly correlated, but in my experience, you are unlikely to find
meaningful information because with binary variables and fewer data
points than variables, there are only a limited number of possible
combinations of 0s and 1s -- in other words, many of your 20,000
columns are likely to be identical. But I suppose you never know until
you try.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com- Hide quoted text -
- Show quoted text -
You are right, 90% of the observations in the training set fall in a
few - maybe 500 - statistically significant combinations of 0s and 1s.
So you can get a predictive model such as decision trees to handle 90%
of the observations.
However, outside the training set, the proportion is not 90% anymore,
maybe 85% instead. It is very important to correctly classify the
remaining 15% though, as that's where much of the bad stuff can be
found. So my idea to use a logistic regression to classify 15% of the
points, combined with a compatible decision tree approach that
actually does not require the creation of decision tress through
pruning / splitting, to classify 85% of the data.
Thanks Paige for your encouraging words about PLS, I'm really very
tempted to use it (or maybe logic regression, not sure what it is, but
it's related to my problem). I'll need to understand a bit more how it
works and stop buying these French books that are filled with
unnecessary theory just for the sake of looking smart, at the expense
of the busy reader like me.
Reference for PLS that is in English, with downloadable data and many
completely worked examples: Multivariate Analysis of Quality, An
Introduction (2001), Martens, H. and Martens, M. John Wiley and Sons,
Chichester.
Also, you mention logistic regression, but unless I'm not aware of
some relevant fact here, logistic regression is what you would use if
the Y variable(s) is binary. I don't believe you can use it if the X
variables are binary.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com- Hide quoted text -
- Show quoted text -
Thanks Paige. Yes the response is binary, or it is between 0 and 1
(probability), which is typically (in both cases) solved through
logistic regression.
I'm assuming that the concept of PLS applies to regular regression and
logistic regression as well, I don't see why not. I hope I'm not
opening a new can of worms by introducing logistic regression in this
problem. In some instances, my logistic regression has been solved as
a regular regression after applying the appropriate transformation
(group by bin, get response in the form of a continuous 0-1 variable,
apply inverse logistic transformation).
.
- Next by Date: Help with Conjoint Analysis
- Next by thread: Help with Conjoint Analysis
- Index(es):
Relevant Pages
|