Re: Approximate solution to linear regression



On Jun 20, 7:57 pm, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:
On Jun 20, 9:41 am, John Kane <jrkrid...@xxxxxxxxx> wrote:



On Jun 20, 3:11 am, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:

On Jun 19, 12:44 pm, Paige Miller <paige.mil...@xxxxxxxxx> wrote:

On Jun 17, 3:46 pm, "vincen...@xxxxxxxxx" <datashap...@xxxxxxxxx>
wrote:

Problem can have 40,000 variables, most of them highly correlated.
More variables than observations in some cases. I came up with an
approach, and my question is

(1) is this an original approach?
(2) more importantly, does it always provide a fairly accurate
solution?

The problem and solution are described athttp://datashaping.com/contest14004.shtml
. The newsgroup can not render the mathematical formatting.

I haven't tried to go through your solution in any detail.

In similar situations, I use Partial Least Squares (PLS) Regression,
which is also an "approximate" method (actually, its a biased
regression) that doesn't care if you have highly correlated X
variables and many more Xs than observations. If you use the maximum
possible number of dimensions in PLS, you will get an OLS solution
without having to invert a matrix.

So, with that in mind, it seems to me your approximate solution is
trying to fit into a niche where there already is a solution, and the
PLS solution has proven useful in zillions of published articles. So
unless you can show that your approximate solution has better
properties than PLS, I don't see much of a need for it.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com

Thanks for your reply. I've heard that Lasso regression does similar
things too. Anyway, being efficient is much more important than being
original in this context: I'm not trying to publish an article, this
is not academic research. If I need to spend $150K to get PLS
regression software (SAS Enterprise Miner) and spend many hours
getting it to work, I'm MUCH better off re-inventing the wheel. So I
could rephrase my question as follows: am I re-inventing the wheel
quite well, meaning my approach is not significantly inferior to PLS
regression?

If PLS will work, try the R package charmingly named PLS. It is Open
Source (i.e.free to the user).
For infohttp://mevik.net/work/software/pls.html

To obtain Rhttp://www.r-project.org/-Hide quoted text -

- Show quoted text -

Thanks a lot. These are great suggestions. I would assume PLS would
work. I bought a book on the subject a while back after reading
somewhere that PLS is supposed to solve my type of problem, but it was
presented in a quite obscure way. The computations seem quite
involved.

R, Splus, JMP, etc. are great but they have limitations: they process
the whole dataset all in memory (as far as I know, and that's how i've
seen it worked). That's fine when you have 0.5 gigabyte of data, but
it does not work when you have many gigabytes or terabytes.
Unless...you buy their more expensive "data mining" packages that cost
more than hiring a Ph.D. statistician full time to do the job.

Well since R is free, the extensive data mining routines cost X * $0
so they're not much more expensive :)

You may well be well beyond the memory limit of R however you might
want to discuss this on the R-help list and see if there are work-
arounds or possibly more suitable packages. https://stat.ethz.ch/mailman/listinfo/r-help

It is not an area I know anything about and R may well have another
package that is more suitable.

John Kane, Kingston ON Canada

.



Relevant Pages

  • Re: Difference between Data Mining and Machine Learning
    ... "Machine learning" is a fairly general term for model building that ... It could be applied to regression, ... regression is rarely used for data mining. ... would be significant predictors. ...
    (sci.stat.consult)
  • RE: Outliers
    ... > I'm a newbie to both data mining generally and Sql Server BI in particular. ... > I've played with the Linear Regression and Decision Tree algorithms a fair amount, ... > Any regression based analysis, or so it seems to me, has major problems with outliers. ...
    (microsoft.public.sqlserver.datamining)
  • RE: Outliers
    ... > I'm a newbie to both data mining generally and Sql Server BI in ... > I've played with the Linear Regression and Decision Tree algorithms a ... > Any regression based analysis, or so it seems to me, has major problems ... > What can I do to detect outliers? ...
    (microsoft.public.sqlserver.datamining)
  • Re: Approximate solution to linear regression
    ... regression software and spend many hours ... I'm MUCH better off re-inventing the wheel. ... That's fine when you have 0.5 gigabyte of data, ... Unless...you buy their more expensive "data mining" packages that cost ...
    (sci.stat.consult)
  • Re: [opensuse] opensuse 11.4 cinelerra update mjpegtools19 dependency problem
    ... other packages have moved to mjpegtools 2. ... What is strange here is ... Is this a regression? ...
    (SuSE)