Re: OLS fails due to outliers, but relationship is very clear



Pawel, you'll probably get more traction in something like sci.stat.math, but here's my two cents:

Hard to tell what you mean by a "highly biased" fit, and I'm not sure what your criterion for calling something an outlier is. But low r^2 doesn't necessarily mean that the OLS model doesn't fit, it may simply mean that the error term is large relative to the range of mean response values (the fitted line) over the range of your predictor. In other words, despite the usual description, r^2 is not really a "goodness of fit" statistic, it's a "usefulness of prediction" statistic.

If you bin and take means (don't forget to weight by N_obs), you'll be estimating a slightly different model (means of observations, not raw observations), but an equivalent one, and your ultimate predictions probably won't change much. You'll just have to interpret your error estimate slightly differently.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.



Pawel Zdziarski wrote:
I am looking at "simple" regression with one independent variable.
Both X's and Y's are very scattered with lots of outliers, and OLS fits a line which is highly biased with ridiculously small R^2.
However, when I bin the X's and look at mean value of Y in each bin, the relationship is very clear:
Bin X | Mean(Y)
-------------------------
0-0.5 | 1.58
0.5-1 | 2.59
1-2.5 | 3.04
2.5-5 | 4.12
5-7.5 | 6.88
7.5-10 | 5.73
10 | 8
What is the formal/preferred way of approaching this sort of problems? For example, I would like the bin ranges to be implied from my data, rather then picking them arbitrarily myself. And once that's done, come up with a summary statistic which I could compare between many samples (such as R^2 for OLS).
Generally, what should I search for to learn about statistics in this sort of "bin analysis"?
.



Relevant Pages

  • Re: Deseasonalization and detrending of Keeling curve
    ... fit with a linear gain factor. ... The "fit" is based on a stiff spline ... MODEL STATISTICS AND EQUATION FOR THE CURRENT EQUATION (DETAILS ... Standard Error of the Mean =Standard Dev/ .148530E-01 ...
    (sci.stat.math)
  • Re: Regression significance conundrum
    ... > is a good linear correlation between the two variables. ... > that a quadratic fit does not confer significant additional benefit? ... If you do the statistics properly, then you probably won't have any ... referees) you thought there was some slight curvature. ...
    (sci.stat.math)
  • Re: Sorry! The Phaistos disc.
    ... consisting of elements of two signs. ... But is its meaning the one you think ?.. ... The statistics can be ... >>figure doesn't fit with the conclusion I want to reach ?.. ...
    (sci.archaeology)
  • Re: The Bin Police
    ... Are you saying that there is a "rule" which says that a household ... may not produce more rubbish than will fit in a bin - if that is ...
    (uk.legal)
  • Re: Goodness of fitting of a distribution
    ... linear combination of a weibull and a normal distribution. ... Kolmogorov-Smirnov statistics is NOT a "goodness of fit" statistic. ...
    (sci.stat.math)