Re: how to understand the concept of overfitting?




On 8-Dec-2007, citi <loseminds2008@xxxxxxxxx> wrote:

I am worried about the "overfitting". I am confused about what
constitutes an "overfitting"? It's not very clear to me when does an
"overfitting" occurs? Only when the number of parameters is larger the
number of data points? Is there any situation where "more parameters
than data points" does not cause overfitting problem? To address the
overfitting problem, is it definitive that I should reduce the number
of parameters to below 100?

Overfitting occurs when you fit noise or anomalies in the training data that
are unique to it and not part of the overall data distribution. Real world
data collected through measurements (even in a good laboratory) will have some
noise or error imposed on the actual phenomenon that you are trying to measure.
In the case of survey or data mining type data, there are always extra,
unknown (or at least not measured) variables that are not present in the model.
These unmeasured variables add an effect that is essentially noise in the
data.

As an example, consider a program that generates data for a sine wave, but the
program adds random values to the sine values using a random number generator.
If your goal was simply to fit the training data set, then you could put in a
zillion parameters so that the fitted curve would jump around to try to match
the random noise imposed on the sine wave. But when you apply that model to
another set of data generated by the same program but with a different set of
random values, it will not fit well because it will try to model the old
(training) random values rather than the new ones. So the best model for
accurate prediction of new (unseen) data is one that ignores the noise and does
the best job possible of fitting the sine wave.

Overfitting is a potential problem for most types of predictive models, but the
risk is more extreme for some than others. A classic case of extreme
overfitting is a decision tree that has one leaf node for every training case.
Such a tree is guaranteed to be 100% accurate on the training data, but it may
do a very poor job on test data. "Pruning" is the process of trimming decision
trees so as to avoid overfitting. Neural networks and SVM models also have the
potential for severe overfitting. There are some types of models that exhibit
little or no overfitting problems. Decision tree forests ("Random Forests")
appear to not suffer from overfitting even if you build them with huge numbers
of trees. Also TreeBoost (boosted decision tree) models usually do not exhibit
overfitting, but on occasion it can occur.

Your idea of partitioning the data into training and test samples is fine.
Cross validation is better because all of the data is used both for training
and testing, and it averages out the modeling error. Note that validating time
series data is a special case. With time series data you can't do class cross
validation where you randomly select training/test points from the data set,
because the best points will be interleaved time-wise with the training points.
For time series it is necessary to train with data from one interval and then
test with data from a separate interval.

--
Phil Sherrod
(PhilSherrod 'at' comcast.net)
http://www.dtreg.com (Decision trees, Neural networks, SVM and Genetic
modeling)
http://www.nlreg.com (Nonlinear Regression)


Computer Guy
.