# Re: how to understand the concept of overfitting?

*From*: "Phil Sherrod" <PhilSherrod@xxxxxxxxxxxxxxxxxxxxx>*Date*: Sat, 8 Dec 2007 16:36:39 GMT

On 8-Dec-2007, citi <loseminds2008@xxxxxxxxx> wrote:

I am worried about the "overfitting". I am confused about what

constitutes an "overfitting"? It's not very clear to me when does an

"overfitting" occurs? Only when the number of parameters is larger the

number of data points? Is there any situation where "more parameters

than data points" does not cause overfitting problem? To address the

overfitting problem, is it definitive that I should reduce the number

of parameters to below 100?

Overfitting occurs when you fit noise or anomalies in the training data that

are unique to it and not part of the overall data distribution. Real world

data collected through measurements (even in a good laboratory) will have some

noise or error imposed on the actual phenomenon that you are trying to measure.

In the case of survey or data mining type data, there are always extra,

unknown (or at least not measured) variables that are not present in the model.

These unmeasured variables add an effect that is essentially noise in the

data.

As an example, consider a program that generates data for a sine wave, but the

program adds random values to the sine values using a random number generator.

If your goal was simply to fit the training data set, then you could put in a

zillion parameters so that the fitted curve would jump around to try to match

the random noise imposed on the sine wave. But when you apply that model to

another set of data generated by the same program but with a different set of

random values, it will not fit well because it will try to model the old

(training) random values rather than the new ones. So the best model for

accurate prediction of new (unseen) data is one that ignores the noise and does

the best job possible of fitting the sine wave.

Overfitting is a potential problem for most types of predictive models, but the

risk is more extreme for some than others. A classic case of extreme

overfitting is a decision tree that has one leaf node for every training case.

Such a tree is guaranteed to be 100% accurate on the training data, but it may

do a very poor job on test data. "Pruning" is the process of trimming decision

trees so as to avoid overfitting. Neural networks and SVM models also have the

potential for severe overfitting. There are some types of models that exhibit

little or no overfitting problems. Decision tree forests ("Random Forests")

appear to not suffer from overfitting even if you build them with huge numbers

of trees. Also TreeBoost (boosted decision tree) models usually do not exhibit

overfitting, but on occasion it can occur.

Your idea of partitioning the data into training and test samples is fine.

Cross validation is better because all of the data is used both for training

and testing, and it averages out the modeling error. Note that validating time

series data is a special case. With time series data you can't do class cross

validation where you randomly select training/test points from the data set,

because the best points will be interleaved time-wise with the training points.

For time series it is necessary to train with data from one interval and then

test with data from a separate interval.

--

Phil Sherrod

(PhilSherrod 'at' comcast.net)

http://www.dtreg.com (Decision trees, Neural networks, SVM and Genetic

modeling)

http://www.nlreg.com (Nonlinear Regression)

Computer Guy

.

- Prev by Date:
**Re: Clarification of Importance of Questions** - Next by Date:
**Re: Clarification of Importance of Questions** - Previous by thread:
**normal probability with threshold** - Next by thread:
**Probability distribution related to the Generalized Goldbach Conjecture** - Index(es):