OT: "Rabbit Hunting" in stat/handicapping
- From: "eleaticus" <eleaticus@xxxxxxxxxxxxx>
- Date: Wed, 14 Mar 2007 19:49:51 -0600
Some academics, sports handicappers, and others tend to throw a bunch of
statistical tests into the hat and viola! pull out a significant
relationship.
The best academics know that the significance result should have been for a
relationship already theorized, but even they do this "rabbit hunting" and
THEN theorize.
Let's take a look at what can be expected if, say, you throw a bunch of
random variables into the correlation hat.
Ten variables give you 45 correlations.
Using the runs test to see what the expectation is in this situation, we use
the formula Nr = n(1-p)(p^r), where r is the 'run length' of minimum
interest and Nr the expected number of such runs (or longer runs) in n
'trials' at the given probability, p.
Here, r = 1. Let's look at p=.01 for the number of trials to get an expected
run of Nr=1 or longer.
1 = n(.99)(.01) = n(.0099), and n = 101, which says in effect that with
about 15 random variables we'd expect one or more highly 'significant' but
meaningless correlations.
(a. We dropped the exponent of p because it is 1.00 here.)
(b. 15 variables gives us [15*14/2=105].)
At p=.05, 1 = n(.95)(.05), and n = 21.
So, with 10 random variables thrown into the correlation bonnet, we expect a
number of misleading bees amongst them.
Let's check to see how many .05 results we would expect in the 45 'trials'
for 10 variables.
Nr = 45(.95)(.05^1)
Nr = 2.1375.
And in the 105 trials on 15 random variables?
Nr = 105(.95)(.05^1)
Nr = 4.9875.
I am subject to frets and anxieties about a similar problem I encounter in
sports handicapping, wherein one tries to 'predict' which team will win and
by what margin.
Amongst other things, I (uniquely times six or seven) devise for each team
both home and away offensive and defensive numbers, independently of each
other (home vs away). [Say vo, vd, ho, hd. v=visitor.]
Obviously, you have more information about a team's offensive and defensive
capability when all games are included in your numbers but no whole-league
guestimate of the relationship of home prowess versus away prowess will do a
good job.
So, why not use not just vo, vd, ho, hd but both teams' complete set of four
power-ratings in your multiple regression?
These variables are not random, so why not use all of them?
Well, the visitor's home offense and defensive numbers do provide additional
prowess information for judging away games but also include what is
misleading info for away games: the part of visitors' ho and hd that is due
to home advantage and to random events that weren't present during the awqay
games. And similarly for the home team's numbers as a visitor, where
perhaps there was an away disadvantage, with or without travel being
involve.
So, when the multiple regression procedure in effect decides what part of
the visiting team's home numbers do not reinforce its visitor numbers, what
is left includes not only the home advantage but also a different set of
random 'noises' than are present in the as-visitor numbers.
The result is, when you include both teams' full number set of 4 variables
instead of just the two each that are directly relevant, you get a greater
decrement in the 'adjusted' R-square than you would without the extras.
This acknowledges the role of additional noise in the additional information
variable.
I have - on those occasions when I try the full set for curiosity's sake -
sometimes found the net adjusted R-square is no higher than with just half
of the variables, the directly relevant four.
--
(c)eleaticus
ee-lee-AT-i-cus
eleaticus@xxxxxxxxxxxxx
.
- Follow-Ups:
- Re: OT: "Rabbit Hunting" in stat/handicapping
- From: Raider Fan
- Re: OT: "Rabbit Hunting" in stat/handicapping
- Prev by Date: Re: 300 Movie: Iranians offended
- Next by Date: Re: Full Tilt tournament rules re: cancellations
- Previous by thread: Re: Screwed by Click2pay
- Next by thread: Re: OT: "Rabbit Hunting" in stat/handicapping
- Index(es):
Relevant Pages
|