Re: Basic math question - averaging numbers



On 3/10/2011 7:17 PM, stavros wrote:
I work in an IT job, and I'm doing some analysis on hard disk usage.
I've been running a process for several weeks which collects data
every hour. The process records the total number of bytes read during
that hour, as well as the total number of disk requests for that
hour. It also records "bytes per request" for that hour, through
simple division.

I produced a graph displaying the hourly "bytes per request" totals
for several days. The graph is getting cluttered as I add data for
new days, so I'd like to change the graph to show daily averages for
"bytes per request".

My first approach was to go back to the raw data. I summed the total
number of bytes read during an entire day, and summed the total number
of disk requests for that day, and calculated "bytes per request"
through dividing those sums. The resulting graph surprised me - the
hourly graph had showed regular spikes up into the millions, while the
daily averages were down in the thousands.

This makes perfectly good sense. Remember the old chestnut about
drowning in a river whose average depth is just six inches? Or the
one about the fellow with his feet in the oven and his head in the
freezer, experiencing a comfortable average temperature?

Think of cars passing by your street corner. There'll be N per
day, an average rate of N/24 per hour. But now count each hour
individually; do you think the 9AM and 3AM values will be alike?
Count them minute by minute; do you expect a smooth graph? Let's get
extreme: Count them microsecond by microsecond. In any microsecond
there's either a car present or there isn't. When there's not, you
calculate an average rate of 0 cars per hour; when a car is present
you get an average of 3.6 billion per hour. Yet N/24 is still as valid
as it ever was, even though it's unlike both zero and 3.6e9.

Smoothness is a phenomenon of scale. At planetary scale the Earth
is a sphere, but at human scale you can spend years training to climb
the insignificant irregularity called Everest. On the average, the
Earth is smooth; try not to fall to your death on the smoothness.

I thought this discrepancy might be hard to explain to my boss, who's
familiar with the hourly graphs. I had the idea for a different
approach, which was to average the hourly "bytes per request" totals,
and come up with daily averages this way. This produced completely
different totals than the first approach.

The question is, which approach is "correct"? I put "correct" in
quotes, because I have a feeling both are correct in their own way,
depending on what I'm trying to do. I just want to understand this
better. Why do these approaches produce completely different totals?
They seem (to my non-math-inclined brain) to be roughly equivalent,
but clearly that's not the case.

You have rediscovered "politician's arithmetic." Or else, maybe,
you are learning that averages are adjectives that describe reality,
not facts that determine it.

"On the average," you say, "I smoke twenty cigarettes a day."
But then you reflect: "Well, for one week last year I swore off
cigarettes while I was pursuing that neo-Puritan girl, the one with
the--well, this is a family newsgroup, so let's not get specific, but
anyhow for that one week I averaged zero cigarettes a day." Now you
muse: "Twenty, plus zero, divided by two -- I guess I average ten
cigarettes daily. Don't call *me* a pack-a-day smoker!"

Ponder this line of reasoning, and consider what might be wrong
with it, and I think you'll understand your calculation better.

--
Eric Sosman
esosman@xxxxxxxxxxxxxxxxxxxx
.



Relevant Pages