Re: Basic math question - averaging numbers




"stavros" <stavros@xxxxxxxxxxxxxx> wrote in message
news:4eae8bbc-2082-40bb-b17f-c9616e0bb8fb@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I work in an IT job, and I'm doing some analysis on hard disk usage.
I've been running a process for several weeks which collects data
every hour. The process records the total number of bytes read during
that hour, as well as the total number of disk requests for that
hour. It also records "bytes per request" for that hour, through
simple division.

I produced a graph displaying the hourly "bytes per request" totals
for several days. The graph is getting cluttered as I add data for
new days, so I'd like to change the graph to show daily averages for
"bytes per request".

My first approach was to go back to the raw data. I summed the total
number of bytes read during an entire day, and summed the total number
of disk requests for that day, and calculated "bytes per request"
through dividing those sums. The resulting graph surprised me - the
hourly graph had showed regular spikes up into the millions, while the
daily averages were down in the thousands.

I thought this discrepancy might be hard to explain to my boss, who's
familiar with the hourly graphs. I had the idea for a different
approach, which was to average the hourly "bytes per request" totals,
and come up with daily averages this way. This produced completely
different totals than the first approach.

The question is, which approach is "correct"? I put "correct" in
quotes, because I have a feeling both are correct in their own way,
depending on what I'm trying to do. I just want to understand this
better. Why do these approaches produce completely different totals?
They seem (to my non-math-inclined brain) to be roughly equivalent,
but clearly that's not the case.

Sorry if this is confusing; I can provide some data samples if that
would help. I know this isn't exactly a puzzle, but I'm sure y'all
can help. Thanks in advance for the education!

Seems to me that you are not seeking the average but an indication of the
spread
- what is the lowest/average/largest bytes/request per day.

As others have pointed out the calculation you have is smoothing the data so
the spread is not visible.
All you get is blancmange.

One way of calcualting the spread and "average" over a time period is to
stick with the raw data, literally,
ie keep your raw statistics of bytes/request, and form a median/quartile
calculation on that range
for each day and graph those.
(in case anyone doesnt know cos this is OT for rec.puzzles
- median is where 50% of the total fall below and 50% fall above
within the hour. Similarly the 25% quartile is where 25% of statistics fall
below the 25% quartile.
And similarly for the 75% quartile. All these are available in Excel
(quartile(range,1/2/3)).

Why quartiles and not min/max? because the min and max fluctuate too much
and will worry your
boss unnecessarily.


HTH
JJ



.



Relevant Pages