Re: Newbie question: "representativeness" of subset of data
- From: "Paige Miller" <paige.miller@xxxxxxx>
- Date: 18 Nov 2005 12:40:06 -0800
hamish@xxxxxxxxx wrote:
> Hi Paige,
>
> Thanks for your reply. What I mean is something like the following
> (Matlab / Octave code):
>
> dataset = randn(200,1);
> subset = dataset(floor(2000 * rand(20,1)));
>
> where randn(200,1) produces 200 datapoints in a normal distribution,
> and rand(20,1) produces 20 datapoints uniformly distributed on the
> interval (0,1). These latter are then used as indexes to pick
> datapoints from the former.
>
> If I were just to pick the 20 smallest values in the dataset, the mean,
> variance and skewness would obviously be very different from that of
> the whole dataset. Whereas if I put the datapoints in order and pick
> each 10th one, the shape of that distribution should be similar to the
> original one.
>
> You might say that the latter subset is more "representative" of the
> original dataset. What I'm trying to find is some measure of this
> "representativeness".
Rich Ulrich is correct as usual.
However, I would like to point out a scale of representativeness that I
would use here. If you sample randomly from a distribution (and of
course you do it properly) then the sample is representative. If you
select purposefully from a distribution, or non-randomly from a
distribution, then by definition this selection or sample is not
representative of the distribution.
That's my scale here -- representative, or not representative. If you
use this scale in your work, please give me complete credit.
I don't see any purpose in going further into a statsitical test when
BY DEFINITION your non-randomly selected sample is NOT REPRESENTATIVE.
Now, I think, reading between the lines, you are asking a different
question ... I think you might want to be able to answer "how close is
this non-random sample to a normal distribution?" Can I derive some
scale for this? There are many such scales, Kolomogorov-Smirnov being
just one. Each of these many scales has certain advantages and
disadvantages. K-S may or may not be what you are looking for. Having
said all that, I still can't understand the purpose behind this
exercise of comparing non-randomly selected data to a random
distribution.
--
Paige Miller
paige.miller@xxxxxxx
.
- Follow-Ups:
- References:
- Newbie question: "representativeness" of subset of data
- From: hamish
- Re: Newbie question: "representativeness" of subset of data
- From: Paige Miller
- Re: Newbie question: "representativeness" of subset of data
- From: hamish
- Newbie question: "representativeness" of subset of data
- Prev by Date: Re: inclusion criteria for regression covariates
- Next by Date: Re: Analysis of scoring data???
- Previous by thread: Re: Newbie question: "representativeness" of subset of data
- Next by thread: Re: Newbie question: "representativeness" of subset of data
- Index(es):
Relevant Pages
|
Loading