Re: Newbie question: "representativeness" of subset of data




hamish@xxxxxxxxx wrote:
> Hi Paige,
>
> Thanks for your reply. What I mean is something like the following
> (Matlab / Octave code):
>
> dataset = randn(200,1);
> subset = dataset(floor(2000 * rand(20,1)));
>
> where randn(200,1) produces 200 datapoints in a normal distribution,
> and rand(20,1) produces 20 datapoints uniformly distributed on the
> interval (0,1). These latter are then used as indexes to pick
> datapoints from the former.
>
> If I were just to pick the 20 smallest values in the dataset, the mean,
> variance and skewness would obviously be very different from that of
> the whole dataset. Whereas if I put the datapoints in order and pick
> each 10th one, the shape of that distribution should be similar to the
> original one.
>
> You might say that the latter subset is more "representative" of the
> original dataset. What I'm trying to find is some measure of this
> "representativeness".

Rich Ulrich is correct as usual.

However, I would like to point out a scale of representativeness that I
would use here. If you sample randomly from a distribution (and of
course you do it properly) then the sample is representative. If you
select purposefully from a distribution, or non-randomly from a
distribution, then by definition this selection or sample is not
representative of the distribution.

That's my scale here -- representative, or not representative. If you
use this scale in your work, please give me complete credit.

I don't see any purpose in going further into a statsitical test when
BY DEFINITION your non-randomly selected sample is NOT REPRESENTATIVE.

Now, I think, reading between the lines, you are asking a different
question ... I think you might want to be able to answer "how close is
this non-random sample to a normal distribution?" Can I derive some
scale for this? There are many such scales, Kolomogorov-Smirnov being
just one. Each of these many scales has certain advantages and
disadvantages. K-S may or may not be what you are looking for. Having
said all that, I still can't understand the purpose behind this
exercise of comparing non-randomly selected data to a random
distribution.

--
Paige Miller
paige.miller@xxxxxxx

.



Relevant Pages

  • Re: Newbie question: "representativeness" of subset of data
    ... > I would like to point out a scale of representativeness that I ... If you sample randomly from a distribution (and of ... Not every random selection you take will be representative of the ...
    (sci.stat.consult)
  • Re: How to identify flat (even) distributions?
    ... doesn't uniquely identify a flat distribution. ... The Simpsons one has the most intuitive scale (i.e 10 is ... ends is generally Uniform. ... Someone with most at the extremes ...
    (sci.stat.math)
  • Re: Altering Gamma Distribution scale and shape parameters
    ... You might find it convenient to create and play with the following two ... distribution so that my expected probability (say 0.50 or the area ... to use these shape and scale values to reproduce a set of numbers ... difference between the probability is equal to 0.5. ...
    (comp.soft-sys.matlab)
  • Re: Newbie question: "representativeness" of subset of data
    ... > where randnproduces 200 datapoints in a normal distribution, ... the shape of that distribution should be similar to the ... and the ratio of variances for variances. ... [snip, about K-S test. ...
    (sci.stat.consult)
  • Re: Newbie question: "representativeness" of subset of data
    ... (Matlab / Octave code): ... where randnproduces 200 datapoints in a normal distribution, ... the shape of that distribution should be similar to the ... correct p-values with ties". ...
    (sci.stat.consult)

Loading