Re: multiple samples w/ replacement from large datasets



Data Matter <fungile@xxxxxxxxx> wrote:

I'd like to take multiple random samples with replacement from a huge
data set. The size of the data set is such that it would take 15-20
minutes to pass through it once. Is there any trick to generate k
samples without having to pass through the data k times?

DM

The best way to do this will of course depend on what software
is used. Here, though, is a fairly generic pseudo-code
algorithm that will do what I think you want to do. Lots of
comments, not much "code;" hope this isn't overkill and that
I haven't misunderstood your question.


* Assume that population size PopN is known, and that
* k samples of size SampN are desired. Assume that population
* contains j= 1, ..., PopN items

* . Create i=1, ... k vectors C1, ..., Ck, each of length PopN.
* Each vector element Ci[j] will be an integer to indicate how many
* times the jth item in the population is to be included in the the
* i_th of the k samples.
* Construct the vectors as follow:

* Each element is initially chosen 0 times for each sample
for i = 1 to k {
For j = 1 to PopN {
Ci[j] = 0
}
}
* Now pick SampN spots, with replacement, in each Ci
for i = 1 to k {
for q = 1 to SampN {
Pick m = a random uniform integer 1, ... PopN
Ci[m] = Ci[m] + 1
}
}

* Read through the population data once, and
* Use the Ci vectors to select k samples from
* the population data
For j = 1 to PopN {
Read item j from the population data
For i = 1 to k {
Put item j Ci[j] times into the i_th sample
}


--
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy, Ft Collins CO 80523
Clean out the 'junk' to email me.

.