Re: multiple samples w/ replacement from large datasets
- From: Michael.Lacy.junk@xxxxxxxxxxxxx
- Date: 12 Apr 2006 09:00:03 -0700
Data Matter <fungile@xxxxxxxxx> wrote:
I'd like to take multiple random samples with replacement from a huge
data set. The size of the data set is such that it would take 15-20
minutes to pass through it once. Is there any trick to generate k
samples without having to pass through the data k times?
DM
The best way to do this will of course depend on what software
is used. Here, though, is a fairly generic pseudo-code
algorithm that will do what I think you want to do. Lots of
comments, not much "code;" hope this isn't overkill and that
I haven't misunderstood your question.
* Assume that population size PopN is known, and that
* k samples of size SampN are desired. Assume that population
* contains j= 1, ..., PopN items
* . Create i=1, ... k vectors C1, ..., Ck, each of length PopN.
* Each vector element Ci[j] will be an integer to indicate how many
* times the jth item in the population is to be included in the the
* i_th of the k samples.
* Construct the vectors as follow:
* Each element is initially chosen 0 times for each sample
for i = 1 to k {
For j = 1 to PopN {
Ci[j] = 0
}
}
* Now pick SampN spots, with replacement, in each Ci
for i = 1 to k {
for q = 1 to SampN {
Pick m = a random uniform integer 1, ... PopN
Ci[m] = Ci[m] + 1
}
}
* Read through the population data once, and
* Use the Ci vectors to select k samples from
* the population data
For j = 1 to PopN {
Read item j from the population data
For i = 1 to k {
Put item j Ci[j] times into the i_th sample
}
--
=-=-=-=-=-=-=-=-=-==-=-=-=
Mike Lacy, Ft Collins CO 80523
Clean out the 'junk' to email me.
.
- Follow-Ups:
- Re: multiple samples w/ replacement from large datasets
- From: Data Matter
- Re: multiple samples w/ replacement from large datasets
- From: Data Matter
- Re: multiple samples w/ replacement from large datasets
- References:
- multiple samples w/ replacement from large datasets
- From: Data Matter
- multiple samples w/ replacement from large datasets
- Prev by Date: Hansen Jagannathan Distance
- Next by Date: intraclass correlation
- Previous by thread: Re: multiple samples w/ replacement from large datasets
- Next by thread: Re: multiple samples w/ replacement from large datasets
- Index(es):