algorithm - read file only once for Stratified sampling -


if not know distribution (or size/probability) of each subpopulation (stratum), , not know total population size, possible stratified sampling reading file once? thanks.

https://en.wikipedia.org/wiki/stratified_sampling

regards, lin

assuming each record in file can identified being in particular sub-population, , know ahead of time size of random sample want sub-population hold, each sub-population, datastructure allowing reservoir sampling, sub-population (https://en.wikipedia.org/wiki/reservoir_sampling#algorithm_r).

so repeatedly:

read record

find out sub-population in , datastructure representing reservoir sampling sub-population, creating if necessary.

use data-structure , record read reservoir sampling sub-population.

at end have, each sub-population seen, reservoir sampling data-structure containing random sample population.

for case when wish end k of n samples forming stratified sample on different classes of records, don't think can better keeping k of each class , downsampling this. suppose can , give initial block of records organised stratified sample have less k/2 of class kept. follow block huge number of records, of class, underrepresented. in case, final random sample should have more k/2 class, , (if random) there should small non-zero probability more k/2 of randomly chosen records came first block. fact never keep more k/2 of these records first block means probability sampling scheme zero, keeping less k of each class won't work in worst case.

here cheat method. suppose instead of reading records sequentially can read records in order chose. if through stackoverflow see (rather contrived) methods based on cryptography generating random permutation of n items without holding n items in memory @ 1 time, this. keep pool of k records @ time proportions of items in pool stratified sample, adding or removing items pool when forced keep proportions correct. think can because need add item of class x keep proportions correct when have observed item of class x. because went through records in random order claim have random stratified sample. have stratified sample, departure randomness can in items selected particular class. consider permutations select items not of class in same order permutation chosen, select items of class in different orders. if there bias in way items of class selected (as there is) because bias affect different items of class in different ways depending on permutation selected result of random choice between of these different permutations total effect unbiassed.