what if your dataset doesn’t fit in RAM? I will present the algorithm I use for shuffling large datasets. It isn’t novel, and one can find multiple instances of people reinventing it or something similar (and in essence it descends from Rao). However, I don’t know of anywhere that states the algorithm, shows why it’s correct, and gets into the particular practical issues we address below.
A 2-pass shuffle algorithm: Suppose we have data x0, …, xn—1. Choose an M sufficiently large that a set of n⁄M points can be shuffled in RAM using something like Fisher-Yates, but small enough that you can have M open files for writing (with decent buffering). Create M “piles” p0, …, pM—1 that we can write data to. The mental model of a “pile” here is that it’s a file you can append to, but in practice you might, say, have several piles exist as datasets in the same HDF5 file. The first pass of the algorithm is to split the data into these M piles, and the second pass shuffles each pile and appends it to the final result.
// First pass create empty piles p[0], …, p[M—1] for i = 0, …, n—1 do j := uniform random draw from [0, …, M—1] append x[i] to pile p[j] // Second pass (perhaps done lazily) for j = 0, …, M—1 do shuffle p[j] in RAM with Fisher-Yates // or whatever is convenient append p[j] to output file
Example of a shuffle: We start with unshuffled data (top); the first pass leaves M = 6 piles (middle); the second pass yields shuffled data (bottom).
Assuming you have enough memory to satisfy the above constraint on M and assuming that drawing a random number is 𝒪(1), this is a linear time algorithm; the constant factor is dominated by having to read and write each data point twice in external storage (but the reading/writing can be done in blocks rather than one point at a time). Since the reading and writing is stream-oriented, the algorithm still works for data with variable record length.
…Appendix: Performance comparison: The 2-pass shuffle seemed so obviously better than random access into a file that I hadn’t bothered to measure how much faster it actually is. One approach works, the other doesn’t, what’s there to measure? But the post was met with a lot of skepticism about whether it is faster at all, apparently on the basis that the 2-pass algorithm has an extra read/write and SSDs are fast. (Even with uncompressed data on local SSDs, sequential traversals are 48× as fast as random access traversals for my data.) So I measured the difference and found that, for my data and how it is stored, the 2-pass approach is 1000× as fast as random access (and that’s before incorporating further improvements to the 2-pass approach that are done in practice, which are to parallelize the first pass and integrate it with the data preprocessing). If this sounds too good to be true, bear in mind that this is not a comparison to some highly-regarded practice; it is a comparison to a bad idea, like quicksort against bubblesort.