söndag 29 december 2019

Iterating in random order with lazy Fisher Yates

While doing research for a talk, I stumbled across the Fisher-Yates shuffle. It is used for shuffling a range of data:


-- To shuffle an array a of N elements (indices 0..N-1):
for i from 0 to N−2 do
     j ← random integer such that ij < N
     exchange a[i] and a[j]


One needs an array a of data to shuffle. But what if the range is the sequence 0, 1, 2, ... N -1   ?
In that case the i:th element a[i] is i, before being touched by the shuffling. Perhaps we could avoid some work if one uses a sparse representation of a?
Only the elements where a[i]!=i need to be stored, the rest are implied.

Another observation is that the i:th step, older elements (those at index [0, i) are not used, so they could be dropped to store space if not needed.


This means it is possible to lazily generate a shuffled range!

I use a std::map for the sparse storage. Here is a ascii art visualization of the algorithm in action. Each row is after step i and N=10, a * means that element is not stored in a.

i = 0 j = 3 a[j](before) = 3 ***0******   size = 1
i = 1 j = 5 a[j](before) = 5 _**0*1****   size = 2
i = 2 j = 8 a[j](before) = 8 __*0*1**2*   size = 3
i = 3 j = 5 a[j](before) = 1 ___0*0**2*   size = 3
i = 4 j = 6 a[j](before) = 6 ____*04*2*   size = 3
i = 5 j = 9 a[j](before) = 9 _____04*20   size = 4
i = 6 j = 8 a[j](before) = 2 ______4*40   size = 3
i = 7 j = 9 a[j](before) = 0 _______*47   size = 2
i = 8 j = 9 a[j](before) = 7 ________44   size = 2
last is 4

The last column is the number of elements stored in the sparse array a.

The amount of elements to store is small in the beginning and the end. I averaged out a number of cases and experimentally found that the amount of elements needed to be stored at step i out of N is:

i*(N-i)/(2N)

which has a maximum N/4 at i=N/2. That means on average, four times less storage (excluding the logarithmic overhead of std::map) need to be stored.
The worst case is min(i,N-i) and the best case is 0.

Perhaps more important, the work is done during iteration, instead of doing all work up front.

I whipped this up into two generic functions. The first one executes a callback on each shuffled number a[j], and the other one iterates through an iterator range in random order.



This is how to use it:

which outputs
Let's randomly print 0 through 4:
1
2
0
3
4
Let's randomly select some words:
seal
amoeba
bird
fish
cow

I implemented this for fun, if you want to do this more efficiently you can use a block cipher instead which requires constant memory and no work up front. I just stumbled across this while doing research for a talk I am giving about using a block  cipher for this purpose.

This is the full implementation: