söndag 29 december 2019

Iterating in random order with lazy Fisher Yates

While doing research for a talk, I stumbled across the Fisher-Yates shuffle. It is used for shuffling a range of data:


-- To shuffle an array a of N elements (indices 0..N-1):
for i from 0 to N−2 do
     j ← random integer such that ij < N
     exchange a[i] and a[j]


One needs an array a of data to shuffle. But what if the range is the sequence 0, 1, 2, ... N -1   ?
In that case the i:th element a[i] is i, before being touched by the shuffling. Perhaps we could avoid some work if one uses a sparse representation of a?
Only the elements where a[i]!=i need to be stored, the rest are implied.

Another observation is that the i:th step, older elements (those at index [0, i) are not used, so they could be dropped to store space if not needed.


This means it is possible to lazily generate a shuffled range!

I use a std::map for the sparse storage. Here is a ascii art visualization of the algorithm in action. Each row is after step i and N=10, a * means that element is not stored in a.

i = 0 j = 3 a[j](before) = 3 ***0******   size = 1
i = 1 j = 5 a[j](before) = 5 _**0*1****   size = 2
i = 2 j = 8 a[j](before) = 8 __*0*1**2*   size = 3
i = 3 j = 5 a[j](before) = 1 ___0*0**2*   size = 3
i = 4 j = 6 a[j](before) = 6 ____*04*2*   size = 3
i = 5 j = 9 a[j](before) = 9 _____04*20   size = 4
i = 6 j = 8 a[j](before) = 2 ______4*40   size = 3
i = 7 j = 9 a[j](before) = 0 _______*47   size = 2
i = 8 j = 9 a[j](before) = 7 ________44   size = 2
last is 4

The last column is the number of elements stored in the sparse array a.

The amount of elements to store is small in the beginning and the end. I averaged out a number of cases and experimentally found that the amount of elements needed to be stored at step i out of N is:

i*(N-i)/(2N)

which has a maximum N/4 at i=N/2. That means on average, four times less storage (excluding the logarithmic overhead of std::map) need to be stored.
The worst case is min(i,N-i) and the best case is 0.

Perhaps more important, the work is done during iteration, instead of doing all work up front.

I whipped this up into two generic functions. The first one executes a callback on each shuffled number a[j], and the other one iterates through an iterator range in random order.



This is how to use it:

int
main()
{
std::random_device r;
std::minstd_rand rnd(r());
std::cout << "Let's randomly print 0 through 4:\n";
lazy_fisher_yates(5, rnd, [](auto i) { std::cout << i << '\n'; });
std::cout << "Let's randomly select some words:\n";
auto text = { "fish", "cow", "bird", "amoeba", "seal" };
lazy_fisher_yates(
begin(text), end(text), rnd, [](auto it) { std::cout << *it << '\n'; });
}
view raw main.cpp hosted with ❤ by GitHub
which outputs
Let's randomly print 0 through 4:
1
2
0
3
4
Let's randomly select some words:
seal
amoeba
bird
fish
cow

I implemented this for fun, if you want to do this more efficiently you can use a block cipher instead which requires constant memory and no work up front. I just stumbled across this while doing research for a talk I am giving about using a block  cipher for this purpose.

This is the full implementation:


/*
*
* Lazy Fisher-Yates iteration
*
* This steps through a range randomly, using the idea from the Fisher Yates
* shuffle. It does not shuffle the range and then iterate, instead it
* stores a sparse representation of the index range.
*
* The amount of memory reguired increases the closer to the middle one is.
* Experimentally, it seems like the expected amount of memory at step i
* out of N is i*(N-i)/2N (so at most N/4 when i=N/2).
* The underlying store is however a std::map, so expect N log N memory need.
*
* By Paul Dreik 2019 https://www.pauldreik.se/
* License: Boost 1.0
* SPDX-License-Identifier: BSL-1.0
*/
#pragma once
#include <cassert>
#include <map>
#include <random>
#include <utility>
// invokes cb with a each integer in [0,N) in random order
template<typename Integer, typename URBG, typename Callback>
void
lazy_fisher_yates(Integer N, URBG&& rng, Callback cb)
{
if (N == 0) {
return;
}
std::map<Integer, Integer> a;
for (Integer i = 0; i < N - 1; ++i) {
std::uniform_int_distribution<Integer> dist(i, N - 1);
Integer j = dist(rng);
// find a[i]
Integer a_i;
if (!a.empty() && a.begin()->first == i) {
a_i = a.begin()->second;
} else {
a_i = i;
}
// find a[j]
Integer a_j;
auto j_it = a.lower_bound(j);
if (j_it != a.end() && j_it->first == j) {
a_j = j_it->second;
j_it->second = a_i;
} else {
a_j = j;
a.insert(j_it, { j, a_i });
}
// drop the first entry to save space if possible
if (a.begin()->first <= i) {
a.erase(a.begin());
}
cb(a_j);
}
assert(a.size() <= 1);
// the last iteration
auto a_j = a.empty() ? N - 1 : a.rbegin()->second;
cb(a_j);
}
// invokes cb on each iterator in [first,last) in random order
template<typename RandomAccessIterator, typename URBG, typename Callback>
void
lazy_fisher_yates(RandomAccessIterator first,
RandomAccessIterator last,
URBG&& rng,
Callback cb)
{
auto N = std::distance(first, last);
lazy_fisher_yates(
N, std::forward<URBG>(rng), [&cb, &first](auto i) { cb(first + i); });
}

Inga kommentarer: