Fast Perfect Shuffling with Computer Assistance

Consider the following problems:

You are playing a social deduction game and want to distribute a role to each player, indicated by a small token. No one should be able to predict which role they have.
You have to grade a stack of homework. Since you fear that your grading changes over time, it might be unfair to the students to correct them in the same order each week.
While playing cards, you see some pattern in the sequence of cards and suspect they have not been shuffled well.

In all three cases you would like to get some collection of objects (tokens, submissions, cards) into a random ordering. More precisely, you want to apply a permutation that is chosen uniformly at random from the set of all possible permutations — a process commonly referred to as shuffling.

A decorative illustration. It shows two groups of five coloured squares at the borders in different orders, connected with a tangle of lines

Shuffling cards is not really a novel idea, and there are many ways to do it. The best is probably a riffle shuffle, or, if the cards are sleeved, a mash shuffle. But there are some problems:

Certain types of objects cannot be shuffled like this. Have you tried riffle shuffling a stack of assignments? It's tough!
Normal shuffling only gets you close to uniform distribution, but it does not get you there exactly — and it is difficult to quantify how close you are.

Sometimes, you really want the peace of mind knowing that a stack of objects is in a perfectly random order. (Well; I do!) And since I do not trust myself to choose independent random numbers with uniform distribution (or any kind of distribution) I want to do what I call “computer-assisted shuffling” — a program generates the randomness, and then tells me how to move the objects so that they end up shuffled.

The simple way

Okay, this is not very difficult. We call random.shuffle or equivalent, output the results, and then order the objects according to the result. For example:

n = ... # input
objects = list(range(n))
random.shuffle(objects)
print(objects)

shuffle1.py

And then you would get something like the following:

$ python shuffle1.py 5
[3, 1, 0, 4, 2]

So you take the first element and put it into position 3, the second into position 1, and so on.

Actually executing this is quite inconvenient, however. As it turns out, my desk is of finite size, and I do not have space for more than a few stacks of items at a time.

Hence I operationalise by having two stacks — one input stack, where all objects start, and one output stack, where they end up. And instead of executing “move top of input stack to position $k$ in output stack”, I found it easier to execute “move position $k$ of input stack to top of output”.

Since the input stack keeps getting smaller, this requires a small adjustment to the algorithm:

  n = int(sys.argv[1])
  objects = list(range(n))
  random.shuffle(objects)
 
+ for i in range(n):
+     for j in range(i+1,n):
+         objects[j] -= objects[j] >= objects[i]
 
  print(objects)

shuffle2.py

Together with some simple output formatting, we have a first prototype:

$ python shuffle2.py 20
Instructions: Collect the objects into a stack. For each number, take the item at that position from the original stack and move it to an output stack.

   8  0  9 15  0  5
   8  7  6  2  7  8
   5  0  4  4  1  1
   0  0

But there is an issue: this algorithm secretly has $Ω (n^{2})$ complexity! [?] I am not talking about the code, mind you (though it is $Ω (n^{2})$ as well, the bottleneck is always the human). The reason is simple: I do not have random access into the stacks, at least not in constant time. If I have to take the 17th element from the top, then I have to count to 17!

This is Big O notation. Since I am pedantic and this is a lower bound,

Ω

is used instead of

𝒪

Still, it's not too bad. As we will see, it's a very reasonable approach, especially for small $n$ .

Well, you can probably guess where this article is going. Can we find an algorithm that, when executed by a human, is $𝒪 (n \log n)$ ?

The model

To answer that question, we have to think about which operations are possible to execute — we have to define an execution model.

So let us model my desk as holding $s$ stacks, where $s$ should be small (say $s \in 𝒪 (1)$ , or $s \in 𝒪 (\log n)$ ). We can perform the following operations:

move i, move the top element of stack $i$ on top of stack $j$ in time $𝒪 (1)$ .
transfer i j, transfer the entirety of stack $i$ onto the top of stack $j$ in time $𝒪 (1)$ , i.e. you grab all items in stack $i$ and put them onto stack $j$ .

Now we can formulate the problem in this model:

Input: A number $n$
Output: A list of instructions that, given a single stack with $n$ elements, results in those elements being in one single stack in an order chosen uniformly at random.

A first try

One might think that it is enough to simply move the elements to a random stack a number of times:

k = math.ceil(1.5 * math.log(n, 2))
for it in range(k):
    for i in range(n):
        a = random.choice([1,2])
        print('move', 0, a)
    print('transfer', 1, 0)
    print('transfer', 2, 0)

This is, essentially, a riffle shuffle in reverse. But this does not give you a uniform distribution! Why not? Well, why would it? Just doing random things will, generally speaking, move you towards a uniform distribution, but you will never reach it.

(If you want a proof, simply consider that the above program makes $n k$ random choices, each of which has precisely 2 possible outcomes with equal probability. Thus there are $2^{n k}$ equally likely possible sequences of choices, and every output has a probability which is a multiple of $1 / 2^{n k}$ . Assume the output distribution were uniform. Then every output must have probability $1 / n!$ , thus $2^{n k} / n!$ is an integer by the prior statement. However, since for $n \geq 3$ we have that $n!$ is divisible by $3$ and $2^{n k}$ is not, we get a contradiction. Hence the distribution cannot be uniform.)

How to obtain a uniform permutation

To get to a uniform distribution, you have to aim for it. This is not difficult, see Fisher-Yates shuffle. (The name is more impressive than the algorithm.)

But there is no need to think about this very much! Instead, for our purposes the computer is infinitely fast and perfectly random — we have to execute the resulting instructions by hand, after all, which is always going to be the bottleneck.

Key idea: Start with a uniform permutation and work backwards.

So let us ignore all the details on how one would come up with the random permutation. Instead we start with random.shuffle, which gives us the permutation already, and focus on outputting the shortest sequence of instructions that results in that ordering.

The advantage of this approach is that we do not have to worry about probabilities and uniform distributions anymore. So our problem simplies to the following:

Input: An ordering $P ∶ {1, \dots, n} \to {1, \dots, n}$
Output: A list of instructions that, given a single stack with elements $1, \dots, n$ , results in one stack with elements $P (1), P (2), \dots, P (n)$ .

Generating instructions

Now comes a small sleight of hand. Instead of applying the permutation $P$ , we might as well apply its inverse $P^{- 1}$ . Since every permutation has exactly one inverse, this does not affect the probability of an outcome. So we get this:

Input: An ordering $P ∶ {1, \dots, n} \to {1, \dots, n}$
Output: A list of instructions that, given a single stack with elements $P (1), P (2), \dots, P (n)$ , results in one stack with elements $1, \dots, n$ .

Why is this useful? Well, now the problem is merely sorting, where $P$ gives us the value of each element. (So we compare elements $i, j$ by comparing the values $P (i), P (j)$ .) Hence we can solve the problem by picking some sorting algorithm and executing it. The steps it takes along the way can then be translated into a list of instructions.

As an example, let us consider bubblesort. This is not actually a good idea, since it is an $Ω (n^{2})$ algorithm, but hopefully it demonstrates how the execution of the algorithm relates to the list of instructions to generate. (If that seems obvious to you, just skip to the next section.)

First, let us look at a standard bubblesort:

def bubblesort(lst: List[int], P: List[int]) -> List[int]:
    n = len(lst)
    # Repeat n-1 times
    for it in range(n-1):
        # For each adjacent pair...
        for i in range(n-1):
            # ... if they are in the wrong order ...
            if P[lst[i]] > P[lst[i+1]]:
                # ... swap them.
                lst[i], lst[i+1] = lst[i+1], lst[i]
    return lst

Now we can modify the algorithm to also generate instructions for the human operator.

 def bubblesort(lst: List[int], P: List[int]) -> List[int]:
     n = len(lst)
     for it in range(n-1):
         # Keep lst[i] on top of stack 1
+        print('move', 0, 1)
         for i in range(n-1):
             if P[lst[i]] > P[lst[i+1]]:
                 lst[i], lst[i+1] = lst[i+1], lst[i]
                 # Move to second-from-the-top in stack 1
+                print('move', 1, 2)
+                print('move', 0, 1)
+                print('move', 2, 1)
             else:
                 # Move to top
+                print('move', 0, 1)
         # Move stack 1 back to stack 0
         # (We need to reverse the order, so transfer does not work)
+        for i in range(n):
+            print('move', 1, 0)
     return lst

shuffle3.py

You would then use it like this:

lst = list(range(n))
P   = list(range(n))
random.shuffle(P)
bubblesort(lst, P)

shuffle3.py

Here, we sort lst according to P — so we would expect bubblesort(lst, P) == P, and indeed, that is what we get.

Generating efficient instructions

There is a straightforward way of improving upon this: use a better sorting algorithm!

While there are many candidates to choose from, I wanted something that roughly matches the structure of the reverse riffle shuffle from above, since I think that this is the most efficient pattern for human execution. But there are other possibilities! In particular, people have considered sorting algorithms that minimise the number of modifications — that might be effective here as well.

Looking at the reverse riffle shuffle, I now profit from all the time I ~~wasted on~~ invested into watching sorting algorithm visualisations:

Key idea: Use LSD radix sort, with a radix of 2.

The idea behind LSD radix sort is simple — we start by splitting the input into two lists, even and odd, where even contains all the elements with an even value, i.e. where the last bit is 0. We then concatenate the two lists, and repeat the process for the second to last bit, and so on.

The code looks like this:

def radixsort(lst: List[int], P: List[int]) -> List[int]:
    n = len(lst)
    # Calculate the smallest number of bits necessary to represent any element
    k = math.floor(math.log(max(lst), 2))

    for bit in range(k):
        lst0 = []
        lst1 = []
        for i in lst:
            # Check whether the bit is set, and append appropriately
            if i >> bit & 1:
                lst1.append(i)
            else:
                lst0.append(i)
        # Concatenate the lists
        lst = lst0 + lst1
    return lst

It is now straightforward to modify the code to also output instructions.

 def radixsort(lst: List[int], P: List[int]) -> List[int]:
     n = len(lst)
     k = math.floor(math.log(max(P), 2))
 
     for bit in range(k+1):
         lst0 = []
         lst1 = []
         for i in lst:
             if P[i] >> bit & 1:
                 lst1.append(i)
+                print('move', 0, 2)
             else:
                 lst0.append(i)
+                print('move', 0, 1)
         lst = lst0 + lst1
+        for i in lst0: print('move', 1, 0)
+        for i in lst1: print('move', 2, 0)
     return lst

shuffle4.py

Similar to before, we cannot use the transfer operation, since we need to preserve the order of the stack. Is is possible to get around this?

Making it even more efficient

To tackle this problem, let us start with the operations we want. Instead of moving all items individually from stacks 1 and 2 back to stack 0, we would like to execute only two transfer operations:

transfer 1 0
transfer 2 0

But this would no longer correspond to lst = lst0 + lst1, instead it would reverse the order afterwards.

lst = lst0 + lst1
lst.reverse()

To analyse this, let $S_{n}$ denote the set of permutations on $n$ elements, i.e. the possible values of lst. I then write $f_{b} ∶ S_{n} \to S_{n}$ for the function that corresponds to one iteration of the loop, where $b \in ℕ$ is the index of the bit. Expressed in code, this would be

def f(lst: List[int]):
    lst0, lst1 = [], []
    for i in lst:
        if P[i] >> b & 1: lst1.append(i)
        else:             lst0.append(i)
    return lst0 + lst1

(As you can see, $f_{b}$ also depends on P. But since P is constant, I'll leave it out.)

Hence the radixsort can be expressed as follows [?]:

\circ

is the standard notation for function application

s o r t = f_{k} \circ f_{k - 1} \circ \dots \circ f_{0}

Let also write $r e v ∶ S_{n} \to S_{n}$ for the function that reverses the input, so

def rev(lst: List[int]):
    lst.reverse()
    return lst

Since the transfer operations will essentially reverse the order after every iteration, we are looking for a representation of $s o r t$ that looks like the following:

s o r t = (r e v \circ ?) \circ \dots \circ (r e v \circ ?)

We need to figure out something for the question marks. Feel free to stop and think about this for bit!

Okay, my first thought was to consider what happens when we reverse the output of $f_{b}$ . As an example, consider the list [2,3,5,0,1,4] and $b = 0$ . So for $f_{0}$ we split into even and odd elements, and then concatenate the results: [2,0,4] + [3,5,1]. If we reverse this, we get [1,5,3] + [4,0,2]. Two things happened: now the odd elements come before the even ones, and the order of each is reversed.

To write this observation down, let $g_{b} ∶ S_{n} \to S_{n}$ be defined just like $f_{b}$ , but with lst0 and lst1 swapped:

def g(lst: List[int]):
    lst0, lst1 = [], []
    for i in lst:
        if P[i] >> b & 1: lst1.append(i)
        else:             lst0.append(i)
    return lst1 + lst0

While we could also swap the order the elements in each of lst0 and lst1, there is another key observation that renders this unnecessary: the functions $f_{b}, g_{b}$ preserve the relative order of elements in the input, so if, say, 3 comes before 5 in lst and both end up in lst1, then 3 will also preceed 5 in lst1. So we can reverse both lst0 and lst1 by reversing the input instead. Combining the above leads to the following:

Key idea: $r e v \circ f_{b} = g_{b} \circ r e v$

This is enough and now the problem reduces to symbol manipulation. Analogously (or by applying $r e v$ from both the left and right), we get $r e v \circ g_{b} = f_{b} \circ r e v$ , and thus

f_{b} \circ f_{b - 1} = f_{b} \circ r e v \circ r e v \circ f_{b - 1} = r e v \circ g_{b} \circ r e v \circ f_{b - 1}

So if $k$ is odd (i.e. we have an even number of terms $f_{k}, \dots, f_{0}$ ), we can write

s o r t = f_{k} \circ f_{k - 1} \circ . . . \circ f_{0} = (r e v \circ g_{k}) \circ (r e v \circ f_{k - 1}) \circ . . . \circ (r e v \circ g_{1}) \circ (r e v \circ f_{0})

This is the desired form, where we reverse the order after every step. If $k$ is even, this does not quite work, and one lonely $f_{k}$ will be left at the end. We could content ourselves with simply reversing the output one final time — but this is not necessary. Instead, we can flip $P$ at the beginning, causing us to sort in the opposite order and end up with the reversed result.

 def radixsort2(lst: List[int], P: List[int]) -> List[int]:
     n = len(lst)
     k = math.floor(math.log(max(P), 2))
 
+    if not k&1: P = [n-1-i for i in P]
     for bit in range(k+1):
         lst0 = []
         lst1 = []
         for i in lst:
-            if P[i] >> bit & 1:
+            if ((P[i] >> bit) ^ bit) & 1:
                 lst1.append(i)
                 print('move', 0, 2)
             else:
                 lst0.append(i)
                 print('move', 0, 1)
         lst = lst0 + lst1
+        lst.reverse()
 
-        for i in lst0: print('move', 1, 0)
-        for i in lst1: print('move', 2, 0)
+        print('transfer', 1, 0)
+        print('transfer', 2, 0)
     return lst

shuffle5.py

As you can see, we first check whether $k$ is even and flip $P$ if so. During the loop, we need to switch lst0 and lst1 every second iteration, so we xor with bit.

In terms of number of instructions, for $n = 5$ we improve by a factor of 1.43, and for $n = 20$ by 1.82 — asymptotically, this will converge to 2, since instead of executing $2 n$ moves every iteration, we have $n + 2$ instructions ( $n$ moves and $2$ transfers.

Finishing touches

Since reading a long list of instructions is tedious and error-prone, we can make the output nicer to read. Additionally, within one iteration of the loop we have a lot of instructions that move the top of stack 0 to either 1 or 2, which are then followed by two appends. Since we always have this structure, we can instead simply tell the human operator how many cards to move to stack 1, then to stack 2, and so on. The sequence of move instructions then compresses down to a list of numbers:

move 0 1
move 0 1
move 0 1
move 0 2
move 0 2
move 0 1
move 0 2
move 0 2

would become

3 2 1 2

The full output then looks like this:

$ python shuffle6.py 20
Instructions: Take the 20 objects and put them in one stack. Make space for two more stacks. To execute one row of numbers, take as many objects from the top of the first stack and place them onto the two other stacks, alternating between them. For example, to execute the row "1 2 3 2 1 2", you deal the top object of the first stack onto the second stack, the next one onto the third stack, the one after that again on the third stack, the three objects after that onto the second one, and so on. Take care to deal one object at a time. To execute a transfer, move the second stack into the place of the first stack, and the third stack on top of it.

  0 1 6 1 1 2
  1 3 2 3
transfer

  1 1 1 1 2 3
  2 2 1 1 1 1
  2 1
transfer

  2 1 1 1 1 2
  2 1 2 1 2 1
  2 1
transfer

  1 1 1 1 1 2
  2 4 1 1 1 1
  1 2
transfer

  1 2 1 1 1 1
  13
transfer

Benchmarks

|A stack of irregular pieces of paper and cardbord.

While finding a theoretically optimal method is fun, my original goal was to actually shuffle things quickly in the real world. So I performed some benchmarks. I used four different kinds of objects: standard playing cards; a stack of paper intended to simulate assignments; coins; and hexagonal cardboard tokens from a game.

I then performed three algorithms: simple, which is shuffle2.py from above, radix, the radixsort corresponding to shuffle6.py, and riffle, which is a standard riffle shuffle.

Type	n	simple	radix	riffle
Cards	8	11s	13s	20s
	16	26s	30s	21s
	32	82s	71s	30s
	52	185s	140s	46s
Assignments	20	133s	193s
	32	251s	264s
	40	430s	497s
Coins	8	25s	30s
Coins	16	96s	103s
Tokens	8	12s	18s
	16	43s	45s
	32	141s	131s
	50	524s	283s

Not too bad! Clearly, if you specifically want to shuffle cards and can tolerate imperfect randomness, doing a riffle shuffle is much faster. But if you need perfect randomness, you can still get it within a reasonable amount of time. The simple shuffle will likely be best until $n \approx 30$ , at which point the radix shuffle takes over.

For assignments and coins, it seems that the simple shuffle is the way to go, at least unless you have many of them (and are willing to spend an unreasonable amount of time shuffling).

The cardboard tokens are where the radix shuffle shines. Already at $n \approx 16$ it matches the simple shuffle, and for larger numbers of tokens it dramatically outperforms. This is due to the difficulty of handling the tokens — since I cannot pick up more than 20 or so at once, finding specific position becomes quite tedious. Still; as long as you can hold them in one hand, the simple shuffle would likely be faster.

In the end, it is probably expensive enough to get perfect randomness via the radix shuffle that you would not bother in most situations. But the true treasure is, as always, the algorithms you made along the way!

Future work: What do we do when we have more cards than can feasibly be riffle shuffled? Imagine you want to shuffle 360 cards. Sure, you are willing to tolerate some amount of imperfect randomness — but how can you get a procedure that gives you at least some guarantees?

Attachments

name	size
shuffle1.py	193 B
shuffle2.py	527 B
shuffle3.py	725 B
shuffle4.py	822 B
shuffle5.py	878 B
shuffle6.py	2.4 KiB
everything	1.7 KiB

Written by Philipp Czerner, 2025-05-04