Post by James Cook [home]

June 4, 2021

*In which I implement a magical but impractical algorithm.*

In 2014, a new algorithm was discovered with a surprising superpower. It's described in a paper which begins:

Imagine the following scenario. You want to perform a computation that requires more memory than you currently have available on your computer. One way of dealing with this problem is by installing a new hard drive. As it turns out you have a hard drive but it is full with data, pictures, movies, files, etc. ...

(*Computing with a full memory: Catalytic space by Harry Buhrman, Richard Cleve, Michal Koucký, Bruno Loff, and Florian Speelman*).

The paper goes on to describe an algorithm which is able to use memory that's already filled with unrelated important (and incompressible) data. The algorithm makes temporary changes, but in the end, the borrowed memory is back the way it started.

If that sounds impossible, we're on the same page! That was my reaction too, and I've been fascinated by it ever since. I recently decided to implement a version of it, just for fun.

*(If you're already familiar with catalytic space algorithms, you might be interested in a small original contribution: in the place where Buhrman et al use Fermat's Little Theorem, my implementation uses an alternative that's easier if you want to do all your arithmetic mod 2 ^{64}. See the section "Last trick: compare to zero" below for details. It came out of one of my many conversations with Ian Mertz.)*

I wrote reachability.c, which solves the directed graph reachability problem using borrowed memory. Give it a graph, and it will tell you whether there is a path from node 0 to node 1. The first line of the input should be the number of nodes in the graph, and the adjacency matrix should follow. You can try it out:

$ cat examples/cycle_5 5 1 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 $ ./reachability < examples/cycle_5 Reachable. $ cat examples/two_cliques_10 10 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 $ ./reachability < examples/two_cliques_10 Not reachable.

The algorithm uses only O(log n) ordinary memory, where n is the number of nodes. (Actually, my implementation uses a constant amount, and is limited to graphs with fewer than 2^{32} nodes.) The rest of the memory is the input, which it treats as read-only, and O(n^{3} log n) bits of "borrowed" memory.

For comparison, the standard Floyd-Warshall algorithm uses Θ(n^3) memory. It's probably impossible to do it in O(log n) memory without borrowing — doing so would imply the complexity class NL is equal to L, which would be surprising. It can be solved with O((log n)^{2}) memory using Savitch's theorem, but with a runtime of n^{Θ(log n)}: worse than any polynomial.

To simulate borrowing, the main program fills an array with random data, lends it to the algorithm, and then compares it to a backup to check that the algorithm correctly restored it.

As far as I know, I'm the first person to actually implement this technique since it was first described in 2014. That's probably because it's not useful in practice, as I'll explain in the next section.

The original algorithm from the paper is more general than my implementation. It can be used to solve some other problems, like computing the determinant of a matrix. To use some complexity theory jargon, it can solve anything in the class TC^{1}. It still uses only O(log n) bits of ordinary, non-borrowed memory.

At this point, an imaginative reader will see nearly limitless possibilities. Not enough memory in your computer to run your web browser and your favourite game at the same time? Just have the game borrow from the web browser, and you've effectively doubled your RAM!

Unfortunately, it doesn't work like that, for a few reasons:

- These algorithms are slow. The runtime of reachability.c is something like Θ(n
^{8}). In practical terms, it took 14 seconds to solve the 10-node example above. (The Θ(n^{3}log n) borrowed memory requirement is more tolerable. If I've done my math right, it borrowed about 27 KiB for the 10-node example.) - Even if you don't care about speed, not everything can be implemented this way. For example, it's unlikely you can solve linear programs with only O(log n) bits of ordinary memory. More generally, the technique is designed for programs that process their input, compute an answer, then exit. It's unclear whether it could be used for interactive software.
- Even if you somehow managed to convert the game to borrow the web browser's memory, you couldn't use them both at once: you'd have to pause the browser until the game finished running, since the game will be making temporary changes to the browser's memory.

Despite these caveats, I think it's really cool that it's even possible to make use of borrowed memory, however impractical it may be. So, I went ahead and implemented it.

*If you want to follow along in the code while you read this section, look at reachability_with_stack.cc instead of reachability.c. I'll explain the difference later, when we get to section "The runtime stack".*

Central to the algorithm is the idea of *transparent computation*. A normal algorithm will write intermediate results to memory, like this:

unsigned int x; x = (... difficult calculation ...);

Instead, a transparent computation adds the result to a value stored in memory:

x = x + (... difficult calculation ...);

This allows the change to be reversed later on, by repeating the same calculation and subtracting instead of adding:

x = x - (... difficult calculation ...);

This kind of repetition is the main reason reachability.c is so slow, but it is also how it manages to restore the contents of the memory it borrowed (in this case, the variable `x`

).

The `add_reachability`

function in reachability_with_stack.cc works this way. The function is responsible for computing whether it's possible to get from node u to node v in 2^{log_path_length} steps, for every pair (u, v) simultaneously. `add_reachability`

encodes its result by adding it to the borrowed-memory array `*catalyst_output`

: if you can get from u to v, then it adds `increment`

to `(*catalyst_output)[u][v]`

. For example, suppose our adjacency matrix is:

1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 1

and the initial content of `*catalyst_output`

is:

5 7 8 3 2 3 3 6 3 2 4 2 4 2 8 8

Then calling `add_reachability`

with `increment=1`

and `log_path_length=0`

should change `*catalyst_output`

to:

6 7 9 3 3 4 3 6 3 2 5 3 4 3 8 9

The `catalyst_aux`

and `catalyst_rotation`

parameters point to more borrowed memory, which `add_reachability`

restores before returning. After calling `add_reachability`

with `increment=1`

, you can restore `*catalyst_output`

to its initial state too by running it a second time with `increment=-1`

.

All the math is done modulo 2^{64}. In theory, the exponent should depend on log n, but I'm happy to assume log n < 64 to keep things simple.

At first glance, transparent computation might look useless: we added the result to the value stored in memory, but how do we know what that result was? The following sections show that we can make useful progress with computations that incorporate the values stored in memory both before and after the function is run.

The function `rotate_array`

shifts all the values in an array in a cycle. For example, if `array`

starts out with

8 2 5 7 2 3 1 4

then after the call `rotate_array(&array, 2)`

, it will contain:

5 7 2 3 1 4 8 2

Now, suppose we're able to transparently compute a value `A`

: we've written a subroutine `compute_A(inc)`

which has the effect `x = x + inc * A`

. What happens if we do the following?

void f() { rotate_array(&array, -x); compute_A(1); /* x = x + A; */ rotate_array(&array, x); compute_A(-1); /* x = x - A; */ }

If the initial value of `x`

is `x0`

, then the two calls to `rotate_array`

amount to a total rotation of `-x0+(x0+A)`

= `A`

. So, `f()`

has done a different kind of transparent computation: instead of adding `A`

to a variable, it rotates an array by `A`

.

(Caveat: since we're doing everything modulo 2^{64}, this won't work unless `array.size()`

divides 2^{64}.)

The original algorithm by Buhrman et al doesn't use this trick. I use it as part of "Last trick: compared to zero" explained below, which replaces the original algorithm's use of Fermat's Little Theorem.

In an ordinary algorithm, if we're able to compute two values `A`

and `B`

, then computing the product `A*B`

is simple:

x = compute_A(); y = compute_B(); z = x * y;

But what if we're using transparent computation?

Suppose `compute_A(inc)`

has the effect `x = x + inc * A`

and `compute_B(inc)`

has the effect `y = y + inc * B`

. Here's a neat procedure from Buhrman et al:

z = z + x * y; compute_A(1); z = z - x * y; compute_B(1); z = z + x * y; compute_A(-1); z = z - x * y; compute_B(-1);

If you carefully work out what's stored in the variables `x`

, `y`

and `z`

after each step, you'll find that these eight lines of code are equivalent to:

z = z + A * B;

Since we don't have direct access to `A`

or `B`

, we can't directly run `z = z + A * B`

. That's why we use the longer version.

Notice this relies entirely on borrowed memory (`x`

, `y`

and `z`

). We didn't allocate any new memory to hold intermediate results, and `z`

can be restored to its initial value by performing the opposite computation.

The first two tricks can be combined to rotate an array by `A*B`

:

rotate_array(&array, x*y); compute_A(1); rotate_array(&array, -x*y); compute_B(1); rotate_array(&array, x*y); compute_A(-1); rotate_array(&array, -x*y); compute_B(-1);

The overall effect of those 8 lines is:

rotate_array(&array, A*B);

This is how the `rotate_by_reachability`

function is implemented: the recursive calls to `add_reachability`

compute whether there are paths of length 2^{log_path_length-1} from u to w and w to v, respectively, and the results are multiplied so that `(*catalyst_rotation)[u][v]`

gets rotated one step if both paths exist. (Here, we're using multiplication as an AND gate.) This is done for all possible values of u, v and w, so that the final amount by which `(*catalyst_rotation)[u][v]`

is rotated equals the number of different ws through which u can reach v.

Suppose again that we have a way to compute some value `A`

transparently. The last ingredient we need is a way to do this:

if (A != 0) { x = x + 1; }

Buhrman et al manage this with some fancy math involving Fermat's Little Theorem and binomial coefficients. Since I want to do all my math modulo 2^{64}, Fermat's Little Theorem won't work for me. Instead, I do something else.

Suppose `rotate_by_A(inc)`

has the effect of rotating `array`

by `inc*A`

(e.g. using the First trick, above). Then we can do this:

rotate_by_A(1); x = x - array[0]; rotate_by_A(-1); array[0] = array[0] - 1; rotate_by_A(1); x = x + 1 + array[0]; rotate_by_A(-1); array[0] = array[0] + 1;

To understand what this does, first notice that it's equivalent to:

x = x - array[A % array.size()]; array[0] = array[0] - 1; x = x + 1 + array[A % array.size()]; array[0] = array[0] + 1;

although we can't run that code since we don't have direct access to the value `A`

.

Now, consider what happens when `A`

= `0`

. Then we have:

x = x - array[0]; array[0] = array[0] - 1; x = x + 1 + array[0]; array[0] = array[0] + 1;

In the end, nothing is changed. On the other hand, if `A`

≠ `0`

(mod `array.size()`

), the lines `array[0] = array[0] - 1`

and `array[0] = array[0] + 1`

have no effect on what happens to x, so effectively we have:

x = x - array[A]; x = x + 1 + array[A];

which is the same as just:

x = x + 1

So, our 8-line procedure is equivalent to

if ((A % array.size()) != 0) { x = x + 1; }

which is exactly what we wanted, as long as we can guarantee `A < array.size()`

.

This trick is used to implement `add_reachability`

using recursive calls to `rotate_by_reachability`

. The latter subroutine rotates `(*catalyst_rotation)[u][v]`

by a number of steps equal to the number of different nodes w through which you can travel from u to v. Using this trick, `add_reachability`

is able to add 1 to `(*catalyst_output)[u][v]`

if that number is at least one, and add 0 otherwise.

The algorithm uses a pair of mutually recursive subroutines:

`add_reachability(log_path_length, increment)`

adds`increment`

to`(*catalyst_output)[u][v]`

as long as there's a path from u to v of length at most 2^{log_path_length}.`rotate_by_reachability(log_path_length, increment)`

rotates`(*catalyst_rotation)[u][v]`

by`increment`

times the number of intermediate nodes w such that there are paths u->w and w->v both of length at most 2^{log_path_length-1}.

Each is implemented in terms of the other using the three tricks described above, except that the base case `add_reachability(0, increment)`

just directly reads the input.

So far, everything we've done is "transparent", so our algorithm effectively does this:

void f(int inc) { if (there's a path from 0 to 1) { x = x + inc; } }

To turn it into a program that actually returns the answer, we can do this:

int save; f(1); save = x; f(-1); if (x == save) { printf("Not reachable.\n"); } else { printf("Reachable.\n"); }

This requires running our algorithm twice, and uses just one extra variable's worth of non-borrowed memory (`int save`

).

There's one last problem. As the two mutually recursive functions call each other, they use the runtime stack to store their state. In all, they will use O(log n) stack frames. I'm taking the point of view that a single C integer counts as O(log n) bits of memory, so those stack frames together count as O((log n)^{2}) memory.

Fortunately, each stack frame really only needs 2 bits. reachability.c re-implements the algorithm using two 64-bit ints `call_stack_add`

and `call_stack_rotate`

in place of the runtime stack. Each stack frame takes 2 bits, so we can manage a total of 64 frames, which is enough for graphs of size less than 2^{32}. As a happy side-effect, reachability.c seems to run significantly faster. The downside is that it's even harder to understand than reachability_with_stack.cc.

This second version also uses a single array of borrowed memory (`uint64_t *catalyst`

) instead of three separate ones. I decided to write it in C, because I wanted some practice. Overall, I hope the simpler set-up makes it a little easier to verify that there's no sleight-of-hand going on: the program really is only using a constant amount of non-borrowed memory.

Download the source here:

On my unix-like system, I can compile them like this:

$ cc -O2 -oreachability reachability.c $ c++ -O2 -oreachability_with_stack reachability_with_stack.cc

Hopefully something similar works for you. You can try running them on the two example graphs from the beginning of this post (search the text for `examples/cycle_5`

) or invent your own. Each program simply prints "`Reachable.`

" if there's a path from node 0 to node 1, and "`Not reachable.`

" otherwise.