Post by James Cook [home]

How to borrow memory

June 4, 2021

In which I implement a magical but impractical algorithm.

In 2014, a new algorithm was discovered with a surprising superpower. It's described in a paper which begins:

Imagine the following scenario. You want to perform a computation that requires more memory than you currently have available on your computer. One way of dealing with this problem is by installing a new hard drive. As it turns out you have a hard drive but it is full with data, pictures, movies, files, etc. ...

(Computing with a full memory: Catalytic space by Harry Buhrman, Richard Cleve, Michal Koucký, Bruno Loff, and Florian Speelman).

The paper goes on to describe an algorithm which is able to use memory that's already filled with unrelated important (and incompressible) data. The algorithm makes temporary changes, but in the end, the borrowed memory is back the way it started.

If that sounds impossible, we're on the same page! That was my reaction too, and I've been fascinated by it ever since. I recently decided to implement a version of it, just for fun.

(If you're already familiar with catalytic space algorithms, you might be interested in a small original contribution: in the place where Buhrman et al use Fermat's Little Theorem, my implementation uses an alternative that's easier if you want to do all your arithmetic mod 264. See the section "Last trick: compare to zero" below for details. It came out of one of my many conversations with Ian Mertz.)

It really works...

I wrote reachability.c, which solves the directed graph reachability problem using borrowed memory. Give it a graph, and it will tell you whether there is a path from node 0 to node 1. The first line of the input should be the number of nodes in the graph, and the adjacency matrix should follow. You can try it out:

$ cat examples/cycle_5
1 0 1 0 0
1 1 0 0 0
0 0 1 1 0
0 0 0 1 1
0 1 0 0 1
$ ./reachability < examples/cycle_5
$ cat examples/two_cliques_10
1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1
1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1
$ ./reachability < examples/two_cliques_10
Not reachable.

The algorithm uses only O(log n) ordinary memory, where n is the number of nodes. (Actually, my implementation uses a constant amount, and is limited to graphs with fewer than 232 nodes.) The rest of the memory is the input, which it treats as read-only, and O(n3 log n) bits of "borrowed" memory.

For comparison, the standard Floyd-Warshall algorithm uses Θ(n^3) memory. It's probably impossible to do it in O(log n) memory without borrowing — doing so would imply the complexity class NL is equal to L, which would be surprising. It can be solved with O((log n)2) memory using Savitch's theorem, but with a runtime of nΘ(log n): worse than any polynomial.

To simulate borrowing, the main program fills an array with random data, lends it to the algorithm, and then compares it to a backup to check that the algorithm correctly restored it.

As far as I know, I'm the first person to actually implement this technique since it was first described in 2014. That's probably because it's not useful in practice, as I'll explain in the next section.

The original algorithm from the paper is more general than my implementation. It can be used to solve some other problems, like computing the determinant of a matrix. To use some complexity theory jargon, it can solve anything in the class TC1. It still uses only O(log n) bits of ordinary, non-borrowed memory.

...but don't get too excited.

At this point, an imaginative reader will see nearly limitless possibilities. Not enough memory in your computer to run your web browser and your favourite game at the same time? Just have the game borrow from the web browser, and you've effectively doubled your RAM!

Unfortunately, it doesn't work like that, for a few reasons:

Despite these caveats, I think it's really cool that it's even possible to make use of borrowed memory, however impractical it may be. So, I went ahead and implemented it.

How it works

If you want to follow along in the code while you read this section, look at instead of reachability.c. I'll explain the difference later, when we get to section "The runtime stack".

Transparent computation

Central to the algorithm is the idea of transparent computation. A normal algorithm will write intermediate results to memory, like this:

unsigned int x;
x = (... difficult calculation ...);

Instead, a transparent computation adds the result to a value stored in memory:

x = x + (... difficult calculation ...);

This allows the change to be reversed later on, by repeating the same calculation and subtracting instead of adding:

x = x - (... difficult calculation ...);

This kind of repetition is the main reason reachability.c is so slow, but it is also how it manages to restore the contents of the memory it borrowed (in this case, the variable x).

The add_reachability function in works this way. The function is responsible for computing whether it's possible to get from node u to node v in 2log_path_length steps, for every pair (u, v) simultaneously. add_reachability encodes its result by adding it to the borrowed-memory array *catalyst_output: if you can get from u to v, then it adds increment to (*catalyst_output)[u][v]. For example, suppose our adjacency matrix is:

1 0 1 0
1 1 0 0
0 0 1 1
0 1 0 1

and the initial content of *catalyst_output is:

5 7 8 3
2 3 3 6
3 2 4 2
4 2 8 8

Then calling add_reachability with increment=1 and log_path_length=0 should change *catalyst_output to:

6 7 9 3
3 4 3 6
3 2 5 3
4 3 8 9

The catalyst_aux and catalyst_rotation parameters point to more borrowed memory, which add_reachability restores before returning. After calling add_reachability with increment=1, you can restore *catalyst_output to its initial state too by running it a second time with increment=-1.

All the math is done modulo 264. In theory, the exponent should depend on log n, but I'm happy to assume log n < 64 to keep things simple.

At first glance, transparent computation might look useless: we added the result to the value stored in memory, but how do we know what that result was? The following sections show that we can make useful progress with computations that incorporate the values stored in memory both before and after the function is run.

First trick: converting addition to rotation

The function rotate_array shifts all the values in an array in a cycle. For example, if array starts out with

8 2 5 7 2 3 1 4

then after the call rotate_array(&array, 2), it will contain:

5 7 2 3 1 4 8 2

Now, suppose we're able to transparently compute a value A: we've written a subroutine compute_A(inc) which has the effect x = x + inc * A. What happens if we do the following?

void f() {
  rotate_array(&array, -x);
  compute_A(1);   /* x = x + A; */
  rotate_array(&array, x);
  compute_A(-1);  /* x = x - A; */

If the initial value of x is x0, then the two calls to rotate_array amount to a total rotation of -x0+(x0+A) = A. So, f() has done a different kind of transparent computation: instead of adding A to a variable, it rotates an array by A.

(Caveat: since we're doing everything modulo 264, this won't work unless array.size() divides 264.)

The original algorithm by Buhrman et al doesn't use this trick. I use it as part of "Last trick: compared to zero" explained below, which replaces the original algorithm's use of Fermat's Little Theorem.

Second trick: multiplying

In an ordinary algorithm, if we're able to compute two values A and B, then computing the product A*B is simple:

x = compute_A();
y = compute_B();
z = x * y;

But what if we're using transparent computation?

Suppose compute_A(inc) has the effect x = x + inc * A and compute_B(inc) has the effect y = y + inc * B. Here's a neat procedure from Buhrman et al:

z = z + x * y;
z = z - x * y;
z = z + x * y;
z = z - x * y;

If you carefully work out what's stored in the variables x, y and z after each step, you'll find that these eight lines of code are equivalent to:

z = z + A * B;

Since we don't have direct access to A or B, we can't directly run z = z + A * B. That's why we use the longer version.

Notice this relies entirely on borrowed memory (x, y and z). We didn't allocate any new memory to hold intermediate results, and z can be restored to its initial value by performing the opposite computation.

The first two tricks can be combined to rotate an array by A*B:

rotate_array(&array, x*y);
rotate_array(&array, -x*y);
rotate_array(&array, x*y);
rotate_array(&array, -x*y);

The overall effect of those 8 lines is:

rotate_array(&array, A*B);

This is how the rotate_by_reachability function is implemented: the recursive calls to add_reachability compute whether there are paths of length 2log_path_length-1 from u to w and w to v, respectively, and the results are multiplied so that (*catalyst_rotation)[u][v] gets rotated one step if both paths exist. (Here, we're using multiplication as an AND gate.) This is done for all possible values of u, v and w, so that the final amount by which (*catalyst_rotation)[u][v] is rotated equals the number of different ws through which u can reach v.

Last trick: compare to zero

Suppose again that we have a way to compute some value A transparently. The last ingredient we need is a way to do this:

if (A != 0) {
  x = x + 1;

Buhrman et al manage this with some fancy math involving Fermat's Little Theorem and binomial coefficients. Since I want to do all my math modulo 264, Fermat's Little Theorem won't work for me. Instead, I do something else.

Suppose rotate_by_A(inc) has the effect of rotating array by inc*A (e.g. using the First trick, above). Then we can do this:

x = x - array[0];
array[0] = array[0] - 1;
x = x + 1 + array[0];
array[0] = array[0] + 1;

To understand what this does, first notice that it's equivalent to:

x = x - array[A % array.size()];
array[0] = array[0] - 1;
x = x + 1 + array[A % array.size()];
array[0] = array[0] + 1;

although we can't run that code since we don't have direct access to the value A.

Now, consider what happens when A = 0. Then we have:

x = x - array[0];
array[0] = array[0] - 1;
x = x + 1 + array[0];
array[0] = array[0] + 1;

In the end, nothing is changed. On the other hand, if A0 (mod array.size()), the lines array[0] = array[0] - 1 and array[0] = array[0] + 1 have no effect on what happens to x, so effectively we have:

x = x - array[A];
x = x + 1 + array[A];

which is the same as just:

x = x + 1

So, our 8-line procedure is equivalent to

if ((A % array.size()) != 0) {
  x = x + 1;

which is exactly what we wanted, as long as we can guarantee A < array.size().

This trick is used to implement add_reachability using recursive calls to rotate_by_reachability. The latter subroutine rotates (*catalyst_rotation)[u][v] by a number of steps equal to the number of different nodes w through which you can travel from u to v. Using this trick, add_reachability is able to add 1 to (*catalyst_output)[u][v] if that number is at least one, and add 0 otherwise.

Putting the algorithm together

The algorithm uses a pair of mutually recursive subroutines:

Each is implemented in terms of the other using the three tricks described above, except that the base case add_reachability(0, increment) just directly reads the input.

Getting an answer out of a transparent computation

So far, everything we've done is "transparent", so our algorithm effectively does this:

void f(int inc) {
  if (there's a path from 0 to 1) {
    x = x + inc;

To turn it into a program that actually returns the answer, we can do this:

int save;
save = x;
if (x == save) {
  printf("Not reachable.\n");
} else {

This requires running our algorithm twice, and uses just one extra variable's worth of non-borrowed memory (int save).

The runtime stack

There's one last problem. As the two mutually recursive functions call each other, they use the runtime stack to store their state. In all, they will use O(log n) stack frames. I'm taking the point of view that a single C integer counts as O(log n) bits of memory, so those stack frames together count as O((log n)2) memory.

Fortunately, each stack frame really only needs 2 bits. reachability.c re-implements the algorithm using two 64-bit ints call_stack_add and call_stack_rotate in place of the runtime stack. Each stack frame takes 2 bits, so we can manage a total of 64 frames, which is enough for graphs of size less than 232. As a happy side-effect, reachability.c seems to run significantly faster. The downside is that it's even harder to understand than

This second version also uses a single array of borrowed memory (uint64_t *catalyst) instead of three separate ones. I decided to write it in C, because I wanted some practice. Overall, I hope the simpler set-up makes it a little easier to verify that there's no sleight-of-hand going on: the program really is only using a constant amount of non-borrowed memory.


Download the source here:

On my unix-like system, I can compile them like this:

$ cc -O2 -oreachability reachability.c
$ c++ -O2 -oreachability_with_stack

Hopefully something similar works for you. You can try running them on the two example graphs from the beginning of this post (search the text for examples/cycle_5) or invent your own. Each program simply prints "Reachable." if there's a path from node 0 to node 1, and "Not reachable." otherwise.