Column 2 of Programming Pearls is titled “Aha! Algorithms”. The first problem starts,

Given a tape that contains at most 1,000,000 twenty-bit integers in random order, find a twenty-bit integer that isn’t on the tape…

(The newer edition of the book asks the same question with larger numbers and without referencing tapes). The point of the constraints are: you don’t have enough memory to load the data at once.

The chapter goes on to describe how a binary search can be applied, even though the set of numbers isn’t ordered. You can partition the set by looking at the first bit of each number. After this split, whichever set has fewer numbers represents the set that is missing at least one number. From there, apply the same logic to the new set, looking at the second bit in each number, and so on. In the end, one half of the numbers will be empty, where a number would have been if it existed in the set!

I turned into a crazy person when I was answering this problem. I was filling my notebook with scribbles of trees:

Over the past couple days I’ve been looking for some fun distractions to keep me sane while I go through the book. Yesterday I played with PhantomJS for a little bit. Today I worked on automatically generating the trees I was scribbling before. I took a page from Jon Bentley’s (the author of Programming Pearls) paper “A System for Algorithm Animation Tutorial and User Manual”:

A sixty-line C simulation program was augmented with eight printf statements…

It sounds nicer to say I took it from a paper, but I just added some debug statements to the algorithm as it went. They’re mostly human readable, but happen to be pretty close to the output needed to create a graph with GraphViz. A little one-liner post-processing:

`go test -v | grep ^info: | perl -alnE 'BEGIN{say "digraph G {";};END{say "}"}; print "@F[1..$#F];"' | dot -Tpng -o out.png`

And with that mouthful, pretty pictures!

This shows the process of finding a missing 5-bit integer in a set starting with 30 integers. The numbers are split by bitwise-ANDing them to the edge label. I chose to start at the least significant bit. At the end of the process, one of the two child nodes will be empty, representing a missing integer. For fun, here’s the run for 1 million 20-bit integers:

I wish I did this sooner! It didn’t take too much time, and would have saved me from making mistakes in my notebook. It’s also rewarding to see something pretty in the end. On to the next one!

The bitfield is a good approach! This question ends by asking how you’d solve it with ample memory (the bitmap) and if you only had tiny amounts of memory but ample disk space (the solution in this post). Nowadays IO is much more precious!

To be fair to the book, the second edition uses “4 billion 32-bit integers”, so the intent of the question is “you have a whole lotta numbers”. A new edition would be much higher (as high as it takes for the bitfield solution to be impractical for the second answer :P)

Thanks for commenting!

The binary search is definitely memory efficient, but man, you’d need to traverse your list a *lot* of times. And if it’s on a tape, do you need to load it yourself for each pass? Yikes!

My gut reaction was to do it with a bitfield initialised to zeroes, then go through the list once setting the bit that corresponds to each number to one. When you’re done, you just search the bitfield for a zero, and voila! For the example given, my napkin calculations say it will require 128KiB of memory. Huge compared to the memory requirements of the binary search, bit still far less than the list requires.