Superimposed code



A superimposed code such as Zatocoding is a kind of hash code that was popular in marginal punched-card systems.

Marginal punched-card systems
Many names, some of them trademarked, have been used for marginal punched-card systems: edge-notched cards, slotted cards, E-Z Sort, Zatocards, McBee, McBee Keysort, Flexisort, Velom, Rocket, etc. The center of each card held the relevant information—typically the name and author of a book, research paper, or journal article on a nearby shelf; and a list of subjects and keywords. Some sets of cards contained all the information required by the user on the card itself, handwritten, typewritten, or on microfilm (aperture card). Every card in a stack had the same set of pre-punched holes. The user would find the particular cards relevant to a search by aligning the holes in the set of cards (using a card holder or card tray), inserting one or more knitting-needle-like rods all the way through the stack, so the desired cards (which had been notched or cut open) fell out from the irrelevant cards in the collection (left un-notched), which remain on the needle(s). A user could repeat this selection many times to form a complex Boolean searching query. A card that was relevant to 2 or more subjects would have the slot(s) for each of those subjects cut out, so that card would drop out when either one or the other or both subjects was selected. The "superimposed code" coding systems, such as Zatocoding, saved space by entering several or all subjects in the same field; such a "superimposed code" stores much more information in less space, but at the cost of occasional "false" selections.

Once you have a collection of index cards, one per book, research paper, or journal article in a library, with a list of keywords (subjects) discussed in a particular book written on that book's card, the "obvious way" to code those subjects is to count up the total number of subjects used in the entire collection R, make a row of R holes near the top of every card, and for each subject actually discussed in a particular book, cut a slot from the hole corresponding to that subject in the card corresponding to that book. Naturally, this also requires a separate list of every subject used in the collection that indicates which hole is punched for each subject. Unfortunately, there may be thousands of distinct subjects in the collection, and it is impractical to punch thousands of holes in every card. While it may not seem possible to use less than 1 hole per subject, superimposed code systems can solve this problem.

Superimposed codes
The Zatocoding system of information retrieval was developed by Calvin Mooers in 1947.

Calvin Mooers invented Zatocoding at M.I.T., a mechanical information retrieval system based on superimposed codes, and formed the Zator Company in 1947 to commercialize its applications. The particular superimposed code used in that system is called Zatocoding, while the marginal-punched card information retrieval system as a whole is called "Zator".

Setting up a superimposed code for a particular library goes something like this:
 * Going through every card in the index, a list of all R subjects used in this particular library is created, and the maximum number of subjects r actually written on a single card is noted. (For example, say we have 8000 subjects, and the librarian decides to index only the top r=4 subjects per book).
 * The librarian looks at the physical edge-notched card, and notes the number of holes N in each card. (If N >= R, then we could use the "obvious way" mentioned above—the whole point of Zatocoding is that it works even when N is much less than R).
 * The librarian chooses some number n of slots per subject—typically $$n = N( 1- 2^{- \frac{1}{r} } )$$
 * On the list of all R subjects, for each subject write down which holes will be slotted for that subject. Rather than slotting one hole per subject in "the obvious way", a superimposed code will slot n holes per subject. (There are several ways to pick these patterns—those distinguish between the various superimposed codes; we discuss them below).
 * When a new book comes in, make a new card for it:
 * Get a blank card with the standard N holes in it and write down the name of the book, etc. in the middle.
 * Write down the subjects covered by the book on the card.
 * For each of the top r subjects, look up that subject in the big list, and see which n slots to cut for that subject, and cut them.
 * When the card is finished, it may have up to r*n slots cut into it—but more likely at least some of the subject slot patterns overlapped, resulting in only v < r*n slots.

Later, when we need to find books on some particular subject, we look up that subject in our list of all R subjects, find the corresponding slot pattern of n slots, and put n needles are through the whole stack in that pattern. All of the cards that have been cut with that pattern will fall out. It is possible that a few other, undesired cards may also fall out—cards who have several subjects whose hole patterns overlap in such a way as to mimic the desired pattern. The probability F of some undesired card with v slots cut in it falling through when we select some pattern of n needles is approximately $$F = \left(\frac{v}{N}\right)^n$$. Most systems have a N large enough and r small enough such that, v < N/2 (i.e., the card is less than half-punched), so that probability of an undesired card falling through is less than $$F < \left(\frac{1}{2}\right)^n$$.

There are several different ways to choose which holes will be slotted for each subject.

(Several variations of Zatocoding were developed. Bourne describes a variant "for newer retrieval systems that require high performance of the superimposed coding system", using an approach Mooers published in 1959. )

Zatocoding
Setting up a Zatocode for a particular list of R subjects goes something like this:
 * For the first subject, pick n of the N slots randomly.
 * For the second subject, pick n of the N slots randomly—but make sure this pattern is not identical to the first subject.
 * For the R'th subject, pick n of the N slots randomly—but make sure it's not identical to any previous subject.
 * For the R'th subject, pick n of the N slots randomly—but make sure it's not identical to any previous subject.

Other superimposed codes
A Zatocode requires a code book that lists every subject and a randomly generated notch code associated with each one. Other "direct" superimposed codes have a fixed hash function for transforming the letters in (one spelling of) a subject into a notch code. Such codes require a much shorter code book that describes the translation of letters in a word to the corresponding notch code, and can in principle easily add new subjects without changing the code book.

A Bloom filter can be considered a kind of superimposed code.