Crypto-PAn

Crypto-PAn (Cryptography-based Prefix-preserving Anonymization ) is a cryptographic algorithm for anonymizing IP addresses while preserving their subnet structure. That is, the algorithm encrypts any string of bits $$x$$ to a new string $$E_k(x)$$, while ensuring that for any pair of bit-strings $$x, y$$ which share a common prefix of length $$p$$, their images $$E_k(x), E_k(y)$$ also share a common prefix of length $$p$$. A mapping with this property is called prefix-preserving. In this way, Crypto-PAn is a kind of format-preserving encryption.

The mathematical outline of Crypto-PAn was developed by Jinliang Fan, Jun Xu, Mostafa H. Ammar (all of Georgia Tech) and Sue B. Moon. It was inspired by the IP address anonymization done by Greg Minshall's TCPdpriv program circa 1996.

Algorithm


Intuitively, Crypto-PAn encrypts a bit-string of length $$n$$ by descending a binary tree of depth $$n$$, one step for each bit in the string. Each of the binary tree's $$2^n - 1$$ non-leaf nodes has been given a value of "0" or "1", according to some pseudo-random function seeded by the encryption key. At each step $$i$$ of the descent, the algorithm computes the $$i$$th bit of the output by XORing the $$i$$th bit of the input with the value of the current node.

The reference implementation takes a 256-bit key. The first 128 bits of the key material are used to initialize an AES-128 cipher in ECB mode. The second 128 bits of the key material are encrypted with the cipher to produce a 128-bit padding block $$\mathit{pad}$$.

Given a 32-bit IPv4 address $$x$$, the reference implementation performs the following operation for each bit $$x_i$$ of the input: Compose a 128-bit input block $$I_i = x_{[0,i)} \mathit{pad}_{[i,128)}$$. Encrypt $$I_i$$ with the cipher to produce a 128-bit output block $$O_i$$. Finally, XOR the $$i$$th bit of that output block with the $$i$$th bit of $$x$$, and append the result — $$x_i \oplus O_{i,i}$$ — onto the output bitstring. Once all 32 bits of the output bitstring have been computed, the result is returned as the anonymized output $$E_k(x)$$ which corresponds to the original input $$x$$.

The reference implementation does not implement deanonymization; that is, it does not provide a function $$D_k$$ such that $$D_k(E_k(x)) = x$$. However, decryption can be implemented almost identically to encryption, just making sure to compose each input block $$I_i = x_{[0,i)} \mathit{pad}_{[i,128)}$$ using the plaintext bits of $$x$$ decrypted so far, rather than using the ciphertext bits: $$I_i \neq E_k(x)_{[0,i)} \mathit{pad}_{[i,128)}$$.

The reference implementation does not implement encryption of bitstrings of lengths other than 32; for example, it does not support the anonymization of 128-bit IPv6 addresses. In practice, the 32-bit Crypto-PAn algorithm can be used in "ECB mode" itself, so that a 128-bit string $$x_{[0,128)}$$ might be anonymized as $$E_k(x_{[0,32)}) E_k(x_{[32,64)}) E_k(x_{[64,96)}) E_k(x_{[96,128)})$$. This approach preserves the prefix structure of the 128-bit string, but does leak information about the lower-order chunks; for example, an anonymized IPv6 address consisting of the same 32-bit ciphertext repeated four times is likely the special address, which thus reveals the encryption of the 32-bit plaintext.

In principle, the reference implementation's approach (building 128-bit input blocks $$I_i$$) can be extended up to 128 bits. Beyond 128 bits, a different approach would have to be used; but the fundamental algorithm (descending a binary tree whose nodes are marked with a pseudo-random function of the key material) remains valid.

Implementations
Crypto-PAn's C++ reference implementation was written in 2002 by Jinliang Fan.

In 2005, David Stott of Lucent made some improvements to the C++ reference implementation, including a deanonymization routine. Stott also observed that the algorithm preserves prefix structure while destroying suffix structure; running the Crypto-PAn algorithm on a bit-reversed string will preserve any existing suffix structure while destroying prefix structure. Thus, running the algorithm first on the input string, and then again on the bit-reversed output of the first pass, destroys both prefix and suffix structure. (However, once the suffix structure has been destroyed, destroying the remaining prefix structure can be accomplished far more efficiently by simply feeding the non-reversed output to AES-128 in ECB mode. There is no particular reason to reuse Crypto-PAn in the second pass.)

A Perl implementation was written in 2005 by John Kristoff. Python and Ruby implementations also exist.

Versions of the Crypto-PAn algorithm are used for data anonymization in many applications, including NetSniff and CAIDA's CoralReef library.