Database and storage systems with control over spatial locality and window or page-level compression (or a moral equivalent) can achieve fantastic compression rates when similar pieces of data are placed close together.
Back in 2004, Bigtable was used by Google to store crawled web page contents, obtaining 10-to-1 space reductions, despite using very fast compression. The space gains were better than the 3-to-1 or 4-to-1 ratios resulting typically from slower algorithms like GZip. This was achieved by storing (keying) all pages from the same host close to each other.
Likewise, a year later git used locality to improve delta compression in pack files.
Similarly to git, the Sync Appliance uses structures resembling Merkle trees in content-addressed storage to power the time travel (point in time recovery) functionality. Unlike git, though, this is not used for file data, and it is possible to purge files or versions without having the rewrite the history (also, there’s no separate stop-the-world packing phase or blob GC, everything is done incrementally).
git uses simple heuristics to order objects and pick candidates to perform delta compression against. Is it possible to find a more general way to improve spatial locality of arbitrary blobs in content-addressed storage? You’re probably guessing it is, and you’re right if so.
Generalization
git uses a very application-specific heuristic. Quoting from linus:
The delta ordering is (wait for it):
- first sort by the “basename” of the object, as defined by the name the object was first reached through when generating the object list
- within the same basename, sort by size of the object
- but always sort different types separately (commits first).
That’s not exactly it, but it’s very close.
git works with opaque blobs with two properties, name and type, and orders them without inspecting or assuming anything about the contents.
If all you have is a bunch of opaque blobs and no properties, there is not much you can do. On the other hand, if you impose the tiniest bit of structuring on the blobs, there way is open.
Only a tiny bit detached from and above serialization (list of bytes) conceptually, lists of things (represented as bytes) are about the most general structure anything can be encoded into. Complex graphs are but sets of vertices and edges (or in other word lists where the order is discarded), which then again are but a list of “vertices or edges”, you get the drill.
If you can control spatial locality of lists of things, you can control locality for just about anything you can serialize.
Locality-sensitive hashing (LSH)
So you want to put related lists of things together, that is given two,
lists l1
and l2
, derive two keys k1 = k(l1)
and k2 = k(l2)
such that
k1
and k2
are close (ideally, share a common prefix) when l1
and l2
are similar. This way, in the storage system, the bindings k1 ->
serialize(l1)
and k2 -> serialize(l2)
will be placed close to each other
and hopefully window or page compression will make this take up much less space
than if the two pieces of data were far apart.
A list can be thought of as a set of elements and their indexes, so a generic method for the latter will work for the former.
The measure of similarity between two sets is the Jaccard index, defined as the size of the intersection divided by the size of the union:
It turns out the Jaccard index of two sets equals the probability of the minimal members of each set being the same according to a random permutation. In other words: say you have a hash function \( h \). You apply it to each element in each set, obtaining \( h(A_1) \)…\( h(A_n) \) and \( h(B_1) \)…\( h(B_m) \). Retain the values \( A_i \) and \( B_j \) which have the minimal hashes \( \min_i\ h(A_i) \) and \( \min_j\ h(B_j) \) in each respective set. Now, the probability that \( A_i = B_j \) is the Jaccard index \( J(A,B) \). Using hashing this way to estimate the Jaccard index is called unsurprisingly MinHash.
It follows that if you use the minimal hashes as prefixes of the keys used to store the sets \( A \) and \( B \), similar sets will be close with a probability equal to their degree of similarity.
Using LSH with Content addressed storage (CAS)
In typical CAS implementations, data blobs are keyed by their cryptographic hash (digest). Probably the most popular CAS implementation out there, git identifies data blobs, trees, etc. by their SHA-1 digest.
Being “content addressed” means that the key for a datum D is computed deterministically from D, \( k_D = k(D) \), in a way that avoids collisions for different values of D. git operates at the lowest level, where D is a sequence a bytes and \( k = \text{SHA-1} \). (Remember git uses separate heuristics for delta candidate determination.)
If a modicum of structure is retained and you work with lists of things, nothing forces you to use \( k(L)=\text{SHA-1}(\text{serialize}(L)) \). Any deterministic function that avoids collisions will do.
In particular, you can use the following function to derive a key for a list L:
where \( | \) is just “concatenation” and MinHash is the minimal hash over all values in L.
Now, the hash function that operates on the list elements is specific to their
kind, but the overall scheme is generic. The Sync Appliance uses it to store
lists of id -> filename
associations, directory entries (this is very much
like git ‘tree’ blobs), etc.
Conclusion
Spatial locality of similar items in storage is desirable when you have compression, as it allows great space reductions even with very fast compression algorithms.
Content-addressed storage is usually implemented by using the digest of an opaque blob directly as the key. This leads to poor spatial locality for revisions of a blob, as a single changed byte in the serialized form leads to a completely different cryptographic digest (and thus key).
Nothing forces you to use the digest directly though: all you need is a deterministic function injective in practice. By prefixing the digest with a locality-sensitive hash (LSH), you can preserve good spatial locality.
If the blobs are not entirely opaque, but can be seen as sets of things, MinHash is a suitable and easy to compute LSH. This technique can be generalized to arbitrarily complex structures, because any complex graph can be expressed (and will, when serializing) as a set of vertices and edges.