Database and storage systems with control over spatial locality and window or
page-level compression (or a moral equivalent) can achieve fantastic
compression rates when similar pieces of data are placed close together.
Back in 2004, Bigtable
was used by Google to store crawled web page contents, obtaining 10-to-1
space reductions, despite using very fast compression. The space gains were
better than the 3-to-1 or 4-to-1 ratios resulting typically from
slower algorithms like GZip. This was achieved by storing (keying) all pages
from the same host close to each other.
Likewise, a year later git used locality
to improve delta compression in pack files.
Similarly to git, the Sync Appliance uses structures resembling Merkle
trees in content-addressed storage
to power the time travel (point in time recovery) functionality. Unlike git,
though, this is not used for file data, and it is possible to purge files or
versions without having the rewrite the history (also, there’s no separate
stop-the-world packing phase or blob GC, everything is done incrementally).
git uses simple heuristics to order objects and pick candidates to perform
delta compression against. Is it possible to find a more general way to
improve spatial locality of arbitrary blobs in content-addressed storage?
You’re probably guessing it is, and you’re right if so.
Continue Reading