Florian Winkelbauer

Dreaming of Git with Chunks

I see more and more people using content addressable storage together with content defined chunking (CDC) to pull off interesting applications. Here are a few examples:

I’d love to see a distributed version control system (DVCS) which is based on CDC (which support for encryption and compression). So far, I have only found the Attaca project, which seems to be unmaintained at the moment. I believe that a “CDC based Git clone” would offer some interesting possibilities. We could:

Design Ideas

Git uses four components to build its internal data structure:

The major difference between Git and a CDC-based DVCS would be, that a single file might be split into one more chunks. This leaves us with two new problems:

Addressing a File

While Git can use a single hash to find a specific file, we need three pieces of information to do the same:

Instead of a file name, I believe that a UUID might be even better to uniquely identify a file. This way, we could keep track of a file, even if its name changes over time. In some cases we might even be able to detect a rename operation by identifying the file based on its unchanged chunks.

Construct Files in a Cache

Before we can run operations similar to git diff, we have to reconstruct a file based on its chunks. To simplify such operations, we could build an internal cache for a specific commit. Keep in mind that a commit is immutable, which means that such a cache could be operated in an “append-only” fashion. While this approach seems to be pretty straightforward, we would need to implement some form of retention policy in order to keep our overall disk space consumption in line.