Dreaming of Git with Chunks

I see more and more people using content addressable storage together with content defined chunking (CDC) to pull off interesting applications. Here are a few examples:

I'd love to see a distributed version control system (DVCS) which is based on CDC (which support for encryption and compression). So far, I have only found the Attaca project, which seems to be unmaintained at the moment. I believe that a "CDC based Git clone" would offer some interesting possibilities. We could:

Design Ideas

Git uses four components to build its internal data structure:

  • Blobs to store the actual file content
  • Trees to create a "snapshot" of a repository
  • Commits to add meta information to trees and to create the repository history
  • References (branches and tags) to point to a specific point in the commit graph

The major difference between Git and a CDC-based DVCS would be, that a single file might be split into one more chunks. This leaves us with two new problems:

  • We have to keep track of which chunks make up a file
  • We need to do some additional work so that we re-gain features such as git diff

Addressing a File

While Git can use a single hash to find a specific file, we need three pieces of information to do the same:

  • A unique identifier
  • A list of hashes to find all current chunks
  • Information about how to process the data in case of encryption and/or compression

Instead of a file name, I believe that a UUID might be even better to uniquely identify a file. This way, we could keep track of a file, even if its name changes over time. In some cases we might even be able to detect a rename operation by identifying the file based on its unchanged chunks.

Construct Files in a Cache

Before we can run operations similar to git diff, we have to reconstruct a file based on its chunks. To simplify such operations, we could build an internal cache for a specific commit. Keep in mind that a commit is immutable, which means that such a cache could be operated in an "append-only" fashion. While this approach seems to be pretty straightforward, we would need to implement some form of retention policy in order to keep our overall disk space consumption in line.

Published: 2020-02-08