Dreaming of Git with Chunks
I see more and more people using content addressable storage together with content defined chunking (CDC) to pull off interesting applications. Here are a few examples:
- League Of Legends' patch downloader
- Asuran
- rdedup
- borg
- restic
I'd love to see a distributed version control system (DVCS) which is based on CDC (which support for encryption and compression). So far, I have only found the Attaca project, which seems to be unmaintained at the moment. I believe that a "CDC based Git clone" would offer some interesting possibilities. We could:
- Keep track of our backups on our local system, while still being able to synchronize changes with one or more remote locations
- Support large binary data so that game developers (or other developers who have to deal with large assets) could use a DVCS
- Track build artifacts (packages, containers, binaries, …) using commits and
branches. This would allow us to build update mechanisms for our applications
based on
push
,pull
andfetch
operations (similar to the League Of Legends post above) - Use a version control system as an alternative to tools such as Dropbox, NextCloud or Syncthing
Design Ideas
Git uses four components to build its internal data structure:
- Blobs to store the actual file content
- Trees to create a "snapshot" of a repository
- Commits to add meta information to trees and to create the repository history
- References (branches and tags) to point to a specific point in the commit graph
The major difference between Git and a CDC-based DVCS would be, that a single file might be split into one more chunks. This leaves us with two new problems:
- We have to keep track of which chunks make up a file
- We need to do some additional work so that we re-gain features such as
git diff
Addressing a File
While Git can use a single hash to find a specific file, we need three pieces of information to do the same:
- A unique identifier
- A list of hashes to find all current chunks
- Information about how to process the data in case of encryption and/or compression
Instead of a file name, I believe that a UUID might be even better to uniquely identify a file. This way, we could keep track of a file, even if its name changes over time. In some cases we might even be able to detect a rename operation by identifying the file based on its unchanged chunks.
Construct Files in a Cache
Before we can run operations similar to git diff
, we have to reconstruct a
file based on its chunks. To simplify such operations, we could build an
internal cache for a specific commit. Keep in mind that a commit is immutable,
which means that such a cache could be operated in an "append-only" fashion.
While this approach seems to be pretty straightforward, we would need to
implement some form of retention policy in order to keep our overall disk space
consumption in line.