Chunkyard Explained

Since my last blog post about Chunkyard the project has reached v1.1.1. Chunkyard is now a single .NET project and does no longer rely on components written in other languages. I also took the time to write down what the project should and should not do in the README file.

I believe that the overall internal data structure is stable, so I'd like to take the time to document how a file is stored.

Creating The Repository

So let's imagine that we want to store a snapshot of the hello-world.txt file using the command:

chunkyard create -r ~/chunkyard-repository -f ~/Documents/hello-world.txt

After the initial password prompt Chunkyard will perform the following operations:

  • The file hello-world.txt is read and split into pieces (chunks) using the FastCdc algorithm. Since text files tend to be rather small, we'll most likely end up with a single chunk
  • Each chunk will be AES (Galois/Counter Mode) encrypted using a 256 bit key which is derived from the password. Note: These choices are based on this NDC Conference talk by Stephen Haunts
  • Every encrypted chunk will be stored in a content addressable storage using its SHA256 hash value

In order the reconstruct the hello-world.txt file, we need to keep track of several pieces of data:

  • Parameters to reproduce the 256 bit key
  • The file name
  • The address of every file chunk in the storage system
  • The AES GCM tag and the cryptographic nonce to decrypt the file chunks

Here's an overview of the file structure of the ~/chunkyard-repository directory:

chunkyard-repository/
  content/
    sha256/
      11/
        116d11f1a1a7301a720848382893cb931e781f31f93eeae3cbb88106b3d88ba5
      58/
        58c767a0b5c211fd26d2869ea36691b98270b7db5f22038d9c902ecdc8a818d8
  reflog/
      0.json

The file 116d11f1a1a7301a720848382893cb931e781f31f93eeae3cbb88106b3d88ba5 contains our encrypted hello-world.txt file.

All the information to reconstruct a snapshot of hello-world.txt is also stored as an encrypted chunk. Here's how the snapshot file 58c767a0b5c211fd26d2869ea36691b98270b7db5f22038d9c902ecdc8a818d8 looks like when it is decrypted:

{
  "CreationTime": "2020-10-03T13:05:45.5405235+02:00",
  "ContentReferences": [
    {
      "Name": "hello-world.txt",
      "Nonce": "rqesO2bAdlHs/3de",
      "Chunks": [
        {
          "ContentUri": "sha256://116d11f1a1a7301a720848382893cb931e781f31f93eeae3cbb88106b3d88ba5",
          "Tag": "d17sWwQ6DXtw7n9ud2NX/g=="
        }
      ]
    }
  ]
}

In this example the snapshot contains a single file which consists of a single chunk (116d11f1a1a7301a720848382893cb931e781f31f93eeae3cbb88106b3d88ba5).

Finally 0.json contains a reference to the above snapshot as well as the salt and hashing iterations parameters needed to create our cryptographic key:

{
  "LogId": "5411e7c2-315e-4b48-9345-e2ddfc44c0ad",
  "ContentReference": {
    "Name": ".chunkyard-snapshot",
    "Nonce": "WOjfK9yWujps+w1+",
    "Chunks": [
      {
        "ContentUri": "sha256://58c767a0b5c211fd26d2869ea36691b98270b7db5f22038d9c902ecdc8a818d8",
        "Tag": "C/YhL/z9Q/rUtR1hVOv5fw=="
      }
    ]
  },
  "Salt": "GsKFv9f5GcrEn12U",
  "Iterations": 1000
}

Create A New Backup

Let's change the content of our hello-world.txt example and re-run the create command. Here's what happened to the ~/chunkyard-repository directory:

chunkyard-repository/
  content/
    sha256/
      11/
        116d11f1a1a7301a720848382893cb931e781f31f93eeae3cbb88106b3d88ba5
      30/
        30a54e8cda4e6cc66a92eb93e28d8b9ca646c71b7d9414c120d6ebb7e34dea2f
      58/
        58c767a0b5c211fd26d2869ea36691b98270b7db5f22038d9c902ecdc8a818d8
      5b/
        5b9d6713a8cba3ecaa8b68fa71a11a2dab88ec266bbb42660fb3a1dbfb34b401
  reflog/
    0.json
    1.json

Chunkyard did not change any existing file, instead it created new ones. These new JSON structures look like this:

The snapshot:

{
  "CreationTime": "2020-10-03T13:16:39.1510417+02:00",
  "ContentReferences": [
    {
      "Name": "hello-world.txt",
      "Nonce": "rqesO2bAdlHs/3de",
      "Chunks": [
        {
          "ContentUri": "sha256://5b9d6713a8cba3ecaa8b68fa71a11a2dab88ec266bbb42660fb3a1dbfb34b401",
          "Tag": "AhAY/F2rAjjFosaFnElhtw=="
        }
      ]
    }
  ]
}

And the 1.json file:

{
  "LogId": "5411e7c2-315e-4b48-9345-e2ddfc44c0ad",
  "ContentReference": {
    "Name": ".chunkyard-snapshot",
    "Nonce": "3sJtajbLoV6PZAWx",
    "Chunks": [
      {
        "ContentUri": "sha256://30a54e8cda4e6cc66a92eb93e28d8b9ca646c71b7d9414c120d6ebb7e34dea2f",
        "Tag": "hAQVhHsH7V14U/bwufvMww=="
      }
    ]
  },
  "Salt": "GsKFv9f5GcrEn12U",
  "Iterations": 1000
}

Restoring Files

Finally, let's walk through the performed operations when we are restoring a snapshot using:

chunkyard restore -r ~/chunkyard-repository -d ~/chunkyard-restored
  • Prompt the user for a password
  • Read the unencrypted JSON file found in the reflog directory. In our example this would be 1.json
  • Create the cryptographic key based on the password, salt and iteration parameters
  • Decrypt and reconstruct the snapshot based on the information found in 1.json
  • Decrypt and reconstruct every file in the snapshot

Closing Thoughts

Chunkyard is by far not as sophisticated as other modern backup tools, but building the project helped me to learn more about encryption, chunking and content addressable storage. I would not recommend anyone to use or rely on Chunkyard, but to rather use it as an opportunity to get a basic understanding of how these other tools work.

Published: 2020-10-03