brain dump about archiving & stuff

life and death of a project

Around year 2018, I decided that the way I dealt with my data was lame. My useful admin files were scattered around multiple computers, some old projects were only to be found on hard drives pried from dead computers, looking for a rarely used scan was a chore.

qbe was in good shape so I thought about taking a bit of time to solve the issue. My recently started professional life and my young kid did not leave me with that much free time, but it would surely be enough to design a decent archival strategy. Hmm, not quite.

I had just read Vannevar Bush's As We May Think essay that describes a hypothetical memex machine. In his words: A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. That sounded exactly like what I needed. But rather than a physical machine, I was envisioning something I could create: software. My memex project was born.

In summer 2021, four years after the idea first crossed my mind, I backburnered my memex work. Its current incarnation mostly meets my needs and I use it to archive all my personal data. However, the time it took to reach this point is far beyond anything I could have imagined.

Several factors can explain the outcome of the memex project.

In the last streak, when all that was left to do was to whip enough code to get the mayonnaise, it started to be a bit discouraging. And, taking a step back, I could not help but notice that excessive focus on protecting what I had done, to the point of stalling the rest, has an undeniably macabre dimension!

To end with the negative tone, there were a couple wins.

Still, I am not fully satisfied with how things went. Side projects should not become something you have to complete; yet I feel like the memex did. I also stalled quite a bit on qbe, sometimes leaving great emails from contributors unanswered for months. I apologize if you recognize yourself! The team of people contributing to qbe is truly technically world class; I got really lucky with you all!

many requirements

Now if you care for it, my original list of requirements:

Security
The memex would contain personal data, so I wanted it to be backed by solid cryptography. Doing so would let me share it with untrusted cloud storage providers.

First class history
Not only the data should be part of the memex but also a trail of how it came to be there. My model here was source control: all edits would be recorded and form a continuous chain of snapshots. The last snapshot could for example be used to compute a delta with the local copy, and possibly spot unintended edits or deletions.

Decentralization
The memex as a logical artifact could be designed so that I can consult it and edit it from different locations: at home, at work, on the go. I simply have to use the computing medium at hand. Ideally though, not all machines should be granted access to the entirety of the memex archive. A basic access control mechanism would prevent accessing sensitive data on less trustworthy machines—such as the one provided by my employer.

Heterogeneous storage
While I was at it tearing data I found it natural to consider teared storage. Even though the memex would contain everything, not everything is equally precious. Some files, such as downloaded movies, can be lost harmlessly and do not need to be backed by expensive redundant storage.

It was clear to me from the start that their union was feature bloat and that I could not hope to design an ergonomic system that meets them all. In the end, I believe the memex fully meets "security" and "first class history". Only to some extent it meets the "decentralization" requirement because it is not yet easy to have multiple writing machines.

Without further ado, let us jump to the implementation.

secret content-addressed storage

WARNING
What follows is a graphic technical account of memex archives. The level of detail is such that, by reading it carefully, one can reconstruct a memex reader. That is on purpose: this page is going to end up on archive.org (or similar) and guarantee the perennity of memex archives. Nonetheless, the text should remain readable so, if you have your reasons, stick with me. Otherwise, that is probably as far as you want to go.

After a while struggling with different prototypes I converged to the idea that building some kind of encrypted git may be a good direction. The data would of course have to be encrypted but archives should remain easy to synchronize.

With this plan in mind, in 2019, I singled out a first bit of infrastructure precisely: sdar. sdar stands for secret deduplicated archive. It provides encrypted and content-addressed storage. The sdar tool can add data to the archive stash, each piece of data added to the stash is given an address computed from the data itself. Eventually, the stash can be consolidated as a segment of the archive. An archive is the sum of its segments. The stash is mere on-disk intermediate storage used to create segments. Obviously, the sdar tool can also retrieve data from an archive using addresses.

In the final scheme, each segment of an archive represents one update to the memex. The user modifies its local copy of the memex, consolidates all the local changes in a segment, then uploads the segment to some remote storage. This way, very little information is leaked to the backup hosting service: only update sizes and rate of change — pretty much the least amount of information that could leak. Synchronization with other versions of the archive could also be easy (assuming segment names do not conflict): just rsync the segment directory.

The cryptography is handled by nacl's cryptobox. sdar loads the archive key from ~/sdar.key (or some other user-supplied path) and, for each segment, it generates a new key pair. The segment's private key is used to encrypt the data for the public archive key. This way, it is possible to create a new segment without having access to the private part of the archive key. Put another way, it is possible to add to an archive without knowing the read key. The detailed layout of an sdar key file is:

name   | range     | content
-------+-----------+---------------------------
magic  |   0 -   7 | 20 2f 18 06 44 de 56 7a
salt   |   8 -  39 | scrypt salt (random)
b3key  |  40 -  71 | blake3 key (random)
pubkey |  72 - 103 | public archive key
cipher | 104 - 151 | (see below)

cipher = nacl_secretbox(
  prvkey | 0 - 31 | private archive key,
  scrypt(passphrase, salt, N=16384, r=8, p=1))

Key files are split in two parts, the first 104 bytes are clear while the last 48 are encrypted using nacl secretbox. This last part contains the private part of the archive key and is only used when reading. The secretbox function from nacl requires both a key and a nonce; sdar derives them from the user's passphrase with a single call to scrypt. scrypt produces 56 bytes, the first 24 bytes are used as nonce while the last 32 form the key. The salt used by scrypt is found in the range [8 - 39] of the key file. When ingesting data, sdar uses a keyed blake3 sum to compute unique identifiers for blocks. From this point, any time I refer to blake3 sum it will be a keyed sum using b3key as key.

Archives are identified by their root directory and follow this simple structure.

arch/
  cache
  seg/
    4fb87932feff5763299abba007414164
    e911349455f055170f85c2bebedd4567
  stash/

The cache file is merely here to enable speedy reads. Like the stash's content, the cache file contains some unencrypted information and should not be shared with untrusted parties. Segments only can safely be shared. Segments have pseudo-random names that are 16 bytes expressed as hexadecimal. The names are in fact the first half of the public key stored in the corresponding segment. Having random segment names makes it easy to merge two archives: just copy over all the segments, if there is a name conflict either you will die the same day from being sucked into a black hole, or the two archives shared some content.

When data is streamed in an archive, sdar splits it in blocks using a content-defined chunking (CDC) algorithm. Chunking serves two purposes: first, it limits the maximal amount of data that needs to reside in memory (blocks have a max size of 2Mib), and second, it allows efficient deduplication of large files that differ only in a few places. For example, you will benefit from chunking-based deduplication if you store VM images in an sdar archive. The blocks created by CDC are stored compressed and encrypted in segments.

Segments start with a header of 40 clear bytes. Then comes an encrypted metadata block, followed by the actual data blocks, and finally, an index.

Clear header
  name   | range     | content
  -------+-----------+---------------------------
  magic  |   0 -   7 | 4c 00 7b f8 62 aa 9a 4e
  pubkey |   8 -  39 | segment public key

Encrypted metadata
  name   | range     | content
  -------+-----------+---------------------------
  nitem  |   0 -   7 | number of items
  dlen   |   8 -  15 | byte size of the data blocks
  date   |  16 -  23 | creation unix time
  msg    |  24 - 151 | user message

Encrypted data blocks (dlen bytes)

Encrypted index (nitem items)

The segment file name is a hex encoding of the bytes [8 - 23]. The rest of the segment is a sequence of blocks individually encrypted. These blocks can all be decrypted with the private part of the archive key. The nonce used to encrypt a block is derived from a 64 bit signed integer N specific to that block. The first 8 bytes of the nonce are the big endian encoding of N and the 16 subsequent bytes are all set to zero.

The encrypted metadata derives its nonce from the integer -1. It contains the total size of the encrypted data blocks as well as the number of items in the index. Readers use this information to find the beginning of the index and subsequently read it. Additionally, the segment creation time as well as a user-defined message (not necessarily nul terminated) are part of the metadata block. These last two pieces of data are not necessary for reading the segment, but may be useful to users. All the integers are serialized in big endian format. It is noteworthy that without the decryption key it is impossible to know how much of the segment is the index and how much is the data. That is, from a segment file alone we cannot infer if the segment holds a single big item, or multiple small ones.

After decoding the metadata a reader will want to jump straight to the index. The index is essentially an array of nitem items. Chunks of 58254 items are grouped to form blocks — the last block may be smaller. The first index block is encrypted with nonce -2, the second with nonce -3, and so on. Items are 36 bytes long so that a single block fits in 2Mib of memory (the maximum block size). Items contain two fields:

name   | range     | content
-------+-----------+---------------------------
b3sum  |   0 -  31 | blake3 keyed sum
szcmp  |  32 -  39 | plain size & compression bit

The blake3 sum (keyed by b3key) is computed over the entire raw content of the block. The second field is the big endian encoding of 2S+C where S is the size of plain data and C is a bit indicating whether the block content was compressed or not. Importantly, if the block content was compressed, S is the size after compression. This is the most useful number when reading a segment: by adding the cipher overhead of 16 bytes to it, we find how long the encrypted data for this block is in the segment file.

Data blocks are not very interesting. They are just plain user data, possibly compressed using lz4 if that happens to decrease their size. They are of course encrypted and their nonce is derived from the offset at which they live in the data part of the segment. So, for example, the first block has an all zeroes nonce.

Stitching things together, let us see how to read a data block from a segment file using a blake3 sum as query. First, decrypt the metadata block using the archive key and the segment public key (stored in the clear header). From the metadata, find the position of the index in the file. Read chunks of the index until the requested blake3 sum is found. While reading the index, add 16+S to an accumulator X for each entry that is not the desired one. The offset X gives us where in the segment the block is. For our entry of interest we also have S and C. X is used to derive the proper nonce and we obtain S bytes of plain payload. Depending on whether C is set or not, send the plain payload to lz4 for decompression. And we are done.

I barely believe you stuck this far! If you did, you might have noticed that I never explained how chunks created by CDC are organized when a big value is ingested. When a single value spans over multiple blocks, sdar creates a tree of blocks. Blocks at the leaves contain value chunks, and internal blocks contain the blake3 sums of blocks at the next level. Trees are limited to depth 2. For efficient seeking, internal blocks also store the size of the subtrees. The following diagram may help in understanding the scheme.


{ ------ internal nodes ------ } { - leaf nodes - }

                                       +-----+
                    +-------+       -- | 123 |
+-------+       --- | def   |     /    +-----+
| abc   |     /     +-------+    /     | A   |
+-------+    /      | 123 1 | --       +-----+
| def 3 | --        | 987 2 | -- 
| 012 1 | --        +-------+    \    +-----+
+-------+    \                     -- | 987 |
              \     +-------+         +-----+
                --- | 012   |         | BC  |
                    +-------+         +-----+
                    | 456 1 | -- 
                    +-------+    \    +-----+
                                   -- | 456 |
                                      +-----+
                                      | D   |
                                      +-----+

In the figure, blake3 sums are represented with three hex digits. Blocks are represented as boxes with a header showing the blake3 sum of their content. The topmost block abc is the root of a tree of depth 2 that stores the content ABCD split over four blocks. In the block abc, the entry def is bundled with 3, the total size of the tree with root block def. In sdar, internal blocks have 52428 entries, and are always filled to their maximum. The entries are composed of a blake3 sum of 32 bytes, followed by 8 bytes to represent the size of the corresponding subtree, again encoded in big endian. The entry count is chosen so that internal blocks do not exceed 2Mib. Trees being limited to depth 2 is not a problem in practice: the CDC algorithm will not produce blocks smaller than 512Kib, so the minimum value size that is required to overflow sdar is over 1Pib.

Finally, sdar addresses are simply the combination of one level in {0,1,2} and a blake3 sum. sdar prints addresses with the level first, then a hex representation of the blake3 sum. The level is used when reading to know what is the depth of the block tree to process. Continuing with the example tree from above, the topmost node would have address 2abc. If you experiment with sdar, you can try to take an address starting with 1 and change the level to 0 to observe the content of the root block.

sdar is pretty fast overall. I chose the primitive algorithms (lz4, gear hash, blake3, ...) so that a honorable throughput can be achieved. The times reported below are obtained on an Intel NUC with an i5-7260U processor and a Samsung SSD on NVMe.

In the next section, I explain how the memex tool is built on top of sdar.

git clone

The memex tool is pretty much a git clone with a slightly poorer cli (can you believe this?!). It was straightforward to build on top of sdar, only a lot of boilerplate. It is a Go program that calls sdar as subprocess and communicates using some pkt-line-like format.

Files are ingested as standalone sdar values, directories are represented by a simple binary format that lists the addresses of their content. And finally, full snapshots, or commits, hold the address of the root directory as well as some metadata.

Diffing, snapshot listing, fetching, and syncing is implemented naively by walking the data in the sdar archive.

Commit objects follow the format below. They start with a rather long magic sequence of 4 bytes so that they can be easily recovered from an sdar archive and its key. For example, if the head commit's address has been lost.

type    | content
--------+----------------
byte[4] | 17 ee 7b a6
string  | message
uvarint | commit time
addr    | root directory object
addr    | previous commit

memex objects use an ad-hoc binary serialization scheme. Integers are stored as varints: the integer is split in K packs of seven bits where the last pack is non-zero, the first K-1 packs are or-ed with 128 and output in order, then the last pack is output as is (i.e., without its msb set). This scheme has the advantage of using few bytes when serializing small integers. Strings are serialized with their size first output as varint, then the string content. Addresses are serialized as sequences of 33 bytes, the first byte is constrained to be in {0,1,2} and represents the level while the 32 others are the blake3 sum part of the address.

Directory objects follow this format.

type    | content
--------+----------------
byte[1] | 11
uvarint | number of entries following
Entry[] | entries

Entry
  type    | content
  --------+----------------
  byte[1] | kind in: reg(0), lnk(1), dir(2)
  data    | depends on the kind, see below

reg data (regular file)
  type    | content
  --------+----------------
  addr    | content address
  string  | name
  beint16 | mode (permissions)
  uvarint | mtime (unix time)
  uvarint | size
  beint64 | xxh checksum

lnk data (symbolic link)
  type    | content
  --------+----------------
  string  | name
  string  | destination

dir data (directory)
  type    | content
  --------+----------------
  addr    | directory object address
  string  | name
  beint16 | mode (permissions)
  uvarint | mtime (unix time)

As you can see, the memex tool is not a proper backup tool: it cannot handle special files and skips a ton of filesystem metadata. I actually think that it may store too much metadata and that I should have omitted mtimes altogether since they tend to pollute diffs.

Despite being conceptually quite easy, programming the memex tool required about 3k lines of Go. There might have been a language better suited to this task.

code

You may be interested in getting a copy of the code by now. If anything, out of curiosity. I do not really plan to make it widely available since I have very little interest in tailoring it to others' needs. However, if you still think that you want it, ping me and I will send you a copy.

Cheers!