Added a comment: Borg vs. restic, some design considerations

2018-12-05 14:36:45 +00:00 · 2018-12-05 14:36:45 +00:00 · d810efe844
commit d810efe844
parent b30fb7fad7
1 changed files with 13 additions and 0 deletions
--- a/doc/todo/borg_special_remote/comment_4_75611482f67e5d52f50d52fdb8c68e8b._comment
+++ b/doc/todo/borg_special_remote/comment_4_75611482f67e5d52f50d52fdb8c68e8b._comment
@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="michael@ff03af62c7fd492c75066bda2fbf02370f5431f4"
+ nickname="michael"
+ avatar="http://cdn.libravatar.org/avatar/125bdfa8a2b91432c072615364bc3fa1"
+ subject="Borg vs. restic, some design considerations"
+ date="2018-12-05T14:36:45Z"
+ content="""
+As I have been looking for a new, de-duplicating, reliable backup system I read through the design documentations of [borg](https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html#archives) and [restic](https://restic.readthedocs.io/en/latest/100_references.html#design). While the design of restic seems to be much simpler and actually quite straightforward, I decided for borg in the end due to its support for compression and the more efficient removal of single backups. Further, it [seems](https://blog.stickleback.dk/borg-or-restic/) the RAM usage is lower for borg.
+
+Here are some comments on both concerning the usability as git annex storage backend. Note that they are all based on my understanding of the design documents that describe how the data is stored in restic and borg. It is well possible that I have misunderstood something or some parts are just impossible due to implementation details. Further, I am quite sure that what I propose is not possible with the current external APIs of git annex and borg.
+
+For none of them, it seems to be a good idea to store individual archives (borg) or snapshots (restic) per file as both of them assume that the list of archives/snapshots is reasonably small, can be presented to the user as a single list and can be pruned based on certain rules about how many to keep per timespan (though that is per group of archives/snapshots). borg stores descriptions of all archives in a single item, the manifest (which means that when an archive is added, the whole list needs to be rewritten), restic stores each archive as a json document in a directory which might scale better but is probably still not a good idea. I think instead of storing individual files, git annex should store the whole set of exported files in a single archive/snapshot, i.e., store some kind of (virtual) directory structure in borg or restic that represents all items that shall be stored. Then, whenever git annex syncs with the borg/restic remote, a new archive/snapshot would be added. The user could then use the time-based pruning rules to remove old snapshots. This would also integrate well with using the same borg/restic repository for other backups, too. It might seem this would make the retrieval of a single file quite inefficient. Both borg and restic split a file into a list of chunks and store information where these chunks can be found. Therefore, it should be possible for a borg/restic special remote to just store this list of chunks for every annexed file. Then, to get a file, git annex would only need to ask for these chunks if it wants to get a single file. For restoring a lot of files, in particular with a non-local restic repository, this might be very inefficient though as restic might need to download a lot of data just to get these chunks - there just getting the whole last archive/snapshot might be more efficient (as far as I understood, then restic downloads each pack of chunks only once and directly writes all of them to the files that want them). Restic stores separate objects for every directory and this directory contains a list of subdirectories and files, where files contain a list of chunks. To add or remove files from a snapshot in restic, git annex would just need to execute the chunker for files not already present in the previous snapshot and could use the already stored chunk ids for the already present files. However, each snapshot would create a completely new directory. Without subdirectories, this would basically mean that the list of all files needs to be re-written for every snapshot. Subdirectories would help with that, but only if few subdirectories are modified. Due to the nature of hashing, this seems unlikely in the case of a git annex special remote (but of course this makes backups of unchanged directories very efficient). Borg doesn't have this directory  structure but instead just stores the metadata of every file in one large stream. This stream is chunked in parts consisting of around 128KiB and therefore, only parts where changes occurred need to be stored again. The list of these metadata chunks needs to be stored, nevertheless, but is much smaller. Again, everything that is needed for storing a file could be generated without having the actual source file if the chunk ids are present. In fact, this is what borg does with a file cache that stores for every file of the previous backup both properties like size, timestamp and inode id to identify modifications and a list of chunks. If borg finds the same file again, it just uses the stored chunk list. If the git annex borg special remote could also keep the order of all previously present files the same, this would result in re-using basically all metadata chunks - however, I don't know if borg assumes any order on the files. Note that borg needs to know which chunks are referenced in an archive as borg stores reference counts for all chunks to determine if a chunk is still needed, so just re-using the metadata chunks without reading their content is definitely not possible. Restic has no such reference counts, it needs to iterate over all trees to determine if a chunk can be deleted (which [seems](https://blog.stickleback.dk/borg-or-restic/) to be terribly slow). Nevertheless, both implementations of cleaning up chunks require that chunks are referenced in some file that is contained in some archive/snapshot.
+"""]]