git-annex/doc/design/assistant/deltas.mdwn

Speed up syncing of modified versions of existing files. 

One simple way is to find the key of the old version of a file that's
being transferred, so it can be used as the basis for rsync, or any
other similar transfer protocol.

For remotes that don't use rsync, use a rolling checksum based chunker,
such as BuzHash. This will produce [[chunks]], which can be stored on the
remote as regular Keys -- where unlike the fixed size chunk keys, the
SHA256 part of these keys is the checksum of the chunk they contain.

Once that's done, it's easy to avoid uploading chunks that have been sent
to the remote before.

When retriving a new version of a file, there would need to be a way to get
the list of chunk keys that constitute the new version. Probably best to
store this list on the remote. Then there needs to be a way to find which
of those chunks are available in locally present files, so that the locally
available chunks can be extracted, and combined with the chunks that need
to be downloaded, to reconstitute the file.

To find which chucks are locally available, here are 2 ideas:

1. Use a single basis file, eg an old version of the file. Re-chunk it, and
   use its chunks.  Slow, but simple.
2. Some kind of database of locally available chunks. Would need to be kept
   up-to-date as files are added, and as files are downloaded.
typos 2012-05-28 18:43:27 +00:00			`Speed up syncing of modified versions of existing files.`
update 2012-05-27 01:38:25 +00:00
			`One simple way is to find the key of the old version of a file that's`
			`being transferred, so it can be used as the basis for rsync, or any`
			`other similar transfer protocol.`

expand to rolling hash based design 2014-07-28 21:11:37 +00:00			`For remotes that don't use rsync, use a rolling checksum based chunker,`
			`such as BuzHash. This will produce [[chunks]], which can be stored on the`
			`remote as regular Keys -- where unlike the fixed size chunk keys, the`
			`SHA256 part of these keys is the checksum of the chunk they contain.`

			`Once that's done, it's easy to avoid uploading chunks that have been sent`
			`to the remote before.`

			`When retriving a new version of a file, there would need to be a way to get`
			`the list of chunk keys that constitute the new version. Probably best to`
			`store this list on the remote. Then there needs to be a way to find which`
			`of those chunks are available in locally present files, so that the locally`
			`available chunks can be extracted, and combined with the chunks that need`
			`to be downloaded, to reconstitute the file.`

			`To find which chucks are locally available, here are 2 ideas:`

			`1. Use a single basis file, eg an old version of the file. Re-chunk it, and`
			`use its chunks. Slow, but simple.`
			`2. Some kind of database of locally available chunks. Would need to be kept`
			`up-to-date as files are added, and as files are downloaded.`