diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index 5b2a11aa61..df9a771b13 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -1,28 +1,7 @@ Once files are added (or removed or moved), need to send those changes to all the other git clones, at both the git level and the key/value level. -## efficiency - -Currently after each file transfer (upload or download), a git sync is done -to all remotes. This is rather a lot of work, also it prevents collecting -presence changes to the git-annex branch into larger commits, which would -save disk space over time. - -In many cases, this sync is necessary. For example, when a file is uploaded -to a transfer remote, the location change needs to be synced out so that -other clients know to grab it. - -Or, when downloading a file from a drive, the sync lets other locally -paired repositories know we got it, so they can download it from us. -OTOH, this is also a case where a sync is sometimes unnecessary, since -if we're going to upload the file to them after getting it, the sync -only perhaps lets them start downloading it before our transfer queue -reaches a point where we'd upload it. - -Do we need all the mapping stuff discussed below to know when we can avoid -syncs? - -## TODO +## misc TODO * Test MountWatcher on LXDE. * Add a hook, so when there's a change to sync, a program can be run @@ -51,30 +30,11 @@ syncs? and fall back to some other method -- either storing deferred downloads on disk, or perhaps scheduling a TransferScanner run to get back into sync. -## data syncing +## More efficient syncing -There are two parts to data syncing. First, map the network and second, -decide what to sync when. +See [[syncing/efficiency]] -Mapping the network can reuse code in `git annex map`. Once the map is -built, we want to find paths through the network that reach all nodes -eventually, with the least cost. This is a minimum spanning tree problem, -except with a directed graph, so really a Arborescence problem. - -With the map, we can determine which nodes to push new content to. Then we -need to control those data transfers, sending to the cheapest nodes first, -and with appropriate rate limiting and control facilities. - -This probably will need lots of refinements to get working well. - -### first pass: flood syncing - -Before mapping the network, the best we can do is flood all files out to every -reachable remote. This is worth doing first, since it's the simplest way to -get the basic functionality of the assistant to work. And we'll need this -anyway. - -## TransferScanner +## TransferScanner efficiency The TransferScanner thread needs to find keys that need to be Uploaded to a remote, or Downloaded from it. diff --git a/doc/design/assistant/syncing/efficiency.mdwn b/doc/design/assistant/syncing/efficiency.mdwn new file mode 100644 index 0000000000..7da721a2c4 --- /dev/null +++ b/doc/design/assistant/syncing/efficiency.mdwn @@ -0,0 +1,73 @@ +Currently, the git-annex assistant syncs with remotes in a way that is +dumb, and potentially inneficient: + +1. Files are transferred to each reachable remote whose + [[preferred_content]] setting indicates it wants the file. + +2. After each file transfer (upload or download), a git sync + is done to all the remotes, to update location log information. + +## unncessary transfers + +There are network toplogies where #1 is massively inneficient. +For example: + +
+  laptopA-----laptopB-----laptopC
+      \         |             /
+       \---cloud based repo--/
+
+ +When laptopA has a new file, it will first send it to laptopB. It will then +check if the cloud based transfer repository wants a copy. It will, because +laptopC has not yet gotten a copy. So laptopA will proceed with a slow +upload to the cloud, while meanwhile laptopB is sending the file over fast +LAN to laptopC. + +(The more common case with no laptopC happens to work efficiently. +So does the case where laptopA is paired with laptopC.) + +## unncessary syncing + +Less importantly, the constant git syncing after each transfer is rather a +lot of work, and prevents collecting multiple presence changes to the git-annex +branch into larger commits, which would save disk space over time. + +In many cases, this sync is necessary. For example, when a file is uploaded +to a transfer remote, the location change needs to be synced out so that +other clients know to grab it. + +Or, when downloading a file from a drive, the sync lets other locally +paired repositories know we got it, so they can download it from us. +OTOH, this is also a case where a sync is sometimes unnecessary, since +if we're going to upload the file to them after getting it, the sync +only perhaps lets them start downloading it before our transfer queue +reaches a point where we'd upload it. + +It would be good to find a way to detect when syncing is not immediately +necessary, and defer it. + +## mapping + +Mapping the repository network has the potential to get git-annex the +information it needs to avoid unnecessary transfers and/or unncessary +syncing. + +Mapping the network can reuse code in `git annex map`. Once the map is +built, we want to find paths through the network that reach all nodes +eventually, with the least cost. This is a minimum spanning tree problem, +except with a directed graph, so really a Arborescence problem. + +A significant problem in mapping is that nodes are mobile, they can move +between networks over time. This breaks LAN based paths through the +network. Mapping would need a way to detect this. Note that individual +git-annex assistants can tell when they've switched networks by using the +`networkConnectedNotifier`. + +## P2P signaling + +Another approach that might help with these problems is if git-annex +repositories have a non-git out of band signaling mechanism. This could, +for example, be used by laptopB to tell laptopA that it's trying to send +a file directly to laptopC. laptopA could then defer the upload to the +cloud for a while.