split off a page

2013-12-02 13:24:47 -04:00 · 2013-12-02 13:24:47 -04:00 · bc786b6f06
commit bc786b6f06
parent 98453b83b2
2 changed files with 77 additions and 44 deletions
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -1,28 +1,7 @@
 Once files are added (or removed or moved), need to send those changes to
 all the other git clones, at both the git level and the key/value level.

-## efficiency
-
-Currently after each file transfer (upload or download), a git sync is done
-to all remotes. This is rather a lot of work, also it prevents collecting
-presence changes to the git-annex branch into larger commits, which would
-save disk space over time.
-
-In many cases, this sync is necessary. For example, when a file is uploaded
-to a transfer remote, the location change needs to be synced out so that
-other clients know to grab it.
-
-Or, when downloading a file from a drive, the sync lets other locally
-paired repositories know we got it, so they can download it from us. 
-OTOH, this is also a case where a sync is sometimes unnecessary, since
-if we're going to upload the file to them after getting it, the sync
-only perhaps lets them start downloading it before our transfer queue
-reaches a point where we'd upload it.
-
-Do we need all the mapping stuff discussed below to know when we can avoid
-syncs?
-
-## TODO
+## misc TODO

 * Test MountWatcher on LXDE.
 * Add a hook, so when there's a change to sync, a program can be run
@ -51,30 +30,11 @@ syncs?
  and fall back to some other method -- either storing deferred downloads
  on disk, or perhaps scheduling a TransferScanner run to get back into sync.

-## data syncing
+## More efficient syncing

-There are two parts to data syncing. First, map the network and second,
-decide what to sync when.
+See [[syncing/efficiency]]

-Mapping the network can reuse code in `git annex map`. Once the map is
-built, we want to find paths through the network that reach all nodes
-eventually, with the least cost. This is a minimum spanning tree problem,
-except with a directed graph, so really a Arborescence problem.
-
-With the map, we can determine which nodes to push new content to. Then we
-need to control those data transfers, sending to the cheapest nodes first,
-and with appropriate rate limiting and control facilities.
-
-This probably will need lots of refinements to get working well.
-
-### first pass: flood syncing
-
-Before mapping the network, the best we can do is flood all files out to every
-reachable remote. This is worth doing first, since it's the simplest way to
-get the basic functionality of the assistant to work. And we'll need this
-anyway.
-
-## TransferScanner
+## TransferScanner efficiency

 The TransferScanner thread needs to find keys that need to be Uploaded
 to a remote, or Downloaded from it.
--- a/doc/design/assistant/syncing/efficiency.mdwn
+++ b/doc/design/assistant/syncing/efficiency.mdwn
@ -0,0 +1,73 @@
+Currently, the git-annex assistant syncs with remotes in a way that is
+dumb, and potentially inneficient:
+
+1. Files are transferred to each reachable remote whose
+   [[preferred_content]] setting indicates it wants the file.
+
+2. After each file transfer (upload or download), a git sync
+   is done to all the remotes, to update location log information.
+
+## unncessary transfers
+
+There are network toplogies where #1 is massively inneficient.
+For example:
+
+<pre>
+  laptopA-----laptopB-----laptopC
+      \         |             /
+       \---cloud based repo--/
+</pre>
+
+When laptopA has a new file, it will first send it to laptopB. It will then
+check if the cloud based transfer repository wants a copy. It will, because
+laptopC has not yet gotten a copy. So laptopA will proceed with a slow
+upload to the cloud, while meanwhile laptopB is sending the file over fast
+LAN to laptopC.
+
+(The more common case with no laptopC happens to work efficiently.
+So does the case where laptopA is paired with laptopC.)
+
+## unncessary syncing
+
+Less importantly, the constant git syncing after each transfer is rather a
+lot of work, and prevents collecting multiple presence changes to the git-annex 
+branch into larger commits, which would save disk space over time.
+
+In many cases, this sync is necessary. For example, when a file is uploaded
+to a transfer remote, the location change needs to be synced out so that
+other clients know to grab it.
+
+Or, when downloading a file from a drive, the sync lets other locally
+paired repositories know we got it, so they can download it from us. 
+OTOH, this is also a case where a sync is sometimes unnecessary, since
+if we're going to upload the file to them after getting it, the sync
+only perhaps lets them start downloading it before our transfer queue
+reaches a point where we'd upload it.
+
+It would be good to find a way to detect when syncing is not immediately
+necessary, and defer it.
+
+## mapping
+
+Mapping the repository network has the potential to get git-annex the
+information it needs to avoid unnecessary transfers and/or unncessary
+syncing.
+
+Mapping the network can reuse code in `git annex map`. Once the map is
+built, we want to find paths through the network that reach all nodes
+eventually, with the least cost. This is a minimum spanning tree problem,
+except with a directed graph, so really a Arborescence problem.
+
+A significant problem in mapping is that nodes are mobile, they can move
+between networks over time. This breaks LAN based paths through the
+network. Mapping would need a way to detect this. Note that individual
+git-annex assistants can tell when they've switched networks by using the
+`networkConnectedNotifier`.
+
+## P2P signaling
+
+Another approach that might help with these problems is if git-annex
+repositories have a non-git out of band signaling mechanism. This could,
+for example, be used by laptopB to tell laptopA that it's trying to send 
+a file directly to laptopC. laptopA could then defer the upload to the
+cloud for a while.