split off a page

2013-12-02 13:24:47 -04:00 · 2013-12-02 13:24:47 -04:00 · bc786b6f06
commit bc786b6f06
parent 98453b83b2
2 changed files with 77 additions and 44 deletions
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -1,28 +1,7 @@
 Once files are added (or removed or moved), need to send those changes to
 all the other git clones, at both the git level and the key/value level.
-## efficiency
+## misc TODO
 Currently after each file transfer (upload or download), a git sync is done
 to all remotes. This is rather a lot of work, also it prevents collecting
 presence changes to the git-annex branch into larger commits, which would
 save disk space over time.
 In many cases, this sync is necessary. For example, when a file is uploaded
 to a transfer remote, the location change needs to be synced out so that
 other clients know to grab it.
 Or, when downloading a file from a drive, the sync lets other locally
 paired repositories know we got it, so they can download it from us. 
 OTOH, this is also a case where a sync is sometimes unnecessary, since
 if we're going to upload the file to them after getting it, the sync
 only perhaps lets them start downloading it before our transfer queue
 reaches a point where we'd upload it.
 Do we need all the mapping stuff discussed below to know when we can avoid
 syncs?
 ## TODO
 * Test MountWatcher on LXDE.
 * Add a hook, so when there's a change to sync, a program can be run
@ -51,30 +30,11 @@ syncs?
  and fall back to some other method -- either storing deferred downloads
  on disk, or perhaps scheduling a TransferScanner run to get back into sync.
-## data syncing
+## More efficient syncing
-There are two parts to data syncing. First, map the network and second,
+See [[syncing/efficiency]]
 decide what to sync when.
-Mapping the network can reuse code in `git annex map`. Once the map is
+## TransferScanner efficiency
 built, we want to find paths through the network that reach all nodes
 eventually, with the least cost. This is a minimum spanning tree problem,
 except with a directed graph, so really a Arborescence problem.
 With the map, we can determine which nodes to push new content to. Then we
 need to control those data transfers, sending to the cheapest nodes first,
 and with appropriate rate limiting and control facilities.
 This probably will need lots of refinements to get working well.
 ### first pass: flood syncing
 Before mapping the network, the best we can do is flood all files out to every
 reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.
 ## TransferScanner
 The TransferScanner thread needs to find keys that need to be Uploaded
 to a remote, or Downloaded from it.
--- a/doc/design/assistant/syncing/efficiency.mdwn
+++ b/doc/design/assistant/syncing/efficiency.mdwn
@ -0,0 +1,73 @@
 Currently, the git-annex assistant syncs with remotes in a way that is
 dumb, and potentially inneficient:
 1. Files are transferred to each reachable remote whose
   [[preferred_content]] setting indicates it wants the file.
 2. After each file transfer (upload or download), a git sync
   is done to all the remotes, to update location log information.
 ## unncessary transfers
 There are network toplogies where #1 is massively inneficient.
 For example:
 <pre>
  laptopA-----laptopB-----laptopC
      \         |             /
       \---cloud based repo--/
 </pre>
 When laptopA has a new file, it will first send it to laptopB. It will then
 check if the cloud based transfer repository wants a copy. It will, because
 laptopC has not yet gotten a copy. So laptopA will proceed with a slow
 upload to the cloud, while meanwhile laptopB is sending the file over fast
 LAN to laptopC.
 (The more common case with no laptopC happens to work efficiently.
 So does the case where laptopA is paired with laptopC.)
 ## unncessary syncing
 Less importantly, the constant git syncing after each transfer is rather a
 lot of work, and prevents collecting multiple presence changes to the git-annex 
 branch into larger commits, which would save disk space over time.
 In many cases, this sync is necessary. For example, when a file is uploaded
 to a transfer remote, the location change needs to be synced out so that
 other clients know to grab it.
 Or, when downloading a file from a drive, the sync lets other locally
 paired repositories know we got it, so they can download it from us. 
 OTOH, this is also a case where a sync is sometimes unnecessary, since
 if we're going to upload the file to them after getting it, the sync
 only perhaps lets them start downloading it before our transfer queue
 reaches a point where we'd upload it.
 It would be good to find a way to detect when syncing is not immediately
 necessary, and defer it.
 ## mapping
 Mapping the repository network has the potential to get git-annex the
 information it needs to avoid unnecessary transfers and/or unncessary
 syncing.
 Mapping the network can reuse code in `git annex map`. Once the map is
 built, we want to find paths through the network that reach all nodes
 eventually, with the least cost. This is a minimum spanning tree problem,
 except with a directed graph, so really a Arborescence problem.
 A significant problem in mapping is that nodes are mobile, they can move
 between networks over time. This breaks LAN based paths through the
 network. Mapping would need a way to detect this. Note that individual
 git-annex assistants can tell when they've switched networks by using the
 `networkConnectedNotifier`.
 ## P2P signaling
 Another approach that might help with these problems is if git-annex
 repositories have a non-git out of band signaling mechanism. This could,
 for example, be used by laptopB to tell laptopA that it's trying to send 
 a file directly to laptopC. laptopA could then defer the upload to the
 cloud for a while.