Merge branch 'master' into assistant

2012-07-25 23:18:39 -04:00 · 2012-07-25 23:18:39 -04:00 · abe5a73d3f
commit abe5a73d3f
parent 1ffef3ad75 3a02c7b635
5 changed files with 122 additions and 60 deletions
--- a/doc/design/assistant/blog/day_43__simple_scanner.mdwn
+++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
@ -0,0 +1,37 @@
 Milestone: I can run `git annex assistant`, plug in a USB drive, and it
 automatically transfers files to get the USB drive and current repo back in
 sync.
 I decided to implement the naive scan, to find files needing to be
 transferred. So it walks through `git ls-files` and checks each file
 in turn. I've deferred less expensive, more sophisticated approaches to later.
 I did some work on the TransferQueue, which now keeps track of the length
 of the queue, and can block attempts to add Transfers to it if it gets too
 long. This was a nice use of STM, which let me implement that without using
 any locking.
 [[!format haskell """
 atomically $ do
        sz <- readTVar (queuesize q)
        if sz <= wantsz
                then enqueue schedule q t (stubInfo f remote)
                else retry -- blocks until queuesize changes
 """]]
 Anyway, the point was that, as the scan finds Transfers to do,
 it doesn't build up a really long TransferQueue, but instead is blocked
 from running further until some of the files get transferred. The resulting
 interleaving of the scan thread with transfer threads means that transfers
 start fairly quickly upon a USB drive being plugged in, and kind of hides
 the innefficiencies of the scanner, which will most of the time be
 swamped out by the IO bound large data transfers.
 ---
 At this point, the assistant should do a good job of keeping repositories
 in sync, as long as they're all interconnected, or on removable media
 like USB drives. There's lots more work to be done to handle use cases
 where repositories are not well-connected, but since the assistant's
 [[syncing]] now covers at least a couple of use cases, I'm ready to move
 on to the next phase. [[Webapp]], here we come!
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -5,14 +5,72 @@ all the other git clones, at both the git level and the key/value level.
 * At startup, and possibly periodically, or when the network connection
  changes, or some heuristic suggests that a remote was disconnected from
-  us for a while, queue remotes for processing by the TransferScanner,
+  us for a while, queue remotes for processing by the TransferScanner.
-  to queue Transfers of files it or we're missing.
+* Ensure that when a remote receives content, and updates its location log,
-* After git sync, identify content that we don't have that is now available
+  it syncs that update back out. Prerequisite for:
 * After git sync, identify new content that we don't have that is now available
  on remotes, and transfer. (Needed when we have a uni-directional connection
-  to a remote, so it won't be uploading content to us.) 
+  to a remote, so it won't be uploading content to us.) Note: Does not
-  But first, need to ensure that when a remote
+  need to use the TransferScanner, if we get and check a list of the changed
-  receives content, and updates its location log, it syncs that update
+  files.
-  out.
+
 ## longer-term TODO
 * Test MountWatcher on LXDE.
 * git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
 * Find a way to probe available outgoing bandwidth, to throttle so
  we don't bufferbloat the network to death.
 * Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
 * Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
 * --debug will show often unnecessary work being done. Optimise.
 * This assumes the network is connected. It's often not, so the
  [[cloud]] needs to be used to bridge between LANs.
 * Configurablity, including only enabling git syncing but not data transfer;
  only uploading new files but not downloading, and only downloading
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
 * speed up git syncing by using the cached ssh connection for it too
  (will need to use `GIT_SSH`, which needs to point to a command to run,
  not a shell command line)
 * Map the network of git repos, and use that map to calculate
  optimal transfers to keep the data in sync. Currently a naive flood fill
  is done instead.
 * Find a more efficient way for the TransferScanner to find the transfers
  that need to be done to sync with a remote. Currently it walks the git
  working copy and checks each file.
 ## misc todo
 * --debug will show often unnecessary work being done. Optimise.
 * It would be nice if, when a USB drive is connected, 
  syncing starts automatically. Use dbus on Linux?
 ## data syncing
 There are two parts to data syncing. First, map the network and second,
 decide what to sync when.
 Mapping the network can reuse code in `git annex map`. Once the map is
 built, we want to find paths through the network that reach all nodes
 eventually, with the least cost. This is a minimum spanning tree problem,
 except with a directed graph, so really a Arborescence problem.
 With the map, we can determine which nodes to push new content to. Then we
 need to control those data transfers, sending to the cheapest nodes first,
 and with appropriate rate limiting and control facilities.
 This probably will need lots of refinements to get working well.
 ### first pass: flood syncing
 Before mapping the network, the best we can do is flood all files out to every
 reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.
 ## TransferScanner
@ -21,6 +79,8 @@ to a remote, or Downloaded from it.
 How to find the keys to transfer? I'd like to avoid potentially
 expensive traversals of the whole git working copy if I can.
 (Currently, the TransferScanner does do the naive and possibly expensive
 scan of the git working copy.)
 One way would be to do a git diff between the (unmerged) git-annex branches
 of the git repo, and its remote. Parse that for lines that add a key to
@ -53,58 +113,6 @@ one. Probably worth handling the case where a remote is connected
 while in the middle of such a scan, so part of the scan needs to be
 redone to check it.
 ## longer-term TODO
 * Test MountWatcher on LXDE.
 * git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
 * Find a way to probe available outgoing bandwidth, to throttle so
  we don't bufferbloat the network to death.
 * Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
 * Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
 * --debug will show often unnecessary work being done. Optimise.
 * This assumes the network is connected. It's often not, so the
  [[cloud]] needs to be used to bridge between LANs.
 * Configurablity, including only enabling git syncing but not data transfer;
  only uploading new files but not downloading, and only downloading
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
 * speed up git syncing by using the cached ssh connection for it too
  (will need to use `GIT_SSH`, which needs to point to a command to run,
  not a shell command line)
 ## misc todo
 * --debug will show often unnecessary work being done. Optimise.
 * It would be nice if, when a USB drive is connected, 
  syncing starts automatically. Use dbus on Linux?
 ## data syncing
 There are two parts to data syncing. First, map the network and second,
 decide what to sync when.
 Mapping the network can reuse code in `git annex map`. Once the map is
 built, we want to find paths through the network that reach all nodes
 eventually, with the least cost. This is a minimum spanning tree problem,
 except with a directed graph, so really a Arborescence problem.
 With the map, we can determine which nodes to push new content to. Then we
 need to control those data transfers, sending to the cheapest nodes first,
 and with appropriate rate limiting and control facilities.
 This probably will need lots of refinements to get working well.
 ### first pass: flood syncing
 Before mapping the network, the best we can do is flood all files out to every
 reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.
 ## done
 1. Can use `git annex sync`, which already handles bidirectional syncing.
--- a/doc/forum/Fixing_up_corrupt_annexes.mdwn
+++ b/doc/forum/Fixing_up_corrupt_annexes.mdwn
@ -0,0 +1,10 @@
 I was wondering how does one recover from...
 <pre>
 (Recording state in git...)
 error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log'
 fatal: git-write-tree: error building trees
 git-annex: failed to read sha from git write-tree
 </pre>
 The above was caught when i ran a "git annex fsck --fast" to check stash of files"
--- a/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
+++ b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
@ -0,0 +1,7 @@
 [[!comment format=mdwn
 username="http://joeyh.name/"
 subject="comment 1"
 date="2012-07-24T22:00:35Z"
 content="""
 This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]]
 """]]
--- a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
+++ b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
@ -16,4 +16,4 @@ are present.
 If you later found the drive, you could let git-annex know it's found
 like so:
-	git annex semitrusted usbdrive
+	git annex semitrust usbdrive