Merge branch 'master' into assistant

Conflicts: doc/design/assistant/syncing.mdwn
2012-07-06 18:26:40 -06:00 · 2012-07-06 18:26:40 -06:00 · d6f65aed16
commit d6f65aed16
parent 62876502c5 8ad844e45c
8 changed files with 140 additions and 46 deletions
--- a/Backend/SHA.hs
+++ b/Backend/SHA.hs
@ -102,9 +102,13 @@ keyValueE size source = keyValue size source >>= maybe (return Nothing) addE
 			}
 selectExtension :: FilePath -> String
-selectExtension = join "." . reverse . take 2 . takeWhile shortenough .
+selectExtension f
-	reverse . split "." . takeExtensions
+	| null es = ""
 	| otherwise = join "." ("":es)
 	where
 		es = filter (not . null) $ reverse $
 			take 2 $ takeWhile shortenough $
 			reverse $ split "." $ takeExtensions f
 		shortenough e
 			| '\n' `elem` e = False -- newline in extension?!
 			| otherwise = length e <= 4 -- long enough for "jpeg"
--- a/doc/design/assistant/blog/day_25__transfer_queueing/comment_1_59fd4f1ffe96c412f613dc86276e7dbd._comment
+++ b/doc/design/assistant/blog/day_25__transfer_queueing/comment_1_59fd4f1ffe96c412f613dc86276e7dbd._comment
@ -0,0 +1,10 @@
 [[!comment format=mdwn
 username="https://www.google.com/accounts/o8/id?id=AItOawldKnauegZulM7X6JoHJs7Gd5PnDjcgx-E"
 nickname="Matt"
 subject="Source code"
 date="2012-07-06T00:12:15Z"
 content="""
 Hi Joey,
 Is the source code for git-annex assistant available somewhere?
 """]]
--- a/doc/design/assistant/blog/day_25__transfer_queueing/comment_2_93bf768a67117e873af5732ecf08dc78._comment
+++ b/doc/design/assistant/blog/day_25__transfer_queueing/comment_2_93bf768a67117e873af5732ecf08dc78._comment
@ -0,0 +1,7 @@
 [[!comment format=mdwn
 username="http://joeyh.name/"
 subject="comment 2"
 date="2012-07-06T00:21:43Z"
 content="""
 It's in the `assistant` branch of git://git-annex.branchable.com/
 """]]
--- a/doc/design/assistant/blog/day_26__dying_drives.mdwn
+++ b/doc/design/assistant/blog/day_26__dying_drives.mdwn
@ -0,0 +1,28 @@
 My laptop's SSD died this morning. I had some work from yesterday
 committed to the git repo on it, but not pushed as it didn't build.
 Luckily I was able to get that off the SSD, which is now a read-only
 drive -- even mounting it fails with fsck write errors.
 Wish I'd realized the SSD was dying before the day before my trip to
 Nicaragua..
 Getting back to a useful laptop used most of my time and energy today.
 I did manage to fix transfers to not block the rest of the assistant's
 threads. Problem was that, without Haskell's threaded runtime, waiting
 on something like a rsync command blocks all threads. To fix this,
 transfers now are run in separate processes.
 Also added code to allow multiple transfers to run at once. Each transfer
 takes up a slot, with the number of free slots tracked by a `QSemN`.
 This allows the transfer starting thread to block until a slot frees up,
 and then run the transfer.
 This needs to be extended to be aware of transfers initiated by remotes.
 The transfer watcher thread should detect those starting and stopping 
 and update the `QSemN` accordingly. It would also be nice if transfers
 initiated by remotes would be delayed when there are no free slots for them
 ... but I have not thought of a good way to do that.
 There's a bug somewhere in the new transfer code, when two transfers are
 queued close together, the second one is lost and doesn't happen.
 Would debug this, but I'm spent for the day.
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -3,27 +3,6 @@ all the other git clones, at both the git level and the key/value level.
 ## action items
 * on-disk transfers in progress information files (read/write/enumerate)
  **done**
 * locking for the files, so redundant transfer races can be detected,
  and failed transfers noticed **done**
 * transfer info for git-annex-shell **done**
 * update files as transfers proceed. See [[progressbars]]
  (updating for downloads is easy; for uploads is hard)
 * add Transfer queue TChan **done**
 * add TransferInfo Map to DaemonStatus for tracking transfers in progress.
  **done**
 * Poll transfer in progress info files for changes (use inotify again!
  wow! hammer, meet nail..), and update the TransferInfo Map **done**
 * enqueue Transfers (Uploads) as new files are added to the annex by
  Watcher. **done**
 * enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
  Watcher. **done**
 * Write basic Transfer handling thread. Multiple such threads need to be
  able to be run at once. Each will need its own independant copy of the 
  Annex state monad. **done**
 * Write transfer control thread, which decides when to launch transfers.
  **done**
 * Check that download transfer triggering code works (when a symlink appears
  and the remote does *not* upload to us.
 * Investigate why transfers seem to block other git-annex assistant work.
@ -35,31 +14,23 @@ all the other git clones, at both the git level and the key/value level.
 * git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
 * file transfer processes are not waited for, contain the zombies.
-## git syncing
+## longer-term TODO
-1. Can use `git annex sync`, which already handles bidirectional syncing.
+* Investigate the XMPP approach like dvcs-autosync does, or other ways of
   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
 1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
 1. Periodically retry pushes that failed.  **done** (every half an hour)
 1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
 2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync. **done**
 3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
-4. Add a hook, so when there's a change to sync, a program can be run
+* Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
-5. --debug will show often unnecessary work being done. Optimise.
+* --debug will show often unnecessary work being done. Optimise.
-6. It would be nice if, when a USB drive is connected, 
+* It would be nice if, when a USB drive is connected, 
   syncing starts automatically. Use dbus on Linux?
 * This assumes the network is connected. It's often not, so the
  [[cloud]] needs to be used to bridge between LANs.
 * Configurablity, including only enabling git syncing but not data transfer;
  only uploading new files but not downloading, and only downloading
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
 ## misc todo
@ -90,7 +61,42 @@ reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.
-## other considerations
+## done
-This assumes the network is connected. It's often not, so the
+1. Can use `git annex sync`, which already handles bidirectional syncing.
-[[cloud]] needs to be used to bridge between LANs.
+   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
 1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
 1. Periodically retry pushes that failed.  **done** (every half an hour)
 1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
 2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync. **done**
 * on-disk transfers in progress information files (read/write/enumerate)
  **done**
 * locking for the files, so redundant transfer races can be detected,
  and failed transfers noticed **done**
 * transfer info for git-annex-shell **done**
 * update files as transfers proceed. See [[progressbars]]
  (updating for downloads is easy; for uploads is hard)
 * add Transfer queue TChan **done**
 * add TransferInfo Map to DaemonStatus for tracking transfers in progress.
  **done**
 * Poll transfer in progress info files for changes (use inotify again!
  wow! hammer, meet nail..), and update the TransferInfo Map **done**
 * enqueue Transfers (Uploads) as new files are added to the annex by
  Watcher. **done**
 * enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
  Watcher. **done**
 * Write basic Transfer handling thread. Multiple such threads need to be
  able to be run at once. Each will need its own independant copy of the 
  Annex state monad. **done**
 * Write transfer control thread, which decides when to launch transfers.
  **done**
--- a/doc/forum/Wishlist:_mark_remotes_offline.mdwn
+++ b/doc/forum/Wishlist:_mark_remotes_offline.mdwn
@ -0,0 +1,12 @@
 I have several remotes which are not always accessible. For example they can
 be on hosts only accessible by LAN or on a portable hard drive which is not
 plugged in. When running sync these remotes are checked as well, leading to
 unnecessary error messages and possibly git-annex waiting for a few minutes
 on each remote for a timeout.
 In this situation it would be useful to mark some remotes as offline
 (`git annex offline <remotename>`), so that git-annex would not even attempt
 to contact them. Then, I could configure my system to automatically, for example,
 mark a portable hard disk remote online when plugging it in, and offline when
 unplugging it, and similarly marking remotes offline and online depending on
 whether I have an internet connection or a connection to a specific network.
--- a/doc/forum/Wishlist:_mark_remotes_offline/comment_1_9e3901f0123abb66034cce95cc5a941a._comment
+++ b/doc/forum/Wishlist:_mark_remotes_offline/comment_1_9e3901f0123abb66034cce95cc5a941a._comment
@ -0,0 +1,14 @@
 [[!comment format=mdwn
 username="http://joeyh.name/"
 subject="comment 1"
 date="2012-07-06T13:04:07Z"
 content="""
 You can already do this:
    git config remote.foo.annex-ignore true
 There's no need to do anything for portable drives that are sometimes mounted and sometimes not -- git-annex will automatically avoid using repositories in directories that do not currently exist.
 I thought git-annex also had a way to run a command and use its exit status to control whether a repo was
 ignored or not, but it seems I never actually implemented that. It might be worth adding, although the command would necessarily run whenever git-annex is transferring data around.
 """]]
--- a/doc/forum/Wishlist:_options_for_syncing_meta-data_and_data.mdwn
+++ b/doc/forum/Wishlist:_options_for_syncing_meta-data_and_data.mdwn
@ -0,0 +1,13 @@
 Since _transfer queueing_  and syncing of data works now in the assistant branch (been playing with it), there are times when I really don't want to sync the data, I would like to just sync meta-data and manually do a _get_ on files that I would want or selectively sync data in a subtree.
 It would be nice to have the syncing/watch feature to have the option of syncing only *meta-data* or *meta-data and data*, I think this sort of option was already planned? It would also be nice to be able to automatically sync data for only a subtree.
 My use case is, I have a big stash of files somewhere at home or work, and I want to keep what I am actually using on my laptop and be able to selectively just take a subtree or a set of subtree's of files. I would not always want to suck down all the data but still have the functionally to add files and push them upstream and sync meta-data.
 that is...
 > * Site A: big master annex in a server room with lots of disk (or machines), watches a directory and syncs both data and meta-data, it should always try and pull data from all it's child repos. That way I will always have a master copy of my data somewhere, it would be even nicer if I could have clones of the annex, where each annex is on a different machine which is configured to only sync a subtree of files so I can distribute my annex across different systems and disks.
 >   * Site A: machine A: syncs Folder A
 >   * Site A: machine B: syncs Folder B
 >   * and so on with selectively syncing sites and directories
 > * Laptop: has a clone of the annex, and watches a directory, syncs meta-data as usual and only uploads files to a remote (all or a designated one) but it never downloads files automatically or it should only occur inside a selected subtree.