Merge branch 'master' into assistant

Conflicts: doc/design/assistant/syncing.mdwn
2012-07-06 18:26:40 -06:00 · 2012-07-06 18:26:40 -06:00 · d6f65aed16
commit d6f65aed16
parent 62876502c5 8ad844e45c
8 changed files with 140 additions and 46 deletions
--- a/Backend/SHA.hs
+++ b/Backend/SHA.hs
@ -102,9 +102,13 @@ keyValueE size source = keyValue size source >>= maybe (return Nothing) addE
 			}

 selectExtension :: FilePath -> String
-selectExtension = join "." . reverse . take 2 . takeWhile shortenough .
-	reverse . split "." . takeExtensions
+selectExtension f
+	| null es = ""
+	| otherwise = join "." ("":es)
 	where
+		es = filter (not . null) $ reverse $
+			take 2 $ takeWhile shortenough $
+			reverse $ split "." $ takeExtensions f
 		shortenough e
 			| '\n' `elem` e = False -- newline in extension?!
 			| otherwise = length e <= 4 -- long enough for "jpeg"
--- a/doc/design/assistant/blog/day_25__transfer_queueing/comment_1_59fd4f1ffe96c412f613dc86276e7dbd._comment
+++ b/doc/design/assistant/blog/day_25__transfer_queueing/comment_1_59fd4f1ffe96c412f613dc86276e7dbd._comment
@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawldKnauegZulM7X6JoHJs7Gd5PnDjcgx-E"
+ nickname="Matt"
+ subject="Source code"
+ date="2012-07-06T00:12:15Z"
+ content="""
+Hi Joey,
+
+Is the source code for git-annex assistant available somewhere?
+"""]]
--- a/doc/design/assistant/blog/day_25__transfer_queueing/comment_2_93bf768a67117e873af5732ecf08dc78._comment
+++ b/doc/design/assistant/blog/day_25__transfer_queueing/comment_2_93bf768a67117e873af5732ecf08dc78._comment
@ -0,0 +1,7 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ subject="comment 2"
+ date="2012-07-06T00:21:43Z"
+ content="""
+It's in the `assistant` branch of git://git-annex.branchable.com/
+"""]]
--- a/doc/design/assistant/blog/day_26__dying_drives.mdwn
+++ b/doc/design/assistant/blog/day_26__dying_drives.mdwn
@ -0,0 +1,28 @@
+My laptop's SSD died this morning. I had some work from yesterday
+committed to the git repo on it, but not pushed as it didn't build.
+Luckily I was able to get that off the SSD, which is now a read-only
+drive -- even mounting it fails with fsck write errors.
+
+Wish I'd realized the SSD was dying before the day before my trip to
+Nicaragua..
+Getting back to a useful laptop used most of my time and energy today.
+
+I did manage to fix transfers to not block the rest of the assistant's
+threads. Problem was that, without Haskell's threaded runtime, waiting
+on something like a rsync command blocks all threads. To fix this,
+transfers now are run in separate processes.
+
+Also added code to allow multiple transfers to run at once. Each transfer
+takes up a slot, with the number of free slots tracked by a `QSemN`.
+This allows the transfer starting thread to block until a slot frees up,
+and then run the transfer.
+
+This needs to be extended to be aware of transfers initiated by remotes.
+The transfer watcher thread should detect those starting and stopping 
+and update the `QSemN` accordingly. It would also be nice if transfers
+initiated by remotes would be delayed when there are no free slots for them
+... but I have not thought of a good way to do that.
+
+There's a bug somewhere in the new transfer code, when two transfers are
+queued close together, the second one is lost and doesn't happen.
+Would debug this, but I'm spent for the day.
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -3,27 +3,6 @@ all the other git clones, at both the git level and the key/value level.

 ## action items

-* on-disk transfers in progress information files (read/write/enumerate)
-  **done**
-* locking for the files, so redundant transfer races can be detected,
-  and failed transfers noticed **done**
-* transfer info for git-annex-shell **done**
-* update files as transfers proceed. See [[progressbars]]
-  (updating for downloads is easy; for uploads is hard)
-* add Transfer queue TChan **done**
-* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
-  **done**
-* Poll transfer in progress info files for changes (use inotify again!
-  wow! hammer, meet nail..), and update the TransferInfo Map **done**
-* enqueue Transfers (Uploads) as new files are added to the annex by
-  Watcher. **done**
-* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
-  Watcher. **done**
-* Write basic Transfer handling thread. Multiple such threads need to be
-  able to be run at once. Each will need its own independant copy of the 
-  Annex state monad. **done**
-* Write transfer control thread, which decides when to launch transfers.
-  **done**
 * Check that download transfer triggering code works (when a symlink appears
  and the remote does *not* upload to us.
 * Investigate why transfers seem to block other git-annex assistant work.
@ -35,31 +14,23 @@ all the other git clones, at both the git level and the key/value level.
 * git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
+* file transfer processes are not waited for, contain the zombies.

-## git syncing
+## longer-term TODO

-1. Can use `git annex sync`, which already handles bidirectional syncing.
-   When a change is committed, launch the part of `git annex sync` that pushes
-   out changes. **done**; changes are pushed out to all remotes in parallel
-1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
-   another node via `git annex sync`), and run the part of `git annex sync`
-   that merges in received changes, and follow it by the part that pushes out
-   changes (sending them to any other remotes).
-   [The watching can be done with the existing inotify code! This avoids needing
-   any special mechanism to notify a remote that it's been synced to.]  
-   **done**
-1. Periodically retry pushes that failed.  **done** (every half an hour)
-1. Also, detect if a push failed due to not being up-to-date, pull,
-   and repush. **done**
-2. Use a git merge driver that adds both conflicting files,
-   so conflicts never break a sync. **done**
-3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
+* Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
-4. Add a hook, so when there's a change to sync, a program can be run
+* Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
-5. --debug will show often unnecessary work being done. Optimise.
-6. It would be nice if, when a USB drive is connected, 
+* --debug will show often unnecessary work being done. Optimise.
+* It would be nice if, when a USB drive is connected, 
   syncing starts automatically. Use dbus on Linux?
+* This assumes the network is connected. It's often not, so the
+  [[cloud]] needs to be used to bridge between LANs.
+* Configurablity, including only enabling git syncing but not data transfer;
+  only uploading new files but not downloading, and only downloading
+  files in some directories and not others. See for use cases:
+  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]

 ## misc todo

@ -90,7 +61,42 @@ reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.

-## other considerations
+## done

-This assumes the network is connected. It's often not, so the
-[[cloud]] needs to be used to bridge between LANs.
+1. Can use `git annex sync`, which already handles bidirectional syncing.
+   When a change is committed, launch the part of `git annex sync` that pushes
+   out changes. **done**; changes are pushed out to all remotes in parallel
+1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
+   another node via `git annex sync`), and run the part of `git annex sync`
+   that merges in received changes, and follow it by the part that pushes out
+   changes (sending them to any other remotes).
+   [The watching can be done with the existing inotify code! This avoids needing
+   any special mechanism to notify a remote that it's been synced to.]  
+   **done**
+1. Periodically retry pushes that failed.  **done** (every half an hour)
+1. Also, detect if a push failed due to not being up-to-date, pull,
+   and repush. **done**
+2. Use a git merge driver that adds both conflicting files,
+   so conflicts never break a sync. **done**
+
+* on-disk transfers in progress information files (read/write/enumerate)
+  **done**
+* locking for the files, so redundant transfer races can be detected,
+  and failed transfers noticed **done**
+* transfer info for git-annex-shell **done**
+* update files as transfers proceed. See [[progressbars]]
+  (updating for downloads is easy; for uploads is hard)
+* add Transfer queue TChan **done**
+* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
+  **done**
+* Poll transfer in progress info files for changes (use inotify again!
+  wow! hammer, meet nail..), and update the TransferInfo Map **done**
+* enqueue Transfers (Uploads) as new files are added to the annex by
+  Watcher. **done**
+* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
+  Watcher. **done**
+* Write basic Transfer handling thread. Multiple such threads need to be
+  able to be run at once. Each will need its own independant copy of the 
+  Annex state monad. **done**
+* Write transfer control thread, which decides when to launch transfers.
+  **done**
--- a/doc/forum/Wishlist:_mark_remotes_offline.mdwn
+++ b/doc/forum/Wishlist:_mark_remotes_offline.mdwn
@ -0,0 +1,12 @@
+I have several remotes which are not always accessible. For example they can
+be on hosts only accessible by LAN or on a portable hard drive which is not
+plugged in. When running sync these remotes are checked as well, leading to
+unnecessary error messages and possibly git-annex waiting for a few minutes
+on each remote for a timeout.
+
+In this situation it would be useful to mark some remotes as offline
+(`git annex offline <remotename>`), so that git-annex would not even attempt
+to contact them. Then, I could configure my system to automatically, for example,
+mark a portable hard disk remote online when plugging it in, and offline when
+unplugging it, and similarly marking remotes offline and online depending on
+whether I have an internet connection or a connection to a specific network.
--- a/doc/forum/Wishlist:_mark_remotes_offline/comment_1_9e3901f0123abb66034cce95cc5a941a._comment
+++ b/doc/forum/Wishlist:_mark_remotes_offline/comment_1_9e3901f0123abb66034cce95cc5a941a._comment
@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ subject="comment 1"
+ date="2012-07-06T13:04:07Z"
+ content="""
+You can already do this:
+
+    git config remote.foo.annex-ignore true
+
+There's no need to do anything for portable drives that are sometimes mounted and sometimes not -- git-annex will automatically avoid using repositories in directories that do not currently exist.
+
+I thought git-annex also had a way to run a command and use its exit status to control whether a repo was
+ignored or not, but it seems I never actually implemented that. It might be worth adding, although the command would necessarily run whenever git-annex is transferring data around.
+"""]]
--- a/doc/forum/Wishlist:_options_for_syncing_meta-data_and_data.mdwn
+++ b/doc/forum/Wishlist:_options_for_syncing_meta-data_and_data.mdwn
@ -0,0 +1,13 @@
+Since _transfer queueing_  and syncing of data works now in the assistant branch (been playing with it), there are times when I really don't want to sync the data, I would like to just sync meta-data and manually do a _get_ on files that I would want or selectively sync data in a subtree.
+
+It would be nice to have the syncing/watch feature to have the option of syncing only *meta-data* or *meta-data and data*, I think this sort of option was already planned? It would also be nice to be able to automatically sync data for only a subtree.
+
+My use case is, I have a big stash of files somewhere at home or work, and I want to keep what I am actually using on my laptop and be able to selectively just take a subtree or a set of subtree's of files. I would not always want to suck down all the data but still have the functionally to add files and push them upstream and sync meta-data.
+
+that is...
+
+> * Site A: big master annex in a server room with lots of disk (or machines), watches a directory and syncs both data and meta-data, it should always try and pull data from all it's child repos. That way I will always have a master copy of my data somewhere, it would be even nicer if I could have clones of the annex, where each annex is on a different machine which is configured to only sync a subtree of files so I can distribute my annex across different systems and disks.
+>   * Site A: machine A: syncs Folder A
+>   * Site A: machine B: syncs Folder B
+>   * and so on with selectively syncing sites and directories
+> * Laptop: has a clone of the annex, and watches a directory, syncs meta-data as usual and only uploads files to a remote (all or a designated one) but it never downloads files automatically or it should only occur inside a selected subtree.