Merge branch 'master' of ssh://git-annex.branchable.com

2012-08-24 12:17:44 -04:00 · 2012-08-24 12:17:44 -04:00 · 13fa141cd3
commit 13fa141cd3
parent bc6eaa4ebb d25f407e67
5 changed files with 91 additions and 40 deletions
--- a/doc/bugs/fsck_thinks_file_content_is_bad_when_it_isn39t.mdwn
+++ b/doc/bugs/fsck_thinks_file_content_is_bad_when_it_isn39t.mdwn
@ -22,3 +22,9 @@ The original file also has sha512 ead9db1f34739014a216239d9624bce74d92fe723de065
 >> And what sha512 does the file in .git/annex/bad have **now**? (fsck
 >> preserves the original filename; this says nothing about what the
 >> current checksum is, if the file has been corrupted). --[[Joey]]
 The same, as it's the file I was trying to inject:
 ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d  .git/annex/bad/SHA512E-s94402560--ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d.Moon.avi
 That's what puzzles me, it is the same file, but for some weird reason git annex thinks it's not.
--- a/doc/design/assistant/blog/day_63__transfer_retries.mdwn
+++ b/doc/design/assistant/blog/day_63__transfer_retries.mdwn
@ -0,0 +1,26 @@
 Implemented everything I planned out yesterday: Expensive scans are only
 done once per remote (unless the remote changed while it was disconnected),
 and failed transfers are logged so they can be retried later.
 Changed the TransferScanner to prefer to scan low cost remotes first,
 as a crude form of scheduling lower-cost transfers first.
 A whole bunch of interesting syncing scenarios should work now. I have not
 tested them all in detail, but to the best of my knowledge, all these
 should work:
 * Connect to the network. It starts syncing with a networked remote.
  Disconnect the network. Reconnect, and it resumes where it left off.
 * Migrate between networks (ie, home to cafe to work). Any transfers
  that can only happen on one LAN are retried on each new network you
  visit, until they succeed.
 One that is not working, but is soooo close:
 * Plug in a removable drive. Some transfers start. Yank the plug.
  Plug it back in. All necessary transfers resume, and it ends up
  fully in sync, no matter how many times you yank that cable.
 That's not working because of an infelicity in the MountWatcher.
 It doesn't notice when the drive gets unmounted, so it ignores
 the new mount event.
--- a/doc/design/assistant/blog/day_63__transfer_retries/comment_1_990d4eb6066c4e2b9ddb3cabef32e4b9._comment
+++ b/doc/design/assistant/blog/day_63__transfer_retries/comment_1_990d4eb6066c4e2b9ddb3cabef32e4b9._comment
@ -0,0 +1,10 @@
 [[!comment format=mdwn
 username="https://www.google.com/accounts/o8/id?id=AItOawmBUR4O9mofxVbpb8JV9mEbVfIYv670uJo"
 nickname="Justin"
 subject="comment 1"
 date="2012-08-23T21:25:48Z"
 content="""
 Do encrypted rsync remotes resume quickly as well?
 One thing I noticed was that if a copy --to an encrypted rsync remote gets interrupted it will remove the tmp file and re-encrypt the whole file before resuming rsync.
 """]]
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@ -3,42 +3,12 @@ all the other git clones, at both the git level and the key/value level.
 ## immediate action items
-* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
+* Fix MountWatcher to notice umounts and remounts of drives.
-  broke content syncing in some situations, which need to be added back.
+* A remote may lose content it had before, so when requeuing
-
+  a failed download, check the location log to see if the remote still has
  Now syncing a disconnected remote only starts a transfer scan if the
  remote's git-annex branch has diverged, which indicates it probably has
  new files. But that leaves open the cases where the local repo has
  new files; and where the two repos git branches are in sync, but the
  content transfers are lagging behind; and where the transfer scan has
  never been run.
  Need to track locally whether we're believed to be in sync with a remote.
  This includes:
  * All local content has been transferred to it successfully.
  * The remote has been scanned once for data to transfer from it, and all
    transfers initiated by that scan succeeded.
  Note the complication that, if it's initiated a transfer, our queued
  transfer will be thrown out as unnecessary. But if its transfer then
  fails, that needs to be noticed.
  If we're going to track failed transfers, we could just set a flag,
  and use that flag later to initiate a new transfer scan. We need a flag
  in any case, to ensure that a transfer scan is run for each new remote.
  The flag could be `.git/annex/transfer/scanned/uuid`.
  But, if failed transfers are tracked, we could also record them, in 
  order to retry them later, without the scan. I'm thinking about a
  directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
  which failed transfer log files could be moved to.
  Note that a remote may lose content it had before, so when requeuing
  a failed download, should check the location log to see if it still has
  the content, and if not, queue a download from elsewhere. (And, a remote
  may get content we were uploading from elsewhere, so check the location
  log when queuing a failed Upload too.)
 * Ensure that when a remote receives content, and updates its location log,
  it syncs that update back out. Prerequisite for:
 * After git sync, identify new content that we don't have that is now available
@ -67,18 +37,17 @@ all the other git clones, at both the git level and the key/value level.
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
 * speed up git syncing by using the cached ssh connection for it too
-  (will need to use `GIT_SSH`, which needs to point to a command to run,
+  Will need to use `GIT_SSH`, which needs to point to a command to run,
-  not a shell command line)
+  not a shell command line. Beware that the network connection may have
  bounced and the cached ssh connection not be usable.
 * Map the network of git repos, and use that map to calculate
  optimal transfers to keep the data in sync. Currently a naive flood fill
  is done instead.
 * Find a more efficient way for the TransferScanner to find the transfers
  that need to be done to sync with a remote. Currently it walks the git
-  working copy and checks each file.
+  working copy and checks each file. That probably needs to be done once,
-
+  but further calls to the TransferScanner could eg, look at the delta
-## misc todo
+  between the last scan and the current one in the git-annex branch.
 * --debug will show often unnecessary work being done. Optimise.
 ## data syncing
@ -196,3 +165,33 @@ redone to check it.
  drives are mounted. **done**
 * It would be nice if, when a USB drive is connected, 
  syncing starts automatically. Use dbus on Linux? **done**
 * Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
  broke content syncing in some situations, which need to be added back.
  **done**
  Now syncing a disconnected remote only starts a transfer scan if the
  remote's git-annex branch has diverged, which indicates it probably has
  new files. But that leaves open the cases where the local repo has
  new files; and where the two repos git branches are in sync, but the
  content transfers are lagging behind; and where the transfer scan has
  never been run.
  Need to track locally whether we're believed to be in sync with a remote.
  This includes:
  * All local content has been transferred to it successfully.
  * The remote has been scanned once for data to transfer from it, and all
    transfers initiated by that scan succeeded.
  Note the complication that, if it's initiated a transfer, our queued
  transfer will be thrown out as unnecessary. But if its transfer then
  fails, that needs to be noticed.
  If we're going to track failed transfers, we could just set a flag,
  and use that flag later to initiate a new transfer scan. We need a flag
  in any case, to ensure that a transfer scan is run for each new remote.
  The flag could be `.git/annex/transfer/scanned/uuid`.
  But, if failed transfers are tracked, we could also record them, in 
  order to retry them later, without the scan. I'm thinking about a
  directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
  which failed transfer log files could be moved to.
--- a/doc/special_remotes/S3/comment_6_78da9e233882ec0908962882ea8c4056._comment
+++ b/doc/special_remotes/S3/comment_6_78da9e233882ec0908962882ea8c4056._comment
@ -0,0 +1,10 @@
 [[!comment format=mdwn
 username="https://www.google.com/accounts/o8/id?id=AItOawnY9ObrNrQuRp8Xs0XvdtJJssm5cp4NMZA"
 nickname="alan"
 subject="Rackspace Cloud Files support?"
 date="2012-08-23T21:00:11Z"
 content="""
 Any chance I could bribe you to setup Rackspace Cloud Files support?  We are using them and would hate to have a S3 bucket only for this.
 https://github.com/rackspace/python-cloudfiles
 """]]