Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2012-08-24 12:17:44 -04:00
commit 13fa141cd3
5 changed files with 91 additions and 40 deletions

View file

@ -22,3 +22,9 @@ The original file also has sha512 ead9db1f34739014a216239d9624bce74d92fe723de065
>> And what sha512 does the file in .git/annex/bad have **now**? (fsck >> And what sha512 does the file in .git/annex/bad have **now**? (fsck
>> preserves the original filename; this says nothing about what the >> preserves the original filename; this says nothing about what the
>> current checksum is, if the file has been corrupted). --[[Joey]] >> current checksum is, if the file has been corrupted). --[[Joey]]
The same, as it's the file I was trying to inject:
ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d .git/annex/bad/SHA512E-s94402560--ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d.Moon.avi
That's what puzzles me, it is the same file, but for some weird reason git annex thinks it's not.

View file

@ -0,0 +1,26 @@
Implemented everything I planned out yesterday: Expensive scans are only
done once per remote (unless the remote changed while it was disconnected),
and failed transfers are logged so they can be retried later.
Changed the TransferScanner to prefer to scan low cost remotes first,
as a crude form of scheduling lower-cost transfers first.
A whole bunch of interesting syncing scenarios should work now. I have not
tested them all in detail, but to the best of my knowledge, all these
should work:
* Connect to the network. It starts syncing with a networked remote.
Disconnect the network. Reconnect, and it resumes where it left off.
* Migrate between networks (ie, home to cafe to work). Any transfers
that can only happen on one LAN are retried on each new network you
visit, until they succeed.
One that is not working, but is soooo close:
* Plug in a removable drive. Some transfers start. Yank the plug.
Plug it back in. All necessary transfers resume, and it ends up
fully in sync, no matter how many times you yank that cable.
That's not working because of an infelicity in the MountWatcher.
It doesn't notice when the drive gets unmounted, so it ignores
the new mount event.

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmBUR4O9mofxVbpb8JV9mEbVfIYv670uJo"
nickname="Justin"
subject="comment 1"
date="2012-08-23T21:25:48Z"
content="""
Do encrypted rsync remotes resume quickly as well?
One thing I noticed was that if a copy --to an encrypted rsync remote gets interrupted it will remove the tmp file and re-encrypt the whole file before resuming rsync.
"""]]

View file

@ -3,42 +3,12 @@ all the other git clones, at both the git level and the key/value level.
## immediate action items ## immediate action items
* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily * Fix MountWatcher to notice umounts and remounts of drives.
broke content syncing in some situations, which need to be added back. * A remote may lose content it had before, so when requeuing
a failed download, check the location log to see if the remote still has
Now syncing a disconnected remote only starts a transfer scan if the
remote's git-annex branch has diverged, which indicates it probably has
new files. But that leaves open the cases where the local repo has
new files; and where the two repos git branches are in sync, but the
content transfers are lagging behind; and where the transfer scan has
never been run.
Need to track locally whether we're believed to be in sync with a remote.
This includes:
* All local content has been transferred to it successfully.
* The remote has been scanned once for data to transfer from it, and all
transfers initiated by that scan succeeded.
Note the complication that, if it's initiated a transfer, our queued
transfer will be thrown out as unnecessary. But if its transfer then
fails, that needs to be noticed.
If we're going to track failed transfers, we could just set a flag,
and use that flag later to initiate a new transfer scan. We need a flag
in any case, to ensure that a transfer scan is run for each new remote.
The flag could be `.git/annex/transfer/scanned/uuid`.
But, if failed transfers are tracked, we could also record them, in
order to retry them later, without the scan. I'm thinking about a
directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
which failed transfer log files could be moved to.
Note that a remote may lose content it had before, so when requeuing
a failed download, should check the location log to see if it still has
the content, and if not, queue a download from elsewhere. (And, a remote the content, and if not, queue a download from elsewhere. (And, a remote
may get content we were uploading from elsewhere, so check the location may get content we were uploading from elsewhere, so check the location
log when queuing a failed Upload too.) log when queuing a failed Upload too.)
* Ensure that when a remote receives content, and updates its location log, * Ensure that when a remote receives content, and updates its location log,
it syncs that update back out. Prerequisite for: it syncs that update back out. Prerequisite for:
* After git sync, identify new content that we don't have that is now available * After git sync, identify new content that we don't have that is now available
@ -67,18 +37,17 @@ all the other git clones, at both the git level and the key/value level.
files in some directories and not others. See for use cases: files in some directories and not others. See for use cases:
[[forum/Wishlist:_options_for_syncing_meta-data_and_data]] [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
* speed up git syncing by using the cached ssh connection for it too * speed up git syncing by using the cached ssh connection for it too
(will need to use `GIT_SSH`, which needs to point to a command to run, Will need to use `GIT_SSH`, which needs to point to a command to run,
not a shell command line) not a shell command line. Beware that the network connection may have
bounced and the cached ssh connection not be usable.
* Map the network of git repos, and use that map to calculate * Map the network of git repos, and use that map to calculate
optimal transfers to keep the data in sync. Currently a naive flood fill optimal transfers to keep the data in sync. Currently a naive flood fill
is done instead. is done instead.
* Find a more efficient way for the TransferScanner to find the transfers * Find a more efficient way for the TransferScanner to find the transfers
that need to be done to sync with a remote. Currently it walks the git that need to be done to sync with a remote. Currently it walks the git
working copy and checks each file. working copy and checks each file. That probably needs to be done once,
but further calls to the TransferScanner could eg, look at the delta
## misc todo between the last scan and the current one in the git-annex branch.
* --debug will show often unnecessary work being done. Optimise.
## data syncing ## data syncing
@ -196,3 +165,33 @@ redone to check it.
drives are mounted. **done** drives are mounted. **done**
* It would be nice if, when a USB drive is connected, * It would be nice if, when a USB drive is connected,
syncing starts automatically. Use dbus on Linux? **done** syncing starts automatically. Use dbus on Linux? **done**
* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
broke content syncing in some situations, which need to be added back.
**done**
Now syncing a disconnected remote only starts a transfer scan if the
remote's git-annex branch has diverged, which indicates it probably has
new files. But that leaves open the cases where the local repo has
new files; and where the two repos git branches are in sync, but the
content transfers are lagging behind; and where the transfer scan has
never been run.
Need to track locally whether we're believed to be in sync with a remote.
This includes:
* All local content has been transferred to it successfully.
* The remote has been scanned once for data to transfer from it, and all
transfers initiated by that scan succeeded.
Note the complication that, if it's initiated a transfer, our queued
transfer will be thrown out as unnecessary. But if its transfer then
fails, that needs to be noticed.
If we're going to track failed transfers, we could just set a flag,
and use that flag later to initiate a new transfer scan. We need a flag
in any case, to ensure that a transfer scan is run for each new remote.
The flag could be `.git/annex/transfer/scanned/uuid`.
But, if failed transfers are tracked, we could also record them, in
order to retry them later, without the scan. I'm thinking about a
directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
which failed transfer log files could be moved to.

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawnY9ObrNrQuRp8Xs0XvdtJJssm5cp4NMZA"
nickname="alan"
subject="Rackspace Cloud Files support?"
date="2012-08-23T21:00:11Z"
content="""
Any chance I could bribe you to setup Rackspace Cloud Files support? We are using them and would hate to have a S3 bucket only for this.
https://github.com/rackspace/python-cloudfiles
"""]]