Merge branch 'master' into assistant

Conflicts:
	debian/changelog

Updated changelog for assistant and webapp
This commit is contained in:
Joey Hess 2012-08-27 13:31:54 -04:00
commit b12db9ef92
17 changed files with 284 additions and 11 deletions

17
debian/changelog vendored
View file

@ -1,13 +1,22 @@
git-annex (3.20120808) UNRELEASED; urgency=low
git-annex (3.20120826) UNRELEASED; urgency=low
* assistant: New command, a daemon which does everything watch does,
as well as automatically syncing file contents between repositories.
* webapp: New command (and FreeDesktop menu item) that allows managing
and configuring the assistant in a web browser.
* init: If no description is provided for a new repository, one will
automatically be generated, like "joey@gnu:~/foo"
-- Joey Hess <joeyh@debian.org> Mon, 27 Aug 2012 13:27:39 -0400
git-annex (3.20120825) unstable; urgency=low
* S3: Add fileprefix setting.
* Pass --use-agent to gpg when in no tty mode. Thanks, Eskild Hustvedt.
* init: If no description is provided for a new repository, one will
automatically be generated, like "joey@gnu:~/foo"
* Bugfix: Fix fsck in SHA*E backends, when the key contains composite
extensions, as added in 3.20120721.
-- Joey Hess <joeyh@debian.org> Thu, 09 Aug 2012 13:51:47 -0400
-- Joey Hess <joeyh@debian.org> Sat, 25 Aug 2012 10:00:10 -0400
git-annex (3.20120807) unstable; urgency=low

View file

@ -22,3 +22,14 @@ The original file also has sha512 ead9db1f34739014a216239d9624bce74d92fe723de065
>> And what sha512 does the file in .git/annex/bad have **now**? (fsck
>> preserves the original filename; this says nothing about what the
>> current checksum is, if the file has been corrupted). --[[Joey]]
The same, as it's the file I was trying to inject:
ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d .git/annex/bad/SHA512E-s94402560--ead9db1f34739014a216239d9624bce74d92fe723de06505f9b94cb4c063142ba42b04546f11d3d33869b736e40ded2ff779cb32b26aa10482f09407df0f3c8d.Moon.avi
That's what puzzles me, it is the same file, but for some weird reason git annex thinks it's not.
> Ok, reproduced and fixed the bug. The "E" backends recently got support
> for 2 levels of filename extensions, but were not made to drop them both
> when fscking. [[done]] (I'll release a fixed version probably tomorrow;
> fix is in git now.) --[[Joey]]

View file

@ -0,0 +1,36 @@
Today, added a thread that deals with recovering when there's been a loss
of network connectivity. When the network's down, the normal immediate
syncing of changes of course doesn't work. So this thread detects when the
network comes back up, and does a pull+push to network remotes, and
triggers scanning for file content that needs to be transferred.
I used dbus again, to detect events generated by both network-manager and
wicd when they've sucessfully brought an interface up. Or, if they're not
available, it polls every 30 minutes.
When the network comes up, in addition to the git pull+push, it also
currently does a full scan of the repo to find files whose contents
need to be transferred to get fully back into sync.
I think it'll be ok for some git pulls and pushes to happen when
moving to a new network, or resuming a laptop (or every 30 minutes when
resorting to polling). But the transfer scan is currently really too heavy
to be appropriate to do every time in those situations. I have an idea for
avoiding that scan when the remote's git-annex branch has not changed. But
I need to refine it, to handle cases like this:
1. a new remote is added
2. file contents start being transferred to (or from it)
3. the network is taken down
4. all the queued transfers fail
5. the network comes back up
6. the transfer scan needs to know the remote was not all in sync
before #3, and so should do a full scan despite the git-annex branch
not having changed
---
Doubled the ram in my netbook, which I use for all development. Yesod needs
rather a lot of ram to compile and link, and this should make me quite a
lot more productive. I was struggling with OOM killing bits of chromium
during my last week of development.

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmubB1Sj2rwFoVdZYvGV0ACaQUJQyiJXJI"
nickname="Paul"
subject="Amazon Glacier"
date="2012-08-23T06:32:24Z"
content="""
Do you think git-annex could support [Amazon Glacier](http://aws.amazon.com/glacier/) as a backend?
"""]]

View file

@ -0,0 +1,21 @@
Woke up this morning with most of the design for a smarter approach to
[[syncing]] in my head. (This is why I sometimes slip up and tell people I
work on this project 12 hours a day..)
To keep the current `assistant` branch working while I make changes
that break use cases that are working, I've started
developing in a new branch, `assistant-wip`.
In it, I've started getting rid of unnecessary expensive transfer scans.
First optimisation I've done is to detect when a remote that was
disconnected has diverged its `git-annex` branch from the local branch.
Only when that's the case does a new transfer scan need to be done, to find
out what new stuff might be available on that remote, to have caused the
change to its branch, while it was disconnected.
That broke a lot of stuff. I have a plan to fix it written down in
[[syncing]]. It'll involve keeping track of whether a transfer scan has
ever been done (if not, one should be run), and recording logs when
transfers failed, so those failed transfers can be retried when the
remote gets reconnected.

View file

@ -0,0 +1,26 @@
Implemented everything I planned out yesterday: Expensive scans are only
done once per remote (unless the remote changed while it was disconnected),
and failed transfers are logged so they can be retried later.
Changed the TransferScanner to prefer to scan low cost remotes first,
as a crude form of scheduling lower-cost transfers first.
A whole bunch of interesting syncing scenarios should work now. I have not
tested them all in detail, but to the best of my knowledge, all these
should work:
* Connect to the network. It starts syncing with a networked remote.
Disconnect the network. Reconnect, and it resumes where it left off.
* Migrate between networks (ie, home to cafe to work). Any transfers
that can only happen on one LAN are retried on each new network you
visit, until they succeed.
One that is not working, but is soooo close:
* Plug in a removable drive. Some transfers start. Yank the plug.
Plug it back in. All necessary transfers resume, and it ends up
fully in sync, no matter how many times you yank that cable.
That's not working because of an infelicity in the MountWatcher.
It doesn't notice when the drive gets unmounted, so it ignores
the new mount event.

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawmBUR4O9mofxVbpb8JV9mEbVfIYv670uJo"
nickname="Justin"
subject="comment 1"
date="2012-08-23T21:25:48Z"
content="""
Do encrypted rsync remotes resume quickly as well?
One thing I noticed was that if a copy --to an encrypted rsync remote gets interrupted it will remove the tmp file and re-encrypt the whole file before resuming rsync.
"""]]

View file

@ -0,0 +1,33 @@
Working toward getting the data syncing to happen robustly,
so a bunch of improvements.
* Got unmount events to be noticed, so unplugging and replugging
a removable drive will resume the syncing to it. There's really no
good unmount event available on dbus in kde, so it uses a heuristic
there.
* Avoid requeuing a download from a remote that no longer has a key.
* Run a full scan on startup, for multiple reasons, including dealing with
crashes.
Ran into a strange issue: Occasionally the assistant will run `git-annex
copy` and it will not transfer the requested file. It seems that
when the copy command runs `git ls-files`, it does not see the file
it's supposed to act on in its output.
Eventually I figured out what's going on: When updating the git-annex
branch, it sets `GIT_INDEX_FILE`, and of course environment settings are
not thread-safe! So there's a race between threads that access
the git-annex branch, and the Transferrer thread, or any other thread
that might expect to look at the normal git index.
Unfortunatly, I don't have a fix for this yet.. Git's only interface for
using a different index file is `GIT_INDEX_FILE`. It seems I have a lot of
code to tear apart, to push back the setenv until after forking every git
command. :(
Before I figured out the root problem, I developed a workaround for the
symptom I was seeing. I added a `git-annex transferkey`, which is
optimised to be run by the assistant, and avoids running `git ls-files`, so
avoids the problem. While I plan to fix this environment variable problem
properly, `transferkey` turns out to be so much faster than how it was
using `copy` that I'm going to keep it.

View file

@ -3,9 +3,16 @@ all the other git clones, at both the git level and the key/value level.
## immediate action items
* At startup, and possibly periodically, or when the network connection
changes, or some heuristic suggests that a remote was disconnected from
us for a while, queue remotes for processing by the TransferScanner.
* The syncing code currently doesn't run for special remotes. While
transfering the git info about special remotes could be a complication,
if we assume that's synced between existing git remotes, it should be
possible for them to do file transfers to/from special remotes.
* Often several remotes will be queued for full TransferScanner scans,
and the scan does the same thing for each .. so it would be better to
combine them into one scan in such a case.
* Sometimes a Download gets queued from a slow remote, and then a fast
remote becomes available, and a Download is queued from it. Would be
good to sort the transfer queue to run fast Downloads (and Uploads) first.
* Ensure that when a remote receives content, and updates its location log,
it syncs that update back out. Prerequisite for:
* After git sync, identify new content that we don't have that is now available
@ -34,14 +41,17 @@ all the other git clones, at both the git level and the key/value level.
files in some directories and not others. See for use cases:
[[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
* speed up git syncing by using the cached ssh connection for it too
(will need to use `GIT_SSH`, which needs to point to a command to run,
not a shell command line)
Will need to use `GIT_SSH`, which needs to point to a command to run,
not a shell command line. Beware that the network connection may have
bounced and the cached ssh connection not be usable.
* Map the network of git repos, and use that map to calculate
optimal transfers to keep the data in sync. Currently a naive flood fill
is done instead.
* Find a more efficient way for the TransferScanner to find the transfers
that need to be done to sync with a remote. Currently it walks the git
working copy and checks each file.
working copy and checks each file. That probably needs to be done once,
but further calls to the TransferScanner could eg, look at the delta
between the last scan and the current one in the git-annex branch.
## misc todo
@ -163,3 +173,42 @@ redone to check it.
finishes. **done**
* Test MountWatcher on KDE, and add whatever dbus events KDE emits when
drives are mounted. **done**
* It would be nice if, when a USB drive is connected,
syncing starts automatically. Use dbus on Linux? **done**
* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
broke content syncing in some situations, which need to be added back.
**done**
Now syncing a disconnected remote only starts a transfer scan if the
remote's git-annex branch has diverged, which indicates it probably has
new files. But that leaves open the cases where the local repo has
new files; and where the two repos git branches are in sync, but the
content transfers are lagging behind; and where the transfer scan has
never been run.
Need to track locally whether we're believed to be in sync with a remote.
This includes:
* All local content has been transferred to it successfully.
* The remote has been scanned once for data to transfer from it, and all
transfers initiated by that scan succeeded.
Note the complication that, if it's initiated a transfer, our queued
transfer will be thrown out as unnecessary. But if its transfer then
fails, that needs to be noticed.
If we're going to track failed transfers, we could just set a flag,
and use that flag later to initiate a new transfer scan. We need a flag
in any case, to ensure that a transfer scan is run for each new remote.
The flag could be `.git/annex/transfer/scanned/uuid`.
But, if failed transfers are tracked, we could also record them, in
order to retry them later, without the scan. I'm thinking about a
directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
which failed transfer log files could be moved to.
* A remote may lose content it had before, so when requeuing
a failed download, check the location log to see if the remote still has
the content, and if not, queue a download from elsewhere. (And, a remote
may get content we were uploading from elsewhere, so check the location
log when queuing a failed Upload too.) **done**
* Fix MountWatcher to notice umounts and remounts of drives. **done**
* Run transfer scan on startup. **done**

View file

@ -0,0 +1,3 @@
I tried to compile the assitant branch on Ubuntu 12.04. But i depends on the DBus libraryw hich does not compile with some glibberish errors. Is there a way to solve this?

View file

@ -0,0 +1,28 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="4.152.246.119"
subject="comment 1"
date="2012-08-25T13:06:31Z"
content="""
Hmm, let's see...
If the gibberish error is ouyay orgotfay otay otay elltay emay utwhay ethay
roreay asway, then we can figure it out, surely..
If the gibberish error looks something like Ḩ̶̞̗̓ͯ̅͒ͪͫe̢ͦ̊ͭͭͤͣ̂͏̢̳̦͔̬ͅ ̣̘̹̄̕͢Ç̛͈͔̹̮̗͈͓̞ͨ͂͑ͅo̿ͥͮ̿͢͏̧̹̗̪͇̫m̷̢̞̙͑̊̔ͧ̍ͩ̇̚ę̜͑̀͝s̖̱̝̩̞̻͐͂̐́̂̇̆͂
.. your use of cabal
has accidentually summoned Cthulu! Back slowly away from the monitor!
Otherwise, you might try installing the `libdbus-1-dev` package with apt,
which might make cabal install the haskell dbus bindings successfully. Or
you could just install the `libghc-dbus-dev` package, which contains the
necessary haskell library pre-built. But I don't know if it's in Ubuntu
12.04; it only seems to be available in quantal
<http://packages.ubuntu.com/search?keywords=libghc-dbus-dev>
Or you could even build it with the Makefile, rather than using cabal.
The Makefile has a `-DWITH_DBUS` setting in it that can be removed to build
the fallback mode that doesn't use dbus.
"""]]

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="4.152.246.119"
subject="comment 2"
date="2012-08-25T13:11:37Z"
content="""
I fnordgot to mention, cabal can be configured to not build with dbus too. The relevant incantation is:
cabal install git-annex --flags=\"-Dbus\"
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="https://me.yahoo.com/speredenn#aaf38"
nickname="Jean-Baptiste Carré"
subject="comment 3"
date="2012-08-21T18:15:48Z"
content="""
You're totally right: The UUIDs are the same. So it shouldn't matter if there are many repositories pointing to the same folder, as you state it. Thanks a lot!
"""]]

View file

@ -11,6 +11,11 @@ sudo cabal update
cabal install git-annex --bindir=$HOME/bin
</pre>
Do not forget to add to your PATH variable your ~/bin folder. In your .bashrc, for example:
<pre>
PATH=~/bin:/usr/bin/local:$PATH
</pre>
See also:
* [[forum/OSX__39__s_haskell-platform_statically_links_things]]

View file

@ -0,0 +1,6 @@
git-annex 3.20120825 released with [[!toggle text="these changes"]]
[[!toggleable text="""
* S3: Add fileprefix setting.
* Pass --use-agent to gpg when in no tty mode. Thanks, Eskild Hustvedt.
* Bugfix: Fix fsck in SHA*E backends, when the key contains composite
extensions, as added in 3.20120721."""]]

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawnY9ObrNrQuRp8Xs0XvdtJJssm5cp4NMZA"
nickname="alan"
subject="Rackspace Cloud Files support?"
date="2012-08-23T21:00:11Z"
content="""
Any chance I could bribe you to setup Rackspace Cloud Files support? We are using them and would hate to have a S3 bucket only for this.
https://github.com/rackspace/python-cloudfiles
"""]]

View file

@ -1,5 +1,5 @@
Name: git-annex
Version: 3.20120807
Version: 3.20120825
Cabal-Version: >= 1.8
License: GPL
Maintainer: Joey Hess <joey@kitenet.net>