Merge branch 'master' into assistant
This commit is contained in:
commit
abe5a73d3f
5 changed files with 122 additions and 60 deletions
37
doc/design/assistant/blog/day_43__simple_scanner.mdwn
Normal file
37
doc/design/assistant/blog/day_43__simple_scanner.mdwn
Normal file
|
@ -0,0 +1,37 @@
|
||||||
|
Milestone: I can run `git annex assistant`, plug in a USB drive, and it
|
||||||
|
automatically transfers files to get the USB drive and current repo back in
|
||||||
|
sync.
|
||||||
|
|
||||||
|
I decided to implement the naive scan, to find files needing to be
|
||||||
|
transferred. So it walks through `git ls-files` and checks each file
|
||||||
|
in turn. I've deferred less expensive, more sophisticated approaches to later.
|
||||||
|
|
||||||
|
I did some work on the TransferQueue, which now keeps track of the length
|
||||||
|
of the queue, and can block attempts to add Transfers to it if it gets too
|
||||||
|
long. This was a nice use of STM, which let me implement that without using
|
||||||
|
any locking.
|
||||||
|
|
||||||
|
[[!format haskell """
|
||||||
|
atomically $ do
|
||||||
|
sz <- readTVar (queuesize q)
|
||||||
|
if sz <= wantsz
|
||||||
|
then enqueue schedule q t (stubInfo f remote)
|
||||||
|
else retry -- blocks until queuesize changes
|
||||||
|
"""]]
|
||||||
|
|
||||||
|
Anyway, the point was that, as the scan finds Transfers to do,
|
||||||
|
it doesn't build up a really long TransferQueue, but instead is blocked
|
||||||
|
from running further until some of the files get transferred. The resulting
|
||||||
|
interleaving of the scan thread with transfer threads means that transfers
|
||||||
|
start fairly quickly upon a USB drive being plugged in, and kind of hides
|
||||||
|
the innefficiencies of the scanner, which will most of the time be
|
||||||
|
swamped out by the IO bound large data transfers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
At this point, the assistant should do a good job of keeping repositories
|
||||||
|
in sync, as long as they're all interconnected, or on removable media
|
||||||
|
like USB drives. There's lots more work to be done to handle use cases
|
||||||
|
where repositories are not well-connected, but since the assistant's
|
||||||
|
[[syncing]] now covers at least a couple of use cases, I'm ready to move
|
||||||
|
on to the next phase. [[Webapp]], here we come!
|
|
@ -5,14 +5,72 @@ all the other git clones, at both the git level and the key/value level.
|
||||||
|
|
||||||
* At startup, and possibly periodically, or when the network connection
|
* At startup, and possibly periodically, or when the network connection
|
||||||
changes, or some heuristic suggests that a remote was disconnected from
|
changes, or some heuristic suggests that a remote was disconnected from
|
||||||
us for a while, queue remotes for processing by the TransferScanner,
|
us for a while, queue remotes for processing by the TransferScanner.
|
||||||
to queue Transfers of files it or we're missing.
|
* Ensure that when a remote receives content, and updates its location log,
|
||||||
* After git sync, identify content that we don't have that is now available
|
it syncs that update back out. Prerequisite for:
|
||||||
|
* After git sync, identify new content that we don't have that is now available
|
||||||
on remotes, and transfer. (Needed when we have a uni-directional connection
|
on remotes, and transfer. (Needed when we have a uni-directional connection
|
||||||
to a remote, so it won't be uploading content to us.)
|
to a remote, so it won't be uploading content to us.) Note: Does not
|
||||||
But first, need to ensure that when a remote
|
need to use the TransferScanner, if we get and check a list of the changed
|
||||||
receives content, and updates its location log, it syncs that update
|
files.
|
||||||
out.
|
|
||||||
|
## longer-term TODO
|
||||||
|
|
||||||
|
* Test MountWatcher on LXDE.
|
||||||
|
* git-annex needs a simple speed control knob, which can be plumbed
|
||||||
|
through to, at least, rsync. A good job for an hour in an
|
||||||
|
airport somewhere.
|
||||||
|
* Find a way to probe available outgoing bandwidth, to throttle so
|
||||||
|
we don't bufferbloat the network to death.
|
||||||
|
* Investigate the XMPP approach like dvcs-autosync does, or other ways of
|
||||||
|
signaling a change out of band.
|
||||||
|
* Add a hook, so when there's a change to sync, a program can be run
|
||||||
|
and do its own signaling.
|
||||||
|
* --debug will show often unnecessary work being done. Optimise.
|
||||||
|
* This assumes the network is connected. It's often not, so the
|
||||||
|
[[cloud]] needs to be used to bridge between LANs.
|
||||||
|
* Configurablity, including only enabling git syncing but not data transfer;
|
||||||
|
only uploading new files but not downloading, and only downloading
|
||||||
|
files in some directories and not others. See for use cases:
|
||||||
|
[[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
|
||||||
|
* speed up git syncing by using the cached ssh connection for it too
|
||||||
|
(will need to use `GIT_SSH`, which needs to point to a command to run,
|
||||||
|
not a shell command line)
|
||||||
|
* Map the network of git repos, and use that map to calculate
|
||||||
|
optimal transfers to keep the data in sync. Currently a naive flood fill
|
||||||
|
is done instead.
|
||||||
|
* Find a more efficient way for the TransferScanner to find the transfers
|
||||||
|
that need to be done to sync with a remote. Currently it walks the git
|
||||||
|
working copy and checks each file.
|
||||||
|
|
||||||
|
## misc todo
|
||||||
|
|
||||||
|
* --debug will show often unnecessary work being done. Optimise.
|
||||||
|
* It would be nice if, when a USB drive is connected,
|
||||||
|
syncing starts automatically. Use dbus on Linux?
|
||||||
|
|
||||||
|
## data syncing
|
||||||
|
|
||||||
|
There are two parts to data syncing. First, map the network and second,
|
||||||
|
decide what to sync when.
|
||||||
|
|
||||||
|
Mapping the network can reuse code in `git annex map`. Once the map is
|
||||||
|
built, we want to find paths through the network that reach all nodes
|
||||||
|
eventually, with the least cost. This is a minimum spanning tree problem,
|
||||||
|
except with a directed graph, so really a Arborescence problem.
|
||||||
|
|
||||||
|
With the map, we can determine which nodes to push new content to. Then we
|
||||||
|
need to control those data transfers, sending to the cheapest nodes first,
|
||||||
|
and with appropriate rate limiting and control facilities.
|
||||||
|
|
||||||
|
This probably will need lots of refinements to get working well.
|
||||||
|
|
||||||
|
### first pass: flood syncing
|
||||||
|
|
||||||
|
Before mapping the network, the best we can do is flood all files out to every
|
||||||
|
reachable remote. This is worth doing first, since it's the simplest way to
|
||||||
|
get the basic functionality of the assistant to work. And we'll need this
|
||||||
|
anyway.
|
||||||
|
|
||||||
## TransferScanner
|
## TransferScanner
|
||||||
|
|
||||||
|
@ -21,6 +79,8 @@ to a remote, or Downloaded from it.
|
||||||
|
|
||||||
How to find the keys to transfer? I'd like to avoid potentially
|
How to find the keys to transfer? I'd like to avoid potentially
|
||||||
expensive traversals of the whole git working copy if I can.
|
expensive traversals of the whole git working copy if I can.
|
||||||
|
(Currently, the TransferScanner does do the naive and possibly expensive
|
||||||
|
scan of the git working copy.)
|
||||||
|
|
||||||
One way would be to do a git diff between the (unmerged) git-annex branches
|
One way would be to do a git diff between the (unmerged) git-annex branches
|
||||||
of the git repo, and its remote. Parse that for lines that add a key to
|
of the git repo, and its remote. Parse that for lines that add a key to
|
||||||
|
@ -53,58 +113,6 @@ one. Probably worth handling the case where a remote is connected
|
||||||
while in the middle of such a scan, so part of the scan needs to be
|
while in the middle of such a scan, so part of the scan needs to be
|
||||||
redone to check it.
|
redone to check it.
|
||||||
|
|
||||||
## longer-term TODO
|
|
||||||
|
|
||||||
* Test MountWatcher on LXDE.
|
|
||||||
* git-annex needs a simple speed control knob, which can be plumbed
|
|
||||||
through to, at least, rsync. A good job for an hour in an
|
|
||||||
airport somewhere.
|
|
||||||
* Find a way to probe available outgoing bandwidth, to throttle so
|
|
||||||
we don't bufferbloat the network to death.
|
|
||||||
* Investigate the XMPP approach like dvcs-autosync does, or other ways of
|
|
||||||
signaling a change out of band.
|
|
||||||
* Add a hook, so when there's a change to sync, a program can be run
|
|
||||||
and do its own signaling.
|
|
||||||
* --debug will show often unnecessary work being done. Optimise.
|
|
||||||
* This assumes the network is connected. It's often not, so the
|
|
||||||
[[cloud]] needs to be used to bridge between LANs.
|
|
||||||
* Configurablity, including only enabling git syncing but not data transfer;
|
|
||||||
only uploading new files but not downloading, and only downloading
|
|
||||||
files in some directories and not others. See for use cases:
|
|
||||||
[[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
|
|
||||||
* speed up git syncing by using the cached ssh connection for it too
|
|
||||||
(will need to use `GIT_SSH`, which needs to point to a command to run,
|
|
||||||
not a shell command line)
|
|
||||||
|
|
||||||
## misc todo
|
|
||||||
|
|
||||||
* --debug will show often unnecessary work being done. Optimise.
|
|
||||||
* It would be nice if, when a USB drive is connected,
|
|
||||||
syncing starts automatically. Use dbus on Linux?
|
|
||||||
|
|
||||||
## data syncing
|
|
||||||
|
|
||||||
There are two parts to data syncing. First, map the network and second,
|
|
||||||
decide what to sync when.
|
|
||||||
|
|
||||||
Mapping the network can reuse code in `git annex map`. Once the map is
|
|
||||||
built, we want to find paths through the network that reach all nodes
|
|
||||||
eventually, with the least cost. This is a minimum spanning tree problem,
|
|
||||||
except with a directed graph, so really a Arborescence problem.
|
|
||||||
|
|
||||||
With the map, we can determine which nodes to push new content to. Then we
|
|
||||||
need to control those data transfers, sending to the cheapest nodes first,
|
|
||||||
and with appropriate rate limiting and control facilities.
|
|
||||||
|
|
||||||
This probably will need lots of refinements to get working well.
|
|
||||||
|
|
||||||
### first pass: flood syncing
|
|
||||||
|
|
||||||
Before mapping the network, the best we can do is flood all files out to every
|
|
||||||
reachable remote. This is worth doing first, since it's the simplest way to
|
|
||||||
get the basic functionality of the assistant to work. And we'll need this
|
|
||||||
anyway.
|
|
||||||
|
|
||||||
## done
|
## done
|
||||||
|
|
||||||
1. Can use `git annex sync`, which already handles bidirectional syncing.
|
1. Can use `git annex sync`, which already handles bidirectional syncing.
|
||||||
|
|
10
doc/forum/Fixing_up_corrupt_annexes.mdwn
Normal file
10
doc/forum/Fixing_up_corrupt_annexes.mdwn
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
I was wondering how does one recover from...
|
||||||
|
|
||||||
|
<pre>
|
||||||
|
(Recording state in git...)
|
||||||
|
error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log'
|
||||||
|
fatal: git-write-tree: error building trees
|
||||||
|
git-annex: failed to read sha from git write-tree
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
The above was caught when i ran a "git annex fsck --fast" to check stash of files"
|
|
@ -0,0 +1,7 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="http://joeyh.name/"
|
||||||
|
subject="comment 1"
|
||||||
|
date="2012-07-24T22:00:35Z"
|
||||||
|
content="""
|
||||||
|
This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]]
|
||||||
|
"""]]
|
|
@ -16,4 +16,4 @@ are present.
|
||||||
If you later found the drive, you could let git-annex know it's found
|
If you later found the drive, you could let git-annex know it's found
|
||||||
like so:
|
like so:
|
||||||
|
|
||||||
git annex semitrusted usbdrive
|
git annex semitrust usbdrive
|
||||||
|
|
Loading…
Reference in a new issue