blog for the day and design update

This commit is contained in:
Joey Hess 2012-06-21 20:02:00 -04:00
parent 8a0d6d83f4
commit f27da7a1cc
2 changed files with 55 additions and 3 deletions

View file

@ -0,0 +1,44 @@
Pondering [[syncing]] today. I will be doing syncing of the git repository
first, and working on syncing of file data later.
The former seems straightforward enough, since we just want to push all
changes to everywhere. Indeed, git-annex already has a [[sync]] command
that uses a smart technique to allow syncing between clones without a
central bare repository. (Props to Joachim Breitner for that.)
But it's not all easy. Syncing should happen as fast as possible, so
changes show up without delay. Eventually it'll need to support syncing
between nodes that cannot directly contact one-another. Syncing needs to
deal with nodes coming and going; one example of that is a USB drive being
plugged in, which should immediatly be synced, but network can also come
and go, so it should periodically retry nodes it failed to sync with. To
start with, I'll be focusing on fast syncing between directly connected
nodes, but I have to keep this wider problem space in mind.
One problem with `git annex sync` is that it has to be run in both clones
in order for changes to fully propigate. This is because git doesn't allow
pushing changes into a non-bare repository; so instead it drops off a new
branch in `.git/refs/remotes/$foo/synced/master`. Then when it's run locally
it merges that new branch into `master`.
So, how to trigger a clone to run `git annex sync` when syncing to it?
Well, I just realized I have spent two weeks developing something that can
be repurposed to do that! [[Inotify]] can watch for changes to
`.git/refs/remotes`, and the instant a change is made, the local sync
process can be started. This avoids needing to make another ssh connection
to trigger the sync, so is faster and allows the data to be transferred
over another protocol than ssh, which may come in handy later.
So, in summary, here's what will happen when a new file is created:
1. inotify event causes the file to be added to the annex, and
immediately committed.
2. new branch is pushed to remotes (probably in parallel)
3. remotes notice new sync branch and merge it
4. (data sync, TBD later)
5. file is fully synced and available
Steps 1, 2, and 3 should all be able to be accomplished in under a second.
The speed of `git push` making a ssh connection will be the main limit
to making it fast. (Perhaps I should also reuse git-annex's existing ssh
connection caching code?)

View file

@ -3,13 +3,21 @@ all the other git clones, at both the git level and the key/value level.
## git syncing ## git syncing
1. At regular intervals, just run `git annex sync`, which already handles 1. Can use `git annex sync`, which already handles bidirectional syncing.
bidirectional syncing. When a change is committed, launch the part of `git annex sync` that pushes
out changes.
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
another node via `git annex sync`), and run the part of `git annex sync`
that merges in received changes, and follow it by the part that pushes out
changes (sending them to any other remotes).
[The watching can be done with the existing inotify code! This avoids needing
any special mechanism to notify a remote that it's been synced to.]
2. Use a git merge driver that adds both conflicting files, 2. Use a git merge driver that adds both conflicting files,
so conflicts never break a sync. so conflicts never break a sync.
3. Investigate the XMPP approach like dvcs-autosync does, or other ways of 3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
signaling a change out of band. signaling a change out of band.
4. Add a hook, so when there's a change to sync, a program can be run. 4. Add a hook, so when there's a change to sync, a program can be run
and do its own signaling.
## data syncing ## data syncing