121 lines
5.4 KiB
Markdown
121 lines
5.4 KiB
Markdown
Once files are added (or removed or moved), need to send those changes to
|
|
all the other git clones, at both the git level and the key/value level.
|
|
|
|
## action items
|
|
|
|
* on-disk transfers in progress information files (read/write/enumerate)
|
|
**done**
|
|
* locking for the files, so redundant transfer races can be detected,
|
|
and failed transfers noticed **done**
|
|
* transfer info for git-annex-shell **done**
|
|
* update files as transfers proceed. See [[progressbars]]
|
|
(updating for downloads is easy; for uploads is hard)
|
|
* add Transfer queue TChan **done**
|
|
* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
|
|
**done**
|
|
* Poll transfer in progress info files for changes (use inotify again!
|
|
wow! hammer, meet nail..), and update the TransferInfo Map **done**
|
|
* enqueue Transfers (Uploads) as new files are added to the annex by
|
|
Watcher.
|
|
* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
|
|
Watcher.
|
|
* Write basic Transfer handling thread. Multiple such threads need to be
|
|
able to be run at once. Each will need its own independant copy of the
|
|
Annex state monad.
|
|
* Write transfer control thread, which decides when to launch transfers.
|
|
* At startup, and possibly periodically, look for files we have that
|
|
location tracking indicates remotes do not, and enqueue Uploads for
|
|
them. Also, enqueue Downloads for any files we're missing.
|
|
* Find a way to probe available outgoing bandwidth, to throttle so
|
|
we don't bufferbloat the network to death.
|
|
* git-annex needs a simple speed control knob, which can be plumbed
|
|
through to, at least, rsync. A good job for an hour in an
|
|
airport somewhere.
|
|
|
|
## git syncing
|
|
|
|
1. Can use `git annex sync`, which already handles bidirectional syncing.
|
|
When a change is committed, launch the part of `git annex sync` that pushes
|
|
out changes. **done**; changes are pushed out to all remotes in parallel
|
|
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
|
|
another node via `git annex sync`), and run the part of `git annex sync`
|
|
that merges in received changes, and follow it by the part that pushes out
|
|
changes (sending them to any other remotes).
|
|
[The watching can be done with the existing inotify code! This avoids needing
|
|
any special mechanism to notify a remote that it's been synced to.]
|
|
**done**
|
|
1. Periodically retry pushes that failed. **done** (every half an hour)
|
|
1. Also, detect if a push failed due to not being up-to-date, pull,
|
|
and repush. **done**
|
|
2. Use a git merge driver that adds both conflicting files,
|
|
so conflicts never break a sync. **done**
|
|
3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
|
|
signaling a change out of band.
|
|
4. Add a hook, so when there's a change to sync, a program can be run
|
|
and do its own signaling.
|
|
5. --debug will show often unnecessary work being done. Optimise.
|
|
6. It would be nice if, when a USB drive is connected,
|
|
syncing starts automatically. Use dbus on Linux?
|
|
|
|
## data syncing
|
|
|
|
There are two parts to data syncing. First, map the network and second,
|
|
decide what to sync when.
|
|
|
|
Mapping the network can reuse code in `git annex map`. Once the map is
|
|
built, we want to find paths through the network that reach all nodes
|
|
eventually, with the least cost. This is a minimum spanning tree problem,
|
|
except with a directed graph, so really a Arborescence problem.
|
|
|
|
With the map, we can determine which nodes to push new content to. Then we
|
|
need to control those data transfers, sending to the cheapest nodes first,
|
|
and with appropriate rate limiting and control facilities.
|
|
|
|
This probably will need lots of refinements to get working well.
|
|
|
|
### first pass: flood syncing
|
|
|
|
Before mapping the network, the best we can do is flood all files out to every
|
|
reachable remote. This is worth doing first, since it's the simplest way to
|
|
get the basic functionality of the assistant to work. And we'll need this
|
|
anyway.
|
|
|
|
### transfer tracking
|
|
|
|
* Upload added to queue by the watcher thread when it adds content.
|
|
* Download added to queue by the watcher thread when it seens new symlinks
|
|
that lack content.
|
|
* Transfer threads started/stopped as necessary to move data.
|
|
(May sometimes want multiple threads downloading, or uploading, or even both.)
|
|
|
|
startTransfer :: TransferQueue -> Transfer -> Annex ()
|
|
startTransfer q transfer = error "TODO"
|
|
|
|
stopTransfer :: TransferQueue -> TransferID -> Annex ()
|
|
stopTransfer q transfer = error "TODO"
|
|
|
|
The assistant needs to find out when `git-annex-shell` is receiving or
|
|
sending (triggered by another remote), so it can add data for those too.
|
|
This is important to avoid uploading content to a remote that is already
|
|
downloading it from us, or vice versa, as well as to in future let the web
|
|
app manage transfers as user desires.
|
|
|
|
For files being received, it can see the temp file, but other than lsof
|
|
there's no good way to find the pid (and I'd rather not kill blindly).
|
|
|
|
For files being sent, there's no filesystem indication. So git-annex-shell
|
|
(and other git-annex transfer processes) should write a status file to disk.
|
|
|
|
Can use file locking on these status files to claim upload/download rights,
|
|
which will avoid races.
|
|
|
|
This status file can also be updated periodically to show amount of transfer
|
|
complete (necessary for tracking uploads).
|
|
|
|
## other considerations
|
|
|
|
It would be nice if, when a USB drive is connected,
|
|
syncing starts automatically. Use dbus on Linux?
|
|
|
|
This assumes the network is connected. It's often not, so the
|
|
[[cloud]] needs to be used to bridge between LANs.
|