260 lines
13 KiB
Markdown
260 lines
13 KiB
Markdown
Once files are added (or removed or moved), need to send those changes to
|
|
all the other git clones, at both the git level and the key/value level.
|
|
|
|
## efficiency
|
|
|
|
Currently after each file transfer (upload or download), a git sync is done
|
|
to all remotes. This is rather a lot of work, also it prevents collecting
|
|
presence changes to the git-annex branch into larger commits, which would
|
|
save disk space over time.
|
|
|
|
In many cases, this sync is necessary. For example, when a file is uploaded
|
|
to a transfer remote, the location change needs to be synced out so that
|
|
other clients know to grab it.
|
|
|
|
Or, when downloading a file from a drive, the sync lets other locally
|
|
paired repositories know we got it, so they can download it from us.
|
|
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
|
if we're going to upload the file to them after getting it, the sync
|
|
only perhaps lets them start downloading it before our transfer queue
|
|
reaches a point where we'd upload it.
|
|
|
|
Do we need all the mapping stuff discussed below to know when we can avoid
|
|
syncs?
|
|
|
|
## TODO
|
|
|
|
* Test MountWatcher on LXDE.
|
|
* Add a hook, so when there's a change to sync, a program can be run
|
|
and do its own signaling.
|
|
* --debug will show often unnecessary work being done. Optimise.
|
|
* Configurablity, including only enabling git syncing but not data transfer;
|
|
only uploading new files but not downloading, and only downloading
|
|
files in some directories and not others. See for use cases:
|
|
[[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
|
|
* speed up git syncing by using the cached ssh connection for it too
|
|
Will need to use `GIT_SSH`, which needs to point to a command to run,
|
|
not a shell command line. Beware that the network connection may have
|
|
bounced and the cached ssh connection not be usable.
|
|
* Map the network of git repos, and use that map to calculate
|
|
optimal transfers to keep the data in sync. Currently a naive flood fill
|
|
is done instead. Maybe use XMPP as a side channel to learn about the
|
|
network topology?
|
|
* Find a more efficient way for the TransferScanner to find the transfers
|
|
that need to be done to sync with a remote. Currently it walks the git
|
|
working copy and checks each file. That probably needs to be done once,
|
|
but further calls to the TransferScanner could eg, look at the delta
|
|
between the last scan and the current one in the git-annex branch.
|
|
* [[use multiple transfer slots|todo/Slow_transfer_for_a_lot_of_small_files.]]
|
|
* The TransferQueue's list of deferred downloads could theoretically
|
|
grow without bounds in memory. Limit it to a given number of entries,
|
|
and fall back to some other method -- either storing deferred downloads
|
|
on disk, or perhaps scheduling a TransferScanner run to get back into sync.
|
|
|
|
## data syncing
|
|
|
|
There are two parts to data syncing. First, map the network and second,
|
|
decide what to sync when.
|
|
|
|
Mapping the network can reuse code in `git annex map`. Once the map is
|
|
built, we want to find paths through the network that reach all nodes
|
|
eventually, with the least cost. This is a minimum spanning tree problem,
|
|
except with a directed graph, so really a Arborescence problem.
|
|
|
|
With the map, we can determine which nodes to push new content to. Then we
|
|
need to control those data transfers, sending to the cheapest nodes first,
|
|
and with appropriate rate limiting and control facilities.
|
|
|
|
This probably will need lots of refinements to get working well.
|
|
|
|
### first pass: flood syncing
|
|
|
|
Before mapping the network, the best we can do is flood all files out to every
|
|
reachable remote. This is worth doing first, since it's the simplest way to
|
|
get the basic functionality of the assistant to work. And we'll need this
|
|
anyway.
|
|
|
|
## TransferScanner
|
|
|
|
The TransferScanner thread needs to find keys that need to be Uploaded
|
|
to a remote, or Downloaded from it.
|
|
|
|
How to find the keys to transfer? I'd like to avoid potentially
|
|
expensive traversals of the whole git working copy if I can.
|
|
(Currently, the TransferScanner does do the naive and possibly expensive
|
|
scan of the git working copy.)
|
|
|
|
One way would be to do a git diff between the (unmerged) git-annex branches
|
|
of the git repo, and its remote. Parse that for lines that add a key to
|
|
either, and queue transfers. That should work fairly efficiently when the
|
|
remote is a git repository. Indeed, git-annex already does such a diff
|
|
when it's doing a union merge of data into the git-annex branch. It
|
|
might even be possible to have the union merge and scan use the same
|
|
git diff data.
|
|
|
|
But that approach has several problems:
|
|
|
|
1. The list of keys it would generate wouldn't have associated git
|
|
filenames, so the UI couldn't show the user what files were being
|
|
transferred.
|
|
2. Worse, without filenames, any later features to exclude
|
|
files/directories from being transferred wouldn't work.
|
|
3. Looking at a git diff of the git-annex branches would find keys
|
|
that were added to either side while the two repos were disconnected.
|
|
But if the two repos' keys were not fully in sync before they
|
|
disconnected (which is quite possible; transfers could be incomplete),
|
|
the diff would not show those older out of sync keys.
|
|
|
|
The remote could also be a special remote. In this case, I have to either
|
|
traverse the git working copy, or perhaps traverse the whole git-annex
|
|
branch (which would have the same problems with filesnames not being
|
|
available).
|
|
|
|
If a traversal is done, should check all remotes, not just
|
|
one. Probably worth handling the case where a remote is connected
|
|
while in the middle of such a scan, so part of the scan needs to be
|
|
redone to check it.
|
|
|
|
## done
|
|
|
|
1. Can use `git annex sync`, which already handles bidirectional syncing.
|
|
When a change is committed, launch the part of `git annex sync` that pushes
|
|
out changes. **done**; changes are pushed out to all remotes in parallel
|
|
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
|
|
another node via `git annex sync`), and run the part of `git annex sync`
|
|
that merges in received changes, and follow it by the part that pushes out
|
|
changes (sending them to any other remotes).
|
|
[The watching can be done with the existing inotify code! This avoids needing
|
|
any special mechanism to notify a remote that it's been synced to.]
|
|
**done**
|
|
1. Periodically retry pushes that failed. **done** (every half an hour)
|
|
1. Also, detect if a push failed due to not being up-to-date, pull,
|
|
and repush. **done**
|
|
2. Use a git merge driver that adds both conflicting files,
|
|
so conflicts never break a sync. **done**
|
|
|
|
* on-disk transfers in progress information files (read/write/enumerate)
|
|
**done**
|
|
* locking for the files, so redundant transfer races can be detected,
|
|
and failed transfers noticed **done**
|
|
* transfer info for git-annex-shell **done**
|
|
* update files as transfers proceed. See [[progressbars]]
|
|
(updating for downloads is easy; for uploads is hard)
|
|
* add Transfer queue TChan **done**
|
|
* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
|
|
**done**
|
|
* Poll transfer in progress info files for changes (use inotify again!
|
|
wow! hammer, meet nail..), and update the TransferInfo Map **done**
|
|
* enqueue Transfers (Uploads) as new files are added to the annex by
|
|
Watcher. **done**
|
|
* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
|
|
Watcher. **done**
|
|
(Note: Needs git-annex branch to be merged before the tree is merged,
|
|
so it knows where to download from. Checked and this is the case.)
|
|
* Write basic Transfer handling thread. Multiple such threads need to be
|
|
able to be run at once. Each will need its own independant copy of the
|
|
Annex state monad. **done**
|
|
* Write transfer control thread, which decides when to launch transfers.
|
|
**done**
|
|
* Transfer watching has a race on kqueue systems, which makes finished
|
|
fast transfers not be noticed by the TransferWatcher. Which in turn
|
|
prevents the transfer slot being freed and any further transfers
|
|
from happening. So, this approach is too fragile to rely on for
|
|
maintaining the TransferSlots. Instead, need [[todo/assistant_threaded_runtime]],
|
|
which would allow running something for sure when a transfer thread
|
|
finishes. **done**
|
|
* Test MountWatcher on KDE, and add whatever dbus events KDE emits when
|
|
drives are mounted. **done**
|
|
* It would be nice if, when a USB drive is connected,
|
|
syncing starts automatically. Use dbus on Linux? **done**
|
|
* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
|
|
broke content syncing in some situations, which need to be added back.
|
|
**done**
|
|
|
|
Now syncing a disconnected remote only starts a transfer scan if the
|
|
remote's git-annex branch has diverged, which indicates it probably has
|
|
new files. But that leaves open the cases where the local repo has
|
|
new files; and where the two repos git branches are in sync, but the
|
|
content transfers are lagging behind; and where the transfer scan has
|
|
never been run.
|
|
|
|
Need to track locally whether we're believed to be in sync with a remote.
|
|
This includes:
|
|
* All local content has been transferred to it successfully.
|
|
* The remote has been scanned once for data to transfer from it, and all
|
|
transfers initiated by that scan succeeded.
|
|
|
|
Note the complication that, if it's initiated a transfer, our queued
|
|
transfer will be thrown out as unnecessary. But if its transfer then
|
|
fails, that needs to be noticed.
|
|
|
|
If we're going to track failed transfers, we could just set a flag,
|
|
and use that flag later to initiate a new transfer scan. We need a flag
|
|
in any case, to ensure that a transfer scan is run for each new remote.
|
|
The flag could be `.git/annex/transfer/scanned/uuid`.
|
|
|
|
But, if failed transfers are tracked, we could also record them, in
|
|
order to retry them later, without the scan. I'm thinking about a
|
|
directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
|
|
which failed transfer log files could be moved to.
|
|
* A remote may lose content it had before, so when requeuing
|
|
a failed download, check the location log to see if the remote still has
|
|
the content, and if not, queue a download from elsewhere. (And, a remote
|
|
may get content we were uploading from elsewhere, so check the location
|
|
log when queuing a failed Upload too.) **done**
|
|
* Fix MountWatcher to notice umounts and remounts of drives. **done**
|
|
* Run transfer scan on startup. **done**
|
|
* Often several remotes will be queued for full TransferScanner scans,
|
|
and the scan does the same thing for each .. so it would be better to
|
|
combine them into one scan in such a case. **done**
|
|
* The syncing code currently doesn't run for special remotes. While
|
|
transfering the git info about special remotes could be a complication,
|
|
if we assume that's synced between existing git remotes, it should be
|
|
possible for them to do file transfers to/from special remotes.
|
|
**done**
|
|
|
|
* The transfer code doesn't always manage to transfer file contents.
|
|
|
|
Besides reconnection events, there are two places where transfers get queued:
|
|
|
|
1. When the committer commits a file, it queues uploads.
|
|
2. When the watcher sees a broken symlink be created, it queues downloads.
|
|
|
|
Consider a doubly-linked chain of three repositories, A B and C.
|
|
(C and A do not directly communicate.)
|
|
|
|
* File is added to A.
|
|
* A uploads its content to B.
|
|
* At the same time, A git syncs to B.
|
|
* Once B gets the git sync, it git syncs to C.
|
|
* When C's watcher sees the file appear, it tries to download it. But if
|
|
B had not finished receiving the file from A, C doesn't know B has it,
|
|
and cannot download it from anywhere.
|
|
|
|
Possible solution: After B receives content, it could queue uploads of it
|
|
to all remotes that it doesn't know have it yet, which would include C.
|
|
**done**
|
|
|
|
In practice, this had the problem that when C receives the content,
|
|
it will queue uploads of it, which can send back to B (or to some other repo
|
|
that already has the content) and loop, until the git-annex branches catch
|
|
up and break the cycle.
|
|
|
|
To avoid that problem, incoming uploads should not result in a transfer
|
|
info file being written when the key is already present. **done**
|
|
|
|
Possible solution: C could record a deferred download. (Similar to a failed
|
|
download, but with an unknown source.) When C next receives a git-annex
|
|
branch push, it could try to queue deferred downloads. **done**
|
|
|
|
Note that this solution won't cover use cases the other does. For example,
|
|
connect a USB drive A; B syncs files from it, and then should pass them to C.
|
|
If the files are not new, C won't immediatly request them from B.
|
|
|
|
* Running the assistant in a fresh clone of a repository, it sometimes
|
|
skips downloading a file, while successfully downloading all the rest.
|
|
There does not seem to be an error message. This will sometimes reproduce
|
|
(in a fresh clone each time) several times in a row, but then stops happening,
|
|
which has prevented me from debugging it.
|
|
This could possibly have been caused by the bug fixed in 750c4ac6c282d14d19f79e0711f858367da145e4.
|
|
Provisionally closed.
|