git-annex/doc/design/assistant/syncing.mdwn

Once files are added (or removed or moved), need to send those changes to
all the other git clones, at both the git level and the key/value level.

## immediate action items

* pause, resume, and pause of a transfer fails... The first pause is ok,
  and the first resume. The second pause seems to block forever when
  it signals the transfer thread. I've checked: ThreadID is correct. Thread
  is still running. No exception is thrown. WTF? (One or twice, it worked,
  but then blocked next time paused.)

## longer-term TODO

* Test MountWatcher on LXDE.
* git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
* Find a way to probe available outgoing bandwidth, to throttle so
  we don't bufferbloat the network to death.
* Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
* Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
* --debug will show often unnecessary work being done. Optimise.
* This assumes the network is connected. It's often not, so the
  [[cloud]] needs to be used to bridge between LANs.
* Configurablity, including only enabling git syncing but not data transfer;
  only uploading new files but not downloading, and only downloading
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
* speed up git syncing by using the cached ssh connection for it too
  Will need to use `GIT_SSH`, which needs to point to a command to run,
  not a shell command line. Beware that the network connection may have
  bounced and the cached ssh connection not be usable.
* Map the network of git repos, and use that map to calculate
  optimal transfers to keep the data in sync. Currently a naive flood fill
  is done instead.
* Find a more efficient way for the TransferScanner to find the transfers
  that need to be done to sync with a remote. Currently it walks the git
  working copy and checks each file. That probably needs to be done once,
  but further calls to the TransferScanner could eg, look at the delta
  between the last scan and the current one in the git-annex branch.
* Ensure that when a remote receives content, and updates its location log,
  it syncs that update back out. Prerequisite for:
* After git sync, identify new content that we don't have that is now available
  on remotes, and transfer. (Needed when we have a uni-directional connection
  to a remote, so it won't be uploading content to us.) Note: Does not
  need to use the TransferScanner, if we get and check a list of the changed
  files.

## data syncing

There are two parts to data syncing. First, map the network and second,
decide what to sync when.

Mapping the network can reuse code in `git annex map`. Once the map is
built, we want to find paths through the network that reach all nodes
eventually, with the least cost. This is a minimum spanning tree problem,
except with a directed graph, so really a Arborescence problem.

With the map, we can determine which nodes to push new content to. Then we
need to control those data transfers, sending to the cheapest nodes first,
and with appropriate rate limiting and control facilities.

This probably will need lots of refinements to get working well.

### first pass: flood syncing

Before mapping the network, the best we can do is flood all files out to every
reachable remote. This is worth doing first, since it's the simplest way to
get the basic functionality of the assistant to work. And we'll need this
anyway.

## TransferScanner

The TransferScanner thread needs to find keys that need to be Uploaded
to a remote, or Downloaded from it.

How to find the keys to transfer? I'd like to avoid potentially
expensive traversals of the whole git working copy if I can.
(Currently, the TransferScanner does do the naive and possibly expensive
scan of the git working copy.)

One way would be to do a git diff between the (unmerged) git-annex branches
of the git repo, and its remote. Parse that for lines that add a key to
either, and queue transfers. That should work fairly efficiently when the
remote is a git repository. Indeed, git-annex already does such a diff
when it's doing a union merge of data into the git-annex branch. It
might even be possible to have the union merge and scan use the same
git diff data.

But that approach has several problems:

1. The list of keys it would generate wouldn't have associated git
   filenames, so the UI couldn't show the user what files were being
   transferred.
2. Worse, without filenames, any later features to exclude
   files/directories from being transferred wouldn't work.
3. Looking at a git diff of the git-annex branches would find keys
   that were added to either side while the two repos were disconnected.
   But if the two repos' keys were not fully in sync before they
   disconnected (which is quite possible; transfers could be incomplete),
   the diff would not show those older out of sync keys.

The remote could also be a special remote. In this case, I have to either
traverse the git working copy, or perhaps traverse the whole git-annex
branch (which would have the same problems with filesnames not being
available).

If a traversal is done, should check all remotes, not just
one. Probably worth handling the case where a remote is connected
while in the middle of such a scan, so part of the scan needs to be
redone to check it.

## done

1. Can use `git annex sync`, which already handles bidirectional syncing.
   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
1. Periodically retry pushes that failed.  **done** (every half an hour)
1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync. **done**

* on-disk transfers in progress information files (read/write/enumerate)
  **done**
* locking for the files, so redundant transfer races can be detected,
  and failed transfers noticed **done**
* transfer info for git-annex-shell **done**
* update files as transfers proceed. See [[progressbars]]
  (updating for downloads is easy; for uploads is hard)
* add Transfer queue TChan **done**
* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
  **done**
* Poll transfer in progress info files for changes (use inotify again!
  wow! hammer, meet nail..), and update the TransferInfo Map **done**
* enqueue Transfers (Uploads) as new files are added to the annex by
  Watcher. **done**
* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
  Watcher. **done**
  (Note: Needs git-annex branch to be merged before the tree is merged,
  so it knows where to download from. Checked and this is the case.)
* Write basic Transfer handling thread. Multiple such threads need to be
  able to be run at once. Each will need its own independant copy of the 
  Annex state monad. **done**
* Write transfer control thread, which decides when to launch transfers.
  **done**
* Transfer watching has a race on kqueue systems, which makes finished
  fast transfers not be noticed by the TransferWatcher. Which in turn
  prevents the transfer slot being freed and any further transfers
  from happening. So, this approach is too fragile to rely on for
  maintaining the TransferSlots. Instead, need [[todo/assistant_threaded_runtime]],
  which would allow running something for sure when a transfer thread
  finishes. **done**
* Test MountWatcher on KDE, and add whatever dbus events KDE emits when
  drives are mounted. **done**
* It would be nice if, when a USB drive is connected, 
  syncing starts automatically. Use dbus on Linux? **done**
* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily
  broke content syncing in some situations, which need to be added back.
  **done**

  Now syncing a disconnected remote only starts a transfer scan if the
  remote's git-annex branch has diverged, which indicates it probably has
  new files. But that leaves open the cases where the local repo has
  new files; and where the two repos git branches are in sync, but the
  content transfers are lagging behind; and where the transfer scan has
  never been run.

  Need to track locally whether we're believed to be in sync with a remote.
  This includes:
  * All local content has been transferred to it successfully.
  * The remote has been scanned once for data to transfer from it, and all
    transfers initiated by that scan succeeded.

  Note the complication that, if it's initiated a transfer, our queued
  transfer will be thrown out as unnecessary. But if its transfer then
  fails, that needs to be noticed.

  If we're going to track failed transfers, we could just set a flag,
  and use that flag later to initiate a new transfer scan. We need a flag
  in any case, to ensure that a transfer scan is run for each new remote.
  The flag could be `.git/annex/transfer/scanned/uuid`.

  But, if failed transfers are tracked, we could also record them, in 
  order to retry them later, without the scan. I'm thinking about a
  directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
  which failed transfer log files could be moved to.
* A remote may lose content it had before, so when requeuing
  a failed download, check the location log to see if the remote still has
  the content, and if not, queue a download from elsewhere. (And, a remote
  may get content we were uploading from elsewhere, so check the location
  log when queuing a failed Upload too.) **done**
* Fix MountWatcher to notice umounts and remounts of drives. **done**
* Run transfer scan on startup. **done**
* Often several remotes will be queued for full TransferScanner scans,
  and the scan does the same thing for each .. so it would be better to
  combine them into one scan in such a case. **done**
* The syncing code currently doesn't run for special remotes. While
  transfering the git info about special remotes could be a complication,
  if we assume that's synced between existing git remotes, it should be
  possible for them to do file transfers to/from special remotes.
  **done**
add preliminary design 2012-05-27 01:11:19 +00:00			`Once files are added (or removed or moved), need to send those changes to`
			`all the other git clones, at both the git level and the key/value level.`

update 2012-07-07 16:56:09 +00:00			`## immediate action items`
reorg 2012-07-02 00:55:20 +00:00
pause, then resume, then pause fails Most puzzling. 2012-08-29 19:26:12 +00:00			`* pause, resume, and pause of a transfer fails... The first pause is ok,`
update 2012-08-29 21:48:36 +00:00			`and the first resume. The second pause seems to block forever when`
pause, then resume, then pause fails Most puzzling. 2012-08-29 19:26:12 +00:00			`it signals the transfer thread. I've checked: ThreadID is correct. Thread`
update 2012-08-29 21:48:36 +00:00			`is still running. No exception is thrown. WTF? (One or twice, it worked,`
			`but then blocked next time paused.)`
pause, then resume, then pause fails Most puzzling. 2012-08-29 19:26:12 +00:00
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`## longer-term TODO`
add preliminary design 2012-05-27 01:11:19 +00:00
works on Gnome 3 2012-07-23 23:55:26 +00:00			`* Test MountWatcher on LXDE.`
update 2012-07-07 16:56:09 +00:00			`* git-annex needs a simple speed control knob, which can be plumbed`
			`through to, at least, rsync. A good job for an hour in an`
			`airport somewhere.`
			`* Find a way to probe available outgoing bandwidth, to throttle so`
			`we don't bufferbloat the network to death.`
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`* Investigate the XMPP approach like dvcs-autosync does, or other ways of`
add preliminary design 2012-05-27 01:11:19 +00:00			`signaling a change out of band.`
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`* Add a hook, so when there's a change to sync, a program can be run`
blog for the day and design update 2012-06-22 00:02:00 +00:00			`and do its own signaling.`
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`* --debug will show often unnecessary work being done. Optimise.`
			`* This assumes the network is connected. It's often not, so the`
			`[[cloud]] needs to be used to bridge between LANs.`
			`* Configurablity, including only enabling git syncing but not data transfer;`
			`only uploading new files but not downloading, and only downloading`
			`files in some directories and not others. See for use cases:`
			`[[forum/Wishlist:_options_for_syncing_meta-data_and_data]]`
update 2012-07-08 03:42:52 +00:00			`* speed up git syncing by using the cached ssh connection for it too`
update 2012-08-23 20:24:22 +00:00			Will need to use `GIT_SSH`, which needs to point to a command to run,
			`not a shell command line. Beware that the network connection may have`
			`bounced and the cached ssh connection not be usable.`
update 2012-07-25 19:07:41 +00:00			`* Map the network of git repos, and use that map to calculate`
			`optimal transfers to keep the data in sync. Currently a naive flood fill`
			`is done instead.`
			`* Find a more efficient way for the TransferScanner to find the transfers`
			`that need to be done to sync with a remote. Currently it walks the git`
update 2012-08-23 20:24:22 +00:00			`working copy and checks each file. That probably needs to be done once,`
			`but further calls to the TransferScanner could eg, look at the delta`
			`between the last scan and the current one in the git-annex branch.`
update 2012-08-29 17:45:14 +00:00			`* Ensure that when a remote receives content, and updates its location log,`
			`it syncs that update back out. Prerequisite for:`
			`* After git sync, identify new content that we don't have that is now available`
			`on remotes, and transfer. (Needed when we have a uni-directional connection`
			`to a remote, so it won't be uploading content to us.) Note: Does not`
			`need to use the TransferScanner, if we get and check a list of the changed`
			`files.`
update 2012-06-28 18:48:46 +00:00
add preliminary design 2012-05-27 01:11:19 +00:00			`## data syncing`

			`There are two parts to data syncing. First, map the network and second,`
			`decide what to sync when.`

			Mapping the network can reuse code in `git annex map`. Once the map is
			`built, we want to find paths through the network that reach all nodes`
			`eventually, with the least cost. This is a minimum spanning tree problem,`
			`except with a directed graph, so really a Arborescence problem.`

			`With the map, we can determine which nodes to push new content to. Then we`
			`need to control those data transfers, sending to the cheapest nodes first,`
			`and with appropriate rate limiting and control facilities.`

			`This probably will need lots of refinements to get working well.`

further design 2012-06-29 15:59:25 +00:00			`### first pass: flood syncing`

			`Before mapping the network, the best we can do is flood all files out to every`
			`reachable remote. This is worth doing first, since it's the simplest way to`
			`get the basic functionality of the assistant to work. And we'll need this`
			`anyway.`

update 2012-07-25 19:07:41 +00:00			`## TransferScanner`

			`The TransferScanner thread needs to find keys that need to be Uploaded`
			`to a remote, or Downloaded from it.`

			`How to find the keys to transfer? I'd like to avoid potentially`
			`expensive traversals of the whole git working copy if I can.`
			`(Currently, the TransferScanner does do the naive and possibly expensive`
			`scan of the git working copy.)`

			`One way would be to do a git diff between the (unmerged) git-annex branches`
			`of the git repo, and its remote. Parse that for lines that add a key to`
			`either, and queue transfers. That should work fairly efficiently when the`
			`remote is a git repository. Indeed, git-annex already does such a diff`
			`when it's doing a union merge of data into the git-annex branch. It`
			`might even be possible to have the union merge and scan use the same`
			`git diff data.`

			`But that approach has several problems:`

			`1. The list of keys it would generate wouldn't have associated git`
			`filenames, so the UI couldn't show the user what files were being`
			`transferred.`
			`2. Worse, without filenames, any later features to exclude`
			`files/directories from being transferred wouldn't work.`
			`3. Looking at a git diff of the git-annex branches would find keys`
			`that were added to either side while the two repos were disconnected.`
			`But if the two repos' keys were not fully in sync before they`
			`disconnected (which is quite possible; transfers could be incomplete),`
			`the diff would not show those older out of sync keys.`

			`The remote could also be a special remote. In this case, I have to either`
			`traverse the git working copy, or perhaps traverse the whole git-annex`
			`branch (which would have the same problems with filesnames not being`
			`available).`

			`If a traversal is done, should check all remotes, not just`
			`one. Probably worth handling the case where a remote is connected`
			`while in the middle of such a scan, so part of the scan needs to be`
			`redone to check it.`

reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`## done`
add preliminary design 2012-05-27 01:11:19 +00:00
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			1. Can use `git annex sync`, which already handles bidirectional syncing.
			When a change is committed, launch the part of `git annex sync` that pushes
			`out changes. done; changes are pushed out to all remotes in parallel`
			1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
			another node via `git annex sync`), and run the part of `git annex sync`
			`that merges in received changes, and follow it by the part that pushes out`
			`changes (sending them to any other remotes).`
			`[The watching can be done with the existing inotify code! This avoids needing`
			`any special mechanism to notify a remote that it's been synced to.]`
			`done`
			`1. Periodically retry pushes that failed. done (every half an hour)`
			`1. Also, detect if a push failed due to not being up-to-date, pull,`
			`and repush. done`
			`2. Use a git merge driver that adds both conflicting files,`
			`so conflicts never break a sync. done`
update 2012-05-28 18:41:23 +00:00
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`* on-disk transfers in progress information files (read/write/enumerate)`
			`done`
			`* locking for the files, so redundant transfer races can be detected,`
			`and failed transfers noticed done`
			`* transfer info for git-annex-shell done`
			`* update files as transfers proceed. See [[progressbars]]`
			`(updating for downloads is easy; for uploads is hard)`
			`* add Transfer queue TChan done`
			`* add TransferInfo Map to DaemonStatus for tracking transfers in progress.`
			`done`
			`* Poll transfer in progress info files for changes (use inotify again!`
			`wow! hammer, meet nail..), and update the TransferInfo Map done`
			`* enqueue Transfers (Uploads) as new files are added to the annex by`
			`Watcher. done`
			`* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by`
			`Watcher. done`
updates 2012-07-17 15:32:21 +00:00			`(Note: Needs git-annex branch to be merged before the tree is merged,`
			`so it knows where to download from. Checked and this is the case.)`
reorg, and add a link to a good forum post todo 2012-07-06 18:21:26 +00:00			`* Write basic Transfer handling thread. Multiple such threads need to be`
			`able to be run at once. Each will need its own independant copy of the`
			`Annex state monad. done`
			`* Write transfer control thread, which decides when to launch transfers.`
			`done`
blog for the day 2012-07-18 23:42:29 +00:00			`* Transfer watching has a race on kqueue systems, which makes finished`
			`fast transfers not be noticed by the TransferWatcher. Which in turn`
			`prevents the transfer slot being freed and any further transfers`
			`from happening. So, this approach is too fragile to rely on for`
			`maintaining the TransferSlots. Instead, need [[todo/assistant_threaded_runtime]],`
			`which would allow running something for sure when a transfer thread`
			`finishes. done`
update 2012-07-20 22:17:44 +00:00			`* Test MountWatcher on KDE, and add whatever dbus events KDE emits when`
			`drives are mounted. done`
update 2012-08-22 19:45:20 +00:00			`* It would be nice if, when a USB drive is connected,`
			`syncing starts automatically. Use dbus on Linux? done`
update 2012-08-23 20:24:22 +00:00			`* Optimisations in 5c3e14649ee7c404f86a1b82b648d896762cbbc2 temporarily`
			`broke content syncing in some situations, which need to be added back.`
			`done`

			`Now syncing a disconnected remote only starts a transfer scan if the`
			`remote's git-annex branch has diverged, which indicates it probably has`
			`new files. But that leaves open the cases where the local repo has`
			`new files; and where the two repos git branches are in sync, but the`
			`content transfers are lagging behind; and where the transfer scan has`
			`never been run.`

			`Need to track locally whether we're believed to be in sync with a remote.`
			`This includes:`
			`* All local content has been transferred to it successfully.`
			`* The remote has been scanned once for data to transfer from it, and all`
			`transfers initiated by that scan succeeded.`

			`Note the complication that, if it's initiated a transfer, our queued`
			`transfer will be thrown out as unnecessary. But if its transfer then`
			`fails, that needs to be noticed.`

			`If we're going to track failed transfers, we could just set a flag,`
			`and use that flag later to initiate a new transfer scan. We need a flag`
			`in any case, to ensure that a transfer scan is run for each new remote.`
			The flag could be `.git/annex/transfer/scanned/uuid`.

			`But, if failed transfers are tracked, we could also record them, in`
			`order to retry them later, without the scan. I'm thinking about a`
			directory like `.git/annex/transfer/failed/{upload,download}/uuid/`,
			`which failed transfer log files could be moved to.`
update 2012-08-24 17:13:17 +00:00			`* A remote may lose content it had before, so when requeuing`
			`a failed download, check the location log to see if the remote still has`
			`the content, and if not, queue a download from elsewhere. (And, a remote`
			`may get content we were uploading from elsewhere, so check the location`
			`log when queuing a failed Upload too.) done`
			`* Fix MountWatcher to notice umounts and remounts of drives. done`
blog for the day 2012-08-24 21:40:38 +00:00			`* Run transfer scan on startup. done`
update 2012-08-26 18:15:03 +00:00			`* Often several remotes will be queued for full TransferScanner scans,`
			`and the scan does the same thing for each .. so it would be better to`
			`combine them into one scan in such a case. done`
update 2012-08-26 19:44:32 +00:00			`* The syncing code currently doesn't run for special remotes. While`
			`transfering the git info about special remotes could be a complication,`
			`if we assume that's synced between existing git remotes, it should be`
			`possible for them to do file transfers to/from special remotes.`
			`done`