git-annex/doc/design/assistant/syncing.mdwn

Once files are added (or removed or moved), need to send those changes to
all the other git clones, at both the git level and the key/value level.

## git syncing

1. Can use `git annex sync`, which already handles bidirectional syncing.
   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
1. Periodically retry pushes that failed.  **done** (every half an hour)
1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync.
3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
4. Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.

## data syncing

There are two parts to data syncing. First, map the network and second,
decide what to sync when.

Mapping the network can reuse code in `git annex map`. Once the map is
built, we want to find paths through the network that reach all nodes
eventually, with the least cost. This is a minimum spanning tree problem,
except with a directed graph, so really a Arborescence problem.

With the map, we can determine which nodes to push new content to. Then we
need to control those data transfers, sending to the cheapest nodes first,
and with appropriate rate limiting and control facilities.

This probably will need lots of refinements to get working well.

## other considerations

It would be nice if, when a USB drive is connected,
syncing starts automatically. Use dbus on Linux?

This assumes the network is connected. It's often not, so the
[[cloud]] needs to be used to bridge between LANs.
add preliminary design 2012-05-27 01:11:19 +00:00			`Once files are added (or removed or moved), need to send those changes to`
			`all the other git clones, at both the git level and the key/value level.`

			`## git syncing`

blog for the day and design update 2012-06-22 00:02:00 +00:00			1. Can use `git annex sync`, which already handles bidirectional syncing.
			When a change is committed, launch the part of `git annex sync` that pushes
update 2012-06-22 19:47:02 +00:00			`out changes. done; changes are pushed out to all remotes in parallel`
blog for the day and design update 2012-06-22 00:02:00 +00:00			1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
			another node via `git annex sync`), and run the part of `git annex sync`
			`that merges in received changes, and follow it by the part that pushes out`
			`changes (sending them to any other remotes).`
			`[The watching can be done with the existing inotify code! This avoids needing`
blog for the day 2012-06-22 21:17:41 +00:00			`any special mechanism to notify a remote that it's been synced to.]`
			`done`
blog for the day 2012-06-26 00:40:58 +00:00			`1. Periodically retry pushes that failed. done (every half an hour)`
			`1. Also, detect if a push failed due to not being up-to-date, pull,`
			`and repush. done`
update 2012-05-27 02:25:25 +00:00			`2. Use a git merge driver that adds both conflicting files,`
			`so conflicts never break a sync.`
			`3. Investigate the XMPP approach like dvcs-autosync does, or other ways of`
add preliminary design 2012-05-27 01:11:19 +00:00			`signaling a change out of band.`
blog for the day and design update 2012-06-22 00:02:00 +00:00			`4. Add a hook, so when there's a change to sync, a program can be run`
			`and do its own signaling.`
add preliminary design 2012-05-27 01:11:19 +00:00
			`## data syncing`

			`There are two parts to data syncing. First, map the network and second,`
			`decide what to sync when.`

			Mapping the network can reuse code in `git annex map`. Once the map is
			`built, we want to find paths through the network that reach all nodes`
			`eventually, with the least cost. This is a minimum spanning tree problem,`
			`except with a directed graph, so really a Arborescence problem.`

			`With the map, we can determine which nodes to push new content to. Then we`
			`need to control those data transfers, sending to the cheapest nodes first,`
			`and with appropriate rate limiting and control facilities.`

			`This probably will need lots of refinements to get working well.`

			`## other considerations`

update 2012-05-27 02:25:25 +00:00			`It would be nice if, when a USB drive is connected,`
blog for the day 2012-06-26 00:40:58 +00:00			`syncing starts automatically. Use dbus on Linux?`
update 2012-05-28 18:41:23 +00:00
			`This assumes the network is connected. It's often not, so the`
			`[[cloud]] needs to be used to bridge between LANs.`