0151976676
Detected while reading recent CHANGELOG entry but then decided to apply to entire codebase and docs since why not?
77 lines
3 KiB
Markdown
77 lines
3 KiB
Markdown
Currently, the git-annex assistant syncs with remotes in a way that is
|
|
dumb, and potentially inneficient:
|
|
|
|
1. Files are transferred to each reachable remote whose
|
|
[[preferred_content]] setting indicates it wants the file.
|
|
|
|
2. After each file transfer (upload or download), a git sync
|
|
is done to all the remotes, to update location log information.
|
|
|
|
## unnecessary transfers
|
|
|
|
There are network toplogies where #1 is massively inneficient.
|
|
For example:
|
|
|
|
<pre>
|
|
laptopA-----laptopB-----laptopC
|
|
\ | /
|
|
\---cloud based repo--/
|
|
</pre>
|
|
|
|
When laptopA has a new file, it will first send it to laptopB. It will then
|
|
check if the cloud based transfer repository wants a copy. It will, because
|
|
laptopC has not yet gotten a copy. So laptopA will proceed with a slow
|
|
upload to the cloud, while meanwhile laptopB is sending the file over fast
|
|
LAN to laptopC.
|
|
|
|
(The more common case with no laptopC happens to work efficiently.
|
|
So does the case where laptopA is paired with laptopC.)
|
|
|
|
## unnecessary syncing
|
|
|
|
Less importantly, the constant git syncing after each transfer is rather a
|
|
lot of work, and prevents collecting multiple presence changes to the git-annex
|
|
branch into larger commits, which would save disk space over time.
|
|
|
|
In many cases, this sync is necessary. For example, when a file is uploaded
|
|
to a transfer remote, the location change needs to be synced out so that
|
|
other clients know to grab it.
|
|
|
|
Or, when downloading a file from a drive, the sync lets other locally
|
|
paired repositories know we got it, so they can download it from us.
|
|
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
|
if we're going to upload the file to them after getting it, the sync
|
|
only perhaps lets them start downloading it before our transfer queue
|
|
reaches a point where we'd upload it.
|
|
|
|
It would be good to find a way to detect when syncing is not immediately
|
|
necessary, and defer it.
|
|
|
|
## mapping
|
|
|
|
Mapping the repository network has the potential to get git-annex the
|
|
information it needs to avoid unnecessary transfers and/or unnecessary
|
|
syncing.
|
|
|
|
Mapping the network can reuse code in `git annex map`. Once the map is
|
|
built, we want to find paths through the network that reach all nodes
|
|
eventually, with the least cost. This is a minimum spanning tree problem,
|
|
except with a directed graph, so really a Arborescence problem.
|
|
|
|
A significant problem in mapping is that nodes are mobile, they can move
|
|
between networks over time. This breaks LAN based paths through the
|
|
network. Mapping would need a way to detect this. Note that individual
|
|
git-annex assistants can tell when they've switched networks by using the
|
|
`networkConnectedNotifier`.
|
|
|
|
## P2P signaling
|
|
|
|
Another approach that might help with these problems is if git-annex
|
|
repositories have a non-git out of band signaling mechanism. This could,
|
|
for example, be used by laptopB to tell laptopA that it's trying to send
|
|
a file directly to laptopC. laptopA could then defer the upload to the
|
|
cloud for a while.
|
|
|
|
## syncing only requested content
|
|
|
|
See [[adhoc_routing]]
|