split off a page
This commit is contained in:
parent
98453b83b2
commit
bc786b6f06
2 changed files with 77 additions and 44 deletions
|
@ -1,28 +1,7 @@
|
|||
Once files are added (or removed or moved), need to send those changes to
|
||||
all the other git clones, at both the git level and the key/value level.
|
||||
|
||||
## efficiency
|
||||
|
||||
Currently after each file transfer (upload or download), a git sync is done
|
||||
to all remotes. This is rather a lot of work, also it prevents collecting
|
||||
presence changes to the git-annex branch into larger commits, which would
|
||||
save disk space over time.
|
||||
|
||||
In many cases, this sync is necessary. For example, when a file is uploaded
|
||||
to a transfer remote, the location change needs to be synced out so that
|
||||
other clients know to grab it.
|
||||
|
||||
Or, when downloading a file from a drive, the sync lets other locally
|
||||
paired repositories know we got it, so they can download it from us.
|
||||
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
||||
if we're going to upload the file to them after getting it, the sync
|
||||
only perhaps lets them start downloading it before our transfer queue
|
||||
reaches a point where we'd upload it.
|
||||
|
||||
Do we need all the mapping stuff discussed below to know when we can avoid
|
||||
syncs?
|
||||
|
||||
## TODO
|
||||
## misc TODO
|
||||
|
||||
* Test MountWatcher on LXDE.
|
||||
* Add a hook, so when there's a change to sync, a program can be run
|
||||
|
@ -51,30 +30,11 @@ syncs?
|
|||
and fall back to some other method -- either storing deferred downloads
|
||||
on disk, or perhaps scheduling a TransferScanner run to get back into sync.
|
||||
|
||||
## data syncing
|
||||
## More efficient syncing
|
||||
|
||||
There are two parts to data syncing. First, map the network and second,
|
||||
decide what to sync when.
|
||||
See [[syncing/efficiency]]
|
||||
|
||||
Mapping the network can reuse code in `git annex map`. Once the map is
|
||||
built, we want to find paths through the network that reach all nodes
|
||||
eventually, with the least cost. This is a minimum spanning tree problem,
|
||||
except with a directed graph, so really a Arborescence problem.
|
||||
|
||||
With the map, we can determine which nodes to push new content to. Then we
|
||||
need to control those data transfers, sending to the cheapest nodes first,
|
||||
and with appropriate rate limiting and control facilities.
|
||||
|
||||
This probably will need lots of refinements to get working well.
|
||||
|
||||
### first pass: flood syncing
|
||||
|
||||
Before mapping the network, the best we can do is flood all files out to every
|
||||
reachable remote. This is worth doing first, since it's the simplest way to
|
||||
get the basic functionality of the assistant to work. And we'll need this
|
||||
anyway.
|
||||
|
||||
## TransferScanner
|
||||
## TransferScanner efficiency
|
||||
|
||||
The TransferScanner thread needs to find keys that need to be Uploaded
|
||||
to a remote, or Downloaded from it.
|
||||
|
|
73
doc/design/assistant/syncing/efficiency.mdwn
Normal file
73
doc/design/assistant/syncing/efficiency.mdwn
Normal file
|
@ -0,0 +1,73 @@
|
|||
Currently, the git-annex assistant syncs with remotes in a way that is
|
||||
dumb, and potentially inneficient:
|
||||
|
||||
1. Files are transferred to each reachable remote whose
|
||||
[[preferred_content]] setting indicates it wants the file.
|
||||
|
||||
2. After each file transfer (upload or download), a git sync
|
||||
is done to all the remotes, to update location log information.
|
||||
|
||||
## unncessary transfers
|
||||
|
||||
There are network toplogies where #1 is massively inneficient.
|
||||
For example:
|
||||
|
||||
<pre>
|
||||
laptopA-----laptopB-----laptopC
|
||||
\ | /
|
||||
\---cloud based repo--/
|
||||
</pre>
|
||||
|
||||
When laptopA has a new file, it will first send it to laptopB. It will then
|
||||
check if the cloud based transfer repository wants a copy. It will, because
|
||||
laptopC has not yet gotten a copy. So laptopA will proceed with a slow
|
||||
upload to the cloud, while meanwhile laptopB is sending the file over fast
|
||||
LAN to laptopC.
|
||||
|
||||
(The more common case with no laptopC happens to work efficiently.
|
||||
So does the case where laptopA is paired with laptopC.)
|
||||
|
||||
## unncessary syncing
|
||||
|
||||
Less importantly, the constant git syncing after each transfer is rather a
|
||||
lot of work, and prevents collecting multiple presence changes to the git-annex
|
||||
branch into larger commits, which would save disk space over time.
|
||||
|
||||
In many cases, this sync is necessary. For example, when a file is uploaded
|
||||
to a transfer remote, the location change needs to be synced out so that
|
||||
other clients know to grab it.
|
||||
|
||||
Or, when downloading a file from a drive, the sync lets other locally
|
||||
paired repositories know we got it, so they can download it from us.
|
||||
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
||||
if we're going to upload the file to them after getting it, the sync
|
||||
only perhaps lets them start downloading it before our transfer queue
|
||||
reaches a point where we'd upload it.
|
||||
|
||||
It would be good to find a way to detect when syncing is not immediately
|
||||
necessary, and defer it.
|
||||
|
||||
## mapping
|
||||
|
||||
Mapping the repository network has the potential to get git-annex the
|
||||
information it needs to avoid unnecessary transfers and/or unncessary
|
||||
syncing.
|
||||
|
||||
Mapping the network can reuse code in `git annex map`. Once the map is
|
||||
built, we want to find paths through the network that reach all nodes
|
||||
eventually, with the least cost. This is a minimum spanning tree problem,
|
||||
except with a directed graph, so really a Arborescence problem.
|
||||
|
||||
A significant problem in mapping is that nodes are mobile, they can move
|
||||
between networks over time. This breaks LAN based paths through the
|
||||
network. Mapping would need a way to detect this. Note that individual
|
||||
git-annex assistants can tell when they've switched networks by using the
|
||||
`networkConnectedNotifier`.
|
||||
|
||||
## P2P signaling
|
||||
|
||||
Another approach that might help with these problems is if git-annex
|
||||
repositories have a non-git out of band signaling mechanism. This could,
|
||||
for example, be used by laptopB to tell laptopA that it's trying to send
|
||||
a file directly to laptopC. laptopA could then defer the upload to the
|
||||
cloud for a while.
|
Loading…
Reference in a new issue