split off a page
This commit is contained in:
parent
98453b83b2
commit
bc786b6f06
2 changed files with 77 additions and 44 deletions
doc/design/assistant
|
@ -1,28 +1,7 @@
|
||||||
Once files are added (or removed or moved), need to send those changes to
|
Once files are added (or removed or moved), need to send those changes to
|
||||||
all the other git clones, at both the git level and the key/value level.
|
all the other git clones, at both the git level and the key/value level.
|
||||||
|
|
||||||
## efficiency
|
## misc TODO
|
||||||
|
|
||||||
Currently after each file transfer (upload or download), a git sync is done
|
|
||||||
to all remotes. This is rather a lot of work, also it prevents collecting
|
|
||||||
presence changes to the git-annex branch into larger commits, which would
|
|
||||||
save disk space over time.
|
|
||||||
|
|
||||||
In many cases, this sync is necessary. For example, when a file is uploaded
|
|
||||||
to a transfer remote, the location change needs to be synced out so that
|
|
||||||
other clients know to grab it.
|
|
||||||
|
|
||||||
Or, when downloading a file from a drive, the sync lets other locally
|
|
||||||
paired repositories know we got it, so they can download it from us.
|
|
||||||
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
|
||||||
if we're going to upload the file to them after getting it, the sync
|
|
||||||
only perhaps lets them start downloading it before our transfer queue
|
|
||||||
reaches a point where we'd upload it.
|
|
||||||
|
|
||||||
Do we need all the mapping stuff discussed below to know when we can avoid
|
|
||||||
syncs?
|
|
||||||
|
|
||||||
## TODO
|
|
||||||
|
|
||||||
* Test MountWatcher on LXDE.
|
* Test MountWatcher on LXDE.
|
||||||
* Add a hook, so when there's a change to sync, a program can be run
|
* Add a hook, so when there's a change to sync, a program can be run
|
||||||
|
@ -51,30 +30,11 @@ syncs?
|
||||||
and fall back to some other method -- either storing deferred downloads
|
and fall back to some other method -- either storing deferred downloads
|
||||||
on disk, or perhaps scheduling a TransferScanner run to get back into sync.
|
on disk, or perhaps scheduling a TransferScanner run to get back into sync.
|
||||||
|
|
||||||
## data syncing
|
## More efficient syncing
|
||||||
|
|
||||||
There are two parts to data syncing. First, map the network and second,
|
See [[syncing/efficiency]]
|
||||||
decide what to sync when.
|
|
||||||
|
|
||||||
Mapping the network can reuse code in `git annex map`. Once the map is
|
## TransferScanner efficiency
|
||||||
built, we want to find paths through the network that reach all nodes
|
|
||||||
eventually, with the least cost. This is a minimum spanning tree problem,
|
|
||||||
except with a directed graph, so really a Arborescence problem.
|
|
||||||
|
|
||||||
With the map, we can determine which nodes to push new content to. Then we
|
|
||||||
need to control those data transfers, sending to the cheapest nodes first,
|
|
||||||
and with appropriate rate limiting and control facilities.
|
|
||||||
|
|
||||||
This probably will need lots of refinements to get working well.
|
|
||||||
|
|
||||||
### first pass: flood syncing
|
|
||||||
|
|
||||||
Before mapping the network, the best we can do is flood all files out to every
|
|
||||||
reachable remote. This is worth doing first, since it's the simplest way to
|
|
||||||
get the basic functionality of the assistant to work. And we'll need this
|
|
||||||
anyway.
|
|
||||||
|
|
||||||
## TransferScanner
|
|
||||||
|
|
||||||
The TransferScanner thread needs to find keys that need to be Uploaded
|
The TransferScanner thread needs to find keys that need to be Uploaded
|
||||||
to a remote, or Downloaded from it.
|
to a remote, or Downloaded from it.
|
||||||
|
|
73
doc/design/assistant/syncing/efficiency.mdwn
Normal file
73
doc/design/assistant/syncing/efficiency.mdwn
Normal file
|
@ -0,0 +1,73 @@
|
||||||
|
Currently, the git-annex assistant syncs with remotes in a way that is
|
||||||
|
dumb, and potentially inneficient:
|
||||||
|
|
||||||
|
1. Files are transferred to each reachable remote whose
|
||||||
|
[[preferred_content]] setting indicates it wants the file.
|
||||||
|
|
||||||
|
2. After each file transfer (upload or download), a git sync
|
||||||
|
is done to all the remotes, to update location log information.
|
||||||
|
|
||||||
|
## unncessary transfers
|
||||||
|
|
||||||
|
There are network toplogies where #1 is massively inneficient.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
<pre>
|
||||||
|
laptopA-----laptopB-----laptopC
|
||||||
|
\ | /
|
||||||
|
\---cloud based repo--/
|
||||||
|
</pre>
|
||||||
|
|
||||||
|
When laptopA has a new file, it will first send it to laptopB. It will then
|
||||||
|
check if the cloud based transfer repository wants a copy. It will, because
|
||||||
|
laptopC has not yet gotten a copy. So laptopA will proceed with a slow
|
||||||
|
upload to the cloud, while meanwhile laptopB is sending the file over fast
|
||||||
|
LAN to laptopC.
|
||||||
|
|
||||||
|
(The more common case with no laptopC happens to work efficiently.
|
||||||
|
So does the case where laptopA is paired with laptopC.)
|
||||||
|
|
||||||
|
## unncessary syncing
|
||||||
|
|
||||||
|
Less importantly, the constant git syncing after each transfer is rather a
|
||||||
|
lot of work, and prevents collecting multiple presence changes to the git-annex
|
||||||
|
branch into larger commits, which would save disk space over time.
|
||||||
|
|
||||||
|
In many cases, this sync is necessary. For example, when a file is uploaded
|
||||||
|
to a transfer remote, the location change needs to be synced out so that
|
||||||
|
other clients know to grab it.
|
||||||
|
|
||||||
|
Or, when downloading a file from a drive, the sync lets other locally
|
||||||
|
paired repositories know we got it, so they can download it from us.
|
||||||
|
OTOH, this is also a case where a sync is sometimes unnecessary, since
|
||||||
|
if we're going to upload the file to them after getting it, the sync
|
||||||
|
only perhaps lets them start downloading it before our transfer queue
|
||||||
|
reaches a point where we'd upload it.
|
||||||
|
|
||||||
|
It would be good to find a way to detect when syncing is not immediately
|
||||||
|
necessary, and defer it.
|
||||||
|
|
||||||
|
## mapping
|
||||||
|
|
||||||
|
Mapping the repository network has the potential to get git-annex the
|
||||||
|
information it needs to avoid unnecessary transfers and/or unncessary
|
||||||
|
syncing.
|
||||||
|
|
||||||
|
Mapping the network can reuse code in `git annex map`. Once the map is
|
||||||
|
built, we want to find paths through the network that reach all nodes
|
||||||
|
eventually, with the least cost. This is a minimum spanning tree problem,
|
||||||
|
except with a directed graph, so really a Arborescence problem.
|
||||||
|
|
||||||
|
A significant problem in mapping is that nodes are mobile, they can move
|
||||||
|
between networks over time. This breaks LAN based paths through the
|
||||||
|
network. Mapping would need a way to detect this. Note that individual
|
||||||
|
git-annex assistants can tell when they've switched networks by using the
|
||||||
|
`networkConnectedNotifier`.
|
||||||
|
|
||||||
|
## P2P signaling
|
||||||
|
|
||||||
|
Another approach that might help with these problems is if git-annex
|
||||||
|
repositories have a non-git out of band signaling mechanism. This could,
|
||||||
|
for example, be used by laptopB to tell laptopA that it's trying to send
|
||||||
|
a file directly to laptopC. laptopA could then defer the upload to the
|
||||||
|
cloud for a while.
|
Loading…
Add table
Reference in a new issue