designing new git-annex-shell multi

This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
Joey Hess 2018-03-06 14:48:44 -04:00
parent dd63b4e744
commit f42baedd4c
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 42 additions and 2 deletions

View file

@ -0,0 +1,33 @@
As shown by benchmarks in
*[[here|todo/speed_up_transfers_over_ssh+rsync_--_directly_reuse_the_same_connection__63__]]*,
there is some overhead for each file transfer to a rsync special remote, to
set up the connection. Idea is to extend git-annex-shell with a command or
commands that don't use rsync for transferring objects, and that can handle
transferring or otherwise operating on multiple objects inside a single tcp
session.
This might only be used when it doesn't need to resume transfer of a file;
it could fall back to rsync for resuming.
Of course, when talking with a git-annex-shell that does not support this
new command, git-annex would still need to fall back to the old commands
using rsync. And should remember for the session that the remote doesn't
support the new command.
It could use sftp, but that seems kind of difficult; it would need to lock
down sftp-server to only write annexed objects to the right repository.
And, using sftp would mean that git-annex would need to figure out the
filenames to use for annexed objects in the remote repository, rather than
letting git-annex-shell on the remote work that out.
So, it seems better to not use sftp, and instead roll our own simple
file transfer protocol.
So, "git-annex-shell -c multi" would speak a protocol over stdin/stdout
that essentially contains the commands inannex, lockcontent, dropkey,
recvkey, and sendkey.
P2P.Protocol already contains such a similar protocol, used over tor.
That protocol even supports resuming interrupted transfers.
It has stuff including auth that this wouldn't need, but it would be
good to unify with it as much as possible.

View file

@ -1,6 +1,10 @@
A sftp backend would be nice because gpg operations could be pipelined to the network transfer, not requiring the creation of a full file to disk with gpg before the network transmission, as it happens with rsync.
A sftp special remote would be nice because gpg operations could be
pipelined to the network transfer, not requiring the creation of a full
file to disk with gpg before the network transmission, as it happens with
the rsync special remote.
There should be some libraries that can handle the sftp connections and transfers. I read that even curl has support for that.
There should be some libraries that can handle the sftp connections and
transfers. I read that even curl has support for that.
> Another reason to build this is that sftp has a `SFTP_FXP_STAT`
> that can get disk free space information. "echo df | sftp user@host"

View file

@ -32,3 +32,6 @@ ATM, even with ControlPersist=yes, on a fast interconnection between hosts (so i
both hosts do not show any high CPU load
> [[closed|done]]; wrung out all the perf gains we can without
> [[accellerate_ssh_remotes_with_git-annex-shell_mass_protocol]] --[[Joey]]