Merge branch 'master' into multifromto

This commit is contained in:
Joey Hess 2018-10-02 10:17:46 -04:00
commit 6e4f73aa3d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 173 additions and 0 deletions

View file

@ -0,0 +1,120 @@
### Please describe the problem.
Our http://datasets.datalad.org has been providing git annex repos, some of which with the content, via a "dummy" HTTP support of git. For various reasons (performance, progress reporting by git upon clone) we want to switch to use [Smart HTTP](https://git-scm.com/book/en/v2/Git-on-the-Server-Smart-HTTP) git-http-backend backend. Sample deployment is at http://datasets-dev.datalad.org/.
I followed the docs to set it up and only added one more configuration tune up
```
RewriteEngine On
RewriteCond "%{HTTP_USER_AGENT}" "(git)"
RewriteRule ^(.*)$ "/git/$1" [PT]
```
so that people could still browse the website in the browser, but whenever `git` tries to access it, we direct to the `git-http-backend` CGI serving under `/git/` prefix (`ScriptAlias /git/ /usr/lib/git-core/git-http-backend/`).
Everything seems to work nicely on git side, BUT I am having difficulty to make git-annex being able to serve annexed files from it:
### What version of git-annex are you using? On what operating system?
6.20180913+git149-g23bd27773-1~ndall+1
### Please provide any additional information below.
[[!format sh """
$> builtin cd /tmp/; rm -rf raiders; git clone http://datasets-dev.datalad.org/labs/haxby/raiders/ ; cd raiders; git annex get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz Cloning into 'raiders'...
remote: Counting objects: 17926, done.
remote: Compressing objects: 100% (7203/7203), done.
remote: Total 17926 (delta 7356), reused 15517 (delta 6237)
Receiving objects: 100% (17926/17926), 1.23 MiB | 6.53 MiB/s, done.
Resolving deltas: 100% (7356/7356), done.
README.md masks/ stimulus/ sub-rid000014/ sub-rid000028/ sub-rid000038/ task-raiders_bold.json
dataset_description.json scripts/ sub-rid000005/ sub-rid000015/ sub-rid000029/ sub-rid000042/
derivatives/ sourcedata/ sub-rid000011/ sub-rid000020/ sub-rid000033/ sub-rid000043/
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz download failed: Not Found
Remote origin not usable by git-annex; setting annex-ignore
(not available)
Try making some of these repositories available:
41e5039d-1750-43d2-8bea-89897d969326 -- /mnt/datasets/datalad/crawl/labs/haxby/raiders
87d7db62-683d-43b2-b594-baeb420ae7a6 -- .
afde6679-1f2f-41f2-935a-93e7e3d70274 -- nastase@head1:~/BIDS/haxby/raiders
de53ce43-2c07-4971-8de8-0445c596f7dc -- datalad-public-ro
(Note that these git remotes have annex-ignore set: origin)
failed
(recording state in git...)
git-annex: get: 1 failed
"""]]
fails because `config` file is under `.git/` subdirectory there and git-annex doesn't try to access it at all to deduce the uuid, thus marking origin as annex-ignore.
But if I add that `.git` suffix to the url, then:
[[!format sh """
(git)hopa:/tmp/raiders[master]
$> builtin cd /tmp/; rm -rf raiders; git clone http://datasets-dev.datalad.org/labs/haxby/raiders/.git/ ; cd raiders; git annex get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz
Cloning into 'raiders'...
remote: Counting objects: 17926, done.
remote: Compressing objects: 100% (7203/7203), done.
remote: Total 17926 (delta 7356), reused 15517 (delta 6237)
Receiving objects: 100% (17926/17926), 1.23 MiB | 5.08 MiB/s, done.
Resolving deltas: 100% (7356/7356), done.
README.md masks/ stimulus/ sub-rid000014/ sub-rid000028/ sub-rid000038/ task-raiders_bold.json
dataset_description.json scripts/ sub-rid000005/ sub-rid000015/ sub-rid000029/ sub-rid000042/
derivatives/ sourcedata/ sub-rid000011/ sub-rid000020/ sub-rid000033/ sub-rid000043/
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz (from origin...)
download failed: Not Found
download failed: Not Found
Unable to access these remotes: origin
Try making some of these repositories available:
41e5039d-1750-43d2-8bea-89897d969326 -- /mnt/datasets/datalad/crawl/labs/haxby/raiders
87d7db62-683d-43b2-b594-baeb420ae7a6 -- .
afde6679-1f2f-41f2-935a-93e7e3d70274 -- nastase@head1:~/BIDS/haxby/raiders
de53ce43-2c07-4971-8de8-0445c596f7dc -- datalad-public-ro [origin]
failed
(recording state in git...)
git-annex: get: 1 failed
"""]]
because it fails to find those two files under `.git/annex/objects`, here is apache log file
```
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 404 243 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//annex/objects/681/5d0/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 404 243 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
```
where it seems to assume different layout:
[[!format sh """
$> ls -dl $webroot//labs/haxby/raiders/.git/annex/objects/*/*/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz
drwxrwsr-x 1 yoh datalad 104 Sep 26 2016 /mnt/btrfs/manual-snapshots/srv-20180928/datasets.datalad.org/www///labs/haxby/raiders/.git/annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/
"""]]
which git-annex assumes when working with the dummy HTTP:
```
10.31.191.134 - - [01/Oct/2018:13:09:53 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
10.31.191.134 - - [01/Oct/2018:13:09:53 -0400] "GET /labs/haxby/raiders/.git//annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 200 41679 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
```
So I wonder if I need to do something on my end in configuring apache2, or something could/should be done on git-annex side? Ideally I would like to be able to just clone them without specifying `.git/` suffix to the url.
But also note that `git-annex` seems to not even provide any agent value while trying to access `config` file:
```
10.31.191.134 - - [01/Oct/2018:13:12:45 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
```

View file

@ -0,0 +1,7 @@
Started work on <http://git-annex.branchable.com/todo/to_and_from_multiple_remotes/>.
It's going slow, I had to start with a large refactoring. So far, option
parsing is working, and a few commands are almost working, but concurrency
is not working right, and concurrency is the main reason to want to support
this (along with remote groups).
Today's work was supported by Jake Vosloo [on Patreon](https://patreon.com/joeyh).

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="yarikoptic"
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
subject="comment 2"
date="2018-10-01T16:29:19Z"
content="""
I think that is correct.
But isn't `annex init` is also indirectly invoked by any annex command, e.g. if I just do `git clone URL ; cd DIR; git annex get FILEs`?
"""]]

View file

@ -23,8 +23,45 @@ a remote. That risks using a lot of memory if one remote is very slow.
The queue would need to be capped at some amount, and when full, The queue would need to be capped at some amount, and when full,
delay until the laggard remotes catches up. delay until the laggard remotes catches up.
Bonus: git annex sync --content -J2 already works, but it has the same
problem described above and ought to be able to be fixed the same way.
Also worth noting that with --auto and -J, git-annex may make more transfers Also worth noting that with --auto and -J, git-annex may make more transfers
than preferred content settings demand, because it will start several than preferred content settings demand, because it will start several
transfers to different remotes at once. If only one copy is needed transfers to different remotes at once. If only one copy is needed
amoung all the remotes, it won't notice and a copy will be sent to all amoung all the remotes, it won't notice and a copy will be sent to all
remotes. I think this is something the user can understand though? remotes. I think this is something the user can understand though?
----
Looking at commands that support --to and --from and what each should do,
there is a lot of diversity.
git annex move --to of course removes the local copy. So if moving to
multiple remotes it would need to delay that removal until it's sent to all
of them. And it really only ought to try to remove the local copy once
at the end, not once per remote moved to.
git annex move --from ought to spread the load amoung remotes with -Jn,
and once a file is downloaded, it needs to try to remove it from all the
remotes.
git annex mirror --from mirrors one remote; mirroring from
multiple remotes does not really make any sense. mirror --to multiple
could be done.
git annex unused --from seems unlikely to make sense with multiple remotes,
since it would result in a list of keys distributed amoung them, and what
would be done with that? Perhaps a git annex drop --from multiple remotes,
but that would be innefficient. A shell script looping over remotes makes
more sense if the user wants to drop unused from multiple remotes.
git annex get/fsck/copy/export/transferkey/drop/dropunused all make sense to
support multiple remotes. But with -Jn the operations that get files behave
differently than the operations that drop files. The gets want to balance
load amoung the remotes, while the drops and uploads need to run each
action over each remote.
Seems two runners are needed with different concurrency behavior, one that
balances the load amoung remotes, and one that runs the same action against
multiple remotes concurrently.