Merge branch 'master' into multifromto
This commit is contained in:
commit
6e4f73aa3d
4 changed files with 173 additions and 0 deletions
|
@ -0,0 +1,120 @@
|
|||
### Please describe the problem.
|
||||
|
||||
Our http://datasets.datalad.org has been providing git annex repos, some of which with the content, via a "dummy" HTTP support of git. For various reasons (performance, progress reporting by git upon clone) we want to switch to use [Smart HTTP](https://git-scm.com/book/en/v2/Git-on-the-Server-Smart-HTTP) git-http-backend backend. Sample deployment is at http://datasets-dev.datalad.org/.
|
||||
I followed the docs to set it up and only added one more configuration tune up
|
||||
|
||||
|
||||
```
|
||||
RewriteEngine On
|
||||
|
||||
RewriteCond "%{HTTP_USER_AGENT}" "(git)"
|
||||
|
||||
RewriteRule ^(.*)$ "/git/$1" [PT]
|
||||
```
|
||||
|
||||
so that people could still browse the website in the browser, but whenever `git` tries to access it, we direct to the `git-http-backend` CGI serving under `/git/` prefix (`ScriptAlias /git/ /usr/lib/git-core/git-http-backend/`).
|
||||
|
||||
Everything seems to work nicely on git side, BUT I am having difficulty to make git-annex being able to serve annexed files from it:
|
||||
|
||||
### What version of git-annex are you using? On what operating system?
|
||||
|
||||
6.20180913+git149-g23bd27773-1~ndall+1
|
||||
|
||||
|
||||
### Please provide any additional information below.
|
||||
|
||||
[[!format sh """
|
||||
$> builtin cd /tmp/; rm -rf raiders; git clone http://datasets-dev.datalad.org/labs/haxby/raiders/ ; cd raiders; git annex get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz Cloning into 'raiders'...
|
||||
remote: Counting objects: 17926, done.
|
||||
remote: Compressing objects: 100% (7203/7203), done.
|
||||
remote: Total 17926 (delta 7356), reused 15517 (delta 6237)
|
||||
Receiving objects: 100% (17926/17926), 1.23 MiB | 6.53 MiB/s, done.
|
||||
Resolving deltas: 100% (7356/7356), done.
|
||||
README.md masks/ stimulus/ sub-rid000014/ sub-rid000028/ sub-rid000038/ task-raiders_bold.json
|
||||
dataset_description.json scripts/ sub-rid000005/ sub-rid000015/ sub-rid000029/ sub-rid000042/
|
||||
derivatives/ sourcedata/ sub-rid000011/ sub-rid000020/ sub-rid000033/ sub-rid000043/
|
||||
(merging origin/git-annex into git-annex...)
|
||||
(recording state in git...)
|
||||
get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz download failed: Not Found
|
||||
|
||||
Remote origin not usable by git-annex; setting annex-ignore
|
||||
(not available)
|
||||
Try making some of these repositories available:
|
||||
41e5039d-1750-43d2-8bea-89897d969326 -- /mnt/datasets/datalad/crawl/labs/haxby/raiders
|
||||
87d7db62-683d-43b2-b594-baeb420ae7a6 -- .
|
||||
afde6679-1f2f-41f2-935a-93e7e3d70274 -- nastase@head1:~/BIDS/haxby/raiders
|
||||
de53ce43-2c07-4971-8de8-0445c596f7dc -- datalad-public-ro
|
||||
|
||||
(Note that these git remotes have annex-ignore set: origin)
|
||||
failed
|
||||
(recording state in git...)
|
||||
git-annex: get: 1 failed
|
||||
"""]]
|
||||
|
||||
fails because `config` file is under `.git/` subdirectory there and git-annex doesn't try to access it at all to deduce the uuid, thus marking origin as annex-ignore.
|
||||
|
||||
But if I add that `.git` suffix to the url, then:
|
||||
|
||||
[[!format sh """
|
||||
(git)hopa:/tmp/raiders[master]
|
||||
$> builtin cd /tmp/; rm -rf raiders; git clone http://datasets-dev.datalad.org/labs/haxby/raiders/.git/ ; cd raiders; git annex get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz
|
||||
Cloning into 'raiders'...
|
||||
remote: Counting objects: 17926, done.
|
||||
remote: Compressing objects: 100% (7203/7203), done.
|
||||
remote: Total 17926 (delta 7356), reused 15517 (delta 6237)
|
||||
Receiving objects: 100% (17926/17926), 1.23 MiB | 5.08 MiB/s, done.
|
||||
Resolving deltas: 100% (7356/7356), done.
|
||||
README.md masks/ stimulus/ sub-rid000014/ sub-rid000028/ sub-rid000038/ task-raiders_bold.json
|
||||
dataset_description.json scripts/ sub-rid000005/ sub-rid000015/ sub-rid000029/ sub-rid000042/
|
||||
derivatives/ sourcedata/ sub-rid000011/ sub-rid000020/ sub-rid000033/ sub-rid000043/
|
||||
(merging origin/git-annex into git-annex...)
|
||||
(recording state in git...)
|
||||
get sub-rid000005/anat/sub-rid000005_run-01_T1w_defacemask.nii.gz (from origin...)
|
||||
download failed: Not Found
|
||||
download failed: Not Found
|
||||
|
||||
Unable to access these remotes: origin
|
||||
|
||||
Try making some of these repositories available:
|
||||
41e5039d-1750-43d2-8bea-89897d969326 -- /mnt/datasets/datalad/crawl/labs/haxby/raiders
|
||||
87d7db62-683d-43b2-b594-baeb420ae7a6 -- .
|
||||
afde6679-1f2f-41f2-935a-93e7e3d70274 -- nastase@head1:~/BIDS/haxby/raiders
|
||||
de53ce43-2c07-4971-8de8-0445c596f7dc -- datalad-public-ro [origin]
|
||||
failed
|
||||
(recording state in git...)
|
||||
git-annex: get: 1 failed
|
||||
"""]]
|
||||
because it fails to find those two files under `.git/annex/objects`, here is apache log file
|
||||
|
||||
|
||||
```
|
||||
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
|
||||
|
||||
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 404 243 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
|
||||
|
||||
10.31.191.134 - - [01/Oct/2018:13:01:58 -0400] "GET /labs/haxby/raiders/.git//annex/objects/681/5d0/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 404 243 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
|
||||
```
|
||||
|
||||
where it seems to assume different layout:
|
||||
|
||||
[[!format sh """
|
||||
$> ls -dl $webroot//labs/haxby/raiders/.git/annex/objects/*/*/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz
|
||||
drwxrwsr-x 1 yoh datalad 104 Sep 26 2016 /mnt/btrfs/manual-snapshots/srv-20180928/datasets.datalad.org/www///labs/haxby/raiders/.git/annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/
|
||||
"""]]
|
||||
|
||||
|
||||
which git-annex assumes when working with the dummy HTTP:
|
||||
|
||||
```
|
||||
10.31.191.134 - - [01/Oct/2018:13:09:53 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
|
||||
|
||||
10.31.191.134 - - [01/Oct/2018:13:09:53 -0400] "GET /labs/haxby/raiders/.git//annex/objects/Z8/f1/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz/MD5E-s41438--06c245e709e7d40a90ed48c6c3b58295.nii.gz HTTP/1.1" 200 41679 "-" "git-annex/6.20180913+git149-g23bd27773-1~ndall+1"
|
||||
```
|
||||
|
||||
So I wonder if I need to do something on my end in configuring apache2, or something could/should be done on git-annex side? Ideally I would like to be able to just clone them without specifying `.git/` suffix to the url.
|
||||
|
||||
But also note that `git-annex` seems to not even provide any agent value while trying to access `config` file:
|
||||
|
||||
```
|
||||
10.31.191.134 - - [01/Oct/2018:13:12:45 -0400] "GET /labs/haxby/raiders/.git//config HTTP/1.1" 206 501 "-" "-"
|
||||
```
|
7
doc/devblog/day_522__multi.mdwn
Normal file
7
doc/devblog/day_522__multi.mdwn
Normal file
|
@ -0,0 +1,7 @@
|
|||
Started work on <http://git-annex.branchable.com/todo/to_and_from_multiple_remotes/>.
|
||||
It's going slow, I had to start with a large refactoring. So far, option
|
||||
parsing is working, and a few commands are almost working, but concurrency
|
||||
is not working right, and concurrency is the main reason to want to support
|
||||
this (along with remote groups).
|
||||
|
||||
Today's work was supported by Jake Vosloo [on Patreon](https://patreon.com/joeyh).
|
|
@ -0,0 +1,9 @@
|
|||
[[!comment format=mdwn
|
||||
username="yarikoptic"
|
||||
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
|
||||
subject="comment 2"
|
||||
date="2018-10-01T16:29:19Z"
|
||||
content="""
|
||||
I think that is correct.
|
||||
But isn't `annex init` is also indirectly invoked by any annex command, e.g. if I just do `git clone URL ; cd DIR; git annex get FILEs`?
|
||||
"""]]
|
|
@ -23,8 +23,45 @@ a remote. That risks using a lot of memory if one remote is very slow.
|
|||
The queue would need to be capped at some amount, and when full,
|
||||
delay until the laggard remotes catches up.
|
||||
|
||||
Bonus: git annex sync --content -J2 already works, but it has the same
|
||||
problem described above and ought to be able to be fixed the same way.
|
||||
|
||||
Also worth noting that with --auto and -J, git-annex may make more transfers
|
||||
than preferred content settings demand, because it will start several
|
||||
transfers to different remotes at once. If only one copy is needed
|
||||
amoung all the remotes, it won't notice and a copy will be sent to all
|
||||
remotes. I think this is something the user can understand though?
|
||||
|
||||
----
|
||||
|
||||
Looking at commands that support --to and --from and what each should do,
|
||||
there is a lot of diversity.
|
||||
|
||||
git annex move --to of course removes the local copy. So if moving to
|
||||
multiple remotes it would need to delay that removal until it's sent to all
|
||||
of them. And it really only ought to try to remove the local copy once
|
||||
at the end, not once per remote moved to.
|
||||
|
||||
git annex move --from ought to spread the load amoung remotes with -Jn,
|
||||
and once a file is downloaded, it needs to try to remove it from all the
|
||||
remotes.
|
||||
|
||||
git annex mirror --from mirrors one remote; mirroring from
|
||||
multiple remotes does not really make any sense. mirror --to multiple
|
||||
could be done.
|
||||
|
||||
git annex unused --from seems unlikely to make sense with multiple remotes,
|
||||
since it would result in a list of keys distributed amoung them, and what
|
||||
would be done with that? Perhaps a git annex drop --from multiple remotes,
|
||||
but that would be innefficient. A shell script looping over remotes makes
|
||||
more sense if the user wants to drop unused from multiple remotes.
|
||||
|
||||
git annex get/fsck/copy/export/transferkey/drop/dropunused all make sense to
|
||||
support multiple remotes. But with -Jn the operations that get files behave
|
||||
differently than the operations that drop files. The gets want to balance
|
||||
load amoung the remotes, while the drops and uploads need to run each
|
||||
action over each remote.
|
||||
|
||||
Seems two runners are needed with different concurrency behavior, one that
|
||||
balances the load amoung remotes, and one that runs the same action against
|
||||
multiple remotes concurrently.
|
||||
|
|
Loading…
Reference in a new issue