Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2020-06-30 12:27:19 -04:00
commit 1d335520df
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 99 additions and 0 deletions

View file

@ -0,0 +1,47 @@
When debugging some ssh-related datalad tests that hang with newer
git-annex versions, I noticed that there was a regression in the
treatment of annex-ssh-options in c8fec6ab0 (Fix a minor bug that
caused options provided with -c to be passed multiple times to git,
2020-03-16).
Here's a demo script. Pointing `SSHURL` to any ssh-accessible annex
repo should do. In the case below, the target is an annex repo with
one commit and no files in the working tree.
[[!format sh """
SSHURL="smaug:/home/kyle/scratch/repo"
cd "$(mktemp -d ${TMPDIR:-/tmp}/gx-ssh-opts-XXXXXXX)"
git clone "$SSHURL" ./ >/dev/null 2>&1
git annex init \
-c annex.sshcaching=false \
-c remote.origin.annex-ssh-options="-o ControlMaster=auto -S CACHE" \
--debug 2>&1 | grep 'read: ssh'
"""]]
With the parent of the above commit checked out (b166223d4), the
script outputs
```
[2020-06-30 11:09:43.853918422] read: ssh ["smaug","-o","ControlMaster=auto","-S","CACHE","-n","-T","git-annex-shell 'configlist' '/home/kyle/scratch/repo' '--debug'"]
```
With c8fec6ab0 checked out, it outputs
```
[2020-06-30 11:11:03.833678263] read: ssh ["smaug","-S",".git/annex/ssh/smaug","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","git-annex-shell 'configlist' '/home/kyle/scratch/repo' '--debug'"]
[2020-06-30 11:11:04.448046366] read: ssh ["-O","stop","-S","smaug","-o","ControlMaster=auto","-o","ControlPersist=yes","localhost"]
```
It looks like the options specified via
`remote.origin.annex-ssh-options` are dropped, and git-annex switches
to using its built-in ssh caching.
A recent commit on master (95b8b4a5a) shows the same behavior.
I've tried to work through the config-related handling and understand
why the condition from c8fec6ab0 results in the ssh options being
dropped, but I haven't made any progress yet.
[[!meta author=kyle]]
[[!tag projects/datalad]]

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="https://bmwiedemann.zq1.de/"
nickname="bmwiedemann"
avatar="http://cdn.libravatar.org/avatar/96f3cd71c3d677f31ed8f79ffb8fb343a8282c085731f405997ff3ef77a1a71b"
subject="comment 2"
date="2020-06-30T15:26:10Z"
content="""
We are already building openSUSE haskell packages sequentially since 2017-07-14 for that reason:
<https://build.opensuse.org/package/rdiff/devel:languages:haskell/ghc-rpm-macros?linkrev=base&rev=79>
Here, non-determinism from filesystem readdir order is another independent class of issue.
"""]]

View file

@ -0,0 +1,33 @@
[[!comment format=mdwn
username="kyle"
avatar="http://cdn.libravatar.org/avatar/7d6e85cde1422ad60607c87fa87c63f3"
subject="comment 3"
date="2020-06-30T16:26:17Z"
content="""
[ I don't have a good understanding of the build issues here. I'm
sorry if this isn't relevant. ]
> I found, the build becomes reproducible, when using a filesystem
> with deterministic readdir order such as disorderfs with sort mode.
This reminded me of an issue with Guix's git-annex build:
<https://debbugs.gnu.org/cgi/bugreport.cgi?bug=33922>. In that case,
the nondeterminism came from \"package database files that are
generated by ghc-pkg (where readdir is used and the result isnt
sorted)\".
The fix on Guix's end, 5de93cdba7 (gnu: ghc-8: Patch ghc-pkg for
reproducibility, 2019-01-17), was at the level of the ghc package. It
replaced the following line in utils/ghc-pkg/Main.hs
```
confs = map (path </>) $ filter (\".conf\" `isSuffixOf`) fs
```
with
```
confs = map (path </>) $ filter (\".conf\" `isSuffixOf`) (sort fs)
```
"""]]

View file

@ -0,0 +1,7 @@
Hello Joeyh,
Overall the performance of git-annex is good for me. However, one case where git-annex could improve is with "git annex sync --content --all", as it takes 20 minutes just to traverse all keys without uploading/downloading anything in my repo. I've looked at the code (learnig some haskell along the way) and I think it's due to getting the location logs via git cat-file. I see two ways how performance could be improved:
1. Use "git cat-file --batch-all-objects --unordered" and traverse the keys in whatever order that outputs the location logs.
2. Cache the location logs in the sqlite database
Other than that, git-annex has really solved all my file syncing and archival needs and is just awesome!