Merge branch 'master' into v8

This commit is contained in:
Joey Hess 2020-01-01 14:26:43 -04:00
commit 2cea674d1e
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
44 changed files with 665 additions and 140 deletions

View file

@ -75,3 +75,5 @@ Thanks for having a look.
[[!meta author=kyle]]
[[!tag projects/datalad]]
> [[fixed|done]] --[[Joey]]

View file

@ -0,0 +1,42 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2019-12-27T06:22:23Z"
content="""
On second thought, making the clean filter check for non-annexed files
would prevent use cases like annex.largefiles=largerthan(100kb)
from working as the user intended and letting a small file start out
non-annexed and get annexed once it gets too large. Users certianly rely on
that and this bug that only affects an edge case does not justify breaking
that.
What would work to make the clean filter detect when a file's content
has not changed, though its mtime (or inode) has changed. In that case,
it's reasonable for the clean filter to ignore annex.largefiles and keep
the content represented in git however it already was (non-annexed or
annexed).
To detect that, in the case where the file in the index is not annexed:
First check if the file size is the same as the
size in the index. If it is, run git hash-object on the file, and see if
the sha1 is the same as in the index. This avoids hashing any unusually
large files, so the clean filter only gets a bit slower.
And when the file in the index is annexed, check if the file size is the
same as the size of the annexed key. If it is, verify if the file content
matches the key. (typically be hashing). Cases where keys lack size or
don't use a checksum could lead to false positives or negatives though.
Although, I've not managed to find a version of this bug that makes an
annexed file get converted to git unintentionally, so maybe this part does
not need to be done?
----
Or.. Since the root of the problem is temporarily overriding annex.largefiles,
it could just be documented that it's not a good idea to use
-c annex.largefiles=anything/nothing, because such broad overrides
can affect other files than the ones you intended.
(And since the documented methods of converting files from annexed to git and
git to annexed use such overrides, that documentation would need to be
changed.)
"""]]

View file

@ -0,0 +1,16 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2019-12-27T17:11:42Z"
content="""
A variant of this where an annexed unlocked file is added first,
then the file is touched, and then some other file is added
with -c annex.largefiles=nothing does result in the clean filter sending
the whole annexed file content back to git, rather than keeping it annexed.
For whatever reason, git does not store that content in .git/objects or
update the index for that file though, so it doesn't show up as a change.
So *apparently* that variant is only potentially an expensive cat of a
large annexed file, and does not need to be dealt with. Unless git
sometimes behaves otherwise.
"""]]

View file

@ -0,0 +1,45 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2019-12-27T18:41:12Z"
content="""
It's almost possible to get the same unwanted conversion without any git
races:
echo content-git > file-git
sleep 2
git add file-git
git commit -m add
echo foo > file-git
echo content-annex > file-annex
git -c annex.largefiles=anything annex add file-annex
In this case, git currently does not run the modified file-git through the
clean filter in the last line, so the annex.largefiles=anything doesn't
affect it.
But, as far as I can see, there's nothing preventing a future version
of git from deciding it does want to run file-git through the clean filter
in this case.
I am not going to try to prevent against such a thing happening.
As far as I can see, anything that the clean filter can possibly do to
avoid such a situation will cripple existing uses cases of
annex.largefiles, like largerthan() as mentioned above.
The user has told git-annex to annex "anything", and if git
decides to run the clean filter while that is in effect, caveat emptor.
Which is not to say I'm not going to fix the specific case this bug was
filed about. I actually have a fix developed now. But just to say that
setting annex.largefiles=anything/nothing temporarily is a blunt instrument,
and you risk accidental conversion when using it, and so it would be a good
idea to not do that.
One idea: Make `git-annex add --annex` and `git-annex add --git`
add a specific file to annex or git, bypassing annex.largefiles and all
other configuration and state. This could also be used to easily switch
a file from one storage to the other. I'd hope the existence of that
would prevent one-off setting of annex.largefiles=anything/nothing.
[[todo/git_annex_add_option_to_control_to_where]]
"""]]

View file

@ -0,0 +1,58 @@
[[!comment format=mdwn
username="kyle"
avatar="http://cdn.libravatar.org/avatar/7d6e85cde1422ad60607c87fa87c63f3"
subject="comment 5"
date="2019-12-28T21:06:46Z"
content="""
Thanks for the explanation and the fix.
> For whatever reason, git becomes confused about whether this file is
> modified. I seem to recall that git distrusts information it recorded in
> its own index if the mtime of the index file is too close to the
> mtime recorded inside it, or something like that.
I see. I think the problem and associated workaround you're referring
to is described in git's Documentation/technical/racy-git.txt.
> Note that, you can accomplish the same thing without setting
> annex.largefiles, assuming a current version of git-annex:
>
> git add file-git
> git annex add file-annex
>
> I think the only reason for setting annex.largefiles in either of the two
> places you did is if there's a default value that you want to
> temporarily override?
Right. DataLad's methods that are responsible for calling out to `git
annex add` have a `git={None,False,True}` parameter. By default
(`None`), DataLad just calls `git annex add ...` and let's any
configuration in the repo control whether the file goes to git or is
annexed. But with `git=True` or `git=False`, the `annex add` call
includes a `-c annex.largefiles=` argument with a value of `nothing`
or `anything`, respectively.
> But just to say that setting annex.largefiles=anything/nothing
> temporarily is a blunt instrument, and you risk accidental
> conversion when using it, and so it would be a good idea to not do
> that.
Noted. As mentioned above, DataLad's default behavior is to honor the
repo's `annex.largefiles` configuration. And the documentation for
`datalad save`, DataLad's main user-facing entry point for `annex
add`, recommends that the user configure .gitattributes rather than
using the option that leads calling `annex add` with `-c
annex.largefiles=nothing`.
> One idea: Make `git-annex add --annex` and `git-annex add --git`
> add a specific file to annex or git, bypassing annex.largefiles and all
> other configuration and state. This could also be used to easily switch
> a file from one storage to the other. I'd hope the existence of that
> would prevent one-off setting of annex.largefiles=anything/nothing.
As far as I can see, those flags would completely cover DataLad's
one-off setting of `annex.largefiles=anything/nothing`. They map
directly to DataLad's `git=False/True` option described above. So,
from DataLad's perspective, they'd be very useful and welcome.
"""]]

View file

@ -0,0 +1,34 @@
[[!comment format=mdwn
username="sirio@84e81889437b3f6208201a26e428197c6045c337"
nickname="sirio"
avatar="http://cdn.libravatar.org/avatar/9f3a0cfaf4825081710b652cc0b438a4"
subject="Duplicate 'gcrypt-id' may be the issue?"
date="2019-12-29T22:10:26Z"
content="""
Had a repo exhibit this behavior just now:
- commit graph `XX -> YY`
- host `A` @ commit `YY`
- host `B` @ commit `XX` (1 behind)
- remotes `hub` and `lab` both @ commit `XX`
- `B` pushes and pulls from both `hub` and `lab`: OK
- `A` pushes to `hub` (updates to commit `YY`): OK
- `B` pulls from `hub`: FAIL with *Packfile does not match digest*
- `B` pulls from `lab`: OK
- `B` pushes to `hub`: FAIL with *Packfile does not match digest*
- `A` pulls from `hub`: OK
- `A` pulls from `lab`: OK
When looking in `.git/config` I noticed that `remote.hub.gcrypt-id` and `remote.lab.gcrypt-id` were identical.
To fix, I:
- removed `remote.hub.gcrypt-id` from `.git/config` on both `A` and `B`
- deleted and re-created a blank repo on `hub`
- `git push hub` on `B`
- `git pull hub master` on `A`
This resulted in a new and unique value for `remote.hub.gcrypt-id`, which is the same on both `A` and `B`.
Have not had time to dig into why but this is the only thread I can find about this problem so I figured I would log this somewhere for posterity.
"""]]

View file

@ -0,0 +1,97 @@
### Please describe the problem.
I have the following repos
a - group manual - all content currently originates on this repo (OS X 10.14.4)
b - group backup - this is a rclone special backed by google drive
c - this is the underlying git repo on gitlab.com
d - group backup - a server that is supposed to backup everything (OS X 10.14.4)
Assistant is running on a and d
It is not guaranteed that a and d will be able to directly connect, however, they both have very good connectivity to b and c
When I add a set of files into a (using git-annex add) the non-annex files get checked into the git repo and pushed to c. Similarly, the content (annex files) get pushed to b. This is confirmed by git-anenx list --allrepos
Within an hour or so, d will know about the files that were added (git-annex list) and the git log shows that it is on the same commit as a and c.
However, the assistant on d never does the git-annex sync --content
If I manually run git-annex sync --content on d, all is updated as expected.
I've made no changes to the groupwants, group, etc. settings
### What steps will reproduce the problem?
create a repo with a central git upstream and a special via rclone on gdrive. Clone that repo in another machine that can also see the upstream and special, but isn't directly connected to the originator of the repo
Add annex-handled files to the original repo.
Check the status of the git upstream, special, and the clone.
After failure is acknowledged, run git annex sync --content to confirm that the mechanics still work
### What version of git-annex are you using? On what operating system?
Both hosts are OSX 10.14.4 and are running 7.20191218
### Please provide any additional information below.
This is from the assistant on the clone. It is running in debug mode.
[[!format sh """
[2019-12-30 17:44:09.362492] main: starting assistant version 7.20191114
[2019-12-30 17:44:14.532638] TransferScanner: Syncing with origin
(scanning...) [2019-12-30 17:44:14.590159] Watcher: Performing startup scan
ControlSocket .git/annex/ssh/git@gitlab already exists, disabling multiplexing
Disallowed command
Everything up-to-date
Disallowed command
Disallowed command
Disallowed command
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
fatal: Pathspec 'workflow/cc-archive-exif/LICENSE' is in submodule 'workflow/cc-archive-exif'
git cat-file EOF: user error
fd:38: hFlush: resource vanished (Broken pipe)
fd:38: hFlush: resource vanished (Broken pipe)
Disallowed command
(started...)
[2019-12-30 17:44:33.097035] Committer: Committing changes to git
(recording state in git...)
[2019-12-30 17:44:33.176213] Pusher: Syncing with origin
Everything up-to-date
Disallowed command
<<A bunch of white space lines removed for brevity>>
Disallowed command
Disallowed command
Disallowed command
Disallowed command
Disallowed command
# End of transcript or log.
"""]]
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Yes - I can run this manually, and overall this is great - I would just love to get this automated....

View file

@ -20,3 +20,5 @@ If strict matching (not sure yet about a use case where it would really be neede
[[!meta author=yoh]]
[[!tag projects/datalad]]
> Looks like we're agreed this is not necessary, so [[done]] --[[Joey]]

View file

@ -0,0 +1,30 @@
### Please describe the problem.
enable-tor on an OSX box (with magic-wormhole and tor installed via brew) fails miserably.
### What steps will reproduce the problem?
run git-annex enable-tor - multiple fails, see details.
### What version of git-annex are you using? On what operating system?
7.20191106
OSX 10.14.5
### Please provide any additional information below.
The first failure is that enable-tor can't run as root. Instead, I call it with sudo git-annex enable-tor <UID>
The second failure is that you try and write into /etc/tor/torrc - which is not where torrc is located on a brew installed tor - it's in /usr/local/etc/tor/torrc. I made a symlink to get around that problem.
The third failure is a complaint about systemctl not being present. I looked in Utilities/Tor.hc and saw you were trying to call for a reload of tor. To hack around that, I wrote a script called systemctl that simply called 'brew services' with the args passed in ( brew services $1 $2 ).
After that, I still get the error: git-annex: tor failed to create hidden service, perhaps the tor service is not running
I have restarted tor manually, and it is indeed running. It looks like something is failing in setting up the Onion socket, but I can't see what
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I love it - using it to protect my photo archive now using a central special repo (rclone) for the data, and a gitlab repo for the base.