Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2021-04-14 12:55:18 -04:00
commit 5ee14db037
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 94 additions and 0 deletions

View file

@ -0,0 +1,69 @@
I've been debugging an intermittent DataLad test failure
(<https://github.com/datalad/datalad/issues/5300>) that is related to
an unlocked annex file whose content switches to being tracked by git.
Basically
* `git annex add` file A to the annex.
* Configure `annex.largefiles` in a way that would have sent file A
to git.
* If file A's mtime matches the index's, adding file B triggers the
clean filter to run on file A and sends its content to git in when
an unrelated file is added.
This sequence looks pretty close to a situation described in a comment
of the bug report below, except that `annex.largefiles` is configured
persistently in the repository rather than via a temporary `-c
annex.largefiles` override.
https://git-annex.branchable.com/bugs/A_case_where_file_tracked_by_git_unexpectedly_becomes_annex_pointer_file/#comment-215a295d83c8a08806d4f9c65ae52b10
As a concrete example, here's a demo that configures .txt files to be
added to git, but then forces the addition of an unlocked annex file
with `--force-large`.
[[!format sh """
cd "$(mktemp -d "${TMPDIR:-/tmp}"/ga-XXXXXXX)" || exit 1
git version
git annex version | head -1
git init -q
git annex init
git config annex.addunlocked true
printf '*.txt annex.largefiles=nothing\n' >.gitattributes
git add .gitattributes
git commit -m"configured annex.largefiles"
echo a >foo.txt
git annex add --force-large foo.txt
git diff
"""]]
```
git version 2.31.1.394.g7d1e84936f
git-annex version: 8.20210330
init (scanning for unlocked files...)
ok
(recording state in git...)
[master (root-commit) 0018dd1] configured annex.largefiles
1 file changed, 1 insertion(+)
create mode 100644 .gitattributes
add foo.txt
ok
(recording state in git...)
diff --git a/foo.txt b/foo.txt
index 4580ed7..7898192 100644
--- a/foo.txt
+++ b/foo.txt
@@ -1 +1 @@
-/annex/objects/SHA256E-s2--87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7.txt
+a
```
Is the above showing expected behavior? That is, if
`annex.largefiles` is configured to send a file to git, the clean
filter will move it there the next time it runs on it?

View file

@ -0,0 +1,5 @@
When fsck'ing a remote repo, files seem to be copied from the remote to a local dir (thus written to disk), read back again for checksumming and then deleted.
This is very time-inefficient and wastes precious SSD erase cycles which is especially problematic in the case of special remotes because they can only be fsck'd "remotely" (AFAIK).
Instead, remote files should be directly piped into an in-memory checksum function and never written to disk on the machine performing the fsck.

View file

@ -0,0 +1,20 @@
Need to import ~10TB of data from a content synced from a google storage. I have BTRFS file system (so CoW is available), and initialized it with
```
git annex initremote buckets type=directory importtree=yes encryption=none directory=../buckets/
```
and `../buckets` resides on the same mount point, so `cp --reflink=always ../buckets/....` works out nicely.
But when I have ran `git annex import --from=buckets -J 10 master` I saw that
- no `cp` is invoked by git-annex (according to `pstree`)
- I have equal amount of output IO as input IO (in `dstat`), suggesting that I am actually copying data
I have not done a really detailed look inside copied keys to see if they share storage extents etc, so my observation could still be wrong.
Joey, is it expected to take advantage of CoW with git-annex 8.20210223-1~ndall+1 in such a case or it is still a TODO? any "workaround"? (e.g. may be I could just cp --reflink=always, git annex add, and then do import and annex would just reuse keys already CoWed?)
[[!meta author=yoh]]
[[!tag projects/datalad]]