Merge branch 'master' of ssh://git-annex.branchable.com
This commit is contained in:
commit
5ee14db037
3 changed files with 94 additions and 0 deletions
|
@ -0,0 +1,69 @@
|
||||||
|
I've been debugging an intermittent DataLad test failure
|
||||||
|
(<https://github.com/datalad/datalad/issues/5300>) that is related to
|
||||||
|
an unlocked annex file whose content switches to being tracked by git.
|
||||||
|
Basically
|
||||||
|
|
||||||
|
* `git annex add` file A to the annex.
|
||||||
|
|
||||||
|
* Configure `annex.largefiles` in a way that would have sent file A
|
||||||
|
to git.
|
||||||
|
|
||||||
|
* If file A's mtime matches the index's, adding file B triggers the
|
||||||
|
clean filter to run on file A and sends its content to git in when
|
||||||
|
an unrelated file is added.
|
||||||
|
|
||||||
|
This sequence looks pretty close to a situation described in a comment
|
||||||
|
of the bug report below, except that `annex.largefiles` is configured
|
||||||
|
persistently in the repository rather than via a temporary `-c
|
||||||
|
annex.largefiles` override.
|
||||||
|
|
||||||
|
https://git-annex.branchable.com/bugs/A_case_where_file_tracked_by_git_unexpectedly_becomes_annex_pointer_file/#comment-215a295d83c8a08806d4f9c65ae52b10
|
||||||
|
|
||||||
|
As a concrete example, here's a demo that configures .txt files to be
|
||||||
|
added to git, but then forces the addition of an unlocked annex file
|
||||||
|
with `--force-large`.
|
||||||
|
|
||||||
|
[[!format sh """
|
||||||
|
cd "$(mktemp -d "${TMPDIR:-/tmp}"/ga-XXXXXXX)" || exit 1
|
||||||
|
|
||||||
|
git version
|
||||||
|
git annex version | head -1
|
||||||
|
|
||||||
|
git init -q
|
||||||
|
git annex init
|
||||||
|
git config annex.addunlocked true
|
||||||
|
|
||||||
|
printf '*.txt annex.largefiles=nothing\n' >.gitattributes
|
||||||
|
git add .gitattributes
|
||||||
|
git commit -m"configured annex.largefiles"
|
||||||
|
|
||||||
|
echo a >foo.txt
|
||||||
|
git annex add --force-large foo.txt
|
||||||
|
|
||||||
|
git diff
|
||||||
|
"""]]
|
||||||
|
|
||||||
|
```
|
||||||
|
git version 2.31.1.394.g7d1e84936f
|
||||||
|
git-annex version: 8.20210330
|
||||||
|
init (scanning for unlocked files...)
|
||||||
|
ok
|
||||||
|
(recording state in git...)
|
||||||
|
[master (root-commit) 0018dd1] configured annex.largefiles
|
||||||
|
1 file changed, 1 insertion(+)
|
||||||
|
create mode 100644 .gitattributes
|
||||||
|
add foo.txt
|
||||||
|
ok
|
||||||
|
(recording state in git...)
|
||||||
|
diff --git a/foo.txt b/foo.txt
|
||||||
|
index 4580ed7..7898192 100644
|
||||||
|
--- a/foo.txt
|
||||||
|
+++ b/foo.txt
|
||||||
|
@@ -1 +1 @@
|
||||||
|
-/annex/objects/SHA256E-s2--87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7.txt
|
||||||
|
+a
|
||||||
|
```
|
||||||
|
|
||||||
|
Is the above showing expected behavior? That is, if
|
||||||
|
`annex.largefiles` is configured to send a file to git, the clean
|
||||||
|
filter will move it there the next time it runs on it?
|
5
doc/todo/Fsck_remote_files_in-flight.mdwn
Normal file
5
doc/todo/Fsck_remote_files_in-flight.mdwn
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
When fsck'ing a remote repo, files seem to be copied from the remote to a local dir (thus written to disk), read back again for checksumming and then deleted.
|
||||||
|
|
||||||
|
This is very time-inefficient and wastes precious SSD erase cycles which is especially problematic in the case of special remotes because they can only be fsck'd "remotely" (AFAIK).
|
||||||
|
|
||||||
|
Instead, remote files should be directly piped into an in-memory checksum function and never written to disk on the machine performing the fsck.
|
|
@ -0,0 +1,20 @@
|
||||||
|
Need to import ~10TB of data from a content synced from a google storage. I have BTRFS file system (so CoW is available), and initialized it with
|
||||||
|
|
||||||
|
```
|
||||||
|
git annex initremote buckets type=directory importtree=yes encryption=none directory=../buckets/
|
||||||
|
```
|
||||||
|
|
||||||
|
and `../buckets` resides on the same mount point, so `cp --reflink=always ../buckets/....` works out nicely.
|
||||||
|
|
||||||
|
But when I have ran `git annex import --from=buckets -J 10 master` I saw that
|
||||||
|
|
||||||
|
- no `cp` is invoked by git-annex (according to `pstree`)
|
||||||
|
- I have equal amount of output IO as input IO (in `dstat`), suggesting that I am actually copying data
|
||||||
|
|
||||||
|
I have not done a really detailed look inside copied keys to see if they share storage extents etc, so my observation could still be wrong.
|
||||||
|
|
||||||
|
Joey, is it expected to take advantage of CoW with git-annex 8.20210223-1~ndall+1 in such a case or it is still a TODO? any "workaround"? (e.g. may be I could just cp --reflink=always, git annex add, and then do import and annex would just reuse keys already CoWed?)
|
||||||
|
|
||||||
|
[[!meta author=yoh]]
|
||||||
|
[[!tag projects/datalad]]
|
||||||
|
|
Loading…
Reference in a new issue