diff --git a/doc/forum/one-off_unlocked_annex_files_that_go_against_large.mdwn b/doc/forum/one-off_unlocked_annex_files_that_go_against_large.mdwn new file mode 100644 index 0000000000..7220551228 --- /dev/null +++ b/doc/forum/one-off_unlocked_annex_files_that_go_against_large.mdwn @@ -0,0 +1,69 @@ +I've been debugging an intermittent DataLad test failure +() that is related to +an unlocked annex file whose content switches to being tracked by git. +Basically + + * `git annex add` file A to the annex. + + * Configure `annex.largefiles` in a way that would have sent file A + to git. + + * If file A's mtime matches the index's, adding file B triggers the + clean filter to run on file A and sends its content to git in when + an unrelated file is added. + +This sequence looks pretty close to a situation described in a comment +of the bug report below, except that `annex.largefiles` is configured +persistently in the repository rather than via a temporary `-c +annex.largefiles` override. + +https://git-annex.branchable.com/bugs/A_case_where_file_tracked_by_git_unexpectedly_becomes_annex_pointer_file/#comment-215a295d83c8a08806d4f9c65ae52b10 + +As a concrete example, here's a demo that configures .txt files to be +added to git, but then forces the addition of an unlocked annex file +with `--force-large`. + +[[!format sh """ +cd "$(mktemp -d "${TMPDIR:-/tmp}"/ga-XXXXXXX)" || exit 1 + +git version +git annex version | head -1 + +git init -q +git annex init +git config annex.addunlocked true + +printf '*.txt annex.largefiles=nothing\n' >.gitattributes +git add .gitattributes +git commit -m"configured annex.largefiles" + +echo a >foo.txt +git annex add --force-large foo.txt + +git diff +"""]] + +``` +git version 2.31.1.394.g7d1e84936f +git-annex version: 8.20210330 +init (scanning for unlocked files...) +ok +(recording state in git...) +[master (root-commit) 0018dd1] configured annex.largefiles + 1 file changed, 1 insertion(+) + create mode 100644 .gitattributes +add foo.txt +ok +(recording state in git...) +diff --git a/foo.txt b/foo.txt +index 4580ed7..7898192 100644 +--- a/foo.txt ++++ b/foo.txt +@@ -1 +1 @@ +-/annex/objects/SHA256E-s2--87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7.txt ++a +``` + +Is the above showing expected behavior? That is, if +`annex.largefiles` is configured to send a file to git, the clean +filter will move it there the next time it runs on it? diff --git a/doc/todo/Fsck_remote_files_in-flight.mdwn b/doc/todo/Fsck_remote_files_in-flight.mdwn new file mode 100644 index 0000000000..1b88fb0b22 --- /dev/null +++ b/doc/todo/Fsck_remote_files_in-flight.mdwn @@ -0,0 +1,5 @@ +When fsck'ing a remote repo, files seem to be copied from the remote to a local dir (thus written to disk), read back again for checksumming and then deleted. + +This is very time-inefficient and wastes precious SSD erase cycles which is especially problematic in the case of special remotes because they can only be fsck'd "remotely" (AFAIK). + +Instead, remote files should be directly piped into an in-memory checksum function and never written to disk on the machine performing the fsck. diff --git a/doc/todo/import_from_directory_does_not_use_cp_--reflink__63___.mdwn b/doc/todo/import_from_directory_does_not_use_cp_--reflink__63___.mdwn new file mode 100644 index 0000000000..18b6128a6b --- /dev/null +++ b/doc/todo/import_from_directory_does_not_use_cp_--reflink__63___.mdwn @@ -0,0 +1,20 @@ +Need to import ~10TB of data from a content synced from a google storage. I have BTRFS file system (so CoW is available), and initialized it with + +``` +git annex initremote buckets type=directory importtree=yes encryption=none directory=../buckets/ +``` + +and `../buckets` resides on the same mount point, so `cp --reflink=always ../buckets/....` works out nicely. + +But when I have ran `git annex import --from=buckets -J 10 master` I saw that + +- no `cp` is invoked by git-annex (according to `pstree`) +- I have equal amount of output IO as input IO (in `dstat`), suggesting that I am actually copying data + +I have not done a really detailed look inside copied keys to see if they share storage extents etc, so my observation could still be wrong. + +Joey, is it expected to take advantage of CoW with git-annex 8.20210223-1~ndall+1 in such a case or it is still a TODO? any "workaround"? (e.g. may be I could just cp --reflink=always, git annex add, and then do import and annex would just reuse keys already CoWed?) + +[[!meta author=yoh]] +[[!tag projects/datalad]] +