Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2024-05-20 15:52:44 -04:00
commit 644ed44ec1
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -0,0 +1,9 @@
I've inherited maintenance of an annex repo that goes back 10+ years with about 500 files - and I am not an expert git-annex user. The writeable remote that people push annex files to is a file:// URL that is only accessible from in-house machines. A cron job does an rsync from that to update a read only copy accessible via an https:// URL at least a couple of times a day. If new content is added via the writebale file URL, then is does not show up when using the read only http URL until after the rsync runs.
The workflow is that developers create sandboxes with forks and/or clones of the main git repo and may/may not use the annex. These sandboxes can be fleeting and may be deleted and/or abandoned. I do not have access to all the machines were developers may have located a sandbox, e.g., with unlocked, modified annex files that have never been pushed to the remote.
Recently when creating new sandboxes and getting the annex files via the read only https URL remote, we've begin to see frequent messages that "<annex file>" is not available, "these git remotes have annex-ignore set: datasrc origin", followed by the "maybe adding some of these git remotes ..." - with a long list of remotes - sometimes more that 100 per file. The list of ID's and remote paths often include old and/or deleted developers sandboxes. Alternately, some references are valid, but the sandbox there may have been clobbered and re-created several times under the same path. These errors cause git-annex to exit with non-zero status which breaks automation.
After reading various postings, I have done the "git annex fsck" (reports no errors), followed by "git annex sync" which did commit/push some changes. I then followed this by running the rsync script to update the read-only remote. This does not always fix the errors; it may do so temporarily and then new errors occur for files that have not been touched/pushed recently.
I don't know git annex well enough to understand the source of these errors (short of the rsync failing - which I don't see). I think our small project would be fine with the read only remote as the primary, falling back to the writeable remote. But I don't see we need git annex to track every sandbox that has ever checked out the annex. Some postings indicate you can drop all unwanted/old references for a particular file by specifying a numeric range for the ID/path entries. I don't see this is a practical solution for ~500 files, each with many references (only some of which are current). I have not tried turning off "ignore datasrc origin", but form what I've read it is not clear to me this will change the behavior. Happy to learn about about other options to diagnose/remedy these errors.