Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2020-01-10 14:53:00 -04:00
commit 4a135934ff
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 114 additions and 0 deletions

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="satya.ortiz-gagne@a4c92de91eb4fd5ae8fc9893bb4fd674a19f2e59"
nickname="satya.ortiz-gagne"
avatar="http://cdn.libravatar.org/avatar/79c93025f174cd2aff98fbb952702c09"
subject="using hardlinks"
date="2020-01-10T16:11:47Z"
content="""
Thanks for your help. Yes I believe that post-checkout hook could do the trick but I really like your idea of using a FUSE filesystem. Thanks a lot for sharing. I also believe this could be the base to progressively get the content of an indexed archive (like .zip) as it's getting needed.
The worktree is a very interesting feature but I'm also using [DataLad 0.11.8](https://www.datalad.org/) which is unfortunately incompatible with it for the moment.
As for my objective to not use locked files, I initially though that the script of a library I was using to preprocess some data was failing because of the fact the files were symlinks but I couldn't reproduce. Unfortunately, too many factor changed so I'm just going to assume I was doing something wrong. Still, it would sometimes be useful to work with unlocked files in the case I'm doing a multi-phases (with multi-commits) preprocessing of a big file. In that case, a phase would modify the file, trigger a copy by unlocking it and annex the modified file. I would be interested into skipping the copy to save a significant amount of time and space since the intermediate states of the file are only temporary. The checksums are still interesting to make sure the phase correctly executed. But that is very specific and will not happen too often so I'm fine with workarounds.
"""]]

View file

@ -0,0 +1,80 @@
Intro
=====
This experience report
describes steps I've taken for recovering from a situation where
an *unrelated* git-annex's remote was accidentally merged into a repository.
It is posted to the forum for use by anyone who finds themselves in the same situation
(especially myself…).
The root cause of the issue was a copy-pasted `git remote add` gone wrong,
and a subsequent `git annex sync`, that "contaminated" the rest of my remotes.
That led to `git annex info` showing the union of all the repositories available to the two repositories,
and `fsck --all` runs looking for files from any repository.
It should go without saying, but here it is anyway:
**Following these steps can eventualy lead to data loss**.
The precautions I've taken are
* knowing that two complete copies of the data sets exist,
* having a filesystem level snapshot of a least one of those copies, and
* not starting any file dropping until all remotes have completed fscks at the end.
Identifying the last good state
===============================
By looking for the first occurrence of the UUID of one of the bad new remotes
in `git log --patch git-annex`,
I've identified the last good git-annex state before the merge.
Tagging that as `git tag before-accidental-merging-with-other-server 83c1b945c2428cefa968aec587229f6a87649de6`.
Removing potentially mergable information
=========================================
git-annex is eager to pull in updates lying around --
while this is usually a good thing,
here it incurs the danger of resurrecting the accident.
On all remotes that were accessed since the accident,
I've executed this to remove both the local synced/git-annex branch
and any memory of cached remote branches:
$ git branch -D synced/git-annex
$ git branch -r | sed 's@remotes/@@' | xargs git branch -d -r
and restore the git-annex branch:
$ git branch -f git-annex 83c1b945c2428cefa968aec587229f6a87649de6
That proved to be insufficient --
after I had first only done this,
things looked good for a while and then after the first `git annex fsck --fast`,
the remotes were back again.
The only file large enough to contain the offending data in .git/annex was .git/annex/index,
so I've removed that backed by [[internals]]' statement of it being safe to remove:
$ rm .git/annex/index
(did that on all remotes; on bare ones it's `annex/index`, obviously).
Verification
============
To ensure everyone is on the same page,
I've run `git annex sync`;
its speed already showed that now there's no information about a second repository being transferred.
Subsequently, I've run `git annex fsck --all` in all locations.
(That *did* show that I should previously have marked some keys as dead when they were migrated from SHA256E to SHA256,
but that's beside the point here).
Even after a sync following the above,
no traces of the bad merge (be it in the form of a repository or of a file from there) have shown up any more.
-- [[chrysn]]

View file

@ -0,0 +1,21 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="Summary; Application: shared thumbnails"
date="2020-01-10T08:41:18Z"
content="""
There are two conflicting approaches to mtimes:
* Treat them as local artifacts
This works great with Make, and generally with any software that works on \"is newer than\" properties.
* Treat them as preservation-worthy file attributes
This is generally required by tools that compare time stamps by identity.
Both approaches break tools that expect the other, and no single out-of-the-box choice will make all users happy. Tools like metastore, a bespoke solution like etckeeper's generated mkdir/chmod file or a git-annex solution like [[storing the full mtime at genmetadata time|bugs/file_modification_time_should_be_stored_in_exactly_one_metadata_field/]] with a (local or repository-wide) option to set the mtime at annex-get time would be convenient.
One more application where this would be relevant is sharing generated thumbnails among clones of repositories (to eventually maybe even have them available when the full files are not present) following the [XDG specification on shared thumnail repositories](https://specifications.freedesktop.org/thumbnail-spec/thumbnail-spec-latest.html#SHARED). Not only does that design rely on the mtimes of the thumbnail and the file to match, it even encodes the mtime again inside the thumbnail, practically requiring all checkouts to not only have consistent mtimes between thumbnails and files, but identical ones.
"""]]