Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2021-08-16 17:30:04 -04:00
commit ffa1f6ed30
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 297 additions and 0 deletions

View file

@ -1,5 +1,7 @@
### Please describe the problem.
edit: I said "lustre" but it is isilon which is under. My wrong/outdated/mixed up knowledge of the system
Finally we got what should be a normal "POSIX" NFS mount (no ACL) on a local HPC. I immediately ventured to test git-annex (and datalad) on that file system. git annex 8.20210803-g99bb214 and git 2.30.2 both coming from conda.
with basic operations -- it felt like it works ok.

View file

@ -0,0 +1,254 @@
Through several migrations, and migrations leave duplicate files.
The man page warns about this and tells to use `annex unused` to find them.
I'd usually copy the new files around, drop the old ones and mark the old ones dead,
but alas, manual things are not always done.
To clean up after sloppy migrations,
or after deciding that I really want to get rid of the old files,
I'd like a tool to
* find files that were migrated,
* do some sanity checks,
* drop them from all the remotes (with extreme prejudice, as even unused files are usually subject to numcopies), and
* declare them dead (which can only happen once they're gone from all remotes).
I've written a tool that does all these. It's available below (copy-paste; attachments seem to be disabled);
see its head for documentation.
This is a tool for forcibly dropping files,
so use with the adaequate caution and at your own risk.
Since, I can reclaim my disk space *and* run `git annex fsck --all` with good results again.
--[[chrysn]]
```python
#!/usr/bin/env python3
"""
Perform in sequence, with a stops for confirmation before acting:
* Enumerate files that were replaced with same-sized ones of different hashes;
these were probably migrated.
* Show commits in which these happened. The user should verify these were all migrations.
* Check that no to-be-dropped file are part of the current HEAD.
* Run an fsck (without --all, just on the present files) to ensure that no
recent migrations created local-only files if local-only doesn't satisfy numcopies.
* FORCE DROP THEM FROM ALL REMOTES.
* Mark them dead.
Marking files as dead will stop when a remote is encountered that has a copy
but is not accessible; this is a consequence of `dead` not working while known
copies are around.
This is not tuned for performance; it tries to avoid any O(n^2) or worse
behavior, and should complete data acquisition (or at least produce output)
within minutes even on a 150000 file, 1000 commit repository. (The slowest
parts being the fsck and the enumeration of whereis data take the longest
time).
The actual dropping takes quite a while, as each drop and dead are done
individually. (Some commands have --batch but not for --key). There are no
checks to reduce work if files were already declared dead. To avoid cluttering
the git-annex branch's history, the full run is rolled into a single commit.
Warnings:
Things like this are notoriously hard to run with a backup in place (because
your backup probably *is* another git-annex from which files will be removed
here); volume level snapshots can be helpful here, as they allow you to run
things and evaluate the outcome while retaining a way to roll back.
Use at your own risk.
"""
import json
import os.path
import stat
import subprocess
import sys
from typing import Optional
from dulwich import diff_tree, walk
from dulwich.repo import Repo
if not __debug__:
raise RuntimeError("I'm using asserts for validation here, please don't make me rewrite them to ifs")
if not os.path.exists('.git'):
raise RuntimeError("Just to make sure the fsck catches everything, please run in an annex's root")
repo = Repo('.')
def hash_from_link(link: bytes) -> str:
"""Return the hash of a git-annex link if a link looks like one"""
if b'/.git/annex/' not in link:
return
link = link.decode('utf8') # whoever uses non-UTF8 file extensions won't have my pity
link = link.lstrip('./')
link = link.removeprefix('git/annex/objects/')
if link.count('/') > 1:
# newer XX/YY/hash/hash scheme
link = link[6:]
dirname, _, filename = link.partition('/')
assert dirname == filename
return filename
def parse_hash(h: str) -> (str, Optional[int], str):
"""Given a hash, the hash algorithm, the file size if indicated, and the hash value"""
if ':' in h: # hoping it never shows up in an extension
# old style
halg, hval = h.split(':', 1)
return (halg, None, hval)
halg_params, _, hval = h.partition('--')
size = None
halg, *params = halg_params.split('-')
for p in params:
if p[0] == 's':
size = int(p[1:])
else:
raise ValueError("Unknown extension: %s" % p)
# hval may need trimming still for extensions
return (halg, size, hval)
# Commits in which migrations happened, mapped to set of all reasons found there
migrating_commits = {}
# Old / migrated hashes that are to be removed
hashes_to_kill = set()
seen_commits = set()
new_commits = {repo.head(),}
while new_commits:
current_commits = new_commits
seen_commits = seen_commits.union(current_commits)
new_commits = set()
for c in current_commits:
c = repo[c]
new_commits = new_commits.union(c.parents)
if len(c.parents) != 1:
# Not expecting any migrations in merge commits
continue
for (eold, enew) in diff_tree.walk_trees(repo, repo[c.parents[0]].tree, c.tree, True):
if eold.sha == enew.sha:
continue
if eold.mode != stat.S_IFLNK or enew.mode != stat.S_IFLNK:
continue
old_full = hash_from_link(repo[eold.sha].data)
new_full = hash_from_link(repo[enew.sha].data)
if old_full is None or new_full is None:
continue # was not a git-annex link
old_hash, old_len, old_hashval = parse_hash(old_full)
new_hash, new_len, new_hashval = parse_hash(new_full)
if (old_hash, old_len, old_hashval) == (new_hash, new_len, new_hashval):
migration_type = "git annex version upgrade"
assert ':' in old_full, "Well what else could it be?"
# While they were migrations, git-annex really has made sure
# they got carried over; modern git annex operations don't even
# work on them any more
continue
elif (old_hash, old_hashval) == (new_hash, new_hashval) and \
old_len is None and new_len is not None:
migration_type = "Length added"
elif old_hash != new_hash and \
(old_len is None or old_len == new_len):
migration_type = "Hash change from %s to %s" % (old_hash, new_hash)
else:
print("Spurious migration", old_full, new_full)
print((old_hash, old_len, old_hashval))
print((new_hash, new_len, new_hashval))
raise RuntimeError("Not understanding this migration, exiting")
assert old_full != new_full, "We don't want to drop still in used keys, and these should have been caught before"
hashes_to_kill.add(old_full)
migrating_commits.setdefault(c, set()).add(migration_type)
new_commits = new_commits.difference(seen_commits)
print("These commits were found to have migrations:\n")
for (commit, reasons) in sorted(migrating_commits.items(), key=lambda cr: cr[0].commit_time):
print("Author:", commit.author.decode('utf8'))
print(commit.message.decode('utf8').strip())
print("Migrations:", ", ".join(reasons))
print()
print("Hashes that were migrated away from: %d" % len(hashes_to_kill))
# a real tree and not a digraph, so easier than the commit walking
files_checked = 0
bad_files = []
trees_to_check = [repo[repo.head()].tree]
while trees_to_check:
old_trees = trees_to_check
trees_to_check = []
for t in list(old_trees):
for item in repo[t].items():
if item.mode == stat.S_IFDIR:
trees_to_check.append(item.sha)
if item.mode == stat.S_IFLNK:
link = repo[item.sha].data
full = hash_from_link(link)
if full in hashes_to_kill:
bad_files.append(item.path)
files_checked += 1
if bad_files:
print("Some hashes that have been migrated away from were still around, files with these names:")
print(bad_files)
sys.exit(1)
print("Checked %d symlinks in HEAD, none of them points to an old hash any more" % files_checked)
print("performing a non-`--all` fsck...")
subprocess.check_call('git annex fsck --fast --quiet', shell=True)
print("Checked that the files that *are* in the tree are properly distributed.")
print("Gathering whereis data to decide where to drop from...")
whereall = subprocess.Popen(['git', 'annex', 'whereis', '--json', '--all'], stdout=subprocess.PIPE)
hashes_to_kill_remotes = {}
for line in whereall.stdout:
wherethis = json.loads(line)
if wherethis['key'] not in hashes_to_kill:
continue
remotes = {None if r['here'] else r['uuid'] for r in wherethis['whereis'] + wherethis['untrusted']}
if remotes:
hashes_to_kill_remotes[wherethis['key']] = remotes
if hashes_to_kill_remotes:
wheretodrop = {r or "here" for r in set.union(*hashes_to_kill_remotes.values())}
else:
wheretodrop = set()
print(f"Found {len(hashes_to_kill_remotes)} migrated hashes still around on remotes {wheretodrop}")
print()
print("If you want to really drop all of them, enter `force drop and declare them dead` here:")
line = input()
if line != "force drop and declare them dead":
print("Good choice. (Sorry if you mistyped...)")
sys.exit(0)
try:
subprocess.check_call(["git", "-c", "annex.commitmessage=updates before running migrate-mark-dead.py", "annex", "merge"])
annex_no_autocommit = ["git", "-c", "annex.alwayscommit=false", "annex"]
# Network first, to ensure the password prompts come fast even when most files are dead already
for (i, (key, remotes)) in enumerate(hashes_to_kill_remotes.items()):
for r in remotes:
if r is None:
# Can't be run with `--from here`
subprocess.check_call(annex_no_autocommit + ['drop', '--force', '--key', key])
else:
subprocess.check_call(annex_no_autocommit + ['drop', '--force', '--key', key, '--from', r])
if (i % 10 == 0):
print(f"Dropped {i} ({100 * i/len(hashes_to_kill_remotes):.1f}% of) present hashes")
for i, key in enumerate(hashes_to_kill):
subprocess.check_call(annex_no_autocommit + ['dead', '--key', key])
if (i % 100 == 0):
print(f"Marked {i} ({100 * i/len(hashes_to_kill):.1f}% of) unused hashes as dead")
finally:
subprocess.check_call(["git", "-c", "annex.commitmessage=ran migrate-mark-dead.py", "annex", "merge"])
```

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="spwhitton"
avatar="http://cdn.libravatar.org/avatar/9c3f08f80e67733fd506c353239569eb"
subject="annex-to-annex"
date="2021-08-13T22:08:57Z"
content="""
[annex-to-annex](https://manpages.debian.org/buster-backports/libgit-annex-perl/annex-to-annex.1p.en.html) does something similar.
"""]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="annex-to-annex"
date="2021-08-15T11:47:30Z"
content="""
Thanks for the pointer, I was unaware of that tool. Indeed for local-only repositories, annex-to-annex avoids these troubles altogether. I suppose that with remotes, running annex-to-annex on all of them in a coordinated fashion does too.
IIUC, the annex-to-annex-dropunused tool does similar cleanup to this, provided annex-to-annex was used in the first place. It seems not to mark these files as dead, so a `git annex fsck --all` will from thereon fail.
"""]]

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="matthias.risze@9f2c8f7faed4cac1905d1bf1ee4524d708c13688"
nickname="matthias.risze"
avatar="http://cdn.libravatar.org/avatar/c9f7f022a1d62c39497b72c56a6a535e"
subject="type=git special remote cannot be enabled, no uuid is generated"
date="2021-08-16T13:02:04Z"
content="""
Following the instructions here, I cannot enable the remote. The error message is: `git-annex: Unknown remote name.`. I assume this is because git annex does not create a uuid for the type=git special remote, presumably because non is set for the actual git remote (the annex-uuid key does not exist for the existing git remote with the same url). This is the relevant line generated in remote.log: ` autoenable=true location=<ssh-url> name=<name> type=git timestamp=1629118438.628919s`, as you can see there is no uuid at the beginning. Any ideas if this is a bug or if the instructions are outdated?
"""]]

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="Another example"
date="2021-08-15T17:42:54Z"
content="""
The program at [[forum/Migrate_mark_files_dead]] shows again how batch-key would be useful, here for `git annex drop --from remote` and `git annex dead`.
I don't have numbers as I can't run it in batch, but comparing to other multi-file batch drop operations, I guesstimate this makes the difference of a script running for an hour invoking git-annex-drop a thousand times (with interruptions if the SSH agent decides to ask confirmation for a key again) or five minutes with --batch-key.
Like with the original use case of annex-to-web, filtering is not an issue for this application.
"""]]