findkeys: New command, very similar to git-annex find but operating on keys

I've long been asked for `git-annex find --all` or something like that,
but pushed back on it because I feel that the command is analagous to
find(1) and so it would be surprising for it to list keys rather than
files. So instead, add a new findkeys subcommand.

Note that the use of withKeyOptions is rather strange because usually
that is used to fall back to --all rather than listing files, but here
it's made to default to --all like behavior and never list files.

A performance thing that could be improved is that withKeyOptions
always reads and caches location logs. But findkeys with no options does
not need them, so it could be made faster. That caching does speed up
options like --in though. This is really just a subset of a more general
performance thing that --all reads location logs sometimes unncessarily.
Anyway, it needs to read the location log in order to checkDead,
and it seems good that findkeys does skip dead keys.

Also, cleaned up comments on git-annex-find man page asking for --all
option.

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2023-01-17 14:42:29 -04:00
parent a522a41a42
commit f8bc208e89
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
24 changed files with 172 additions and 205 deletions

View file

@ -22,6 +22,7 @@ git-annex (10.20221213) UNRELEASED; urgency=medium
* web: Add urlinclude and urlexclude configuration settings.
* Added an optional cost= configuration to all special remotes.
* adb: Support the remote.name.cost and remote.name.cost-command configs.
* findkeys: New command, very similar to git-annex find but operating on keys.
-- Joey Hess <id@joeyh.name> Mon, 12 Dec 2022 13:04:54 -0400

View file

@ -70,6 +70,7 @@ import qualified Command.PreCommit
import qualified Command.PostReceive
import qualified Command.FilterBranch
import qualified Command.Find
import qualified Command.FindKeys
import qualified Command.FindRef
import qualified Command.Whereis
import qualified Command.WhereUsed
@ -208,6 +209,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption
, Command.AddUnused.cmd
, Command.FilterBranch.cmd
, Command.Find.cmd
, Command.FindKeys.cmd
, Command.FindRef.cmd
, Command.Whereis.cmd
, Command.WhereUsed.cmd

View file

@ -239,15 +239,23 @@ annexedMatchingOptions :: [AnnexOption]
annexedMatchingOptions = concat
[ keyMatchingOptions'
, fileMatchingOptions' Limit.LimitAnnexFiles
, anythingNothingOptions
, combiningOptions
, timeLimitOption
, sizeLimitOption
]
-- Matching options that can operate on keys as well as files.
-- Options to match properties of keys.
keyMatchingOptions :: [AnnexOption]
keyMatchingOptions = keyMatchingOptions' ++ combiningOptions ++ timeLimitOption ++ sizeLimitOption
keyMatchingOptions = concat
[ keyMatchingOptions'
, anythingNothingOptions
, combiningOptions
, timeLimitOption
, sizeLimitOption
]
-- Matching options that can operate on keys as well as files.
keyMatchingOptions' :: [AnnexOption]
keyMatchingOptions' =
[ annexOption (setAnnexState . Limit.addIn) $ strOption
@ -378,7 +386,11 @@ fileMatchingOptions' lb =
<> help "match files smaller than a size"
<> hidden
)
, annexFlag (setAnnexState Limit.addAnything)
]
anythingNothingOptions :: [AnnexOption]
anythingNothingOptions =
[ annexFlag (setAnnexState Limit.addAnything)
( long "anything"
<> help "match all files"
<> hidden

View file

@ -43,28 +43,26 @@ optParser desc = FindOptions
<*> parseBatchOption False
parseFormatOption :: Parser Utility.Format.Format
parseFormatOption =
parseFormatOption = parseFormatOption' "${file}\0"
parseFormatOption' :: String -> Parser Utility.Format.Format
parseFormatOption' print0format =
option (Utility.Format.gen <$> str)
( long "format" <> metavar paramFormat
<> help "control format of output"
)
<|> flag' (Utility.Format.gen "${file}\0")
<|> flag' (Utility.Format.gen print0format)
( long "print0"
<> help "output filenames terminated with nulls"
<> help "use nulls to separate output rather than lines"
)
seek :: FindOptions -> CommandSeek
seek o = do
unless (isJust (keyOptions o)) $
checkNotBareRepo
islimited <- limited
let seeker = AnnexedFileSeeker
seeker <- contentPresentUnlessLimited $ AnnexedFileSeeker
{ startAction = start o
-- only files with content present are shown, unless
-- the user has requested others via a limit
, checkContentPresent = if islimited
then Nothing
else Just True
, checkContentPresent = Nothing
, usesLocationLog = False
}
case batchOption o of
@ -77,6 +75,17 @@ seek o = do
where
ww = WarnUnmatchLsFiles
-- Default to needing content to be present, but if the user specified a
-- limit, content does not need to be present.
contentPresentUnlessLimited :: AnnexedFileSeeker -> Annex AnnexedFileSeeker
contentPresentUnlessLimited s = do
islimited <- limited
return $ s
{ checkContentPresent = if islimited
then Nothing
else Just True
}
start :: FindOptions -> SeekInput -> RawFilePath -> Key -> CommandStart
start o _ file key = startingCustomOutput key $ do
showFormatted (formatOption o) file

47
Command/FindKeys.hs Normal file
View file

@ -0,0 +1,47 @@
{- git-annex command
-
- Copyright 2023 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
module Command.FindKeys where
import Command
import qualified Utility.Format
import qualified Command.Find
cmd :: Command
cmd = withAnnexOptions [keyMatchingOptions] $ Command.Find.mkCommand $
command "findkeys" SectionQuery "lists available keys"
paramNothing (seek <$$> optParser)
data FindKeysOptions = FindKeysOptions
{ formatOption :: Maybe Utility.Format.Format
}
optParser :: CmdParamsDesc -> Parser FindKeysOptions
optParser _ = FindKeysOptions
<$> optional (Command.Find.parseFormatOption' "${key}\0")
seek :: FindKeysOptions -> CommandSeek
seek o = do
seeker <- Command.Find.contentPresentUnlessLimited $ AnnexedFileSeeker
{ checkContentPresent = Nothing
, usesLocationLog = False
-- startAction is not actually used since this
-- is not used to seek files
, startAction = \_ _ key -> start' o key
}
withKeyOptions (Just WantAllKeys) False seeker
(commandAction . start o)
(const noop) (WorkTreeItems [])
start :: FindKeysOptions -> (SeekInput, Key, ActionItem) -> CommandStart
start o (_si, key, _ai) = start' o key
start' :: FindKeysOptions -> Key -> CommandStart
start' o key = startingCustomOutput key $ do
Command.Find.showFormatted (formatOption o) (serializeKey' key)
(Command.Find.formatVars key (AssociatedFile Nothing))
next $ return True

View file

@ -85,7 +85,7 @@ finds files in the current directory and its subdirectories.
[[git-annex-whereis]](1)
[[git-annex-list]](1)
[[git-annex-findkeys]](1)
# AUTHOR

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="Re: &quot;Confirming all annexed files exist elsewhere?&quot;"
date="2022-07-28T17:40:19Z"
content="""
Could you [[`git-annex-untrust`|git-annex-untrust]] the laptop repo, do a [[`git-annex-sync`|git-annex-sync]], then [[`git-annex-fsck`|git-annex-fsck]] to check that the files have enough trusted copies (as set in your `numcopies` setting)?
"""]]

View file

@ -1,19 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""Re: Confirming all annexed files exist elsewhere?"""
date="2022-07-28T16:10:17Z"
content="""
@Dan `findref` never supported listing all keys either.
Yours is the best argument I've seen so far for wanting `find --all`.
But the fact that this command is about listing files, not keys, still
makes that seem out of scope for it.
Using `whereis` would certainly do what you want. Another option would
be to `untrust` the repository that you are going to be deleting, and then
run `fsck --all`. Although that would report potentially other problems
besides files that are only present in that repository.
Finally, there's the bare metal option, which is also the fastest:
`find .git/annex/objects -type f`
"""]]

View file

@ -1,26 +0,0 @@
[[!comment format=mdwn
username="Dan"
avatar="http://cdn.libravatar.org/avatar/986de9e060699ae70ff7c31342393adc"
subject="Re: Confirming all annexed files exist elsewhere?"
date="2022-07-28T23:20:03Z"
content="""
Thanks Joey and Ilya for the nigh simultaneous untrust then fsck suggestion.
I was able to get everything squared away using the `whereis` approach as a sort of poor man's dry run, then running the copy command I described, and then using my `whereis` again to convince myself that nothing was left behind, although I imagine I'm not the first one to be retiring a repo and so hopefully these comments will be of use to future users.
While my particular problem is solved, I just wanted to add some additional input RE adding `--all` to find.
I appreciate that Joey thinks that `find` is about \"listing files, not keys\" (and if anyone's opinion here is authoritative, it should be his), but this was not my expectation as a user (although I would agree it was called `findfiles` or something like that), so I just wanted to share my experience trying to accomplish this task.
Given what I understand of `git-annex`, my instinct was to reach for a command that would let me use the powerful matching options against all known keys, so I looked over the list of commands to try to identify something that would do this.
Right away, `find` leapt out as the natural candidate, but I couldn't get it to work how I wanted, so the next obvious choice was `list`, but that also didn't work.
It was only when I looked at the this wiki page for `find` and saw discussion of adding support for `--all` that I started searching for commands that *did* accept `--all`, and I stumbled upon `whereis`, but this required a fair deal of detective work on my part.
FWIW, `whereis` is, IMHO, just as much about listing files at particular paths as `find` is (the documentation for both describes the argument as `[path ...]`; it only typically talks about keys when `--all` is passed, and so `whereis` taking `--all` when `find` does not seems unbalanced given that `whereis` seems like a tool that would be built on top of `find`.
I think there's a similar asymmetry with `list` since it's described as being \"similar to `git annex whereis` but a more compact display.\"
Now that the `--all` genie is somewhat out of the bottle it might be too late for this, but I wonder if a `findkeys` command would help fill this need while obviating the need for `--all` being passed to most other commands.
It would be unequivocally about finding keys and not files, and its output could be say a list of keys delimited by newlines (or perhaps optionally null's to make it play nice with commands that accept `-z`).
If the user wanted to know more about the keys that matched their query, the output of this command could then be piped to `whereis`, `examinekey`, and other commands that support the `--batch` and/or `-z` option.
Of course, instead of defining a new command, this functionality could be absorbed into `find --all`.
I realize that I can accomplish precisely what I describe above with e.g., `git annex whereis --all --format='${key}\n'`, which is great now that I know it's possible under `whereis`, but as a new user I would expect to find this functionality in `find` (which helpfully already supports `--format=`) before I thought to check `whereis`.
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 13"""
date="2022-07-29T16:34:56Z"
content="""
The reason I think that `git-annex find` is limited to operating on files
is because it is analagous to the `find` command. It would violate least
surprise to some extent for it to operate on keys. `git-annex whereis` has
no such expectation.
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="Dan"
avatar="http://cdn.libravatar.org/avatar/986de9e060699ae70ff7c31342393adc"
subject="comment 14"
date="2022-07-29T22:02:56Z"
content="""
Ah, I hadn't considered the parallel to the standard `find` command, but now that you mention that I understand where you're coming from and can appreciate why `whereis` is free of this association.
Still, I would think that a user who, after looking at the docs for `git annex find`, specified `--all` because they wanted to operate on keys would not be surprised.
I notice that the man page for `git annex find` already has a \"SEE ALSO\" reference to `git annex whereis`.
Could this be expanded so that it more clearly and prominently advises the reader who is looking to query against all known keys to check out the `--all` argument to `git annex whereis` as well as its `--format=` option if \"whereis\" information is not actually of interest?
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/d1f0f4275931c552403f4c6707bead7a"
subject="comment 15"
date="2022-07-30T15:30:31Z"
content="""
How about a `git annex find --keys` option? That way it's crystal clear you're searching among keys rather than files.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="yarikoptic"
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
subject="could be of help to DANDI"
date="2023-01-09T21:36:43Z"
content="""
FWIW, having `find --keys` or `findkeys` could have been of good help in DANDI project to ensure for paranoid of us that all data (previous versions included) was backed up.
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 17"""
date="2023-01-16T19:18:15Z"
content="""
Reminder that this is a man page, and place to maybe ask for clarification
about that documentation. It is not a place to post feature requests.
Opened a todo, [[todo/findkeys]]
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="disteph@02005197c6b0e3d92255823d62c08dbe6d7a4d52"
nickname="disteph"
avatar="http://cdn.libravatar.org/avatar/a12e6e0852d5a1985b8684b17202561c"
subject="--all ?"
date="2018-10-15T23:57:27Z"
content="""
It would be awesome if there was the same --all option, and if the command produced something in bare repos, just like with the move / copy commands. At the moment for instance, it seems running git annex find on a bare repo returns nothing. There would be the question of the output format, but I guess one key per line is the obvious format one wants. I personally would want to run git annex find before a git annex move/copy, as a kind of dry-run, i.e. just to see the list of keys that will be transferred.
"""]]

View file

@ -1,13 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2018-10-16T14:40:57Z"
content="""
You can use `git annex findref master` in a bare repository, which is like
find but operates on some branch.
I am not convinced that find --all would really be that useful, since it
would have to display keys and not filenames, and find is all about displaying
filenames. I did make find error out in a bare repo rather than not doing
anything.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="CandyAngel"
avatar="http://cdn.libravatar.org/avatar/15c0aade8bec5bf004f939dd73cf9ed8"
subject="comment 3"
date="2018-10-16T15:05:19Z"
content="""
Just bear in mind that [findref doesn't work with all the matching options](https://git-annex.branchable.com/todo/Support_for_include_matching_option_in_findref/).
"""]]

View file

@ -1,13 +0,0 @@
[[!comment format=mdwn
username="disteph@02005197c6b0e3d92255823d62c08dbe6d7a4d52"
nickname="disteph"
avatar="http://cdn.libravatar.org/avatar/a12e6e0852d5a1985b8684b17202561c"
subject="comment 4"
date="2018-10-18T08:25:34Z"
content="""
> You can use `git annex findref master` in a bare repository, which is like find but operates on some branch.
>
> I am not convinced that find --all would really be that useful, since it would have to display keys and not filenames, and find is all about displaying filenames. I did make find error out in a bare repo rather than not doing anything.
Thanks for the quick answer and for the tip. `findref` still displays file names, so OK, I can pipe the output with `lookupkey` to have the corresponding list of keys. Still, my understanding is that the computation is not the same as a potential `find --all` (or `find` on bare repos), in the sense that commands like `move --all` (or `move` on bare repos) only scan the files that are present in the repo, whereas `git annex findref master` looks at the whole branch regardless of where the files are. Sure, I can filter it with `findref master --in=here`, but the computational cost wouldn't be the same, would it? (imagining that my repo contains orders of magnitude fewer files than the branch) Also, `move --all` catches past versions of files that are still in the repo, i.e. \"unused files\", whereas I guess `findref master --in=here` would miss them? It's just that commands like `move --all` start by doing the job I want before taking an action on the files, so I just wish there was a \"no-action\" version of them. A `--dry-run` option in `move` and `copy` would be good enough. I tried to trick the `move` command with a `move --all ... --from=here --to=here` but of course I was outsmarted by the command :-)
"""]]

View file

@ -1,11 +0,0 @@
[[!comment format=mdwn
username="disteph@02005197c6b0e3d92255823d62c08dbe6d7a4d52"
nickname="disteph"
avatar="http://cdn.libravatar.org/avatar/a12e6e0852d5a1985b8684b17202561c"
subject="comment 5"
date="2018-10-24T11:42:15Z"
content="""
`findref` still displays file names, so OK, I can pipe the output with `lookupkey` to have the corresponding list of keys
Actually, that's not even true: `lookupkey` doesn't seem to work on a bare repo. So I don't see how I can get the list of keys that are going to be moved or copied when a `git annex move ...` or `git annex copy ...` is run from a bare repo.
"""]]

View file

@ -1,26 +0,0 @@
[[!comment format=mdwn
username="Dan"
avatar="http://cdn.libravatar.org/avatar/986de9e060699ae70ff7c31342393adc"
subject="Confirming all annexed files exist elsewhere?"
date="2022-07-27T16:35:21Z"
content="""
I'm preparing to recycle an aging laptop that has a few git-annex repos on it. I'd like to confirm that anything in its annex(es) exist in at least one other place and want to confirm what I'm doing to check this makes sense.
At first glance, it seems like the appropriate way to do this is with `git annex find --in here --not --copies=2` (where the latter predicate should be equal to testing for copies strictly less than 2).
Since I have recently `git annex sync`-ed, this doesn't turn up anything.
However, if I understand everything correctly, this *only* checks files that are reachable from my current working tree.
Thus, if there are a bunch of files in my (not currently checked out) `dev` branch that are *not* in my worktree, then this query will not discover them.
I can get them to be considered with `git annex find --in here --not --copies=2 --branch dev`.
I could manually (or via script) loop over all of my branches and repeat `git annex find --in here --not --copies=2 --branch ${branch}` to check all of my branches.
However, this will only check the tips.
Suppose there's a file that previously existed (solely) in my master branch, but at some point it was `git rm`-ed. Then unless I specify using `--branch` a TREEISH that has that file, it will not be considered.
So, I need to use some sort of query tool that supports both the `--all` flag as well as all of the matching options.
The only thing I was able to find was `whereis`, so I can run `git annex whereis --all --in here --not --copies=2` in order to identify keys corresponding to files that are (a) locally available but where (b) the number of copies is not 2 or greater (i.e., it is here and only here).
I suppose I could also just plunge ahead with `git annex copy --to ${remote} --all --in here --not --copies=2`, but it's reassuring to be able to run the query and see what would need to get moved (as well as to see the query come back empty before I wipe the hard drive).
Is this an appropriate use of `git annex whereis`, or is there a way that I can use `git annex find` to accomplish this, or perhaps some other query tool?
In essence, I just want a way of querying "all" of the objects that git annex has ever known about using all of the standard matching options.
I see discussion above regarding the lack of `--all` support for `git annex find`, which at the time suggested using `findref` instead but it seems like that has been deprecated in favor of `find`.
"""]]

View file

@ -0,0 +1,73 @@
# NAME
git-annex findkeys - lists available keys
# SYNOPSIS
git annex findkeys
# DESCRIPTION
Outputs a list of keys known to git-annex.
# OPTIONS
* matching options
The [[git-annex-matching-options]](1)
can be used to specify which keys to list.
By default, the findkeys command only lists keys whose content is
currently present. Specifying any of the matching options will override
this default behavior and match on all keys that git-annex knows about.
To list all keys, present or not, specify `--anything`.
To list keys whose content is not present, specify `--not --in=here`
* `--print0`
Output keys terminated with nulls, for use with `xargs -0`
* `--format=value`
Use custom output formatting.
The value is a format string, in which '${var}' is expanded to the
value of a variable. To right-justify a variable with whitespace,
use '${var;width}' ; to left-justify a variable, use '${var;-width}';
to escape unusual characters in a variable, use '${escaped_var}'
These variables are available for use in formats: key, backend,
bytesize, humansize, keyname, hashdirlower, hashdirmixed, mtime (for
the mtime field of a WORM key).
Also, '\\n' is a newline, '\\000' is a NULL, etc.
The default output format is the same as `--format='${key}\\n'`
* `--json`
Output the list of keys in JSON format.
This is intended to be parsed by programs that use
git-annex. Each line of output is a JSON object.
* `--json-error-messages`
Messages that would normally be output to standard error are included in
the json instead.
* Also the [[git-annex-common-options]](1) can be used.
# SEE ALSO
[[git-annex]](1)
[[git-annex-find]](1)
# AUTHOR
Joey Hess <id@joeyh.name>
Warning: Automatically converted into a man page by mdwn2man. Edit with care.

View file

@ -487,6 +487,12 @@ content from the key-value store.
See [[git-annex-inprogress]](1) for details.
* `findkeys`
Similar to `git-annex find`, but operating on keys.
See [[git-annex-findkeys]](1) for details.
# METADATA COMMANDS
* `metadata [path ...]`

View file

@ -3,7 +3,7 @@ all keys in the repository, or all known keys.
[[git-annex-find]] has a long comment section with users wanting some way
to do this, but I am strongly of the opinion that `git-annex find` should
list files, not keys, like `find(1)` does. See my comments there.
list files, not keys, like `find(1)` does.
--[[Joey]]
findkeys could support --format like find, but without `${file}`, and
@ -11,13 +11,10 @@ findkeys could support --format like find, but without `${file}`, and
useful? The other options wouldn't apply to it, except for matching options
like --in that can operate on keys.
Seems that some users are looking for a way to list all keys known to
git-annex, while other users are looking for a way to list only keys
present in the local repo. If it defaults to listing all known keys,
--in=here can be used to get the other behavior. OTOH, `git-annex find`
defaults to listing only files that are present, so it would be
a little inconsistent for findkeys to default to listing all keys.
I think I prefer the inconsistency over needing a different option than
--in=here though. --[[Joey]]
Like `git-annex find`, findkeys should default to listing only keys
whose content is present. But when an option like --in=remote
or --anything is used, it should, like find, not be limited to that.
[[!tag projects/dandi]]
> [[done]]! --[[Joey]]

View file

@ -67,6 +67,7 @@ Extra-Source-Files:
doc/git-annex-export.mdwn
doc/git-annex-filter-branch.mdwn
doc/git-annex-find.mdwn
doc/git-annex-findkeys.mdwn
doc/git-annex-findref.mdwn
doc/git-annex-fix.mdwn
doc/git-annex-forget.mdwn
@ -727,6 +728,7 @@ Executable git-annex
Command.FilterBranch
Command.FilterProcess
Command.Find
Command.FindKeys
Command.FindRef
Command.Fix
Command.Forget