Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.
updateRepoSize is only called on the UUID of a repository, not any
cluster it might be a node of. But overLocationLogs and overLocationLogsJournal
were inclusing cluster UUIDs. So it was inconsistent.
Currently I don't see any reason to calculate RepoSize for a cluster.
It's not even clear what it should mean, the total size of all nodes, or
the amount of information stored in the cluster in total?
This will be used to prime the RepoSizes database, which will always
contain values that correpond to information in the git-annex branch, so
without anything from journal files.
Factored out overJournalFileContents which will later be used to
update Annex.reposizes to include information from journal files.
This will be partitcularly important to support private UUIDs which only
ever get to journal files and not to the branch.
5afbea25e7 changed it to ignore journal
files that did not correspond to a key in the git-annex branch. However,
when there is a private journal, that can happen.
Neither behavior is fully correct, so keep the old incorrect behavior
rather than introducing a new differently incorrect behavior.
I plan to eventually make git-annex info use Annex.reposizes instead of
calculating it itself, and once Annex.reposizes handles this all
correctly, this will be a moot problem.
git-annex info was displaying a message that didn't make sense in
context.
In calcRepoSizes, it seems better to return the information from the
git-annex branch, rather than giving up. Especially since balanced
preferred content uses it, and we can't just give up evaluating a
preferred content expression if git-annex is to be usable in such a
readonly repo.
Commit 6d7ecd9e5d nobly wanted git-annex
to behave the same with such unmerged branches as it does when it can
merge them. But for the purposes of preferred content, it seems to me
there's a sense that such an unmerged branch is the same as a remote we
have not pulled from. The balanced preferred content will either way
operate under outdated information, and so make not the best choices.
Thanks to previous work in 11cc9f1933,
this is almost entirely free, it only needs to do some additional map
lookups and math.
The strictness annotations keep the memory use from blowing up.
Sponsored-by: unqueued on Patreon
This can take a lot of memory. I decided to violate the usual rule in
git-annex that it operate in constant memory no matter how many annexed
objects. In this case, it would be hard to be fast without using a big
map of the location logs. The main difficulty here is that there can be
many git-annex branches and it needs to display a consistent view at a
point in time, which means merging information from multiple git-annex
branches.
I have not checked if there are any laziness leaks in this code. It
takes 1 gb to run in my big repo, which is around what I estimated
before writing it.
2 options that are documented are not yet implemented.
Small bug: With eg --when=1h, it will display at 12:00 then 1:10 if the
next change after 12:59 is then. Then it waits until after 2:10 to
display the next change. It ought to wait until after 2:00.
Sponsored-by: Brock Spratlen on Patreon
Factored out overLocationLogs from CmdLine.Seek, which can calculate this
pretty fast even in a large repo. In my big repo, the time to run git-annex
info went up from 1.33s to 8.5s.
Note that the "backend usage" stats are for annexed files in the working
tree only, not all annexed files. This new data source would let that be
changed, but that would be a confusing behavior change. And I cannot
retitle it either, out of fear something uses the current title (eg parsing
the json).
Also note that, while time says "402108maxresident" in my big repo now,
up from "54092maxresident", top shows the RES constant at 64mb, and it
was 48mb before. So I don't think there is a memory leak. I tried using
deepseq to force full evaluation of addKeyCopies and memory use didn't
change, which also says no memory leak. And indeed, not even calling
addKeyCopies resulted in the same memory use. Probably the increased memory
usage is buffering the stream of data from git in overLocationLogs.
Sponsored-by: Brett Eisenberg on Patreon
I considered a more wide-ranging config option to make other commands
also show dead repositories. But it would be difficult to implement that
because Remote.keyLocations is used to get locations, filtering out dead
repos, and commands like get then try to use those locations. So a config
setting would make dead repos sometimes be acted on by commands.
Sponsored-by: unqueued on Patreon
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brock Spratlen on Patreon
Added StringContainingQuotedPath, which is used for ActionItemOther.
In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.
Sponsored-by: Graham Spencer on Patreon
This is by no means complete, but escaping filenames in actionItemDesc does
cover most commands.
Note that for ActionItemBranchFilePath, the value is branch:file, and I
choose to only quote the file part (if necessary). I considered quoting the
whole thing. But, branch names cannot contain control characters, and while
they can contain unicode, git coes not quote unicode when displaying branch
names. So, it would be surprising for git-annex to quote unicode in a
branch name.
The find command is the most obvious command that still needs to be
dealt with. There are probably other places that filenames also get
displayed, eg embedded in error messages.
Some other commands use ActionItemOther with a filename, I think that
ActionItemOther should either be pre-sanitized, or should explicitly not
be used for filenames, so that needs more work.
When --json is used, unicode does not get escaped, but control
characters were already escaped in json.
(Key escaping may turn out to be needed, but I'm ignoring that for now.)
Sponsored-by: unqueued on Patreon
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.
Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.
Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.
The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.
Sponsored-by: Dartmouth College's Datalad project
info: Fix reversion in last release involving handling of unsupported input
by continuing to handle any other inputs, before exiting nonzero at the
end.
Sponsored-by: Dartmouth College's Datalad project
Used to fail with a bad error message, indicating there was no
repository with the specified name, or something like that. Now, suggest
they use the uuid to disambiguate.
* info, enableremotemote, renameremote: Avoid a confusing message when more
than one repository matches the user provided name.
* info: Exit nonzero when the input is not supported.
Sponsored-by: Kevin Mueller on Patreon
--backend is no longer a global option, and is only accepted by commands
that actually need it.
Three commands that used to support backend but don't any longer are
watch, webapp, and assistant. It would be possible to make them support it,
but I doubt anyone used the option with these. And in the case of webapp
and assistant, the option was handled inconsistently, only taking affect
when the command is run with an existing git-annex repo, not when it
creates a new one.
Also, renamed GlobalOption etc to AnnexOption. Because there are many
options of this type that are not actually global (any more) and get
added to commands that need them.
Sponsored-by: Kevin Mueller on Patreon
Use cases include using git-annex init --no-autoenable and then going back
and enabling the special remotes that have autoenable configured. As well
as just querying to remember which ones have it enabled.
It lists all special remotes that have autoenable=yes whether currently
enabled or not. And it can be used with --json.
I pondered making this "git-annex info autoenable", but that seemed wrong
because then if the use has a directory named "autoenable", it's unclear
what they are asking for. (Although "git-annex info remote" may be
similarly unclear.) Making it an option does mean that it can't be provided
via --batch though.
Sponsored-by: Dartmouth College's Datalad project
The patch that removed it did not break anything, since the strings it's
used on are all ASCII not unicode. But I like making sure to use
packString everywhere just in case the code later changes in a way that
needs it.
In aeson 2.0, Text has been replaced by the Key type and HashMap by the
KeyMap interface. Accomodating this required adding some CPP in order to
still be able to compile with aeson < 2.0. The required changes were:
* Prevent Key from being re-exported by Utilities.Aeson, as it clashes
with git-annex's own Key type.
* Fix up convertion from String/Text to Key (or Text in aeson 1.*) in a
couple of places
* Import Data.Aeson.KeyMap instead of Data.HashMap.Strict, as they are
mostly API-compatible. insertWith needs to be replaced by unionWith,
however, as KeyMap lacks the former function.
File matching options like --include will be rejected in situations where
there is no filename to match against. (Or where there is a filename but
it's not relative to the cwd, or otherwise seemed too bothersome to match
against.)
The addition of listKeys' was necessary to avoid using more memory in the
common case of "git-annex info". Adding a filterM would have caused the
list to buffer in memory and not stream. This is an ugly hack, but listKeys
had previously run Annex operations inside unafeInterleaveIO (for direct
mode). And matching against a matcher should hopefully not change any Annex
state.
This does allow for eg `git-annex info somefile --include=*.ext`
although why someone would want to do that I don't really know. But it
seems to make sense to allow it.
But, consider: `git-annex info ./somefile --include=somefile`
This does not match, so will not display info about somefile.
If the user really wants to, they can `--include=./somefile`.
Using matching options like --copies or --in=remote seems likely to be
slower than git-annex find with those options, because unlike such
commands, info does not have optimised streaming through the matcher.
Note that `git-annex info remote` is not the same as
`git-annex info --in remote`. The former shows info about all files in
the remote. The latter shows local keys that are also in that remote.
The output should make that clear, but this still seems like a point
where users could get confused.
Sponsored-by: Jochen Bartl on Patreon
Reject combinations of --batch (or --batch-keys) with options like --all or
--key or with filenames.
Most commands ignored the non-batch items when batch mode was enabled.
For some reason, addurl and dropkey both processed first the specified
non-batch items, followed by entering batch mode. Changed them to also
error out, for consistency.
Sponsored-by: Dartmouth College's Datalad project
It would be difficult to make Annex.Branch.files query the unmerged
git-annex branches. Might be possible, similar to what was discussed in
7f6b2ca49c but again I decided to make it
not do anything in that situation to start with before adding such a
complicated thing.
git-annex info uses it when getting info about a repostory. The choices
were to make that fail with an error, or display the info it can, and
change the output slightly for the bits of info it cannot access. While
that is a behavior change, and I want to avoid any behavior changes due
to unmerged git-annex branches in a read-only repo, displaying a message
that is not a number seems unlikely to break anything that was consuming
a number, any worse than throwing an exception would. Probably.
Also git-annex unused --from origin is made to throw an error, but
it would fail later anyway when trying to write to the unused log files.
Sponsored-by: Dartmouth College's Datalad project
New --batch-keys option added to these commands: get, drop, move, copy, whereis
git-annex-matching-options had to be reworded since some of its options
can be used to match on keys, not only files.
Sponsored-by: Luke Shumaker on Patreon
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)
Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.
AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.
Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.
Sponsored-by: Shae Erisson on Patreon
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
MatchingKey is not the thing to use when matching on actual worktreee
files.
Fix reversion in 8.20201116 that made include= and exclude= in
preferred/required content expressions match a path relative to the current
directory, rather than the path from the top of the repository.
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Anything that needs to examine the file content will fail to match,
or fall back to other available information. But the intent is that the
matcher be checked for matchNeedsFileContent and only be used if it does
not, so the exact behavior doesn't much matter as it should never
happen.
The real point of this is to not need to provide a dummy content file
when matching.
This commit was sponsored by Martin D on Patreon.
Eliminate a zombie that was only cleaned up by the later zombie cleanup
code.
This is still not ideal, it would be cleaner if it used conduit or
something, and if the thread gets killed before waiting, it won't stop
the process.
Only remaining zombies are in CmdLine.Seek
The use case of this field is mostly to support -J combined with --json.
When that is implemented, a user will be able to look at the field to
determine which of the requests they have sent it corresponds to.
The field typically has a single value in its list, but in some cases
mutliple values (eg 2 command-line params) are combined together and the
list will have more.
Note that json parsing was already non-strict, so old git-annex metadata
--json --batch can be fed json produced by the new git-annex and will
not stumble over the new field.
No behavior changes (hopefully), just adding SeekInput and plumbing it
through to the JSON display code for later use.
Over the course of 2 grueling days.
withFilesNotInGit reimplemented in terms of seekHelper
should be the only possible behavior change. It seems to test as
behaving the same.
Note that seekHelper dummies up the SeekInput in the case where
segmentPaths' gives up on sorting the expanded paths because there are
too many input paths. When SeekInput later gets exposed as a json field,
that will result in it being a little bit wrong in the case where
100 or more paths are passed to a git-annex command. I think this is a
subtle enough problem to not matter. If it does turn out to be a
problem, fixing it would require splitting up the input
parameters into groups of < 100, which would make git ls-files run
perhaps more than is necessary. May want to revisit this, because that
fix seems fairly low-impact.
Clean build under ghc 8.8.3, which seems to do better at finding cases
where two imports both provide the same symbol, and warns about one of
them.
This commit was sponsored by Ilya Shlyakhter on Patreon.