Fix bug caused by recent optimisations that could make git-annex not see
recently recorded status information when configured with
annex.alwayscommit=false.
This does mean that --all can end up processing the same key more than once,
but before the optimisations that introduced this bug, it used to also behave
that way. So I didn't try to fix that; it's an edge case and anyway git-annex
behaves well when run on the same key repeatedly.
I am not too happy with the use of a MVar to buffer the list of files in the
journal. I guess it doesn't defeat lazy streaming of the list, if that
list is actually generated lazily, and anyway the size of the journal is
normally capped and small, so if configs are changed to make it huge and
this code path fire, git-annex using enough memory to buffer it all is not a
large problem.
Fix bug caused by recent optimisations that could make git-annex not see
recently recorded status information when configured with
annex.alwayscommit=false.
When not using --all, precaching only gets triggered when the
command actually needs location logs, and so there's no speed hit there.
This is a minor speed hit for --all, because it precaches even when the
location log is not actually going to be used, and so checking the journal
is not necessary. It would have been possible to defer checking the journal
until the cache gets used. But that would complicate the usual Branch.get
code path with two different kinds of caches, and the speed hit is really
minimal. A better way to speed up --all, later, would be to avoid
precaching at all when the location log is not going to be used.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
Added LinkType to ProvidedInfo, and unified MatchingKey with
ProvidedInfo. They're both used in the same way, so there was no real
reason to keep separate.
Note that addLocked and addUnlocked still set matchNeedsFileName,
because to handle MatchingFile, they do need it. However, they
don't use it when MatchingInfo is provided. This should be ok,
the --branch case will be able skip checking matchNeedsFileName,
since it will provide a filename in any case.
Previously such nonsensical combinations always treated the matching option
as if it didn't match.
For now, made find --branch refuse matching options that need a
filename, because one is not provided to them in a way they'll use.
There's an open bug report to support it, but making it error out is
better than the old behavior of not finding what it was asked to.
Also, made --mimetype combined with eg --all work, by looking at the
object file when operating on keys.
It got broken in several ways by the streaming seeking optimisations
around version 8.20201007.
Moved time limit checking out of the matcher, which was a hack in the
first place. So everywhere that uses Limit.getMatcher needs to check
time limit. Well, almost everywhere. Command.Info uses it, but it does
not make sense to time limit getting info. And Command.MultiCast uses it
just to build up a list of files that then get passed to a command, so
it would never have hit the timeout in a useful way.
This implementation is a little more expensive when at time limit than
necessary, since it continues seeking only to discard everything after the
time limit. I did try making it close the file handles to force a faster
shutdown, but that didn't work and hung. Could certianly be improved
somehow, but seeking is probably not the expensive bit when a time limit
is hit, so this seems acceptable for now.
MatchingKey is not the thing to use when matching on actual worktreee
files.
Fix reversion in 8.20201116 that made include= and exclude= in
preferred/required content expressions match a path relative to the current
directory, rather than the path from the top of the repository.
The problem was this line:
cleanup = and <$> sequence (map snd v)
That caused all of v to be held onto until the end, when the cleanup action
was run.
I could not seem to find a bang pattern that avoided the leak, so I
resorted to a IORef, rather clunky, but not a performance problem because
it will only be written once per git ls-files, so typically just 1 time.
This commit was sponsored by Mark Reidenbach on Patreon.
Anything that needs to examine the file content will fail to match,
or fall back to other available information. But the intent is that the
matcher be checked for matchNeedsFileContent and only be used if it does
not, so the exact behavior doesn't much matter as it should never
happen.
The real point of this is to not need to provide a dummy content file
when matching.
This commit was sponsored by Martin D on Patreon.
This was the last one marked as a zombie. There might be others I don't
know about, but except for in the hypothetical case of a thread dying
due to an async exception before it can wait on a process it started, I
don't know of any.
It would probably be safe to remove the reapZombies now, but let's wait
and so that in its own commit in case it turns out to cause problems.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Sped up seeking for files to operate on, when using options like --copies
or --in, by around 20%.
Benchmark showed an increase for --copies from 155 seconds to 121
seconds, and --in remote will be similar to that.
For --in here, the speedup was less, 5-10% or so.
(both warm cache)
This commit was sponsored by Jack Hill on Patreon.
No behavior changes (hopefully), just adding SeekInput and plumbing it
through to the JSON display code for later use.
Over the course of 2 grueling days.
withFilesNotInGit reimplemented in terms of seekHelper
should be the only possible behavior change. It seems to test as
behaving the same.
Note that seekHelper dummies up the SeekInput in the case where
segmentPaths' gives up on sorting the expanded paths because there are
too many input paths. When SeekInput later gets exposed as a json field,
that will result in it being a little bit wrong in the case where
100 or more paths are passed to a git-annex command. I think this is a
subtle enough problem to not matter. If it does turn out to be a
problem, fixing it would require splitting up the input
parameters into groups of < 100, which would make git ls-files run
perhaps more than is necessary. May want to revisit this, because that
fix seems fairly low-impact.
Avoid complaining that a file with "is beyond a symbolic link" when the
filepath is absolute and the symlink in question is not actually inside the
git repository.
This assumes that inodes remain stable while the command is running.
I think they always will, the filesystems where they are unstable change
them across mounts. (If inodes were not stable, it would just complain about
symlinks in the path that are not inside the working tree.)
(On windows, I don't want to assume anything about inodes, they could be
random numbers for all I know. But if they were, this would still be ok, as
long as windows doesn't have symlinks that are detected by isSymbolicLink.
Which seems a fair bet.)
Part of workTreeItems is trying detect a case
where git porcelain refuses to process a file, and where
git ls-files silently outputs nothing. But, it's hard to perfectly
replicate git's behavior, and besides, git's behavior could change.
So it could be that we warn, but then git ls-files does not skip over
it, and so git-annex also processes it after warning about it.
So, if we think we have a problem with a parameter, display the warning,
and skip processing it at all.
Implementing this was complicated by needing to handle the case where
all command-line parameters get filtered out this way. Which is
different than the case where there are none, because we don't want to
operate on all files in this new case..
And add a test case for that.
This certianly loses some of the 2x performance improvement in file
seeking that seekFilteredKeys led to, because now it has to stat the
worktree files again. Without benchmarking, I expect there will still be
a sizable improvement, and also the git-annex branch precaching that
seekFilteredKeys can do will still be a win of its approach.
Also worth noting that lookupKey, when the file DNE, check if it's in an
adjusted branch with hidden files, and if so, finds the key for the
file anyway. That was intended to make git-annex sync --content be able
to process those files, but a side effect was that, when a file was
deleted but the deletion not yet staged, git-annex commands used to
still list it. That was actually a bug. This commit fixes that bug too.
(git-annex sync --content on such a branch does not use seekFilteredKeys
so was not affected by the reversion or by this behavior change)
This commit was sponsored by Jake Vosloo on Patreon.
This was a bit disappointing, I was hoping for a 2x speedup. But, I think
the metadata lookup is wasting a lot of time and also needs to be made to
stream.
The changes to catObjectStreamLsTree were benchmarked to not also speed
up --all around 3% more. Seems I managed to make it polymorphic after all.
This solves the same problem as commit b4d0f6dfc2
but in a better way, that should make processing pointer files maximally
fast. If there is a mixture of pointer files and symlinks, the first
symlinks until the pointer file are handled maximally fast, while the
ones after that go via the slightly slower path.
This removes all calls to inAnnex, except for some involving --batch.
It may be that the batch code could get a similar speedup, but I don't
know if people habitually pass a huge number of files through --batch
that git-annex does not need to do anything to process, so I skipped it
for now.
A few calls to ifAnnexed remain, and might be worth doing more to
convert. In particular, Command.Sync has one that would probably speed
it up by a good amount.
(also removed some dead code from Command.Lock)
This is all good, except for one small problem... When a pointer file
has to be fed into the metadata cat-file, it's possible for a
non-pointer file that comes after it to get fed into the main cat-file
first, so the two files will be processed in a different order than the
user specified.
So, while this is the fast way, I guess I'll have to change it to be
slower, but sequential..
This is only implemented for git-annex get so far. It makes git-annex
get nearly twice as fast in a repo with 10k files, all of them present!
But, see the TODO for some caveats.
This assumes that no location log files will have a newline or carriage
return in their name. catObjectStream skips any such files due to
cat-file not supporting them.
Keys have been prevented from containing newlines since 2011,
commit 480495beb4. If some old repo
had a key with a newline in it, --all will just skip processing that key.
Other things, like .git/annex/unused files certianly assume no newlines in
keys too, and AFAICR, such keys never actually worked.
Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys
generated before that point could perhaps contain a CR. (URL probably not,
http probably doesn't support an URL with a raw CR in it.) So, added
a warning in fsck about such keys. Although, fsck --all will naturally
skip them, so won't be able to warn about them. Not entirely
satisfactory, but I'll bet there are not really any such keys in
existence.
Thanks to Lukey for finding this optimisation.
Added annex.skipunknown git config, that can be set to false to change the
behavior of commands like `git annex get foo*`, to not skip over files/dirs
that are not checked into git and are explicitly listed in the command
line.
Significant complexity was needed to handle git-annex add, which uses some
git ls-files calls, but needs to not use --error-unmatch because of course
the files are not known to git.
annex.skipunknown is planned to change to default to false in a
git-annex release in early 2022. There's a todo for that.
This relicates git's behavior. It adds a few stat calls for the command
line parameters, so there is some minor slowdown, but even with thousands
of parameters it will not be very noticable, and git does the same statting
in similar circumstances.
Note that this does not prevent eg "git annex add symlink"; the symlink
will be added to git as usual. And "git annex find symlink" will silently
list nothing as well. It's only "symlink/foo" or "subdir/symlink/foo" that
triggers the warning.
The git add behavior changes could be avoided if it turns out to be
really annoying, but then it would need to behave the old way when
annex.dotfiles=false and the new way when annex.dotfiles=true. I'd
rather not have the config option result in such divergent behavior as
`git annex add .` skipping a dotfile (old) vs adding to annex (new).
Note that the assistant always adds dotfiles to the annex.
This is surprising, but not new behavior. Might be worth making it also
honor annex.dotfiles, but I wonder if perhaps some user somewhere uses
it and keeps large files in a directory that happens to begin with a
dot. Since dotfiles and dotdirs are a unix culture thing, and the
assistant users may not be part of that culture, it seems best to keep
its current behavior for now.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.