Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
Benchmarks:
Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.
An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.
Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.
Sponsored-by: k0ld on Patreon
Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.
Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.
Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.
Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.
But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.
I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.
Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.
Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.
Sponsored-by: Noam Kremen on Patreon
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon