I saw a nearly 2 minute speed up from this, in a repo with 56000 files some
of which are preferred content of the special remote and others not. In
such a case, addBackExportExcluded has to do a lot of work, which is
unncessary when the tree is unchanged.
When using sync --content, preferred content checking of that many files
takes about 1 minute. So this speeds up sync --content by 3x.
When using git-annex import, the speed up is much larger.
Sponsored-by: Nicholas Golder-Manning on Patreon
When importing from a special remote, support preferred content expressions
that use terms that match on keys (eg "present", "copies=1"). Such terms
are ignored when importing, since the key is not known yet.
When "standard" or "groupwanted" is used, the terms in those
expressions also get pruned accordingly.
This does allow setting preferred content to "not (copies=1)" to make a
special remote into a "source" type of repository. Importing from it will
import all files. Then exporting to it will drop all files from it.
In the case of setting preferred content to "present", it's pruned on
import, so everything gets imported from it. Then on export, it's applied,
and everything in it is left on it, and no new content is exported to it.
Since the old behavior on these preferred content expressions was for
importtree to error out, there's no backwards compatability to worry about.
Except that sync/pull/etc will now import where before it errored out.
This can reduce the size of the branch by up to 8%. My test was
running git-annex add 1000 times on one file each.
Lots of different high-resolution timestamps were recorded before
and eliminating those, after packing, the git repo was 8% smaller.
Due to the use of vector clocks, high resolution timestamps are
not necessary to make clear which information is most recent when
eg, a value is changed repeatedly in the same second. In such a
case, the vector clock will be advanced to the next second after
the last modification. For example, running
git-annex numcopies 1; git-annex numcopies 2
The first will record the current second, while the next records
the second after that even if it runs in the same second.
As for conflicting information written to two different clones of the
repository, this will make git-annex sometimes pick information that
was written earlier in a second over information written later in the
same second. Usually git-annex does not write conflicting information,
but there are some cases where it could. Eg, storing an object on a remote
can update the remote state log with some state. If two repos both store the
same object, and end up storing different remote state for some reason,
this can result in one that ran a tiny bit later winning. Such a situation
seems unlikely to be user visible. And a small amount of clock skew could
already result in such things.
The only case I can think of where this might be a user visible change
is if a configuration command like git-annex numcopies is being run
in 2 clones of a repository on the same machine at very
close to the same time. Then the user will know which they ran last,
and git-annex won't.
If that did become a problem, this could be dialed back to eg log
milliseconds with still some space saving.
migrate: Support adding size to URL keys that were added with --relaxed, by
running eg: git-annex migrate --backend=URL foo
Since url keys cannot be generated, that used to fail. Make it notice that
the backend is not changed, and just get the size of the content.
Sponsored-by: Brock Spratlen on Patreon
This is most of the way there, but not quite working.
The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.
Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.
Sponsored-by: Joshua Antonishen on Patreon
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.
commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.
Sponsored-by: Graham Spencer on Patreon
"Could only verify the existence of 0 out of 1 necessary copy"
does not sound right, but neither does it with "copies".
Kept the "1" rather than "only" or such since numcopies is mentioned.
Sponsored-by: Brock Spratlen on Patreon
Make git-annex get/copy/move --from foo override configuration of
remote.foo.annex-ignore, as documented.
This already worked for remotes supporting hasKeyCheap. For others though,
git-annex copy --from foo would silently not do anything, while
git-annex copy --to foo would use the annex-ignored remote.
Also improved the annex-ignore docs, to reflect that `git-annex get`
without --from will skip using annex-ignored remotes, for example.
Sponsored-by: Dartmouth College's DANDI project
Noticed that Semigroup instance of Map is not suitable to use
for MapLog. For example, it behaved like this:
ghci> parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12"
fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})]
Which was wrong, it lost the newer DeadTrusted value.
Luckily, nothing used that Semigroup when operating on a MapLog. And this
provides a safe instance.
Sponsored-by: Graham Spencer on Patreon
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
Minor optimisation, but a win in every case, except for a couple where
it's a wash.
Note that replaceFile still takes a FilePath, because it needs to
operate on Chars to truncate unicode filenames properly.
Note that the use of s2w8 in genUUIDInNameSpace made it truncate unicode
characters. Luckily, genUUIDInNameSpace is only ever used on ASCII
strings as far as I can determine. In particular, git-remote-gcrypt's
gcrypt-id is an ASCII string.
Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.
I also tested the case that 9b93278e8a
fixed:
git-annex metadata -s foo='a …' x
git-annex metadata x
metadata x
foo=a …
In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
This does not improve Annex.Branch.files at all, since it still uses ++ to
combine the lists, so forcing all but the last one.
But when there are a lot of files in the private journal, it does avoid
--all (or a bare repo) from buffering the filenames in memory.
See commit 653b719472 for prior discussion of
this buffering.
Sponsored-by: Graham Spencer on Patreon
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.
Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.
Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.
Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.
Sponsored-by: unqueued on Patreon
git-annex only writes regular files there, but other things may drop junk
like empty .DAV directories around the tree. And trying to hash such things
can have weird and hard to understand effects. So it seems best to do a
small amount of work in statting the journal file to make sure it's a
regular file.
Sponsored-by: Jack Hill on Patreon
Avoid using curl when annex.security.allowed-ip-addresses is set but
neither annex.web-options nor annex.security.allowed-url-schemes is set to
a value that needs curl.
Bug introduced in 840bd50390
Sponsored-By: Brock Spratlen on Patreon
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.
The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.
So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.
(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)
The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.
The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.
Sponsored-by: Joshua Antonishen on Patreon
Significant startup speed increase by avoiding repeatedly checking if some
remote git-annex branch refs need to be merged when it is not newer.
One way this could happen is when there are 2 remotes that are themselves
connected. The git-annex branch on the first remote gets updated. Then the
second remote pulls from the first, and merges in its git-annex branch.
Then the local repo pulls from the second remote, and merges its git-annex
branch. At this point, a pull from the first remote will get a git-annex
branch that is not newer, but is not on the merged refs list.
In my big repo, git-annex startup time dropped from 4 seconds to 0.1 seconds.
There were 5 to 10 such remote refs out of 18 remotes.
Sponsored-by: Graham Spencer on Patreon
git-annex test hang when running git-annex add in an adjusted unlocked
branch. I couldn't seem to reproduce the hang outside the test suite.
Seems that the code added in 26a9ea12d1
was buggy, and as that commit was made without testing it, building with
unix-2.8 exposed the bug.
I don't fully understand the bug, which involves fdToHandle
and then closing the fd, vs closing the handle. May somehow involve
laziness or forcing around the S.hGet? Using hClose solved it
in any case.
(Also eliminated checkcontentfollowssymlinks to fix a build warning
when it's not used.)
And annex.largefiles and annex.addunlocked.
Also git-annex matchexpression --explain explains why its input
expression matches or fails to match.
When there is no limit, avoid explaining why the lack of limit
matches. This is also done when no preferred content expression is set,
although in a few cases it defaults to a non-empty matcher, which will
be explained.
Sponsored-by: Dartmouth College's DANDI project
Currently it only displays explanations of options like --in and --copies.
In the future, it should explain preferred content expression evaluation
and other decisions.
The explanations of a few things could be better. In particular,
"standard" will just appear as-is (or as "!standard" if it doesn't
match), rather than explaining why the standard preferred content expression
for the group matches or not.
Currently as implemented, it goes to stdout, and so commands like
git-annex find that have custom output will not display --explain
information. Perhaps that should change, dunno.
Sponsored-by: Dartmouth College's DANDI project
explain is a kind of debug message, but not formatted in the same way.
So it makes sense to reuse the debug machinery for it, since that is
already quite optimised.
Sponsored-by: Dartmouth College's DANDI project
yt-dlp when resumed was observed having written the same filename twice
into the file list. Perhaps once by the first download and once by the
resumed one?
This causes changes to the original branch to get merged with a single
sync. Before, it took 2 syncs; the first happened to update the synced/
branch, and the second merged changes from the synced/ branch into the
ajusted branch.
Using mergeToAdjustedBranch when tomerge == origbranch is probably
overkill, but it does work fine.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Bug fix: Re-running git-annex adjust or sync when in an adjusted branch
would overwrite the original branch, losing any commits that had been made
to it since the adjusted branch was created.
When git-annex adjust is run in this situation, it will display a warning
about the diverged branches.
When git-annex sync is run in this situation, mergeToAdjustedBranch
will merge the changes from the original branch to the adjusted branch.
So it does not need to display the divergence warning.
Note that for some reason, I'm needing to run sync twice for that to
happen. The first run does not do the merge and the second does. I'm unsure
why and so am not fully done with this bug.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Running git config --list inside .git then fails, so better to only
do that when --git-dir was specified explicitly. Otherwise, when the
repository is not bare, run the command inside the working tree.
Also make init detect when the uuid it just set cannot be read and fail
with an error, in case git changes something that breaks this later.
I still don't actually understand why git-annex add/assist -J2 was
affected but -J1 was not. But I did show that it was skipping writing to
the location log, because the uuid was NoUUID.
Sponsored-by: Graham Spencer on Patreon
Fixes a crash by git-annex repair when .git/annex/journal/ does not exist.
Normally the journal directory is created before withJournalHandle gets
run, but git-annex repair can be run in a situation where it does not
exist.
Rather than after it, which can leave one wondering what file it's
downloading.
youtubeDl should not ever return Right Nothing in normal operation,
becaause it's already asked youtube-dl if it supports the url. So it
would have to succeed at that, then not download any file, but also
exit successfully, in order for the new error message to display.
Also display the name of yt-dlp when using it.
Sometimes resuming an interrupted download will fail to resume and download
more files with different names. That resulted in the workdir having
multiple files at the end, which causes git-annex to give up because it
does not know what was downloaded.
To fix this, use a yt-dlp feature, which appends to a file the name of each
file after it's finished downloading it. So the presence of other cruft in
the workdir will not confuse git-annex.
* config: Added the --show-origin and --for-file options.
* config: Support annex.numcopies and annex.mincopies.
There is a little bit of redundancy here with other code elsewhere that
combines the various configs and selects which to use. But really only
for the special case of annex.numcopies, which is a git config that does
not override the annex branch setting and for annex.mincopies, which does
not have a git config but does have gitattributes settings as well as the
annex branch setting.
That seems small enough, and unlikely enough to grow into a mess that it was
worth supporting annex.numcopies and annex.mincopies in git-annex config
--show-origin. Because these settings are a prime thing that someone might
get confused about and want to know where they were configured.
And, it followed that git-annex config might as well support those two
for --set and --get as well. While this is redundant with the speclialized
commands, it's only a little code and it makes it more consistent.
Note that --set does not have as nice output as numcopies/mincopies
commands in some special cases like setting to 0 or a negative number.
It does avoid setting to a bad value thanks to the smart
constructors (eg configuredNumCopies).
As for other git-annex branch configurations that are not set by git-annex
config, things like trust and wanted that are specific to a repository
don't map to a git config name, so don't really fit into git-annex config.
And they are only configured in the git-annex branch with no local override
(at least so far), so --show-origin would not be useful for them.
Sponsored-by: Dartmouth College's DANDI project
It just so happens that everywhere that checks attrs other than
annex.largefiles parses the value further, and failed to parse
unspecifiedAttr in a way that behaved the same as if nothing was set. So
this is not a bug fix or behavior change. What it does so is prevent
future uses of checkAttr from needing to remember to handle checking for
this edge case.
Sponsored-by: Dartmouth College's DANDI project
This avoids bottlenecking on git check-ignore in a particular situation.
Also, there may have been a correctness issue with it not having updated it.
When the exportdb is already up-to-date, this is not expensive. And the
exportdb is updated elsewhere, so usually it is up-to-date.
Sponsored-by: Joshua Antonishen on Patreon
Speeds up sync in an adjusted branch by avoiding re-adjusting the branch
unncessarily, particularly when it is adjusted with --hide-missing or
--unlock-present.
When there are a lot of files, that was the majority of the time of a
--no-content sync.
Uses a log file, which is updated when content presence changes. This
adds a little bit of overhead to every file get/drop when on such an
adjusted branch. The overhead is minimal for get of any size of file,
but might be noticable for drop in some cases. It seems like a reasonable
trade-off. It would be possible to update the log file only at the end, but
then it would not happen if the command is interrupted.
When not in an adjusted branch, there should be no additional overhead.
(getCurrentBranch is an MVar read, and it avoids the MVar read of
getGitConfig.)
Note that this does not deal with situations such as:
git checkout master, git-annex get, git checkout adjusted branch,
git-annex sync. The sync won't know that the adjusted branch needs to be
updated. Dealing with that would add overhead to operation in non-adjusted
branches, which I don't like. Also, there are other situations like having
two adjusted branches that both need to be updated like this, and switching
between them and sync not updating.
This does mean a behavior change to sync, since it did previously deal
with those situations. But, the documentation did not say that it did.
The man pages only talk about sync updating the adjusted branch after
it transfers content.
I did consider making sync keep track of content it transferred (and
dropped) and only update the adjusted branch then, not to catch up to other
changes made previously. That would perform better. But it seemed rather
hard to implement, and also it would have problems with races with a
concurrent get/drop, which this implementation avoids.
And it seemed pretty likely someone had gotten used to get/drop followed by
sync updating the branch. It seems much less likely someone is switching
branches, doing get/drop, and then switching back and expecting sync to update
the branch.
Re-running git-annex adjust still does a full re-adjusting of the branch,
for anyone who needs that.
Sponsored-by: Leon Schuermann on Patreon
Introduced recently in commit 64fc34b3da.
adjustBranch changes the sha that is recorded for the current branch
(eg the adjusted branch). So, have to get the original sha before
calling it.
Sponsored-by: Jack Hill on Patreon
Updating an adjusted branch can take a while when there are a lot of
files. HEAD was detached at the start, so if eg git-annex sync was
interrupted at the wrong point, there was a possibly wide window where
it would leave the repo with HEAD detached.
There's still a window, just much narrower. I don't know if it's
possible to close the window entirely. While git can clearly update
the currently checked out branch in eg git merge, it doesn't seem to
provide another way to do it.
Sponsored-by: Graham Spencer on Patreon
Reading from the cidsdb is responsible for about 25% of the runtime of
an import. Since the cidmap is used to store the same information in
ram, the cidsdb is not written to during an import any longer. And so,
if it started off empty (and updateFromLog wasn't needed), those reads
can just be skipped.
This is kind of a cheesy optimisation, since after any import from any
special remote, the database will no longer be empty, so it's a single
use optimisation. But it's probably not uncommon to start by importing a
lot of files, and it can save a lot of time then.
Sponsored-by: Brock Spratlen on Patreon
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
Benchmarks:
Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.
An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.
Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.
Sponsored-by: k0ld on Patreon
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.
Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.
Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.
A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.
Sponsored-by: Luke Shumaker on Patreon