* add: Significantly speed up adding lots of non-large files to git,
by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
so any other smudge filters than the annex one that may be enabled will
be used.
Split out two todos for things that were mentioned as still open items
in there. Most of the others were already dealt with. I didn't open a
new todo for the import from readonly S3 bucket because I guess if
someone needs that, they can ask for it.
Especially from borg, where the content identifier logs
all end up being the same identical file!
But also, for other imports, the location tracking logs can,
in some cases, be identical files.
Bonus optimisation: Avoid looking up (and parsing when set)
GIT_ANNEX_VECTOR_CLOCK env var every time a log is written to.
Although the lookup does happen at startup even when no
log will be written now.
Note that, after changing it with enableremote, syncing won't rescan
known archives in the borg repo using the changed config. Probably not a
problem?
Also used File in some places where filenames that could theoretically
start with - are passed to borg, to avoid it confusing them with
options.
Don't want to try to use these remotes as key/value remotes, which will
surely fail. It only recently became possible for importtree to be set
w/o exporttree, so before this code was ok.
(cherry picked from commit 97599cb0f7f4115aa5a3e81a91ee3d1d6c52dc84)
git -c was already propagated via environment, but need this for
consistency.
Also, notice it does not use gitAnnexChildProcess to run the
transferrer. So nothing is done about avoid it taking the
pid lock. It's possible that the caller is already doing something that
took the pid lock, and if so, the transferrer will certianly fail,
since it needs to take the pid lock too. This may prevent combining
annex.stalldetection with annex.pidlock, but I have not verified it's
really a problem. If it was, it seems git-annex would have to take
the pid lock when starting a transferrer, and hold it until shutdown,
or would need to take pid lock when starting to use a transferrer,
and hold it until done with a transfer and then drop it. The latter
would require starting the transferrer with pid locking disabled for the
child process, so assumes that the transferrer does not do anyting that
needs locking when not running a transfer.
* Guard against running in a repo where annex.uuid is set but
annex.version is set, or vice-versa.
* Avoid autoinit when a repo does not have annex.version or annex.uuid
set, but has a git-annex objects directory, suggesting it was used
by git-annex before.
I need to think about this some more, not clear if it's a todo
item specific to stalldetection at all. Remotes with this behavior
also show no progress when run with -J. And some other remotes don't
update any progress meters at all, eg adb is that way and so are hook
remotes and of course external remotes don't have to send progress info.
Done on unix, could not implement it on windows quite.
The signal library gets part of the way needed for windows.
But I had to open https://github.com/pmlodawski/signal/issues/1 because
it lacks raiseSignal.
Also, I don't know what the equivilant of getProcessGroupIDOf is on
windows. And System.Process does not provide a way to send any signal to
a process group except for SIGINT.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
All callers adjusted to update it themselves.
In Command.ReKey, and Command.SetKey, the cleanup action already did,
so it was updating the log twice before.
This fixes a bug when annex.stalldetection is set, as now
Command.Transferrer can skip updating the location log, and let it be
updated by the calling process.
This is to avoid breakage when upgrading or downgrading git-annex with a
process running that uses the interface. It's better to keep the
compatability code for a few years than worry about such breakage.
This commit was sponsored by Brett Eisenberg on Patreon.
New config annex.stalldetection, remote.name.annex-stalldetection, which
can be used to deal with remotes that stall during transfers, or are
sometimes too slow to want to use.
This commit was sponsored by Luke Shumaker on Patreon.
That seems to be the last thing needed for message serialization.
Although it's only used in the assistant currently, so hard to tell if I
forgot something.
At this point, it should be possible to start using transferkeys
when performing transfers, which will allow killing a transferkeys
process if a transfer times out or stalls. But that's for another day.
This commit was sponsored by Ethan Aubin.
Added annex.adjustedbranchrefresh git config to update adjusted branches
set up by git-annex adjust --unlock-present/--hide-missing.
Note, in a few cases, I was not able to make the adjusted branch
be updated in calls to moveAnnex, because information about what
file corresponds to a key is not available. They are:
* If two files point to one file, then eg, `git annex get foo` will
update the branch to unlock foo, but will not unlock bar, because it
does not know about it. Might be fixable by making `git annex get
bar` do something besides skipping bar?
* git-annex-shell recvkey likewise (so sends over ssh from old versions
of git-annex)
* git-annex setkey
* git-annex transferkey if the user does not use --file
* git-annex multicast sends keys with no associated file info
Doing a single full refresh at the end, after any incremental refresh,
will deal with those edge cases.
Note that, the way the SeekInput parser is written to support batch mode,
it's actually possible to do git-annex examinekey
"SHA1--foo foo.tar.gz" --migrate-to-backend=SHA1E
While that might be kind of useful to support multiple migrations not using
batch mode, I have not documented it. It would be better to take pairs of
key and file in that case.
In cases where numcopies checks prevented the resumed move from dropping
the object from the source repository, it now relies on a log of recent
moves to replicate the behavior of the interrupted command.
Performance: Probably noticable impact, since it has to add to the log,
check the log, and remove from the log. Seems worth it to avoid this
annoying edge case. The log functions are pretty well optimised to avoid
unncessary work.
An performance improvement to make later would be to avoid cleanup doing
anything if it's not written to the log file, and has confirmed that the
log file does not contain the log line.
This commit was sponsored by Jake Vosloo on Patreon.
The problem was this line:
cleanup = and <$> sequence (map snd v)
That caused all of v to be held onto until the end, when the cleanup action
was run.
I could not seem to find a bang pattern that avoided the leak, so I
resorted to a IORef, rather clunky, but not a performance problem because
it will only be written once per git ls-files, so typically just 1 time.
This commit was sponsored by Mark Reidenbach on Patreon.
Ensure that checkCanAdd is used everywhere a file is added to git,
so git add is run with -f, presumably avoiding the work it would usually
do to check ignores.
This avoids import with --no-content and with --content potentially
generating two different trees, leading to a merge conflict when run in
two different clones of a repo. And it's necessary groundwork to make
git-annex sync --no-content import from special remotes that support
importKey.
Only the directory special remote currently supports importKey, and it
generates the same key as git-annex usually does, so there is no
behavior change for it.
Future special remotes will need to take care when adding importKey,
if it generates different keys. Added some warnings about that to
comments.
This commit was sponsored by Noam Kremen on Patreon.
Import small files into git, the same as is done when importing with content.
Which means, for small files, --no-content does download them.
If the largefiles expression needs the file content available
(due to mimetype or mimeencoding being used), the import will fail.
This commit was sponsored by Jake Vosloo on Patreon.
Sped up seeking for files to operate on, when using options like --copies
or --in, by around 20%.
Benchmark showed an increase for --copies from 155 seconds to 121
seconds, and --in remote will be similar to that.
For --in here, the speedup was less, 5-10% or so.
(both warm cache)
This commit was sponsored by Jack Hill on Patreon.
Sped up seeking to around twice as fast, by avoiding a pass over the
worktree files when preferred content expressions of the local repo and
remotes don't use include=/exclude=.
Thanks to Lukey for identifying the optimisation.
This commit was sponsored by Brock Spratlen on Patreon.
matchNeedsFileContent is not used yet, but shows how to add information
about terminals. That one would be needed for
https://git-annex.branchable.com/todo/sync_fast_import/
Note the tricky bit in Annex.FileMatcher.call where it folds over the
included matcher to propagate the information.
This commit was sponsored by Svenne Krap on Patreon.
getPid returns Nothing if the process has already been stopped, and in that
case, the pid will not be displayed. I think that would only happen if
waitForProcess or similar gets called more than once on the same process
handle though.
getPid on unix has an overhead of only a MVar read. On Windows it needs to
make a syscall, so will be probably more expensive. While the added expense
happens even when debug logging is disabled, it should be small enough
compared with the overhead of starting a process that it's not a problem.
(It does occur to me that a debugM that took an IO String could only run it
when debugging is really enabled, which would improve performance. It does
not seem possible to use the current hslogger interface to do that though;
it does not expose the information that would be needed.)
With some hints for the user for what to do.
Took care to avoid changing the json output. It would have been ok to add
the new separated lists to it, in addition to the old list, but I didn't
do that because I didn't see much point.
Also tested what happens if the other special remote has importtree=yes
and exporttree=yes, and in that case, download via httpalso works too,
without needing to implement any importtree methods here.
It might be possible to make it automatically set exporttree=yes if the
--sameas does. Didn't try, will probably be layering issues.
Or perhaps it should be inherited by sameas like some
other configs? But then, wouldn't it also make sense to inherit
importree=yes? But as shown here, it's not needed by this kind of
remote.
"http" was too generic and easy to confuse with web. The new name makes
clear it's used in addition to some other remote. And other protocols
can use the same naming scheme.
Reduce the number of directories listed in libdirs, which makes the linker
check a lot less dead ends looking for directories.
Eliminated some directories that didn't really contain shared libraries,
or only contained the linker.
That left only 2, one in lib and one in usr/lib, so consolidate those two.
Doing it this way, rather than just consolidating all libs that might exist
into a single directory means that, if there are optimised versions of some
libs, eg in lib/subarch/foo.so, and lib/subarch2/foo.so, they don't get
moved around in a way that would make the linker pick the wrong one.
Sped up seeking files to drop by 2x, and also some performance
improvements to checking numcopies.
Interestingly, the seek speedup is not due to precaching, but I think is
due to calling getParsed earlier.
Annex.Drop had to be changed to check inAnnex there, since it was removed
from Command.Drop. All other users of Command.Drop already checked inAnnex
themselves.
This commit was sponsored by Ryan Newton on Patreon.