This avoids import with --no-content and with --content potentially
generating two different trees, leading to a merge conflict when run in
two different clones of a repo. And it's necessary groundwork to make
git-annex sync --no-content import from special remotes that support
importKey.
Only the directory special remote currently supports importKey, and it
generates the same key as git-annex usually does, so there is no
behavior change for it.
Future special remotes will need to take care when adding importKey,
if it generates different keys. Added some warnings about that to
comments.
This commit was sponsored by Noam Kremen on Patreon.
Import small files into git, the same as is done when importing with content.
Which means, for small files, --no-content does download them.
If the largefiles expression needs the file content available
(due to mimetype or mimeencoding being used), the import will fail.
This commit was sponsored by Jake Vosloo on Patreon.
Sped up seeking for files to operate on, when using options like --copies
or --in, by around 20%.
Benchmark showed an increase for --copies from 155 seconds to 121
seconds, and --in remote will be similar to that.
For --in here, the speedup was less, 5-10% or so.
(both warm cache)
This commit was sponsored by Jack Hill on Patreon.
Sped up seeking to around twice as fast, by avoiding a pass over the
worktree files when preferred content expressions of the local repo and
remotes don't use include=/exclude=.
Thanks to Lukey for identifying the optimisation.
This commit was sponsored by Brock Spratlen on Patreon.
matchNeedsFileContent is not used yet, but shows how to add information
about terminals. That one would be needed for
https://git-annex.branchable.com/todo/sync_fast_import/
Note the tricky bit in Annex.FileMatcher.call where it folds over the
included matcher to propagate the information.
This commit was sponsored by Svenne Krap on Patreon.
getPid returns Nothing if the process has already been stopped, and in that
case, the pid will not be displayed. I think that would only happen if
waitForProcess or similar gets called more than once on the same process
handle though.
getPid on unix has an overhead of only a MVar read. On Windows it needs to
make a syscall, so will be probably more expensive. While the added expense
happens even when debug logging is disabled, it should be small enough
compared with the overhead of starting a process that it's not a problem.
(It does occur to me that a debugM that took an IO String could only run it
when debugging is really enabled, which would improve performance. It does
not seem possible to use the current hslogger interface to do that though;
it does not expose the information that would be needed.)
With some hints for the user for what to do.
Took care to avoid changing the json output. It would have been ok to add
the new separated lists to it, in addition to the old list, but I didn't
do that because I didn't see much point.
add, addurl, importfeed, import: Added --no-check-gitignore option
for finer grained control than using --force.
(--force is used for too many different things, and at least one
of these also uses it for something else. I would like to reduce
--force's footprint until it only forces drops or a few other data
losses. For now, --force still disables checking ignores too.)
addunused: Don't check .gitignores when adding files. This is a behavior
change, but I justify it by analogy with git add of a gitignored file
adding it, asking to add all unused files back should add them all back,
not skip some. The old behavior was surprising.
In Command.Lock and Command.ReKey, CheckGitIgnore False does not change
behavior, it only makes explicit what is done. Since these commands are run
on annexed files, the file is already checked into git, so git add won't
check ignores.
--batch combined with -J now runs batch requests concurrently for many
commands. Before, the combination was accepted, but did not enable
concurrency. Since the output of batch requests can be in any order, --json
with the new "input" field is recommended to be used, to determine which
batch request each response corresponds to.
If --json is not used, batch mode still runs concurrently, using the usual
concurrent-output. That will not be very useful for most batch mode users,
probably, but who knows.
If a program was using --batch -J before, and was parsing non-json output,
this could break it. But, it was relying on git-annex not supporting
concurrency despite it being enabled, so it should have expected concurrent
output. So, I think that's ok.
annex.jobs does not enable concurrency in --batch mode, because that would
confuse programs that use --batch but don't expect concurrency.
So these special remotes are always supported.
IIRC these build flags were added because the dep chains were a bit too
long, or perhaps because the libraries were not available in Debian stable,
or something like that. That was long ago, those reasons no longer apply,
and users get confused when builtin special remotes are not available, so
it seems best to remove the build flags now.
If this does cause a problem it can be reverted of course..
This commit was sponsored by Jochen Bartl on Patreon.
Works better with automatic merge conflict resolution than git's ususual
default of "conflict".
This is not done when automatic merge conflict resolution is disabled.
This commit was sponsored by Mark Reidenbach on Patreon.
This case was handled by cleanConflictCruft, but only when the annexed
file's object was present. When not present, it left the annexed file
with the original name, not checked into git, while adding the variant
file. So, add an explicit deletion of the deleted file in this case.
My specific case where this happened actually involves
merge.directoryRenames=conflict. After a merge involving that,
the situation was the file appears as "added by them", because that
caused the file that they added to be moved into a directory we renamed.
That case is the same as them adding a modified version of the file,
while we deleted it. (Except for the history of the file, since it's a
new file, but this doesn't look at history.)
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
This does not actually change how the merge conflict is resolved when
one side deleted the file, but it was not documented before, and I think
it only worked by accident.
This commit was sponsored by Brett Eisenberg on Patreon.
One reason is, 5 is an arbitrary number so ought to be configurable.
The real reason though, is I wanted to make the man page explain when
forward retry can override annex.retry, and having a config made the
man page easier to write.
Fixed several cases where files were created without file mode bits that
the umask would usually set. This included exports to the directory special
remote, torrent files used by the bittorrent special remote, hooks written
by git-annex init, and some log files in .git/annex/
Audited all calls, looking for ones that didn't want the umask bits to be
set. All such turned out to already set the specific restrictive file mode
they wanted.
Also audited for other calls to openTempFile, and all are ok,
except for viaTmp which will need further work.
Remote.Directory fixed to set umask mode when writing to an export,
although it has another one using viaTmp that's not fixed.
Will make exports that are published via a http server running as
another user work, for example.
Remote.BitTorrent fixed to set umask mode when downloading the torrent
file. Normally this does not matter as that file does not hang around
after the download, but if a bittorrent download were started by one user,
got interrupted and then another user ran it, this will let them access
the torrent file created by the first user.
Also tested what happens if the other special remote has importtree=yes
and exporttree=yes, and in that case, download via httpalso works too,
without needing to implement any importtree methods here.
It might be possible to make it automatically set exporttree=yes if the
--sameas does. Didn't try, will probably be layering issues.
Or perhaps it should be inherited by sameas like some
other configs? But then, wouldn't it also make sense to inherit
importree=yes? But as shown here, it's not needed by this kind of
remote.
"http" was too generic and easy to confuse with web. The new name makes
clear it's used in addition to some other remote. And other protocols
can use the same naming scheme.
They normally shutdown when the GNUPGHOME directory is deleted, but on
NFS they keep the directory from being deleted. And also, this avoids
a number of them piling up while the test suite is running.
Fixes reversion in 8.20200617 that made annex.pidlock being enabled result
in some commands stalling, particularly those needing to autoinit.
Renamed runsGitAnnexChildProcess to make clearer where it should be
used.
Arguably, it would be better to have a way to make any process git-annex
runs have the env var set. But then it would need to take the pid lock
when running any and all processes, and that would be a problem when
git-annex runs two processes concurrently. So, I'm left doing it ad-hoc
in places where git-annex really does run a child process, directly
or indirectly via a particular git command.
addurl: Fix reversion in 7.20190322 that made --file not be honored when
youtube-dl was used to download media.
8758f9c561 was on the right track, but missed that | otherwise prevented
the code it added from being used.
Also, refactored out a common function.
This commit was sponsored by Graham Spencer on Patreon.
TVars were not updated atomically, which was ok when each thread got its
own External that was the only thing using these TVars. But, with the
async extension, several External instances can share the same var, so
it needs to be a TMVar to avoid read/write conflicts.
In particular, this makes PREPARE only be sent once.
Moving jobid generation to the git-annex side lets it be simplified a
lot.
Note that it will also be possible to generate one jobid per connection,
rather than a new job per request. That will make overflow not an issue,
and will avoid some work, and will simplify some of the code.
Receive loop looks right. Still need the send loop.
And, a complication is that some messages git-annex
sends need to be wrapped in REPLY_ASYNC, while others
do not. So will probably need to split externalSend
into two.
Now that I've started implementation, I see it's really necessary that
every message the special remote sends use the protocol, otherwise
nasty edge cases abound.
Since there's a race here, and since Kyle saw an exception leak out,
which I have not been able to reproduce that. See my comment for what
I think might be going on.
Note that, I used tryNonAsync, because it seems a later tryNonAsync
caught the exception. I don't actually understand how it did, as I
understand exception classification, it's the data type, not the way it
was thrown. One possibility is that the async exception may have been wrapped
in some other, non-async exception, and Show displayed it the same way.
Avoid complaining that a file with "is beyond a symbolic link" when the
filepath is absolute and the symlink in question is not actually inside the
git repository.
This assumes that inodes remain stable while the command is running.
I think they always will, the filesystems where they are unstable change
them across mounts. (If inodes were not stable, it would just complain about
symlinks in the path that are not inside the working tree.)
(On windows, I don't want to assume anything about inodes, they could be
random numbers for all I know. But if they were, this would still be ok, as
long as windows doesn't have symlinks that are detected by isSymbolicLink.
Which seems a fair bet.)
Part of workTreeItems is trying detect a case
where git porcelain refuses to process a file, and where
git ls-files silently outputs nothing. But, it's hard to perfectly
replicate git's behavior, and besides, git's behavior could change.
So it could be that we warn, but then git ls-files does not skip over
it, and so git-annex also processes it after warning about it.
So, if we think we have a problem with a parameter, display the warning,
and skip processing it at all.
Implementing this was complicated by needing to handle the case where
all command-line parameters get filtered out this way. Which is
different than the case where there are none, because we don't want to
operate on all files in this new case..
sanitizeFilePath was changed to sanitize leading '.', but ImportFeed was
running it on parts of the template. So eg the leading '.' in the extension
got sanitized.
Note the added case for sanitizeLeadingFilePathCharacter ('/':_)
-- this was added because, if the template is title/episode and the title
is not set, it would expand to "/episode". So this is another potential
security fix.
Reduce the number of directories listed in libdirs, which makes the linker
check a lot less dead ends looking for directories.
Eliminated some directories that didn't really contain shared libraries,
or only contained the linker.
That left only 2, one in lib and one in usr/lib, so consolidate those two.
Doing it this way, rather than just consolidating all libs that might exist
into a single directory means that, if there are optimised versions of some
libs, eg in lib/subarch/foo.so, and lib/subarch2/foo.so, they don't get
moved around in a way that would make the linker pick the wrong one.
Sped up seeking files to drop by 2x, and also some performance
improvements to checking numcopies.
Interestingly, the seek speedup is not due to precaching, but I think is
due to calling getParsed earlier.
Annex.Drop had to be changed to check inAnnex there, since it was removed
from Command.Drop. All other users of Command.Drop already checked inAnnex
themselves.
This commit was sponsored by Ryan Newton on Patreon.
And add a test case for that.
This certianly loses some of the 2x performance improvement in file
seeking that seekFilteredKeys led to, because now it has to stat the
worktree files again. Without benchmarking, I expect there will still be
a sizable improvement, and also the git-annex branch precaching that
seekFilteredKeys can do will still be a win of its approach.
Also worth noting that lookupKey, when the file DNE, check if it's in an
adjusted branch with hidden files, and if so, finds the key for the
file anyway. That was intended to make git-annex sync --content be able
to process those files, but a side effect was that, when a file was
deleted but the deletion not yet staged, git-annex commands used to
still list it. That was actually a bug. This commit fixes that bug too.
(git-annex sync --content on such a branch does not use seekFilteredKeys
so was not affected by the reversion or by this behavior change)
This commit was sponsored by Jake Vosloo on Patreon.
This was a bit disappointing, I was hoping for a 2x speedup. But, I think
the metadata lookup is wasting a lot of time and also needs to be made to
stream.
The changes to catObjectStreamLsTree were benchmarked to not also speed
up --all around 3% more. Seems I managed to make it polymorphic after all.
planned to use for an optimisation
most things using stagedDetails were not expecting to get dup files in a
conflicted merge and deal with them, so converted them to use
inRepoDetails.
Turns out the %(rest) trick was not needed. Instead, just maintain a
list of files we've asked for, and each cat-file response is for the
next file in the list.
This actually benchmarks 25% faster than before! Very surprising, but it
must be due to needing to shove less data through the pipe, and parse
less.
This assumes that no location log files will have a newline or carriage
return in their name. catObjectStream skips any such files due to
cat-file not supporting them.
Keys have been prevented from containing newlines since 2011,
commit 480495beb4. If some old repo
had a key with a newline in it, --all will just skip processing that key.
Other things, like .git/annex/unused files certianly assume no newlines in
keys too, and AFAICR, such keys never actually worked.
Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys
generated before that point could perhaps contain a CR. (URL probably not,
http probably doesn't support an URL with a raw CR in it.) So, added
a warning in fsck about such keys. Although, fsck --all will naturally
skip them, so won't be able to warn about them. Not entirely
satisfactory, but I'll bet there are not really any such keys in
existence.
Thanks to Lukey for finding this optimisation.