No behavior changes (hopefully), just adding SeekInput and plumbing it
through to the JSON display code for later use.
Over the course of 2 grueling days.
withFilesNotInGit reimplemented in terms of seekHelper
should be the only possible behavior change. It seems to test as
behaving the same.
Note that seekHelper dummies up the SeekInput in the case where
segmentPaths' gives up on sorting the expanded paths because there are
too many input paths. When SeekInput later gets exposed as a json field,
that will result in it being a little bit wrong in the case where
100 or more paths are passed to a git-annex command. I think this is a
subtle enough problem to not matter. If it does turn out to be a
problem, fixing it would require splitting up the input
parameters into groups of < 100, which would make git ls-files run
perhaps more than is necessary. May want to revisit this, because that
fix seems fairly low-impact.
So these special remotes are always supported.
IIRC these build flags were added because the dep chains were a bit too
long, or perhaps because the libraries were not available in Debian stable,
or something like that. That was long ago, those reasons no longer apply,
and users get confused when builtin special remotes are not available, so
it seems best to remove the build flags now.
If this does cause a problem it can be reverted of course..
This commit was sponsored by Jochen Bartl on Patreon.
Works better with automatic merge conflict resolution than git's ususual
default of "conflict".
This is not done when automatic merge conflict resolution is disabled.
This commit was sponsored by Mark Reidenbach on Patreon.
This case was handled by cleanConflictCruft, but only when the annexed
file's object was present. When not present, it left the annexed file
with the original name, not checked into git, while adding the variant
file. So, add an explicit deletion of the deleted file in this case.
My specific case where this happened actually involves
merge.directoryRenames=conflict. After a merge involving that,
the situation was the file appears as "added by them", because that
caused the file that they added to be moved into a directory we renamed.
That case is the same as them adding a modified version of the file,
while we deleted it. (Except for the history of the file, since it's a
new file, but this doesn't look at history.)
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
One reason is, 5 is an arbitrary number so ought to be configurable.
The real reason though, is I wanted to make the man page explain when
forward retry can override annex.retry, and having a config made the
man page easier to write.
Fixed several cases where files were created without file mode bits that
the umask would usually set. This included exports to the directory special
remote, torrent files used by the bittorrent special remote, hooks written
by git-annex init, and some log files in .git/annex/
Audited all calls, looking for ones that didn't want the umask bits to be
set. All such turned out to already set the specific restrictive file mode
they wanted.
"http" was too generic and easy to confuse with web. The new name makes
clear it's used in addition to some other remote. And other protocols
can use the same naming scheme.
They normally shutdown when the GNUPGHOME directory is deleted, but on
NFS they keep the directory from being deleted. And also, this avoids
a number of them piling up while the test suite is running.
Fixes reversion in 8.20200617 that made annex.pidlock being enabled result
in some commands stalling, particularly those needing to autoinit.
Renamed runsGitAnnexChildProcess to make clearer where it should be
used.
Arguably, it would be better to have a way to make any process git-annex
runs have the env var set. But then it would need to take the pid lock
when running any and all processes, and that would be a problem when
git-annex runs two processes concurrently. So, I'm left doing it ad-hoc
in places where git-annex really does run a child process, directly
or indirectly via a particular git command.
addurl: Fix reversion in 7.20190322 that made --file not be honored when
youtube-dl was used to download media.
8758f9c561 was on the right track, but missed that | otherwise prevented
the code it added from being used.
Also, refactored out a common function.
This commit was sponsored by Graham Spencer on Patreon.
I'm sure this used to work, but somewhere along the line something or
things (getCost and getAvailability I think, probably others)
started catching the exception and not displaying it. So, show warnings.
Avoid complaining that a file with "is beyond a symbolic link" when the
filepath is absolute and the symlink in question is not actually inside the
git repository.
This assumes that inodes remain stable while the command is running.
I think they always will, the filesystems where they are unstable change
them across mounts. (If inodes were not stable, it would just complain about
symlinks in the path that are not inside the working tree.)
(On windows, I don't want to assume anything about inodes, they could be
random numbers for all I know. But if they were, this would still be ok, as
long as windows doesn't have symlinks that are detected by isSymbolicLink.
Which seems a fair bet.)
sanitizeFilePath was changed to sanitize leading '.', but ImportFeed was
running it on parts of the template. So eg the leading '.' in the extension
got sanitized.
Note the added case for sanitizeLeadingFilePathCharacter ('/':_)
-- this was added because, if the template is title/episode and the title
is not set, it would expand to "/episode". So this is another potential
security fix.
Reduce the number of directories listed in libdirs, which makes the linker
check a lot less dead ends looking for directories.
Eliminated some directories that didn't really contain shared libraries,
or only contained the linker.
That left only 2, one in lib and one in usr/lib, so consolidate those two.
Doing it this way, rather than just consolidating all libs that might exist
into a single directory means that, if there are optimised versions of some
libs, eg in lib/subarch/foo.so, and lib/subarch2/foo.so, they don't get
moved around in a way that would make the linker pick the wrong one.
Sped up seeking files to drop by 2x, and also some performance
improvements to checking numcopies.
Interestingly, the seek speedup is not due to precaching, but I think is
due to calling getParsed earlier.
Annex.Drop had to be changed to check inAnnex there, since it was removed
from Command.Drop. All other users of Command.Drop already checked inAnnex
themselves.
This commit was sponsored by Ryan Newton on Patreon.
And add a test case for that.
This certianly loses some of the 2x performance improvement in file
seeking that seekFilteredKeys led to, because now it has to stat the
worktree files again. Without benchmarking, I expect there will still be
a sizable improvement, and also the git-annex branch precaching that
seekFilteredKeys can do will still be a win of its approach.
Also worth noting that lookupKey, when the file DNE, check if it's in an
adjusted branch with hidden files, and if so, finds the key for the
file anyway. That was intended to make git-annex sync --content be able
to process those files, but a side effect was that, when a file was
deleted but the deletion not yet staged, git-annex commands used to
still list it. That was actually a bug. This commit fixes that bug too.
(git-annex sync --content on such a branch does not use seekFilteredKeys
so was not affected by the reversion or by this behavior change)
This commit was sponsored by Jake Vosloo on Patreon.
This was a bit disappointing, I was hoping for a 2x speedup. But, I think
the metadata lookup is wasting a lot of time and also needs to be made to
stream.
The changes to catObjectStreamLsTree were benchmarked to not also speed
up --all around 3% more. Seems I managed to make it polymorphic after all.
Turns out the %(rest) trick was not needed. Instead, just maintain a
list of files we've asked for, and each cat-file response is for the
next file in the list.
This actually benchmarks 25% faster than before! Very surprising, but it
must be due to needing to shove less data through the pipe, and parse
less.
This assumes that no location log files will have a newline or carriage
return in their name. catObjectStream skips any such files due to
cat-file not supporting them.
Keys have been prevented from containing newlines since 2011,
commit 480495beb4. If some old repo
had a key with a newline in it, --all will just skip processing that key.
Other things, like .git/annex/unused files certianly assume no newlines in
keys too, and AFAICR, such keys never actually worked.
Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys
generated before that point could perhaps contain a CR. (URL probably not,
http probably doesn't support an URL with a raw CR in it.) So, added
a warning in fsck about such keys. Although, fsck --all will naturally
skip them, so won't be able to warn about them. Not entirely
satisfactory, but I'll bet there are not really any such keys in
existence.
Thanks to Lukey for finding this optimisation.
The cache was removed way back in 2012,
commit 3417c55189
Then I forgot I had removed it! I remember clearly multiple times when I
thought, "this reads the same data twice, but the cache will avoid that
being very expensive".
The reason it was removed was it messed up the assistant noticing when
other processes made changes. That same kind of problem has recently
been addressed when adding the optimisation to avoid reading the journal
unnecessarily.
Indeed, enableInteractiveJournalAccess is run in just the
right places, so can just piggyback on it to know when it's not safe
to use the cache.
adb shell has sha256sum sha1sum and some others, so they could be used.
They're provided by toybox, so seem about as likely to keep
working as find and stat, which it already depends on.
Or to not add a dep, could use stat the same as getExportContentIdentifier
to get a mtime, and make a WORM key. But do I really want this to
default to WORM?
Unsure what's the best path, so punting for now.
Only supported by some special remotes: directory
I need to check the rest and they're currently missing methods until I do.
git-annex sync --no-content does not yet use this to do imports
Always run Git.Config.store, so when the git config gets reloaded,
the override gets re-added to it, and changeGitRepo then calls extractGitConfig
on it and sees the annex.* settings from the override.
Remove any prior occurance of -c v and add it to the end. This way,
-c foo=1 -c foo=2 -c foo=1 will pass -c foo=1 to git, rather than -c foo=2
Note that, if git had some multiline config that got built up by
multiple -c's, this would not work still. But it never worked because
before the bug got fixed in the first place, the -c value was repeated
many times, so the multivalue thing would have been wrong. I don't think
-c can be used with multiline configs anyway, though git-config does
talk about them?
Trivial since git-annex cannot remove, but do an active checkKey verification
anyway, in case the data was lost somehow.
This commit was sponsored by Ryan Newton on Patreon.
Made several special remotes support locking content on them while
dropping, which allows dropping from another special remote when the
content will only remain on a special remote of these types.
In both cases, verify the content is present actively, because it's
certianly possible for things other than git-annex to have removed it.
Worth thinking about what to do if at some later point, git-lfs gains
support for dropping content, and a content locking operation.
That would probably need a transition; first would need to make lockContent
use the locking operation. Then, once enough time had passed that we can
assume any git-annex operating on the git-lfs remote had that change,
git-annex could finally allow dropping from git-lfs.
Or, it could be that git-lfs gains support for dropping content, but not
locking it. In that case, it seems this commit would need to be reverted,
and then wait long enough for that git-annex to be everywhere, and only
then can git-annex safely support dropping from git-lfs.
So, the assumption made in this commit could lead to bother later.. But I
think it's actually highly unlikely git-lfs does ever support dropping;
it's outside their centralized model. Probably. :) Worth keeping in mind as
the same assumption is made about other special remotes though.
This commit was sponsored by Ethan Aubin.
git is making that configurable, and configuring it globally would break
the test suite in a few places.
No other part of git-annex assumes any branch name. Renamed a few
placeholders to make that clearer.
This commit was sponsored by Jake Vosloo on Patreon.
Otherwise use the vendored copy as before.
The library is in Debian testing but not stable. Once it reaches
stable, the vendored copy can be removed.
Did not add it to debian/control because IIRC that's used to build
git-annex on stable too, possibly. However, the Debian maintainer will
probably want to make the package depend on libghc-http-client-restricted-dev
This commit was sponsored by Ilya Shlyakhter on Patreon.
Otherwise use the vendored copy as before.
The library is in Debian testing but not stable. Once it reaches
stable, the vendored copy can be removed.
Did not add it to debian/control because IIRC that's used to build
git-annex on stable too, possibly. However, the Debian maintainer will
probably want to make the package depend on libghc-git-lfs-dev.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Fix a deadlock that could occur after git-annex got an unlocked file,
causing the command to hang indefinitely.
Known to happen on vfat filesystems, possibly others.
Note that a deadlock is still theoretically possible, if anything
smudge --clean does causes it to run the git queue for some other
reason.
Apparently that doesn't happen, but will need to keep an eye on it.
That made eg git-annex get of an unlocked file hang until the
annex.pidlocktimeout and then fail.
This fix should be fully thread safe no matter what else git-annex is
doing.
Only using runsGitAnnexChildProcess in the one place it's known to be a
problem. Could audit for all places where git-annex runs itself as a child
and add it to all of them, later.
Fix bug that made creds not be stored in git when a special remote was
initialized with gpg encryption, but without an explicit embedcreds=yes.
(Yet nother regression introduced in version 7.20200202.7. 5th so far.)
This makes the creds get saved, since only things recorded there will be
saved.
IIRC, unparsedRemoteConfig was not originally available when I
implemented this; now that it is things get a bit simpler.
More could probably be simplified, is externalConfigChanges needed at
all?
This does not entirely fix the bugs though, because creds are only
embedded when embedcreds=yes, but not when encryption=pubkey is used
without embedcreds=yes.
checkpresentkey: When no remote is specified, try all remotes, not only
ones that the location log says contain the key. This is what the
documentation has always said it did.
Still try the logged remotes first, because they are far more likely to
have the key.
* Improve display of problems auto-initializing or upgrading local git
remotes.
* When a local git remote cannot be initialized because it has no
git-annex branch or a .noannex file, avoid displaying a message about it.
So stop documenting it, and stop offering it as a choice in the assistant.
Removed the code that parses it into S3.ReducedRedundancy, because
S3.OtherStorageClass with the value will work just the same and avoids a
special case for a deprecated this.
The ContentIdentifier can contain almost anything, so could have characters
that are not fit for the filesystem, or might be longer than a key usually
is, or contain a newline, or .... genKeyName deals with those problems.
This should not present a back-compat issue, because this is a temporary
key used while downloading the imported file, before the real key for it
can be generated.
Since an external process can be in the middle of some operation when an
async exception is received, it has to be shut down then. Using
cleanupProcess will close its IO handles and send it a SIGTERM.
If a special remote choses to catch SIGTERM, it's fine for it to do some
cleanup then, but until it finishes, git-annex will be blocked waiting
for it. If a special remote blocked SIGTERM, it would cause a hang.
Mentioned in docs.
Also, in passing, fixed a FD leak, it was not closing the error handle
when shutting down the external. In practice that didn't matter before because
it was only run when git-annex was itself shutting down, but now that it
can run on exception, it would have been a problem.
Added annex.skipunknown git config, that can be set to false to change the
behavior of commands like `git annex get foo*`, to not skip over files/dirs
that are not checked into git and are explicitly listed in the command
line.
Significant complexity was needed to handle git-annex add, which uses some
git ls-files calls, but needs to not use --error-unmatch because of course
the files are not known to git.
annex.skipunknown is planned to change to default to false in a
git-annex release in early 2022. There's a todo for that.
Try to enable special remotes configured with autoenable=yes when git-annex
auto-initialization happens in a new clone of an existing repo. Previously,
git-annex init had to be explicitly run to enable them. That was a bit of a
wart of a special case for users to need to keep in mind.
Special remotes cannot display anything when autoenabled this way, to avoid
interfering with the output of git-annex query commands.
Any error messages will be hidden, and if it fails, nothing is displayed.
The user will realize the remote isn't enable when they try to use it,
and can run git-annex init manually then to try the autoenable again and
see what failed.
That seems like a reasonable approach, and it's less complicated than
communicating something across a pipe in order to display it as a side
message. Other reason not to do that is that, if the first command the
user runs is one like git-annex find that has machine readable output,
any message about autoenable failing would need to not be displayed anyway.
So better to not display a failure message ever, for consistency.
(Had to split out Remote.List.Util to avoid an import cycle.)
Fix a crash or potentially not all files being exported when sync -J
--content is used with an export remote.
Crash as described in fixed bug report.
waitForAllRunningCommandActions inserted in several points where all the
commandActions started before need to have finished before moving on to
the next stage of the export. A race across those points could have
maybe resulted in not all files being exported, or a wrong tree being
export.
For example, changeExport starting up an action like
a rename of A to B. Then, with that action still running, fillExport
uploading a new A, *before* the rename occurred. That race seems
unlikely to have happened. There are some other ones that this also
fixes.
move --to, copy --to, mirror --to: When concurrency is enabled, run cleanup
actions in separate job pool from uploads.
transferStages was confusingly named, it's only useful when doing downloads
as then the verify actions can be run concurrently with other downloads.
For commands that upload, there will be more concurrency from running
cleanup actions in a separate job pool.
As for sync, I left it using downloadStages although that's not optimal
for the part of a sync that uploads. Perhaps it should use the union of
both?
Already supported --json, but not that.
Also checked all other commands that only support --json, and the only
other one that does transfers is fsck (--from), which it did not seem worth
adding --json-progress to really.
Fix bug that made enableremote of S3 and webdav remotes, that have
embedcreds=yes, fail to set up the embedded creds, so accessing the remotes
failed.
(Regression introduced in version 7.20200202.7 in when reworking all the
remote configs to be parsed.)
Root problem is that parseEncryptionConfig excludes all other config keys
except encryption ones, so it is then unable to find the
credPairRemoteField. And since that field is not required to be
present, it proceeds as if it's not, rather than failing in any visible
way.
This causes it to not find any creds, and so it does not cache
them. When when the S3 remote tries to make a S3 connection, it finds no
creds, so assumes it's being used in no-creds mode, and tries to find a
public url. With no public url available, it fails, but the failure doesn't
say a lack of creds is the problem.
Fix is to provide setRemoteCredPair with a ParsedRemoteConfig, so the full
set of configs of the remote can be parsed. A bit annoying to need to
parse the remote config before the full config (as returned by
setRemoteCredPair) is available, but this avoids the problem.
I assume webdav also had the problem by inspection, but didn't try to
reproduce it with it.
Also, getRemoteCredPair used getRemoteConfigValue to get a ProposedAccepted
String, but that does not seem right. Now that it runs that code, it
crashed saying it had just a String.
Remotes that have already been enableremoted, and so lack the cached creds
file will work after this fix, because getRemoteCredPair will extract
the creds from the remote config, writing the missing file.
This commit was sponsored by Ilya Shlyakhter on Patreon.
One way this can be used is to remove all urls for some website that went
away:
git-annex whereis --format '${file} ${url}\0' | \
grep -z whatever.com | git-annex rmurl --batch -z
Combining ${url} and ${uuid} is a bit of a combinatorial explosion.
It didn't seem worth only outputting a uuid alongside an url belonging
to it, so each uuid is output beside each url.
When storing content on remote fails, always display a reason why.
Since the Storer used by special remotes already did, this mostly affects
git remotes, but not entirely. For example, if git-lfs failed to connect to
the endpoint, it used to silently return False.
This relicates git's behavior. It adds a few stat calls for the command
line parameters, so there is some minor slowdown, but even with thousands
of parameters it will not be very noticable, and git does the same statting
in similar circumstances.
Note that this does not prevent eg "git annex add symlink"; the symlink
will be added to git as usual. And "git annex find symlink" will silently
list nothing as well. It's only "symlink/foo" or "subdir/symlink/foo" that
triggers the warning.
* addurl --preserve-filename: New option, uses server-provided filename
without any sanitization, but with some security checking.
Not yet implemented for remotes other than the web.
* addurl, importfeed: Avoid adding filenames with leading '.', instead
it will be replaced with '_'.
This might be considered a security fix, but a CVE seems unwattanted.
It was possible for addurl to create a dotfile, which could change
behavior of some program. It was also possible for a web server to say
the file name was ".git" or "foo/.git". That would not overrwrite the
.git directory, but would cause addurl to fail; of course git won't
add "foo/.git".
sanitizeFilePath is too opinionated to remain in Utility, so moved it.
The changes to mkSafeFilePath are because it used sanitizeFilePath.
In particular:
isDrive will never succeed, because "c:" gets munged to "c_"
".." gets sanitized now
".git" gets sanitized now
It will never be null, because sanitizeFilePath keeps the length
the same, and splitDirectories never returns a null path.
Also, on the off chance a web server suggests a filename of "",
ignore that, rather than trying to save to such a filename, which would
fail in some way.
git-lfs repos that encrypt the annexed content but not the git repo only
need --force passed to initremote, allow enableremote and autoenable of
such remotes without forcing again.
Needing --force again particularly made autoenable of such a repo not work.
And once such a repo has been set up, it seems a second --force when
enabling it elsewhere has little added value. It does tell the user about
the possibly insecure configuration, but if the git repo has already been
pushed to that remote in the clear, data has already been exposed. The goal
of that --force was not to prevent every situation where such an exposure
can happen -- anyone who sets up a public git repo and pushes to it will
expose things similarly and git-annex is not involved. Instead, the purpose
of the --force is to point out to the user that they're asking for a
configuration where encryption is inconsistently applied.
To use S3 Signature Version 4. Some S3 services seem to require v4, while
others may only support v2, which remains the default.
I'm also not sure if v4 works correctly in all cases, there is this
upstream bug report: https://github.com/aristidb/aws/issues/262
I've only tested it against the default S3 endpoint.
37b42e72e7 made it catch exceptions but
thought they were unlikely to be useful to display, which may be right when
a git command fails, but not in the case yoh found.
Now the warning gets displayed, which is better than an arcane git error.
The warning is still kind of ugly, especially when the pull later in the
sync will clear up what it warns about. But, this is an unusual situation
not likely to happen, and if there is no remote to pull from, the warning
message is needed or the sync will seem to succeed despite not merging the
synced master branch.
Would still be better if it could merge the synced master branch in this
situation, making an empty commit to master to do it seems wrong, and
otherwise it would need a whole separate code path, and would bypass using
git merge in favor of say, setting master to the syned branch. Which would
bypass git configs like arguably merge.ff and certianly
merge.verifySignatures. So don't want to do that.
User reported git@my.gitlab.foo:username/myrepo.git didn't work with
git-repair, because it rewrites it to an url
ssh://git@my.gitlab.foo/~/username/myrepo.git
and the /~/ was not something the hosting site supported.
Since git-annex still generally needs the repo url to be well, an url, did
not change the conversion code. But in this case, we're running git fetch,
so we might as well pass it the remote name rather than the url.
Did a quick audit of repoLocation uses to see if there was anything else
like this problem elsewhere, and didn't see any. But this is not the first
time this special case in git and git-annex's attempt to de-special-case
it has caused a problem..
* Display a warning message when a remote uses a protocol, such as
git://, that git-annex does not support. Silently skipping such a
remote was confusing behavior.
It sets annex-ignore, so the warning is only displayed once.
* Also display a warning message when a remote, without a known uuid,
is located in a directory that does not currently exist, to avoid
silently skipping such a remote.
This is a bit more debatable, since git-annex get will say,
try making repository available. And since it does not set annex-ignore,
the warning will be displayed repeatedly. It's also an extreme edge case,
I don't think I've ever seen it happen in real life.
Due to eg, too long a path to the agent socket, caused by running gpg in a
container where /run is not mounted, and/or some other gpg behavior like
unnecessarily making relative paths to its home directory absolute.
When the required content is set to "groupwanted", use whatever expression
has been set in groupwanted as the required content of the repo, similar to
how setting required content to "standard" already worked.
addurl: When run with --fast on an url that
annex.security.allowed-ip-addresses prevents accessing, display a more
useful message.
(Also importfeed --fast potentially.)
Do not sync with a faster remote that was not specified.
That old behavior was only documented in the changelog, and was certianly
surprising. It also meant adding --fast made it slower..
readonly=true is used to make an external special remote that does not
need the external program to be installed. It was stored in the
remote.log by default, and so every time it was specified in an
enableremote or initremote, whatever value was used became the new
default for subsequent enableremotes of that remote.
That was surprising, and I consider it to be a bug.
It does not make much sense to pass it to initremote because then how
would you populate that remote with anything? You would have to
enableremote elsewhere, and store content there. I'm assuming nobody
used it that way.
Someone might rely on passing it to enableremote once, and then that
being inherited in other clones. But that is not how it's documented to
be used. It is barely documented in git-annex at all, only in the
external special remote protocol, and the documentation there says to
"Document that this external special remote can be used in readonly
mode." (by the user of it passing readonly=true to enableremote). The
one external special remote that I know of that does document that is
<https://github.com/bgilbert/gcsannex> (the one that motivated adding
it). That one's docs do say to pass it to enableremote.
So, it seemed safe to make this behavior change. If someone was in fact
relying on one of those behaviors, all their current repos will still
work as they configured them (although they will need to deal
with the related change in 9f3c2dfeda).
In new clones, they will find enableremote fails, complaining the
external program is not in path. An easy enough problem to recover from.
get --from, move --from: When used with a local git remote, these used to
silently skip files that the location log thought were present on the
remote, when the remote actually no longer contained them. Since that
behavior could be surprising, now instead display a warning.
I got very confused when I encountered this behavior, since it was silently
skipping a file I needed that whereis said was on the remote.
get without --from already displayed a "unable to access these remotes"
message, which while a bit misleading in that the remote is likely
accessible, but just doesn't contain the file, at least indicated something
went wrong.
Having get --from display a warning makes it in line with get
w/o --from, so seems certianly ok. It might be there are situations where
move --from is used, on eg a whole directory, and the user only wants to
move whatever is present in the remote, and is perfectly ok with files
that are not present being skipped. So I'm less sure about the new warning
being ok there. OTOH, only local git remotes avoiding displaying a warning
in that case too, so this just brings them into line with other remotes.
(Also note that this makes it a little bit faster when dealing with a lot of
files, since it avoids a redundant stat of the file.)
Limited to min of -JN or number of CPU cores, because it will often be
CPU bound, once it's read the gitignore file for a directory.
In some situations it's more disk bound, but in any case it's unlikely
to be the main bottleneck that -J is used to avoid. Eg, when dropping,
this is used for numcopies checks, but the main bottleneck will be
accessing the remotes to verify presence. So the user might decide to
-J32 that, but having 32 check-attr processes would just waste however
many filehandles they open, and probably worsen their performance due to
CPU contention.
Note that, I first tried just letting up to the -JN be started. However,
even when it's no bottleneck at all, that still results in all of them
being started. Why? Well, all the worker threads start up nearly
simulantaneously, so there's a thundering herd..
Avoid running a large number of git cat-file child processes when run with
a large -J value.
This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.
What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.
Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
This change does impact git-annex config
eg "git annex config --set annex.addunlocked on"
will store "on" and new git-annex will understand that value, while
old git-annex will error:
git-annex: bad annex.addunlocked configuration in git annex config:
Parse failure: near "on"
That seems acceptable.
Not special remote configs that are only documented as =true or =false
however. Having git-annex support other values for those would break
backwards compatability when used with old versions of git-annex. And
older versions ignore invalid special remote configs.. That would not
be a good combination.
Eg"core.bare" is the same as "core.bare = true".
Note that git treats "core.bare =" the same as "core.bare = false", so the
code had to become more complicated in order to treat the absense of a
value differently than an empty value. Ugh.
Git has an obnoxious special case in git config, a line "foo" is the same
as "foo = true". That means there is no way to examine the output of
git config and tell if it was run with --null or not, since a "foo"
in the first line could be such a boolean, or could be followed by its
value on the next line if --null were used.
So, rather than trying to do such a detection, track the style of config
at all the points where it's generated.
Around 3% total speedup.
Profiling git annex find --not --in web, it's now bytestring end-to-end,
and there is only a little added overhead in eg accessing the Annex
state MVar (3%). The rest of the runtime is spent reading symlinks, and
in attoparsec.
This feels like the end of the optimisation road, without a major change
like caching information for faster queries.
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.
--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
After a user completely ignored the display of the exception probably
because it didn't make sense..
This does make it a little bit slower since it checks adb is in path each
time before running it. Also, it might display a lot of warnings about it
not being installed.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Improve git-annex's ability to find the path to its program, especially
when it needs to run itself in another repo to upgrade it.
Some parts of the code used readProgramFile, probably because I forgot that
programPath exists.
I noticed this when a git-annex auto-upgrade failed because it was running
git-annex upgrade --autoonly, but the code to run git-annex used
readProgramFile, which happened to point to an older build of git-annex.
Upgrade other repos than the current one by running git-annex upgrade
inside them, which avoids problems with upgrade code making assumptions
that the cwd will be inside the repo being upgraded.
In particular, this fixes a problem where upgrading a v7 repo to v8 caused
an ugly git error message.
I actually could not find a way to make Upgrade.V7 work properly
without changing directory to the remote. Once I got git ls-files to work,
the git cat-file failed because :path can only be used in the current git
repo.
This means it will still be a .git file when git-annex init runs. That's
ok, the repo probably contains no annexed objects yet, and even if it does,
git-annex init does not care if symlinks in the worktree don't point to the
objects.
I made init, at the end, run the conversion code. Not really necessary
because the next git-annex command could do it just as well. But, this
avoids commands that don't normally write to the repo needing to write to
it, which might avoid some problem or other, and seems worth avoiding
generally.
While git ls-files can actually be used on a repo that is not in the
cwd, it works inconsistently. For example, this fails:
git --git-dir=../foo/.git --work-tree=../foo ls-files ../foo
But change some of the paths to absolute and it will succeed. That seems
like a bug in git.
OTOH, this succeeds:
git --git-dir=../foo/.git --work-tree=../foo ls-files
But, that lists paths relative to the top of the --work-tree,
rather than the usual listing them relative to the cwd. Because the cwd
is not in the repo. And so anything parsing the ls-files output of that
is likely to operate on files in the wrong location. Indeed, there is
code in Upgrade/ that has this problem!
md5sum is part of busybox, so is probably available unless it were compiled
out. If md5sum (or cut for that matter) is not available, it will
still use the whole path to $base, otherwise hash it.
Of course it's possible for md5sum to be available sometimes and not others
on the same system; in such an event the locales would be built twice for
the same bundle. The cleanup code will delete both sets once that
version of the bundle is upgraded.
git-annex config: Only allow configs be set that are ones git-annex
actually supports reading from repo-global config, to avoid confused users
trying to set other configs with this.
* whereis: If a remote fails to report on urls where a key
is located, display a warning, rather than giving up and not displaying
any information.
* When external special remotes fail but neglect to provide an error
message, say what request failed, which is better than displaying an
empty error message to the user.
Fix serious regression in gcrypt and encrypted git-lfs remotes.
Since version 7.20200202.7, git-annex incorrectly stored content
on those remotes without encrypting it.
Problem was, Remote.Git enumerates all git remotes, including git-lfs
and gcrypt. It then dispatches to those. So, Remote.List used the
RemoteConfigParser from Remote.Git, instead of from git-lfs or gcrypt,
and that parser does not know about encryption fields, so did not
include them in the ParsedRemoteConfig. (Also didn't include other
fields specific to those remotes, perhaps chunking etc also didn't
get through.)
To fix, had to move RemoteConfig parsing down into the generate methods
of each remote, rather than doing it in Remote.List.
And a consequence of that was that ParsedRemoteConfig had to change to
include the RemoteConfig that got parsed, so that testremote can
generate a new remote based on an existing remote.
(I would have rather fixed this just inside Remote.Git, but that was not
practical, at least not w/o re-doing work that Remote.List already did.
Big ugly mostly mechanical patch seemed preferable to making git-annex
slower.)
* init --version: When the version given is one that automatically
upgrades to a newer version, use the newer version instead.
* Auto upgrades from older repo versions, like v5, now jump right to v8.
Fix some cases where handling of keys with extensions varied depending on
the locale.
A filename with a unicode extension would before generate a key with an
extension in a unicode locale, but not in LANG=C, because the extension
was not all alphanumeric. Also the the length of the extension could be
counted differently depending on the locale.
In a non-unicode locale, git-annex migrate would see that the extension
was not all alphanumeric and want to "upgrade" it. Now that doesn't happen.
As far as backwards compatability, this does mean that unicode
extensions are counted by the number of bytes, not number of characters.
So, if someone is using unicode extensions, they may find git-annex
stops using them when adding files, because their extensions are too
long. Keys already in their repo with the "too long" extensions will
still work though, so this only prevents adding the same content with
the same extension generating the same key. Documented this by
documenting that annex.maxextensionlength is a number of bytes.
Also, if a filename has an extension that is not valid utf-8 and the
locale is utf-8, the extension will be allowed now, and an old
git-annex, in the same locale would not, and would also want to
"upgrade" that.
Unfortunately, cabal puts the binary in a very complicated path
and does not provide any good way to get it out, leaving no good choice
except to use find.
It may be possible to use cabal (new-)install --symlink-bindir,
and ask it to symlink to pwd, but with my older version of cabal,
that does not work.
The stack branch will probably also break once it uses a newer cabal,
didn't try to deal with that.
* Added sync --only-annex, which syncs the git-annex branch and annexed
content but leaves managing the other git branches up to you.
* Added annex.synconlyannex git config setting, which can also be set with
git-annex config to configure sync in all clones of the repo.
Use case is then the user has their own git workflow, and wants to use
git-annex without disrupting that, so they sync --only-annex to get the
git-annex stuff in sync in addition to their usual git workflow.
When annex.synconlyannex is set, --not-only-annex can be used to override
it.
It's not entirely clear what --only-annex --commit or --only-annex
--push should do, and I left that combination not documented because I
don't know if I might want to change the current behavior, which is that
such options do not override the --only-annex. My gut feeling is that
there is no good reasons to use such combinations; if you want to use
your own git workflow, you'll be doing your own committing and pulling
and pushing.
A subtle question is, how should import/export special remotes be handled?
Importing updates their remote tracking branch and merges it into master.
If --only-annex prevented that git branch stuff, then it would prevent
exporting to the special remote, in the case where it has changes that
were not imported yet, because there would be a unresolved conflict.
I decided that it's best to treat the fact that there's a remote tracking
branch for import/export as an implementation detail in this case. The more
important thing is that an import/export special remote is entirely annexed
content, and so it makes a lot of sense that --only-annex will still sync
with it.
Fix support for repositories tuned with annex.tune.branchhash1=true,
including --all not working and git-annex log not displaying anything for
annexed files.
fsck --from remote: Fix a concurrency bug that could make it incorrectly
detect that content in the remote is corrupt, and remove it, resulting in
data loss.
* When git-annex is built with a ssh that does not support ssh connection
caching, default annex.sshcaching to false, but let the user override it.
* Improve warning messages further when ssh connection caching cannot
be used, to clearly state why.
This is untested because of rain, also I am operating from truncated
copiler error messages in a bug report that also doesn't mention what the
library version is. Still, it should work.
May break builds with old ghc, in particular DerivingStrategies is
I think fairly new? The pragmas could be ifdefed if necessary. Works with
ghc 8.6.5.
A warning message is unsatisfying. But erroring out is too hard a failure,
especially since it may well work fine if the user has enabled passwordless
ssh.
I did think about falling back to one ssh connection at a time in this
case, but it would have needed a rework of every ssh call, which
seems far overboard for such a niche problem. There's no single place where
git-annex runs ssh, so no one place that it could block a concurrent call
on a semaphore. And, even if it did fall back to one ssh connection at a
time, it seems to me that doing so without warning the user about the
problem just invites bug reports like "git-annex is ignoring my -J2 and
only doing one download at a time". So a warning is needed, and I suppose
is good enough.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
Http remotes that do expose a git config file, but are not initialized
resulted in an ugly and unncessary error message, now sqelched.
When git-annex-shell configlist is run w/o the autoinit field, it may
not generate a uuid for the repository. So in that case, it's not
unexpected for the config it does list to not include a UUID, and
dumping out the config in a warning message is not needed.
If configlist is asked to autoinit and we don't get back a config with a
UUID in it, that suggests some problem, and what we got back may not be
a config at all but some diagnostic message, so it does make sense to
output it then.
The use case is basically the user having forgotten, so --help would be
best, but it would be quite hard to include this in --help, since it may
even have to spin up an external special remote program.
I also considered --umm but typoed it the first time I tried it as
--uum, and while memorable, it's too cutesy. --whatelse is good because
it explicitly asks, what other params, besides the ones I've given?
Special remote programs that use GETCONFIG/SETCONFIG are recommended
to implement it.
The description is not yet used, but will be useful later when adding a way
to make initremote list all accepted configs.
configParser now takes a RemoteConfig parameter. Normally, that's not
needed, because configParser returns a parter, it does not parse it
itself. But, it's needed to look at externaltype and work out what
external remote program to run for LISTCONFIGS.
Note that, while externalUUID is changed to a Maybe UUID, checkExportSupported
used to use NoUUID. The code that now checks for Nothing used to behave
in some undefined way if the external program made requests that
triggered it.
Also, note that in externalSetup, once it generates external,
it parses the RemoteConfig strictly. That generates a
ParsedRemoteConfig, which is thrown away. The reason it's ok to throw
that away, is that, if the strict parse succeeded, the result must be
the same as the earlier, lenient parse.
initremote of an external special remote now runs the program three
times. First for LISTCONFIGS, then EXPORTSUPPORTED, and again
LISTCONFIGS+INITREMOTE. It would not be hard to eliminate at least
one of those, and it should be possible to only run the program once.
This is a first step toward that goal, using the ProposedAccepted type
in RemoteConfig lets initremote/enableremote reject bad parameters that
were passed in a remote's configuration, while avoiding enableremote
rejecting bad parameters that have already been stored in remote.log
This does not eliminate every place where a remote config is parsed and a
default value is used if the parse false. But, I did fix several
things that expected foo=yes/no and so confusingly accepted foo=true but
treated it like foo=no. There are still some fields that are parsed with
yesNo but not not checked when initializing a remote, and there are other
fields that are parsed in other ways and not checked when initializing a
remote.
This also lays groundwork for rejecting unknown/typoed config keys.
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).
Changed all parsers for git plumbing output to support both sizes of
shas.
One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
ifAnnexed in a bare repo passes to git cat-file :./filename , which it
refuses to do since the repo is bare.
Note that, reinject somefile someannexedfile in a bare repo silently does
nothing, because someannexedfile is never actually an annexed worktree
file, because the repo is bare.
options make it easier to override annex.largefiles configuration
(and potentially safer as it avoids bugs like the smudge bug fixed
in the last release)
Deleted some old comments that were posted to the man page discussing such
options.
Updated docs that used -c annex.largefiles to use the options.
Note that addSmallOverridden was needed to avoid the clean filter running
on the file. It would be possible to make addFile also update the index
directly, rather than going via git add. However, it was not necessary,
and I want to avoid breaking on some edge case, particularly if the code in
addSmallOverridden has some oversight.
Also, when annex.addunlocked is set and annex.largefiles does not match a file,
git annex add --force-large works, but git status will then show the file
as added, with a unstaged modification. The unstaged modification adds the
file to git. This is identical behavior to using -c annex.largefiles=nothing
when annex.addunlocked is set. This does not prevent committing what was
intended to be added. I have not gotten to the bottom of why git thinks
the file is modified and runs it through the clean filter in this case.
smudge: When annex.largefiles=anything, files that were already stored in
git, and have not been modified could sometimes be converted to being
stored in the annex. Changes in 7.20191024 made this more of a problem.
This case is now detected and prevented.
The git add behavior changes could be avoided if it turns out to be
really annoying, but then it would need to behave the old way when
annex.dotfiles=false and the new way when annex.dotfiles=true. I'd
rather not have the config option result in such divergent behavior as
`git annex add .` skipping a dotfile (old) vs adding to annex (new).
Note that the assistant always adds dotfiles to the annex.
This is surprising, but not new behavior. Might be worth making it also
honor annex.dotfiles, but I wonder if perhaps some user somewhere uses
it and keeps large files in a directory that happens to begin with a
dot. Since dotfiles and dotdirs are a unix culture thing, and the
assistant users may not be part of that culture, it seems best to keep
its current behavior for now.