directory: When cp supports reflinks, use it when getting content from a
directory special remote.
Not yet for imports from directory though, and not for store.
Note that, when it's chunked, using cp --reflink would not speed it up, and
when reflink was not supported, would unnecessarily write the chunk to a
file before reading it back in. So, only using a fileRetriever in the
NoChunks case is necessary to keep chunking fast.
fileCopier is told not to verify, because the special remote interface
does not yet support verification in passing. AFAICS, fileCopies can
never return False when not verifying so the added giveup should never
actually happen.
When downloading content from a remote, if the content is able to be
verified during the transfer, skip checksumming it a second time.
Note that in this case, the fsck output does not include "(checksum)"
which it does when the checksumming is done separately from the download.
This commit was sponsored by Brock Spratlen on Patreon.
When git-annex transferrer started up, and the journal contained something,
it would commit it to the git-annex branch. This caused excess commits to
the branch, in cases where normally several changes would be journalled and
committed together. That generated some excess git objects and was also
just noisy on stdout.
Since transferrer uses enableInteractiveBranchAccess, it does not need to
commit journalled changes, since the optimisation that avoids checking
the journal when reading from the branch is disabled for processes that
call that.
This commit was sponsored by Svenne Krap on Patreon.
persistent stopped using askLogFunc, and the thing to use is askLoggerIO
from monad-logger. Bumped the dep to the first version that contained that.
Note that the i386ancient build uses a newer monad-logger than 0.3.10,
so the new versioned dep should not break it, and presumably nothing else
either.
This commit was sponsored by Noam Kremen on Patreon.
Which generated unusual git trees that could confuse git merge,
since they incorrectly had 2 subtrees with the same name.
Root of the bug was a) not testing that at all! but also
b) confusing graftdirs, which contains eg "foo/bar" with
non-recursively read trees, which would contain eg "bar"
when reading a subtree of "foo".
It's worth noting that Annex.Import uses graftTree, but it really
shouldn't have needed to. Eg, when importing into foo/bar from a remote,
it's enough to generate a tree of foo/bar/x, foo/bar/y, and does not
include other files that are at the top of the master branch. It uses
graftTree, so it does include the other files, as well as the foo/bar
tree. git merge will do the same thing for both trees. With that said,
switching it away from graftTree would result in another import
generating a new commit that seems to delete files that were there in a
previous commit, so it probably has to keep using graftTree since it
used it before.
This commit was sponsored by Kevin Mueller on Patreon.
Note that a key with no size field that is hard linked will
result in listImportableContents reporting a file size of 0,
rather than the actual size of the file. One result is that
the progress meter when getting the file will seem to get stuck
at 100%. Another is that the remote's preferred content expression,
if it tries to match against file size, will treat it as an empty file.
I don't see a way to improve the latter behavior, and the former behavior
is a minor enough problem.
This commit was sponsored by Jake Vosloo on Patreon.
Keys stored on the filesystem are mangled by keyFile to avoid problem
chars. So, that mangling has to be reversed when parsing files from a
borg backup back to a key.
The directory special remote also so mangles them. Some other special
remotes do not; eg S3 just serializes the key -- but S3 object names are
not limited to filesystem valid filenames anyway, so a S3 server must
not map them directly to files in any case. It seems unlikely that a
borg backup of some such special remote will get broken by this change.
This commit was sponsored by Graham Spencer on Patreon.
New error message:
Remote foo not usable by git-annex; setting annex-ignore
http://localhost/foo/config download failed: Configuration of annex.security.allowed-ip-addresses does not allow accessing address ::1
If git config parse fails, or the git config file is not available at the url,
a better error message for that is also shown.
This commit was sponsored by Mark Reidenbach on Patreon.
It changed parseOnly in the ByteString.Lazy module to take a lazy, not
strict ByteString. In all these cases though, we actually had a strict
ByteString, so the most efficient fix, which also happens to avoid needing
ifdefs, is to use the non-lazy module instead.
This commit was sponsored by Denis Dzyubenko on Patreon.
Seems that hasOrigin was never finding origin's git-annex branch, so a new
one got created each time. And so then it later needed to merge the two
branches, which is expensive.
Added --no-track to git branch to avoid it displaying a message about
setting up tracking branches. Of course there's no reason to make the
git-annex branch a tracking branch since git-annex auto-merges it.
Can beet to false to avoid some expensive things needed to support unlocked
files.
See my comment for why this only controls what init sets up, and not other
behavior.
I didn't bother with making the v5 upgrade code path look at this, though
it easily could, because the docs say to run git-annex init after setting
it to make it take effect.
I don't think this was really intentional behavior. It may be that it was
useful to include it so it could be passed to rmurl, since without it rmurl
would not actually remove the url. Since that was changed earlier today,
now seems like a good time to clean up the display of these urls.
This commit was sponsored by Jochen Bartl on Patreon.
fsck: When --from is used in combination with --all or similar options, do
not verify required content, which can't be checked properly when operating
on keys.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
box.com already had a special case, since its renaming was known buggy.
In its case, renaming to the temp file succeeds, but then renaming the temp
file to final destination fails.
Then this 4shared server has buggy handling of renames across directories.
While already worked around with for the temp files when storing exports
now being in the same directory as the final filename, that also affected
renameExport when the file moves between directories.
I'm not entirely clear what happens on the 4shared server when it fails
this way. It kind of looks like it may rename the file to destination and
then still fail.
To handle both, when rename fails, delete both the source and the
destination, and fall back to uploading the content again. In the box.com
case, the temp file is the source, and deleting it makes sure the temp file
gets cleaned up. In the 4shared case, the file may have been renamed to the
destination and so cleaning that up avoids any interference with the
re-upload to the destination.
unregisterurl: Fix a bug that caused an url to not be unregistered when it
is claimed by a special remote other than the web.
See commit f175d4cc90 for rationalle.
* rmurl: When youtube-dl was used for an url, it no longer needs to be
prefixed with "yt:" in order to be removed.
* rmurl: If an url is both used by the web and also claimed by another
special remote, fix a bug that caused the url to to not be removed.
The youtube-dl change is a consequence of how the bug fix is implemented.
But I also think it's the right thing to do. Consider that, before,
git-annex addurl $url followed by git-annex rmurl $url would not remove the
url in the case where youtube-dl was used. That was surprising behavior.
In the unlikely case where a special remote claims an url, and it's been
added using OtherDownloader, but it was also added already as a web url,
it seems better for rmurl to remove both than to arbitrarily remove only one.
And in the case the bug report was filed for, when an url was added as a
web url, but a special remote now claims it, that should not prevent rmurl
removing the web url.
Calling setUrlMissing lets other callers of it behave differently.
Probably the calls to it in eg, Remote.External and Remote.BitTorrent are
fine, since they don't mangle the url and just remove what was provided,
and the OtherDownloader form of a bittorrent url, respectively.
I suspect unregisterurl needs to have a similar change made to rmurl, for
similar reasons.
When autoenabling special remotes of type S3, weddav, or glacier, do not
take login credentials from environment variables, as the user may not be
expecting the autoenable to happen, and may have those set for other
purposes.
Like import was using ActionItemWorkTreeFile, it's ok to use it for export,
even though it might not correspond with a file in the work tree.
And renamed it to ActionItemTreeFile to make that clearer.
Note that when an export has to rename files, it still uses
ActionItemOther, so file will still be null in that case, but as no file is
being transferred, that seems ok.
import: When the previously exported tree contained a submodule,
preserve it in the imported tree so it does not get deleted.
The export exclude log, which was used for non-preferred content,
now also includes the submodules. Since the log format is git ls-tree
output, this does not break backwards compatibility.
This mostly affects OSX and (possibly) Windows, but the Windows
installer does not bundle git. The linux standalone builds are not
updated yet pending debian stable getting a backport of the security
fix, but the security hole is unlikely to affect linux as
case-insensitive filesystems that support symlinks are a rarity on it.
Using the linux standalone build on windows via WSL is another way it
could be affected.
This commit was sponsored by Brett Eisenberg on Patreon.
Which access a remote using rsync over ssh, and which git pushes to much
more efficiently than ssh urls.
There was some old partial support for rsync URIs from 2013, but it seemed
incomplete, and did not use rsync over ssh. Weird.
I'm not sure if there's any remaining benefit to using the non-rsync url
forms with gcrypt, now that this is implemented? Updated docs to encourage
using the rsync urls.
This commit was sponsored by Svenne Krap on Patreon.
Git.Remote.parseRemoteLocation had a hack to handle URIs that contained
characters like spaces, which is something git unfortunately allows
despite not being a valid URI. However, that hack looked for "//" to
guess something was an URI, and these gcrypt URIs, being to a local
path, don't contain that. So instead escape all illegal characters and
check if the resulting thing is an URI.
And that was already done by Git.Construct.fromUrl, so
internally the gcrypt URI with a space looks like "gcrypt::foo%20bar"
and that needs to be de-escaped when converting back from URI to local
repo path.
This change might also allow a few other almost-valid URIs to be handled
as URIs by git-annex. None that contain "//" will change, and any
behavior change should result in git-annex doing closer to a right thing
than it did before, probably.
This commit was sponsored by Noam Kremen on Patreon.
Previously such nonsensical combinations always treated the matching option
as if it didn't match.
For now, made find --branch refuse matching options that need a
filename, because one is not provided to them in a way they'll use.
There's an open bug report to support it, but making it error out is
better than the old behavior of not finding what it was asked to.
Also, made --mimetype combined with eg --all work, by looking at the
object file when operating on keys.
Implemented by generalizing registerurl. Without the implicit batch mode
of registerurl since that is only a backwards compatability thing
(see commit 1d1054faa6).
unannex, uninit: When an annexed file is modified, don't overwrite the
modified version with an older version from the annex
This commit was sponsored by Mark Reidenbach on Patreon.
This benchmarks only slightly faster than the old git-annex. Eg, for a 1
gb file, 14.56s vs 15.57s. (On a ram disk; there would certianly be
more of an effect if the file was written to disk and didn't stay in
cache.)
Commenting out the updateIncremental calls make the same run in 6.31s.
May be that overhead in the implementation, other than the actual
checksumming, is slowing it down. Eg, MVar access.
(I also tried using 10x larger chunks, which did not change the speed.)
Changing to the P2P protocol broke this, because preseedTmp copies
the local copy of the object to the temp file, and then the P2P transfer
sees the right length file and uses it as-is.
When git-annex-shell is too old and rsync is used, it did verify the
content, and when the local repo does not have the object it did verify the
content.
Checksum as content is received from a remote git-annex repository, rather
than doing it in a second pass.
Not tested at all yet, but I imagine it will work!
Not implemented for any special remotes, and also not implemented for
copies from local remotes. It may be that, for local remotes, it will
suffice to use rsync, rely on its checksumming, and simply return Verified.
(It would still make a checksumming pass when cp is used for COW, I guess.)
See my comment in the next commit for some details about why
Verified needs a hash with preimage resistance. As far as tahoe goes,
it's fully cryptographically secure.
I think that bup could also return Verified. However, the Retriever
interface does not currenly support that.
When a git remote is configured with an absolute path, use that path,
rather than making it relative. If it's configured with a relative path,
use that.
Git.Construct.fromPath changed to preserve the path as-is,
rather than making it absolute. And Annex.new changed to not
convert the path to relative. Instead, Git.CurrentRepo.get
generates a relative path.
A few things that used fromAbsPath unncessarily were changed in passing to
use fromPath instead. I'm seeing fromAbsPath as a security check,
while before it was being used in some cases when the path was
known absolute already. It may be that fromAbsPath is not really needed,
but only git-annex-shell uses it now, and I'm not 100% sure that there's
not some input that would cause a relative path to be used, opening a
security hole, without the security check. So left it as-is.
Test suite passes and strace shows the configured remote url is used
unchanged in the path into it. I can't be 100% sure there's not some code
somewhere that takes an absolute path to the repo and converts it to
relative and uses it, but it seems pretty unlikely that the code paths used
for a git remote would call such code. One place I know of is gitAnnexLink,
but I'm pretty sure that git remotes never deal with annex symlinks. If
that did get called, it generates a path relative to cwd, which would have
been wrong before this change as well, when operating on a remote.
When annex.stalldetection is not enabled, and a likely stall is detected,
display a suggestion to enable it.
Note that the progress meter display is not taken down when displaying
the message, so it will display like this:
0% 8 B 0 B/s
Transfer seems to have stalled. To handle stalling transfers, configure annex.stalldetection
0% 10 B 0 B/s
Although of course if it's really stalled, it will never update
again after the message. Taking down the progress meter and starting
a new one doesn't seem too necessary given how unusual this is,
also this does help show the state it was at when it stalled.
Use of uninterruptibleCancel here is ok, the thread it's canceling
only does STM transactions and sleeps. The annex thread that gets
forked off is separate to avoid it being canceled, so that it
can be joined back at the end.
A module cycle required moving from dupState the precaching of the
remote list. Doing it at startConcurrency should cover all the cases
where the remote list is used in concurrent actions.
This commit was sponsored by Kevin Mueller on Patreon.
annex.stalldetection can now be set to "true" to make git-annex do
automatic stall detection when it detects a remote is updating its transfer
progress consistently enough.
This commit was sponsored by Luke Shumaker on Patreon.
Seems only fair, that, like git runs git-annex, git-annex runs
git-annex-foo.
Implementation relies on O.forwardOptions, so that any options are passed
through to the addon program. Note that this includes options before the
subcommand, eg: git-annex -cx=y foo
Unfortunately, git-annex eats the --help/-h options.
This is because it uses O.hsubparser, which injects that option into each
subcommand. Seems like this should be possible to avoid somehow, to let
commands display their own --help, instead of the dummy one git-annex
displays.
The two step searching mirrors how git works, it makes finding
git-annex-foo fast when "git annex foo" is run, but will also support fuzzy
matching, once findAllAddonCommands gets implemented.
This commit was sponsored by Dr. Land Raider on Patreon.
I suspect this is a bug in cabal sdist, because with
Includes: Utility/libkqueue.h
the file is not included, but putting it in extra-files does
get it into the tarball.
Fix an oddity in matching options and preferred content expressions such as
"foo (bar or baz)", which was incorrectly handled as if it were "(foo or
bar) and baz)" rather than the intended "foo and (bar or baz)"
Seemed like a change to consume should be able to handle this case
better, but I was having trouble writing it that way, so instead added
a separate pass that inserts the implicit ands explicitly. Also added
several test cases to make sure versions with and without explicit ands
generate the same.
Missed this when implementing it because of the default case catching
the new constructor. So, removed that default case to make sure
future types of adjusted branches don't make the same mistake.
Complicated by git-annex addurl --fast which adds the file whose content
is not present, so it needs to stay unlocked when on such a branch.
This commit was sponsored by Brock Spratlen on Patreon.
Fixed that, and made parserLsTree accept the space as well as tab.
Fixes a reversion that made import of a tree from a special remote result in
a merge that deleted files that were not preferred content of that special
remote.
Avoids the smudge --clean filter failing because URL keys do not support
genKey. Instead the modified content will be added using the default
backend.
This commit was sponsored by Jochen Bartl on Patreon.
This avoids the smudge --clean filter failing on the URL keys.
git checkout runs the post-checkout hook, which runs smudge --update.
That populates all the pointer files, but it neglected to store their inode
caches in the keys db. With that done, and the keys db flushed before
smudge --clean gets run (by restagePointerFile), the isUnmodifiedCheap
check can tell the file is not modified, so will not try to re-ingest it,
which does not work with URL keys because they do not support genKey.
It also seems possible that the isUnmodifiedCheap was also failing for
non-URL keys, which would cause them to be re-ingested, leading to a lot of
extra work. I have not verified that, but don't see why it wouldn't have
happened. So this probably also speeds up checking out adjusted branches.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
This is probably a reversion, but not sure what caused it. By the time
Annex.Init runs fixupUnusualReposAfterInit, another git-annex process has
at least sometimes already done the necessary fixups. (Eg, one run
indirectly by a git command.) But since the Repo is cached, it doesn't
realize and does them again. So, avoid crashing when git config --unset
fails.
This commit was sponsored by Jack Hill on Patreon.
Directory special remotes with importtree=yes now avoid unncessary overhead
when inodes of files have changed, as happens whenever a FAT filesystem
gets remounted.
A few unusual edge cases of modifications won't be detected and
imported. I think they're unusual enough not to be a concern. It would
be possible to add a config setting that controls whether to compare
inodes too, but does not seem worth bothering the user about currently.
I chose to continue to use the InodeCache serialization, just with the
inode zeroed. This way, if I later change my mind or make it
configurable, can parse it back to an InodeCache and operate on it. The
overhead of storing a 0 in the content identifier log seems worth it.
There is a one-time cost to this change; all directory special remotes
with importtree=yes will re-hash all files once, and will update the
content identifier logs with zeroed inodes.
This commit was sponsored by Brett Eisenberg on Patreon.
Including the non-standard URI form that git-remote-gcrypt uses for rsync.
Eg, "ook://foo:bar" cannot be parsed because "bar" is not a valid port
number. But git could have a remote with that, it would try to run
git-remote-ook to handle it. So, git-annex has to allow for such things,
rather than crashing.
This commit was sponsored by Luke Shumaker on Patreon.
Since that can lead to data loss, which should never be enabled by an
option other than --force.
I suppose that using --trust was in some situation, safer than --force,
because it doesn't entirely disable checking for data loss, but only
disables checking involving data that is on the specified repository.
But it seems better to be able to say that data loss only happens with
--force.
This commit was sponsored by Graham Spencer on Patreon.
Since unconsidered use of trusted repositories can lead to data loss.
Trusted has always been this way, but it used to be acceptable for
git-annex to be set up so that data could be lost without using --force,
and most or all other ways that can happen have already been eliminated.
This commit was sponsored by Mark Reidenbach on Patreon.
This is conceptually very simple, just making a 1 that was hard coded be
exposed as a config option. The hard part was plumbing all that, and
dealing with complexities like reading it from git attributes at the
same time that numcopies is read.
Behavior change: When numcopies is set to 0, git-annex used to drop
content without requiring any copies. Now to get that (highly unsafe)
behavior, mincopies also needs to be set to 0. It seemed better to
remove that edge case, than complicate mincopies by ignoring it when
numcopies is 0.
This commit was sponsored by Denis Dzyubenko on Patreon.
It got broken in several ways by the streaming seeking optimisations
around version 8.20201007.
Moved time limit checking out of the matcher, which was a hack in the
first place. So everywhere that uses Limit.getMatcher needs to check
time limit. Well, almost everywhere. Command.Info uses it, but it does
not make sense to time limit getting info. And Command.MultiCast uses it
just to build up a list of files that then get passed to a command, so
it would never have hit the timeout in a useful way.
This implementation is a little more expensive when at time limit than
necessary, since it continues seeking only to discard everything after the
time limit. I did try making it close the file handles to force a faster
shutdown, but that didn't work and hung. Could certianly be improved
somehow, but seeking is probably not the expensive bit when a time limit
is hit, so this seems acceptable for now.
* add: Significantly speed up adding lots of non-large files to git,
by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
so any other smudge filters than the annex one that may be enabled will
be used.
And vice-versa, but it's better to use '/' for portability.
Notably, standardPreferredContent contains "archive/*" and that might not
match if the filename ends up coming in with the slashes the other way
around.
I do think this was a reversion, but I have not tracked back to what
version. While involving the remote config, it's not the same class of
problems that I kept having to chase down for a while after the remote
config parser reworking.
This is common in some feeds, which might mix some items with enclosures,
with others that link to posts or whatever. Before this, it would try to
use youtube-dl and fail, or if youtube-dl was not allowed, it would
incorrectly complain that an url was supported by youtube-dl.
MatchingKey is not the thing to use when matching on actual worktreee
files.
Fix reversion in 8.20201116 that made include= and exclude= in
preferred/required content expressions match a path relative to the current
directory, rather than the path from the top of the repository.
* Guard against running in a repo where annex.uuid is set but
annex.version is set, or vice-versa.
* Avoid autoinit when a repo does not have annex.version or annex.uuid
set, but has a git-annex objects directory, suggesting it was used
by git-annex before.
There was no particular reason not to support this, other than maybe a lack
of a use case. One use case would of course be a remote that you want to
avoid overwriting content on. A new use case is the idea of importing from
backups, eg borg, where exporting is not necessarily supported at all.
This commit was sponsored by Brock Spratlen on Patreon.
New config annex.stalldetection, remote.name.annex-stalldetection, which
can be used to deal with remotes that stall during transfers, or are
sometimes too slow to want to use.
This commit was sponsored by Luke Shumaker on Patreon.
Avoid spurious "verification of content failed" message when downloading
content from a ssh or tor remote fails due to the remote no longer having a
copy of the content.
The P2P protocol already handled this case by sending DATA 0, followed by
VALID. But VALID was not really right, because the data is not the
requested data. So, send DATA 0, followed by INVALID. Old versions of
git-annex handle INVALID the same as VALID in this case. Now new versions
avoid displaying an incorrect message.
It would be better for the P2P protocol to have a different way to indicate
this, like perhaps sending INVALID without DATA. But that would be a
breaking change and need a new protocol verison. Since INVALID already is
part of the protocol and already needs to be handled, using it for this
special case too seems ok, and avoids the complication of another protocol
version.
This commit was sponsored by Jochen Bartl on Patreon.
This is an edge case, which happened to be triggered by the P2P protocol
seeing DATA 0. When reading 0 bytes, getting an empty string does
not mean the handle has reached EOF.
I verified there was in fact a bug, where get of an empty file followed
by another file would get the empty file and then fail
with "handle is closed". This fixes it.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Reversion introduced in version 8.20201007, one release after the 1st
release with the extension.
Surprisingly, hClose can hang if another thread is reading from the
handle. This is because it uses takeMVar.
The use of cancel here does mean that, if receiveMessageAddonProcess
or Remote.External.AsyncExtension.receiveloop allocated some resource in
a non-async-exception safe way, they might not get a chance to clean it up.
They do not appear to, and anyway, this only happens when git-annex is
shutting down, so any recource that did leak would not be a problem.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Only displaying git-annex and git command output when something went wrong.
A few could still leak stderr. These include the couple of calls
to readProcess, which reads stdin but lets stderr through. But they don't
leak any usually, so probably only would when failing anyway.
Currently, there is no excess output at all!
This commit was sponsored by Brock Spratlen on Patreon.
Fix a bug that could make resuming a download from the web fail when the
entire content of the file is actually already present locally.
What a mess that Request can throw exceptions or not, depending on how
it's configured. Makes it very hard if you need to handle some specific
http status codes in a function like this! Implementing everything two
ways did not seem appealing, if possible at all, so I decided to
override the Request if it did come configured to throw exception on
non-2xx http status. Other exceptions, like from http-client-restricted,
or due to a redirect to a non-http url, still get thrown.
This commit was sponsored by Luke Shumaker on Patreon.
Prevent windows assistant from trying (and failing) to upgrade itself,
which has never been supported on windows.
The new windows build is made with UPGRADE_LOCATION set, which enabled this
code path that had never run on windows before, and doesn't work. I don't
want to try to support self-upgrade on windows, or generally on other OS's
than the ones where its working, so added a check for that. This way the
build can keep setting UPGRADE_LOCATION and if some later git-annex does
learn how to upgrade itself on some OS, it won't need changing the build
setup.
Added annex.adjustedbranchrefresh git config to update adjusted branches
set up by git-annex adjust --unlock-present/--hide-missing.
Note, in a few cases, I was not able to make the adjusted branch
be updated in calls to moveAnnex, because information about what
file corresponds to a key is not available. They are:
* If two files point to one file, then eg, `git annex get foo` will
update the branch to unlock foo, but will not unlock bar, because it
does not know about it. Might be fixable by making `git annex get
bar` do something besides skipping bar?
* git-annex-shell recvkey likewise (so sends over ssh from old versions
of git-annex)
* git-annex setkey
* git-annex transferkey if the user does not use --file
* git-annex multicast sends keys with no associated file info
Doing a single full refresh at the end, after any incremental refresh,
will deal with those edge cases.
An --unlock-present branch reverses back to a branch where
all files that get modified or renamed become locked, even if they were
originally unlocked. This is the same that reversing a --unlock branch
works, and the new name makes that commonality more clear.
Like --hide-missing the branch does not get updated when content
availability changes.
Seems to basically work, but sync does not update it yet.
Also, when a file is present and so unlocked, git mv followed by
git-annex sync results in the basis branch being updated to contain the
file with the new name, unlocked. This seems different than what
happens in an adjusted unlocked branch, where the commit propigates back
locked. Probably the reverse adjustment code needs to be improved to
handle this case.
Note that, the way the SeekInput parser is written to support batch mode,
it's actually possible to do git-annex examinekey
"SHA1--foo foo.tar.gz" --migrate-to-backend=SHA1E
While that might be kind of useful to support multiple migrations not using
batch mode, I have not documented it. It would be better to take pairs of
key and file in that case.
Warn when adding a annex symlink or pointer file that uses a key that is
not known to the repository, to prevent confusion if the user has copied it
from some other repository.
This commit was sponsored by Jake Vosloo on Patreon.
This fixes a bug where a file that was not preferred content could be
transferred to a remote. This happened when the file got deleted after
the sync started running.
The only time checkMatcher is run without a Key is in calls to
checkFileMatcher, which are only done by add, addurl, import, and
smudge --clean. Those won't be affected by this kind of race. Anything
else that might be precaching and have a similar race as sync will also
be fixed, but I don't know if it actually affected anything other than
sync.
As well as fixing a bug, this also probably makes sync and --auto faster
by avoiding the redundant key lookup.
This commit was sponsored by Graham Spencer on Patreon.
Make sure to relay any remaining stderr from the process after it has
shut down, rather than closing stderr just before shutdown. This avoids
a situation where the process is still running and tries to write to
stderr, getting a SIGPIPE. And, it ensures that no stderr output is
lost.
This may fix a problem encountered by datalad on windows, where it hangs
during the external special remote shutdown.
Before commit a49d300545, it closed stdin
and stdout, but left stderr open, and never killed the stderr waiter
thread, which presumably exited on its own. For async exception
safety, do need to at make sure that thread gets waited on, as that
commit does, but it introduced this problem.
Note that, the process's stdout is closed before waiting on it. It's too
late for anything it writes to stdout to be processed, and since we're
not going to consume any such writes, this avoids the process getting
blocked writing to stdout due to us not reading what it's buffered. This
does mean that if the process writes to stdout too late, it will get a
SIGPIPE. (This was already the case before the above-mentioned commit.)
In practice, I think only the protocol's ERROR is allowed to be
sent at a point where this could happen.
Because it's a special character on Windows ("c:").
Use same technique already used for '/' and '\'.
I didn't record how I generated their encoded forms before, so am sure
there was a better way, but the way I did it now is to look at
ghci> encodeFilePath "∕"
"\226\136\149"
And then the difference from that to "\56546\56456\56469"
is adding 56320 to each, to get up to the escaped code plane.
See comment for why I think handling ':' is ok, but that other illegal
windows filenames won't. Note that, this should be enough to make the
test suite always work. Other windows illegal filenames will fail at
checkout time when it tries to put the illegal filename on the
filesystem.
This is a cleaner build than on Jenkins because the whole environment setup
is handled by the CI config, at least up to the point of "get a random bag
of Windows bytes".
Also, the Jenkins autobuilder has been intermittently failing for a long
time, not due to any problem with git-annex but just a failure to clean up
directories.
Also, this build runs the test suite, and it is (mostly) passing. Test
suite always failed in the jenkins environment.
Also, this build includes libmagic.
Here is the build workflow used by github actions:
https://github.com/datalad/datalad-extensions/blob/master/.github/workflows/build-git-annex-windows.yaml
The libmagic build has its own workflow:
https://github.com/datalad/file-windows/blob/master/.github/workflows/build.yml
(Also cleaned up some windows build cruft I don't use anymore.)
There is no build-version file to link to. I've opened a todo requesting
one: https://github.com/datalad/datalad-extensions/issues/55
This fixes the bug.
Note, it's only done when GIT_DIR is set. When it's not set,
Git.Construct already handled it. This is why it was only noticed with this
git submodule command.
This commit was sponsored by Brett Eisenberg on Patreon.
When a special remote has chunking enabled, but no chunk sizes are
recorded (or the recorded ones are not found), speculatively try chunks
using the configured chunk size.
This makes eg, git-annex fsck --from remote be able to fix up the
location log of a file that the git-annex branch does not indicate is
stored on the remote.
Note that fsck does *not* fix up the chunk log to indicate the chunk
size. So, changing the chunk config of the remote after that will still
prevent accessing the chunks stored on it. Maybe fsck should, but I
wanted to start with this and see if it's needed.
In cases where numcopies checks prevented the resumed move from dropping
the object from the source repository, it now relies on a log of recent
moves to replicate the behavior of the interrupted command.
Performance: Probably noticable impact, since it has to add to the log,
check the log, and remove from the log. Seems worth it to avoid this
annoying edge case. The log functions are pretty well optimised to avoid
unncessary work.
An performance improvement to make later would be to avoid cleanup doing
anything if it's not written to the log file, and has confirmed that the
log file does not contain the log line.
This commit was sponsored by Jake Vosloo on Patreon.
The problem was this line:
cleanup = and <$> sequence (map snd v)
That caused all of v to be held onto until the end, when the cleanup action
was run.
I could not seem to find a bang pattern that avoided the leak, so I
resorted to a IORef, rather clunky, but not a performance problem because
it will only be written once per git ls-files, so typically just 1 time.
This commit was sponsored by Mark Reidenbach on Patreon.
inet_addr was removed, but all this needs is localhost, so hardcoding it
should work fine.
It may be that this windows ifdef is no longer needed. It was added in 2013
with a note that getAddrInfo didn't work on windows, but it seems likely
such a problem would have been fixed since.
Only turning it off when the criterion library is not installed.
Not enabled for osx or i386ancient yet since that will need some
invesitgation to update their respective stack.yaml files.
Broke a while ago during optimisation work, and not noticed since the flag
is disabled by default.
This commit was sponsored by Brock Spratlen on Patreon.
It seemed best to do this, for consistency with every other way files can
get into a git-annex repo. Although it's just a bit strange that a local
.gitignore file affects the pseudo-commits made for the remote that's
imported from.
This commit was sponsored by Brett Eisenberg on Patreon.
Which lets progress be displayed when doing concurrent downloads.
Amoung other things, like --json-progress etc.
The youtube-dl output is no longer displayed, except for any errors.
This commit was sponsored by Denis Dzyubenko on Patreon.
Ensure that checkCanAdd is used everywhere a file is added to git,
so git add is run with -f, presumably avoiding the work it would usually
do to check ignores.
sync: When run without --content, import without copying from
importtree=yes directory special remotes. (Other special remotes may
support this later as well.)
This commit was sponsored by Svenne Krap on Patreon.
Import small files into git, the same as is done when importing with content.
Which means, for small files, --no-content does download them.
If the largefiles expression needs the file content available
(due to mimetype or mimeencoding being used), the import will fail.
This commit was sponsored by Jake Vosloo on Patreon.
Sped up seeking for files to operate on, when using options like --copies
or --in, by around 20%.
Benchmark showed an increase for --copies from 155 seconds to 121
seconds, and --in remote will be similar to that.
For --in here, the speedup was less, 5-10% or so.
(both warm cache)
This commit was sponsored by Jack Hill on Patreon.
Sped up seeking to around twice as fast, by avoiding a pass over the
worktree files when preferred content expressions of the local repo and
remotes don't use include=/exclude=.
Thanks to Lukey for identifying the optimisation.
This commit was sponsored by Brock Spratlen on Patreon.
getPid returns Nothing if the process has already been stopped, and in that
case, the pid will not be displayed. I think that would only happen if
waitForProcess or similar gets called more than once on the same process
handle though.
getPid on unix has an overhead of only a MVar read. On Windows it needs to
make a syscall, so will be probably more expensive. While the added expense
happens even when debug logging is disabled, it should be small enough
compared with the overhead of starting a process that it's not a problem.
(It does occur to me that a debugM that took an IO String could only run it
when debugging is really enabled, which would improve performance. It does
not seem possible to use the current hslogger interface to do that though;
it does not expose the information that would be needed.)
With some hints for the user for what to do.
Took care to avoid changing the json output. It would have been ok to add
the new separated lists to it, in addition to the old list, but I didn't
do that because I didn't see much point.
add, addurl, importfeed, import: Added --no-check-gitignore option
for finer grained control than using --force.
(--force is used for too many different things, and at least one
of these also uses it for something else. I would like to reduce
--force's footprint until it only forces drops or a few other data
losses. For now, --force still disables checking ignores too.)
addunused: Don't check .gitignores when adding files. This is a behavior
change, but I justify it by analogy with git add of a gitignored file
adding it, asking to add all unused files back should add them all back,
not skip some. The old behavior was surprising.
In Command.Lock and Command.ReKey, CheckGitIgnore False does not change
behavior, it only makes explicit what is done. Since these commands are run
on annexed files, the file is already checked into git, so git add won't
check ignores.
--batch combined with -J now runs batch requests concurrently for many
commands. Before, the combination was accepted, but did not enable
concurrency. Since the output of batch requests can be in any order, --json
with the new "input" field is recommended to be used, to determine which
batch request each response corresponds to.
If --json is not used, batch mode still runs concurrently, using the usual
concurrent-output. That will not be very useful for most batch mode users,
probably, but who knows.
If a program was using --batch -J before, and was parsing non-json output,
this could break it. But, it was relying on git-annex not supporting
concurrency despite it being enabled, so it should have expected concurrent
output. So, I think that's ok.
annex.jobs does not enable concurrency in --batch mode, because that would
confuse programs that use --batch but don't expect concurrency.
The use case of this field is mostly to support -J combined with --json.
When that is implemented, a user will be able to look at the field to
determine which of the requests they have sent it corresponds to.
The field typically has a single value in its list, but in some cases
mutliple values (eg 2 command-line params) are combined together and the
list will have more.
Note that json parsing was already non-strict, so old git-annex metadata
--json --batch can be fed json produced by the new git-annex and will
not stumble over the new field.
No behavior changes (hopefully), just adding SeekInput and plumbing it
through to the JSON display code for later use.
Over the course of 2 grueling days.
withFilesNotInGit reimplemented in terms of seekHelper
should be the only possible behavior change. It seems to test as
behaving the same.
Note that seekHelper dummies up the SeekInput in the case where
segmentPaths' gives up on sorting the expanded paths because there are
too many input paths. When SeekInput later gets exposed as a json field,
that will result in it being a little bit wrong in the case where
100 or more paths are passed to a git-annex command. I think this is a
subtle enough problem to not matter. If it does turn out to be a
problem, fixing it would require splitting up the input
parameters into groups of < 100, which would make git ls-files run
perhaps more than is necessary. May want to revisit this, because that
fix seems fairly low-impact.
So these special remotes are always supported.
IIRC these build flags were added because the dep chains were a bit too
long, or perhaps because the libraries were not available in Debian stable,
or something like that. That was long ago, those reasons no longer apply,
and users get confused when builtin special remotes are not available, so
it seems best to remove the build flags now.
If this does cause a problem it can be reverted of course..
This commit was sponsored by Jochen Bartl on Patreon.
Works better with automatic merge conflict resolution than git's ususual
default of "conflict".
This is not done when automatic merge conflict resolution is disabled.
This commit was sponsored by Mark Reidenbach on Patreon.
This case was handled by cleanConflictCruft, but only when the annexed
file's object was present. When not present, it left the annexed file
with the original name, not checked into git, while adding the variant
file. So, add an explicit deletion of the deleted file in this case.
My specific case where this happened actually involves
merge.directoryRenames=conflict. After a merge involving that,
the situation was the file appears as "added by them", because that
caused the file that they added to be moved into a directory we renamed.
That case is the same as them adding a modified version of the file,
while we deleted it. (Except for the history of the file, since it's a
new file, but this doesn't look at history.)
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
One reason is, 5 is an arbitrary number so ought to be configurable.
The real reason though, is I wanted to make the man page explain when
forward retry can override annex.retry, and having a config made the
man page easier to write.
Fixed several cases where files were created without file mode bits that
the umask would usually set. This included exports to the directory special
remote, torrent files used by the bittorrent special remote, hooks written
by git-annex init, and some log files in .git/annex/
Audited all calls, looking for ones that didn't want the umask bits to be
set. All such turned out to already set the specific restrictive file mode
they wanted.
"http" was too generic and easy to confuse with web. The new name makes
clear it's used in addition to some other remote. And other protocols
can use the same naming scheme.
They normally shutdown when the GNUPGHOME directory is deleted, but on
NFS they keep the directory from being deleted. And also, this avoids
a number of them piling up while the test suite is running.
Fixes reversion in 8.20200617 that made annex.pidlock being enabled result
in some commands stalling, particularly those needing to autoinit.
Renamed runsGitAnnexChildProcess to make clearer where it should be
used.
Arguably, it would be better to have a way to make any process git-annex
runs have the env var set. But then it would need to take the pid lock
when running any and all processes, and that would be a problem when
git-annex runs two processes concurrently. So, I'm left doing it ad-hoc
in places where git-annex really does run a child process, directly
or indirectly via a particular git command.
addurl: Fix reversion in 7.20190322 that made --file not be honored when
youtube-dl was used to download media.
8758f9c561 was on the right track, but missed that | otherwise prevented
the code it added from being used.
Also, refactored out a common function.
This commit was sponsored by Graham Spencer on Patreon.
I'm sure this used to work, but somewhere along the line something or
things (getCost and getAvailability I think, probably others)
started catching the exception and not displaying it. So, show warnings.
Avoid complaining that a file with "is beyond a symbolic link" when the
filepath is absolute and the symlink in question is not actually inside the
git repository.
This assumes that inodes remain stable while the command is running.
I think they always will, the filesystems where they are unstable change
them across mounts. (If inodes were not stable, it would just complain about
symlinks in the path that are not inside the working tree.)
(On windows, I don't want to assume anything about inodes, they could be
random numbers for all I know. But if they were, this would still be ok, as
long as windows doesn't have symlinks that are detected by isSymbolicLink.
Which seems a fair bet.)
sanitizeFilePath was changed to sanitize leading '.', but ImportFeed was
running it on parts of the template. So eg the leading '.' in the extension
got sanitized.
Note the added case for sanitizeLeadingFilePathCharacter ('/':_)
-- this was added because, if the template is title/episode and the title
is not set, it would expand to "/episode". So this is another potential
security fix.
Reduce the number of directories listed in libdirs, which makes the linker
check a lot less dead ends looking for directories.
Eliminated some directories that didn't really contain shared libraries,
or only contained the linker.
That left only 2, one in lib and one in usr/lib, so consolidate those two.
Doing it this way, rather than just consolidating all libs that might exist
into a single directory means that, if there are optimised versions of some
libs, eg in lib/subarch/foo.so, and lib/subarch2/foo.so, they don't get
moved around in a way that would make the linker pick the wrong one.
Sped up seeking files to drop by 2x, and also some performance
improvements to checking numcopies.
Interestingly, the seek speedup is not due to precaching, but I think is
due to calling getParsed earlier.
Annex.Drop had to be changed to check inAnnex there, since it was removed
from Command.Drop. All other users of Command.Drop already checked inAnnex
themselves.
This commit was sponsored by Ryan Newton on Patreon.
And add a test case for that.
This certianly loses some of the 2x performance improvement in file
seeking that seekFilteredKeys led to, because now it has to stat the
worktree files again. Without benchmarking, I expect there will still be
a sizable improvement, and also the git-annex branch precaching that
seekFilteredKeys can do will still be a win of its approach.
Also worth noting that lookupKey, when the file DNE, check if it's in an
adjusted branch with hidden files, and if so, finds the key for the
file anyway. That was intended to make git-annex sync --content be able
to process those files, but a side effect was that, when a file was
deleted but the deletion not yet staged, git-annex commands used to
still list it. That was actually a bug. This commit fixes that bug too.
(git-annex sync --content on such a branch does not use seekFilteredKeys
so was not affected by the reversion or by this behavior change)
This commit was sponsored by Jake Vosloo on Patreon.
This was a bit disappointing, I was hoping for a 2x speedup. But, I think
the metadata lookup is wasting a lot of time and also needs to be made to
stream.
The changes to catObjectStreamLsTree were benchmarked to not also speed
up --all around 3% more. Seems I managed to make it polymorphic after all.
Turns out the %(rest) trick was not needed. Instead, just maintain a
list of files we've asked for, and each cat-file response is for the
next file in the list.
This actually benchmarks 25% faster than before! Very surprising, but it
must be due to needing to shove less data through the pipe, and parse
less.
This assumes that no location log files will have a newline or carriage
return in their name. catObjectStream skips any such files due to
cat-file not supporting them.
Keys have been prevented from containing newlines since 2011,
commit 480495beb4. If some old repo
had a key with a newline in it, --all will just skip processing that key.
Other things, like .git/annex/unused files certianly assume no newlines in
keys too, and AFAICR, such keys never actually worked.
Carriage return is escaped by preSanitizeKeyName since 2013. WORM keys
generated before that point could perhaps contain a CR. (URL probably not,
http probably doesn't support an URL with a raw CR in it.) So, added
a warning in fsck about such keys. Although, fsck --all will naturally
skip them, so won't be able to warn about them. Not entirely
satisfactory, but I'll bet there are not really any such keys in
existence.
Thanks to Lukey for finding this optimisation.
The cache was removed way back in 2012,
commit 3417c55189
Then I forgot I had removed it! I remember clearly multiple times when I
thought, "this reads the same data twice, but the cache will avoid that
being very expensive".
The reason it was removed was it messed up the assistant noticing when
other processes made changes. That same kind of problem has recently
been addressed when adding the optimisation to avoid reading the journal
unnecessarily.
Indeed, enableInteractiveJournalAccess is run in just the
right places, so can just piggyback on it to know when it's not safe
to use the cache.
adb shell has sha256sum sha1sum and some others, so they could be used.
They're provided by toybox, so seem about as likely to keep
working as find and stat, which it already depends on.
Or to not add a dep, could use stat the same as getExportContentIdentifier
to get a mtime, and make a WORM key. But do I really want this to
default to WORM?
Unsure what's the best path, so punting for now.
Only supported by some special remotes: directory
I need to check the rest and they're currently missing methods until I do.
git-annex sync --no-content does not yet use this to do imports
Always run Git.Config.store, so when the git config gets reloaded,
the override gets re-added to it, and changeGitRepo then calls extractGitConfig
on it and sees the annex.* settings from the override.
Remove any prior occurance of -c v and add it to the end. This way,
-c foo=1 -c foo=2 -c foo=1 will pass -c foo=1 to git, rather than -c foo=2
Note that, if git had some multiline config that got built up by
multiple -c's, this would not work still. But it never worked because
before the bug got fixed in the first place, the -c value was repeated
many times, so the multivalue thing would have been wrong. I don't think
-c can be used with multiline configs anyway, though git-config does
talk about them?
Trivial since git-annex cannot remove, but do an active checkKey verification
anyway, in case the data was lost somehow.
This commit was sponsored by Ryan Newton on Patreon.
Made several special remotes support locking content on them while
dropping, which allows dropping from another special remote when the
content will only remain on a special remote of these types.
In both cases, verify the content is present actively, because it's
certianly possible for things other than git-annex to have removed it.
Worth thinking about what to do if at some later point, git-lfs gains
support for dropping content, and a content locking operation.
That would probably need a transition; first would need to make lockContent
use the locking operation. Then, once enough time had passed that we can
assume any git-annex operating on the git-lfs remote had that change,
git-annex could finally allow dropping from git-lfs.
Or, it could be that git-lfs gains support for dropping content, but not
locking it. In that case, it seems this commit would need to be reverted,
and then wait long enough for that git-annex to be everywhere, and only
then can git-annex safely support dropping from git-lfs.
So, the assumption made in this commit could lead to bother later.. But I
think it's actually highly unlikely git-lfs does ever support dropping;
it's outside their centralized model. Probably. :) Worth keeping in mind as
the same assumption is made about other special remotes though.
This commit was sponsored by Ethan Aubin.
git is making that configurable, and configuring it globally would break
the test suite in a few places.
No other part of git-annex assumes any branch name. Renamed a few
placeholders to make that clearer.
This commit was sponsored by Jake Vosloo on Patreon.
Otherwise use the vendored copy as before.
The library is in Debian testing but not stable. Once it reaches
stable, the vendored copy can be removed.
Did not add it to debian/control because IIRC that's used to build
git-annex on stable too, possibly. However, the Debian maintainer will
probably want to make the package depend on libghc-http-client-restricted-dev
This commit was sponsored by Ilya Shlyakhter on Patreon.
Otherwise use the vendored copy as before.
The library is in Debian testing but not stable. Once it reaches
stable, the vendored copy can be removed.
Did not add it to debian/control because IIRC that's used to build
git-annex on stable too, possibly. However, the Debian maintainer will
probably want to make the package depend on libghc-git-lfs-dev.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Fix a deadlock that could occur after git-annex got an unlocked file,
causing the command to hang indefinitely.
Known to happen on vfat filesystems, possibly others.
Note that a deadlock is still theoretically possible, if anything
smudge --clean does causes it to run the git queue for some other
reason.
Apparently that doesn't happen, but will need to keep an eye on it.
That made eg git-annex get of an unlocked file hang until the
annex.pidlocktimeout and then fail.
This fix should be fully thread safe no matter what else git-annex is
doing.
Only using runsGitAnnexChildProcess in the one place it's known to be a
problem. Could audit for all places where git-annex runs itself as a child
and add it to all of them, later.
Fix bug that made creds not be stored in git when a special remote was
initialized with gpg encryption, but without an explicit embedcreds=yes.
(Yet nother regression introduced in version 7.20200202.7. 5th so far.)
This makes the creds get saved, since only things recorded there will be
saved.
IIRC, unparsedRemoteConfig was not originally available when I
implemented this; now that it is things get a bit simpler.
More could probably be simplified, is externalConfigChanges needed at
all?
This does not entirely fix the bugs though, because creds are only
embedded when embedcreds=yes, but not when encryption=pubkey is used
without embedcreds=yes.
checkpresentkey: When no remote is specified, try all remotes, not only
ones that the location log says contain the key. This is what the
documentation has always said it did.
Still try the logged remotes first, because they are far more likely to
have the key.
* Improve display of problems auto-initializing or upgrading local git
remotes.
* When a local git remote cannot be initialized because it has no
git-annex branch or a .noannex file, avoid displaying a message about it.
So stop documenting it, and stop offering it as a choice in the assistant.
Removed the code that parses it into S3.ReducedRedundancy, because
S3.OtherStorageClass with the value will work just the same and avoids a
special case for a deprecated this.
The ContentIdentifier can contain almost anything, so could have characters
that are not fit for the filesystem, or might be longer than a key usually
is, or contain a newline, or .... genKeyName deals with those problems.
This should not present a back-compat issue, because this is a temporary
key used while downloading the imported file, before the real key for it
can be generated.
Since an external process can be in the middle of some operation when an
async exception is received, it has to be shut down then. Using
cleanupProcess will close its IO handles and send it a SIGTERM.
If a special remote choses to catch SIGTERM, it's fine for it to do some
cleanup then, but until it finishes, git-annex will be blocked waiting
for it. If a special remote blocked SIGTERM, it would cause a hang.
Mentioned in docs.
Also, in passing, fixed a FD leak, it was not closing the error handle
when shutting down the external. In practice that didn't matter before because
it was only run when git-annex was itself shutting down, but now that it
can run on exception, it would have been a problem.
Added annex.skipunknown git config, that can be set to false to change the
behavior of commands like `git annex get foo*`, to not skip over files/dirs
that are not checked into git and are explicitly listed in the command
line.
Significant complexity was needed to handle git-annex add, which uses some
git ls-files calls, but needs to not use --error-unmatch because of course
the files are not known to git.
annex.skipunknown is planned to change to default to false in a
git-annex release in early 2022. There's a todo for that.
Try to enable special remotes configured with autoenable=yes when git-annex
auto-initialization happens in a new clone of an existing repo. Previously,
git-annex init had to be explicitly run to enable them. That was a bit of a
wart of a special case for users to need to keep in mind.
Special remotes cannot display anything when autoenabled this way, to avoid
interfering with the output of git-annex query commands.
Any error messages will be hidden, and if it fails, nothing is displayed.
The user will realize the remote isn't enable when they try to use it,
and can run git-annex init manually then to try the autoenable again and
see what failed.
That seems like a reasonable approach, and it's less complicated than
communicating something across a pipe in order to display it as a side
message. Other reason not to do that is that, if the first command the
user runs is one like git-annex find that has machine readable output,
any message about autoenable failing would need to not be displayed anyway.
So better to not display a failure message ever, for consistency.
(Had to split out Remote.List.Util to avoid an import cycle.)
Fix a crash or potentially not all files being exported when sync -J
--content is used with an export remote.
Crash as described in fixed bug report.
waitForAllRunningCommandActions inserted in several points where all the
commandActions started before need to have finished before moving on to
the next stage of the export. A race across those points could have
maybe resulted in not all files being exported, or a wrong tree being
export.
For example, changeExport starting up an action like
a rename of A to B. Then, with that action still running, fillExport
uploading a new A, *before* the rename occurred. That race seems
unlikely to have happened. There are some other ones that this also
fixes.
move --to, copy --to, mirror --to: When concurrency is enabled, run cleanup
actions in separate job pool from uploads.
transferStages was confusingly named, it's only useful when doing downloads
as then the verify actions can be run concurrently with other downloads.
For commands that upload, there will be more concurrency from running
cleanup actions in a separate job pool.
As for sync, I left it using downloadStages although that's not optimal
for the part of a sync that uploads. Perhaps it should use the union of
both?
Already supported --json, but not that.
Also checked all other commands that only support --json, and the only
other one that does transfers is fsck (--from), which it did not seem worth
adding --json-progress to really.
Fix bug that made enableremote of S3 and webdav remotes, that have
embedcreds=yes, fail to set up the embedded creds, so accessing the remotes
failed.
(Regression introduced in version 7.20200202.7 in when reworking all the
remote configs to be parsed.)
Root problem is that parseEncryptionConfig excludes all other config keys
except encryption ones, so it is then unable to find the
credPairRemoteField. And since that field is not required to be
present, it proceeds as if it's not, rather than failing in any visible
way.
This causes it to not find any creds, and so it does not cache
them. When when the S3 remote tries to make a S3 connection, it finds no
creds, so assumes it's being used in no-creds mode, and tries to find a
public url. With no public url available, it fails, but the failure doesn't
say a lack of creds is the problem.
Fix is to provide setRemoteCredPair with a ParsedRemoteConfig, so the full
set of configs of the remote can be parsed. A bit annoying to need to
parse the remote config before the full config (as returned by
setRemoteCredPair) is available, but this avoids the problem.
I assume webdav also had the problem by inspection, but didn't try to
reproduce it with it.
Also, getRemoteCredPair used getRemoteConfigValue to get a ProposedAccepted
String, but that does not seem right. Now that it runs that code, it
crashed saying it had just a String.
Remotes that have already been enableremoted, and so lack the cached creds
file will work after this fix, because getRemoteCredPair will extract
the creds from the remote config, writing the missing file.
This commit was sponsored by Ilya Shlyakhter on Patreon.
One way this can be used is to remove all urls for some website that went
away:
git-annex whereis --format '${file} ${url}\0' | \
grep -z whatever.com | git-annex rmurl --batch -z
Combining ${url} and ${uuid} is a bit of a combinatorial explosion.
It didn't seem worth only outputting a uuid alongside an url belonging
to it, so each uuid is output beside each url.
When storing content on remote fails, always display a reason why.
Since the Storer used by special remotes already did, this mostly affects
git remotes, but not entirely. For example, if git-lfs failed to connect to
the endpoint, it used to silently return False.
This relicates git's behavior. It adds a few stat calls for the command
line parameters, so there is some minor slowdown, but even with thousands
of parameters it will not be very noticable, and git does the same statting
in similar circumstances.
Note that this does not prevent eg "git annex add symlink"; the symlink
will be added to git as usual. And "git annex find symlink" will silently
list nothing as well. It's only "symlink/foo" or "subdir/symlink/foo" that
triggers the warning.
* addurl --preserve-filename: New option, uses server-provided filename
without any sanitization, but with some security checking.
Not yet implemented for remotes other than the web.
* addurl, importfeed: Avoid adding filenames with leading '.', instead
it will be replaced with '_'.
This might be considered a security fix, but a CVE seems unwattanted.
It was possible for addurl to create a dotfile, which could change
behavior of some program. It was also possible for a web server to say
the file name was ".git" or "foo/.git". That would not overrwrite the
.git directory, but would cause addurl to fail; of course git won't
add "foo/.git".
sanitizeFilePath is too opinionated to remain in Utility, so moved it.
The changes to mkSafeFilePath are because it used sanitizeFilePath.
In particular:
isDrive will never succeed, because "c:" gets munged to "c_"
".." gets sanitized now
".git" gets sanitized now
It will never be null, because sanitizeFilePath keeps the length
the same, and splitDirectories never returns a null path.
Also, on the off chance a web server suggests a filename of "",
ignore that, rather than trying to save to such a filename, which would
fail in some way.
git-lfs repos that encrypt the annexed content but not the git repo only
need --force passed to initremote, allow enableremote and autoenable of
such remotes without forcing again.
Needing --force again particularly made autoenable of such a repo not work.
And once such a repo has been set up, it seems a second --force when
enabling it elsewhere has little added value. It does tell the user about
the possibly insecure configuration, but if the git repo has already been
pushed to that remote in the clear, data has already been exposed. The goal
of that --force was not to prevent every situation where such an exposure
can happen -- anyone who sets up a public git repo and pushes to it will
expose things similarly and git-annex is not involved. Instead, the purpose
of the --force is to point out to the user that they're asking for a
configuration where encryption is inconsistently applied.
To use S3 Signature Version 4. Some S3 services seem to require v4, while
others may only support v2, which remains the default.
I'm also not sure if v4 works correctly in all cases, there is this
upstream bug report: https://github.com/aristidb/aws/issues/262
I've only tested it against the default S3 endpoint.
37b42e72e7 made it catch exceptions but
thought they were unlikely to be useful to display, which may be right when
a git command fails, but not in the case yoh found.
Now the warning gets displayed, which is better than an arcane git error.
The warning is still kind of ugly, especially when the pull later in the
sync will clear up what it warns about. But, this is an unusual situation
not likely to happen, and if there is no remote to pull from, the warning
message is needed or the sync will seem to succeed despite not merging the
synced master branch.
Would still be better if it could merge the synced master branch in this
situation, making an empty commit to master to do it seems wrong, and
otherwise it would need a whole separate code path, and would bypass using
git merge in favor of say, setting master to the syned branch. Which would
bypass git configs like arguably merge.ff and certianly
merge.verifySignatures. So don't want to do that.
User reported git@my.gitlab.foo:username/myrepo.git didn't work with
git-repair, because it rewrites it to an url
ssh://git@my.gitlab.foo/~/username/myrepo.git
and the /~/ was not something the hosting site supported.
Since git-annex still generally needs the repo url to be well, an url, did
not change the conversion code. But in this case, we're running git fetch,
so we might as well pass it the remote name rather than the url.
Did a quick audit of repoLocation uses to see if there was anything else
like this problem elsewhere, and didn't see any. But this is not the first
time this special case in git and git-annex's attempt to de-special-case
it has caused a problem..
* Display a warning message when a remote uses a protocol, such as
git://, that git-annex does not support. Silently skipping such a
remote was confusing behavior.
It sets annex-ignore, so the warning is only displayed once.
* Also display a warning message when a remote, without a known uuid,
is located in a directory that does not currently exist, to avoid
silently skipping such a remote.
This is a bit more debatable, since git-annex get will say,
try making repository available. And since it does not set annex-ignore,
the warning will be displayed repeatedly. It's also an extreme edge case,
I don't think I've ever seen it happen in real life.
Due to eg, too long a path to the agent socket, caused by running gpg in a
container where /run is not mounted, and/or some other gpg behavior like
unnecessarily making relative paths to its home directory absolute.
When the required content is set to "groupwanted", use whatever expression
has been set in groupwanted as the required content of the repo, similar to
how setting required content to "standard" already worked.
addurl: When run with --fast on an url that
annex.security.allowed-ip-addresses prevents accessing, display a more
useful message.
(Also importfeed --fast potentially.)
Do not sync with a faster remote that was not specified.
That old behavior was only documented in the changelog, and was certianly
surprising. It also meant adding --fast made it slower..
readonly=true is used to make an external special remote that does not
need the external program to be installed. It was stored in the
remote.log by default, and so every time it was specified in an
enableremote or initremote, whatever value was used became the new
default for subsequent enableremotes of that remote.
That was surprising, and I consider it to be a bug.
It does not make much sense to pass it to initremote because then how
would you populate that remote with anything? You would have to
enableremote elsewhere, and store content there. I'm assuming nobody
used it that way.
Someone might rely on passing it to enableremote once, and then that
being inherited in other clones. But that is not how it's documented to
be used. It is barely documented in git-annex at all, only in the
external special remote protocol, and the documentation there says to
"Document that this external special remote can be used in readonly
mode." (by the user of it passing readonly=true to enableremote). The
one external special remote that I know of that does document that is
<https://github.com/bgilbert/gcsannex> (the one that motivated adding
it). That one's docs do say to pass it to enableremote.
So, it seemed safe to make this behavior change. If someone was in fact
relying on one of those behaviors, all their current repos will still
work as they configured them (although they will need to deal
with the related change in 9f3c2dfeda).
In new clones, they will find enableremote fails, complaining the
external program is not in path. An easy enough problem to recover from.
get --from, move --from: When used with a local git remote, these used to
silently skip files that the location log thought were present on the
remote, when the remote actually no longer contained them. Since that
behavior could be surprising, now instead display a warning.
I got very confused when I encountered this behavior, since it was silently
skipping a file I needed that whereis said was on the remote.
get without --from already displayed a "unable to access these remotes"
message, which while a bit misleading in that the remote is likely
accessible, but just doesn't contain the file, at least indicated something
went wrong.
Having get --from display a warning makes it in line with get
w/o --from, so seems certianly ok. It might be there are situations where
move --from is used, on eg a whole directory, and the user only wants to
move whatever is present in the remote, and is perfectly ok with files
that are not present being skipped. So I'm less sure about the new warning
being ok there. OTOH, only local git remotes avoiding displaying a warning
in that case too, so this just brings them into line with other remotes.
(Also note that this makes it a little bit faster when dealing with a lot of
files, since it avoids a redundant stat of the file.)
Limited to min of -JN or number of CPU cores, because it will often be
CPU bound, once it's read the gitignore file for a directory.
In some situations it's more disk bound, but in any case it's unlikely
to be the main bottleneck that -J is used to avoid. Eg, when dropping,
this is used for numcopies checks, but the main bottleneck will be
accessing the remotes to verify presence. So the user might decide to
-J32 that, but having 32 check-attr processes would just waste however
many filehandles they open, and probably worsen their performance due to
CPU contention.
Note that, I first tried just letting up to the -JN be started. However,
even when it's no bottleneck at all, that still results in all of them
being started. Why? Well, all the worker threads start up nearly
simulantaneously, so there's a thundering herd..
Avoid running a large number of git cat-file child processes when run with
a large -J value.
This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
Avoid repeatedly opening keys db when accessing a local git remote and -J
is used.
What was happening was that Remote.Git.onLocal created a new annex state
as each thread started up. The way the MVar was used did not prevent that.
And that, in turn, led to repeated opening of the keys db, as well as
probably other extra work or resource use.
Also managed to get rid of Annex.remoteannexstate, and it turned out there
was an unncessary Maybe in the keysdbhandle, since the handle starts out
closed.
This change does impact git-annex config
eg "git annex config --set annex.addunlocked on"
will store "on" and new git-annex will understand that value, while
old git-annex will error:
git-annex: bad annex.addunlocked configuration in git annex config:
Parse failure: near "on"
That seems acceptable.
Not special remote configs that are only documented as =true or =false
however. Having git-annex support other values for those would break
backwards compatability when used with old versions of git-annex. And
older versions ignore invalid special remote configs.. That would not
be a good combination.
Eg"core.bare" is the same as "core.bare = true".
Note that git treats "core.bare =" the same as "core.bare = false", so the
code had to become more complicated in order to treat the absense of a
value differently than an empty value. Ugh.
Git has an obnoxious special case in git config, a line "foo" is the same
as "foo = true". That means there is no way to examine the output of
git config and tell if it was run with --null or not, since a "foo"
in the first line could be such a boolean, or could be followed by its
value on the next line if --null were used.
So, rather than trying to do such a detection, track the style of config
at all the points where it's generated.
Around 3% total speedup.
Profiling git annex find --not --in web, it's now bytestring end-to-end,
and there is only a little added overhead in eg accessing the Annex
state MVar (3%). The rest of the runtime is spent reading symlinks, and
in attoparsec.
This feels like the end of the optimisation road, without a major change
like caching information for faster queries.
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.
--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
After a user completely ignored the display of the exception probably
because it didn't make sense..
This does make it a little bit slower since it checks adb is in path each
time before running it. Also, it might display a lot of warnings about it
not being installed.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Improve git-annex's ability to find the path to its program, especially
when it needs to run itself in another repo to upgrade it.
Some parts of the code used readProgramFile, probably because I forgot that
programPath exists.
I noticed this when a git-annex auto-upgrade failed because it was running
git-annex upgrade --autoonly, but the code to run git-annex used
readProgramFile, which happened to point to an older build of git-annex.
Upgrade other repos than the current one by running git-annex upgrade
inside them, which avoids problems with upgrade code making assumptions
that the cwd will be inside the repo being upgraded.
In particular, this fixes a problem where upgrading a v7 repo to v8 caused
an ugly git error message.
I actually could not find a way to make Upgrade.V7 work properly
without changing directory to the remote. Once I got git ls-files to work,
the git cat-file failed because :path can only be used in the current git
repo.
This means it will still be a .git file when git-annex init runs. That's
ok, the repo probably contains no annexed objects yet, and even if it does,
git-annex init does not care if symlinks in the worktree don't point to the
objects.
I made init, at the end, run the conversion code. Not really necessary
because the next git-annex command could do it just as well. But, this
avoids commands that don't normally write to the repo needing to write to
it, which might avoid some problem or other, and seems worth avoiding
generally.
While git ls-files can actually be used on a repo that is not in the
cwd, it works inconsistently. For example, this fails:
git --git-dir=../foo/.git --work-tree=../foo ls-files ../foo
But change some of the paths to absolute and it will succeed. That seems
like a bug in git.
OTOH, this succeeds:
git --git-dir=../foo/.git --work-tree=../foo ls-files
But, that lists paths relative to the top of the --work-tree,
rather than the usual listing them relative to the cwd. Because the cwd
is not in the repo. And so anything parsing the ls-files output of that
is likely to operate on files in the wrong location. Indeed, there is
code in Upgrade/ that has this problem!
md5sum is part of busybox, so is probably available unless it were compiled
out. If md5sum (or cut for that matter) is not available, it will
still use the whole path to $base, otherwise hash it.
Of course it's possible for md5sum to be available sometimes and not others
on the same system; in such an event the locales would be built twice for
the same bundle. The cleanup code will delete both sets once that
version of the bundle is upgraded.
git-annex config: Only allow configs be set that are ones git-annex
actually supports reading from repo-global config, to avoid confused users
trying to set other configs with this.
* whereis: If a remote fails to report on urls where a key
is located, display a warning, rather than giving up and not displaying
any information.
* When external special remotes fail but neglect to provide an error
message, say what request failed, which is better than displaying an
empty error message to the user.
Fix serious regression in gcrypt and encrypted git-lfs remotes.
Since version 7.20200202.7, git-annex incorrectly stored content
on those remotes without encrypting it.
Problem was, Remote.Git enumerates all git remotes, including git-lfs
and gcrypt. It then dispatches to those. So, Remote.List used the
RemoteConfigParser from Remote.Git, instead of from git-lfs or gcrypt,
and that parser does not know about encryption fields, so did not
include them in the ParsedRemoteConfig. (Also didn't include other
fields specific to those remotes, perhaps chunking etc also didn't
get through.)
To fix, had to move RemoteConfig parsing down into the generate methods
of each remote, rather than doing it in Remote.List.
And a consequence of that was that ParsedRemoteConfig had to change to
include the RemoteConfig that got parsed, so that testremote can
generate a new remote based on an existing remote.
(I would have rather fixed this just inside Remote.Git, but that was not
practical, at least not w/o re-doing work that Remote.List already did.
Big ugly mostly mechanical patch seemed preferable to making git-annex
slower.)
* init --version: When the version given is one that automatically
upgrades to a newer version, use the newer version instead.
* Auto upgrades from older repo versions, like v5, now jump right to v8.
Fix some cases where handling of keys with extensions varied depending on
the locale.
A filename with a unicode extension would before generate a key with an
extension in a unicode locale, but not in LANG=C, because the extension
was not all alphanumeric. Also the the length of the extension could be
counted differently depending on the locale.
In a non-unicode locale, git-annex migrate would see that the extension
was not all alphanumeric and want to "upgrade" it. Now that doesn't happen.
As far as backwards compatability, this does mean that unicode
extensions are counted by the number of bytes, not number of characters.
So, if someone is using unicode extensions, they may find git-annex
stops using them when adding files, because their extensions are too
long. Keys already in their repo with the "too long" extensions will
still work though, so this only prevents adding the same content with
the same extension generating the same key. Documented this by
documenting that annex.maxextensionlength is a number of bytes.
Also, if a filename has an extension that is not valid utf-8 and the
locale is utf-8, the extension will be allowed now, and an old
git-annex, in the same locale would not, and would also want to
"upgrade" that.
Unfortunately, cabal puts the binary in a very complicated path
and does not provide any good way to get it out, leaving no good choice
except to use find.
It may be possible to use cabal (new-)install --symlink-bindir,
and ask it to symlink to pwd, but with my older version of cabal,
that does not work.
The stack branch will probably also break once it uses a newer cabal,
didn't try to deal with that.
* Added sync --only-annex, which syncs the git-annex branch and annexed
content but leaves managing the other git branches up to you.
* Added annex.synconlyannex git config setting, which can also be set with
git-annex config to configure sync in all clones of the repo.
Use case is then the user has their own git workflow, and wants to use
git-annex without disrupting that, so they sync --only-annex to get the
git-annex stuff in sync in addition to their usual git workflow.
When annex.synconlyannex is set, --not-only-annex can be used to override
it.
It's not entirely clear what --only-annex --commit or --only-annex
--push should do, and I left that combination not documented because I
don't know if I might want to change the current behavior, which is that
such options do not override the --only-annex. My gut feeling is that
there is no good reasons to use such combinations; if you want to use
your own git workflow, you'll be doing your own committing and pulling
and pushing.
A subtle question is, how should import/export special remotes be handled?
Importing updates their remote tracking branch and merges it into master.
If --only-annex prevented that git branch stuff, then it would prevent
exporting to the special remote, in the case where it has changes that
were not imported yet, because there would be a unresolved conflict.
I decided that it's best to treat the fact that there's a remote tracking
branch for import/export as an implementation detail in this case. The more
important thing is that an import/export special remote is entirely annexed
content, and so it makes a lot of sense that --only-annex will still sync
with it.
Fix support for repositories tuned with annex.tune.branchhash1=true,
including --all not working and git-annex log not displaying anything for
annexed files.
fsck --from remote: Fix a concurrency bug that could make it incorrectly
detect that content in the remote is corrupt, and remove it, resulting in
data loss.
* When git-annex is built with a ssh that does not support ssh connection
caching, default annex.sshcaching to false, but let the user override it.
* Improve warning messages further when ssh connection caching cannot
be used, to clearly state why.
This is untested because of rain, also I am operating from truncated
copiler error messages in a bug report that also doesn't mention what the
library version is. Still, it should work.
May break builds with old ghc, in particular DerivingStrategies is
I think fairly new? The pragmas could be ifdefed if necessary. Works with
ghc 8.6.5.
A warning message is unsatisfying. But erroring out is too hard a failure,
especially since it may well work fine if the user has enabled passwordless
ssh.
I did think about falling back to one ssh connection at a time in this
case, but it would have needed a rework of every ssh call, which
seems far overboard for such a niche problem. There's no single place where
git-annex runs ssh, so no one place that it could block a concurrent call
on a semaphore. And, even if it did fall back to one ssh connection at a
time, it seems to me that doing so without warning the user about the
problem just invites bug reports like "git-annex is ignoring my -J2 and
only doing one download at a time". So a warning is needed, and I suppose
is good enough.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
Http remotes that do expose a git config file, but are not initialized
resulted in an ugly and unncessary error message, now sqelched.
When git-annex-shell configlist is run w/o the autoinit field, it may
not generate a uuid for the repository. So in that case, it's not
unexpected for the config it does list to not include a UUID, and
dumping out the config in a warning message is not needed.
If configlist is asked to autoinit and we don't get back a config with a
UUID in it, that suggests some problem, and what we got back may not be
a config at all but some diagnostic message, so it does make sense to
output it then.
The use case is basically the user having forgotten, so --help would be
best, but it would be quite hard to include this in --help, since it may
even have to spin up an external special remote program.
I also considered --umm but typoed it the first time I tried it as
--uum, and while memorable, it's too cutesy. --whatelse is good because
it explicitly asks, what other params, besides the ones I've given?
Special remote programs that use GETCONFIG/SETCONFIG are recommended
to implement it.
The description is not yet used, but will be useful later when adding a way
to make initremote list all accepted configs.
configParser now takes a RemoteConfig parameter. Normally, that's not
needed, because configParser returns a parter, it does not parse it
itself. But, it's needed to look at externaltype and work out what
external remote program to run for LISTCONFIGS.
Note that, while externalUUID is changed to a Maybe UUID, checkExportSupported
used to use NoUUID. The code that now checks for Nothing used to behave
in some undefined way if the external program made requests that
triggered it.
Also, note that in externalSetup, once it generates external,
it parses the RemoteConfig strictly. That generates a
ParsedRemoteConfig, which is thrown away. The reason it's ok to throw
that away, is that, if the strict parse succeeded, the result must be
the same as the earlier, lenient parse.
initremote of an external special remote now runs the program three
times. First for LISTCONFIGS, then EXPORTSUPPORTED, and again
LISTCONFIGS+INITREMOTE. It would not be hard to eliminate at least
one of those, and it should be possible to only run the program once.
This is a first step toward that goal, using the ProposedAccepted type
in RemoteConfig lets initremote/enableremote reject bad parameters that
were passed in a remote's configuration, while avoiding enableremote
rejecting bad parameters that have already been stored in remote.log
This does not eliminate every place where a remote config is parsed and a
default value is used if the parse false. But, I did fix several
things that expected foo=yes/no and so confusingly accepted foo=true but
treated it like foo=no. There are still some fields that are parsed with
yesNo but not not checked when initializing a remote, and there are other
fields that are parsed in other ways and not checked when initializing a
remote.
This also lays groundwork for rejecting unknown/typoed config keys.
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).
Changed all parsers for git plumbing output to support both sizes of
shas.
One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
ifAnnexed in a bare repo passes to git cat-file :./filename , which it
refuses to do since the repo is bare.
Note that, reinject somefile someannexedfile in a bare repo silently does
nothing, because someannexedfile is never actually an annexed worktree
file, because the repo is bare.
options make it easier to override annex.largefiles configuration
(and potentially safer as it avoids bugs like the smudge bug fixed
in the last release)
Deleted some old comments that were posted to the man page discussing such
options.
Updated docs that used -c annex.largefiles to use the options.
Note that addSmallOverridden was needed to avoid the clean filter running
on the file. It would be possible to make addFile also update the index
directly, rather than going via git add. However, it was not necessary,
and I want to avoid breaking on some edge case, particularly if the code in
addSmallOverridden has some oversight.
Also, when annex.addunlocked is set and annex.largefiles does not match a file,
git annex add --force-large works, but git status will then show the file
as added, with a unstaged modification. The unstaged modification adds the
file to git. This is identical behavior to using -c annex.largefiles=nothing
when annex.addunlocked is set. This does not prevent committing what was
intended to be added. I have not gotten to the bottom of why git thinks
the file is modified and runs it through the clean filter in this case.
smudge: When annex.largefiles=anything, files that were already stored in
git, and have not been modified could sometimes be converted to being
stored in the annex. Changes in 7.20191024 made this more of a problem.
This case is now detected and prevented.
The git add behavior changes could be avoided if it turns out to be
really annoying, but then it would need to behave the old way when
annex.dotfiles=false and the new way when annex.dotfiles=true. I'd
rather not have the config option result in such divergent behavior as
`git annex add .` skipping a dotfile (old) vs adding to annex (new).
Note that the assistant always adds dotfiles to the annex.
This is surprising, but not new behavior. Might be worth making it also
honor annex.dotfiles, but I wonder if perhaps some user somewhere uses
it and keeps large files in a directory that happens to begin with a
dot. Since dotfiles and dotdirs are a unix culture thing, and the
assistant users may not be part of that culture, it seems best to keep
its current behavior for now.
eg, `git-annex get . ..` used to order the files strangly, because it
did not realize that when git ls-files output eg "foo", that should be
grouped with the first set of files and not the second set.
Fixed by making dirContains "." "./foo" = True
which makes sense, because dirContains ".." "../foo" = True
* annex.addunlocked can be set to an expression with the same format used by
annex.largefiles, in case you want to default to unlocking some files but
not others.
* annex.addunlocked can be configured by git-annex config.
Added a git-annex-matching-expression man page, broken out from
tips/largefiles.
A tricky consequence of this is that git-annex add --relaxed
honors annex.addunlocked, but an expression might want to know the size
or content of an url, which it's not going to download. I decided it was
better not to fail, and just dummy up some plausible data in that case.
Performance impact should be negligible. The global config is already
loaded for annex.largefiles. The expression only has to be parsed once,
and in the simple true/false case, it should not do any additional work
matching it.
e53070c1f quietly made it set the local git config too, but that was never
documented anywhere, and it had surprising results. If I set
annex.largefiles globally in a repo, I would expect to be able to change it
in another repo, and the original repo would get the change and use it,
rather than being stuck on the old value set there.
And, if I have a local annex.largefiles and set a different global default,
I'd be surprised to have my local setting overwritten.
annex.securehashesonly does need to be set locally, since it's a security
feature and the global is only a default until it gets set locally. So
special cased.
annex.largefiles can be configured by git-annex config, to more easily set
a default that will also be used by clones, without needing to shoehorn the
expression into the gitattributes file. The git config and gitattributes
override that.
Whenever something is added to git-annex config, we have to consider what
happens if a user puts a purposfully bad value in there. Or, if a new
git-annex adds some new value that an old git-annex can't parse.
In this case, a global annex.largefiles that can't be parsed currently
makes an error be thrown. That might not be ideal, but the gitattribute
behaves the same, and is almost equally repo-global.
Performance notes:
git-annex add and addurl construct a matcher once
and uses it for every file, so the added time penalty for reading the global
config log is minor. If the gitattributes annex.largefiles were deprecated,
git-annex add would get around 2% faster (excluding hashing), because
looking that up for each file is not fast. So this new way of setting
it is progress toward speeding up add.
git-annex smudge does need to load the log every time. As well as checking
the git attribute. Not ideal. Setting annex.gitaddtoannex=false avoids
both overheads.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
(cherry picked from commit 09ee6b0ccb)
* benchmark: Changed --databases to take a parameter specifiying the size
of the database to benchmark.
* benchmark --databases: Display size of the populated database.
* benchmark --databases: Improve the "addAssociatedFile to (new)"
benchmark to really add new values, not overwriting old values.
Eg:
git clone url --bare r
git --git-dir r annex init
This resulted in worktree = Just "." and so several things that check
worktree to determine when the repo is bare ran code paths intended for
non-bare. One such code path[1] ran git checkout with --worktree=. which
actually makes it ignore core.bare config, and so the current directory
got populated with a checkout of the master branch in this example. There
was probably also other breakage.
The fix is a bit complicated because whether the repo is bare is not
known until after Git.Config reads the config, but Git.Config handles
setting the RepoLocations's worktree when core.worktree is set. So have
to assume the worktree is the cwd, let core.worktree override that,
and then if the repo turns out to be bare, it's set back to Nothing.
(And then GIT_WORK_TREE can still override all of that.)
[1] switchHEADBack, which runs even when the clone is not from a bare repo.
* git-lfs: The url provided to initremote/enableremote will now be
stored in the git-annex branch, allowing enableremote to be used without
an url. initremote --sameas can be used to add additional urls.
* git-lfs: When there's a git remote with an url that's known to be
used for git-lfs, automatically enable the special remote.
Reasons to do this include:
1. I've gotten pretty used to git-annex's own progress display, which is
used for all transfers over ssh (except to old git-annex-shell),
and for most special remote transfers. It's getting to seem weird to see
the rsync progress display instead.
2. When -J was used, the rsync output could not be shown, and so there was
no progress display. Now there will be.
Progress will also be displayed now when cp CoW is used. But I'd expect a CoW
copy to typically run so fast that the progress display will barely be
noticable.
This commit was sponsored by Peter on Patreon.
So that binaries in that directory can find the library next to them,
where they get modified to look.
This is a hack; it would be better for OSXMkLibs to build a list of what
libraries are needed where.
Unsure if this is needed due to a recent reversion, or is an older
problem, so updated changelog accordingly.
Putting the binaries in bundle/git-core/bin didn't work on OSX,
linker can't find the libraries next to those binaries where it expects to.
So instead put the binaries in the progDir.
See the comment for a trace of the deadlock.
Added a new StartStage. New worker threads begin in the StartStage.
Once a thread is ready to do work, it moves away from the StartStage,
and no thread will ever transition back to it.
A thread that blocks waiting on another thread that is processing
the same key will block while in the StartStage. That other thread
will never switch back to the StartStage, and so the deadlock is avoided.
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
Used to work but was broken in version 7.20181031, specifically commit
5ab0f48ffb.
That this was not noticed over at least 1 daylight savings time zone
changes makes me wonder if the TSDelta stuff is still needed.
Perhaps the mtime on Windows no longer changes when the time zone is changed?
Fix bug that lost modifications to unlocked files when init is re-ran in an
already initialized repo.
In retrospect needing scanUnlockedFiles False in the direct mode upgrade
path was a good hint that it was unsafe when used with True.
However, this bug did not affect upgrade from v5. In such an upgrade, an
unlocked file that is modified is left as-is. The only place
scanUnlockedFiles True did overwrite modified unlocked files is during an
git-annex init of a repo that was already initialized by git-annex.
(I also tried a scenario where the repo had not been initialized by
git-annex yet, but was cloned from a v7 repo with an unlocked file, and the
pointer file replaced with some other content, and the data loss did not
occur in that situation.)
Since the fixed scanUnlockedFiles avoids overwriting non-pointer files,
it should be safe to run in any situation, so there's no need any longer
for the parameter.
Rescued from commit 11d6e2e260 which removed
db benchmarks in favor of benchmarking arbitrary git-annex commands. Which
is nice and general, but microbenchmarks are useful too.
Code change should be trvial, but not yet implemented. This
significantly complicated the task of documenting how git-annex works.
I'm not sure how useful the annex.gitaddtoannex confguration is after
this change; seems that if a user has an annex.largefiles they will want
it applied consistently. But the last thing I want to hear is more
complaining from users about git add doing something they don't want it
to.
There's a pretty high risk users who got used to the git add behavior
and don't have annex.largefiles configured will miss the NEWS and
complain bitterly about their suddenly bloated repositories. Oh well.
Removed outdated comments about the old behavior to avoid confusion.
I don't know if I've found all the places that griping spread to.
Added annex.gitaddtoannex configuration. Setting it to false prevents
git add from usually adding files to the annex.
(Unless the file was annexed before, or a renamed annexed file is detected.)
Currently left at true; some users are encouraging it be set to false.
When the submodule's parent repo has an adjusted unlocked branch,
it gets cloned by git, but git checks out master. git annex init then
fails because it wants to enter the adjusted branch, but:
adjusted branch adjusted/master(unlocked) already exists.
Aborting because that branch may have changes that have not yet reached master
Note that init actually then exits 0, leaving master checked out.
This could also happen, absent submodules, if the parent repo has
an adjusted unlocked branch, but it is not checked out. In the more common
case where that branch is checked out, the clone uses the same branch,
so no problem.
The choices to fix this:
* Init could delete the existing adjusted branch, and re-adjust.
But then running init inside an adjusted branch on a crippled filesystem
would lose any changes that have not been synced back to master.
* Init could sync any changes back to master, but that would be very surprising
behavior for it.
* Init could simply check out the existing adjusted branch. If the branch
is diverged from master, well, sync will sort that out later.
This mirrors the behavior of cloning a repo that has an adjusted branch
checked out that has not yet been synced back to master.
Picked this choice.
Homebrew now has eg:
datalads-imac:~ joey$ ls -l /Users/joey/homebrew/Cellar/git/2.23.0/libexec/git-core
total 36776
lrwxr-xr-x 1 joey staff 13 Aug 29 13:38 git -> ../../bin/git
lrwxr-xr-x 1 joey staff 13 Aug 29 13:38 git-add -> ../../bin/git
So the target of the symlink also needs to be installed now.
Doing it in shell code was too hairy for my dentistry-addled brain, so
reimplemented in haskell. Also using it for building linuxstandalone.
forget --drop-dead: Remove several classes of git-annex log files when they
become empty, further reducing the size of the git-annex branch.
Noticed while testing sameas uuid removal, but it could happen other times
too.
An empty log file is always treated by git-annex the same as no file
being present, and when the files are per-key, it can be a sizable space
saving to exclude them from the tree.
This solves the problem of sameas remotes trampling over per-remote
state. Used for:
* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
could in theory generate the same content identifier for two different
peices of content
While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.
External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.
Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
Work around git cat-file --batch's odd stripping of carriage return from
the end of the line (some windows infection), avoiding crashing when the
repo contains a filename ending in a carriage return.
When dropping an unlocked file, preserve its mtime, which avoids git status
unncessarily running the clean filter on the file.
If the index file has close to the same mtime as a work tree file, git will
not trust the index to be up-to-date, and re-runs the clean filter
unncessarily. Preserving the mtime when depopulating a pointer file avoids
git status doing a little (or maybe a lot) of unncessary work.
There are other places that the mtime could be preserved, including other
places where pointer files are written perhaps, but also
populatePointerFile. But, I don't know of cases where those lead to git
status doing unncessary work, so I just fixed the one I'm aware of for now.
fsck --incremental/--more: Fix bug that prevented the incremental fsck
information from being updated every 5 minutes as it was supposed to be; it
was only updated after 1000 files were checked, which may be more files
that are possible to fsck in a given fsck time window.
Thanks to Peter Simons for help with analysis of this bug.
Auditing for other cases of the same mistake, the keys db also had it
backwards. This seems unlikely to really have been a problem;
it would need associated files updates etc to be coming in slowly for some
reason and then be interrupted to cause any problem.
IIRC the design of the keys db assumes that any interruped
operation will be restarted, and so it can lose any buffered database
updates safely.
Fix bug in handling of annex.largefiles that use largerthan/smallerthan.
When adding a modified file, it incorrectly used the file size of the old
version of the file, not the current size.
That was the only largefiles limit that didn't directly look at the file on
disk already. Added a new type to keep straight the two different ways such
a limit can be matched. I kind of wanted to extend MatchingFile or FileInfo
to indicate that the matcher is supposed to operate on files from disk or
annex, but it turned out to be too complex to implement it that way.
This also changes the LimitAnnexFiles case when lookupFileKey does not find
a key. It used to fall back to statting the file, now it always returns
False. I doubt the old code could really get to that point, but if it
somehow does, it's better for preferred content matching to be consistent.
gksu is no longer in debian, even stable
kdesu in debian is not installed in PATH any longer, though the executable
is still present under /usr/lib
pkexec is packagekit's replacement for those older commands.
Straightforward, except for the issue of how to reverse LockAdjustment.
With --unlock, a commit that modifies/adds unlocked files gets reverse
adjusted to use locked files. That's fairly reasonable, I think.
But reversing --lock by unlocking all modified files feels wrong. Maybe
that's just because repositories typically seem to still have mostly
locked files in them (unless one is in an adjusted unlocked branch of
course!)
It may be that eventually how to reverse both will need to be configurable,
I don't know.
Had a report of close throwing ErrorBusy on CIFS.
Retrying up to 16 seconds is a balance between hopefully waiting long
enough for the problem to clear up and waiting so long that git-annex seems
to hang.
The new dependency is free; persistent depends on unliftio-core.
I just had a test that crashed at cleanup on linux with:
.t/gpgtest/12/S.gpg-agent.browser: removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:getSymbolicLinkStatus: does not exist (No such file or directory)
sleeping 10 seconds and will retry directory cleanup
git-annex: .t/gpgtest/14/S.gpg-agent.browser: removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:getSymbolicLinkStatus: does not exist (No such file or directory)
removePathForcibly is supposed to be more robust to things in the directory vanishing while it's running, etc.
Will probably avoid such crashes.
It was added to directory-1.2.7, which comes with ghc since 8.0.2.
Since base >= 4.11.1.0 means ghc 8.4.4, I expect all builds will have it,
but I ifdefed it to be sure.
This avoids some extra work, but I don't think it was possible for two ssh
endpoint discoveries run concurrently to both prompt for the ssh password;
Annex.Ssh itself deals with concurrency.
This is mostly groundwork for http password prompting.
This allows the rpm to be built anywhere the necessary build deps are
available (including on debian) and the resulting package will work on as
broad a range of rpm distributions as the libc/kernel supports.
The DistributionUpdate changes to use the new script have not yet been
tested.
debian oldoldstable has 2.1, and that's what i386ancient uses. It would be
better to require git 2.2, which is needed to use adjusted branches, but
can't do that w/o losing support for some old linux kernels or a
complicated git backport.
This brings back .git/annex/misctmp, but only for init. If an init
is interrupted while probing using that temp directory, the files it left
will get deleted 1 week later by a subsequent git-annex run.
Can be set to false to prevent any automatic repository upgrades.
Also, removed direct mode specific upgrade code in Annex.Init, and made
needsUpgrade always include the name/path of the repo, so if
there's a problem it's clear what repo has the problem.
And, made needsUpgrade catch any exceptions that might occur during the
upgrade, so it can display a more useful error message than just the
exception.
That git fixed a memory leak that could cause an OOM during the upgrade.
Most git-annex builds have a new enough git already.
OSX git was upgraded with brew.
Linux i386ancient build's git was too old. Upgrading it to a fixed
git didn't work (due to the newer git not working with the old ssh,
https://bugs.chromium.org/p/git/issues/detail?id=7 )
Choices to deal with that were:
* Somehow make direct mode upgrade work with the old git, avoiding its
OOM problem. One way would be to switch the repo to indirect mode
first, and so upgrade to a repo with locked files. Not good when
the filesystem does not support symlinks.
* backport the OOM fix from git 2.22
(And do what about the version number so git-annex knows it's fixed?)
* backport openssh (and possibly more stuff)
* move the i386ancient build to at least Debian stretch (still backporting git)
But this will make it no longer work with some of the ancient kernels it
targets.
Of those, backporting the OOM fix seemed the best approach. Put "oomfix"
in the git version number to indicate it.
I have not automated building the git backport, so here's the patch I
used:
diff -ur orig/git-2.1.4/convert.c git-2.1.4/convert.c
--- orig/git-2.1.4/convert.c 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/convert.c 2019-08-29 20:05:04.371872338 +0100
@@ -404,7 +404,7 @@
if (start_async(&async))
return 0; /* error was already reported */
- if (strbuf_read(&nbuf, async.out, len) < 0) {
+ if (strbuf_read(&nbuf, async.out, 0) < 0) {
error("read from external filter %s failed", cmd);
ret = 0;
}
diff -ur orig/git-2.1.4/GIT-VERSION-GEN git-2.1.4/GIT-VERSION-GEN
--- orig/git-2.1.4/GIT-VERSION-GEN 2014-12-18 18:42:18.000000000 +0000
+++ git-2.1.4/GIT-VERSION-GEN 2019-08-29 20:06:39.132743228 +0100
@@ -1,7 +1,7 @@
#!/bin/sh
GVF=GIT-VERSION-FILE
-DEF_VER=v2.1.4
+DEF_VER=v2.1.4.oomfix
LF='
'
diff -ur orig/git-2.1.4/configure git-2.1.4/configure
--- orig/git-2.1.4/configure 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/configure 2019-08-29 20:27:45.896380015 +0100
@@ -580,8 +580,8 @@
# Identity of this package.
PACKAGE_NAME='git'
PACKAGE_TARNAME='git'
-PACKAGE_VERSION='2.1.4'
-PACKAGE_STRING='git 2.1.4'
+PACKAGE_VERSION='2.1.4.oomfix'
+PACKAGE_STRING='git 2.1.4.oomfix'
PACKAGE_BUGREPORT='git@vger.kernel.org'
PACKAGE_URL=''
diff -ur orig/git-2.1.4/version git-2.1.4/version
--- orig/git-2.1.4/version 2014-12-18 18:42:19.000000000 +0000
+++ git-2.1.4/version 2019-08-29 20:06:17.572545210 +0100
@@ -1 +1 @@
-2.1.4
+2.1.4.oomfix
Three reasons:
* Committing as part of an upgrade is very unusual and unexpected.
* The commit was failing with a weird error message when done during an
automatic upgrade.
* Let me remove more of that sweet^Whorrible direct mode code.
* Automatically convert direct mode repositories to v7 with adjusted
unlocked branches and set annex.thin.
* init: When run on a crippled filesystem with --version=5,
will error out, since version 7 is needed for adjusted unlocked branch.
* direct: This command always errors out as direct mode is no longer
supported.
* indirect: This command has become a deprecated noop.
* proxy: This command is deprecated because it was only needed in direct
mode. (But it continues to work.)
Also removed mentions of direct mode throughough the documentation.
I have not removed all the direct mode code yet.
When upgrading a direct mode repo to v7 with adjusted unlocked branches,
fix a bug that prevented annex.thin from taking effect for the files in
working tree.
The hard links used to be ok, but commit 8e22114735 accidentially
broke them. It repopulates the worktree file, which is already a hard link,
and when it's creating the new file, the link count is already 2, and so it
doesn't make a hard link then.
Rather than direct mode, which this is a small step on the path to
removing.
Init on a crippled filesystem already used v7 adjusted branches,
and like that, this doesn't pose any interoperability issues with old
versions of git-annex that clone the same repo, because files are only
unlocked on the adjusted branch.
When file matching options are specified when getting info of
something other than a directory, they won't have any effect, so error out
to avoid confusion.
This commit was sponsored by mo on Patreon.
Systems such as Debian that have overridden the default fpath will need to
set ZSH_COMPLETIONS_PATH.
I feel that Debian is causing unncessary complexity by making this change,
and have filed a bug report about it.
This also means that when git-annex is installed with PREFIX=/usr/local
it will use /usr/local/share/zsh/site-functions which works with probably
all versions of zsh.
Its repeated opening and writing to the sqlite database somehow caused
inode cache information to occasionally be lost.
This loses code coverage, since running git-annex as a child process
prevents tracking what parts of the code are exercised. I have not looked
at the code coverage in a long time. It would probably be possible to
collect code coverage for the child procesess and merge it together.
* merge: When run with a branch parameter, merges from that branch.
This is especially useful when using an adjusted branch, because
it applies the same adjustment to the branch before merging it.
On second thought, the extra time running the test suite is worth it.
It will be gained back once we finally get rid of direct mode.
There are two failing tests, same two that have been failing on windows
(though the failure does not look identical). So this should also spare me
the Windows VM while fixing.
This way a failure to clean up the main repo dir from a previous pass
can't result in reusing that repo, which won't be configured right for the
current pass.
I saw the installer not defaulting to any installation directory,
and I had to manually enter C:\Program Files\Git
Maybe it was choosing gitInstallDir32, and that was empty? Or the
conditional somehow failed. Simplifying so it will hopefully work again.
This typo would make "git cat-file cat-file" fail, and the way it's used,
I think it broke querying all info from filenames containing newlines,
because the other queries are only run when it succeeds.
Use the same optimisation for --in=here as has always been used for --in=.
rather than the slow code path that unncessarily queries the git-annex
branch.
It looks like when "here" got added as an alias for "." back in 2012, I
forgot about this place.
Also sped up some very unlikely ways of referring to the current
repository.
Note that, this could in some rare corner case cause a behavior
change, if the git-annex branch and inAnnex disagree about whether content
is present in the local repository. But --in=. already behaved
that way, and the truth on the ground should win also.
In 40ecf58d4b I changed the license of code I
wrote from GPL to AGPL. But, two files containing code I wrote combined
with code by others were updated to say their license is AGPL, while in
fact part of it was (the code I wrote) but part remained under the original
license (the code written by others).
Remote/Ddar.hs is now changed entirely back to GPL 3.
Annex/DirHashes.hs stays AGPL, but I broke out Utility/MD5.hs with the code
not written by me, and corrected its license statement to GPL-2, which
is the actual version of the GPL included with the code in its original
distribution at http://www.cs.ox.ac.uk/people/ian.lynagh/md5/
rsync is only needed for rsync special remotes and git-annex-shell from
Debian oldstable. Since the library situation on windows for rsync required
a particular 32 bit build of git for it to work, and may also somehow need
git-annex to be 32 bit build, it's better to not include it.
This commit was sponsored by Jake Vosloo on Patreon.
Improved probing when CoW copies can be made between files on the same
drive. Now supports CoW between BTRFS subvolumes. And, falls back to rsync
instead of using cp when CoW won't work, eg copies between repos on the
same EXT4 filesystem.
Rather than trying cp --reflink=always for each file copied to a remote,
it's tried once and if it fails it falls back to using rsync thereafter
for the lifetime of the Remote object. That avoids overhead of calling cp
which while small, will add up over a large number of files.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Support running v7 upgrade in a repo where there is no branch checked out,
but HEAD is set directly to some other ref.
This commit was sponsored by Jack Hill on Patreon.
The cabal file does not yet demand this version because it's not in Debian
yet and only affects use of certian broken http servers, but let's use it
when it's easily available.
using a blake2 variant optimised for 4-way CPUs
This had been deferred because the Debian package of cryptonite, and
possibly other builds, was broken for blake2bp, but I've confirmed #892855
is fixed.
This commit was sponsored by Brett Eisenberg on Patreon.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
Reversion from commit 436f10771, CustomOutput was forcing quiet output
which overrode the json setting.
find happened to be the only command that uses CustomOutput and also
outputs json. (metadata --get does also use CustomOutput and --json does not
enable json output for that, which may be an oversight, but was already the
behavior before this regression.)
init: Fix a reversion in the last release that prevented automatically
generating and setting a description for the repository.
Seemed best to factor out uuidDescMapRaw that does not
have the default mempty descrition behavior.
I don't much like that behavior, but I know things depend on it.
One thing in particular is `git annex info` which lists the uuids and
descriptions; if the current repo has been initialized in some way that
means it does not have a description, it would not show up w/o that.
(Not only repos created due to this bug might lack that. For example a repo
that was marked dead and had --drop-dead delete its git-annex branch info,
and then came back from the dead would similarly not be in the uuid.log.
Also there have been other versions of git-annex that didn't set a default
description; for years there was no default description.)
When downloading an url and the destination file exists but is empty,
avoid using http range to resume, since a range "bytes=0-" is an unusual
edge case that it's best to avoid relying on working.
This is known to fix a case where importfeed downloaded a partial feed from
such a server. Since importfeed uses withTmpFile, the destination always exists
empty, so it would particularly tickle such problem servers. Resuming from 0
is otherwise possible, but unlikely.
get, move, copy, sync: When -J or annex.jobs has enabled concurrency,
checksum verification uses a separate job pool than is used for
downloads, to keep bandwidth saturated.
Not yet done for upload checksum verification, but that only affects
remotes on local disks.
Avoid a delay at startup when concurrency is enabled and there are
rsync or gcrypt special remotes, which was caused by git-annex
opening a ssh connection to the remote too early.
sshOptions makes a connection to the ssh server if one is not already open,
when concurrency is enabled. Avoid doing that at startup, when the remote
list is being built, but the remote may not be used at all.
Instead, rsync/gcrypt now runs sshOptions once per ssh connection to the
server. This should not be significant overhead since Remote.Git already
has the same overhead (as do Bup and Ddar).
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.
But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.
This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).
It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
Add back support for ftp urls, which was disabled as part of the fix for
security hole CVE-2018-10857 (except for configurations which enabled curl
and bypassed public IP address restrictions). Now it will work if allowed
by annex.security.allowed-ip-addresses.
Renamed annex.security.allowed-http-addresses to
annex.security.allowed-ip-addresses because it is not really specific to
the http protocol, also limiting eg, git-annex's use of ftp and via
youtube-dl, several other protocols.
The old name for the config will still work.
If both old and new name are set, the new name will win.
When a remote is configured to be readonly, don't allow changing what's
exported to it.
This was missed in the original export remote implementation, but it makes
sense for a readonly export remote to not be allowed to change.
~/.profile works for bash, but not all other login shells.
This setting PATH is a minor convenience for users, particuarly since
typing on android is so much harder. The usual linux standalone bundle
just expects the user to know how to add it to PATH. I don't want this
code to grow special cases for every possible login shell. So displaying a
message to the presumably minority who don't use bash seems like the best
choice.
Longer term, I'd hope termux gets some way to set an environment variable
for all login shells. Systems using PAM can, via ~/.pam_environment. Or
alternatively, add a git-annex package to termux, even if just an installer
package. I'd rather spend time on either of those than on making this minor
thing support more login shells.
This commit was sponsored by mo on Patreon.
* init: When the repository already has a description, don't change it.
* describe: When run with no description parameter it used to set
the description to "", now it will error out.
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.
Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.
There was one minor inefficiency mentioned in the todo that I punted on.
Make the import have the previous import as a parent, so eg `git log --stat`
displays a useful diff.
Also a minor optimisation, only calculate the depth of the imported history
once.
This includes a note about how include= and exclude= match when exporting
a subtree. I don't know if the note is prominent enough, but the
behavior seems unsurprising enough.
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.
Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
Fixes bug that caused git-annex to fail to add a file when another
git-annex process cleaned up the temp directory it was using.
Solution is just to push withOtherTmp out to a higher level, so that
the whole ingest process can be completed inside it.
But in the assistant, that was not practical to do, since withOtherTmp runs
in the Annex monad and the assistant does not. Worked around by introducing
a separate temp directory that only the assistant uses for lockdown.
Since only one assistant can run at a time, it's easy to clean up that
directory of old cruft at startup.
Fix reversion in last release that caused wrong tree to be written to
remote tracking branch after an export of a subtree.
The invariant "commitsha should have the treesha as its tree"
was not met due to a bug. Guarantee it's met by catting the commitsha
to find its actual tree. A little bit slower, but this is not run often.
* Added mimeencoding= term to annex.largefiles expressions.
This is probably mostly useful to match non-text files with eg
"mimeencoding=binary"
* git-annex matchexpression: Added --mimeencoding option.
Switch listContents to being a proper CommandStart, so if it throws an
exception, it will be treated like any other command action that fails.
downloadImport apparently does not ever throw an exception,
and itself uses commandAction, so it can't be a CommandStart.
Fix bug that caused importing from a special remote to repeatedly download
unchanged files when multiple files in the remote have the same content.
Unfortunately, there's really no good way to remove a uniqueness constraint
from a sqlite database. The best that can be done is to make a new table
and copy the data over. But that would require using persistent's
migrations or raw sql, and I don't want to do either.
Instead, a sledgehammer approach: Renamed .git/annex/cid to
.git/annex/cids. When the new database doesn't exist, it will be populated
from the git-annex branch.
Noting deletes the old database. Don't want to delete it out from under
some long-running git-annex process that might be using it. It could
eventually be deleted. But this is such a new feature, probably few repos
have the database in any case.
As well as adding the necessary methods, a few other changes to the adb
remote:
* Use ".annextmp" extension for temp files, to avoid conflict with other
temp files.
* Stop using "echo $?" to get exit status of command inside adb.
There were two problems; first the "echo" just before it meant it was
always 0! And secondly, it seems kind of random on my phone whether it's
1 or 0, not dependant on whether the command seems to have succeeded.
protocol=https implies port=443 and
port=443 implies protocol=https
-- this was necessary because the existing configs set port=443, but
with a protocol setting, users will naturally want to use it, and then
there's no need for them to supply the default https port. So we keep
back-compat, add a nicer way to enable https, and also add support for
non-standard https ports.
In particular, when two files had the same content, and one was unlocked
and modified, with annex.thin that can corrupt the content of the
annex object, and so fsck on the other file should detect that.
getKeyStatus was relying on Database.Keys.getAssociatedFiles to tell
when a file is unlocked, but that can false positive because the
database can list old associated files.
Instead, separate out the case of unlocked object which has multiple
hardlinks when annex.thin is in use.
To support filenames starting with dashes.
To update the config of existing repositories, you can re-run git-annex init.
Perhaps it should check every time for the old config and update it, but
that has several problems:
- read-only repos
- unexpected commands like `git annex find` changing git configs
might be surprising behavior
Since filenames starting with dashes are not super common and the user can
re-init easily enough if their repo needs fixed, I went for the simplest
fix.
Users may want sync to only export, or only import and this is broadly
analagous to push and pull, so it makes sense to use the same
configuration for it.
The branch is only updated once the export is 100% complete. This way,
if an export is started but interrupted and so the remote does not yet
contain some of the files, an import will make a commit on the old
branch, and so won't delete the missing files.
This log, unlike all other current top-level logs, is a new format log.
I have not checked what throwing it at the old log parser did, but it seems
likely it ignored unparsable lines, and so perhaps deleted all lines from
the log.
This fixes a reversion in the ByteString conversion. The old code used
isSpace to decide when the metadata value needs to be base64 encoded,
and that incorrectly changed to only checking if it contained ' '.
Note that only '\n' and '\r' were added and not other sorts of
whitespace that isSpace matches, like '\t' and '\v'. Only the former
would cause problems.
Installing git-annex with stack rsync won't be available.
Also, using the git-annex installer with 64 bit git installs a non-working
rsync binary because it's linked with libraries provided by 32 bit git.
Like with the network-uri split, cabal will automatically turn off the flag
when building with an old network.
I have not tested building with the new network-3.0.0.0 yet; several
other dependencies including aws are still pinned on network-2.*
xporting files with '#' or '?' in their name won't work because urls get
truncated on those. Fail in a better way in this case, and avoid failing
when removing such files from the export, so after the user has renamed the
problem files the export will succeed.
Avoid performing repository fixups for submodules and git-worktrees
when there's a .noannex file that will prevent git-annex from being
used in the repository.
This change is ok as long as the .noannex file is really going to prevent
git-annex from being used. But, init --force could override the file.
Which would result in the repo being initialized without the fixups
having run.
To avoid that situation decided to change init, to not let --force be used
to override a .noannex file. Instead the user can just delete the file.
* fromkey: Added --json.
* fromkey --batch output changed to support using it with --json.
The old output was not parseable for any useful information, so
this is not expected to break anything.
If the worktree file already exists, and is annexed and uses the same
key, avoid failing, nothing needs to be done.
Had to add lookupFileNotHidden to handle the case where an adjust --hide-missing
is in use, and the worktree file was hidden due to the object content
being missing. lookupFile would return the key of the hidden file,
but it makes sense that after fromkey succeeds, the worktree must
contain the file it was supposed to set up.
Need to create the directory after the lock is held, not before.
The other racing process would need to shut down at just the wrong time,
running cleanupOtherTmp.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
This gets back any speed lost in commit
9cebfd7002, and speeds up all uses of S3
remotes that operate on them more than once.
This commit was sponsored by Brett Eisenberg on Patreon.
When key-based retrieval from a S3 remote with exporttree=yes
appendonly=yes fails, fall back to trying to retrieve from the exported
tree. This allows downloads of files that were exported to such a remote
before versioning was enabled on it.
This is useful at least for a transition for users who got into that
situation, so they can download content from their S3 remote. May want to
remove this in the future though, since normally trying to download the
second time is only extra work.
This commit was sponsored by Brock Spratlen on Patreon.
Like the earlier fixed one in Command.Export, it occurred when the same
tree was exported by multiple clones. Previous fix was incomplete since
several other places looked at the list of exported trees to detect when
there was an export conflict. Added a single unified function to avoid
missing any places it needed to be fixed.
This commit was sponsored by mo on Patreon.
Because when git-annex lacks S3 version IDs for files stored in the bucket,
deleting them would cause data loss.
Also because git-annex is not able to download unversioned objects from a bucket
when versioning=yes.
This also prevents setting versioning=no. While that would perhaps be
possible to do safely, it would add complexity, and would mean that if
the user accidentially did enableremote versioning=no, they would not be
able to undo it.
This commit was sponsored by Trenton Cronholm on Patreon.
Needs not yet released version 0.22 of aws library; with older versions
asks the user to configure the bucket versioning themselves.
Note that S3 endpoints that don't support versioning will cause putBucketVersioning
to throw an exception, so initremote will fail.
This commit was sponsored by Jake Vosloo on Patreon.
* webapp: Remove configurator for box.com repository, since their
webdav support is going away at the end of this January.
* webapp: Remove configurator for gitlab, which stopped supporting git-annex
some time ago.
This commit was sponsored by Brock Spratlen on Patreon.
However, rsync still won't work with 64 bit git and
this is still not the documented way to install it.
So, if both 64 and 32 are installed, go with 32.
And if neither git can be found, default to 32.
* Switch to using .git/annex/othertmp for tmp files other than partial
downloads, and make stale files left in that directory when git-annex
is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
delete anything lingering in there after it's 1 week old.
Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
A keyName could contain "/", though this is unlikely and certianly only
ever could happen with WORM keys.
The change to addunused to escape that is no problem at all.
The change to VariantFile to escape it means that different versions of
git-annex could resolve a merge conflict differently in this case, which
is unfortunate. There would be different .variant files used, so the two
resolutions would themselves merge together without additional
conflicts, but the user would have to clean up the extra .variant
files.
Not likely to be any speed gain here, but this completes porting every
log file over.
And, it let me get rid of code copied from ghc and modified, so
simplifying the licensing.
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.
There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.
Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...
For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1 group2 " due to excess
whitespace, which would be even more confusing behavior.
The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.
In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
Tested on an older ghc by enabling MonadFailDesugaring globally.
In TransferQueue, the lack of a MonadFail for STM exposed what would
normally be a bug in the pattern matching, although in this case an
earlier check that the queue was not empty avoided a pattern match
failure.
This is not as efficient as using ByteStrings throughout, but converting
the String to ByteString is actually significantly faster than the old
parser.
benchmarking parse/old
time 9.657 μs (9.600 μs .. 9.732 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 9.703 μs (9.645 μs .. 9.785 μs)
std dev 231.6 ns (161.5 ns .. 323.7 ns)
variance introduced by outliers: 25% (moderately inflated)
benchmarking parse/new
time 834.6 ns (797.1 ns .. 886.9 ns)
0.987 R² (0.976 R² .. 0.999 R²)
mean 816.4 ns (802.7 ns .. 845.1 ns)
std dev 62.39 ns (37.66 ns .. 108.4 ns)
variance introduced by outliers: 82% (severely inflated)
There is a small behavior change from the old parsePOSIXTime,
which accepted any amount of trailing whitespace after the timestamp.
That behavior was not documented, and it doesn't seem anything relied on it.
It used to display the "bad feed content" message indicating there were no
enclosures found, which was misleading when the http request for the feed
failed.
This commit was sponsored by Ewen McNeill on Patreon.
downloadUrl uses meteredFile, which sets up one progress meter,
and Remote.Web also uses metered, so two progress meters are displayed for
the same download.
Reversion introduced with the http-conduit switch in
c34152777b -- I don't know why the extra
call to metered was added there.
When -J is not used, the extra progress meter didn't display,
but an extra blank line did get output, which is also fixed.
This commit was sponsored by John Pellman on Patreon.
init: When --version=5 is passed on a crippled filesystem, use a v5 direct
mode repo as requested, rather than upgrading to v7 adjusted unlocked.
Fixed test suite on crippled filesystems, making it request --version=5
to test direct mode.
Deleting directories is one of the great unsolved problems of CS, thanks to
abominations like NFS lock files and Windows and races with other processes
cleaning up after themselves in the background. The gpg test harness
sometimes failed to delete its temp directory on NFS. Avoid the problem
class by not deleting it at all, and putting it inside the tmp repo being
tested. The test suite's more robust (and/or nonsensical) workarounds for
deleting its test dir will thus be used, hopefully avoiding the problem
until an OS finds a new way to violate POSIX and the laws of nature.
Note that this means that the .gnupg directory will be on whatever
filesystem the test suite is being run on, which may be a lesser quality
filesystem than gpg is really expecting. Gpg does not seem to need to
write sockets etc to there so this seems ok. The only known problem is
that if the filesystem forces a directory mode like 777, gpg will warn
about unsafe home directory perms, but it still works.
This fixes a bug with the numcopies counting when using sync --content.
It did not always pass the local repo uuid to handleDropsFrom, and so the
numcopies counting was off by one, and unwanted local content would only be
dropped when there were numcopies+1 remote copies.
Also, support dropping local content that has reached an
exporttree remote that is not untrusted (currently only S3 remotes
with versioning).
* Fix bug upgrading from direct mode to v7: when files in the repository
were already committed as v7 unlocked files elsewhere, and the
content was present in the direct mode repository, the annexed files
got their full content checked into git.
* Fix bug that caused v7 unlocked files in a direct mode repository
to get locked when committing.
This commit was sponsored by Nick Piper on Patreon.
When a file was already unlocked, but the annex object was present, the
upgrade process populated the unlocked file, but neglected to update the
index.
This commit was sponsored by Jochen Bartl on Patreon.
webdav: When initializing, avoid trying to make a directory at the top of
the webdav server, which could never accomplish anything and failed on
nextcloud servers. (Reversion introduced in version 6.20170925.)
This commit was sponsored by mo on patreon.
No deprecation warning at run time, just one on the man page.
One thing findref remains able to do that find cannot is to run in a bare
repo. Find was made to refuse to run in a bare repo because it seemed
confusing for it to not list any files ever in that situation. It would be
better for find --branch to work in a bare repo but not without --branch
but I don't currently have a way to do that.
Probably a better solution would be to make git-annex in a bare repo
default to --branch master or something like that instead of --all.
This commit was sponsored by Denis Dzyubenko on Patreon.
* findref: Support file matching options: --include, --exclude,
--want-get, --want-drop, --largerthan, --smallerthan, --accessedwithin
* Commands supporting --branch now apply file matching options --include,
--exclude, --want-get, --want-drop to filenames from the branch.
Previously, combining --branch with those would fail to match anything.
* add, import, findref: Support --time-limit.
This commit was sponsored by Jake Vosloo on Patreon.
When public access is used for the remote, it complained that the user
needed to set creds to use it, which was just wrong.
When creds were being used, it fell back from trying to use the version ID
to just accessing the key in the bucket, which was ok for non-export
remotes, but wrong for buckets.
In both cases, display a hopefully useful warning.
This should only come up when an existing S3 remote has been exported
to, and then later versioning was enabled.
Note that it would perhaps be possible to fall back from trying to use
retrieveKeyFile when it fails and instead use retrieveKeyFileFromExport,
which may work when S3 version ID is missing. But there are problems
with that approach; how to tell when retrieveKeyFile has failed due to this
rather than a network problem etc? Anyway, that approach would only work
until the file in the export got overwritten, and then it would no
longer be accessible. And with versioning enabled, the user wants old
versions of objects to remain accessible, so it seems better to warn
about the problem as soon as possible, so they can go back and add S3
version IDs.
This work is supported by the NIH-funded NICEMAN (ReproNim TR&D3) project.
Note that it does not prevent storing p2p access tokens or multicast
encryption keys, since those are not cached; the previous commit
established the distinction.
How well this works depends on how often getRemoteCredPair is called and
how expensive it is. In some cases setting this will result in an annoying
number of gpg password prompts and/or slowdowns due to reading creds
from the git-annex branch and decrypting, which could be improved by calling
getRemoteCredPair less often.
This commit was sponsored by Ilya Shlyakhter on Patreon.
dropunused: When an unused object file has gotten modified, eg due to
annex.thin being set, don't silently skip it, but display a warning and let
--force drop it.
This commit was sponsored by Ethan Aubin.
info: When used with an exporttree remote, includes an "exportedtree" info,
which is the tree last exported to the remote. During an export conflict,
multiple values will be listed.
This commit was sponsored by John Pellman on Patreon.
* init: When a crippled filesystem causes an adjusted unlocked branch to
be used, set repo version to 7, which it neglected to do before.
* init: When on a crippled filesystem, and the git version is too old
to use an adjusted unlocked branch, fall back to using direct mode.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Seems that youtube-dl --get-filename on a playlist lists all the filenames
for the playlist, which can take quite some time. The code already only
took the first name, so --no-playlist can speed it up a lot.
This commit was sponsored by Brett Eisenberg on Patreon.
And added stack-lts-9.9.yaml to support old versions of stack.
The i386 ancient autobuilder needs stack-lts-9.9.yaml; the OSX autobuilder
may also use it for a while, and it's needed to build on eg debian stable.