Rather than the error that occurred when trying to download the unchunked
content, which is less likely to actually be stored in the remote.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
git does not crash when there's a remote configured for a user who does
not exist, and this prevents git-annex from crashing too. Consider that
a user might exist on one system but not another, and the git repo be
moved between systems. So not crashing is desirable.
Note that git fetch seems to mishandle a remote path like ~foo/bar
when the user does not exist. While it does access ./~foo/bar,
and gets as far as running git-upload-pack on the path,
it then complains there is no such repo. So different parts of git seem
to be doing different things in that edge case. Anyway, git-annex does
not need to be bug-for-bug compatible with git.
Sponsored-by: Jack Hill on Patreon
test: Put gpg temp home directory in system temp directory, not filesystem
being tested.
Since I've found indications gpg can fail talking to the agent when the
socket ends up on eg, fat. And to hopefully fix this bug report I've
followed up on.
The main risk in using the system temp dir is that TMPDIR could be set to a
long directory path, which is too long to put a unix socket in. To
partially amelorate that risk, it uses either an absolute or a relative
path, whichever is shorter. (Hopefully gpg will not convert it to a longer
form of the path..)
If the user sets TMPDIR to something so long a path to it +
"S.gpg-agent" is too long, I suppose that's their issue to deal with.
Sponsored-by: Dartmouth College's Datalad project
This negotiation is not supported by versions of git-annex older
than 6.20180312. Well, maybe really 6.20180227 or so, but using that in
the changelog simplifies things since it was the version for the other
changes as well.
See commit c81768d425 for the back story.
As well as allowing for future protocol improvements, this will result
in negoatiating protocol version 1, which is an improvement over default
version 0.
In fact, it looks like no supported version of git-annex will use
protocol version 0, since version 1 was introduced in 6.20180227.
Still, removing the code for version 0 seems unncessary.
See commit 31e1adc005.
Sponsored-by: Brett Eisenberg on Patreon.
* Removed support for accessing git remotes that use versions of
git-annex older than 6.20180312.
* git-annex-shell: Removed several commands that were only needed to
support git-annex versions older than 6.20180312.
(lockcontent, recvkey, sendkey, transferinfo, commit)
The P2P protocol was added in that version, and used ever since, so
this code was only needed for interop with older versions.
"git-annex-shell commit" is used by newer git-annex versions, though
unnecessarily so, because the p2pstdio command makes a single commit at
shutdown. Luckily, it was run with stderr and stdout sent to /dev/null,
and non-zero exit status or other exceptions are caught and ignored. So,
that was able to be removed from git-annex-shell too.
git-annex-shell inannex, recvkey, sendkey, and dropkey are still used by
gcrypt special remotes accessed over ssh, so those had to be kept.
It would probably be possible to convert that to using the P2P protocol,
but it would be another multi-year transition.
Some git-annex-shell fields were able to be removed. I hoped to remove
all of them, and the very concept of them, but unfortunately autoinit
is used by git-annex sync, and gcrypt uses remoteuuid.
The main win here is really in Remote.Git, removing piles of hairy fallback
code.
Sponsored-by: Luke Shumaker
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.
When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.
Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.
It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.
The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.
Sponsored-by: Noam Kremen on Patreon
When the progress display gets longer, and then shorter again, it causes
the cursor to jitter back and forth. Somehow I never noticed this until
this morning, but then it became intolerable to watch.
To fix it, pad the progress display to the maximum length it's occupied.
Sponsored-by: Svenne Krap on Patreon
This adds the overhead of a copy whenever converting to/from ExportLocation and
ImportLocation.
borg: Some improvements to memory use when importing a lot of archives.
(It's still pretty bad.)
Sponsored-by: Mark Reidenbach on Patreon
Commit 4bf7940d6b introduced this
problem, but was otherwise doing a good thing. Problem being
that fileRef "/foo" used to return ":./foo", which was actually wrong,
but as long as there was no foo in the local repository, catKey
could operate on it without crashing. After that fix though, fileRef
would return eg "../../foo", resulting in fileRef returning
":./../../foo", which will make git cat-file crash since that's
not a valid path in the repo.
Fix is simply to make fileRef detect paths outside the repo and return
Nothing. Then catKey can be skipped. This needed several bugfixes to
dirContains as well, in previous commits.
In Command.Smudge, this led to needing to check for Nothing. That case
should actually never happen, because the fileoutsiderepo check will
detect it earlier.
Sponsored-by: Brock Spratlen on Patreon
sync --content: Avoid a redundant checksum of a file that was
incrementally verified, when used on NTFS and perhaps other filesystems.
When sync has just gotten the content, it does not need to check inAnnex a
second time. On NTFS, for some reason the write of the inode cache after
it gets the content is not immediately able to be read, and with an
empty/non-matching inode cache due to that stale data, inAnnex falls back
to hashing the whole object to determine if it's present.
Sponsored-by: Brock Spratlen on Patreon
This method avoids breaking test_readonly. Just check if the dest file
exists, and avoid CoW probing when it does, so when CoW probing fails,
it can resume where the previous non-CoW copy left off.
If CoW has been probed already to work, delete the dest file
since a CoW copy will presumably work. It seems like it would be almost
as good to just skip CoW copying in this case too, but consider that the
dest file might have started to be copied from some other remote, not
using CoW, but CoW has been probed to work to copy from the current
place.
Sponsored-by: Dartmouth College's Datalad project
commit 63d508e885 broke test_readonly.
When a local git remote is readonly, tryCopyCoW run to copy a file
from it failed at withOtherTmp.
Sponsored-by: Dartmouth College's Datalad project
Disabling git-annex branch update for this command is
ok, because it does not use any information from the branch,
but only logs the location when it adds a key.
Sponsored-by: Dartmouth College's Datalad project
New method is much better. Avoids unrestrained transfer at the beginning
(except for the first block. Keeps right at or a few kb/s below the
configured limit, with very little varation in the actual reported bandwidth.
Removed the /s part of the config as it's not needed.
Ready to merge.
Sponsored-by: Luke Shumaker on Patreon
Bug fix: Git configs such as annex.verify were incorrectly overriding
per-remote git configs such as remote.name.annex-verify.
This dates all the way back to 2013,
commit 8a5b397ac4,
where hlint apparently somehow confused me into parsing in the wrong
order. Before that it was correct.
Amazing noone has noticed until now.
Sponsored-by: Kevin Mueller on Patreon
Probably this fixes a reversion, but I don't know what version broke it.
This does use withOtherTmp for a temp file that could be quite large.
Though albeit a reflink copy that will not actually take up any space
as long as the file it was copied from still exists. So if the copy cow
succeeds but git-annex is interrupted just before that temp file gets
renamed into the usual .git/annex/tmp/ location, there is a risk that
the other temp directory ends up cluttered with a larger temp file than
later. It will eventually be cleaned up, and the changes of this being
a problem are small, so this seems like an acceptable thing to do.
Sponsored-by: Shae Erisson on Patreon
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.
This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.
However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.
And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.
This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.
Sponsored-by: Graham Spencer on Patreon
This should complete the fix started in
6329997ac4, fixing the actual cause of the
test suite failure this time.
Sponsored-by: Dartmouth College's Datalad project
* When downloading urls fail, explain which urls failed for which
reasons.
* web: Avoid displaying a warning when downloading one url failed
but another url later succeeded.
Some other uses of downloadUrl use urls that are effectively internal use,
and should not all be displayed to the user on failure. Eg, Remote.Git
tries different urls where content could be located depending on how the
remote repo is set up. Exposing those urls to the user would lead to wild
goose chases. So had to parameterize it to control whether it displays urls
or not.
A side effect of this change is that when there are some youtube urls
and some regular urls, it will try regular urls first, even if the
youtube urls are listed first. This seems like an improvement if
anything, but in any case there's no defined order of urls that it's
supposed to use.
Sponsored-by: Dartmouth College's Datalad project
And fail with an informative message.
I don't think ACLs can prevent removing the write bit, but I'm not sure,
so kept it mentioning them as a possibility.
Should git-annex lock also check if the write bits are able to be removed?
Maybe, but the case I know about with xattrs involves cp -a copying NFS
xattrs, and it's the copy of the file that is the problem. So when locking
a file, I guess it will not be the copy.
Sponsored-by: Dartmouth College's Datalad project
New --batch-keys option added to these commands: get, drop, move, copy, whereis
git-annex-matching-options had to be reworded since some of its options
can be used to match on keys, not only files.
Sponsored-by: Luke Shumaker on Patreon
It would be better if the Arbitrary instance avoided generating impossible
filenames like "foo/c:bar", but proably this is the only place that splits
the file from the directory and then uses the file without the directory..
At least on the quickcheck properties.
Sponsored-by: Svenne Krap on Patreon
This was unlikely to cause any problem, but it is unsightly to mention
normally hidden refs, and it might have done a bit of unnecessary work to
check that ref.
Sponsored-by: Noam Kremen on Patreon
And that should be all the special remotes supporting it on linux now,
except for in the odd edge case here and there.
Sponsored-by: Dartmouth College's DANDI project
Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.
But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.
Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.
Sponsored-by: Dartmouth College's DANDI project
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.
Sponsored-by: Dartmouth College's DANDI project
It uses tailVerify to hash the file while it's being written.
This is able to sometimes avoid a separate checksum step. Although
if the file gets written quickly enough, tailVerify may not see it
get created before the write finishes, and the checksum still happens.
Testing with the directory special remote, incremental checksumming did
not happen. But then I disabled the copy CoW probing, and it did work.
What's going on with that is the CoW probe creates an empty file on
failure, then deletes it, and then the file is created again. tailVerify
will open the first, empty file, and so fails to read the content that
gets written to the file that replaces it.
The directory special remote really ought to be able to avoid needing to
use tailVerify, and while other special remotes could do things that
cause similar problems, they probably don't. And if they do, it just
means the checksum doesn't get done incrementally.
Sponsored-by: Dartmouth College's DANDI project
Simply feed each chunk in turn to the incremental verifier.
When resuming an interrupted retrieve, it does not do incremental
verification. That would need to read the file, up to the resume point,
and feed it to the incremental verifier. That seems easy to get wrong.
Also it would mean extra work done before the transfer can start. Which
would complicate displaying progress, and would perhaps not appear to the
user as if it was resuming from where it left off. Instead, in that
situation, return UnVerified, and let the verification be done in a
separate pass.
Granted, Annex.CopyFile does manage all that, but it's not complicated
by dealing with chunks too.
Sponsored-by: Dartmouth College's DANDI project
Several special remotes verify content while it is being retrieved,
avoiding a separate checksum pass. They are: S3, bup, ddar, and
gcrypt (with a local repository).
Not done when using chunking, yet.
Complicated by Retriever needing to change to be polymorphic. Which in turn
meant RankNTypes is needed, and also needed some code changes. The
change in Remote.External does not change behavior at all but avoids
the type checking failing because of a "rigid, skolem type" which
"would escape its scope". So I refactored slightly to make the type
checker's job easier there.
Unfortunately, directory uses fileRetriever (except when chunked),
so it is not amoung the improved ones. Fixing that would need a way for
FileRetriever to return a Verification. But, since the file retrieved
may be encrypted or chunked, it would be extra work to always
incrementally checksum the file while retrieving it. Hm.
Some other special remotes use fileRetriever, and so don't get incremental
verification, but could be converted to byteRetriever later. One is
GitLFS, which uses downloadConduit, which writes to the file, so could
verify as it goes. Other special remotes like web could too, but don't
use Remote.Helper.Special and so will need to be addressed separately.
Sponsored-by: Dartmouth College's DANDI project
* Deal with clock skew, both forwards and backwards, when logging
information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.
When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.
When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.
When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.
(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)
Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.
A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.
Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.
Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.
Sponsored-by: Jochen Bartl on Patreon
git-annex get when run as the first git-annex command in a new repo did not
populate unlocked files. (Reversion in version 8.20210621)
I am not entirely happy with this, because I don't understand how
428c91606b caused the problem in the first
place, and I don't fully understand how skipping calling scanAnnexedFiles
during autoinit avoids the problem.
Kept the explicit call to scanAnnexedFiles during git-annex init,
so that when reconcileStaged is expensive, it can be made to run then,
rather than at some later point when the information is needed.
Sponsored-by: Brock Spratlen on Patreon
Eg, showImprecise 1 1.99 returned "1.1" rather than "2". The 9 rounded
upward to 10, and that was wrongly used as the decimal, rather than
carrying the 1.
Sponsored-by: Jack Hill on Patreon
An easy way to see this in action is to have an unlocked file, and touch the
object file.
While all code that compares inode caches for object files needs to be
prepared for this kind of problem and fall back to verification, having
fsck notice it and correct it is cheap (as long as fsck is being run
anyway) and ensures that if it happens for some unusual reason, there's a
way for the user to notice that it's happening.
Not that, when annex.thin is in use, the earlier call to isUnmodified
(and also potentially earlier calls to inAnnex in eg, verifyLocationLog)
will fix up the same problem silently. That might prevent the warning
being displayed, although probably it still will be, because the
Database.Keys write of the InodeCache will be queued but will not have
happened yet. I can't see a way to improve this, but it's not great.
Sponsored-by: Dartmouth College's Datalad project
This is a result of an audit of every use of getInodeCaches,
to find places that misbehave when the annex object is not in the inode
cache, despite pointer files for the same key being in the inode cache.
Unfortunately, that is the case for objects that were in v7 repos that
upgraded to v8. Added a note about this gotcha to getInodeCaches.
Database.Keys.reconcileStaged, then annex.thin is set, would fail to
populate pointer files in this situation. Changed it to check if the
annex object is unmodified the same way inAnnex does, falling back to a
checksum if the inode cache is not recorded.
Sponsored-by: Dartmouth College's Datalad project
Fix bug that caused some transfers to incorrectly fail with "content
changed while it was being sent", when the content was not changed.
While I don't know how to reproduce the problem that several people
reported, it is presumably due to the inode cache somehow being stale.
So check isUnmodified', and if it's not modified, include the file's
current inode cache in the set to accept, when checking for modification
after the transfer.
That seems like the right thing to do for another reason: The failure
says the file changed while it was being sent, but if the object file was
changed before the transfer started, that's wrong. So it needs to check
before allowing the transfer at all if the file is modified.
(Other calls to sameInodeCache or elemInodeCaches, when operating on inode
caches from the database, could also be problimatic if the inode cache is
somehow getting stale. This does not address such problems.)
Sponsored-by: Dartmouth College's Datalad project