Improve annex.stalldetection to handle remotes that update progress less
frequently than the configured time period.
In particular, this makes remotes that don't report progress but are
chunked work when transferring a single chunk takes longer than the
specified time period.
Any remotes that just have very low update granulatity would also be
handled by this.
The change to Remote.Helper.Chunked avoids an extra progress update when
resuming an interrupted upload. In that case, the code saw first Nothing
and then Just the already transferred number of bytes, which defeated this
new heuristic. This change will mean that, when resuming an interrupted
upload to a chunked remote that does not do its own progress reporting, the
progress display does not start out displaying the amount sent so far,
until after the first chunk is sent. This behavior change does not seem
like a major problem.
About the scalefudgefactor, it seems reasonable to expect subsequent chunks
to take no more than 1.5 times as long as the first chunk to transfer.
Could set it to 1, but then any chunk taking a little longer would be
treated as a stall. 2 also seems a likely value. Even 10 might be fine?
Sponsored-by: Dartmouth College's DANDI project
Improve disk free space checking when transferring unsized keys to
local git remotes. Since the size of the object file is known, can
check that instead.
Getting unsized keys from local git remotes does not check the actual
object size. It would be harder to handle that direction because the size
check is run locally, before anything involving the remote is done. So it
doesn't know the size of the file on the remote.
Also, transferring unsized keys to other remotes, including ssh remotes and
p2p remotes don't do disk size checking for unsized keys. This would need a
change in protocol.
(It does seem like it would be possible to implement the same thing for
directory special remotes though.)
In some sense, it might be better to not ever do disk free checking for
unsized keys, than to do it only sometimes. A user might notice this
direction working and consider it a bug that the other direction does not.
On the other hand, disk reserve checking is not implemented for most
special remotes at all, and yet it is implemented for a few, which is also
inconsistent, but best effort. And so doing this best effort seems to make
some sense. Fundamentally, if the user wants the size to always be checked,
they should not use unsized keys.
Sponsored-by: Brock Spratlen on Patreon
This works well, and it interoperates with gpg in my testing (although some
SOP commands might choose to use a profile that does not so caveat emptor).
Note that for creating the Cipher, gpg --gen-random is still used. SOP
does not have an eqivilant, and as long as the user has gpg around,
which seems likely, it doesn't matter that it uses gpg here, it's not being
used for encryption. That seemed better than implementing a second way
to get high quality entropy, at least for now.
The need for the sop command to run in an empty directory has each call
to encrypt and decrypt creating a new temporary directory. That is some
unncessary overhead, though probably swamped by the overhead of running
the sop command. This could be improved in the future by passing an
already empty directory to them, or a sufficiently empty directory
(.git/annex/tmp would probably suffice).
Sponsored-by: Brett Eisenberg on Patreon
This allows getting rid of the ugly and error prone handling of
"bag of bytes" String in Remote.Helper.Encryptable.
Avoiding breakage like that dealt with by commit
9862d64bf9
And allows converting Utility.Gpg to use ByteString for IO, which is
a welcome change.
Tested the new git-annex interoperability with old, using all 3
encryption= types.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.
I also tested the case that 9b93278e8a
fixed:
git-annex metadata -s foo='a …' x
git-annex metadata x
metadata x
foo=a …
In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.
The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.
So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.
(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)
The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.
The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.
Sponsored-by: Joshua Antonishen on Patreon
Reading from the cidsdb is responsible for about 25% of the runtime of
an import. Since the cidmap is used to store the same information in
ram, the cidsdb is not written to during an import any longer. And so,
if it started off empty (and updateFromLog wasn't needed), those reads
can just be skipped.
This is kind of a cheesy optimisation, since after any import from any
special remote, the database will no longer be empty, so it's a single
use optimisation. But it's probably not uncommon to start by importing a
lot of files, and it can save a lot of time then.
Sponsored-by: Brock Spratlen on Patreon
This makes annexFileMode be just an application of setAnnexPerm',
which avoids having 2 functions that do different versions of the same
thing.
Fixes some buggy behavior for some combinations of core.sharedRepository
and umask.
Sponsored-by: Jack Hill on Patreon
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brock Spratlen on Patreon
Converted warning and similar to use StringContainingQuotedPath. Most
warnings are static strings, some do refer to filepaths that need to be
quoted, and others don't need quoting.
Note that, since quote filters out control characters of even
UnquotedString, this makes all warnings safe, even when an attacker
sneaks in a control character in some other way.
When json is being output, no quoting is done, since json gets its own
quoting.
This does, as a side effect, make warning messages in json output not
be indented. The indentation is only needed to offset warning messages
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brett Eisenberg on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.
Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.
Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.
The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.
Sponsored-by: Dartmouth College's Datalad project
Combined with commit 0ffc59d341, this
fixes the case where there are duplicate files on the special remote,
and the first gets modified/deleted, while the second is still present.
directory, adb: Fixed a bug when importtree=yes, and multiple files in the
special remote have the same content, that caused it to refuse to get a
file from the special remote, incorrectly complaining that it had changed,
due to only accepting the inode+mtime of one file (that was since modified
or deleted) and not accepting the inode+mtime of other duplicate files.
Sponsored-by: Max Thoursie on Patreon
This partly fixes an issue where there are duplicate files in the
special remote, and the first file gets swapped with another duplicate,
or deleted. The swap case is fixed by this, the deleted case will need
other changes.
This makes retrieveExportWithContentIdentifier take a list of allowed
ContentIdentifier, same as storeExportWithContentIdentifier,
removeExportWithContentIdentifier, and
checkPresentExportWithContentIdentifier.
Of the special remotes that support importtree, borg is a special case
and does not use content identifiers, S3 I assume can't get mixed up
like this, directory certainly has the problem, and adb also appears to
have had the problem.
Sponsored-by: Graham Spencer on Patreon
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.
annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.
It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.
Sponsored-by: Dartmouth College's Datalad project
Some small wins, almost certianly swamped by the system calls, but still
worthwhile progress on the RawFilePath conversion.
Sponsored-by: Erik Bjäreholt on Patreon
Fix retrival of an empty file that is stored in a special remote with
chunking enabled.
The speculative chunk stuff caused a reversion by adding an empty list for
the empty file. Which is just wrong; the empty file is still stored on the
remote, and should be retrieved like any other file. It uses 1 chunk, so
`max 1` is the simple fix.
Sponsored-by: Noam Kremen on Patreon
Commit 36133f27c0 had a boolean flip in it,
aaargh.
Special remotes with importtree=yes or exporttree=yes are once again
treated as untrusted, since files stored in them can be deleted or modified
at any time.
Sponsored-by: Kevin Mueller on Patreon
None of the special remotes do it yet, but this lays the groundwork.
Added MustFinishIncompleteVerify so that, when an incremental verify is
started but not complete, it can be forced to finish it. Otherwise, it
would have skipped doing it when verification is disabled, but
verification must always be done when retrievin from export remotes
since files can be modified during retrieval.
Note that retrieveExportWithContentIdentifier doesn't support incremental
verification yet. And I'm not sure if it can -- it doesn't know the Key
before it downloads the content. It seems a new API call would need to
be split out of that, which is provided with the key.
Sponsored-by: Dartmouth College's Datalad project
This was needed when supporting old git-annex-shell that do not support
p2pstdio yet, in order to cleanly fall back to the old interface without
error messages being displayed. That is no longer supported, so simplify
to not intercept error messages.
Sponsored-by: Dartmouth College's DANDI project
git-lfs: Fix interoperability with gitlab's implementation of the git-lfs
protocol, which requests Content-Encoding chunked.
Sponsored-by: Dartmouth College's Datalad project
Rather than the error that occurred when trying to download the unchunked
content, which is less likely to actually be stored in the remote.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
* Removed support for accessing git remotes that use versions of
git-annex older than 6.20180312.
* git-annex-shell: Removed several commands that were only needed to
support git-annex versions older than 6.20180312.
(lockcontent, recvkey, sendkey, transferinfo, commit)
The P2P protocol was added in that version, and used ever since, so
this code was only needed for interop with older versions.
"git-annex-shell commit" is used by newer git-annex versions, though
unnecessarily so, because the p2pstdio command makes a single commit at
shutdown. Luckily, it was run with stderr and stdout sent to /dev/null,
and non-zero exit status or other exceptions are caught and ignored. So,
that was able to be removed from git-annex-shell too.
git-annex-shell inannex, recvkey, sendkey, and dropkey are still used by
gcrypt special remotes accessed over ssh, so those had to be kept.
It would probably be possible to convert that to using the P2P protocol,
but it would be another multi-year transition.
Some git-annex-shell fields were able to be removed. I hoped to remove
all of them, and the very concept of them, but unfortunately autoinit
is used by git-annex sync, and gcrypt uses remoteuuid.
The main win here is really in Remote.Git, removing piles of hairy fallback
code.
Sponsored-by: Luke Shumaker
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.
When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.
Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.
It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.
The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.
Sponsored-by: Noam Kremen on Patreon
RemoteGitConfig parsing looks for annex.bwlimit when a remote
does not have a per-remote config for it, so no need for a separate
gobal config.
Sponsored-by: Svenne Krap on Patreon
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.
This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.
However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.
And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.
This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.
Sponsored-by: Graham Spencer on Patreon
IncrementalVerifier moved to Utility.Hash, which will let Utility.Url
use it later.
It's perhaps not really specific to hashing, but making a separate
module just for the data type seemed unncessary.
Sponsored-by: Dartmouth College's DANDI project
This fixes the recent reversion that annex.verify is not honored,
because retrieveChunks was passed RemoteVerify baser, but baser
did not have export/import set up.
Sponsored-by: Dartmouth College's DANDI project
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.
Sponsored-by: Dartmouth College's DANDI project
Now it's run in VerifyStage.
I thought about keeping the file handle open, and resuming reading where
tailVerify left off. But that risks leaking open file handles, until the
GC closes them, if the deferred verification does not get resumed. Since
that could perhaps happen if there's an exception somewhere, I decided
that was too unsafe.
Instead, re-open the file, seek, and resume.
Sponsored-by: Dartmouth College's DANDI project
It uses tailVerify to hash the file while it's being written.
This is able to sometimes avoid a separate checksum step. Although
if the file gets written quickly enough, tailVerify may not see it
get created before the write finishes, and the checksum still happens.
Testing with the directory special remote, incremental checksumming did
not happen. But then I disabled the copy CoW probing, and it did work.
What's going on with that is the CoW probe creates an empty file on
failure, then deletes it, and then the file is created again. tailVerify
will open the first, empty file, and so fails to read the content that
gets written to the file that replaces it.
The directory special remote really ought to be able to avoid needing to
use tailVerify, and while other special remotes could do things that
cause similar problems, they probably don't. And if they do, it just
means the checksum doesn't get done incrementally.
Sponsored-by: Dartmouth College's DANDI project
Simply feed each chunk in turn to the incremental verifier.
When resuming an interrupted retrieve, it does not do incremental
verification. That would need to read the file, up to the resume point,
and feed it to the incremental verifier. That seems easy to get wrong.
Also it would mean extra work done before the transfer can start. Which
would complicate displaying progress, and would perhaps not appear to the
user as if it was resuming from where it left off. Instead, in that
situation, return UnVerified, and let the verification be done in a
separate pass.
Granted, Annex.CopyFile does manage all that, but it's not complicated
by dealing with chunks too.
Sponsored-by: Dartmouth College's DANDI project
Several special remotes verify content while it is being retrieved,
avoiding a separate checksum pass. They are: S3, bup, ddar, and
gcrypt (with a local repository).
Not done when using chunking, yet.
Complicated by Retriever needing to change to be polymorphic. Which in turn
meant RankNTypes is needed, and also needed some code changes. The
change in Remote.External does not change behavior at all but avoids
the type checking failing because of a "rigid, skolem type" which
"would escape its scope". So I refactored slightly to make the type
checker's job easier there.
Unfortunately, directory uses fileRetriever (except when chunked),
so it is not amoung the improved ones. Fixing that would need a way for
FileRetriever to return a Verification. But, since the file retrieved
may be encrypted or chunked, it would be extra work to always
incrementally checksum the file while retrieving it. Hm.
Some other special remotes use fileRetriever, and so don't get incremental
verification, but could be converted to byteRetriever later. One is
GitLFS, which uses downloadConduit, which writes to the file, so could
verify as it goes. Other special remotes like web could too, but don't
use Remote.Helper.Special and so will need to be addressed separately.
Sponsored-by: Dartmouth College's DANDI project