Improve disk free space checking when transferring unsized keys to
local git remotes. Since the size of the object file is known, can
check that instead.
Getting unsized keys from local git remotes does not check the actual
object size. It would be harder to handle that direction because the size
check is run locally, before anything involving the remote is done. So it
doesn't know the size of the file on the remote.
Also, transferring unsized keys to other remotes, including ssh remotes and
p2p remotes don't do disk size checking for unsized keys. This would need a
change in protocol.
(It does seem like it would be possible to implement the same thing for
directory special remotes though.)
In some sense, it might be better to not ever do disk free checking for
unsized keys, than to do it only sometimes. A user might notice this
direction working and consider it a bug that the other direction does not.
On the other hand, disk reserve checking is not implemented for most
special remotes at all, and yet it is implemented for a few, which is also
inconsistent, but best effort. And so doing this best effort seems to make
some sense. Fundamentally, if the user wants the size to always be checked,
they should not use unsized keys.
Sponsored-by: Brock Spratlen on Patreon
This is groundwork for making special remotes like borg be skipped by
sync when on an offline drive.
Added AVAILABILITY UNAVAILABLE reponse and the UNAVAILABLERESPONSE extension
to the external special remote protocol. The extension is needed because
old git-annex, if it sees that response, will display a warning
message. (It does continue as if the remote is globally available, which
is acceptable, and the warning is only displayed at initremote due to
remote.name.annex-availability caching, but still it seemed best to make
this a protocol extension.)
The remote.name.annex-availability git config is no longer used any
more, and is documented as such. It was only used by external special
remotes to cache the availability, to avoid needing to start the
external process every time. Now that availability is queried as an
Annex action, the external is only started by sync (and the assistant),
when they actually check availability.
Sponsored-by: Nicholas Golder-Manning on Patreon
Failure to remove is not treated as a problem, and no permissions
modifications are done, to avoid unexpected states.
Sponsored-by: Luke Shumaker on Patreon
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.
Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.
Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.
The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.
Sponsored-by: Dartmouth College's Datalad project
Note that when this is specified and an older git-annex is used to
enableremote such a special remote, it will simply ignore the cost= field
and use whatever the default cost is.
In passing, fixed adb to support the remote.name.cost and
remote.name.cost-command configs.
Sponsored-by: Dartmouth College's DANDI project
This partly fixes an issue where there are duplicate files in the
special remote, and the first file gets swapped with another duplicate,
or deleted. The swap case is fixed by this, the deleted case will need
other changes.
This makes retrieveExportWithContentIdentifier take a list of allowed
ContentIdentifier, same as storeExportWithContentIdentifier,
removeExportWithContentIdentifier, and
checkPresentExportWithContentIdentifier.
Of the special remotes that support importtree, borg is a special case
and does not use content identifiers, S3 I assume can't get mixed up
like this, directory certainly has the problem, and adb also appears to
have had the problem.
Sponsored-by: Graham Spencer on Patreon
Improve handling of directory special remotes with importtree=yes whose
ignoreinode setting has been changed. (By either enableremote or by
upgrading to commit 3e2f1f73cbc5fc10475745b3c3133267bd1850a7.)
When getting a file from such a remote, accept the content that would have
been accepted with the previous ignoreinode setting.
After a change to ignoreinode, importing a tree from the remote will
re-import and generate new content identifiers using the new config. So
when ignoreinode has changed to no, the inodes will be learned, and after
that point, a change in an inode will be detected as a change. Before
re-importing, a change in an inode will be ignored, as it was before the
ignoreinode change. This seems acceptble, because the user can re-import
immediately if they urgently need to add inodes. And if not, they'll
do it sometime, presumably, and the change will take effect then.
Sponsored-by: Erik Bjäreholt on Patreon
Fix crash importing from a directory special remote that contains a broken
symlink.
The crash was in listImportableContentsM but some other places in
Remote.Directory also seemed like they could have the same problem.
Also audited for other places that have such a problem. Not all calls
to getFileStatus are bad, in some cases it's better to crash on something
unexpected. For example, `git-annex import path` when the path is a broken
symlink should crash, the same as when it does not exist. Many of the
getFileStatus calls are like that, particularly when they involve
.git/annex/objects which should never have a broken symlink in it.
Fixed a few other possible cases of the problem.
Sponsored-by: Lawrence Brogan on Patreon
Too big a footgun.
This does not prevent attackers who can write to the directory being
imported from racing the check. But they can cause anything to be imported
anyway, so would be limited to getting the legacy import to follow into a
directory they do not write to, and move files out of it into the annex.
(The directory special remote does not have that problem since it does not
move files.)
Sponsored-by: Jack Hill on Patreon
This should not change the behavior of it, unless there are multiple top
directories, and then it should behave the same as if there was a single
top directory that was actually above the directory to be created.
Sponsored-by: Dartmouth College's Datalad project
On Windows, that does not support long paths
https://github.com/jacobstanley/unix-compat/issues/56
Instead, use System.Directory.renamePath, which does support long paths.
Sponsored-by: Dartmouth College's Datalad project
None of the special remotes do it yet, but this lays the groundwork.
Added MustFinishIncompleteVerify so that, when an incremental verify is
started but not complete, it can be forced to finish it. Otherwise, it
would have skipped doing it when verification is disabled, but
verification must always be done when retrievin from export remotes
since files can be modified during retrieval.
Note that retrieveExportWithContentIdentifier doesn't support incremental
verification yet. And I'm not sure if it can -- it doesn't know the Key
before it downloads the content. It seems a new API call would need to
be split out of that, which is provided with the key.
Sponsored-by: Dartmouth College's Datalad project
Directory special remotes with importtree=yes have changed to once more
take inodes into account. This will cause extra work when importing from a
directory on a FAT filesystem that changes inodes on every mount.
To avoid that extra work, set ignoreinodes=yes when initializing a new
directory special remote, or change the configuration of your existing
remote: git-annex enableremote foo ignoreinodes=yes
This will mean a one-time re-import of all contents from every directory
special remote due to the changed setting.
73df633a62 thought
it was too unlikely that there would be modifications that the inode number
was needed to notice. That was probably right; it's very unlikely that a
file will get modified and end up with the same size and mtime as before.
But, what was not considered is that a program like NextCloud might write
two files with different content so closely together that they share the
mtime. The inode is necessary to detect that situation.
Sponsored-by: Max Thoursie on Patreon
This improves the borg special remote memory usage, by
letting it only load one archive's worth of filenames into memory at a
time, and building up a larger tree out of the chunks.
When a borg repository has many archives, git-annex could easily OOM
before. Now, it will use only memory proportional to the number of
annexed keys in an archive.
Minor implementation wart: Each new chunk re-opens the content
identifier database, and also a new vector clock is used for each chunk.
This is a minor innefficiency only; the use of continuations makes
it hard to avoid, although putting the database handle into a Reader
monad would be one way to fix it.
It may later be possible to extend the ImportableContentsChunkable
interface to remotes that are not third-party populated. However, that
would perhaps need an interface that does not use continuations.
The ImportableContentsChunkable interface currently does not allow
populating the top of the tree with anything other than subtrees. It
would be easy to extend it to allow putting files in that tree, but borg
doesn't need that so I left it out for now.
Sponsored-by: Noam Kremen on Patreon
commit 63d508e885 broke test_readonly.
When a local git remote is readonly, tryCopyCoW run to copy a file
from it failed at withOtherTmp.
Sponsored-by: Dartmouth College's Datalad project
Probably this fixes a reversion, but I don't know what version broke it.
This does use withOtherTmp for a temp file that could be quite large.
Though albeit a reflink copy that will not actually take up any space
as long as the file it was copied from still exists. So if the copy cow
succeeds but git-annex is interrupted just before that temp file gets
renamed into the usual .git/annex/tmp/ location, there is a risk that
the other temp directory ends up cluttered with a larger temp file than
later. It will eventually be cleaned up, and the changes of this being
a problem are small, so this seems like an acceptable thing to do.
Sponsored-by: Shae Erisson on Patreon
Added fileRetriever', which will let the remaining special remotes
eventually also support incremental verify.
Sponsored-by: Dartmouth College's DANDI project
Several special remotes verify content while it is being retrieved,
avoiding a separate checksum pass. They are: S3, bup, ddar, and
gcrypt (with a local repository).
Not done when using chunking, yet.
Complicated by Retriever needing to change to be polymorphic. Which in turn
meant RankNTypes is needed, and also needed some code changes. The
change in Remote.External does not change behavior at all but avoids
the type checking failing because of a "rigid, skolem type" which
"would escape its scope". So I refactored slightly to make the type
checker's job easier there.
Unfortunately, directory uses fileRetriever (except when chunked),
so it is not amoung the improved ones. Fixing that would need a way for
FileRetriever to return a Verification. But, since the file retrieved
may be encrypted or chunked, it would be extra work to always
incrementally checksum the file while retrieving it. Hm.
Some other special remotes use fileRetriever, and so don't get incremental
verification, but could be converted to byteRetriever later. One is
GitLFS, which uses downloadConduit, which writes to the file, so could
verify as it goes. Other special remotes like web could too, but don't
use Remote.Helper.Special and so will need to be addressed separately.
Sponsored-by: Dartmouth College's DANDI project
directory: When cp supports reflinks, use it when getting content from a
directory special remote.
Not yet for imports from directory though, and not for store.
Note that, when it's chunked, using cp --reflink would not speed it up, and
when reflink was not supported, would unnecessarily write the chunk to a
file before reading it back in. So, only using a fileRetriever in the
NoChunks case is necessary to keep chunking fast.
fileCopier is told not to verify, because the special remote interface
does not yet support verification in passing. AFAICS, fileCopies can
never return False when not verifying so the added giveup should never
actually happen.
Don't accept the cid of the temp file that the content has just been
written to as something we will accept if another file has that same
content. There's no reason to, and on FAT, due to mtime resolution,
the test suite hit just such a case.
This fixes a reversion from 73df633a62
which removed inode from the ContentIdentifier.
Directory special remotes with importtree=yes now avoid unncessary overhead
when inodes of files have changed, as happens whenever a FAT filesystem
gets remounted.
A few unusual edge cases of modifications won't be detected and
imported. I think they're unusual enough not to be a concern. It would
be possible to add a config setting that controls whether to compare
inodes too, but does not seem worth bothering the user about currently.
I chose to continue to use the InodeCache serialization, just with the
inode zeroed. This way, if I later change my mind or make it
configurable, can parse it back to an InodeCache and operate on it. The
overhead of storing a 0 in the content identifier log seems worth it.
There is a one-time cost to this change; all directory special remotes
with importtree=yes will re-hash all files once, and will update the
content identifier logs with zeroed inodes.
This commit was sponsored by Brett Eisenberg on Patreon.
May actually work now.
Note that, importKey now has to add the size to the key if it's supposed
to have size. Remote.Directory relied on the importer adding the size,
which is no longer done, so it was changed; it was the only one.
This way, importKey does not need to behave differently between regular
and thirdpartypopulated imports.
This is to support, eg a borg repo as a special remote, which is
populated not by running git-annex commands, but by using borg. Then
git-annex sync lists the content of the remote, learns which files are
annex objects, and treats those as present in the remote.
So, most of the import machinery is reused, to a new purpose. While
normally importtree maintains a remote tracking branch, this does not,
because the files stored in the remote are annex object files, not
user-visible filenames. But, internally, a git tree is still generated,
of the files on the remote that are annex objects. This tree is used
by retrieveExportWithContentIdentifier, etc. As with other import/export
remotes, that the tree is recorded in the export log, and gets grafted
into the git-annex branch.
importKey changed to be able to return Nothing, to indicate when an
ImportLocation is not an annex object and so should be skipped from
being included in the tree.
It did not seem to make sense to have git-annex import do this, since
from the user's perspective, it's not like other imports. So only
git-annex sync does it.
Note that, git-annex sync does not yet download objects from such
remotes that are preferred content. importKeys is run with
content downloading disabled, to avoid getting the content of all
objects. Perhaps what's needed is for seekSyncContent to be run with these
remotes, but I don't know if it will just work (in particular, it needs
to avoid trying to transfer objects to them), so I skipped that for now.
(Untested and unused as of yet.)
This commit was sponsored by Jochen Bartl on Patreon.
9cb250f7be got the ones in RawFilePath,
but there were others that used the one from unix-compat, which fails at
runtime on windows. To avoid this,
import System.PosixCompat.Files hiding removeLink
This commit was sponsored by Ethan Aubin.
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
nukeFile replaced with removeWhenExistsWith removeLink, which allows
using RawFilePath. Utility.Directory cannot use RawFilePath since setup
does not depend on posix.
This commit was sponsored by Graham Spencer on Patreon.
Also audited for other calls to openTempFile, and all are ok,
except for viaTmp which will need further work.
Remote.Directory fixed to set umask mode when writing to an export,
although it has another one using viaTmp that's not fixed.
Will make exports that are published via a http server running as
another user work, for example.
Remote.BitTorrent fixed to set umask mode when downloading the torrent
file. Normally this does not matter as that file does not hang around
after the download, but if a bittorrent download were started by one user,
got interrupted and then another user ran it, this will let them access
the torrent file created by the first user.
Only supported by some special remotes: directory
I need to check the rest and they're currently missing methods until I do.
git-annex sync --no-content does not yet use this to do imports