Transfers from or to a local git repo could fail without a reason being
given, if the content failed to verify, or if the object file's stat
changed while it was being copied. Now display messages in these cases.
Sponsored-by: Jack Hill on Patreon
Which could happen occasionally before when concurrency is enabled.
While not much of a problem when it did happen, better to avoid it. Also,
since it seems likely the gpg-agent sometimes fails in such a situation,
this makes it not happen when running a single git-annex command with
concurrency enabled.
This commit was sponsored by Jake Vosloo on Patreon.
This fixes fsck of a remote that uses chunking displaying
(checking remotename) (checking remotename)" for every chunk.
Also, some remotes displayed the message, and others did not, with no
consistency. It was originally displayed only when accessing remotes
that were expensive or might involve a password prompt, I think, but
nothing in the API said when to do it so it became an inconsistent mess.
Originally I thought fsck should always display it. But it only displays
in fsck --from remote, so the user knows the remote is being accessed,
so there is no reason to tell them it's accessing it over and over.
It was also possible for git-annex move to sometimes display it twice,
due to checking if content is present twice. But, the user of move
specifies --from/--to, so it does not need to display when it's
accessing the remote, as the user expects it to access the remote.
git-annex get might display it, but only if the remote also supports
hasKeyCheap, which is really only local git remotes, which didn't
display it always; and in any case nothing displayed it before hasKeyCheap,
which is checked first, so I don't think this needs to display it ever.
mirror is like move. And that's all the main places it would have been
displayed.
This commit was sponsored by Jochen Bartl on Patreon.
If it's passed a ConfigKey such as annex.version, avoid returning
an empty remote name and return Nothing instead. Also, foo.bar.baz is
not treated as a remote named "bar".
directory: When cp supports reflinks, use it when getting content from a
directory special remote.
Not yet for imports from directory though, and not for store.
Note that, when it's chunked, using cp --reflink would not speed it up, and
when reflink was not supported, would unnecessarily write the chunk to a
file before reading it back in. So, only using a fileRetriever in the
NoChunks case is necessary to keep chunking fast.
fileCopier is told not to verify, because the special remote interface
does not yet support verification in passing. AFAICS, fileCopies can
never return False when not verifying so the added giveup should never
actually happen.
Had to add to AnnexRead an indication of whether debugging is enabled.
Could have just made setupConsole not install a debug output action that
outputs, and have enableDebug be what installs that, but then in the
common case where there is no debug selector, and so all debug output is
selected, it would run the debug output action every time, which entails
an IORef access. Which would make fastDebug too slow..
This uses a DebugSelector, rather than debug levels, which will allow
for a later option like --debug-from=Process to only
see debuging about running processes.
The module name that contains the thing being debugged is used as the
DebugSelector (in most cases; does not need to be a hard and fast rule).
Debug calls were changed to add that. hslogger did not display
that first parameter to debugM, but the DebugSelector does get
displayed.
Also fastDebug will allow doing debugging in places that are used in
tight loops, with the DebugSelector coming from the Annex Reader
essentially for free. Not done yet.
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.
This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.
The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
Note that a key with no size field that is hard linked will
result in listImportableContents reporting a file size of 0,
rather than the actual size of the file. One result is that
the progress meter when getting the file will seem to get stuck
at 100%. Another is that the remote's preferred content expression,
if it tries to match against file size, will treat it as an empty file.
I don't see a way to improve the latter behavior, and the former behavior
is a minor enough problem.
This commit was sponsored by Jake Vosloo on Patreon.
Keys stored on the filesystem are mangled by keyFile to avoid problem
chars. So, that mangling has to be reversed when parsing files from a
borg backup back to a key.
The directory special remote also so mangles them. Some other special
remotes do not; eg S3 just serializes the key -- but S3 object names are
not limited to filesystem valid filenames anyway, so a S3 server must
not map them directly to files in any case. It seems unlikely that a
borg backup of some such special remote will get broken by this change.
This commit was sponsored by Graham Spencer on Patreon.
New error message:
Remote foo not usable by git-annex; setting annex-ignore
http://localhost/foo/config download failed: Configuration of annex.security.allowed-ip-addresses does not allow accessing address ::1
If git config parse fails, or the git config file is not available at the url,
a better error message for that is also shown.
This commit was sponsored by Mark Reidenbach on Patreon.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
box.com already had a special case, since its renaming was known buggy.
In its case, renaming to the temp file succeeds, but then renaming the temp
file to final destination fails.
Then this 4shared server has buggy handling of renames across directories.
While already worked around with for the temp files when storing exports
now being in the same directory as the final filename, that also affected
renameExport when the file moves between directories.
I'm not entirely clear what happens on the 4shared server when it fails
this way. It kind of looks like it may rename the file to destination and
then still fail.
To handle both, when rename fails, delete both the source and the
destination, and fall back to uploading the content again. In the box.com
case, the temp file is the source, and deleting it makes sure the temp file
gets cleaned up. In the 4shared case, the file may have been renamed to the
destination and so cleaning that up avoids any interference with the
re-upload to the destination.
When autoenabling special remotes of type S3, weddav, or glacier, do not
take login credentials from environment variables, as the user may not be
expecting the autoenable to happen, and may have those set for other
purposes.
This may work better in some webdav server that gets confused at
cross-collection renamed. I don't know, let's find out.
The only real downside of doing this is that the temp files are not all
in the top-level collection, in case an interrupted run leaves one
behind. But that does not seem especially significant.
Which access a remote using rsync over ssh, and which git pushes to much
more efficiently than ssh urls.
There was some old partial support for rsync URIs from 2013, but it seemed
incomplete, and did not use rsync over ssh. Weird.
I'm not sure if there's any remaining benefit to using the non-rsync url
forms with gcrypt, now that this is implemented? Updated docs to encourage
using the rsync urls.
This commit was sponsored by Svenne Krap on Patreon.
Avoiding using a callback simplifies this and should make it easier to
implement incremental checksumming, which will need to happen partly in
writeRetrievedContent and partly in retrieveChunks.
This benchmarks only slightly faster than the old git-annex. Eg, for a 1
gb file, 14.56s vs 15.57s. (On a ram disk; there would certianly be
more of an effect if the file was written to disk and didn't stay in
cache.)
Commenting out the updateIncremental calls make the same run in 6.31s.
May be that overhead in the implementation, other than the actual
checksumming, is slowing it down. Eg, MVar access.
(I also tried using 10x larger chunks, which did not change the speed.)
This is groundwork for calculating checksums while copying, rather than
in a separate pass, but that's not done yet. For now, avoid using rsync
(and cp on Windows), and instead read and write the file ourselves, with
resume handling.
Benchmarking vs old git-annex that used rsync, this is faster,
at least once the file size is larger than a couple of MB.
Changing to the P2P protocol broke this, because preseedTmp copies
the local copy of the object to the temp file, and then the P2P transfer
sees the right length file and uses it as-is.
When git-annex-shell is too old and rsync is used, it did verify the
content, and when the local repo does not have the object it did verify the
content.
This could perhaps have caused a hard link to be made when the content
of the object was modified. I don't think that actually happened,
because the annexed file would have to be unlocked, with annex.thin, for
the object to get modified, and in that case, a hard link is not made.
However, to be sure, run the check.
Note that it seemed best to run the check only once, although the
current implementation is fast and safe to run repeatedly.
Checksum as content is received from a remote git-annex repository, rather
than doing it in a second pass.
Not tested at all yet, but I imagine it will work!
Not implemented for any special remotes, and also not implemented for
copies from local remotes. It may be that, for local remotes, it will
suffice to use rsync, rely on its checksumming, and simply return Verified.
(It would still make a checksumming pass when cp is used for COW, I guess.)
See my comment in the next commit for some details about why
Verified needs a hash with preimage resistance. As far as tahoe goes,
it's fully cryptographically secure.
I think that bup could also return Verified. However, the Retriever
interface does not currenly support that.
When a git remote is configured with an absolute path, use that path,
rather than making it relative. If it's configured with a relative path,
use that.
Git.Construct.fromPath changed to preserve the path as-is,
rather than making it absolute. And Annex.new changed to not
convert the path to relative. Instead, Git.CurrentRepo.get
generates a relative path.
A few things that used fromAbsPath unncessarily were changed in passing to
use fromPath instead. I'm seeing fromAbsPath as a security check,
while before it was being used in some cases when the path was
known absolute already. It may be that fromAbsPath is not really needed,
but only git-annex-shell uses it now, and I'm not 100% sure that there's
not some input that would cause a relative path to be used, opening a
security hole, without the security check. So left it as-is.
Test suite passes and strace shows the configured remote url is used
unchanged in the path into it. I can't be 100% sure there's not some code
somewhere that takes an absolute path to the repo and converts it to
relative and uses it, but it seems pretty unlikely that the code paths used
for a git remote would call such code. One place I know of is gitAnnexLink,
but I'm pretty sure that git remotes never deal with annex symlinks. If
that did get called, it generates a path relative to cwd, which would have
been wrong before this change as well, when operating on a remote.
When annex.stalldetection is not enabled, and a likely stall is detected,
display a suggestion to enable it.
Note that the progress meter display is not taken down when displaying
the message, so it will display like this:
0% 8 B 0 B/s
Transfer seems to have stalled. To handle stalling transfers, configure annex.stalldetection
0% 10 B 0 B/s
Although of course if it's really stalled, it will never update
again after the message. Taking down the progress meter and starting
a new one doesn't seem too necessary given how unusual this is,
also this does help show the state it was at when it stalled.
Use of uninterruptibleCancel here is ok, the thread it's canceling
only does STM transactions and sleeps. The annex thread that gets
forked off is separate to avoid it being canceled, so that it
can be joined back at the end.
A module cycle required moving from dupState the precaching of the
remote list. Doing it at startConcurrency should cover all the cases
where the remote list is used in concurrent actions.
This commit was sponsored by Kevin Mueller on Patreon.
Don't accept the cid of the temp file that the content has just been
written to as something we will accept if another file has that same
content. There's no reason to, and on FAT, due to mtime resolution,
the test suite hit just such a case.
This fixes a reversion from 73df633a62
which removed inode from the ContentIdentifier.
Directory special remotes with importtree=yes now avoid unncessary overhead
when inodes of files have changed, as happens whenever a FAT filesystem
gets remounted.
A few unusual edge cases of modifications won't be detected and
imported. I think they're unusual enough not to be a concern. It would
be possible to add a config setting that controls whether to compare
inodes too, but does not seem worth bothering the user about currently.
I chose to continue to use the InodeCache serialization, just with the
inode zeroed. This way, if I later change my mind or make it
configurable, can parse it back to an InodeCache and operate on it. The
overhead of storing a 0 in the content identifier log seems worth it.
There is a one-time cost to this change; all directory special remotes
with importtree=yes will re-hash all files once, and will update the
content identifier logs with zeroed inodes.
This commit was sponsored by Brett Eisenberg on Patreon.
After the last commit, it was able to throw errors just due to an
unparseable url. This avoids needing to worry about that, as long
as the call site has already checked that it has a parseable url.
Including the non-standard URI form that git-remote-gcrypt uses for rsync.
Eg, "ook://foo:bar" cannot be parsed because "bar" is not a valid port
number. But git could have a remote with that, it would try to run
git-remote-ook to handle it. So, git-annex has to allow for such things,
rather than crashing.
This commit was sponsored by Luke Shumaker on Patreon.
This code I'm reverting works. But it has a problem: The export db and
log and the ContentIdentifier db and log still list the content as being
stored in the remote. So when I ran borg create again and stored the
content in borg again in a new archive, git-annex sync noticed that, but
since it didn't update the tree for the old archives, it then thought
the content that had been removed from them was still in them, and so
git-annex get failed in an ugly way:
Include pattern 'tmp/x/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855' never matched.
[2020-12-28 16:40:44.878952393] process [933616] done ExitFailure 1
user error (borg ["extract","/tmp/b::abs4","tmp/x/.git/annex/objects/pX/ZJ/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"] exited 1)
It does not seem worth it to update the git tree for the export when dropping
content, that would make drop of many files very expensive in git tree objects
created. So, let's not support this I suppose..