Cache high-resolution mtimes for improved detection of modified files in v7
(and direct mode).
Including on Windows.
With back-compat support so old low-res mtimes won't break anything, and
so the new information also won't break old versions of git-annex.
I've seen intermittent failures of the test suite with v6 for a long time,
it seems to have possibly gotten worse with the changes around v7. Or just
being unlucky; all tests failed today.
Seen on amd64 and i386 builders, repeatedly but intermittently:
unused: FAIL (4.86s)
Test.hs:928:
git diff did not show changes to unlocked file
And I think other such failures, all involving v7/v6 mode tests.
I managed to reproduce the unused failure with --keep-failures,
and inside the repo, git diff was indeed not showing any changes for
the modified unlocked file.
The two stats will be the same other than mtime; the old and new files have
the same size and inode, since the test case writes to the file and then
overwrites it.
Indeed, notice the identical timestamps:
builder@orca:~/gitbuilder/build/.t/tmprepo335$ echo 1 > foo; stat foo; echo 2 > foo; stat foo
File: foo
Size: 2 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 3546179 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ builder) Gid: ( 1000/ builder)
Access: 2018-10-29 22:14:10.894942036 +0000
Modify: 2018-10-29 22:14:10.894942036 +0000
Change: 2018-10-29 22:14:10.894942036 +0000
Birth: -
File: foo
Size: 2 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 3546179 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ builder) Gid: ( 1000/ builder)
Access: 2018-10-29 22:14:10.894942036 +0000
Modify: 2018-10-29 22:14:10.898942036 +0000
Change: 2018-10-29 22:14:10.898942036 +0000
Birth: -
I'm seeing this in Linux VMs; it doesn't happen on my laptop. I've also
not experienced the intermittent test suite failures on my laptop.
So, I hope that this small delay will avoid the problem.
Update: I didn't, indeed I then reproduced the same failure on my
laptop, so it must be due to something else. But keeping this change anyway
since not needing to worry about lowish-resolution mtime in the test suite seems
worthwhile.
Removed undocumented special case in handling of a CHECKURL-MULTI response
with only a single file listed. Rather than ignoring the url that was in
the response, use it. This allows external special remotes that want to
provide some better url to do so, although I don't entirely agree with
using CHECKURL-MULTI to accomplish that. I'm more of the feeling that an
undocumented special case that throws data away is just not a good idea.
This could in theory break some external special remote program that relied
on the current behavior, but its seems unlikely that it would because such
a program must already handle the multiple url case, unless it only ever
provides a single url response to CHECKURL-MULTI.
Make addurl --file work with a single item CHECKURL-MULTI response.
It already did for external special remotes due to the special case,
but now it also will for builtin ones like the BitTorrent special remote.
This commit was sponsored by Ilya Shlyakhter on Patron.
This completes initial support for --hide-missing, although the
assistant still needs to be updated and it perhaps needs to be sped up,
and maybe there needs to be a way for git-annex get to operate on
missing files. Opened some more todos for those things.
This commit was sponsored by Henrik Riomar.
This relies on git ls-files --with-tree, which I'm using in a way that
its man page does not document. Hm. I emailed the git list to try to get
the docs improved, but at least the git test suite does test the same
kind of use case I'm using here.
Performance impact when not in an adjusted branch is limited to some
additional MVar accesses, and a single git call to determine the name of
the current branch. So very minimal.
When in an adjusted branch, the performance impact is
in Annex.WorkTree.lookupFile, which starts doing an equal amount of work
for files that didn't exist as it already did for files that were
unlocked.
This commit was sponsored by Jochen Bartl on Patreon.
* At long last there's a way to hide annexed files whose content
is missing from the working tree: git-annex adjust --hide-missing
* When already in an adjusted branch, running git-annex adjust
again will update the branch as needed. This is mostly
useful with --hide-missing to hide/unhide files after their content
has been dropped or received.
Still needs integration with sync and the assistant, and not as fast as it
could be, but already usable.
This commit was sponsored by Ethan Aubin.
That could cause git-annex to get confused about whether a locked file's
content was present, when the object file got touched.
Unfortunately this means more work sometimes when annex.thin is set,
since it has to checksum the file to tell if it's still got the right
content.
Had to suppress output when inAnnex calls isUnmodified, otherwise
"(checksum...)" would be printed in places it ought not to be,
eg "git annex get" could turn out not need to get anything, and
so only display that.
This commit was sponsored by Ole-Morten Duesund on Patreon.
* Added arm64 Linux standalone build. (No autobuilder yet.)
* Improved termux installation process.
Added git-annex-install.sh script to avoid user needing to type as much in
termux. The scope of this script is limited; runshell handles the rest.
Runshell runs termux-fix-shebang on the shell scripts. The problem is
the bundled bin/sh script, deleting that script also works, but then the
others probably use the system Android /bin/sh, which could be old or
broken or not posix or whatever. Using termux sh to run the scripts is
better.
This commit was sponsored by Eric Drechsel on Patreon.
Added annex.jobs setting, which is like using the -J option.
Of course, -J overrides annex.jobs.
This commit was sponsored by Trenton Cronholm on Patreon.
The error message displayed used to only come from curl/wget and perhaps
was clearer than the one displayed now that http-client is used. In any
case, it does make sense to hide it because git-annex prints its own
warning message.
This commit was sponsored by Jake Vosloo on Patreon.
Only display the warning when the current branch has a tree that is not
the same as the tree in the export.
Note that it doesn't check to see if the current tree is
in incompleteExportedTreeish; it might be worth checking that and reminding
the user about an incomplete export, but when export tracking is not
configured, they are probably not in the right clone of the repository to
resolve the incomplete export.
This commit was sponsored by Ethan Aubin.
Added remote.name.annex-security-allow-unverified-downloads, a per-remote
setting for annex.security.allow-unverified-downloads.
This commit was sponsored by Brock Spratlen on Patreon.
Added annex.maxextensionlength for use cases where extensions longer than 4
characters are needed.
This commit was sponsored by Henrik Riomar on Patreon.
Makes git annex whereis display the versionId urls.
And, when a s3 remote is enabled without creds, git-annex will use the
versionId urls to access its contents.
This commit was sponsored by Fernando Jimenez on Patreon.
Since the same key can be stored in a versioned S3 bucket multiple times
with different version IDs, this allows tracking them all. Not currently
needed, but if we ever want to drop from a versioned S3 bucket, we'll
need to know them all.
This commit was supported by the NSF-funded DataLad project.
Actually very straightforward reuse of the metadata log file code.
Although I had to add a todo item as git-annex forget won't clean up
dead remote's metadata yet.
This would be worth adding to the external special remote interface
sometime. Have not opened a todo though, guess I'll wait until something
needs it.
This commit was supported by the NSF-funded DataLad project.
Have to store the S3 object along with the version ID, so retrieval can
use the same object.
This commit was supported by the NSF-funded DataLad project.
Only done when versioning=yes is configured. It could always do it when
S3 sends back a version id, but there may be buckets that have
versioning enabled by accident, so it seemed better to honor the
configuration.
S3's docs say version IDs are "randomly generated", so presumably
storing the same content twice gets two different ones not the same one.
So I considered storing a list of version IDs for a key. That would
allow removing the key completely. But.. The way Logs.RemoteState works,
when there are multiple writers, the last writer wins. So storing a list
would need a different log format that merges, which seemed overkill to support
removing a key from an append-only remote.
Note that Logs.RemoteState for S3 is now dedicated to version IDs.
If something else needs to be stored, a new log will be needed to do it.
This commit was supported by the NSF-funded DataLad project.
Make exporttree=yes remotes that are appendonly not be untrusted, and not force
verification of content, since the usual concerns about losing data when an
export is updated by someone else don't apply.
Note that all the remote operations on keys are left as usual for
appendonly export remotes, except for storing content.
This commit was supported by the NSF-funded DataLad project.
Make `git annex export` check appendonly when removing a file from an
export, and not update the location log, since the remote still contains
the content.
This commit was supported by the NSF-funded DataLad project.
Last of the known v6 races.
This also makes git add of a pointer file populate it when its content
is present in the annex. Which makes sense to do, I think.
This commit was supported by the NSF-funded DataLad project.
Update pointer file next time reconcileStaged is run to recover from the
race.
Note that restagePointerFile causes git to run the clean filter,
and that will run reconcileStaged. So, normally by the time the git
annex get/drop command finishes, the race has already been dealt with.
It may be that, in some case, that won't happen and the race will be
dealt with at a later point. git-annex could run reconcileStaged at
shutdown if that becomes a problem.
This does not handle the situation where the git mv is committed before
git-annex gets a chance to run again. git commit does run the clean
filter, and that happens to re-inject the content if it was supposed to
be dropped but is still populated. But, the case where the file was
supposed to be gotten but is not populated is not handled yet.
This commit was supported by the NSF-funded DataLad project.
Check just before running update-index if the worktree file's content is
still the same, don't update it when it's been modified. This narrows
the race window a lot, from possibly minutes or hours, to seconds or
less.
(Use replaceFile so that the worktree update happens atomically,
allowing the InodeCache of the new worktree file to itself be gathered
w/o any other race.)
This doesn't eliminate the race; it can still occur in the window before
update-index runs. When annex.queue is large, a lot of files will be
statted by the checks, and so the window may still be large enough to be a
problem.
When only a few files are being processed, the window is as small as it
is in the race where a modification gets overwritten by git-annex when
it updates the worktree. Or maybe as small as whatever race git
checkout/pull/merge may have when the worktree gets modified during it.
Still, I've kept a todo about this race.
This commit was supported by the NSF-funded DataLad project.
Use git update-index --refresh, since it's a little bit more
efficient and the user can be told to run it if a locked index prevents
git-annex from running it.
This also fixes the problem where an annexed file was deleted in the index
and a get of another file that uses the same key caused the index update to
add back the deleted file. update-index will not add back the deleted file.
Documented in tips/unlocked_files.mdwn the gotcha that the index update
may conflict with other operations. I can't see any way to possibly avoid
that conflict.
One new todo about a race that causes a modification to be accidentially
staged.
Note that the assistant only flushes the git command queue when it
commits a modification. I have not tested the assistant with v6 unlocked
files, but assume most users of the assistant won't care if the index
shows a file as modified for a while.
This commit was supported by the NSF-funded DataLad project.
After updating the worktree for an add/drop, update git's index, so git
status will not show the files as modified.
What actually happens is that the index update removes the inode
information from the index. The next git status (or similar) run
then has to do some work. It runs the clean filter.
So, this depends on the clean filter being reasonably fast and on git
not leaking memory when running it. Both problems were fixed in
a96972015d, but only for git 2.5. Anyone
using an older git will see very expensive git status after an add/drop.
This uses the same git update-index queue as other parts of git-annex, so
the actual index update is fairly efficient. Of course, updating the index
does still have some overhead. The annex.queuesize config will control how
often the index gets updated when working on a lot of files.
This is an imperfect workaround... Added several todos about new
problems this workaround causes. Still, this seems a lot better than the
old behavior.
This commit was supported by the NSF-funded DataLad project.
v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to speed up
the clean filter by not reading the file content from the pipe. This also
avoids git buffering the whole file content in memory.
When built with an older git, still consumes stdin. If built with a newer
git and used with an older one, it breaks, but that's acceptable --
checking the git version every time would make repeated smudge runs slow.
This commit was supported by the NSF-funded DataLad project.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
addurl: When security configuration prevents downloads with youtube-dl,
still check if the url is one that it supports, and fail downloading it,
instead of downloading the raw web page.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
In keyUrls, the GitConfig is used only by annexLocations
to support configured Differences. Since such configurations affect all
clones of a repository, the local repo's GitConfig must have the same
information as the remote's GitConfig would have. So, used getGitConfig
to get the local GitConfig, which is cached and so available cheaply.
That actually fixed a bug noone had ever noticed: keyUrls is
used for remotes accessed over http. The full git config of such a
remote is normally not available, so the remoteGitConfig that keyUrls
used would not have the necessary information in it.
In copyFromRemoteCheap', it uses gitAnnexLocation,
which does need the GitConfig of the remote repo itself in order to
check if it's crippled, supports symlinks, etc. So, made the
State include that GitConfig, cached. The use of gitAnnexLocation is
within a (not $ Git.repoIsUrl repo) guard, so it's local, and so
its git config will always be read and available.
(Note that gitAnnexLocation in turn calls annexLocations, so the
Differences config it uses in this case comes from the remote repo's
GitConfig and not from the local repo's GitConfig. As explained above
this is ok since they must have the same value.)
Not very happy with this mess of different GitConfigs not type-safe and
some read only sometimes etc. Very hairy. Think I got it this change
right. Test suite passes..
This commit was sponsored by Ethan Aubin.
Unfortunately one more use remains..
This should be just as fast as the other method. The remote's Git.Repo
has already had its config read, so Annex.new's call to Git.Config.read
is a noop.
Thid commit was sponsored by andrea rota.
Fixed annex-checkuuid implementation, so that remotes configured that way
can be used. This was 100% broken from the first commit of it, oops.
This commit was sponsored by Øyvind Andersen Holm.
Makes it allow writes, but not deletion of annexed content. Note that
securing pushes to the git repository is left up to the user.
This commit was sponsored by Jack Hill on Patreon.
The old git-annex Android app is now deprecated in favor of running
git-annex in termux. I suspect all or nearly all of these no longer apply.
This commit was sponsored by Jochen Bartl on Patreon.
Fix regression in last release that crashes when using --all or running
git-annex in a bare repository. May have also affected git-annex unused and
git-annex info.
Reversed the order of the (++) in Annex.Branch.files so --all will stream
lazily still when there are not a bunch of uncommitted journal files.
Added a todo to maybe improve this later.
This commit was sponsored by Trenton Cronholm on Patreon.
Added some tweaks to make git-annex work in termux on Android. The regular
arm standalone tarball now works in termux.
I guess the test for "$base/bin/git" is not really necessary, since it
tests for git-annex. Since that gets deleted on android, removed that test.
These are pretty hackish hacks, especially adding it to PATH. The goal is
to make it work well enough out of the box on Android.
This commit was sponsored by Eric Drechsel on Patreon.
* For url downloads, git-annex now defaults to using a http library,
rather than wget or curl. But, if annex.web-options is set, it will
use curl. To use the .netrc file, run:
git config annex.web-options --netrc
* git-annex no longer uses wget (and wget is no longer shipped with
git-annex builds).
Note that curl is always run in silent mode, since the new API for
download has a MeterUpdate and doesn't make way for curl progress
output. It might be worth writing a parser for curl's progress output
to update the meter when using it, but I didn't bother with this edge
case for now.
This commit was supported by the NSF-funded DataLad project.
Remote.S3 and Remote.Helper.Http both had similar code to sink a
http-conduit Response to a file; refactor out sinkResponseFile.
downloadC downloads an url to a file using http-conduit, and supports
resuming. Falls back to curl to handle urls that http-conduit does not
support. This is not used yet, but the goal is to replace download with
it.
git-annex.cabal: conduit-extra was not actually used for a long time,
remove the dep. conduit moves into the main dependency list, but since
http-conduit was already in there, and it depends on conduit, that's not
really adding a new build dep.
This commit was supported by the NSF-funded DataLad project.
Enable HTTP connection reuse across multiple files, when git-annex
uses http-conduit. Before, a new Manager was created each time
Utility.Url used it. Now, a single Manager gets created the first time,
so connections are reused.
Doesn't help when external programs are used for url download,
but does speed up addurl --fast, fsck --from web, etc.
Testing fsck --fast --from web with 3 files, over high-latency
satellite internet, it sped up from 19.37s to 14.96s.
This commit was supported by the NSF-funded DataLad project.
git annex testremote passes.
exportree not implemented yet, although the documentation talks about it,
since it will be the main way this remote will be used.
The adb push/pull progress is displayed for now; it would be better
to consume it and use it to update the git-annex progress bar.
This commit was sponsored by andrea rota.
There are a lot of different variants and sizes, I suppose we might as well
export all the common ones.
Bump dep to cryptonite to 0.16, earlier versions lacked BLAKE2 support.
Even android has 0.16 or newer.
On Debian, Blake2bp_512 is buggy, so I have omitted it for now.
http://bugs.debian.org/892855
This commit was sponsored by andrea rota.