Better to make it not be surprising and slow, than surprising and fast.
--raw can be used when it needs to be really fast.
Implemented adding a youtube-dl supported url to an existing file.
This commit was sponsored by andrea rota.
Fully working, including --fast/--relaxed.
Note that, while git-annex addurl --relaxed is not going to check
youtube-dl, I kept git annex importfeed --relaxed checking it.
Thinking is that, let's not break people's importfeed cron jobs, and
importfeed does not typically have to check a large number of new items,
so it's ok if it's a little bit slower when used with youtube playlist
feeds.
importfeed's behavior is also improved (?) when a feed has links in it
to non-media files. Before, those were skipped. Now, the content of the
link is downloaded. This had to be done, because trying to use
youtube-dl is slow, and if those were skipped, it would have to check
every time importfeed was run. While this behavior change may not be
desirable for some feeds, that intersperse links to web pages with
enclosures, it will be desirable for other feeds, that have
non-enclosure directy links to media files.
Remove old quvi modules.
This commit was sponsored by Øyvind Andersen Holm.
As it was getting too expensive to patch out use of the "new" syscalls
We could revisit this if someone has hardware with an older kernel
that's still being maintained, but I've verified that the Synology
NAS that had used a too old kernel version has been updated to 2.6.32.
Was trying to rmdir the file, which had already been deleted, and when that
failed, it skipped trying to delete the parent directories.
Noticed the bug through testremote, but it can't itself detect such
problems as there is no enumeration in the API.
This commit was sponsored by Brock Spratlen on Patreon.
As long as the class of remotes supports exporting, it's tested whether
or not the remote is configured with exporttree=yes.
Also, made testremote of a remote configured with exporttree=yes
disable that configuration for testing non-export storage.
This commit was supported by the NSF-funded DataLad project.
When there are multiple urls for a file, still treat it as being present
in the web when some urls don't work, as long as at least one url does
work.
This is consistent with the other web methods handling of multiple urls.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Actual problem is the keyName was set to "Ref \"sha\"", which led to
this follow-on failure since it contained a space.
The bad data would also get into the export database when exporting to a
non-external special remote. Looking briefly at that, I don't think the bad
data will lead to anything more than a re-upload of the file content
now that the problem has been fixed.
This commit was sponsored by Peter Hogg on Patreon.
Seems I forgot to fully test that feature when documenting it.
git rev-parse needs a colon after a branch to de-reference the tree
it points to, rather than the commit. But that had it adding an extra
colon when the user specified "branch:subdir". So, check if there is a
colon before adding one.
This commit was sponsored by Francois Marier on Patreon.
Windows: Fix reversion that caused the path used to link to annexed
content include the drive letter and full path, rather than being
relative. (`git annex fix` will fix up after this problem).
I've not identified the commit that brought the reversion (probably it
happened this spring when I was removing MisingH and last touched
Utility.Path). Likely commit 18b9a4b8024115db67ae309fdaf54e1553037529?
The problem is that relPathDirToFile got called two paths that had the
slashes different ways around. Since takeDrive includes the first slash,
this made two paths on the same drive seem different and it bailed.
(ifdefs around this to avoid doing extra work on non-windows)
This commit was sponsored by Jack Hill on Patreon.
Get ugly reversion out of CHANGELOG.
Also, relocated the windows stack.yaml to top, and updated windows build
instructions.
This commit was sponsored by Henrik Riomar on Patreon.
wget was broken even in the previous old release of the windows bundle,
this is not new breakage. msys-idn-11.dll and probably more would be needed
to use it. git for windows includes msys-idn2-0.dll instead.
Code for terminating processes on Windows is not linking anymore;
made a warning be displayed instead. This breaks restarting the
assistant and git annex assistant --stop.
I hope to see the code added to the Win32 library, where it should fit
better and should avoid whatever problem is making the linker not like it
when included in git-annex. I opened an issue requesting its addition,
here: https://github.com/haskell/win32/issues/91
This commit was sponsored by Thomas Hochstein on Patreon.
This avoids all the complication about redundant work discussed in
the previous try at fixing this. At the expense of needing each command
that could have the problem to be patched to simply wrap the action in
onlyActionOn once the key is known. But there do not seem to be many
such commands.
onlyActionOn' should not be used with a CommandStart (or CommandPerform),
although the types do allow it. onlyActionOn handles running the whole
CommandStart chain. I couldn't immediately see a way to avoid mistken
use of onlyActionOn'.
This commit was supported by the NSF-funded DataLad project.
After a false start, I found a fairly non-intrusive way to deal with it.
Although it only handles transfers -- there may be issues with eg
concurrent dropping of the same key, or other operations.
There is no added overhead when -J is not used, other than an added
inAnnex check. When -J is used, it has to maintain and check a small
Set, which should be negligible overhead.
It could output some message saying that the transfer is being done by
another thread. Or it could even display the same progress info for both
files that are being downloaded since they have the same content. But I
opted to keep it simple, since this is rather an edge case, so it just
doesn't say anything about the transfer of the file until the other
thread finishes.
Since the deferred transfer action still runs, actions that do more than
transfer content will still get a chance to do their other work. (An
example of something that needs to do such other work is P2P.Annex,
where the download always needs to receive the content from the peer.)
And, if the first thread fails to complete a transfer, the second thread
can resume it.
But, this unfortunately means that there's a risk of redundant work
being done to transfer a key that just got transferred.
That's not ideal, but should never cause breakage; the same
thing can occur when running two separate git-annex processes.
The get/move/copy/mirror --from commands had extra inAnnex checks added,
inside the download actions. Without those checks, the first thread
downloaded the content, and then the second thread woke up and
downloaded the same content redundantly.
move/copy/mirror --to is left doing redundant uploads for now. It
would need a second checkPresent of the remote inside the upload
to avoid them, which would be expensive. A better way to avoid
redundant work needs to be found..
This commit was supported by the NSF-funded DataLad project.
git annex add, git annex lock etc make multiple seek passes,
and each seek pass checked that files existed. That was unncessary
redundant work.
Fixed by adding a new WorkTreeItem type, make seek actions use it,
and check that the files exist when constructing it.
This commit was supported by the NSF-funded DataLad project.
Before, there was a window where interrupting an add could result in the
file being moved into the annex, with no symlink yet created.
This commit was supported by the NSF-funded DataLad project.
when storing files there, since that collection is created by initremote.
(This seems to work around some brokenness of the box.com webdav server
which was entering a redirect loop.)
Note that the fix makes locationParent return Nothing instead of "."
when there's no parent directory between the path and the top of the webdav
repo.
This commit was sponsored by André Pereira on Patreon.
In my git-annex repos, I found some stale transfer info files
without lock files.
Pass a mode to tryLockExclusive, so it will create the lock file if
not present, and so not fail to clean up such transfer info files.
Normally, transfer info files are accompanied by a lock file.
But, when alwaysRunTransfer is used, the locking can fail
and it will still write the transfer info file. Perhaps there are other
cases too? Note that mkProgressUpdater's meter
writes to the transfer info file too, and it might be possible for
that meter to fire after runTransfer has cleaned up.
This commit was sponsored by andrea rota.
Fix process and file descriptor leak that was exposed when git-annex was
built with ghc 8.2.1. Apparently ghc has changed its behavior of GC
of open file handles that are pipes to running processes. That
broke git-annex test on OSX due to running out of FDs.
Audited for all uses of Annex.new and made stopCoProcesses be called
once it's done with the state. Fixed several places that might have
leaked in other situations than running the test suite.
This commit was sponsored by Ewen McNeill.
Using annexeval to run probeCrippledFileSystem' caused Git.CurrentRepo.get
to be run. Fixed easily since probeCrippledFileSystem' had no need to use
the Annex monad.
This commit was sponsored by Ethan Aubin.
When the external special remote program crashed, a newline
could be output, which messed up the expected output for --batch mode.
Avoid checking EXPORTSUPPORTED for special remotes that are
not configured to use exports. The datalad special remote apparently is/was
buggy and crashed on EXPORTSUPPORTED. Anyway, there's no need to send
it when the configuration doesn't need it.
This commit was supported by the NSF-funded DataLad project.
Also deletes any tagged pushes that the assistant might have done,
since those would also prevent resetting a branch back.
This commit was sponsored by andrea rota.
Motivation is to remove all metadata when it gets copied from a previous
version of the file, and that is not deisrable.
This commit was supported by the NSF-funded DataLad project.
This is similar to the pusher thread, but a separate thread because git
pushes can be done in parallel with exports, and updating a big export
should not prevent other git pushes going out in the meantime.
The exportThread only runs at most every 30 seconds, since updating an
export is more expensive than pushing. This may need to be tuned.
Added a separate channel for export commits; the committer records a
commit in that channel.
Also, reconnectRemotes records a dummy commit, to make the exporter
thread wake up and make sure all exports are up-to-date. So,
connecting a drive with a directory special remote export will
immediately update it, and getting online will automatically
update S3 and WebDAV exports.
The transfer queue is not involved in exports. Instead, failed
exports are retried much like failed pushes.
This commit was sponsored by Ewen McNeill.
Done to avoid a "tmp" directory appearing in webdav exports.
Also affects non-export webdav remotes, so interrupted uploads using the
old path will not overwrite it. However, PUT is quite likely to be
implemented atomically on web servers anyway, so I doubt this will cause
problems.
inDAVLocation does not url-escape, and so exporting a filename with spaces
to box.com at least resulted in a error 400.
It might also have affected storing keys on a webdav remote, if the key
contained a space or other problem character. Pretty unlikely.
I emailed Clint about the inDAVLocation gotcha, but seems best to fix it
here.
This commit was supported by the NSF-funded DataLad project.
webdav: Checking if a non-existent file is present on Box.com triggered a
bug in its webdav support that generates an infinite series of redirects.
It seems to redirect foo to foo/ to foo/index.php to
foo/index.php/index.php ... Why a webdav endpoint would behave this way
who knows.
Deal with such problems by assuming such behavior means the file is not
present.
Can't simply disable following redirects, because the webdav endpoint could
legitimately be redirected to a new endpoint. So, when this happens
10 redirects have to be followed, before it gives up and assumes this means
the file does not exist.
This commit was supported by the NSF-funded DataLad project.
This basically works, but there's a bug when renaming a file that leaves
a .git-annex-temp-content-key file in the webdav store, that never gets
cleaned up.
Also, exporting files with spaces to box.com seems to fail; perhaps it
does not support it?
This commit was supported by the NSF-funded DataLad project.
In a test, I uploaded a pdf, and several files were derived from it.
After removing the pdf, the derived files went away after approximatly
half an hour. This window does not seem worth warning about every time.
Documented it in the tip.
Removal works, only derives are a potential issue, so allow removing
with a warning. This way, unexporting a file works, and behavior is
consistent with IA remotes whether or not exporttree=yes.
Also tested exporting filenames containing unicode, spaces, underscores.
All worked, despite the IA's faq saying it doesn't.
This commit was sponsored by Trenton Cronholm on Patreon.
It opens a http connection per file exported, but then so does git
annex copy --to s3.
Decided not to munge exported filenames for IA. Too large a chance of
the munging having confusing results. Instead, export of files not
supported by IA, eg with spaces in their name, will fail.
This commit was supported by the NSF-funded DataLad project.
https://github.com/haskell/cabal/issues/4655
This means that when a module is conditionally imported via ifdef
depending on the OS or build flags, the cabal file has to mirror the
same logic there to only list the module then.
Since there are lots of OS's and lots of combinations of build flags
here, it's rather difficult to know if the cabal file has been completelty
correctly updated to match the source code.
So I am very unhappy with needing to update things in two places. I've
only tested this on linux with most build flags enables; this will
probably need significant time and testing to catch every cabal file
tweak that this change to Cabal requires. And it will be a continual
source of compile failures going forward when the code is modified and
the cabal file not also updated.
DRY DRY DRY, I repeat myself, but: DRY! Sigh..
(Also, had to remove all Build.* that are standalone programs from the
Other-Modules list, because since cabal passes those modules to ghc when
building git-annex, it complains that they use module Main. Those
modules are only used when building with the Makefile anyway, so this
change shouldn't break anything.)
This commit was sponsored by Thomas Hochstein on Patreon.
Security fix: Disallow hostname starting with a dash, which would get
passed to ssh and be treated an option. This could be used by an attacker
who provides a crafted ssh url (for eg a git remote) to execute arbitrary
code via ssh -oProxyCommand.
No CVE has yet been assigned for this hole.
The same class of security hole recently affected git itself,
CVE-2017-1000117.
Method: Identified all places where ssh is run, by git grep '"ssh"'
Converted them all to use a SshHost, if they did not already, for
specifying the hostname.
SshHost was made a data type with a smart constructor, which rejects
hostnames starting with '-'.
Note that git-annex already contains extensive use of Utility.SafeCommand,
which fixes a similar class of problem where a filename starting with a
dash gets passed to a program which treats it as an option.
This commit was sponsored by Jochen Bartl on Patreon.
Fix the external special remotes git-annex-remote-ipfs,
git-annex-remote-torrent and the example.sh template to correctly support
filenames with spaces.
This commit was sponsored by John Peloquin on Patreon.
External special remotes will refuse to operate on keys with spaces in
their names. That has never worked correctly due to the design of the
external special remote protocol. Display an error message suggesting
migration.
Not super happy with this, but it's a pragmatic solution. Better than
complicating the external special remote interface and all external special
remotes.
Note that I only made it use SafeKey in Request, not Response. git-annex
does not construct a Response, so that would not add any safety. And
presumably, if git-annex avoids feeding any such keys to an external
special remote, it will never have a reason to make a Response using such a
key. If it did, it would result in a protocol error anyway.
There's still a Serializeable instance for Key; it's used by P2P.Protocol.
There, the Key is always in the final position, so it's ok if it contains
spaces.
Note that the protocol documentation has been fixed to say that the File
may contain spaces. One way that can happen, even though the Key can't,
is when using direct mode, and the work tree filename contains spaces.
When sending such a file to the external special remote the worktree
filename is used.
This commit was sponsored by Thom May on Patreon.
To work around the problem that the external special remote protocol does
not support keys containing spaces.
This commit was sponsored by Denis Dzyubenko on Patreon.
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.
For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.
Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.
This commit was sponsored by Trenton Cronholm on Patreon.
By forking a worker process and only deleting the test directory once it exits.
This way, if a test leaves files open, they'll get closed when the worker
exits, so avoiding failure to delete open files on Windows, and failure to
delete directories due to NFS lock files.
If a test leaves a git worker process running, the closed pipes should
cause the worker to exit too, also avoiding the problem there. The 10
second sleep ought to give plenty of time for such worker processes to
exit, although this is of course a race.
Finally, even if test directory fails to be deleted still,
it won't appear as if the last test in the test suite failed; the error
will be displayed at the very end.
This commit was supported by the NSF-funded DataLad project.
Should fix this:
lock (v6 --force): FAIL
Exception: .git/annex/keys: removeDirectoryRecursive: unsatisfied constraints (Directory not empty)
Verified that the test case still catches the regression it's meant to.
This commit was supported by the NSF-funded DataLad project.
Can be used to override the default timestamps used in log files in the
git-annex branch. This is a dangerous environment variable; use with
caution.
Note that this only affects writing to the logs on the git-annex branch.
It is not used for metadata in git commits (other env vars can be set for
that).
There are many other places where timestamps are still used, that don't
get committed to git, but do touch disk. Including regular timestamps
of files, and timestamps embedded in some files in .git/annex/, including
the last fsck timestamp and timestamps in transfer log files.
A good way to find such things in git-annex is to get for getPOSIXTime and
getCurrentTime, although some of the results are of course false positives
that never hit disk (unless git-annex gets swapped out..)
So this commit does NOT necessarily make git-annex comply with some HIPPA
privacy regulations; it's up to the user to determine if they can use it in
a way compliant with such regulations.
Benchmarking: It takes 0.00114 milliseconds to call getEnv
"GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log
files can be written with an added overhead of only 0.114 seconds. That
should be by far swamped by the actual overhead of writing the log files
and making the commit containing them.
This commit was supported by the NSF-funded DataLad project.
QuickCheck added an Arbitrary instance for CTime aka EpochTime. However,
while git-annex's instance disallowed times before the epoch, QuickCheck's
does not. So, rather than using its instance, convert from an Integer.
This commit was sponsored by Thomas Hochstein on Patreon.
Don't trust OSX FSEvents's eventFlagItemModified to be called when the last
writer of a file closes it; apparently that sometimes does not happen,
which prevented files from being quickly added.
This commit was sponsored by John Peloquin on Patreon.
optparse-applicative-0.14.0.0 adds support for these, so have the
Makefile install their scripts when built with it.
CmdLine/GitAnnex/Options.hs now uses action "file" in cmdParams,
which affects the bash and zsh completions, letting them complete
filenames for subcommands that use that. This is not needed for
bash, since bash-completion.bash enables -o bashdefault, which
lets it complete filenames too. But it does not seem to break the bash
completions. It is needed for zsh; the zsh completion otherwise
does not complete filenames. The fish completion will always complete
filenames no matter what. Messy.
This commit was sponsored by Denis Dzyubenko on Patreon.
Previously, only sync branches were merged. This makes regular git push
into a repository watched by the assistant auto-merge.
While this does hardcode an assumption about what the remote tracking
branch is named, which some unusual git configurations won't match,
git-annex sync already made the same assumption.
Also, changed behavior when a tracking branch like
refs/remotes/synced/not/master is received. When on the master branch,
that used to get merged into it, but it's the tracking branch for
not/master, so should only be merged in when on the not/master branch.
This commit was sponsored by Ewen McNeill.
* Added annex.resolvemerge configuration, which can be set to false to
disable the usual automatic merge conflict resolution done by git-annex
sync and the assistant.
* sync: Added --no-resolvemerge option.
Note that disabling merge conflict resolution is probably not a good idea
in a direct mode repo or adjusted branch. Since updates to both are done
outside the usual work tree, if it fails the tree is not left in a
conflicted state, and it would be hard to manually resolve the conflict.
Still, made annex.resolvemerge be supported in those cases for consistency.
This commit was sponsored by Riku Voipio.
When setting metadata of a file that did not exist, no error message was
displayed, unlike getting metadata and most other git-annex commands. Fixed
this oversight.
Note that, if the file exists but is not annexed, there's no error.
This is the same behavior as other git-annex commands.
This commit was supported by the NSF-funded DataLad project.
* move --to=here moves from all reachable remotes to the local repository.
The output of move --from remote is changed slightly, when the remote and
local both have the content. It used to say:
move foo ok
Now:
move foo (from theremote...) ok
That was done so that, when move --to=here is used and the content is
locally present and also in several remotes, it's clear which remotes the
content gets dropped from.
Note that move --to=here will report an error if a non-reachable remote
contains the file, even if the local repository also contains the file. I
think that's reasonable; the user may be intending to move all other copies
of the file from remotes.
OTOH, if a copy of the file is believed to be present in some repository
that is not a configured remote, move --to=here does not report an error.
So a little bit inconsistent, but erroring in this case feels wrong.
copy --to=here came along for free, but it's basically the same behavior as
git-annex get, and probably with not as good messages in edge cases
(especially on failure), so I've not documented it.
This commit was sponsored by Anthony DeRobertis on Patreon.
See my comment. This only avoids the problem for -J; two git-annex
processes started at the same time could still both try to write to
.git/config and one fail. That would be very unlikely though, and it
doesn't really seem worth adding an additional layer of locking around
.git/config.
This commit was supported by the NSF-funded DataLad project.
orElse is great, but was not the right thing to use here because
waitTakeLock could retry for other reasons than the lock being held,
which made tryTakeLock fail when it shouldn't.
Instead, move the code to tryTakeLock and implement waitTakeLock using
tryTakeLock and retry.
(Also, in runTransfer, when checkSaneLock fails, dropLock to avoid leaking a
lock handle.)
This commit was supported by the NSF-funded DataLad project.
When built with concurrent-output 1.9, ssh password prompts will no longer
interfere with the -J display.
To avoid flicker, only done when ssh actually does need to prompt;
ssh is first run in batch mode and if that succeeds the connection is up
and no need to clear regions.
This commit was supported by the NSF-funded DataLad project.
Might want to remove this when it gets fixed, in case adjusted branches are
used in a repo with a great many refs, which would become unnecessarily
slow.
This commit was supported by the NSF-funded DataLad project.
Removed dependency on MissingH, instead depending on the split
library.
After laying groundwork for this since 2015, it
was mostly straightforward. Added Utility.Tuple and
Utility.Split. Eyeballed System.Path.WildMatch while implementing
the same thing.
Since MissingH's progress meter display was being used, I re-implemented
my own. Bonus: Now progress is displayed for transfers of files of
unknown size.
This commit was sponsored by Shane-o on Patreon.
When ssh connection caching is enabled (and when GIT_ANNEX_USE_GIT_SSH is
not set), only one ssh password prompt will be made per host, and only one
ssh password prompt will be made at a time.
This also fixes a race in prepSocket's stale ssh connection stopping
when run with -J. It was possible for one thread to start a cached ssh
connection, and another thread to immediately stop it, resulting in excess
connections being made.
This commit was supported by the NSF-funded DataLad project.
It takes a single key-value backend, rather than the unncessary and confusing list.
The old option still works if set.
Simplified some old old code too.
This commit was sponsored by Thomas Hochstein on Patreon.
fsck already special-cased dead keys to make --all not report errors with
them, and it makes sense to also expand that to whereis. I think it makes
sense for dead keys to be skipped by all uses of --all, so mistakes can be
completely forgotten about and not come back to haunt us.
The speed impact of testing if the key is dead is negligible for fsck and
whereis, since they use the location log anyway and it gets cached.
This does slow down a few commands that support --all, in particular
metadata --all runs around 2x as slow. I don't think metadata
--all is often used though. It might slow down copy/move/mirror
--all and get --all.
log --all is not affected (does not use the normal --all machinery).
Dead keys will still be processed by --incomplete, --branch,
--failed, and --key. Although it would be unlikely for a dead key to
ave in incomplete or failed transfer. It seems to make perfect sense for
--branch to process keys on the branch, even if dead.
(fsck's special-casing of dead keys was left in, so if one of these options
causes a dead key to be fscked, there will be a nice message.)
This commit was supported by the NSF-funded DataLad project.
Unlike git add -u, git annex add -u does not update the index for files
removed from the working tree. But then, "git add ." stages removals,
and "git annex add ." does not, so that's an existing divergence.
Seems that --update --batch would need to run git ls-files once per line of
batch input, which would surely be too slow, so just throw an error for
that.
This commit was supported by the NSF-funded DataLad project.
This was never supported before. And it doesn't re-encrypt the
gcrypt repo to the new gcrypt-participants, but it does at least now not
crash, and set gcrypt-participants.
This commit was sponsored by andrea rota.
They were silently ignored, a reversion introduced in 6.20160527.
I don't like this regular git remote special case in enableremote, but I
can't see a way to get rid of it. So, check if the existing remote is
a Remote.Git
This commit was sponsored by Trenton Cronholm on Patreon.
This is necessary because as feared, the extra -n parameter that git-annex
passes breaks uses of these environment variables that expect exactly the
parameters that git passes.
For example, see https://github.com/datalad/datalad/issues/1456
It would of course be possible to pre-close stdin before running ssh so not
needing the -n, and I think that would not even break ssh's password
caching. But it would probably involve a lot of work, possibly would need
to deal with some layering violations, and would be error-prone. The really
clean fix would be to make all the ssh stuff return a CreateProcess, which
could have the handle closed when appropriate, but that would be a large
reworing of the code base.
This commit was supported by the NSF-funded DataLad project.
The former can be useful to make remotes that don't get fully synced with
local changes, which comes up in a lot of situations.
The latter was mostly added for symmetry, but could be useful (though less
likely to be).
Implementing `remote.<name>.annex-pull` was a bit tricky, as there's no one
place where git-annex pulls/fetches from remotes. I audited all
instances of "fetch" and "pull". A few cases were left not checking this
config:
* Git.Repair can try to pull missing refs from a remote, and if the local
repo is corrupted, that seems a reasonable thing to do even though
the config would normally prevent it.
* Assistant.WebApp.Gpg and Remote.Gcrypt and Remote.Git do fetches
as part of the setup process of a remote. The config would probably not
be set then, and having the setup fail seems worse than honoring it if it
is already set.
I have not prevented all the code that does a "merge" from merging branches
from remotes with remote.<name>.annex-pull=false. That could perhaps
be done, but it would need a way to map from branch name to remote name,
and the way refspecs work makes that hard to get really correct. So if the
user fetches manually, the git-annex branch will get merged, for example.
Anther way of looking at/justifying this is that the setting is called
"annex-pull", not "annex-merge".
This commit was supported by the NSF-funded DataLad project.
They are handled close the same as they are by git. However, unlike git,
git-annex sometimes needs to pass the -n parameter when using these.
So, this has the potential for breaking some setup, and perhaps there ought
to be a ANNEX_USE_GIT_SSH=1 needed to use these. But I'd rather avoid that
if possible, so let's see if anyone complains.
Almost all places where "ssh" was run have been changed to support the env
vars. Anything still calling sshOptions does not support them. In
particular, rsync special remotes don't. Seems that annex-rsync-transport
already gives sufficient control there.
(Fixed in passing: Remote.Helper.Ssh.toRepo used to extract
remoteAnnexSshOptions and pass them to sshOptions, which was redundant
since sshOptions also extracts those.)
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Fix bug when used with a recently cloned repository, where
"merging" messages were included in the output of configlist (and perhaps
other commands) and caused a "Failed to get annex.uuid configuration"
error.
This does not seem to have been a reversion.
I saw this with configlist, but it seems possible for other commands to be
effected, and it might not always happen only after a fresh clone. Eg, if a
foo/git-annex branch is pushed to the remote, the next git-annex-shell will
auto-merge it and display the message.
Decided to run all git-annex-shell commands with noMessages,
even ones that don't currently use stdout for structured communication.
Better to keep open the possibility for using stdout in the future.
This commit was supported by the NSF-funded DataLad project
The bug was that withFile closes the handle afterwards, but the content
of the file was not read due to laziness. Using readFile avoids it.
This commit was sponsored by Nick Daly on Patreon.
findShellCommand needs a full path to a file in order to check it for a
shebang on Windows. It was being run with only the base name of the external
special remote program, which would only work when it was in the current
directory.
This is why users in
https://github.com/DanielDent/git-annex-remote-rclone/pull/10 and elsewhere
were complaining that the previous improvements to git-annex didn't make
git-remote-rclone work on Windows.
Also, reworked checkearlytermination, which while it worked, seemed
to rely on a race condition. And, improved its error messages.
This commit was sponsored by Shane-o on Patreon.
It was distributing jobs to remotes that were not being used by any other
job. But, suppose that there are only 2 remotes, and -J10. In such a case,
the first 2 downloads would be distributed amoung the 2 remotes, but
the other 8 would all go to remote #1. Improved by keeping a counter
of how many jobs are assigned to a remote, and prefer remotes with fewer
jobs.
Note use of Data.Map.Strict to avoid blowing up space. I kept the
bang-patterns as-is, although probably not needed with Data.Map.Strict.
This commit was sponsored by Jack Hill on Patreon.
The slowdown is not going to be large in typical small-ish repos.
And it does not seem to matter if the assistant reacts a little bit slower
in situations involving the expensive scan, since:
a) Those situations typically involve getting back in sync after something
has changed on a remote, often after a disconnect of some duration.
So taking a few seconds more is not noticable.
b) If the scan finds things that it needs to do, it will start
blocking anyway after 10 transfers are queued (due to use of
queueTransferWhenSmall). So, only the speed of finding the first 10
transfers will be impacted by this change.
This commit was sponsored by Jochen Bartl on Patreon.
It was relying on segmentPaths to work correctly, so when it didn't,
sometimes the file that did not exist got matched up with a non-null
list of results. Fixed by always checking if each parameter exists.
There are two reason segmentPaths might not work correctly.
For one, it assumes that when the original list of paths
has more than 100 paths, it's not worth paying the CPU cost to
preserve input orders.
And then, it fails when a directory such as "." or ".." or
/path/to/repo is in the input list, and the list of found paths
does not start with that same thing. It should probably not be using
dirContains, but something else.
But, it's not clear how to handle this fully. Consider
when [".", "subdir"] has been expanded by git ls-files to
["subdir/1", "subdir/2"]
-- Both of the inputs contained those results, so there's
no one right answer for segmentPaths. All these would be equally valid:
[["subdir/1", "subdir/2"], []]
[[], ["subdir/1", "subdir/2"]]
[["subdir/1"], [""subdir/2"]]
So I've not tried to improve segmentPaths.
* init: When annex.securehashesonly has been set with git-annex config,
copy that value to the annex.securehashesonly git config.
* config --set: As well as setting value in git-annex branch,
set local gitconfig. This is needed especially for
annex.securehashesonly, which is read only from local gitconfig and not
the git-annex branch.
doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the
rationalle for doing it this way. There's no perfect solution; this
seems to be the least-bad one.
This commit was supported by the NSF-funded DataLad project.
Added --securehash option to match files using a secure hash function, and
corresponding securehash preferred content expression.
This commit was sponsored by Ethan Aubin.
Cryptographically secure hashes can be forced to be used in a repository,
by setting annex.securehashesonly. This does not prevent the git repository
from containing files with insecure hashes, but it does prevent the content
of such files from being pulled into .git/annex/objects from another
repository.
We want to make sure that at no point does git-annex accept content into
.git/annex/objects that is hashed with an insecure key. Here's how it
was done:
* .git/annex/objects/xx/yy/KEY/ is kept frozen, so nothing can be
written to it normally
* So every place that writes content must call, thawContent or modifyContent.
We can audit for these, and be sure we've considered all cases.
* The main functions are moveAnnex, and linkToAnnex; these were made to
check annex.securehashesonly, and are the main security boundary
for annex.securehashesonly.
* Most other calls to modifyContent deal with other files in the KEY
directory (inode cache etc). The other ones that mess with the content
are:
- Annex.Direct.toDirectGen, in which content already in the
annex directory is moved to the direct mode file, so not relevant.
- fix and lock, which don't add new content
- Command.ReKey.linkKey, which manually unlocks it to make a
copy.
* All other calls to thawContent appear safe.
Made moveAnnex return a Bool, so checked all callsites and made them
deal with a failure in appropriate ways.
linkToAnnex simply returns LinkAnnexFailed; all callsites already deal
with it failing in appropriate ways.
This commit was sponsored by Riku Voipio.
Yesterday's SHA1 collision attack could be used to generate eg:
SHA256-sfoo--whatever.good
SHA256-sfoo--whatever.bad
Such that they collide. A repository with the good one could have the
bad one swapped in and signed commits would still verify.
I've already mitigated this.
I am not happy that I had to put backend-specific code in file2key. But
it would be very difficult to avoid this layering violation.
Most of the time, when parsing a Key from a symlink target, git-annex
never looks up its Backend at all, so adding this check to a method of
the Backend object would not work.
The Key could be made to contain the appropriate
Backend, but since Backend is parameterized on an "a" that is fixed to
the Annex monad later, that would need Key to change to "Key a".
The only way to clean this up that I can see would be to have the Key
contain a LowlevelBackend, and put the validation in LowlevelBackend.
Perhaps later, but that would be an extensive change, so let's not do
it in this commit which may want to cherry-pick to backports.
This commit was sponsored by Ethan Aubin.
* Run curl with -S, so HTTP errors are displayed, even when
it's otherwise silent.
* When downloading in --json or --quiet mode, use curl in preference
to wget, since curl is able to display only errors to stderr, unlike
wget.
This does mean that downloadQuiet is only silent on stdout, not necessarily
on stderr, which affects a couple other calls of it. For example,
downloading the .git/config of a http remote may show an error message now,
perhaps with slightly suboptimal formatting due to other output.
This adds one extra line of output when a download is successful,
after the progress bar. I don't much like that, but wget does not provide a
way to show HTTP errors without it.
sync: When syncing with a local repository located on a crippled
filesystem, run the post-receive hook there, since it wouldn't get run
otherwise. This makes pushing to repos on FAT-formatted removable drives
update them when receive.denyCurrentBranch=updateInstead.
Made Remote.Git export onLocal, which was cleaned up to not have so many
caveats about its use.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
* Added post-recieve hook, which makes updateInstead work with direct
mode and adjusted branches.
* init: Set up the post-receive hook.
This commit was sponsored by Fernando Jimenez on Patreon.
config group groupwanted numcopies schedule wanted required: Avoid
displaying extraneous messages about repository auto-init, git-annex branch
merging, etc, when being used to get information.
By displaying error messages from the remote then it fails to update
its checked out branch.
Error messages in the default receive.denyCurrentBranch are still
suppressed, which matches user expectations.
This commit was sponsored by Nick Daly on Patreon.
... to avoid it consuming stdin that it shouldn't.
This fixes git-annex-checkpresentkey --batch remote, which didn't output
results for all keys passed into it.
Other git-annex commands that communicate with a remote over ssh may also
have been consuming stdin that they shouldn't have, which could have
impacted using them in eg, shell scripts. For example, a shell script
reading files from stdin and passing them to git annex drop would be
impacted by this bug, whenever git annex drop ran git-annex-shell
checkpresent, it would consume part/all of the stdin that the shell script
was supposed to consume.
Fixed by adding a ConsumeStdin parameter to Annex.Ssh.sshOptions, which
is used throughout git-annex to run ssh (in order for ssh connection
caching to work). Every call site was checked to see if it used
CreatePipe for stdin, and if not was marked NoConsumeStdin.
At first I wanted to make it go ahead and merge into the newborn branch,
so made it use Git.Branch.currentUnsafe to get the current branch. But that
failed:
fatal: ambiguous argument 'refs/heads/master..refs/heads/synced/master':
unknown revision or path not in the working tree.
A whole nother code path to handle merging into newborn branches seemed
excessive, so went with displaying a warning and propigating failure
status.
This commit was sponsored by Brock Spratlen on Patreon.
Refactored some common code into initDb.
This only deals with the problem when creating new databases. If a repo
got bad permissions into it, it's up to the user to deal with it.
This commit was sponsored by Ole-Morten Duesund on Patreon.
The check was broken in two ways.. First, nowhere did it error out when
checkUUIDFile found a different UUID already in the file. Instead,
it overwrote the uuid file.
And, checkUUIDFile's implementation was for some reason always failing with
a ConnectionClosed exception. Apparently something to do with using two
different runResourceT's and a response getting GCed inbetween. I'm pretty
sure that used to work, but changed to a more obviously correct
implementation.
This commit was sponsored by Peter Hogg on Patreon.
Probing for hard link support in the pid locking code is redundant since
git-annex init already probes that. But, it didn't seem worth threading
that data through; the pid locking code runs at most once per git-annex
process, and only on unusual filesystems. Optimising a single hard link
and unlink isn't worth it.
This commit was sponsored by Francois Marier on Patreon.
Git does not provide a switch to find out where this directory is, and
while the git-init man page says it will always be in
/usr/share/git-core/templates, that's not the case on OSX with git
installed from homebrew. So, I used a hack taking the --man-path and
constructing a path from that. Works on both Debian and OSX at least.
This is the same as running git annex reinject --known, followed by
git-annex import. The advantage to having it in one command is that it
only has to hash each file once; the two commands have to
hash the imported files a second time.
This commit was sponsored by Shane-o on Patreon.
import: --deduplicate and --skip-duplicates were implemented inneficiently;
they unncessarily hashed each file twice. They have been improved to only
hash once.
The new approach is to lock down (minimally) and hash files, and then
reuse that information when importing them.
This was rather tricky, especially in detecting changes to files while
they are being imported.
The output of import changed slightly. While before it silently skipped
over files with eg --skip-duplicates, now it shows each file as it starts
to act on it. Since every file is hashed first thing, it would otherwise
not be clear what file import is chewing on. (Actually, it wasn't clear
before when any of the duplicates switches were used.)
This commit was sponsored by Alexander Thompson on Patreon.
Before, only content known to be present somewhere was considered a
duplicate. Now, any content that has been annexed before will be considered
a duplicate, even if all annexed copies of the data have been lost.
Note that --clean-duplicates and --deduplicate still check numcopies,
so won't delete duplicate files unless there's an annexed copy.
This makes import use the same method as reinject --known.
The man page already said that duplicate meant "its content is either
present in the local repository already, or git-annex knows of another
repository that contains it, or it was present in the annex before but has
been removed now". So, this is really only bringing the implementation into
line with the man page.
This commit was sponsored by Jochen Bartl on Patreon.
Wormhole pairing will start to provide an appid to wormhole on 2021-12-31.
An appid can't be provided now because Debian stable is going to ship a
older version of git-annex that does not provide an appid. Assumption is
that by 2021-12-31, this version of git-annex will be shipped in a Debian
stable release. If that turns out to not be the case, this change will need
to be cherry-picked into the git-annex in Debian stable, or its wormhole
pairing will break.
This commit was sponsored by Thomas Hochstein on Patreon.
.. which can be set to true to make git annex sync default to --content.
This may become the default at some point in the future.
As well as being configuable by git config, it can be configured by
git-annex config to control the default behavior in all clones of a
repository.
Had to add a separate --no-content switch to we can tell if it's been
explicitly set, and should override annex.synccontent. If --content was the
default, this complication would not be necessary.
This commit was sponsored by Jake Vosloo on Patreon.
... to control the default behavior in all clones of a repository.
This includes a new Configurable data type, so the GitConfig type indicates
which values can be configured this way.
The implementation should be quite efficient; the config log is only read
once, and only when a Configurable value has not already been set by
git-config.
Indeed, it would be nice in the future to extend this, so that git-config
is itself only read on demand. Some commands may not need to look at the
git configuration at all.
This commit was sponsored by Trenton Cronholm on Patreon.
Argh, didn't need an accumulator here!
I think I use accumulators a lot more than I need to when recusively
processing lists..
This commit was sponsored by Jeff Goeke-Smith on Patreon.
This makes it a little bit slower since it has to check file size,
but worth it to fix a potential memory use problem.
This commit was sponsored by Fernando Jimenez on Patreon.
Turns out that Data.List.Utils.split is slow and makes a lot of
allocations. Here's a much simpler single character splitter that behaves
the same (even in wacky corner cases) while running in half the time and
75% the allocations.
As well as being an optimisation, this helps move toward eliminating use of
missingh.
(Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and
allocates even more.)
I have not benchmarked the effect on git-annex, but would not be surprised
to see some parsing of eg, large streams from git commands run twice as
fast, and possibly in less memory.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
esqueleto finally got fixed, thanks to @bitemyapp
Since XMPP was removed, the previous build failures related to it should
no longer be a problem either.
Meanwhile, lts-5.18 fails to build anymore on Debian due to linker
hardening breaking the version of ghc stack uses with that version.
This commit was sponsored by Francois Marier on Patreon.
Any config names can be set using this; git-annex commands will only look
at specific ones that make sense and are worth the overhead of querying the
branch.
This might also be useful for storing whatever other config-type stuff the
user might want to shove into the git-annex branch.
This commit was sponsored by Jochen Bartl on Patreon.
Docs say vicfg can configure everything from git-annex branch,
so it ought to configure numcopies.
Note that commenting out existing numcopies does not unset it.
This commit was sponsored by Thom May on Patreon.
This way we know that after enable-tor, the tor hidden service is fully
published and working, and so there should be no problems with it at
pairing time.
It has to start up its own temporary listener on the hidden service. It
would be nice to have it start the remotedaemon running, so that extra
step is not needed afterwards. But, there may already be a remotedaemon
running, in communication with the assistant and we don't want to start
another one. I thought about trying to HUP any running remotedaemon, but
Windows does not make it easy to do that. In any case, having the user
start the remotedaemon themselves lets them know it needs to be running
to serve the hidden service.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
weasel explained that apparmor limits on what files tor can read do not
apply to sockets (because they're not files). And apparently the
problems I was seeing with hidden services not being accessible had to
do with onion address propigation and not the location of the socket
file.
remotedaemon looks up the HiddenServicePort in torrc, so if it was
previously configured with the socket in /etc, that will still work.
This commit was sponsored by Denis Dzyubenko on Patreon.
This reverts commit 3037feb1bf.
On second thought, this was an overcomplication of what should be the
lowest-level primitive. Let's build bi-directional links at the pairing
level with eg magic wormhole.
Both the local and remote git repositories get remotes added
pointing at one-another.
Makes pairing twice as easy!
Security: The new LINK command in the protocol can be sent repeatedly,
but only by a peer who has authenticated with us. So, it's entirely safe to
add a link back to that peer, or to some other peer it knows about.
Anything we receive over such a link, the peer could send us over the
current connection.
There is some risk of being flooded with LINKs, and adding too many
remotes. To guard against that, there's a hard cap on the number of remotes
that can be set up this way. This will only be a problem if setting up
large p2p networks that have exceptional interconnectedness.
A new, dedicated authtoken is created when sending LINK.
This also allows, in theory, using a p2p network like tor, to learn about
links on other networks, like telehash.
This commit was sponsored by Bruno BEAUFILS on Patreon.
Revert ServerAliveInterval change in 6.20161111, which caused problems
with too many old versions of ssh and unusual ssh configurations.
It should have not been needed anyway since ssh is supposted to
have TCPKeepAlive enabled by default.
1 microsecond delay is ugly.. but, maintaining an queue of a list of timestamps
and taking a new one from the queue each time around, or maintaining a timestamp
counter, would probably be slower.