when storing files there, since that collection is created by initremote.
(This seems to work around some brokenness of the box.com webdav server
which was entering a redirect loop.)
Note that the fix makes locationParent return Nothing instead of "."
when there's no parent directory between the path and the top of the webdav
repo.
This commit was sponsored by André Pereira on Patreon.
Fix process and file descriptor leak that was exposed when git-annex was
built with ghc 8.2.1. Apparently ghc has changed its behavior of GC
of open file handles that are pipes to running processes. That
broke git-annex test on OSX due to running out of FDs.
Audited for all uses of Annex.new and made stopCoProcesses be called
once it's done with the state. Fixed several places that might have
leaked in other situations than running the test suite.
This commit was sponsored by Ewen McNeill.
When the external special remote program crashed, a newline
could be output, which messed up the expected output for --batch mode.
Avoid checking EXPORTSUPPORTED for special remotes that are
not configured to use exports. The datalad special remote apparently is/was
buggy and crashed on EXPORTSUPPORTED. Anyway, there's no need to send
it when the configuration doesn't need it.
This commit was supported by the NSF-funded DataLad project.
Now when one repository has exported a tree, another repository can get
files from the export, after syncing.
There's a bug: While the database update works, somehow the database on
disk does not get updated, and so the database update is run the next
time, etc. Wasn't able to figure out why yet.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Use ExportTree rather than ExportedLocation for retrieveKeyFile and
checkPresent. When another remote exported the content, ExportTree will
be populated, but ExportedLocation will not be.
It would be possible to implement storeKey to exports as well, but it
risks performing a lot of unncessary work when another repository
already stored the key on the export and the local repository doesn't
know about it.
The only way to avoid that work would be for storeKey to use checkPresentExport
before uploading. But, the other repository could have changed the
exported tree as well, so that can't be trusted, and if it were used in
storeKey, could result in bad information getting into the location log.
This commit was sponsored by Bruno BEAUFILS on Patreon.
New table needed to look up what filenames are used in the currently
exported tree, for reasons explained in export.mdwn.
Also, added smart constructors for ExportLocation and ExportDirectory to
make sure they contain filepaths with the right direction slashes.
And some code refactoring.
This commit was sponsored by Francois Marier on Patreon.
There does not seem to be a use case for supporting that, and it would
need a lot of complication to support it in a way that allows eventual
consistency when two repositories are updating the same export.
This commit was sponsored by Henrik Riomar on Patreon.
Done to avoid a "tmp" directory appearing in webdav exports.
Also affects non-export webdav remotes, so interrupted uploads using the
old path will not overwrite it. However, PUT is quite likely to be
implemented atomically on web servers anyway, so I doubt this will cause
problems.
The subtle part of this is what happens when the remote fails to remove
an empty directory. The removal from the export needs to fail in that
case, so the removal will be tried again later. However, removeExportLocation
has already been run and changed the export db, so if the next run
checks getExportLocation, it might decide nothing remains to be done,
leaving the empty directory.
Dealt with that by making removeEmptyDirectories, handle a failure
by calling addExportLocation, reverting the database changes so the next
run will be guaranteed to try deleting the empty directory again.
This commit was sponsored by Thomas Hochstein on Patreon.
Not yet called by Command.Export.
WebDAV needs this to clean up empty collections. Also, example.sh turned
out to not be cleaning up directories when removing content
from them, so it made sense for it to use this.
Remote.Directory did not need it, and since its cleanup method for empty
directories is more efficient than what Command.Export will need to do
to find empty directories, it uses Nothing so that extra work can be
avoided.
This commit was sponsored by Thom May on Patreon.
Apparently box.com renaming is just buggy. I tried a couple of fixes:
* In case the http Manager was opening multiple connections and reaching
different backend servers, I tried limiting the number of connections
to 1. Didn't help.
* To make sure it was not a http connection reuse problem, I tried
rewriting how exportAction works, so that the same http connection
is clearly open. Didn't help.
So, disable renaming of exports for box.com. It would be good to test it
with some other webdav server.
This commit was sponsored by John Peloquin on Patreon.
Use tmp/key when exporting, so the whole export directory structure does
not have to be created under tmp/
This commit was sponsored by Denis Dzyubenko on Patreon.
inDAVLocation does not url-escape, and so exporting a filename with spaces
to box.com at least resulted in a error 400.
It might also have affected storing keys on a webdav remote, if the key
contained a space or other problem character. Pretty unlikely.
I emailed Clint about the inDAVLocation gotcha, but seems best to fix it
here.
This commit was supported by the NSF-funded DataLad project.
webdav: Checking if a non-existent file is present on Box.com triggered a
bug in its webdav support that generates an infinite series of redirects.
It seems to redirect foo to foo/ to foo/index.php to
foo/index.php/index.php ... Why a webdav endpoint would behave this way
who knows.
Deal with such problems by assuming such behavior means the file is not
present.
Can't simply disable following redirects, because the webdav endpoint could
legitimately be redirected to a new endpoint. So, when this happens
10 redirects have to be followed, before it gives up and assumes this means
the file does not exist.
This commit was supported by the NSF-funded DataLad project.
This basically works, but there's a bug when renaming a file that leaves
a .git-annex-temp-content-key file in the webdav store, that never gets
cleaned up.
Also, exporting files with spaces to box.com seems to fail; perhaps it
does not support it?
This commit was supported by the NSF-funded DataLad project.
In a test, I uploaded a pdf, and several files were derived from it.
After removing the pdf, the derived files went away after approximatly
half an hour. This window does not seem worth warning about every time.
Documented it in the tip.
Since renameExport is allowed to fail for any reason, and its failure is
always recovered from by doing a new upload and deleting the old
content, this avoids unnecessary noise.
Copying a file on the IA failed, apparently something wrong with their
emulation of S3:
S3Error {s3StatusCode = Status {statusCode = 400, statusMessage = "Bad Request"}, s3ErrorCode = "InvalidArgument", s3ErrorMessage = "Invalid Argument", s3ErrorResource = Just "x-(amz|archive)-copy-source header is bad: 'joeyh-public-test2/foo'", s3ErrorHostId = Nothing, s3ErrorAccessKeyId = Nothing, s3ErrorStringToSign = Nothing, s3ErrorBucket = Nothing, s3ErrorEndpointRaw = Nothing, s3ErrorEndpoint = Nothing}
This commit was sponsored by Jake Vosloo on Patreon.
Removal works, only derives are a potential issue, so allow removing
with a warning. This way, unexporting a file works, and behavior is
consistent with IA remotes whether or not exporttree=yes.
Also tested exporting filenames containing unicode, spaces, underscores.
All worked, despite the IA's faq saying it doesn't.
This commit was sponsored by Trenton Cronholm on Patreon.
It opens a http connection per file exported, but then so does git
annex copy --to s3.
Decided not to munge exported filenames for IA. Too large a chance of
the munging having confusing results. Instead, export of files not
supported by IA, eg with spaces in their name, will fail.
This commit was supported by the NSF-funded DataLad project.
Don't allow "exporttree=yes" to be set when the special remote
does not support exports. That would be confusing since the user would
set up a special remote for exports, but `git annex export` to it would
later fail.
This commit was supported by the NSF-funded DataLad project.
Straightforward enough, except for the needed belt-and-suspenders sanity
checks to avoid foot shooting due to exports not being key/value stores.
* Even when annex.verify=false, always verify from exports.
* Only get files from exports that use a backend that supports
checksum verification.
* Never trust exports, even if the user says to, because then
`git annex drop` would drop content if the export seemed to contain
a copy.
This commit was supported by the NSF-funded DataLad project.
* Only export to remotes that were initialized to support it.
* Prevent storing key/value on export remotes.
* Prevent enabling exporttree=yes and encryption in the same remote.
SetupStage Enable was changed to take the old RemoteConfig.
This allowed only setting exporttree when initially setting up a
remote, and not configuring it later after stuff might already be stored
in the remote.
Went with =yes rather than =true for consistency with other parts of
git-annex. Changed docs accordingly.
This commit was supported by the NSF-funded DataLad project.
This will allow disabling exports for remotes that are not configured to
allow them. Also, exportSupported will be useful for the external
special remote to probe.
This commit was supported by the NSF-funded DataLad project
This avoids needing to deal with the complexity of partially transferred
files in the export. We'd not be able to resume uploading to such a file
anyway, so just avoid them.
The implementation in Remote.Directory is not completely ideal, because
it could leave the temp file hanging around in the export directory.
This only happens if it's killed with -9, or there's a power failure;
normally viaTmp cleans up after itself, even when interrupted. I could
not see a better way to do it though, since the export directory might
be the root of a filesystem.
Also some design thoughts on resuming, which depend on storeExport being
atomic.
This commit was sponsored by Fernando Jimenez on Partreon.
Rather than providing the key to export, provide the file.
When exporting a treeish that contains files that are not annexed,
this will let the content of those files also be exported.
There's still a Key in the interface; it will be used by the external
special remote protocol. A SHA1 key can be used when exporting
non-annexed files.
This commit was sponsored by Brock Spratlen on Patreon.
Implemented so far for the directory special remote.
Several remotes don't make sense to export to. Regular Git remotes,
obviously, do not. Bup remotes almost certianly do not, since bup would
need to be used to extract the export; same store for Ddar. Web and
Bittorrent are download-only. GCrypt is always encrypted so exporting to
it would be pointless. There's probably no point complicating the Hook
remotes with exporting at this point. External, S3, Glacier, WebDAV,
Rsync, and possibly Tahoe should be modified to support export.
Thought about trying to reuse the storeKey/retrieveKeyFile/removeKey
interface, rather than adding a new interface. But, it seemed better to
keep it separate, to avoid a complicated interface that sometimes
encrypts/chunks key/value storage and sometimes users non-key/value
storage. Any common parts can be factored out.
Note that storeExport is not atomic.
doc/design/exporting_trees_to_special_remotes.mdwn has some things in
the "resuming exports" section that bear on this decision. Basically,
I don't think, at this time, that an atomic storeExport would help with
resuming, because exports are not key/value storage, and we can't be
sure that a partially uploaded file is the same content we're currently
trying to export.
Also, note that ExportLocation will always use unix path separators.
This is important, because users may export from a mix of windows and
unix, and it avoids complicating the API with path conversions,
and ensures that in such a mix, they always use the same locations for
exports.
This commit was sponsored by Bruno BEAUFILS on Patreon.
Security fix: Disallow hostname starting with a dash, which would get
passed to ssh and be treated an option. This could be used by an attacker
who provides a crafted ssh url (for eg a git remote) to execute arbitrary
code via ssh -oProxyCommand.
No CVE has yet been assigned for this hole.
The same class of security hole recently affected git itself,
CVE-2017-1000117.
Method: Identified all places where ssh is run, by git grep '"ssh"'
Converted them all to use a SshHost, if they did not already, for
specifying the hostname.
SshHost was made a data type with a smart constructor, which rejects
hostnames starting with '-'.
Note that git-annex already contains extensive use of Utility.SafeCommand,
which fixes a similar class of problem where a filename starting with a
dash gets passed to a program which treats it as an option.
This commit was sponsored by Jochen Bartl on Patreon.
External special remotes will refuse to operate on keys with spaces in
their names. That has never worked correctly due to the design of the
external special remote protocol. Display an error message suggesting
migration.
Not super happy with this, but it's a pragmatic solution. Better than
complicating the external special remote interface and all external special
remotes.
Note that I only made it use SafeKey in Request, not Response. git-annex
does not construct a Response, so that would not add any safety. And
presumably, if git-annex avoids feeding any such keys to an external
special remote, it will never have a reason to make a Response using such a
key. If it did, it would result in a protocol error anyway.
There's still a Serializeable instance for Key; it's used by P2P.Protocol.
There, the Key is always in the final position, so it's ok if it contains
spaces.
Note that the protocol documentation has been fixed to say that the File
may contain spaces. One way that can happen, even though the Key can't,
is when using direct mode, and the work tree filename contains spaces.
When sending such a file to the external special remote the worktree
filename is used.
This commit was sponsored by Thom May on Patreon.
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.
For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.
Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.
This commit was sponsored by Trenton Cronholm on Patreon.
Removed dependency on MissingH, instead depending on the split
library.
After laying groundwork for this since 2015, it
was mostly straightforward. Added Utility.Tuple and
Utility.Split. Eyeballed System.Path.WildMatch while implementing
the same thing.
Since MissingH's progress meter display was being used, I re-implemented
my own. Bonus: Now progress is displayed for transfers of files of
unknown size.
This commit was sponsored by Shane-o on Patreon.
This was never supported before. And it doesn't re-encrypt the
gcrypt repo to the new gcrypt-participants, but it does at least now not
crash, and set gcrypt-participants.
This commit was sponsored by andrea rota.
They are handled close the same as they are by git. However, unlike git,
git-annex sometimes needs to pass the -n parameter when using these.
So, this has the potential for breaking some setup, and perhaps there ought
to be a ANNEX_USE_GIT_SSH=1 needed to use these. But I'd rather avoid that
if possible, so let's see if anyone complains.
Almost all places where "ssh" was run have been changed to support the env
vars. Anything still calling sshOptions does not support them. In
particular, rsync special remotes don't. Seems that annex-rsync-transport
already gives sufficient control there.
(Fixed in passing: Remote.Helper.Ssh.toRepo used to extract
remoteAnnexSshOptions and pass them to sshOptions, which was redundant
since sshOptions also extracts those.)
This commit was sponsored by Jeff Goeke-Smith on Patreon.
findShellCommand needs a full path to a file in order to check it for a
shebang on Windows. It was being run with only the base name of the external
special remote program, which would only work when it was in the current
directory.
This is why users in
https://github.com/DanielDent/git-annex-remote-rclone/pull/10 and elsewhere
were complaining that the previous improvements to git-annex didn't make
git-remote-rclone work on Windows.
Also, reworked checkearlytermination, which while it worked, seemed
to rely on a race condition. And, improved its error messages.
This commit was sponsored by Shane-o on Patreon.
sync: When syncing with a local repository located on a crippled
filesystem, run the post-receive hook there, since it wouldn't get run
otherwise. This makes pushing to repos on FAT-formatted removable drives
update them when receive.denyCurrentBranch=updateInstead.
Made Remote.Git export onLocal, which was cleaned up to not have so many
caveats about its use.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
... to avoid it consuming stdin that it shouldn't.
This fixes git-annex-checkpresentkey --batch remote, which didn't output
results for all keys passed into it.
Other git-annex commands that communicate with a remote over ssh may also
have been consuming stdin that they shouldn't have, which could have
impacted using them in eg, shell scripts. For example, a shell script
reading files from stdin and passing them to git annex drop would be
impacted by this bug, whenever git annex drop ran git-annex-shell
checkpresent, it would consume part/all of the stdin that the shell script
was supposed to consume.
Fixed by adding a ConsumeStdin parameter to Annex.Ssh.sshOptions, which
is used throughout git-annex to run ssh (in order for ssh connection
caching to work). Every call site was checked to see if it used
CreatePipe for stdin, and if not was marked NoConsumeStdin.
The check was broken in two ways.. First, nowhere did it error out when
checkUUIDFile found a different UUID already in the file. Instead,
it overwrote the uuid file.
And, checkUUIDFile's implementation was for some reason always failing with
a ConnectionClosed exception. Apparently something to do with using two
different runResourceT's and a response getting GCed inbetween. I'm pretty
sure that used to work, but changed to a more obviously correct
implementation.
This commit was sponsored by Peter Hogg on Patreon.
Most remotes have an idempotent setup that can be reused for
enableremote, but in a few cases, it needs to tell which, and whether
a UUID was provided to setup was used.
This is groundwork for making initremote be able to provide a UUID.
It should not change any behavior.
Note that it would be nice to make the UUID always be provided to setup,
and make setup not need to generate and return a UUID. What prevented
this simplification is Remote.Git.gitSetup, which needs to reuse the
UUID of the git remote when setting it up, and so has to return that
UUID.
This commit was sponsored by Thom May on Patreon.
Turns out that Data.List.Utils.split is slow and makes a lot of
allocations. Here's a much simpler single character splitter that behaves
the same (even in wacky corner cases) while running in half the time and
75% the allocations.
As well as being an optimisation, this helps move toward eliminating use of
missingh.
(Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and
allocates even more.)
I have not benchmarked the effect on git-annex, but would not be surprised
to see some parsing of eg, large streams from git commands run twice as
fast, and possibly in less memory.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
The attacker could just send a very lot of data, with no \n and it would
all be buffered in memory until the kernel killed git-annex or perhaps OOM
killed some other more valuable process.
This is a low impact security hole, only affecting communication between
local git-annex and git-annex-shell on the remote system. (With either
able to be the attacker). Only those with the right ssh key can do it. And,
there are probably lots of ways to construct git repositories that make git
use a lot of memory in various ways, which would have similar impact as
this attack.
The fix in P2P/IO.hs would have been higher impact, if it had made it to a
released version, since it would have allowed DOSing the tor hidden
service without needing to authenticate.
(The LockContent and NotifyChanges instances may not be really
exploitable; since the line is read and ignored, it probably gets read
lazily and does not end up staying buffered in memory.)
Display progress meter on send and receive from remote.
Added a new hGetMetered that can read an exact number of bytes (or
less), updating a meter as it goes.
This commit was sponsored by Andreas on Patreon.
In case the repo on the peer changes uuid (eg by a new repo being moved
into place).
Also, added some warning messages when unable to communicate with a
peer.
This commit was sponsored by Anthony DeRobertis on Patreon.
Not tested at all, but it just might work.
Only known problem is that progress is not updated when storing to a P2P
remote.
This commit was sponsored by Nick Daly on Patreon.
Similar to GCrypt remotes, P2P remotes have an url, so Remote.Git has to
separate them out and handle them, passing off to Remote.P2P.
This commit was sponsored by Ignacio on Patreon.
git upload-pack makes some uncessary writes in sequence, this tries to
gather them together to avoid needing to send multiple DATA packets when
just one will do.
In a small pull, this reduces the average number of DATA packets from
4.5 to 2.5.
Still a couple bugs:
* Closing the connection to the server leaves git upload-pack /
receive-pack running, which could be used to DOS.
* Sometimes the data is transferred, but it fails at the end, sometimes
with:
git-remote-tor-annex: <socket: 10>: commitBuffer: resource vanished (Broken pipe)
Must be a race condition around shutdown.
Almost working, but there's a bug in the relaying.
Also, made tor hidden service setup pick a random port, to make it harder
to port scan.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
This is most of the way to having the p2p protocol working over tor
hidden services, at least enough to do git push/pull.
The free monad was split into two, one for network operations and the
other for local (Annex) operations. This will allow git-remote-tor-annex
to run only an IO action, not needing the Annex monad.
This commit was sponsored by Remy van Elst on Patreon.
A bit tricky since Proto doesn't support threads. Rather than adding
threading support to it, ended up using a callback that waits for both
data on a Handle, and incoming messages at the same time.
This commit was sponsored by Denis Dzyubenko on Patreon.
Is content locking needed in the P2P protocol? Based on re-reading
bugs/concurrent_drop--from_presence_checking_failures.mdwn,
I think so: Peers can form cycles, and multiple peers can all be trying
to drop the same content.
So, added content locking to the protocol, with some difficulty.
The implementation is fine as far as it goes, but note the warning
comment for lockContentWhile -- if the connection to the peer is dropped
unexpectedly, the peer will then unlock the content, and yet the local
side will still think it's locked.
To be honest I'm not sure if Remote.Git's lockKey for ssh remotes
doesn't have the same problem. It checks that the
"ssh remote git-annex-shell lockcontent"
process has not exited, but if the connection closes afer that check,
the lockcontent command will unlock it, and yet the local side will
still think it's locked.
Probably this needs to be fixed by eg, making lockcontent catch any
execptions due to the connection closing, and in that case, wait a
significantly long time before dropping the lock.
This commit was sponsored by Anthony DeRobertis on Patreon.
For use with tor hidden services, and perhaps other transports later.
Based on Utility.SimpleProtocol, it's a line-based protocol,
interspersed with transfers of bytestrings of a specified size.
Implementation of the local and remote sides of the protocol is done
using a free monad. This lets monadic code be included here, without
tying it to any particular way to get bytes peer-to-peer.
This adds a dependency on the haskell package "free", although that
was probably pulled in transitively from other dependencies already.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.
Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.
This commit was sponsored by Ethan Aubin.
* S3: Support the special case endpoint needed for the cn-north-1 region.
* Webapp: Don't list the Frankfurt region, as this (and some other new
regions) need V4 authorization which the aws library does not yet use.
This commit was sponsored by Nick Daly on Patreon.
If a transfer fails for some reason, but some data managed to be sent, the
transfer will be retried. (The assistant already did this.)
Possible impacts:
* More ssh prompts if ssh needs to prompt for a password to connect to a
host, or is prompting about some other problem like a ssh key mismatch.
* More data transfer due to retrying, epecially when a remote does not
support resuming a transfer.
In the worst case, a lot of data will be transferred but it fails before
the end, and then all that data gets transferred again plus one byte more;
repeat until it manages to get the whole file.
Multiple external special remote processes for the same remote will be
started as needed when using -J.
This should not beak any existing external special remotes, because running
multiple git-annex commands at the same time could already start multiple
processes for the same external special remotes.
Note that get --from foo --failed will get things that a previous get --from bar
tried and failed to get, etc. I considered making --failed only retry
transfers from the same remote, but it was easier, and seems more useful,
to not have the same remote requirement.
Noisy due to some refactoring into Types/
Ignore exceptions when getting the cost and availability for the remote,
and return sane defaults. These defaults are not cached, so if a special
remote program has a transient problem, it will re-query it later.
Removed the instance LensGpgEncParams RemoteConfig because it encouraged
code that does not take the RemoteGitConfig into account.
RemoteType's setup was changed to take a RemoteGitConfig,
although the only place that is able to provide a non-empty one is
enableremote, when it's changing an existing remote. This led to several
folow-on changes, and got RemoteGitConfig plumbed through.
This is useful for makking a special remote that anyone with a clone of the
repo and your public keys can upload files to, but only you can decrypt the
files stored in it.
The naming is unofrtunately not consistent, but the gnupg-options
were only used for encrypting, and it's too late to change that.
It would be nice to have a third setting that is always passed to gnupg,
but ~/.gnupg/options can be used to specify such global options when really
needed.
Since git-annex unsets these when started, they have to be explicitly
propigated. Also, this makes --git-dir and --work-tree settings be
reflected in the environment.
The need for this came up in
https://github.com/DanielDent/git-annex-remote-rclone/issues/3
Fix hang when dropping content needs to lock the content on a ssh remote,
which occurred when the remote has git-annex version 5.20151019 or newer.
Analysis: `race` runs 2 threads at once, and the hGetLine finishes first.
So, it tries to cancel the waitForProcess, but unfortunately that is making
a foreign call and so cannot be canceled. The remote git-annex-shell
is waiting for a line on stdin before it will exit. Deadlock.
This only occurred sometimes; I reproduced it going from darkstar to
elephant, but not from darkstar to darkstar. Not sure how that fits into
the above analysis -- perhaps a race condition is also involved?
Fixed by not using `race`; now the hGetLine will fail with an exception
if the remote git-annex-shell exits without any output.
sshOptions is now designed for working out ssh options only, and may
insert the extra options it is given to the middle. So it is incorrect
to call it with the remote parameters at the end. Instead, append them
to its return value.
This half regressed in 5be7ba7, and presumably regressed fully when
sshOptions was changed some time later.
That trailing slash is needed for legacy chunked mode, because it puts the
chunks in a subdir under the key. But, outside legacy chunked mode, it's BS
and it's amazing it worked at all with some webdav servers.
* Removed the webapp-secure build flag, rolling it into the webapp build
flag.
* Removed the quvi and tahoe build flags, which only adds aeson to
the core dependencies.
* Removed the feed build flag, which only adds feed to the core
dependencies.
Build flags have cost in both code complexity and also make Setup configure
have to work harder to find a usable set of build flags when some
dependencies are missing.
In copyFromRemote, it used to check isDirect, but that was not needed;
the remote is sending the file, so it doesn't matter if the local,
receiving repository is in direct mode or not. And, since the content is not
present, yet, it's certianly not unlocked. Note that, the remote may indeed
be sending an unlocked file, but sendkey uses sendAnnex, which will detect
if the file is modified before or during transfer, and will exit nonzero,
aborting the upload. So, the receiver doesn't need any checks.
In copyToRemote, it forces recvkey to verify content whenever it's being
sent from a v6 repository. recvkey is almost always going to verify content
anyway, unless annex.verify is not set. So, this doesn't make it any more
expensive, except for in that unusual configuration. The alternative would
be to change the recvkey interface, so that the sender checks afterwards if
what it was sending changed, and the receiver then throws out the bad
transfer. That would be less expensive for the reciever, as it would not
need to do a checksum verification. But, it would mean another network
round trip, and since rsync closes the connection, it would need to open
another ssh connection to do this. Even with connction caching, that would
add latency to uploads. It would also complicate the interface, especially
because an older git-annex-shell would not have the new interface
available. For these reasons, I prefer punting on that at this time, and
instead someone might set annex.verify=false and be unhappy that it still
verifies..
(One other gotcha not dealt with is that a v5 repo could be upgraded to v6
while an upload is in progress, and a file unlocked and modified.)
(Also, I double-checked Remote.GCrypt's calls to rsyncParamsRemote, and
they're fine. When a file is being uploaded to gcrypt, or any other special
repository, it is mediated by sendAnnex, so changes will be detected at
that level and the special remote implementation doesn't need to worry
about them.)
The direct flag is also set when sending unlocked content, to support old
versions of git-annex-shell. At some point, the direct flag will be
removed, and only the unlocked flag will be used.
Had everything available, just didn't combine the progress meter with the
other places progress is sent to update it. (And to a remote repo already
did show progress.)
Most special remotes should already display progress meters with -J,
same as without it. One exception to this is the web, since it relies on
wget/curl progress display without -J. Still todo..
* Fix failure to build with aws-0.13.0.
* When built with aws-0.13.0, the S3 special remote can be used to create
google nearline buckets, by setting storageclass=NEARLINE.
Instead, only display transport error if the configlist output doesn't
include an annex.uuid line, even an empty one.
A recent change made git-annex init try to get all the remote uuids, and so
the transport error would be displayed by it. It was also displayed when
eg, copying files to a remote that had no uuid yet.
sideAction is for things not generally related to the current action being
performed. And, it adds a newline after the side action. This was not the
right thing to use for stuff like "checksum", where doing a checksum is
part of the git annex get process, and indeed we want it to display
"(checksum...) ok"
/dev/null stderr; ssh is still able to display a password prompt
despite this
Show some messages so the user knows it's locking a remote, and
knows if that locking failed.
Also, rename lockContent to lockContentExclusive
inAnnexSafe should perhaps be eliminated, and instead use
`lockContentShared inAnnex`. However, I'm waiting on that, as there are
only 2 call sites for inAnnexSafe and it's fiddly.
In c6632ee5c8, it actually only handled
uploading objects to a shared repository. To avoid verification when
downloading objects from a shared repository, was a lot harder.
On the plus side, if the process of downloading a file from a remote
is able to verify its content on the side, the remote can indicate this
now, and avoid the extra post-download verification.
As of yet, I don't have any remotes (except Git) using this ability.
Some more work would be needed to support it in special remotes.
It would make sense for tahoe to implicitly verify things downloaded from it;
as long as you trust your tahoe server (which typically runs locally),
there's cryptographic integrity. OTOH, despite bup being based on shas,
a bup repo under an attacker's control could have the git ref used for an
object changed, and so a bup repo shouldn't implicitly verify. Indeed,
tahoe seems unique in being trustworthy enough to implicitly verify.
* When annex objects are received into git repositories, their checksums are
verified then too.
* To get the old, faster, behavior of not verifying checksums, set
annex.verify=false, or remote.<name>.annex-verify=false.
* setkey, rekey: These commands also now verify that the provided file
matches the key, unless annex.verify=false.
* reinject: Already verified content; this can now be disabled by
setting annex.verify=false.
recvkey and reinject already did verification, so removed now duplicate
code from them. fsck still does its own verification, which is ok since it
does not use getViaTmp, so verification doesn't happen twice when using fsck
--from.
Since I want git-annex to keep building on debian stable, I need to still
support the old http-client, which required explicit calls to
closeManager, or use of withManager to get Managers to close at appropriate
times. This is not needed in the new version, and so they added a
deprecation warning. IMHO much too early, because look at the mess I had to
go through to avoid that deprecation warning while supporting both
versions..
Added support for storageclass=STANDARD_IA to use Amazon's
new Infrequently Accessed storage.
Also allows using storageclass=NEARLINE to use Google's NearLine storage.
The necessary changes to aws to support this are in
https://github.com/aristidb/aws/pull/176
When gpg.program is configured, it's used to get the command to run for
gpg. Useful on systems that have only a gpg2 command or want to use it
instead of the gpg command.
This only makes sense for public repos, that are not chunked, so
that there's a 1:1 from Key in the git-annex repo to file on the remote.
Rather than making every remote implementation deal with that, just disable
whereisKey when it doesn't make sense.
Note that, if an url is added to the web log for such a remote, it's not
distinguishable from another url that might be added for the web remote.
(Because the web log doesn't distinguish which remote owns a plain url.
Urls with a downloader set are distinguishable, but we're not using them
here.)
This seems ok-ish.. In such a case, both remotes will try to use both
urls, and both remotes should be able to.
The only issue I see is that dropping a file from the web remote will
remove both urls in this case. This is not often done, and could even
be considered a feature, I suppose.
Note that I had one in Annex.Action.startup too, but it resulted in a weird
message printed by ssh, "channel 2: bad ext data". I don't know why, but
it only happened when transferinfo was run, so I wonder
if 983a95f021 introduced a fragility somehow.
While cryptohash has SHA3 support, it has not been updated for the final
version of the spec. Note that cryptonite has not been ported to all arches
that cryptohash builds on yet.
Now it suffices to run git remote add, followed by git-annex sync. Now the
remote is automatically initialized for use by git-annex, where before the
git-annex branch had to manually be pushed before using git-annex sync.
Note that this involved changes to git-annex-shell, so if the remote is
using an old version, the manual push is still needed.
Implementation required git-annex-shell be changed, so configlist can
autoinit a repository even when no git-annex branch has been pushed yet.
Unfortunate because we'll have to wait for it to get deployed to servers
before being able to rely on this change in the documentation.
Did consider making git-annex sync push the git-annex branch to repos that
didn't have a uuid, but this seemed difficult to do without complicating it
in messy ways.
It would be cleaner to split a command out from configlist to handle
the initialization. But this is difficult without sacrificing backwards
compatability, for users of old git-annex versions which would not use the
new command.
"checkPresent baser" was wrong; the baser has a dummy checkPresent action
not the real one. So, to fix this, we need to call preparecheckpresent to
get a checkpresent action that can be used to check if chunks are present.
Note that, for remotes like S3, this means that the preparer is run,
which opens a S3 handle, that will be used for each checkpresent of a
chunk. That's a good thing; if we're resuming an upload that's already many
chunks in, it'll reuse that same http connection for each chunk it checks.
Still, it's not a perfectly ideal thing, since this is a different http
connection that the one that will be used to upload chunks. It would be
nice to improve the API so that both use the same http connection.
Note that it's possible for a S3 bucket to be configured to allow public
access, but for git-annex to not know that it is. I chose to not show the
url unless public=yes.
In my tests, this has to be set when uploading a file to the bucket
and then the file can be accessed using the bucketname.s3.amazonaws.com
url.
Setting it when creating the bucket didn't seem to make the whole bucket
public, or allow accessing files stored in it. But I have gone ahead and
also sent it when creating the bucket just in case that is needed in some
case.
This removes a bit of complexity, and should make things faster
(avoids tokenizing Params string), and probably involve less garbage
collection.
In a few places, it was useful to use Params to avoid needing a list,
but that is easily avoided.
Problems noticed while doing this conversion:
* Some uses of Params "oneword" which was entirely unnecessary
overhead.
* A few places that built up a list of parameters with ++
and then used Params to split it!
Test suite passes.
This is especially useful because the caller doesn't need to generate valid
url keys, which involves some escaping of characters, and may involve
taking a md5sum of the url if it's too long.
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
Only the assistant uses these, and only the assistant cleans them up, so
make only git annex transferkeys write them,
There is one behavior change from this. If glacier is being used, and a
manual git annex get --from glacier fails because the file isn't available
yet, the assistant will no longer later see that failed transfer file and
retry the get. Hope no-one depended on that old behavior.
I've tested all the dataenc to sandi conversions except Assistant.XMPP,
and all have unchanged behavior, including behavior on large unicode code
points.
For example, it failed to get files from a bucket named S3.
Also fixes `git annex initremote UPPERCASE type=S3`, which failed with the
new aws library, with a signing error message.
The directory special remote was not affected in its normal configuration,
since annex-directory is an absolute path normally. But it could fail
when a relative path was used.
The git remote was affected even when an absolute path to it was used in
.git/config, since git-annex now converts all such paths to relative.
It sounds worse than it is. ;)
Some external special remotes may run commands that display progress on
stderr. If git-annex is run with --quiet, this should filter out such
displays while letting the errors through.
Came up with a generic way to filter out progress messages while keeping
errors, for commands that use stderr for both.
--json mode will disable command outputs too.
Useful for things like ipfs that don't use regular urls.
An external special remote can add a regular url to a key, and then
git-annex get will download it from the web. But for ipfs, we want to
instead tell git-annex that the uri uses OtherDownloader. Before this
change, the external special remote protocol lacked a way to do that.
The fix is to stop using w82s, which does not properly reconstitute unicode
strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This
passes unicode through perfectly, including any invalid filesystem encoded
characters.
Note that toB64 / fromB64 are also used for creds and cipher
embedding. It would be unfortunate if this change broke those uses.
For cipher embedding, note that ciphers can contain arbitrary bytes (should
really be using ByteString.Char8 there). Testing indicated it's not safe to
use the new fromB64 there; I think that characters were incorrectly
combined.
For credpair embedding, the username or password could contain unicode.
Before, that unicode would fail to round-trip through the b64.
So, I guess this is not going to break any embedded creds that worked
before.
This bug may have affected some creds before, and if so,
this change will not fix old ones, but should fix new ones at least.
Most of the time, there will be no discreprancy between programPath and
readProgramFile.
But, the programFile might have been written by an old version of git-annex
that is still installed, while a newer one is currently running. In this
case, we want to run the same one that's currently running.
This is especially important for things like the GIT_SSH=git-annex used for
ssh connection caching.
The only code that still uses readProgramFile directly is the upgrade code,
which needs to know where the standalone git-annex was installed, in order to
upgrade it.