Leverage the new chunked remotes to automatically resume downloads.
Sort of like rsync, although of course not as efficient since this
needs to start at a chunk boundry.
But, unlike rsync, this method will work for S3, WebDAV, external
special remotes, etc, etc. Only directory special remotes so far,
but many more soon!
This implementation will also properly handle starting a download
from one remote, interrupting, and resuming from another one, and so on.
(Resuming interrupted chunked uploads is similarly doable, although
slightly more expensive.)
This commit was sponsored by Thomas Djärv.
When chunk=0, always try the unchunked key first. This avoids the overhead
of needing to read the git-annex branch to find the chunkcount.
However, if the unchunked key is not present, go on and try the chunks.
Also, when removing a chunked key, update the chunkcounts even when
chunk=0.
No need to process each L.ByteString chunk, instead ask it to split.
Doesn't seem to have really sped things up much, but it also made the code
simpler.
Note that this does (and already did) buffer in memory. It seems that only
the directory special remote could take advantage of streaming chunks to
files w/o buffering, so probably won't add an interface to allow for that.
This will allow things like WebDAV to opean a single persistent connection
and reuse it for all the chunked data.
The crazy types allow for some nice code reuse.
Push it down from needing to be done in every Storer,
to being checked once inside ChunkedEncryptable.
Also, catch exceptions from PrepareStorer and PrepareRetriever,
just in case..
I'd have liked to keep these two concepts entirely separate,
but that are entagled: Storing a key in an encrypted and chunked remote
need to generate chunk keys, encrypt the keys, chunk the data, encrypt the
chunks, and send them to the remote. Similar for retrieval, etc.
So, here's an implemnetation of all of that.
The total win here is that every remote was implementing encrypted storage
and retrival, and now it can move into this single place. I expect this
to result in several hundred lines of code being removed from git-annex
eventually!
This commit was sponsored by Henrik Ahlgren.
Not yet used by any special remotes, but should not be too hard to add it
to most of them.
storeChunks is the hairy bit! It's loosely based on
Remote.Directory.storeLegacyChunked. The object is read in using a lazy
bytestring, which is streamed though, creating chunks as needed, without
ever buffering more than 1 chunk in memory.
Getting the progress meter update to work right was also fun, since
progress meter values are absolute. Finessed by constructing an offset
meter.
This commit was sponsored by Richard Collins.
Slightly tricky as they are not normal UUIDBased logs, but are instead maps
from (uuid, chunksize) to chunkcount.
This commit was sponsored by Frank Thomas.
Moved old legacy chunking code, and cleaned up the directory and webdav
remotes use of it, so when no chunking is configured, that code is not
used.
The config for new style chunking will be chunk=1M instead of chunksize=1M.
There should be no behavior changes from this commit.
This commit was sponsored by Andreas Laas.
This is a security/usability tradeoff. To avoid exposing the gpg key ids
who can decrypt the repository, users can unset
gcrypt-publish-participants.
The gcrypt-publish-participants option is available in my fork of
git-remote-gcrypt.
This commit was sponsored by Christopher Kernahan.
Catch an exception when ensureInitialized is run in a non-initted
repository. In this case, just read the git config, so that the Git.Repo
object is not LocalUnknown, which is what is used to represent remotes
on eg, drives that are not connected.
The assistant already got this right, and like with the assistant, this
causes an implicit git-annex init of the local remote on the second sync,
once the git-annex branch has been pushed to it.
See this comment for more analysis:
http://git-annex.branchable.com/todo/Recovering_from_a_bad_sync/#comment-64e469a2c1969829ee149cbb41b1c138
This commit was sponsored by jscit.
It is useful to be able to specify an alternative git-annex-shell
program to execute on the remote, e.g., to run a version not on the
PATH. Use remote.<name>.annex-shell if specified, instead of the
default "git-annex-shell" i.e., first so-named executable on the
PATH.
Version 5.20140227 broke creation of glacier repositories, not including
the datacenter and vault in their configuration. This bug is fixed, but
glacier repositories set up with the broken version of git-annex need to
have the datacenter and vault set in order to be usable. This can be done
using git annex enableremote to add the missing settings. For details, see
http://git-annex.branchable.com/bugs/problems_with_glacier/
Motivation: Hook scripts for nautilus or other file managers
need to provide the user with feedback that a file is being downloaded.
This commit was sponsored by THM Schoemaker.
Benchmarking this with 1000 small files being copied, the time reduced from
15.98s to 14.64s -- an 8% improvement in the non-data-transfer overhead of
git-annex copy.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).
.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.
I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.
git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
This breakage seems to have been caused way back in a1eded86,
but I am pretty sure rsync.net support has not been entirely
broken since last April. AFAICS, the generated .ssh/config
has not changed since then -- it has never included a Username setting
line. So, I am puzzled at when this reversion was introduced.
Note that the breakage only affected checkpresent and remove. Upload and
download use the ssh connection caching, which includes a -l username.
Removed instance, got it all to build using fromRef. (With a few things
that really need to show something using a ref for debugging stubbed out.)
Then added back Read instance, and made Logs.View use it for serialization.
This changes the view log format.
Potentially fixes some FD leak if an action on an opened file handle fails
for some reason. There have been some hard to reproduce reports of
git-annex leaking FDs, and this may solve them.
Similar to the assistant, this honors any configured preferred content
expressions.
I am not entirely happpy with the implementation. It would be nicer if
the seek function returned a list of actions which included the individual
file gets and copies and drops, rather than the current list of calls to
syncContent. This would allow getting rid of the somewhat reundant display
of "sync file [ok|failed]" after the get/put display.
But, do that, withFilesInGit would need to somehow be able to construct
such a mixed action list. And it would be less efficient than the current
implementation, which is able to reuse several values between eg get and
drop.
Note that currently this does not try to satisfy numcopies when
getting/putting files (numcopies are of course checked when dropping
files!) This makes it like the assistant, and unlike get --auto
and copy --auto, which do duplicate files when numcopies is not yet
satisfied. I don't know if this is the right decision; it only seemed to
make sense to have this parallel the assistant as far as possible to start
with, since I know the assistant works.
This commit was sponsored by Øyvind Andersen Holm.
Known problems:
1. Tries to tahoe start when daemon is already running.
2. If multiple tahoe remotes are set up on the same computer,
they will have the same node.url configured by default,
and this confuses tahoe commands.
This commit was sponsored by LeastAuthority.com
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
This was unexpectedly difficult because of a depdenency cycle. To parse a
preferred content expression involves several things that need to operate
on the list of remotes. Which needs Remote.External. The only way to avoid
this cycle (I tried breaking it at several points) was to skip parsing the
expression in SETWANTED.
That's sorta ok, because git-annex already has to deal with unparsable
preferred content expressions being stored, in order to handle eg,
upgrades. But I'm still not very happy that I cannot check it.
I feel this is a strong indication that I need to beware of further
bloating the special remote protocol interface.