Ensure that each file has something written to it, even if the bytestring
chunk size is greater than the configured chunksize.
This means we may write a bit larger than the configured value, but only
when the configured value is very small; ie, < 8 kb.
Files are now written to a tmp directory in the remote, and once all
chunks are written, etc, it's moved into the final place atomically.
For now, checkpresent still checks every single chunk of a file, because
the old method could leave partially transferred files with some chunks
present and others not.
Both the directory and webdav special remotes used to have to buffer
the whole file contents before it could be decrypted, as they read
from chunks. Now the chunks are streamed through gpg with no buffering.
Transfer info files are updated when the callback is called, updating
the number of bytes transferred.
Left unused p variables at every place the callback should be used.
Which is rather a lot..
Currently only the web special remote is readonly, but it'd be possible to
also have readonly drives, or other remotes. These are handled in the
assistant by only downloading from them, and never trying to upload to
them.
In order to record a semi-useful filename associated with the key,
this required plumbing the filename all the way through to the remotes'
storeKey and retrieveKeyFile.
Note that there is potential for deadlock here, narrowly avoided.
Suppose the repos are A and B. A sends file foo to B, and at the same
time, B gets file foo from A. So, A locks its upload transfer info file,
and then locks B's download transfer info file. At the same time,
B is taking the two locks in the opposite order. This is only not a
deadlock because the lock code does not wait, and aborts. So one of A or
B's transfers will be aborted and the other transfer will continue.
Whew!
getConfig got a remote-specific config, and this confusing name caused it
to be used a couple of places that only were interested in global configs.
Rename to getRemoteConfig and make getConfig only get global configs.
There are no behavior changes here, but remote.<name>.annex-web-options
never actually worked (and per-remote web options is a very unlikely to be
useful case so I didn't make it work), so fix the documentation for it.
That was actually really easy. But, when getting a file from an encrypted
directory special remote, no meter can be shown, because the total file
size is not known.
Avoiding writing files larger than a specified size is useful on certian
things. For example, box.com has a file size limit of 100 mb. Could also
be useful on really crappy removable media.
This drops the >>! and >>? with the nice low fixity. IfElse does have
undocumented >>=>>! and >>=>>? operators, but I deem that too fishy.
Anyway, using whenM and unlessM is easier; I sometimes mixed the operators
up.
When moving a file to the remote failed, and partially transferred content
was left behind in the directory, re-running the same move would think it
succeeded and delete the local copy.
I reproduced data loss when moving files to a partition that was almost
full. Interrupting a transfer could have similar results.
Easily fixed by using a temp file which is then moved atomically into place
once the transfer completes.
I've audited other calls to copyFileExternal, and other special remote
file transfer code; everything else seems to use temp files correctly
(rsync, git), or otherwise use atomic transfers (bup, S3).
Directory special remotes will now always store keys in the lowercase name,
which avoids the complication of catching failures to create the mixed case
name.
Git remotes using http will now try the lowercase name first.
Supporting multiple directory hash types will allow converting to a
different one, without a flag day.
gitAnnexLocation now checks which of the possible locations have a file.
This means more statting of files. Several places currently use
gitAnnexLocation and immediately check if the returned file exists;
those need to be optimised.
git-annex-shell inannex now returns always 0, 1, or 100 (the last when
it's unclear if content is currently in the index due to it currently being
moved or dropped).
(Actual locking code still not yet written.)
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.
In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.
This also provides more opportunities to use monadic and applicative
combinators.
Only one place need to filter the list of remotes for ignored remotes:
keyPossibilities. Make the full list available to everything else.
This allows getting rid of the special case handing for --from and --to
to make ignored remotes not be ignored with those options.
These are defined in ifelse, but it's not currently available and I don't
want to pull in a library for 6 lines of code anyhow.
Also, ifelse sets the fixity to 1, which does not allow >>? error $ ...
This was a most surprising leak. It occurred in the process that is forked
off to feed data to gpg. That process was passed a lazy ByteString of
input, and ghc seemed to not GC the ByteString as it was lazily read
and consumed, so memory slowly leaked as the file was read and passed
through gpg to bup.
To fix it, I simply changed the feeder to take an IO action that returns
the lazy bytestring, and fed the result directly to hPut.
AFAICS, this should change nothing WRT buffering. But somehow it makes
ghc's GC do the right thing. Probably I triggered some weakness in ghc's
GC (version 6.12.1).
(Note that S3 still has this leak, and others too. Fixing it will involve
another dance with the type system.)
Update: One theory I have is that this has something to do with
the forking of the feeder process. Perhaps, when the ByteString
is produced before the fork, ghc decides it need to hold a pointer
to the start of it, for some reason -- maybe it doesn't realize that
it is only used in the forked process.
to avoid some issues with git on OSX with the mixed-case directories. No
migration is needed; the old mixed case hash directories are still read;
new information is written to the new directories.
Two machines might have access to the same directory remote on different
paths, so don't include the path in its persistent config, instead use
the git config to record it.