New approach is to do it the expensive way for the first 100 paths
on the command line, but then assume the user doesn't care about order too
much and fall back to the cheap way that does not preserve order.
In this situation, curl -o exits successfully without creating the output
file.
There was already a workaround for curl file:/// but I did not realize this
also affected regular url downloads.
To fix it, pre-create the destination file before starting curl.
Since we cannot always know the size of an url before trying to download
it, let's always do this.
Note that since curl is told -C -, we have to consider if this
makes curl try to do a ranged download, which might fail on some servers
where a regular download would have succeeded. My testing indicates
this isn't a problem; since the file is empty, curl seems to not try to
do a ranged download.
Original report: https://github.com/datalad/datalad/issues/79
Curl bug report: https://github.com/bagder/curl/issues/183
The fix is to stop using w82s, which does not properly reconstitute unicode
strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This
passes unicode through perfectly, including any invalid filesystem encoded
characters.
Note that toB64 / fromB64 are also used for creds and cipher
embedding. It would be unfortunate if this change broke those uses.
For cipher embedding, note that ciphers can contain arbitrary bytes (should
really be using ByteString.Char8 there). Testing indicated it's not safe to
use the new fromB64 there; I think that characters were incorrectly
combined.
For credpair embedding, the username or password could contain unicode.
Before, that unicode would fail to round-trip through the b64.
So, I guess this is not going to break any embedded creds that worked
before.
This bug may have affected some creds before, and if so,
this change will not fix old ones, but should fix new ones at least.
hGetSomeString reads one byte at a time, so unicode bytes are not composed.
The problem comes when outputting that to the console with hPut; that
tried to apply the handle's encoding, and so we get mojibake.
Instead, use ByteStrings, and only convert it to a string for parsing, not
for display.
Note that there are a couple of other things that use hGetSomeString,
which I've left as-is for now.
This reverts commit a7f05c007b.
Consider: relPathDirToFile (absPathFrom "/tmp/repo/xxx" "y/bar") "/tmp/repo/.git/annex/objects/xxx"
This needs to always yield "../../../.git/annex/objects/xxx" but on
Windows, it is "..\\..\\/tmp/repo/.git/annex/objects/xxx"
This is necessary for interop between inode caches created on unix and
windows. Which is more important than supporting inodecaches for large keys
with the wrong size, which are broken anyway.
There should be no slowdown from this change, except on Windows.
Avoid using fileSize which maxes out at just 2 gb on Windows.
Instead, use hFileSize, which doesn't have a bounded size.
Fixes support for files > 2 gb on Windows.
Note that the InodeCache code only needs to compare a file size,
so it doesn't matter it the file size wraps. So it has been
left as-is. This was necessary both to avoid invalidating existing inode
caches, and because the code passed FileStatus around and would have become
more expensive if it called getFileSize.
This commit was sponsored by Christian Dietrich.
Reverts 965e106f24
Unfortunately, this caused breakage on Windows, and possibly elsewhere,
because parentDir and takeDirectory do not behave the same when there is a
trailing directory separator.
parentDir is less safe than takeDirectory, especially when working
with relative FilePaths. It's really only useful in loops that
want to terminate at /
This commit was sponsored by Audric SCHILTKNECHT.
This allows the git repository to be moved while git-annex is running in
it, with fewer problems.
On Windows, this avoids some of the problems with the absurdly small
MAX_PATH of 260 bytes. In particular, git-annex repositories should
work in deeper/longer directory structures than before. See
http://git-annex.branchable.com/bugs/__34__git-annex:_direct:_1_failed__34___on_Windows/
There are several possible ways this change could break git-annex:
1. If it changes its working directory while it's running, that would
be Bad News. Good news everyone! git-annex never does so. It would also
break thread safety, so all such things were stomped out long ago.
2. parentDir "." -> "" which is not a valid path. I had to fix one
instace of this, and I should probably wipe all calls to parentDir out
of the git-annex code base; it was never a good idea.
3. Things like relPathDirToFile require absolute input paths,
and code assumes that the git repo path is absolute and passes it to it
as-is. In the case of relPathDirToFile, I converted it to not make
this assumption.
Currently, the test suite has 16 failures.
Getting rid of build warning
warning: 'statfs64' is deprecated: first deprecated in OS X 10.6
[-Wdeprecated-declarations]
10.6 is much older than the oldest git-annex OSX port, so won't break
anything.
More aggressive rsync params fixup for windows. Param may contain a url, or
a file path, so check if it looks like a local file path and if so, fix it
up.
On windows only, rsyncUrlIsPath will treat c:foo as a path, rather than as
a rsyncurl starting with a host "c".
This should be essentially no-op change for hGetContentsMetered, since it
always gets the entire contents. So the only difference is that each chunk
of the lazy bytestring will always be the full chunk size. So, I'm pretty
sure this is safe. Also, the only current users of hGetContentsMetered are
reading files, so the stream won't block for long in the middle.
The improvement is that hGetUntilMetered will always get some multiple of
the defaultChunkSize. This will allow the S3 multipart code to pick a fixed
size and know that hGetUntilMetered will really get that size.
(cherry picked from commit bd09046291)
Didn't know that this library existed!
This includes making git-annex not re-exec itself on start on windows, and
making the test suite on Windows run tests without forking.
This reverts commit dd667844b6
and commit e6eff0e951.
Those commits were fine, except the android autobuilder currently has a bit
of a mess of yesod versions and broke. Better to wait on this.
Only use that when building with ancient yesod, which does not include it.
This also let me remove ifdefs in the file to support building with the new
version of yesod.
Found these with:
git grep "^ " $(find -type f -name \*.hs) |grep -v ': where'
Unfortunately there is some inline hamlet that cannot use tabs for
indentation.
Also, Assistant/WebApp/Bootstrap3.hs is a copy of a module and so I'm
leaving it as-is.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
This avoids cp -a overriding the default mode acls that the user might have
set in a git repository.
With GNU cp, this behavior change should not be a breaking change, because
git-anex also uses rsync sometimes in the same situation, and has only ever
preserved timestamps when using rsync.
Systems without GNU cp will no longer use cp -a, but instead just cp.
So, timestamps will no longer be preserved. Preserving timestamps when
copying between repos is not guaranteed anyway.
Closes: #729757
Note that while before checkTransfer this called getLock with WriteLock,
getLockStatus's use of ReadLock will also notice any exclusive locks.
Since transfer info files are only locked exclusively, never shared,
there is no behavior change.
Also, fixes checkLocked to actually return Just False when the file
exists, but is not locked.
Also fixes a test suite failures introduced in recent commits, where
inAnnexSafe failed in indirect mode, since it tried to open the lock file
ReadWrite. This is why the new checkLocked opens it ReadOnly.
This commit was sponsored by Chad Horohoe.
Added a convenience Utility.LockFile that is not a windows/posix
portability shim, but still manages to cut down on the boilerplate around
locking.
This commit was sponsored by Johan Herland.
(With the exception of daemon pid locking.)
This fixes at part of #758630. I reproduced the assistant locking eg, a
removable drive's annex journal lock file and forking a long-running
git-cat-file process that inherited that lock.
This did not affect Windows.
Considered doing a portable Utility.LockFile layer, but git-annex uses
posix locks in several special ways that have no direct Windows equivilant,
and it seems like it would mostly be a complication.
This commit was sponsored by Protonet.
The hoary old HTTP library was only used when checking if an url exists,
when curl was not available. It had many problems, including not supporting
https at all.
Now, this is done using http-conduit for all urls that it supports. Falls
back to curl for any url that http-conduit doesn't like (probably ftp etc,
but could also be an url that its parser chokes on for whatever reason).
This adds a new dependency on http-conduit, but webdav support already
indirectly depended on that, and the s3-aws branch also uses it.
This opens up the possibility of using http-conduit for large file
downloads, but for now I've left it using wget/curl.
This commit was sponsored by Paul Tötterman.
FileID type changed, needs Arbitrary instance.
On the plus side, getFileStatus on Windows now actually gets file id's,
not always 0, so direct mode is safer there now.
Removed old extensible-exceptions, only needed for very old ghc.
Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.
Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.
However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.
Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
This reaping of any processes came to cause me problems when redoing the
rsync special remote -- a gpg process that was running gets waited on and
the place that then checks its return code fails.
I cannot reproduce any zombies when using the rsync special remote.
But I still can when using a normal git remote, accessed over ssh.
There is 1 zombie per file downloaded without this horrible hack enabled.
So, move the hack to only be used in that case.
This only performs some basic tests so far; no testing of chunking or
resuming. Also, the existing encryption type of the remote is used; it
would be good later to derive an encrypted and a non-encrypted version of
the remote and test them both.
This commit was sponsored by Joseph Liu.
Some remotes like External need to run store and retrieve actions in Annex,
not IO. In order to do that lift, I had to dive pretty deep into the
utilities, making Utility.Gpg and Utility.Tmp be partly converted to using
MonadIO, and Control.Monad.Catch for exception handling.
There should be no behavior changes in this commit.
This commit was sponsored by Michael Barabanov.
Leverage the new chunked remotes to automatically resume downloads.
Sort of like rsync, although of course not as efficient since this
needs to start at a chunk boundry.
But, unlike rsync, this method will work for S3, WebDAV, external
special remotes, etc, etc. Only directory special remotes so far,
but many more soon!
This implementation will also properly handle starting a download
from one remote, interrupting, and resuming from another one, and so on.
(Resuming interrupted chunked uploads is similarly doable, although
slightly more expensive.)
This commit was sponsored by Thomas Djärv.
Not yet used by any special remotes, but should not be too hard to add it
to most of them.
storeChunks is the hairy bit! It's loosely based on
Remote.Directory.storeLegacyChunked. The object is read in using a lazy
bytestring, which is streamed though, creating chunks as needed, without
ever buffering more than 1 chunk in memory.
Getting the progress meter update to work right was also fun, since
progress meter values are absolute. Finessed by constructing an offset
meter.
This commit was sponsored by Richard Collins.
Configure crashed on systems with that process and without eg, sha256sum.
The rest of the code in configure looks to work ok, since it uses sh -c to
probe for commands, and sh is always in path so it works.
Dunno about all the rest of git-annex. Not a huge amount of external
program use, other than git, so perhaps this won't be a large pain.
Note that boolSystem can throw an exception now if the program doesn't
exist. Could easily be changed back to False.
Minor because normally only 1 FD is leaked per git-annex run. However,
the test suite leaks a few hundred FDs, and this broke it on the Debian
autobuilders, which seem to have a tigher than usual ulimit.
The leak was introduced by the lazy getDirectoryContents' that was
introduced in e6330988dd in order to scale to
millions of journal files -- if the lazy list was never fully consumed, the
directory handle did not get closed.
Instead, pull in openDirectory/readDirectory/closeDirectory code that I
already developed and submitted in a patch to the haskell directory library
earlier. Using this in journalDirty avoids the place that the lazy list
caused a problem. And using it in stageJournal eliminates the need for
getDirectoryContents'.
The getJournalFiles* functions are switched back to using the regular
strict getDirectoryContents. I'm not sure if those always consume the whole
list, so this avoids any leak. And the things that call those are things
like git annex unused, which also look at every file committed to the
git-annex branch, so would need more work to scale to insane numbers of
files anyway.
Yes, this means that git annex webapp on windows execs git-annex, which
execs itself to set env, and the execs itself again to redirect logs.
This is disgusting. This is Windows(TM).
Rather than calculating the TSDelta once, and caching it, this now
reads the inode sential file's InodeCache file once, and then each time a
new InodeCache is generated, looks at the sentinal file to get the current
delta.
This way, if the time zone changes while git-annex is running, it will
adapt.
This adds some inneffiency, but only on Windows, and only 1 stat per new
file added. The worst innefficiency is that `git annex status` and
`git annex sync` will now (on Windows) stat the inode sentinal file once per
file in the repo.
It would be more efficient to use getCurrentTimeZone, rather than needing
to stat the sentinal file. This should be easy to do, once the time
package gets my bugfix patch.
This commit was sponsored by Jürgen Lüters.
On Windows, changing the time zone causes the apparent mtime of files to
change. This confuses git-annex, which natually thinks this means the files
have actually been modified (since THAT'S WHAT A MTIME IS FOR, BILL <sheesh>).
Work around this stupidity, by using the inode sentinal file to detect if
the timezone has changed, and calculate a TSDelta, which will be applied
when generating InodeCaches.
This should add no overhead at all on unix. Indeed, I sped up a few
things slightly in the refactoring.
Seems to basically work! But it has a big known problem:
If the timezone changes while the assistant (or a long-running command)
runs, it won't notice, since it only checks the inode cache once, and
so will use the old delta for all new inode caches it generates for new
files it's added. Which will result in them seeming changed the next time
it runs.
This commit was sponsored by Vincent Demeester.
Deal with FAT's low resolution timestamps, which in combination with
Linux's caching of higher res timestamps while a FAT is mounted, caused
direct mode repositories on FAT to seem to have modified files after they
were unmounted and remounted.
This commit was sponsored by Fabrice Rossi.
This version of wai changed the type of Middleware, so I cannot seem
to liftIO inside it. So, got rid of a lot of not really needed
complexity to use System.Log.Logger's logging stuff, and just use
the standard wai stdout logger when debug logging is enabled.
Format may change some, and it logs http to stdout instead of stderr
now. Doesn't matter for the webapp since both go to the same log anyway.
This avoids ssh prompting for passwords on stdin, ever.
It may also change other behavior of other programs, as there is no
controlling terminal now. However, setsid was already done when running the
assistant in daemon mode, so any behavior changes should not be really new.
For sync, saves 1 ssh connection per remote. For remotedaemon, the same
ssh connection that is already open to run git-annex-shell notifychanges
is reused to pull from the remote.
Only potential problem is that this also enables connection caching
when the assistant syncs with a ssh remote. Including the sync it does
when a network connection has just come up. In that case, cached ssh
connections are likely to be stale, and so using them would hang.
Until I'm sure such problems have been dealt with, this commit needs to
stay on the remotecontrol branch, and not be merged to master.
This commit was sponsored by Alexandre Dupas.
Code was still buggy, it turns out (though the recursion checker caught
it). In the case of (Schedule (Monthly Nothing) AnyTime), where the last
run was on yyyy-12-31, it looped forever.
Also, the handling of (Schedule (Yearly Nothing) AnyTime) was wacky where
the last run was yyyy-12-31. It would suggest a window starting on the 3rd
for the next run (because 31 mod 28 is 3).
I think that originally I was wanted to avoid running on 01-01 if it had
just run on 12-31. But the code didn't accomplish this, and it's not
necessary anyway. This is supposed to calculate the next window meeting the
schedule, and for (Schedule (Monthly Nothing), the window starts at 01-01
and runs through 01-31. If that causes two back-to-back runs, well the next
one will not be until 02-01 at the earliest.
Also, back-to-back runs can be avoided, if desired, by using Divisible 2.
This includes checking when dropping files that any required content
configuration is satisfied. However, it does not yet include an active
check on the required content; the location log is trusted when checking
the required content expression.
Motivation: Hook scripts for nautilus or other file managers
need to provide the user with feedback that a file is being downloaded.
This commit was sponsored by THM Schoemaker.
For some reason this was working w/o a cast before, despite POSIXTime etc
being newtypes. It stopped working with the new QuickCheck:
Utility/QuickCheck.hs:31:33:
No instance for (Integral POSIXTime)
arising from a use of `arbitrarySizedIntegral'
Possible fix: add an instance declaration for (Integral POSIXTime)
In the first argument of `nonNegative', namely
`arbitrarySizedIntegral'
In the expression: nonNegative arbitrarySizedIntegral
In an equation for `arbitrary':
arbitrary = nonNegative arbitrarySizedIntegral
Debian stable's warp-tls is too old to support the new https feature well,
so only use http with that old version.
Note that the webapp still depends on warp-tls, because the TLSSettings
type is used.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
Promosing work toward metadata driven filter branches. A few methods
to construct them are stubbed out; all the data types and pure code
seems good.
This commit was sponsored by Walter Somerville.
The ctrl-c hack used before didn't actually seem to work.
No haskell libraries expose TerminateProcess. I tried just calling it via
FFI, but got segfaults, probably to do with the wacky process handle not
being managed correctly. Moving it all into one C function worked.
This was hell. The EvilLinker hack was just final icing on the cake.
We all know what the cake was made of.
A very haskell commit! Just data types, instances to serialize the metadata
to a nice format, and QuickCheck tests.
This commit was sponsored by Andreas Leha.
git-annex has been using MissingH's `abdNormPath` forever, but that's
unmaintained and possibly buggy, and doesn't work on Windows. I've been
wanting to get rid of it for some time, and finally did today, writing a
`simplifyPath` that does the things git-annex needs and will work with all
the Windows filename craziness, and takes advantage of the more modern
System.FilePath to be quite a simple peice of code. A QuickCheck test found
no important divergences from absNormPath. A good first step to making
git-annex not depend on MissingH at all.
And it fixed some weird behaviors on Windows like
`git annex add ..\subdir\file` not working.
Note that absNormPathUnix has been left alone for now.
Seems I punted on this while porting before. This hack relies on DOS not
using / in filenames, it's effectively an alternate path separatr in at
least current versions of windows..
Potentially fixes some FD leak if an action on an opened file handle fails
for some reason. There have been some hard to reproduce reports of
git-annex leaking FDs, and this may solve them.
So that remotes that use a persistent network connection are restarted.
A remote might keep open a long duration network connection, and could
fail to deal well with losing the connection. This is particularly a
concern now that we have external special reotes. An external
special remote that is implemented naively might open the connection only
when PREPARE is sent, and if it loses connection, throw errors on each
request that is made.
(Note that the ssh connection caching should not have this problem; if the
long-duration ssh process loses connection, the named pipe is disconnected
and the next ssh attempt will reconnect. Also, XMPP already deals with
disconnection robustly in its own way.)
There's no way for git-annex to know if a lost network connection actually
affects a given remote, which might have a transfer in process. It does not
make sense to force kill the transferkeys process every time the NetWatcher
detects a change. (Especially because the NetWatcher sometimes polls 1
change per hour.)
In any case, the NetWatcher only detects connection to a network, not
disconnection. So if a transfer is in progress over the network, and the
network goes down, that will need to time out on its own.
An alternate approch that was considered is to use a separate transferkeys
process for each remote, and detect when a request fails, and assume that
means that process is in a failing state and restart it. The problem with
that approach is that if a resource is not available and a remote fails
every time, it degrades to starting a new transferkeys process for every
file transfer, which is too expensive.
Instead, this commit only handles the network reconnection case, and restarts
transferkeys only once the network has reconnected and another transfer needs
to be made. So, a transferkeys process will be reused for 1 hour, or until the
next network connection.
----
The NotificationBroadcaster was rewritten to use TMVars rather than MSampleVars,
to allow checking without blocking if a notification has been received.
----
This commit was sponsored by Tobias Brunner.
This has not been tested at all. It compiles!
The only known missing things are support for encryption, and for get/set
of special remote configuration, and of key state. (The latter needs
separate work to add a new per-key log file to store that state.)
Only thing I don't much like is that initremote needs to be passed both
type=external and externaltype=foo. It would be better to have just
type=foo
Most of this is quite straightforward code, that largely wrote itself given
the types. The only tricky parts were:
* Need to lock the remote when using it to eg make a request, because
in theory git-annex could have multiple threads that each try to use
a remote at the same time. I don't think that git-annex ever does
that currently, but better safe than sorry.
* Rather than starting up every external special remote program when
git-annex starts, they are started only on demand, when first used.
This will avoid slowdown, especially when running fast git-annex query
commands. Once started, they keep running until git-annex stops, currently,
which may not be ideal, but it's hard to know a better time to stop them.
* Bit of a chicken and egg problem with caching the cost of the remote,
because setting annex-cost in the git config needs the remote to already
be set up. Managed to finesse that.
This commit was sponsored by Lukas Anzinger.
The shell code was nasty, and buggy. New haskell code is much nicer,
and it's easy to do complicated calculations to properly convert possibly
absolute symlinks between libraries into relative links using it.
transferkeys had used special FDs for communication, but that would be
quite annoying to do in Windows.
Instead, use stdin and stdout. But, to avoid commands like rsync stomping
on them and messing up the communications channel, they're duplicated to a
different handle; stdin is replaced with a null handle, and stdout is
replaced with a copy of stderr. This should all work in windows too.
Stopping in progress transfers may work on windows.. if the types unify
anyway. ;) May need some more porting.
In 0.9, -v shows version, rather than controlling verbosity.
Still need to port to 0.9, this just avoids massively confusing addurl when
quvi prints its version and exits successfully, on urls that it cannot be
used with.
Allow AnyTime events that still have time to occur in the current day to
fall in a window covering the current day, instead of waiting until the
next day in the Recurrance.
Currently only implemented for local git remotes. May try to add support
to git-annex-shell for ssh remotes later. Could concevably also be
supported by some special remote, although that seems unlikely.
Cronner user this when available, and when not falls back to
fsck --fast --from remote
git annex fsck --from does not itself use this interface.
To do so, I would need to pass --fast and all other options that influence
fsck on to the git annex fsck that it runs inside the remote. And that
seems like a lot of work for a result that would be no better than
cd remote; git annex fsck
This may need to be revisited if git-annex-shell gets support, since it
may be the case that the user cannot ssh to the server to run git-annex
fsck there, but can run git-annex-shell there.
This commit was sponsored by Damien Diederen.
addurl: Improve message when adding url with wrong size to existing file.
Before the message suggested the url didn't exist.
Fixed handling of URL keys that have no recorded size. Before, if the key
has no size, the url also had to not declare any size, which was unlikely
and wrong, or it was taken to not exist. This probably would mostly affect
keys that were added to the annex with addurl --relaxed.
Turns out that forkProcess masks async exceptions. Unmask them so that the
daemon code can use them for thread IPC.
There is some risk this introduces breakage in git-annex, but it would be
breakage that would already occur when the assistant was run with
--foreground.
Wow! This was hairy, but about 10x less hairy than expected actually!
A bit more recursion than I really like, since I think in theory all
of this date stuff can be calulated using some formulas I am too lazy too
look up. But this doesn't matter in practice; I asked it for
nextTime (Schedule (Divisible 100 (Yearly 7)) (SpecificTime 23 59) (MinutesDuration 10)) Nothing
.. and it calculated (NextTimeExactly 2100-01-07 23:59:00) in milliseconds.
Rather similar to crontab, although with a different format.
But with less emphasis on per-minute scheduling.
Also, supports weekly events, which cron makes too hard.
Also, has a duration field.
SHA3 is still waiting for final standardization.
Although this is looking less likely given
https://www.cdt.org/blogs/joseph-lorenzo-hall/2409-nist-sha-3
In the meantime, cryptohash implements skein, and it's used by some of the
haskell ecosystem (for yesod sessions, IIRC), so this implementation is
likely to continue working. Also, I've talked with the cryprohash author
and he's a reasonable guy.
It makes sense to have an alternate high security hash, in case some
horrible attack is found against SHA2 tomorrow, or in case SHA3 comes out
and worst fears are realized.
I'd also like to support using skein for HMAC. But no hurry there and
a new version of cryptohash has much nicer HMAC code, so I will probably
wait until I can use that version.
Overridable with --user-agent option.
Not yet done for S3 or WebDAV due to limitations of libraries used --
nether allows a user-agent header to be specified.
This commit sponsored by Michael Zehrer.
I forgot I had <$$> hidden away in Utility.Applicative.
It allows doing the same kind of currying as does >=*>
and I found using it made the code more readable for me.
(*>=> was not used)
This is a massive win on OSX, which doesn't have a sha256sum normally.
Only use external hash commands when the file is > 1 mb,
since cryptohash is quite close to them in speed.
SHA is still used to calculate HMACs. I don't quite understand
cryptohash's API for those.
Used the following benchmark to arrive at the 1 mb number.
1 mb file:
benchmarking sha256/internal
mean: 13.86696 ms, lb 13.83010 ms, ub 13.93453 ms, ci 0.950
std dev: 249.3235 us, lb 162.0448 us, ub 458.1744 us, ci 0.950
found 5 outliers among 100 samples (5.0%)
4 (4.0%) high mild
1 (1.0%) high severe
variance introduced by outliers: 10.415%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 14.20670 ms, lb 14.17237 ms, ub 14.27004 ms, ci 0.950
std dev: 230.5448 us, lb 150.7310 us, ub 427.6068 us, ci 0.950
found 3 outliers among 100 samples (3.0%)
2 (2.0%) high mild
1 (1.0%) high severe
2 mb file:
benchmarking sha256/internal
mean: 26.44270 ms, lb 26.23701 ms, ub 26.63414 ms, ci 0.950
std dev: 1.012303 ms, lb 925.8921 us, ub 1.122267 ms, ci 0.950
variance introduced by outliers: 35.540%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 26.84521 ms, lb 26.77644 ms, ub 26.91433 ms, ci 0.950
std dev: 347.7867 us, lb 210.6283 us, ub 571.3351 us, ci 0.950
found 6 outliers among 100 samples (6.0%)
import Crypto.Hash
import Data.ByteString.Lazy as L
import Criterion.Main
import Common
testfile :: FilePath
testfile = "/run/shm/data" -- on ram disk
main = defaultMain
[ bgroup "sha256"
[ bench "internal" $ whnfIO internal
, bench "external" $ whnfIO external
]
]
sha256 :: L.ByteString -> Digest SHA256
sha256 = hashlazy
internal :: IO String
internal = show . sha256 <$> L.readFile testfile
external :: IO String
external = do
s <- readProcess "sha256sum" [testfile]
return $ fst $ separate (== ' ') s
Now the webapp can generate a gpg key that is dedicated for use by
git-annex. Since the key is single use, much of the complexity of
generating gpg keys is avoided.
Note that the key has no password, because gpg-agent is not available
everywhere the assistant is installed. This is not a big security problem
because the key is going to live on the same disk as the git annex
repository, so an attacker with access to it can look directly in the
repository to see the same files that get stored in the encrypted
repository on the removable drive.
There is no provision yet for backing up keys.
This commit sponsored by Robert Beaty.
Note that Utility.Format.prop_idempotent_deencode does not hold
now that hex escaped characters are supported. quickcheck fails to notice
this, so I have left it as-is for now.
Cipher is now a datatype
data Cipher = Cipher String | MacOnlyCipher String
which makes more precise its interpretation MAC-only vs. MAC + used to
derive a key for symmetric crypto.
With the initremote parameters "encryption=pubkey keyid=788A3F4C".
/!\ Adding or removing a key has NO effect on files that have already
been copied to the remote. Hence using keyid+= and keyid-= with such
remotes should be used with care, and make little sense unless the point
is to replace a (sub-)key by another. /!\
Also, a test case has been added to ensure that the cipher and file
contents are encrypted as specified by the chosen encryption scheme.
Having one module that knows about all the filenames used on the branch
allows working back from an arbitrary filename to enough information about
it to implement dropping dead remotes and doing other log file compacting
as part of a forget transition.
/!\ It is to be noted that revoking a key does NOT necessarily prevent
the owner of its private part from accessing data on the remote /!\
The only sound use of `keyid-=` is probably to replace a (sub-)key by
another, where the private part of both is owned by the same
person/entity:
git annex enableremote myremote keyid-=2512E3C7 keyid+=788A3F4C
Reference: http://git-annex.branchable.com/bugs/Using_a_revoked_GPG_key/
* Other change introduced by this patch:
New keys now need to be added with option `keyid+=`, and the scheme
specified (upon initremote only) with `encryption=`. The motivation for
this change is to open for new schemes, e.g., strict asymmetric
encryption.
git annex initremote myremote encryption=hybrid keyid=2512E3C7
git annex enableremote myremote keyid+=788A3F4C
Instead of populating the second-level Bloom filter with every key
referenced in every Git reference, consider only those which differ
from what's referenced in the index.
Incidentaly, unlike with its old behavior, staged
modifications/deletion/... will now be detected by 'unused'.
Credits to joeyh for the algorithm. :-)
When quvi is installed, git-annex addurl automatically uses it to detect
when an page is a video, and downloads the video file.
web special remote: Also support using quvi, for getting files,
or checking if files exist in the web.
This commit was sponsored by Mark Hepburn. Thanks!
<RichiH> i richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn % rm /home/richih/work/git/debconf-share/.git/annex/tmp/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG
<RichiH> richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn % git annex get P8060008.JPG
<RichiH> get P8060008.JPG (from website...) --2013-08-21 21:42:45-- http://annex.debconf.org/debconf-share/.git//annex/objects/1a4/67d/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG
<RichiH> Resolving annex.debconf.org (annex.debconf.org)... 5.153.231.227, 2001:41c8:1000:19::227:2
<RichiH> Connecting to annex.debconf.org (annex.debconf.org)|5.153.231.227|:80... connected.
<RichiH> HTTP request sent, awaiting response... 404 Not Found
<RichiH> 2013-08-21 21:42:45 ERROR 404: Not Found.
<RichiH> File `/home/richih/work/git/debconf-share/.git/annex/tmp/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG' already there; not retrieving.
<RichiH> Unable to access these remotes: website
<RichiH> Try making some of these repositories available:
<RichiH> 3e0356ac-0743-11e3-83a5-1be63124a102 -- website (annex.debconf.org)
<RichiH> a7495021-9f2d-474e-80c7-34d29d09fec6 -- chrysn@hephaistos:~/data/projects/debconf13/debconf-share
<RichiH> eb8990f7-84cd-4e6b-b486-a5e71efbd073 -- joeyh passport usb drive
<RichiH> f415f118-f428-4c68-be66-c91501da3a93 -- joeyh laptop
<RichiH> failed
<RichiH> git-annex: get: 1 failed
<RichiH> richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn %
I was not able to reproduce the failure, but I did reproduce that
wget -O http://404/ results in an empty file being written.
This runs git-cat-file in non-batch mode for all files with spaces.
If a directory tree has a lot of them, and is in direct mode, even "git
annex add" when there are few new files will need a *lot* of forks!
The only reason buffering the whole file content to get the sha is not a
memory leak is that git-annex only ever uses this on symlinks.
This needs to be reverted as soon as a fix is available in git!
A git pathspec is a filename, except when it starts with ':', it's taken
to refer to a branch, etc. Rather than special case ':', any filename
starting with anything unusual is prefixed with "./"
This could have been a real mess to deal with, but luckily SafeCommand
is already extensively used and so we know at the type level the difference
between parameters that are files, and parameters that are command options.
Testing did show that Git.Queue was not using SafeCommand on
filenames fed to xargs. (Filenames starting with '-' worked before only
because -- was used to separate filenames from options when calling eg git
add.)
The test suite now passes with filenames starting with ':'. However, I did
not keep that change to it, because such filenames are probably not legal
on windows, and I have enough ugly windows ifdefs in there as it is.
This commit was sponsored by Otavio Salvador. Thanks!
Started with a problem when running addurl on a really long url,
because the whole url is munged into the filename. Ended up doing
a fairly extensive review for places where filenames could get too large,
although it's hard to say I'm not missed any..
Backend.Url had a 128 character limit, which is fine when the limit is 255,
but not if it's a lot shorter on some systems. So check the pathconf()
limit. Note that this could result in fromUrl creating different keys
for the same url, if run on systems with different limits. I don't see
this is likely to cause any problems. That can already happen when using
addurl --fast, or if the content of an url changes.
Both Command.AddUrl and Backend.Url assumed that urls don't contain a
lot of multi-byte unicode, and would fail to truncate an url that did
properly.
A few places use a filename as the template to make a temp file.
While that's nice in that the temp file name can be easily related back to
the original filename, it could lead to `git annex add` failing to add a
filename that was at or close to the maximum length.
Note that in Command.Add.lockdown, the template is still derived from the
filename, just with enough space left to turn it into a temp file.
This is an important optimisation, because the assistant may lock down
a bunch of files all at once, and using the same template for all of them
would cause openTempFile to iterate through the same set of names,
looking for an unused temp file. I'm not very happy with the relatedTemplate
hack, but it avoids that slowdown.
Backend.WORM does not limit the filename stored in the key.
I have not tried to change that; so git annex add will fail on really long
filenames when using the WORM backend. It seems better to preserve the
invariant that a WORM key always contains the complete filename, since
the filename is the only unique material in the key, other than mtime and
size. Since nobody has complained about add failing (I think I saw it
once?) on WORM, probably it's ok, or nobody but me uses it.
There may be compatability problems if using git annex addurl --fast
or the WORM backend on a system with the 255 limit and then trying to use
that repo in a system with a smaller limit. I have not tried to deal with
those.
This commit was sponsored by Alexander Brem. Thanks!
The icon files will be installed when running make install or cabal
install. Did not try to run update-icon-caches, since I think it's debian
specific, and dh_icons will take care of that for the Debian package.
Using the favicon as a 16x16 icon. At 24x24 the svg displays pretty well,
although the dotted lines are rather faint. The svg is ok at all higher
resolutions.
The standalone linux build auto-installs the desktop and autostart files
when run. I have not made it auto-install the icon file too, because
a) that would take more work to include them in the tarball and find them
b) it would need to be an install to ~/.icons/, and I don't know if that
really works!
This is a compromise. I would like to nice every thread except for the
webapp thread, but it's not practical to do so. That would need every
thread to run as a bound thread, which could add significant overhead.
And any forkIO would escape the nice level.