A very haskell commit! Just data types, instances to serialize the metadata
to a nice format, and QuickCheck tests.
This commit was sponsored by Andreas Leha.
git-annex has been using MissingH's `abdNormPath` forever, but that's
unmaintained and possibly buggy, and doesn't work on Windows. I've been
wanting to get rid of it for some time, and finally did today, writing a
`simplifyPath` that does the things git-annex needs and will work with all
the Windows filename craziness, and takes advantage of the more modern
System.FilePath to be quite a simple peice of code. A QuickCheck test found
no important divergences from absNormPath. A good first step to making
git-annex not depend on MissingH at all.
And it fixed some weird behaviors on Windows like
`git annex add ..\subdir\file` not working.
Note that absNormPathUnix has been left alone for now.
Seems I punted on this while porting before. This hack relies on DOS not
using / in filenames, it's effectively an alternate path separatr in at
least current versions of windows..
Potentially fixes some FD leak if an action on an opened file handle fails
for some reason. There have been some hard to reproduce reports of
git-annex leaking FDs, and this may solve them.
So that remotes that use a persistent network connection are restarted.
A remote might keep open a long duration network connection, and could
fail to deal well with losing the connection. This is particularly a
concern now that we have external special reotes. An external
special remote that is implemented naively might open the connection only
when PREPARE is sent, and if it loses connection, throw errors on each
request that is made.
(Note that the ssh connection caching should not have this problem; if the
long-duration ssh process loses connection, the named pipe is disconnected
and the next ssh attempt will reconnect. Also, XMPP already deals with
disconnection robustly in its own way.)
There's no way for git-annex to know if a lost network connection actually
affects a given remote, which might have a transfer in process. It does not
make sense to force kill the transferkeys process every time the NetWatcher
detects a change. (Especially because the NetWatcher sometimes polls 1
change per hour.)
In any case, the NetWatcher only detects connection to a network, not
disconnection. So if a transfer is in progress over the network, and the
network goes down, that will need to time out on its own.
An alternate approch that was considered is to use a separate transferkeys
process for each remote, and detect when a request fails, and assume that
means that process is in a failing state and restart it. The problem with
that approach is that if a resource is not available and a remote fails
every time, it degrades to starting a new transferkeys process for every
file transfer, which is too expensive.
Instead, this commit only handles the network reconnection case, and restarts
transferkeys only once the network has reconnected and another transfer needs
to be made. So, a transferkeys process will be reused for 1 hour, or until the
next network connection.
----
The NotificationBroadcaster was rewritten to use TMVars rather than MSampleVars,
to allow checking without blocking if a notification has been received.
----
This commit was sponsored by Tobias Brunner.
This has not been tested at all. It compiles!
The only known missing things are support for encryption, and for get/set
of special remote configuration, and of key state. (The latter needs
separate work to add a new per-key log file to store that state.)
Only thing I don't much like is that initremote needs to be passed both
type=external and externaltype=foo. It would be better to have just
type=foo
Most of this is quite straightforward code, that largely wrote itself given
the types. The only tricky parts were:
* Need to lock the remote when using it to eg make a request, because
in theory git-annex could have multiple threads that each try to use
a remote at the same time. I don't think that git-annex ever does
that currently, but better safe than sorry.
* Rather than starting up every external special remote program when
git-annex starts, they are started only on demand, when first used.
This will avoid slowdown, especially when running fast git-annex query
commands. Once started, they keep running until git-annex stops, currently,
which may not be ideal, but it's hard to know a better time to stop them.
* Bit of a chicken and egg problem with caching the cost of the remote,
because setting annex-cost in the git config needs the remote to already
be set up. Managed to finesse that.
This commit was sponsored by Lukas Anzinger.
The shell code was nasty, and buggy. New haskell code is much nicer,
and it's easy to do complicated calculations to properly convert possibly
absolute symlinks between libraries into relative links using it.
transferkeys had used special FDs for communication, but that would be
quite annoying to do in Windows.
Instead, use stdin and stdout. But, to avoid commands like rsync stomping
on them and messing up the communications channel, they're duplicated to a
different handle; stdin is replaced with a null handle, and stdout is
replaced with a copy of stderr. This should all work in windows too.
Stopping in progress transfers may work on windows.. if the types unify
anyway. ;) May need some more porting.
In 0.9, -v shows version, rather than controlling verbosity.
Still need to port to 0.9, this just avoids massively confusing addurl when
quvi prints its version and exits successfully, on urls that it cannot be
used with.
Allow AnyTime events that still have time to occur in the current day to
fall in a window covering the current day, instead of waiting until the
next day in the Recurrance.
Currently only implemented for local git remotes. May try to add support
to git-annex-shell for ssh remotes later. Could concevably also be
supported by some special remote, although that seems unlikely.
Cronner user this when available, and when not falls back to
fsck --fast --from remote
git annex fsck --from does not itself use this interface.
To do so, I would need to pass --fast and all other options that influence
fsck on to the git annex fsck that it runs inside the remote. And that
seems like a lot of work for a result that would be no better than
cd remote; git annex fsck
This may need to be revisited if git-annex-shell gets support, since it
may be the case that the user cannot ssh to the server to run git-annex
fsck there, but can run git-annex-shell there.
This commit was sponsored by Damien Diederen.
addurl: Improve message when adding url with wrong size to existing file.
Before the message suggested the url didn't exist.
Fixed handling of URL keys that have no recorded size. Before, if the key
has no size, the url also had to not declare any size, which was unlikely
and wrong, or it was taken to not exist. This probably would mostly affect
keys that were added to the annex with addurl --relaxed.
Turns out that forkProcess masks async exceptions. Unmask them so that the
daemon code can use them for thread IPC.
There is some risk this introduces breakage in git-annex, but it would be
breakage that would already occur when the assistant was run with
--foreground.
Wow! This was hairy, but about 10x less hairy than expected actually!
A bit more recursion than I really like, since I think in theory all
of this date stuff can be calulated using some formulas I am too lazy too
look up. But this doesn't matter in practice; I asked it for
nextTime (Schedule (Divisible 100 (Yearly 7)) (SpecificTime 23 59) (MinutesDuration 10)) Nothing
.. and it calculated (NextTimeExactly 2100-01-07 23:59:00) in milliseconds.
Rather similar to crontab, although with a different format.
But with less emphasis on per-minute scheduling.
Also, supports weekly events, which cron makes too hard.
Also, has a duration field.
SHA3 is still waiting for final standardization.
Although this is looking less likely given
https://www.cdt.org/blogs/joseph-lorenzo-hall/2409-nist-sha-3
In the meantime, cryptohash implements skein, and it's used by some of the
haskell ecosystem (for yesod sessions, IIRC), so this implementation is
likely to continue working. Also, I've talked with the cryprohash author
and he's a reasonable guy.
It makes sense to have an alternate high security hash, in case some
horrible attack is found against SHA2 tomorrow, or in case SHA3 comes out
and worst fears are realized.
I'd also like to support using skein for HMAC. But no hurry there and
a new version of cryptohash has much nicer HMAC code, so I will probably
wait until I can use that version.
Overridable with --user-agent option.
Not yet done for S3 or WebDAV due to limitations of libraries used --
nether allows a user-agent header to be specified.
This commit sponsored by Michael Zehrer.
I forgot I had <$$> hidden away in Utility.Applicative.
It allows doing the same kind of currying as does >=*>
and I found using it made the code more readable for me.
(*>=> was not used)
This is a massive win on OSX, which doesn't have a sha256sum normally.
Only use external hash commands when the file is > 1 mb,
since cryptohash is quite close to them in speed.
SHA is still used to calculate HMACs. I don't quite understand
cryptohash's API for those.
Used the following benchmark to arrive at the 1 mb number.
1 mb file:
benchmarking sha256/internal
mean: 13.86696 ms, lb 13.83010 ms, ub 13.93453 ms, ci 0.950
std dev: 249.3235 us, lb 162.0448 us, ub 458.1744 us, ci 0.950
found 5 outliers among 100 samples (5.0%)
4 (4.0%) high mild
1 (1.0%) high severe
variance introduced by outliers: 10.415%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 14.20670 ms, lb 14.17237 ms, ub 14.27004 ms, ci 0.950
std dev: 230.5448 us, lb 150.7310 us, ub 427.6068 us, ci 0.950
found 3 outliers among 100 samples (3.0%)
2 (2.0%) high mild
1 (1.0%) high severe
2 mb file:
benchmarking sha256/internal
mean: 26.44270 ms, lb 26.23701 ms, ub 26.63414 ms, ci 0.950
std dev: 1.012303 ms, lb 925.8921 us, ub 1.122267 ms, ci 0.950
variance introduced by outliers: 35.540%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 26.84521 ms, lb 26.77644 ms, ub 26.91433 ms, ci 0.950
std dev: 347.7867 us, lb 210.6283 us, ub 571.3351 us, ci 0.950
found 6 outliers among 100 samples (6.0%)
import Crypto.Hash
import Data.ByteString.Lazy as L
import Criterion.Main
import Common
testfile :: FilePath
testfile = "/run/shm/data" -- on ram disk
main = defaultMain
[ bgroup "sha256"
[ bench "internal" $ whnfIO internal
, bench "external" $ whnfIO external
]
]
sha256 :: L.ByteString -> Digest SHA256
sha256 = hashlazy
internal :: IO String
internal = show . sha256 <$> L.readFile testfile
external :: IO String
external = do
s <- readProcess "sha256sum" [testfile]
return $ fst $ separate (== ' ') s
Now the webapp can generate a gpg key that is dedicated for use by
git-annex. Since the key is single use, much of the complexity of
generating gpg keys is avoided.
Note that the key has no password, because gpg-agent is not available
everywhere the assistant is installed. This is not a big security problem
because the key is going to live on the same disk as the git annex
repository, so an attacker with access to it can look directly in the
repository to see the same files that get stored in the encrypted
repository on the removable drive.
There is no provision yet for backing up keys.
This commit sponsored by Robert Beaty.
Note that Utility.Format.prop_idempotent_deencode does not hold
now that hex escaped characters are supported. quickcheck fails to notice
this, so I have left it as-is for now.
Cipher is now a datatype
data Cipher = Cipher String | MacOnlyCipher String
which makes more precise its interpretation MAC-only vs. MAC + used to
derive a key for symmetric crypto.
With the initremote parameters "encryption=pubkey keyid=788A3F4C".
/!\ Adding or removing a key has NO effect on files that have already
been copied to the remote. Hence using keyid+= and keyid-= with such
remotes should be used with care, and make little sense unless the point
is to replace a (sub-)key by another. /!\
Also, a test case has been added to ensure that the cipher and file
contents are encrypted as specified by the chosen encryption scheme.
Having one module that knows about all the filenames used on the branch
allows working back from an arbitrary filename to enough information about
it to implement dropping dead remotes and doing other log file compacting
as part of a forget transition.
/!\ It is to be noted that revoking a key does NOT necessarily prevent
the owner of its private part from accessing data on the remote /!\
The only sound use of `keyid-=` is probably to replace a (sub-)key by
another, where the private part of both is owned by the same
person/entity:
git annex enableremote myremote keyid-=2512E3C7 keyid+=788A3F4C
Reference: http://git-annex.branchable.com/bugs/Using_a_revoked_GPG_key/
* Other change introduced by this patch:
New keys now need to be added with option `keyid+=`, and the scheme
specified (upon initremote only) with `encryption=`. The motivation for
this change is to open for new schemes, e.g., strict asymmetric
encryption.
git annex initremote myremote encryption=hybrid keyid=2512E3C7
git annex enableremote myremote keyid+=788A3F4C
Instead of populating the second-level Bloom filter with every key
referenced in every Git reference, consider only those which differ
from what's referenced in the index.
Incidentaly, unlike with its old behavior, staged
modifications/deletion/... will now be detected by 'unused'.
Credits to joeyh for the algorithm. :-)
When quvi is installed, git-annex addurl automatically uses it to detect
when an page is a video, and downloads the video file.
web special remote: Also support using quvi, for getting files,
or checking if files exist in the web.
This commit was sponsored by Mark Hepburn. Thanks!
<RichiH> i richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn % rm /home/richih/work/git/debconf-share/.git/annex/tmp/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG
<RichiH> richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn % git annex get P8060008.JPG
<RichiH> get P8060008.JPG (from website...) --2013-08-21 21:42:45-- http://annex.debconf.org/debconf-share/.git//annex/objects/1a4/67d/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG
<RichiH> Resolving annex.debconf.org (annex.debconf.org)... 5.153.231.227, 2001:41c8:1000:19::227:2
<RichiH> Connecting to annex.debconf.org (annex.debconf.org)|5.153.231.227|:80... connected.
<RichiH> HTTP request sent, awaiting response... 404 Not Found
<RichiH> 2013-08-21 21:42:45 ERROR 404: Not Found.
<RichiH> File `/home/richih/work/git/debconf-share/.git/annex/tmp/SHA256E-s3044235--693b74fcb12db06b5e79a8b99d03e2418923866506ee62d24a4e9ae8c5236758.JPG' already there; not retrieving.
<RichiH> Unable to access these remotes: website
<RichiH> Try making some of these repositories available:
<RichiH> 3e0356ac-0743-11e3-83a5-1be63124a102 -- website (annex.debconf.org)
<RichiH> a7495021-9f2d-474e-80c7-34d29d09fec6 -- chrysn@hephaistos:~/data/projects/debconf13/debconf-share
<RichiH> eb8990f7-84cd-4e6b-b486-a5e71efbd073 -- joeyh passport usb drive
<RichiH> f415f118-f428-4c68-be66-c91501da3a93 -- joeyh laptop
<RichiH> failed
<RichiH> git-annex: get: 1 failed
<RichiH> richih@eudyptes (git)-[master] ~git/debconf-share/debconf13/photos/chrysn %
I was not able to reproduce the failure, but I did reproduce that
wget -O http://404/ results in an empty file being written.