Batch detection is heuristic, so can sometimes fail. I observed one such
failure while starting up in a repository with 87000 files. After the first
several batches of ~5000 files, it fell out of batch mode, and never
re-entered it, and so made many more commits of a few files at a time
than necessary.
So, let's always use batch mode when in the startup scan. This avoids the
heuristic there, at least.
There is clearly also room to improve the heuristic. Possibly 10 files is
too high a bar to be found during a commit, on a system that can commit
quickly.
Fixes a test case I received where a corrupted repo was repaired, but the
git-annex branch was not. The root of the problem was that the
MissingObject returned by the repair code was not necessarily a complete
set of all objects that might have been deleted during the repair.
So, stop trying to return that at all, and instead make the index file
checking code explicitly verify that each object the index uses is present.
Previous test did not notice if there is a dangling symlink.
Also, if a directory exists with the same name as the imported file, that
cannot work, so don't let --force have an effect.
Note that the hash backends were made to stop printing a (checksum..)
message as part of this, since it showed up without a file when deciding
whether to act on a file. Should have probably removed that message a while
ago anyway, I suppose.
The assistant's commit code also always avoids git commit, for simplicity.
Indirect mode sync still does a git commit -a to catch unstaged changes.
Note that this means that direct mode sync no longer runs the pre-commit
hook or any other hooks git commit might call. The git annex pre-commit
hook action for direct mode is however explicitly run. (The assistant
already ran git commit with hooks disabled, so no change there.)
Not yet wired up to restart the assistant on upgrade; that needs careful
sanity checking to wait until the upgrade is done before restarting.
Used the DirWatcher here, so it gets events for any changes to the
directory containing the program file. (But not subdirs.) This is necessary
in order to detect when the file is renamed as part of the upgrade, which
an inotify on a single file would not detect. (Also, I have DirWatcher code,
but not FileWatcher code.)
Note that upgrades that remove or rename a whole directory tree containing
the executable will *not* trigger this code. So eg, deleting and replacing
the whole standalone tarball dir tree won't work -- but untarring it
over top will. So should dpkg package upgrades.
Added programPath, using a new GHC feature to find the full path to the
executable. The fallback code for old GHC or unsupported OS is less good;
its worst failure mode would be either failing to find the program, and so
not checking for upgrades, or finding a git-annex that's in PATH, but is
not the one running.
This commit was sponsored by John Roepke.
In 0.9, -v shows version, rather than controlling verbosity.
Still need to port to 0.9, this just avoids massively confusing addurl when
quvi prints its version and exits successfully, on urls that it cannot be
used with.
Because that allowed writing to symlinks of files that are not present,
which followed the link and put bad content in an object location.
fsck: Fix up .git/annex/object directory permissions.
This commit was sponsored by an anonymous bitcoin donor.
Complicated by such repositories potentially being repos that should have
an annex.uuid, but it failed to be gotten, perhaps due to the past ssh repo
setup bugs. This is handled now by an Upgrade Repository button.
Adding the file moved it to the annex, and then tried to set the mode.
Error unwind then moved the file back, and so the watcher saw the file get
deleted and then added back, and so tried again..
Direct mode repositories can now have core.bare=true set, to prevent
accidentally running git commands that try to operate on the work tree,
and so do the wrong thing.
This is not yet the default, and it causes known problems for git-annex sync
due to receive.denyCurrentBranch not working in bare repositories.
This commit was sponsored by Richard Hartmann.
The -c option now not only modifies the git configuration seen by
git-annex, but it is passed along to every git command git-annex runs.
This was easy to plumb through because gitCommandLine is already used to
construct every git command line, to add --git-dir and --work-tree
I think both of these are all that's affected, but I went ahead and fixed
all the remotes that set their config to M.empty to instead store the
actual config. Who knows what will expect it to be actually present in
future, the Remote instance of getGpgEncParams came to..
I was able to reproduce something very like this bug by starting
pairing separately on both computers under poor network conditions (ie,
weak wifi on my front porch). Neither computer showed an alert for the
PairReq messages it was seeing (intermittently) from the other.
So, I've made a new PairReq message that has not been seen before
always make the alert pop up, even if the assistant thinks it is
in the middle of its own pairing process (or even another pairing
process with a different box on the LAN).
(This shouldn't cause a rogue PairAck to disrupt a pairing process part
way through.)
Copies files out of the annex. This avoids an unannex of one file breaking
other files that link to the same content. Also, it means that the content
remains in the annex using up space until cleaned up with "git annex
unused".
(The behavior of unannex --fast has not changed; it still hard
links to content in the annex. --fast was not made the default because it
is potentially unsafe; editing such a hard linked file can unexpectedly
change content stored in the annex.)
catKeyFileHEAD is still checked too, because when doing a git commit with
unlocked files, the file gets staged to the index, so is not typechanged
there.
(This is also why git annex add foo; git annex unlock foo; git commit -a
does not re-annex foo, because there is no indication left that it was
added.)
Note that this case is only fully automatically resolved in direct mode.
In indirect mode, git merge moves the file to file~HEAD, and replaces it
with the directory, and leaves the file in unmerged state, and sync doesn't
yet change that.
I have not actually tested with 1.8.5, which is not yet relesaed, but
git.git commit f7cd8c50b9ab83e084e8f52653ecc8d90665eef2 changes -z
to also apply to output, without regards to back-compat. (But with pretty
good reasons.)
New code should work with both versions, by fingerprinting for NULs and
newlines.
addurl: Improve message when adding url with wrong size to existing file.
Before the message suggested the url didn't exist.
Fixed handling of URL keys that have no recorded size. Before, if the key
has no size, the url also had to not declare any size, which was unlikely
and wrong, or it was taken to not exist. This probably would mostly affect
keys that were added to the annex with addurl --relaxed.
Thought was that this would be faster than a map, since a vector can be
updated more efficiently. It turns out to not seem to matter; runtime and
memory usage are basically identical.
The control socket path passed to ssh needs to be 17 characters shorter
than the maximum unix domain socket length, because ssh appends stuff to it
to make a temporary filename. Closes: #725512
Also, take the shorter of the relative and the absolute paths to the
socket. Typically the relative path will be a lot shorter (unless
deep inside a subdirectory of the repository), and so using it will
avoid flirting with the maximum safe socket lenghts in more situations,
and so lead to less breakage if all my attempts at fixing this are
still buggy.
Extends the index.lock handling to other git lock files. I surveyed
all lock files used by git, and found more than I expected. All are
handled the same in git; it leaves them open while doing the operation,
possibly writing the new file content to the lock file, and then closes
them when done.
The gc.pid file is excluded because it won't affect the normal operation
of the assistant, and waiting for a gc to finish on startup wouldn't be
good.
All threads except the webapp thread wait on the new startup sanity checker
thread to complete, so they won't try to do things with git that fail
due to stale lock files. The webapp thread mostly avoids doing that kind of
thing itself. A few configurators might fail on lock files, but only if the
user is explicitly trying to run them. The webapp needs to start
immediately when the user has opened it, even if there are stale lock
files.
Arranging for the threads to wait on the startup sanity checker was a bit
of a bear. Have to get all the NotificationHandles set up before the
startup sanity checker runs, or they won't see its signal. Perhaps
the NotificationBroadcaster is not the best interface to have used for
this. Oh well, it works.
This commit was sponsored by Michael Jakl
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
SHA3 is still waiting for final standardization.
Although this is looking less likely given
https://www.cdt.org/blogs/joseph-lorenzo-hall/2409-nist-sha-3
In the meantime, cryptohash implements skein, and it's used by some of the
haskell ecosystem (for yesod sessions, IIRC), so this implementation is
likely to continue working. Also, I've talked with the cryprohash author
and he's a reasonable guy.
It makes sense to have an alternate high security hash, in case some
horrible attack is found against SHA2 tomorrow, or in case SHA3 comes out
and worst fears are realized.
I'd also like to support using skein for HMAC. But no hurry there and
a new version of cryptohash has much nicer HMAC code, so I will probably
wait until I can use that version.
gcrypt needs to be able to fast-forward the master branch. If a git
repository is set up with git init --shared --bare, it gets that set, and
pushing to it will then fail, even when it's up-to-date.
This happened because the transferrer process did not know about the new
remote. remoteFromUUID crashed, which crashed the transferrer. When it was
restarted, the new one knew about the new remote so all further files would
transfer, but the one file would temporarily not be, until transfers retried.
Fixed by making remoteFromUUID not crash, and try reloading the remote list
if it does not know about a remote.
Note that this means that remoteFromUUID does not only return Nothing anymore
when the UUID is the UUID of the local repository. So had to change some code
that dependend on that assumption.
Overridable with --user-agent option.
Not yet done for S3 or WebDAV due to limitations of libraries used --
nether allows a user-agent header to be specified.
This commit sponsored by Michael Zehrer.
This is motivated by a user report that the assistant was repeatedly
retrying transfers of files that had been deleted (in direct mode, so
removing the only copy).
Note that the glacier code retries failed transfers after a while to retry
downloads that have aged long enough to be available. This is ok; if we're
doing a full transfer scan we'll retry on every file that is still in the
git tree.
Also note that this makes the assistant less likely to get every file
referenced by old revs of the git tree. Not something the assistant tries
to ensure anyway, so I feel this is acceptable.
This is a massive win on OSX, which doesn't have a sha256sum normally.
Only use external hash commands when the file is > 1 mb,
since cryptohash is quite close to them in speed.
SHA is still used to calculate HMACs. I don't quite understand
cryptohash's API for those.
Used the following benchmark to arrive at the 1 mb number.
1 mb file:
benchmarking sha256/internal
mean: 13.86696 ms, lb 13.83010 ms, ub 13.93453 ms, ci 0.950
std dev: 249.3235 us, lb 162.0448 us, ub 458.1744 us, ci 0.950
found 5 outliers among 100 samples (5.0%)
4 (4.0%) high mild
1 (1.0%) high severe
variance introduced by outliers: 10.415%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 14.20670 ms, lb 14.17237 ms, ub 14.27004 ms, ci 0.950
std dev: 230.5448 us, lb 150.7310 us, ub 427.6068 us, ci 0.950
found 3 outliers among 100 samples (3.0%)
2 (2.0%) high mild
1 (1.0%) high severe
2 mb file:
benchmarking sha256/internal
mean: 26.44270 ms, lb 26.23701 ms, ub 26.63414 ms, ci 0.950
std dev: 1.012303 ms, lb 925.8921 us, ub 1.122267 ms, ci 0.950
variance introduced by outliers: 35.540%
variance is moderately inflated by outliers
benchmarking sha256/external
mean: 26.84521 ms, lb 26.77644 ms, ub 26.91433 ms, ci 0.950
std dev: 347.7867 us, lb 210.6283 us, ub 571.3351 us, ci 0.950
found 6 outliers among 100 samples (6.0%)
import Crypto.Hash
import Data.ByteString.Lazy as L
import Criterion.Main
import Common
testfile :: FilePath
testfile = "/run/shm/data" -- on ram disk
main = defaultMain
[ bgroup "sha256"
[ bench "internal" $ whnfIO internal
, bench "external" $ whnfIO external
]
]
sha256 :: L.ByteString -> Digest SHA256
sha256 = hashlazy
internal :: IO String
internal = show . sha256 <$> L.readFile testfile
external :: IO String
external = do
s <- readProcess "sha256sum" [testfile]
return $ fst $ separate (== ' ') s
Done using a mode witness, which ensures it's fixed everywhere.
Fixing catFileKey was a bear, because git cat-file does not provide a
nice way to query for the mode of a file and there is no other efficient
way to do it. Oh, for libgit2..
Note that I am looking at tree objects from HEAD, rather than the index.
Because I cat-file cannot show a tree object for the index.
So this fix is technically incomplete. The only cases where it matters
are:
1. A new large file has been directly staged in git, but not committed.
2. A file that was committed to HEAD as a symlink has been staged
directly in the index.
This could be fixed a lot better using libgit2.
Now can tell if a repo uses gcrypt or not, and whether it's decryptable
with the current gpg keys.
This closes the hole that undecryptable gcrypt repos could have before been
combined into the repo in encrypted mode.
To support this, a core.gcrypt-id is stored by git-annex inside the git
config of a local gcrypt repository, when setting it up.
That is compared with the remote's cached gcrypt-id. When different, a
drive has been changed. git-annex then looks up the remote config for
the uuid mapped from the core.gcrypt-id, and tweaks the configuration
appropriately. When there is no known config for the uuid, it will refuse to
use the remote.
Note that it would be possible to extend the display to show all
repositories. But there can be a lot of repositories that are not set up as
remotes, and it would significantly clutter the display to show them all.
Since we're not showing all repositories, it's not worth trying to show
numcopies count either.
I decided to embrace these limitations and call the command remotes.
Use rsync for gcrypt remotes that are not local to the disk.
(Note that I have punted on supporting http transport for now, it doesn't
seem likely to be very useful.)
This was mostly quite easy, it just uses the rsync special remote to handle
the transfers. The git repository url is converted to a RsyncOptions
structure, which required parsing it separately, since the rsync special
remote only supports rsync urls, which use a different format.
Note that annexed objects are now stored at the top of the gcrypt repo,
rather than inside annex/objects. This simplified the rsync suport,
since it doesn't have to arrange to create that directory. And git-annex
is not going to be run directly within gcrypt repos -- or if in some
strance scenario it was, it would make sense for it to not see the
encrypted objects.
This commit was sponsored by Sheila Miguez
With the initremote parameters "encryption=pubkey keyid=788A3F4C".
/!\ Adding or removing a key has NO effect on files that have already
been copied to the remote. Hence using keyid+= and keyid-= with such
remotes should be used with care, and make little sense unless the point
is to replace a (sub-)key by another. /!\
Also, a test case has been added to ensure that the cipher and file
contents are encrypted as specified by the chosen encryption scheme.
Wrote nice pure transition calculator, and ugly code to stage its results
into the git-annex branch. Also had to split up several Log modules
that Annex.Branch needed to use, but that themselves used Annex.Branch.
The transition calculator is limited to looking at and changing one file at
a time. While this made the implementation relatively easy, it precludes
transitions that do stuff like deleting old url log files for keys that are
being removed because they are no longer present anywhere.
When quvi is installed, git-annex addurl automatically uses it to detect
when an page is a video, and downloads the video file.
web special remote: Also support using quvi, for getting files,
or checking if files exist in the web.
This commit was sponsored by Mark Hepburn. Thanks!