This bug was introduced in 82a6db8fe8,
which improved handling of adding very large numbers of files by ensuring
that a minimum number of max size commits (5000 files each) were done.
I accidentially made it wait for another change to appear after such a max
size commit, even if a lot of queued changes were already accumulated.
That resulted in a stall when it got to the end. Now fixed to not wait
any longer than necessary to ensure the watcher has had time to wake back
up after the max size commit.
This commit was sponsored by Michael Linksvayer. Thanks!
This is a laziness problem. Despite the bang pattern on newfiles, the list
was not being fully evaluated before cleanup was called. Moving cleanup out
to after the list is actually used fixes this.
More evidence that I should be using ResourceT or pipes, if any was needed.
This affected both the hourly NetWatcherFallback thread and the syncing
when network connection is detected.
It was a reversion of sorts, introduced in
8861e270be, when annex-ignore was changed to
not control git syncing. I forgot to make it check annex-sync at that
point.
This was the last place in git-annex that could remove data referred to by
the git history, without being forced.
Like drop, dropunused checks remotes, and honors the global annex.numcopies
setting. (However, .gitattributes settings cannot apply to unused files.)
I only added this to the presense messages that are really intended for
presence. The ones used for tunneling git etc don't have the tag, because
that would waste bandwidth.
In direct mode, it's best to whenever possible not move direct mode files
out of the way, and so I made unannex avoid touching the direct mode file at
all.
That actually turns out to be easy, because in direct mode, unlike indirect
mode, the pre-commit hook won't get confused if the unannexed file later
gets added back by git add. So there's no need to commit the unannex right
away; it can be staged for the user to commit later. This also means that
unannex in direct mode is a lot faster than in indirect mode!
Another subtle bit is the bookkeeping that is done when unannexing a direct
mode file. The inode cache needs to be removed so that when uninit runs
getKeysPresent, it doesn't see the cache and think the key is still
present and crash when it's not.
This commit is sponsored by Douglas Butts. Thanks!
This is ok to do now that the socket filename never needs to be mapped back
to a hostname.
Short hostnames will still appear in the clear, which is less obfuscated.
So this cannot possibly make ssh connection caching fail for a hostname it
used to work for.
gmail.com has some XMPP SRV records, but does not itself respond to XMPP
traffic, although it does accept connections on port 5222. So if a user
entered the wrong password, it would try all the SRVs and fall back to
trying gmail, and hang at that point.
This seems the right thing to do, not just a workaround.
I wanted to try to guard against it in Command.Add too, but it's a case of
garbage in, garbage out. Once Command.Add has been told it's dealing with a
dummy symlink, it goes and deletes it, and even though the object it
thinks it points to is not present in the annex, it's Command.Add is still
doing the right thing to go ahead and add the broken symlink. So the two
fixes I was able to put in will have to do.
I thought at first this was a Windows specific problem, but it's not;
this affects checking any non-bare repository exported via http. Which is
a potentially important use case!
The actual bug was the case where Right False was returned by the first url
short-curcuited later checks. But the whole method used felt like code
I'd no longer write, and the use of undefined was particularly disgusting.
So I rewrote it.
Also added an action display.
This commit was sponsored by Eric Hanchrow. Thanks!
Cabal does not seem to have a way to check if flag A is set and then, if
flag B is set, add a dep. Instead, it makes flag B get unset if the
dep is not available.
Note that've told me:
We'll see how it goes, but I think this could be a permanent offer for
your userbase. People using git-annex are clueful and won't be a big
support burden for us, so it's a win-win.
The icon files will be installed when running make install or cabal
install. Did not try to run update-icon-caches, since I think it's debian
specific, and dh_icons will take care of that for the Debian package.
Using the favicon as a 16x16 icon. At 24x24 the svg displays pretty well,
although the dotted lines are rather faint. The svg is ok at all higher
resolutions.
The standalone linux build auto-installs the desktop and autostart files
when run. I have not made it auto-install the icon file too, because
a) that would take more work to include them in the tarball and find them
b) it would need to be an install to ~/.icons/, and I don't know if that
really works!
annexLocations uses OS-native directory separators, but for an url,
it needs to use / even on Windows.
This is an ugly workaround. Could parameterize a lot of stuff in
annexLocations to fix it better. I suspect this is probably the only place
it's needed though.
migrate wants to know the associated filename, in order to look up
the new backend. Can't do that with --all
migrate --all --backend=newvalue could be useful to support, in the future.
I spent a long time worrying about this problem with --all, that it cannot
check .gitattributes files for numcopies settings, and so would not be
entirely safe to use. The solution turns out to be simple, just don't
implement `git annex drop --all`. drop is the only command that needs to
check numcopies (move can also reduce the number of copies, but explicitly
bypasses numcopies settings).
Use cases that might need a drop --all are probably better served by using
unused and dropunused, which already work in a bare repository.
The ssh setup first runs ssh to the real hostname, to probe if a ssh key is
needed. If one is, it generates a mangled hostname that uses a key. This
mangled hostname was being used to ssh into the server to set up the key.
But if the server already had the key set up, and it was locked down, the
setup would fail. This changes it to use the real hostname when sshing in
to set up the key, which avoids the problem.
Note that it will redundantly set up the key on the ssh server. But it's
the same key; the ssh key generation code uses the key if it already
exists.
A common failure mode for direct mode has been for files to end up still
stored in indirect mode. While I hope that doesn't happen anymore, fsck
should deal with it.
This is a compromise. I would like to nice every thread except for the
webapp thread, but it's not practical to do so. That would need every
thread to run as a bound thread, which could add significant overhead.
And any forkIO would escape the nice level.
Better to have a working test suite that doesn't test a few things
than no working test suite.
Most of the disabled stuff is because for some reason "git annex sync"
doesn't work when run inside the test suite. Looks like PATH problems.
The directory and rsync special remotes seem broken on Windows, or
maybe the tests are. Pretty sure the hook special remote test is broken.
Yeah, that didn't actually work. Got error messages like it couldn't read
from the control socket, so probably ssh doesn't really support that on
Windows, at least the cygwin ssh build I'm using.
This write permission frobbing is very appropriate in indirect mode,
since annexed objects are stored as immutably as can be managed. But not
in direct mode, where files should be able to be modified at any time.
There are already sufficient guards that there's no need to prevent a file
being written to while it's being ingested, in direct mode. The inode cache
will detect (most) types of modifications, and the add will fail. Then a
re-add should be done. The assistant should get another inotify change
event, and automatically add the new version of the file.
Fuzz tests have shown that git cat-file --batch sometimes stops running.
It's not yet known why (no error message; repo seems ok). But this is
something we can deal with in the CoProcess framework, since all 3 types of
long-running git processes should be restartable if they fail.
Note that, as implemented, only IO errors are caught. So an error thrown
by the reveiver, when it sees something that is not valid output from
git cat-file (etc) will not cause a restart. I don't want it to retry
if git commands change their output or are just outputting garbage.
This does mean that if the command did a partial output and crashed in the
middle, it would still not be restarted.
There is currently no guard against restarting a command repeatedly, if,
for example, it crashes repeatedly on startup.
The checkpresent hook can return either True or, False, or fail with a message
if it cannot successfully check the remote. Currently for glacier, when
--trust-glacier is not set, it always returns False. Crucially, in the case
when a file is in glacier, this is telling git-annex it's not there, so copy
re-uploads it. This is not desirable; it breaks using glacier-cli to retreive
that file later, and it wastes money/bandwidth.
What if it instead, when the glacier inventory is missing a
file, it returns False. And when the glacier inventory has a file, unless
--trust-glacier is set, it *fails*.
The result would be:
* `git annex copy --to glacier` would only send things not listed in inventory. If a file is listed in the inventory, `copy`
would complain that --trust-glacier` is not set, and not re-upload the file.
* `git annex drop` would only trust that glacier has a file when --trust-glacier is set. Behavior unchanged.
* `git annex move --to glacier`, when the file is not listed in inventory, would send the file, and delete it locally. Behavior unchanged.
* `git annex move --to glacier`, when the file is listed in inventory, would only trust that glacier has the file when --trust-glacier is set
* `git annex copy --from glacier` / `git annex get`, when the file is located in glacier, would trust the location log, and attempt to get the file from glacier.
Ie, when there'a a conflicted merge we may get foo.variant-xxxx
created in a merge. If a second merge conflict occurs on that new file,
it was not falling back to putting in the whole key (which should stop
the merge conflicts happening for good, but is ugly).
The current manual mode preferred content expression is:
"present and (((exclude=*/archive/* and exclude=archive/*) or (not (copies=archive:1 or copies=smallarchive:1))) or (not copies=semitrusted+:1))"
The old matcher misparsed this, to basically:
OR (present and (...)) (not copies=semitrusted+:1))
The paren handling and indeed the whole conversion from tokens to the
matcher was just wrong. The new way may not be the cleverest, but I think
it is correct, and you can see how it pattern matches structurally against
the expressions when parsing them.
That expression is now parsed to:
MAnd (MOp <function>)
(MOr (MOr (MAnd (MOp <function>) (MOp <function>)) (MNot (MOr (MOp <function>) (MOp <function>))))
(MNot (MOp <function>)))
Which appears correct, and behaves correct in testing.
Also threw in a simplifier, so the final generated Matcher has less
unnecessary clutter in it. Mostly so that I could more easily read &
confirm them.
Also, added a simple test of the Matcher to the test suite.
There is a small chance of badly formed preferred content expressions
behaving differently than before due to this rewrite.
I noticed that when my modem hung up and redialed, my xmpp client was left
sending messages into the void. This will also handle any idle
disconnection issues.
This bug was turned up by the test suite, running fsck in direct mode.
A repository was cloned, was put into direct mode, was fscked, and fsck
incorrectly said that no copy existed of a file, that was actually present
in origin.
This turned out to occur because fsck first did a Annex.Branch.change,
recording that it did not locally have the file. That was recorded in the
journal. Since neither the git annex direct not the fsck had yet needed to
read any info from the branch, but had only made changes to it, the
origin/git-annex branch was not yet merged in. So the journal got a
location log entry written to it, but this did not include
the location log info for the origin. When fsck then did a
Annex.Branch.get, it trusted the journal was cosnsitent, and returned it,
again w/o merging from origin/git-annex. This latter behavior is the
actual bug.
Refer to commit e9bfa8eaed for the thinking
behind it being ok to make a change to a file on the branch, without
first merging the branch. That thinking still stands. However, it means
that files in the journal cannot be trusted to be consistent if the branch
has not been merged. So, to fix, just enure the branch gets merged, even
when reading from the journal.
In tests, this does not seem to cause any extra merging. Except, of course,
in the one case described above. But git annex add, etc, are able to make
changes w/o first merging the branch.
The bug was in movein, which just replaceFile'd the file with a symlink,
even if it already had the desired content, before trying to pull the
content out of the annex and replace the symlink with it.
That was ok-ish for non conflicted merges, where if the file existed it would
be an old version of the content. But for conflicted merges, the automatic
merge resolver has already run, and will have already put the desired
content into the file for the local variant.
Also, made removeDirect not trust that the associated files map is correct.
Only if it can verify that another file has the content will it not move it
into .git/annex/objects.
As seen in this bug report, the lifted exception handling using the StateT
monad throws away state changes when an action throws an exception.
http://git-annex.branchable.com/bugs/git_annex_fork_bombs_on_gpg_file/
.. Which can result in cached values being redundantly calculated, or other
possibly worse bugs when the annex state gets out of sync with reality.
This switches from a StateT AnnexState to a ReaderT (MVar AnnexState).
All changes to the state go via the MVar. So when an Annex action is
running inside an exception handler, and it makes some changes, they
immediately go into affect in the MVar. If it then throws an exception
(or even crashes its thread!), the state changes are still in effect.
The MonadCatchIO-transformers change is actually only incidental.
I could have kept on using lifted-base for the exception handling.
However, I'd have needed to write a new instance of MonadBaseControl
for the new monad.. and I didn't write the old instance.. I begged Bas
and he kindly sent it to me. Happily, MonadCatchIO-transformers is
able to derive a MonadCatchIO instance for my monad.
This is a deep level change. It passes the test suite! What could it break?
Well.. The most likely breakage would be to code that runs an Annex action
in an exception handler, and *wants* state changes to be thrown away.
Perhaps the state changes leaves the state inconsistent, or wrong. Since
there are relatively few places in git-annex that catch exceptions in the
Annex monad, and the AnnexState is generally just used to cache calculated
data, this is unlikely to be a problem.
Oh yeah, this change also makes Assistant.Types.ThreadedMonad a bit
redundant. It's now entirely possible to run concurrent Annex actions in
different threads, all sharing access to the same state! The ThreadedMonad
just adds some extra work on top of that, with its own MVar, and avoids
such actions possibly stepping on one-another's toes. I have not gotten
rid of it, but might try that later. Being able to run concurrent Annex
actions would simplify parts of the Assistant code.
Run the same code git-annex used to get the sha, including its sanity
checking. Much better than old grep. Should detect FreeBSD systems with
sha commands that output in stange format.
This after fielding a bug where git-annex was built with a sha256 program
whose output checked out, but was then run with one that output lines
like:
SHA256 (file) = <sha here>
Which it then parsed as having a SHA256 of "SHA256"!
Now the output of the command is required to be of the right length,
and contain only the right characters.
It's possible for files in indirect mode to have a direct mode mapping
file. Probably from when they were in direct mode. In this case,
toDirectGen tried to copy the content from the direct mode file that the
mapping said had it. But, being in indirect mode, it didn't really have the
content. So it did nothing. This fix makes it always move the content from
.git/annex/objects/ when it's there.
Observed that the pushed refs were received, but not merged into master.
The merger never saw an add event for these refs. Either git is not writing
to a new file and renaming it into place, or the inotify code didn't notice
that. Changed it to also watch for modify events and that seems to have
fixed it!
Note that the note on df88c51334 turned out
to be wrong. Multiple repository pairing over XMPP does work, because the
annex-uuid of the xmpp remote is updated to the uuid of the new repo
when pairing takes place. So the push from it is accepted. (And the other
UUIDs are listed in uuid.log, so pushes from those repositories also are
accepted of course.)
The root of the problem is that toInodeCache sees a non-symlink, and so
goes on and generates a new inode cache for the dummy symlink.
Any place that toInodeCache, or sameFileStatus, or genInodeCache are called
may need to deal with this case. Although many of them are ok. For example,
prepSendAnnex calls sameInodeCache, which calls genInodeCache.. but if
the file content is not present, the InodeCache generated for its standin
file is appropriately not the same, and so it returns Nothing.
I've audited some, but have to say I'm not happy with this; it should be
handled at the type level somehow, or a toInodeCache wrapper be used that
is aware of dummy symlinks.
(The Watcher already dealt with it, via the guardSymlinkStandin function.)
This is so git remotes on servers without git-annex installed can be used
to keep clients' git repos in sync.
This is a behavior change, but since annex-sync can be set to disable
syncing with a remote, I think it's acceptable.
assistant: Work around horrible, terrible, very bad behavior of
gnome-keyring, by not storing special-purpose ssh keys in ~/.ssh/*.pub.
Apparently gnome-keyring apparently will load and indiscriminately use such
keys in some cases, even if they are not using any of the standard ssh key
names. Instead store the keys in ~/.ssh/annex/, which gnome-keyring will
not check.
Note that neither I nor #debian-devel were able to quite reproduce this
problem, but I believe it exists, and that this fixes it. And it certianly
won't hurt anything..
Most remotes have meters in their implementations of retrieveKeyFile
already. Simply hooking these up to the transfer log makes that information
available. Easy peasy.
This is particularly valuable information for encrypted remotes, which
otherwise bypass the assistant's polling of temp files, and so don't have
good progress bars yet.
Still some work to do here (see progressbars.mdwn changes), but this
is entirely an improvement from the lack of progress bars for encrypted
downloads.
* addurl: Register transfer so the webapp can see it.
* addurl: Automatically retry downloads that fail, as long as some
additional content was downloaded.
Fixed by storing a list of cached inodes for a key, instead of just one.
Backwards compatability note: An old git-annex version will fail to parse
an inode cache file that has been written by a new version, and has
multiple items. It will succees if just one. So old git-annexes will have
even worse behavior when there are duplicated files, if that is possible.
I don't think it will be a problem. (Famous last words.)
Also, note that it doesn't expire old and unused inode caches for a key.
It would be possible to add this if needed; just look through the
associated files for a key and if there are more cached inodes, throw out
any not corresponding to associated files. Unless a file is being copied
repeatedly and the old copy deleted, this lack of expiry should not be a
problem.
* since this is a crippled filesystem anyway, git-annex doesn't use
symlinks on it
* so there's no reason to use the mixed case hash directories that we're
stuck using to avoid breaking everyone's symlinks to the content
* so we can do what is already done for all bare repos, and make non-bare
repos on crippled filesystems use the all-lower case hash directories
* which are, happily, all 3 letters long, so they cannot conflict with
mixed case hash directories
* so I was able to 100% fix this and even resuming `git annex add` in the
test case will recover and it will all just work.
This avoids commit churn by the assistant when eg,
replacing a file with a symlink.
But, just as importantly, it prevents the working tree being left with a
deleted file if git-annex, or perhaps the whole system, crashes at the
wrong time.
(It also probably avoids confusing displays in file managers.)
My test case for this bug is to have the assistant running and syncing to
a remote, and create a file in the annex. Then at the command line run
git annex drop. The assistant sees that the file is gone, sees it's a wanted
file, and downloads it from the remote.
With a directory special remote and a small file, I was seeing around 1
time in 3, a race where the file got unstaged from git after it got
downloaded.
Looking at what direct mode content managing code does in this case, it
deletes the symlink, and then adds the file content back. It would be
possible, sometimes, to avoid removing the symlink and do this atomically.
And I probably should.. but in some cases, particularly where the file
needs to be run through `cp` (multiple direct mode files with same
content), there's no way to atomically replace the symlink with the
content.
Anyway, the bug turns out to be something that the watcher does right for
indirect mode, but not for direct mode. When it got an add event, it
checked to see if this was a new file, or one we've already added. In the
latter case, no add event was queued. But that means that only the rm event
is queued, and so it unstages the file.
Fixed by queueing an add event even when the file is already in git.
Tested by running hundreds of drops in a loop; file remained staged.
I would have sort of liked to put this in .gitattributes, but it seems
it does not support multi-word attribute values. Also, making this a single
config setting makes it easy to only parse the expression once.
A natural next step would be to make the assistant `git add` files that
are not annex.largefiles. OTOH, I don't think `git annex add` should
`git add` such files, because git-annex command line tools are
not in the business of wrapping git command line tools.
There was confusion in different parts of the progress bar code about
whether an update contained the total number of bytes transferred, or the
number of bytes transferred since the last update. One way this bug
showed up was progress bars that seemed to stick at zero for a long time.
In order to fix it comprehensively, I add a new BytesProcessed data type,
that is explicitly a total quantity of bytes, not a delta.
Note that this doesn't necessarily fix every problem with progress bars.
Particularly, buffering can now cause progress bars to seem to run ahead
of transfers, reaching 100% when data is still being uploaded.
When a page is loaded, the javascript requests an notification url, and
does long polling on the url to be informed of changes. But if a change
occured before the notification url was requested, it would not be notified
of that change, and so the page display would not update.
I fixed this by *always* updating the page display after it gets
the notification url. This is extra work, but the overhead is not noticable
in the other overhead of loading a page.
(A nicer way would be to somehow record the version of a page initially
loaded, and then compare it with the current version when getting the
notification url, and only force an update if it's changed. But getting
the "version" of the different parts of the page that use long polling
is difficult.)
Move all the binaries and libraries under a bundle/ subdirectory;
so when it's in PATH only git-annex, runshell, and git-annex-webapp
will be available.
A long time ago I made Remote be an instance of the Ord typeclass, with an
implementation that compared the costs of Remotes. That seemed like a good
idea at the time, as it saved typing.. But at the time I was still making
custom Read and Show instances too. I've since learned that this is *not* a
good idea, and neither is making custom Ord instances, without deep thought
about the possible sets of values in a type. Haskell typeclasses are not a
toy.
This Ord instance came around and bit me when I put Remotes into a Set,
because now remotes with the same cost appeared to be in the Set even if
they were not. Also affected putting Remotes into a Map.
Rarely does a bug go this deep. I've fixed it comprehensively, first
removing the Ord instance entirely, and fixing the places that wanted to
order remotes by cost to do it explicitly. Then adding back an Ord instance
that is much more sane. Also by checking the rest of the Ord instances in
the code base (which were all ok).
While doing that, I found lots of places that kept remotes in Maps and
Sets. All of it was probably subtly broken in one way or another before
this fix, but it would be hard to say exactly how the bugs would
manifest.
This may work around google talk's horrible presence handling, in which
clients often don't learn about other clients, at least when using the same
account. This way, every time we start a git push over xmpp, we'll waste
bandwidth asking clients to please try again to identify themselves.
Just before starting a transfer, do one last check that it's still
preferred content.
I was just doing this for uploads, as part of the smarter flood filling
bug, but realized it's also possible for a download that was preferred
content to change to not be before the download begins, so check that too.
* addurl: Escape invalid characters in urls, rather than failing to
use an invalid url.
* addurl: Properly handle url-escaped characters in file:// urls.
The comments correctly noted that the remote could drop the key and
yet False be returned due to some problem that occurred afterwards.
For example, if it's a network remote, it could drop the key just
as the network goes down, and so things timeout and a nonzero exit
from ssh is propigated through and False returned.
However... Most of the time, this scenario will not have happened.
False will mean the remote was not available or could not drop the key
at all.
So, instead of assuming the worst, just trust the status we have.
If we get it wrong, and the scenario above happened, our location
log will think the remote has the key. But the remote's location
log (assuming it has one) will know it dropped it, and the next sync
will regain consistency.
For a special remote, with no location log, our location log will be wrong,
but this is no different than the situation where someone else dropped
the key from the remote and we've not synced with them. The standard
paranoia about not trusting the location log to be the last word about
whether a remote has a key will save us from these situations. Ie,
if we try to drop the file, we'll actively check the remote,
and determine the inconsistency then.
This cleaned up the code quite a bit; now the committer just looks at the
Change to see if it's a change that needs to have a transfer queued for it.
If I later want to add dropping keys for files that were removed, or
something like that, this should make it straightforward.
This also fixes a bug. In direct mode, moving a file out of an archive
directory failed to start a transfer to get its content. The problem
was that the file had not been committed to git yet, and so the transfer
code didn't want to touch it, since fileKey failed to get its key.
Only starting transfers after a commit avoids this problem.
Noticed that, At startup or network reconnect, git push messages were sent,
often before presence info has been gathered, so were not sent to any
buddies.
To fix this, keep track of which buddies have seen such messages,
and when new presence is received from a buddy that has not yet seen it,
resend.
This is done only for push initiation messages, so very little data needs
to be stored.
Make manualPull send push requests over XMPP.
When reconnecting with remotes, those that are XMPP remotes cannot
immediately be pulled from and scanned, so instead maintain a set of
(probably) desynced remotes, and put XMPP remotes on it. (This set could be
used in other ways later, if we can detect we're out of sync with other
types of remotes.)
The merger handles detecting when a XMPP push is received from a desynced
remote, and triggers a scan then, if they have in fact diverged.
This has one known bug: A single XMPP remote can have multiple clients
behind it. When this happens, only the UUID of one client is recorded
as the UUID of the XMPP remote. Pushes from the other XMPP clients will not
trigger a scan. If the client whose UUID is expected responds to the push
request, it'll work, but when that client is offline, we're SOL.
For example, copy --to such a remote would send the file, but as NoUUID was
its uuid, no location log update was done. And drop --from such a remote
would not do anything, because NoUUID is not listed in the location log..
assistant: Fix bug in direct mode that could occur when a symlink is moved
out of an archive directory, and resulted in the file not being set to
direct mode when it was transferred.
The bug was that the direct mode mapping was not up-to-date when the
transferrer finished. So, finding no direct mode place to store the object,
it was put into .git/annex in indirect mode.
To fix this, just make the watcher update the direct mode mapping to
include the new file before it starts the transfer. (Seems we don't need to
update it to remove the old file if the link was moved, because the direct
mode code will notice it's not present and the mapping gets updated for its
removal later.)
The reason this was a race, and was probably not seen often is because
the committer came along and updated the direct mode mapping as part of
adding the moved symlink. But when the file was sufficiently small or
the remote sufficiently fast, this could happen after the transfer
finished.
Looking through the git sources (documentation is unclear),
it seems commit doesn't ever trigger git-gc, mostly fetching and merging
seems to. I cannot easily override the setting in all those places, so
instead set gc.auto in git config when initializing a repository with
the assistant.
This does mean that the user cannot set gc.auto=0 and completely avoid
repacks, as the assistant does it daily. But, it only does it after there
are 100x the default number of loose objects, so this is probably not going
to be too annoying.
A transfer is queued, but if the file has already been transferred to the
remote before, the transfer is skipped. In this case, it needs to perform
any actions it would normally take after finishing the transfer, like
dropping the local object.
This cannot completely guard against a runaway log event, and only runs
every hour anyway, but it should avoid most problems with very
long-running, active assistants using up too much space.
The transfer queue can grow larger than 10 when queueing transfers for
files that were just received, as well as requeueing failed transfers.
I probably need to do some work to prevent that, as it could use a lot of
RAM. But for now, cap the number of displayed transfers in the webapp, to
avoid flooding the browser.
I have seen some other programs do this, and think it's pretty cool. Means
you can test wherever it's deployed, as well as at build time.
My other reason for doing it is less happy. Cabal's handling of test suites
sucks, requiring duplicated info, and even when that's done, it fails to
preprocess hsc files here. Building it in avoids that and avoids having
to explicitly tell cabal to enable test suites, which would then make it
link the test executable every time, which is unnecessarily slow.
This also has the benefit that now "make fast test" does a max speed build
and tests it.
The only thing lost is ./ghci
Speed: make fast used to take 20 seconds here, when rebuilding from
touching Command/Unused.hs. With cabal, it's 29 seconds.
Pity that the library does not provide a function to extract the status
code from the StatusCodeException, so when they had to add a new field, it
breaks every single place that does it.
Two fixes. First, and most importantly, relax the isLinkToAnnex check
to only look for /annex/objects/, not [^|/].git/annex/objects. If
GIT_DIR is used with a detached work tree, the git directory is
not necessarily named .git.
There are important caveats with doing that at all, since git-annex will
make symlinks that point at GIT_DIR, which means that the relative path
between GIT_DIR and GIT_WORK_TREE needs to remain stable across all clones
of the repository.
----
The other fix is just fixing crazy and wrong code that, when GIT_DIR is
set, expects to still find a git repository in the path below the work
tree, and uses some of its configuration, and some of GIT_DIR. What was I
thinking, and why can't I seem to get this code right?
In general, git-annex does not try to preserve file permissions. For
example, they don't round trip through special remotes. So it's ok to not
preserve them for git remotes either.
On crippled filesystems, rsync has been observed failing after the file
was transferred because it couldn't set some permission or other.
Adding a file that is already annexed, but has been modified, was broken in
direct mode.
This fix makes the new content be added. It does have the problem that
re-running `git annex add` will checksum and re-add the content repeatedly,
until it's committed. This happens because the key associated with the file
does not change until the new one gets committed, so it keeps thinking the
file has changed.
Refactored annex link code into nice clean new library.
Audited and dealt with calls to createSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ createSymbolicLink linktarget file
only when core.symlinks=true
Assistant/WebApp/Configurators/Local.hs: createSymbolicLink link link
test if symlinks can be made
Command/Fix.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/FromKey.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/Indirect.hs: liftIO $ createSymbolicLink l f
refuses to run if core.symlinks=false
Init.hs: createSymbolicLink f f2
test if symlinks can be made
Remote/Directory.hs: go [file] = catchBoolIO $ createSymbolicLink file f >> return True
fast key linking; catches failure to make symlink and falls back to copy
Remote/Git.hs: liftIO $ catchBoolIO $ createSymbolicLink loc file >> return True
ditto
Upgrade/V1.hs: liftIO $ createSymbolicLink link f
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to readSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ catchMaybeIO $ readSymbolicLink file
only when core.symlinks=true
Assistant/Threads/Watcher.hs: ifM ((==) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file))
code that fixes real symlinks when inotify sees them
It's ok to not fix psdueo-symlinks.
Assistant/Threads/Watcher.hs: mlink <- liftIO (catchMaybeIO $ readSymbolicLink file)
ditto
Command/Fix.hs: stopUnless ((/=) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file)) $ do
command only works in indirect mode
Upgrade/V1.hs: getsymlink = takeFileName <$> readSymbolicLink file
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to isSymbolicLink.
(Typically used with getSymbolicLinkStatus, but that is just used because
getFileStatus is not as robust; it also works on pseudolinks.)
Remaining calls are all safe, because:
Assistant/Threads/SanityChecker.hs: | isSymbolicLink s -> addsymlink file ms
only handles staging of symlinks that were somehow not staged
(might need to be updated to support pseudolinks, but this is
only a belt-and-suspenders check anyway, and I've never seen the code run)
Command/Add.hs: if isSymbolicLink s || not (isRegularFile s)
avoids adding symlinks to the annex, so not relevant
Command/Indirect.hs: | isSymbolicLink s -> void $ flip whenAnnexed f $
only allowed on systems that support symlinks
Command/Indirect.hs: whenM (liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f) $ do
ditto
Seek.hs:notSymlink f = liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f
used to find unlocked files, only relevant in indirect mode
Utility/FSEvents.hs: | Files.isSymbolicLink s = runhook addSymlinkHook $ Just s
Utility/FSEvents.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: checkfiletype Files.isSymbolicLink addSymlinkHook f
Utility/Kqueue.hs: | Files.isSymbolicLink s = callhook addSymlinkHook (Just s) change
all above are lower-level, not relevant
Audited and dealt with calls to isSymLink.
Remaining calls are all safe, because:
Annex/Direct.hs: | isSymLink (getmode item) =
This is looking at git diff-tree objects, not files on disk
Command/Unused.hs: | isSymLink (LsTree.mode l) = do
This is looking at git ls-tree, not file on disk
Utility/FileMode.hs:isSymLink :: FileMode -> Bool
Utility/FileMode.hs:isSymLink = checkMode symbolicLinkMode
low-level
Done!!
In indirect mode, now checks the inode cache to detect changes to a file.
Note that a file can still be changed if a process has it open for write,
after landing in the annex.
In direct mode, some checking of the inode cache was done before, but
from a much later point, so fewer modifications could be detected. Now it's
as good as indirect mode.
On crippled filesystems, no lock down is done before starting to add a
file, so checking the inode cache is the only protection we have.
git annex init probes for crippled filesystems, and sets direct mode, as
well as `annex.crippledfilesystem`.
Avoid manipulating permissions of files on crippled filesystems.
That would likely cause an exception to be thrown.
Very basic support in Command.Add for cripped filesystems; avoids the lock
down entirely since doing it needs both permissions and hard links.
Will make this better soon.
Been meaning to do this for some time; Android port was last straw.
Note that newer versions of the uuid library have a Data.UUID.V4 that
generates random UUIDs slightly more cleanly, but Debian has an old version
of the library, so I do it slightly round-about.
These files were left behind, and made getKeysPresent find keys that were
not present. It would be expensive to make getKeysPresent check that the
actual key files are present (it just lists the directories). But that's not
needed if we just clean up the stale cache and mapping files.
To handle systems that were in direct mode and got switched back with stale
direct mode files, made cleanObjectLoc remove all files in the key's directory.
git annex unused will still list keys that are gone but for which the stale
direct mode files exists. To deal with that, made dropunused remove the key's
directory even if the key does not seem to be present.
Making the pre-commit hook look at git diff-index to find changed direct
mode files and update the mappings works pretty well.
One case where it does not work is when a file is git annex added, and then
git rmed, and then this is committed. That's a no-op commit, so the hook
probably doesn't even run, and it certianly never notices that the file
was deleted, so the mapping will still have the original filename in it.
For this and other reasons, it's important that the mappings still be
treated as possibly inconsistent.
Also, the assistant now allows the pre-commit hook to run when in direct
mode, so the mappings also get updated there.
The most common way for a mapping to be stale is when a file was deleted,
or renamed. Nothing updates the mappings for deletions yet.
But they can also become stale in other ways. For example a file can
be modified.
So, the mapping is not trusted to be consistent. When we get a key,
only replace symlinks that still point to that key with its content.
When we drop a key, only put back symlinks for files that still have
the direct mode content.
New setting, can be used to disable autocommit of changed files by the
assistant, while it still does data syncing and other tasks.
Also wired into webapp UI
Union merges involving two or more repositories could sometimes result in
data from one repository getting lost. This could result in the location
log data becoming wrong, and fsck being needed to fix it.
NB: I audited for any other occurrences of this problem. There are other
places than union merge where multiple changes are fed into update-index
in a stream, but they all involve working copy files being staged, or their
deletion being staged, and in this case it's fine for the later changes
to override the earlier ones.
It used to not log to daemon.log when a repository was first created, and
when starting the webapp. Now both do. Redirecting stdout and stderr to the
log is tricky when starting the webapp, because the web browser may want to
communicate with the user. (Either a console web browser, or web.browser = echo)
This is handled by restoring the original fds when running the browser.
A while ago I added code to support recent versions of shakespeare-js,
(commit fe11b3a940). But it seems that resulted
in quoting of all strings inserted into javascript files, which means it's
now impossible to do the type of metaprogramming that longpolling.julius
relied on. I have found another way to accomplish the same thing without
needing to generate unique function names. Hopefully it's portable.
Opinion of shakespeare-js now at rock bottom. One of these days, this
needs to be redone to use Fay.
since some systems may have configuration problems or other issues that
prevent web browsers from connecting to the right localhost IP for the
webapp.
Tested on both ipv4 and ipv6 localhost. Url for the latter looks like:
http://[::1]:50676
The expensive scan uses lookupFile, but in direct mode, that doesn't work
for files that are present. So the scan was not finding things that are
present that need to be uploaded. (It did find things not present that
needed to be downloaded.)
Now lookupFile also works in direct mode. Note that it still prefers
symlinks on disk to info committed to git, in direct mode. This is
necessary to make things like Assistant.Threads.Watcher.onAddSymlink
work correctly, when given a new symlink not yet checked into git (or
replacing a file checked into git).
Would like to also have restart UI, but that's rather harder to do,
seems it'd need to start another copy of the webapp, and redirect the
browser to its new url, but running two assistants in the same repo at
the same time isn't good.
* SHA*E backends: Exclude non-alphanumeric characters from extensions.
* migrate: Remove leading \ in SHA* checksums, and non-alphanumerics
from extensions of SHA*E keys.
* Bugfix: Remove leading \ from checksums output by sha*sum commands,
when the filename contains \ or a newline. Closes: #696384
* fsck: Still accept checksums with a leading \ as valid, now that
above bug is fixed.
* migrate: Remove leading \ in checksums
The newline after the filename was included in it.
This was generally benign -- mostly these filenames are just displayed,
and the newline didn't matter.
But in the assistant, it caused unexpected dropping of preferred
content.
A characteristic of this bug is that the drop was displayed like this:
drop some_file
ok
* get/copy --auto: Transfer data even if it would exceed numcopies,
when preferred content settings want it.
* drop --auto: Fix dropping content when there are no preferred content
settings.
It was doubly broken; both missing a slash, and containing
"runshell git-annex", while some parts of the code expected it to be a
simple path to a program. This appears to include the transfer queue
runner, and the code that starts a new assistant process when switching to
another repository in the webapp.
Ensure that each file has something written to it, even if the bytestring
chunk size is greater than the configured chunksize.
This means we may write a bit larger than the configured value, but only
when the configured value is very small; ie, < 8 kb.
Files are now written to a tmp directory in the remote, and once all
chunks are written, etc, it's moved into the final place atomically.
For now, checkpresent still checks every single chunk of a file, because
the old method could leave partially transferred files with some chunks
present and others not.
Both the directory and webdav special remotes used to have to buffer
the whole file contents before it could be decrypted, as they read
from chunks. Now the chunks are streamed through gpg with no buffering.
This adds a dep on haskell's async library, but since that's been
added to the recent haskell platform release, it should not be
much hardship to my poor long-suffering library chasing users.
Wrote a better git remote name sanitizer. Git blows up on lots of weird
stuff, especially if it starts the remote name, but I managed to get
some common punctuation working.
Adjust build deps to ensure that only a fixed version of the library will
be used.
Also, removed the bound thread stuff, which I now think was (probably)
a red herring.
MountWatcher can't do this, because it uses the session dbus,
and won't have access to the new DBUS_SESSION_BUS_ADDRESS if a new session
is started.
Bumped dbus library version, FD leak in it is fixed.
For now, when dbus goes away, the assistant keeps running but does not fall
back or reconnect. To do so needs more changes to the DBus library; in
particular a connectSessionWith and connectSystemWith to let me specify
my own clientThreadRunner.
So, it might be called sha1sum, or on some other OS, it might be called
sha1. It might be hidden away off of PATH on that OS. That's just expected
insanity; UNIX has been this way since 1980's. And these days, nobody even
gives the flying flip about standards that we briefly did in the 90's
after the first round of unix wars.
But it's the 2010's now, and we've certainly learned something.
So, let's make it so sometimes sha1 is a crazy program that wants to run as
root so it can lock memory while prompting for a passphrase, and outputting
binary garbage. Yes, that'd be wise. Let's package that in major Linux
distros, too, so users can stumble over it.
bup 0.25 does not accept that; and bup split reads from stdin by
default if no file is given. I'm not sure what version of bup changed this.
This only affected bup special remotes that were encrypted.
Monitors git-annex branch for changes, which are noticed by the Merger
thread whenever the branch ref is changed (either due to an incoming push,
or a local change), and refreshes cached config values for modified config
files.
Rate limited to run no more often than once per minute. This is important
because frequent git-annex branch changes happen when files are being
added, or transferred, etc.
A primary use case is that, when preferred content changes are made,
and get pushed to remotes, the remotes start honoring those settings.
Other use cases include propigating repository description and trust
changes to remotes, and learning when a remote has added a new special
remote, so the webapp can present the GUI to enable that special remote
locally.
Also added a uuid.log cache. All other config files already had caches.
in= was problimatic in two ways. First, it referred to a remote by name,
but preferred content expressions can be evaluated elsewhere, where that
remote doesn't exist, or a different remote has the same name. This name
lookup code could error out at runtime. Secondly, in= seemed pretty useless.
in=here did not cause content to be gotten, but it did let present content
be dropped.
present is more useful, although "not present" is unstable and should be
avoided.
When in a subdir, both the normal filepath, and the filepath relative to
the top of the git repo are needed for matching. The former for key lookup,
and the latter for include/exclude to match against. Previously, key lookup
didn't work in this situation.
The old code was just wrong in taking fromPath of GIT_DIR -- that made an
localUnknown location with the GIT_DIR in it, which only worked by
accident, and failed in submodules.
Now that this is handled correctly, git-annex can be used in git submodules.
Also, fixed infelicity where Git.CurrentRepo and Git.Config.updateLocation
were both dealing with core.worktree. Now updateLocation handles it for
Local as well as for LocalUnknown repos.
Aka solve the github problem.
Note that it's possible the initial configlist will fail for some network
reason etc, and then the fetch succeeds. In this case, a usable remote gets
disabled. But it does print a message, and this only happens once per
remote, so that seems ok.
There was one forkProcess lurking in test.hs, and that seems to be
responsible for recent buildd failures on amd64 and armhf. I was able to
reproduce it pretty easily on amd64, and even once on i386, and it was
clearly that same bad old threaded runtime hang. So removing this
forkProcess should fix it. Odd that it lurked for some months before
popping up.
Setting GIT_INDEX_FILE clobbers the rest of the environment, making git
not read ~/.gitconfig, and blow up if GECOS didn't have a name for the
user.
I'm not entirely happy with getEnvironment being run every time now,
that's somewhat expensive. It may make sense to just set GIT_COMMITTER_*
and GIT_AUTHOR_*, but I worry that clobbering the rest could break PATH,
or GIT_PATH, or something else that might be used by a command run in here.
And caching the environment is not a good idea either; it can change..
webapp: Adds newly created repositories to one of these groups:
clients, drives, servers
This is heuristic, but it's a pretty good heuristic, and can always be
configured.
Both when queueing downloads, and uploads, consults the preferred content
settings.
I didn't make it check yet when requeing failed transfers or queuing
deferred downloads; dealing with the preferred content settings (or indeed,
other settings) changing while the assistant is running still needs work.
Incomplete; I need to finish parsing and saving. This will also be used
for editing transfer control expresssions.
Removed the group display from the status output, I didn't really
like that format, and vicfg can be used to see as well as edit rempository
group membership.
Makes it safe to use git annex unlock with the watcher/assistant.
And also to mix use of the watcher/assistant with regular files stored in git.
Long ago, I had avoided doing this check, except during the startup scan,
because it would be slow to run ls-files repeatedly.
But then I added the lsof check, and to make that fast, got it to detect
batch file adds. So let's move the ls-files check to also occur when it'll
have a batch, and can check them all with one call.
This does slow down adding a single file by just a bit, but really only
a little bit. (The lsof check is probably more expensive.) It also
speeds up the startup scan, especially when there are lots of new files
found by the scan.
Also, fixed the sleep for annex.delayadd to not run while the threadstate
lock is held, so it doesn't unnecessarily freeze everything else.
Also, --force no longer makes it skip the lsof check, which was not
documented, and seems never a good idea.
At recommends it causes avahi-daemon to be pulled in on upgrade, which is
just too annoying to deal with avoiding on servers. MDNS is needed for
robust peering, but probably most desktop systems have it anyway; it's in
task-desktop.
This means that anyone serving up the webapp to users as a service
(ie, without providing any git-annex binary at all to the user) still needs
to provide a link to the source code for it, including any modifications
they may make.
This may make git-annex be covered by the AGPL as a whole when it is built
with the webapp. If in doubt, you should ask a lawyer.
When git-annex is built with the webapp disabled, no AGPLed code is used.
Even building in the assistant does not pull in AGPLed code.
This fixes a problem I was seeing in the assistant where two remotes would
attempt to sync with one another at the same time, and both failed pushing
the diverged git-annex branch. Then when both tried to resolve the failed
push, they each modified their git-annex branch, which again each blocked
the other from pushing into it. The result was that the git-annex
branches were perpetually diverged (despite having the same content!) and
once the assistant fell into this trap, it couldn't get out and always
had to do the slow push/fail/pull/merge/push/fail cycle.
The default backend used when adding files to the annex is changed from
SHA256 to SHA256E, to simplify interoperability with OSX, media players,
and various programs that needlessly look at symlink targets.
To get old behavior, add a .gitattributes containing: * annex.backend=SHA256