Fix bug caused by recent optimisations that could make git-annex not see
recently recorded status information when configured with
annex.alwayscommit=false.
This does mean that --all can end up processing the same key more than once,
but before the optimisations that introduced this bug, it used to also behave
that way. So I didn't try to fix that; it's an edge case and anyway git-annex
behaves well when run on the same key repeatedly.
I am not too happy with the use of a MVar to buffer the list of files in the
journal. I guess it doesn't defeat lazy streaming of the list, if that
list is actually generated lazily, and anyway the size of the journal is
normally capped and small, so if configs are changed to make it huge and
this code path fire, git-annex using enough memory to buffer it all is not a
large problem.
Fix bug caused by recent optimisations that could make git-annex not see
recently recorded status information when configured with
annex.alwayscommit=false.
When not using --all, precaching only gets triggered when the
command actually needs location logs, and so there's no speed hit there.
This is a minor speed hit for --all, because it precaches even when the
location log is not actually going to be used, and so checking the journal
is not necessary. It would have been possible to defer checking the journal
until the cache gets used. But that would complicate the usual Branch.get
code path with two different kinds of caches, and the speed hit is really
minimal. A better way to speed up --all, later, would be to avoid
precaching at all when the location log is not going to be used.
This adds a separate journal, which does not currently get committed to
an index, but is planned to be committed to .git/annex/index-private.
Changes that are regarding a UUID that is private will get written to
this journal, and so will not be published into the git-annex branch.
All log writing should have been made to indicate the UUID it's
regarding, though I've not verified this yet.
Currently, no UUIDs are treated as private yet, a way to configure that
is needed.
The implementation is careful to not add any additional IO work when
privateUUIDsKnown is False. It will skip looking at the private journal
at all. So this should be free, or nearly so, unless the feature is
used. When it is used, all branch reads will be about twice as expensive.
It is very lucky -- or very prudent design -- that Annex.Branch.change
and maybeChange are the only ways to change a file on the branch,
and Annex.Branch.set is only internal use. That let Annex.Branch.get
always yield any private information that has been recorded, without
the risk that Annex.Branch.set might be called, with a non-private UUID,
and end up leaking the private information into the git-annex branch.
And, this relies on the way git-annex union merges the git-annex branch.
When reading a file, there can be a public and a private version, and
they are just concacenated together. That will be handled the same as if
there were two diverged git-annex branches that got union merged.
Reads of cached data are not debugged, only cache misses are, and since
many commands pre-cache location log data, this avoids a slew of
fastDebug calls when running commands such as git-annex get --from
Seems that hasOrigin was never finding origin's git-annex branch, so a new
one got created each time. And so then it later needed to merge the two
branches, which is expensive.
Added --no-track to git branch to avoid it displaying a message about
setting up tracking branches. Of course there's no reason to make the
git-annex branch a tracking branch since git-annex auto-merges it.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
9cb250f7be got the ones in RawFilePath,
but there were others that used the one from unix-compat, which fails at
runtime on windows. To avoid this,
import System.PosixCompat.Files hiding removeLink
This commit was sponsored by Ethan Aubin.
nukeFile replaced with removeWhenExistsWith removeLink, which allows
using RawFilePath. Utility.Directory cannot use RawFilePath since setup
does not depend on posix.
This commit was sponsored by Graham Spencer on Patreon.
Eliminate a zombie that was only cleaned up by the later zombie cleanup
code.
This is still not ideal, it would be cleaner if it used conduit or
something, and if the thread gets killed before waiting, it won't stop
the process.
Only remaining zombies are in CmdLine.Seek
The cache was removed way back in 2012,
commit 3417c55189
Then I forgot I had removed it! I remember clearly multiple times when I
thought, "this reads the same data twice, but the cache will avoid that
being very expensive".
The reason it was removed was it messed up the assistant noticing when
other processes made changes. That same kind of problem has recently
been addressed when adding the optimisation to avoid reading the journal
unnecessarily.
Indeed, enableInteractiveJournalAccess is run in just the
right places, so can just piggyback on it to know when it's not safe
to use the cache.
Avoid running a large number of git cat-file child processes when run with
a large -J value.
This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
The journal read optimisation in aeca7c220 later got fixed in eedd73b84
to stage and commit any files that were left in the journal by a
previous git-annex run. That's necessary for the optimisation to work
correctly. But it also meant that alwayscommit=false started committing
the previous git-annex processes journalled changes, which defeated the
purpose of the config setting entirely.
So, disable the optimisation when alwayscommit=false, leaving the
files in the journal and not committing them. See my comments on the bug
report for why this seemed the best approach.
Also fixes a problem when annex.merge-annex-branches=false and there
are changes in the journal. That config indirectly prevents committing
the journal. (Which seems a bit odd given its name, but it always has..)
So, when there were changes in the journal, perhaps left there due to
alwayscommit=false being set before, the optimisation would prevent
git-annex from reading the journal files, and it would operate with out
of date information.
aeca7c2207 was predicated on the
assumption that updateTo would stage any journal files, but in one case
it did not actually do so. The test suite happened to expose the bug.
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.
--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
Fix support for repositories tuned with annex.tune.branchhash1=true,
including --all not working and git-annex log not displaying anything for
annexed files.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
forget --drop-dead: Remove several classes of git-annex log files when they
become empty, further reducing the size of the git-annex branch.
Noticed while testing sameas uuid removal, but it could happen other times
too.
An empty log file is always treated by git-annex the same as no file
being present, and when the files are per-key, it can be a sizable space
saving to exclude them from the tree.
It would have been a lot less round-about to just make git annex dead
also add the uuids of sameas remotes to the trust.log as dead.
But, that would fail in the case where there's an unmerged other clone
that has a sameas remote that the current repo does not know about.
Then it would not get marked as dead.
Handling it at transition time avoids that scenario.
Note that the generation of trustmap' in dropDead should only
happen once, due to the partial application.
This is only done for correctness sake; I don't see any way that it
would have caused a problem here. The jlog file escaped withOtherTmp
so another process could swoop in and delete it, but the file is only
used as a buffer for a list of filenames, and its handle gets rewound
and they're read back out, which will still work even if it's already
been deleted.
The only reason I didn't just pre-delete the file and keep the handle
open is I'm not sure that works on all OS's (eg Windows). If there was
a problem that this fixed it might involve an OS that doesn't support
deleting an open file or something like that.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
* Switch to using .git/annex/othertmp for tmp files other than partial
downloads, and make stale files left in that directory when git-annex
is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
delete anything lingering in there after it's 1 week old.
Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
Not likely to be any speed gain here, but this completes porting every
log file over.
And, it let me get rid of code copied from ghc and modified, so
simplifying the licensing.
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.
There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.
Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...
For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1 group2 " due to excess
whitespace, which would be even more confusing behavior.
The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
Most of the individual logs are not converted yet, only presense logs
have an efficient ByteString Builder implemented so far. The rest
convert to and from String.
Added getStaged, to get the versions of git-annex branch files staged in its
index, and use during transitions so the result of merging sibling branches
is used.
The catFileStop in performTransitionsLocked is absolutely necessary,
without that the bug still occurred, because git cat-file was already
running and was looking at the old index file.
Note that getLocal still has cat-file look at the git-annex branch, not the
index. It might be faster if it looked at the index, but probably only
marginally so, and I've not benchmarked it to see if it's faster at all. I
didn't want to change unrelated behavior as part of this bug fix. And as
the need for catFileStop shows, using the index file has added
complications.
Anyway, it still seems fine for getLocal to look at the git-annex branch,
because normally the index file is updated just before the git-annex branch
is committed, and so they'll contain the same information. It's only during
a transition that the two diverge.
This commit was sponsored by Paul Walmsley in honor of Mark Phillips.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
This was badly named, it's a not a blob necessarily, but anything that a
tree can refer to.
Also removed the Show instance which was used for serialization to git
format, instead use fmtTreeItemType.
This commit was supported by the NSF-funded DataLad project.
Fix regression in last release that crashes when using --all or running
git-annex in a bare repository. May have also affected git-annex unused and
git-annex info.
Reversed the order of the (++) in Annex.Branch.files so --all will stream
lazily still when there are not a bunch of uncommitted journal files.
Added a todo to maybe improve this later.
This commit was sponsored by Trenton Cronholm on Patreon.
In Annex.Branch.branch, the (++) was killing laziness.
Rewrote so it streams lazily.
filterM also kills laziness, so made loggedKeys use a Unchecked type,
and check if the key is dead in the seek loop.
Note that loggedKeysFor still buffers, so git-annex info <remote> and
git-annex unused --from remote still use more memory than necessary.
Also removed some unused functions from Annex.Journal.
Added annex.merge-annex-branches config setting which can be used to
disable automatic merge of git-annex branches.
I wonder if git-annex merge/sync/assistant should disable this
setting? Not sure yet, so have not done so. May be that users will not set
it in git config, but pass it via -c to commands that need it.
Checking the config setting adds a very small overhead, but it's
only checked once per command so should be insignificant.
This commit was supported by the NSF-funded DataLad project.
Repositories that are upgraded from before that version to this
one will not break, but will just not see the benefit of the mergedrefs log
speeding things up, until one new ref gets merged in.
And for tab completion, by not unnessessarily statting paths to remotes,
which used to cause eg, spin-up of removable drives.
Got rid of the remotes member of Git.Repo. This was a bit painful.
Remote.Git modifies the list of remotes as it reads their configs,
so still need a persistent list of remotes. So, put it in as
Annex.gitremotes. It's only populated by getGitRemotes, so commands
like examinekey that don't care about remotes won't do so.
This commit was sponsored by Jake Vosloo on Patreon.
Fix more places where files in .git/annex/ were written with modes that
did not take the core.sharedRepository config into account.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Fourth or fifth try at this and finally found a way to make it work.
Absurd amount of busy-work forced on me by change in cabal's behavior.
Split up Utility modules that need posix stuff out of ones used by
Setup. Various other hacks around inability for Setup to use anything
that ifdefs a use of unix.
Probably lost a full day of my life to this.
This is how build systems make their users hate them. Just saying.
Needs ghc 7.6.1, so minimum base version increased slightly. All builds
are well above this version of ghc, and debian oldstable is as well.
Code that could use lambdacase can be found by running:
git grep -B 1 'case ' | less
and searching in less for "<-"
This commit was sponsored by andrea rota.
It was not getting old lines removed, because the tree graft confused
the updater, so it union merged from the previous git-annex branch,
which still contained the old lines. Fixed by carefully using setIndexSha.
This commit was supported by the NSF-funded DataLad project.
So it will be available later and elsewhere, even after GC.
I first though to use git update-index to do this, but feeding it a line
with a tree object seems to always cause it to generate a git subtree
merge. So, fell back to using the Git.Tree interface to maniupulate the
trees, and not involving the git-annex branch index file at all.
This commit was sponsored by Andreas Karlsson.
Removed dependency on MissingH, instead depending on the split
library.
After laying groundwork for this since 2015, it
was mostly straightforward. Added Utility.Tuple and
Utility.Split. Eyeballed System.Path.WildMatch while implementing
the same thing.
Since MissingH's progress meter display was being used, I re-implemented
my own. Bonus: Now progress is displayed for transfers of files of
unknown size.
This commit was sponsored by Shane-o on Patreon.
Any config names can be set using this; git-annex commands will only look
at specific ones that make sense and are worth the overhead of querying the
branch.
This might also be useful for storing whatever other config-type stuff the
user might want to shove into the git-annex branch.
This commit was sponsored by Jochen Bartl on Patreon.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.
Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.
This commit was sponsored by Ethan Aubin.
This affected git annex view. It turns out that some other places
that use GIT_INDEX_FILE were already working around the bug. I removed the
workaround from Annex.Branch since the new workaround will do.
This does mean that it has to write out temp files containing updated
objects for the merge. So may use more disk space, and disk IO, but that
should generally win out over needing to launch N separate
git hash-object processes.
Implemented with no additional overhead of compares etc.
This is safe to do for presence logs because of their locality of change;
a given repo's presence logs are only ever changed in that repo, or in a
repo that has just been actively changing the content of that repo.
So, we don't need to worry about a split-brain situation where there'd
be disagreement about the location of a key in a repo. And so, it's ok to
not update the timestamp when that's the only change that would be made
due to logging presence info.
This removes a bit of complexity, and should make things faster
(avoids tokenizing Params string), and probably involve less garbage
collection.
In a few places, it was useful to use Params to avoid needing a list,
but that is easily avoided.
Problems noticed while doing this conversion:
* Some uses of Params "oneword" which was entirely unnecessary
overhead.
* A few places that built up a list of parameters with ++
and then used Params to split it!
Test suite passes.
* init: Repository tuning parameters can now be passed when initializing a
repository for the first time. For details, see
http://git-annex.branchable.com/tuning/
* merge: Refuse to merge changes from a git-annex branch of a repo
that has been tuned in incompatable ways.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
Minor because normally only 1 FD is leaked per git-annex run. However,
the test suite leaks a few hundred FDs, and this broke it on the Debian
autobuilders, which seem to have a tigher than usual ulimit.
The leak was introduced by the lazy getDirectoryContents' that was
introduced in e6330988dd in order to scale to
millions of journal files -- if the lazy list was never fully consumed, the
directory handle did not get closed.
Instead, pull in openDirectory/readDirectory/closeDirectory code that I
already developed and submitted in a patch to the haskell directory library
earlier. Using this in journalDirty avoids the place that the lazy list
caused a problem. And using it in stageJournal eliminates the need for
getDirectoryContents'.
The getJournalFiles* functions are switched back to using the regular
strict getDirectoryContents. I'm not sure if those always consume the whole
list, so this avoids any leak. And the things that call those are things
like git annex unused, which also look at every file committed to the
git-annex branch, so would need more work to scale to insane numbers of
files anyway.
Eg after git-annex add has run on 2 million files in one go.
Slightly unhappy with the neeed to use a temp file here, but I cannot see
any other alternative (see comments on the bug report).
This commit was sponsored by Hamish Coleman.
Support users who have set commit.gpgsign, by disabling gpg signatures for
git-annex branch commits and commits made by the assistant.
The thinking here is that a user sets commit.gpgsign intending the commits
that they manually initiate to be gpg signed. But not commits made in the
background, whether by a deamon or implicitly to the git-annex branch.
gpg signing those would be at best a waste of CPU and at worst would fail,
or flood the user with gpg passphrase prompts, or put their signature on
changes they did not directly do.
See Debian bug #753720.
Also makes all commits done by git-annex go through a few central control
points, to make such changes easier in future.
Also disables commit.gpgsign in the test suite.
This commit was sponsored by Antoine Boegli.
There was a tricky bit here, when it does combine, the edit form is shown,
and so the info needs to be committed to the new repository, but then
pulled into the current one. And caches need to be invalidated for it
to be visible in the edit form.
Allow any encoding to be used, as with filenames (but utf8 is the sane
choice). Affects metadata and repository descriptions, and preferred
content expressions.
The question of what's the right encoding for the git-annex branch is a
vexing one. utf-8 would be a nice choice, but this leaves the possibility
of bad data getting into a git-annex branch somehow, and this resulting in
git-annex crashing with encoding errors, which is a failure mode I want to
avoid.
(Also, preferred content expressions can refer to filenames, and filenames
can have any encoding, so limiting to utf-8 would not be ideal.)
The union merge code already took care to not assume any encoding for a
file. Except it assumes that any \n is a literal newline, and not part of
some encoding of a character that happens to contain a newline. (At least
utf-8 avoids using newline for anything except liternal newlines.)
Adapted the git-annex branch code to use this same approach.
Note that there is a potential interop problem with Windows, since
FileSystemEncoding doesn't work there, and instead things are always
decoded as utf-8. If someone uses non-utf8 encoding for data on the
git-annex branch, this can lead to an encoding error on windows. However,
this commit doesn't actually make that any worse, because the union merge
code would similarly fail with an encoding error on windows in that
situation.
This commit was sponsored by Kyle Meyer.
Removed instance, got it all to build using fromRef. (With a few things
that really need to show something using a ref for debugging stubbed out.)
Then added back Read instance, and made Logs.View use it for serialization.
This changes the view log format.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
The assistant's commit code also always avoids git commit, for simplicity.
Indirect mode sync still does a git commit -a to catch unstaged changes.
Note that this means that direct mode sync no longer runs the pre-commit
hook or any other hooks git commit might call. The git annex pre-commit
hook action for direct mode is however explicitly run. (The assistant
already ran git commit with hooks disabled, so no change there.)