My ByteString rewrite oversimplified it, resulting in any _ in a journal
file turning into a / in the git-annex branch, which was often the wrong
filename, or sometimes (//) an invalid filename that git
refused to add.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
* Switch to using .git/annex/othertmp for tmp files other than partial
downloads, and make stale files left in that directory when git-annex
is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
delete anything lingering in there after it's 1 week old.
Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
Most of the individual logs are not converted yet, only presense logs
have an efficient ByteString Builder implemented so far. The rest
convert to and from String.
Fix regression in last release that crashes when using --all or running
git-annex in a bare repository. May have also affected git-annex unused and
git-annex info.
Reversed the order of the (++) in Annex.Branch.files so --all will stream
lazily still when there are not a bunch of uncommitted journal files.
Added a todo to maybe improve this later.
This commit was sponsored by Trenton Cronholm on Patreon.
In Annex.Branch.branch, the (++) was killing laziness.
Rewrote so it streams lazily.
filterM also kills laziness, so made loggedKeys use a Unchecked type,
and check if the key is dead in the seek loop.
Note that loggedKeysFor still buffers, so git-annex info <remote> and
git-annex unused --from remote still use more memory than necessary.
Also removed some unused functions from Annex.Journal.
Fourth or fifth try at this and finally found a way to make it work.
Absurd amount of busy-work forced on me by change in cabal's behavior.
Split up Utility modules that need posix stuff out of ones used by
Setup. Various other hacks around inability for Setup to use anything
that ifdefs a use of unix.
Probably lost a full day of my life to this.
This is how build systems make their users hate them. Just saying.
Removed old extensible-exceptions, only needed for very old ghc.
Made webdav use Utility.Exception, to work after some changes in DAV's
exception handling.
Removed Annex.Exception. Mostly this was trivial, but note that
tryAnnex is replaced with tryNonAsync and catchAnnex replaced with
catchNonAsync. In theory that could be a behavior change, since the former
caught all exceptions, and the latter don't catch async exceptions.
However, in practice, nothing in the Annex monad uses async exceptions.
Grepping for throwTo and killThread only find stuff in the assistant,
which does not seem related.
Command.Add.undo is changed to accept a SomeException, and things
that use it for rollback now catch non-async exceptions, rather than
only IOExceptions.
Minor because normally only 1 FD is leaked per git-annex run. However,
the test suite leaks a few hundred FDs, and this broke it on the Debian
autobuilders, which seem to have a tigher than usual ulimit.
The leak was introduced by the lazy getDirectoryContents' that was
introduced in e6330988dd in order to scale to
millions of journal files -- if the lazy list was never fully consumed, the
directory handle did not get closed.
Instead, pull in openDirectory/readDirectory/closeDirectory code that I
already developed and submitted in a patch to the haskell directory library
earlier. Using this in journalDirty avoids the place that the lazy list
caused a problem. And using it in stageJournal eliminates the need for
getDirectoryContents'.
The getJournalFiles* functions are switched back to using the regular
strict getDirectoryContents. I'm not sure if those always consume the whole
list, so this avoids any leak. And the things that call those are things
like git annex unused, which also look at every file committed to the
git-annex branch, so would need more work to scale to insane numbers of
files anyway.
Allow any encoding to be used, as with filenames (but utf8 is the sane
choice). Affects metadata and repository descriptions, and preferred
content expressions.
The question of what's the right encoding for the git-annex branch is a
vexing one. utf-8 would be a nice choice, but this leaves the possibility
of bad data getting into a git-annex branch somehow, and this resulting in
git-annex crashing with encoding errors, which is a failure mode I want to
avoid.
(Also, preferred content expressions can refer to filenames, and filenames
can have any encoding, so limiting to utf-8 would not be ideal.)
The union merge code already took care to not assume any encoding for a
file. Except it assumes that any \n is a literal newline, and not part of
some encoding of a character that happens to contain a newline. (At least
utf-8 avoids using newline for anything except liternal newlines.)
Adapted the git-annex branch code to use this same approach.
Note that there is a potential interop problem with Windows, since
FileSystemEncoding doesn't work there, and instead things are always
decoded as utf-8. If someone uses non-utf8 encoding for data on the
git-annex branch, this can lead to an encoding error on windows. However,
this commit doesn't actually make that any worse, because the union merge
code would similarly fail with an encoding error on windows in that
situation.
This commit was sponsored by Kyle Meyer.
This allows eg, putting .git/annex/tmp on a ram disk, if the disk IO
of temp object files is too annoying (and if you don't want to keep
partially transferred objects across reboots).
.git/annex/misctmp must be on the same filesystem as the git work tree,
since files are moved to there in a way that will not work cross-device,
as well as symlinked into there.
I first wanted to put the tmp objects in .git/annex/objects/tmp, but
that would pose transition problems on upgrade when partially transferred
objects existed.
git annex info does not currently show the size of .git/annex/misctemp,
since it should stay small. It would also be ok to make something clean it
out, periodically.
My implementation does not guard against double locking of the journal. But
it does ensure that the journal is always locked when operated on, by using
a type that is only produced by lockJournal, and which is required as a
parameter of all functions that operate on the journal.
Note that I had to add the fooStale functions for cases where it does not
make sense to lock the journal when querying it. I was more concerned about
ensuring that anything that modifies the journal is locked.
setJournalFile's implementation ensures that any query of the journal will
get one value or the other atomically, even if the journal is being changed
at the time.
Made fromDirect check that a file in the tree has good content (and is not
a broken symlink either) before copying it to another file that has the
same key.
Made replaceFile clean up the temp file if the action that creates it, or
the file replacement action fails.
This cache prevented noticing changes made by another process.
The case I just ran into involved the assistant dropping a file, which
cached its presence info. Then the same file was downloaded again,
but the assistant didn't know its presence info had changed.
I don't see a way to keep this cache. Will instead rely on the OS level
file cache, for files in the journal. May need to add more higher-level
caching of info that it's ok to have a potentially stale copy of,
although much of git-annex already does so.