When git-annex transferrer started up, and the journal contained something,
it would commit it to the git-annex branch. This caused excess commits to
the branch, in cases where normally several changes would be journalled and
committed together. That generated some excess git objects and was also
just noisy on stdout.
Since transferrer uses enableInteractiveBranchAccess, it does not need to
commit journalled changes, since the optimisation that avoids checking
the journal when reading from the branch is disabled for processes that
call that.
This commit was sponsored by Svenne Krap on Patreon.
My worry was that a preferred content expression that matches on metadata
would have removed the location log from cache, causing an expensive
re-read when a Seek action later checked the location log.
Especially when the --all optimisation in the previous commit
pre-cached the location log.
This also means that the --all optimisation could cache the metadata log
too, if it wanted too, but not currently done.
The cache is a list, with the most recently accessed file first. That
optimises it for the common case of reading the same file twice, eg a
get, examine, followed by set reads it twice. And sync --content reads the
location log 3 times in a row commonly.
But, as a list, it should not be made to be too long. I thought about
expanding it to 5 items, but that seemed unlikely to be a win commonly
enough to outweigh the extra time spent checking the cache.
Clearly there could be some further benchmarking and tuning here.
The cache was removed way back in 2012,
commit 3417c55189
Then I forgot I had removed it! I remember clearly multiple times when I
thought, "this reads the same data twice, but the cache will avoid that
being very expensive".
The reason it was removed was it messed up the assistant noticing when
other processes made changes. That same kind of problem has recently
been addressed when adding the optimisation to avoid reading the journal
unnecessarily.
Indeed, enableInteractiveJournalAccess is run in just the
right places, so can just piggyback on it to know when it's not safe
to use the cache.
The journal read optimisation in aeca7c220 later got fixed in eedd73b84
to stage and commit any files that were left in the journal by a
previous git-annex run. That's necessary for the optimisation to work
correctly. But it also meant that alwayscommit=false started committing
the previous git-annex processes journalled changes, which defeated the
purpose of the config setting entirely.
So, disable the optimisation when alwayscommit=false, leaving the
files in the journal and not committing them. See my comments on the bug
report for why this seemed the best approach.
Also fixes a problem when annex.merge-annex-branches=false and there
are changes in the journal. That config indirectly prevents committing
the journal. (Which seems a bit odd given its name, but it always has..)
So, when there were changes in the journal, perhaps left there due to
alwayscommit=false being set before, the optimisation would prevent
git-annex from reading the journal files, and it would operate with out
of date information.
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.
--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This cache prevented noticing changes made by another process.
The case I just ran into involved the assistant dropping a file, which
cached its presence info. Then the same file was downloaded again,
but the assistant didn't know its presence info had changed.
I don't see a way to keep this cache. Will instead rely on the OS level
file cache, for files in the journal. May need to add more higher-level
caching of info that it's ok to have a potentially stale copy of,
although much of git-annex already does so.