This adds a separate journal, which does not currently get committed to
an index, but is planned to be committed to .git/annex/index-private.
Changes that are regarding a UUID that is private will get written to
this journal, and so will not be published into the git-annex branch.
All log writing should have been made to indicate the UUID it's
regarding, though I've not verified this yet.
Currently, no UUIDs are treated as private yet, a way to configure that
is needed.
The implementation is careful to not add any additional IO work when
privateUUIDsKnown is False. It will skip looking at the private journal
at all. So this should be free, or nearly so, unless the feature is
used. When it is used, all branch reads will be about twice as expensive.
It is very lucky -- or very prudent design -- that Annex.Branch.change
and maybeChange are the only ways to change a file on the branch,
and Annex.Branch.set is only internal use. That let Annex.Branch.get
always yield any private information that has been recorded, without
the risk that Annex.Branch.set might be called, with a non-private UUID,
and end up leaking the private information into the git-annex branch.
And, this relies on the way git-annex union merges the git-annex branch.
When reading a file, there can be a public and a private version, and
they are just concacenated together. That will be handled the same as if
there were two diverged git-annex branches that got union merged.
Especially from borg, where the content identifier logs
all end up being the same identical file!
But also, for other imports, the location tracking logs can,
in some cases, be identical files.
Bonus optimisation: Avoid looking up (and parsing when set)
GIT_ANNEX_VECTOR_CLOCK env var every time a log is written to.
Although the lookup does happen at startup even when no
log will be written now.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.
There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.
Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...
For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1 group2 " due to excess
whitespace, which would be even more confusing behavior.
The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
There should be some speed gains here, especially for chunk and remote
state logs, which are queried once per key.
Now only old-format uuid-based logs still need to be converted to attoparsec.
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.
In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
Probably not any particular speedup in this, since most of these logs
are not written to often. Possibly chunk log writing is sped up, but
writes to chunk logs are interleaved with expensive data transfers to
remotes, so unlikely to be a noticiable speedup.
Most of the individual logs are not converted yet, only presense logs
have an efficient ByteString Builder implemented so far. The rest
convert to and from String.
Can be used to override the default timestamps used in log files in the
git-annex branch. This is a dangerous environment variable; use with
caution.
Note that this only affects writing to the logs on the git-annex branch.
It is not used for metadata in git commits (other env vars can be set for
that).
There are many other places where timestamps are still used, that don't
get committed to git, but do touch disk. Including regular timestamps
of files, and timestamps embedded in some files in .git/annex/, including
the last fsck timestamp and timestamps in transfer log files.
A good way to find such things in git-annex is to get for getPOSIXTime and
getCurrentTime, although some of the results are of course false positives
that never hit disk (unless git-annex gets swapped out..)
So this commit does NOT necessarily make git-annex comply with some HIPPA
privacy regulations; it's up to the user to determine if they can use it in
a way compliant with such regulations.
Benchmarking: It takes 0.00114 milliseconds to call getEnv
"GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log
files can be written with an added overhead of only 0.114 seconds. That
should be by far swamped by the actual overhead of writing the log files
and making the commit containing them.
This commit was supported by the NSF-funded DataLad project.
This includes checking when dropping files that any required content
configuration is satisfied. However, it does not yet include an active
check on the required content; the location log is trusted when checking
the required content expression.
This was unexpectedly difficult because of a depdenency cycle. To parse a
preferred content expression involves several things that need to operate
on the list of remotes. Which needs Remote.External. The only way to avoid
this cycle (I tried breaking it at several points) was to skip parsing the
expression in SETWANTED.
That's sorta ok, because git-annex already has to deal with unparsable
preferred content expressions being stored, in order to handle eg,
upgrades. But I'm still not very happy that I cannot check it.
I feel this is a strong indication that I need to beware of further
bloating the special remote protocol interface.