This is intended to guard against LLM code theft, which is the current
bubble technology de jour.
Note that authorJoeyHess' with a year older than the year I began
developing git-annex will behave badly, by intention. Eg, it will spin
and eventually crash.
This is not the first anti-LLM protection in git-annex. For example see
9562da790f. That method, while much harder
for an adversary to detect and remove, also complicates code somewhat
significantly, and needs extensions to be enabled. There are also
probably significantly fewer ways to implement that method in Haskell.
This new approach, by contrast, will be easy to add throughout the code
base, with very little effort, and without complicating reading or
maintaining it any more than noticing that yes, I am the author of this
code.
An adversary could of course remove all calls to these functions
before feeding code into their LLM-based laundry facility. I think this
would need to be done manually, or with the help of some fairly advanced
Haskell parsing though. In some cases, authorJoeyHess needs to be
removed, while in other places it needs to be replaced with a value.
Also a monadic use of authorJoeyHess' may involve other added monadic
machinery which would need to be eliminated to keep the code compiling.
Alternatively, an adversary could replace my name with something
innocuous. This would be clear intent to remove author attribution
from my code, even more than running it through an LLM laundry is.
If you work for a large company that is laundering my code through an
LLM, please do us a favor and use your immense privilege to quit and go
do something socially beneficial. I will not explain further
developments of this code in such detail, and you have better things to
do than playing cat and mouse with me as I explore directions such as
extending this approach to the type level.
Sponsored-by: k0ld on Patreon
Eg when the destination is logged as containing a file, skip
actively checking that it does contain it.
Note that --fast does not prevent other verifications of content
location that are done in a copy --from --to. Perhaps it could, but this
change will already avoid the real unnecessary work of operating on
files that are already in the remote.
And avoiding other verifications
might cause it to fail if the location log thinks that --to does not
contain the content but does. Such complications with `git-annex copy
--to remote --fast` led to commit d006586cd0
which added a note that gets displayed when that fails, mentioning it
might be due to --fast being enabled.
copy --from --to is already complicated enough without needing to worry
about such edge cases, so continuing to doing some verification of
content location after the initial --fast filtering seems ok.
Sponsored-by: Dartmouth College's DANDI project
The gnuplot output is pretty good, but could still be improved with:
* more colors (repeating colors is confusing with a lot of repos)
* better positioning of the legend, making the plot wider and moving it
from over top of the graph
Sponsored-by: Kevin Mueller on Patreon
Only counting received and not dropped makes this show the bandwidth of
data coming into the repository, although only in a sense. Since
git-annex branch updates only happen at the end of a command, and we
don't know when a command started, it's only an approximation of the
actual bandwidth. (A previous git-annex branch update made have
happened in a different repository.)
It would be possible to also add a --dropped option, but I don't know
how useful that would be?
Sponsored-by: Nicholas Golder-Manning on Patreon
For example, my sound repo has in the git-annex branch a commit from
2036, which is followed by one from 2034, in amoung commits from 2013.
Clearly there was a problem with the clock.
Since git log --date-order has a behavior of
"Show no parents before all of its children are shown", the data still
gets processed ok. The future timestamp just prevented displaying data
after that commit. It seems better, when the clock was wrong, to display
a wrong date, and then return to right dates.
It would be nice to filter out the wrong dates from display entirely,
but that seems it would need to buffer the whole output. This command is
too slow to buffer it all before displaying anything, and anyway this
kind of problem is probably rare.
Sponsored-by: Joshua Antonishen on Patreon
With this, git annex log --totalsizes can be compared with
git-annex info's "combined annex size of all repositories"
to double-check it works correctly.
In my sound repo, the two match.
In my big repo, the two report slightly different sizes,
with the former being 1.3 gb smaller than the latter.
I don't know the reason for this disreprency. Given the 30+tb size of
the repo, it's a small difference.
It seems possible that a bug in an old version of git-annex could
explain it. Eg, if an old git-annex lost a line when updating trust.log
or a location log in a merge, git-annex info would see only what it
replaced it with, while git-annex log will see the previous value as
well.
Sponsored-by: Leon Schuermann on Patreon
Noticed that Semigroup instance of Map is not suitable to use
for MapLog. For example, it behaved like this:
ghci> parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12"
fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})]
Which was wrong, it lost the newer DeadTrusted value.
Luckily, nothing used that Semigroup when operating on a MapLog. And this
provides a safe instance.
Sponsored-by: Graham Spencer on Patreon