git-annex/doc/todo/proving_preferred_content_behavior.mdwn
Joey Hess 340bdd0dac
treat "not present" in preferred content as invalid
Detect when a preferred content expression contains "not present", which
would lead to repeatedly getting and then dropping files, and make it never
match. This also applies to "not balanced" and "not sizebalanced".

--explain will tell the user when this happens

Note that getMatcher calls matchMrun' and does not check for unstable
negated limits. While there is no --present anyway, if there was,
it would not make sense for --not --present to complain about
instability and fail to match.
2024-09-03 13:50:06 -04:00

83 lines
3.6 KiB
Markdown

[[!toc ]]
## motivating examples
Preferred content expressions can be complicated to write and reason about.
A complex expression can involve lots of repositories that can get into
different states, and needs to be written to avoid unwanted behavior.
It would be very handy to provide some way to prove things about behavior
of preferred content expressions, or a way to simulate the behavior of a
network of git-annex repositories with a given preferred content configuration
The worst case of this is `not present`, where the file gets dropped and
transferred over and over again. The docs warn against using that one. But
they can't warn about every bad preferred content expression.
Mostly, git-annex manages to keep things stable that seem like they would
not be. Consider repo A that is not in group foo, and B is in group foo. A
has preferred content "onlyingroup=foo". This will make A want a file that
is in B. And once it has it, it will not want to drop it. That's because
when dropping, it considers if it would be preferred content after the
drop. In this case it would, so it doesn't drop it.
## balanced preferred content
When [[design/balanced_preferred_content]] is added, a whole new level of
complexity will exist in preferred content expressions, because now an
expression does not make a file be wanted by a single repository, but
shards the files amoung repositories in a group.
And up until this point preferred content expressions have behaved the same no
matter the sizes of the underlying repositories, but balanced preferred
content does take repository fullness into account, which further
complicates fully understanding the behavior.
Notice that `fullybalanced()` is not stable when used
on its own, and so `balanced()` adds an "or present" to stabilize it.
And so `not balanced()` includes `not present`, which is bad!
## proof
What could be proved about a preferred content expression?
No idea really. Would be interesting to consider what formal methods can
do here. Could a SAT solver be used somehow for example?
## static analysis
Clearly `not present` is an problematic preferred content expression. It
would be good if git-annex warned and/or refused to set such an expression
if it could detect it. Similarly `not groupwanted` could be detected as a
problem when the group's preferred content expression contains `present`.
> This is now detected and such an unstable expression never matches.
> --debug explains why too.
>
> Note that the detection will not be trigged by `"not (not present)"`,
> but it will by `"include=* or (not present)"` even though that is always
> stable, because `"include=*"` always matches and so what it's ORed with
> doesn't matter. Probably noone will set something like that in real life
> though.
>
> It's problimatic to make `git-annex wanted` warn about it. Consider
> if in one repository, groupwanted is set to "present". In another
> repository, which is disconnected, wanted is set to "not groupwanted".
> Both operations are ok, but upon merging the two repositories,
> the combined effect is that "not present" has been set.
>
> So while it could warn sometimes on setting "not present",
> it would sometimes not be able to. Better to not warn inconsistently.
> --[[Joey]]
## simulation
Simulation seems fairly straightforward, just simulate the network of
git-annex repositories with random files with different sizes and
metadata. Or use the current files and metadata.
Be sure to enforce invariants like numcopies the same as git-annex does.
Since users can write preferred content expressions, this should be
targeted at being used by end users.
[[!tag projects/openneuro]]