git-annex/doc/todo/proving_preferred_content_behavior.mdwn

Preferred content expressions can be complicated to write and reason about.
A complex expression can involve lots of repositories that can get into
different states, and needs to be written to avoid unwanted behavior.

It would be very handy to provide some way to prove things about behavior
of preferred content expressions, or a way to simulate the behavior of a
network of git-annex repositories with a given preferred content configuration

## motivating examples

For example, consider two reposities A and B. A is in group M and B is in
group N. A has preferred content `not inallgroup=N` and B has `not inallgroup=M`.

If A contains a file, then B will want to also get a copy. And things
stabilize there. But if the file is removed from A, then B also wants to
remove it. And once B has removed it, A wants a copy of it. And then B also
wants a copy of it. So the result is that the file got transferred twice,
to end up right back where we started.

The worst case of this is `not present`, where the file gets dropped and
transferred over and over again. The docs warn against using that one. But
they can't warn about every bad preferred content expression.

## balanced preferred content

When [[design/balanced_preferred_content]] is added, a whole new level of
complexity will exist in preferred content expressions, because now an
expression does not make a file be wanted by a single repository, but
shards the files amoung repositories in a group.

And up until this point preferred content expressions have behaved the same no
matter the sizes of the underlying repositories, but balanced preferred
content does take repository fullness into account, which further
complicates fully understanding the behavior.

Notice that `balanced()` (in the current design) is not stable when used
on its own, and has to be used as part of a larger expression to make it
stable, eg:

    ((balanced(backup) and not (copies=backup:1)) or present

So perhaps `balanced()` should include the other checks in it,
to avoid the user shooting themselves in the foot. On the other
hand, if `balanced()` implicitly contains `present`, then `not balanced()`
would include `not present`, which is bad!

(For that matter, what does `not balanced()` even do currently?)

## proof

What could be proved about a preferred content expression?

No idea really. Would be interesting to consider what formal methods can
do here. Could a SAT solver be used somehow for example?

## static analysis

Clearly `not present` is an problematic preferred content expression. It
would be good if git-annex warned and/or refused to set such an expression
if it could detect it. Similarly `not groupwanted` could be detected as a
problem when the group's preferred content expression contains `present`.

Is there is a more general purpose and not expensive way to detect such
problematic expressions, that can find problems such as the
`not inallgroup=N` example above?

## simulation

Simulation seems fairly straightforward, just simulate the network of
git-annex repositories with random files with different sizes and
metadata. Be sure to enforce invariants like numcopies the same as
git-annex does.

Since users can write preferred content expressions, this should be
targeted at being used by end users.

[[!tag projects/openneuro]]