git-annex/doc/bugs/concurrent_drop--from_presence_checking_failures.mdwn

Concurrent dropping of a file has problems when drop --from is
used. (Also when the assistant or sync --content decided to drop from a
remote.)

[[!toc]]

# refresher

First, let's remember how it works in the case where we're just dropping
from 2 repos concurrently. git-annex uses locking to detect and prevent
data loss:

<pre>
Two repos, each with a file:

A (has)
B (has)

A wants from drop from A         B wants to drop from B
A locks it                       B locks it
A checks if B has it             B checks if A has it
  (does, but locked, so fails)     (does, but locked, so fails)
A fails to drop it               B fails to drop it

The two processes are racing, so there are other orderings to
consider, for example:

A wants from drop from A        B wants to drop from B
A locks it
A checks if B has it (succeeds)
A drops it from A		B locks it
                                B checks if A has it (fails)
				B fails to drop it

Which is also ok.

A wants from drop from A        B wants to drop from B
A locks it
A checks if B has it (succeeds)
                                B locks it
                                B checks if A has it
				  (does, but locked, so fails)
A drops it                      B fails to drop it

Yay, still ok.
</pre>

Locking works in those cases to prevent concurrent dropping of a file.

# the bug

But, when drop --from is used, the locking doesn't work:

<pre>
Two repos, each with a file:

A (has)
B (has)

A wants to drop from B                  B wants to drop from A
A checks to see if A has it (succeeds)  B checks to see if B has it (succeeds)
A tells B to drop it                    B tells A to drop it
B locks it, drops it                    A locks it, drops it

No more copies remain!
</pre>

Verified this one in the wild (adding an appropriate sleep to force the
race).

Best fix here seems to be for A to lock the content on A
as part of its check of numcopies, and keep it locked
while it's asking B to drop it. Then when B tells A to drop it,
it'll be locked and that'll fail (and vice-versa).

# the bug part 2

<pre>
Three repos; C might be a special remote, so w/o its own locking:

A       C (has)
B (has)

A wants to drop from C         B wants to drop from B
                               B locks it
A checks if B has it           B checks if C has it (does)
 (does, but locked, so fails)  B drops it

Copy remains in C. But, what if the race goes the other way?

A wants to drop from C          B wants to drop from B
A checks if B has it (succeeds)
A drops it from C               B locks it
                                B checks if C has it (does not)

So ok, but then:

A wants to drop from C          B wants to drop from B
A checks if B has it (succeeds)
                                B locks it
                                B checks if C has it (does)
A drops it from C               B drops it from B

No more copies remain!
</pre>

To fix this, seems that A should not just check if B has it, but lock
the content on B and keep it locked while A is dropping from C.
This would prevent B dropping the content from itself while A is in the
process of dropping from C.

That would mean replacing the call to `git-annex-shell inannex`
with a new command that locks the content.

Note that this is analgous to the fix above; in both cases
the change is from checking if content is in a location, to locking it in
that location while performing a drop from another location.

# the bug part 3 (where it gets really nasty)

<pre>
4 repos; C and D might be special remotes, so w/o their own locking:

A      C (has)
B      D (has)

B wants to drop from C        A wants to drop from D
B checks if D has it (does)   A checks if C has it (does)
B drops from C                A drops from D

No more copies remain!
</pre>

How do we get locking in this case?

Adding locking to C and D is not a general option, because special remotes
are dumb key/value stores; they may have no locking operations.

## a solution: require locking

What could be done is, change from checking if the remote has content, to
trying to lock it there. If the remote doesn't support locking, it can't
be guaranteed to have a copy. Require N locked copies for a drop to
succeed.

So, drop --from would no longer be supported in these configurations.
To drop the content from C, B would have to --force the drop, or move the
content from C to B, and then drop it from B.

### impact when using assistant/sync --content

Need to consider whether this might cause currently working topologies
with the assistant/sync --content to no longer work. Eg, might content
pile up in a transfer remote?

> The assistant checks after any transfer of an object if it should drop
> it from anywhere. So, it gets/puts, and later drops.
> Similarly, for sync --content, it first gets, then puts, and finally drops.

> When dropping an object from remotes(s) + local, in `handleDropsFrom`,
> it drops from local first. So, this would cause content pile-up unless
> changed.
>
> Also, when numcopies > 1, a toplogy like
> `A(transfer) -- B(client) -- specials(backup)` would never be able to drop
> the file from A, because the specials don't support locking and it can't
> guarantee the content will remain on them.
>
> One solution might be to make sync --content/the assistant generate
> move operations, which can then ignore numcopies (like `move` does).
> So, move from A to B and then copy to the specials.
>
> Using moves does lead to a decrease in robustness. For example, in
> the topology `A(transfer) -- B(client) -- C (backup)`, with numcopies=2,
> and C intermittently connected, the current
> behavior with sync --content/assistant is for an object to reach B
> and then later C, and only then be removed from A.
> If moves were used, the object moves from A to B, and so there's only
> 1 copy instead of the 2 as before, in the interim until C gets connected.

## a solution: require (minimal) locking

Instead of requiring N locked copies of content when dropping,
require only 1 locked copy. Check that content is on the other N-1
remotes w/o requiring locking (but use it if the remote supports locking).

This seems likely to behave similarly to using moves to work around the
limitations of the earlier solution, and should be easier to implement in
the assistant/sync --content, as well as less impactful on the manual user.

Unlike using moves, it does not decrease robustness, most of the time;
barring the kind of race this bug is about, numcopies behaves as desired.
When there is a race, some of the non-locked copies might be removed,
dipping below numcopies, but the 1 locked copy remains, so the data is not
entirely lost.

Dipping below desired numcopies in an unusual race condition, and then
doing extra work later to recover may be good enough.

Note that this solution will still result in drop --from failing in some
situations where it works now; manual users still need to switch their
workflows to using moves in such situations.