From 9ed32ce62b5d6c9c78481a8055e7991ed12f6f77 Mon Sep 17 00:00:00 2001
From: Joey Hess <joeyh@joeyh.name>
Date: Thu, 22 Oct 2020 19:23:48 -0400
Subject: [PATCH] urk

---
 ..._loss_when_unsized_key_stored_chunked.mdwn | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn

diff --git a/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn b/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn
new file mode 100644
index 0000000000..1183398175
--- /dev/null
+++ b/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn
@@ -0,0 +1,41 @@
+When a key has no known size (from addurl --relaxed eg), I think data loss
+could occur in this situation:
+
+* repo A has an object for the key with size X
+* repo B has an object for the same key with size Y (!= X)
+* repo A transfers to the special remote
+* then B transfers to the special remote
+* B transfers one more chunk than A, because of the different size
+* B actually "resumes" after the last chunk A uploaded. So now the remote
+  contains A's chunks, followed by B's extra chunk.
+* A and B sync up, which merges the chunk logs. Since that log
+  uses "key:chunksize" as the log key, and the two logs have two different
+  ones, one will win or come first in the merged log. Suppose it's
+  the entry for B. So, the log then will be interpreted as the number of
+  chunks being B's.
+* Now when the object is retrieved from the special remote, it will
+  retrieve and concacenate A's chunks, followed by B's extra chunk.
+
+So this is corruption at least, it can be recovered from, but to do so
+you have to know the original length of A's object. Note that most keys
+with unknown size also have no checksum to use to verify them, so it would
+be easy for this to happen and not be caught.
+
+(Alternatively, after B transfers, it can sync with A, drop, and get
+the content back from the special remote. Same result by another route,
+and without needing any particular git-annex branch merge behavior to
+happen so easier to reproduce. (I have not tried either yet.))
+
+A simulantaneous upload by A and B might cause unrecoverable data loss
+if they eg alternate chunks. Unsure if that can really happen.
+
+If A starts to transfer, sends some chunks, but is interrupted, and B
+then transfers, resuming after the last chunk A stored, that would be data
+loss.
+
+It might be best to just disable storing in chunks for keys of unknown size,
+since it can fail so badly with them, and they're kind of a side thing?
+
+(Could continue retrieving, for whatever is stored hopefully w/o being
+corrupted already.)
+--[[Joey]]