From 9ed32ce62b5d6c9c78481a8055e7991ed12f6f77 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 22 Oct 2020 19:23:48 -0400 Subject: [PATCH] urk --- ..._loss_when_unsized_key_stored_chunked.mdwn | 41 +++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn diff --git a/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn b/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn new file mode 100644 index 0000000000..1183398175 --- /dev/null +++ b/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn @@ -0,0 +1,41 @@ +When a key has no known size (from addurl --relaxed eg), I think data loss +could occur in this situation: + +* repo A has an object for the key with size X +* repo B has an object for the same key with size Y (!= X) +* repo A transfers to the special remote +* then B transfers to the special remote +* B transfers one more chunk than A, because of the different size +* B actually "resumes" after the last chunk A uploaded. So now the remote + contains A's chunks, followed by B's extra chunk. +* A and B sync up, which merges the chunk logs. Since that log + uses "key:chunksize" as the log key, and the two logs have two different + ones, one will win or come first in the merged log. Suppose it's + the entry for B. So, the log then will be interpreted as the number of + chunks being B's. +* Now when the object is retrieved from the special remote, it will + retrieve and concacenate A's chunks, followed by B's extra chunk. + +So this is corruption at least, it can be recovered from, but to do so +you have to know the original length of A's object. Note that most keys +with unknown size also have no checksum to use to verify them, so it would +be easy for this to happen and not be caught. + +(Alternatively, after B transfers, it can sync with A, drop, and get +the content back from the special remote. Same result by another route, +and without needing any particular git-annex branch merge behavior to +happen so easier to reproduce. (I have not tried either yet.)) + +A simulantaneous upload by A and B might cause unrecoverable data loss +if they eg alternate chunks. Unsure if that can really happen. + +If A starts to transfer, sends some chunks, but is interrupted, and B +then transfers, resuming after the last chunk A stored, that would be data +loss. + +It might be best to just disable storing in chunks for keys of unknown size, +since it can fail so badly with them, and they're kind of a side thing? + +(Could continue retrieving, for whatever is stored hopefully w/o being +corrupted already.) +--[[Joey]]