urk

2020-10-22 19:23:48 -04:00 · 2020-10-22 19:23:48 -04:00 · 9ed32ce62b
commit 9ed32ce62b
parent 577af1b679
1 changed files with 41 additions and 0 deletions
--- a/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn
+++ b/doc/bugs/possible_data_loss_when_unsized_key_stored_chunked.mdwn
@ -0,0 +1,41 @@
 When a key has no known size (from addurl --relaxed eg), I think data loss
 could occur in this situation:
 * repo A has an object for the key with size X
 * repo B has an object for the same key with size Y (!= X)
 * repo A transfers to the special remote
 * then B transfers to the special remote
 * B transfers one more chunk than A, because of the different size
 * B actually "resumes" after the last chunk A uploaded. So now the remote
  contains A's chunks, followed by B's extra chunk.
 * A and B sync up, which merges the chunk logs. Since that log
  uses "key:chunksize" as the log key, and the two logs have two different
  ones, one will win or come first in the merged log. Suppose it's
  the entry for B. So, the log then will be interpreted as the number of
  chunks being B's.
 * Now when the object is retrieved from the special remote, it will
  retrieve and concacenate A's chunks, followed by B's extra chunk.
 So this is corruption at least, it can be recovered from, but to do so
 you have to know the original length of A's object. Note that most keys
 with unknown size also have no checksum to use to verify them, so it would
 be easy for this to happen and not be caught.
 (Alternatively, after B transfers, it can sync with A, drop, and get
 the content back from the special remote. Same result by another route,
 and without needing any particular git-annex branch merge behavior to
 happen so easier to reproduce. (I have not tried either yet.))
 A simulantaneous upload by A and B might cause unrecoverable data loss
 if they eg alternate chunks. Unsure if that can really happen.
 If A starts to transfer, sends some chunks, but is interrupted, and B
 then transfers, resuming after the last chunk A stored, that would be data
 loss.
 It might be best to just disable storing in chunks for keys of unknown size,
 since it can fail so badly with them, and they're kind of a side thing?
 (Could continue retrieving, for whatever is stored hopefully w/o being
 corrupted already.)
 --[[Joey]]