design work

2022-07-15 15:06:40 -04:00 · 2022-07-15 15:06:40 -04:00 · 8bc9381d8d
commit 8bc9381d8d
parent a2b1f369d1
2 changed files with 112 additions and 0 deletions
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_10_0de07ff7054e7d71a6cf1d08051664f9._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_10_0de07ff7054e7d71a6cf1d08051664f9._comment
@ -0,0 +1,39 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 10"""
+ date="2022-07-15T15:50:00Z"
+ content="""
+Ran `git-annex whereis --quiet` over 10000 annexed files. With journal
+locking on read, it took 11.71 seconds. Without journal locking, 
+it took 11.73 seconds. No speed difference.
+
+And strace showed why: This only opened the journal directory once, noticed
+it was empty, and skipped ever trying to read any files from it! If there
+are files, it stages them and still manages to not need to read from the
+journal after that. Nice optimisation from earlier this year. :-)
+
+I thought that --batch commands would still check the journal files,
+but surprisingly, they don't seem to. That was a bug:
+[[bugs/batch_commands_miss_journalled_changes_made_while_running]]
+
+After fixing that, I benchmarked feeding 10000 filenames into `git-annex
+whereis --batch`. With journal locking on read, it took 18.43 seconds. 
+Without journal locking, it took 17.22 seconds. Before that bug fix, 
+with or without journal locking, it took 16.59 seconds.
+
+So, if the slow down caused by journal locking on read is a problem for
+anyone, a mode could be added that makes --batch not check the journal for
+changes made after the command started. That would make it run as fast as
+before that bug fix.
+
+There might be other commands than --batch commands, that both read and
+write git-annex branch data, and so end up checking the journal on every
+read, since writing invalidates the above optimisation. Not sure what
+commands that would be, maybe `git-annex drop`? Anyway, such commands are
+probably doing more expensive things than locking the journal; they're not
+query commands.
+
+That makes me ok with adding the locking on read, if needed for append.
+(Or similar added overheads to journal reads.)
+For now, I've committed it to the `append` branch.
+"""]]
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_11_b41f6266e341ec05586e6ec32b6cf68a._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_11_b41f6266e341ec05586e6ec32b6cf68a._comment
@ -0,0 +1,73 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 11"""
+ date="2022-07-15T17:58:27Z"
+ content="""
+The remaining problem with appending is crash safety. If an append is
+not atomic, a journal file could end up having a truncated line written to
+it.
+
+That seems unlikely, but see the bugzilla page above; it can happen on a
+kill signal at least.
+
+So, can append somehow be made atomic? How about this:
+
+Make `.git/annex/journal-append/` which contains append files,
+that are the same as journal files, but in the process of being appended.
+And make it also contain size files, which contain a number, the size of
+the append file before anything got appended to it.
+
+Then, to append to a journal file:
+
+1. When the journal file exists, and the append file does not,
+   move the journal file to the append file.
+2. When the journal file does not exist and the append file does,
+   truncate it to the size in the size file. (If the size file does not
+   exist, skip truncating.)
+3. When the journal and append file both exist, truncate the append file,
+   and add the journal file's content to what is going to be appended.
+   (This is in case an old git-annex wrote a new value to the
+   journal file, not knowing about the append file.)
+2. Write the current size of the append file to the size file.
+3. Append to the append file.
+4. Move the append file back to the journal file.
+5. Delete the size file.
+
+When reading journalled files, it would need to also check the append
+file, and only read the recorded size. When both the append file and the
+journal file exist, it would read both and combine them. This change would 
+slow down reads slightly, though as seen in comment #10, mostly only for
+--batch commands.
+
+(It may not be necessary to lock on read actually. It can check for the
+append file and read the size file. If a write is happening at the
+same time, the size file may not exist yet, or may have been deleted
+already. In either case, reading the whole append file is ok.
+Should be possible to make this race-safe without locking.)
+
+When staging the journal, it would need to first handle any interrupted
+appends, by checking if any append files exist.
+
+1. Truncate the append file to the value in the size file
+2. Read the value of the file from the branch.
+3. Append the value of the file from the branch to the append file.
+   (This is to handle a case with old git-annex having written
+   divergent data to the branch, see below.)
+4. Move the append file back to the journal.
+5. Delete the size file.
+
+----
+
+When a new git-annex is doing an append and an old git-annex is also in use,
+the old git-annex will not see files in the journal that are in the process
+of being appended to. So it might use out of date information for queries.
+When it's making a write, it always did first read with the journal locked,
+so it will block until the append is complete. So it will not use out of
+date information for writes.
+
+Only when something was written to the journal, but not committed to the
+branch, and then an append happened but got interruped will the old
+git-annex miss data. It will not see that data, and might make its own
+divergent changes, that get committed to the branch. The new git-annex
+will need to deal with this when handling interrupted appends.
+"""]]