design work

This commit is contained in:
Joey Hess 2022-07-15 15:06:40 -04:00
parent a2b1f369d1
commit 8bc9381d8d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 112 additions and 0 deletions

View file

@ -0,0 +1,39 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2022-07-15T15:50:00Z"
content="""
Ran `git-annex whereis --quiet` over 10000 annexed files. With journal
locking on read, it took 11.71 seconds. Without journal locking,
it took 11.73 seconds. No speed difference.
And strace showed why: This only opened the journal directory once, noticed
it was empty, and skipped ever trying to read any files from it! If there
are files, it stages them and still manages to not need to read from the
journal after that. Nice optimisation from earlier this year. :-)
I thought that --batch commands would still check the journal files,
but surprisingly, they don't seem to. That was a bug:
[[bugs/batch_commands_miss_journalled_changes_made_while_running]]
After fixing that, I benchmarked feeding 10000 filenames into `git-annex
whereis --batch`. With journal locking on read, it took 18.43 seconds.
Without journal locking, it took 17.22 seconds. Before that bug fix,
with or without journal locking, it took 16.59 seconds.
So, if the slow down caused by journal locking on read is a problem for
anyone, a mode could be added that makes --batch not check the journal for
changes made after the command started. That would make it run as fast as
before that bug fix.
There might be other commands than --batch commands, that both read and
write git-annex branch data, and so end up checking the journal on every
read, since writing invalidates the above optimisation. Not sure what
commands that would be, maybe `git-annex drop`? Anyway, such commands are
probably doing more expensive things than locking the journal; they're not
query commands.
That makes me ok with adding the locking on read, if needed for append.
(Or similar added overheads to journal reads.)
For now, I've committed it to the `append` branch.
"""]]

View file

@ -0,0 +1,73 @@
[[!comment format=mdwn
username="joey"
subject="""comment 11"""
date="2022-07-15T17:58:27Z"
content="""
The remaining problem with appending is crash safety. If an append is
not atomic, a journal file could end up having a truncated line written to
it.
That seems unlikely, but see the bugzilla page above; it can happen on a
kill signal at least.
So, can append somehow be made atomic? How about this:
Make `.git/annex/journal-append/` which contains append files,
that are the same as journal files, but in the process of being appended.
And make it also contain size files, which contain a number, the size of
the append file before anything got appended to it.
Then, to append to a journal file:
1. When the journal file exists, and the append file does not,
move the journal file to the append file.
2. When the journal file does not exist and the append file does,
truncate it to the size in the size file. (If the size file does not
exist, skip truncating.)
3. When the journal and append file both exist, truncate the append file,
and add the journal file's content to what is going to be appended.
(This is in case an old git-annex wrote a new value to the
journal file, not knowing about the append file.)
2. Write the current size of the append file to the size file.
3. Append to the append file.
4. Move the append file back to the journal file.
5. Delete the size file.
When reading journalled files, it would need to also check the append
file, and only read the recorded size. When both the append file and the
journal file exist, it would read both and combine them. This change would
slow down reads slightly, though as seen in comment #10, mostly only for
--batch commands.
(It may not be necessary to lock on read actually. It can check for the
append file and read the size file. If a write is happening at the
same time, the size file may not exist yet, or may have been deleted
already. In either case, reading the whole append file is ok.
Should be possible to make this race-safe without locking.)
When staging the journal, it would need to first handle any interrupted
appends, by checking if any append files exist.
1. Truncate the append file to the value in the size file
2. Read the value of the file from the branch.
3. Append the value of the file from the branch to the append file.
(This is to handle a case with old git-annex having written
divergent data to the branch, see below.)
4. Move the append file back to the journal.
5. Delete the size file.
----
When a new git-annex is doing an append and an old git-annex is also in use,
the old git-annex will not see files in the journal that are in the process
of being appended to. So it might use out of date information for queries.
When it's making a write, it always did first read with the journal locked,
so it will block until the append is complete. So it will not use out of
date information for writes.
Only when something was written to the journal, but not committed to the
branch, and then an append happened but got interruped will the old
git-annex miss data. It will not see that data, and might make its own
divergent changes, that get committed to the branch. The new git-annex
will need to deal with this when handling interrupted appends.
"""]]