design work
This commit is contained in:
parent
a2b1f369d1
commit
8bc9381d8d
2 changed files with 112 additions and 0 deletions
|
@ -0,0 +1,39 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 10"""
|
||||
date="2022-07-15T15:50:00Z"
|
||||
content="""
|
||||
Ran `git-annex whereis --quiet` over 10000 annexed files. With journal
|
||||
locking on read, it took 11.71 seconds. Without journal locking,
|
||||
it took 11.73 seconds. No speed difference.
|
||||
|
||||
And strace showed why: This only opened the journal directory once, noticed
|
||||
it was empty, and skipped ever trying to read any files from it! If there
|
||||
are files, it stages them and still manages to not need to read from the
|
||||
journal after that. Nice optimisation from earlier this year. :-)
|
||||
|
||||
I thought that --batch commands would still check the journal files,
|
||||
but surprisingly, they don't seem to. That was a bug:
|
||||
[[bugs/batch_commands_miss_journalled_changes_made_while_running]]
|
||||
|
||||
After fixing that, I benchmarked feeding 10000 filenames into `git-annex
|
||||
whereis --batch`. With journal locking on read, it took 18.43 seconds.
|
||||
Without journal locking, it took 17.22 seconds. Before that bug fix,
|
||||
with or without journal locking, it took 16.59 seconds.
|
||||
|
||||
So, if the slow down caused by journal locking on read is a problem for
|
||||
anyone, a mode could be added that makes --batch not check the journal for
|
||||
changes made after the command started. That would make it run as fast as
|
||||
before that bug fix.
|
||||
|
||||
There might be other commands than --batch commands, that both read and
|
||||
write git-annex branch data, and so end up checking the journal on every
|
||||
read, since writing invalidates the above optimisation. Not sure what
|
||||
commands that would be, maybe `git-annex drop`? Anyway, such commands are
|
||||
probably doing more expensive things than locking the journal; they're not
|
||||
query commands.
|
||||
|
||||
That makes me ok with adding the locking on read, if needed for append.
|
||||
(Or similar added overheads to journal reads.)
|
||||
For now, I've committed it to the `append` branch.
|
||||
"""]]
|
|
@ -0,0 +1,73 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 11"""
|
||||
date="2022-07-15T17:58:27Z"
|
||||
content="""
|
||||
The remaining problem with appending is crash safety. If an append is
|
||||
not atomic, a journal file could end up having a truncated line written to
|
||||
it.
|
||||
|
||||
That seems unlikely, but see the bugzilla page above; it can happen on a
|
||||
kill signal at least.
|
||||
|
||||
So, can append somehow be made atomic? How about this:
|
||||
|
||||
Make `.git/annex/journal-append/` which contains append files,
|
||||
that are the same as journal files, but in the process of being appended.
|
||||
And make it also contain size files, which contain a number, the size of
|
||||
the append file before anything got appended to it.
|
||||
|
||||
Then, to append to a journal file:
|
||||
|
||||
1. When the journal file exists, and the append file does not,
|
||||
move the journal file to the append file.
|
||||
2. When the journal file does not exist and the append file does,
|
||||
truncate it to the size in the size file. (If the size file does not
|
||||
exist, skip truncating.)
|
||||
3. When the journal and append file both exist, truncate the append file,
|
||||
and add the journal file's content to what is going to be appended.
|
||||
(This is in case an old git-annex wrote a new value to the
|
||||
journal file, not knowing about the append file.)
|
||||
2. Write the current size of the append file to the size file.
|
||||
3. Append to the append file.
|
||||
4. Move the append file back to the journal file.
|
||||
5. Delete the size file.
|
||||
|
||||
When reading journalled files, it would need to also check the append
|
||||
file, and only read the recorded size. When both the append file and the
|
||||
journal file exist, it would read both and combine them. This change would
|
||||
slow down reads slightly, though as seen in comment #10, mostly only for
|
||||
--batch commands.
|
||||
|
||||
(It may not be necessary to lock on read actually. It can check for the
|
||||
append file and read the size file. If a write is happening at the
|
||||
same time, the size file may not exist yet, or may have been deleted
|
||||
already. In either case, reading the whole append file is ok.
|
||||
Should be possible to make this race-safe without locking.)
|
||||
|
||||
When staging the journal, it would need to first handle any interrupted
|
||||
appends, by checking if any append files exist.
|
||||
|
||||
1. Truncate the append file to the value in the size file
|
||||
2. Read the value of the file from the branch.
|
||||
3. Append the value of the file from the branch to the append file.
|
||||
(This is to handle a case with old git-annex having written
|
||||
divergent data to the branch, see below.)
|
||||
4. Move the append file back to the journal.
|
||||
5. Delete the size file.
|
||||
|
||||
----
|
||||
|
||||
When a new git-annex is doing an append and an old git-annex is also in use,
|
||||
the old git-annex will not see files in the journal that are in the process
|
||||
of being appended to. So it might use out of date information for queries.
|
||||
When it's making a write, it always did first read with the journal locked,
|
||||
so it will block until the append is complete. So it will not use out of
|
||||
date information for writes.
|
||||
|
||||
Only when something was written to the journal, but not committed to the
|
||||
branch, and then an append happened but got interruped will the old
|
||||
git-annex miss data. It will not see that data, and might make its own
|
||||
divergent changes, that get committed to the branch. The new git-annex
|
||||
will need to deal with this when handling interrupted appends.
|
||||
"""]]
|
Loading…
Reference in a new issue