optimise journal writes to not mkdir journal directory when it already exists

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2022-07-14 12:28:16 -04:00
parent 5e407304a2
commit ad467791c1
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 19 additions and 4 deletions

View file

@ -81,14 +81,16 @@ setJournalFile _jl ru file content = withOtherTmp $ \tmp -> do
( return gitAnnexPrivateJournalDir ( return gitAnnexPrivateJournalDir
, return gitAnnexJournalDir , return gitAnnexJournalDir
) )
createAnnexDirectory jd
-- journal file is written atomically -- journal file is written atomically
let jfile = journalFile file let jfile = journalFile file
let tmpfile = tmp P.</> jfile let tmpfile = tmp P.</> jfile
liftIO $ do let write = liftIO $ do
withFile (fromRawFilePath tmpfile) WriteMode $ \h -> withFile (fromRawFilePath tmpfile) WriteMode $ \h ->
writeJournalHandle h content writeJournalHandle h content
moveFile tmpfile (jd P.</> jfile) moveFile tmpfile (jd P.</> jfile)
-- avoid overhead of creating the journal directory when it already
-- exists
write `catchIO` (const (createAnnexDirectory jd >> write))
data JournalledContent data JournalledContent
= NoJournalledContent = NoJournalledContent

View file

@ -27,3 +27,5 @@ May be changes to those .web files in journal could be done "in place" by append
may be there is a way to "stagger" those --batch additions somehow so all thousands of URLs are added in a single "run" thus having a single "copy/move" and locking/stat'ing syscalls? may be there is a way to "stagger" those --batch additions somehow so all thousands of URLs are added in a single "run" thus having a single "copy/move" and locking/stat'ing syscalls?
PS More information could be found at [dandisets/issues/225](https://github.com/dandi/dandisets/issues/225 ) PS More information could be found at [dandisets/issues/225](https://github.com/dandi/dandisets/issues/225 )
[[!tag projects/dandi]]

View file

@ -9,9 +9,10 @@ randomly distributed?
It sounds like it's more randomly distributed, if you're walking a tree and It sounds like it's more randomly distributed, if you're walking a tree and
adding each file you encounter, and some of them have the same content so adding each file you encounter, and some of them have the same content so
the same url and key. the same key.
If it was not randomly distributed, a nice optimisation would be for But your stace shows repeated writes for the same key, so maybe they bunch
up? If it was not randomly distributed, a nice optimisation would be for
registerurl to buffer urls as long as the key is the same, and then do a registerurl to buffer urls as long as the key is the same, and then do a
single write for that key of all the urls. But it can't really buffer like single write for that key of all the urls. But it can't really buffer like
that if it's randomly distributed; the buffer could use a large amount of that if it's randomly distributed; the buffer could use a large amount of

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2022-07-14T16:16:35Z"
content="""
I've optimised away the repeated mkdir of the journal.
Probably not a big win in this particular edge case, but a nice general
win..
"""]]