optimise journal writes to not mkdir journal directory when it already exists

Sponsored-by: Dartmouth College's DANDI project
2022-07-14 12:28:16 -04:00 · 2022-07-14 12:28:16 -04:00 · ad467791c1
commit ad467791c1
parent 5e407304a2
4 changed files with 19 additions and 4 deletions
--- a/Annex/Journal.hs
+++ b/Annex/Journal.hs
@ -81,14 +81,16 @@ setJournalFile _jl ru file content = withOtherTmp $ \tmp -> do
 		( return gitAnnexPrivateJournalDir
 		, return gitAnnexJournalDir
 		)
 	createAnnexDirectory jd
 	-- journal file is written atomically
 	let jfile = journalFile file
 	let tmpfile = tmp P.</> jfile
-	liftIO $ do
+	let write = liftIO $ do
 		withFile (fromRawFilePath tmpfile) WriteMode $ \h ->
 			writeJournalHandle h content
 		moveFile tmpfile (jd P.</> jfile)
 	-- avoid overhead of creating the journal directory when it already
 	-- exists
 	write `catchIO` (const (createAnnexDirectory jd >> write))
 data JournalledContent
 	= NoJournalledContent
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463.mdwn
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463.mdwn
@ -27,3 +27,5 @@ May be changes to those .web files in journal could be done "in place" by append
 may be there is a way to "stagger" those --batch additions somehow so all thousands of URLs are added in a single "run" thus having a single "copy/move" and locking/stat'ing syscalls?
 PS More information could be found at [dandisets/issues/225](https://github.com/dandi/dandisets/issues/225 )
 [[!tag projects/dandi]]
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_2_a4fce84f5777ed582fa599778835455f._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_2_a4fce84f5777ed582fa599778835455f._comment
@ -9,9 +9,10 @@ randomly distributed?
 It sounds like it's more randomly distributed, if you're walking a tree and
 adding each file you encounter, and some of them have the same content so
-the same url and key.
+the same key.
-If it was not randomly distributed, a nice optimisation would be for
+But your stace shows repeated writes for the same key, so maybe they bunch
 up? If it was not randomly distributed, a nice optimisation would be for
 registerurl to buffer urls as long as the key is the same, and then do a
 single write for that key of all the urls. But it can't really buffer like
 that if it's randomly distributed; the buffer could use a large amount of
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_3_56c313fdcb88e95abaa10647678bc108._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_3_56c313fdcb88e95abaa10647678bc108._comment
@ -0,0 +1,10 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 3"""
 date="2022-07-14T16:16:35Z"
 content="""
 I've optimised away the repeated mkdir of the journal.
 Probably not a big win in this particular edge case, but a nice general
 win..
 """]]