From 5e407304a2fa53743dbb6a640f7f9ca3290eb94f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 14 Jul 2022 12:13:28 -0400 Subject: [PATCH] comment with a question --- ..._98cd9e5dd449cbd834f53d97fd5dfc13._comment | 19 +++++++ ..._a4fce84f5777ed582fa599778835455f._comment | 55 +++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment create mode 100644 doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment diff --git a/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment new file mode 100644 index 0000000000..2d0d6eb3a1 --- /dev/null +++ b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment @@ -0,0 +1,19 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2022-07-14T15:17:41Z" + content=""" +You have not ruled out the flat directory structure being a problem, +if your system has different performance than mine. It would be good if you +could try the simple test I showed there to check if reading/writing a file +to a large directory is indeed a problem. + +Anyway, nice observation here; growing such a large log file +one line at a time with rewrites is of course gonna be slow. +That would be a nice optimisation target. + +(Also the redundant mkdir/stat/etc on every write are not helping +performance. git-annex never rmdirs the journal, so those should be able to +easily be eliminated by only doing a mkdir when a write fails due to the +journal directory not existing.) +"""]] diff --git a/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment new file mode 100644 index 0000000000..6efc0a53a4 --- /dev/null +++ b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment @@ -0,0 +1,55 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 2""" + date="2022-07-14T15:23:04Z" + content=""" +@yoh is your use of registerurl typically going to add all the urls for a +given key in succession, and then move on to the next key? Or are the keys +randomly distributed? + +It sounds like it's more randomly distributed, if you're walking a tree and +adding each file you encounter, and some of them have the same content so +the same url and key. + +If it was not randomly distributed, a nice optimisation would be for +registerurl to buffer urls as long as the key is the same, and then do a +single write for that key of all the urls. But it can't really buffer like +that if it's randomly distributed; the buffer could use a large amount of +memory then. + +---- + +Simply appending would be a nice optimisation, but setUrlPresent currently +compacts the log after adding the new url to it. That handles the case +where the url was already in the log, so it does not get logged twice. +Compacting here is not strictly necessary (the log is also compacted when +its queried), but committing a log file with many copies of the same url +would slow down later accesses of the log file. + +So it needs to check before appending if the url is already in the log +(with the same or newer vector clock). This would still be worth doing, +it avoids the repeated rewrites. But there would still be the overhead +of rereading the file each time. + +Hmm... git-annex could cache the last journal file it wrote, and only use +that cache to check if line it's writing is already in the file before +appending to the file. Using the cache this way seems like it could avoid +needing to invalidate the cache when some other process modifies the +journal file. There are two cases: + +1. The other process removes the line that was already in the file, which + is being logged a second time. In this case the line stays removed. + This is the same as if the other process had waited until after the + second time and removed it then, and so I think this is ok. +2. The other process adds the same line itself. This is unlikely due to + vector clocks, but it could happen. In this case, the file gets two + copies of the same line, which is a little innefficient but does not + change any behavior. + +Unfortunately, with random distrubution, as discussed above, that caching +would not help since git-annex can't cache every log for every key. + +Anyway, writes are the more expensive thing, so it's still worth +implementing appending, even if it still needs to read the log file +first. +"""]]