comment with a question

2022-07-14 12:13:28 -04:00 · 2022-07-14 12:13:28 -04:00 · 5e407304a2
commit 5e407304a2
parent c4cca7e6c6
2 changed files with 74 additions and 0 deletions
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment
@ -0,0 +1,19 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 1"""
 date="2022-07-14T15:17:41Z"
 content="""
 You have not ruled out the flat directory structure being a problem,
 if your system has different performance than mine. It would be good if you
 could try the simple test I showed there to check if reading/writing a file
 to a large directory is indeed a problem.
 Anyway, nice observation here; growing such a large log file
 one line at a time with rewrites is of course gonna be slow.
 That would be a nice optimisation target.
 (Also the redundant mkdir/stat/etc on every write are not helping
 performance. git-annex never rmdirs the journal, so those should be able to
 easily be eliminated by only doing a mkdir when a write fails due to the
 journal directory not existing.)
 """]]
--- a/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_2_a4fce84f5777ed582fa599778835455f._comment
+++ b/doc/todo/registerurl58_do_changes_in_journal_34in_place3463/comment_2_a4fce84f5777ed582fa599778835455f._comment
@ -0,0 +1,55 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 2"""
 date="2022-07-14T15:23:04Z"
 content="""
@yoh is your use of registerurl typically going to add all the urls for a
 given key in succession, and then move on to the next key? Or are the keys
 randomly distributed?
 It sounds like it's more randomly distributed, if you're walking a tree and
 adding each file you encounter, and some of them have the same content so
 the same url and key.
 If it was not randomly distributed, a nice optimisation would be for
 registerurl to buffer urls as long as the key is the same, and then do a
 single write for that key of all the urls. But it can't really buffer like
 that if it's randomly distributed; the buffer could use a large amount of
 memory then.
 ----
 Simply appending would be a nice optimisation, but setUrlPresent currently
 compacts the log after adding the new url to it. That handles the case
 where the url was already in the log, so it does not get logged twice.
 Compacting here is not strictly necessary (the log is also compacted when
 its queried), but committing a log file with many copies of the same url
 would slow down later accesses of the log file.
 So it needs to check before appending if the url is already in the log
 (with the same or newer vector clock). This would still be worth doing,
 it avoids the repeated rewrites. But there would still be the overhead
 of rereading the file each time.
 Hmm... git-annex could cache the last journal file it wrote, and only use
 that cache to check if line it's writing is already in the file before
 appending to the file. Using the cache this way seems like it could avoid
 needing to invalidate the cache when some other process modifies the
 journal file. There are two cases:
 1. The other process removes the line that was already in the file, which
   is being logged a second time. In this case the line stays removed.
   This is the same as if the other process had waited until after the
   second time and removed it then, and so I think this is ok.
 2. The other process adds the same line itself. This is unlikely due to
   vector clocks, but it could happen. In this case, the file gets two
   copies of the same line, which is a little innefficient but does not
   change any behavior.
 Unfortunately, with random distrubution, as discussed above, that caching
 would not help since git-annex can't cache every log for every key.
 Anyway, writes are the more expensive thing, so it's still worth
 implementing appending, even if it still needs to read the log file
 first.
 """]]