From 5e407304a2fa53743dbb6a640f7f9ca3290eb94f Mon Sep 17 00:00:00 2001
From: Joey Hess <joeyh@joeyh.name>
Date: Thu, 14 Jul 2022 12:13:28 -0400
Subject: [PATCH] comment with a question

---
 ..._98cd9e5dd449cbd834f53d97fd5dfc13._comment | 19 +++++++
 ..._a4fce84f5777ed582fa599778835455f._comment | 55 +++++++++++++++++++
 2 files changed, 74 insertions(+)
 create mode 100644 doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment
 create mode 100644 doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment

diff --git a/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment
new file mode 100644
index 0000000000..2d0d6eb3a1
--- /dev/null
+++ b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_1_98cd9e5dd449cbd834f53d97fd5dfc13._comment
@@ -0,0 +1,19 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2022-07-14T15:17:41Z"
+ content="""
+You have not ruled out the flat directory structure being a problem,
+if your system has different performance than mine. It would be good if you
+could try the simple test I showed there to check if reading/writing a file
+to a large directory is indeed a problem.
+
+Anyway, nice observation here; growing such a large log file
+one line at a time with rewrites is of course gonna be slow.
+That would be a nice optimisation target.
+
+(Also the redundant mkdir/stat/etc on every write are not helping
+performance. git-annex never rmdirs the journal, so those should be able to
+easily be eliminated by only doing a mkdir when a write fails due to the
+journal directory not existing.)
+"""]]
diff --git a/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment
new file mode 100644
index 0000000000..6efc0a53a4
--- /dev/null
+++ b/doc/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/comment_2_a4fce84f5777ed582fa599778835455f._comment
@@ -0,0 +1,55 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2022-07-14T15:23:04Z"
+ content="""
+@yoh is your use of registerurl typically going to add all the urls for a
+given key in succession, and then move on to the next key? Or are the keys
+randomly distributed?
+
+It sounds like it's more randomly distributed, if you're walking a tree and
+adding each file you encounter, and some of them have the same content so
+the same url and key.
+
+If it was not randomly distributed, a nice optimisation would be for
+registerurl to buffer urls as long as the key is the same, and then do a
+single write for that key of all the urls. But it can't really buffer like
+that if it's randomly distributed; the buffer could use a large amount of
+memory then.
+
+----
+
+Simply appending would be a nice optimisation, but setUrlPresent currently
+compacts the log after adding the new url to it. That handles the case
+where the url was already in the log, so it does not get logged twice.
+Compacting here is not strictly necessary (the log is also compacted when
+its queried), but committing a log file with many copies of the same url
+would slow down later accesses of the log file.
+
+So it needs to check before appending if the url is already in the log
+(with the same or newer vector clock). This would still be worth doing,
+it avoids the repeated rewrites. But there would still be the overhead
+of rereading the file each time.
+
+Hmm... git-annex could cache the last journal file it wrote, and only use
+that cache to check if line it's writing is already in the file before
+appending to the file. Using the cache this way seems like it could avoid
+needing to invalidate the cache when some other process modifies the
+journal file. There are two cases:
+
+1. The other process removes the line that was already in the file, which
+   is being logged a second time. In this case the line stays removed.
+   This is the same as if the other process had waited until after the
+   second time and removed it then, and so I think this is ok.
+2. The other process adds the same line itself. This is unlikely due to
+   vector clocks, but it could happen. In this case, the file gets two
+   copies of the same line, which is a little innefficient but does not
+   change any behavior.
+
+Unfortunately, with random distrubution, as discussed above, that caching
+would not help since git-annex can't cache every log for every key.
+
+Anyway, writes are the more expensive thing, so it's still worth
+implementing appending, even if it still needs to read the log file
+first.
+"""]]