progress in my head

2021-10-06 14:45:12 -04:00 · 2021-10-06 14:45:12 -04:00 · 153f3600fb
commit 153f3600fb
parent cc66c9f9ad
3 changed files with 74 additions and 12 deletions
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_10_40a8fbf3c4140e955f7e1503db824aaf._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_10_40a8fbf3c4140e955f7e1503db824aaf._comment
@ -0,0 +1,35 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 10"""
+ date="2021-10-06T17:09:50Z"
+ content="""
+There is still a big PINNED spike though. I measured this memory use:
+
+	115344 post listContents
+	133816 post importKeys
+	236676 post recordImportTree
+
+listContents produces an `ImportableContents (ContentIdentifier, ByteSize)`
+and that gets transformed through importKeys 
+to `ImportableContents (Either Sha Key)`. The GC should be able to
+free up the first as it's being traversed, but PINNED still goes up during
+that, and memory increases by 20% or so.
+
+Then recordImportTree calls mktreeitem and treeItemsToTree, which between
+then double the memory.
+
+So I think I understand where the memory use is, although why it's PINNED
+is still not clear, and unpinning could still help. I did try converting
+TopFilePath to ShortByteString, since TreeItems contain them, but it didn't
+reduce the amount PINNED and actually used more memory.
+
+To avoid the allocation entirely, it seems that borg's
+listImportableContents would need to generate a Tree itself, rather than
+using ImportableContents. And it could, probably fairly efficiently, but it
+would not be able to reuse the tree import interface as it does now.
+
+(borg could return a `ImportableContents (Either Sha Key)` more easily,
+and still reuse part of the interface, but the conversion to that only
+uses 20% or so of memory so it's not a big enough win. Also when I looked
+at it, it was still not going to be an easy refactoring.)
+"""]]
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_11_fe04d3da8859101ba1649fdd9d5ee39e._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_11_fe04d3da8859101ba1649fdd9d5ee39e._comment
@ -0,0 +1,39 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 11"""
+ date="2021-10-06T18:03:23Z"
+ content="""
+@tomdhunt the tree is being stored in git, so the natural way
+to do something like a difference encoding would be a series of trees
+in a commit sequence.
+
+The tree import interface does support that, but borg remote 
+doesn't bother and puts all the items in a single tree. But even if it did,
+it would still populate the same ImportableContents data structure with
+the same amount of data just a different layout.
+
+But maybe this line of thinking does point toward a solution.. Suppose that
+there was a way for listImportableContents to generate an
+ImportableContentsChunk that contained a subtree, and a continuation to get
+the next subtree. Then each subtree's worth of ImportableContents would be
+passed through to recordImportTree (a version omitting the parts of it that
+commit the tree), and only one subtree at a time would occupy memory. At
+the end a tree would be constucted containing all the subtrees, and
+committed. 
+
+For borg, each archive would be a subtree; 500k filenames will fit in memory
+or at least fit better than `365*500k`.
+
+The interface I'm thinking about is something like this:
+
+	data ChunkedImportableContents info
+		= ImportableContentsChunk
+			{ importableContentsRoot :: ImportLocation
+			, importableContentsSubTree :: [(ImportLocation, info)]
+			-- ^ locations are relative to importableContentsRoot
+			, importableContentsContinuation :: Annex (ChunkedImportableContents info)
+			}
+		| ImportableContentsComplete (ImportableContents info)
+
+This is a promising idea!
+"""]]
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_7_f59d9c51716892240ebd12fa80a2e58b._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_7_f59d9c51716892240ebd12fa80a2e58b._comment
@ -8,16 +8,4 @@ and the -hc profile is unchanged. So the pinned memory is not in refs.

 Also tried converting Key to use ShortByteString. That was a win!
 My 20 borg archive test case is down from 320 mb to 242 mb.
-
-Looking at Command.SyncpullThirdPartyPopulated,
-it calls listContents, which calls borg's listImportableContents,
-and produces an `ImportableContents (ContentIdentifier, ByteSize)`
-then that gets passed through importKeys to produce
-an `ImportableContents (Either Sha Key)`. Probably
-double memory is used while doing that conversion, unless
-the GC manages to free the first one while it's traversed.
-
-If borg's listImportableContents included a Key (which it does
-produce already only to throw away!) that might 
-eliminate the big spike just before treeItemsToTree.
 """]]