progress in my head

2021-10-06 14:45:12 -04:00 · 2021-10-06 14:45:12 -04:00 · 153f3600fb
commit 153f3600fb
parent cc66c9f9ad
3 changed files with 74 additions and 12 deletions
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_10_40a8fbf3c4140e955f7e1503db824aaf._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_10_40a8fbf3c4140e955f7e1503db824aaf._comment
@ -0,0 +1,35 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 10"""
 date="2021-10-06T17:09:50Z"
 content="""
 There is still a big PINNED spike though. I measured this memory use:
 	115344 post listContents
 	133816 post importKeys
 	236676 post recordImportTree
 listContents produces an `ImportableContents (ContentIdentifier, ByteSize)`
 and that gets transformed through importKeys 
 to `ImportableContents (Either Sha Key)`. The GC should be able to
 free up the first as it's being traversed, but PINNED still goes up during
 that, and memory increases by 20% or so.
 Then recordImportTree calls mktreeitem and treeItemsToTree, which between
 then double the memory.
 So I think I understand where the memory use is, although why it's PINNED
 is still not clear, and unpinning could still help. I did try converting
 TopFilePath to ShortByteString, since TreeItems contain them, but it didn't
 reduce the amount PINNED and actually used more memory.
 To avoid the allocation entirely, it seems that borg's
 listImportableContents would need to generate a Tree itself, rather than
 using ImportableContents. And it could, probably fairly efficiently, but it
 would not be able to reuse the tree import interface as it does now.
 (borg could return a `ImportableContents (Either Sha Key)` more easily,
 and still reuse part of the interface, but the conversion to that only
 uses 20% or so of memory so it's not a big enough win. Also when I looked
 at it, it was still not going to be an easy refactoring.)
 """]]
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_11_fe04d3da8859101ba1649fdd9d5ee39e._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_11_fe04d3da8859101ba1649fdd9d5ee39e._comment
@ -0,0 +1,39 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 11"""
 date="2021-10-06T18:03:23Z"
 content="""
@tomdhunt the tree is being stored in git, so the natural way
 to do something like a difference encoding would be a series of trees
 in a commit sequence.
 The tree import interface does support that, but borg remote 
 doesn't bother and puts all the items in a single tree. But even if it did,
 it would still populate the same ImportableContents data structure with
 the same amount of data just a different layout.
 But maybe this line of thinking does point toward a solution.. Suppose that
 there was a way for listImportableContents to generate an
 ImportableContentsChunk that contained a subtree, and a continuation to get
 the next subtree. Then each subtree's worth of ImportableContents would be
 passed through to recordImportTree (a version omitting the parts of it that
 commit the tree), and only one subtree at a time would occupy memory. At
 the end a tree would be constucted containing all the subtrees, and
 committed. 
 For borg, each archive would be a subtree; 500k filenames will fit in memory
 or at least fit better than `365*500k`.
 The interface I'm thinking about is something like this:
 	data ChunkedImportableContents info
 		= ImportableContentsChunk
 			{ importableContentsRoot :: ImportLocation
 			, importableContentsSubTree :: [(ImportLocation, info)]
 			-- ^ locations are relative to importableContentsRoot
 			, importableContentsContinuation :: Annex (ChunkedImportableContents info)
 			}
 		| ImportableContentsComplete (ImportableContents info)
 This is a promising idea!
 """]]
--- a/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_7_f59d9c51716892240ebd12fa80a2e58b._comment
+++ b/doc/bugs/borg_special_remote_memory_usage_high_for_large_borg_repo/comment_7_f59d9c51716892240ebd12fa80a2e58b._comment
@ -8,16 +8,4 @@ and the -hc profile is unchanged. So the pinned memory is not in refs.
 Also tried converting Key to use ShortByteString. That was a win!
 My 20 borg archive test case is down from 320 mb to 242 mb.
 Looking at Command.SyncpullThirdPartyPopulated,
 it calls listContents, which calls borg's listImportableContents,
 and produces an `ImportableContents (ContentIdentifier, ByteSize)`
 then that gets passed through importKeys to produce
 an `ImportableContents (Either Sha Key)`. Probably
 double memory is used while doing that conversion, unless
 the GC manages to free the first one while it's traversed.
 If borg's listImportableContents included a Key (which it does
 produce already only to throw away!) that might 
 eliminate the big spike just before treeItemsToTree.
 """]]