progress in my head

This commit is contained in:
Joey Hess 2021-10-06 14:45:12 -04:00
parent cc66c9f9ad
commit 153f3600fb
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 74 additions and 12 deletions

View file

@ -0,0 +1,35 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2021-10-06T17:09:50Z"
content="""
There is still a big PINNED spike though. I measured this memory use:
115344 post listContents
133816 post importKeys
236676 post recordImportTree
listContents produces an `ImportableContents (ContentIdentifier, ByteSize)`
and that gets transformed through importKeys
to `ImportableContents (Either Sha Key)`. The GC should be able to
free up the first as it's being traversed, but PINNED still goes up during
that, and memory increases by 20% or so.
Then recordImportTree calls mktreeitem and treeItemsToTree, which between
then double the memory.
So I think I understand where the memory use is, although why it's PINNED
is still not clear, and unpinning could still help. I did try converting
TopFilePath to ShortByteString, since TreeItems contain them, but it didn't
reduce the amount PINNED and actually used more memory.
To avoid the allocation entirely, it seems that borg's
listImportableContents would need to generate a Tree itself, rather than
using ImportableContents. And it could, probably fairly efficiently, but it
would not be able to reuse the tree import interface as it does now.
(borg could return a `ImportableContents (Either Sha Key)` more easily,
and still reuse part of the interface, but the conversion to that only
uses 20% or so of memory so it's not a big enough win. Also when I looked
at it, it was still not going to be an easy refactoring.)
"""]]

View file

@ -0,0 +1,39 @@
[[!comment format=mdwn
username="joey"
subject="""comment 11"""
date="2021-10-06T18:03:23Z"
content="""
@tomdhunt the tree is being stored in git, so the natural way
to do something like a difference encoding would be a series of trees
in a commit sequence.
The tree import interface does support that, but borg remote
doesn't bother and puts all the items in a single tree. But even if it did,
it would still populate the same ImportableContents data structure with
the same amount of data just a different layout.
But maybe this line of thinking does point toward a solution.. Suppose that
there was a way for listImportableContents to generate an
ImportableContentsChunk that contained a subtree, and a continuation to get
the next subtree. Then each subtree's worth of ImportableContents would be
passed through to recordImportTree (a version omitting the parts of it that
commit the tree), and only one subtree at a time would occupy memory. At
the end a tree would be constucted containing all the subtrees, and
committed.
For borg, each archive would be a subtree; 500k filenames will fit in memory
or at least fit better than `365*500k`.
The interface I'm thinking about is something like this:
data ChunkedImportableContents info
= ImportableContentsChunk
{ importableContentsRoot :: ImportLocation
, importableContentsSubTree :: [(ImportLocation, info)]
-- ^ locations are relative to importableContentsRoot
, importableContentsContinuation :: Annex (ChunkedImportableContents info)
}
| ImportableContentsComplete (ImportableContents info)
This is a promising idea!
"""]]

View file

@ -8,16 +8,4 @@ and the -hc profile is unchanged. So the pinned memory is not in refs.
Also tried converting Key to use ShortByteString. That was a win!
My 20 borg archive test case is down from 320 mb to 242 mb.
Looking at Command.SyncpullThirdPartyPopulated,
it calls listContents, which calls borg's listImportableContents,
and produces an `ImportableContents (ContentIdentifier, ByteSize)`
then that gets passed through importKeys to produce
an `ImportableContents (Either Sha Key)`. Probably
double memory is used while doing that conversion, unless
the GC manages to free the first one while it's traversed.
If borg's listImportableContents included a Key (which it does
produce already only to throw away!) that might
eliminate the big spike just before treeItemsToTree.
"""]]