This commit is contained in:
Joey Hess 2024-11-14 15:27:00 -04:00
parent f724ff0388
commit b9b3e1257d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 66 additions and 0 deletions

View file

@ -0,0 +1,26 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2024-11-14T18:23:54Z"
content="""
No, it does not request versions from S3 when versioning is not enabled.
This feels fairly similar to
[[git-annex-import_stalls_and_uses_all_ram_available]].
But I don't think it's really the same, that one used versioning, and relied
on preferred content to filter the wanted files.
Is the size of the whole bucket under the fileprefix, in your case, large
enough that storing a list of all the files (without the versions) could
logically take as much memory as you're seeing? At one point you said it
was 7k files, but later hundreds of thousands, so I'm confused about how
big it is.
Is this bucket supposed to be public? I am having difficulty finding an
initremote command that works.
It also seems quite possible, looking at the code, that it's keeping all
the responses from S3 in memory until it gets done with listing all the
files, which would further increase memory use.
I don't see any `O(N^2)` operations though.
"""]]

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2024-11-14T18:50:37Z"
content="""
This is the initremote for it:
git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes
It started at 1 API call per second, but it slowed down as memory rapidly
went up. 3 gb in a few minutes, so I think there is definitely a memory
leak involved.
"""]]

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2024-11-14T19:05:48Z"
content="""
I suspect one way the CLI tool is faster, aside from not leaking memory,
is that there is a max-key max-keys parameter that git-annex is not using.
Less pagination would speed it up.
"""]]

View file

@ -0,0 +1,18 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2024-11-14T19:21:33Z"
content="""
Apparently gbrNextMarker is Nothing despite the response being truncted. So
git-annex is looping forever, getting the same first page each time, and
storing it all in a list.
I think this is a bug in the aws library, or I'm using it wrong.
It looks for a NextMarker in the response XML, but accoccording to
<https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html>
> This element is returned only if you have the delimiter request parameter
> specified. If the response does not include the NextMarker element and it is
> truncated, you can use the value of the last Key element in the response as the
> marker parameter in the subsequent request to get the next set of object keys.
"""]]