comments
This commit is contained in:
parent
f724ff0388
commit
b9b3e1257d
4 changed files with 66 additions and 0 deletions
|
@ -0,0 +1,26 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2024-11-14T18:23:54Z"
|
||||
content="""
|
||||
No, it does not request versions from S3 when versioning is not enabled.
|
||||
|
||||
This feels fairly similar to
|
||||
[[git-annex-import_stalls_and_uses_all_ram_available]].
|
||||
But I don't think it's really the same, that one used versioning, and relied
|
||||
on preferred content to filter the wanted files.
|
||||
|
||||
Is the size of the whole bucket under the fileprefix, in your case, large
|
||||
enough that storing a list of all the files (without the versions) could
|
||||
logically take as much memory as you're seeing? At one point you said it
|
||||
was 7k files, but later hundreds of thousands, so I'm confused about how
|
||||
big it is.
|
||||
|
||||
Is this bucket supposed to be public? I am having difficulty finding an
|
||||
initremote command that works.
|
||||
|
||||
It also seems quite possible, looking at the code, that it's keeping all
|
||||
the responses from S3 in memory until it gets done with listing all the
|
||||
files, which would further increase memory use.
|
||||
I don't see any `O(N^2)` operations though.
|
||||
"""]]
|
|
@ -0,0 +1,13 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2024-11-14T18:50:37Z"
|
||||
content="""
|
||||
This is the initremote for it:
|
||||
|
||||
git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes
|
||||
|
||||
It started at 1 API call per second, but it slowed down as memory rapidly
|
||||
went up. 3 gb in a few minutes, so I think there is definitely a memory
|
||||
leak involved.
|
||||
"""]]
|
|
@ -0,0 +1,9 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 4"""
|
||||
date="2024-11-14T19:05:48Z"
|
||||
content="""
|
||||
I suspect one way the CLI tool is faster, aside from not leaking memory,
|
||||
is that there is a max-key max-keys parameter that git-annex is not using.
|
||||
Less pagination would speed it up.
|
||||
"""]]
|
|
@ -0,0 +1,18 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 5"""
|
||||
date="2024-11-14T19:21:33Z"
|
||||
content="""
|
||||
Apparently gbrNextMarker is Nothing despite the response being truncted. So
|
||||
git-annex is looping forever, getting the same first page each time, and
|
||||
storing it all in a list.
|
||||
|
||||
I think this is a bug in the aws library, or I'm using it wrong.
|
||||
It looks for a NextMarker in the response XML, but accoccording to
|
||||
<https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html>
|
||||
|
||||
> This element is returned only if you have the delimiter request parameter
|
||||
> specified. If the response does not include the NextMarker element and it is
|
||||
> truncated, you can use the value of the last Key element in the response as the
|
||||
> marker parameter in the subsequent request to get the next set of object keys.
|
||||
"""]]
|
Loading…
Reference in a new issue