comments
This commit is contained in:
parent
f724ff0388
commit
b9b3e1257d
4 changed files with 66 additions and 0 deletions
|
@ -0,0 +1,26 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 2"""
|
||||||
|
date="2024-11-14T18:23:54Z"
|
||||||
|
content="""
|
||||||
|
No, it does not request versions from S3 when versioning is not enabled.
|
||||||
|
|
||||||
|
This feels fairly similar to
|
||||||
|
[[git-annex-import_stalls_and_uses_all_ram_available]].
|
||||||
|
But I don't think it's really the same, that one used versioning, and relied
|
||||||
|
on preferred content to filter the wanted files.
|
||||||
|
|
||||||
|
Is the size of the whole bucket under the fileprefix, in your case, large
|
||||||
|
enough that storing a list of all the files (without the versions) could
|
||||||
|
logically take as much memory as you're seeing? At one point you said it
|
||||||
|
was 7k files, but later hundreds of thousands, so I'm confused about how
|
||||||
|
big it is.
|
||||||
|
|
||||||
|
Is this bucket supposed to be public? I am having difficulty finding an
|
||||||
|
initremote command that works.
|
||||||
|
|
||||||
|
It also seems quite possible, looking at the code, that it's keeping all
|
||||||
|
the responses from S3 in memory until it gets done with listing all the
|
||||||
|
files, which would further increase memory use.
|
||||||
|
I don't see any `O(N^2)` operations though.
|
||||||
|
"""]]
|
|
@ -0,0 +1,13 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 3"""
|
||||||
|
date="2024-11-14T18:50:37Z"
|
||||||
|
content="""
|
||||||
|
This is the initremote for it:
|
||||||
|
|
||||||
|
git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes
|
||||||
|
|
||||||
|
It started at 1 API call per second, but it slowed down as memory rapidly
|
||||||
|
went up. 3 gb in a few minutes, so I think there is definitely a memory
|
||||||
|
leak involved.
|
||||||
|
"""]]
|
|
@ -0,0 +1,9 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 4"""
|
||||||
|
date="2024-11-14T19:05:48Z"
|
||||||
|
content="""
|
||||||
|
I suspect one way the CLI tool is faster, aside from not leaking memory,
|
||||||
|
is that there is a max-key max-keys parameter that git-annex is not using.
|
||||||
|
Less pagination would speed it up.
|
||||||
|
"""]]
|
|
@ -0,0 +1,18 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 5"""
|
||||||
|
date="2024-11-14T19:21:33Z"
|
||||||
|
content="""
|
||||||
|
Apparently gbrNextMarker is Nothing despite the response being truncted. So
|
||||||
|
git-annex is looping forever, getting the same first page each time, and
|
||||||
|
storing it all in a list.
|
||||||
|
|
||||||
|
I think this is a bug in the aws library, or I'm using it wrong.
|
||||||
|
It looks for a NextMarker in the response XML, but accoccording to
|
||||||
|
<https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html>
|
||||||
|
|
||||||
|
> This element is returned only if you have the delimiter request parameter
|
||||||
|
> specified. If the response does not include the NextMarker element and it is
|
||||||
|
> truncated, you can use the value of the last Key element in the response as the
|
||||||
|
> marker parameter in the subsequent request to get the next set of object keys.
|
||||||
|
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue