comments

2024-11-14 15:27:00 -04:00 · 2024-11-14 15:27:00 -04:00 · b9b3e1257d
commit b9b3e1257d
parent f724ff0388
4 changed files with 66 additions and 0 deletions
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_2_6f19e248752d4edfc36e84bb92a7348d._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_2_6f19e248752d4edfc36e84bb92a7348d._comment
@ -0,0 +1,26 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 2"""
 date="2024-11-14T18:23:54Z"
 content="""
 No, it does not request versions from S3 when versioning is not enabled.
 This feels fairly similar to
 [[git-annex-import_stalls_and_uses_all_ram_available]].
 But I don't think it's really the same, that one used versioning, and relied
 on preferred content to filter the wanted files. 
 Is the size of the whole bucket under the fileprefix, in your case, large
 enough that storing a list of all the files (without the versions) could
 logically take as much memory as you're seeing? At one point you said it
 was 7k files, but later hundreds of thousands, so I'm confused about how
 big it is.
 Is this bucket supposed to be public? I am having difficulty finding an
 initremote command that works.
 It also seems quite possible, looking at the code, that it's keeping all
 the responses from S3 in memory until it gets done with listing all the
 files, which would further increase memory use.
 I don't see any `O(N^2)` operations though.
 """]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_3_b5b786e7ab8fa6c2fe80691033529b5b._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_3_b5b786e7ab8fa6c2fe80691033529b5b._comment
@ -0,0 +1,13 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 3"""
 date="2024-11-14T18:50:37Z"
 content="""
 This is the initremote for it:
 	git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes
 It started at 1 API call per second, but it slowed down as memory rapidly
 went up. 3 gb in a few minutes, so I think there is definitely a memory
 leak involved.
 """]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_4_c57a8a4fbceb47965da3bf32ce502ed6._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_4_c57a8a4fbceb47965da3bf32ce502ed6._comment
@ -0,0 +1,9 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 4"""
 date="2024-11-14T19:05:48Z"
 content="""
 I suspect one way the CLI tool is faster, aside from not leaking memory,
 is that there is a max-key max-keys parameter that git-annex is not using.
 Less pagination would speed it up.
 """]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_5_1784657b6d59a7b7da71fdbb8dbcf61c._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_5_1784657b6d59a7b7da71fdbb8dbcf61c._comment
@ -0,0 +1,18 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 5"""
 date="2024-11-14T19:21:33Z"
 content="""
 Apparently gbrNextMarker is Nothing despite the response being truncted. So
 git-annex is looping forever, getting the same first page each time, and
 storing it all in a list.
 I think this is a bug in the aws library, or I'm using it wrong.
 It looks for a NextMarker in the response XML, but accoccording to
 <https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html>
 > This element is returned only if you have the delimiter request parameter
 > specified. If the response does not include the NextMarker element and it is
 > truncated, you can use the value of the last Key element in the response as the
 > marker parameter in the subsequent request to get the next set of object keys.
 """]]