comments

2024-11-14 15:27:00 -04:00 · 2024-11-14 15:27:00 -04:00 · b9b3e1257d
commit b9b3e1257d
parent f724ff0388
4 changed files with 66 additions and 0 deletions
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_2_6f19e248752d4edfc36e84bb92a7348d._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_2_6f19e248752d4edfc36e84bb92a7348d._comment
@ -0,0 +1,26 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2024-11-14T18:23:54Z"
+ content="""
+No, it does not request versions from S3 when versioning is not enabled.
+
+This feels fairly similar to
+[[git-annex-import_stalls_and_uses_all_ram_available]].
+But I don't think it's really the same, that one used versioning, and relied
+on preferred content to filter the wanted files. 
+
+Is the size of the whole bucket under the fileprefix, in your case, large
+enough that storing a list of all the files (without the versions) could
+logically take as much memory as you're seeing? At one point you said it
+was 7k files, but later hundreds of thousands, so I'm confused about how
+big it is.
+
+Is this bucket supposed to be public? I am having difficulty finding an
+initremote command that works.
+
+It also seems quite possible, looking at the code, that it's keeping all
+the responses from S3 in memory until it gets done with listing all the
+files, which would further increase memory use.
+I don't see any `O(N^2)` operations though.
+"""]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_3_b5b786e7ab8fa6c2fe80691033529b5b._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_3_b5b786e7ab8fa6c2fe80691033529b5b._comment
@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2024-11-14T18:50:37Z"
+ content="""
+This is the initremote for it:
+
+	git-annex initremote dandiarchive type=S3 encryption=none fileprefix=dandisets/ bucket=dandiarchive publicurl=https://dandiarchive.s3.amazonaws.com/ signature=anonymous host=s3.amazonaws.com datacenter=US importtree=yes
+
+It started at 1 API call per second, but it slowed down as memory rapidly
+went up. 3 gb in a few minutes, so I think there is definitely a memory
+leak involved.
+"""]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_4_c57a8a4fbceb47965da3bf32ce502ed6._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_4_c57a8a4fbceb47965da3bf32ce502ed6._comment
@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 4"""
+ date="2024-11-14T19:05:48Z"
+ content="""
+I suspect one way the CLI tool is faster, aside from not leaking memory,
+is that there is a max-key max-keys parameter that git-annex is not using.
+Less pagination would speed it up.
+"""]]
--- a/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_5_1784657b6d59a7b7da71fdbb8dbcf61c._comment
+++ b/doc/bugs/importtree_from_S3_slows_to_halt_even_with_prefix/comment_5_1784657b6d59a7b7da71fdbb8dbcf61c._comment
@ -0,0 +1,18 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 5"""
+ date="2024-11-14T19:21:33Z"
+ content="""
+Apparently gbrNextMarker is Nothing despite the response being truncted. So
+git-annex is looping forever, getting the same first page each time, and
+storing it all in a list.
+
+I think this is a bug in the aws library, or I'm using it wrong.
+It looks for a NextMarker in the response XML, but accoccording to
+<https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html>
+
+> This element is returned only if you have the delimiter request parameter
+> specified. If the response does not include the NextMarker element and it is
+> truncated, you can use the value of the last Key element in the response as the
+> marker parameter in the subsequent request to get the next set of object keys.
+"""]]