diff --git a/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_1_842a1243cd6f15004a178f607912ca33._comment b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_1_842a1243cd6f15004a178f607912ca33._comment new file mode 100644 index 0000000000..a7cca3fb31 --- /dev/null +++ b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_1_842a1243cd6f15004a178f607912ca33._comment @@ -0,0 +1,14 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2026-01-08T13:49:52Z" + content=""" +Paths in preferred content expressions match relative to the top, so +this preferred content expression will match only md files in the top, +and files in the docs subdirectory: + +`include=docs/* or include=*.md` + +Only preferred content is downloaded, but S3 is still queried for the +entire list of files in the bucket. +"""]] diff --git a/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_2_f5a391a3e62284e0c503139eade4fdda._comment b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_2_f5a391a3e62284e0c503139eade4fdda._comment new file mode 100644 index 0000000000..b1a0c2585c --- /dev/null +++ b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_2_f5a391a3e62284e0c503139eade4fdda._comment @@ -0,0 +1,32 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 2""" + date="2026-01-08T14:16:26Z" + content=""" +I do think it would be possible to avoid the overhead of listing the +contents of subdirectories that are not preferred content. At +least sometimes. + +When a bucket is listed with a "/" delimiter, S3 does not recurse into +subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...", +the response will list only the file "foo", and CommonPrefixes contains +"bar" and "baz". + +So, git-annex could make that request, and then if "include=bar/*" is not +in preferred content, but "include=foo/*" is, it could make a request to +list files prefixed by "foo/". And so avoid listing all the files in "bar". + +If preferred content contained "include=foo/x/*" and "include=foo/y/*", +when CommonPrefixes includes "foo", git-annex could follow up with 2 requests +to list those subdirectories. + +So this ends up making at most 1 additional request per subdirectory included +in preferred content. + +When preferred content excludes a subdirectory though, more requests would +be needed. For "exclude=bar/*", if the response lists 100 other +subdirectories in CommonPrefixes, it would need to make 100 separate +requests to list those while avoiding listing bar. That could easily be +more expensive than the current behavior. So it does not seem to make sense +to try to optimise handling of excludes. +"""]] diff --git a/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_3_0914c14c2b2b97bd0c79f3d9c990719f._comment b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_3_0914c14c2b2b97bd0c79f3d9c990719f._comment new file mode 100644 index 0000000000..e18502378a --- /dev/null +++ b/doc/todo/way_to_limit_recursion_for_import__47__export_S3_tree/comment_3_0914c14c2b2b97bd0c79f3d9c990719f._comment @@ -0,0 +1,42 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 3""" + date="2026-01-08T14:46:44Z" + content=""" +There are some complications in possible preferred content expressions: + +"include=foo*/*" -- we want "foo/*" but also "foooooom/*"... but what if +there are 100 such subdirectories? It would be an unexpected cost to need +to make so many requests. Like exclude=, the optimisation should not be +used in this case. + +"include=foo/bar" -- we want only this file.. so would prefer to avoid +recursing through the rest of foo. If there are multiple ones like this +that are all in the same subdirectory, it might be nice to make +one single request to find them all. But this seems like an edge case, +and one request per include is probably acceptable. + +Here's a design: + +1. Get preferred content expression of the remote. +2. Filter for "include=" that contain a "/" in the value. If none are + found, do the usual full listing of the bucket. +3. If any of those includes contain a glob before a "/", do the usual full + listing of the bucket. (This handles the "include=foo*/* case) +4. Otherwise, list the top level of the bucket with delimiter set to "/". +5. Include all the top-level files in the list. +6. Filter the includes to ones that start with a subdirectory in the + CommonPrefixes. +7. For each remaining include, make a request to list the bucket, with + the prefix set to the non-glob directory from the include. For example, + for "include=foo/bar/*", set prefix to "foo/bar/", but for + "include=foo/*bar", set prefix to "foo/". And for "include=foo/bar", + set prefix to "foo/". +8. Add back the prefixes to each file in the responses. + +Note that, step #1 hides some complexity, because currently preferred +content is loaded and parsed to a MatchFiles, which does not allow +introspecting to get the expression. Since we only care about include +expressions, it would suffice to add to MatchFiles a +`matchInclude :: Maybe String` which gets set for includes. +"""]]