design

2026-01-08 11:26:44 -04:00 · 2026-01-08 11:26:44 -04:00 · 890deeaaa6
commit 890deeaaa6
parent d1c3c7f52c
3 changed files with 88 additions and 0 deletions
--- a/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_1_842a1243cd6f15004a178f607912ca33._comment
+++ b/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_1_842a1243cd6f15004a178f607912ca33._comment
@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2026-01-08T13:49:52Z"
+ content="""
+Paths in preferred content expressions match relative to the top, so
+this preferred content expression will match only md files in the top,
+and files in the docs subdirectory:
+
+`include=docs/* or include=*.md`
+
+Only preferred content is downloaded, but S3 is still queried for the
+entire list of files in the bucket.
+"""]]
--- a/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_2_f5a391a3e62284e0c503139eade4fdda._comment
+++ b/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_2_f5a391a3e62284e0c503139eade4fdda._comment
@ -0,0 +1,32 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2026-01-08T14:16:26Z"
+ content="""
+I do think it would be possible to avoid the overhead of listing the
+contents of subdirectories that are not preferred content. At
+least sometimes.
+
+When a bucket is listed with a "/" delimiter, S3 does not recurse into
+subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...",
+the response will list only the file "foo", and CommonPrefixes contains
+"bar" and "baz". 
+
+So, git-annex could make that request, and then if "include=bar/*" is not
+in preferred content, but "include=foo/*" is, it could make a request to
+list files prefixed by "foo/". And so avoid listing all the files in "bar".
+
+If preferred content contained "include=foo/x/*" and "include=foo/y/*", 
+when CommonPrefixes includes "foo", git-annex could follow up with 2 requests
+to list those subdirectories.
+
+So this ends up making at most 1 additional request per subdirectory included
+in preferred content.
+
+When preferred content excludes a subdirectory though, more requests would
+be needed. For "exclude=bar/*", if the response lists 100 other
+subdirectories in CommonPrefixes, it would need to make 100 separate
+requests to list those while avoiding listing bar. That could easily be
+more expensive than the current behavior. So it does not seem to make sense
+to try to optimise handling of excludes.
+"""]]
--- a/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_3_0914c14c2b2b97bd0c79f3d9c990719f._comment
+++ b/doc/todo/way_to_limit_recursion_for_import47export_S3_tree/comment_3_0914c14c2b2b97bd0c79f3d9c990719f._comment
@ -0,0 +1,42 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2026-01-08T14:46:44Z"
+ content="""
+There are some complications in possible preferred content expressions:
+
+"include=foo*/*" -- we want "foo/*" but also "foooooom/*"... but what if
+there are 100 such subdirectories? It would be an unexpected cost to need
+to make so many requests. Like exclude=, the optimisation should not be
+used in this case.
+
+"include=foo/bar" -- we want only this file.. so would prefer to avoid
+recursing through the rest of foo. If there are multiple ones like this
+that are all in the same subdirectory, it might be nice to make
+one single request to find them all. But this seems like an edge case,
+and one request per include is probably acceptable.
+
+Here's a design:
+
+1. Get preferred content expression of the remote.
+2. Filter for "include=" that contain a "/" in the value. If none are
+   found, do the usual full listing of the bucket.
+3. If any of those includes contain a glob before a "/", do the usual full
+   listing of the bucket. (This handles the "include=foo*/* case)
+4. Otherwise, list the top level of the bucket with delimiter set to "/".
+5. Include all the top-level files in the list.
+6. Filter the includes to ones that start with a subdirectory in the
+   CommonPrefixes.
+7. For each remaining include, make a request to list the bucket, with
+   the prefix set to the non-glob directory from the include. For example,
+   for "include=foo/bar/*", set prefix to "foo/bar/", but for
+   "include=foo/*bar", set prefix to "foo/". And for "include=foo/bar",
+   set prefix to "foo/". 
+8. Add back the prefixes to each file in the responses.
+
+Note that, step #1 hides some complexity, because currently preferred
+content is loaded and parsed to a MatchFiles, which does not allow
+introspecting to get the expression. Since we only care about include
+expressions, it would suffice to add to MatchFiles a 
+`matchInclude :: Maybe String` which gets set for includes.
+"""]]