This commit is contained in:
Joey Hess 2026-01-08 11:26:44 -04:00
commit 890deeaaa6
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 88 additions and 0 deletions

View file

@ -0,0 +1,14 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2026-01-08T13:49:52Z"
content="""
Paths in preferred content expressions match relative to the top, so
this preferred content expression will match only md files in the top,
and files in the docs subdirectory:
`include=docs/* or include=*.md`
Only preferred content is downloaded, but S3 is still queried for the
entire list of files in the bucket.
"""]]

View file

@ -0,0 +1,32 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2026-01-08T14:16:26Z"
content="""
I do think it would be possible to avoid the overhead of listing the
contents of subdirectories that are not preferred content. At
least sometimes.
When a bucket is listed with a "/" delimiter, S3 does not recurse into
subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...",
the response will list only the file "foo", and CommonPrefixes contains
"bar" and "baz".
So, git-annex could make that request, and then if "include=bar/*" is not
in preferred content, but "include=foo/*" is, it could make a request to
list files prefixed by "foo/". And so avoid listing all the files in "bar".
If preferred content contained "include=foo/x/*" and "include=foo/y/*",
when CommonPrefixes includes "foo", git-annex could follow up with 2 requests
to list those subdirectories.
So this ends up making at most 1 additional request per subdirectory included
in preferred content.
When preferred content excludes a subdirectory though, more requests would
be needed. For "exclude=bar/*", if the response lists 100 other
subdirectories in CommonPrefixes, it would need to make 100 separate
requests to list those while avoiding listing bar. That could easily be
more expensive than the current behavior. So it does not seem to make sense
to try to optimise handling of excludes.
"""]]

View file

@ -0,0 +1,42 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2026-01-08T14:46:44Z"
content="""
There are some complications in possible preferred content expressions:
"include=foo*/*" -- we want "foo/*" but also "foooooom/*"... but what if
there are 100 such subdirectories? It would be an unexpected cost to need
to make so many requests. Like exclude=, the optimisation should not be
used in this case.
"include=foo/bar" -- we want only this file.. so would prefer to avoid
recursing through the rest of foo. If there are multiple ones like this
that are all in the same subdirectory, it might be nice to make
one single request to find them all. But this seems like an edge case,
and one request per include is probably acceptable.
Here's a design:
1. Get preferred content expression of the remote.
2. Filter for "include=" that contain a "/" in the value. If none are
found, do the usual full listing of the bucket.
3. If any of those includes contain a glob before a "/", do the usual full
listing of the bucket. (This handles the "include=foo*/* case)
4. Otherwise, list the top level of the bucket with delimiter set to "/".
5. Include all the top-level files in the list.
6. Filter the includes to ones that start with a subdirectory in the
CommonPrefixes.
7. For each remaining include, make a request to list the bucket, with
the prefix set to the non-glob directory from the include. For example,
for "include=foo/bar/*", set prefix to "foo/bar/", but for
"include=foo/*bar", set prefix to "foo/". And for "include=foo/bar",
set prefix to "foo/".
8. Add back the prefixes to each file in the responses.
Note that, step #1 hides some complexity, because currently preferred
content is loaded and parsed to a MatchFiles, which does not allow
introspecting to get the expression. Since we only care about include
expressions, it would suffice to add to MatchFiles a
`matchInclude :: Maybe String` which gets set for includes.
"""]]