design
This commit is contained in:
parent
d1c3c7f52c
commit
890deeaaa6
3 changed files with 88 additions and 0 deletions
|
|
@ -0,0 +1,14 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2026-01-08T13:49:52Z"
|
||||
content="""
|
||||
Paths in preferred content expressions match relative to the top, so
|
||||
this preferred content expression will match only md files in the top,
|
||||
and files in the docs subdirectory:
|
||||
|
||||
`include=docs/* or include=*.md`
|
||||
|
||||
Only preferred content is downloaded, but S3 is still queried for the
|
||||
entire list of files in the bucket.
|
||||
"""]]
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 2"""
|
||||
date="2026-01-08T14:16:26Z"
|
||||
content="""
|
||||
I do think it would be possible to avoid the overhead of listing the
|
||||
contents of subdirectories that are not preferred content. At
|
||||
least sometimes.
|
||||
|
||||
When a bucket is listed with a "/" delimiter, S3 does not recurse into
|
||||
subdirectories. Eg, if the bucket contains "foo", "bar/...", and "baz/...",
|
||||
the response will list only the file "foo", and CommonPrefixes contains
|
||||
"bar" and "baz".
|
||||
|
||||
So, git-annex could make that request, and then if "include=bar/*" is not
|
||||
in preferred content, but "include=foo/*" is, it could make a request to
|
||||
list files prefixed by "foo/". And so avoid listing all the files in "bar".
|
||||
|
||||
If preferred content contained "include=foo/x/*" and "include=foo/y/*",
|
||||
when CommonPrefixes includes "foo", git-annex could follow up with 2 requests
|
||||
to list those subdirectories.
|
||||
|
||||
So this ends up making at most 1 additional request per subdirectory included
|
||||
in preferred content.
|
||||
|
||||
When preferred content excludes a subdirectory though, more requests would
|
||||
be needed. For "exclude=bar/*", if the response lists 100 other
|
||||
subdirectories in CommonPrefixes, it would need to make 100 separate
|
||||
requests to list those while avoiding listing bar. That could easily be
|
||||
more expensive than the current behavior. So it does not seem to make sense
|
||||
to try to optimise handling of excludes.
|
||||
"""]]
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2026-01-08T14:46:44Z"
|
||||
content="""
|
||||
There are some complications in possible preferred content expressions:
|
||||
|
||||
"include=foo*/*" -- we want "foo/*" but also "foooooom/*"... but what if
|
||||
there are 100 such subdirectories? It would be an unexpected cost to need
|
||||
to make so many requests. Like exclude=, the optimisation should not be
|
||||
used in this case.
|
||||
|
||||
"include=foo/bar" -- we want only this file.. so would prefer to avoid
|
||||
recursing through the rest of foo. If there are multiple ones like this
|
||||
that are all in the same subdirectory, it might be nice to make
|
||||
one single request to find them all. But this seems like an edge case,
|
||||
and one request per include is probably acceptable.
|
||||
|
||||
Here's a design:
|
||||
|
||||
1. Get preferred content expression of the remote.
|
||||
2. Filter for "include=" that contain a "/" in the value. If none are
|
||||
found, do the usual full listing of the bucket.
|
||||
3. If any of those includes contain a glob before a "/", do the usual full
|
||||
listing of the bucket. (This handles the "include=foo*/* case)
|
||||
4. Otherwise, list the top level of the bucket with delimiter set to "/".
|
||||
5. Include all the top-level files in the list.
|
||||
6. Filter the includes to ones that start with a subdirectory in the
|
||||
CommonPrefixes.
|
||||
7. For each remaining include, make a request to list the bucket, with
|
||||
the prefix set to the non-glob directory from the include. For example,
|
||||
for "include=foo/bar/*", set prefix to "foo/bar/", but for
|
||||
"include=foo/*bar", set prefix to "foo/". And for "include=foo/bar",
|
||||
set prefix to "foo/".
|
||||
8. Add back the prefixes to each file in the responses.
|
||||
|
||||
Note that, step #1 hides some complexity, because currently preferred
|
||||
content is loaded and parsed to a MatchFiles, which does not allow
|
||||
introspecting to get the expression. Since we only care about include
|
||||
expressions, it would suffice to add to MatchFiles a
|
||||
`matchInclude :: Maybe String` which gets set for includes.
|
||||
"""]]
|
||||
Loading…
Add table
Add a link
Reference in a new issue