Added a comment

This commit is contained in:
prancewit 2022-09-13 21:36:38 +00:00 committed by admin
parent 9f5f960548
commit bef6eb5d02

View file

@ -0,0 +1,21 @@
[[!comment format=mdwn
username="prancewit"
avatar="http://cdn.libravatar.org/avatar/f6cc165b68a5cca3311f9a1cd7fd027c"
subject="comment 4"
date="2022-09-13T21:36:38Z"
content="""
At a high level, I see 2 possible ways in which special remotes can optimize for larger queries
* Pre-fetch or cache state of existing keys (Mostly useful only for no-op requests. For instance, pre-fetch the list of keys in the remote enabling no-op REMOVEs, but hard to tell if there's been a separate change since the fetch)
* Batch multiple requests in a single call. (Batching can be done before or after sending the SUCCESS response to git-annex with corresponding results)
> So it might be that a protocol extension would be useful, some way for git-annex to indicate that it is blocked waiting on current requests to finish.
I can think of a few related ways to do this:
* Have the remote send ACKs to notify that it's ready for the next request, and send SUCCESS when the request is actually completed. The remote can then have the flexibility to run them in whatever batch/async manner suitable. In the case of chunk-keys, git-annex could rapidly send successive keys in sequence since there's no additional lookup required making it pretty efficient.
* Have git-annex send some kind of group identifier (all chunks of same key might be grouped together) or delimiter(eg: GROUP_COMPLETED). This acts as a hint that these requests could be batched together without any obligation from the remote to do so. Coupled with a guarantee that all items in one group will be sent sequentially, the first item that belongs to a different group provides a guarantee that the previous group is completed. In this case, the SUCCESS for the last item could be taken to mean that the entire group is completed. One issue here is that this could leak some information in encrypted repositories.
* Define a CACHE_TIMEOUT_SECONDS. This could be used by the remote to decide if any pre-fetched or cached data can be trusted or if they should be re-checked. Git-annex would use this during merge/sync with other remotes to determine if there's a conflict that needs to be handled differently. (Seems too complicated TBH, but trying to see how we can make pre-fetch/caching work)
"""]]