Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2022-09-13 21:36:16 -04:00
commit e3d19c7674
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 61 additions and 1 deletions

View file

@ -290,3 +290,5 @@ UBUNTU_CODENAME=jammy
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Sure! Lots! We use it to share a large open access dataset at https://github.com/spine-generic, and [I'm working on](https://github.com/neuropoly/gitea/pull/1) helping other researchers share their datasets on their own infrastructure using git-annex + gitea.
[[done]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="nick.guenther@e418ed3c763dff37995c2ed5da4232a7c6cee0a9"
nickname="nick.guenther"
avatar="http://cdn.libravatar.org/avatar/9e85c6ca61c3f877fef4f91c2bf6e278"
subject="comment 6"
date="2022-09-13T23:32:01Z"
content="""
That's awesome! Thanks very much joey.
I'll mark this done now :)
"""]]

View file

@ -22,7 +22,7 @@ git annex required wasabi-west groupwanted
git annex group machine1 active
git annex group machine2 active
git annex groupwanted anything
git annex groupwanted active anything
# from machine1
git annex sync -a origin machine2 wasabi-east wasabi-west

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="pat"
avatar="http://cdn.libravatar.org/avatar/6b552550673a6a6df3b33364076f8ea8"
subject="comment 3"
date="2022-09-13T21:14:02Z"
content="""
hrm… as you can see in my post, I AM using “anything” as the wanted content. So I would expect all of the remotes (wasabi and machines) to get all of the file versions. But thats not happening. Its behaving more like “used” would.
I will try “anything or unused” despite the fact that it seems like “or unused” should be unnecessary.
"""]]

View file

@ -0,0 +1,16 @@
[[!comment format=mdwn
username="prancewit"
avatar="http://cdn.libravatar.org/avatar/f6cc165b68a5cca3311f9a1cd7fd027c"
subject="My current use case"
date="2022-09-13T19:45:23Z"
content="""
Thanks for the response, Joey.
Let me start by providing more details the use case where I noticed this slowness.
I was using a slow remote with a lot of chunks. I stopped the upload and wanted to do a cleanup of the uploaded chunks. That's when I noticed that git-annex was requesting a removal of each chunk individually, even ones that never actually got uploaded.
In this particular case, I could \"preload\" the data since I knew which chunks were valid and which ones weren't to make it faster (though I actually just waited it out)
Also, like you mentioned, this MULTIREMOVE is most useful for this specific case so a more generic solution will definitely be much better.
"""]]

View file

@ -0,0 +1,21 @@
[[!comment format=mdwn
username="prancewit"
avatar="http://cdn.libravatar.org/avatar/f6cc165b68a5cca3311f9a1cd7fd027c"
subject="comment 4"
date="2022-09-13T21:36:38Z"
content="""
At a high level, I see 2 possible ways in which special remotes can optimize for larger queries
* Pre-fetch or cache state of existing keys (Mostly useful only for no-op requests. For instance, pre-fetch the list of keys in the remote enabling no-op REMOVEs, but hard to tell if there's been a separate change since the fetch)
* Batch multiple requests in a single call. (Batching can be done before or after sending the SUCCESS response to git-annex with corresponding results)
> So it might be that a protocol extension would be useful, some way for git-annex to indicate that it is blocked waiting on current requests to finish.
I can think of a few related ways to do this:
* Have the remote send ACKs to notify that it's ready for the next request, and send SUCCESS when the request is actually completed. The remote can then have the flexibility to run them in whatever batch/async manner suitable. In the case of chunk-keys, git-annex could rapidly send successive keys in sequence since there's no additional lookup required making it pretty efficient.
* Have git-annex send some kind of group identifier (all chunks of same key might be grouped together) or delimiter(eg: GROUP_COMPLETED). This acts as a hint that these requests could be batched together without any obligation from the remote to do so. Coupled with a guarantee that all items in one group will be sent sequentially, the first item that belongs to a different group provides a guarantee that the previous group is completed. In this case, the SUCCESS for the last item could be taken to mean that the entire group is completed. One issue here is that this could leak some information in encrypted repositories.
* Define a CACHE_TIMEOUT_SECONDS. This could be used by the remote to decide if any pre-fetched or cached data can be trusted or if they should be re-checked. Git-annex would use this during merge/sync with other remotes to determine if there's a conflict that needs to be handled differently. (Seems too complicated TBH, but trying to see how we can make pre-fetch/caching work)
"""]]