fix addurl concurrency issue

addurl: Support adding the same url to multiple files at the same time when
using -J with --batch --with-files.

Implementation was easier than expected, was able to reuse OnlyActionOn.

While it will download the url's content multiple times, that seems like
the best thing to do; see my comment for why.

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2021-10-27 16:15:41 -04:00
parent a3cdff3fd5
commit eb95ed4863
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
5 changed files with 42 additions and 2 deletions

View file

@ -27,6 +27,8 @@ git-annex (8.20211012) UNRELEASED; urgency=medium
the hook to freeze the file in ways that prevent moving it, such as
removing the Windows delete permission.
Thanks, Reiko Asakura.
* addurl: Support adding the same url to multiple files at the same
time when using -J with --batch --with-files.
-- Joey Hess <id@joeyh.name> Mon, 11 Oct 2021 14:09:13 -0400

View file

@ -372,11 +372,21 @@ checkRaw o a
- a filename. It's not displayed then for output consistency,
- but is added to the json when available. -}
startingAddUrl :: SeekInput -> URLString -> AddUrlOptions -> CommandPerform -> CommandStart
startingAddUrl si url o p = starting "addurl" (ActionItemOther (Just url)) si $ do
startingAddUrl si url o p = starting "addurl" ai si $ do
case fileOption (downloadOptions o) of
Nothing -> noop
Just file -> maybeShowJSON $ JSONChunk [("file", file)]
p
where
-- Avoid failure when the same url is downloaded concurrently
-- to two different files, by using OnlyActionOn with a key
-- based on the url. Note that this may not be the actual key
-- that is used for the download; later size information may be
-- available and get added to it. That's ok, this is only
-- used to prevent two threads running concurrently when that would
-- likely fail.
ai = OnlyActionOn urlkey (ActionItemOther (Just url))
urlkey = Backend.URL.fromUrl url Nothing
showDestinationFile :: FilePath -> Annex ()
showDestinationFile file = do

View file

@ -2,3 +2,5 @@ While running `git-annex addurl --batch --with-files --jobs 10 --json --json-err
[[!meta author=jwodder]]
[[!tag projects/dandi]]
> [[fixed|done]] --[[Joey]]

View file

@ -8,7 +8,7 @@ downloading, and the key transfer machinery prevents redundant downloads
of the same Key at the same time.
Arguably, the problem is not where the message gets put, but that
it fails when adding an url to two different paths.
it fails when adding an url to two different paths at the same time.
I have, though, moved that message so it will appear in error-messages.
"""]]

View file

@ -0,0 +1,26 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2021-10-27T18:56:23Z"
content="""
The best solution I can find is for it to notice when another thread is
downloading the same url, and wait until it finishes. Then proceed
with downloading the url for a second time.
It's not very satisfying to re-download. But once the url Key is downloaded,
it does not keep that url Key populated, but hashes the content and moves
the content to the final Key. It would be a real complication to
communicate, across threads, what Key the content ended up at, and have the
waiting thread use that. And addurl is already complicated well beyond a
point I am comfortable with.
Also, the content of an url can of course change over time. If I feed
"$url foo" into git-annex addurl --batch -J10 and then some time
later, I feed "$url bar", I might expect that file bar gets whatever
content the url has now, not the content that the url had back when I added
the same url to file foo. And if I cared about avoiding re-downloading,
I could add the url to the first file, and then copy the annex link to the
second file myself.
Implemented this approach.
"""]]