From eb95ed4863015dc991de946b403c450e482637d7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 27 Oct 2021 16:15:41 -0400 Subject: [PATCH] fix addurl concurrency issue addurl: Support adding the same url to multiple files at the same time when using -J with --batch --with-files. Implementation was easier than expected, was able to reuse OnlyActionOn. While it will download the url's content multiple times, that seems like the best thing to do; see my comment for why. Sponsored-by: Dartmouth College's DANDI project --- CHANGELOG | 2 ++ Command/AddUrl.hs | 12 ++++++++- ...durl_failure_has_empty_error-messages.mdwn | 2 ++ ..._74e37e96d01bfb8b9521000d5faa7e53._comment | 2 +- ..._1b87390e69204b42878930cc1614437e._comment | 26 +++++++++++++++++++ 5 files changed, 42 insertions(+), 2 deletions(-) create mode 100644 doc/bugs/addurl_failure_has_empty_error-messages/comment_5_1b87390e69204b42878930cc1614437e._comment diff --git a/CHANGELOG b/CHANGELOG index 9091a2de3f..3ff1917879 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -27,6 +27,8 @@ git-annex (8.20211012) UNRELEASED; urgency=medium the hook to freeze the file in ways that prevent moving it, such as removing the Windows delete permission. Thanks, Reiko Asakura. + * addurl: Support adding the same url to multiple files at the same + time when using -J with --batch --with-files. -- Joey Hess Mon, 11 Oct 2021 14:09:13 -0400 diff --git a/Command/AddUrl.hs b/Command/AddUrl.hs index f256aed2d8..a24d4c8c9b 100644 --- a/Command/AddUrl.hs +++ b/Command/AddUrl.hs @@ -372,11 +372,21 @@ checkRaw o a - a filename. It's not displayed then for output consistency, - but is added to the json when available. -} startingAddUrl :: SeekInput -> URLString -> AddUrlOptions -> CommandPerform -> CommandStart -startingAddUrl si url o p = starting "addurl" (ActionItemOther (Just url)) si $ do +startingAddUrl si url o p = starting "addurl" ai si $ do case fileOption (downloadOptions o) of Nothing -> noop Just file -> maybeShowJSON $ JSONChunk [("file", file)] p + where + -- Avoid failure when the same url is downloaded concurrently + -- to two different files, by using OnlyActionOn with a key + -- based on the url. Note that this may not be the actual key + -- that is used for the download; later size information may be + -- available and get added to it. That's ok, this is only + -- used to prevent two threads running concurrently when that would + -- likely fail. + ai = OnlyActionOn urlkey (ActionItemOther (Just url)) + urlkey = Backend.URL.fromUrl url Nothing showDestinationFile :: FilePath -> Annex () showDestinationFile file = do diff --git a/doc/bugs/addurl_failure_has_empty_error-messages.mdwn b/doc/bugs/addurl_failure_has_empty_error-messages.mdwn index 6e5246cda3..3ade4c1737 100644 --- a/doc/bugs/addurl_failure_has_empty_error-messages.mdwn +++ b/doc/bugs/addurl_failure_has_empty_error-messages.mdwn @@ -2,3 +2,5 @@ While running `git-annex addurl --batch --with-files --jobs 10 --json --json-err [[!meta author=jwodder]] [[!tag projects/dandi]] + +> [[fixed|done]] --[[Joey]] diff --git a/doc/bugs/addurl_failure_has_empty_error-messages/comment_4_74e37e96d01bfb8b9521000d5faa7e53._comment b/doc/bugs/addurl_failure_has_empty_error-messages/comment_4_74e37e96d01bfb8b9521000d5faa7e53._comment index a276239e11..7411b045d1 100644 --- a/doc/bugs/addurl_failure_has_empty_error-messages/comment_4_74e37e96d01bfb8b9521000d5faa7e53._comment +++ b/doc/bugs/addurl_failure_has_empty_error-messages/comment_4_74e37e96d01bfb8b9521000d5faa7e53._comment @@ -8,7 +8,7 @@ downloading, and the key transfer machinery prevents redundant downloads of the same Key at the same time. Arguably, the problem is not where the message gets put, but that -it fails when adding an url to two different paths. +it fails when adding an url to two different paths at the same time. I have, though, moved that message so it will appear in error-messages. """]] diff --git a/doc/bugs/addurl_failure_has_empty_error-messages/comment_5_1b87390e69204b42878930cc1614437e._comment b/doc/bugs/addurl_failure_has_empty_error-messages/comment_5_1b87390e69204b42878930cc1614437e._comment new file mode 100644 index 0000000000..009320608e --- /dev/null +++ b/doc/bugs/addurl_failure_has_empty_error-messages/comment_5_1b87390e69204b42878930cc1614437e._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 5""" + date="2021-10-27T18:56:23Z" + content=""" +The best solution I can find is for it to notice when another thread is +downloading the same url, and wait until it finishes. Then proceed +with downloading the url for a second time. + +It's not very satisfying to re-download. But once the url Key is downloaded, +it does not keep that url Key populated, but hashes the content and moves +the content to the final Key. It would be a real complication to +communicate, across threads, what Key the content ended up at, and have the +waiting thread use that. And addurl is already complicated well beyond a +point I am comfortable with. + +Also, the content of an url can of course change over time. If I feed +"$url foo" into git-annex addurl --batch -J10 and then some time +later, I feed "$url bar", I might expect that file bar gets whatever +content the url has now, not the content that the url had back when I added +the same url to file foo. And if I cared about avoiding re-downloading, +I could add the url to the first file, and then copy the annex link to the +second file myself. + +Implemented this approach. +"""]]