fix addurl concurrency issue
addurl: Support adding the same url to multiple files at the same time when using -J with --batch --with-files. Implementation was easier than expected, was able to reuse OnlyActionOn. While it will download the url's content multiple times, that seems like the best thing to do; see my comment for why. Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
parent
a3cdff3fd5
commit
eb95ed4863
5 changed files with 42 additions and 2 deletions
|
@ -27,6 +27,8 @@ git-annex (8.20211012) UNRELEASED; urgency=medium
|
||||||
the hook to freeze the file in ways that prevent moving it, such as
|
the hook to freeze the file in ways that prevent moving it, such as
|
||||||
removing the Windows delete permission.
|
removing the Windows delete permission.
|
||||||
Thanks, Reiko Asakura.
|
Thanks, Reiko Asakura.
|
||||||
|
* addurl: Support adding the same url to multiple files at the same
|
||||||
|
time when using -J with --batch --with-files.
|
||||||
|
|
||||||
-- Joey Hess <id@joeyh.name> Mon, 11 Oct 2021 14:09:13 -0400
|
-- Joey Hess <id@joeyh.name> Mon, 11 Oct 2021 14:09:13 -0400
|
||||||
|
|
||||||
|
|
|
@ -372,11 +372,21 @@ checkRaw o a
|
||||||
- a filename. It's not displayed then for output consistency,
|
- a filename. It's not displayed then for output consistency,
|
||||||
- but is added to the json when available. -}
|
- but is added to the json when available. -}
|
||||||
startingAddUrl :: SeekInput -> URLString -> AddUrlOptions -> CommandPerform -> CommandStart
|
startingAddUrl :: SeekInput -> URLString -> AddUrlOptions -> CommandPerform -> CommandStart
|
||||||
startingAddUrl si url o p = starting "addurl" (ActionItemOther (Just url)) si $ do
|
startingAddUrl si url o p = starting "addurl" ai si $ do
|
||||||
case fileOption (downloadOptions o) of
|
case fileOption (downloadOptions o) of
|
||||||
Nothing -> noop
|
Nothing -> noop
|
||||||
Just file -> maybeShowJSON $ JSONChunk [("file", file)]
|
Just file -> maybeShowJSON $ JSONChunk [("file", file)]
|
||||||
p
|
p
|
||||||
|
where
|
||||||
|
-- Avoid failure when the same url is downloaded concurrently
|
||||||
|
-- to two different files, by using OnlyActionOn with a key
|
||||||
|
-- based on the url. Note that this may not be the actual key
|
||||||
|
-- that is used for the download; later size information may be
|
||||||
|
-- available and get added to it. That's ok, this is only
|
||||||
|
-- used to prevent two threads running concurrently when that would
|
||||||
|
-- likely fail.
|
||||||
|
ai = OnlyActionOn urlkey (ActionItemOther (Just url))
|
||||||
|
urlkey = Backend.URL.fromUrl url Nothing
|
||||||
|
|
||||||
showDestinationFile :: FilePath -> Annex ()
|
showDestinationFile :: FilePath -> Annex ()
|
||||||
showDestinationFile file = do
|
showDestinationFile file = do
|
||||||
|
|
|
@ -2,3 +2,5 @@ While running `git-annex addurl --batch --with-files --jobs 10 --json --json-err
|
||||||
|
|
||||||
[[!meta author=jwodder]]
|
[[!meta author=jwodder]]
|
||||||
[[!tag projects/dandi]]
|
[[!tag projects/dandi]]
|
||||||
|
|
||||||
|
> [[fixed|done]] --[[Joey]]
|
||||||
|
|
|
@ -8,7 +8,7 @@ downloading, and the key transfer machinery prevents redundant downloads
|
||||||
of the same Key at the same time.
|
of the same Key at the same time.
|
||||||
|
|
||||||
Arguably, the problem is not where the message gets put, but that
|
Arguably, the problem is not where the message gets put, but that
|
||||||
it fails when adding an url to two different paths.
|
it fails when adding an url to two different paths at the same time.
|
||||||
|
|
||||||
I have, though, moved that message so it will appear in error-messages.
|
I have, though, moved that message so it will appear in error-messages.
|
||||||
"""]]
|
"""]]
|
||||||
|
|
|
@ -0,0 +1,26 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 5"""
|
||||||
|
date="2021-10-27T18:56:23Z"
|
||||||
|
content="""
|
||||||
|
The best solution I can find is for it to notice when another thread is
|
||||||
|
downloading the same url, and wait until it finishes. Then proceed
|
||||||
|
with downloading the url for a second time.
|
||||||
|
|
||||||
|
It's not very satisfying to re-download. But once the url Key is downloaded,
|
||||||
|
it does not keep that url Key populated, but hashes the content and moves
|
||||||
|
the content to the final Key. It would be a real complication to
|
||||||
|
communicate, across threads, what Key the content ended up at, and have the
|
||||||
|
waiting thread use that. And addurl is already complicated well beyond a
|
||||||
|
point I am comfortable with.
|
||||||
|
|
||||||
|
Also, the content of an url can of course change over time. If I feed
|
||||||
|
"$url foo" into git-annex addurl --batch -J10 and then some time
|
||||||
|
later, I feed "$url bar", I might expect that file bar gets whatever
|
||||||
|
content the url has now, not the content that the url had back when I added
|
||||||
|
the same url to file foo. And if I cared about avoiding re-downloading,
|
||||||
|
I could add the url to the first file, and then copy the annex link to the
|
||||||
|
second file myself.
|
||||||
|
|
||||||
|
Implemented this approach.
|
||||||
|
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue