clean shut down of cluster connection when PUT is interrupted

An interrupted `git-annex copy --to` a cluster via the http server,
when repeated, failed. The http server output "transfer already in
progress, or unable to take transfer lock". Apparently a second
connection was opened to the cluster, because the first connection
never got shut down.

Turned out the problem was that when proxying to a cluster, it would read a
short ByteString from the client, and send that to the nodes. But that left the
nodes warning more. Meanwhile, the proxy was expecting a SUCCESS/FAILURE
message from the nodes. So it didn't return, and so the cluster connection
stayed open.
This commit is contained in:
Joey Hess 2024-07-28 14:15:28 -04:00
parent bdde6d829c
commit 5e205f215d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 13 additions and 10 deletions

View file

@ -558,7 +558,6 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle
(const protoerr) (const protoerr)
relayPUTMulti minoffset remotes k (Len datalen) _ = do relayPUTMulti minoffset remotes k (Len datalen) _ = do
let totallen = datalen + minoffset
-- Tell each remote how much data to expect, depending -- Tell each remote how much data to expect, depending
-- on the remote's offset. -- on the remote's offset.
rs <- forMC (proxyConcurrencyConfig proxyparams) remotes $ \r@(remoteside, remoteoffset) -> rs <- forMC (proxyConcurrencyConfig proxyparams) remotes $ \r@(remoteside, remoteoffset) ->
@ -569,6 +568,8 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle
protoerrhandler (send (catMaybes rs) minoffset) $ protoerrhandler (send (catMaybes rs) minoffset) $
client $ net $ receiveBytes (Len datalen) nullMeterUpdate client $ net $ receiveBytes (Len datalen) nullMeterUpdate
where where
totallen = datalen + minoffset
chunksize = fromIntegral defaultChunkSize chunksize = fromIntegral defaultChunkSize
-- Stream the lazy bytestring out to the remotes in chunks. -- Stream the lazy bytestring out to the remotes in chunks.
@ -593,13 +594,21 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle
return r return r
else return (Just r) else return (Just r)
if L.null b' if L.null b'
then sent (catMaybes rs') then do
-- If we didn't receive as much
-- data as expected, close
-- connections to all the remotes,
-- because they are still waiting
-- on the rest of the data.
when (n' /= totallen) $
mapM_ (closeRemoteSide . fst) rs
sent (catMaybes rs')
else send (catMaybes rs') n' b' else send (catMaybes rs') n' b'
sent [] = proxydone sent [] = proxydone
sent rs = relayDATAFinishMulti k (map fst rs) sent rs = relayDATAFinishMulti k (map fst rs)
runRemoteSideOrSkipFailed remoteside a = runRemoteSideOrSkipFailed remoteside a =
runRemoteSide remoteside a >>= \case runRemoteSide remoteside a >>= \case
Right v -> return (Just v) Right v -> return (Just v)
Left _ -> do Left _ -> do
@ -640,7 +649,7 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle
net receiveMessage net receiveMessage
where where
finish a = do finish a = do
storeduuids <- forMC (proxyConcurrencyConfig proxyparams) rs $ \r -> storeduuids <- forMC (proxyConcurrencyConfig proxyparams) rs $ \r ->
runRemoteSideOrSkipFailed r a >>= \case runRemoteSideOrSkipFailed r a >>= \case
Just (Just resp) -> Just (Just resp) ->
relayPUTRecord k r resp relayPUTRecord k r resp

View file

@ -28,12 +28,6 @@ Planned schedule of work:
## work notes ## work notes
* An interrupted `git-annex copy --to` a cluster via the http server,
when repeated, fails. The http server outputs "transfer already in
progress, or unable to take transfer lock". Apparently a second
connection gets opened to the cluster, because the first connection
never got shut down.
* When part of a file has been sent to a cluster via the http server, * When part of a file has been sent to a cluster via the http server,
the transfer interrupted, and another node is added to the cluster, the transfer interrupted, and another node is added to the cluster,
and the transfer of the file performed again, there is a failure and the transfer of the file performed again, there is a failure