split up remaining items from todo/git-annex_proxies and close it!

This commit is contained in:
Joey Hess 2024-10-30 14:49:54 -04:00
parent 9b7378fb79
commit 87871f724e
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 27 additions and 82 deletions

View file

@ -400,6 +400,11 @@ liveRepoOffsets (RepoSizeHandle (Just h) _) wantedsizechange = H.queryDb h $ do
map (\(k, v) -> (k, [v])) $ map (\(k, v) -> (k, [v])) $
fromMaybe [] $ fromMaybe [] $
M.lookup u livechanges M.lookup u livechanges
-- This could be optimised to a single SQL join, rather
-- than querying once for each live change. That would make
-- it less expensive when there are a lot happening at the
-- same time. Persistent is not capable of that join,
-- it would need a dependency on esquelito.
livechanges' <- combinelikelivechanges <$> livechanges' <- combinelikelivechanges <$>
filterM (nonredundantlivechange livechangesbykey u) filterM (nonredundantlivechange livechangesbykey u)
(fromMaybe [] $ M.lookup u livechanges) (fromMaybe [] $ M.lookup u livechanges)

View file

@ -0,0 +1,7 @@
`git-annex info` in the limitedcalc path in cachedAllRepoData
double-counts redundant information from the journal due to using
overLocationLogs. In the other path it does not (any more; it used to but
live repo sizes fixed that), and this should be fixed for consistency
and correctness.
(This is a deferred item from the [[todo/git-annex_proxies]] megatodo.) --[[Joey]]

View file

@ -1,4 +1,4 @@
This is a summary todo covering several subprojects, which would extend This is a summary todo covering several subprojects, which extend
git-annex to be able to use proxies which sit in front of a cluster of git-annex to be able to use proxies which sit in front of a cluster of
repositories. repositories.
@ -12,7 +12,7 @@ repositories.
[[!toc ]] [[!toc ]]
## planned schedule ## plan
Joey has received funding to work on this. Joey has received funding to work on this.
Planned schedule of work: Planned schedule of work:
@ -24,94 +24,27 @@ Planned schedule of work:
* September: proving behavior of balanced preferred content with proxies * September: proving behavior of balanced preferred content with proxies
* October: streaming through proxy to special remotes (especially S3) * October: streaming through proxy to special remotes (especially S3)
> This project is now complete! [[done]] --[[Joey]]
[[!tag projects/openneuro]] [[!tag projects/openneuro]]
## remaining things to do in October ## some todos that spun off from this project and didn't get implemented during it:
* Possibly some of the deferred items listed in following sections: For balanced preferred content and maxsize tracking:
## items deferred until later for balanced preferred content and maxsize tracking * [[todo/assistant_does_not_use_LiveUpdate]]
* [[todo/git-annex_info_with_limit_overcounts]]
* The assistant is using NoLiveUpdate, but it should be posssible to plumb For p2p protocol over http:
a LiveUpdate through it from preferred content checking to location log
updating.
* `git-annex info` in the limitedcalc path in cachedAllRepoData * [[p2phttp_serve_multiple_repositories]]
double-counts redundant information from the journal due to using * [[git-remote-annex_support_for_p2phttp]]
overLocationLogs. In the other path it does not (any more; it used to),
and this should be fixed for consistency and correctness.
* getLiveRepoSizes has a filterM getRecentChange over the live updates. For proxying:
This could be optimised to a single sql join. There are usually not many
live updates, but sometimes there will be a great many recent changes,
so it might be worth doing this optimisation. Persistent is not capable
of this, would need dependency added on esquelito.
## items deferred until later for p2p protocol over http * [[proxying_for_p2phttp_and_tor-annex_remotes]]
* [[faster_proxying]]
* Support proxying to git remotes that use annex+http urls. This needs a * [[smarter_use_of_disk_when_proxying]]
translation from P2P protocol to servant-client to P2P protocol.
* Should be possible to use a git-remote-annex annex::$uuid url as
remote.foo.url with remote.foo.annexUrl using annex+http, and so
not need a separate web server to serve the git repository. Doesn't work
currently because git-remote-annex urls only support special remotes.
It would need a new form of git-remote-annex url, eg:
annex::$uuid?annex+http://example.com/git-annex/
* `git-annex p2phttp` could support systemd socket activation. This would
allow making a systemd unit that listens on port 80.
## items deferred until later for [[design/passthrough_proxy]]
* Check annex.diskreserve when proxying for special remotes
to avoid the proxy's disk filling up with the temporary object file
cached there.
* Resuming an interrupted download from proxied special remote makes the proxy
re-download the whole content. It could instead keep some of the
object files around when the client does not send SUCCESS. This would
use more disk, but could minimize to eg, the last 2 or so.
The design doc has some more thoughts about this.
* Getting a key from a cluster currently picks from amoung
the lowest cost remotes at random. This could be smarter,
eg prefer to avoid using remotes that are doing other transfers at the
same time.
* The cost of a proxied node that is accessed via an intermediate gateway
is currently the same as a node accessed via the cluster gateway.
To fix this, there needs to be some way to tell how many hops through
gateways it takes to get to a node. Currently the only way is to
guess based on number of dashes in the node name, which is not satisfying.
Even counting hops is not very satisfying, one cluster gateway could
be much more expensive to traverse than another one.
If seriously tackling this, it might be worth making enough information
available to use spanning tree protocol for routing inside clusters.
* Speed: A proxy to a local git repository spawns git-annex-shell
to communicate with it. It would be more efficient to operate
directly on the Remote. Especially when transferring content to/from it.
But: When a cluster has several nodes that are local git repositories,
and is sending data to all of them, this would need an alternate
interface than `storeKey`, which supports streaming, of chunks
of a ByteString.
* Use `sendfile()` to avoid data copying overhead when
`receiveBytes` is being fed right into `sendBytes`.
Library to use:
<https://hackage.haskell.org/package/hsyscall-0.4/docs/System-Syscall.html>
* Support using a proxy when its url is a P2P address.
(Eg tor-annex remotes.)
* When an upload to a cluster is distributed to multiple special remotes,
a temporary file is written for each one, which may even happen in
parallel. This is a lot of extra work and may use excess disk space.
It should be possible to only write a single temp file.
(With streaming this wouldn't be an issue.)
## completed items for October's work on streaming through proxy to special remotes ## completed items for October's work on streaming through proxy to special remotes