This is a summary todo covering several subprojects, which would extend git-annex to be able to use proxies which sit in front of a cluster of repositories. 1. [[design/passthrough_proxy]] 2. [[design/p2p_protocol_over_http]] 3. [[design/balanced_preferred_content]] 4. [[todo/track_free_space_in_repos_via_git-annex_branch]] 5. [[todo/proving_preferred_content_behavior]] Joey has received funding to work on this. Planned schedule of work: * June: git-annex proxy * July, part 1: git-annex proxy support for exporttree * July, part 2: p2p protocol over http * August: balanced preferred content * September: streaming through proxy to special remotes (especially S3) * October: proving behavior of balanced preferred content with proxies [[!tag projects/openneuro]] # work notes In development on the `proxy` branch. For June's work on [[design/passthrough_proxy]], implementation plan: * UUID discovery via git-annex branch. Add a log file listing UUIDs accessible via proxy UUIDs. It also will contain the names of the remotes that the proxy is a proxy for, from the perspective of the proxy. (done) * Add `git-annex updateproxy` command (done) * Remote instantiation for proxies. (done) * Implement git-annex-shell proxying to git remotes. (done) * Proxy should update location tracking information for proxied remotes, so it is available to other users who sync with it. (done) * Implement `git-annex updatecluster` command (done) * Implement cluster UUID insertation on location log load, and removal on location log store. (done) * Omit cluster UUIDs when constructing drop proofs, since lockcontent will always fail on a cluster. (done) * Don't count cluster UUID as a copy. (done) * Tab complete proxied remotes and clusters in eg --from option. (done) * Getting a key from a cluster should proxy from one of the nodes that has it. (done) * Implement cluster drops, trying to remove from all nodes, and returning which UUIDs it was dropped from. (done) * Getting a key from a cluster currently always selects the lowest cost remote, and always the same remote if cost is the same. Should round-robin amoung remotes, and prefer to avoid using remotes that other git-annex processes are currently using. * Implement upload with fanout and reporting back additional UUIDs over P2P protocol. (done, but need to check for fencepost errors on resume of incomplete upload with remotes at different points) * On upload to cluster, send to nodes where it's preferred content, and not to other nodes. * Problem: `move --from cluster` in "does this make it worse" check may fail to realize that dropping from multiple nodes does in fact make it worse. * Support annex.jobs for clusters. * On upload to a cluster, as well as fanout to nodes, if the key is preferred content of the proxy repository, store it there. (But not when preferred content is not configured.) And on download from a cluster, if the proxy repository has the content, get it from there to avoid the overhead of proxying to a node. * Basic proxying to special remote support (non-streaming). * Support proxies-of-proxies better, eg foo-bar-baz. Currently, it does work, but have to run `git-annex updateproxy` on foo in order for it to notice the bar-baz proxied remote exists, and record it as foo-bar-baz. Make it skip recording proxies of proxies like that, and instead automatically generate those from the log. (With cycle prevention there of course.) * Cycle prevention including cluster-in-cluster cycles. See design. * Optimise proxy speed. See design for ideas. * Use `sendfile()` to avoid data copying overhead when `receiveBytes` is being fed right into `sendBytes`. * Encryption and chunking. See design for issues. * Indirect uploads (to be considered). See design. * Support using a proxy when its url is a P2P address. (Eg tor-annex remotes.)