git-annex/doc/todo/git-annex_proxies.mdwn

This is a summary todo covering several subprojects, which would extend
git-annex to be able to use proxies which sit in front of a cluster of
repositories.

1. [[design/passthrough_proxy]]
2. [[design/p2p_protocol_over_http]]
3. [[design/balanced_preferred_content]]
4. [[todo/track_free_space_in_repos_via_git-annex_branch]]
5. [[todo/proving_preferred_content_behavior]]

Joey has received funding to work on this.
Planned schedule of work:

* June: git-annex proxy
* July, part 1: git-annex proxy support for exporttree
* July, part 2: p2p protocol over http
* August: balanced preferred content
* September: streaming through proxy to special remotes (especially S3)
* October: proving behavior of balanced preferred content with proxies

[[!tag projects/openneuro]]

# work notes

In development on the `proxy` branch.

For June's work on [[design/passthrough_proxy]], implementation plan:

* UUID discovery via git-annex branch. Add a log file listing UUIDs
  accessible via proxy UUIDs. It also will contain the names
  of the remotes that the proxy is a proxy for,
  from the perspective of the proxy. (done)

* Add `git-annex updateproxy` command (done)

* Remote instantiation for proxies. (done)

* Implement git-annex-shell proxying to git remotes. (done)

* Proxy should update location tracking information for proxied remotes,
  so it is available to other users who sync with it. (done)

* Implement `git-annex updatecluster` command (done)

* Implement cluster UUID insertation on location log load, and removal
  on location log store. (done)

* Omit cluster UUIDs when constructing drop proofs, since lockcontent will
  always fail on a cluster. (done)

* Don't count cluster UUID as a copy. (done)

* Tab complete proxied remotes and clusters in eg --from option. (done)

* Getting a key from a cluster should proxy from one of the nodes that has
  it. (done)

* Implement cluster drops, trying to remove from all nodes, and returning
  which UUIDs it was dropped from. (done)

* Getting a key from a cluster currently always selects the lowest cost
  remote, and always the same remote if cost is the same. Should
  round-robin amoung remotes, and prefer to avoid using remotes that
  other git-annex processes are currently using.

* Implement upload with fanout and reporting back additional UUIDs over P2P
  protocol. (done, but need to check for fencepost errors on resume of
  incomplete upload with remotes at different points)

* On upload to cluster, send to nodes where it's preferred content, and not
  to other nodes.

* Problem: `move --from cluster` in "does this make it worse"
  check may fail to realize that dropping from multiple nodes does in fact
  make it worse.

* Bug: When a cluster has one node, copying a file to it does not update
  location log to say the content is present on it. It's returning SUCCESS
  rather than SUCCESS-PLUS.

* Support annex.jobs for clusters.

* On upload to a cluster, as well as fanout to nodes, if the key is
  preferred content of the proxy repository, store it there.
  (But not when preferred content is not configured.)
  And on download from a cluster, if the proxy repository has the content,
  get it from there to avoid the overhead of proxying to a node.

* Basic proxying to special remote support (non-streaming).

* Support proxies-of-proxies better, eg foo-bar-baz.
  Currently, it does work, but have to run `git-annex updateproxy`
  on foo in order for it to notice the bar-baz proxied remote exists,
  and record it as foo-bar-baz. Make it skip recording proxies of
  proxies like that, and instead automatically generate those from the log.
  (With cycle prevention there of course.)

* Cycle prevention including cluster-in-cluster cycles. See design.

* Optimise proxy speed. See design for ideas.

* Use `sendfile()` to avoid data copying overhead when
  `receiveBytes` is being fed right into `sendBytes`.

* Encryption and chunking. See design for issues.

* Indirect uploads (to be considered). See design.

* Support using a proxy when its url is a P2P address.
  (Eg tor-annex remotes.)