git-annex/doc/todo/git-remote-annex.mdwn

git-remote-annex will be a program that allows push/pull/clone of a git
repository to many types of git-annex special remote.

This is a redesign and reimplementation of git-remote-datalad-annex.
It will be a safer implementation, will support incremental pushes, and
will be available to users who don't use datalad.
--[[Joey]]

---

This is implememented and working. Remaining todo list for it:

* Test incremental push edge cases involving checkprereq.

* Cloning from an annex:: url with importtree=yes doesn't work
  (with or without exporttree=yes). This is because the ContentIdentifier
  db is not populated. It should be possible to work around this.

* It would be nice if git-annex could generate an annex:: url
  for a special remote and show it to the user, eg when
  they have set the shorthand "annex::" url, so they know the full url.
  `git-annex info $remote` could also display it.
  Currently, the user has to remember how the special remote was
  configured and replicate it all in the url.

  There are some difficulties to doing this, including that
  RemoteConfig can have hidden fields that should be omitted.

* initremote/enableremote could have an option that configures the url to a
  special remote to a annex:: url. This would make it easier to use
  git-remote-annex, since the user would not need to set up the url
  themselves. (Also it would then avoid setting `skipFetchAll = true`)

* datalad-annex supports cloning from the web special remote,
  using an url that contains the result of pushing to eg, a directory
  special remote.
  `datalad-annex::https://example.com?type=web&url={noquery}`
  Supporting something like this would be good.

* Improve behavior in push races. A race can overwrite a change
  to the MANIFEST and lose work that was pushed from the other repo.
  From the user's perspective, that situation is the same as if one repo
  pushed new work, then the other repo did a git push --force, overwriting
  the first repo's push. In the first repo, another push will then fail as
  a non fast-forward, and the user can recover as usual. This is probably
  okish.

  But.. a MANIFEST overwrite will leave bundle files in the remote that
  are not listed in the MANIFEST. It seems likely that git-annex could
  detect that after the fact and clean it up. Eg, if it caches
  the last MANIFEST it uploaded, next time it downloads the MANIFEST
  it can check if there are bundle files in the old one that are not
  in the new one. If so, it can drop those bundle files from the remote.
  (May be unsafe, see below section on bundle deletion problems.)

* A push race can also appear to the user as if they pushed a ref, but then
  it got deleted from the remote. This happens when two pushes are
  pushing different ref names. This might be harder for the user to
  notice; git fetch does not indicate that a remote ref got deleted.
  They would have to use git fetch --prune to notice the deletion.
  Once the user does notice, they can re-push their ref to recover.
  Can this be improved?

## bundle deletion problems

Deleting bundles results in some problems involving races,
detailed below, that result in the manifest file listing a bundle that has
been deleted. Which breaks cloning, and is data loss, and so *must*
be solved before release.

* A race between an incremental push and a full push can result in
  a bundle that the incremental push is based on being deleted by the full
  push, and then incremental push's manifest file being written later.
  Which will prevent cloning or some pulls from working.

  A fix: Make each full push (and emptying push) also write to a fallback
  manifest file that is only written by full pushes (and emptying pushes),
  not incremental pushes. When fetching the main manifest file, always
  check that all bundles mentioned in it are still in the remote. If any
  are missing, fetch and use the fallback manifest file instead.

* A race between two full pushes can also result in the manifest file listing
  a bundle that has been deleted:

  Start with a full push of bundle A.

  Then there are 2 racing full pushes X and Y, of bundle A and B
  respectively. With this series of operations:

  1. Y: write bundle B
  1. Y: read manifest (listing A)
  1. Y: write B to manifest
  1. X: write bundle A
  1. Y: delete bundle A
  1. X: read manifest (listing B)
  1. X: write A to manifest
  1. X: delete bundle B

  Which results in a manifest that lists A, but that bundle was deleted.

The problems above *could* be solved by not deleting bundles, but that is
unsatisfactory.

Old bundles could be deleted after some period of time. But a process can
be suspended at any point and resumed later, so the race windows can be
arbitrarily wide.

What if only emptying pushes delete bundles? If a manifest file refers to a
bundle that has been deleted, that can be treated the same as if the
manifest file was empty, because we know that, for that bundle to have been
deleted, there must have been an emptying push. So this would work.

It is kind of a cop-out, because it requires the user to do an emptying
push from time to time. But by doing that, the user will expect that
someone who pulls at that point gets an empty repository.

Note that a race between an emptying push an a ref push will result in the
emptying push winning, so the ref push is lost. This is the same behavior
as can happen in a push race not involving deletion though, and any
improvements that are made to the UI around that will also help with this.