git-annex/doc/internals/git-remote-annex.mdwn

55 lines
1.7 KiB
Text
Raw Normal View History

This adds two new key types to git-annex, GITMANIFEST and a GITBUNDLE.
proof of concent for push to git bundles with MANIFEST This is a shell script, so not final code, and it does not use git-annex at all, but it shows how to push to git bundles, listed in a MANIFEST, the same as the git-remote-annex program will eventually do. While developing this, I realized that the design needed to be changed slightly regarding where refs are stored. Since a push can delete a ref from a remote, storing each newly pushed ref in a bundle won't work, because deleting a ref would then entail deleting all old bundles and re-uploading from scratch. So instead, only the refs in the last bundle listed in the MANIFEST are the active refs. Any refs in prior bundles are just old refs that were stored previously (a reflog as it were). That means that, in a situation where two different people are pushing to the same special remote from different repos, whoever pushes last wins. Any refs pushed by the other person earlier will be ignored. This may not be desirable, and git-annex might be able use the git-annex branch to detect such situations and rescue the refs that got lost. Even without such a recovery process though, the refs that the other person thought they pushed will be preserved in their refs/namespaces/mine, so a pull followed by a push will generally resolve the situation. Note that the use of refs/namespaces/mine in the bundle is not really desirable, and it might be worth making a local clone of the repo in order to set up the refs that will be put in the bundle. Which seems to be the only way to avoid needing that. But it does need to maintain the refs/namespaces/mine/ in the git repo in order to remember what refs have been pushed to the remote before, in order to include them in the next bundle pushed. A name that includes the remote uuid will be needed in the final implementation. Anyway, this shell script seems to fully work, including incremental pushing, force pushing, and pushes that delete refs. Sponsored-by: Brett Eisenberg on Patreon
2024-04-25 20:38:34 +00:00
GITMANIFEST--$UUID is the manifest for a git repository stored in the
git-annex repository with that UUID.
GITBUNDLE--$UUID-sha256 is a git bundle.
# format of the manifest file
An ordered list of bundle keys, one per line.
(Lines end with unix `"\n"`, not `"\r\n"`.)
# fetching
1. download GITMANIFEST for the uuid of the special remote
2. download each listed GITBUNDLE key that we don't have
convert git-remote-annex to not include old pushed refs in new bundle Rather than requiring the last listed bundle in the manifest include all refs that are in the remote, build up refs from each bundle listed in the manifest. This fixes a bug where pushing first a new branch foo from one clone, and then pushing a new branch bar from another clone, caused the second push to lose branch foo. Now the second push will add a new bundle, but the foo ref in the bundle from the first push will still be used. Pushing a deletion of a ref now has to delete all bundles and push a new bundle with only the remaining refs in it. In a "list for-push", it now has to unbundle all bundles, in order for a deletion repush to have available all objects. (And a non-deletion push can also rely on refs/namespaces/mine/ being up-to-date.) It would have been possible to fix the bug by only making it do that unbundling in "list for-push", without changing what's stored in the bundles. But I think I prefer to populate the bundles this way. For one thing, deleting a pushed ref now really deletes all data relating to it, rather than leaving it present in old bundles. For another, it's easier to explain since there is no special case for the last bundle. And, it will often result in smaller bundles. Note that further efficiency gains are possible with respect to what objects are included in an incremental bundle. Two XXX comments document how to reduce excess objects. It didn't seem worth implementing those optimisations in this proof of concept code. Sponsored-by: Brock Spratlen on Patreon
2024-04-30 17:51:43 +00:00
3. `git fetch` from each new bundle in order
(note that later bundles can update refs from the versions in previous
bundles)
# pushing (incrementally)
convert git-remote-annex to not include old pushed refs in new bundle Rather than requiring the last listed bundle in the manifest include all refs that are in the remote, build up refs from each bundle listed in the manifest. This fixes a bug where pushing first a new branch foo from one clone, and then pushing a new branch bar from another clone, caused the second push to lose branch foo. Now the second push will add a new bundle, but the foo ref in the bundle from the first push will still be used. Pushing a deletion of a ref now has to delete all bundles and push a new bundle with only the remaining refs in it. In a "list for-push", it now has to unbundle all bundles, in order for a deletion repush to have available all objects. (And a non-deletion push can also rely on refs/namespaces/mine/ being up-to-date.) It would have been possible to fix the bug by only making it do that unbundling in "list for-push", without changing what's stored in the bundles. But I think I prefer to populate the bundles this way. For one thing, deleting a pushed ref now really deletes all data relating to it, rather than leaving it present in old bundles. For another, it's easier to explain since there is no special case for the last bundle. And, it will often result in smaller bundles. Note that further efficiency gains are possible with respect to what objects are included in an incremental bundle. Two XXX comments document how to reduce excess objects. It didn't seem worth implementing those optimisations in this proof of concept code. Sponsored-by: Brock Spratlen on Patreon
2024-04-30 17:51:43 +00:00
This is how pushes are usually done.
1. create git bundle of all refs that are being pushed and have changed,
proof of concent for push to git bundles with MANIFEST This is a shell script, so not final code, and it does not use git-annex at all, but it shows how to push to git bundles, listed in a MANIFEST, the same as the git-remote-annex program will eventually do. While developing this, I realized that the design needed to be changed slightly regarding where refs are stored. Since a push can delete a ref from a remote, storing each newly pushed ref in a bundle won't work, because deleting a ref would then entail deleting all old bundles and re-uploading from scratch. So instead, only the refs in the last bundle listed in the MANIFEST are the active refs. Any refs in prior bundles are just old refs that were stored previously (a reflog as it were). That means that, in a situation where two different people are pushing to the same special remote from different repos, whoever pushes last wins. Any refs pushed by the other person earlier will be ignored. This may not be desirable, and git-annex might be able use the git-annex branch to detect such situations and rescue the refs that got lost. Even without such a recovery process though, the refs that the other person thought they pushed will be preserved in their refs/namespaces/mine, so a pull followed by a push will generally resolve the situation. Note that the use of refs/namespaces/mine in the bundle is not really desirable, and it might be worth making a local clone of the repo in order to set up the refs that will be put in the bundle. Which seems to be the only way to avoid needing that. But it does need to maintain the refs/namespaces/mine/ in the git repo in order to remember what refs have been pushed to the remote before, in order to include them in the next bundle pushed. A name that includes the remote uuid will be needed in the final implementation. Anyway, this shell script seems to fully work, including incremental pushing, force pushing, and pushes that delete refs. Sponsored-by: Brett Eisenberg on Patreon
2024-04-25 20:38:34 +00:00
and objects since the previously pushed refs
2. hash to calculate GITBUNDLE key
3. upload GITBUNDLE key
4. download current manifest
5. append GITBUNDLE key to manifest
convert git-remote-annex to not include old pushed refs in new bundle Rather than requiring the last listed bundle in the manifest include all refs that are in the remote, build up refs from each bundle listed in the manifest. This fixes a bug where pushing first a new branch foo from one clone, and then pushing a new branch bar from another clone, caused the second push to lose branch foo. Now the second push will add a new bundle, but the foo ref in the bundle from the first push will still be used. Pushing a deletion of a ref now has to delete all bundles and push a new bundle with only the remaining refs in it. In a "list for-push", it now has to unbundle all bundles, in order for a deletion repush to have available all objects. (And a non-deletion push can also rely on refs/namespaces/mine/ being up-to-date.) It would have been possible to fix the bug by only making it do that unbundling in "list for-push", without changing what's stored in the bundles. But I think I prefer to populate the bundles this way. For one thing, deleting a pushed ref now really deletes all data relating to it, rather than leaving it present in old bundles. For another, it's easier to explain since there is no special case for the last bundle. And, it will often result in smaller bundles. Note that further efficiency gains are possible with respect to what objects are included in an incremental bundle. Two XXX comments document how to reduce excess objects. It didn't seem worth implementing those optimisations in this proof of concept code. Sponsored-by: Brock Spratlen on Patreon
2024-04-30 17:51:43 +00:00
# pushing (full)
Note that this can be used to replace incrementals with a single bundle for
performance. It is also the only way to handle a push that deletes a
previously pushed ref.
proof of concent for push to git bundles with MANIFEST This is a shell script, so not final code, and it does not use git-annex at all, but it shows how to push to git bundles, listed in a MANIFEST, the same as the git-remote-annex program will eventually do. While developing this, I realized that the design needed to be changed slightly regarding where refs are stored. Since a push can delete a ref from a remote, storing each newly pushed ref in a bundle won't work, because deleting a ref would then entail deleting all old bundles and re-uploading from scratch. So instead, only the refs in the last bundle listed in the MANIFEST are the active refs. Any refs in prior bundles are just old refs that were stored previously (a reflog as it were). That means that, in a situation where two different people are pushing to the same special remote from different repos, whoever pushes last wins. Any refs pushed by the other person earlier will be ignored. This may not be desirable, and git-annex might be able use the git-annex branch to detect such situations and rescue the refs that got lost. Even without such a recovery process though, the refs that the other person thought they pushed will be preserved in their refs/namespaces/mine, so a pull followed by a push will generally resolve the situation. Note that the use of refs/namespaces/mine in the bundle is not really desirable, and it might be worth making a local clone of the repo in order to set up the refs that will be put in the bundle. Which seems to be the only way to avoid needing that. But it does need to maintain the refs/namespaces/mine/ in the git repo in order to remember what refs have been pushed to the remote before, in order to include them in the next bundle pushed. A name that includes the remote uuid will be needed in the final implementation. Anyway, this shell script seems to fully work, including incremental pushing, force pushing, and pushes that delete refs. Sponsored-by: Brett Eisenberg on Patreon
2024-04-25 20:38:34 +00:00
1. create git bundle containing all refs stored in the repository, and all
objects
2. hash to calculate GITBUNDLE key name
3. upload GITBUNDLE key
fix conflicting push situation In a situation where there are two repos that are diverged and each pushes in turn to git-remote-annex, the first to push updates it. Then the second push fails because it is not a fast-forward. The problem is, before git push fails with "non-fast-forward", it actually calls git-remote-annex with push. So, to the user it appears as if the push failed, but it actually reached the remote, and overwrote the other push! The only solution to this seems to be for git-remote-annex push to notice when a non-force-push would overwrite a ref stored in the remote, and refuse to push that ref, returning an error to git. This seems strange, why would git make remote helpers implement that when it later checks the same thing itself? With this fix, it's still possible for a race to overwrite a change to the MANIFEST and lose work that was pushed from the other repo. But that needs two pushes to be running at the same time. From the user's perspective, that situation is the same as if one repo pushed new work, then the other repo did a git push --force, overwriting the first repo's push. In the first repo, another push will then fail as a non fast-forward, and the user can recover as usual. But, a MANIFEST overwrite will leave bundle files in the remote that are not listed in the MANIFEST. It seems likely that git-annex will eventually be able to detect that after the fact and clean it up. Eg, it can learn all bundles that are stored in the remote using the location log, and compare them to the MANIFEST to find bundles that got lost. The race can also appear to the user as if they pushed a ref, but then it got deleted from the remote. This happens when two two pushes are pushing different ref names. This might be harder for the user to notice; git fetch does not indicate that a remote ref got deleted. They would have to use git fetch --prune to notice the deletion. Once the user does notice, they can re-push their ref to recover. Sponsored-by: Jack Hill on Patreon
2024-04-26 18:33:11 +00:00
4. download old manifest
4. upload new manifest listing only the single new GITBUNDLE
5. delete all other GITBUNDLEs that were listed in the old manifest
# multiple GITMANIFEST files
Usually there will only be one per special remote, but it's possible for
multiple special remotes to point to the same object storage, and if so
multiple GITMANIFEST objects can be stored.
It follows that the UUID of the special remote has to be included in the
annex:// uri, to know which GITMANIFEST to use when cloning from it.