From 9b116870a69fa3889b715489bc4f1c235f7d49cf Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Sat, 6 Apr 2024 05:28:29 -0400 Subject: [PATCH 01/53] added docs for git-remote-annex special remote contents Designed with the help of Timothy Sanders and Michael Hanke at Distribits 2024 --- doc/internals/git-remote-annex.mdwn | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 doc/internals/git-remote-annex.mdwn diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn new file mode 100644 index 0000000000..49e97400ce --- /dev/null +++ b/doc/internals/git-remote-annex.mdwn @@ -0,0 +1,24 @@ +This adds a GIT-- object type to git-annex. + +GIT--manifest is the manifest + +GIT--hash is a git bundle + +# format of the manifest file + +An ordered list of bundle keys, one per line. + +# fetching + +1. download manifest +2. download each listed GIT bundle object that we don't have +3. fetch from bundles in timestamp order + +# pushing + +1. create git bundle, hash to calculate GIT bundle object name +2. upload GIT bundle object +3. download current manifest +4. add to manifest with current time, and upload + + From f900c56ca362f15c10c422b89e9bfdef356dd879 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Sat, 6 Apr 2024 08:30:51 -0400 Subject: [PATCH 02/53] parameterize manifest on UUID and expand slightly --- doc/internals/git-remote-annex.mdwn | 43 +++++++++++++++++++++-------- 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 49e97400ce..0334e53edb 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -1,8 +1,9 @@ -This adds a GIT-- object type to git-annex. +This adds two new object types to git-annex, GITMANIFEST and a GITBUNDLE. -GIT--manifest is the manifest +GITMANIFEST-$UUID is the manifest for a git repository stored in the +git-annex repository with that UUID. -GIT--hash is a git bundle +GITBUNDLE--sha256 is a git bundle. # format of the manifest file @@ -10,15 +11,35 @@ An ordered list of bundle keys, one per line. # fetching -1. download manifest -2. download each listed GIT bundle object that we don't have -3. fetch from bundles in timestamp order +1. download GITMANIFEST for the uuid of the special remote +2. download each listed GITBUNDLE object that we don't have +3. git fetch from bundles in timestamp order -# pushing +# pushing (incrementally) -1. create git bundle, hash to calculate GIT bundle object name -2. upload GIT bundle object -3. download current manifest -4. add to manifest with current time, and upload +1. create git bundle containing refs to push, and objects since + the previously pushed refs +2. hash to calculate GITBUNDLE key +3. upload GITBUNDLE object +4. download current manifest +5. append GITBUNDLE key to manifest +# pushing (replacing incrementals with single bundle) +1. create git bundle containing refs to push and all objects +2. hash to calculate GITBUNDLE object name +3. upload GITBUNDLE object +4. download current manifest +5. remove all old GITBUNDLES from the manifest, and add new GITBUNDLE at + the end. Note that it's possible for the manifest to contain GITBUNDLES + that were not in the last fetched manifest, if so those must be + preserved, and the new GITBUNDLE appended + +# multiple GITMANIFEST files + +Usually there will only be one per special remote, but it's possible for +multiple special remotes to point to the same object storage, and if so +multiple GITMANIFEST objects can be stored. + +It follows that the UUID of the special remote has to be included in the +annex:// uri, to know which GITMANIFEST to use when cloning from it. From 6ff4300bd19c7ad508328e797a17f86364ae5136 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 25 Apr 2024 16:38:34 -0400 Subject: [PATCH 03/53] proof of concent for push to git bundles with MANIFEST This is a shell script, so not final code, and it does not use git-annex at all, but it shows how to push to git bundles, listed in a MANIFEST, the same as the git-remote-annex program will eventually do. While developing this, I realized that the design needed to be changed slightly regarding where refs are stored. Since a push can delete a ref from a remote, storing each newly pushed ref in a bundle won't work, because deleting a ref would then entail deleting all old bundles and re-uploading from scratch. So instead, only the refs in the last bundle listed in the MANIFEST are the active refs. Any refs in prior bundles are just old refs that were stored previously (a reflog as it were). That means that, in a situation where two different people are pushing to the same special remote from different repos, whoever pushes last wins. Any refs pushed by the other person earlier will be ignored. This may not be desirable, and git-annex might be able use the git-annex branch to detect such situations and rescue the refs that got lost. Even without such a recovery process though, the refs that the other person thought they pushed will be preserved in their refs/namespaces/mine, so a pull followed by a push will generally resolve the situation. Note that the use of refs/namespaces/mine in the bundle is not really desirable, and it might be worth making a local clone of the repo in order to set up the refs that will be put in the bundle. Which seems to be the only way to avoid needing that. But it does need to maintain the refs/namespaces/mine/ in the git repo in order to remember what refs have been pushed to the remote before, in order to include them in the next bundle pushed. A name that includes the remote uuid will be needed in the final implementation. Anyway, this shell script seems to fully work, including incremental pushing, force pushing, and pushes that delete refs. Sponsored-by: Brett Eisenberg on Patreon --- doc/internals/git-remote-annex.mdwn | 16 ++- git-remote-annex | 152 ++++++++++++++++++++++++++++ 2 files changed, 163 insertions(+), 5 deletions(-) create mode 100755 git-remote-annex diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 0334e53edb..0e916c99ff 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -1,6 +1,6 @@ This adds two new object types to git-annex, GITMANIFEST and a GITBUNDLE. -GITMANIFEST-$UUID is the manifest for a git repository stored in the +GITMANIFEST--$UUID is the manifest for a git repository stored in the git-annex repository with that UUID. GITBUNDLE--sha256 is a git bundle. @@ -9,16 +9,21 @@ GITBUNDLE--sha256 is a git bundle. An ordered list of bundle keys, one per line. +The last bundle in the list provides all refs that are currently stored in +the repository. The bundles before it in the list can incrementally provide +objects, but not refs. + # fetching 1. download GITMANIFEST for the uuid of the special remote 2. download each listed GITBUNDLE object that we don't have -3. git fetch from bundles in timestamp order +3. `git bundle unpack` each bundle in order +4. `git fetch` from the last bundle listed in the manifest # pushing (incrementally) -1. create git bundle containing refs to push, and objects since - the previously pushed refs +1. create git bundle all refs that will be stored in the repository, + and objects since the previously pushed refs 2. hash to calculate GITBUNDLE key 3. upload GITBUNDLE object 4. download current manifest @@ -26,7 +31,8 @@ An ordered list of bundle keys, one per line. # pushing (replacing incrementals with single bundle) -1. create git bundle containing refs to push and all objects +1. create git bundle containing all refs stored in the repository, and all + objects 2. hash to calculate GITBUNDLE object name 3. upload GITBUNDLE object 4. download current manifest diff --git a/git-remote-annex b/git-remote-annex new file mode 100755 index 0000000000..6171e9d7b2 --- /dev/null +++ b/git-remote-annex @@ -0,0 +1,152 @@ +#!/bin/sh + +set -x + +# remember the refs that were uploaded already +git for-each-ref refs/namespaces/mine/ > .git/old-refs + +# Unfortunately, git bundle omits prerequisites that are omitted once, +# even if they are used by a later ref. +# For example, where x is a ref that points at A, and y is a ref +# that points at B (which has A as its parent), git bundle x A..y +# will omit inclding the x ref in the bundle at all. +check_prereq () { + # So, if a sha is one of the other refs that will be included in the + # bundle, it cannot be treated as a prerequisite. + if git for-each-ref refs/namespaces/mine/ | grep -Pv "\t$2$" | awk '{print $1}' | grep -q "$1"; then + echo "$2" + else + # And, if one of the other refs that will be included in the bundle + # is an ancestor of the sha, it cannot be treated as a prerequisite. + if [ -n "$(for x in $(git for-each-ref refs/namespaces/mine/ | grep -Pv "\t$2$" | awk '{print $1}'); do git log --oneline -n1 $x..$1; done)" ]; then + echo "$2" + else + echo "$1..$2" + fi + fi +} + +while read foo; do + case "$foo" in + capabilities) + echo fetch + echo push + echo + ;; + list*) + if [ -e "MANIFEST" ]; then + # Only list the refs in the last bundle + # listed in the manifest. Each push + # includes all refs in its bundle. + f=$(tail -n 1 MANIFEST) + if [ -n "$f" ]; then + # refs in the bundle may end up prefixed with refs/namespaces/mine/ + # when the intent is for the bundle to include a + # ref with the name that comes after that. + git bundle list-heads $f | sed 's/refs\/namespaces\/mine\///' + fi + fi + echo + ;; + fetch*) + dofetch=1 + ;; + push*) + set -- $foo + x="$2" + # src ref if prefixed with a + in a forced push + srcref="$(echo "$x" | cut -d : -f 1 | sed 's/^\+//')" + dstref="$(echo "$x" | cut -d : -f 2)" + if [ -z "$srcref" ]; then + git update-ref -d refs/namespaces/mine/"$dstref" + else + # Need to create a bundle containing $dstref, but + # don't want to overwrite that ref in the local + # repo. Unfortunately, git bundle does not support + # GIT_NAMESPACE, so it's not possible to do that + # without making a clone of the whole git repo. + # Instead, just create a ref under the namespace + # refs/namespaces/mine/ that will be put in the + # bundle. + git update-ref refs/namespaces/mine/"$dstref" "$srcref" + fi + dopush=1 + ;; + # docs say a blank line ends communication, but that's not + # accurate, actually a blank line comes after a series of + # fetch or push commands, and also according to the docs, + # another series of commands could follow + "") + if [ "$dofetch" ]; then + if [ -e "MANIFEST" ]; then + for f in $(cat MANIFEST); do + git bundle unbundle "$f" >/dev/null 2>&1 + done + fi + echo + dofetch="" + fi + if [ "$dopush" ]; then + if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then + # deleted all refs + if [ -e "MANIFEST" ]; then + for f in $(cat MANIFEST); do + rm "$f" + done + rm MANIFEST + touch MANIFEST + fi + else + # set REPUSH=1 to do a full push + # rather than incremental + if [ "$REPUSH" ]; then + rm MANIFEST + rm *.bundle + git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ + git bundle create --quiet new.bundle --stdin + else + # incremental bundle + IFS=" +" + (for l in $(git for-each-ref refs/namespaces/mine/); do + r=$(echo "$l" | awk '{print $3}') + newsha=$(echo "$l" | awk '{print $1}') + oldsha=$(grep -P "\t$r$" .git/old-refs | awk '{print $1}') + if [ -n "$oldsha" ]; then + # include changes from $oldsha to $r when there are some + if [ -n "$(git log --oneline $oldsha..$r)" ]; then + check_prereq "$oldsha" "$r" + else + if [ "$oldsha" = "$newsha" ]; then + # $r is unchanged from last push, so include + # the minimum data to make the bundle contain $r + rparentsha=$(git log -n 2 "$r" --format='%H' | tail -n+2) + if [ -n "$rparentsha" ]; then + check_prereq "$rparentsha" "$r" + else + # $r has no parent so include it as is + echo "$r" + fi + else + # $oldsha is not a parent of $r, so + # include $r and all its parents + echo "$r" + fi + fi + else + # no old version was pushed so include $r and all its parents + echo "$r" + fi + done) \ + | git bundle create --quiet new.bundle --stdin + fi + sha1=$(sha1sum new.bundle | awk '{print $1}') + mv new.bundle "$sha1.bundle" + echo "$sha1.bundle" >> MANIFEST + fi + echo + dopush="" + fi + ;; + esac +done From 99491f572fae84d705af17d6a02fb793c9599af8 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 26 Apr 2024 12:53:53 -0400 Subject: [PATCH 04/53] TOPDIR --- git-remote-annex | 38 ++++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/git-remote-annex b/git-remote-annex index 6171e9d7b2..6cd3d5772e 100755 --- a/git-remote-annex +++ b/git-remote-annex @@ -1,5 +1,7 @@ #!/bin/sh +TOPDIR=.. + set -x # remember the refs that were uploaded already @@ -34,16 +36,16 @@ while read foo; do echo ;; list*) - if [ -e "MANIFEST" ]; then + if [ -e "$TOPDIR/MANIFEST" ]; then # Only list the refs in the last bundle # listed in the manifest. Each push # includes all refs in its bundle. - f=$(tail -n 1 MANIFEST) + f=$(tail -n 1 $TOPDIR/MANIFEST) if [ -n "$f" ]; then # refs in the bundle may end up prefixed with refs/namespaces/mine/ # when the intent is for the bundle to include a # ref with the name that comes after that. - git bundle list-heads $f | sed 's/refs\/namespaces\/mine\///' + git bundle list-heads $TOPDIR/$f | sed 's/refs\/namespaces\/mine\///' fi fi echo @@ -78,9 +80,9 @@ while read foo; do # another series of commands could follow "") if [ "$dofetch" ]; then - if [ -e "MANIFEST" ]; then - for f in $(cat MANIFEST); do - git bundle unbundle "$f" >/dev/null 2>&1 + if [ -e "$TOPDIR/MANIFEST" ]; then + for f in $(cat $TOPDIR/MANIFEST); do + git bundle unbundle "$TOPDIR/$f" >/dev/null 2>&1 done fi echo @@ -89,21 +91,21 @@ while read foo; do if [ "$dopush" ]; then if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then # deleted all refs - if [ -e "MANIFEST" ]; then - for f in $(cat MANIFEST); do - rm "$f" + if [ -e "$TOPDIR/MANIFEST" ]; then + for f in $(cat $TOPDIR/MANIFEST); do + rm "$TOPDIR/$f" done - rm MANIFEST - touch MANIFEST + rm $TOPDIR/MANIFEST + touch $TOPDIR/MANIFEST fi else # set REPUSH=1 to do a full push # rather than incremental if [ "$REPUSH" ]; then - rm MANIFEST - rm *.bundle + rm $TOPDIR/MANIFEST + rm $TOPDIR/*.bundle git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ - git bundle create --quiet new.bundle --stdin + git bundle create --quiet $TOPDIR/new.bundle --stdin else # incremental bundle IFS=" @@ -138,11 +140,11 @@ while read foo; do echo "$r" fi done) \ - | git bundle create --quiet new.bundle --stdin + | git bundle create --quiet $TOPDIR/new.bundle --stdin fi - sha1=$(sha1sum new.bundle | awk '{print $1}') - mv new.bundle "$sha1.bundle" - echo "$sha1.bundle" >> MANIFEST + sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') + mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" + echo "$sha1.bundle" >> $TOPDIR/MANIFEST fi echo dopush="" From 8b56d6b283e8a3faaa69aa0a3d2e19a9873b350c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 26 Apr 2024 14:33:11 -0400 Subject: [PATCH 05/53] fix conflicting push situation In a situation where there are two repos that are diverged and each pushes in turn to git-remote-annex, the first to push updates it. Then the second push fails because it is not a fast-forward. The problem is, before git push fails with "non-fast-forward", it actually calls git-remote-annex with push. So, to the user it appears as if the push failed, but it actually reached the remote, and overwrote the other push! The only solution to this seems to be for git-remote-annex push to notice when a non-force-push would overwrite a ref stored in the remote, and refuse to push that ref, returning an error to git. This seems strange, why would git make remote helpers implement that when it later checks the same thing itself? With this fix, it's still possible for a race to overwrite a change to the MANIFEST and lose work that was pushed from the other repo. But that needs two pushes to be running at the same time. From the user's perspective, that situation is the same as if one repo pushed new work, then the other repo did a git push --force, overwriting the first repo's push. In the first repo, another push will then fail as a non fast-forward, and the user can recover as usual. But, a MANIFEST overwrite will leave bundle files in the remote that are not listed in the MANIFEST. It seems likely that git-annex will eventually be able to detect that after the fact and clean it up. Eg, it can learn all bundles that are stored in the remote using the location log, and compare them to the MANIFEST to find bundles that got lost. The race can also appear to the user as if they pushed a ref, but then it got deleted from the remote. This happens when two two pushes are pushing different ref names. This might be harder for the user to notice; git fetch does not indicate that a remote ref got deleted. They would have to use git fetch --prune to notice the deletion. Once the user does notice, they can re-push their ref to recover. Sponsored-by: Jack Hill on Patreon --- doc/internals/git-remote-annex.mdwn | 8 +- git-remote-annex | 165 ++++++++++++++++++---------- 2 files changed, 107 insertions(+), 66 deletions(-) diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 0e916c99ff..309c74087b 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -35,11 +35,9 @@ objects, but not refs. objects 2. hash to calculate GITBUNDLE object name 3. upload GITBUNDLE object -4. download current manifest -5. remove all old GITBUNDLES from the manifest, and add new GITBUNDLE at - the end. Note that it's possible for the manifest to contain GITBUNDLES - that were not in the last fetched manifest, if so those must be - preserved, and the new GITBUNDLE appended +4. download old manifest +4. upload new manifest listing only the single new GITBUNDLE +5. delete all other GITBUNDLEs that were listed in the old manifest # multiple GITMANIFEST files diff --git a/git-remote-annex b/git-remote-annex index 6cd3d5772e..cab2c418bc 100755 --- a/git-remote-annex +++ b/git-remote-annex @@ -7,6 +7,8 @@ set -x # remember the refs that were uploaded already git for-each-ref refs/namespaces/mine/ > .git/old-refs +rm -f .git/push-response + # Unfortunately, git bundle omits prerequisites that are omitted once, # even if they are used by a later ref. # For example, where x is a ref that points at A, and y is a ref @@ -42,10 +44,13 @@ while read foo; do # includes all refs in its bundle. f=$(tail -n 1 $TOPDIR/MANIFEST) if [ -n "$f" ]; then + # stash the listed refs for later + # checking in push + git bundle list-heads $TOPDIR/$f > .git/listed-refs # refs in the bundle may end up prefixed with refs/namespaces/mine/ # when the intent is for the bundle to include a # ref with the name that comes after that. - git bundle list-heads $TOPDIR/$f | sed 's/refs\/namespaces\/mine\///' + sed 's/refs\/namespaces\/mine\///' .git/listed-refs fi fi echo @@ -56,21 +61,46 @@ while read foo; do push*) set -- $foo x="$2" - # src ref if prefixed with a + in a forced push + # src ref is prefixed with a + in a forced push + forcedpush="" + if echo "$x" | cut -d : -f 1 | egrep -q '^\+'; then + forcedpush=1 + fi srcref="$(echo "$x" | cut -d : -f 1 | sed 's/^\+//')" dstref="$(echo "$x" | cut -d : -f 2)" + # Need to create a bundle containing $dstref, but + # don't want to overwrite that ref in the local + # repo. Unfortunately, git bundle does not support + # GIT_NAMESPACE, so it's not possible to do that + # without making a clone of the whole git repo. + # Instead, just create a ref under the namespace + # refs/namespaces/mine/ that will be put in the + # bundle. + mydstref=refs/namespaces/mine/"$dstref" if [ -z "$srcref" ]; then - git update-ref -d refs/namespaces/mine/"$dstref" + git update-ref -d "$mydstref" + touch .git/push-response + echo "ok $dstref" >> .git/push-response else - # Need to create a bundle containing $dstref, but - # don't want to overwrite that ref in the local - # repo. Unfortunately, git bundle does not support - # GIT_NAMESPACE, so it's not possible to do that - # without making a clone of the whole git repo. - # Instead, just create a ref under the namespace - # refs/namespaces/mine/ that will be put in the - # bundle. - git update-ref refs/namespaces/mine/"$dstref" "$srcref" + if [ ! "$forcedpush" ]; then + # check if the push would overwrite + # work in the ref currently stored in the + # remote, if so refuse to do it + prevsha=$(grep " $mydstref$" .git/listed-refs | awk '{print $1}') + newsha=$(git rev-parse "$srcref") + if [ -n "$prevsha" ] && [ "$prevsha" != "$newsha" ] && [ -z "$(git log --oneline $prevsha..$newsha 2>/dev/null)" ]; then + touch .git/push-response + echo "error $dstref non-fast-forward" >> .git/push-response + else + touch .git/push-response + echo "ok $dstref" >> .git/push-response + git update-ref "$mydstref" "$srcref" + fi + else + git update-ref "$mydstref" "$srcref" + touch .git/push-response + echo "ok $dstref" >> .git/push-response + fi fi dopush=1 ;; @@ -89,63 +119,76 @@ while read foo; do dofetch="" fi if [ "$dopush" ]; then - if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then - # deleted all refs - if [ -e "$TOPDIR/MANIFEST" ]; then - for f in $(cat $TOPDIR/MANIFEST); do - rm "$TOPDIR/$f" - done - rm $TOPDIR/MANIFEST - touch $TOPDIR/MANIFEST - fi + # if some refs cannot be pushed, refuse to + # push anything. It would be difficult to + # push only some refs, because the bundle + # needs to contain all refs, and some refs + # on the remote may contain objects we have + # not fetched yet. + if egrep -q "^error" .git/push-response; then + sed 's/^ok \(.*\)/error \1 unable to push this due to other pushed ref being non-fast-forward/' .git/push-response > .git/push-response.new + mv .git/push-response.new .git/push-response else - # set REPUSH=1 to do a full push - # rather than incremental - if [ "$REPUSH" ]; then - rm $TOPDIR/MANIFEST - rm $TOPDIR/*.bundle - git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ - git bundle create --quiet $TOPDIR/new.bundle --stdin + if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then + # deleted all refs + if [ -e "$TOPDIR/MANIFEST" ]; then + for f in $(cat $TOPDIR/MANIFEST); do + rm "$TOPDIR/$f" + done + rm $TOPDIR/MANIFEST + touch $TOPDIR/MANIFEST + fi else - # incremental bundle - IFS=" + # set REPUSH=1 to do a full push + # rather than incremental + if [ "$REPUSH" ]; then + rm $TOPDIR/MANIFEST + rm $TOPDIR/*.bundle + git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ + git bundle create --quiet $TOPDIR/new.bundle --stdin + else + # incremental bundle + IFS=" " - (for l in $(git for-each-ref refs/namespaces/mine/); do - r=$(echo "$l" | awk '{print $3}') - newsha=$(echo "$l" | awk '{print $1}') - oldsha=$(grep -P "\t$r$" .git/old-refs | awk '{print $1}') - if [ -n "$oldsha" ]; then - # include changes from $oldsha to $r when there are some - if [ -n "$(git log --oneline $oldsha..$r)" ]; then - check_prereq "$oldsha" "$r" - else - if [ "$oldsha" = "$newsha" ]; then - # $r is unchanged from last push, so include - # the minimum data to make the bundle contain $r - rparentsha=$(git log -n 2 "$r" --format='%H' | tail -n+2) - if [ -n "$rparentsha" ]; then - check_prereq "$rparentsha" "$r" + (for l in $(git for-each-ref refs/namespaces/mine/); do + r=$(echo "$l" | awk '{print $3}') + newsha=$(echo "$l" | awk '{print $1}') + oldsha=$(grep -P "\t$r$" .git/old-refs | awk '{print $1}') + if [ -n "$oldsha" ]; then + # include changes from $oldsha to $r when there are some + if [ -n "$(git log --oneline $oldsha..$r)" ]; then + check_prereq "$oldsha" "$r" + else + if [ "$oldsha" = "$newsha" ]; then + # $r is unchanged from last push, so include + # the minimum data to make the bundle contain $r + rparentsha=$(git log -n 2 "$r" --format='%H' | tail -n+2) + if [ -n "$rparentsha" ]; then + check_prereq "$rparentsha" "$r" + else + # $r has no parent so include it as is + echo "$r" + fi else - # $r has no parent so include it as is + # $oldsha is not a parent of $r, so + # include $r and all its parents echo "$r" fi - else - # $oldsha is not a parent of $r, so - # include $r and all its parents - echo "$r" - fi - fi - else - # no old version was pushed so include $r and all its parents - echo "$r" - fi - done) \ - | git bundle create --quiet $TOPDIR/new.bundle --stdin + fi + else + # no old version was pushed so include $r and all its parents + echo "$r" + fi + done) \ + | git bundle create --quiet $TOPDIR/new.bundle --stdin + fi + sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') + mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" + echo "$sha1.bundle" >> $TOPDIR/MANIFEST fi - sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') - mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" - echo "$sha1.bundle" >> $TOPDIR/MANIFEST fi + cat .git/push-response + rm -f .git/push-response echo dopush="" fi From e5cfaf003caee40021ee0e65896781db3cb7486a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 26 Apr 2024 17:11:30 -0400 Subject: [PATCH 06/53] found a bug --- git-remote-annex | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/git-remote-annex b/git-remote-annex index cab2c418bc..1770ef2a07 100755 --- a/git-remote-annex +++ b/git-remote-annex @@ -1,5 +1,14 @@ #!/bin/sh +# BUG: +# In one repo, make a new commit on master, and git push remote master +# In a second repo, make a new branch foo, make a new commit in foo, and +# git push remote foo +# This second push overwrites the master branch pushed from the first repo +# with an old version. +# Need to fetch new revs before push or rethink including all revs in most +# recent bundle. + TOPDIR=.. set -x From fc37243ffe225b5ddc6486e557491429ab22a510 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 30 Apr 2024 13:51:43 -0400 Subject: [PATCH 07/53] convert git-remote-annex to not include old pushed refs in new bundle Rather than requiring the last listed bundle in the manifest include all refs that are in the remote, build up refs from each bundle listed in the manifest. This fixes a bug where pushing first a new branch foo from one clone, and then pushing a new branch bar from another clone, caused the second push to lose branch foo. Now the second push will add a new bundle, but the foo ref in the bundle from the first push will still be used. Pushing a deletion of a ref now has to delete all bundles and push a new bundle with only the remaining refs in it. In a "list for-push", it now has to unbundle all bundles, in order for a deletion repush to have available all objects. (And a non-deletion push can also rely on refs/namespaces/mine/ being up-to-date.) It would have been possible to fix the bug by only making it do that unbundling in "list for-push", without changing what's stored in the bundles. But I think I prefer to populate the bundles this way. For one thing, deleting a pushed ref now really deletes all data relating to it, rather than leaving it present in old bundles. For another, it's easier to explain since there is no special case for the last bundle. And, it will often result in smaller bundles. Note that further efficiency gains are possible with respect to what objects are included in an incremental bundle. Two XXX comments document how to reduce excess objects. It didn't seem worth implementing those optimisations in this proof of concept code. Sponsored-by: Brock Spratlen on Patreon --- doc/internals/git-remote-annex.mdwn | 19 +-- git-remote-annex | 181 ++++++++++++++-------------- 2 files changed, 104 insertions(+), 96 deletions(-) diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 309c74087b..2a0ee0267a 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -9,27 +9,30 @@ GITBUNDLE--sha256 is a git bundle. An ordered list of bundle keys, one per line. -The last bundle in the list provides all refs that are currently stored in -the repository. The bundles before it in the list can incrementally provide -objects, but not refs. - # fetching 1. download GITMANIFEST for the uuid of the special remote 2. download each listed GITBUNDLE object that we don't have -3. `git bundle unpack` each bundle in order -4. `git fetch` from the last bundle listed in the manifest +3. `git fetch` from each new bundle in order + (note that later bundles can update refs from the versions in previous + bundles) # pushing (incrementally) -1. create git bundle all refs that will be stored in the repository, +This is how pushes are usually done. + +1. create git bundle of all refs that are being pushed and have changed, and objects since the previously pushed refs 2. hash to calculate GITBUNDLE key 3. upload GITBUNDLE object 4. download current manifest 5. append GITBUNDLE key to manifest -# pushing (replacing incrementals with single bundle) +# pushing (full) + +Note that this can be used to replace incrementals with a single bundle for +performance. It is also the only way to handle a push that deletes a +previously pushed ref. 1. create git bundle containing all refs stored in the repository, and all objects diff --git a/git-remote-annex b/git-remote-annex index 1770ef2a07..408fd211f5 100755 --- a/git-remote-annex +++ b/git-remote-annex @@ -1,21 +1,9 @@ #!/bin/sh -# BUG: -# In one repo, make a new commit on master, and git push remote master -# In a second repo, make a new branch foo, make a new commit in foo, and -# git push remote foo -# This second push overwrites the master branch pushed from the first repo -# with an old version. -# Need to fetch new revs before push or rethink including all revs in most -# recent bundle. - TOPDIR=.. set -x -# remember the refs that were uploaded already -git for-each-ref refs/namespaces/mine/ > .git/old-refs - rm -f .git/push-response # Unfortunately, git bundle omits prerequisites that are omitted once, @@ -26,12 +14,12 @@ rm -f .git/push-response check_prereq () { # So, if a sha is one of the other refs that will be included in the # bundle, it cannot be treated as a prerequisite. - if git for-each-ref refs/namespaces/mine/ | grep -Pv "\t$2$" | awk '{print $1}' | grep -q "$1"; then + if git show-ref $push_refs | grep -v " $2$" | awk '{print $1}' | grep -q "$1"; then echo "$2" else # And, if one of the other refs that will be included in the bundle # is an ancestor of the sha, it cannot be treated as a prerequisite. - if [ -n "$(for x in $(git for-each-ref refs/namespaces/mine/ | grep -Pv "\t$2$" | awk '{print $1}'); do git log --oneline -n1 $x..$1; done)" ]; then + if [ -n "$(for x in $(git show-ref $push_refs | grep -v " $2$" | awk '{print $1}'); do git log --oneline -n1 $x..$1; done)" ]; then echo "$2" else echo "$1..$2" @@ -39,6 +27,12 @@ check_prereq () { fi } +addnewbundle () { + sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') + mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" + echo "$sha1.bundle" >> $TOPDIR/MANIFEST +} + while read foo; do case "$foo" in capabilities) @@ -48,19 +42,41 @@ while read foo; do ;; list*) if [ -e "$TOPDIR/MANIFEST" ]; then - # Only list the refs in the last bundle - # listed in the manifest. Each push - # includes all refs in its bundle. - f=$(tail -n 1 $TOPDIR/MANIFEST) - if [ -n "$f" ]; then - # stash the listed refs for later - # checking in push - git bundle list-heads $TOPDIR/$f > .git/listed-refs - # refs in the bundle may end up prefixed with refs/namespaces/mine/ - # when the intent is for the bundle to include a - # ref with the name that comes after that. - sed 's/refs\/namespaces\/mine\///' .git/listed-refs + for f in $(cat $TOPDIR/MANIFEST); do + git bundle list-heads $TOPDIR/$f >> .git/listed-refs-new + if [ "$foo" = "list for-push" ]; then + # Get all the objects from the bundle. This is done here so that + # refs/namespaces/mine can be updated with what was listed, + # and so what when a full repush needs to be done, everything + # gets pushed. + git bundle unbundle "$TOPDIR/$f" >/dev/null 2>&1 + fi + done + perl -e 'while (<>) { if (m/(.*) (.*)/) { $seen{$2}=$1 } }; foreach my $k (keys %seen) { print "$seen{$k} $k\n" }' < .git/listed-refs-new > .git/listed-refs + rm -f .git/listed-refs-new + + # when listing for a push, update refs/namespaces/mine to match what was + # listed. This is necessary in order for a full repush to know what to push. + if [ "$foo" = "list for-push" ]; then + for r in $(git for-each-ref refs/namespaces/mine/ | awk '{print $3}'); do + git update-ref -d "$r" + done + IFS=" + " + for x in $(cat .git/listed-refs); do + sha="$(echo "$x" | cut -d ' ' -f 1)" + r="$(echo "$x" | cut -d ' ' -f 2)" + git update-ref "$r" "$sha" + done + unset IFS fi + + # respond to git with a list of refs + sed 's/refs\/namespaces\/mine\///' .git/listed-refs + # .git/listed-refs is later checked in push + else + rm -f .git/listed-refs + touch .git/listed-refs fi echo ;; @@ -87,6 +103,9 @@ while read foo; do # bundle. mydstref=refs/namespaces/mine/"$dstref" if [ -z "$srcref" ]; then + # To delete a ref, have to do a repush of + # all remaining refs. + REPUSH=1 git update-ref -d "$mydstref" touch .git/push-response echo "ok $dstref" >> .git/push-response @@ -104,11 +123,13 @@ while read foo; do touch .git/push-response echo "ok $dstref" >> .git/push-response git update-ref "$mydstref" "$srcref" + push_refs="$mydstref $push_refs" fi else git update-ref "$mydstref" "$srcref" touch .git/push-response echo "ok $dstref" >> .git/push-response + push_refs="$mydstref $push_refs" fi fi dopush=1 @@ -128,72 +149,56 @@ while read foo; do dofetch="" fi if [ "$dopush" ]; then - # if some refs cannot be pushed, refuse to - # push anything. It would be difficult to - # push only some refs, because the bundle - # needs to contain all refs, and some refs - # on the remote may contain objects we have - # not fetched yet. - if egrep -q "^error" .git/push-response; then - sed 's/^ok \(.*\)/error \1 unable to push this due to other pushed ref being non-fast-forward/' .git/push-response > .git/push-response.new - mv .git/push-response.new .git/push-response + if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then + # deleted all refs + if [ -e "$TOPDIR/MANIFEST" ]; then + for f in $(cat $TOPDIR/MANIFEST); do + rm "$TOPDIR/$f" + done + rm $TOPDIR/MANIFEST + touch $TOPDIR/MANIFEST + fi else - if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then - # deleted all refs - if [ -e "$TOPDIR/MANIFEST" ]; then - for f in $(cat $TOPDIR/MANIFEST); do - rm "$TOPDIR/$f" - done - rm $TOPDIR/MANIFEST - touch $TOPDIR/MANIFEST - fi + # set REPUSH=1 to do a full push + # rather than incremental + if [ "$REPUSH" ]; then + rm $TOPDIR/MANIFEST + rm $TOPDIR/*.bundle + git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ + git bundle create --quiet $TOPDIR/new.bundle --stdin + addnewbundle else - # set REPUSH=1 to do a full push - # rather than incremental - if [ "$REPUSH" ]; then - rm $TOPDIR/MANIFEST - rm $TOPDIR/*.bundle - git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ - git bundle create --quiet $TOPDIR/new.bundle --stdin - else - # incremental bundle - IFS=" -" - (for l in $(git for-each-ref refs/namespaces/mine/); do - r=$(echo "$l" | awk '{print $3}') - newsha=$(echo "$l" | awk '{print $1}') - oldsha=$(grep -P "\t$r$" .git/old-refs | awk '{print $1}') - if [ -n "$oldsha" ]; then - # include changes from $oldsha to $r when there are some - if [ -n "$(git log --oneline $oldsha..$r)" ]; then - check_prereq "$oldsha" "$r" - else - if [ "$oldsha" = "$newsha" ]; then - # $r is unchanged from last push, so include - # the minimum data to make the bundle contain $r - rparentsha=$(git log -n 2 "$r" --format='%H' | tail -n+2) - if [ -n "$rparentsha" ]; then - check_prereq "$rparentsha" "$r" - else - # $r has no parent so include it as is - echo "$r" - fi - else - # $oldsha is not a parent of $r, so - # include $r and all its parents - echo "$r" - fi - fi + # incremental bundle + for r in $push_refs; do + newsha=$(git show-ref "$r" | awk '{print $1}') + oldsha=$(grep " $r$" .git/listed-refs | awk '{print $1}') + if [ -n "$oldsha" ]; then + # include changes from $oldsha to $r when there are some + if [ -n "$(git log --oneline $oldsha..$r)" ]; then + check_prereq "$oldsha" "$r" else - # no old version was pushed so include $r and all its parents - echo "$r" - fi - done) \ - | git bundle create --quiet $TOPDIR/new.bundle --stdin + if [ "$oldsha" = "$newsha" ]; then + # $r is unchanged from last push, so no need to push it + : + else + # $oldsha is not a parent of $r, so + # include $r and all its parents + # XXX (this could be improved by checking other refs that were pushed + # and only including changes from them) + echo "$r" + fi + fi + else + # no old version was pushed so include $r and all its parents + # XXX (this could be improved by checking other refs that were pushed + # and only including changes from them) + echo "$r" + fi + done > .git/tobundle + if [ -s ".git/tobundle" ]; then + git bundle create --quiet $TOPDIR/new.bundle --stdin < ".git/tobundle" + addnewbundle fi - sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') - mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" - echo "$sha1.bundle" >> $TOPDIR/MANIFEST fi fi cat .git/push-response From 7a9633312e544e1838fb4a6097840e3a660be2ec Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 30 Apr 2024 14:40:05 -0400 Subject: [PATCH 08/53] got git clone from git-remote-annex prototype working eg git clone annex://`pwd` when the MANIFEST file is in the pwd. This is easy in the prototype, just use $GIT_DIR, but in git-annex, it will need to automatically git-annex init, and set up the special remote, in order to be able to download the manifest and bundle keys from it. Sponsored-by: k0ld on Patreon --- git-remote-annex | 51 ++++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/git-remote-annex b/git-remote-annex index 408fd211f5..2e818ed420 100755 --- a/git-remote-annex +++ b/git-remote-annex @@ -1,10 +1,11 @@ #!/bin/sh +URL="$2" -TOPDIR=.. +TOPDIR="$(echo "$URL" | sed 's/^annex:\/\///')" set -x -rm -f .git/push-response +rm -f $GIT_DIR/push-response # Unfortunately, git bundle omits prerequisites that are omitted once, # even if they are used by a later ref. @@ -43,7 +44,7 @@ while read foo; do list*) if [ -e "$TOPDIR/MANIFEST" ]; then for f in $(cat $TOPDIR/MANIFEST); do - git bundle list-heads $TOPDIR/$f >> .git/listed-refs-new + git bundle list-heads $TOPDIR/$f >> $GIT_DIR/listed-refs-new if [ "$foo" = "list for-push" ]; then # Get all the objects from the bundle. This is done here so that # refs/namespaces/mine can be updated with what was listed, @@ -52,8 +53,8 @@ while read foo; do git bundle unbundle "$TOPDIR/$f" >/dev/null 2>&1 fi done - perl -e 'while (<>) { if (m/(.*) (.*)/) { $seen{$2}=$1 } }; foreach my $k (keys %seen) { print "$seen{$k} $k\n" }' < .git/listed-refs-new > .git/listed-refs - rm -f .git/listed-refs-new + perl -e 'while (<>) { if (m/(.*) (.*)/) { $seen{$2}=$1 } }; foreach my $k (keys %seen) { print "$seen{$k} $k\n" }' < $GIT_DIR/listed-refs-new > $GIT_DIR/listed-refs + rm -f $GIT_DIR/listed-refs-new # when listing for a push, update refs/namespaces/mine to match what was # listed. This is necessary in order for a full repush to know what to push. @@ -63,7 +64,7 @@ while read foo; do done IFS=" " - for x in $(cat .git/listed-refs); do + for x in $(cat $GIT_DIR/listed-refs); do sha="$(echo "$x" | cut -d ' ' -f 1)" r="$(echo "$x" | cut -d ' ' -f 2)" git update-ref "$r" "$sha" @@ -72,11 +73,11 @@ while read foo; do fi # respond to git with a list of refs - sed 's/refs\/namespaces\/mine\///' .git/listed-refs - # .git/listed-refs is later checked in push + sed 's/refs\/namespaces\/mine\///' $GIT_DIR/listed-refs + # $GIT_DIR/listed-refs is later checked in push else - rm -f .git/listed-refs - touch .git/listed-refs + rm -f $GIT_DIR/listed-refs + touch $GIT_DIR/listed-refs fi echo ;; @@ -107,28 +108,28 @@ while read foo; do # all remaining refs. REPUSH=1 git update-ref -d "$mydstref" - touch .git/push-response - echo "ok $dstref" >> .git/push-response + touch $GIT_DIR/push-response + echo "ok $dstref" >> $GIT_DIR/push-response else if [ ! "$forcedpush" ]; then # check if the push would overwrite # work in the ref currently stored in the # remote, if so refuse to do it - prevsha=$(grep " $mydstref$" .git/listed-refs | awk '{print $1}') + prevsha=$(grep " $mydstref$" $GIT_DIR/listed-refs | awk '{print $1}') newsha=$(git rev-parse "$srcref") if [ -n "$prevsha" ] && [ "$prevsha" != "$newsha" ] && [ -z "$(git log --oneline $prevsha..$newsha 2>/dev/null)" ]; then - touch .git/push-response - echo "error $dstref non-fast-forward" >> .git/push-response + touch $GIT_DIR/push-response + echo "error $dstref non-fast-forward" >> $GIT_DIR/push-response else - touch .git/push-response - echo "ok $dstref" >> .git/push-response + touch $GIT_DIR/push-response + echo "ok $dstref" >> $GIT_DIR/push-response git update-ref "$mydstref" "$srcref" push_refs="$mydstref $push_refs" fi else git update-ref "$mydstref" "$srcref" - touch .git/push-response - echo "ok $dstref" >> .git/push-response + touch $GIT_DIR/push-response + echo "ok $dstref" >> $GIT_DIR/push-response push_refs="$mydstref $push_refs" fi fi @@ -171,7 +172,7 @@ while read foo; do # incremental bundle for r in $push_refs; do newsha=$(git show-ref "$r" | awk '{print $1}') - oldsha=$(grep " $r$" .git/listed-refs | awk '{print $1}') + oldsha=$(grep " $r$" $GIT_DIR/listed-refs | awk '{print $1}') if [ -n "$oldsha" ]; then # include changes from $oldsha to $r when there are some if [ -n "$(git log --oneline $oldsha..$r)" ]; then @@ -194,15 +195,15 @@ while read foo; do # and only including changes from them) echo "$r" fi - done > .git/tobundle - if [ -s ".git/tobundle" ]; then - git bundle create --quiet $TOPDIR/new.bundle --stdin < ".git/tobundle" + done > $GIT_DIR/tobundle + if [ -s "$GIT_DIR/tobundle" ]; then + git bundle create --quiet $TOPDIR/new.bundle --stdin < "$GIT_DIR/tobundle" addnewbundle fi fi fi - cat .git/push-response - rm -f .git/push-response + cat $GIT_DIR/push-response + rm -f $GIT_DIR/push-response echo dopush="" fi From 90b389369ffdd72686b6615204477793c111e29c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 12:07:05 -0400 Subject: [PATCH 09/53] fix name of gitremote-helpers The git man page has that name. --- CmdLine/GitRemoteTorAnnex.hs | 2 +- doc/git-remote-tor-annex.mdwn | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CmdLine/GitRemoteTorAnnex.hs b/CmdLine/GitRemoteTorAnnex.hs index 3349304fb2..c3aa13c3f0 100644 --- a/CmdLine/GitRemoteTorAnnex.hs +++ b/CmdLine/GitRemoteTorAnnex.hs @@ -25,7 +25,7 @@ run (_remotename:address:[]) = forever $ "capabilities" -> putStrLn "connect" >> ready "connect git-upload-pack" -> go UploadPack "connect git-receive-pack" -> go ReceivePack - l -> giveup $ "git-remote-helpers protocol error at " ++ show l + l -> giveup $ "gitremote-helpers protocol error at " ++ show l where (onionaddress, onionport) | '/' `elem` address = parseAddressPort $ diff --git a/doc/git-remote-tor-annex.mdwn b/doc/git-remote-tor-annex.mdwn index 4e41de877b..e32b711e4c 100644 --- a/doc/git-remote-tor-annex.mdwn +++ b/doc/git-remote-tor-annex.mdwn @@ -21,7 +21,7 @@ service, its first line is used as the authtoken. # SEE ALSO -git-remote-helpers(1) +gitremote-helpers(1) [[git-annex]](1) From a8cef2bf85b8fa12774290644f595e94a2f55f8e Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 12:44:33 -0400 Subject: [PATCH 10/53] added man page for git-remote-annex And document remote..git-remote-annex-max-bundles which will configure it. datalad-annex uses a similar url format, but with some enhancements. See https://github.com/datalad/datalad-next/blob/main/datalad_next/gitremotes/datalad_annex.py I added the UUID to the URL, because it is needed in order to pick out which manifest file to use. The design allows for a single key/value store to have several special remotes all stored in it, and so the manifest includes the UUID in its name. While datalad-annex allows datalad-annex::?, and allows referencing peices of the url in the parameters, needing the UUID prevents git-remote-annex from supporting that syntax. And anyway, it is a complication and I want to keep things simple for now. Sponsored-by: unqueued on Patreon --- doc/git-annex.mdwn | 11 ++++++ doc/git-remote-annex.mdwn | 79 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 90 insertions(+) create mode 100644 doc/git-remote-annex.mdwn diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 30028a45e2..c32d5854d3 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1644,6 +1644,17 @@ Remotes are configured using these settings in `.git/config`. remotes, and is set when using [[git-annex-initremote]](1) with the `--private` option. +* `remote..git-remote-annex-max-bundles` + + When using [[git-remote-annex]] to store a git repository in a special + remote, this configures how many separate git bundle objects to store + in the special remote before re-pushing a single git bundle that contains + the entire git repository. + + The default is 100, which aims to avoid often needing to often re-upload, + while preventing a new clone needing to download too many objects. Set to + 0 to disable re-pushing. + * `remote..annex-bare` Can be used to tell git-annex if a remote is a bare repository diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn new file mode 100644 index 0000000000..f94799143f --- /dev/null +++ b/doc/git-remote-annex.mdwn @@ -0,0 +1,79 @@ +# NAME + +git-remote-annex - remote helper program to store git repository in a git-annex special remote + +# SYNOPSIS + +git fetch annex::uuid?param=value¶m=value... + +# DESCRIPTION + +This is a git remote helper program that allows git to clone, +pull and push from a git repository that is stored in a git-annex +special remote. + +It can be used with any special remote except those that use +encryption=shared or encryption=hybrid. (Since those types of encryption +rely on on cipher that is checked into the git repository, cloning from +such a special remote would present a chicken and egg problem.) + +The format of the remote URL is "annex::" followed by the UUID of the +special remote, and then followed by all of the configuration parameters of +the special remote. + +For example, to clone from a directory special remote: + + git clone annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ + +When a special remote needs some additional credentials to be provided, +they are not included in the URL, and need to be provided when cloning from +the special remote. That is typically done by setting environment +variables. Some special remotes may also need environment variables to be +set when pulling or pushing. + +As a useful shorthand, when the special remote has already been enabled, +the configuration parameters can be omitted. For example: + + git push annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 master + +This also makes it easy to configure git to use an existing special remote: + + git config remote.foo.url annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 + git fetch foo master + +Configuring the url like that is automatically done when cloning from a +special remote, but not by [[git-annex-initremote]](1) and +[[git-annex-enableremote]](1). + +The git repository is stored in the special remote using special annex objects +with names starting with "GITMANIFEST--" and "GITBUNDLE--". For details about +how the git repository is stored, see + + +Pushes to a special remote are usually done incrementally. However, +sometimes the whole git repository (but not the annex) needs to be +re-uploaded. This is done when deleting a ref from the remote. It's also +done when too many git bundles accumulate in the special remote, as +configured by the `remote..git-remote-annex-max-bundles` git config. + +Like any git repository, a git repository stored on a special remote can +have conflicting things pushed to it from different places. This mostly +works the same as any other git repository, eg a push that overwrites other +work will be prevented unless forced. However, it is possible, when +conflicting pushes are being done at the same time, for one of the pushes +to be overwritten by the other one. In this sitiation, the push will appear +to have succeeded, but pulling later will show the true situation. + +# SEE ALSO + +gitremote-helpers(1) + +[[git-annex]](1) + +[[git-annex-initremote]](1) + +# AUTHOR + +Joey Hess + +Warning: Automatically converted into a man page by mdwn2man. Edit with care. From 0be9f7a2c605476c09eab496b27bc2da660b5db5 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 12:51:44 -0400 Subject: [PATCH 11/53] add UUID to GITBUNDLE The UUID is included in the GITMANIFEST in order to allow a single key/value store to be used to store several special remotes, without any namespacing. In that situation though, if the same ref is pushed to two special remotes, it will result in git bundles with the same content. Which is ok, until a re-push happens to one of the special remote. At that point, the old git bundle will be deleted. That will prevent fetching it from the other special remote, where the re-push has not happened. Adding the UUID avoids this problem. --- doc/internals/git-remote-annex.mdwn | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 2a0ee0267a..c973d62e51 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -3,7 +3,7 @@ This adds two new object types to git-annex, GITMANIFEST and a GITBUNDLE. GITMANIFEST--$UUID is the manifest for a git repository stored in the git-annex repository with that UUID. -GITBUNDLE--sha256 is a git bundle. +GITBUNDLE--$UUID--sha256 is a git bundle. # format of the manifest file From a01d64a4ad3cbfaa79b994f60cd217abb403371b Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 12:58:38 -0400 Subject: [PATCH 12/53] add git-remote-annex stub and build machinery Renamed git-remote-annex.sh, keeping it around for now for reference. Sponsored-by: Graham Spencer on Patreon --- Build/Standalone.hs | 1 + CmdLine/GitRemoteAnnex.hs | 21 +++++++++++++++++++++ Makefile | 8 ++++++-- git-annex.cabal | 1 + git-annex.hs | 4 +++- git-remote-annex => git-remote-annex.sh | 0 6 files changed, 32 insertions(+), 3 deletions(-) create mode 100644 CmdLine/GitRemoteAnnex.hs rename git-remote-annex => git-remote-annex.sh (100%) diff --git a/Build/Standalone.hs b/Build/Standalone.hs index 98c3b2cc89..367527430a 100644 --- a/Build/Standalone.hs +++ b/Build/Standalone.hs @@ -224,6 +224,7 @@ installGitAnnex topdir = go (topdir "bin") error "strip failed" createSymbolicLink "git-annex" (bindir "git-annex-shell") createSymbolicLink "git-annex" (bindir "git-remote-tor-annex") + createSymbolicLink "git-annex" (bindir "git-remote-annex") main :: IO () main = getArgs >>= go diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs new file mode 100644 index 0000000000..4c92f2150a --- /dev/null +++ b/CmdLine/GitRemoteAnnex.hs @@ -0,0 +1,21 @@ +{- git-remote-annex program + - + - Copyright 2024 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +module CmdLine.GitRemoteAnnex where + +import Common +import qualified Annex +import qualified Git.CurrentRepo +import Annex.UUID +import Annex.Action + +run :: [String] -> IO () +run (_remotename:address:[]) = forever $ + getLine >>= \case + l -> giveup $ "gitremote-helpers protocol error at " ++ show l +run (_remotename:[]) = giveup "remote address not configured" +run _ = giveup "expected remote name and address parameters" diff --git a/Makefile b/Makefile index 06ebebdea3..fc64201f72 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -all=git-annex git-annex-shell mans docs +all=git-annex git-annex-shell git-remote-annex mans docs # set to "./Setup" if you lack a cabal program. Or can be set to "stack" BUILDER?=cabal @@ -70,6 +70,9 @@ git-annex: tmp/configure-stamp git-annex-shell: git-annex ln -sf git-annex git-annex-shell +git-remote-annex: git-annex + ln -sf git-annex git-remote-annex + # These are not built normally. git-union-merge.1: doc/git-union-merge.mdwn ./Build/mdwn2man git-union-merge 1 doc/git-union-merge.mdwn > git-union-merge.1 @@ -90,6 +93,7 @@ install-bins: build install -d $(DESTDIR)$(PREFIX)/bin install git-annex $(DESTDIR)$(PREFIX)/bin ln -sf git-annex $(DESTDIR)$(PREFIX)/bin/git-annex-shell + ln -sf git-annex $(DESTDIR)$(PREFIX)/bin/git-remote-annex ln -sf git-annex $(DESTDIR)$(PREFIX)/bin/git-remote-tor-annex install-desktop: build Build/InstallDesktopFile @@ -141,7 +145,7 @@ clean: doc/.ikiwiki html dist tags Build/SysConfig Build/Version \ Setup Build/InstallDesktopFile Build/Standalone \ Build/DistributionUpdate Build/BuildVersion Build/MakeMans \ - git-annex-shell git-union-merge .tasty-rerun-log + git-annex-shell git-remote-annex git-union-merge .tasty-rerun-log find . -name \*.o -exec rm {} \; find . -name \*.hi -exec rm {} \; diff --git a/git-annex.cabal b/git-annex.cabal index a207606d83..97ae80382f 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -606,6 +606,7 @@ Executable git-annex CmdLine.GitAnnexShell.Fields CmdLine.AnnexSetter CmdLine.Option + CmdLine.GitRemoteAnnex CmdLine.GitRemoteTorAnnex CmdLine.Seek CmdLine.Usage diff --git a/git-annex.hs b/git-annex.hs index 89a9350b40..88117b4508 100644 --- a/git-annex.hs +++ b/git-annex.hs @@ -1,6 +1,6 @@ {- git-annex main program dispatch - - - Copyright 2010-2016 Joey Hess + - Copyright 2010-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -13,6 +13,7 @@ import Network.Socket (withSocketsDo) import qualified CmdLine.GitAnnex import qualified CmdLine.GitAnnexShell +import qualified CmdLine.GitRemoteAnnex import qualified CmdLine.GitRemoteTorAnnex import qualified Test import qualified Benchmark @@ -35,6 +36,7 @@ main = sanitizeTopLevelExceptionMessages $ withSocketsDo $ do where run ps n = case takeFileName n of "git-annex-shell" -> CmdLine.GitAnnexShell.run ps + "git-remote-annex" -> CmdLine.GitRemoteAnnex.run ps "git-remote-tor-annex" -> CmdLine.GitRemoteTorAnnex.run ps _ -> CmdLine.GitAnnex.run Test.optParser Test.runner Benchmark.mkGenerator ps diff --git a/git-remote-annex b/git-remote-annex.sh similarity index 100% rename from git-remote-annex rename to git-remote-annex.sh From 306ea42447feccf6e2962cc6ac5197483b12dfb8 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 13:06:22 -0400 Subject: [PATCH 13/53] improve git-remote-annex docs renamed the git config to something shorter too --- doc/git-annex.mdwn | 2 +- doc/git-remote-annex.mdwn | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index c32d5854d3..ceb5fbb60b 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1644,7 +1644,7 @@ Remotes are configured using these settings in `.git/config`. remotes, and is set when using [[git-annex-initremote]](1) with the `--private` option. -* `remote..git-remote-annex-max-bundles` +* `remote..max-bundles`, `annex.max-bundles` When using [[git-remote-annex]] to store a git repository in a special remote, this configures how many separate git bundle objects to store diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index f94799143f..ab0f8a930e 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -1,6 +1,6 @@ # NAME -git-remote-annex - remote helper program to store git repository in a git-annex special remote +git-remote-annex - remote helper program to store a git repository in a git-annex special remote # SYNOPSIS @@ -14,7 +14,7 @@ special remote. It can be used with any special remote except those that use encryption=shared or encryption=hybrid. (Since those types of encryption -rely on on cipher that is checked into the git repository, cloning from +rely on a cipher that is checked into the git repository, cloning from such a special remote would present a chicken and egg problem.) The format of the remote URL is "annex::" followed by the UUID of the @@ -36,7 +36,8 @@ the configuration parameters can be omitted. For example: git push annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 master -This also makes it easy to configure git to use an existing special remote: +This also makes it easy to configure the url for an existing special remote, +making it usable by git: git config remote.foo.url annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 git fetch foo master @@ -52,9 +53,9 @@ how the git repository is stored, see Pushes to a special remote are usually done incrementally. However, sometimes the whole git repository (but not the annex) needs to be -re-uploaded. This is done when deleting a ref from the remote. It's also +re-uploaded. That is done when deleting a ref from the remote. It's also done when too many git bundles accumulate in the special remote, as -configured by the `remote..git-remote-annex-max-bundles` git config. +configured by the `remote..max-bundles` git config. Like any git repository, a git repository stored on a special remote can have conflicting things pushed to it from different places. This mostly From f17fa48b7cd7d69c8813a9e62867b237c43a939f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 13:13:39 -0400 Subject: [PATCH 14/53] ignore git-remote-annex --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 2d0859233f..e21cbf9c80 100644 --- a/.gitignore +++ b/.gitignore @@ -13,6 +13,7 @@ Build/BuildVersion Build/MakeMans git-annex git-annex-shell +git-remote-annex man git-union-merge git-union-merge.1 From 4b94fc371e8d52f024b7767678fe9de0db9cb657 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 14:07:27 -0400 Subject: [PATCH 15/53] implement gitremote-helpers protocol parsing Sponsored-by: Leon Schuermann on Patreon --- CmdLine/GitRemoteAnnex.hs | 104 +++++++++++++++++++++++++++++++++++--- 1 file changed, 97 insertions(+), 7 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 4c92f2150a..84ce91b1eb 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -7,15 +7,105 @@ module CmdLine.GitRemoteAnnex where -import Common +import Annex.Common import qualified Annex import qualified Git.CurrentRepo import Annex.UUID -import Annex.Action run :: [String] -> IO () -run (_remotename:address:[]) = forever $ - getLine >>= \case - l -> giveup $ "gitremote-helpers protocol error at " ++ show l -run (_remotename:[]) = giveup "remote address not configured" -run _ = giveup "expected remote name and address parameters" +run (_remotename:url:[]) = do + state <- Annex.new =<< Git.CurrentRepo.get + Annex.eval state (run' url) +run (_remotename:[]) = giveup "remote url not configured" +run _ = giveup "expected remote name and url parameters" + +run' :: String -> Annex () +run' url = go =<< lines <$> liftIO getContents + where + go (l:ls) = + let (c, v) = splitLine l + in case c of + "capabilities" -> capabilities >> go ls + "list" -> case v of + "" -> list False >> go ls + "for-push" -> list True >> go ls + _ -> protocolError l + "fetch" -> fetch (l:ls) >>= go + "push" -> push (l:ls) >>= go + _ -> protocolError l + go [] = return () + +protocolError :: String -> a +protocolError l = giveup $ "gitremote-helpers protocol error at " ++ show l + +capabilities :: Annex () +capabilities = do + liftIO $ putStrLn "fetch" + liftIO $ putStrLn "push" + liftIO $ putStrLn "" + liftIO $ hFlush stdout + +list :: Bool -> Annex () +list forpush = error "TODO list" + +-- Any number of fetch commands can be sent by git, asking for specific +-- things. We fetch everything new at once, so find the end of the fetch +-- commands (which is supposed to be a blank line) before fetching. +fetch :: [String] -> Annex [String] +fetch (l:ls) = case splitLine l of + ("fetch", _) -> fetch ls + ("", _) -> do + fetch' + return ls + _ -> do + fetch' + return (l:ls) +fetch [] = do + fetch' + return [] + +fetch' :: Annex () +fetch' = error "TODO fetch" + +push :: [String] -> Annex [String] +push ls = do + let (refspecs, ls') = collectRefSpecs ls + error "TODO push refspecs" + return ls' + +data RefSpec = RefSpec + { forcedPush :: Bool + , srcRef :: Maybe String -- empty when deleting a ref + , dstRef :: String + } + deriving (Show) + +-- Any number of push commands can be sent by git, specifying the refspecs +-- to push. They should be followed by a blank line. +collectRefSpecs :: [String] -> ([RefSpec], [String]) +collectRefSpecs = go [] + where + go c (l:ls) = case splitLine l of + ("push", refspec) -> go (parseRefSpec refspec:c) ls + ("", _) -> (c, ls) + _ -> (c, (l:ls)) + go c [] = (c, []) + +parseRefSpec :: String -> RefSpec +parseRefSpec ('+':s) = (parseRefSpec s) { forcedPush = True } +parseRefSpec s = + let (src, cdst) = break (== ':') s + dst = if null cdst then cdst else drop 1 cdst + in RefSpec + { forcedPush = False + , srcRef = if null src then Nothing else Just src + , dstRef = dst + } + +-- "foo bar" to ("foo", "bar") +-- "foo" to ("foo", "") +splitLine :: String -> (String, String) +splitLine l = + let (c, sv) = break (== ' ') l + v = if null sv then sv else drop 1 sv + in (c, v) From f4ba6e0c1e4eac8ab95bc9fc1449f1c51cf962b5 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 14:50:41 -0400 Subject: [PATCH 16/53] add annex: url parser Changed the format of the url to use annex: rather than annex:: The reason is that in the future, might want to support an url that includes an uriAuthority part, eg: annex://foo@example.com:42/358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/" To parse that foo@example.com:42 as an uriAuthority it needs to start with annex: rather than annex:: That would also need something to be done with uriAuthority, and also the uriPath (the UUID) is prefixed with "/" in that example. So the current parser won't handle that example currently. But this leaves the possibility for expansion. Sponsored-by: Joshua Antonishen on Patreon --- CmdLine/GitRemoteAnnex.hs | 36 +++++++++++++++++++++++++++++++++--- doc/git-remote-annex.mdwn | 10 +++++----- 2 files changed, 38 insertions(+), 8 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 84ce91b1eb..2d3a506a3c 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -11,11 +11,14 @@ import Annex.Common import qualified Annex import qualified Git.CurrentRepo import Annex.UUID +import Network.URI run :: [String] -> IO () -run (_remotename:url:[]) = do - state <- Annex.new =<< Git.CurrentRepo.get - Annex.eval state (run' url) +run (_remotename:url:[]) = case parseSpecialRemoteUrl url of + Left e -> giveup e + Right src -> do + state <- Annex.new =<< Git.CurrentRepo.get + Annex.eval state (run' url) run (_remotename:[]) = giveup "remote url not configured" run _ = giveup "expected remote name and url parameters" @@ -109,3 +112,30 @@ splitLine l = let (c, sv) = break (== ' ') l v = if null sv then sv else drop 1 sv in (c, v) + +data SpecialRemoteConfig = SpecialRemoteConfig + { specialRemoteUUID :: UUID + , specialRemoteParams :: [(String, String)] + } + deriving (Show) + +-- The url for a special remote looks like +-- annex:uuid?param=value¶m=value... +parseSpecialRemoteUrl :: String -> Either String SpecialRemoteConfig +parseSpecialRemoteUrl s = case parseURI s of + Nothing -> Left "URL parse failed" + Just u -> case uriScheme u of + "annex:" -> case uriPath u of + "" -> Left "annex: URL did not include a UUID" + (':':_) -> Left "annex: URL malformed" + p -> Right $ SpecialRemoteConfig + { specialRemoteUUID = toUUID p + , specialRemoteParams = parsequery u + } + _ -> Left "Not an annex: URL" + where + parsequery u = map parsekv $ splitc '&' (drop 1 (uriQuery u)) + parsekv s = + let (k, sv) = break (== '=') s + v = if null sv then sv else drop 1 sv + in (unEscapeString k, unEscapeString v) diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index ab0f8a930e..74b946ebac 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -4,7 +4,7 @@ git-remote-annex - remote helper program to store a git repository in a git-anne # SYNOPSIS -git fetch annex::uuid?param=value¶m=value... +git fetch annex:uuid?param=value¶m=value... # DESCRIPTION @@ -17,13 +17,13 @@ encryption=shared or encryption=hybrid. (Since those types of encryption rely on a cipher that is checked into the git repository, cloning from such a special remote would present a chicken and egg problem.) -The format of the remote URL is "annex::" followed by the UUID of the +The format of the remote URL is "annex:" followed by the UUID of the special remote, and then followed by all of the configuration parameters of the special remote. For example, to clone from a directory special remote: - git clone annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ + git clone annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ When a special remote needs some additional credentials to be provided, they are not included in the URL, and need to be provided when cloning from @@ -34,12 +34,12 @@ set when pulling or pushing. As a useful shorthand, when the special remote has already been enabled, the configuration parameters can be omitted. For example: - git push annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 master + git push annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3 master This also makes it easy to configure the url for an existing special remote, making it usable by git: - git config remote.foo.url annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3 + git config remote.foo.url annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3 git fetch foo master Configuring the url like that is automatically done when cloning from a From 483887591dfbbab42b74b60447f6b9d2ccd95540 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 6 May 2024 16:25:55 -0400 Subject: [PATCH 17/53] working toward git-remote-annex using a special remote Not quite there yet. Also, changed the format of GITBUNDLE keys to use only one '-' after the UUID. A sha256 does not contain that character, so can just split at the last one. Amusingly, the sha256 will probably not actually be verified. A git bundle contains its own checksums that git uses to verify it. And if someone wanted to replace the content of a GITBUNDLE object, they could just edit the manifest to use a new one whose sha256 does verify. Sponsored-by: Nicholas Golder-Manning --- CmdLine/GitRemoteAnnex.hs | 134 ++++++++++++++++++++++------ Command/Fsck.hs | 4 +- doc/internals/git-remote-annex.mdwn | 6 +- 3 files changed, 115 insertions(+), 29 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 2d3a506a3c..8331d89429 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -5,38 +5,53 @@ - Licensed under the GNU AGPL version 3 or higher. -} +{-# LANGUAGE OverloadedStrings #-} + module CmdLine.GitRemoteAnnex where import Annex.Common import qualified Annex import qualified Git.CurrentRepo +import qualified Remote import Annex.UUID +import Types.Remote +import Types.Key import Network.URI +import Utility.Tmp +import Utility.Metered +import qualified Data.ByteString as B +import qualified Data.ByteString.Char8 as B8 +import qualified Data.ByteString.Short as S run :: [String] -> IO () run (_remotename:url:[]) = case parseSpecialRemoteUrl url of Left e -> giveup e Right src -> do state <- Annex.new =<< Git.CurrentRepo.get - Annex.eval state (run' url) + Annex.eval state (run' src) run (_remotename:[]) = giveup "remote url not configured" run _ = giveup "expected remote name and url parameters" -run' :: String -> Annex () -run' url = go =<< lines <$> liftIO getContents +run' :: SpecialRemoteConfig -> Annex () +run' src = + -- Prevent any usual git-annex output to stdout, because + -- the output of this command is being parsed by git. + doQuietAction $ do + rmt <- getSpecialRemote src + go rmt =<< lines <$> liftIO getContents where - go (l:ls) = + go rmt (l:ls) = let (c, v) = splitLine l in case c of - "capabilities" -> capabilities >> go ls + "capabilities" -> capabilities >> go rmt ls "list" -> case v of - "" -> list False >> go ls - "for-push" -> list True >> go ls + "" -> list rmt False >> go rmt ls + "for-push" -> list rmt True >> go rmt ls _ -> protocolError l - "fetch" -> fetch (l:ls) >>= go - "push" -> push (l:ls) >>= go + "fetch" -> fetch rmt (l:ls) >>= go rmt + "push" -> push rmt (l:ls) >>= go rmt _ -> protocolError l - go [] = return () + go _ [] = return () protocolError :: String -> a protocolError l = giveup $ "gitremote-helpers protocol error at " ++ show l @@ -48,30 +63,30 @@ capabilities = do liftIO $ putStrLn "" liftIO $ hFlush stdout -list :: Bool -> Annex () -list forpush = error "TODO list" +list :: Remote -> Bool -> Annex () +list rmt forpush = error "TODO list" -- Any number of fetch commands can be sent by git, asking for specific -- things. We fetch everything new at once, so find the end of the fetch -- commands (which is supposed to be a blank line) before fetching. -fetch :: [String] -> Annex [String] -fetch (l:ls) = case splitLine l of - ("fetch", _) -> fetch ls +fetch :: Remote -> [String] -> Annex [String] +fetch rmt (l:ls) = case splitLine l of + ("fetch", _) -> fetch rmt ls ("", _) -> do - fetch' + fetch' rmt return ls _ -> do - fetch' + fetch' rmt return (l:ls) -fetch [] = do - fetch' +fetch rmt [] = do + fetch' rmt return [] -fetch' :: Annex () -fetch' = error "TODO fetch" +fetch' :: Remote -> Annex () +fetch' rmt = error "TODO fetch" -push :: [String] -> Annex [String] -push ls = do +push :: Remote -> [String] -> Annex [String] +push rmt ls = do let (refspecs, ls') = collectRefSpecs ls error "TODO push refspecs" return ls' @@ -135,7 +150,76 @@ parseSpecialRemoteUrl s = case parseURI s of _ -> Left "Not an annex: URL" where parsequery u = map parsekv $ splitc '&' (drop 1 (uriQuery u)) - parsekv s = - let (k, sv) = break (== '=') s + parsekv kv = + let (k, sv) = break (== '=') kv v = if null sv then sv else drop 1 sv in (unEscapeString k, unEscapeString v) + +getSpecialRemote :: SpecialRemoteConfig -> Annex Remote +getSpecialRemote src + -- annex:uuid with no query string uses an existing remote + | null (specialRemoteParams src) = + Remote.byUUID (specialRemoteUUID src) >>= \case + Just rmt -> if thirdPartyPopulated (remotetype rmt) + then giveup "Cannot use this thirdparty-populated special remote as a git remote" + else return rmt + Nothing -> giveup $ "Cannot find an existing special remote with UUID " + ++ fromUUID (specialRemoteUUID src) + -- Given the configuration of a special remote, create a + -- Remote object to access the special remote. + -- This needs to avoid storing the configuration in the git-annex + -- branch (which would be redundant and also the configuration + -- provided may differ in some small way from the configuration + -- that is stored in the git repository inside the remote, which + -- should not be changed). It also needs to avoid creating a git + -- remote in .git/config. + | otherwise = error "TODO conjure up a new special remote out of thin air" + -- XXX one way to do it would be to make a temporary git repo, + -- initremote in there, and use that for accessing the special + -- remote, rather than the current git repo. But can this be + -- avoided? + +-- A key that is used for the manifest of the git repository stored in a +-- special remote with the specified uuid. +manifestKey :: UUID -> Key +manifestKey u = mkKey $ \kd -> kd + { keyName = S.toShort (fromUUID u) + , keyVariety = OtherKey "GITMANIFEST" + } + +-- A key that is used for the git bundle with the specified sha256 +-- that is stored in a special remote with the specified uuid. +gitbundleKey :: UUID -> B.ByteString -> Key +gitbundleKey u sha256 = mkKey $ \kd -> kd + { keyName = S.toShort (fromUUID u <> "-" <> sha256) + , keyVariety = OtherKey "GITBUNDLE" + } + +-- The manifest contains an ordered list of git bundle keys. +newtype Manifest = Manifest [Key] + +-- Downloads the Manifest, or if it does not exist, returns an empty +-- Manifest. +-- +-- Throws errors if the remote cannot be accessed or the download fails, +-- or if the manifest file cannot be parsed. +downloadManifest :: Remote -> Annex Manifest +downloadManifest rmt = ifM (checkPresent rmt mk) + ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do + liftIO $ hClose tmph + _ <- retrieveKeyFile rmt mk + (AssociatedFile Nothing) tmp + nullMeterUpdate NoVerify + ks <- map deserializeKey' . B8.lines <$> liftIO (B.readFile tmp) + Manifest <$> checkvalid [] ks + , return (Manifest []) + ) + where + mk = manifestKey (Remote.uuid rmt) + + checkvalid c [] = return (reverse c) + checkvalid c (Just k:ks) = case fromKey keyVariety k of + OtherKey "GITBUNDLE" -> checkvalid (k:c) ks + _ -> giveup $ "Wrong type of key in manifest " ++ serializeKey k + checkvalid _ (Nothing:_) = + giveup $ "Error parsing manifest " ++ serializeKey mk diff --git a/Command/Fsck.hs b/Command/Fsck.hs index 3e3de1ea86..344d21ac81 100644 --- a/Command/Fsck.hs +++ b/Command/Fsck.hs @@ -38,6 +38,7 @@ import Utility.CopyFile import Git.FilePath import Utility.PID import Utility.InodeCache +import Utility.Metered import Annex.InodeSentinal import qualified Database.Keys import qualified Database.Fsck as FsckDb @@ -206,8 +207,7 @@ performRemote key afile numcopies remote = ) , return Nothing ) - getfile' tmp = Remote.retrieveKeyFile remote key (AssociatedFile Nothing) (fromRawFilePath tmp) dummymeter (RemoteVerify remote) - dummymeter _ = noop + getfile' tmp = Remote.retrieveKeyFile remote key (AssociatedFile Nothing) (fromRawFilePath tmp) nullMeterUpdate (RemoteVerify remote) getcheap tmp = case Remote.retrieveKeyFileCheap remote of Just a -> isRight <$> tryNonAsync (a key afile (fromRawFilePath tmp)) Nothing -> return False diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index c973d62e51..7ec8d76515 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -3,11 +3,13 @@ This adds two new object types to git-annex, GITMANIFEST and a GITBUNDLE. GITMANIFEST--$UUID is the manifest for a git repository stored in the git-annex repository with that UUID. -GITBUNDLE--$UUID--sha256 is a git bundle. +GITBUNDLE--$UUID-sha256 is a git bundle. # format of the manifest file -An ordered list of bundle keys, one per line. +An ordered list of bundle keys, one per line. + +(Lines end with unix `"\n"`, not `"\r\n"`.) # fetching From c7731cdbd94521c579a418313647201e950e6c5c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 13:42:12 -0400 Subject: [PATCH 18/53] add Backend.GitRemoteAnnex Making GITBUNDLE be in the backend list allows those keys to be hashed to verify, both when git-remote-annex downloads them, and by other transfers and by git fsck. GITMANIFEST is not in the backend list, because those keys will never be stored in .git/annex/objects and can't be verified in any case. This does mean that git-annex version will include GITBUNDLE in the list of backends. Also documented these in backends.mdwn Sponsored-by: Kevin Mueller on Patreon --- Backend/GitRemoteAnnex.hs | 76 +++++++++++++++++++++++++++++ Backend/Hash.hs | 14 ++++-- Backend/Variety.hs | 2 + Types/Key.hs | 4 ++ doc/backends.mdwn | 17 +++++-- doc/internals/git-remote-annex.mdwn | 10 ++-- git-annex.cabal | 1 + 7 files changed, 110 insertions(+), 14 deletions(-) create mode 100644 Backend/GitRemoteAnnex.hs diff --git a/Backend/GitRemoteAnnex.hs b/Backend/GitRemoteAnnex.hs new file mode 100644 index 0000000000..bb825a917e --- /dev/null +++ b/Backend/GitRemoteAnnex.hs @@ -0,0 +1,76 @@ +{- Backends for git-remote-annex. + - + - GITBUNDLE keys store git bundles + - GITMANIFEST keys store ordered lists of GITBUNDLE keys + - + - Copyright 2024 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings #-} + +module Backend.GitRemoteAnnex ( + backends, + genGitBundleKey, + genManifestKey, +) where + +import Annex.Common +import Types.Key +import Types.Backend +import Utility.Hash +import Utility.Metered +import qualified Backend.Hash as Hash + +import qualified Data.ByteString.Short as S + +backends :: [Backend] +backends = [gitbundle] + +gitbundle :: Backend +gitbundle = Backend + { backendVariety = GitBundleKey + , genKey = Nothing + -- ^ Not provided because these keys can only be generated by + -- git-remote-annex. + , verifyKeyContent = Just $ Hash.checkKeyChecksum sameCheckSum hash + , verifyKeyContentIncrementally = Just (liftIO . incrementalVerifier) + , canUpgradeKey = Nothing + , fastMigrate = Nothing + , isStableKey = const True + , isCryptographicallySecure = Hash.cryptographicallySecure hash + , isCryptographicallySecureKey = const $ pure $ + Hash.cryptographicallySecure hash + } + +-- git bundle keys use the sha256 hash. +hash :: Hash.Hash +hash = Hash.SHA2Hash (HashSize 256) + +incrementalVerifier :: Key -> IO IncrementalVerifier +incrementalVerifier = + mkIncrementalVerifier sha2_256_context "checksum" . sameCheckSum + +sameCheckSum :: Key -> String -> Bool +sameCheckSum key s = s == expected + where + -- The checksum comes after a UUID. + expected = reverse $ takeWhile (/= '-') $ reverse $ + decodeBS $ S.fromShort $ fromKey keyName key + +genGitBundleKey :: UUID -> RawFilePath -> MeterUpdate -> Annex Key +genGitBundleKey remoteuuid file meterupdate = do + filesize <- liftIO $ getFileSize file + s <- Hash.hashFile hash file meterupdate + return $ mkKey $ \k -> k + { keyName = S.toShort $ fromUUID remoteuuid <> "-" <> encodeBS s + , keyVariety = GitBundleKey + , keySize = Just filesize + } + +genManifestKey :: UUID -> Key +genManifestKey u = mkKey $ \kd -> kd + { keyName = S.toShort (fromUUID u) + , keyVariety = OtherKey "GITMANIFEST" + } diff --git a/Backend/Hash.hs b/Backend/Hash.hs index 9768550adf..f2684755b7 100644 --- a/Backend/Hash.hs +++ b/Backend/Hash.hs @@ -1,6 +1,6 @@ {- git-annex hashing backends - - - Copyright 2011-2021 Joey Hess + - Copyright 2011-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -12,6 +12,10 @@ module Backend.Hash ( testKeyBackend, keyHash, descChecksum, + Hash(..), + cryptographicallySecure, + hashFile, + checkKeyChecksum ) where import Annex.Common @@ -78,7 +82,7 @@ genBackend :: Hash -> Backend genBackend hash = Backend { backendVariety = hashKeyVariety hash (HasExt False) , genKey = Just (keyValue hash) - , verifyKeyContent = Just $ checkKeyChecksum hash + , verifyKeyContent = Just $ checkKeyChecksum sameCheckSum hash , verifyKeyContentIncrementally = Just $ checkKeyChecksumIncremental hash , canUpgradeKey = Just needsUpgrade , fastMigrate = Just trivialMigrate @@ -123,14 +127,14 @@ keyValueE hash source meterupdate = keyValue hash source meterupdate >>= addE source (const $ hashKeyVariety hash (HasExt True)) -checkKeyChecksum :: Hash -> Key -> RawFilePath -> Annex Bool -checkKeyChecksum hash key file = catchIOErrorType HardwareFault hwfault $ do +checkKeyChecksum :: (Key -> String -> Bool) -> Hash -> Key -> RawFilePath -> Annex Bool +checkKeyChecksum issame hash key file = catchIOErrorType HardwareFault hwfault $ do fast <- Annex.getRead Annex.fast exists <- liftIO $ R.doesPathExist file case (exists, fast) of (True, False) -> do showAction (UnquotedString descChecksum) - sameCheckSum key + issame key <$> hashFile hash file nullMeterUpdate _ -> return True where diff --git a/Backend/Variety.hs b/Backend/Variety.hs index b4da6f2a96..a48933c88a 100644 --- a/Backend/Variety.hs +++ b/Backend/Variety.hs @@ -18,12 +18,14 @@ import qualified Backend.External import qualified Backend.Hash import qualified Backend.WORM import qualified Backend.URL +import qualified Backend.GitRemoteAnnex {- Regular backends. Does not include externals or VURL. -} regularBackendList :: [Backend] regularBackendList = Backend.Hash.backends ++ Backend.WORM.backends ++ Backend.URL.backends + ++ Backend.GitRemoteAnnex.backends {- The default hashing backend. -} defaultHashBackend :: Backend diff --git a/Types/Key.hs b/Types/Key.hs index f942b4e55c..6d8956b0a0 100644 --- a/Types/Key.hs +++ b/Types/Key.hs @@ -219,6 +219,7 @@ data KeyVariety | WORMKey | URLKey | VURLKey + | GitBundleKey -- A key that is handled by some external backend. | ExternalKey S.ByteString HasExt -- Some repositories may contain keys of other varieties, @@ -253,6 +254,7 @@ hasExt (MD5Key (HasExt b)) = b hasExt WORMKey = False hasExt URLKey = False hasExt VURLKey = False +hasExt GitBundleKey = False hasExt (ExternalKey _ (HasExt b)) = b hasExt (OtherKey s) = (snd <$> S8.unsnoc s) == Just 'E' @@ -282,6 +284,7 @@ formatKeyVariety v = case v of WORMKey -> "WORM" URLKey -> "URL" VURLKey -> "VURL" + GitBundleKey -> "GITBUNDLE" ExternalKey s e -> adde e ("X" <> s) OtherKey s -> s where @@ -347,6 +350,7 @@ parseKeyVariety "MD5E" = MD5Key (HasExt True) parseKeyVariety "WORM" = WORMKey parseKeyVariety "URL" = URLKey parseKeyVariety "VURL" = VURLKey +parseKeyVariety "GITBUNDLE" = GitBundleKey parseKeyVariety b | "X" `S.isPrefixOf` b = let b' = S.tail b diff --git a/doc/backends.mdwn b/doc/backends.mdwn index da37597902..d24fe0e654 100644 --- a/doc/backends.mdwn +++ b/doc/backends.mdwn @@ -79,10 +79,6 @@ content of an annexed file remains unchanged. passing it to a shell script. These types of keys are distinct from URLs/URIs that may be attached to a key (using any backend) indicating the key's location on the web or in one of [[special_remotes]]. -* `GIT` -- This is used internally by git-annex when exporting trees - containing files stored in git, rather than git-annex. It represents a - git sha. This is never used for git-annex links, but information about - keys of this type is stored in the git-annex branch. ## external backends @@ -100,6 +96,19 @@ Like with git-annex's builtin backends, you can add "E" to the end of the name of an external backend, to get a version that includes the file extension in the key. +## internal use backends + +Keys using these backends can sometimes be visible, but they are used by +git-annex for its own purposes, and not for your annexed files. + +* `GIT` -- This is used internally by git-annex when exporting trees + containing files stored in git, rather than git-annex. It represents a + git sha. This is never used for git-annex links, but information about + keys of this type is stored in the git-annex branch. +* `GITBUNDLE` and `GITMANIFEST` -- Used by [[git-remote-annex]] to store + a git repository in a special remote. See + [[this_page|internals/git-remote-annex]] for details about these. + ## notes If you want to be able to prove that you're working with the same file diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 7ec8d76515..5c9203931d 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -1,4 +1,4 @@ -This adds two new object types to git-annex, GITMANIFEST and a GITBUNDLE. +This adds two new key types to git-annex, GITMANIFEST and a GITBUNDLE. GITMANIFEST--$UUID is the manifest for a git repository stored in the git-annex repository with that UUID. @@ -14,7 +14,7 @@ An ordered list of bundle keys, one per line. # fetching 1. download GITMANIFEST for the uuid of the special remote -2. download each listed GITBUNDLE object that we don't have +2. download each listed GITBUNDLE key that we don't have 3. `git fetch` from each new bundle in order (note that later bundles can update refs from the versions in previous bundles) @@ -26,7 +26,7 @@ This is how pushes are usually done. 1. create git bundle of all refs that are being pushed and have changed, and objects since the previously pushed refs 2. hash to calculate GITBUNDLE key -3. upload GITBUNDLE object +3. upload GITBUNDLE key 4. download current manifest 5. append GITBUNDLE key to manifest @@ -38,8 +38,8 @@ previously pushed ref. 1. create git bundle containing all refs stored in the repository, and all objects -2. hash to calculate GITBUNDLE object name -3. upload GITBUNDLE object +2. hash to calculate GITBUNDLE key name +3. upload GITBUNDLE key 4. download old manifest 4. upload new manifest listing only the single new GITBUNDLE 5. delete all other GITBUNDLEs that were listed in the old manifest diff --git a/git-annex.cabal b/git-annex.cabal index 97ae80382f..3c6ad98118 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -580,6 +580,7 @@ Executable git-annex Author Backend Backend.External + Backend.GitRemoteAnnex Backend.Hash Backend.URL Backend.Utilities From 8d58a2354809866f7a6097dc8981f736dd69cf93 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 14:22:04 -0400 Subject: [PATCH 19/53] add git for-each-ref binding Sponsored-by: Luke T. Shumaker on Patreon --- Git/Ref.hs | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/Git/Ref.hs b/Git/Ref.hs index 2d2874a7ef..72e8b15cd4 100644 --- a/Git/Ref.hs +++ b/Git/Ref.hs @@ -1,6 +1,6 @@ {- git ref stuff - - - Copyright 2011-2020 Joey Hess + - Copyright 2011-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -165,6 +165,15 @@ matchingUniq refs repo = nubBy uniqref <$> matching refs repo list :: Repo -> IO [(Sha, Ref)] list = matching' [] [] +{- Lists refs using for-each-ref. -} +forEachRef :: [CommandParam] -> Repo -> IO [(Sha, Branch)] +forEachRef ps repo = map gen . S8.lines <$> + pipeReadStrict (Param "for-each-ref" : ps ++ [format]) repo + where + format = Param "--format=%(objectname) %(refname)" + gen l = let (r, b) = separate' (== fromIntegral (ord ' ')) l + in (Ref r, Ref b) + {- Deletes a ref. This can delete refs that are not branches, - which git branch --delete refuses to delete. -} delete :: Sha -> Ref -> Repo -> IO () From e1447dc2e264b929a99d4ab34623f09de6c5e3fd Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 14:22:41 -0400 Subject: [PATCH 20/53] add git bundle interface Sponsored-by: mycroft on Patreon --- Git/Bundle.hs | 25 +++++++++++++++++++++++++ git-annex.cabal | 1 + 2 files changed, 26 insertions(+) create mode 100644 Git/Bundle.hs diff --git a/Git/Bundle.hs b/Git/Bundle.hs new file mode 100644 index 0000000000..7b1b1adc15 --- /dev/null +++ b/Git/Bundle.hs @@ -0,0 +1,25 @@ +{- git bundles + - + - Copyright 2024 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +module Git.Bundle where + +import Common +import Git +import Git.Command + +import Data.Char (ord) +import qualified Data.ByteString.Char8 as S8 + +listHeads :: FilePath -> Repo -> IO [(Sha, Ref)] +listHeads bundle repo = map gen . S8.lines <$> + pipeReadStrict [Param "bundle", Param "list-heads", File bundle] repo + where + gen l = let (s, r) = separate' (== fromIntegral (ord ' ')) l + in (Ref s, Ref r) + +unbundle :: FilePath -> Repo -> IO () +unbundle bundle = runQuiet [Param "bundle", Param "unbundle", File bundle] diff --git a/git-annex.cabal b/git-annex.cabal index 3c6ad98118..53f9d5d786 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -760,6 +760,7 @@ Executable git-annex Git.AutoCorrect Git.Branch Git.BuildVersion + Git.Bundle Git.CatFile Git.CheckAttr Git.CheckIgnore From 947cf1c34552bd942ed0cf6601990df1338fe8b2 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 14:37:29 -0400 Subject: [PATCH 21/53] back to annex:: for git-remote-annex url Oh, turns out git needs two colons to use a gitremote-helper. Ok. --- CmdLine/GitRemoteAnnex.hs | 6 +++--- doc/git-remote-annex.mdwn | 20 ++++++++------------ 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 8331d89429..115ca3c90e 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -135,18 +135,18 @@ data SpecialRemoteConfig = SpecialRemoteConfig deriving (Show) -- The url for a special remote looks like --- annex:uuid?param=value¶m=value... +-- annex::uuid?param=value¶m=value... parseSpecialRemoteUrl :: String -> Either String SpecialRemoteConfig parseSpecialRemoteUrl s = case parseURI s of Nothing -> Left "URL parse failed" Just u -> case uriScheme u of "annex:" -> case uriPath u of "" -> Left "annex: URL did not include a UUID" - (':':_) -> Left "annex: URL malformed" - p -> Right $ SpecialRemoteConfig + (':':p) -> Right $ SpecialRemoteConfig { specialRemoteUUID = toUUID p , specialRemoteParams = parsequery u } + _ -> Left "annex: URL malformed" _ -> Left "Not an annex: URL" where parsequery u = map parsekv $ splitc '&' (drop 1 (uriQuery u)) diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index 74b946ebac..4fefb1bd36 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -4,7 +4,7 @@ git-remote-annex - remote helper program to store a git repository in a git-anne # SYNOPSIS -git fetch annex:uuid?param=value¶m=value... +git fetch annex::uuid?param=value¶m=value... # DESCRIPTION @@ -17,13 +17,13 @@ encryption=shared or encryption=hybrid. (Since those types of encryption rely on a cipher that is checked into the git repository, cloning from such a special remote would present a chicken and egg problem.) -The format of the remote URL is "annex:" followed by the UUID of the +The format of the remote URL is "annex::" followed by the UUID of the special remote, and then followed by all of the configuration parameters of the special remote. For example, to clone from a directory special remote: - git clone annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ + git clone annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ When a special remote needs some additional credentials to be provided, they are not included in the URL, and need to be provided when cloning from @@ -31,16 +31,12 @@ the special remote. That is typically done by setting environment variables. Some special remotes may also need environment variables to be set when pulling or pushing. -As a useful shorthand, when the special remote has already been enabled, -the configuration parameters can be omitted. For example: +When configuring the url of an existing special remote, a +shorter url of "annex::" is sufficient. For example: - git push annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3 master - -This also makes it easy to configure the url for an existing special remote, -making it usable by git: - - git config remote.foo.url annex:358ff77e-0bc3-11ef-bc49-872e6695c0e3 - git fetch foo master + git-annex initremote foo type=directory encryption=none directory=/mnt/foo + git config remote.foo.url annex:: + git push foo master Configuring the url like that is automatically done when cloning from a special remote, but not by [[git-annex-initremote]](1) and From a89e8f6bad7a445892decc6ce56779a5eb4ffa03 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 15:01:22 -0400 Subject: [PATCH 22/53] skip remotes with an annex:: url These remotes are not regular git remotes, they are special remotes that git uses git-remote-annex to access. Sponsored-by: Jack Hill on Patreon --- Remote/Git.hs | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Remote/Git.hs b/Remote/Git.hs index bba505e378..a234fd0fbb 100644 --- a/Remote/Git.hs +++ b/Remote/Git.hs @@ -92,7 +92,7 @@ list :: Bool -> Annex [Git.Repo] list autoinit = do c <- fromRepo Git.config rs <- mapM (tweakurl c) =<< Annex.getGitRemotes - mapM (configRead autoinit) rs + mapM (configRead autoinit) (filter (not . isGitRemoteAnnex) rs) where annexurl r = remoteConfig r "annexurl" tweakurl c r = do @@ -103,6 +103,9 @@ list autoinit = do Git.Construct.remoteNamed n $ Git.Construct.fromRemoteLocation (Git.fromConfigValue url) False g +isGitRemoteAnnex :: Git.Repo -> Bool +isGitRemoteAnnex r = "annex::" `isPrefixOf` Git.repoLocation r + {- Git remotes are normally set up using standard git commands, not - git-annex initremote and enableremote. - From cdcf2fe3a2f23c2aaf5acb7c6e6af1db35eb2411 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 15:13:41 -0400 Subject: [PATCH 23/53] git-remote-annex can fetch from an existing special remote Tested using a manually populated directory special remote. Pushing is still to be done. So is fetching from special remotes configured via the annex:: url. Sponsored-by: Brock Spratlen on Patreon --- CmdLine/GitRemoteAnnex.hs | 244 ++++++++++++++++++++++++++++---------- 1 file changed, 181 insertions(+), 63 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 115ca3c90e..361feaabf3 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -11,9 +11,14 @@ module CmdLine.GitRemoteAnnex where import Annex.Common import qualified Annex -import qualified Git.CurrentRepo import qualified Remote -import Annex.UUID +import qualified Git.CurrentRepo +import qualified Git.Ref +import qualified Git.Branch +import qualified Git.Bundle +import Git.Types +import Backend.GitRemoteAnnex +import Annex.Transfer import Types.Remote import Types.Key import Network.URI @@ -21,14 +26,18 @@ import Utility.Tmp import Utility.Metered import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as B8 -import qualified Data.ByteString.Short as S +import qualified Data.Map.Strict as M run :: [String] -> IO () -run (_remotename:url:[]) = case parseSpecialRemoteUrl url of - Left e -> giveup e - Right src -> do - state <- Annex.new =<< Git.CurrentRepo.get - Annex.eval state (run' src) +run (remotename:url:[]) = + -- git strips the "annex::" prefix of the url + -- when running this command, so add it back + let url' = "annex::" ++ url + in case parseSpecialRemoteNameUrl remotename url' of + Left e -> giveup e + Right src -> do + state <- Annex.new =<< Git.CurrentRepo.get + Annex.eval state (run' src) run (_remotename:[]) = giveup "remote url not configured" run _ = giveup "expected remote name and url parameters" @@ -38,20 +47,33 @@ run' src = -- the output of this command is being parsed by git. doQuietAction $ do rmt <- getSpecialRemote src - go rmt =<< lines <$> liftIO getContents + ls <- lines <$> liftIO getContents + go rmt ls emptyState where - go rmt (l:ls) = + go rmt (l:ls) st = let (c, v) = splitLine l in case c of - "capabilities" -> capabilities >> go rmt ls + "capabilities" -> capabilities >> go rmt ls st "list" -> case v of - "" -> list rmt False >> go rmt ls - "for-push" -> list rmt True >> go rmt ls + "" -> list st rmt False >>= go rmt ls + "for-push" -> list st rmt True >>= go rmt ls _ -> protocolError l - "fetch" -> fetch rmt (l:ls) >>= go rmt - "push" -> push rmt (l:ls) >>= go rmt + "fetch" -> fetch st rmt (l:ls) >>= \ls' -> go rmt ls' st + "push" -> push st rmt (l:ls) >>= \ls' -> go rmt ls' st + "" -> return () _ -> protocolError l - go _ [] = return () + go _ [] _ = return () + +data State = State + { manifestCache :: Maybe Manifest + , trackingRefs :: M.Map Ref Sha + } + +emptyState :: State +emptyState = State + { manifestCache = Nothing + , trackingRefs = mempty + } protocolError :: String -> a protocolError l = giveup $ "gitremote-helpers protocol error at " ++ show l @@ -63,30 +85,71 @@ capabilities = do liftIO $ putStrLn "" liftIO $ hFlush stdout -list :: Remote -> Bool -> Annex () -list rmt forpush = error "TODO list" +list :: State -> Remote -> Bool -> Annex State +list st rmt forpush = do + manifest <- downloadManifest rmt + l <- forM (inManifest manifest) $ \k -> do + b <- downloadGitBundle rmt k + heads <- inRepo $ Git.Bundle.listHeads b + -- Get all the objects from the bundle. This is done here + -- so that the tracking refs can be updated with what is + -- listed, and so what when a full repush is done, all + -- objects are available to be pushed. + when forpush $ + inRepo $ Git.Bundle.unbundle b + -- The bundle may contain tracking refs, or regular refs, + -- make sure we're operating on regular refs. + return $ map (\(s, r) -> (fromTrackingRef rmt r, s)) heads + + -- Later refs replace earlier refs with the same name. + let refmap = M.fromList $ concat l + let reflist = M.toList refmap + let trackingrefmap = M.mapKeys (toTrackingRef rmt) refmap + + -- When listing for a push, update the tracking refs to match what + -- was listed. This is necessary in order for a full repush to know + -- what to push. + when forpush $ + updateTrackingRefs rmt trackingrefmap + + -- Respond to git with a list of refs. + liftIO $ do + forM_ reflist $ \(ref, sha) -> + B8.putStrLn $ fromRef' sha <> " " <> fromRef' ref + -- Newline terminates list of refs. + putStrLn "" + hFlush stdout + + -- Remember the tracking refs. + return $ st + { manifestCache = Just manifest + , trackingRefs = trackingrefmap + } -- Any number of fetch commands can be sent by git, asking for specific -- things. We fetch everything new at once, so find the end of the fetch -- commands (which is supposed to be a blank line) before fetching. -fetch :: Remote -> [String] -> Annex [String] -fetch rmt (l:ls) = case splitLine l of - ("fetch", _) -> fetch rmt ls +fetch :: State -> Remote -> [String] -> Annex [String] +fetch st rmt (l:ls) = case splitLine l of + ("fetch", _) -> fetch st rmt ls ("", _) -> do - fetch' rmt + fetch' st rmt return ls _ -> do - fetch' rmt + fetch' st rmt return (l:ls) -fetch rmt [] = do - fetch' rmt +fetch st rmt [] = do + fetch' st rmt return [] -fetch' :: Remote -> Annex () -fetch' rmt = error "TODO fetch" +fetch' :: State -> Remote -> Annex () +fetch' st rmt = do + manifest <- maybe (downloadManifest rmt) pure (manifestCache st) + forM_ (inManifest manifest) $ \k -> + downloadGitBundle rmt k >>= inRepo . Git.Bundle.unbundle -push :: Remote -> [String] -> Annex [String] -push rmt ls = do +push :: State -> Remote -> [String] -> Annex [String] +push st rmt ls = do let (refspecs, ls') = collectRefSpecs ls error "TODO push refspecs" return ls' @@ -128,16 +191,27 @@ splitLine l = v = if null sv then sv else drop 1 sv in (c, v) -data SpecialRemoteConfig = SpecialRemoteConfig - { specialRemoteUUID :: UUID - , specialRemoteParams :: [(String, String)] - } +data SpecialRemoteConfig + = SpecialRemoteConfig + { specialRemoteUUID :: UUID + , specialRemoteParams :: [(String, String)] + } + | ExistingSpecialRemote RemoteName deriving (Show) -- The url for a special remote looks like --- annex::uuid?param=value¶m=value... +-- "annex::uuid?param=value¶m=value..." +-- +-- Also accept an url of "annex::", when a remote name is provided, +-- to use an already enabled special remote. +parseSpecialRemoteNameUrl :: String -> String -> Either String SpecialRemoteConfig +parseSpecialRemoteNameUrl remotename url + | url == "annex::" && remotename /= url = Right $ + ExistingSpecialRemote remotename + | otherwise = parseSpecialRemoteUrl url + parseSpecialRemoteUrl :: String -> Either String SpecialRemoteConfig -parseSpecialRemoteUrl s = case parseURI s of +parseSpecialRemoteUrl url = case parseURI url of Nothing -> Left "URL parse failed" Just u -> case uriScheme u of "annex:" -> case uriPath u of @@ -156,15 +230,13 @@ parseSpecialRemoteUrl s = case parseURI s of in (unEscapeString k, unEscapeString v) getSpecialRemote :: SpecialRemoteConfig -> Annex Remote -getSpecialRemote src - -- annex:uuid with no query string uses an existing remote - | null (specialRemoteParams src) = - Remote.byUUID (specialRemoteUUID src) >>= \case - Just rmt -> if thirdPartyPopulated (remotetype rmt) - then giveup "Cannot use this thirdparty-populated special remote as a git remote" - else return rmt - Nothing -> giveup $ "Cannot find an existing special remote with UUID " - ++ fromUUID (specialRemoteUUID src) +getSpecialRemote (ExistingSpecialRemote remotename) = + Remote.byNameOnly remotename >>= \case + Just rmt -> if thirdPartyPopulated (remotetype rmt) + then giveup "Cannot use this thirdparty-populated special remote as a git remote" + else return rmt + Nothing -> giveup $ "There is no special remote named " ++ remotename +getSpecialRemote src@(SpecialRemoteConfig {}) -- Given the configuration of a special remote, create a -- Remote object to access the special remote. -- This needs to avoid storing the configuration in the git-annex @@ -179,30 +251,18 @@ getSpecialRemote src -- remote, rather than the current git repo. But can this be -- avoided? --- A key that is used for the manifest of the git repository stored in a --- special remote with the specified uuid. -manifestKey :: UUID -> Key -manifestKey u = mkKey $ \kd -> kd - { keyName = S.toShort (fromUUID u) - , keyVariety = OtherKey "GITMANIFEST" - } - --- A key that is used for the git bundle with the specified sha256 --- that is stored in a special remote with the specified uuid. -gitbundleKey :: UUID -> B.ByteString -> Key -gitbundleKey u sha256 = mkKey $ \kd -> kd - { keyName = S.toShort (fromUUID u <> "-" <> sha256) - , keyVariety = OtherKey "GITBUNDLE" - } - -- The manifest contains an ordered list of git bundle keys. -newtype Manifest = Manifest [Key] +newtype Manifest = Manifest { inManifest :: [Key] } -- Downloads the Manifest, or if it does not exist, returns an empty -- Manifest. -- -- Throws errors if the remote cannot be accessed or the download fails, -- or if the manifest file cannot be parsed. +-- +-- This downloads the manifest to a temporary file, rather than using +-- the usual Annex.Transfer.download. The content of manifests is not +-- stable, and so it needs to re-download it fresh every time. downloadManifest :: Remote -> Annex Manifest downloadManifest rmt = ifM (checkPresent rmt mk) ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do @@ -215,11 +275,69 @@ downloadManifest rmt = ifM (checkPresent rmt mk) , return (Manifest []) ) where - mk = manifestKey (Remote.uuid rmt) + mk = genManifestKey (Remote.uuid rmt) checkvalid c [] = return (reverse c) checkvalid c (Just k:ks) = case fromKey keyVariety k of - OtherKey "GITBUNDLE" -> checkvalid (k:c) ks + GitBundleKey -> checkvalid (k:c) ks _ -> giveup $ "Wrong type of key in manifest " ++ serializeKey k checkvalid _ (Nothing:_) = giveup $ "Error parsing manifest " ++ serializeKey mk + +-- Downloads a git bundle to the annex objects directory, unless +-- the object file is already present. Returns the filename of the object +-- file. +-- +-- Throws errors if the download fails, or the checksum does not verify. +-- +-- This does not update the location log to indicate that the local +-- repository contains the git bundle object. Reasons not to include: +-- 1. When this is being used in a git clone, the repository will not have +-- a UUID yet. +-- 2. It would unncessarily bloat the git-annex branch, which would then +-- lead to more things needing to be pushed to the special remote, +-- and so more things pulled from it, etc. +-- 3. Git bundle objects are not usually transferred between repositories +-- except special remotes (although the user can if they want to). +downloadGitBundle :: Remote -> Key -> Annex FilePath +downloadGitBundle rmt k = + ifM (download rmt k (AssociatedFile Nothing) stdRetry noNotification) + ( decodeBS <$> calcRepo (gitAnnexLocation k) + , giveup $ "Failed to download " ++ serializeKey k + ) + +-- Tracking refs are used to remember the refs that are currently on the +-- remote. This is different from git's remote tracking branches, since it +-- needs to track all refs on the remote, not only the refs that the user +-- chooses to fetch. +-- +-- For refs/heads/master, the tracking ref is +-- refs/namespaces/git-remote-annex/uuid/refs/heads/master, +-- using the uuid of the remote. See gitnamespaces(7). +trackingRefPrefix :: Remote -> B.ByteString +trackingRefPrefix rmt = "refs/namespaces/git-remote-annex/" + <> fromUUID (Remote.uuid rmt) <> "/" + +toTrackingRef :: Remote -> Ref -> Ref +toTrackingRef rmt (Ref r) = Ref $ trackingRefPrefix rmt <> r + +-- If the ref is not a tracking ref, it is returned as-is. +fromTrackingRef :: Remote -> Ref -> Ref +fromTrackingRef rmt = Git.Ref.removeBase (decodeBS (trackingRefPrefix rmt)) + +-- Update the tracking refs to be those in the map, and no others. +updateTrackingRefs :: Remote -> M.Map Ref Sha -> Annex () +updateTrackingRefs rmt new = do + old <- inRepo $ Git.Ref.forEachRef + [Param (decodeBS (trackingRefPrefix rmt))] + + -- Delete all tracking refs that are not in the map. + forM_ (filter (\p -> M.notMember (fst p) new) old) $ \(s, r) -> + inRepo $ Git.Ref.delete s r + + -- Update all changed tracking refs. + let oldmap = M.fromList (map (\(s, r) -> (r, s)) old) + forM_ (M.toList new) $ \(r, s) -> + case M.lookup r oldmap of + Just s' | s' == s -> noop + _ -> inRepo $ Git.Branch.update' r s From df5011ec43e1869d82eb80566088badbff97ffd4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 7 May 2024 15:34:55 -0400 Subject: [PATCH 24/53] git-remote-annex: fix hang on fetch Sponsored-by: k0ld on Patreon --- CmdLine/GitRemoteAnnex.hs | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 361feaabf3..000c5b5238 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -147,6 +147,10 @@ fetch' st rmt = do manifest <- maybe (downloadManifest rmt) pure (manifestCache st) forM_ (inManifest manifest) $ \k -> downloadGitBundle rmt k >>= inRepo . Git.Bundle.unbundle + -- Newline indicates end of fetch. + liftIO $ do + putStrLn "" + hFlush stdout push :: State -> Remote -> [String] -> Annex [String] push st rmt ls = do From 59fc2005ecb84711e831e42281a3a1cf5f918cc0 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 8 May 2024 16:55:45 -0400 Subject: [PATCH 25/53] git clone support for git-remote-annex Also support using annex:: urls that specify the whole special remote config. Both of these cases need a special remote to be initialized enough to use it, which means writing to .git/config but not to the git-annex branch. When cloning, the remote is left set up in .git/config, so further use of it, by git-annex or git-remote-annex will work. When using git with an annex:: url, a temporary remote is written to .git/config, but then removed at the end. While that's a little bit ugly, the fact is that the Remote interface expects that it's ok to set git configs of the remote that is being initialized. And it's nowhere near as ugly as the alternative of making a temporary git repository and initializing the special remote in there. Cloning from a repository that does not contain a git-annex branch and then later running git-annex init is currently broken, although I've gotten most of the way there to supporting it. See cleanupInitialization FIXME. Special shout out to git clone for running gitremote-helpers with GIT_DIR set, but not in the git repository and with GIT_WORK_TREE not set. Resulting in needing the fixupRepo hack. Sponsored-by: unqueued on Patreon --- Annex/Init.hs | 10 +- Annex/SpecialRemote/Config.hs | 2 +- CmdLine/GitRemoteAnnex.hs | 221 ++++++++++++++++++++++++++++------ Git/Remote.hs | 8 +- 4 files changed, 196 insertions(+), 45 deletions(-) diff --git a/Annex/Init.hs b/Annex/Init.hs index 6a499e4771..b9478ae4f2 100644 --- a/Annex/Init.hs +++ b/Annex/Init.hs @@ -1,6 +1,6 @@ {- git-annex repository initialization - - - Copyright 2011-2022 Joey Hess + - Copyright 2011-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -12,6 +12,7 @@ module Annex.Init ( checkInitializeAllowed, ensureInitialized, autoInitialize, + autoInitialize', isInitialized, initialize, initialize', @@ -256,10 +257,13 @@ guardSafeToUseRepo a = ifM (inRepo Git.Config.checkRepoConfigInaccessible) - Checks repository version and handles upgrades too. -} autoInitialize :: Annex [Remote] -> Annex () -autoInitialize remotelist = getInitializedVersion >>= maybe needsinit checkUpgrade +autoInitialize = autoInitialize' autoInitializeAllowed + +autoInitialize' :: Annex Bool -> Annex [Remote] -> Annex () +autoInitialize' check remotelist = getInitializedVersion >>= maybe needsinit checkUpgrade where needsinit = - whenM (initializeAllowed <&&> autoInitializeAllowed) $ do + whenM (initializeAllowed <&&> check) $ do initialize Nothing Nothing autoEnableSpecialRemotes remotelist diff --git a/Annex/SpecialRemote/Config.hs b/Annex/SpecialRemote/Config.hs index fff2c88c1d..7fbd0d4191 100644 --- a/Annex/SpecialRemote/Config.hs +++ b/Annex/SpecialRemote/Config.hs @@ -1,6 +1,6 @@ {- git-annex special remote configuration - - - Copyright 2019-2023 Joey Hess + - Copyright 2019-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 000c5b5238..a2462ea770 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -13,30 +13,48 @@ import Annex.Common import qualified Annex import qualified Remote import qualified Git.CurrentRepo +import qualified Git import qualified Git.Ref import qualified Git.Branch import qualified Git.Bundle -import Git.Types -import Backend.GitRemoteAnnex +import qualified Git.Remote +import qualified Git.Remote.Remove +import qualified Annex.SpecialRemote as SpecialRemote +import qualified Annex.Branch +import qualified Types.Remote as R import Annex.Transfer -import Types.Remote +import Backend.GitRemoteAnnex +import Config +import Types.RemoteConfig +import Types.ProposedAccepted import Types.Key -import Network.URI +import Types.GitConfig +import Git.Types +import Logs.Difference +import Annex.Init +import Annex.Content +import Remote.List +import Remote.List.Util import Utility.Tmp +import Utility.Env import Utility.Metered + +import Network.URI import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as B8 import qualified Data.Map.Strict as M +import System.FilePath.ByteString as P run :: [String] -> IO () -run (remotename:url:[]) = +run (remotename:url:[]) = -- git strips the "annex::" prefix of the url -- when running this command, so add it back let url' = "annex::" ++ url in case parseSpecialRemoteNameUrl remotename url' of Left e -> giveup e Right src -> do - state <- Annex.new =<< Git.CurrentRepo.get + repo <- getRepo + state <- Annex.new repo Annex.eval state (run' src) run (_remotename:[]) = giveup "remote url not configured" run _ = giveup "expected remote name and url parameters" @@ -45,10 +63,10 @@ run' :: SpecialRemoteConfig -> Annex () run' src = -- Prevent any usual git-annex output to stdout, because -- the output of this command is being parsed by git. - doQuietAction $ do - rmt <- getSpecialRemote src - ls <- lines <$> liftIO getContents - go rmt ls emptyState + doQuietAction $ + withSpecialRemote src $ \rmt -> do + ls <- lines <$> liftIO getContents + go rmt ls emptyState where go rmt (l:ls) st = let (c, v) = splitLine l @@ -198,7 +216,9 @@ splitLine l = data SpecialRemoteConfig = SpecialRemoteConfig { specialRemoteUUID :: UUID - , specialRemoteParams :: [(String, String)] + , specialRemoteConfig :: RemoteConfig + , specialRemoteName :: Maybe RemoteName + , specialRemoteUrl :: String } | ExistingSpecialRemote RemoteName deriving (Show) @@ -212,48 +232,114 @@ parseSpecialRemoteNameUrl :: String -> String -> Either String SpecialRemoteConf parseSpecialRemoteNameUrl remotename url | url == "annex::" && remotename /= url = Right $ ExistingSpecialRemote remotename - | otherwise = parseSpecialRemoteUrl url + | "annex::" `isPrefixOf` remotename = parseSpecialRemoteUrl url Nothing + | otherwise = parseSpecialRemoteUrl url (Just remotename) -parseSpecialRemoteUrl :: String -> Either String SpecialRemoteConfig -parseSpecialRemoteUrl url = case parseURI url of +parseSpecialRemoteUrl :: String -> Maybe RemoteName -> Either String SpecialRemoteConfig +parseSpecialRemoteUrl url remotename = case parseURI url of Nothing -> Left "URL parse failed" Just u -> case uriScheme u of "annex:" -> case uriPath u of "" -> Left "annex: URL did not include a UUID" (':':p) -> Right $ SpecialRemoteConfig { specialRemoteUUID = toUUID p - , specialRemoteParams = parsequery u + , specialRemoteConfig = parsequery u + , specialRemoteName = remotename + , specialRemoteUrl = url } _ -> Left "annex: URL malformed" _ -> Left "Not an annex: URL" where - parsequery u = map parsekv $ splitc '&' (drop 1 (uriQuery u)) + parsequery u = M.fromList $ + map parsekv $ splitc '&' (drop 1 (uriQuery u)) parsekv kv = let (k, sv) = break (== '=') kv v = if null sv then sv else drop 1 sv - in (unEscapeString k, unEscapeString v) + in (Proposed (unEscapeString k), Proposed (unEscapeString v)) -getSpecialRemote :: SpecialRemoteConfig -> Annex Remote -getSpecialRemote (ExistingSpecialRemote remotename) = +-- Runs an action with a Remote as specified by the SpecialRemoteConfig. +withSpecialRemote :: SpecialRemoteConfig -> (Remote -> Annex a) -> Annex a +withSpecialRemote (ExistingSpecialRemote remotename) a = + getEnabledSpecialRemoteByName remotename >>= + maybe (giveup $ "There is no special remote named " ++ remotename) + a +withSpecialRemote cfg@(SpecialRemoteConfig {}) a = case specialRemoteName cfg of + -- The name could be the name of an existing special remote, + -- if so use it as long as its UUID matches the UUID from the url. + Just remotename -> getEnabledSpecialRemoteByName remotename >>= \case + Just rmt + | R.uuid rmt == specialRemoteUUID cfg -> a rmt + | otherwise -> giveup $ "The uuid in the annex:: url does not match the uuid of the remote named " ++ remotename + -- When cloning from an annex:: url, + -- this is used to set up the origin remote. + Nothing -> (initremote remotename >>= a) + `finally` cleanupInitialization + Nothing -> inittempremote + `finally` cleanupInitialization + where + -- Initialize a new special remote with the provided configuration + -- and name. + -- + -- The configuration is not stored in the git-annex branch, because + -- it's expected that the git repository stored on the special + -- remote includes its configuration, perhaps under a different + -- name, and perhaps slightly different (when the annex:: url + -- omitted some unimportant part of the configuration). + initremote remotename = do + let c = M.insert SpecialRemote.nameField (Proposed remotename) + (specialRemoteConfig cfg) + t <- either giveup return (SpecialRemote.findType c) + dummycfg <- liftIO dummyRemoteGitConfig + (c', _u) <- R.setup t R.Init (Just (specialRemoteUUID cfg)) + Nothing c dummycfg + `onException` cleanupremote remotename + setConfig (remoteConfig c' "url") (specialRemoteUrl cfg) + remotesChanged + getEnabledSpecialRemoteByName remotename >>= \case + Just rmt -> case checkSpecialRemoteProblems rmt of + Nothing -> return rmt + Just problem -> do + cleanupremote remotename + giveup problem + Nothing -> do + cleanupremote remotename + giveup "Unable to find special remote after setup." + + -- Temporarily initialize a special remote, and remove it after + -- the action is run. + inittempremote = + let remotename = Git.Remote.makeLegalName $ + "annex-temp-" ++ fromUUID (specialRemoteUUID cfg) + in bracket + (initremote remotename) + (const $ cleanupremote remotename) + a + + cleanupremote remotename = do + l <- inRepo Git.Remote.listRemotes + when (remotename `elem` l) $ + inRepo $ Git.Remote.Remove.remove remotename + +-- When a special remote has already been enabled, just use it. +getEnabledSpecialRemoteByName :: RemoteName -> Annex (Maybe Remote) +getEnabledSpecialRemoteByName remotename = Remote.byNameOnly remotename >>= \case - Just rmt -> if thirdPartyPopulated (remotetype rmt) - then giveup "Cannot use this thirdparty-populated special remote as a git remote" - else return rmt - Nothing -> giveup $ "There is no special remote named " ++ remotename -getSpecialRemote src@(SpecialRemoteConfig {}) - -- Given the configuration of a special remote, create a - -- Remote object to access the special remote. - -- This needs to avoid storing the configuration in the git-annex - -- branch (which would be redundant and also the configuration - -- provided may differ in some small way from the configuration - -- that is stored in the git repository inside the remote, which - -- should not be changed). It also needs to avoid creating a git - -- remote in .git/config. - | otherwise = error "TODO conjure up a new special remote out of thin air" - -- XXX one way to do it would be to make a temporary git repo, - -- initremote in there, and use that for accessing the special - -- remote, rather than the current git repo. But can this be - -- avoided? + Nothing -> return Nothing + Just rmt -> + maybe (return (Just rmt)) giveup + (checkSpecialRemoteProblems rmt) + +-- Avoid using special remotes that are thirdparty populated, because +-- there is no way to push the git repository keys into one. +-- +-- XXX Avoid using special remotes that are encrypted by key +-- material stored in the git repository, since that would present a +-- chicken and egg problem when cloning. +checkSpecialRemoteProblems :: Remote -> Maybe String +checkSpecialRemoteProblems rmt + | R.thirdPartyPopulated (R.remotetype rmt) = + Just "Cannot use this thirdparty-populated special remote as a git remote" + | otherwise = Nothing -- The manifest contains an ordered list of git bundle keys. newtype Manifest = Manifest { inManifest :: [Key] } @@ -268,12 +354,12 @@ newtype Manifest = Manifest { inManifest :: [Key] } -- the usual Annex.Transfer.download. The content of manifests is not -- stable, and so it needs to re-download it fresh every time. downloadManifest :: Remote -> Annex Manifest -downloadManifest rmt = ifM (checkPresent rmt mk) +downloadManifest rmt = ifM (R.checkPresent rmt mk) ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do liftIO $ hClose tmph - _ <- retrieveKeyFile rmt mk + _ <- R.retrieveKeyFile rmt mk (AssociatedFile Nothing) tmp - nullMeterUpdate NoVerify + nullMeterUpdate R.NoVerify ks <- map deserializeKey' . B8.lines <$> liftIO (B.readFile tmp) Manifest <$> checkvalid [] ks , return (Manifest []) @@ -345,3 +431,58 @@ updateTrackingRefs rmt new = do case M.lookup r oldmap of Just s' | s' == s -> noop _ -> inRepo $ Git.Branch.update' r s + +-- git clone does not bother to set GIT_WORK_TREE when running this +-- program, and it does not run it inside the new git repo either. +-- GIT_DIR is set to the new git directory. So, have to override +-- the worktree to be the parent of the gitdir. +getRepo :: IO Repo +getRepo = getEnv "GIT_WORK_TREE" >>= \case + Just _ -> Git.CurrentRepo.get + Nothing -> fixup <$> Git.CurrentRepo.get + where + fixup r@(Repo { location = loc@(Local { worktree = Just _ }) }) = + r { location = loc { worktree = Just (P.takeDirectory (gitdir loc)) } } + fixup r = r + +-- This is run after git has used this process to fetch or push from a +-- special remote that was specified using a git-annex url. If the git +-- repository was not initialized for use by git-annex already, it is still +-- not initialized at this point. +-- +-- It's important that initialization not be done by this process until +-- git has fetched any git-annex branch from the special remote. That +-- git-annex branch may have Differences, and prematurely initializing the +-- local repository would then create a git-annex branch that can't merge +-- with the one from the special remote. +-- +-- If there is still not a sibling git-annex branch, this deletes all annex +-- objects for git bundles from the annex objects directory, and deletes +-- the annex objects directory. That is necessary to avoid the +-- Annex.Init.objectDirNotPresent check preventing a later initialization. +-- And if the later initialization includes Differences, the git bundle +-- objects downloaded by this process would be in the wrong locations. +-- +-- When there is now a sibling git-annex branch, this handles +-- initialization. When the initialized git-annex branch has Differences, +-- the git bundle objects are in the wrong place, so have to be deleted. +-- +-- FIXME git-annex branch is unfortunately created during git clone from a +-- special remote. Should not be for this to work. +cleanupInitialization :: Annex () +cleanupInitialization = ifM Annex.Branch.hasSibling + ( do + autoInitialize' (pure True) remoteList + differences <- allDifferences <$> recordedDifferences + when (differences /= mempty) $ + deletebundleobjects + , deletebundleobjects + ) + where + deletebundleobjects = do + annexobjectdir <- fromRepo gitAnnexObjectDir + ks <- listKeys InAnnex + forM_ ks $ \k -> case fromKey keyVariety k of + GitBundleKey -> lockContentForRemoval k noop removeAnnex + _ -> noop + void $ liftIO $ tryIO $ removeDirectory (decodeBS annexobjectdir) diff --git a/Git/Remote.hs b/Git/Remote.hs index 9cdaad61ca..4eb6780fcc 100644 --- a/Git/Remote.hs +++ b/Git/Remote.hs @@ -1,6 +1,6 @@ {- git remote stuff - - - Copyright 2012-2021 Joey Hess + - Copyright 2012-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -13,6 +13,7 @@ module Git.Remote where import Common import Git import Git.Types +import Git.Command import Data.Char import qualified Data.Map as M @@ -23,6 +24,11 @@ import Network.URI import Git.FilePath #endif +{- Lists all currently existing git remotes. -} +listRemotes :: Repo -> IO [RemoteName] +listRemotes repo = map decodeBS . S8.lines + <$> pipeReadStrict [Param "remote"] repo + {- Is a git config key one that specifies the url of a remote? -} isRemoteUrlKey :: ConfigKey -> Bool isRemoteUrlKey = isRemoteKey "url" From 797f27ab0517e0021363791ff269300f2ba095a5 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 8 May 2024 18:07:26 -0400 Subject: [PATCH 26/53] handle cloning from a special remote that does not contain a git-annex branch It did not seem possible to avoid creating a git-annex branch while git-remote-annex is running. Special remotes can even store their own state in it. So instead, if it didn't exist before git-remote-annex created it, it deletes it at the end. This does possibly allow a race condition, where git-annex init and perhaps other git-annex writing commands are run, that writes to the git-annex branch, at the same time a git-remote-annex process is being run by git fetch/push with a full annex:: url. Those writes would be lost. If the repository has already been initialized before git-remote-annex, that race won't happen. So it's pretty unlikely. Sponsored-by: Graham Spencer on Patreon --- CmdLine/GitRemoteAnnex.hs | 81 ++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 31 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index a2462ea770..3beb4f12a1 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -21,7 +21,7 @@ import qualified Git.Remote import qualified Git.Remote.Remove import qualified Annex.SpecialRemote as SpecialRemote import qualified Annex.Branch -import qualified Types.Remote as R +import qualified Types.Remote as Remote import Annex.Transfer import Backend.GitRemoteAnnex import Config @@ -44,6 +44,7 @@ import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as B8 import qualified Data.Map.Strict as M import System.FilePath.ByteString as P +import qualified Utility.RawFilePath as R run :: [String] -> IO () run (remotename:url:[]) = @@ -60,11 +61,12 @@ run (_remotename:[]) = giveup "remote url not configured" run _ = giveup "expected remote name and url parameters" run' :: SpecialRemoteConfig -> Annex () -run' src = +run' src = do + sab <- startAnnexBranch -- Prevent any usual git-annex output to stdout, because -- the output of this command is being parsed by git. doQuietAction $ - withSpecialRemote src $ \rmt -> do + withSpecialRemote src sab $ \rmt -> do ls <- lines <$> liftIO getContents go rmt ls emptyState where @@ -258,24 +260,24 @@ parseSpecialRemoteUrl url remotename = case parseURI url of in (Proposed (unEscapeString k), Proposed (unEscapeString v)) -- Runs an action with a Remote as specified by the SpecialRemoteConfig. -withSpecialRemote :: SpecialRemoteConfig -> (Remote -> Annex a) -> Annex a -withSpecialRemote (ExistingSpecialRemote remotename) a = +withSpecialRemote :: SpecialRemoteConfig -> StartAnnexBranch -> (Remote -> Annex a) -> Annex a +withSpecialRemote (ExistingSpecialRemote remotename) _ a = getEnabledSpecialRemoteByName remotename >>= maybe (giveup $ "There is no special remote named " ++ remotename) a -withSpecialRemote cfg@(SpecialRemoteConfig {}) a = case specialRemoteName cfg of +withSpecialRemote cfg@(SpecialRemoteConfig {}) sab a = case specialRemoteName cfg of -- The name could be the name of an existing special remote, -- if so use it as long as its UUID matches the UUID from the url. Just remotename -> getEnabledSpecialRemoteByName remotename >>= \case Just rmt - | R.uuid rmt == specialRemoteUUID cfg -> a rmt + | Remote.uuid rmt == specialRemoteUUID cfg -> a rmt | otherwise -> giveup $ "The uuid in the annex:: url does not match the uuid of the remote named " ++ remotename -- When cloning from an annex:: url, -- this is used to set up the origin remote. Nothing -> (initremote remotename >>= a) - `finally` cleanupInitialization + `finally` cleanupInitialization sab Nothing -> inittempremote - `finally` cleanupInitialization + `finally` cleanupInitialization sab where -- Initialize a new special remote with the provided configuration -- and name. @@ -290,7 +292,7 @@ withSpecialRemote cfg@(SpecialRemoteConfig {}) a = case specialRemoteName cfg of (specialRemoteConfig cfg) t <- either giveup return (SpecialRemote.findType c) dummycfg <- liftIO dummyRemoteGitConfig - (c', _u) <- R.setup t R.Init (Just (specialRemoteUUID cfg)) + (c', _u) <- Remote.setup t Remote.Init (Just (specialRemoteUUID cfg)) Nothing c dummycfg `onException` cleanupremote remotename setConfig (remoteConfig c' "url") (specialRemoteUrl cfg) @@ -337,7 +339,7 @@ getEnabledSpecialRemoteByName remotename = -- chicken and egg problem when cloning. checkSpecialRemoteProblems :: Remote -> Maybe String checkSpecialRemoteProblems rmt - | R.thirdPartyPopulated (R.remotetype rmt) = + | Remote.thirdPartyPopulated (Remote.remotetype rmt) = Just "Cannot use this thirdparty-populated special remote as a git remote" | otherwise = Nothing @@ -354,12 +356,12 @@ newtype Manifest = Manifest { inManifest :: [Key] } -- the usual Annex.Transfer.download. The content of manifests is not -- stable, and so it needs to re-download it fresh every time. downloadManifest :: Remote -> Annex Manifest -downloadManifest rmt = ifM (R.checkPresent rmt mk) +downloadManifest rmt = ifM (Remote.checkPresent rmt mk) ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do liftIO $ hClose tmph - _ <- R.retrieveKeyFile rmt mk + _ <- Remote.retrieveKeyFile rmt mk (AssociatedFile Nothing) tmp - nullMeterUpdate R.NoVerify + nullMeterUpdate Remote.NoVerify ks <- map deserializeKey' . B8.lines <$> liftIO (B.readFile tmp) Manifest <$> checkvalid [] ks , return (Manifest []) @@ -445,16 +447,29 @@ getRepo = getEnv "GIT_WORK_TREE" >>= \case r { location = loc { worktree = Just (P.takeDirectory (gitdir loc)) } } fixup r = r +-- Records what the git-annex branch was at the beginning of this command. +data StartAnnexBranch + = AnnexBranchExistedAlready Ref + | AnnexBranchCreatedEmpty Ref + +startAnnexBranch :: Annex StartAnnexBranch +startAnnexBranch = ifM (null <$> Annex.Branch.siblingBranches) + ( AnnexBranchCreatedEmpty <$> Annex.Branch.getBranch + , AnnexBranchExistedAlready <$> Annex.Branch.getBranch + ) + -- This is run after git has used this process to fetch or push from a -- special remote that was specified using a git-annex url. If the git -- repository was not initialized for use by git-annex already, it is still -- not initialized at this point. -- --- It's important that initialization not be done by this process until --- git has fetched any git-annex branch from the special remote. That --- git-annex branch may have Differences, and prematurely initializing the --- local repository would then create a git-annex branch that can't merge --- with the one from the special remote. +-- If the git-annex branch did not exist when this command started, +-- the current contents of it were created in passing by this command, +-- which is hard to avoid. But if a git-annex branch is fetched from the +-- special remote and contains Differences, it would not be possible to +-- merge it into the git-annex branch that was created while running this +-- command. To avoid that problem, when the git-annex branch was created +-- at the start of this command, it's deleted. -- -- If there is still not a sibling git-annex branch, this deletes all annex -- objects for git bundles from the annex objects directory, and deletes @@ -466,18 +481,22 @@ getRepo = getEnv "GIT_WORK_TREE" >>= \case -- When there is now a sibling git-annex branch, this handles -- initialization. When the initialized git-annex branch has Differences, -- the git bundle objects are in the wrong place, so have to be deleted. --- --- FIXME git-annex branch is unfortunately created during git clone from a --- special remote. Should not be for this to work. -cleanupInitialization :: Annex () -cleanupInitialization = ifM Annex.Branch.hasSibling - ( do - autoInitialize' (pure True) remoteList - differences <- allDifferences <$> recordedDifferences - when (differences /= mempty) $ - deletebundleobjects - , deletebundleobjects - ) +cleanupInitialization :: StartAnnexBranch -> Annex () +cleanupInitialization sab = do + case sab of + AnnexBranchExistedAlready _ -> noop + AnnexBranchCreatedEmpty _ -> do + inRepo $ Git.Branch.delete Annex.Branch.fullname + indexfile <- fromRepo gitAnnexIndex + liftIO $ removeWhenExistsWith R.removeLink indexfile + ifM Annex.Branch.hasSibling + ( do + autoInitialize' (pure True) remoteList + differences <- allDifferences <$> recordedDifferences + when (differences /= mempty) $ + deletebundleobjects + , deletebundleobjects + ) where deletebundleobjects = do annexobjectdir <- fromRepo gitAnnexObjectDir From f2d17cf154dfdcccaf3d1cad37c5855c9ad2a7ef Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 9 May 2024 16:11:16 -0400 Subject: [PATCH 27/53] git-remote-annex: mostly implemented pushing Full pushing will probably work, but is untested. Incremental pushing is not implemented yet. While a fairly straightforward port of the shell prototype, the details of exactly how to get the objects to the remote were tricky. And the prototype did not consider how to deal with partial failures and interruptions. I've taken considerable care to make sure it always leaves things in a consistent state when interrupted or when it loses access to a remote in the middle of a push. Sponsored-by: Leon Schuermann on Patreon --- CmdLine/GitRemoteAnnex.hs | 229 ++++++++++++++++++++++++++++++++++++-- Git/Bundle.hs | 13 +++ Git/Ref.hs | 26 ++++- 3 files changed, 254 insertions(+), 14 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 3beb4f12a1..a4a9a421e9 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -13,7 +13,6 @@ import Annex.Common import qualified Annex import qualified Remote import qualified Git.CurrentRepo -import qualified Git import qualified Git.Ref import qualified Git.Branch import qualified Git.Bundle @@ -78,8 +77,10 @@ run' src = do "" -> list st rmt False >>= go rmt ls "for-push" -> list st rmt True >>= go rmt ls _ -> protocolError l - "fetch" -> fetch st rmt (l:ls) >>= \ls' -> go rmt ls' st - "push" -> push st rmt (l:ls) >>= \ls' -> go rmt ls' st + "fetch" -> fetch st rmt (l:ls) + >>= \ls' -> go rmt ls' st + "push" -> push st rmt (l:ls) + >>= \(ls', st') -> go rmt ls' st' "" -> return () _ -> protocolError l go _ [] _ = return () @@ -140,7 +141,7 @@ list st rmt forpush = do putStrLn "" hFlush stdout - -- Remember the tracking refs. + -- Remember the tracking refs and manifest. return $ st { manifestCache = Just manifest , trackingRefs = trackingrefmap @@ -172,16 +173,128 @@ fetch' st rmt = do putStrLn "" hFlush stdout -push :: State -> Remote -> [String] -> Annex [String] +-- Note that the git bundles that are generated to push contain +-- tracking refs, rather than the actual refs that the user requested to +-- push. This is done because git bundle does not allow creating a bundle +-- that contains refs with different names than the ones in the git +-- repository. Consider eg, git push remote foo:bar, where the destination +-- ref is bar, but there may be no bar ref locally, or the bar ref may +-- be different than foo. If git bundle supported GIT_NAMESPACE, it would +-- be possible to generate a bundle that contains the specified refs. +push :: State -> Remote -> [String] -> Annex ([String], State) push st rmt ls = do let (refspecs, ls') = collectRefSpecs ls - error "TODO push refspecs" - return ls' + (responses, trackingrefs) <- calc refspecs ([], trackingRefs st) + (ok, st') <- if M.null trackingrefs + then pushEmpty st rmt + else if any forcedPush refspecs + then fullPush st rmt (M.keys trackingrefs) + -- TODO: support max-bundles config + else incrementalPush st rmt + (trackingRefs st) trackingrefs + if ok + then do + sendresponses responses + -- Update the tracking refs to reflect the push. + updateTrackingRefs rmt trackingrefs + return (ls', st' { trackingRefs = trackingrefs }) + else do + sendresponses $ + map (const "error push failed") refspecs + return (ls', st') + where + calc + :: [RefSpec] + -> ([B.ByteString], M.Map Ref Sha) + -> Annex ([B.ByteString], M.Map Ref Sha) + calc [] (responses, trackingrefs) = + return (reverse responses, trackingrefs) + calc (r:rs) (responses, trackingrefs) = + let tr = toTrackingRef rmt (dstRef r) + okresp m = pure + ( ("ok " <> fromRef' (dstRef r)):responses + , m + ) + errresp msg = pure + ( ("error " <> fromRef' (dstRef r) <> " " <> msg):responses + , trackingrefs + ) + in calc rs =<< case srcRef r of + Just srcref + | forcedPush r -> okresp $ + M.insert tr srcref trackingrefs + | otherwise -> ifM (isfastforward srcref tr) + ( okresp $ + M.insert tr srcref trackingrefs + , errresp $ fromRef' (dstRef r) + <> " non-fast-forward" + ) + Nothing -> okresp $ M.delete tr trackingrefs + + -- Check if the push is a fast-forward that will not overwrite work + -- in the ref currently stored in the remote. This seems redundant + -- to git's own checking for non-fast-forwards. But unfortunately, + -- before git push checks that, it actually tells us to push. + -- That seems likely to be a bug in git, and this is a workaround. + isfastforward newref tr = case M.lookup tr (trackingRefs st) of + Nothing -> pure False + Just prevsha -> inRepo $ Git.Ref.isAncestor prevsha newref + + -- Send responses followed by newline to indicate end of push. + sendresponses responses = liftIO $ do + mapM_ B8.putStrLn responses + putStrLn "" + hFlush stdout + +-- Full push of the specified refs to the remote. +-- All git bundle objects listed in the old manifest will be +-- deleted after successful upload of the new git bundle and manifest. +-- +-- If this is interrupted, or loses access to the remote mid way through, it +-- will leave the remote with unused bundle keys on it, but every bundle +-- key listed in the manifest will exist, so it's in a consistent, usable +-- state. +-- +-- However, the manifest is replaced by first dropping the object and then +-- uploading a new one. Interrupting that will leave the remote without a +-- manifest, which will appear as if all tracking branches were deleted +-- from it. +fullPush :: State -> Remote -> [Ref] -> Annex (Bool, State) +fullPush st rmt refs = flip catchNonAsync failed $ do + oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) + bundlekey <- generateAndUploadGitBundle rmt refs oldmanifest + uploadManifest rmt (Manifest [bundlekey]) + let st' = st { manifestCache = Nothing } + ok <- allM (dropKey rmt) $ + filter (/= bundlekey) (inManifest oldmanifest) + return (ok, st') + where + failed ex = do + liftIO $ hPutStrLn stderr $ + "Push faild (" ++ show ex ++ ")" + return (False, st) + +-- Incremental push of only the refs that changed. +incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State) +incrementalPush st rmt oldtrackingrefs newtrackingrefs = do + error "TODO incrementalPush" + +-- When the push deletes all refs from the remote, drop the manifest +-- and all bundles that were listed in it. The manifest is dropped +-- first so if this is interrupted, only unused bundles will remain in the +-- remote, rather than leaving the remote with a manifest that refers to +-- missing bundles. +pushEmpty :: State -> Remote -> Annex (Bool, State) +pushEmpty st rmt = do + manifest <- maybe (downloadManifest rmt) pure (manifestCache st) + ok <- allM (dropKey rmt) + (genManifestKey (Remote.uuid rmt) : inManifest manifest) + return (ok, st { manifestCache = Nothing }) data RefSpec = RefSpec { forcedPush :: Bool - , srcRef :: Maybe String -- empty when deleting a ref - , dstRef :: String + , srcRef :: Maybe Ref -- ^ Nothing when deleting a ref + , dstRef :: Ref } deriving (Show) @@ -201,10 +314,15 @@ parseRefSpec ('+':s) = (parseRefSpec s) { forcedPush = True } parseRefSpec s = let (src, cdst) = break (== ':') s dst = if null cdst then cdst else drop 1 cdst + deletesrc = null src in RefSpec - { forcedPush = False - , srcRef = if null src then Nothing else Just src - , dstRef = dst + -- To delete a ref, have to do a force push of all + -- remaining refs. + { forcedPush = deletesrc + , srcRef = if deletesrc + then Nothing + else Just (Ref (encodeBS src)) + , dstRef = Ref (encodeBS dst) } -- "foo bar" to ("foo", "bar") @@ -376,6 +494,50 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) checkvalid _ (Nothing:_) = giveup $ "Error parsing manifest " ++ serializeKey mk +-- Uploads the Manifest to the remote. +-- +-- Throws errors if the remote cannot be accessed or the upload fails. +-- +-- The manifest key is first dropped from the remote, then the new +-- content is uploaded. This is necessary because the same key is used, +-- and behavior of remotes is undefined when sending a key that is +-- already present on the remote, but with different content. +-- +-- Note that if this is interrupted or loses access to the remote part +-- way through, it may leave the remote without a manifest file. That will +-- appear as if all refs have been deleted from the remote. +-- FIXME It should be possible to remember when that happened, by writing +-- state to a file before, and then the next time git-remote-annex is run, it +-- could recover from the situation. +uploadManifest :: Remote -> Manifest -> Annex () +uploadManifest rmt manifest = + withTmpFile "GITMANIFEST" $ \tmp tmph -> do + liftIO $ forM_ (inManifest manifest) $ \bundlekey -> + B8.hPutStrLn tmph (serializeKey' bundlekey) + liftIO $ hClose tmph + -- Remove old manifest if present. + Remote.removeKey rmt mk + -- storeKey needs the key to be in the annex objects + -- directory, so put the manifest file there temporarily. + -- Using linkOrCopy rather than moveAnnex to avoid updating + -- InodeCache database. Also, works even when the repository + -- is configured to require only cryptographically secure + -- keys, which it is not. + objfile <- calcRepo (gitAnnexLocation mk) + unlessM (isJust <$> linkOrCopy mk (toRawFilePath tmp) objfile Nothing) + uploadfailed + -- noRetry because manifest content is not stable + ok <- upload rmt mk (AssociatedFile Nothing) + noRetry noNotification + -- Don't leave the manifest key in the annex objects + -- directory. + unlinkAnnex mk + unless ok + uploadfailed + where + mk = genManifestKey (Remote.uuid rmt) + uploadfailed = giveup $ "Failed to upload " ++ serializeKey mk + -- Downloads a git bundle to the annex objects directory, unless -- the object file is already present. Returns the filename of the object -- file. @@ -398,6 +560,49 @@ downloadGitBundle rmt k = , giveup $ "Failed to download " ++ serializeKey k ) +-- Uploads a git bundle from the annex objects directory to the remote. +-- +-- Throws errors if the upload fails. +-- +-- This does not update the location log to indicate that the remote +-- contains the git bundle object. +uploadGitBundle :: Remote -> Key -> Annex () +uploadGitBundle rmt k = + unlessM (upload rmt k (AssociatedFile Nothing) stdRetry noNotification) $ + giveup $ "Failed to upload " ++ serializeKey k + +-- Generates a git bundle that contains the specified refs, ingests +-- it into the local objects directory, and uploads its key to the special +-- remote. +-- +-- If the key is present in the provided manifest, avoids uploading it. +-- +-- On failure, an exception is thrown, and nothing is added to the local +-- objects directory. +generateAndUploadGitBundle :: Remote -> [Ref] -> Manifest -> Annex Key +generateAndUploadGitBundle rmt refs manifest = + withTmpFile "GITBUNDLE" $ \tmp tmph -> do + liftIO $ hClose tmph + inRepo $ Git.Bundle.create tmp refs + bundlekey <- genGitBundleKey (Remote.uuid rmt) + (toRawFilePath tmp) nullMeterUpdate + unless (bundlekey `elem` (inManifest manifest)) $ do + unlessM (moveAnnex bundlekey (AssociatedFile Nothing) (toRawFilePath tmp)) $ + giveup "Unable to push" + uploadGitBundle rmt bundlekey + `onException` unlinkAnnex bundlekey + return bundlekey + +dropKey :: Remote -> Key -> Annex Bool +dropKey rmt k = tryNonAsync (Remote.removeKey rmt k) >>= \case + Right () -> return True + Left ex -> do + liftIO $ hPutStrLn stderr $ + "Failed to drop " + ++ serializeKey k + ++ " (" ++ show ex ++ ")" + return False + -- Tracking refs are used to remember the refs that are currently on the -- remote. This is different from git's remote tracking branches, since it -- needs to track all refs on the remote, not only the refs that the user diff --git a/Git/Bundle.hs b/Git/Bundle.hs index 7b1b1adc15..2d90f20a34 100644 --- a/Git/Bundle.hs +++ b/Git/Bundle.hs @@ -23,3 +23,16 @@ listHeads bundle repo = map gen . S8.lines <$> unbundle :: FilePath -> Repo -> IO () unbundle bundle = runQuiet [Param "bundle", Param "unbundle", File bundle] + +create :: FilePath -> [Ref] -> Repo -> IO () +create bundle refs repo = pipeWrite + [ Param "bundle" + , Param "create" + , Param "--quiet" + , File bundle + , Param "--stdin" + ] repo writerefs + where + writerefs h = do + mapM_ (S8.hPutStrLn h . fromRef') refs + hClose h diff --git a/Git/Ref.hs b/Git/Ref.hs index 72e8b15cd4..fd7d2da0c8 100644 --- a/Git/Ref.hs +++ b/Git/Ref.hs @@ -174,8 +174,9 @@ forEachRef ps repo = map gen . S8.lines <$> gen l = let (r, b) = separate' (== fromIntegral (ord ' ')) l in (Ref r, Ref b) -{- Deletes a ref. This can delete refs that are not branches, - - which git branch --delete refuses to delete. -} +{- Deletes a ref when it contains the specified sha. + - This can delete refs that are not branches, which + - git branch --delete refuses to delete. -} delete :: Sha -> Ref -> Repo -> IO () delete oldvalue ref = run [ Param "update-ref" @@ -184,6 +185,14 @@ delete oldvalue ref = run , Param $ fromRef oldvalue ] +{- Deletes a ref no matter what it contains. -} +delete' :: Ref -> Repo -> IO () +delete' ref = run + [ Param "update-ref" + , Param "-d" + , Param $ fromRef ref + ] + {- Gets the sha of the tree a ref uses. - - The ref may be something like a branch name, and it could contain @@ -201,6 +210,19 @@ tree (Ref ref) = extractSha <$$> pipeReadStrict -- de-reference commit objects to the tree else ref <> ":" +{- Check if the first ref is an ancestor of the second ref. + - + - Note that if the two refs point to the same commit, it is considered + - to be an ancestor of itself. + -} +isAncestor :: Ref -> Ref -> Repo -> IO Bool +isAncestor r1 r2 = runBool + [ Param "merge-base" + , Param "--ancestor" + , Param (fromRef r1) + , Param (fromRef r2) + ] + {- Checks if a String is a legal git ref name. - - The rules for this are complex; see git-check-ref-format(1) -} From 303933152978902a1894c416292c0bb285bd4181 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 13:32:37 -0400 Subject: [PATCH 28/53] git-remote-annex: incremental pushing Untested Sponsored-by: Joshua Antonishen on Patreon --- CmdLine/GitRemoteAnnex.hs | 115 ++++++++++++++++++++++++++++++-------- Git/Bundle.hs | 40 +++++++++++-- 2 files changed, 128 insertions(+), 27 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index a4a9a421e9..62f739cedc 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -226,8 +226,7 @@ push st rmt ls = do | otherwise -> ifM (isfastforward srcref tr) ( okresp $ M.insert tr srcref trackingrefs - , errresp $ fromRef' (dstRef r) - <> " non-fast-forward" + , errresp "non-fast-forward" ) Nothing -> okresp $ M.delete tr trackingrefs @@ -237,8 +236,8 @@ push st rmt ls = do -- before git push checks that, it actually tells us to push. -- That seems likely to be a bug in git, and this is a workaround. isfastforward newref tr = case M.lookup tr (trackingRefs st) of - Nothing -> pure False Just prevsha -> inRepo $ Git.Ref.isAncestor prevsha newref + Nothing -> pure True -- Send responses followed by newline to indicate end of push. sendresponses responses = liftIO $ do @@ -260,24 +259,92 @@ push st rmt ls = do -- manifest, which will appear as if all tracking branches were deleted -- from it. fullPush :: State -> Remote -> [Ref] -> Annex (Bool, State) -fullPush st rmt refs = flip catchNonAsync failed $ do +fullPush st rmt refs = guardPush st $ do oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) - bundlekey <- generateAndUploadGitBundle rmt refs oldmanifest + let bs = map Git.Bundle.fullBundleSpec refs + bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest uploadManifest rmt (Manifest [bundlekey]) - let st' = st { manifestCache = Nothing } ok <- allM (dropKey rmt) $ filter (/= bundlekey) (inManifest oldmanifest) - return (ok, st') - where - failed ex = do - liftIO $ hPutStrLn stderr $ - "Push faild (" ++ show ex ++ ")" - return (False, st) + return (ok, st { manifestCache = Nothing }) + +guardPush :: State -> Annex (Bool, State) -> Annex (Bool, State) +guardPush st a = catchNonAsync a $ \ex -> do + liftIO $ hPutStrLn stderr $ + "Push faild (" ++ show ex ++ ")" + return (False, st { manifestCache = Nothing }) -- Incremental push of only the refs that changed. +-- +-- No refs were deleted (that causes a fullPush), but new refs may +-- have been added. incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State) -incrementalPush st rmt oldtrackingrefs newtrackingrefs = do - error "TODO incrementalPush" +incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do + bs <- calc [] (M.toList newtrackingrefs) + liftIO $ hPutStrLn stderr (show bs) + oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) + bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest + uploadManifest rmt (Manifest [bundlekey]) + return (True, st { manifestCache = Nothing }) + where + calc c [] = return (reverse c) + calc c ((ref, sha):refs) = case M.lookup ref oldtrackingrefs of + Just oldsha + | oldsha == sha -> calc c refs -- unchanged + | otherwise -> + ifM (inRepo $ Git.Ref.isAncestor oldsha ref) + ( use $ checkprereq oldsha ref + , use $ findotherprereq ref sha + ) + Nothing -> use $ findotherprereq ref sha + where + use a = do + bs <- a + calc (bs:c) refs + + -- Unfortunately, git bundle will let a prerequisite specified + -- for one ref prevent it including another ref. For example, + -- where x is a ref that points at A, and y is a ref that points at + -- B (which has A as its parent), git bundle x A..y + -- will omit including the x ref in the bundle at all. + -- + -- But we need to include all (changed) refs that the user + -- specified to push in the bundle. So, only include the sha + -- as a prerequisite when it will not prevent including another + -- changed ref in the bundle. + checkprereq prereq ref = + ifM (anyM shadows $ M.elems $ M.delete ref changedrefs) + ( pure $ Git.Bundle.fullBundleSpec ref + , pure $ Git.Bundle.BundleSpec + { Git.Bundle.preRequisiteRef = Just prereq + , Git.Bundle.includeRef = ref + } + ) + where + shadows s + | s == prereq = pure True + | otherwise = inRepo $ Git.Ref.isAncestor s prereq + changedrefs = M.differenceWith + (\a b -> if a == b then Nothing else Just a) + newtrackingrefs oldtrackingrefs + + -- When the old tracking ref is not able to be used as a + -- prerequisite, this to find some other ref that was previously + -- pushed that can be used as a prerequisite instead. This can + -- optimise the bundle size a bit in edge cases. + -- + -- For example, a forced push of branch foo that resets it back + -- several commits can use a previously pushed bar as a prerequisite + -- if it's an ancestor of foo. + findotherprereq ref sha = + findotherprereq' ref sha (M.elems oldtrackingrefs) + findotherprereq' ref _ [] = pure (Git.Bundle.fullBundleSpec ref) + findotherprereq' ref sha (l:ls) + | l == sha = findotherprereq' ref sha ls + | otherwise = ifM (inRepo $ Git.Ref.isAncestor l ref) + ( checkprereq l ref + , findotherprereq' ref sha ls + ) -- When the push deletes all refs from the remote, drop the manifest -- and all bundles that were listed in it. The manifest is dropped @@ -506,7 +573,7 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) -- Note that if this is interrupted or loses access to the remote part -- way through, it may leave the remote without a manifest file. That will -- appear as if all refs have been deleted from the remote. --- FIXME It should be possible to remember when that happened, by writing +-- XXX It should be possible to remember when that happened, by writing -- state to a file before, and then the next time git-remote-annex is run, it -- could recover from the situation. uploadManifest :: Remote -> Manifest -> Annex () @@ -571,19 +638,23 @@ uploadGitBundle rmt k = unlessM (upload rmt k (AssociatedFile Nothing) stdRetry noNotification) $ giveup $ "Failed to upload " ++ serializeKey k --- Generates a git bundle that contains the specified refs, ingests --- it into the local objects directory, and uploads its key to the special --- remote. +-- Generates a git bundle, ingests it into the local objects directory, +-- and uploads its key to the special remote. -- --- If the key is present in the provided manifest, avoids uploading it. +-- If the key is already present in the provided manifest, avoids +-- uploading it. -- -- On failure, an exception is thrown, and nothing is added to the local -- objects directory. -generateAndUploadGitBundle :: Remote -> [Ref] -> Manifest -> Annex Key -generateAndUploadGitBundle rmt refs manifest = +generateAndUploadGitBundle + :: Remote + -> [Git.Bundle.BundleSpec] + -> Manifest + -> Annex Key +generateAndUploadGitBundle rmt bs manifest = withTmpFile "GITBUNDLE" $ \tmp tmph -> do liftIO $ hClose tmph - inRepo $ Git.Bundle.create tmp refs + inRepo $ Git.Bundle.create tmp bs bundlekey <- genGitBundleKey (Remote.uuid rmt) (toRawFilePath tmp) nullMeterUpdate unless (bundlekey `elem` (inManifest manifest)) $ do diff --git a/Git/Bundle.hs b/Git/Bundle.hs index 2d90f20a34..caa4d12ec9 100644 --- a/Git/Bundle.hs +++ b/Git/Bundle.hs @@ -5,6 +5,8 @@ - Licensed under the GNU AGPL version 3 or higher. -} +{-# LANGUAGE OverloadedStrings #-} + module Git.Bundle where import Common @@ -24,15 +26,43 @@ listHeads bundle repo = map gen . S8.lines <$> unbundle :: FilePath -> Repo -> IO () unbundle bundle = runQuiet [Param "bundle", Param "unbundle", File bundle] -create :: FilePath -> [Ref] -> Repo -> IO () -create bundle refs repo = pipeWrite +-- Specifies what to include in the bundle. +data BundleSpec = BundleSpec + { preRequisiteRef :: Maybe Ref + -- ^ Do not include this Ref, or any objects reachable from it + -- in the bundle. This should be an ancestor of the includeRef. + , includeRef :: Ref + -- ^ Include this Ref and objects reachable from it in the bundle, + -- unless filtered out by the preRequisiteRef of this BundleSpec + -- or any other one that is included in the bundle. + } + deriving (Show) + +-- Include the ref and all objects reachable from it in the bundle. +-- (Unless another BundleSpec is included that has a preRequisiteRef +-- that filters out the ref or other objects.) +fullBundleSpec :: Ref -> BundleSpec +fullBundleSpec r = BundleSpec + { preRequisiteRef = Nothing + , includeRef = r + } + +create :: FilePath -> [BundleSpec] -> Repo -> IO () +create bundle revs repo = pipeWrite [ Param "bundle" , Param "create" , Param "--quiet" , File bundle , Param "--stdin" - ] repo writerefs + ] repo writer where - writerefs h = do - mapM_ (S8.hPutStrLn h . fromRef') refs + writer h = do + forM_ revs $ \bs -> + case preRequisiteRef bs of + Nothing -> S8.hPutStrLn h $ + fromRef' (includeRef bs) + Just pr -> S8.hPutStrLn h $ + fromRef' pr + <> ".." <> + fromRef' (includeRef bs) hClose h From ef5e9aa0823a5fd6893afd6247a7a0d4fbebc07c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 13:55:46 -0400 Subject: [PATCH 29/53] git-remote-annex working A few bugfixes. Have not tested extensively, but a push followed by a clone worked. Sponsored-by: Nicholas Golder-Manning on Patreon --- CmdLine/GitRemoteAnnex.hs | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 62f739cedc..25c7f6b9c0 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -32,6 +32,7 @@ import Git.Types import Logs.Difference import Annex.Init import Annex.Content +import Annex.Perms import Remote.List import Remote.List.Util import Utility.Tmp @@ -185,6 +186,7 @@ push :: State -> Remote -> [String] -> Annex ([String], State) push st rmt ls = do let (refspecs, ls') = collectRefSpecs ls (responses, trackingrefs) <- calc refspecs ([], trackingRefs st) + updateTrackingRefs rmt trackingrefs (ok, st') <- if M.null trackingrefs then pushEmpty st rmt else if any forcedPush refspecs @@ -195,10 +197,10 @@ push st rmt ls = do if ok then do sendresponses responses - -- Update the tracking refs to reflect the push. - updateTrackingRefs rmt trackingrefs return (ls', st' { trackingRefs = trackingrefs }) else do + -- Restore the old tracking refs + updateTrackingRefs rmt (trackingRefs st) sendresponses $ map (const "error push failed") refspecs return (ls', st') @@ -271,7 +273,7 @@ fullPush st rmt refs = guardPush st $ do guardPush :: State -> Annex (Bool, State) -> Annex (Bool, State) guardPush st a = catchNonAsync a $ \ex -> do liftIO $ hPutStrLn stderr $ - "Push faild (" ++ show ex ++ ")" + "Push failed (" ++ show ex ++ ")" return (False, st { manifestCache = Nothing }) -- Incremental push of only the refs that changed. @@ -591,7 +593,9 @@ uploadManifest rmt manifest = -- is configured to require only cryptographically secure -- keys, which it is not. objfile <- calcRepo (gitAnnexLocation mk) - unlessM (isJust <$> linkOrCopy mk (toRawFilePath tmp) objfile Nothing) + res <- modifyContentDir objfile $ + linkOrCopy mk (toRawFilePath tmp) objfile Nothing + unless (isJust res) uploadfailed -- noRetry because manifest content is not stable ok <- upload rmt mk (AssociatedFile Nothing) From 1250bb26a0ea9180c3e697d4cfcf82745bf625a4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 13:59:35 -0400 Subject: [PATCH 30/53] reject annex:: url that omits a uuid Such as annex::?type=foo&... I accidentially left out the uuid when creating one, and the result is it appears to clone an empty repository. So let's guard against that mistake. --- CmdLine/GitRemoteAnnex.hs | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 25c7f6b9c0..38fce37e9c 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -283,7 +283,6 @@ guardPush st a = catchNonAsync a $ \ex -> do incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State) incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do bs <- calc [] (M.toList newtrackingrefs) - liftIO $ hPutStrLn stderr (show bs) oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest uploadManifest rmt (Manifest [bundlekey]) @@ -430,11 +429,13 @@ parseSpecialRemoteUrl url remotename = case parseURI url of Just u -> case uriScheme u of "annex:" -> case uriPath u of "" -> Left "annex: URL did not include a UUID" - (':':p) -> Right $ SpecialRemoteConfig - { specialRemoteUUID = toUUID p - , specialRemoteConfig = parsequery u - , specialRemoteName = remotename - , specialRemoteUrl = url + (':':p) + | null p -> Left "annex: URL did not include a UUID" + | otherwise -> Right $ SpecialRemoteConfig + { specialRemoteUUID = toUUID p + , specialRemoteConfig = parsequery u + , specialRemoteName = remotename + , specialRemoteUrl = url } _ -> Left "annex: URL malformed" _ -> Left "Not an annex: URL" From 4d0543932ea101053937b0f9b152d5b20e9381de Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 14:40:38 -0400 Subject: [PATCH 31/53] pushEmpty: upload empty manifest --- CmdLine/GitRemoteAnnex.hs | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 38fce37e9c..b597b7ae1c 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -347,14 +347,15 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do , findotherprereq' ref sha ls ) --- When the push deletes all refs from the remote, drop the manifest --- and all bundles that were listed in it. The manifest is dropped --- first so if this is interrupted, only unused bundles will remain in the --- remote, rather than leaving the remote with a manifest that refers to --- missing bundles. +-- When the push deletes all refs from the remote, upload an empty +-- manifest and then drop all bundles that were listed in it. +-- The manifest is emptired first so if this is interrupted, only +-- unused bundles will remain in the remote, rather than leaving the +-- remote with a manifest that refers to missing bundles. pushEmpty :: State -> Remote -> Annex (Bool, State) pushEmpty st rmt = do manifest <- maybe (downloadManifest rmt) pure (manifestCache st) + uploadManifest rmt (Manifest []) ok <- allM (dropKey rmt) (genManifestKey (Remote.uuid rmt) : inManifest manifest) return (ok, st { manifestCache = Nothing }) From dfb09ad1ad99898591b50debebb4fb3d8215698b Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 14:41:18 -0400 Subject: [PATCH 32/53] preparing to merge git-remote-annex Update its todo with remaining items. Add changelog entry. Simplified internals document to no longer be notes to myself, but target users who want to understand how the data is stored and might want to extract these repos manually. Sponsored-by: Kevin Mueller on Patreon --- CHANGELOG | 3 + doc/future_proofing.mdwn | 4 +- doc/internals.mdwn | 7 ++ doc/internals/git-remote-annex.mdwn | 55 ++++++--------- doc/todo/git-remote-annex.mdwn | 101 +++++++++++++++++++++++----- 5 files changed, 116 insertions(+), 54 deletions(-) diff --git a/CHANGELOG b/CHANGELOG index 754da07f08..224c735dbf 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,5 +1,8 @@ git-annex (10.20240431) UNRELEASED; urgency=medium + * git-remote-annex: New program which allows pushing a git repo to a + git-annex special remote, and cloning from a special remote. + (Based on Michael Hanke's git-remote-datalad-annex.) * Typo fixes. Thanks, Yaroslav Halchenko diff --git a/doc/future_proofing.mdwn b/doc/future_proofing.mdwn index 369aa7d890..84883d060f 100644 --- a/doc/future_proofing.mdwn +++ b/doc/future_proofing.mdwn @@ -49,5 +49,5 @@ problem: [[fairly simple shell script using standard tools|tips/Decrypting_files_in_special_remotes_without_git-annex]] (gpg and openssl) can decrypt files stored on such a remote, as long as you have access to the encryption keys (which - are stored in the git-annex branch of the repository, sometimes - encrypted with your gpg key). + for some types of encryption are stored in the git-annex branch of + the repository, sometimes encrypted with your gpg key). diff --git a/doc/internals.mdwn b/doc/internals.mdwn index 09225312fc..f0ee6f66c1 100644 --- a/doc/internals.mdwn +++ b/doc/internals.mdwn @@ -2,6 +2,8 @@ In the world of git, we're not scared about internal implementation details, and sometimes we like to dive in and tweak things by hand. Here's some documentation to that end. +[[!toc ]] + ## The .git/ directory ### `.git/annex/objects/aa/bb/*/*` @@ -364,3 +366,8 @@ of actual annexed files. These trees are recorded in history of the git-annex branch, but the head of the git-annex branch will never contain them. + +## Other internals documentation + +* [[git-remote-annex]] documents how git repositories are stored + on special remotes when using git with "annex::" urls. diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index 5c9203931d..f85965810b 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -1,3 +1,6 @@ +The [[git-remote-annex|/git-remote-annex]] command allows pushing a git +repository to a special remote, and later cloning from it. + This adds two new key types to git-annex, GITMANIFEST and a GITBUNDLE. GITMANIFEST--$UUID is the manifest for a git repository stored in the @@ -11,44 +14,26 @@ An ordered list of bundle keys, one per line. (Lines end with unix `"\n"`, not `"\r\n"`.) -# fetching - -1. download GITMANIFEST for the uuid of the special remote -2. download each listed GITBUNDLE key that we don't have -3. `git fetch` from each new bundle in order - (note that later bundles can update refs from the versions in previous - bundles) - -# pushing (incrementally) - -This is how pushes are usually done. - -1. create git bundle of all refs that are being pushed and have changed, - and objects since the previously pushed refs -2. hash to calculate GITBUNDLE key -3. upload GITBUNDLE key -4. download current manifest -5. append GITBUNDLE key to manifest - -# pushing (full) - -Note that this can be used to replace incrementals with a single bundle for -performance. It is also the only way to handle a push that deletes a -previously pushed ref. - -1. create git bundle containing all refs stored in the repository, and all - objects -2. hash to calculate GITBUNDLE key name -3. upload GITBUNDLE key -4. download old manifest -4. upload new manifest listing only the single new GITBUNDLE -5. delete all other GITBUNDLEs that were listed in the old manifest - # multiple GITMANIFEST files Usually there will only be one per special remote, but it's possible for multiple special remotes to point to the same object storage, and if so multiple GITMANIFEST objects can be stored. -It follows that the UUID of the special remote has to be included in the -annex:// uri, to know which GITMANIFEST to use when cloning from it. +This is why the UUID of the special remote is included in the GITMANIFEST +key, and in the annex:: uri. + +# manually cloning from these files + +If you are unable to use git-annex and need to clone a git repository +stored in such a special remote, this procedure will work: + +* Find and download the GITMANIFEST +* Download each listed GITBUNDLE +* `git fetch` from each new bundle in order. + (Note that later bundles can update refs from the versions in previous + bundles.) + +When the special remote is encryptee, the GITMANIFEST and GITBUNDLE will +also be encrypted. To decrypt those manually, see this +[[fairly simple shell script using standard tools|tips/Decrypting_files_in_special_remotes_without_git-annex]]. diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 05ea923975..da5c5dbab6 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -4,21 +4,88 @@ repository to any git-annex special remote. This is a redesign and reimplementation of git-remote-datalad-annex. It will be a safer implementation, will support incremental pushes, and will be available to users who don't use datalad. - -Work is in the `git-remote-annex` branch, currently we have a design for -the core data files and operations. - - -Also, that branch has a proof of concept implementation in a shell script. -Though it doesn't yet use special remotes at all, it is able to do -incremental pushes to git bundles with a manifest. - -I still need to do some design work around using the git-annex branch to -detect concurrent push situations where changes to the manifest get lost, -and re-add those changes to it later. - -Also, it's not clear what will happen when two people make conflicting pushes -to a ref, the goal would be to replicate git push to a regular git remote, -but that may not be entirely possible. This will need to be investigated -further. --[[Joey]] + +--- + +This is implememented and working. Remaining todo list for it: + +* Need to test all types of pushes, barely tested at all. + +* Support exporttree=yes remotes. + +* Support max-bundles config + +* Need to mention git-remote-annex in special remotes page, and perhaps + write a tip for it. Also link to it from git-annex man page. + +* initremote could optionally configure the url to a special remote + to an annex:: url. This would make it easier to use git-remote-annex, + since the user would not need to set up the url themselves. + (Also it would then avoid setting `skipFetchAll = true`) + +* Prevent using with remotes that are encrypted using a cipher + stored in the repo. Chicken and egg problem cloning from + such a remote. Maybe allow advanced users to force it? + +* When the remote has no manifest, a pull from it should fail, + while a push should succeed. Otherwise, it can be confusing + to clone from a wrong url, since it fails to download + a manifest and so appears as if the remote is empty. + +* See XXX in uploadManifest about recovering from a situation + where the remote is left with a deleted manifest when a push + is interrupted part way through. This should be recoverable + by caching the manifest locally and re-uploading it when + the remote has no manifest. + +* datalad-annex supports cloning from the web special remote, + using an url that contains the result of pushing to eg, a directory + special remote. + `datalad-annex::https://example.com?type=web&url={noquery}` + Supporting something like this would be good. + +* It would be nice if git-annex could generate an annex:: url + for a special remote and show it to the user, eg when + they have set the shorthand "annex::" url, so they know the full url. + `git-annex info $remote` could also display it. + Currently, the user has to remember how the special remote was + configured and replicate it all in the url. + + There are some difficulties to doing this, including that + RemoteConfig can have hidden fields that should be omitted, + and that some, like type=directory, remove some configs + (eg directory=) in their setup action. + +* Improve behavior in push races. A race can overwrite a change + to the MANIFEST and lose work that was pushed from the other repo. + From the user's perspective, that situation is the same as if one repo + pushed new work, then the other repo did a git push --force, overwriting + the first repo's push. In the first repo, another push will then fail as + a non fast-forward, and the user can recover as usual. This is probably + okish. + + But.. a MANIFEST overwrite will leave bundle files in the remote that + are not listed in the MANIFEST. It seems likely that git-annex could + detect that after the fact and clean it up. Eg, if it caches + the last MANIFEST it uploaded, next time it downloads the MANIFEST + it can check if there are bundle files in the old one that are not + in the new one. If so, it can drop those bundle files from the remote. + +* A push race can also appear to the user as if they pushed a ref, but then + it got deleted from the remote. This happens when two pushes are + pushing different ref names. This might be harder for the user to + notice; git fetch does not indicate that a remote ref got deleted. + They would have to use git fetch --prune to notice the deletion. + Once the user does notice, they can re-push their ref to recover. + Can this be improved? + +* The race condition described in + [[!commit 797f27ab0517e0021363791ff269300f2ba095a5]] + where before git-annex init is run in a repo, + using git-remote-annex and at the same time git-annex init can lose + changes that the latter command (and ones after it) write to the + git-annex branch. + + This should be fixable by making git-remote-annex not write to the + git-annex branch, but to eg, a temporary journal directory. From 1f62bc861a057ec52aff0b15a07e01485899d99e Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 10 May 2024 15:09:56 -0400 Subject: [PATCH 33/53] delete shell prototype --- git-remote-annex.sh | 212 -------------------------------------------- 1 file changed, 212 deletions(-) delete mode 100755 git-remote-annex.sh diff --git a/git-remote-annex.sh b/git-remote-annex.sh deleted file mode 100755 index 2e818ed420..0000000000 --- a/git-remote-annex.sh +++ /dev/null @@ -1,212 +0,0 @@ -#!/bin/sh -URL="$2" - -TOPDIR="$(echo "$URL" | sed 's/^annex:\/\///')" - -set -x - -rm -f $GIT_DIR/push-response - -# Unfortunately, git bundle omits prerequisites that are omitted once, -# even if they are used by a later ref. -# For example, where x is a ref that points at A, and y is a ref -# that points at B (which has A as its parent), git bundle x A..y -# will omit inclding the x ref in the bundle at all. -check_prereq () { - # So, if a sha is one of the other refs that will be included in the - # bundle, it cannot be treated as a prerequisite. - if git show-ref $push_refs | grep -v " $2$" | awk '{print $1}' | grep -q "$1"; then - echo "$2" - else - # And, if one of the other refs that will be included in the bundle - # is an ancestor of the sha, it cannot be treated as a prerequisite. - if [ -n "$(for x in $(git show-ref $push_refs | grep -v " $2$" | awk '{print $1}'); do git log --oneline -n1 $x..$1; done)" ]; then - echo "$2" - else - echo "$1..$2" - fi - fi -} - -addnewbundle () { - sha1=$(sha1sum $TOPDIR/new.bundle | awk '{print $1}') - mv $TOPDIR/new.bundle "$TOPDIR/$sha1.bundle" - echo "$sha1.bundle" >> $TOPDIR/MANIFEST -} - -while read foo; do - case "$foo" in - capabilities) - echo fetch - echo push - echo - ;; - list*) - if [ -e "$TOPDIR/MANIFEST" ]; then - for f in $(cat $TOPDIR/MANIFEST); do - git bundle list-heads $TOPDIR/$f >> $GIT_DIR/listed-refs-new - if [ "$foo" = "list for-push" ]; then - # Get all the objects from the bundle. This is done here so that - # refs/namespaces/mine can be updated with what was listed, - # and so what when a full repush needs to be done, everything - # gets pushed. - git bundle unbundle "$TOPDIR/$f" >/dev/null 2>&1 - fi - done - perl -e 'while (<>) { if (m/(.*) (.*)/) { $seen{$2}=$1 } }; foreach my $k (keys %seen) { print "$seen{$k} $k\n" }' < $GIT_DIR/listed-refs-new > $GIT_DIR/listed-refs - rm -f $GIT_DIR/listed-refs-new - - # when listing for a push, update refs/namespaces/mine to match what was - # listed. This is necessary in order for a full repush to know what to push. - if [ "$foo" = "list for-push" ]; then - for r in $(git for-each-ref refs/namespaces/mine/ | awk '{print $3}'); do - git update-ref -d "$r" - done - IFS=" - " - for x in $(cat $GIT_DIR/listed-refs); do - sha="$(echo "$x" | cut -d ' ' -f 1)" - r="$(echo "$x" | cut -d ' ' -f 2)" - git update-ref "$r" "$sha" - done - unset IFS - fi - - # respond to git with a list of refs - sed 's/refs\/namespaces\/mine\///' $GIT_DIR/listed-refs - # $GIT_DIR/listed-refs is later checked in push - else - rm -f $GIT_DIR/listed-refs - touch $GIT_DIR/listed-refs - fi - echo - ;; - fetch*) - dofetch=1 - ;; - push*) - set -- $foo - x="$2" - # src ref is prefixed with a + in a forced push - forcedpush="" - if echo "$x" | cut -d : -f 1 | egrep -q '^\+'; then - forcedpush=1 - fi - srcref="$(echo "$x" | cut -d : -f 1 | sed 's/^\+//')" - dstref="$(echo "$x" | cut -d : -f 2)" - # Need to create a bundle containing $dstref, but - # don't want to overwrite that ref in the local - # repo. Unfortunately, git bundle does not support - # GIT_NAMESPACE, so it's not possible to do that - # without making a clone of the whole git repo. - # Instead, just create a ref under the namespace - # refs/namespaces/mine/ that will be put in the - # bundle. - mydstref=refs/namespaces/mine/"$dstref" - if [ -z "$srcref" ]; then - # To delete a ref, have to do a repush of - # all remaining refs. - REPUSH=1 - git update-ref -d "$mydstref" - touch $GIT_DIR/push-response - echo "ok $dstref" >> $GIT_DIR/push-response - else - if [ ! "$forcedpush" ]; then - # check if the push would overwrite - # work in the ref currently stored in the - # remote, if so refuse to do it - prevsha=$(grep " $mydstref$" $GIT_DIR/listed-refs | awk '{print $1}') - newsha=$(git rev-parse "$srcref") - if [ -n "$prevsha" ] && [ "$prevsha" != "$newsha" ] && [ -z "$(git log --oneline $prevsha..$newsha 2>/dev/null)" ]; then - touch $GIT_DIR/push-response - echo "error $dstref non-fast-forward" >> $GIT_DIR/push-response - else - touch $GIT_DIR/push-response - echo "ok $dstref" >> $GIT_DIR/push-response - git update-ref "$mydstref" "$srcref" - push_refs="$mydstref $push_refs" - fi - else - git update-ref "$mydstref" "$srcref" - touch $GIT_DIR/push-response - echo "ok $dstref" >> $GIT_DIR/push-response - push_refs="$mydstref $push_refs" - fi - fi - dopush=1 - ;; - # docs say a blank line ends communication, but that's not - # accurate, actually a blank line comes after a series of - # fetch or push commands, and also according to the docs, - # another series of commands could follow - "") - if [ "$dofetch" ]; then - if [ -e "$TOPDIR/MANIFEST" ]; then - for f in $(cat $TOPDIR/MANIFEST); do - git bundle unbundle "$TOPDIR/$f" >/dev/null 2>&1 - done - fi - echo - dofetch="" - fi - if [ "$dopush" ]; then - if [ -z "$(git for-each-ref refs/namespaces/mine/)" ]; then - # deleted all refs - if [ -e "$TOPDIR/MANIFEST" ]; then - for f in $(cat $TOPDIR/MANIFEST); do - rm "$TOPDIR/$f" - done - rm $TOPDIR/MANIFEST - touch $TOPDIR/MANIFEST - fi - else - # set REPUSH=1 to do a full push - # rather than incremental - if [ "$REPUSH" ]; then - rm $TOPDIR/MANIFEST - rm $TOPDIR/*.bundle - git for-each-ref refs/namespaces/mine/ | awk '{print $3}' | \ - git bundle create --quiet $TOPDIR/new.bundle --stdin - addnewbundle - else - # incremental bundle - for r in $push_refs; do - newsha=$(git show-ref "$r" | awk '{print $1}') - oldsha=$(grep " $r$" $GIT_DIR/listed-refs | awk '{print $1}') - if [ -n "$oldsha" ]; then - # include changes from $oldsha to $r when there are some - if [ -n "$(git log --oneline $oldsha..$r)" ]; then - check_prereq "$oldsha" "$r" - else - if [ "$oldsha" = "$newsha" ]; then - # $r is unchanged from last push, so no need to push it - : - else - # $oldsha is not a parent of $r, so - # include $r and all its parents - # XXX (this could be improved by checking other refs that were pushed - # and only including changes from them) - echo "$r" - fi - fi - else - # no old version was pushed so include $r and all its parents - # XXX (this could be improved by checking other refs that were pushed - # and only including changes from them) - echo "$r" - fi - done > $GIT_DIR/tobundle - if [ -s "$GIT_DIR/tobundle" ]; then - git bundle create --quiet $TOPDIR/new.bundle --stdin < "$GIT_DIR/tobundle" - addnewbundle - fi - fi - fi - cat $GIT_DIR/push-response - rm -f $GIT_DIR/push-response - echo - dopush="" - fi - ;; - esac -done From 97b309b56e30a0cbc9e503cbf5fd3bc56c173e94 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 09:03:43 -0400 Subject: [PATCH 34/53] extend manifest with keys to be deleted This will eventually be used to recover from an interrupted fullPush and drop the old bundle keys it was unable to delete. Sponsored-by: Luke T. Shumaker on Patreon --- CmdLine/GitRemoteAnnex.hs | 31 ++++++++++++++++++++++------- doc/internals/git-remote-annex.mdwn | 4 ++++ doc/todo/git-remote-annex.mdwn | 3 +++ 3 files changed, 31 insertions(+), 7 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index b597b7ae1c..47800d29a6 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -40,6 +40,7 @@ import Utility.Env import Utility.Metered import Network.URI +import Data.Either import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as B8 import qualified Data.Map.Strict as M @@ -265,7 +266,7 @@ fullPush st rmt refs = guardPush st $ do oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) let bs = map Git.Bundle.fullBundleSpec refs bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest - uploadManifest rmt (Manifest [bundlekey]) + uploadManifest rmt (Manifest [bundlekey] []) ok <- allM (dropKey rmt) $ filter (/= bundlekey) (inManifest oldmanifest) return (ok, st { manifestCache = Nothing }) @@ -285,7 +286,7 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do bs <- calc [] (M.toList newtrackingrefs) oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest - uploadManifest rmt (Manifest [bundlekey]) + uploadManifest rmt (Manifest [bundlekey] []) return (True, st { manifestCache = Nothing }) where calc c [] = return (reverse c) @@ -355,7 +356,7 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do pushEmpty :: State -> Remote -> Annex (Bool, State) pushEmpty st rmt = do manifest <- maybe (downloadManifest rmt) pure (manifestCache st) - uploadManifest rmt (Manifest []) + uploadManifest rmt (Manifest [] []) ok <- allM (dropKey rmt) (genManifestKey (Remote.uuid rmt) : inManifest manifest) return (ok, st { manifestCache = Nothing }) @@ -533,7 +534,14 @@ checkSpecialRemoteProblems rmt | otherwise = Nothing -- The manifest contains an ordered list of git bundle keys. -newtype Manifest = Manifest { inManifest :: [Key] } +-- +-- There is a second list of git bundle keys that are no longer +-- used and should be deleted. +data Manifest = + Manifest + { inManifest :: [Key] + , outManifest :: [Key] + } -- Downloads the Manifest, or if it does not exist, returns an empty -- Manifest. @@ -551,9 +559,12 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) _ <- Remote.retrieveKeyFile rmt mk (AssociatedFile Nothing) tmp nullMeterUpdate Remote.NoVerify - ks <- map deserializeKey' . B8.lines <$> liftIO (B.readFile tmp) - Manifest <$> checkvalid [] ks - , return (Manifest []) + (outks, inks) <- partitionEithers . map parseline . B8.lines + <$> liftIO (B.readFile tmp) + Manifest + <$> checkvalid [] inks + <*> checkvalid [] (filter (`notElem` inks) outks) + , return (Manifest [] []) ) where mk = genManifestKey (Remote.uuid rmt) @@ -565,6 +576,12 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) checkvalid _ (Nothing:_) = giveup $ "Error parsing manifest " ++ serializeKey mk + parseline l + | "-" `B.isPrefixOf` l = + Left $ deserializeKey' $ B.drop 1 l + | otherwise = + Right $ deserializeKey' l + -- Uploads the Manifest to the remote. -- -- Throws errors if the remote cannot be accessed or the upload fails. diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index f85965810b..efaae84ab4 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -12,6 +12,10 @@ GITBUNDLE--$UUID-sha256 is a git bundle. An ordered list of bundle keys, one per line. +Additionally, there may be bundle keys that are prefixed with "-". +These keys are not part of the current content of the git remote +and are in the process of being deleted. + (Lines end with unix `"\n"`, not `"\r\n"`.) # multiple GITMANIFEST files diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index da5c5dbab6..53c71c3746 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -33,6 +33,9 @@ This is implememented and working. Remaining todo list for it: to clone from a wrong url, since it fails to download a manifest and so appears as if the remote is empty. +* Improve recovery from interrupted push by using outManifest to clean up + after it. (Requires populating outManifest.) + * See XXX in uploadManifest about recovering from a situation where the remote is left with a deleted manifest when a push is interrupted part way through. This should be recoverable From 424afe46d7541fbfcbf0322887c4a70339b75309 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 09:33:15 -0400 Subject: [PATCH 35/53] fix incremental push to preserve existing bundle keys in manifest Also broke Manifest out to its own type with a smart constructor. Sponsored-by: mycroft on Patreon --- CmdLine/GitRemoteAnnex.hs | 23 +++++++------------- Types/GitRemoteAnnex.hs | 44 +++++++++++++++++++++++++++++++++++++++ git-annex.cabal | 1 + 3 files changed, 52 insertions(+), 16 deletions(-) create mode 100644 Types/GitRemoteAnnex.hs diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 47800d29a6..b99b7848f0 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -10,6 +10,7 @@ module CmdLine.GitRemoteAnnex where import Annex.Common +import Types.GitRemoteAnnex import qualified Annex import qualified Remote import qualified Git.CurrentRepo @@ -266,7 +267,7 @@ fullPush st rmt refs = guardPush st $ do oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) let bs = map Git.Bundle.fullBundleSpec refs bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest - uploadManifest rmt (Manifest [bundlekey] []) + uploadManifest rmt (mkManifest [bundlekey] []) ok <- allM (dropKey rmt) $ filter (/= bundlekey) (inManifest oldmanifest) return (ok, st { manifestCache = Nothing }) @@ -286,7 +287,7 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do bs <- calc [] (M.toList newtrackingrefs) oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest - uploadManifest rmt (Manifest [bundlekey] []) + uploadManifest rmt (oldmanifest <> mkManifest [bundlekey] []) return (True, st { manifestCache = Nothing }) where calc c [] = return (reverse c) @@ -356,7 +357,7 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do pushEmpty :: State -> Remote -> Annex (Bool, State) pushEmpty st rmt = do manifest <- maybe (downloadManifest rmt) pure (manifestCache st) - uploadManifest rmt (Manifest [] []) + uploadManifest rmt mempty ok <- allM (dropKey rmt) (genManifestKey (Remote.uuid rmt) : inManifest manifest) return (ok, st { manifestCache = Nothing }) @@ -533,16 +534,6 @@ checkSpecialRemoteProblems rmt Just "Cannot use this thirdparty-populated special remote as a git remote" | otherwise = Nothing --- The manifest contains an ordered list of git bundle keys. --- --- There is a second list of git bundle keys that are no longer --- used and should be deleted. -data Manifest = - Manifest - { inManifest :: [Key] - , outManifest :: [Key] - } - -- Downloads the Manifest, or if it does not exist, returns an empty -- Manifest. -- @@ -561,10 +552,10 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) nullMeterUpdate Remote.NoVerify (outks, inks) <- partitionEithers . map parseline . B8.lines <$> liftIO (B.readFile tmp) - Manifest + mkManifest <$> checkvalid [] inks - <*> checkvalid [] (filter (`notElem` inks) outks) - , return (Manifest [] []) + <*> checkvalid [] outks + , return mempty ) where mk = genManifestKey (Remote.uuid rmt) diff --git a/Types/GitRemoteAnnex.hs b/Types/GitRemoteAnnex.hs new file mode 100644 index 0000000000..8dae944e59 --- /dev/null +++ b/Types/GitRemoteAnnex.hs @@ -0,0 +1,44 @@ +{- git-remote-annex types + - + - Copyright 2024 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +module Types.GitRemoteAnnex + ( Manifest + , mkManifest + , inManifest + , outManifest + ) where + +import Types.Key + +import qualified Data.Semigroup as Sem + +-- The manifest contains an ordered list of git bundle keys. +-- +-- There is a second list of git bundle keys that are no longer +-- used and should be deleted. This list should never contain keys +-- that are in the first list. +data Manifest = + Manifest + { inManifest :: [Key] + , outManifest :: [Key] + } + deriving (Show) + +-- Smart constructor for Manifest. Preserves outManifest invariant. +mkManifest + :: [Key] -- ^ inManifest + -> [Key] -- ^ outManifest + -> Manifest +mkManifest inks outks = Manifest inks (filter (`notElem` inks) outks) + +instance Monoid Manifest where + mempty = Manifest [] [] + +instance Sem.Semigroup Manifest where + a <> b = mkManifest + (inManifest a <> inManifest b) + (outManifest a <> outManifest b) diff --git a/git-annex.cabal b/git-annex.cabal index ef9875098d..5540828726 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -942,6 +942,7 @@ Executable git-annex Types.Export Types.FileMatcher Types.GitConfig + Types.GitRemoteAnnex Types.Group Types.Import Types.IndexFiles From 3f848564ac6ce0eb320cf2fa4d8bfab76ece7630 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 09:47:21 -0400 Subject: [PATCH 36/53] refuse to fetch from a remote that has no manifest Otherwise, it can be confusing to clone from a wrong url, since it fails to download a manifest and so appears as if the remote exists but is empty. Sponsored-by: Jack Hill on Patreon --- CmdLine/GitRemoteAnnex.hs | 47 ++++++++++++++++++++++------------ doc/todo/git-remote-annex.mdwn | 5 ---- 2 files changed, 31 insertions(+), 21 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index b99b7848f0..6d495435f7 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -111,7 +111,9 @@ capabilities = do list :: State -> Remote -> Bool -> Annex State list st rmt forpush = do - manifest <- downloadManifest rmt + manifest <- if forpush + then downloadManifestWhenPresent rmt + else downloadManifestOrFail rmt l <- forM (inManifest manifest) $ \k -> do b <- downloadGitBundle rmt k heads <- inRepo $ Git.Bundle.listHeads b @@ -168,7 +170,7 @@ fetch st rmt [] = do fetch' :: State -> Remote -> Annex () fetch' st rmt = do - manifest <- maybe (downloadManifest rmt) pure (manifestCache st) + manifest <- maybe (downloadManifestOrFail rmt) pure (manifestCache st) forM_ (inManifest manifest) $ \k -> downloadGitBundle rmt k >>= inRepo . Git.Bundle.unbundle -- Newline indicates end of fetch. @@ -264,7 +266,8 @@ push st rmt ls = do -- from it. fullPush :: State -> Remote -> [Ref] -> Annex (Bool, State) fullPush st rmt refs = guardPush st $ do - oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) + oldmanifest <- maybe (downloadManifestWhenPresent rmt) pure + (manifestCache st) let bs = map Git.Bundle.fullBundleSpec refs bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest uploadManifest rmt (mkManifest [bundlekey] []) @@ -285,7 +288,7 @@ guardPush st a = catchNonAsync a $ \ex -> do incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State) incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do bs <- calc [] (M.toList newtrackingrefs) - oldmanifest <- maybe (downloadManifest rmt) pure (manifestCache st) + oldmanifest <- maybe (downloadManifestWhenPresent rmt) pure (manifestCache st) bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest uploadManifest rmt (oldmanifest <> mkManifest [bundlekey] []) return (True, st { manifestCache = Nothing }) @@ -350,13 +353,14 @@ incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do ) -- When the push deletes all refs from the remote, upload an empty --- manifest and then drop all bundles that were listed in it. --- The manifest is emptired first so if this is interrupted, only +-- manifest and then drop all bundles that were listed in the manifest. +-- The manifest is emptied first so if this is interrupted, only -- unused bundles will remain in the remote, rather than leaving the -- remote with a manifest that refers to missing bundles. pushEmpty :: State -> Remote -> Annex (Bool, State) pushEmpty st rmt = do - manifest <- maybe (downloadManifest rmt) pure (manifestCache st) + manifest <- maybe (downloadManifestWhenPresent rmt) pure + (manifestCache st) uploadManifest rmt mempty ok <- allM (dropKey rmt) (genManifestKey (Remote.uuid rmt) : inManifest manifest) @@ -534,17 +538,27 @@ checkSpecialRemoteProblems rmt Just "Cannot use this thirdparty-populated special remote as a git remote" | otherwise = Nothing --- Downloads the Manifest, or if it does not exist, returns an empty --- Manifest. +-- Downloads the Manifest when present in the remote. When not present, +-- returns an empty Manifest. +downloadManifestWhenPresent :: Remote -> Annex Manifest +downloadManifestWhenPresent rmt = fromMaybe mempty <$> downloadManifest rmt + +-- Downloads the Manifest, or fails if the remote does not contain it. +downloadManifestOrFail :: Remote -> Annex Manifest +downloadManifestOrFail rmt = + maybe (giveup "No git repository found in this remote.") return + =<< downloadManifest rmt + +-- Downloads the Manifest or Nothing if the remote does not contain a +-- manifest. -- -- Throws errors if the remote cannot be accessed or the download fails, -- or if the manifest file cannot be parsed. --- --- This downloads the manifest to a temporary file, rather than using --- the usual Annex.Transfer.download. The content of manifests is not --- stable, and so it needs to re-download it fresh every time. -downloadManifest :: Remote -> Annex Manifest +downloadManifest :: Remote -> Annex (Maybe Manifest) downloadManifest rmt = ifM (Remote.checkPresent rmt mk) + -- Downloads to a temporary file, rather than using + -- the usual Annex.Transfer.download. The content of manifests is + -- not stable, and so it needs to re-download it fresh every time. ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do liftIO $ hClose tmph _ <- Remote.retrieveKeyFile rmt mk @@ -552,10 +566,11 @@ downloadManifest rmt = ifM (Remote.checkPresent rmt mk) nullMeterUpdate Remote.NoVerify (outks, inks) <- partitionEithers . map parseline . B8.lines <$> liftIO (B.readFile tmp) - mkManifest + m <- mkManifest <$> checkvalid [] inks <*> checkvalid [] outks - , return mempty + return (Just m) + , return Nothing ) where mk = genManifestKey (Remote.uuid rmt) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 53c71c3746..25626939d3 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -28,11 +28,6 @@ This is implememented and working. Remaining todo list for it: stored in the repo. Chicken and egg problem cloning from such a remote. Maybe allow advanced users to force it? -* When the remote has no manifest, a pull from it should fail, - while a push should succeed. Otherwise, it can be confusing - to clone from a wrong url, since it fails to download - a manifest and so appears as if the remote is empty. - * Improve recovery from interrupted push by using outManifest to clean up after it. (Requires populating outManifest.) From 34eae54ff99f14504352d81a95c1ec3833e38e28 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 11:37:47 -0400 Subject: [PATCH 37/53] git-remote-annex support exporttree=yes remotes Put the annex objects in .git/annex/objects/ inside the export remote. This way, when importing from the remote, they will be filtered out. Note that, when importtree=yes, content identifiers are used, and this means that pushing to a remote updates the git-annex branch. Urk. Will need to try to prevent that later, but I already had a todo about that for other reasons. Untested! Sponsored-By: Brock Spratlen on Patreon --- CmdLine/GitRemoteAnnex.hs | 182 +++++++++++++++++++++------- Types/Difference.hs | 1 + doc/internals/git-remote-annex.mdwn | 7 +- doc/todo/git-remote-annex.mdwn | 8 +- 4 files changed, 151 insertions(+), 47 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 6d495435f7..000f2ae5a8 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -25,15 +25,19 @@ import qualified Types.Remote as Remote import Annex.Transfer import Backend.GitRemoteAnnex import Config +import Types.Key import Types.RemoteConfig import Types.ProposedAccepted -import Types.Key +import Types.Export import Types.GitConfig +import Types.Difference import Git.Types import Logs.Difference import Annex.Init +import Annex.UUID import Annex.Content import Annex.Perms +import Annex.SpecialRemote.Config import Remote.List import Remote.List.Util import Utility.Tmp @@ -45,8 +49,9 @@ import Data.Either import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as B8 import qualified Data.Map.Strict as M -import System.FilePath.ByteString as P +import qualified System.FilePath.ByteString as P import qualified Utility.RawFilePath as R +import qualified Data.Set as S run :: [String] -> IO () run (remotename:url:[]) = @@ -526,6 +531,28 @@ getEnabledSpecialRemoteByName remotename = maybe (return (Just rmt)) giveup (checkSpecialRemoteProblems rmt) +parseManifest :: B.ByteString -> Either String Manifest +parseManifest b = + let (outks, inks) = partitionEithers $ map parseline $ B8.lines b + in case (checkvalid [] inks, checkvalid [] outks) of + (Right inks', Right outks') -> + Right $ mkManifest inks' outks' + (Left err, _) -> Left err + (_, Left err) -> Left err + where + parseline l + | "-" `B.isPrefixOf` l = + Left $ deserializeKey' $ B.drop 1 l + | otherwise = + Right $ deserializeKey' l + + checkvalid c [] = Right (reverse c) + checkvalid c (Just k:ks) = case fromKey keyVariety k of + GitBundleKey -> checkvalid (k:c) ks + _ -> Left $ "Wrong type of key in manifest " ++ serializeKey k + checkvalid _ (Nothing:_) = + Left "Error parsing manifest" + -- Avoid using special remotes that are thirdparty populated, because -- there is no way to push the git repository keys into one. -- @@ -555,38 +582,39 @@ downloadManifestOrFail rmt = -- Throws errors if the remote cannot be accessed or the download fails, -- or if the manifest file cannot be parsed. downloadManifest :: Remote -> Annex (Maybe Manifest) -downloadManifest rmt = ifM (Remote.checkPresent rmt mk) - -- Downloads to a temporary file, rather than using - -- the usual Annex.Transfer.download. The content of manifests is - -- not stable, and so it needs to re-download it fresh every time. - ( withTmpFile "GITMANIFEST" $ \tmp tmph -> do - liftIO $ hClose tmph - _ <- Remote.retrieveKeyFile rmt mk - (AssociatedFile Nothing) tmp - nullMeterUpdate Remote.NoVerify - (outks, inks) <- partitionEithers . map parseline . B8.lines - <$> liftIO (B.readFile tmp) - m <- mkManifest - <$> checkvalid [] inks - <*> checkvalid [] outks - return (Just m) - , return Nothing - ) +downloadManifest rmt = getKeyExportLocations rmt mk >>= \case + Nothing -> ifM (Remote.checkPresent rmt mk) + ( gettotmp $ \tmp -> + Remote.retrieveKeyFile rmt mk + (AssociatedFile Nothing) tmp + nullMeterUpdate Remote.NoVerify + , return Nothing + ) + Just locs -> getexport locs where mk = genManifestKey (Remote.uuid rmt) - checkvalid c [] = return (reverse c) - checkvalid c (Just k:ks) = case fromKey keyVariety k of - GitBundleKey -> checkvalid (k:c) ks - _ -> giveup $ "Wrong type of key in manifest " ++ serializeKey k - checkvalid _ (Nothing:_) = - giveup $ "Error parsing manifest " ++ serializeKey mk + -- Downloads to a temporary file, rather than using eg + -- Annex.Transfer.download that would put it in the object + -- directory. The content of manifests is not stable, and so + -- it needs to re-download it fresh every time, and the object + -- file should not be stored locally. + gettotmp dl = withTmpFile "GITMANIFEST" $ \tmp tmph -> do + liftIO $ hClose tmph + _ <- dl tmp + b <- liftIO (B.readFile tmp) + case parseManifest b of + Right m -> return (Just m) + Left err -> giveup err - parseline l - | "-" `B.isPrefixOf` l = - Left $ deserializeKey' $ B.drop 1 l - | otherwise = - Right $ deserializeKey' l + getexport [] = return Nothing + getexport (loc:locs) = + ifM (Remote.checkPresentExport (Remote.exportActions rmt) mk loc) + ( gettotmp $ \tmp -> + Remote.retrieveExport (Remote.exportActions rmt) + mk loc tmp nullMeterUpdate + , getexport locs + ) -- Uploads the Manifest to the remote. -- @@ -610,7 +638,7 @@ uploadManifest rmt manifest = B8.hPutStrLn tmph (serializeKey' bundlekey) liftIO $ hClose tmph -- Remove old manifest if present. - Remote.removeKey rmt mk + dropKey' rmt mk -- storeKey needs the key to be in the annex objects -- directory, so put the manifest file there temporarily. -- Using linkOrCopy rather than moveAnnex to avoid updating @@ -622,9 +650,8 @@ uploadManifest rmt manifest = linkOrCopy mk (toRawFilePath tmp) objfile Nothing unless (isJust res) uploadfailed - -- noRetry because manifest content is not stable - ok <- upload rmt mk (AssociatedFile Nothing) - noRetry noNotification + ok <- (uploadGitObject rmt mk >> pure True) + `catchNonAsync` (const (pure False)) -- Don't leave the manifest key in the annex objects -- directory. unlinkAnnex mk @@ -650,22 +677,46 @@ uploadManifest rmt manifest = -- 3. Git bundle objects are not usually transferred between repositories -- except special remotes (although the user can if they want to). downloadGitBundle :: Remote -> Key -> Annex FilePath -downloadGitBundle rmt k = - ifM (download rmt k (AssociatedFile Nothing) stdRetry noNotification) +downloadGitBundle rmt k = getKeyExportLocations rmt k >>= \case + Nothing -> dlwith $ + download rmt k (AssociatedFile Nothing) stdRetry noNotification + Just locs -> dlwith $ + anyM getexport locs + where + dlwith a = ifM a ( decodeBS <$> calcRepo (gitAnnexLocation k) , giveup $ "Failed to download " ++ serializeKey k ) --- Uploads a git bundle from the annex objects directory to the remote. + getexport loc = catchNonAsync (getexport' loc) (const (pure False)) + getexport' loc = + getViaTmp rsp vc k (AssociatedFile Nothing) Nothing $ \tmp -> do + v <- Remote.retrieveExport (Remote.exportActions rmt) + k loc (decodeBS tmp) nullMeterUpdate + return (True, v) + rsp = Remote.retrievalSecurityPolicy rmt + vc = Remote.RemoteVerify rmt + +-- Uploads a bundle or manifest object from the annex objects directory +-- to the remote. -- -- Throws errors if the upload fails. -- -- This does not update the location log to indicate that the remote --- contains the git bundle object. -uploadGitBundle :: Remote -> Key -> Annex () -uploadGitBundle rmt k = - unlessM (upload rmt k (AssociatedFile Nothing) stdRetry noNotification) $ - giveup $ "Failed to upload " ++ serializeKey k +-- contains the git object. +uploadGitObject :: Remote -> Key -> Annex () +uploadGitObject rmt k = getKeyExportLocations rmt k >>= \case + Just (loc:_) -> do + objfile <- fromRawFilePath <$> calcRepo (gitAnnexLocation k) + Remote.storeExport (Remote.exportActions rmt) objfile k loc nullMeterUpdate + _ -> + unlessM (upload rmt k (AssociatedFile Nothing) retry noNotification) $ + giveup $ "Failed to upload " ++ serializeKey k + where + retry = case fromKey keyVariety k of + GitBundleKey -> stdRetry + -- Manifest keys are not stable + _ -> noRetry -- Generates a git bundle, ingests it into the local objects directory, -- and uploads its key to the special remote. @@ -689,12 +740,12 @@ generateAndUploadGitBundle rmt bs manifest = unless (bundlekey `elem` (inManifest manifest)) $ do unlessM (moveAnnex bundlekey (AssociatedFile Nothing) (toRawFilePath tmp)) $ giveup "Unable to push" - uploadGitBundle rmt bundlekey + uploadGitObject rmt bundlekey `onException` unlinkAnnex bundlekey return bundlekey dropKey :: Remote -> Key -> Annex Bool -dropKey rmt k = tryNonAsync (Remote.removeKey rmt k) >>= \case +dropKey rmt k = tryNonAsync (dropKey' rmt k) >>= \case Right () -> return True Left ex -> do liftIO $ hPutStrLn stderr $ @@ -703,6 +754,49 @@ dropKey rmt k = tryNonAsync (Remote.removeKey rmt k) >>= \case ++ " (" ++ show ex ++ ")" return False +dropKey' :: Remote -> Key -> Annex () +dropKey' rmt k = getKeyExportLocations rmt k >>= \case + Nothing -> Remote.removeKey rmt k + Just locs -> forM_ locs $ \loc -> + Remote.removeExport (Remote.exportActions rmt) k loc + +getKeyExportLocations :: Remote -> Key -> Annex (Maybe [ExportLocation]) +getKeyExportLocations rmt k = do + cfg <- Annex.getGitConfig + u <- getUUID + return $ keyExportLocations rmt k cfg u + +-- When the remote contains a tree, the git keys are stored +-- inside the .git/annex/objects/ directory in the remote. +-- +-- The first ExportLocation in the returned list is the one that +-- is the same as the local repository would use. But it's possible +-- that one of the others in the list was used by another repository to +-- upload a git key. +keyExportLocations :: Remote -> Key -> GitConfig -> UUID -> Maybe [ExportLocation] +keyExportLocations rmt k cfg uuid + | exportTree (Remote.config rmt) || importTree (Remote.config rmt) = + Just $ map (\p -> mkExportLocation (".git" P. p)) $ + concatMap (`annexLocationsNonBare` k) cfgs + | otherwise = Nothing + where + -- When git-annex has not been initialized yet (eg, when cloning), + -- the Differences are unknown, so make a version of the GitConfig + -- with and without the OneLevelObjectHash difference. + cfgs + | uuid /= NoUUID = [cfg] + | hasDifference OneLevelObjectHash (annexDifferences cfg) = + [ cfg + , cfg { annexDifferences = mempty } + ] + | otherwise = + [ cfg + , cfg + { annexDifferences = mkDifferences + (S.singleton OneLevelObjectHash) + } + ] + -- Tracking refs are used to remember the refs that are currently on the -- remote. This is different from git's remote tracking branches, since it -- needs to track all refs on the remote, not only the refs that the user diff --git a/Types/Difference.hs b/Types/Difference.hs index 0617dc22a1..93a175c76c 100644 --- a/Types/Difference.hs +++ b/Types/Difference.hs @@ -17,6 +17,7 @@ module Types.Difference ( differenceConfigVal, hasDifference, listDifferences, + mkDifferences, ) where import Utility.PartialPrelude diff --git a/doc/internals/git-remote-annex.mdwn b/doc/internals/git-remote-annex.mdwn index efaae84ab4..8fff1eff4e 100644 --- a/doc/internals/git-remote-annex.mdwn +++ b/doc/internals/git-remote-annex.mdwn @@ -18,6 +18,11 @@ and are in the process of being deleted. (Lines end with unix `"\n"`, not `"\r\n"`.) +# exporttree=yes remotes + +In an exporttree=yes remote, the GITMANIFEST and GITBUNDLE objects are +stored in the remote, under the `.git/annex/objects/` path. + # multiple GITMANIFEST files Usually there will only be one per special remote, but it's possible for @@ -38,6 +43,6 @@ stored in such a special remote, this procedure will work: (Note that later bundles can update refs from the versions in previous bundles.) -When the special remote is encryptee, the GITMANIFEST and GITBUNDLE will +When the special remote is encrypted, the GITMANIFEST and GITBUNDLE will also be encrypted. To decrypt those manually, see this [[fairly simple shell script using standard tools|tips/Decrypting_files_in_special_remotes_without_git-annex]]. diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 25626939d3..2bb50d2226 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -12,7 +12,7 @@ This is implememented and working. Remaining todo list for it: * Need to test all types of pushes, barely tested at all. -* Support exporttree=yes remotes. +* Need to test exporttree=yes remotes. * Support max-bundles config @@ -35,7 +35,7 @@ This is implememented and working. Remaining todo list for it: where the remote is left with a deleted manifest when a push is interrupted part way through. This should be recoverable by caching the manifest locally and re-uploading it when - the remote has no manifest. + the remote has no manifest or prompting the user to merge and re-push. * datalad-annex supports cloning from the web special remote, using an url that contains the result of pushing to eg, a directory @@ -87,3 +87,7 @@ This is implememented and working. Remaining todo list for it: This should be fixable by making git-remote-annex not write to the git-annex branch, but to eg, a temporary journal directory. + + Also, when the remote uses importree=yes, pushing to it updates + content identifiers, which currently get recorded in the git-annex + branch. It would be good to avoid that being written as well. From 13a6a20716d33e66bcd1eb39c2f081e2671e1cb5 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 13:52:58 -0400 Subject: [PATCH 38/53] fix --is-ancestor option --- Git/Ref.hs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Git/Ref.hs b/Git/Ref.hs index fd7d2da0c8..2767ae339c 100644 --- a/Git/Ref.hs +++ b/Git/Ref.hs @@ -218,7 +218,7 @@ tree (Ref ref) = extractSha <$$> pipeReadStrict isAncestor :: Ref -> Ref -> Repo -> IO Bool isAncestor r1 r2 = runBool [ Param "merge-base" - , Param "--ancestor" + , Param "--is-ancestor" , Param (fromRef r1) , Param (fromRef r2) ] From 552b000ef1a218a14e3f27bb66ec8319a1d091df Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 14:30:18 -0400 Subject: [PATCH 39/53] update --- doc/todo/git-remote-annex.mdwn | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 2bb50d2226..96b5fcd994 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,9 +10,12 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: -* Need to test all types of pushes, barely tested at all. +* Test pushes that delete branches. -* Need to test exporttree=yes remotes. +* Test incremental pushes that don't fast-forward. + +* exporttree=yes remote works, but cloning one from the annex:: url + does not, somehow exportTree is not set then. * Support max-bundles config From ddf05c271b150f5c5311fa7a68fbbadf03faba80 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 14:35:17 -0400 Subject: [PATCH 40/53] fix cloning from an annex:: remote with exporttree=yes Updating the remote list needs the config to be written to the git-annex branch, which was not done for good reasons. While it would be possible to instead use Remote.List.remoteGen without writing to the branch, I already have a plan to discard git-annex branch writes made by git-remote-annex, so the simplest fix is to write the config to the branch. Sponsored-by: k0ld on Patreon --- CmdLine/GitRemoteAnnex.hs | 10 +++------- doc/todo/git-remote-annex.mdwn | 9 ++++++--- 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 000f2ae5a8..e21128e7fd 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -22,6 +22,7 @@ import qualified Git.Remote.Remove import qualified Annex.SpecialRemote as SpecialRemote import qualified Annex.Branch import qualified Types.Remote as Remote +import qualified Logs.Remote import Annex.Transfer import Backend.GitRemoteAnnex import Config @@ -481,20 +482,15 @@ withSpecialRemote cfg@(SpecialRemoteConfig {}) sab a = case specialRemoteName cf where -- Initialize a new special remote with the provided configuration -- and name. - -- - -- The configuration is not stored in the git-annex branch, because - -- it's expected that the git repository stored on the special - -- remote includes its configuration, perhaps under a different - -- name, and perhaps slightly different (when the annex:: url - -- omitted some unimportant part of the configuration). initremote remotename = do let c = M.insert SpecialRemote.nameField (Proposed remotename) (specialRemoteConfig cfg) t <- either giveup return (SpecialRemote.findType c) dummycfg <- liftIO dummyRemoteGitConfig - (c', _u) <- Remote.setup t Remote.Init (Just (specialRemoteUUID cfg)) + (c', u) <- Remote.setup t Remote.Init (Just (specialRemoteUUID cfg)) Nothing c dummycfg `onException` cleanupremote remotename + Logs.Remote.configSet u c' setConfig (remoteConfig c' "url") (specialRemoteUrl cfg) remotesChanged getEnabledSpecialRemoteByName remotename >>= \case diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 96b5fcd994..7683eff17d 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -14,9 +14,6 @@ This is implememented and working. Remaining todo list for it: * Test incremental pushes that don't fast-forward. -* exporttree=yes remote works, but cloning one from the annex:: url - does not, somehow exportTree is not set then. - * Support max-bundles config * Need to mention git-remote-annex in special remotes page, and perhaps @@ -91,6 +88,12 @@ This is implememented and working. Remaining todo list for it: This should be fixable by making git-remote-annex not write to the git-annex branch, but to eg, a temporary journal directory. + Also, cloning currently writes the special remote config into remote.log, + which might be slightly different in some way than the config in + remote.log for the same remote. cloning should not change the stored + config of a remote, but that branch write was necessary. So throwing + away the branch write is also good for this case. + Also, when the remote uses importree=yes, pushing to it updates content identifiers, which currently get recorded in the git-annex branch. It would be good to avoid that being written as well. From 8bf6dab615a82a727e04fb2299c1b088cf1a0f29 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 13 May 2024 14:42:25 -0400 Subject: [PATCH 41/53] update --- doc/todo/git-remote-annex.mdwn | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 7683eff17d..7d115739cc 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -16,6 +16,10 @@ This is implememented and working. Remaining todo list for it: * Support max-bundles config +* Cloning from an annex:: url with importtree=yes doesn't work + (with or without exporttree=yes). This is because the ContentIdentifier + db is not populated. + * Need to mention git-remote-annex in special remotes page, and perhaps write a tip for it. Also link to it from git-annex man page. From 6f1039900d473c6bdb6d2d378e2ef6b12b44901a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 13:52:20 -0400 Subject: [PATCH 42/53] prevent using git-remote-annex with unsuitable special remote configs I hope to support importtree=yes eventually, but it does not currently work. Added remote..allow-encrypted-gitrepo that needs to be set to allow using it with encrypted git repos. Note that even encryption=pubkey uses a cipher stored in the git repo to encrypt the keys stored in the remote. While it would be possible to not encrypt the GITBUNDLE and GITMANIFEST keys, and then allow using encryption=pubkey, it doesn't currently work, and that would be a complication that I doubt is worth it. --- CmdLine/GitRemoteAnnex.hs | 17 +++++++++++++++- Remote/Helper/Encryptable.hs | 19 ++++++++++------- Types/GitConfig.hs | 6 +++++- doc/git-annex.mdwn | 18 +++++++++++++---- doc/git-remote-annex.mdwn | 5 ----- doc/todo/git-remote-annex.mdwn | 37 +++++++++++++++------------------- 6 files changed, 63 insertions(+), 39 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index e21128e7fd..626b4e4ce3 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -23,6 +23,7 @@ import qualified Annex.SpecialRemote as SpecialRemote import qualified Annex.Branch import qualified Types.Remote as Remote import qualified Logs.Remote +import Remote.Helper.Encryptable (parseEncryptionMethod) import Annex.Transfer import Backend.GitRemoteAnnex import Config @@ -32,6 +33,7 @@ import Types.ProposedAccepted import Types.Export import Types.GitConfig import Types.Difference +import Types.Crypto import Git.Types import Logs.Difference import Annex.Init @@ -558,8 +560,21 @@ parseManifest b = checkSpecialRemoteProblems :: Remote -> Maybe String checkSpecialRemoteProblems rmt | Remote.thirdPartyPopulated (Remote.remotetype rmt) = - Just "Cannot use this thirdparty-populated special remote as a git remote" + Just $ "Cannot use this thirdparty-populated special" + ++ " remote as a git remote." + | importTree (Remote.config rmt) = + Just $ "Using importtree=yes special remotes as git remotes" + ++ " is not yet supported." + | parseEncryptionMethod (unparsedRemoteConfig (Remote.config rmt)) /= Right NoneEncryption + && not (remoteAnnexAllowEncryptedGitRepo (Remote.gitconfig rmt)) = + Just $ "Using an encrypted special remote as a git" + ++ " remote makes it impossible to clone" + ++ " from it. If you will never need to" + ++ " clone from this remote, set: git config " + ++ decodeBS allowencryptedgitrepo ++ " true" | otherwise = Nothing + where + ConfigKey allowencryptedgitrepo = remoteAnnexConfig rmt "allow-encrypted-gitrepo" -- Downloads the Manifest when present in the remote. When not present, -- returns an empty Manifest. diff --git a/Remote/Helper/Encryptable.hs b/Remote/Helper/Encryptable.hs index 884d53d7bf..9f4bd7fcb1 100644 --- a/Remote/Helper/Encryptable.hs +++ b/Remote/Helper/Encryptable.hs @@ -14,6 +14,7 @@ module Remote.Helper.Encryptable ( encryptionAlreadySetup, encryptionConfigParsers, parseEncryptionConfig, + parseEncryptionMethod, remoteCipher, remoteCipher', embedCreds, @@ -85,7 +86,7 @@ encryptionFieldParser :: RemoteConfigFieldParser encryptionFieldParser = RemoteConfigFieldParser { parserForField = encryptionField , valueParser = \v c -> Just . RemoteConfigValue - <$> parseEncryptionMethod (fmap fromProposedAccepted v) c + <$> parseEncryptionMethod' v c , fieldDesc = FieldDesc "how to encrypt data stored in the special remote" , valueDesc = Just $ ValueDesc $ intercalate " or " (M.keys encryptionMethods) @@ -100,14 +101,18 @@ encryptionMethods = M.fromList , ("sharedpubkey", SharedPubKeyEncryption) ] -parseEncryptionMethod :: Maybe String -> RemoteConfig -> Either String EncryptionMethod -parseEncryptionMethod (Just s) _ = case M.lookup s encryptionMethods of - Just em -> Right em - Nothing -> Left badEncryptionMethod +parseEncryptionMethod :: RemoteConfig -> Either String EncryptionMethod +parseEncryptionMethod c = parseEncryptionMethod' (M.lookup encryptionField c) c + +parseEncryptionMethod' :: Maybe (ProposedAccepted String) -> RemoteConfig -> Either String EncryptionMethod +parseEncryptionMethod' (Just s) _ = + case M.lookup (fromProposedAccepted s) encryptionMethods of + Just em -> Right em + Nothing -> Left badEncryptionMethod -- Hybrid encryption is the default when a keyid is specified without -- an encryption field, or when there's a cipher already but no encryption -- field. -parseEncryptionMethod Nothing c +parseEncryptionMethod' Nothing c | M.member (Accepted "keyid") c || M.member cipherField c = Right HybridEncryption | otherwise = Left badEncryptionMethod @@ -162,7 +167,7 @@ encryptionSetup c gc = do maybe (genCipher pc gpgcmd) (updateCipher pc gpgcmd) (extractCipher pc) where -- The type of encryption - encryption = parseEncryptionMethod (fromProposedAccepted <$> M.lookup encryptionField c) c + encryption = parseEncryptionMethod c -- Generate a new cipher, depending on the chosen encryption scheme genCipher pc gpgcmd = case encryption of Right NoneEncryption -> return (c, NoEncryption) diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 26540b8484..49fde98fc3 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -362,6 +362,7 @@ data RemoteGitConfig = RemoteGitConfig , remoteAnnexStopCommand :: Maybe String , remoteAnnexSpeculatePresent :: Bool , remoteAnnexBare :: Maybe Bool + , remoteAnnexAllowEncryptedGitRepo :: Bool , remoteAnnexRetry :: Maybe Integer , remoteAnnexForwardRetry :: Maybe Integer , remoteAnnexRetryDelay :: Maybe Seconds @@ -430,8 +431,11 @@ extractRemoteGitConfig r remotename = do , remoteAnnexTrustLevel = notempty $ getmaybe "trustlevel" , remoteAnnexStartCommand = notempty $ getmaybe "start-command" , remoteAnnexStopCommand = notempty $ getmaybe "stop-command" - , remoteAnnexSpeculatePresent = getbool "speculate-present" False + , remoteAnnexSpeculatePresent = + getbool "speculate-present" False , remoteAnnexBare = getmaybebool "bare" + , remoteAnnexAllowEncryptedGitRepo = + getbool "allow-encrypted-gitrepo" False , remoteAnnexRetry = getmayberead "retry" , remoteAnnexForwardRetry = getmayberead "forward-retry" , remoteAnnexRetryDelay = Seconds diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index c8581ed713..561c8c1836 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1634,10 +1634,6 @@ Remotes are configured using these settings in `.git/config`. configured by the trust and untrust commands. The value can be any of "trusted", "semitrusted" or "untrusted". -* `remote..annex-availability` - - This configuration setting is no longer used. - * `remote..annex-speculate-present` Set to "true" to make git-annex speculate that this remote may contain the @@ -1663,11 +1659,25 @@ Remotes are configured using these settings in `.git/config`. while preventing a new clone needing to download too many objects. Set to 0 to disable re-pushing. +* `remote..allow-encrypted-gitrepo` + + Setting this to true allows using [[git-remote-annex]] to push the git + repository to an encrypted special remote. + + That is not allowed by default, because it is impossible to git clone + from an encrypted special remote, since it needs encryption keys stored + in the remote. So take care that, if you set this, you don't rely + on the encrypted special remote being the only copy of your git repository. + * `remote..annex-bare` Can be used to tell git-annex if a remote is a bare repository or not. Normally, git-annex determines this automatically. +* `remote..annex-availability` + + This configuration setting is no longer used. + * `remote..annex-ssh-options` Options to use when using ssh to talk to this remote. diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index 4fefb1bd36..3da0961c5b 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -12,11 +12,6 @@ This is a git remote helper program that allows git to clone, pull and push from a git repository that is stored in a git-annex special remote. -It can be used with any special remote except those that use -encryption=shared or encryption=hybrid. (Since those types of encryption -rely on a cipher that is checked into the git repository, cloning from -such a special remote would present a chicken and egg problem.) - The format of the remote URL is "annex::" followed by the UUID of the special remote, and then followed by all of the configuration parameters of the special remote. diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 7d115739cc..6863f56655 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -1,5 +1,5 @@ -git-remote-annex will be a program that allows push/pull of a git -repository to any git-annex special remote. +git-remote-annex will be a program that allows push/pull/clone of a git +repository to many types of git-annex special remote. This is a redesign and reimplementation of git-remote-datalad-annex. It will be a safer implementation, will support incremental pushes, and @@ -23,14 +23,20 @@ This is implememented and working. Remaining todo list for it: * Need to mention git-remote-annex in special remotes page, and perhaps write a tip for it. Also link to it from git-annex man page. -* initremote could optionally configure the url to a special remote - to an annex:: url. This would make it easier to use git-remote-annex, - since the user would not need to set up the url themselves. - (Also it would then avoid setting `skipFetchAll = true`) +* It would be nice if git-annex could generate an annex:: url + for a special remote and show it to the user, eg when + they have set the shorthand "annex::" url, so they know the full url. + `git-annex info $remote` could also display it. + Currently, the user has to remember how the special remote was + configured and replicate it all in the url. -* Prevent using with remotes that are encrypted using a cipher - stored in the repo. Chicken and egg problem cloning from - such a remote. Maybe allow advanced users to force it? + There are some difficulties to doing this, including that + RemoteConfig can have hidden fields that should be omitted. + +* initremote/enableremote could have an option that configures the url to a + special remote to a annex:: url. This would make it easier to use + git-remote-annex, since the user would not need to set up the url + themselves. (Also it would then avoid setting `skipFetchAll = true`) * Improve recovery from interrupted push by using outManifest to clean up after it. (Requires populating outManifest.) @@ -47,18 +53,6 @@ This is implememented and working. Remaining todo list for it: `datalad-annex::https://example.com?type=web&url={noquery}` Supporting something like this would be good. -* It would be nice if git-annex could generate an annex:: url - for a special remote and show it to the user, eg when - they have set the shorthand "annex::" url, so they know the full url. - `git-annex info $remote` could also display it. - Currently, the user has to remember how the special remote was - configured and replicate it all in the url. - - There are some difficulties to doing this, including that - RemoteConfig can have hidden fields that should be omitted, - and that some, like type=directory, remove some configs - (eg directory=) in their setup action. - * Improve behavior in push races. A race can overwrite a change to the MANIFEST and lose work that was pushed from the other repo. From the user's perspective, that situation is the same as if one repo @@ -101,3 +95,4 @@ This is implememented and working. Remaining todo list for it: Also, when the remote uses importree=yes, pushing to it updates content identifiers, which currently get recorded in the git-annex branch. It would be good to avoid that being written as well. + From 8ad768fdba9d4035fd22c6beaa6d3c5684780c99 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 13:57:56 -0400 Subject: [PATCH 43/53] todo --- doc/todo/git-remote-annex.mdwn | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 6863f56655..5dd42c44f7 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,6 +10,11 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: +* git-annex unused --from remote should not treat git manifest and bundle + keys as unused, since that could lead to data loss. It's fine for + git-annex unused on the local repo to treat those as unused since they're + only a local cache. + * Test pushes that delete branches. * Test incremental pushes that don't fast-forward. @@ -92,7 +97,7 @@ This is implememented and working. Remaining todo list for it: config of a remote, but that branch write was necessary. So throwing away the branch write is also good for this case. - Also, when the remote uses importree=yes, pushing to it updates + Also, when the remote uses importtree=yes, pushing to it updates content identifiers, which currently get recorded in the git-annex branch. It would be good to avoid that being written as well. From 0bf72ef10306891a99f6dd706e79ed7d8e26edd4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 14:23:40 -0400 Subject: [PATCH 44/53] max-git-bundles config for git-remote-annex --- CmdLine/GitRemoteAnnex.hs | 18 +++++++++++++----- Types/GitConfig.hs | 9 ++++++--- doc/git-annex.mdwn | 6 +++--- doc/git-remote-annex.mdwn | 14 +++++++------- doc/todo/git-remote-annex.mdwn | 2 -- 5 files changed, 29 insertions(+), 20 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 626b4e4ce3..6bc802401c 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -203,7 +203,6 @@ push st rmt ls = do then pushEmpty st rmt else if any forcedPush refspecs then fullPush st rmt (M.keys trackingrefs) - -- TODO: support max-bundles config else incrementalPush st rmt (trackingRefs st) trackingrefs if ok @@ -276,6 +275,10 @@ fullPush :: State -> Remote -> [Ref] -> Annex (Bool, State) fullPush st rmt refs = guardPush st $ do oldmanifest <- maybe (downloadManifestWhenPresent rmt) pure (manifestCache st) + fullPush' oldmanifest st rmt refs + +fullPush' :: Manifest -> State -> Remote -> [Ref] -> Annex (Bool, State) +fullPush' oldmanifest st rmt refs =do let bs = map Git.Bundle.fullBundleSpec refs bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest uploadManifest rmt (mkManifest [bundlekey] []) @@ -295,12 +298,17 @@ guardPush st a = catchNonAsync a $ \ex -> do -- have been added. incrementalPush :: State -> Remote -> M.Map Ref Sha -> M.Map Ref Sha -> Annex (Bool, State) incrementalPush st rmt oldtrackingrefs newtrackingrefs = guardPush st $ do - bs <- calc [] (M.toList newtrackingrefs) oldmanifest <- maybe (downloadManifestWhenPresent rmt) pure (manifestCache st) - bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest - uploadManifest rmt (oldmanifest <> mkManifest [bundlekey] []) - return (True, st { manifestCache = Nothing }) + if length (inManifest oldmanifest) + 1 > remoteAnnexMaxGitBundles (Remote.gitconfig rmt) + then fullPush' oldmanifest st rmt (M.keys newtrackingrefs) + else go oldmanifest where + go oldmanifest = do + bs <- calc [] (M.toList newtrackingrefs) + bundlekey <- generateAndUploadGitBundle rmt bs oldmanifest + uploadManifest rmt (oldmanifest <> mkManifest [bundlekey] []) + return (True, st { manifestCache = Nothing }) + calc c [] = return (reverse c) calc c ((ref, sha):refs) = case M.lookup ref oldtrackingrefs of Just oldsha diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 49fde98fc3..b24ae48eb8 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -362,7 +362,6 @@ data RemoteGitConfig = RemoteGitConfig , remoteAnnexStopCommand :: Maybe String , remoteAnnexSpeculatePresent :: Bool , remoteAnnexBare :: Maybe Bool - , remoteAnnexAllowEncryptedGitRepo :: Bool , remoteAnnexRetry :: Maybe Integer , remoteAnnexForwardRetry :: Maybe Integer , remoteAnnexRetryDelay :: Maybe Seconds @@ -374,6 +373,8 @@ data RemoteGitConfig = RemoteGitConfig , remoteAnnexBwLimitDownload :: Maybe BwRate , remoteAnnexAllowUnverifiedDownloads :: Bool , remoteAnnexConfigUUID :: Maybe UUID + , remoteAnnexMaxGitBundles :: Int + , remoteAnnexAllowEncryptedGitRepo :: Bool {- These settings are specific to particular types of remotes - including special remotes. -} @@ -434,8 +435,6 @@ extractRemoteGitConfig r remotename = do , remoteAnnexSpeculatePresent = getbool "speculate-present" False , remoteAnnexBare = getmaybebool "bare" - , remoteAnnexAllowEncryptedGitRepo = - getbool "allow-encrypted-gitrepo" False , remoteAnnexRetry = getmayberead "retry" , remoteAnnexForwardRetry = getmayberead "forward-retry" , remoteAnnexRetryDelay = Seconds @@ -480,6 +479,10 @@ extractRemoteGitConfig r remotename = do , remoteAnnexDdarRepo = getmaybe "ddarrepo" , remoteAnnexHookType = notempty $ getmaybe "hooktype" , remoteAnnexExternalType = notempty $ getmaybe "externaltype" + , remoteAnnexMaxGitBundles = + fromMaybe 100 (getmayberead "max-git-bundles") + , remoteAnnexAllowEncryptedGitRepo = + getbool "allow-encrypted-gitrepo" False } where getbool k d = fromMaybe d $ getmaybebool k diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 561c8c1836..8489f45c07 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1648,16 +1648,16 @@ Remotes are configured using these settings in `.git/config`. remotes, and is set when using [[git-annex-initremote]](1) with the `--private` option. -* `remote..max-bundles`, `annex.max-bundles` +* `remote..max-git-bundles`, `annex.max-git-bundles` When using [[git-remote-annex]] to store a git repository in a special remote, this configures how many separate git bundle objects to store - in the special remote before re-pushing a single git bundle that contains + in the special remote before re-uploading a single git bundle that contains the entire git repository. The default is 100, which aims to avoid often needing to often re-upload, while preventing a new clone needing to download too many objects. Set to - 0 to disable re-pushing. + 0 to disable re-uploading. * `remote..allow-encrypted-gitrepo` diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index 3da0961c5b..912b9702b5 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -20,12 +20,6 @@ For example, to clone from a directory special remote: git clone annex::358ff77e-0bc3-11ef-bc49-872e6695c0e3?type=directory&encryption=none&directory=/mnt/foo/ -When a special remote needs some additional credentials to be provided, -they are not included in the URL, and need to be provided when cloning from -the special remote. That is typically done by setting environment -variables. Some special remotes may also need environment variables to be -set when pulling or pushing. - When configuring the url of an existing special remote, a shorter url of "annex::" is sufficient. For example: @@ -37,6 +31,12 @@ Configuring the url like that is automatically done when cloning from a special remote, but not by [[git-annex-initremote]](1) and [[git-annex-enableremote]](1). +When a special remote needs some additional credentials to be provided, +they are not included in the URL, and need to be provided when cloning from +the special remote. That is typically done by setting environment +variables. Some special remotes may also need environment variables to be +set when pulling or pushing. + The git repository is stored in the special remote using special annex objects with names starting with "GITMANIFEST--" and "GITBUNDLE--". For details about how the git repository is stored, see @@ -46,7 +46,7 @@ Pushes to a special remote are usually done incrementally. However, sometimes the whole git repository (but not the annex) needs to be re-uploaded. That is done when deleting a ref from the remote. It's also done when too many git bundles accumulate in the special remote, as -configured by the `remote..max-bundles` git config. +configured by the `remote..max-git-bundles` git config. Like any git repository, a git repository stored on a special remote can have conflicting things pushed to it from different places. This mostly diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 5dd42c44f7..514e2f2a38 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -19,8 +19,6 @@ This is implememented and working. Remaining todo list for it: * Test incremental pushes that don't fast-forward. -* Support max-bundles config - * Cloning from an annex:: url with importtree=yes doesn't work (with or without exporttree=yes). This is because the ContentIdentifier db is not populated. From 24af51e66d14ac01d8ca52e0bbe257d6bba3c2e1 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 15:12:07 -0400 Subject: [PATCH 45/53] git-annex unused --from remote skips its git-remote-annex keys This turns out to only be necessary is edge cases. Most of the time, git-annex unused --from remote doesn't see git-remote-annex keys at all, because it does not record a location log for them. On the other hand, git-annex unused does find them, since it does not rely on the location log. And that's good because they're a local cache that the user should be able to drop. If, however, the user ran git-annex unused and then git-annex move --unused --to remote, the keys would have a location log for that remote. Then git-annex unused --from remote would see them, and would consider them unused. Even when they are present on the special remote they belong to. And that risks losing data if they drop the keys from the special remote, but didn't expect it would delete git branches they had pushed to it. So, make git-annex unused --from skip git-remote-annex keys whose uuid is the same as the remote. --- Backend/GitRemoteAnnex.hs | 32 ++++++++++++++++++++++++++++++-- Command/Unused.hs | 6 ++++-- Types/Key.hs | 3 +++ doc/todo/git-remote-annex.mdwn | 5 ----- 4 files changed, 37 insertions(+), 9 deletions(-) diff --git a/Backend/GitRemoteAnnex.hs b/Backend/GitRemoteAnnex.hs index bb825a917e..84da8aee44 100644 --- a/Backend/GitRemoteAnnex.hs +++ b/Backend/GitRemoteAnnex.hs @@ -14,6 +14,7 @@ module Backend.GitRemoteAnnex ( backends, genGitBundleKey, genManifestKey, + isGitRemoteAnnexKey, ) where import Annex.Common @@ -24,9 +25,10 @@ import Utility.Metered import qualified Backend.Hash as Hash import qualified Data.ByteString.Short as S +import qualified Data.ByteString.Char8 as B8 backends :: [Backend] -backends = [gitbundle] +backends = [gitbundle, gitmanifest] gitbundle :: Backend gitbundle = Backend @@ -44,6 +46,19 @@ gitbundle = Backend Hash.cryptographicallySecure hash } +gitmanifest :: Backend +gitmanifest = Backend + { backendVariety = GitManifestKey + , genKey = Nothing + , verifyKeyContent = Nothing + , verifyKeyContentIncrementally = Nothing + , canUpgradeKey = Nothing + , fastMigrate = Nothing + , isStableKey = const True + , isCryptographicallySecure = False + , isCryptographicallySecureKey = const $ pure False + } + -- git bundle keys use the sha256 hash. hash :: Hash.Hash hash = Hash.SHA2Hash (HashSize 256) @@ -72,5 +87,18 @@ genGitBundleKey remoteuuid file meterupdate = do genManifestKey :: UUID -> Key genManifestKey u = mkKey $ \kd -> kd { keyName = S.toShort (fromUUID u) - , keyVariety = OtherKey "GITMANIFEST" + , keyVariety = GitManifestKey } + +{- Is the key a manifest or bundle key that belongs to the special remote + - with this uuid? -} +isGitRemoteAnnexKey :: UUID -> Key -> Bool +isGitRemoteAnnexKey u k = + case fromKey keyVariety k of + GitBundleKey -> sameuuid $ + -- Remove the checksum that comes after the UUID. + B8.dropEnd 1 . B8.dropWhileEnd (/= '-') + GitManifestKey -> sameuuid id + _ -> False + where + sameuuid f = fromUUID u == f (S.fromShort (fromKey keyName k)) diff --git a/Command/Unused.hs b/Command/Unused.hs index eebe24ca36..75cf94a3e2 100644 --- a/Command/Unused.hs +++ b/Command/Unused.hs @@ -1,6 +1,6 @@ {- git-annex command - - - Copyright 2010-2016 Joey Hess + - Copyright 2010-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -34,6 +34,7 @@ import Logs.View (is_branchView) import Annex.BloomFilter import qualified Database.Keys import Annex.InodeSentinal +import Backend.GitRemoteAnnex (isGitRemoteAnnexKey) import qualified Data.Map as M import qualified Data.ByteString as S @@ -104,7 +105,8 @@ checkRemoteUnused remotename refspec = go =<< Remote.nameToUUID remotename _ <- check "" (remoteUnusedMsg r remotename) (remoteunused u) 0 next $ return True remoteunused u = loggedKeysFor u >>= \case - Just ks -> excludeReferenced refspec ks + Just ks -> filter (not . isGitRemoteAnnexKey u) + <$> excludeReferenced refspec ks Nothing -> giveup "This repository is read-only." check :: String -> ([(Int, Key)] -> String) -> Annex [Key] -> Int -> Annex Int diff --git a/Types/Key.hs b/Types/Key.hs index b883ac0d9b..2d901c0af7 100644 --- a/Types/Key.hs +++ b/Types/Key.hs @@ -220,6 +220,7 @@ data KeyVariety | URLKey | VURLKey | GitBundleKey + | GitManifestKey -- A key that is handled by some external backend. | ExternalKey S.ByteString HasExt -- Some repositories may contain keys of other varieties, @@ -255,6 +256,7 @@ hasExt WORMKey = False hasExt URLKey = False hasExt VURLKey = False hasExt GitBundleKey = False +hasExt GitManifestKey = False hasExt (ExternalKey _ (HasExt b)) = b hasExt (OtherKey s) = (snd <$> S8.unsnoc s) == Just 'E' @@ -285,6 +287,7 @@ formatKeyVariety v = case v of URLKey -> "URL" VURLKey -> "VURL" GitBundleKey -> "GITBUNDLE" + GitManifestKey -> "GITMANIFEST" ExternalKey s e -> adde e ("X" <> s) OtherKey s -> s where diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 514e2f2a38..bbb779a7b8 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,11 +10,6 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: -* git-annex unused --from remote should not treat git manifest and bundle - keys as unused, since that could lead to data loss. It's fine for - git-annex unused on the local repo to treat those as unused since they're - only a local cache. - * Test pushes that delete branches. * Test incremental pushes that don't fast-forward. From 23c4125ed48df553ceefe24858b5c695a9674d2f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 15:23:45 -0400 Subject: [PATCH 46/53] mention other commands shipped with git-annex in SEE ALSO in man page --- doc/git-annex.mdwn | 4 ++++ doc/todo/git-remote-annex.mdwn | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 8489f45c07..2053269cf2 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2215,6 +2215,10 @@ More git-annex documentation is available on its web site, If git-annex is installed from a package, a copy of its documentation should be included, in, for example, `/usr/share/doc/git-annex/`. +* [[git-annex-shell]](1) +* [[git-remote-annex]](1) +* [[git-remote-tor-annex]](1) + # AUTHOR Joey Hess diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index bbb779a7b8..9e83e9b44a 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -19,7 +19,7 @@ This is implememented and working. Remaining todo list for it: db is not populated. * Need to mention git-remote-annex in special remotes page, and perhaps - write a tip for it. Also link to it from git-annex man page. + write a tip for it. * It would be nice if git-annex could generate an annex:: url for a special remote and show it to the user, eg when From 0722c504c5742377c1985cae376a54898629be15 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 15:31:16 -0400 Subject: [PATCH 47/53] update docs for git-remote-annex --- doc/special_remotes.mdwn | 21 +++++++++++++++------ doc/todo/git-remote-annex.mdwn | 3 --- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/doc/special_remotes.mdwn b/doc/special_remotes.mdwn index 7399ba34a8..04f2feb9c6 100644 --- a/doc/special_remotes.mdwn +++ b/doc/special_remotes.mdwn @@ -5,8 +5,7 @@ directory. But, git-annex also extends git's concept of remotes, with these special types of remotes. These can be used by git-annex to store and retrieve -the content of files. They cannot be used by other git commands, and -the git history is not stored in them. +the content of files. * [[adb]] (for Android devices) * [[Amazon_Glacier|glacier]] @@ -94,15 +93,25 @@ To initialize a new special remote, use the special remote you want to use for details about configuration and examples of how to initremote it. -Once a special remote has been initialize, other clones of the repository can +Once a special remote has been initialized, other clones of the repository can also enable it, by using [[git-annex enableremote|git-annex-enableremote]] with the same name that was used to initialize it. (Run the command without any name to get a list of available special remotes.) Initializing or enabling a special remote adds it as a remote of your git -repository. You can't use git commands like `git pull` with the remote -(usually, there are exceptions like [[git-lfs]]), but you can use git-annex -commands. +repository. + +## Storing a git repository in a special remote + +Most special remotes do not include a clone of the git repository +by default, so you can't use commands like `git push` and `git pull` +with them. (There are some exceptions like [[git-lfs]].) + +But it is possible to store a git repository in many special remotes, +using the [[git-remote-annex]] command. This involves configuring +the remote with an "annex::" url. It's even possible to `git clone` +from a special remote using such an url. See the documentation of +[[git-remote-annex]] for details. ## Unused content on special remotes diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 9e83e9b44a..a027b36556 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -18,9 +18,6 @@ This is implememented and working. Remaining todo list for it: (with or without exporttree=yes). This is because the ContentIdentifier db is not populated. -* Need to mention git-remote-annex in special remotes page, and perhaps - write a tip for it. - * It would be nice if git-annex could generate an annex:: url for a special remote and show it to the user, eg when they have set the shorthand "annex::" url, so they know the full url. From 7dd2a67c418b90cf71ad9223a06038bf03f7886c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 15:33:47 -0400 Subject: [PATCH 48/53] fix names of new git configs --- doc/git-annex.mdwn | 4 ++-- doc/git-remote-annex.mdwn | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 2053269cf2..bb0a6172cf 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1648,7 +1648,7 @@ Remotes are configured using these settings in `.git/config`. remotes, and is set when using [[git-annex-initremote]](1) with the `--private` option. -* `remote..max-git-bundles`, `annex.max-git-bundles` +* `remote..annex-max-git-bundles`, `annex.max-git-bundles` When using [[git-remote-annex]] to store a git repository in a special remote, this configures how many separate git bundle objects to store @@ -1659,7 +1659,7 @@ Remotes are configured using these settings in `.git/config`. while preventing a new clone needing to download too many objects. Set to 0 to disable re-uploading. -* `remote..allow-encrypted-gitrepo` +* `remote..annex-allow-encrypted-gitrepo` Setting this to true allows using [[git-remote-annex]] to push the git repository to an encrypted special remote. diff --git a/doc/git-remote-annex.mdwn b/doc/git-remote-annex.mdwn index 912b9702b5..52d9a11ccb 100644 --- a/doc/git-remote-annex.mdwn +++ b/doc/git-remote-annex.mdwn @@ -46,7 +46,7 @@ Pushes to a special remote are usually done incrementally. However, sometimes the whole git repository (but not the annex) needs to be re-uploaded. That is done when deleting a ref from the remote. It's also done when too many git bundles accumulate in the special remote, as -configured by the `remote..max-git-bundles` git config. +configured by the `remote..annex-max-git-bundles` git config. Like any git repository, a git repository stored on a special remote can have conflicting things pushed to it from different places. This mostly From 169e673ad4034d890782106b0dc8d16c855bf677 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 16:01:24 -0400 Subject: [PATCH 49/53] result of some testing --- doc/todo/git-remote-annex.mdwn | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index a027b36556..3d1bb8e75a 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,9 +10,22 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: -* Test pushes that delete branches. +* Bug: Problem with forced push: -* Test incremental pushes that don't fast-forward. + joey@darkstar:~/tmp/bench5/a#ook>git push d ook --force + fatal: bad revision 'refs/namespaces/git-remote-annex/d5a263c6-1c28-432a-a161-914476ae5390/refs/heads/git-annex' + Push failed (user error (git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","bundle","create","--quiet","/home/joey/tmp/GITBUNDLE2690124-1","--stdin"] exited 128)) + warning: helper reported unexpected status of push + Everything up-to-date + + This was preceeded by pushing the git-annex branch and master, + then making 3 commits and pushing each of them in turn. + Then reset back one commit, try to push (which fails as + non-fast-forward), and force push as shown then fails. + + So the problem is not the forced push itself, which works + if a non-forced push is not tried before it, but something + with that specific situation. * Cloning from an annex:: url with importtree=yes doesn't work (with or without exporttree=yes). This is because the ContentIdentifier From 2dfffa062111847d7a6a4a3daf85121e53535236 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 14 May 2024 16:17:27 -0400 Subject: [PATCH 50/53] bugfix When pushing branch foo, we don't want to delete other tracking branches. In particular, a full push needs all the tracking branches. --- CmdLine/GitRemoteAnnex.hs | 18 ++++++++++-------- doc/todo/git-remote-annex.mdwn | 17 +---------------- 2 files changed, 11 insertions(+), 24 deletions(-) diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 6bc802401c..0af908c5b7 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -144,7 +144,7 @@ list st rmt forpush = do -- was listed. This is necessary in order for a full repush to know -- what to push. when forpush $ - updateTrackingRefs rmt trackingrefmap + updateTrackingRefs True rmt trackingrefmap -- Respond to git with a list of refs. liftIO $ do @@ -198,7 +198,7 @@ push :: State -> Remote -> [String] -> Annex ([String], State) push st rmt ls = do let (refspecs, ls') = collectRefSpecs ls (responses, trackingrefs) <- calc refspecs ([], trackingRefs st) - updateTrackingRefs rmt trackingrefs + updateTrackingRefs False rmt trackingrefs (ok, st') <- if M.null trackingrefs then pushEmpty st rmt else if any forcedPush refspecs @@ -211,7 +211,7 @@ push st rmt ls = do return (ls', st' { trackingRefs = trackingrefs }) else do -- Restore the old tracking refs - updateTrackingRefs rmt (trackingRefs st) + updateTrackingRefs True rmt (trackingRefs st) sendresponses $ map (const "error push failed") refspecs return (ls', st') @@ -835,15 +835,17 @@ toTrackingRef rmt (Ref r) = Ref $ trackingRefPrefix rmt <> r fromTrackingRef :: Remote -> Ref -> Ref fromTrackingRef rmt = Git.Ref.removeBase (decodeBS (trackingRefPrefix rmt)) --- Update the tracking refs to be those in the map, and no others. -updateTrackingRefs :: Remote -> M.Map Ref Sha -> Annex () -updateTrackingRefs rmt new = do +-- Update the tracking refs to be those in the map. +-- When deleteold is set, any other tracking refs are deleted. +updateTrackingRefs :: Bool -> Remote -> M.Map Ref Sha -> Annex () +updateTrackingRefs deleteold rmt new = do old <- inRepo $ Git.Ref.forEachRef [Param (decodeBS (trackingRefPrefix rmt))] -- Delete all tracking refs that are not in the map. - forM_ (filter (\p -> M.notMember (fst p) new) old) $ \(s, r) -> - inRepo $ Git.Ref.delete s r + when deleteold $ + forM_ (filter (\p -> M.notMember (fst p) new) old) $ \(s, r) -> + inRepo $ Git.Ref.delete s r -- Update all changed tracking refs. let oldmap = M.fromList (map (\(s, r) -> (r, s)) old) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 3d1bb8e75a..9e69bcdb2c 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,22 +10,7 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: -* Bug: Problem with forced push: - - joey@darkstar:~/tmp/bench5/a#ook>git push d ook --force - fatal: bad revision 'refs/namespaces/git-remote-annex/d5a263c6-1c28-432a-a161-914476ae5390/refs/heads/git-annex' - Push failed (user error (git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","bundle","create","--quiet","/home/joey/tmp/GITBUNDLE2690124-1","--stdin"] exited 128)) - warning: helper reported unexpected status of push - Everything up-to-date - - This was preceeded by pushing the git-annex branch and master, - then making 3 commits and pushing each of them in turn. - Then reset back one commit, try to push (which fails as - non-fast-forward), and force push as shown then fails. - - So the problem is not the forced push itself, which works - if a non-forced push is not tried before it, but something - with that specific situation. +* Test incremental push edge cases involving checkprereq. * Cloning from an annex:: url with importtree=yes doesn't work (with or without exporttree=yes). This is because the ContentIdentifier From d24d8870c54313e92dcccecafff17dd5f477d5e4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 15 May 2024 14:33:13 -0400 Subject: [PATCH 51/53] todo --- doc/todo/git-remote-annex.mdwn | 41 +++++++++++++++++----------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 9e69bcdb2c..36b3cf9155 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,6 +10,26 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: +* Cloning writes the new special remote config into remote.log, + and *deletes* other special remote configs. + + The remote config from the url may be slightly different as well + than the existing one. Cloning should not write it. + +* The race condition described in + [[!commit 797f27ab0517e0021363791ff269300f2ba095a5]] + where before git-annex init is run in a repo, + using git-remote-annex and at the same time git-annex init can lose + changes that the latter command (and ones after it) write to the + git-annex branch. + + This should be fixable by making git-remote-annex not write to the + git-annex branch, but to eg, a temporary journal directory. + + Also, when the remote uses importtree=yes, pushing to it updates + content identifiers, which currently get recorded in the git-annex + branch. It would be good to avoid that being written as well. + * Test incremental push edge cases involving checkprereq. * Cloning from an annex:: url with importtree=yes doesn't work @@ -68,24 +88,3 @@ This is implememented and working. Remaining todo list for it: They would have to use git fetch --prune to notice the deletion. Once the user does notice, they can re-push their ref to recover. Can this be improved? - -* The race condition described in - [[!commit 797f27ab0517e0021363791ff269300f2ba095a5]] - where before git-annex init is run in a repo, - using git-remote-annex and at the same time git-annex init can lose - changes that the latter command (and ones after it) write to the - git-annex branch. - - This should be fixable by making git-remote-annex not write to the - git-annex branch, but to eg, a temporary journal directory. - - Also, cloning currently writes the special remote config into remote.log, - which might be slightly different in some way than the config in - remote.log for the same remote. cloning should not change the stored - config of a remote, but that branch write was necessary. So throwing - away the branch write is also good for this case. - - Also, when the remote uses importtree=yes, pushing to it updates - content identifiers, which currently get recorded in the git-annex - branch. It would be good to avoid that being written as well. - From adcebbae47271dcd539c112952094e9aa9a84b55 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 15 May 2024 17:33:38 -0400 Subject: [PATCH 52/53] clean up git-remote-annex git-annex branch handling Implemented alternateJournal, which git-remote-annex uses to avoid any writes to the git-annex branch while setting up a special remote from an annex:: url. That prevents the remote.log from being overwritten with the special remote configuration from the url, which might not be 100% the same as the existing special remote configuration. And it prevents an overwrite deleting of other stuff that was already in the remote.log. Also, when the branch was created by git-remote-annex, only delete it at the end if nothing else has been written to it by another command. This fixes the race condition described in 797f27ab0517e0021363791ff269300f2ba095a5, where git-remote-annex set up the branch and git-annex init and other commands were run at the same time and their writes to the branch were lost. --- Annex/Branch.hs | 3 +- Annex/Journal.hs | 41 ++++++++++++---------- Annex/Locations.hs | 17 +++++---- CmdLine/GitRemoteAnnex.hs | 64 +++++++++++++++++++++++----------- Types/BranchState.hs | 5 ++- doc/todo/git-remote-annex.mdwn | 20 ----------- 6 files changed, 84 insertions(+), 66 deletions(-) diff --git a/Annex/Branch.hs b/Annex/Branch.hs index bcc9ae114d..717cbc0400 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -727,7 +727,8 @@ stageJournal :: JournalLocked -> Annex () -> Annex () stageJournal jl commitindex = withIndex $ withOtherTmp $ \tmpdir -> do prepareModifyIndex jl g <- gitRepo - let dir = gitAnnexJournalDir g + st <- getState + let dir = gitAnnexJournalDir st g (jlogf, jlogh) <- openjlog (fromRawFilePath tmpdir) withHashObjectHandle $ \h -> withJournalHandle gitAnnexJournalDir $ \jh -> diff --git a/Annex/Journal.hs b/Annex/Journal.hs index ea6327606d..54dd3317ef 100644 --- a/Annex/Journal.hs +++ b/Annex/Journal.hs @@ -7,7 +7,7 @@ - All files in the journal must be a series of lines separated by - newlines. - - - Copyright 2011-2022 Joey Hess + - Copyright 2011-2024 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -23,6 +23,8 @@ import qualified Git import Annex.Perms import Annex.Tmp import Annex.LockFile +import Annex.BranchState +import Types.BranchState import Utility.Directory.Stream import qualified Utility.RawFilePath as R @@ -82,9 +84,10 @@ privateUUIDsKnown' = not . S.null . annexPrivateRepos . Annex.gitconfig -} setJournalFile :: Journalable content => JournalLocked -> RegardingUUID -> RawFilePath -> content -> Annex () setJournalFile _jl ru file content = withOtherTmp $ \tmp -> do + st <- getState jd <- fromRepo =<< ifM (regardingPrivateUUID ru) - ( return gitAnnexPrivateJournalDir - , return gitAnnexJournalDir + ( return (gitAnnexPrivateJournalDir st) + , return (gitAnnexJournalDir st) ) -- journal file is written atomically let jfile = journalFile file @@ -106,9 +109,10 @@ newtype AppendableJournalFile = AppendableJournalFile (RawFilePath, RawFilePath) - branch. -} checkCanAppendJournalFile :: JournalLocked -> RegardingUUID -> RawFilePath -> Annex (Maybe AppendableJournalFile) checkCanAppendJournalFile _jl ru file = do + st <- getState jd <- fromRepo =<< ifM (regardingPrivateUUID ru) - ( return gitAnnexPrivateJournalDir - , return gitAnnexJournalDir + ( return (gitAnnexPrivateJournalDir st) + , return (gitAnnexJournalDir st) ) let jfile = jd P. journalFile file ifM (liftIO $ R.doesPathExist jfile) @@ -176,14 +180,12 @@ data GetPrivate = GetPrivate Bool -} getJournalFileStale :: GetPrivate -> RawFilePath -> Annex JournalledContent getJournalFileStale (GetPrivate getprivate) file = do - -- Optimisation to avoid a second MVar access. st <- Annex.getState id - let g = Annex.repo st liftIO $ if getprivate && privateUUIDsKnown' st then do - x <- getfrom (gitAnnexJournalDir g) - getfrom (gitAnnexPrivateJournalDir g) >>= \case + x <- getfrom (gitAnnexJournalDir (Annex.branchstate st) (Annex.repo st)) + getfrom (gitAnnexPrivateJournalDir (Annex.branchstate st) (Annex.repo st)) >>= \case Nothing -> return $ case x of Nothing -> NoJournalledContent Just b -> JournalledContent b @@ -193,7 +195,7 @@ getJournalFileStale (GetPrivate getprivate) file = do -- happens in a merge of two -- git-annex branches. Just x' -> x' <> y - else getfrom (gitAnnexJournalDir g) >>= return . \case + else getfrom (gitAnnexJournalDir (Annex.branchstate st) (Annex.repo st)) >>= return . \case Nothing -> NoJournalledContent Just b -> JournalledContent b where @@ -219,18 +221,20 @@ discardIncompleteAppend v {- List of existing journal files in a journal directory, but without locking, - may miss new ones just being added, or may have false positives if the - journal is staged as it is run. -} -getJournalledFilesStale :: (Git.Repo -> RawFilePath) -> Annex [RawFilePath] +getJournalledFilesStale :: (BranchState -> Git.Repo -> RawFilePath) -> Annex [RawFilePath] getJournalledFilesStale getjournaldir = do - g <- gitRepo - fs <- liftIO $ catchDefaultIO [] $ - getDirectoryContents $ fromRawFilePath (getjournaldir g) + st <- Annex.getState id + let d = getjournaldir (Annex.branchstate st) (Annex.repo st) + fs <- liftIO $ catchDefaultIO [] $ + getDirectoryContents (fromRawFilePath d) return $ filter (`notElem` [".", ".."]) $ map (fileJournal . toRawFilePath) fs {- Directory handle open on a journal directory. -} -withJournalHandle :: (Git.Repo -> RawFilePath) -> (DirectoryHandle -> IO a) -> Annex a +withJournalHandle :: (BranchState -> Git.Repo -> RawFilePath) -> (DirectoryHandle -> IO a) -> Annex a withJournalHandle getjournaldir a = do - d <- fromRepo getjournaldir + st <- Annex.getState id + let d = getjournaldir (Annex.branchstate st) (Annex.repo st) bracket (opendir d) (liftIO . closeDirectory) (liftIO . a) where -- avoid overhead of creating the journal directory when it already @@ -239,9 +243,10 @@ withJournalHandle getjournaldir a = do `catchIO` (const (createAnnexDirectory d >> opendir d)) {- Checks if there are changes in the journal. -} -journalDirty :: (Git.Repo -> RawFilePath) -> Annex Bool +journalDirty :: (BranchState -> Git.Repo -> RawFilePath) -> Annex Bool journalDirty getjournaldir = do - d <- fromRawFilePath <$> fromRepo getjournaldir + st <- getState + d <- fromRawFilePath <$> fromRepo (getjournaldir st) liftIO $ (not <$> isDirectoryEmpty d) `catchIO` (const $ doesDirectoryExist d) diff --git a/Annex/Locations.hs b/Annex/Locations.hs index 9b465dce8d..ee5b6d690f 100644 --- a/Annex/Locations.hs +++ b/Annex/Locations.hs @@ -118,6 +118,7 @@ import Key import Types.UUID import Types.GitConfig import Types.Difference +import Types.BranchState import qualified Git import qualified Git.Types as Git import Git.FilePath @@ -528,15 +529,19 @@ gitAnnexTransferDir r = {- .git/annex/journal/ is used to journal changes made to the git-annex - branch -} -gitAnnexJournalDir :: Git.Repo -> RawFilePath -gitAnnexJournalDir r = - P.addTrailingPathSeparator $ gitAnnexDir r P. "journal" +gitAnnexJournalDir :: BranchState -> Git.Repo -> RawFilePath +gitAnnexJournalDir st r = P.addTrailingPathSeparator $ + case alternateJournal st of + Nothing -> gitAnnexDir r P. "journal" + Just d -> d {- .git/annex/journal.private/ is used to journal changes regarding private - repositories. -} -gitAnnexPrivateJournalDir :: Git.Repo -> RawFilePath -gitAnnexPrivateJournalDir r = - P.addTrailingPathSeparator $ gitAnnexDir r P. "journal-private" +gitAnnexPrivateJournalDir :: BranchState -> Git.Repo -> RawFilePath +gitAnnexPrivateJournalDir st r = P.addTrailingPathSeparator $ + case alternateJournal st of + Nothing -> gitAnnexDir r P. "journal-private" + Just d -> d {- Lock file for the journal. -} gitAnnexJournalLock :: Git.Repo -> RawFilePath diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 0af908c5b7..d1eac6dfd8 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -21,6 +21,7 @@ import qualified Git.Remote import qualified Git.Remote.Remove import qualified Annex.SpecialRemote as SpecialRemote import qualified Annex.Branch +import qualified Annex.BranchState import qualified Types.Remote as Remote import qualified Logs.Remote import Remote.Helper.Encryptable (parseEncryptionMethod) @@ -32,6 +33,7 @@ import Types.RemoteConfig import Types.ProposedAccepted import Types.Export import Types.GitConfig +import Types.BranchState import Types.Difference import Types.Crypto import Git.Types @@ -44,6 +46,7 @@ import Annex.SpecialRemote.Config import Remote.List import Remote.List.Util import Utility.Tmp +import Utility.Tmp.Dir import Utility.Env import Utility.Metered @@ -485,10 +488,9 @@ withSpecialRemote cfg@(SpecialRemoteConfig {}) sab a = case specialRemoteName cf | otherwise -> giveup $ "The uuid in the annex:: url does not match the uuid of the remote named " ++ remotename -- When cloning from an annex:: url, -- this is used to set up the origin remote. - Nothing -> (initremote remotename >>= a) - `finally` cleanupInitialization sab - Nothing -> inittempremote - `finally` cleanupInitialization sab + Nothing -> specialRemoteFromUrl sab + (initremote remotename >>= a) + Nothing -> specialRemoteFromUrl sab inittempremote where -- Initialize a new special remote with the provided configuration -- and name. @@ -869,27 +871,48 @@ getRepo = getEnv "GIT_WORK_TREE" >>= \case -- Records what the git-annex branch was at the beginning of this command. data StartAnnexBranch - = AnnexBranchExistedAlready Ref - | AnnexBranchCreatedEmpty Ref + = AnnexBranchExistedAlready Sha + | AnnexBranchCreatedEmpty Sha +{- Run early in the command, gets the initial state of the git-annex + - branch. + - + - If the branch does not exist yet, it's created here. This is done + - because it's hard to avoid the branch being created by this command, + - so tracking the sha of the created branch allows cleaning it up later. + -} startAnnexBranch :: Annex StartAnnexBranch startAnnexBranch = ifM (null <$> Annex.Branch.siblingBranches) ( AnnexBranchCreatedEmpty <$> Annex.Branch.getBranch , AnnexBranchExistedAlready <$> Annex.Branch.getBranch ) --- This is run after git has used this process to fetch or push from a --- special remote that was specified using a git-annex url. If the git --- repository was not initialized for use by git-annex already, it is still --- not initialized at this point. +-- This runs an action that will set up a special remote that +-- was specified using an annex url. -- +-- Setting up a special remote needs to write its config to the git-annex +-- branch. And using a special remote may also write to the branch. +-- But in this case, writes to the git-annex branch need to be avoided, +-- so that cleanupInitialization can leave things in the right state. +-- +-- So this prevents commits to the git-annex branch, and redirects all +-- journal writes to a temporary directory, so that all writes +-- to the git-annex branch by the action will be discarded. +specialRemoteFromUrl :: StartAnnexBranch -> Annex a -> Annex a +specialRemoteFromUrl sab a = withTmpDir "journal" $ \tmpdir -> do + Annex.overrideGitConfig $ \c -> + c { annexAlwaysCommit = False } + Annex.BranchState.changeState $ \st -> + st { alternateJournal = Just (toRawFilePath tmpdir) } + a `finally` cleanupInitialization sab + -- If the git-annex branch did not exist when this command started, --- the current contents of it were created in passing by this command, --- which is hard to avoid. But if a git-annex branch is fetched from the --- special remote and contains Differences, it would not be possible to --- merge it into the git-annex branch that was created while running this --- command. To avoid that problem, when the git-annex branch was created --- at the start of this command, it's deleted. +-- it was created empty by this command, and this command has avoided +-- making any other commits to it. If nothing else has written to the +-- branch while this command was running, the branch will be deleted. +-- That allows for the git-annex branch that is fetched from the special +-- remote to contain Differences, which would prevent it from being merged +-- with the git-annex branch created by this command. -- -- If there is still not a sibling git-annex branch, this deletes all annex -- objects for git bundles from the annex objects directory, and deletes @@ -905,10 +928,11 @@ cleanupInitialization :: StartAnnexBranch -> Annex () cleanupInitialization sab = do case sab of AnnexBranchExistedAlready _ -> noop - AnnexBranchCreatedEmpty _ -> do - inRepo $ Git.Branch.delete Annex.Branch.fullname - indexfile <- fromRepo gitAnnexIndex - liftIO $ removeWhenExistsWith R.removeLink indexfile + AnnexBranchCreatedEmpty r -> + whenM ((r ==) <$> Annex.Branch.getBranch) $ do + inRepo $ Git.Branch.delete Annex.Branch.fullname + indexfile <- fromRepo gitAnnexIndex + liftIO $ removeWhenExistsWith R.removeLink indexfile ifM Annex.Branch.hasSibling ( do autoInitialize' (pure True) remoteList diff --git a/Types/BranchState.hs b/Types/BranchState.hs index 129a17b349..d79a1c70a6 100644 --- a/Types/BranchState.hs +++ b/Types/BranchState.hs @@ -36,7 +36,10 @@ data BranchState = BranchState -- process need to be noticed while the current process is running? -- (This makes the journal always be read, and avoids using the -- cache.) + , alternateJournal :: Maybe RawFilePath + -- ^ use this directory for all journals, rather than the + -- gitAnnexJournalDir and gitAnnexPrivateJournalDir. } startBranchState :: BranchState -startBranchState = BranchState False False False [] [] [] False +startBranchState = BranchState False False False [] [] [] False Nothing diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 36b3cf9155..3608360983 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -10,26 +10,6 @@ will be available to users who don't use datalad. This is implememented and working. Remaining todo list for it: -* Cloning writes the new special remote config into remote.log, - and *deletes* other special remote configs. - - The remote config from the url may be slightly different as well - than the existing one. Cloning should not write it. - -* The race condition described in - [[!commit 797f27ab0517e0021363791ff269300f2ba095a5]] - where before git-annex init is run in a repo, - using git-remote-annex and at the same time git-annex init can lose - changes that the latter command (and ones after it) write to the - git-annex branch. - - This should be fixable by making git-remote-annex not write to the - git-annex branch, but to eg, a temporary journal directory. - - Also, when the remote uses importtree=yes, pushing to it updates - content identifiers, which currently get recorded in the git-annex - branch. It would be good to avoid that being written as well. - * Test incremental push edge cases involving checkprereq. * Cloning from an annex:: url with importtree=yes doesn't work From b1b6e35d4c2790e324e388625de79384f20849b3 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 15 May 2024 17:41:55 -0400 Subject: [PATCH 53/53] reorg todo --- doc/todo/git-remote-annex.mdwn | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/doc/todo/git-remote-annex.mdwn b/doc/todo/git-remote-annex.mdwn index 3608360983..2d46d5701c 100644 --- a/doc/todo/git-remote-annex.mdwn +++ b/doc/todo/git-remote-annex.mdwn @@ -16,6 +16,15 @@ This is implememented and working. Remaining todo list for it: (with or without exporttree=yes). This is because the ContentIdentifier db is not populated. +* Improve recovery from interrupted push by using outManifest to clean up + after it. (Requires populating outManifest.) + +* See XXX in uploadManifest about recovering from a situation + where the remote is left with a deleted manifest when a push + is interrupted part way through. This should be recoverable + by caching the manifest locally and re-uploading it when + the remote has no manifest or prompting the user to merge and re-push. + * It would be nice if git-annex could generate an annex:: url for a special remote and show it to the user, eg when they have set the shorthand "annex::" url, so they know the full url. @@ -31,15 +40,6 @@ This is implememented and working. Remaining todo list for it: git-remote-annex, since the user would not need to set up the url themselves. (Also it would then avoid setting `skipFetchAll = true`) -* Improve recovery from interrupted push by using outManifest to clean up - after it. (Requires populating outManifest.) - -* See XXX in uploadManifest about recovering from a situation - where the remote is left with a deleted manifest when a push - is interrupted part way through. This should be recoverable - by caching the manifest locally and re-uploading it when - the remote has no manifest or prompting the user to merge and re-push. - * datalad-annex supports cloning from the web special remote, using an url that contains the result of pushing to eg, a directory special remote.