The file corruption consists of each chunk of the file being duplicated.
Since chunks are typically a fixed size, it would certianly be possible
to get from a corrupted file back to the original file. But this is still
bad data loss.
Reversion was in commit fcc052bed8.
Luckily that did not make the most recent release.
It works when using git-annex sync/push/assist, or when manually sending
all content to the proxied remote before pushing to the proxy remote.
But when the push comes before the content is sent, sending content does
not update the exported tree.
(When possible, of course it may not be there, or it may get renamed from
there for another exported file first. Or the remote may not support
renames.)
This will avoids redundant uploads.
An example case where this is important: Proxying to a exporttree remote,
a file is uploaded to it but is not yet in an exported tree. When the
exported tree is pushed, the remote needs to be updated by exporting to
it. In this case, the proxy doesn't have a copy of the file, so it would
need to download it from annexobjects before uploading it to the final
location. With this optimisation, it can just rename it.
However: If a key is used twice in an exported tree, it seems a proxy
will need to download and reupload anyway. Unless a copy operation is
added to exporttree remotes..
This avoids needing to re-upload the file again to get it to the
annexobjects location, which git-annex sync was doing when it was
preferred content.
If the file is not preferred content, sync will drop it from the
annexobjects location.
If the file has been deleted from the tree, it will remain in the
annexobjects location until an unused/dropunused pass is done.
Decided not to use the annexobjects location for exportTempName.
There doesn't seem to be any actual benefit to doing that, because an
export that renames to exportTempName always renames it back from that
to another location.
Also the annexobjects directory won't actually help with the paired
rename issue.
The file in the annexobjects location may have been renamed from a
previously exported file that got deleted in a subsequent export.
Or it may be renamed to annexobjects temporarily before being renamed to
another name (to handle eg pairwise renames).
But, an exported file is not guaranteed to contain the content of the
key that the local repository last exported there. Another tree could
have been exported from elsewhere in the meantime.
So, files in annexobjects do not necessarily have the content of their
key. And so have to be strongly verified when retrieving. The same as
is done when retrieving exported files.
Removing the key from the annexobjects location when it's in the
exported tree would leave it in the exported tree, and so succeeding
would update the location log incorrectly. But this also can't remove it
from the exported tree, because that would cause import tree to see a
file got deleted. So, refuse to remove in this situation.
It would be possible to remove from the annexobjects location and then
fail. Then if a key somehow got stored in both the annexobjects location
and the exported tree location(s), the duplicate would be resolved. Not
doing this because first, I don't know how that situation could happen,
and second, it seems wrong for a failed remove to have a side-effect
like that.
There was no good reason for it to be using annexLocationsNonBare,
and exporttree=yes annexobjects=yes is going to use annexLocationsBare,
so this should as well for consistency.
Since all returned ExportLocations are tried when retrieving objects,
this won't break backwards compatability.
This fixes a problem with datalad's test suite, where loading the cluster
log happened to cause the git-annex branch commits to take a different
shape, with an additional commit.
It's also faster though, since many commands don't need the cluster log.
Just fill Annex.clusters with a thunk.
Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project