remove false starts, simplify

This commit is contained in:
Joey Hess 2018-08-29 14:12:18 -04:00
parent 5b78952f78
commit dad627fa9e
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -5,62 +5,12 @@ content and it can be accessed using a version ID (that S3 returns when
storing the content). So it should be possible for git-annex to allow
downloading old versions of files from such a remote.
## remote pair approach
Basically, store the S3 version ID in git-annex branch and support
downloading using it.
One way would be to have the S3 remote, when storing a file to a S3 bucket
that is known to support versioning, to add an url using the S3 version ID
to the web remote.
However, some remotes that support versioning won't be accessible via the
web, so that's not a general solution.
(Also, S3 buckets only support web access when configured to be public.)
This generalizes to a pair of remotes, it could be S3+web or S3 could instantiate
two remotes automatically, and use the second for versioned data.
Note that location tracking info has to be carefully managed, to avoid
there appearing to be two copies of data that's only really stored in one place.
When uploading to S3, it should not yet add the url or mark the content
as present in the web. Then when dropping from S3, after the
drop succeeds, it can mark the content as present in the web and add its url.
There's a potential race there still, since the remote does not update location
tracking when dropping, the caller of the remote does. So if S3 marks content
as being present in the web, it will breifly appear present in both locations
and break numcopies counting. Would need to extend the API to avoid this race.
> Ah, but: exporttree remotes are always untrusted for other reasons,
> so location tracking is less of a problem. Even if location tracking
> shows the content in two places, a drop will skip the exporttree remote
> so will only treat the pair as one copy.
>
> So the location tracking problem is limited to --copies=N matching incorrectly,
> and whereis listing both locations, and some preferred content
> expressions behaving in surprising ways.
Unfortunately this remote pair approach will leak out into git-annex's interface;
it will show two remotes. Not a problem for S3+web really, but if S3 instantiates
an S3oldversions remote, that necessarily adds the potential for confusion,
and adds complexity in configuration of preferred content settings, repo groups,
etc.
> Could flip it; make the main remote track the versioned data, and the
> exporttree remote be secondary. Since only git-annex export/sync need to
> access that remote, they could have a special case to look for such a
> secondary remote and act on it. All other commands would only operate on
> the main remote. Indeed, the secondary remote would not need to be
> in the RemoteList at all.
>
> Doesn't avoid preferred content etc complexity, still.
## location tracking approach
Another way is to store the S3 version ID in git-annex branch and support
downloading using it. But this has the problem that dropping makes
git-annex think it's not in S3 any more, while what we want for export
is for it to be removed from the current bucket, but still tracked as
present in S3.
But this has the problem that dropping makes git-annex think it's not in S3
any more, while what we want for export is for it to be removed from the
current bucket, but still tracked as present in S3.
The drop from S3 could fail, or "succeed" in a way that prevents the location
tracking being updated to say it lacks the content. Failing is how bup deals
@ -75,7 +25,8 @@ and make at sync --content/assistant use that.
Note that git-annex export does not rely on location tracking to determine
which files still need to be sent to an export. It uses the export database
to keep track of that.
to keep track of that. This is important, because the location tracking
won't be updated, as discussed above.
## final plan