starting api design

This commit is contained in:
Joey Hess 2019-02-11 15:47:18 -04:00
parent b7991248db
commit 87987c78cf
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -11,7 +11,7 @@ that has the modifications in it.
Updating the working copy is then done by merging the import treeish.
This way, conflicts will be detected and handled as normal by git.
----
## content identifiers
The remote is responsible for collecting a list of
files currently in it, along with some content identifier. That data is
@ -53,7 +53,28 @@ the same remote? In that case, perhaps different trees would be imported,
and merged into master. So the two repositories then have differing
masters, which can be reconciled in merge as usual.
----
Since exporttree remotes don't have content identifier information yet, it
needs to be collected the first time import tree is used. (Or import
everything, but that is probably too expensive). Any modifications made to
exported files before the first import tree would not be noticed. Seems
acceptible as long as this only affects exporttree remotes created before
this feature was added.
What if repo A is being used to import tree from R for a while, and the
user gets used to editing files on R and importing them. Then they stop
using A and switch to clone B. It would not have the content identifier
information that A did. It seems that in this case, B needs to re-download
everything, to build up the map of content identifiers.
(Anything could have changed since the last time A imported).
That seems too expensive!
Would storing content identifiers in the git-annex branch be too
expensive? Probably not.. For S3 with versioning a content identifier is
already stored. When the content identifier is (mtime, size, inode),
that's a small amount of data. The maximum size of a content identifier
could be limited to the size of a typical hash, and if a remote for some
reason gets something larger, it could simply hash it to generate
the content identifier.
## race conditions TODO
@ -152,25 +173,6 @@ Since this is acceptable in git, I suppose we can accept it here too..
----
Since exporttree remotes don't have content identifier information yet, it
needs to be collected the first time import tree is used. (Or import
everything, but that is probably too expensive). Any modifications made to
exported files before the first import tree would not be noticed. Seems
acceptible as long as this only affects exporttree remotes created before
this feature was added.
What if repo A is being used to import tree from R for a while, and the
user gets used to editing files on R and importing them. Then they stop
using A and switch to clone B. It would not have the content identifier
information that A did (unless it's stored in git-annex branch rather than
locally). It seems that in this case, B needs to re-download everything,
since anything could have changed since the last time A imported.
That seems too expensive!
Would storing content identifiers in the git-annex branch be too expensive?
----
If multiple repos can access the remote at the same time, then there's a
potential problem when one is exporting a new tree, and the other one is
importing from the remote.
@ -187,6 +189,33 @@ importing from the remote.
> to be on the remote. (May need to reword that prompt.)
> --[[Joey]]
## api design
Pulling all of the above together, this is an extension to the
ExportActions api.
listContents :: Annex [(ExportLocation, ContentIdentifier)]
getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> FilePath -> MeterUpdate -> Annex Bool
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
retrieveExportWithContentIdentifier is used when downloading a new file from
the remote that listContents found. retrieveExport can't be used because
it has a Key parameter and the key is not yet known in this case.
storeExportWithContentIdentifier is used to get the content identifier
corresponding to what was just stored. It can either get the content
identifier in reply to the store (as S3 does with versioning), or it can
store to a temp location, get the content identifier of that, and then
rename the content into place. When there's a race with a concurrent
writer, it needs to avoid getting the ContentIdentifier for data written by
the other writer.
TODO what's needed to work around the other race condition discussed above?
----
See also, [[adb_special_remote]]