28e2cad849
* Only export to remotes that were initialized to support it. * Prevent storing key/value on export remotes. * Prevent enabling exporttree=yes and encryption in the same remote. SetupStage Enable was changed to take the old RemoteConfig. This allowed only setting exporttree when initially setting up a remote, and not configuring it later after stuff might already be stored in the remote. Went with =yes rather than =true for consistency with other parts of git-annex. Changed docs accordingly. This commit was supported by the NSF-funded DataLad project.
246 lines
11 KiB
Markdown
246 lines
11 KiB
Markdown
For publishing content from a git-annex repository, it would be useful to
|
|
be able to export a tree of files to a special remote, using the filenames
|
|
and content from the tree.
|
|
|
|
(See also [[todo/export]] and [[todo/dumb, unsafe, human-readable_backend]])
|
|
|
|
[[!toc ]]
|
|
|
|
## configuring a special remote for tree export
|
|
|
|
If a special remote already has files stored in it, switching it to be a
|
|
tree export would result in a mix of files named by key and by filename.
|
|
That's not desirable. So, the user should set up a new special remote
|
|
when they want to export a tree. (It would also be possible to drop all content
|
|
from an existing special remote and reuse it, but there does not seem much
|
|
benefit in doing so.)
|
|
|
|
Add a new `initremote` configuration `exporttree=yes`, that cannot be
|
|
changed by `enableremote`:
|
|
|
|
git annex initremote myexport type=... exporttree=yes
|
|
|
|
It does not make sense to encrypt an export, so exporttree=yes requires
|
|
encryption=none.
|
|
|
|
Note that the particular tree to export is not specified yet. This is
|
|
because the tree that is exported to a special remote may change.
|
|
|
|
## exporting a treeish
|
|
|
|
To export a treeish, the user can run:
|
|
|
|
git annex export $treeish --to myexport
|
|
|
|
That does all necessary uploads etc to make the special remote contain
|
|
the tree of files. The treeish can be a tag, a branch, or a tree.
|
|
|
|
If a file's content is not present, it won't be exported. Re-running the
|
|
same export later should export files whose content has become present.
|
|
(This likely means a second pass, and needs location tracking to track
|
|
which files are in the export.)
|
|
|
|
Users may sometimes want to export multiple treeishes to a single special
|
|
remote. For example, exporting several tags. This interface could be
|
|
complicated to support that, putting the treeishes in subdirectories on the
|
|
special remote etc. But that's not necessary, because the user can use git
|
|
commands to graft trees together into a larger tree, and export that larger
|
|
tree.
|
|
|
|
If an export is interrupted, running it again should resume where it left
|
|
off.
|
|
|
|
It would also be nice to have a way to say, "I want to export the master branch",
|
|
and have git-annex sync and the assistant automatically update the export.
|
|
This could be done by recording the treeish in eg, refs/remotes/myexport/HEAD.
|
|
git-annex export could do this by default (if the user doesn't want the export
|
|
to track the branch, they could instead export a tree or a tag).
|
|
|
|
## updating an export
|
|
|
|
The user can at any time re-run git-annex export with a new treeish
|
|
to change what's exported. While some use cases for git annex export
|
|
involve publishing datasets that are intended to remain immutable,
|
|
other use cases include eg, making a tree of files available to a computer
|
|
that can't run git-annex, and in such use cases, the tree needs to be able
|
|
to be updated.
|
|
|
|
To efficiently update an export, git-annex can diff the tree
|
|
that was exported with the new tree. The naive approach is to upload
|
|
new and modified files and remove deleted files.
|
|
|
|
With rename detection, if the special remote supports moving files,
|
|
more efficient updates can be done. It gets complicated; consider two files
|
|
that swap names.
|
|
|
|
If the special remote supports copying files, that would also make some
|
|
updates more efficient.
|
|
|
|
## changes to special remote interface
|
|
|
|
This needs some additional methods added to special remotes, and to
|
|
the [[external_special_remote_protocol]].
|
|
|
|
Here's the changes to the latter:
|
|
|
|
* `EXPORTSUPPORTED`
|
|
Used to check if a special remote supports exports. The remote
|
|
responds with either `EXPORTSUPPORTED-SUCCESS` or
|
|
`EXPORTSUPPORTED-FAILURE`
|
|
* `EXPORT Name`
|
|
Comes immediately before each of the following requests,
|
|
specifying the name of the exported file. It will be in the form
|
|
of a relative path, and may contain path separators, whitespace,
|
|
and other special characters.
|
|
* `TRANSFEREXPORT STORE|RETRIEVE Key File`
|
|
Requests the transfer of a File on local disk to or from the previously
|
|
provided Name on the special remote.
|
|
Note that it's important that, while a file is being stored,
|
|
CHECKPRESENTEXPORT not indicate it's present until all the data has
|
|
been transferred.
|
|
The remote responds with either `TRANSFER-SUCCESS` or
|
|
`TRANSFER-FAILURE`, and a remote where exports do not make sense
|
|
may always fail.
|
|
* `CHECKPRESENTEXPORT Key`
|
|
Requests the remote to check if the previously provided Name is present
|
|
in it.
|
|
The remote responds with `CHECKPRESENT-SUCCESS`, `CHECKPRESENT-FAILURE`,
|
|
or `CHECKPRESENT-UNKNOWN`.
|
|
* `REMOVEEXPORT Key`
|
|
Requests the remote to remove content stored by `TRANSFEREXPORT`
|
|
with the previously provided Name.
|
|
The remote responds with either `REMOVE-SUCCESS` or
|
|
`REMOVE-FAILURE`.
|
|
* `RENAMEEXPORT Key NewName`
|
|
Requests the remote rename a file stored on it from the previously
|
|
provided Name to the NewName.
|
|
The remote responds with `RENAMEEXPORT-SUCCESS`,
|
|
`RENAMEEXPORT-FAILURE`, or with `RENAMEEXPORT-UNSUPPORTED` if an efficient
|
|
rename cannot be done.
|
|
|
|
To support old external special remote programs that have not been updated
|
|
to support exports, git-annex will need to handle an `ERROR` response
|
|
when using any of the above.
|
|
|
|
## location tracking
|
|
|
|
Since not all the files in an exported treeish may have content
|
|
present when the export is done, location tracking will be needed so that
|
|
getting the files and exporting again transfers their content.
|
|
|
|
Does a copy of a file exported to a special remote count as a copy
|
|
of a file as far as [[numcopies]] goes? Should git-annex get download
|
|
a file from an export?
|
|
|
|
The problem is that special remotes with exports are not
|
|
key/value stores. The content of a file can change, and if multiple
|
|
repositories can export a special remote, they can be out of sync about
|
|
what files are exported to it.
|
|
|
|
Possible solution: Make exporttree=yes cause the special remote to
|
|
be untrusted, and rely on annex.verify to catch cases where the content
|
|
of a file on a special remote has changed. This would work well enough
|
|
except for when the WORM or URL backend is used. So, prevent the user
|
|
from exporting such keys. Also, force verification on for such special
|
|
remotes, don't let it be turned off.
|
|
|
|
The same file contents may be in a treeish multiple times under different
|
|
filenames. That complicates using location tracking. One file may have been
|
|
exported and the other not, and location tracking says that the content
|
|
is present in the export.
|
|
|
|
## recording exported filenames in git-annex branch
|
|
|
|
In order to download the content of a key from a file exported
|
|
to a special remote, the filename that was exported needs to somehow
|
|
be recorded in the git-annex branch. How to do this? The filename could
|
|
be included in the location tracking log or a related log file, or
|
|
the exported tree could be grafted into the git-annex branch
|
|
(under eg, `exported/uuid/`). Which way uses less space in the git repository?
|
|
|
|
Grafting in the exported tree records the necessary data, but the
|
|
file-to-key map needs to be reversed to support downloading from an export.
|
|
It would be too expensive to traverse the tree each time to hunt for a key;
|
|
instead would need a database that gets populated once by traversing the
|
|
tree.
|
|
|
|
On the other hand, for updating what's exported, having access to the old
|
|
exported tree seems perfect, because it and the new tree can be diffed to
|
|
find what changes need to be made to the special remote.
|
|
|
|
If the filenames are stored in the location tracking log, the exported tree
|
|
could be reconstructed, but it would take O(N) queries to git, where N is
|
|
the total number of keys git-annex knows about; updating exports of small
|
|
subsets of large repositories would be expensive. So grafting in the
|
|
exported tree seems the better approach.
|
|
|
|
## export conflicts
|
|
|
|
What if different repositories can access the same special remote,
|
|
and different trees get exported to it concurrently?
|
|
|
|
This would be very hard to untangle, because it's hard to know what
|
|
content was exported to a file last, and thus what content the file
|
|
actually has. The location log's timestamps might give a hint,
|
|
but clocks vary too much to trust it.
|
|
|
|
Also, if the exported tree is grafted in to the git-annex branch,
|
|
there would be a merge conflict. Union merging would *scramble* the exported
|
|
tree, so even if a smart merge is added, old versions of git-annex would
|
|
corrupt the exported tree.
|
|
|
|
To avoid that problem, add a log file `export.log` that contains the uuid
|
|
of the remote that was exported to, and the sha1 of the exported tree.
|
|
To avoid the exported tree being GCed, do graft it in to the git-annex
|
|
branch, but follow that with a commit that removes the tree again,
|
|
and only update `refs/heads/git-annex` after making both commits.
|
|
|
|
If `export.log` contains multiple active exports of different trees,
|
|
there was an export conflict. Short of downloading the whole export to
|
|
checksum it, or deleting the whole export, what can be done to resolve it?
|
|
|
|
In this case, git-annex knows both exported trees. Have the user provide
|
|
a tree that resolves the conflict as they desire (it could be the same as
|
|
one of the exported trees, or some merge of them or an entirely new tree).
|
|
The UI to do this can just be another `git annex export $tree --to remote`.
|
|
To resolve, diff each exported tree in turn against the resolving tree
|
|
and delete all files that differ.
|
|
|
|
## when to update export.log for efficient resuming of exports
|
|
|
|
When should `export.log` be updated? Possibilities:
|
|
|
|
* Before performing any work, to set the goal.
|
|
* After the export is fully successful, to record the current state.
|
|
* After some mid-point.
|
|
|
|
Lots of things could go wrong during an export. A file might fail to be
|
|
transferred or only part of it be transferred; a file's content might not
|
|
be present to transfer at all. The export could be interrupted part way.
|
|
Updating the export.log at the right point in time is important to handle
|
|
these cases efficiently.
|
|
|
|
If the export.log is updated first, then it's only a goal and does not tell
|
|
us what's been done already.
|
|
|
|
If the export.log is updated only after complete success, then the common
|
|
case of some files not having content locally present will prevent it from
|
|
being updated. When we resume, we again don't know what's been done
|
|
already.
|
|
|
|
If the export.log is updated after deleting any files from the
|
|
remote that are not the same in the new treeish as in the old treeish,
|
|
and as long as TRANSFEREXPORT STORE is atomic, then when resuming we can
|
|
trust CHECKPRESENTEXPORT to only find files that have the correct content
|
|
for the current treeish. (Unless a conflicting export was made from
|
|
elsewhere, but in that case, the conflict resolution will have to fix up
|
|
later.)
|
|
|
|
Efficient resuming can then first check if the location log says the
|
|
export contains the content. (If not, transfer a copy.) If the location
|
|
log says the export contains the content, use CHECKPRESENTEXPORT to see if
|
|
the file exists, and if not transfer a copy. The CHECKPRESENTEXPORT check
|
|
deals with the case where the treeish has two files with the same content.
|
|
If we have a key-to-files map for the export, then we can skip the
|
|
CHECKPRESENTEXPORT check when there's only one file using a key. So,
|
|
resuming can be quite efficient.
|