4da763439b
Removed uncorrect UniqueKey key in db schema; a key can appear multiple times with different files. The database has to be flushed after each removal. But when adding files to the export, lots of changes are able to be queued up w/o flushing. So it's still fairly efficient. If large removals of files from exports are too slow, an alternative would be to make two passes over the diff, one pass queueing deletions from the database, then a flush and the a second pass updating the location log. But that would use more memory, and need to look up exportKey twice per removed file, so I've avoided such optimisation yet. This commit was supported by the NSF-funded DataLad project.
247 lines
11 KiB
Markdown
247 lines
11 KiB
Markdown
For publishing content from a git-annex repository, it would be useful to
|
|
be able to export a tree of files to a special remote, using the filenames
|
|
and content from the tree.
|
|
|
|
(See also [[todo/export]] and [[todo/dumb, unsafe, human-readable_backend]])
|
|
|
|
[[!toc ]]
|
|
|
|
## configuring a special remote for tree export
|
|
|
|
If a special remote already has files stored in it, switching it to be a
|
|
tree export would result in a mix of files named by key and by filename.
|
|
That's not desirable. So, the user should set up a new special remote
|
|
when they want to export a tree. (It would also be possible to drop all content
|
|
from an existing special remote and reuse it, but there does not seem much
|
|
benefit in doing so.)
|
|
|
|
Add a new `initremote` configuration `exporttree=yes`, that cannot be
|
|
changed by `enableremote`:
|
|
|
|
git annex initremote myexport type=... exporttree=yes
|
|
|
|
It does not make sense to encrypt an export, so exporttree=yes requires
|
|
encryption=none.
|
|
|
|
Note that the particular tree to export is not specified yet. This is
|
|
because the tree that is exported to a special remote may change.
|
|
|
|
## exporting a treeish
|
|
|
|
To export a treeish, the user can run:
|
|
|
|
git annex export $treeish --to myexport
|
|
|
|
That does all necessary uploads etc to make the special remote contain
|
|
the tree of files. The treeish can be a tag, a branch, or a tree.
|
|
|
|
If a file's content is not present, it won't be exported. Re-running the
|
|
same export later should export files whose content has become present.
|
|
(This likely means a second pass, and needs location tracking to track
|
|
which files are in the export.)
|
|
|
|
Users may sometimes want to export multiple treeishes to a single special
|
|
remote. For example, exporting several tags. This interface could be
|
|
complicated to support that, putting the treeishes in subdirectories on the
|
|
special remote etc. But that's not necessary, because the user can use git
|
|
commands to graft trees together into a larger tree, and export that larger
|
|
tree.
|
|
|
|
If an export is interrupted, running it again should resume where it left
|
|
off.
|
|
|
|
It would also be nice to have a way to say, "I want to export the master branch",
|
|
and have git-annex sync and the assistant automatically update the export.
|
|
This could be done by recording the treeish in eg, refs/remotes/myexport/HEAD.
|
|
git-annex export could do this by default (if the user doesn't want the export
|
|
to track the branch, they could instead export a tree or a tag).
|
|
|
|
## updating an export
|
|
|
|
The user can at any time re-run git-annex export with a new treeish
|
|
to change what's exported. While some use cases for git annex export
|
|
involve publishing datasets that are intended to remain immutable,
|
|
other use cases include eg, making a tree of files available to a computer
|
|
that can't run git-annex, and in such use cases, the tree needs to be able
|
|
to be updated.
|
|
|
|
To efficiently update an export, git-annex can diff the tree
|
|
that was exported with the new tree. The naive approach is to upload
|
|
new and modified files and remove deleted files.
|
|
|
|
With rename detection, if the special remote supports moving files,
|
|
more efficient updates can be done. It gets complicated; consider two files
|
|
that swap names.
|
|
|
|
If the special remote supports copying files, that would also make some
|
|
updates more efficient.
|
|
|
|
## changes to special remote interface
|
|
|
|
This needs some additional methods added to special remotes, and to
|
|
the [[external_special_remote_protocol]].
|
|
|
|
Here's the changes to the latter:
|
|
|
|
* `EXPORTSUPPORTED`
|
|
Used to check if a special remote supports exports. The remote
|
|
responds with either `EXPORTSUPPORTED-SUCCESS` or
|
|
`EXPORTSUPPORTED-FAILURE`
|
|
* `EXPORT Name`
|
|
Comes immediately before each of the following requests,
|
|
specifying the name of the exported file. It will be in the form
|
|
of a relative path, and may contain path separators, whitespace,
|
|
and other special characters.
|
|
* `TRANSFEREXPORT STORE|RETRIEVE Key File`
|
|
Requests the transfer of a File on local disk to or from the previously
|
|
provided Name on the special remote.
|
|
Note that it's important that, while a file is being stored,
|
|
CHECKPRESENTEXPORT not indicate it's present until all the data has
|
|
been transferred.
|
|
The remote responds with either `TRANSFER-SUCCESS` or
|
|
`TRANSFER-FAILURE`, and a remote where exports do not make sense
|
|
may always fail.
|
|
* `CHECKPRESENTEXPORT Key`
|
|
Requests the remote to check if the previously provided Name is present
|
|
in it.
|
|
The remote responds with `CHECKPRESENT-SUCCESS`, `CHECKPRESENT-FAILURE`,
|
|
or `CHECKPRESENT-UNKNOWN`.
|
|
* `REMOVEEXPORT Key`
|
|
Requests the remote to remove content stored by `TRANSFEREXPORT`
|
|
with the previously provided Name.
|
|
The remote responds with either `REMOVE-SUCCESS` or
|
|
`REMOVE-FAILURE`.
|
|
* `RENAMEEXPORT Key NewName`
|
|
Requests the remote rename a file stored on it from the previously
|
|
provided Name to the NewName.
|
|
The remote responds with `RENAMEEXPORT-SUCCESS`,
|
|
`RENAMEEXPORT-FAILURE`, or with `RENAMEEXPORT-UNSUPPORTED` if an efficient
|
|
rename cannot be done.
|
|
|
|
To support old external special remote programs that have not been updated
|
|
to support exports, git-annex will need to handle an `ERROR` response
|
|
when using any of the above.
|
|
|
|
## location tracking
|
|
|
|
Since not all the files in an exported treeish may have content
|
|
present when the export is done, location tracking will be needed so that
|
|
getting the files and exporting again transfers their content.
|
|
|
|
Does a copy of a file exported to a special remote count as a copy
|
|
of a file as far as [[numcopies]] goes? Should git-annex get download
|
|
a file from an export?
|
|
|
|
The problem is that special remotes with exports are not
|
|
key/value stores. The content of a file can change, and if multiple
|
|
repositories can export a special remote, they can be out of sync about
|
|
what files are exported to it.
|
|
|
|
Possible solution: Make exporttree=yes cause the special remote to
|
|
be untrusted, and rely on annex.verify to catch cases where the content
|
|
of a file on a special remote has changed. This would work well enough
|
|
except for when the WORM or URL backend is used. So, prevent the user
|
|
from exporting such keys. Also, force verification on for such special
|
|
remotes, don't let it be turned off.
|
|
|
|
The same file contents may be in a treeish multiple times under different
|
|
filenames. That complicates using location tracking. One file may have been
|
|
exported and the other not, and location tracking says that the content
|
|
is present in the export. A sqlite database is needed to keep track of
|
|
this.
|
|
|
|
## recording exported filenames in git-annex branch
|
|
|
|
In order to download the content of a key from a file exported
|
|
to a special remote, the filename that was exported needs to somehow
|
|
be recorded in the git-annex branch. How to do this? The filename could
|
|
be included in the location tracking log or a related log file, or
|
|
the exported tree could be grafted into the git-annex branch
|
|
(under eg, `exported/uuid/`). Which way uses less space in the git repository?
|
|
|
|
Grafting in the exported tree records the necessary data, but the
|
|
file-to-key map needs to be reversed to support downloading from an export.
|
|
It would be too expensive to traverse the tree each time to hunt for a key;
|
|
instead would need a database that gets populated once by traversing the
|
|
tree.
|
|
|
|
On the other hand, for updating what's exported, having access to the old
|
|
exported tree seems perfect, because it and the new tree can be diffed to
|
|
find what changes need to be made to the special remote.
|
|
|
|
If the filenames are stored in the location tracking log, the exported tree
|
|
could be reconstructed, but it would take O(N) queries to git, where N is
|
|
the total number of keys git-annex knows about; updating exports of small
|
|
subsets of large repositories would be expensive. So grafting in the
|
|
exported tree seems the better approach.
|
|
|
|
## export conflicts
|
|
|
|
What if different repositories can access the same special remote,
|
|
and different trees get exported to it concurrently?
|
|
|
|
This would be very hard to untangle, because it's hard to know what
|
|
content was exported to a file last, and thus what content the file
|
|
actually has. The location log's timestamps might give a hint,
|
|
but clocks vary too much to trust it.
|
|
|
|
Also, if the exported tree is grafted in to the git-annex branch,
|
|
there would be a merge conflict. Union merging would *scramble* the exported
|
|
tree, so even if a smart merge is added, old versions of git-annex would
|
|
corrupt the exported tree.
|
|
|
|
To avoid that problem, add a log file `export.log` that contains the uuid
|
|
of the remote that was exported to, and the sha1 of the exported tree.
|
|
To avoid the exported tree being GCed, do graft it in to the git-annex
|
|
branch, but follow that with a commit that removes the tree again,
|
|
and only update `refs/heads/git-annex` after making both commits.
|
|
|
|
If `export.log` contains multiple active exports of different trees,
|
|
there was an export conflict. Short of downloading the whole export to
|
|
checksum it, or deleting the whole export, what can be done to resolve it?
|
|
|
|
In this case, git-annex knows both exported trees. Have the user provide
|
|
a tree that resolves the conflict as they desire (it could be the same as
|
|
one of the exported trees, or some merge of them or an entirely new tree).
|
|
The UI to do this can just be another `git annex export $tree --to remote`.
|
|
To resolve, diff each exported tree in turn against the resolving tree
|
|
and delete all files that differ.
|
|
|
|
## when to update export.log for efficient resuming of exports
|
|
|
|
When should `export.log` be updated? Possibilities:
|
|
|
|
* Before performing any work, to set the goal.
|
|
* After the export is fully successful, to record the current state.
|
|
* After some mid-point.
|
|
|
|
Lots of things could go wrong during an export. A file might fail to be
|
|
transferred or only part of it be transferred; a file's content might not
|
|
be present to transfer at all. The export could be interrupted part way.
|
|
Updating the export.log at the right point in time is important to handle
|
|
these cases efficiently.
|
|
|
|
If the export.log is updated first, then it's only a goal and does not tell
|
|
us what's been done already.
|
|
|
|
If the export.log is updated only after complete success, then the common
|
|
case of some files not having content locally present will prevent it from
|
|
being updated. When we resume, we again don't know what's been done
|
|
already.
|
|
|
|
If the export.log is updated after deleting any files from the
|
|
remote that are not the same in the new treeish as in the old treeish,
|
|
and as long as TRANSFEREXPORT STORE is atomic, then when resuming we can
|
|
trust CHECKPRESENTEXPORT to only find files that have the correct content
|
|
for the current treeish. (Unless a conflicting export was made from
|
|
elsewhere, but in that case, the conflict resolution will have to fix up
|
|
later.)
|
|
|
|
Efficient resuming can then first check if the location log says the
|
|
export contains the content. (If not, transfer a copy.) If the location
|
|
log says the export contains the content, use CHECKPRESENTEXPORT to see if
|
|
the file exists, and if not transfer a copy. The CHECKPRESENTEXPORT check
|
|
deals with the case where the treeish has two files with the same content.
|
|
If we have a key-to-files map for the export, then we can skip the
|
|
CHECKPRESENTEXPORT check when there's only one file using a key. So,
|
|
resuming can be quite efficient.
|