git-annex/doc/todo/export_preferred_content.mdwn
2019-05-14 11:49:23 -04:00

133 lines
6.3 KiB
Markdown

`git annex export` normally exports all files in the specified tree,
which is generally what the user wants.
But, in some situations, the user may want to export a subset of files,
in a way that can be well expressed by a preferred content expression.
For example, they may want to export .mp3 files but not the .wav
files used to produce those.
Or, export podcasts, but not ones in a "old" directory that have already
been listened to.
It seems doable to make `git annex export` honor whatever
preferred content settings have been configured for the remote.
(And `git annex sync --content` too.)
Problem: A preferred content expression include=subdir/foo or
exclude=subdir/bar matches relative to the top of the repository.
But `git annex export` may be exporting a sub-tree, and it has no way
of knowing where a provided sub-tree sha is rooted within the larger tree.
What it could do is when provided "master:subdir" know that it's operating
within subdir and prefix that to filenames when matching preferred content.
But that would be inconsistent behavior and could violate least surprise.
It may be better to add a note that preferred content expressions include=
exclude= etc match relative to the top of the exported tree when exporting
a subtree.
----
> `git annex import` of a tree from a special remote would also be
> influenced by this.
>
> It would make sense for the ImportableContents to have files
> that are not preferred content filtered out of it. Eg, if a .wav file
> is added to the remote, it shouldn't be downloaded. Or a better example,
> if directory Music is excluded from an android remote, importing from
> it should exclude that directory.
----
> Problem: If a tree is exported with eg, no .wav files, and then an import
> is made from the remote, and necessarily lacks .wav files, the remote
> tracking branch will have a tree with no .wav
> files. Merging that into master will delete all the .wav files.
>
> If the remote tracking branch has a disconnected history from master,
> then git wouldn't delete files on
> merge. But: This would prevent actual deletions made on the special
> remote from happening in master too. So not a good idea.
>
> So it seems that, when updating the remote tracking branch for an import,
> the files that were excluded from being exported to it need to be added
> back in. So that tree of excluded files needs to somehow be kept track of
> when exporting, or generated from records.
>
> To generated the excluded tree, would need the whole tree that was
> exported, and the remote's preferred content expression at export time.
> But expressions like inallgroup would also need to look at location
> tracking info at that time. So it would need to remember the
> head of the git-annex branch at export time and query against that
> version of the branch for preferred content and location tracking.
> (And use of `git-annex forget` could break it.)
>
> It seems easier to instead record the tree of excluded files somewhere,
> Logs.Export already records the whole exported tree in the git-annex
> branch, so extend it to also record the tree of excluded files.
> Complication: Export conflicts.
---
> Matching a preferred content expression at import time before the content
> is downloaded means that the imported key may not yet be known. (Only
> when the ContentIdentifier is known can it can be mapped back to an
> already known key.) This is a problem for every preferred content term
> that relates to a key.
>
> Maybe the problem expressions can be guessed:
>
> * For copies, lackingcopies, and approxlackingcopies, inallgroup,
> the number of copies could be assumed to be 1 (the remote being
> imported from). But if it turns out to hash to a known key,
> they would have matched wrong.
>
> * For inbackend and securehash, the backend that will be used for the
> import is probably known. But if annex.largefiles becomes
> supported for imports, it would not be any longer.
>
> * For smallerthan, largerthan, the file size of an import is known.
>
> * For metadata, if we assume the imported file is new content,
> is has no metadata attached. But if it turns out to hash
> to a known key, this would have matched wrong.
>
> * For present, the content is in the remote, so it's definitely present.
>
> * For unused, the file is going to be added to the tree, its key
> will definitely not be unused.
>
> So in some cases the guess is wrong and a problem expression
> matches when it should not. This either results in a file being imported
> that should not, or a file not being imported that should be.
> In the former case, when the file reaches the master branch and
> a later export is done, the file may or may not be preferred content
> for the special remote then, and when it's not it will get removed from
> the special remote.
>
> So for example: The user sets a preferred content expression of
> "metadata=notforexport=true" and has some files with that set.
> Then they import from a remote, and it downloads a new file that happens
> to have the same content as one of those files. The new file gets
> added to their master branch, and they export to the remote and the
> new file is then removed from the remote. Seems fairly ok?
>
> Another example: The user sets a preferred content expression of "not
> inallgroup=backup". The import/export remote is not in that group.
> They import from it, and find that no new files that are added to the
> remote ever get imported. That seems to be what they asked for.
>
> Another example: The user sets a preferred content expression of "not
> inallgroup=exports". The import/export remote *is* in that group,
> and so are several other import/export remotes.
> They import from it, and find that no new files that are added to the
> remote ever get imported. Even if the same file got added to all other
> remote in that group. This seems surprising!
>
> Maybe better than guessing would be to limit preferred content
> expression matching for importing to terms that don't require guessing.
> If an expression is found to require guessing, display a warning and
make the whole expression match. OR download the content
> from the remote, generate a key from it, and match the preferred
> content expression at that point. That avoids any surprises at
> the expense of an unnessary download. As long as the ContentIdentifier to
> Key mapping gets updated, it will only download a given file unncessarily
one time.