git-annex/doc/todo/export_preferred_content.mdwn
Joey Hess e06feb7316
honor preferred content when importing
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.

Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.

There was one minor inefficiency mentioned in the todo that I punted on.
2019-05-21 14:38:06 -04:00

240 lines
11 KiB
Markdown

`git annex export` normally exports all files in the specified tree,
which is generally what the user wants.
But, in some situations, the user may want to export a subset of files,
in a way that can be well expressed by a preferred content expression.
> started work on this in the `preferred` branch. --[[Joey]]
>
> > And [[done]]! --[[Joey]]
For example, they may want to export .mp3 files but not the .wav
files used to produce those.
Or, export podcasts, but not ones in a "old" directory that have already
been listened to.
It seems doable to make `git annex export` honor whatever
preferred content settings have been configured for the remote.
(And `git annex sync --content` too.)
> done
Logs.Export already records the tree that the user chose to export
into the git-annex branch. Should excluded files be present in that
tree or not? A good reason to do that is that if the preferred content
settings change, the next export will pick up on the change, since
the exported tree differs from the tree to be exported.
So: Make export of a tree filter that tree through the preferred
content of the remote, and use the new tree as the tree that really
gets exported, recording it in the git-annex branch. But the remote
tracking branch will point to the tree that the user chose to export.
> done
Problem: A preferred content expression include=subdir/foo or
exclude=subdir/bar matches relative to the top of the repository.
But `git annex export` may be exporting a sub-tree, and it has no way
of knowing where a provided sub-tree sha is rooted within the larger tree.
What it could do is when provided "master:subdir" know that it's operating
within subdir and prefix that to filenames when matching preferred content.
But that would be inconsistent behavior and could violate least surprise.
It may be better to add a note that preferred content expressions include=
exclude= etc match relative to the top of the exported tree when exporting
a subtree.
> done
Note: Each `git-annex sync --content` re-filters the exported tree.
Unnecessary work. If there were a way to look up the original tree that
corresponds with the filtered exported tree, that could be avoided.
----
> `git annex import` of a tree from a special remote would also be
> influenced by this.
>
> It would make sense for the ImportableContents to have files
> that are not preferred content filtered out of it. Eg, if a .wav file
> is added to the remote, it shouldn't be downloaded. Or a better example,
> if directory Music is excluded from an android remote, importing from
> it should exclude that directory.
> > done
## import after limited export
> Problem: If a tree is exported with eg, no .wav files, and then an import
> is made from the remote, and necessarily lacks .wav files, the remote
> tracking branch will be updated with a tree with no .wav
> files. Merging that into master will delete all the .wav files.
>
> So it seems that, when updating the remote tracking branch for an import,
> the files that were excluded from being exported to it need to be added
> back in. So that tree of excluded files needs to somehow be kept track of
> when exporting.
>
> Complication: The export might happen from one clone and then another
> clone imports. The clones might not sync in between. Seems all that
> the importing clone can rely on is its local state.
>
> If importing with no remote tracking branch existing yet, the import will
> create one with a disconnected history, and so it's ok to import a tree
> missing excluded files; merging a disconnected history won't delete
> those files from master.
>
> In the multiple clone case, the importing clone can't rely on information
> from the exporting clone, but if the importing clone only ever imports
> it's fine; if it exports it needs to take that into account for
> subsequent imports.
>
> So, the only case where the excluded files
> need to be added back is when there was a previous export done from
> the current repo. The list of excluded files in the export can
> be recorded locally and added back to the import.
>
> > done
## matching preferred content expressions on import
> Matching a preferred content expression at import time before the content
> is downloaded means that the imported key may not yet be known. (Only
> when the ContentIdentifier is known can it can be mapped back to an
> already known key.) This is a problem for every preferred content term
> that relates to a key.
>
> Maybe the problem expressions can be guessed:
>
> * For copies, lackingcopies, and approxlackingcopies, inallgroup,
> the number of copies could be assumed to be 1 (the remote being
> imported from). But if it turns out to hash to a known key,
> they would have matched wrong.
>
> * For inbackend and securehash, the backend that will be used for the
> import is probably known. But if annex.largefiles becomes
> supported for imports, it would not be any longer.
>
> * For metadata, if we assume the imported file is new content,
> is has no metadata attached. But if it turns out to hash
> to a known key, this would have matched wrong.
>
> * For present, the content is in the remote, so it's definitely present.
>
> * For unused, the file is going to be added to the tree, its key
> will definitely not be unused.
>
> So in some cases the guess is wrong and a problem expression
> matches when it should not. This either results in a file being imported
> that should not, or a file not being imported that should be.
> In the former case, when the file reaches the master branch and
> a later export is done, the file may or may not be preferred content
> for the special remote then, and when it's not it will get removed from
> the special remote.
>
> So for example: The user sets a preferred content expression of
> "metadata=notforexport=true" and has some files with that set.
> Then they import from a remote, and it downloads a new file that happens
> to have the same content as one of those files. The new file gets
> added to their master branch, and they export to the remote and the
> new file is then removed from the remote. Seems fairly ok?
>
> Another example: The user sets a preferred content expression of "not
> inallgroup=backup". The import/export remote is not in that group.
> They import from it, and find that no new files that are added to the
> remote ever get imported. That seems to be what they asked for.
>
> Another example: The user sets a preferred content expression of "not
> inallgroup=exports". The import/export remote *is* in that group,
> and so are several other import/export remotes.
> They import from it, and find that no new files that are added to the
> remote ever get imported. Even if the same file got added to all other
> remote in that group. This seems surprising!
>
> Maybe better than guessing would be to limit preferred content
> expression matching for importing to terms that don't require the key.
> If an expression is found to require the key, display a warning and
> don't import.
>
> OR download the content
> from the remote, generate a key from it, and re-match the preferred
> content expression. That avoids any surprises and supports all
> expressions at the expense of an unnessary download. As long as the
> ContentIdentifier to Key mapping gets updated, it will only download
> a given file unncessarily one time.
>
> Which approach is better? Note that almost all of the standard groups
> do depend on the key. But it seems very likely that most actual
> uses of this feature would involve the name or size of a file that's
> being imported, and nothing else.
>
> Imagine an import of all non-mp3s from the phone, and the phone has
> a 20 gb mp3 collection. Downloading them all just to check the preferred
> content expression would be an enormous amount of unnecessary work.
> If the user started with `exclude=*.mp3`, they'd expect it to be fast,
> and if they changed to `exclude=*.mp3 or metadata=tag=podcast`
> and it did all that extra work, that would be surprising.
> > done; it seemed to make sense at least at first to make import
> > fail when the preferred content dependened on a key.
## different preferred content for export and import?
May be cases where this makes sense. For example, I might make my phone
prefer all content that has some metadata set, but want to import all files
from my phone (or all files except those in the music directory).
OTOH, that config would cause files imported from the phone to be removed
from it on the next export, unless the necessary metadata got set; git
annex sync --content would not work well.
Better example: Make the phone want all content that is in the laptop
group, so all files on my laptop export to the phone but not others that I
have archived. But want to import all files from the phone except for mp3s.
One way would be new preferred content terminals that match when importing
and exporting:
(exporting and copies=laptop:1) or (importing and exclude=*.mp3)
This needs preferred content expressions for import to be able to
support things like copies= that need to know the key, as discussed above.
If the import preferred content expression is limited to not include those
terms, the above example can't be used at import time. Unless it can be
simplified before being checked for those terms:
(false and copies=laptop:1) or (true and exclude=*.mp3)
false or (true and exclude=*.mp3)
Or there could be separate expressions for import and export.
And this kind of makes sense; the normal preferred content expression
controls what is stored on a remote, so affects export, and the import
preferred content expression is only used to fine-tune which files get
imported from it.
Problem: Suppose a tree is exported to a special remote, and the tree includes some
mp3 files. And then an import is done, excluding mp3 files. That will
create a remote tracking branch that lacks the mp3 files; merging it will
delete them from master. That could be surprising! This is the inverse of
the "import after limited export" problem, but it seems it
can't be solved in a similar way.
That problem might be a fatal blow to this idea of separate expressions
for import and export. But it's worse than that! --
Problem: Suppose a tree is exported to a special remote and the tree
includes mp3 files. Then the remote's preferred content is set to exclude
mp3 files. Then on import from the remote, a tree will be constructed that
lacks the mp3 files that were exported before; merging it will delete them
from master.
That could be avoided by making the import notice that the preferred
content expression has changed, and so throw away the old remote tracking
branch history and import a commit with no parent, avoiding the deletion.
But, if some other file got deleted from the special remote after the
export, the import would then not delete it.
Alternatively, when a preferred content expression doesn't match a file at
import, could check if the same file is known to be present on the remote
as of the last import or export. (With same or different content.) If so,
assume the preferred content has changed and that the user does not want to
delete this file, so keep it in the import anyway. This way the import does
not delete files from master, and when the next export removes it from
the remote it will still not get deleted from master.
> done