git-annex/doc/design/metadata.mdwn
Joey Hess d0fce426c4 pre-commit-annex hook script to automatically extract metadata from lots of types of files
Using the extract(1) program to do the heavy lifting.

Decided to make git-annex run pre-commit-annex when committing. Since
git-annex pre-commit also runs it, it'll be run when git commit is run too,
via the pre-commit hook. This basically gives back the pre-commit hook
that git-annex took away. The implementation avoids repeatedly looking
for the hook script when the assistant is running and committing
repeatedly; only checks if the hook is available once.

To make the script simpler, made git-annex metadata -s field?=value
only set a field when it's not already got a value.

This commit was sponsored by bak.
2014-03-02 20:11:58 -04:00

196 lines
7.5 KiB
Markdown

[[!toc]]
# metadata
Attach an arbitrary set of metadata to a key. This consists of any number
of fields. Each field has an unordered set of values. The special field
"tag" has as its values any tags that are set for the key.
Store in git-annex branch, next to location log files.
Storage needs to support union merging, including removing an old value
of a field, and adding a new value of a field.
# filtered branches
See [[tips/metadata_driven_views]]
The reason to use specially named filtered branches is because it makes
self-documenting how the repository is currently filtered.
## unmatched files in filtered branches
TODO Files not matching the view should be able to be included in
the filtered branch, in a special location, an "other" directory.
For example, it could make a "other" directory containing files
without a tag when viewing by tag.
It might be nice, if in a two level view, for the other directories
to nest. For example, `other/2014/file`. However, that leads to a
performance problem: When adding a level to a view, it has to look at each
file in the "other" directory and generate a view for it too. With a lot
of files, that'd be slow.
Instead, why not replicate the parent branch's directory structure inside
the "other" directory? Then the directory tree only has to be constructed
once, and can be left alone when refining a view.
## operations while on filtered branch
* If files are removed and git commit called, git-annex should remove the
relevant metadata from the files. **done**
(Currently, only metadata used for visible subdirs is added and removed
this way.)
(Also, this is not usable in direct mode because deleting the
file.. actually deletes it...)
* If a file is moved into a new subdirectory while in a view branch,
a tag is added with the subdir name. This allows on the fly tagging.
**done**
* `git annex sync` should avoid pushing out the view branch, but
it should check if there are changes to the metadata pulled in, and update
the branch to reflect them.
## automatically added metadata
When annex.genmetadata is set, git annex add automatically attaches
some metadata to a file. Currently year and month fields, from its mtime.
There's also a post-commit-annex hook script.
## directory hierarchy metadata
From the original filename used in the master branch, when
constructing a view, generate fields. For example foo/bar/baz.mp3
would get /=foo, foo/=bar, foo/bar/=baz, and .=mp3.
Note that dir/=subdir allows a view to use `dir/=*` and only
match one level of subdirs with the glob. So is better than dir=foo/bar
as the metadata. (Alternatively, could do special glob matching.)
This allows using whatever directory hierarchy exists to inform the view,
without locking the view into using it.
Complication: When refining a view, it only looks at the filenames in
the view, so it has to map from
those filenames to derive the same metadata, unless there is persistent
storage. Luckily, the filenames used in the views currently include the
subdirs.
# other uses for metadata
Uses are not limited to view branches.
`git annex checkoutmeta year=2014 talk` in a subdir of master could create the
same tree of files filter would. The user can then commit that if desired.
Or, they could run additional commands like `git annex fadd` to refine the
tree of files in the subdir.
Metadata can be used for configuring numcopies. One way would be a
numcopies=n value attached to a file. But perhaps better would be to make
the numcopies.log allow configuring numcopies based on which files have
other metadata.
Other programs could query git-annex for the metadata of files in the work
tree, and do whatever it wants with it.
# filenames
The hard part of this is actually getting a useful filename to put in the
view branch, since git-annex only has a key which the user will not
want to see.
* Could use filename metadata for the key, recorded by git-annex add (which
may not correspond to filenames being used in regular git branches like
master for the key).
* Could use the .map files to get a filename, but this is somewhat
arbitrary (.map can contain multiple filenames), and is only
currently supported in direct mode.
* Current approach: Have a reference branch (eg master) and walk it to
find filenames and
keys. Fine as long as it can be done efficiently. Also allows including
the subdirectory a file is in, potentially. cwebber points out that this
is essentially a form of tracking branch. Which implies it will need to
be updatable when the reference branch changes. Should be doable via
diff-tree.
Note that we have to take care to avoid generating conflicting filenames.
The current approach is to embed the full directory structure inside the
filename in the view branch.
## union merge properties
While the storage could just list all the current values of a field on a
line with a timestamp, that's not good enough. Two disconnected
repositories can make changes to the values of a field (setting and
unsetting tags for example) and when this is union merged back together,
the changes need to be able to be replayed in order to determine which
values we end up with.
To make that work, we log not only when a field is set to a value,
but when a value is unset as well.
For example, here two different remotes added tags, and then later
a tag was removed:
1287290776.765152s tag +foo +bar
1287290991.152124s tag +baz
1291237510.141453s tag -bar
# efficient metadata lookup
Looking up metadata for view generation so far requires traversing all keys
in the git-annex branch. This is slow. A fast cache is needed.
TODO
# direct mode issues
TODO (direct mode is currently not supported with view branches)
Checking out a view branch can result in any number of copies of a file
appearing in different directories. No problem in indirect mode, but
in direct mode these are real, expensive copies.
But, it's worth supporting direct mode!
So, possible approaches:
* Before checking out a view branch, calculate how much space will
be used by duplicates and refuse if not enough is free.
* Only check out one file, and omit the copies. Keep track of which
files were omitted, and make sure that when committing on the branch,
that metadata is not removed. Has the downside that files can seem
to randomly move around in the tree as their metadata changes.
* Disallow view branch checkouts that have duplicate files.
This would cripple it some, but perhaps not too badly?
# gotchas
* Checking out a view branch can remove the current subdir. May be worth
detecting when this happens and help the user.
**done**
* Git has a complex set of rules for what is legal in a ref name.
View branch names will need to filter out any illegal stuff. **done**
* Metadata should be copied to the new key when adding a modified version
of a file. **done**
* Filesystems that are not case sensative (including case preserving OSX)
will cause problems if view branches try to use different cases for
2 directories representing a metadata field.
Solution might be to compare fields names case-insensitively, and
pick one representation consistently. **done**
* Assistant needs to know about views, so it can update metadata when
files are moved around inside them. TODO
* What happens if git annex add or the assistant add a new file while on a
view? If the file is not also added to the master branch, it will be lost
when exiting the view. TODO
* The filename mangling can result in a filename in a view
that is too long for its containing filesystem. Should detect and do
something reasonable to avoid. TODO