every idea that came to me in my sleep. there were rather a lot of them

This commit is contained in:
Joey Hess 2014-02-11 11:37:53 -04:00
parent 5e8dee6cb0
commit aa06e913e5
Failed to extract signature

View file

@ -4,11 +4,12 @@
Attach an arbitrary set of metadata to a key. Attach an arbitrary set of metadata to a key.
Metadata can be tags, but it can also be fields with values (ie, date=xxx,
conference=yyy).
Store in git-annex branch, next to location log files. Store in git-annex branch, next to location log files.
Metadata can be tags, but it can also be fields with values (ie, date=xxx,
conference=yyy). Fields can have multiple values, for example
multiple authors.
Storage needs to support union merging, including removing tags, and Storage needs to support union merging, including removing tags, and
changing values. changing values.
@ -20,6 +21,7 @@ when adding it.
Could also automatically attach permissions. Could also automatically attach permissions.
A git hook could be run by git annex add to gather more metadata. A git hook could be run by git annex add to gather more metadata.
For example, by examining MP3 metadata.
Also auto adds metadata when adding files to filter branches. See below. Also auto adds metadata when adding files to filter branches. See below.
@ -28,40 +30,62 @@ Also auto adds metadata when adding files to filter branches. See below.
From the ctime, some additional From the ctime, some additional
metadata is derived, at least year=yyyy and probably also month, etc. metadata is derived, at least year=yyyy and probably also month, etc.
Should be a general mechanism for this. This is probably not stored anywhere. It's computed on demand by a pure
function from the other metadata.
From the set of tags a file has, a "tag" field is derived, which has the
value of each tag. See example below.
Should be a general mechanism for this. (It probably generalizes to
sql queries if we want to go that far.)
# filtered branches # filtered branches
`git annex filter year=2014 talk` should create a new branch `git annex filter year=2014 talk` should create a new branch
filtered/talk/year=2014 containing only files tagged with that, and filtered/year=2014/talk containing only files tagged with that, and
have git check it out. In this example, all files appear in top level have git check it out. In this example, all files appear in top level
directory of repo; no subdirs. directory of repo; no subdirs.
`git annex fadd haskell` switches to branch `git annex fadd haskell` switches to branch
filtered/haskell/talk/year=2014 with only the haskell talks. filtered/year=2014/talk/haskell with only the haskell talks.
`git annex fadd year=2013 year=2012` switches to branch `git annex fadd year=2013 year=2012` switches to branch
filtered/haskell/talk/year=2012,2013,2014. This has subdirectories 2012, filtered/year=2012,2013,2014/talk/haskell. This has subdirectories 2012,
2013 and 2014 with the matching talks. 2013 and 2014 with the matching talks.
`git annex frm haskell` switches to Patterns can be used in both the values of fields, and in matching tags.
filtered/talk/year=2012,2013,2014, which has all available talks in it. So, `year=20*` could be used to match years, and `foo/*` matches any
tag in the foo namespace. Or even `*` to match *all* tags.
`git annex filteradd conference=fosdem conference=icfp` switches to branch `git annex frm haskell` switches to
filtered/conference=fosdem,icfp/talk/year=2012,2013,2014. Now we need filtered/year=2012,2013,2014/talk, which has all available talks in it.
to either nest the subdirectories, or make fosdem-2014, icfp-2013, etc.
May need an option to choose this. Note that user may prefer to have year `git annex fadd conference=fosdem conference=icfp` switches to branch
first or conference first, so may need an option for that as well. filtered/year=2012,2013,2014/talk/conference=fosdem,icfp. Now there
are nested subdirectories. They follow the format of the branch,
so 2013/icfp, 2014/fosdem, etc.
`git annex filter tag=haskell,debian` uses the "tag" field that is
automatically derived from the set of tags. So this yields a branch
with hakell and debian subdirectories, containing the files tagged with
either.
To see all tags, `git annex filter tag=*` !
Files not matching the filter can be included, by using
`git annex filter --unmatched=other`. That puts all such files into
the subdirectory other.
Sometimes you want to see files that do not match a tag, while still
getting subdirectories for
Note that old filter branches can be deleted when switching to a new one. Note that old filter branches can be deleted when switching to a new one.
There is no need to retain them. Unless the user has committed non There is no need to retain them. Unless the user has committed non-annexed
git-annexed files to them, In which case, urk. files to them, In which case, urk. The only reason to use specially named
filtered branches is because it makes self-documenting how the repository
is currently filtered.
These command should probably refuse to do anything if run from within a ## operations while on filtered branch
subdir of the work tree that would get deleted by checking out the new
filtered branch.
# operations while on filter branch
* If files are removed and git commit called, git-annex should remove the * If files are removed and git commit called, git-annex should remove the
relevant metadata from the files. **possibly** It's not clear that relevant metadata from the files. **possibly** It's not clear that
@ -69,6 +93,8 @@ filtered branch.
branch (especially if it's derived metadata like the year). branch (especially if it's derived metadata like the year).
Also, this is not usable in direct mode because deleting the Also, this is not usable in direct mode because deleting the
file.. actually deletes it. file.. actually deletes it.
* If a file is moved into a new subdirectory while in a filter branch,
a tag is added with the subdir name. This allows on the fly tagging.
* `git annex sync` should avoid pushing out the filter branch, but * `git annex sync` should avoid pushing out the filter branch, but
it should check if there are changes to the metadata pulled in, and update it should check if there are changes to the metadata pulled in, and update
the branch to reflect them. the branch to reflect them.
@ -85,6 +111,11 @@ same tree of files filter would. The user can then commit that if desired.
Or, they could run additional commands like `git annex fadd` to refine the Or, they could run additional commands like `git annex fadd` to refine the
tree of files in the subdir. tree of files in the subdir.
Metadata can be used for configuring numcopies. One way would be a
numcopies=n value attached to a file. But perhaps better would be to make
the numcopies.log allow configuring numcopies based on which files have
other metadata.
Other programs could query git-annex for the metadata of files in the work Other programs could query git-annex for the metadata of files in the work
tree, and do whatever it wants with it. tree, and do whatever it wants with it.
@ -97,11 +128,59 @@ want to see.
* Could use filename metadata for the key, recorded by git-annex add (which * Could use filename metadata for the key, recorded by git-annex add (which
may not correspond to filenames being used in regular git branches like may not correspond to filenames being used in regular git branches like
master for the key). master for the key).
* Couod use the .map files to get a filename, but this is somewhat * Could use the .map files to get a filename, but this is somewhat
arbitrary (.map can contain multiple filenames), and is only arbitrary (.map can contain multiple filenames), and is only
currently supported in direct mode. currently supported in direct mode.
Note that any of these filenames can in theory conflict. May need to use
`.variant-*` like sync does on conflict to allow 2 files with same name in
same filtered branch.
# efficient metadata lookup # efficient metadata lookup
Looking up metadata for filtering so far requires traversing all keys in Looking up metadata for filtering so far requires traversing all keys in
the git-annex branch. This is slow. A fast cache is needed. the git-annex branch. This is slow. A fast cache is needed.
# direct mode issues
Checking out a filter branch can result in any number of copies of a file
appearing in different directories. No problem in indirect mode, but
in direct mode these are real, expensive copies.
But, it's worth supporting direct mode!
So, possible approaches:
* Before checking out a filter branch, calculate how much space will
be used by duplicates and refuse if not enough is free.
* Only check out one file, and omit the copies. Keep track of which
files were omitted, and make sure that when committing on the branch,
that metadata is not removed. Has the downside that files can seem
to randomly move around in the tree as their metadata changes.
* Disallow filter branch checkouts that have duplicate files.
Note that duplicate files can only occur when filtering on the content
of values, not tags. And values can be used in some simple cases w/o
duplicate files. This would cripple it some, but perhaps not too badly?
# gotchas
* Checking out a filter branch can remove the current subdir. May be worth
detecting when this happens and leaving behind an empty directory so the
user can navigate back up.
* Git has a complex set of rules for what is legal in a ref name.
Filter branch names will need to filter out any illegal stuff.
* Filesystems that are not case sensative (including case preserving OSX)
will cause problems if filter branches try to use different cases for
2 directories representing the value of some metadata. But, users
probably want at least case-preserving metadata values.
Solution might be to compare metadata case-insensitively, and
pick one representation consistently, so if, for example an author
field uses mixed case, it will be used in the filter branch.
Alternatively, it could escape `A` to `_A` when such a filesystem
is detected and avoid collisions that way (double `_` to escape it).
This latter option is ugly, but so are non-posix filesystems.. and it
also solves any similar issues with case-colliding filenames.