434 lines
16 KiB
Markdown
434 lines
16 KiB
Markdown
git-annex should use smudge/clean filters. v6 mode
|
|
|
|
### problems keeping v6 experimental
|
|
|
|
* Checking out a different branch causes git to smudge all changed files,
|
|
and write their content. This does not honor annex.thin. A warning
|
|
message is printed in this case.
|
|
|
|
This is particularly wasteful when checking out an adjusted unlocked
|
|
branch, which causes 2x the space to be used.
|
|
|
|
"git annex proxy" could be used to handle this.
|
|
Make it run the git command with smudge filters disabled and then
|
|
scan through the changed files in the work tree, and update pointer files
|
|
to be hard links to their content.
|
|
|
|
git-annex adjust and git-annex sync could both use that internally
|
|
when checking out the adjusted branch, and merging a branch into HEAD.
|
|
|
|
Or: Make the smudge filter not provide the actual file content, but the
|
|
pointer, at least when used in a checkout. And install a post-checkout
|
|
hook that populates the worktree files that were checked out.
|
|
Of course, it will also need to update the index..
|
|
|
|
(My enhanced smudge/clean patch set also fixed this problem, in a much
|
|
nicer way...)
|
|
|
|
* When git runs the smudge filter, it buffers all its output in ram before
|
|
writing it to a file. So, checking out a branch with a large v6 unlocked files
|
|
can cause git to use a lot of memory.
|
|
|
|
This needs to be fixed in git, but my proposed interface in
|
|
<http://thread.gmane.org/gmane.comp.version-control.git/294425> would
|
|
avoid the problem for git checkout, since it would use the new interface
|
|
and not the smudge filter.
|
|
|
|
Last verified with git 2.18 in 2018.
|
|
|
|
Note that the long-running filter process interface has the same problem.
|
|
|
|
The annex.thin idea above could work around this problem.
|
|
|
|
|
|
## other warts
|
|
|
|
* There are several v6 bugs that are edge cases and
|
|
need more info or analysis. None of these seem like blockers
|
|
to keep v6 experimental or to replacing direct mode with v6.
|
|
|
|
- <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
|
|
- <http://git-annex.branchable.com/bugs/v6_appears_to_not_thin/>
|
|
- <http://git-annex.branchable.com/bugs/Metadata_views_in_v6_repo_upgraded_from_direct_mode_act_strangely/>
|
|
- <http://git-annex.branchable.com/bugs/git-annex-sync_sometimes_fails_in_submodule_in_V6_adjusted_branch/>
|
|
|
|
### long term todos
|
|
|
|
* Potentially: Use git's new `filter.<driver>.process` interface, which will
|
|
let only 1 git-annex process be started by git when processing
|
|
multiple files, and so should be faster.
|
|
|
|
See [[todo/Long_Running_Filter_Process]] .. it's not currently actually a
|
|
win but might be a good way to improve git to work better with v6.
|
|
|
|
* Eventually (but not yet), make v6 the default for new repositories.
|
|
Note that the assistant forces repos into direct mode; that will need to
|
|
be changed then, and it should enable annex.thin instead.
|
|
|
|
* Later still, remove support for direct mode, and enable automatic
|
|
v5 to v6 upgrades.
|
|
|
|
### historical notes
|
|
|
|
2013: Currently, this does not look likely to work. In particular,
|
|
the clean filter needs to consume all stdin from git, which consists of the
|
|
entire content of the file. It cannot optimise by directly accessing
|
|
the file in the repository, because git may be cleaning a different
|
|
version of the file during a merge.
|
|
|
|
So every `git status` would need to read the entire content of all
|
|
available files, and checksum them, which is too expensive.
|
|
|
|
> Update from GitTogether: Peff thinks a new interface could be added to
|
|
> git to handle this sort of case in an efficient way.. just needs someone
|
|
> to do the work. --[[Joey]]
|
|
|
|
>> Update 2015: git status only calls the clean filter for files
|
|
>> that the index says are modified, so this is no longer a problem.
|
|
>> --[[Joey]]
|
|
|
|
### background
|
|
|
|
The clean filter is run when files are staged for commit. So a user could copy
|
|
any file into the annex, git add it, and git-annex's clean filter causes
|
|
the file's key to be staged, while its value is added to the annex.
|
|
|
|
The smudge filter is run when files are checked out. Since git annex
|
|
repos have partial content, this would not git annex get the file content.
|
|
Instead, if the content is not currently available, it would need to do
|
|
something like return empty file content. (Sadly, it cannot create a
|
|
symlink, as git still wants to write the file afterwards.)
|
|
|
|
So the nice current behavior of unavailable files being clearly missing due
|
|
to dangling symlinks, would be lost when using smudge/clean filters.
|
|
(Contact git developers to get an interface to do this?)
|
|
|
|
Instead, we get the nice behavior of not having to remeber to `git annex
|
|
add` files, and just being able to use `git add` or `git commit -a`,
|
|
and have it use git-annex when .gitattributes says to. Also, annexed
|
|
files can be directly modified without having to `git annex unlock`.
|
|
|
|
### configuration
|
|
|
|
In .gitattributes, the user would put something like "* filter=git-annex".
|
|
This way they could control which files are annexed vs added normally.
|
|
|
|
It would also be good to allow using this without having to specify
|
|
the files in .gitattributes. Just use "* filter=git-annex" there, and then
|
|
let git-annex decide which files to annex and which to pass through the
|
|
smudge and clean filters as-is. The smudge filter can just read a little of
|
|
its input to see if it's a pointer to an annexed file. The clean filter
|
|
could apply annex.largefiles to decide whether to annex a file's content or
|
|
not.
|
|
|
|
For files not configured this way in .gitattributes, git-annex could
|
|
continue to use its symlink method -- this would preserve backwards
|
|
compatability, and even allow mixing the two methods in a repo as desired.
|
|
(But not switching an existing repo between indirect and direct modes;
|
|
the user decides which mode to use when adding files to the repo.)
|
|
|
|
### clean
|
|
|
|
The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
|
|
something like this works to provide a filename to the clean script:
|
|
|
|
git config --global filter.huge.clean huge-clean %f
|
|
|
|
This could avoid it needing to read all the current file content from stdin
|
|
when doing eg, a git status or git commit. Instead it is passed the
|
|
filename that git is operating on, in the working directory.
|
|
(Update: No, doesn't work; git may be cleaning a different file content
|
|
than is currently on disk, and git requires all stdin be consumed too.)
|
|
|
|
So, WORM could just look at that file and easily tell if it is one
|
|
it already knows (same mtime and size). If so, it can short-circuit and
|
|
do nothing, file content is already cached.
|
|
|
|
SHA1 has a harder job. Would not want to re-sha1 the file every time,
|
|
probably. So it'd need a local cache of file stat info, mapped to known
|
|
objects.
|
|
|
|
But: Even with %f, git actually passes the full file content to the clean
|
|
filter, and if it fails to consume it all, it will crash (may only happen
|
|
if the file is larger than some chunk size; tried with 500 mb file and
|
|
saw a SIGPIPE.) This means unnecessary works needs to be done,
|
|
and it slows down *everything*, from `git status` to `git commit`.
|
|
**showstopper** I have sent a patch to the git mailing list to address
|
|
this. <http://marc.info/?l=git&m=131465033512157&w=2> (Update: apparently
|
|
can't be fixed.)
|
|
|
|
> Update: I tried this again (2015) and it seems that git status and git
|
|
> add avoid re-sending the file content to the clean filter, as long as the
|
|
> file stat has not changed. I'm not sure when git started doing that,
|
|
> but it seems to avoid this problem.
|
|
> --[[Joey]]
|
|
|
|
### smudge
|
|
|
|
The smudge script can also be provided a filename with %f, but it
|
|
cannot directly write to the file or git gets unhappy.
|
|
|
|
> Still the case in 2015. Means an unnecesary read and pipe of the file
|
|
> even if the content is already locally available on disk. --[[Joey]]
|
|
|
|
### partial checkouts
|
|
|
|
.. Are very important, otherwise a repo can't scale past the size of the
|
|
smallest client's disk!
|
|
|
|
It would be nice if the smudge filter could hard link a work
|
|
tree file to the annex object.
|
|
|
|
But currently, the smudge filter can't modify the work tree file on its own
|
|
-- git always modifies the file after getting the output of the smudge
|
|
filter, and will stumble over any modifications that the smudge filter
|
|
makes. And, it's important that the smudge filter never fail as that will
|
|
leave the repo in a bad state.
|
|
|
|
Seems the best that can be done is for the smudge filter to copy from the
|
|
annex object when the object is present. When it's not present, the smudge
|
|
filter should provide a pointer to its content.
|
|
|
|
The clean filter should detect when it's operating on that pointer file.
|
|
|
|
I've a demo implementation of this technique in the scripts below.
|
|
|
|
### deduplication
|
|
|
|
.. Is nice; needing 2 copies of every annexed file is annoying.
|
|
|
|
Unfortunately, when using smudge/clean, `git merge` does not preserve a
|
|
smudged file in the work tree when renaming it. It instead deletes the old
|
|
file and asks the smudge filter to smudge the new filename.
|
|
|
|
So, copies need to be maintained in .git/annex/objects, though it's ok
|
|
to use hard links to the work tree files. (Although somewhat unsafe
|
|
since modification of the file will lose the old version. annex.thin
|
|
setting can enable this.)
|
|
|
|
Even if hard links are used, smudge needs to output the content of an
|
|
annexed file, which will result in duplication when merging in renames of
|
|
files.
|
|
|
|
### design
|
|
|
|
Goal: Get rid of current direct mode, using smudge/clean filters instead to
|
|
cover the same use cases, more flexibly and robustly.
|
|
|
|
Use case 1:
|
|
|
|
A user wants to be able to edit files, and git-add, git commit,
|
|
without needing to worry about using git-annex to unlock files, add files,
|
|
etc.
|
|
|
|
Use case 2:
|
|
|
|
Using git-annex on a crippled filesystem that does not support symlinks.
|
|
|
|
Data:
|
|
|
|
* An annex pointer file has as its first line the git-annex key
|
|
that it's standing in for (prefixed with "annex/objects/", similar to
|
|
an annex symlink target). Subsequent lines of the file might
|
|
be a message saying that the file's content is not currently available.
|
|
An annex pointer file is checked into the git repository the same way
|
|
that an annex symlink is checked in.
|
|
* A file map is maintained by git-annex, to keep track of the keys
|
|
that are used by files in the working tree.
|
|
|
|
Configuration:
|
|
|
|
* .gitattributes tells git which files to use git-annex's smudge/clean
|
|
filters with. Typically, all files except for dotfiles:
|
|
|
|
* filter=annex
|
|
.* !filter
|
|
|
|
* annex.largefiles tells git-annex which files should in fact be put in
|
|
the annex. Other files are passed through the smudge/clean as-is and
|
|
have their contents stored in git.
|
|
|
|
* annex.direct is repurposed to configure how git-annex adds files.
|
|
When set to false, it adds symlinks and when true it adds pointer files.
|
|
|
|
git-annex clean:
|
|
|
|
* Run by `git add` (and diff and status, etc), and passed the
|
|
filename, as well as fed the file content on stdin.
|
|
|
|
Look at configuration to decide if this file's content belongs in the
|
|
annex. If not, output the file content to stdout.
|
|
|
|
Generate annex key from filename and content from stdin.
|
|
|
|
Hard link (annex.thin) or copy .git/annex/objects to the file,
|
|
if it doesn't already exist.
|
|
|
|
This is done to prevent losing the only copy of a file when eg
|
|
doing a git checkout of a different branch, or merging a commit that
|
|
renames or deletes a file. But, with annex.thin no attempt is made to
|
|
protect the object from being modified. If a user wants to
|
|
protect object contents from modification, they should use
|
|
`git annex add`, not `git add`, or they can `git annex lock` after adding,
|
|
or not enable annex.thin.
|
|
|
|
Update file map.
|
|
|
|
Output the pointer file content to stdout.
|
|
|
|
git-annex smudge:
|
|
|
|
* Run by eg `git checkout`
|
|
and passed the filename, as well as fed the pointer file content on stdin.
|
|
|
|
Update file map.
|
|
|
|
When an object is present in the annex, outputs its content to stdout.
|
|
Otherwise, outputs the file pointer content.
|
|
|
|
git annex direct/indirect:
|
|
|
|
Previously these commands switched in and out of direct mode.
|
|
Now they become no-ops.
|
|
|
|
git annex lock/unlock:
|
|
|
|
Makes sense for these to change to switch files between using
|
|
git-annex symlinks and pointers. So, this provides both a way to
|
|
transition repositories to using pointers, and a cleaner unlock/lock
|
|
for repos using symlinks.
|
|
|
|
unlock will stage a pointer file, and will link the content of the object
|
|
from .git/annex/objects to the work tree file.
|
|
|
|
lock will replace the current work tree file with the symlink, and stage it,
|
|
and lock down the permissions of the annex object.
|
|
|
|
#### file map
|
|
|
|
The file map needs to map from `Key -> [File]`. `File -> Key`
|
|
seems useful to have, but in practice is not worthwhile.
|
|
|
|
Drop and get operations need to know what files in the work tree use a
|
|
given key in order to update the work tree. And, we don't want to
|
|
overwrite a work tree file if it's been modified when dropping or getting.
|
|
|
|
git-annex commands that look at annex symlinks to get keys to act on will
|
|
need fall back to either consulting the file map, or looking at the staged
|
|
file to see if it's a pointer to a key. So a `File -> Key` map is a possible
|
|
optimisation.
|
|
|
|
Question: If the smudge/clean filters update the file map incrementally
|
|
based on the pointer files they generate/see, will the result
|
|
always be consistent with the content of the working tree?
|
|
|
|
This depends on when git calls the smudge/clean filters and on what.
|
|
In particular:
|
|
|
|
* Does the clean filter always get called when adding a relevant
|
|
file to git? Yes.
|
|
* Is the clean filter called at any other time? Yes, for example
|
|
git diff will clean relevant modified files to generate the diff.
|
|
So, the clean filter may see file versions that have not yet been staged
|
|
in git.
|
|
* Is the clean filter ever passed content not in the work tree?
|
|
I don't think so, but not 100% sure.
|
|
* Is the smudge filter always called when git updates a relevant file
|
|
in the work tree? Yes.
|
|
* Is the smudge filter called at any other time? Seems unlikely but then
|
|
there could be situations with a detached work tree or such.
|
|
* Does git call any useful hooks when removing a file from the work tree,
|
|
or converting it to not be annexed, or for `git mv` of an annexed file?
|
|
No!
|
|
|
|
From this analysis, any file map generated by the smudge/clean filters
|
|
is necessary potentially innaccurate. It may list deleted files.
|
|
It may or may not reflect current unstaged changes from the work tree.
|
|
|
|
|
|
Follows that any use of the file map needs to verify the info from it,
|
|
and throw out bad cached info (updating the map to match reality).
|
|
|
|
When downloading a key, check if the files listed in the file map are
|
|
still pointer files in the work tree, and only replace them with the
|
|
content if so.
|
|
|
|
When dropping a key, check if the files listed for it in the file map are
|
|
unmodified in the work tree, and are staged as pointers to the key,
|
|
and only reset them to the pointers if so. Note that this means that
|
|
a modified work tree file that has not yet been staged, but that
|
|
corresponds to a key, won't be reset when the key is dropped.
|
|
This is probably not a big deal; the user will either add the
|
|
file, which will add the key back, or reset the file.
|
|
|
|
Does the `File -> Key` map have any benefits given this innaccuracy?
|
|
Answer seems to be no; any answer that map gives may be innaccurate and
|
|
needs to be verified by looking at actual repo content, so might as well
|
|
just look at the repo content in the first place..
|
|
|
|
#### Upgrading
|
|
|
|
annex.version changes to 6
|
|
|
|
git config for filter.annex.smudge and filter.annex.clean is set up.
|
|
|
|
.gitattributes is updated with a stock configuration,
|
|
unless it already mentions "filter=annex".
|
|
|
|
Upgrading a direct mode repo needs to switch it out of bare mode, and
|
|
needs to run `git annex unlock` on all files (or reach the same result).
|
|
So will need to stage changes to all annexed files.
|
|
|
|
When a repo has some clones indirect and some direct, the upgraded repo
|
|
will have all files unlocked, necessarily in all clones. This happens
|
|
automatically, because when the direct repos are upgraded that causes the
|
|
files to be unlocked, while the indirect upgrades don't touch the files.
|
|
|
|
----
|
|
|
|
### test files
|
|
|
|
huge-smudge:
|
|
|
|
<pre>
|
|
#!/bin/sh
|
|
read f
|
|
file="$1"
|
|
echo "smudging $f" >&2
|
|
if [ -e ~/$f ]; then
|
|
cat ~/$f # possibly expensive copy here
|
|
else
|
|
echo "$f not available"
|
|
fi
|
|
</pre>
|
|
|
|
huge-clean:
|
|
|
|
<pre>
|
|
#!/bin/sh
|
|
file="$1"
|
|
cat >/tmp/file
|
|
# in real life, this should be done more efficiently, not trying to read
|
|
# the whole file content!
|
|
if grep -q 'not available' /tmp/file; then
|
|
awk '{print $1}' /tmp/file # provide what we would if the content were avail!
|
|
exit 0
|
|
fi
|
|
echo "cleaning $file" >&2
|
|
# XXX store file content here
|
|
echo $file
|
|
</pre>
|
|
|
|
.gitattributes:
|
|
|
|
<pre>
|
|
*.huge filter=huge
|
|
</pre>
|
|
|
|
in .git/config:
|
|
|
|
<pre>
|
|
[filter "huge"]
|
|
clean = huge-clean %f
|
|
smudge = huge-smudge %f
|
|
<pre>
|