306 lines
11 KiB
Markdown
306 lines
11 KiB
Markdown
git-annex should use smudge/clean filters.
|
|
|
|
----
|
|
|
|
Update: Currently, this does not look likely to work. In particular,
|
|
the clean filter needs to consume all stdin from git, which consists of the
|
|
entire content of the file. It cannot optimise by directly accessing
|
|
the file in the repository, because git may be cleaning a different
|
|
version of the file during a merge.
|
|
|
|
So every `git status` would need to read the entire content of all
|
|
available files, and checksum them, which is too expensive.
|
|
|
|
> Update from GitTogether: Peff thinks a new interface could be added to
|
|
> git to handle this sort of case in an efficient way.. just needs someone
|
|
> to do the work. --[[Joey]]
|
|
|
|
>> Update 2015: git status only calls the clean filter for files
|
|
>> that the index says are modified, so this is no longer a problem.
|
|
>> --[[Joey]]
|
|
|
|
----
|
|
|
|
The clean filter is run when files are staged for commit. So a user could copy
|
|
any file into the annex, git add it, and git-annex's clean filter causes
|
|
the file's key to be staged, while its value is added to the annex.
|
|
|
|
The smudge filter is run when files are checked out. Since git annex
|
|
repos have partial content, this would not git annex get the file content.
|
|
Instead, if the content is not currently available, it would need to do
|
|
something like return empty file content. (Sadly, it cannot create a
|
|
symlink, as git still wants to write the file afterwards.)
|
|
|
|
So the nice current behavior of unavailable files being clearly missing due
|
|
to dangling symlinks, would be lost when using smudge/clean filters.
|
|
(Contact git developers to get an interface to do this?)
|
|
|
|
Instead, we get the nice behavior of not having to remeber to `git annex
|
|
add` files, and just being able to use `git add` or `git commit -a`,
|
|
and have it use git-annex when .gitattributes says to. Also, annexed
|
|
files can be directly modified without having to `git annex unlock`.
|
|
|
|
### configuration
|
|
|
|
In .gitattributes, the user would put something like "* filter=git-annex".
|
|
This way they could control which files are annexed vs added normally.
|
|
|
|
It would also be good to allow using this without having to specify
|
|
the files in .gitattributes. Just use "* filter=git-annex" there, and then
|
|
let git-annex decide which files to annex and which to pass through the
|
|
smudge and clean filters as-is. The smudge filter can just read a little of
|
|
its input to see if it's a pointer to an annexed file. The clean filter
|
|
could apply annex.largefiles to decide whether to annex a file's content or
|
|
not.
|
|
|
|
For files not configured this way in .gitattributes, git-annex could
|
|
continue to use its symlink method -- this would preserve backwards
|
|
compatability, and even allow mixing the two methods in a repo as desired.
|
|
(But not switching an existing repo between indirect and direct modes;
|
|
the user decides which mode to use when adding files to the repo.)
|
|
|
|
### clean
|
|
|
|
The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
|
|
something like this works to provide a filename to the clean script:
|
|
|
|
git config --global filter.huge.clean huge-clean %f
|
|
|
|
This could avoid it needing to read all the current file content from stdin
|
|
when doing eg, a git status or git commit. Instead it is passed the
|
|
filename that git is operating on, in the working directory.
|
|
(Update: No, doesn't work; git may be cleaning a different file content
|
|
than is currently on disk, and git requires all stdin be consumed too.)
|
|
|
|
So, WORM could just look at that file and easily tell if it is one
|
|
it already knows (same mtime and size). If so, it can short-circuit and
|
|
do nothing, file content is already cached.
|
|
|
|
SHA1 has a harder job. Would not want to re-sha1 the file every time,
|
|
probably. So it'd need a local cache of file stat info, mapped to known
|
|
objects.
|
|
|
|
But: Even with %f, git actually passes the full file content to the clean
|
|
filter, and if it fails to consume it all, it will crash (may only happen
|
|
if the file is larger than some chunk size; tried with 500 mb file and
|
|
saw a SIGPIPE.) This means unnecessary works needs to be done,
|
|
and it slows down *everything*, from `git status` to `git commit`.
|
|
**showstopper** I have sent a patch to the git mailing list to address
|
|
this. <http://marc.info/?l=git&m=131465033512157&w=2> (Update: apparently
|
|
can't be fixed.)
|
|
|
|
> Update: I tried this again (2015) and it seems that git status and git
|
|
> add avoid re-sending the file content to the clean filter, as long as the
|
|
> file stat has not changed. I'm not sure when git started doing that,
|
|
> but it seems to avoid this problem.
|
|
> --[[Joey]]
|
|
|
|
### smudge
|
|
|
|
The smudge script can also be provided a filename with %f, but it
|
|
cannot directly write to the file or git gets unhappy.
|
|
|
|
> Still the case in 2015. Means an unnecesary read and pipe of the file
|
|
P> even if the content is already locally available on disk. --[[Joey]]
|
|
|
|
### partial checkouts
|
|
|
|
It's important that git-annex supports partial checkouts of the content of
|
|
a repository. This allows repositories to be checked out when there's not
|
|
available disk space for all files in the repository.
|
|
|
|
The way git-lfs uses smudge/clean filters, which is similar to that
|
|
described above, does not support partial checkouts; it always tries to
|
|
download the contents of all files. Indeed, git-lfs seems to keep 2 copies
|
|
of newly added files; one in the work tree and one in .git/lfs/objects/,
|
|
at least before it sends the latter to the server. This lack of control
|
|
over which data is checked out and duplication of the data limits the
|
|
usefulness of git-lfs on truely large amounts of data.
|
|
|
|
To support partial checkouts, `git annex get` and `git annex drop` need to
|
|
be able to be used.
|
|
|
|
To avoid data duplication when adding a new object, the clean filter could
|
|
hard link from the work tree file to the annex object. Although the
|
|
user could change the work tree file w/o breaking the hard link and this
|
|
would corrupt the annexed object. Could remove write permissions to avoid
|
|
that (mostly), but that would lose some of the benefits of smudge/clean as
|
|
the user wouldn't be able to modify annexed files.
|
|
> This may be one of those things where different tradeoffs meet different
|
|
> user's needs and so a repo could be switched between the two modes as
|
|
> needed.)
|
|
|
|
The smudge filter can't modify the work tree file on its own -- git always
|
|
modifies the file after getting the output of the smudge filter, and will
|
|
stumble over any modifications that the smudge filter makes. And, it's
|
|
important that the smudge filter never fail as that will leave the repo in
|
|
a bad state.
|
|
|
|
So, to support partial checkouts and avoid data dupliciation, the smudge
|
|
filter should provide some dummy content, probably including the key of the
|
|
file. (The clean filter should detect when it's operating on that dummy
|
|
content, and provide the same key as it would if the file content was
|
|
present.)
|
|
|
|
To get the real content, use `git annex get`. (A `post-checkout` hook could
|
|
run that on all files if the user wants that behavior, or a config setting
|
|
could make the smudge filter automatically get file's contents.)
|
|
|
|
I've a demo implementation of this technique in the scripts below.
|
|
|
|
### design
|
|
|
|
Goal: Get rid of current direct mode, using smudge/clean filters instead to
|
|
cover the same use cases, more flexibly and robustly.
|
|
|
|
Use case 1:
|
|
|
|
A user wants to be able to edit files, and git-add, git commit,
|
|
without needing to worry about using git-annex to unlock files, add files,
|
|
etc.
|
|
|
|
Use case 2:
|
|
|
|
Using git-annex on a crippled filesystem that does not support symlinks.
|
|
|
|
Data:
|
|
|
|
* An annex pointer file has as its first line the git-annex key
|
|
that it's standing in for. Subsequent lines of the file might
|
|
be a message saying that the file's content is not currently available.
|
|
An annex pointer file is checked into the git repository the same way
|
|
that an annex symlink is checked in.
|
|
* file2key maps are maintained by git-annex, to keep track of
|
|
what files are pointers at keys.
|
|
|
|
Configuration:
|
|
|
|
* .gitattributes tells git which files to use git-annex's smudge/clean
|
|
filters with. Typically, all files except for dotfiles:
|
|
|
|
* filter=annex
|
|
.* !filter
|
|
|
|
* annex.largefiles tells git-annex which files should in fact be put in
|
|
the annex. Other files are passed through the smudge/clean as-is and
|
|
have their contents stored in git.
|
|
|
|
git-annex clean:
|
|
|
|
* Run by `git add` (and diff and status, etc), and passed the
|
|
filename, as well as fed the file content on stdin.
|
|
|
|
Look at configuration to decide if this file's content belongs in the
|
|
annex. If not, output the file content to stdout.
|
|
|
|
Generate annex key from filename and content from stdin.
|
|
|
|
Hard link .git/annex/objects to the file, if it doesn't already exist.
|
|
(On platforms not supporting hardlinks, copy the file to
|
|
.git/annex/objects.)
|
|
|
|
This is done to prevent losing the only copy of a file when eg
|
|
doing a git checkout of a different branch. But, no attempt is made to
|
|
protect the object from being modified. If a user wants to
|
|
protect object contents from modification, they should use
|
|
`git annex add`, not `git add`, or they can `git annex lock` after adding,.
|
|
|
|
There could be a configuration knob to cause a copy to be made to
|
|
.git/annex/objects -- useful for those crippled filesystems. It might
|
|
also drop that copy once the object gets uploaded to another repo ...
|
|
But that gets complicated quickly.
|
|
|
|
Update file2key map.
|
|
|
|
Output the pointer file content to stdout.
|
|
|
|
git-annex smudge:
|
|
|
|
* Run by eg `git checkout` and passed the filename, as well as fed
|
|
the pointer file content on stdin.
|
|
|
|
Updates file2key map.
|
|
|
|
Outputs the same pointer file content to stdout.
|
|
|
|
git annex direct/indirect:
|
|
|
|
Previously these commands switched in and out of direct mode.
|
|
Now they become no-ops.
|
|
|
|
git annex lock/unlock:
|
|
|
|
Makes sense for these to change to switch files between using
|
|
git-annex symlinks and pointers. So, this provides both a way to
|
|
transition repositories to using pointers, and a cleaner unlock/lock
|
|
for repos using symlinks.
|
|
|
|
unlock will stage a pointer file, and will copy the content of the object
|
|
out of .git/annex/objects to the work tree file. (Might want a --hardlink
|
|
switch.)
|
|
|
|
lock will replace the current work tree file with the symlink, and stage it.
|
|
Note that multiple work tree files could point to the same object.
|
|
So, if the link count is > 1, replace the annex object with a copy of
|
|
itself to break such a hard link. Always finish by locking down the
|
|
permissions of the annex object.
|
|
|
|
All other git-annex commands that look at annex symlinks to get keys will
|
|
need fall back to checking if a given work tree file is stored in git as
|
|
pointer file. This can be done by checking the file2key map (or by looking
|
|
it up in the index).
|
|
|
|
Note that I have not verified if file2key maps can be maintained
|
|
consistently using the smudge/clean filters. Seems likely to work,
|
|
based on when I see smudge/clean filters being run. The file2key
|
|
optimisation may not be needed though, looking at the index
|
|
might be fast enough.
|
|
|
|
----
|
|
|
|
### test files
|
|
|
|
huge-smudge:
|
|
|
|
<pre>
|
|
#!/bin/sh
|
|
read f
|
|
file="$1"
|
|
echo "smudging $f" >&2
|
|
if [ -e ~/$f ]; then
|
|
cat ~/$f # possibly expensive copy here
|
|
else
|
|
echo "$f not available"
|
|
fi
|
|
</pre>
|
|
|
|
huge-clean:
|
|
|
|
<pre>
|
|
#!/bin/sh
|
|
file="$1"
|
|
cat >/tmp/file
|
|
# in real life, this should be done more efficiently, not trying to read
|
|
# the whole file content!
|
|
if grep -q 'not available' /tmp/file; then
|
|
awk '{print $1}' /tmp/file # provide what we would if the content were avail!
|
|
exit 0
|
|
fi
|
|
echo "cleaning $file" >&2
|
|
# XXX store file content here
|
|
echo $file
|
|
</pre>
|
|
|
|
.gitattributes:
|
|
|
|
<pre>
|
|
*.huge filter=huge
|
|
</pre>
|
|
|
|
in .git/config:
|
|
|
|
<pre>
|
|
[filter "huge"]
|
|
clean = huge-clean %f
|
|
smudge = huge-smudge %f
|
|
<pre>
|