smudge update: Not practical.
This commit is contained in:
parent
a64c16bf7d
commit
b96443364e
1 changed files with 29 additions and 9 deletions
|
@ -1,5 +1,18 @@
|
||||||
git-annex should use smudge/clean filters.
|
git-annex should use smudge/clean filters.
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Update: Currently, this does not look likely to work. In particular,
|
||||||
|
the clean filter needs to consume all stdin from git, which consists of the
|
||||||
|
entire content of the file. It cannot optimise by directly accessing
|
||||||
|
the file in the repository, because git may be cleaning a different
|
||||||
|
version of the file during a merge.
|
||||||
|
|
||||||
|
So every `git status` would need to read the entire content of all
|
||||||
|
available files, and checksum them, which is too expensive.
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
The clean filter is run when files are staged for commit. So a user could copy
|
The clean filter is run when files are staged for commit. So a user could copy
|
||||||
any file into the annex, git add it, and git-annex's clean filter causes
|
any file into the annex, git add it, and git-annex's clean filter causes
|
||||||
the file's key to be staged, while its value is added to the annex.
|
the file's key to be staged, while its value is added to the annex.
|
||||||
|
@ -8,7 +21,7 @@ The smudge filter is run when files are checked out. Since git annex
|
||||||
repos have partial content, this would not git annex get the file content.
|
repos have partial content, this would not git annex get the file content.
|
||||||
Instead, if the content is not currently available, it would need to do
|
Instead, if the content is not currently available, it would need to do
|
||||||
something like return empty file content. (Sadly, it cannot create a
|
something like return empty file content. (Sadly, it cannot create a
|
||||||
symlink, as git still wants to write the file afterwards.
|
symlink, as git still wants to write the file afterwards.)
|
||||||
|
|
||||||
So the nice current behavior of unavailable files being clearly missing due
|
So the nice current behavior of unavailable files being clearly missing due
|
||||||
to dangling symlinks, would be lost when using smudge/clean filters.
|
to dangling symlinks, would be lost when using smudge/clean filters.
|
||||||
|
@ -39,7 +52,13 @@ of a file, rather than reading its symlink, git-annex would need to
|
||||||
look up the git blob associated with the file -- this can be done
|
look up the git blob associated with the file -- this can be done
|
||||||
efficiently using the existing code in `Branch.catFile`.
|
efficiently using the existing code in `Branch.catFile`.
|
||||||
|
|
||||||
### efficiency
|
The clean filter would inject the file's content into the annex, and hard
|
||||||
|
link from the annex to the file. Avoiding duplication of data.
|
||||||
|
|
||||||
|
The smudge filter can't do that, so to avoid duplication of data, it
|
||||||
|
might always create an empty file. To get the content, `git annex get`
|
||||||
|
could be used (which would hard link it). A `post-checkout` hook might
|
||||||
|
be used to set up hard links for all currently available content.
|
||||||
|
|
||||||
#### clean
|
#### clean
|
||||||
|
|
||||||
|
@ -51,6 +70,8 @@ something like this works to provide a filename to the clean script:
|
||||||
This could avoid it needing to read all the current file content from stdin
|
This could avoid it needing to read all the current file content from stdin
|
||||||
when doing eg, a git status or git commit. Instead it is passed the
|
when doing eg, a git status or git commit. Instead it is passed the
|
||||||
filename that git is operating on, in the working directory.
|
filename that git is operating on, in the working directory.
|
||||||
|
(Update: No, doesn't work; git may be cleaning a different file content
|
||||||
|
than is currently on disk, and git requires all stdin be consumed too.)
|
||||||
|
|
||||||
So, WORM could just look at that file and easily tell if it is one
|
So, WORM could just look at that file and easily tell if it is one
|
||||||
it already knows (same mtime and size). If so, it can short-circuit and
|
it already knows (same mtime and size). If so, it can short-circuit and
|
||||||
|
@ -66,15 +87,14 @@ if the file is larger than some chunk size; tried with 500 mb file and
|
||||||
saw a SIGPIPE.) This means unnecessary works needs to be done,
|
saw a SIGPIPE.) This means unnecessary works needs to be done,
|
||||||
and it slows down *everything*, from `git status` to `git commit`.
|
and it slows down *everything*, from `git status` to `git commit`.
|
||||||
**showstopper** I have sent a patch to the git mailing list to address
|
**showstopper** I have sent a patch to the git mailing list to address
|
||||||
this. <http://marc.info/?l=git&m=131465033512157&w=2>
|
this. <http://marc.info/?l=git&m=131465033512157&w=2> (Update: apparently
|
||||||
|
can't be fixed.)
|
||||||
|
|
||||||
#### smudge
|
#### smudge
|
||||||
|
|
||||||
The smudge script can also be provided a filename with %f, but it
|
The smudge script can also be provided a filename with %f, but it
|
||||||
cannot directly write to the file or git gets unhappy.
|
cannot directly write to the file or git gets unhappy.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### dealing with partial content availability
|
### dealing with partial content availability
|
||||||
|
|
||||||
The smudge filter cannot be allowed to fail, that leaves the tree and
|
The smudge filter cannot be allowed to fail, that leaves the tree and
|
||||||
|
@ -111,15 +131,15 @@ huge-clean:
|
||||||
<pre>
|
<pre>
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
file="$1"
|
file="$1"
|
||||||
|
cat >/tmp/file
|
||||||
# in real life, this should be done more efficiently, not trying to read
|
# in real life, this should be done more efficiently, not trying to read
|
||||||
# the whole file content!
|
# the whole file content!
|
||||||
if grep -q 'not available' "$file"; then
|
if grep -q 'not available' /tmp/file; then
|
||||||
awk '{print $1}' "$file" # provide what we would if the content were avail!
|
awk '{print $1}' /tmp/file # provide what we would if the content were avail!
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
echo "cleaning $file" >&2
|
echo "cleaning $file" >&2
|
||||||
ls -l "$file" >&2
|
# XXX store file content here
|
||||||
ln -f "$file" ~/$file # can't delete temp file
|
|
||||||
echo $file
|
echo $file
|
||||||
</pre>
|
</pre>
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue