smudge design
This commit is contained in:
parent
59c2001d2f
commit
33fb0de1a3
2 changed files with 224 additions and 33 deletions
56
doc/devblog/day_339_smudging_out_direct_mode.mdwn
Normal file
56
doc/devblog/day_339_smudging_out_direct_mode.mdwn
Normal file
|
@ -0,0 +1,56 @@
|
||||||
|
I'm considering ways to get rid of direct mode, replacing it with something
|
||||||
|
better implemented using [[todo/smudge]] filters.
|
||||||
|
|
||||||
|
## git-lfs
|
||||||
|
|
||||||
|
I started by trying out git-lfs, to see what I can learn from it. My
|
||||||
|
feeling is that git-lfs brings an admirable simplicity to using git with
|
||||||
|
large files. For example, it uses a push-hook to automatically
|
||||||
|
upload file contents before pushing a branch.
|
||||||
|
|
||||||
|
But its simplicity comes at the cost of being centralized. You can't make a
|
||||||
|
git-lfs repository locally and clone it onto other drive and have the local
|
||||||
|
repositories interoperate to pass file contents around. Everything has to
|
||||||
|
go back through a centralized server. I'm willing to pay complexity costs
|
||||||
|
for decentralization.
|
||||||
|
|
||||||
|
Its simplicity also means that the user doesn't have much control over what
|
||||||
|
files are present in their checkout of a repository. git-lfs downloads
|
||||||
|
all the files in the work tree. It doesn't have facilities for dropping
|
||||||
|
files to free up space, or for configuring a repository to only want to get
|
||||||
|
a subset of files in the first place. Some of this could be added to it
|
||||||
|
I suppose.
|
||||||
|
|
||||||
|
## replacing direct mode
|
||||||
|
|
||||||
|
Anyway, as smudge/clean filters stand now, they can't be used to set up
|
||||||
|
git-annex symlinks; their interface doesn't allow it. But, I was able to
|
||||||
|
think up a design that uses smudge/clean filters to cover the same use
|
||||||
|
cases that direct mode covers now.
|
||||||
|
|
||||||
|
Thanks to the clean filter, adding a file with `git add` would check in a
|
||||||
|
small file that points to the git-annex object. When a file has been added
|
||||||
|
this way, the file in the work tree remains the only copy of the object
|
||||||
|
until you use git-annex to copy it to another repository. So if you modify
|
||||||
|
the work tree file, you can lose the old version of the object.
|
||||||
|
|
||||||
|
This is analagous to how direct mode works now, and it avoids needing to
|
||||||
|
store 2 copies of every file in the local repository.
|
||||||
|
|
||||||
|
In the same repository, you could also use `git annex add` to check
|
||||||
|
in a git-annex symlink, which would protect the object from modification,
|
||||||
|
in the good old indirect mode way. `git annex lock` and `git annex unlock`
|
||||||
|
could switch a file between those two modes.
|
||||||
|
|
||||||
|
So this allows mixing directly writable annexed files and locked down
|
||||||
|
annexed files in the same repository. All regular git commands and all
|
||||||
|
git-annex commands can be used on both sorts of files.
|
||||||
|
|
||||||
|
That's much more flexible than the current direct mode, and I think it will
|
||||||
|
be able to be implemented in a simpler, more scalable, and robust way too.
|
||||||
|
I can lose the direct mode merge code, and remove hundreds of lines of
|
||||||
|
other special cases for direct mode.
|
||||||
|
|
||||||
|
The downside, perhaps, is that for a repository to be usable on a crippled
|
||||||
|
filesystem, all the files in it will need to be unlocked. A file can't
|
||||||
|
easily be unlocked in one checkout and locked in another checkout.
|
|
@ -15,6 +15,10 @@ available files, and checksum them, which is too expensive.
|
||||||
> git to handle this sort of case in an efficient way.. just needs someone
|
> git to handle this sort of case in an efficient way.. just needs someone
|
||||||
> to do the work. --[[Joey]]
|
> to do the work. --[[Joey]]
|
||||||
|
|
||||||
|
>> Update 2015: git status only calls the clean filter for files
|
||||||
|
>> that the index says are modified, so this is no longer a problem.
|
||||||
|
>> --[[Joey]]
|
||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
The clean filter is run when files are staged for commit. So a user could copy
|
The clean filter is run when files are staged for commit. So a user could copy
|
||||||
|
@ -36,35 +40,26 @@ add` files, and just being able to use `git add` or `git commit -a`,
|
||||||
and have it use git-annex when .gitattributes says to. Also, annexed
|
and have it use git-annex when .gitattributes says to. Also, annexed
|
||||||
files can be directly modified without having to `git annex unlock`.
|
files can be directly modified without having to `git annex unlock`.
|
||||||
|
|
||||||
### design
|
### configuration
|
||||||
|
|
||||||
In .gitattributes, the user would put something like "* filter=git-annex".
|
In .gitattributes, the user would put something like "* filter=git-annex".
|
||||||
This way they could control which files are annexed vs added normally.
|
This way they could control which files are annexed vs added normally.
|
||||||
|
|
||||||
(git-annex could have further controls to allow eg, passing small files
|
It would also be good to allow using this without having to specify
|
||||||
through to regular processing. At least .gitattributes is a special case,
|
the files in .gitattributes. Just use "* filter=git-annex" there, and then
|
||||||
it should never be annexed...)
|
let git-annex decide which files to annex and which to pass through the
|
||||||
|
smudge and clean filters as-is. The smudge filter can just read a little of
|
||||||
|
its input to see if it's a pointer to an annexed file. The clean filter
|
||||||
|
could apply annex.largefiles to decide whether to annex a file's content or
|
||||||
|
not.
|
||||||
|
|
||||||
For files not configured this way, git-annex could continue to use
|
For files not configured this way in .gitattributes, git-annex could
|
||||||
its symlink method -- this would preserve backwards compatability,
|
continue to use its symlink method -- this would preserve backwards
|
||||||
and even allow mixing the two methods in a repo as desired.
|
compatability, and even allow mixing the two methods in a repo as desired.
|
||||||
|
(But not switching an existing repo between indirect and direct modes;
|
||||||
|
the user decides which mode to use when adding files to the repo.)
|
||||||
|
|
||||||
To find files in the repository that are annexed, git-annex would do
|
### clean
|
||||||
`ls-files` as now, but would check if found files have the appropriate
|
|
||||||
filter, rather than the current symlink checks. To determine the key
|
|
||||||
of a file, rather than reading its symlink, git-annex would need to
|
|
||||||
look up the git blob associated with the file -- this can be done
|
|
||||||
efficiently using the existing code in `Branch.catFile`.
|
|
||||||
|
|
||||||
The clean filter would inject the file's content into the annex, and hard
|
|
||||||
link from the annex to the file. Avoiding duplication of data.
|
|
||||||
|
|
||||||
The smudge filter can't do that, so to avoid duplication of data, it
|
|
||||||
might always create an empty file. To get the content, `git annex get`
|
|
||||||
could be used (which would hard link it). A `post-checkout` hook might
|
|
||||||
be used to set up hard links for all currently available content.
|
|
||||||
|
|
||||||
#### clean
|
|
||||||
|
|
||||||
The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
|
The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
|
||||||
something like this works to provide a filename to the clean script:
|
something like this works to provide a filename to the clean script:
|
||||||
|
@ -100,27 +95,167 @@ can't be fixed.)
|
||||||
> but it seems to avoid this problem.
|
> but it seems to avoid this problem.
|
||||||
> --[[Joey]]
|
> --[[Joey]]
|
||||||
|
|
||||||
#### smudge
|
### smudge
|
||||||
|
|
||||||
The smudge script can also be provided a filename with %f, but it
|
The smudge script can also be provided a filename with %f, but it
|
||||||
cannot directly write to the file or git gets unhappy.
|
cannot directly write to the file or git gets unhappy.
|
||||||
|
|
||||||
> Still the case in 2015. Means an unnecesary read and pipe of the file
|
> Still the case in 2015. Means an unnecesary read and pipe of the file
|
||||||
> even if the content is already locally available on disk. --[[Joey]]
|
P> even if the content is already locally available on disk. --[[Joey]]
|
||||||
|
|
||||||
### dealing with partial content availability
|
### partial checkouts
|
||||||
|
|
||||||
The smudge filter cannot be allowed to fail, that leaves the tree and
|
It's important that git-annex supports partial checkouts of the content of
|
||||||
index in a weird state. So if a file's content is requested by calling
|
a repository. This allows repositories to be checked out when there's not
|
||||||
the smudge filter, the trick is to instead provide dummy content,
|
available disk space for all files in the repository.
|
||||||
indicating it is not available (and perhaps saying to run "git-annex get").
|
|
||||||
|
|
||||||
Then, in the clean filter, it has to detect that it's cleaning a file
|
The way git-lfs uses smudge/clean filters, which is similar to that
|
||||||
with that dummy content, and make sure to provide the same identifier as
|
described above, does not support partial checkouts; it always tries to
|
||||||
it would if the file content was there.
|
download the contents of all files. Indeed, git-lfs seems to keep 2 copies
|
||||||
|
of newly added files; one in the work tree and one in .git/lfs/objects/,
|
||||||
|
at least before it sends the latter to the server. This lack of control
|
||||||
|
over which data is checked out and duplication of the data limits the
|
||||||
|
usefulness of git-lfs on truely large amounts of data.
|
||||||
|
|
||||||
|
To support partial checkouts, `git annex get` and `git annex drop` need to
|
||||||
|
be able to be used.
|
||||||
|
|
||||||
|
To avoid data duplication when adding a new object, the clean filter could
|
||||||
|
hard link from the work tree file to the annex object. Although the
|
||||||
|
user could change the work tree file w/o breaking the hard link and this
|
||||||
|
would corrupt the annexed object. Could remove write permissions to avoid
|
||||||
|
that (mostly), but that would lose some of the benefits of smudge/clean as
|
||||||
|
the user wouldn't be able to modify annexed files.
|
||||||
|
> This may be one of those things where different tradeoffs meet different
|
||||||
|
> user's needs and so a repo could be switched between the two modes as
|
||||||
|
> needed.)
|
||||||
|
|
||||||
|
The smudge filter can't modify the work tree file on its own -- git always
|
||||||
|
modifies the file after getting the output of the smudge filter, and will
|
||||||
|
stumble over any modifications that the smudge filter makes. And, it's
|
||||||
|
important that the smudge filter never fail as that will leave the repo in
|
||||||
|
a bad state.
|
||||||
|
|
||||||
|
So, to support partial checkouts and avoid data dupliciation, the smudge
|
||||||
|
filter should provide some dummy content, probably including the key of the
|
||||||
|
file. (The clean filter should detect when it's operating on that dummy
|
||||||
|
content, and provide the same key as it would if the file content was
|
||||||
|
present.)
|
||||||
|
|
||||||
|
To get the real content, use `git annex get`. (A `post-checkout` hook could
|
||||||
|
run that on all files if the user wants that behavior, or a config setting
|
||||||
|
could make the smudge filter automatically get file's contents.)
|
||||||
|
|
||||||
I've a demo implementation of this technique in the scripts below.
|
I've a demo implementation of this technique in the scripts below.
|
||||||
|
|
||||||
|
### design
|
||||||
|
|
||||||
|
Goal: Get rid of current direct mode, using smudge/clean filters instead to
|
||||||
|
cover the same use cases, more flexibly and robustly.
|
||||||
|
|
||||||
|
Use case 1:
|
||||||
|
|
||||||
|
A user wants to be able to edit files, and git-add, git commit,
|
||||||
|
without needing to worry about using git-annex to unlock files, add files,
|
||||||
|
etc.
|
||||||
|
|
||||||
|
Use case 2:
|
||||||
|
|
||||||
|
Using git-annex on a crippled filesystem that does not support symlinks.
|
||||||
|
|
||||||
|
Data:
|
||||||
|
|
||||||
|
* An annex pointer file has as its first line the git-annex key
|
||||||
|
that it's standing in for. Subsequent lines of the file might
|
||||||
|
be a message saying that the file's content is not currently available.
|
||||||
|
An annex pointer file is checked into the git repository the same way
|
||||||
|
that an annex symlink is checked in.
|
||||||
|
* file2key maps are maintained by git-annex, to keep track of
|
||||||
|
what files are pointers at keys.
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
|
||||||
|
* .gitattributes tells git which files to use git-annex's smudge/clean
|
||||||
|
filters with. Typically, all files except for dotfiles:
|
||||||
|
|
||||||
|
* filter=annex
|
||||||
|
.* !filter
|
||||||
|
|
||||||
|
* annex.largefiles tells git-annex which files should in fact be put in
|
||||||
|
the annex. Other files are passed through the smudge/clean as-is and
|
||||||
|
have their contents stored in git.
|
||||||
|
|
||||||
|
git-annex clean:
|
||||||
|
|
||||||
|
* Run by `git add` (and diff and status, etc), and passed the
|
||||||
|
filename, as well as fed the file content on stdin.
|
||||||
|
|
||||||
|
Look at configuration to decide if this file's content belongs in the
|
||||||
|
annex. If not, output the file content to stdout.
|
||||||
|
|
||||||
|
Generate annex key from filename and content from stdin.
|
||||||
|
|
||||||
|
Hard link .git/annex/objects to the file, if it doesn't already exist.
|
||||||
|
(On platforms not supporting hardlinks, copy the file to
|
||||||
|
.git/annex/objects.)
|
||||||
|
|
||||||
|
This is done to prevent losing the only copy of a file when eg
|
||||||
|
doing a git checkout of a different branch. But, no attempt is made to
|
||||||
|
protect the object from being modified. If a user wants to
|
||||||
|
protect object contents from modification, they should use
|
||||||
|
`git annex add`, not `git add`, or they can `git annex lock` after adding,.
|
||||||
|
|
||||||
|
There could be a configuration knob to cause a copy to be made to
|
||||||
|
.git/annex/objects -- useful for those crippled filesystems. It might
|
||||||
|
also drop that copy once the object gets uploaded to another repo ...
|
||||||
|
But that gets complicated quickly.
|
||||||
|
|
||||||
|
Update file2key map.
|
||||||
|
|
||||||
|
Output the pointer file content to stdout.
|
||||||
|
|
||||||
|
git-annex smudge:
|
||||||
|
|
||||||
|
* Run by eg `git checkout` and passed the filename, as well as fed
|
||||||
|
the pointer file content on stdin.
|
||||||
|
|
||||||
|
Updates file2key map.
|
||||||
|
|
||||||
|
Outputs the same pointer file content to stdout.
|
||||||
|
|
||||||
|
git annex direct/indirect:
|
||||||
|
|
||||||
|
Previously these commands switched in and out of direct mode.
|
||||||
|
Now they become no-ops.
|
||||||
|
|
||||||
|
git annex lock/unlock:
|
||||||
|
|
||||||
|
Makes sense for these to change to switch files between using
|
||||||
|
git-annex symlinks and pointers. So, this provides both a way to
|
||||||
|
transition repositories to using pointers, and a cleaner unlock/lock
|
||||||
|
for repos using symlinks.
|
||||||
|
|
||||||
|
unlock will stage a pointer file, and will copy the content of the object
|
||||||
|
out of .git/annex/objects to the work tree file. (Might want a --hardlink
|
||||||
|
switch.)
|
||||||
|
|
||||||
|
lock will replace the current work tree file with the symlink, and stage it.
|
||||||
|
Note that multiple work tree files could point to the same object.
|
||||||
|
So, if the link count is > 1, replace the annex object with a copy of
|
||||||
|
itself to break such a hard link. Always finish by locking down the
|
||||||
|
permissions of the annex object.
|
||||||
|
|
||||||
|
All other git-annex commands that look at annex symlinks to get keys will
|
||||||
|
need fall back to checking if a given work tree file is stored in git as
|
||||||
|
pointer file. This can be done by checking the file2key map (or by looking
|
||||||
|
it up in the index).
|
||||||
|
|
||||||
|
Note that I have not verified if file2key maps can be maintained
|
||||||
|
consistently using the smudge/clean filters. Seems likely to work,
|
||||||
|
based on when I see smudge/clean filters being run. The file2key
|
||||||
|
optimisation may not be needed though, looking at the index
|
||||||
|
might be fast enough.
|
||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
### test files
|
### test files
|
||||||
|
|
Loading…
Reference in a new issue