smudge design

2015-11-23 16:53:05 -04:00 · 2015-11-23 16:53:05 -04:00 · 33fb0de1a3
commit 33fb0de1a3
parent 59c2001d2f
2 changed files with 224 additions and 33 deletions
--- a/doc/devblog/day_339_smudging_out_direct_mode.mdwn
+++ b/doc/devblog/day_339_smudging_out_direct_mode.mdwn
@ -0,0 +1,56 @@
 I'm considering ways to get rid of direct mode, replacing it with something
 better implemented using [[todo/smudge]] filters.
 ## git-lfs
 I started by trying out git-lfs, to see what I can learn from it. My
 feeling is that git-lfs brings an admirable simplicity to using git with
 large files. For example, it uses a push-hook to automatically
 upload file contents before pushing a branch.
 But its simplicity comes at the cost of being centralized. You can't make a
 git-lfs repository locally and clone it onto other drive and have the local
 repositories interoperate to pass file contents around. Everything has to
 go back through a centralized server. I'm willing to pay complexity costs
 for decentralization.
 Its simplicity also means that the user doesn't have much control over what
 files are present in their checkout of a repository. git-lfs downloads
 all the files in the work tree. It doesn't have facilities for dropping
 files to free up space, or for configuring a repository to only want to get
 a subset of files in the first place. Some of this could be added to it 
 I suppose.
 ## replacing direct mode
 Anyway, as smudge/clean filters stand now, they can't be used to set up
 git-annex symlinks; their interface doesn't allow it. But, I was able to
 think up a design that uses smudge/clean filters to cover the same use
 cases that direct mode covers now.
 Thanks to the clean filter, adding a file with `git add` would check in a
 small file that points to the git-annex object. When a file has been added
 this way, the file in the work tree remains the only copy of the object
 until you use git-annex to copy it to another repository. So if you modify
 the work tree file, you can lose the old version of the object.
 This is analagous to how direct mode works now, and it avoids needing to
 store 2 copies of every file in the local repository.
 In the same repository, you could also use `git annex add` to check
 in a git-annex symlink, which would protect the object from modification,
 in the good old indirect mode way. `git annex lock` and `git annex unlock` 
 could switch a file between those two modes.
 So this allows mixing directly writable annexed files and locked down
 annexed files in the same repository. All regular git commands and all
 git-annex commands can be used on both sorts of files.
 That's much more flexible than the current direct mode, and I think it will
 be able to be implemented in a simpler, more scalable, and robust way too.
 I can lose the direct mode merge code, and remove hundreds of lines of
 other special cases for direct mode.
 The downside, perhaps, is that for a repository to be usable on a crippled
 filesystem, all the files in it will need to be unlocked. A file can't
 easily be unlocked in one checkout and locked in another checkout.
--- a/doc/todo/smudge.mdwn
+++ b/doc/todo/smudge.mdwn
@ -15,6 +15,10 @@ available files, and checksum them, which is too expensive.
 > git to handle this sort of case in an efficient way.. just needs someone
 > to do the work. --[[Joey]] 
 >> Update 2015: git status only calls the clean filter for files
 >> that the index says are modified, so this is no longer a problem.
 >> --[[Joey]]
 ----
 The clean filter is run when files are staged for commit. So a user could copy
@ -36,35 +40,26 @@ add` files, and just being able to use `git add` or `git commit -a`,
 and have it use git-annex when .gitattributes says to. Also, annexed
 files can be directly modified without having to `git annex unlock`.
-### design
+### configuration
 In .gitattributes, the user would put something like "* filter=git-annex".
 This way they could control which files are annexed vs added normally.
-(git-annex could have further controls to allow eg, passing small files
+It would also be good to allow using this without having to specify
-through to regular processing. At least .gitattributes is a special case,
+the files in .gitattributes. Just use "* filter=git-annex" there, and then
-it should never be annexed...)
+let git-annex decide which files to annex and which to pass through the
 smudge and clean filters as-is. The smudge filter can just read a little of
 its input to see if it's a pointer to an annexed file. The clean filter
 could apply annex.largefiles to decide whether to annex a file's content or
 not.
-For files not configured this way, git-annex could continue to use
+For files not configured this way in .gitattributes, git-annex could
-its symlink method -- this would preserve backwards compatability,
+continue to use its symlink method -- this would preserve backwards
-and even allow mixing the two methods in a repo as desired.
+compatability, and even allow mixing the two methods in a repo as desired.
 (But not switching an existing repo between indirect and direct modes;
 the user decides which mode to use when adding files to the repo.)
-To find files in the repository that are annexed, git-annex would do
+### clean
 `ls-files` as now, but would check if found files have the appropriate
 filter, rather than the current symlink checks. To determine the key
 of a file, rather than reading its symlink, git-annex would need to
 look up the git blob associated with the file -- this can be done
 efficiently using the existing code in `Branch.catFile`.
 The clean filter would inject the file's content into the annex, and hard
 link from the annex to the file. Avoiding duplication of data.
 The smudge filter can't do that, so to avoid duplication of data, it
 might always create an empty file. To get the content, `git annex get`
 could be used (which would hard link it). A `post-checkout` hook might
 be used to set up hard links for all currently available content.
 #### clean
 The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
 something like this works to provide a filename to the clean script:
@ -100,27 +95,167 @@ can't be fixed.)
 > but it seems to avoid this problem.
 > --[[Joey]]
-#### smudge
+### smudge
 The smudge script can also be provided a filename with %f, but it
 cannot directly write to the file or git gets unhappy.
 > Still the case in 2015. Means an unnecesary read and pipe of the file
-> even if the content is already locally available on disk. --[[Joey]]
+P> even if the content is already locally available on disk. --[[Joey]]
-### dealing with partial content availability
+### partial checkouts
-The smudge filter cannot be allowed to fail, that leaves the tree and
+It's important that git-annex supports partial checkouts of the content of
-index in a weird state. So if a file's content is requested by calling
+a repository. This allows repositories to be checked out when there's not
-the smudge filter, the trick is to instead provide dummy content,
+available disk space for all files in the repository.
 indicating it is not available (and perhaps saying to run "git-annex get").
-Then, in the clean filter, it has to detect that it's cleaning a file
+The way git-lfs uses smudge/clean filters, which is similar to that
-with that dummy content, and make sure to provide the same identifier as
+described above, does not support partial checkouts; it always tries to
-it would if the file content was there. 
+download the contents of all files. Indeed, git-lfs seems to keep 2 copies
 of newly added files; one in the work tree and one in .git/lfs/objects/,
 at least before it sends the latter to the server. This lack of control
 over which data is checked out and duplication of the data limits the
 usefulness of git-lfs on truely large amounts of data.
 To support partial checkouts, `git annex get` and `git annex drop` need to
 be able to be used.
 To avoid data duplication when adding a new object, the clean filter could
 hard link from the work tree file to the annex object. Although the
 user could change the work tree file w/o breaking the hard link and this
 would corrupt the annexed object. Could remove write permissions to avoid
 that (mostly), but that would lose some of the benefits of smudge/clean as
 the user wouldn't be able to modify annexed files. 
 > This may be one of those things where different tradeoffs meet different
 > user's needs and so a repo could be switched between the two modes as
 > needed.)
 The smudge filter can't modify the work tree file on its own -- git always
 modifies the file after getting the output of the smudge filter, and will
 stumble over any modifications that the smudge filter makes. And, it's
 important that the smudge filter never fail as that will leave the repo in
 a bad state.
 So, to support partial checkouts and avoid data dupliciation, the smudge
 filter should provide some dummy content, probably including the key of the
 file. (The clean filter should detect when it's operating on that dummy
 content, and provide the same key as it would if the file content was
 present.)
 To get the real content, use `git annex get`. (A `post-checkout` hook could
 run that on all files if the user wants that behavior, or a config setting
 could make the smudge filter automatically get file's contents.)
 I've a demo implementation of this technique in the scripts below.
 ### design
 Goal: Get rid of current direct mode, using smudge/clean filters instead to
 cover the same use cases, more flexibly and robustly.
 Use case 1:
 A user wants to be able to edit files, and git-add, git commit,
 without needing to worry about using git-annex to unlock files, add files,
 etc.
 Use case 2:
 Using git-annex on a crippled filesystem that does not support symlinks.
 Data:
 * An annex pointer file has as its first line the git-annex key
  that it's standing in for. Subsequent lines of the file might
  be a message saying that the file's content is not currently available.
  An annex pointer file is checked into the git repository the same way
  that an annex symlink is checked in.
 * file2key maps are maintained by git-annex, to keep track of
  what files are pointers at keys.
 Configuration: 
 * .gitattributes tells git which files to use git-annex's smudge/clean
  filters with. Typically, all files except for dotfiles:
 	* filter=annex
 	.* !filter
 * annex.largefiles tells git-annex which files should in fact be put in 
  the annex. Other files are passed through the smudge/clean as-is and
  have their contents stored in git.
 git-annex clean:
 * Run by `git add` (and diff and status, etc), and passed the
  filename, as well as fed the file content on stdin.
  Look at configuration to decide if this file's content belongs in the
  annex. If not, output the file content to stdout.
  Generate annex key from filename and content from stdin.
  Hard link .git/annex/objects to the file, if it doesn't already exist.
  (On platforms not supporting hardlinks, copy the file to
  .git/annex/objects.)
  This is done to prevent losing the only copy of a file when eg
  doing a git checkout of a different branch. But, no attempt is made to 
  protect the object from being modified. If a user wants to
  protect object contents from modification, they should use
  `git annex add`, not `git add`, or they can `git annex lock` after adding,.
  There could be a configuration knob to cause a copy to be made to
  .git/annex/objects -- useful for those crippled filesystems. It might
  also drop that copy once the object gets uploaded to another repo ...
  But that gets complicated quickly.
  Update file2key map.
  Output the pointer file content to stdout.
 git-annex smudge:
 * Run by eg `git checkout` and passed the filename, as well as fed
  the pointer file content on stdin.
  Updates file2key map.
  Outputs the same pointer file content to stdout.
 git annex direct/indirect:
  Previously these commands switched in and out of direct mode.
  Now they become no-ops.
 git annex lock/unlock:
  Makes sense for these to change to switch files between using
  git-annex symlinks and pointers. So, this provides both a way to
  transition repositories to using pointers, and a cleaner unlock/lock
  for repos using symlinks.
  unlock will stage a pointer file, and will copy the content of the object
  out of .git/annex/objects to the work tree file. (Might want a --hardlink
  switch.)
  lock will replace the current work tree file with the symlink, and stage it.
  Note that multiple work tree files could point to the same object.
  So, if the link count is > 1, replace the annex object with a copy of
  itself to break such a hard link. Always finish by locking down the
  permissions of the annex object.
 All other git-annex commands that look at annex symlinks to get keys will
 need fall back to checking if a given work tree file is stored in git as
 pointer file. This can be done by checking the file2key map (or by looking
 it up in the index).
 Note that I have not verified if file2key maps can be maintained
 consistently using the smudge/clean filters. Seems likely to work,
 based on when I see smudge/clean filters being run. The file2key
 optimisation may not be needed though, looking at the index 
 might be fast enough.
 ----
 ### test files