new todo

2021-11-09 13:30:52 -04:00 · 2021-11-09 13:30:52 -04:00 · 9121154a75
commit 9121154a75
parent 8034f2e9bb
3 changed files with 47 additions and 8 deletions
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -90,12 +90,12 @@ And here's the consequences of git-annex's workarounds:

 * When `git-annex filter-process` is enabled, it cannot use the trick
  described above that `git-annex smudge --clean` uses to avoid git
-  piping the whole content of large files through it. This mainly slows
-  down `git add` when it is being used with an annex.largefiles
-  confguration to add a large file to the annex. (Making filter-process
-  incrementally hash the content git passes to it will mostly avoid
-  this performance problem though it may always be a little bit slower
-  than `git-annex smudge --clean` due to the data piping.)
+  piping the whole content of large files through it. The whole file
+  content has to be read, even when git-annex does not need to see it.
+  This mainly slows down `git add` when it is being used with an
+  annex.largefiles confguration to add a large file to the annex,
+  by about 5%. ([[todo/incremental_hashing_for_add]] would improve
+  performance)

 * In a rare situation, git-annex would like to get git to run the clean
  filter, but it cannot because git has the index locked. So, git-annex has
--- a/doc/todo/incremental_hashing_for_add.mdwn
+++ b/doc/todo/incremental_hashing_for_add.mdwn
@ -0,0 +1,40 @@
+When `git-annex filter-process` is enabled, `git add` pipes the content of
+files into it, but that's thrown away, and the file is read again by git-annex
+to generate a hash. It would improve performance to hash the content
+provided via the pipe.
+
+When filter-process is not enabled, `git-annex smudge --clean` reads
+the file to hash it, then reads it a second time to copy it into
+.git/annex/objects. When annex.addunlocked is enabled, `git annex add`
+does the same. It would improve performance to read once and copy and
+hash at the same time.
+
+The `incrementalhash` branch has a start at implementing this.
+I lost steam on this branch when I realized that it would need to
+re-implement Annex.Ingest.ingest in order to populate
+.git/annex/objects/. And it's not as simple as writing an object file
+and moving it into place there, because annex.thin means a hard link should
+be made, and if the filesystem supports CoW, that should be used rather
+than writing the file again.
+
+A benchmark showed that `git add` of a 1 gb file 
+is about 5% slower with filter-process enabled than it is 
+with filter-process disabled. That's due to the piping overhead to 
+filter-process ([[todo/git_smudge_clean_interface_suboptiomal]]).
+`git-annex add` with `annex.addunlocked` has similar performance
+as `git add` with filter-process disabled.
+
+`git-annex add` without `annex.addunlocked` is about 25% faster than those,
+and only reads the file once, but it also does not copy the file, so of
+course it's faster, and always will be. 
+
+Probably disk cache helps them a fair amount, unless it's too small.
+So it's not clear how much implementing this would really speed them up.
+
+This does not really affect default configurations.
+Performance is only impacted when annex.addunlocked or 
+annex.largefiles is configured, and in a few cases 
+where an already annexed file is added by `git add` or `git commit -a`.
+
+So is the complication of implementing this worth it? Users who
+need maximum speed can use `git-annex add`.
--- a/doc/todo/v9_changes.mdwn
+++ b/doc/todo/v9_changes.mdwn
@ -15,5 +15,4 @@ could change and if it does, these things could be included.
 * Possibly enable `git-annex filter-process` by default. If the tradeoffs
  seem worth it.

-  It does not currently incrementally hash, so implementing that first
-  would improve the tradeoffs.
+  May want to implement [[incremental_hashing_for_add]] first.