new todo

2021-11-09 13:30:52 -04:00 · 2021-11-09 13:30:52 -04:00 · 9121154a75
commit 9121154a75
parent 8034f2e9bb
3 changed files with 47 additions and 8 deletions
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -90,12 +90,12 @@ And here's the consequences of git-annex's workarounds:
 * When `git-annex filter-process` is enabled, it cannot use the trick
  described above that `git-annex smudge --clean` uses to avoid git
-  piping the whole content of large files through it. This mainly slows
+  piping the whole content of large files through it. The whole file
-  down `git add` when it is being used with an annex.largefiles
+  content has to be read, even when git-annex does not need to see it.
-  confguration to add a large file to the annex. (Making filter-process
+  This mainly slows down `git add` when it is being used with an
-  incrementally hash the content git passes to it will mostly avoid
+  annex.largefiles confguration to add a large file to the annex,
-  this performance problem though it may always be a little bit slower
+  by about 5%. ([[todo/incremental_hashing_for_add]] would improve
-  than `git-annex smudge --clean` due to the data piping.)
+  performance)
 * In a rare situation, git-annex would like to get git to run the clean
  filter, but it cannot because git has the index locked. So, git-annex has
--- a/doc/todo/incremental_hashing_for_add.mdwn
+++ b/doc/todo/incremental_hashing_for_add.mdwn
@ -0,0 +1,40 @@
 When `git-annex filter-process` is enabled, `git add` pipes the content of
 files into it, but that's thrown away, and the file is read again by git-annex
 to generate a hash. It would improve performance to hash the content
 provided via the pipe.
 When filter-process is not enabled, `git-annex smudge --clean` reads
 the file to hash it, then reads it a second time to copy it into
 .git/annex/objects. When annex.addunlocked is enabled, `git annex add`
 does the same. It would improve performance to read once and copy and
 hash at the same time.
 The `incrementalhash` branch has a start at implementing this.
 I lost steam on this branch when I realized that it would need to
 re-implement Annex.Ingest.ingest in order to populate
 .git/annex/objects/. And it's not as simple as writing an object file
 and moving it into place there, because annex.thin means a hard link should
 be made, and if the filesystem supports CoW, that should be used rather
 than writing the file again.
 A benchmark showed that `git add` of a 1 gb file 
 is about 5% slower with filter-process enabled than it is 
 with filter-process disabled. That's due to the piping overhead to 
 filter-process ([[todo/git_smudge_clean_interface_suboptiomal]]).
 `git-annex add` with `annex.addunlocked` has similar performance
 as `git add` with filter-process disabled.
 `git-annex add` without `annex.addunlocked` is about 25% faster than those,
 and only reads the file once, but it also does not copy the file, so of
 course it's faster, and always will be. 
 Probably disk cache helps them a fair amount, unless it's too small.
 So it's not clear how much implementing this would really speed them up.
 This does not really affect default configurations.
 Performance is only impacted when annex.addunlocked or 
 annex.largefiles is configured, and in a few cases 
 where an already annexed file is added by `git add` or `git commit -a`.
 So is the complication of implementing this worth it? Users who
 need maximum speed can use `git-annex add`.
--- a/doc/todo/v9_changes.mdwn
+++ b/doc/todo/v9_changes.mdwn
@ -15,5 +15,4 @@ could change and if it does, these things could be included.
 * Possibly enable `git-annex filter-process` by default. If the tradeoffs
  seem worth it.
-  It does not currently incrementally hash, so implementing that first
+  May want to implement [[incremental_hashing_for_add]] first.
  would improve the tradeoffs.