update old todo item with what still needs doing

removed old comments that are no longer relevant
2019-05-10 13:52:09 -04:00 · 2019-05-10 13:52:09 -04:00 · ae562ad4d7
commit ae562ad4d7
parent daa0c6c1c6
6 changed files with 24 additions and 68 deletions
--- a/doc/design/roadmap.mdwn
+++ b/doc/design/roadmap.mdwn
@ -11,8 +11,8 @@ Speed improvements, including:
  <https://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/>
  <https://git-annex.branchable.com/todo/Long_Running_Filter_Process/>

-* Enable parallelism by default.
-  <https://git-annex.branchable.com/todo/config_option_to_use_all_processors_by_default/>
+* Improve parallelism.
+  <https://git-annex.branchable.com/todo/parallel_possibilities>

 * [[todo/sqlite_database_improvements]] to avoid using String and better
  encode data stored in sqlite.
--- a/doc/todo/parallel_possibilities.mdwn
+++ b/doc/todo/parallel_possibilities.mdwn
@ -1,13 +1,24 @@
-One of my reasons for using haskell was that it provides the possibility of
-some parallell processing. Although since git-annex hits the filesystem
-heavily and mostly runs other git commands, maybe not a whole lot.
+git-annex has good support for running commands in parallel, but there
+are still some things that could be improved, tracked here:

-Anyway, each git-annex command is broken down into a series of independant 
-actions, which has some potential for parallelism.
+* Maybe support -Jn in more commands. Just needs changing a few lines of code
+  and testing each.

-Each action has 3 distinct phases, basically "check", "perform", and
-"cleanup". The perform actions are probably parellizable; the cleanup may be
-(but not if it has to run git commands to stage state; it can queue
-commands though); the check should be easily parallelizable, although they
-may access the disk or run minor git query commands, so would probably not
-want to run too many of them at once.
+* Maybe extend --jobs/annex.jobs for more control. `--jobs=cpus` is already
+  supported; it might be good to have `--jobs=cpus-1` to leave a spare
+  cpu to avoid contention, or `--jobs=remotes*2` to run 2 jobs per remote.
+
+* Parallelism is often used when the user wants to full saturate the pipe
+  to a remote, since having some extra transfers running avoid being
+  delayed while git-annex runs cleanup actions, checksum verification,
+  and other non-transfer stuff.
+
+  But, the user will sometimes be disappointed, because every job
+  can still end up stuck doing checksum verification at the same time, 
+  so the pipe to the remote is not saturated.
+
+  Running cleanup actions in a separate queue from the main job queue
+  wouldn't be sufficient for this, because verification is done as part
+  of the same action that transfers content. That needs to somehow be
+  refactored to a cleanup action that ingests the file, and then
+  the cleanup action can be run in a separate queue.
--- a/doc/todo/parallel_possibilities/comment_1_d8e34fc2bc4e5cf761574608f970d496._comment
+++ b/doc/todo/parallel_possibilities/comment_1_d8e34fc2bc4e5cf761574608f970d496._comment
@ -1,8 +0,0 @@
-[[!comment format=mdwn
- username="https://www.google.com/accounts/o8/id?id=AItOawkptNW1PzrVjYlJWP_9e499uH0mjnBV6GQ"
- nickname="Christian"
- subject="comment 1"
- date="2011-04-08T12:41:43Z"
- content="""
-I also think, that fetching keys via rsync can be done by one rsync process, when the keys are fetched from one host. This would avoid establishing a new TCP connection for every file.
-"""]]
--- a/doc/todo/parallel_possibilities/comment_2_adb76f06a7997abe4559d3169a3181c3._comment
+++ b/doc/todo/parallel_possibilities/comment_2_adb76f06a7997abe4559d3169a3181c3._comment
@ -1,12 +0,0 @@
-[[!comment format=mdwn
- username="http://ertai.myopenid.com/"
- nickname="npouillard"
- subject="comment 2"
- date="2011-05-20T20:14:15Z"
- content="""
-I agree with Christian.
-
-One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.
-
-Of course this could be implemented using parallel&concurrency features of Haskell to do this.
-"""]]
--- a/doc/todo/parallel_possibilities/comment_3_145fb974f45da99b7d4b117a3699cccf._comment
+++ b/doc/todo/parallel_possibilities/comment_3_145fb974f45da99b7d4b117a3699cccf._comment
@ -1,12 +0,0 @@
-[[!comment format=mdwn
- username="http://joeyh.name/"
- ip="4.154.4.90"
- subject="comment 3"
- date="2013-07-17T19:59:50Z"
- content="""
-Note that git-annex now uses locks to communicate among multiple processes, so it's now possible to eg run two `git annex get` processes, and one will skip over the file the other is downloading and go on to the next file, and so on. 
-
-This is an especially nice speedup when downloading encrypted data, since the decryption of one file will tend to happen while the other process is downloading the next file (assuming files of approximately the same size, and that decryption takes approxiately as long as downloading).
-
-The only thing preventing this being done by threads in one process, enabled by a -jN option, is that the output would be a jumbled mess.
-"""]]
--- a/doc/todo/parallel_possibilities/comment_4_229af3089d01eef2bb5a9a9c0610a73c._comment
+++ b/doc/todo/parallel_possibilities/comment_4_229af3089d01eef2bb5a9a9c0610a73c._comment
@ -1,23 +0,0 @@
-[[!comment format=mdwn
- username="joey"
- subject="""comment 4"""
- date="2015-11-04T21:00:02Z"
- content="""
-Now, many git-annex commands support -Jn, the output is not a jumbled mess
-thanks to the concurrent-output library.
-
-At this point all I see that needs doing is:
-
-* Maybe support -Jn in more commands. Just needs changing a few lines of
-  code and testing each.
-
-* It would be nice to be able to run cleanup actions in the "background",
-  after a command has otherwise succeeded, even when -Jn is not used.
-  In particular, when getting files, their checksum is verified after
-  download. That would nicely parellize with the next file being
-  downloaded.
-
-  This could be implemented also using concurrent-output, but it would then
-  have to drive the display even when -J is not used. I'm not yet sure
-  enough about it to use it by default.
-"""]]