diff --git a/doc/design/roadmap.mdwn b/doc/design/roadmap.mdwn index 9095e9d4b5..8e43ca269a 100644 --- a/doc/design/roadmap.mdwn +++ b/doc/design/roadmap.mdwn @@ -11,8 +11,8 @@ Speed improvements, including: -* Enable parallelism by default. - +* Improve parallelism. + * [[todo/sqlite_database_improvements]] to avoid using String and better encode data stored in sqlite. diff --git a/doc/todo/parallel_possibilities.mdwn b/doc/todo/parallel_possibilities.mdwn index 9c0e69e294..28c6b8ebed 100644 --- a/doc/todo/parallel_possibilities.mdwn +++ b/doc/todo/parallel_possibilities.mdwn @@ -1,13 +1,24 @@ -One of my reasons for using haskell was that it provides the possibility of -some parallell processing. Although since git-annex hits the filesystem -heavily and mostly runs other git commands, maybe not a whole lot. +git-annex has good support for running commands in parallel, but there +are still some things that could be improved, tracked here: -Anyway, each git-annex command is broken down into a series of independant -actions, which has some potential for parallelism. +* Maybe support -Jn in more commands. Just needs changing a few lines of code + and testing each. -Each action has 3 distinct phases, basically "check", "perform", and -"cleanup". The perform actions are probably parellizable; the cleanup may be -(but not if it has to run git commands to stage state; it can queue -commands though); the check should be easily parallelizable, although they -may access the disk or run minor git query commands, so would probably not -want to run too many of them at once. +* Maybe extend --jobs/annex.jobs for more control. `--jobs=cpus` is already + supported; it might be good to have `--jobs=cpus-1` to leave a spare + cpu to avoid contention, or `--jobs=remotes*2` to run 2 jobs per remote. + +* Parallelism is often used when the user wants to full saturate the pipe + to a remote, since having some extra transfers running avoid being + delayed while git-annex runs cleanup actions, checksum verification, + and other non-transfer stuff. + + But, the user will sometimes be disappointed, because every job + can still end up stuck doing checksum verification at the same time, + so the pipe to the remote is not saturated. + + Running cleanup actions in a separate queue from the main job queue + wouldn't be sufficient for this, because verification is done as part + of the same action that transfers content. That needs to somehow be + refactored to a cleanup action that ingests the file, and then + the cleanup action can be run in a separate queue. diff --git a/doc/todo/parallel_possibilities/comment_1_d8e34fc2bc4e5cf761574608f970d496._comment b/doc/todo/parallel_possibilities/comment_1_d8e34fc2bc4e5cf761574608f970d496._comment deleted file mode 100644 index 4aceb3abd3..0000000000 --- a/doc/todo/parallel_possibilities/comment_1_d8e34fc2bc4e5cf761574608f970d496._comment +++ /dev/null @@ -1,8 +0,0 @@ -[[!comment format=mdwn - username="https://www.google.com/accounts/o8/id?id=AItOawkptNW1PzrVjYlJWP_9e499uH0mjnBV6GQ" - nickname="Christian" - subject="comment 1" - date="2011-04-08T12:41:43Z" - content=""" -I also think, that fetching keys via rsync can be done by one rsync process, when the keys are fetched from one host. This would avoid establishing a new TCP connection for every file. -"""]] diff --git a/doc/todo/parallel_possibilities/comment_2_adb76f06a7997abe4559d3169a3181c3._comment b/doc/todo/parallel_possibilities/comment_2_adb76f06a7997abe4559d3169a3181c3._comment deleted file mode 100644 index 6ecce52c42..0000000000 --- a/doc/todo/parallel_possibilities/comment_2_adb76f06a7997abe4559d3169a3181c3._comment +++ /dev/null @@ -1,12 +0,0 @@ -[[!comment format=mdwn - username="http://ertai.myopenid.com/" - nickname="npouillard" - subject="comment 2" - date="2011-05-20T20:14:15Z" - content=""" -I agree with Christian. - -One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers. - -Of course this could be implemented using parallel&concurrency features of Haskell to do this. -"""]] diff --git a/doc/todo/parallel_possibilities/comment_3_145fb974f45da99b7d4b117a3699cccf._comment b/doc/todo/parallel_possibilities/comment_3_145fb974f45da99b7d4b117a3699cccf._comment deleted file mode 100644 index 0d646a7a80..0000000000 --- a/doc/todo/parallel_possibilities/comment_3_145fb974f45da99b7d4b117a3699cccf._comment +++ /dev/null @@ -1,12 +0,0 @@ -[[!comment format=mdwn - username="http://joeyh.name/" - ip="4.154.4.90" - subject="comment 3" - date="2013-07-17T19:59:50Z" - content=""" -Note that git-annex now uses locks to communicate among multiple processes, so it's now possible to eg run two `git annex get` processes, and one will skip over the file the other is downloading and go on to the next file, and so on. - -This is an especially nice speedup when downloading encrypted data, since the decryption of one file will tend to happen while the other process is downloading the next file (assuming files of approximately the same size, and that decryption takes approxiately as long as downloading). - -The only thing preventing this being done by threads in one process, enabled by a -jN option, is that the output would be a jumbled mess. -"""]] diff --git a/doc/todo/parallel_possibilities/comment_4_229af3089d01eef2bb5a9a9c0610a73c._comment b/doc/todo/parallel_possibilities/comment_4_229af3089d01eef2bb5a9a9c0610a73c._comment deleted file mode 100644 index 46e399ffef..0000000000 --- a/doc/todo/parallel_possibilities/comment_4_229af3089d01eef2bb5a9a9c0610a73c._comment +++ /dev/null @@ -1,23 +0,0 @@ -[[!comment format=mdwn - username="joey" - subject="""comment 4""" - date="2015-11-04T21:00:02Z" - content=""" -Now, many git-annex commands support -Jn, the output is not a jumbled mess -thanks to the concurrent-output library. - -At this point all I see that needs doing is: - -* Maybe support -Jn in more commands. Just needs changing a few lines of - code and testing each. - -* It would be nice to be able to run cleanup actions in the "background", - after a command has otherwise succeeded, even when -Jn is not used. - In particular, when getting files, their checksum is verified after - download. That would nicely parellize with the next file being - downloaded. - - This could be implemented also using concurrent-output, but it would then - have to drive the display even when -J is not used. I'm not yet sure - enough about it to use it by default. -"""]]