update old todo item with what still needs doing

removed old comments that are no longer relevant
This commit is contained in:
Joey Hess 2019-05-10 13:52:09 -04:00
parent daa0c6c1c6
commit ae562ad4d7
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 24 additions and 68 deletions

View file

@ -11,8 +11,8 @@ Speed improvements, including:
<https://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/>
<https://git-annex.branchable.com/todo/Long_Running_Filter_Process/>
* Enable parallelism by default.
<https://git-annex.branchable.com/todo/config_option_to_use_all_processors_by_default/>
* Improve parallelism.
<https://git-annex.branchable.com/todo/parallel_possibilities>
* [[todo/sqlite_database_improvements]] to avoid using String and better
encode data stored in sqlite.

View file

@ -1,13 +1,24 @@
One of my reasons for using haskell was that it provides the possibility of
some parallell processing. Although since git-annex hits the filesystem
heavily and mostly runs other git commands, maybe not a whole lot.
git-annex has good support for running commands in parallel, but there
are still some things that could be improved, tracked here:
Anyway, each git-annex command is broken down into a series of independant
actions, which has some potential for parallelism.
* Maybe support -Jn in more commands. Just needs changing a few lines of code
and testing each.
Each action has 3 distinct phases, basically "check", "perform", and
"cleanup". The perform actions are probably parellizable; the cleanup may be
(but not if it has to run git commands to stage state; it can queue
commands though); the check should be easily parallelizable, although they
may access the disk or run minor git query commands, so would probably not
want to run too many of them at once.
* Maybe extend --jobs/annex.jobs for more control. `--jobs=cpus` is already
supported; it might be good to have `--jobs=cpus-1` to leave a spare
cpu to avoid contention, or `--jobs=remotes*2` to run 2 jobs per remote.
* Parallelism is often used when the user wants to full saturate the pipe
to a remote, since having some extra transfers running avoid being
delayed while git-annex runs cleanup actions, checksum verification,
and other non-transfer stuff.
But, the user will sometimes be disappointed, because every job
can still end up stuck doing checksum verification at the same time,
so the pipe to the remote is not saturated.
Running cleanup actions in a separate queue from the main job queue
wouldn't be sufficient for this, because verification is done as part
of the same action that transfers content. That needs to somehow be
refactored to a cleanup action that ingests the file, and then
the cleanup action can be run in a separate queue.

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawkptNW1PzrVjYlJWP_9e499uH0mjnBV6GQ"
nickname="Christian"
subject="comment 1"
date="2011-04-08T12:41:43Z"
content="""
I also think, that fetching keys via rsync can be done by one rsync process, when the keys are fetched from one host. This would avoid establishing a new TCP connection for every file.
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="http://ertai.myopenid.com/"
nickname="npouillard"
subject="comment 2"
date="2011-05-20T20:14:15Z"
content="""
I agree with Christian.
One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.
Of course this could be implemented using parallel&concurrency features of Haskell to do this.
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="4.154.4.90"
subject="comment 3"
date="2013-07-17T19:59:50Z"
content="""
Note that git-annex now uses locks to communicate among multiple processes, so it's now possible to eg run two `git annex get` processes, and one will skip over the file the other is downloading and go on to the next file, and so on.
This is an especially nice speedup when downloading encrypted data, since the decryption of one file will tend to happen while the other process is downloading the next file (assuming files of approximately the same size, and that decryption takes approxiately as long as downloading).
The only thing preventing this being done by threads in one process, enabled by a -jN option, is that the output would be a jumbled mess.
"""]]

View file

@ -1,23 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2015-11-04T21:00:02Z"
content="""
Now, many git-annex commands support -Jn, the output is not a jumbled mess
thanks to the concurrent-output library.
At this point all I see that needs doing is:
* Maybe support -Jn in more commands. Just needs changing a few lines of
code and testing each.
* It would be nice to be able to run cleanup actions in the "background",
after a command has otherwise succeeded, even when -Jn is not used.
In particular, when getting files, their checksum is verified after
download. That would nicely parellize with the next file being
downloaded.
This could be implemented also using concurrent-output, but it would then
have to drive the display even when -J is not used. I'm not yet sure
enough about it to use it by default.
"""]]