git-annex/doc/design/assistant/blog/day_8__speed.mdwn
2012-07-08 13:04:35 -06:00

67 lines
3.1 KiB
Markdown

Since last post, I've worked on speeding up `git annex watch`'s startup time
in a large repository.
The problem was that its initial scan was naively staging every symlink in
the repository, even though most of them are, presumably, staged correctly
already. This was done in case the user copied or moved some symlinks
around while `git annex watch` was not running -- we want to notice and
commit such changes at startup.
Since I already had the `stat` info for the symlink, it can look at the
`ctime` to see if the symlink was made recently, and only stage it if so.
This sped up startup in my big repo from longer than I cared to wait (10+
minutes, or half an hour while profiling) to a minute or so. Of course,
inotify events are already serviced during startup, so making it scan
quickly is really only important so people don't think it's a resource hog.
First impressions are important. :)
But what does "made recently" mean exactly? Well, my answer is possibly
over engineered, but most of it is really groundwork for things I'll need
later anyway. I added a new data structure for tracking the status of the
daemon, which is periodically written to disk by another thread (thread #6!)
to `.git/annex/daemon.status` Currently it looks like this; I anticipate
adding lots more info as I move into the [[syncing]] stage:
lastRunning:1339610482.47928s
scanComplete:True
So, only symlinks made after the daemon was last running need to be
expensively staged on startup. Although, as RichiH pointed out,
this fails if the clock is changed. But I have been planning to have a
cleanup thread anyway, that will handle this, and other
potential problems, so I think that's ok.
Stracing its startup scan, it's fairly tight now. There are some repeated
`getcwd` syscalls that could be optimised out for a minor speedup.
----
Added the sanity check thread. Thread #7! It currently only does one sanity
check per day, but the sanity check is a fairly lightweight job,
so I may make it run more frequently. OTOH, it may never ever find a
problem, so once per day seems a good compromise.
Currently it's only checking that all files in the tree are properly staged
in git. I might make it `git annex fsck` later, but fscking the whole tree
once per day is a bit much. Perhaps it should only fsck a few files per
day? TBD
Currently any problems found in the sanity check are just fixed and logged.
It would be good to do something about getting problems that might indicate
bugs fed back to me, in a privacy-respecting way. TBD
----
I also refactored the code, which was getting far too large to all be in
one module.
I have been thinking about renaming `git annex watch` to `git annex assistant`,
but I think I'll leave the command name as-is. Some users might
want a simple watcher and stager, without the assistant's other features
like syncing and the webapp. So the next stage of the
[[roadmap|design/assistant]] will be a different command that also runs
`watch`.
At this point, I feel I'm done with the first phase of [[inotify]].
It has a couple known bugs, but it's ready for brave beta testers to try.
I trust it enough to be running it on my live data.