git-annex/doc/design/assistant/blog/day_7__bugfixes.mdwn

Kickstarter is over. Yay!

Today I worked on the bug where `git annex watch` turned regular files
that were already checked into git into symlinks. So I made it check
if a file is already in git before trying to add it to the annex.

The tricky part was doing this check quickly. Unless I want to write my
own git index parser (or use one from Hackage), this check requires running
`git ls-files`, once per file to be added. That won't fly if a huge
tree of files is being moved or unpacked into the watched directory.

Instead, I made it only do the check during `git annex watch`'s initial
scan of the tree. This should be OK, because once it's running, you
won't be adding new  files to git anyway, since it'll automatically annex
new files. This is good enough for now, but there are at least two problems
with it:

* Someone might `git merge` in a branch that has some regular files,
  and it would add the merged in files to the annex.
* Once `git annex watch` is running, if you modify a file that was
  checked into git as a regular file, the new version will be added
  to the annex.

I'll probably come back to this issue, and may well find myself directly
querying git's index.

---

I've started work to fix the memory leak I see when running `git annex
watch` in a large repository (40 thousand files). As always with a Haskell
memory leak, I crack open [Real World Haskell's chapter on profiling](http://book.realworldhaskell.org/read/profiling-and-optimization.html).

Eventually this yields a nice graph of the problem:

[[!img profile.png alt="memory profile"]]

So, looks like a few minor memory leaks, and one huge leak. Stared
at this for a while and trying a few things, and got a much better result:

[[!img profile2.png alt="memory profile"]]

I may come back later and try to improve this further, but it's not bad memory
usage. But, it's still rather slow to start up in such a large repository,
and its initial scan is still doing too much work. I need to optimize
more..
blog for the day 2012-06-13 01:29:01 +00:00			`Kickstarter is over. Yay!`

			Today I worked on the bug where `git annex watch` turned regular files
			`that were already checked into git into symlinks. So I made it check`
			`if a file is already in git before trying to add it to the annex.`

			`The tricky part was doing this check quickly. Unless I want to write my`
			`own git index parser (or use one from Hackage), this check requires running`
			`git ls-files`, once per file to be added. That won't fly if a huge
			`tree of files is being moved or unpacked into the watched directory.`

			Instead, I made it only do the check during `git annex watch`'s initial
Fix typos on blog 2012-07-08 18:53:50 +00:00			`scan of the tree. This should be OK, because once it's running, you`
blog for the day 2012-06-13 01:29:01 +00:00			`won't be adding new files to git anyway, since it'll automatically annex`
			`new files. This is good enough for now, but there are at least two problems`
			`with it:`

			* Someone might `git merge` in a branch that has some regular files,
			`and it would add the merged in files to the annex.`
			* Once `git annex watch` is running, if you modify a file that was
			`checked into git as a regular file, the new version will be added`
			`to the annex.`

			`I'll probably come back to this issue, and may well find myself directly`
			`querying git's index.`

			`---`

			I've started work to fix the memory leak I see when running `git annex
			watch` in a large repository (40 thousand files). As always with a Haskell
			`memory leak, I crack open [Real World Haskell's chapter on profiling](http://book.realworldhaskell.org/read/profiling-and-optimization.html).`

			`Eventually this yields a nice graph of the problem:`

			`[[!img profile.png alt="memory profile"]]`

			`So, looks like a few minor memory leaks, and one huge leak. Stared`
			`at this for a while and trying a few things, and got a much better result:`

			`[[!img profile2.png alt="memory profile"]]`

			`I may come back later and try to improve this further, but it's not bad memory`
			`usage. But, it's still rather slow to start up in such a large repository,`
			`and its initial scan is still doing too much work. I need to optimize`
			`more..`