Merge branch 'master' into watch
This commit is contained in:
commit
24da48816d
4 changed files with 111 additions and 50 deletions
45
doc/design/assistant/blog/day_7__bugfixes.mdwn
Normal file
45
doc/design/assistant/blog/day_7__bugfixes.mdwn
Normal file
|
@ -0,0 +1,45 @@
|
|||
Kickstarter is over. Yay!
|
||||
|
||||
Today I worked on the bug where `git annex watch` turned regular files
|
||||
that were already checked into git into symlinks. So I made it check
|
||||
if a file is already in git before trying to add it to the annex.
|
||||
|
||||
The tricky part was doing this check quickly. Unless I want to write my
|
||||
own git index parser (or use one from Hackage), this check requires running
|
||||
`git ls-files`, once per file to be added. That won't fly if a huge
|
||||
tree of files is being moved or unpacked into the watched directory.
|
||||
|
||||
Instead, I made it only do the check during `git annex watch`'s initial
|
||||
scan of the tree. This should be ok, because once it's running, you
|
||||
won't be adding new files to git anyway, since it'll automatically annex
|
||||
new files. This is good enough for now, but there are at least two problems
|
||||
with it:
|
||||
|
||||
* Someone might `git merge` in a branch that has some regular files,
|
||||
and it would add the merged in files to the annex.
|
||||
* Once `git annex watch` is running, if you modify a file that was
|
||||
checked into git as a regular file, the new version will be added
|
||||
to the annex.
|
||||
|
||||
I'll probably come back to this issue, and may well find myself directly
|
||||
querying git's index.
|
||||
|
||||
---
|
||||
|
||||
I've started work to fix the memory leak I see when running `git annex
|
||||
watch` in a large repository (40 thousand files). As always with a Haskell
|
||||
memory leak, I crack open [Real World Haskell's chapter on profiling](http://book.realworldhaskell.org/read/profiling-and-optimization.html).
|
||||
|
||||
Eventually this yields a nice graph of the problem:
|
||||
|
||||
[[!img profile.png alt="memory profile"]]
|
||||
|
||||
So, looks like a few minor memory leaks, and one huge leak. Stared
|
||||
at this for a while and trying a few things, and got a much better result:
|
||||
|
||||
[[!img profile2.png alt="memory profile"]]
|
||||
|
||||
I may come back later and try to improve this further, but it's not bad memory
|
||||
usage. But, it's still rather slow to start up in such a large repository,
|
||||
and its initial scan is still doing too much work. I need to optimize
|
||||
more..
|
BIN
doc/design/assistant/blog/day_7__bugfixes/profile.png
Normal file
BIN
doc/design/assistant/blog/day_7__bugfixes/profile.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 46 KiB |
BIN
doc/design/assistant/blog/day_7__bugfixes/profile2.png
Normal file
BIN
doc/design/assistant/blog/day_7__bugfixes/profile2.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 226 KiB |
|
@ -1,44 +1,53 @@
|
|||
Finish "git annex watch" command, which runs, in the background, watching via
|
||||
inotify for changes, and automatically annexing new files, etc.
|
||||
|
||||
There is a `watch` branch in git that adds such a command. To make this
|
||||
really useful, it needs to:
|
||||
There is a `watch` branch in git that adds the command.
|
||||
|
||||
- on startup, add any files that have appeared since last run **done**
|
||||
- on startup, fix the symlinks for any renamed links **done**
|
||||
- on startup, stage any files that have been deleted since last run
|
||||
(seems to require a `git commit -a` on startup, or at least a
|
||||
`git add --update`, which will notice deleted files) **done**
|
||||
- notice new files, and git annex add **done**
|
||||
- notice renamed files, auto-fix the symlink, and stage the new file location
|
||||
**done**
|
||||
- handle cases where directories are moved outside the repo, and stop
|
||||
watching them **done**
|
||||
- when a whole directory is deleted or moved, stage removal of its
|
||||
contents from the index **done**
|
||||
- notice deleted files and stage the deletion
|
||||
(tricky; there's a race with add since it replaces the file with a symlink..)
|
||||
**done**
|
||||
- Gracefully handle when the default limit of 8192 inotified directories
|
||||
is exceeded. This can be tuned by root, so help the user fix it.
|
||||
**done**
|
||||
- periodically auto-commit staged changes (avoid autocommitting when
|
||||
lots of changes are coming in) **done**
|
||||
- coleasce related add/rm events for speed and less disk IO **done**
|
||||
- don't annex `.gitignore` and `.gitattributes` files **done**
|
||||
- run as a daemon **done**
|
||||
- tunable delays before adding new files, etc
|
||||
## known bugs
|
||||
|
||||
* A process has a file open for write, another one closes it,
|
||||
and so it's added. Then the first process modifies it.
|
||||
|
||||
Or, a process has a file open for write when `git annex watch` starts
|
||||
up, it will be added to the annex. If the process later continues
|
||||
writing, it will change content in the annex.
|
||||
|
||||
This changes content in the annex, and fsck will later catch
|
||||
the inconsistency.
|
||||
|
||||
Possible fixes:
|
||||
|
||||
* Somehow track or detect if a file is open for write by any processes.
|
||||
* Or, when possible, making a copy on write copy before adding the file
|
||||
would avoid this.
|
||||
* Or, as a last resort, make an expensive copy of the file and add that.
|
||||
* Tracking file opens and closes with inotify could tell if any other
|
||||
processes have the file open. But there are problems.. It doesn't
|
||||
seem to differentiate between files opened for read and for write.
|
||||
And there would still be a race after the last close and before it's
|
||||
injected into the annex, where it could be opened for write again.
|
||||
Would need to detect that and undo the annex injection or something.
|
||||
|
||||
* If a file is checked into git as a normal file and gets modified
|
||||
(or merged, etc), it will be converted into an annexed file.
|
||||
See [[blog/day_7__bugfixes]]
|
||||
|
||||
## todo
|
||||
|
||||
- Support OSes other than Linux; it only uses inotify currently.
|
||||
OSX and FreeBSD use the same mechanism, and there is a Haskell interface
|
||||
for it,
|
||||
- Run niced and ioniced? Seems to make sense, this is a background job.
|
||||
- configurable option to only annex files meeting certian size or
|
||||
filename criteria
|
||||
- option to check files not meeting annex criteria into git directly
|
||||
- option to check files not meeting annex criteria into git directly,
|
||||
automatically
|
||||
- honor .gitignore, not adding files it excludes (difficult, probably
|
||||
needs my own .gitignore parser to avoid excessive running of git commands
|
||||
to check for ignored files)
|
||||
- Possibly, when a directory is moved out of the annex location,
|
||||
unannex its contents.
|
||||
- Support OSes other than Linux; it only uses inotify currently.
|
||||
OSX and FreeBSD use the same mechanism, and there is a Haskell interface
|
||||
for it,
|
||||
unannex its contents. (Does inotify tell us where the directory moved
|
||||
to so we can access it?)
|
||||
|
||||
## the races
|
||||
|
||||
|
@ -61,25 +70,6 @@ Many races need to be dealt with by this code. Here are some of them.
|
|||
Fixed this problem; Now it hard links the file to a temp directory and
|
||||
operates on the hard link, which is also made unwritable.
|
||||
|
||||
* A process has a file open for write, another one closes it, and so it's
|
||||
added. Then the first process modifies it.
|
||||
|
||||
**Currently unfixed**; This changes content in the annex, and fsck will
|
||||
later catch the inconsistency.
|
||||
|
||||
Possible fixes:
|
||||
|
||||
* Somehow track or detect if a file is open for write by any processes.
|
||||
* Or, when possible, making a copy on write copy before adding the file
|
||||
would avoid this.
|
||||
* Or, as a last resort, make an expensive copy of the file and add that.
|
||||
* Tracking file opens and closes with inotify could tell if any other
|
||||
processes have the file open. But there are problems.. It doesn't
|
||||
seem to differentiate between files opened for read and for write.
|
||||
And there would still be a race after the last close and before it's
|
||||
injected into the annex, where it could be opened for write again.
|
||||
Would need to detect that and undo the annex injection or something.
|
||||
|
||||
* File is added and then replaced with another file before the annex add
|
||||
makes its symlink.
|
||||
|
||||
|
@ -108,3 +98,29 @@ Many races need to be dealt with by this code. Here are some of them.
|
|||
|
||||
Not a problem; The removal event removes the old file from the index, and
|
||||
the add event adds the new one.
|
||||
|
||||
## done
|
||||
|
||||
- on startup, add any files that have appeared since last run **done**
|
||||
- on startup, fix the symlinks for any renamed links **done**
|
||||
- on startup, stage any files that have been deleted since last run
|
||||
(seems to require a `git commit -a` on startup, or at least a
|
||||
`git add --update`, which will notice deleted files) **done**
|
||||
- notice new files, and git annex add **done**
|
||||
- notice renamed files, auto-fix the symlink, and stage the new file location
|
||||
**done**
|
||||
- handle cases where directories are moved outside the repo, and stop
|
||||
watching them **done**
|
||||
- when a whole directory is deleted or moved, stage removal of its
|
||||
contents from the index **done**
|
||||
- notice deleted files and stage the deletion
|
||||
(tricky; there's a race with add since it replaces the file with a symlink..)
|
||||
**done**
|
||||
- Gracefully handle when the default limit of 8192 inotified directories
|
||||
is exceeded. This can be tuned by root, so help the user fix it.
|
||||
**done**
|
||||
- periodically auto-commit staged changes (avoid autocommitting when
|
||||
lots of changes are coming in) **done**
|
||||
- coleasce related add/rm events for speed and less disk IO **done**
|
||||
- don't annex `.gitignore` and `.gitattributes` files **done**
|
||||
- run as a daemon **done**
|
||||
|
|
Loading…
Add table
Reference in a new issue