Merge branch 'master' into watch

This commit is contained in:
Joey Hess 2012-06-17 15:45:35 -04:00
commit bf3339e5b7
6 changed files with 177 additions and 37 deletions

View file

@ -0,0 +1,54 @@
A rather frustrating and long day coding went like this:
## 1-3 pm
Wrote a single function, of which all any Haskell programmer needs to know
is its type signature:
Lsof.queryDir :: FilePath -> IO [(FilePath, LsofOpenMode, ProcessInfo)]
When I'm spending another hour or two taking a unix utility like lsof and
parsing its output, which in this case is in a rather complicated
machine-parsable output format, I often wish unix streams were strongly
typed, which would avoid this bother.
## 3-9 pm
Six hours spent making it defer annexing files until the commit thread
wakes up and is about to make a commit. Why did it take so horribly long?
Well, there were a number of complications, and some really bad bugs
involving races that were hard to reproduce reliably enough to deal with.
In other words, I was lost in the weeds for a lot of those hours...
At one point, something glorious happened, and it was always making exactly
one commit for batch mode modifications of a lot of files (like untarring
them). Unfortunatly, I had to lose that gloriousness due to another
potential race, which, while unlikely, would have made the program deadlock
if it happened.
So, it's back to making 2 or 3 commits per batch mode change. I also have a
buglet that causes sometimes a second empty commit after a file is added.
I know why (the inotify event for the symlink gets in late,
after the commit); will try to improve commit frequency later.
## 9-11 pm
Put the capstone on the day's work, by calling lsof on a directory full
of hardlinks to the files that are about to be annexed, to check if any
are still open for write.
This works great! Starting up `git annex watch` when processes have files
open is no longer a problem, and even if you're evil enough to try having
muliple processes open the same file, it will complain and not annex it
until all the writers close it.
(Well, someone really evil could turn the write bit back on after git annex
clears it, and open the file again, but then really evil people can do
that to files in `.git/annex/objects` too, and they'll get their just
deserts when `git annex fsck` runs. So, that's ok..)
----
Anyway, will beat on it more tomorrow, and if all is well, this will finally
go out to the beta testers.

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="http://dieter-be.myopenid.com/"
nickname="dieter"
subject="comment 1"
date="2012-06-16T09:14:26Z"
content="""
maybe at some point, your tool could show \"warning, the following files are still open and are hence not being annexed\"
to avoid any nasty surprises of a file not being annexed and the user not realizing it.
"""]]

View file

@ -7,43 +7,6 @@ There is a `watch` branch in git that adds the command.
## known bugs
* A process has a file open for write, another one closes it,
and so it's added. Then the first process modifies it.
Or, a process has a file open for write when `git annex watch` starts
up, it will be added to the annex. If the process later continues
writing, it will change content in the annex.
This changes content in the annex, and fsck will later catch
the inconsistency.
Possible fixes:
* Somehow track or detect if a file is open for write by any processes.
`lsof` could be used, although it would be a little slow.
Here's one way to avoid the slowdown: When a file is being added,
set it read-only, and hard-link it into a quarantine directory,
remembering both filenames.
Then use the batch change mode code to detect batch adds and bundle
them together.
Just before committing, lsof the quarantine directory. Any files in
it that are still open for write can just have their write bit turned
back on and be deleted from quarantine, to be handled when their writer
closes. Files that pass quarantine get added as usual. This avoids
repeated lsof calls slowing down adds, but does add a constant factor
overhead (0.25 seconds lsof call) before any add gets committed.
* Or, when possible, making a copy on write copy before adding the file
would avoid this.
* Or, as a last resort, make an expensive copy of the file and add that.
* Tracking file opens and closes with inotify could tell if any other
processes have the file open. But there are problems.. It doesn't
seem to differentiate between files opened for read and for write.
And there would still be a race after the last close and before it's
injected into the annex, where it could be opened for write again.
Would need to detect that and undo the annex injection or something.
* If a file is checked into git as a normal file and gets modified
(or merged, etc), it will be converted into an annexed file.
See [[blog/day_7__bugfixes]]
@ -54,6 +17,51 @@ There is a `watch` branch in git that adds the command.
I'd also like to support OSX and if possible the BSDs.
* kqueue ([haskell bindings](http://hackage.haskell.org/package/kqueue))
is supported by FreeBSD, OSX, and other BSDs.
In kqueue, to watch for changes to a file, you have to have an open file
descriptor to the file. This wouldn't scale.
Apparently, a directory can be watched, and events are generated when
files are added/removed from it. You then have to scan to find which
files changed. [example](https://developer.apple.com/library/mac/#samplecode/FileNotification/Listings/Main_c.html#//apple_ref/doc/uid/DTS10003143-Main_c-DontLinkElementID_3)
Gamin does the best it can with just kqueue, supplimented by polling.
The source file `server/gam_kqueue.c` makes for interesting reading.
Using gamin to do the heavy lifting is one option.
([haskell bindings](http://hackage.haskell.org/package/hlibfam) for FAM;
gamin shares the API)
kqueue does not seem to provide a way to tell when a file gets closed,
only when it's initially created. Poses problems..
* [man page](http://www.freebsd.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&format=html)
* <https://github.com/gorakhargosh/watchdog/blob/master/src/watchdog/observers/kqueue.py> (good example program)
* hfsevents ([haskell bindings](http://hackage.haskell.org/package/hfsevents))
is OSX specific.
Originally it was only directory level, and you were only told a
directory had changed and not which file. Based on the haskell
binding's code, from OSX 10.7.0, file level events were added.
This will be harder for me to develop for, since I don't have access to
OSX machines..
hfsevents does not seem to provide a way to tell when a file gets closed,
only when it's initially created. Poses problems..
* <https://developer.apple.com/library/mac/#documentation/Darwin/Conceptual/FSEvents_ProgGuide/Introduction/Introduction.html>
* <http://pypi.python.org/pypi/MacFSEvents/0.2.8> (good example program)
* <https://github.com/gorakhargosh/watchdog/blob/master/src/watchdog/observers/fsevents.py> (good example program)
* Windows has a Win32 ReadDirectoryChangesW, and perhaps other things.
## beyond Linux
I'd also like to support OSX and if possible the BSDs.
* kqueue ([haskell bindings](http://hackage.haskell.org/package/kqueue))
is supported by FreeBSD, OSX, and other BSDs.
@ -171,3 +179,40 @@ Many races need to be dealt with by this code. Here are some of them.
- coleasce related add/rm events for speed and less disk IO **done**
- don't annex `.gitignore` and `.gitattributes` files **done**
- run as a daemon **done**
- A process has a file open for write, another one closes it,
and so it's added. Then the first process modifies it.
Or, a process has a file open for write when `git annex watch` starts
up, it will be added to the annex. If the process later continues
writing, it will change content in the annex.
This changes content in the annex, and fsck will later catch
the inconsistency.
Possible fixes:
* Somehow track or detect if a file is open for write by any processes.
`lsof` could be used, although it would be a little slow.
Here's one way to avoid the slowdown: When a file is being added,
set it read-only, and hard-link it into a quarantine directory,
remembering both filenames.
Then use the batch change mode code to detect batch adds and bundle
them together.
Just before committing, lsof the quarantine directory. Any files in
it that are still open for write can just have their write bit turned
back on and be deleted from quarantine, to be handled when their writer
closes. Files that pass quarantine get added as usual. This avoids
repeated lsof calls slowing down adds, but does add a constant factor
overhead (0.25 seconds lsof call) before any add gets committed. **done**
* Or, when possible, making a copy on write copy before adding the file
would avoid this.
* Or, as a last resort, make an expensive copy of the file and add that.
* Tracking file opens and closes with inotify could tell if any other
processes have the file open. But there are problems.. It doesn't
seem to differentiate between files opened for read and for write.
And there would still be a race after the last close and before it's
injected into the annex, where it could be opened for write again.
Would need to detect that and undo the annex injection or something.

View file

@ -0,0 +1,16 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="4.154.6.135"
subject="comment 1"
date="2012-06-15T19:25:59Z"
content="""
Sure, you can simply:
cp annexedfile ~
Or just attach the file right from the git repository to an email, like any other file. Should work fine.
If you wanted to copy a whole directory to export, you'd need to use the -L flag to make cp follow the symlinks and copy the real contents:
cp -r -L annexeddirectory /media/usbdrive/
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://denis.laxalde.org/"
nickname="dlax"
subject="nautilus"
date="2012-06-15T19:57:31Z"
content="""
Ah! I was fooled by nautilus which is not able to properly handle symlinks when copying. It copies links instead of target [[!gnomebug 623580]].
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="http://joeyh.name/"
ip="4.154.6.135"
subject="comment 3"
date="2012-06-16T03:26:37Z"
content="""
That nautilous behavior is a bad thing when trying to export files out, but it's a good thing when just moving files around inside your repository...
"""]]