| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	Mandatory File Locking For The Linux Operating System | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		Andy Walker <andy@lysaker.kvaerner.no> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			   15 April 1996 | 
					
						
							| 
									
										
										
										
											2007-09-25 11:57:19 -04:00
										 |  |  | 		     (Updated September 2007) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-09-25 11:57:19 -04:00
										 |  |  | 0. Why you should avoid mandatory locking | 
					
						
							|  |  |  | ----------------------------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The Linux implementation is prey to a number of difficult-to-fix race | 
					
						
							|  |  |  | conditions which in practice make it not dependable: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	- The write system call checks for a mandatory lock only once | 
					
						
							|  |  |  | 	  at its start.  It is therefore possible for a lock request to | 
					
						
							|  |  |  | 	  be granted after this check but before the data is modified. | 
					
						
							|  |  |  | 	  A process may then see file data change even while a mandatory | 
					
						
							|  |  |  | 	  lock was held. | 
					
						
							|  |  |  | 	- Similarly, an exclusive lock may be granted on a file after | 
					
						
							|  |  |  | 	  the kernel has decided to proceed with a read, but before the | 
					
						
							|  |  |  | 	  read has actually completed, and the reading process may see | 
					
						
							|  |  |  | 	  the file data in a state which should not have been visible | 
					
						
							|  |  |  | 	  to it. | 
					
						
							|  |  |  | 	- Similar races make the claimed mutual exclusion between lock | 
					
						
							|  |  |  | 	  and mmap similarly unreliable. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 1. What is  mandatory locking? | 
					
						
							|  |  |  | ------------------------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Mandatory locking is kernel enforced file locking, as opposed to the more usual | 
					
						
							|  |  |  | cooperative file locking used to guarantee sequential access to files among | 
					
						
							|  |  |  | processes. File locks are applied using the flock() and fcntl() system calls | 
					
						
							|  |  |  | (and the lockf() library routine which is a wrapper around fcntl().) It is | 
					
						
							|  |  |  | normally a process' responsibility to check for locks on a file it wishes to | 
					
						
							|  |  |  | update, before applying its own lock, updating the file and unlocking it again. | 
					
						
							|  |  |  | The most commonly used example of this (and in the case of sendmail, the most | 
					
						
							|  |  |  | troublesome) is access to a user's mailbox. The mail user agent and the mail | 
					
						
							|  |  |  | transfer agent must guard against updating the mailbox at the same time, and | 
					
						
							|  |  |  | prevent reading the mailbox while it is being updated. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | In a perfect world all processes would use and honour a cooperative, or | 
					
						
							|  |  |  | "advisory" locking scheme. However, the world isn't perfect, and there's | 
					
						
							|  |  |  | a lot of poorly written code out there. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | In trying to address this problem, the designers of System V UNIX came up | 
					
						
							|  |  |  | with a "mandatory" locking scheme, whereby the operating system kernel would | 
					
						
							|  |  |  | block attempts by a process to write to a file that another process holds a | 
					
						
							|  |  |  | "read" -or- "shared" lock on, and block attempts to both read and write to a  | 
					
						
							|  |  |  | file that a process holds a "write " -or- "exclusive" lock on. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The System V mandatory locking scheme was intended to have as little impact as | 
					
						
							|  |  |  | possible on existing user code. The scheme is based on marking individual files | 
					
						
							|  |  |  | as candidates for mandatory locking, and using the existing fcntl()/lockf() | 
					
						
							|  |  |  | interface for applying locks just as if they were normal, advisory locks. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Note 1: In saying "file" in the paragraphs above I am actually not telling | 
					
						
							|  |  |  | the whole truth. System V locking is based on fcntl(). The granularity of | 
					
						
							|  |  |  | fcntl() is such that it allows the locking of byte ranges in files, in addition | 
					
						
							|  |  |  | to entire files, so the mandatory locking rules also have byte level | 
					
						
							|  |  |  | granularity. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite | 
					
						
							|  |  |  | borrowing the fcntl() locking scheme from System V. The mandatory locking | 
					
						
							|  |  |  | scheme is defined by the System V Interface Definition (SVID) Version 3. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 2. Marking a file for mandatory locking | 
					
						
							|  |  |  | --------------------------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | A file is marked as a candidate for mandatory locking by setting the group-id | 
					
						
							|  |  |  | bit in its file mode but removing the group-execute bit. This is an otherwise | 
					
						
							|  |  |  | meaningless combination, and was chosen by the System V implementors so as not | 
					
						
							|  |  |  | to break existing user programs. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Note that the group-id bit is usually automatically cleared by the kernel when | 
					
						
							|  |  |  | a setgid file is written to. This is a security measure. The kernel has been | 
					
						
							|  |  |  | modified to recognize the special case of a mandatory lock candidate and to | 
					
						
							|  |  |  | refrain from clearing this bit. Similarly the kernel has been modified not | 
					
						
							|  |  |  | to run mandatory lock candidates with setgid privileges. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 3. Available implementations | 
					
						
							|  |  |  | ---------------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | I have considered the implementations of mandatory locking available with | 
					
						
							|  |  |  | SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Generally I have tried to make the most sense out of the behaviour exhibited | 
					
						
							|  |  |  | by these three reference systems. There are many anomalies. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | All the reference systems reject all calls to open() for a file on which | 
					
						
							|  |  |  | another process has outstanding mandatory locks. This is in direct | 
					
						
							|  |  |  | contravention of SVID 3, which states that only calls to open() with the | 
					
						
							|  |  |  | O_TRUNC flag set should be rejected. The Linux implementation follows the SVID | 
					
						
							|  |  |  | definition, which is the "Right Thing", since only calls with O_TRUNC can | 
					
						
							|  |  |  | modify the contents of the file. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not | 
					
						
							|  |  |  | just mandatory locks. That would appear to contravene POSIX.1. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | mmap() is another interesting case. All the operating systems mentioned | 
					
						
							|  |  |  | prevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX | 
					
						
							|  |  |  | also disallows advisory locks for such a file. SVID actually specifies the | 
					
						
							|  |  |  | paranoid HP-UX behaviour. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | In my opinion only MAP_SHARED mappings should be immune from locking, and then | 
					
						
							|  |  |  | only from mandatory locks - that is what is currently implemented. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for | 
					
						
							|  |  |  | mandatory locks, so reads and writes to locked files always block when they | 
					
						
							|  |  |  | should return EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | I'm afraid that this is such an esoteric area that the semantics described | 
					
						
							|  |  |  | below are just as valid as any others, so long as the main points seem to | 
					
						
							|  |  |  | agree.  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 4. Semantics | 
					
						
							|  |  |  | ------------ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 1. Mandatory locks can only be applied via the fcntl()/lockf() locking | 
					
						
							|  |  |  |    interface - in other words the System V/POSIX interface. BSD style | 
					
						
							|  |  |  |    locks using flock() never result in a mandatory lock. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 2. If a process has locked a region of a file with a mandatory read lock, then | 
					
						
							|  |  |  |    other processes are permitted to read from that region. If any of these | 
					
						
							|  |  |  |    processes attempts to write to the region it will block until the lock is | 
					
						
							|  |  |  |    released, unless the process has opened the file with the O_NONBLOCK | 
					
						
							|  |  |  |    flag in which case the system call will return immediately with the error | 
					
						
							|  |  |  |    status EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 3. If a process has locked a region of a file with a mandatory write lock, all | 
					
						
							|  |  |  |    attempts to read or write to that region block until the lock is released, | 
					
						
							|  |  |  |    unless a process has opened the file with the O_NONBLOCK flag in which case | 
					
						
							|  |  |  |    the system call will return immediately with the error status EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has | 
					
						
							|  |  |  |    any mandatory locks owned by other processes will be rejected with the | 
					
						
							|  |  |  |    error status EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 5. Attempts to apply a mandatory lock to a file that is memory mapped and | 
					
						
							|  |  |  |    shared (via mmap() with MAP_SHARED) will be rejected with the error status | 
					
						
							|  |  |  |    EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) | 
					
						
							|  |  |  |    that has any mandatory locks in effect will be rejected with the error status | 
					
						
							|  |  |  |    EAGAIN. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 5. Which system calls are affected? | 
					
						
							|  |  |  | ----------------------------------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Those which modify a file's contents, not just the inode. That gives read(), | 
					
						
							|  |  |  | write(), readv(), writev(), open(), creat(), mmap(), truncate() and | 
					
						
							|  |  |  | ftruncate(). truncate() and ftruncate() are considered to be "write" actions | 
					
						
							|  |  |  | for the purposes of mandatory locking. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The affected region is usually defined as stretching from the current position | 
					
						
							|  |  |  | for the total number of bytes read or written. For the truncate calls it is | 
					
						
							|  |  |  | defined as the bytes of a file removed or added (we must also consider bytes | 
					
						
							|  |  |  | added, as a lock can specify just "the whole file", rather than a specific | 
					
						
							|  |  |  | range of bytes.) | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Note 3: I may have overlooked some system calls that need mandatory lock | 
					
						
							|  |  |  | checking in my eagerness to get this code out the door. Please let me know, or | 
					
						
							|  |  |  | better still fix the system calls yourself and submit a patch to me or Linus. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 6. Warning! | 
					
						
							|  |  |  | ----------- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Not even root can override a mandatory lock, so runaway processes can wreak | 
					
						
							|  |  |  | havoc if they lock crucial files. The way around it is to change the file | 
					
						
							|  |  |  | permissions (remove the setgid bit) before trying to read or write to it. | 
					
						
							|  |  |  | Of course, that might be a bit tricky if the system is hung :-( | 
					
						
							|  |  |  | 
 |