| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  *  linux/fs/file_table.c | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  *  Copyright (C) 1991, 1992  Linus Torvalds | 
					
						
							|  |  |  |  *  Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #include <linux/string.h>
 | 
					
						
							|  |  |  | #include <linux/slab.h>
 | 
					
						
							|  |  |  | #include <linux/file.h>
 | 
					
						
							| 
									
										
										
										
											2008-04-24 07:44:08 -04:00
										 |  |  | #include <linux/fdtable.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/init.h>
 | 
					
						
							|  |  |  | #include <linux/module.h>
 | 
					
						
							|  |  |  | #include <linux/fs.h>
 | 
					
						
							|  |  |  | #include <linux/security.h>
 | 
					
						
							|  |  |  | #include <linux/eventpoll.h>
 | 
					
						
							| 
									
										
										
										
											2005-09-09 13:04:13 -07:00
										 |  |  | #include <linux/rcupdate.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/mount.h>
 | 
					
						
							| 
									
										
										
										
											2006-01-11 12:17:46 -08:00
										 |  |  | #include <linux/capability.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/cdev.h>
 | 
					
						
							| 
									
										
											  
											
												[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
        * dnotify requires the opening of one fd per each directory
          that you intend to watch. This quickly results in too many
          open files and pins removable media, preventing unmount.
        * dnotify is directory-based. You only learn about changes to
          directories. Sure, a change to a file in a directory affects
          the directory, but you are then forced to keep a cache of
          stat structures.
        * dnotify's interface to user-space is awful.  Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
        * inotify's interface is a system call that returns a fd, not SIGIO.
	  You get a single fd, which is select()-able.
        * inotify has an event that says "the filesystem that the item
          you were watching is on was unmounted."
        * inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-07-12 17:06:03 -04:00
										 |  |  | #include <linux/fsnotify.h>
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | #include <linux/sysctl.h>
 | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | #include <linux/lglock.h>
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | #include <linux/percpu_counter.h>
 | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | #include <linux/percpu.h>
 | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | #include <linux/hardirq.h>
 | 
					
						
							|  |  |  | #include <linux/task_work.h>
 | 
					
						
							| 
									
										
										
										
											2009-12-16 04:53:03 -05:00
										 |  |  | #include <linux/ima.h>
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-07-26 16:09:06 -07:00
										 |  |  | #include <linux/atomic.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-12-04 15:47:36 -05:00
										 |  |  | #include "internal.h"
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /* sysctl tunables... */ | 
					
						
							|  |  |  | struct files_stat_struct files_stat = { | 
					
						
							|  |  |  | 	.max_files = NR_FILE | 
					
						
							|  |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-09 14:49:54 -07:00
										 |  |  | DEFINE_STATIC_LGLOCK(files_lglock); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-12-10 09:35:45 -08:00
										 |  |  | /* SLAB cache for file structures */ | 
					
						
							|  |  |  | static struct kmem_cache *filp_cachep __read_mostly; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | static struct percpu_counter nr_files __cacheline_aligned_in_smp; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-20 23:05:59 +04:00
										 |  |  | static void file_free_rcu(struct rcu_head *head) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2008-11-14 10:39:25 +11:00
										 |  |  | 	struct file *f = container_of(head, struct file, f_u.fu_rcuhead); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	put_cred(f->f_cred); | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	kmem_cache_free(filp_cachep, f); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | static inline void file_free(struct file *f) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	percpu_counter_dec(&nr_files); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:38:01 -08:00
										 |  |  | 	file_check_state(f); | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Return the total number of open files in the system | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | static long get_nr_files(void) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	return percpu_counter_read_positive(&nr_files); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Return the maximum number of open files in the system | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | unsigned long get_max_files(void) | 
					
						
							| 
									
										
										
										
											2005-09-09 13:04:13 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	return files_stat.max_files; | 
					
						
							| 
									
										
										
										
											2005-09-09 13:04:13 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | EXPORT_SYMBOL_GPL(get_max_files); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * Handle nr_files sysctl | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 | 
					
						
							| 
									
										
										
										
											2009-09-23 15:57:19 -07:00
										 |  |  | int proc_nr_files(ctl_table *table, int write, | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  |                      void __user *buffer, size_t *lenp, loff_t *ppos) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	files_stat.nr_files = get_nr_files(); | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | } | 
					
						
							|  |  |  | #else
 | 
					
						
							| 
									
										
										
										
											2009-09-23 15:57:19 -07:00
										 |  |  | int proc_nr_files(ctl_table *table, int write, | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  |                      void __user *buffer, size_t *lenp, loff_t *ppos) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	return -ENOSYS; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | #endif
 | 
					
						
							| 
									
										
										
										
											2005-09-09 13:04:13 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /* Find an unused file structure and return a pointer to it.
 | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  |  * Returns an error pointer if some error happend e.g. we over file | 
					
						
							|  |  |  |  * structures limit, run out of memory or operation is not permitted. | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:26 -08:00
										 |  |  |  * | 
					
						
							|  |  |  |  * Be very careful using this.  You are responsible for | 
					
						
							|  |  |  |  * getting write access to any mount that you might assign | 
					
						
							|  |  |  |  * to this filp, if it is opened for write.  If this is not | 
					
						
							|  |  |  |  * done, you will imbalance int the mount's writer count | 
					
						
							|  |  |  |  * and a warning at __fput() time. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							|  |  |  | struct file *get_empty_filp(void) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2008-11-14 10:39:18 +11:00
										 |  |  | 	const struct cred *cred = current_cred(); | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | 	static long old_max; | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  | 	struct file *f; | 
					
						
							|  |  |  | 	int error; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Privileged users can go above max_files | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * percpu_counters are inaccurate.  Do an expensive check before | 
					
						
							|  |  |  | 		 * we go and fail. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2007-10-16 23:25:44 -07:00
										 |  |  | 		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files) | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 			goto over; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2005-06-23 00:09:50 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:19 -07:00
										 |  |  | 	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  | 	if (unlikely(!f)) | 
					
						
							|  |  |  | 		return ERR_PTR(-ENOMEM); | 
					
						
							| 
									
										
										
										
											2005-06-23 00:09:50 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	percpu_counter_inc(&nr_files); | 
					
						
							| 
									
										
										
										
											2011-02-04 18:13:24 +00:00
										 |  |  | 	f->f_cred = get_cred(cred); | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  | 	error = security_file_alloc(f); | 
					
						
							|  |  |  | 	if (unlikely(error)) { | 
					
						
							|  |  |  | 		file_free(f); | 
					
						
							|  |  |  | 		return ERR_PTR(error); | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-23 03:01:03 -08:00
										 |  |  | 	INIT_LIST_HEAD(&f->f_u.fu_list); | 
					
						
							| 
									
										
										
										
											2008-07-26 00:39:17 -04:00
										 |  |  | 	atomic_long_set(&f->f_count, 1); | 
					
						
							| 
									
										
										
										
											2005-06-23 00:09:50 -07:00
										 |  |  | 	rwlock_init(&f->f_owner.lock); | 
					
						
							| 
									
										
										
										
											2009-02-06 13:52:43 -07:00
										 |  |  | 	spin_lock_init(&f->f_lock); | 
					
						
							| 
									
										
										
										
											2006-03-23 03:01:03 -08:00
										 |  |  | 	eventpoll_init_file(f); | 
					
						
							| 
									
										
										
										
											2005-06-23 00:09:50 -07:00
										 |  |  | 	/* f->f_version: 0 */ | 
					
						
							|  |  |  | 	return f; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | over: | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	/* Ran out of filps - report that */ | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 	if (get_nr_files() > old_max) { | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | 		pr_info("VFS: file-max limit %lu reached\n", get_max_files()); | 
					
						
							| 
									
										
										
										
											2006-03-07 21:55:35 -08:00
										 |  |  | 		old_max = get_nr_files(); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  | 	return ERR_PTR(-ENFILE); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * alloc_file - allocate and initialize a 'struct file' | 
					
						
							|  |  |  |  * @mnt: the vfsmount on which the file will reside | 
					
						
							|  |  |  |  * @dentry: the dentry representing the new file | 
					
						
							|  |  |  |  * @mode: the mode with which the new file will be opened | 
					
						
							|  |  |  |  * @fop: the 'struct file_operations' for the new file | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Use this instead of get_empty_filp() to get a new | 
					
						
							|  |  |  |  * 'struct file'.  Do so because of the same initialization | 
					
						
							|  |  |  |  * pitfalls reasons listed for init_file().  This is a | 
					
						
							|  |  |  |  * preferred interface to using init_file(). | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If all the callers of init_file() are eliminated, its | 
					
						
							|  |  |  |  * code should be moved into this function. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2009-08-09 00:52:35 +04:00
										 |  |  | struct file *alloc_file(struct path *path, fmode_t mode, | 
					
						
							|  |  |  | 		const struct file_operations *fop) | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	struct file *file; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	file = get_empty_filp(); | 
					
						
							| 
									
										
										
										
											2013-02-14 20:41:04 -05:00
										 |  |  | 	if (IS_ERR(file)) | 
					
						
							| 
									
										
										
										
											2012-09-12 20:11:55 -07:00
										 |  |  | 		return file; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-08-09 00:52:35 +04:00
										 |  |  | 	file->f_path = *path; | 
					
						
							| 
									
										
										
										
											2013-03-01 19:48:30 -05:00
										 |  |  | 	file->f_inode = path->dentry->d_inode; | 
					
						
							| 
									
										
										
										
											2009-08-09 00:52:35 +04:00
										 |  |  | 	file->f_mapping = path->dentry->d_inode->i_mapping; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | 	file->f_mode = mode; | 
					
						
							|  |  |  | 	file->f_op = fop; | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:48 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * These mounts don't really matter in practice | 
					
						
							|  |  |  | 	 * for r/o bind mounts.  They aren't userspace- | 
					
						
							|  |  |  | 	 * visible.  We do this for consistency, and so | 
					
						
							|  |  |  | 	 * that we can do debugging checks at __fput() | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2009-08-09 00:52:35 +04:00
										 |  |  | 	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) { | 
					
						
							| 
									
										
										
										
											2008-02-15 14:38:01 -08:00
										 |  |  | 		file_take_write(file); | 
					
						
							| 
									
										
										
										
											2009-12-16 12:48:44 -08:00
										 |  |  | 		WARN_ON(mnt_clone_write(path->mnt)); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:48 -08:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2010-11-02 10:13:07 -04:00
										 |  |  | 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) | 
					
						
							|  |  |  | 		i_readcount_inc(path->dentry->d_inode); | 
					
						
							| 
									
										
										
										
											2009-08-08 23:56:29 +04:00
										 |  |  | 	return file; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2009-12-16 12:43:11 -08:00
										 |  |  | EXPORT_SYMBOL(alloc_file); | 
					
						
							| 
									
										
										
										
											2007-10-16 23:31:13 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:31 -08:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * drop_file_write_access - give up ability to write to a file | 
					
						
							|  |  |  |  * @file: the file to which we will stop writing | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * This is a central place which will give up the ability | 
					
						
							|  |  |  |  * to write to @file, along with access to write through | 
					
						
							|  |  |  |  * its vfsmount. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-02-12 02:38:16 -05:00
										 |  |  | static void drop_file_write_access(struct file *file) | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:31 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:48 -08:00
										 |  |  | 	struct vfsmount *mnt = file->f_path.mnt; | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:31 -08:00
										 |  |  | 	struct dentry *dentry = file->f_path.dentry; | 
					
						
							|  |  |  | 	struct inode *inode = dentry->d_inode; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	put_write_access(inode); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:38:01 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	if (special_file(inode->i_mode)) | 
					
						
							|  |  |  | 		return; | 
					
						
							|  |  |  | 	if (file_check_writeable(file) != 0) | 
					
						
							|  |  |  | 		return; | 
					
						
							| 
									
										
										
										
											2012-06-12 16:20:35 +02:00
										 |  |  | 	__mnt_drop_write(mnt); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:38:01 -08:00
										 |  |  | 	file_release_write(file); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:31 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-05-26 15:13:55 -04:00
										 |  |  | /* the real guts of fput() - releasing the last reference to file
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-05-26 15:13:55 -04:00
										 |  |  | static void __fput(struct file *file) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-12-08 02:36:35 -08:00
										 |  |  | 	struct dentry *dentry = file->f_path.dentry; | 
					
						
							|  |  |  | 	struct vfsmount *mnt = file->f_path.mnt; | 
					
						
							| 
									
										
										
										
											2013-06-13 23:37:49 +01:00
										 |  |  | 	struct inode *inode = file->f_inode; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	might_sleep(); | 
					
						
							| 
									
										
											  
											
												[PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
        * dnotify requires the opening of one fd per each directory
          that you intend to watch. This quickly results in too many
          open files and pins removable media, preventing unmount.
        * dnotify is directory-based. You only learn about changes to
          directories. Sure, a change to a file in a directory affects
          the directory, but you are then forced to keep a cache of
          stat structures.
        * dnotify's interface to user-space is awful.  Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
        * inotify's interface is a system call that returns a fd, not SIGIO.
	  You get a single fd, which is select()-able.
        * inotify has an event that says "the filesystem that the item
          you were watching is on was unmounted."
        * inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2005-07-12 17:06:03 -04:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	fsnotify_close(file); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * The function eventpoll_release() should be the first called | 
					
						
							|  |  |  | 	 * in the file cleanup chain. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	eventpoll_release(file); | 
					
						
							|  |  |  | 	locks_remove_flock(file); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-31 23:28:30 +00:00
										 |  |  | 	if (unlikely(file->f_flags & FASYNC)) { | 
					
						
							|  |  |  | 		if (file->f_op && file->f_op->fasync) | 
					
						
							|  |  |  | 			file->f_op->fasync(-1, file, 0); | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2011-03-16 22:48:43 -04:00
										 |  |  | 	ima_file_free(file); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	if (file->f_op && file->f_op->release) | 
					
						
							|  |  |  | 		file->f_op->release(inode, file); | 
					
						
							|  |  |  | 	security_file_free(file); | 
					
						
							| 
									
										
										
										
											2011-03-16 18:17:54 +01:00
										 |  |  | 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && | 
					
						
							|  |  |  | 		     !(file->f_mode & FMODE_PATH))) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		cdev_put(inode->i_cdev); | 
					
						
							| 
									
										
										
										
											2011-03-16 18:17:54 +01:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	fops_put(file->f_op); | 
					
						
							| 
									
										
										
										
											2006-10-02 02:17:15 -07:00
										 |  |  | 	put_pid(file->f_owner.pid); | 
					
						
							| 
									
										
										
										
											2010-11-02 10:13:07 -04:00
										 |  |  | 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) | 
					
						
							|  |  |  | 		i_readcount_dec(inode); | 
					
						
							| 
									
										
										
										
											2008-02-15 14:37:31 -08:00
										 |  |  | 	if (file->f_mode & FMODE_WRITE) | 
					
						
							|  |  |  | 		drop_file_write_access(file); | 
					
						
							| 
									
										
										
										
											2006-12-08 02:36:35 -08:00
										 |  |  | 	file->f_path.dentry = NULL; | 
					
						
							|  |  |  | 	file->f_path.mnt = NULL; | 
					
						
							| 
									
										
										
										
											2013-03-01 19:48:30 -05:00
										 |  |  | 	file->f_inode = NULL; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	file_free(file); | 
					
						
							|  |  |  | 	dput(dentry); | 
					
						
							|  |  |  | 	mntput(mnt); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-08 14:24:16 -07:00
										 |  |  | static LLIST_HEAD(delayed_fput_list); | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | static void delayed_fput(struct work_struct *unused) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2013-07-08 14:24:16 -07:00
										 |  |  | 	struct llist_node *node = llist_del_all(&delayed_fput_list); | 
					
						
							|  |  |  | 	struct llist_node *next; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	for (; node; node = next) { | 
					
						
							|  |  |  | 		next = llist_next(node); | 
					
						
							|  |  |  | 		__fput(llist_entry(node, struct file, f_u.fu_llist)); | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static void ____fput(struct callback_head *work) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	__fput(container_of(work, struct file, f_u.fu_rcuhead)); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * If kernel thread really needs to have the final fput() it has done | 
					
						
							|  |  |  |  * to complete, call this.  The only user right now is the boot - we | 
					
						
							|  |  |  |  * *do* need to make sure our writes to binaries on initramfs has | 
					
						
							|  |  |  |  * not left us with opened struct file waiting for __fput() - execve() | 
					
						
							|  |  |  |  * won't work without that.  Please, don't add more callers without | 
					
						
							|  |  |  |  * very good reasons; in particular, never call that with locks | 
					
						
							|  |  |  |  * held and never call that from a thread that might need to do | 
					
						
							|  |  |  |  * some work on any kind of umount. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | void flush_delayed_fput(void) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	delayed_fput(NULL); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static DECLARE_WORK(delayed_fput_work, delayed_fput); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-05-26 15:13:55 -04:00
										 |  |  | void fput(struct file *file) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 	if (atomic_long_dec_and_test(&file->f_count)) { | 
					
						
							|  |  |  | 		struct task_struct *task = current; | 
					
						
							| 
									
										
										
										
											2013-06-14 21:09:47 +02:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 		file_sb_list_del(file); | 
					
						
							| 
									
										
										
										
											2013-06-14 21:09:47 +02:00
										 |  |  | 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { | 
					
						
							|  |  |  | 			init_task_work(&file->f_u.fu_rcuhead, ____fput); | 
					
						
							|  |  |  | 			if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) | 
					
						
							|  |  |  | 				return; | 
					
						
							| 
									
										
										
										
											2013-07-08 14:24:15 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * After this task has run exit_task_work(), | 
					
						
							|  |  |  | 			 * task_work_add() will fail.  free_ipc_ns()-> | 
					
						
							|  |  |  | 			 * shm_destroy() can do this.  Fall through to delayed | 
					
						
							|  |  |  | 			 * fput to avoid leaking *file. | 
					
						
							|  |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2013-07-08 14:24:16 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		if (llist_add(&file->f_u.fu_llist, &delayed_fput_list)) | 
					
						
							|  |  |  | 			schedule_work(&delayed_fput_work); | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * synchronous analog of fput(); for kernel threads that might be needed | 
					
						
							|  |  |  |  * in some umount() (and thus can't use flush_delayed_fput() without | 
					
						
							|  |  |  |  * risking deadlocks), need to wait for completion of __fput() and know | 
					
						
							|  |  |  |  * for this specific struct file it won't involve anything that would | 
					
						
							|  |  |  |  * need them.  Use only if you really need it - at the very least, | 
					
						
							|  |  |  |  * don't blindly convert fput() by kernel thread to that. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | void __fput_sync(struct file *file) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	if (atomic_long_dec_and_test(&file->f_count)) { | 
					
						
							|  |  |  | 		struct task_struct *task = current; | 
					
						
							|  |  |  | 		file_sb_list_del(file); | 
					
						
							|  |  |  | 		BUG_ON(!(task->flags & PF_KTHREAD)); | 
					
						
							| 
									
										
										
										
											2010-05-26 15:13:55 -04:00
										 |  |  | 		__fput(file); | 
					
						
							| 
									
										
										
										
											2012-06-24 09:56:45 +04:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2010-05-26 15:13:55 -04:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | EXPORT_SYMBOL(fput); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | void put_filp(struct file *file) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2008-07-26 00:39:17 -04:00
										 |  |  | 	if (atomic_long_dec_and_test(&file->f_count)) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		security_file_free(file); | 
					
						
							| 
									
										
										
										
											2010-08-18 04:37:35 +10:00
										 |  |  | 		file_sb_list_del(file); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		file_free(file); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | static inline int file_list_cpu(struct file *file) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | #ifdef CONFIG_SMP
 | 
					
						
							|  |  |  | 	return file->f_sb_list_cpu; | 
					
						
							|  |  |  | #else
 | 
					
						
							|  |  |  | 	return smp_processor_id(); | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* helper for file_sb_list_add to reduce ifdefs */ | 
					
						
							|  |  |  | static inline void __file_sb_list_add(struct file *file, struct super_block *sb) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct list_head *list; | 
					
						
							|  |  |  | #ifdef CONFIG_SMP
 | 
					
						
							|  |  |  | 	int cpu; | 
					
						
							|  |  |  | 	cpu = smp_processor_id(); | 
					
						
							|  |  |  | 	file->f_sb_list_cpu = cpu; | 
					
						
							|  |  |  | 	list = per_cpu_ptr(sb->s_files, cpu); | 
					
						
							|  |  |  | #else
 | 
					
						
							|  |  |  | 	list = &sb->s_files; | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 	list_add(&file->f_u.fu_list, list); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /**
 | 
					
						
							|  |  |  |  * file_sb_list_add - add a file to the sb's file list | 
					
						
							|  |  |  |  * @file: file to add | 
					
						
							|  |  |  |  * @sb: sb to add it to | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Use this function to associate a file with the superblock of the inode it | 
					
						
							|  |  |  |  * refers to. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-08-18 04:37:35 +10:00
										 |  |  | void file_sb_list_add(struct file *file, struct super_block *sb) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 	lg_local_lock(&files_lglock); | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | 	__file_sb_list_add(file, sb); | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 	lg_local_unlock(&files_lglock); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * file_sb_list_del - remove a file from the sb's file list | 
					
						
							|  |  |  |  * @file: file to remove | 
					
						
							|  |  |  |  * @sb: sb to remove it from | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Use this function to remove a file from its superblock. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-08-18 04:37:35 +10:00
										 |  |  | void file_sb_list_del(struct file *file) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2005-10-30 15:02:16 -08:00
										 |  |  | 	if (!list_empty(&file->f_u.fu_list)) { | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 		lg_local_lock_cpu(&files_lglock, file_list_cpu(file)); | 
					
						
							| 
									
										
										
										
											2005-10-30 15:02:16 -08:00
										 |  |  | 		list_del_init(&file->f_u.fu_list); | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 		lg_local_unlock_cpu(&files_lglock, file_list_cpu(file)); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | #ifdef CONFIG_SMP
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * These macros iterate all files on all CPUs for a given superblock. | 
					
						
							|  |  |  |  * files_lglock must be held globally. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define do_file_list_for_each_entry(__sb, __file)		\
 | 
					
						
							|  |  |  | {								\ | 
					
						
							|  |  |  | 	int i;							\ | 
					
						
							|  |  |  | 	for_each_possible_cpu(i) {				\ | 
					
						
							|  |  |  | 		struct list_head *list;				\ | 
					
						
							|  |  |  | 		list = per_cpu_ptr((__sb)->s_files, i);		\ | 
					
						
							|  |  |  | 		list_for_each_entry((__file), list, f_u.fu_list) | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define while_file_list_for_each_entry				\
 | 
					
						
							|  |  |  | 	}							\ | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #else
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define do_file_list_for_each_entry(__sb, __file)		\
 | 
					
						
							|  |  |  | {								\ | 
					
						
							|  |  |  | 	struct list_head *list;					\ | 
					
						
							|  |  |  | 	list = &(sb)->s_files;					\ | 
					
						
							|  |  |  | 	list_for_each_entry((__file), list, f_u.fu_list) | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define while_file_list_for_each_entry				\
 | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  *	mark_files_ro - mark all files read-only | 
					
						
							|  |  |  |  *	@sb: superblock in question | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  *	All files are marked read-only.  We don't care about pending | 
					
						
							|  |  |  |  *	delete files so this should be used in 'force' mode only. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | void mark_files_ro(struct super_block *sb) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct file *f; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 	lg_global_lock(&files_lglock); | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | 	do_file_list_for_each_entry(sb, f) { | 
					
						
							| 
									
										
										
										
											2013-01-23 17:07:38 -05:00
										 |  |  | 		if (!S_ISREG(file_inode(f)->i_mode)) | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | 		       continue; | 
					
						
							|  |  |  | 		if (!file_count(f)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 		if (!(f->f_mode & FMODE_WRITE)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:01 -08:00
										 |  |  | 		spin_lock(&f->f_lock); | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | 		f->f_mode &= ~FMODE_WRITE; | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:01 -08:00
										 |  |  | 		spin_unlock(&f->f_lock); | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | 		if (file_check_writeable(f) != 0) | 
					
						
							|  |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2012-12-05 14:40:14 +01:00
										 |  |  | 		__mnt_drop_write(f->f_path.mnt); | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | 		file_release_write(f); | 
					
						
							| 
									
										
											  
											
												fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
                throughput
2.6.34-rc2      24.5
+patch          24.9
                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2010-08-18 04:37:38 +10:00
										 |  |  | 	} while_file_list_for_each_entry; | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 	lg_global_unlock(&files_lglock); | 
					
						
							| 
									
										
										
										
											2009-04-26 20:25:56 +10:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | void __init files_init(unsigned long mempages) | 
					
						
							|  |  |  | {  | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | 	unsigned long n; | 
					
						
							| 
									
										
										
										
											2008-12-10 09:35:45 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, | 
					
						
							|  |  |  | 			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * One file with associated inode and dcache is very roughly 1K. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	 * Per default don't use more than 10% of our memory for files.  | 
					
						
							|  |  |  | 	 */  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	n = (mempages * (PAGE_SIZE / 1024)) / 10; | 
					
						
							| 
									
										
										
										
											2010-10-26 14:22:44 -07:00
										 |  |  | 	files_stat.max_files = max_t(unsigned long, n, NR_FILE); | 
					
						
							| 
									
										
										
										
											2005-09-09 13:04:13 -07:00
										 |  |  | 	files_defer_init(); | 
					
						
							| 
									
										
										
										
											2012-05-08 13:32:02 +09:30
										 |  |  | 	lg_lock_init(&files_lglock, "files_lglock"); | 
					
						
							| 
									
										
										
										
											2006-06-23 02:05:41 -07:00
										 |  |  | 	percpu_counter_init(&nr_files, 0); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | }  |