[PATCH] VFS: update documentation
This patch brings the now out-of-date Documentation/filesystems/vfs.txt back to life. Thanks to Carsten Otte, Trond Myklebust, and Anton Altaparmakov for their help on updating this documentation. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This commit is contained in:
		
					parent
					
						
							
								952b649272
							
						
					
				
			
			
				commit
				
					
						5ea626aac1
					
				
			
		
					 1 changed files with 325 additions and 114 deletions
				
			
		|  | @ -1,35 +1,27 @@ | ||||||
| /* -*- auto-fill -*-                                                         */ |  | ||||||
| 
 | 
 | ||||||
| 		Overview of the Virtual File System | 	      Overview of the Linux Virtual File System | ||||||
| 
 | 
 | ||||||
| 		Richard Gooch <rgooch@atnf.csiro.au> | 	Original author: Richard Gooch <rgooch@atnf.csiro.au> | ||||||
| 
 | 
 | ||||||
| 			      5-JUL-1999 | 		  Last updated on August 25, 2005 | ||||||
|  | 
 | ||||||
|  |   Copyright (C) 1999 Richard Gooch | ||||||
|  |   Copyright (C) 2005 Pekka Enberg | ||||||
|  | 
 | ||||||
|  |   This file is released under the GPLv2. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| Conventions used in this document                                     <section> | What is it? | ||||||
| ================================= |  | ||||||
| 
 |  | ||||||
| Each section in this document will have the string "<section>" at the |  | ||||||
| right-hand side of the section title. Each subsection will have |  | ||||||
| "<subsection>" at the right-hand side. These strings are meant to make |  | ||||||
| it easier to search through the document. |  | ||||||
| 
 |  | ||||||
| NOTE that the master copy of this document is available online at: |  | ||||||
| http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt |  | ||||||
| 
 |  | ||||||
| 
 |  | ||||||
| What is it?                                                           <section> |  | ||||||
| =========== | =========== | ||||||
| 
 | 
 | ||||||
| The Virtual File System (otherwise known as the Virtual Filesystem | The Virtual File System (otherwise known as the Virtual Filesystem | ||||||
| Switch) is the software layer in the kernel that provides the | Switch) is the software layer in the kernel that provides the | ||||||
| filesystem interface to userspace programs. It also provides an | filesystem interface to userspace programs. It also provides an | ||||||
| abstraction within the kernel which allows different filesystem | abstraction within the kernel which allows different filesystem | ||||||
| implementations to co-exist. | implementations to coexist. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| A Quick Look At How It Works                                          <section> | A Quick Look At How It Works | ||||||
| ============================ | ============================ | ||||||
| 
 | 
 | ||||||
| In this section I'll briefly describe how things work, before | In this section I'll briefly describe how things work, before | ||||||
|  | @ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the | ||||||
| other view which is how a filesystem is supported and subsequently | other view which is how a filesystem is supported and subsequently | ||||||
| mounted. | mounted. | ||||||
| 
 | 
 | ||||||
| Opening a File                                                     <subsection> | 
 | ||||||
|  | Opening a File | ||||||
| -------------- | -------------- | ||||||
| 
 | 
 | ||||||
| The VFS implements the open(2), stat(2), chmod(2) and similar system | The VFS implements the open(2), stat(2), chmod(2) and similar system | ||||||
|  | @ -77,7 +70,7 @@ back to userspace. | ||||||
| 
 | 
 | ||||||
| Opening a file requires another operation: allocation of a file | Opening a file requires another operation: allocation of a file | ||||||
| structure (this is the kernel-side implementation of file | structure (this is the kernel-side implementation of file | ||||||
| descriptors). The freshly allocated file structure is initialised with | descriptors). The freshly allocated file structure is initialized with | ||||||
| a pointer to the dentry and a set of file operation member functions. | a pointer to the dentry and a set of file operation member functions. | ||||||
| These are taken from the inode data. The open() file method is then | These are taken from the inode data. The open() file method is then | ||||||
| called so the specific filesystem implementation can do it's work. You | called so the specific filesystem implementation can do it's work. You | ||||||
|  | @ -102,7 +95,8 @@ filesystem or driver code at the same time, on different | ||||||
| processors. You should ensure that access to shared resources is | processors. You should ensure that access to shared resources is | ||||||
| protected by appropriate locks. | protected by appropriate locks. | ||||||
| 
 | 
 | ||||||
| Registering and Mounting a Filesystem                              <subsection> | 
 | ||||||
|  | Registering and Mounting a Filesystem | ||||||
| ------------------------------------- | ------------------------------------- | ||||||
| 
 | 
 | ||||||
| If you want to support a new kind of filesystem in the kernel, all you | If you want to support a new kind of filesystem in the kernel, all you | ||||||
|  | @ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem. | ||||||
| It's now time to look at things in more detail. | It's now time to look at things in more detail. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| struct file_system_type                                               <section> | struct file_system_type | ||||||
| ======================= | ======================= | ||||||
| 
 | 
 | ||||||
| This describes the filesystem. As of kernel 2.1.99, the following | This describes the filesystem. As of kernel 2.6.13, the following | ||||||
| members are defined: | members are defined: | ||||||
| 
 | 
 | ||||||
| struct file_system_type { | struct file_system_type { | ||||||
| 	const char *name; | 	const char *name; | ||||||
| 	int fs_flags; | 	int fs_flags; | ||||||
| 	struct super_block *(*read_super) (struct super_block *, void *, int); |         struct super_block *(*get_sb) (struct file_system_type *, int, | ||||||
| 	struct file_system_type * next; |                                        const char *, void *); | ||||||
|  |         void (*kill_sb) (struct super_block *); | ||||||
|  |         struct module *owner; | ||||||
|  |         struct file_system_type * next; | ||||||
|  |         struct list_head fs_supers; | ||||||
| }; | }; | ||||||
| 
 | 
 | ||||||
|   name: the name of the filesystem type, such as "ext2", "iso9660", |   name: the name of the filesystem type, such as "ext2", "iso9660", | ||||||
|  | @ -141,51 +139,97 @@ struct file_system_type { | ||||||
| 
 | 
 | ||||||
|   fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) |   fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) | ||||||
| 
 | 
 | ||||||
|   read_super: the method to call when a new instance of this |   get_sb: the method to call when a new instance of this | ||||||
| 	filesystem should be mounted | 	filesystem should be mounted | ||||||
| 
 | 
 | ||||||
|   next: for internal VFS use: you should initialise this to NULL |   kill_sb: the method to call when an instance of this filesystem | ||||||
|  | 	should be unmounted | ||||||
| 
 | 
 | ||||||
| The read_super() method has the following arguments: |   owner: for internal VFS use: you should initialize this to THIS_MODULE in | ||||||
|  |   	most cases. | ||||||
|  | 
 | ||||||
|  |   next: for internal VFS use: you should initialize this to NULL | ||||||
|  | 
 | ||||||
|  | The get_sb() method has the following arguments: | ||||||
| 
 | 
 | ||||||
|   struct super_block *sb: the superblock structure. This is partially |   struct super_block *sb: the superblock structure. This is partially | ||||||
| 	initialised by the VFS and the rest must be initialised by the | 	initialized by the VFS and the rest must be initialized by the | ||||||
| 	read_super() method | 	get_sb() method | ||||||
|  | 
 | ||||||
|  |   int flags: mount flags | ||||||
|  | 
 | ||||||
|  |   const char *dev_name: the device name we are mounting. | ||||||
| 
 | 
 | ||||||
|   void *data: arbitrary mount options, usually comes as an ASCII |   void *data: arbitrary mount options, usually comes as an ASCII | ||||||
| 	string | 	string | ||||||
| 
 | 
 | ||||||
|   int silent: whether or not to be silent on error |   int silent: whether or not to be silent on error | ||||||
| 
 | 
 | ||||||
| The read_super() method must determine if the block device specified | The get_sb() method must determine if the block device specified | ||||||
| in the superblock contains a filesystem of the type the method | in the superblock contains a filesystem of the type the method | ||||||
| supports. On success the method returns the superblock pointer, on | supports. On success the method returns the superblock pointer, on | ||||||
| failure it returns NULL. | failure it returns NULL. | ||||||
| 
 | 
 | ||||||
| The most interesting member of the superblock structure that the | The most interesting member of the superblock structure that the | ||||||
| read_super() method fills in is the "s_op" field. This is a pointer to | get_sb() method fills in is the "s_op" field. This is a pointer to | ||||||
| a "struct super_operations" which describes the next level of the | a "struct super_operations" which describes the next level of the | ||||||
| filesystem implementation. | filesystem implementation. | ||||||
| 
 | 
 | ||||||
|  | Usually, a filesystem uses generic one of the generic get_sb() | ||||||
|  | implementations and provides a fill_super() method instead. The | ||||||
|  | generic methods are: | ||||||
| 
 | 
 | ||||||
| struct super_operations                                               <section> |   get_sb_bdev: mount a filesystem residing on a block device | ||||||
|  | 
 | ||||||
|  |   get_sb_nodev: mount a filesystem that is not backed by a device | ||||||
|  | 
 | ||||||
|  |   get_sb_single: mount a filesystem which shares the instance between | ||||||
|  |   	all mounts | ||||||
|  | 
 | ||||||
|  | A fill_super() method implementation has the following arguments: | ||||||
|  | 
 | ||||||
|  |   struct super_block *sb: the superblock structure. The method fill_super() | ||||||
|  |   	must initialize this properly. | ||||||
|  | 
 | ||||||
|  |   void *data: arbitrary mount options, usually comes as an ASCII | ||||||
|  | 	string | ||||||
|  | 
 | ||||||
|  |   int silent: whether or not to be silent on error | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | struct super_operations | ||||||
| ======================= | ======================= | ||||||
| 
 | 
 | ||||||
| This describes how the VFS can manipulate the superblock of your | This describes how the VFS can manipulate the superblock of your | ||||||
| filesystem. As of kernel 2.1.99, the following members are defined: | filesystem. As of kernel 2.6.13, the following members are defined: | ||||||
| 
 | 
 | ||||||
| struct super_operations { | struct super_operations { | ||||||
| 	void (*read_inode) (struct inode *); |         struct inode *(*alloc_inode)(struct super_block *sb); | ||||||
| 	int (*write_inode) (struct inode *, int); |         void (*destroy_inode)(struct inode *); | ||||||
| 	void (*put_inode) (struct inode *); | 
 | ||||||
| 	void (*drop_inode) (struct inode *); |         void (*read_inode) (struct inode *); | ||||||
| 	void (*delete_inode) (struct inode *); | 
 | ||||||
| 	int (*notify_change) (struct dentry *, struct iattr *); |         void (*dirty_inode) (struct inode *); | ||||||
| 	void (*put_super) (struct super_block *); |         int (*write_inode) (struct inode *, int); | ||||||
| 	void (*write_super) (struct super_block *); |         void (*put_inode) (struct inode *); | ||||||
| 	int (*statfs) (struct super_block *, struct statfs *, int); |         void (*drop_inode) (struct inode *); | ||||||
| 	int (*remount_fs) (struct super_block *, int *, char *); |         void (*delete_inode) (struct inode *); | ||||||
| 	void (*clear_inode) (struct inode *); |         void (*put_super) (struct super_block *); | ||||||
|  |         void (*write_super) (struct super_block *); | ||||||
|  |         int (*sync_fs)(struct super_block *sb, int wait); | ||||||
|  |         void (*write_super_lockfs) (struct super_block *); | ||||||
|  |         void (*unlockfs) (struct super_block *); | ||||||
|  |         int (*statfs) (struct super_block *, struct kstatfs *); | ||||||
|  |         int (*remount_fs) (struct super_block *, int *, char *); | ||||||
|  |         void (*clear_inode) (struct inode *); | ||||||
|  |         void (*umount_begin) (struct super_block *); | ||||||
|  | 
 | ||||||
|  |         void (*sync_inodes) (struct super_block *sb, | ||||||
|  |                                 struct writeback_control *wbc); | ||||||
|  |         int (*show_options)(struct seq_file *, struct vfsmount *); | ||||||
|  | 
 | ||||||
|  |         ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); | ||||||
|  |         ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); | ||||||
| }; | }; | ||||||
| 
 | 
 | ||||||
| All methods are called without any locks being held, unless otherwise | All methods are called without any locks being held, unless otherwise | ||||||
|  | @ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are | ||||||
| only called from a process context (i.e. not from an interrupt handler | only called from a process context (i.e. not from an interrupt handler | ||||||
| or bottom half). | or bottom half). | ||||||
| 
 | 
 | ||||||
|  |   alloc_inode: this method is called by inode_alloc() to allocate memory | ||||||
|  |  	for struct inode and initialize it. | ||||||
|  | 
 | ||||||
|  |   destroy_inode: this method is called by destroy_inode() to release | ||||||
|  |   	resources allocated for struct inode. | ||||||
|  | 
 | ||||||
|   read_inode: this method is called to read a specific inode from the |   read_inode: this method is called to read a specific inode from the | ||||||
| 	mounted filesystem. The "i_ino" member in the "struct inode" |         mounted filesystem.  The i_ino member in the struct inode is | ||||||
| 	will be initialised by the VFS to indicate which inode to | 	initialized by the VFS to indicate which inode to read. Other | ||||||
| 	read. Other members are filled in by this method | 	members are filled in by this method. | ||||||
|  | 
 | ||||||
|  | 	You can set this to NULL and use iget5_locked() instead of iget() | ||||||
|  | 	to read inodes.  This is necessary for filesystems for which the | ||||||
|  | 	inode number is not sufficient to identify an inode. | ||||||
|  | 
 | ||||||
|  |   dirty_inode: this method is called by the VFS to mark an inode dirty. | ||||||
| 
 | 
 | ||||||
|   write_inode: this method is called when the VFS needs to write an |   write_inode: this method is called when the VFS needs to write an | ||||||
| 	inode to disc.  The second parameter indicates whether the write | 	inode to disc.  The second parameter indicates whether the write | ||||||
| 	should be synchronous or not, not all filesystems check this flag. | 	should be synchronous or not, not all filesystems check this flag. | ||||||
| 
 | 
 | ||||||
|   put_inode: called when the VFS inode is removed from the inode |   put_inode: called when the VFS inode is removed from the inode | ||||||
| 	cache. This method is optional | 	cache. | ||||||
| 
 | 
 | ||||||
|   drop_inode: called when the last access to the inode is dropped, |   drop_inode: called when the last access to the inode is dropped, | ||||||
| 	with the inode_lock spinlock held. | 	with the inode_lock spinlock held. | ||||||
| 
 | 
 | ||||||
| 	This method should be either NULL (normal unix filesystem | 	This method should be either NULL (normal UNIX filesystem | ||||||
| 	semantics) or "generic_delete_inode" (for filesystems that do not | 	semantics) or "generic_delete_inode" (for filesystems that do not | ||||||
| 	want to cache inodes - causing "delete_inode" to always be | 	want to cache inodes - causing "delete_inode" to always be | ||||||
| 	called regardless of the value of i_nlink) | 	called regardless of the value of i_nlink) | ||||||
| 
 | 
 | ||||||
| 	The "generic_delete_inode()" behaviour is equivalent to the | 	The "generic_delete_inode()" behavior is equivalent to the | ||||||
| 	old practice of using "force_delete" in the put_inode() case, | 	old practice of using "force_delete" in the put_inode() case, | ||||||
| 	but does not have the races that the "force_delete()" approach | 	but does not have the races that the "force_delete()" approach | ||||||
| 	had.  | 	had.  | ||||||
| 
 | 
 | ||||||
|   delete_inode: called when the VFS wants to delete an inode |   delete_inode: called when the VFS wants to delete an inode | ||||||
| 
 | 
 | ||||||
|   notify_change: called when VFS inode attributes are changed. If this |  | ||||||
| 	is NULL the VFS falls back to the write_inode() method. This |  | ||||||
| 	is called with the kernel lock held |  | ||||||
| 
 |  | ||||||
|   put_super: called when the VFS wishes to free the superblock |   put_super: called when the VFS wishes to free the superblock | ||||||
| 	(i.e. unmount). This is called with the superblock lock held | 	(i.e. unmount). This is called with the superblock lock held | ||||||
| 
 | 
 | ||||||
|   write_super: called when the VFS superblock needs to be written to |   write_super: called when the VFS superblock needs to be written to | ||||||
| 	disc. This method is optional | 	disc. This method is optional | ||||||
| 
 | 
 | ||||||
|  |   sync_fs: called when VFS is writing out all dirty data associated with | ||||||
|  |   	a superblock. The second parameter indicates whether the method | ||||||
|  | 	should wait until the write out has been completed. Optional. | ||||||
|  | 
 | ||||||
|  |   write_super_lockfs: called when VFS is locking a filesystem and forcing | ||||||
|  |   	it into a consistent state.  This function is currently used by the | ||||||
|  | 	Logical Volume Manager (LVM). | ||||||
|  | 
 | ||||||
|  |   unlockfs: called when VFS is unlocking a filesystem and making it writable | ||||||
|  |   	again. | ||||||
|  | 
 | ||||||
|   statfs: called when the VFS needs to get filesystem statistics. This |   statfs: called when the VFS needs to get filesystem statistics. This | ||||||
| 	is called with the kernel lock held | 	is called with the kernel lock held | ||||||
| 
 | 
 | ||||||
|  | @ -238,21 +301,31 @@ or bottom half). | ||||||
| 
 | 
 | ||||||
|   clear_inode: called then the VFS clears the inode. Optional |   clear_inode: called then the VFS clears the inode. Optional | ||||||
| 
 | 
 | ||||||
|  |   umount_begin: called when the VFS is unmounting a filesystem. | ||||||
|  | 
 | ||||||
|  |   sync_inodes: called when the VFS is writing out dirty data associated with | ||||||
|  |   	a superblock. | ||||||
|  | 
 | ||||||
|  |   show_options: called by the VFS to show mount options for /proc/<pid>/mounts. | ||||||
|  | 
 | ||||||
|  |   quota_read: called by the VFS to read from filesystem quota file. | ||||||
|  | 
 | ||||||
|  |   quota_write: called by the VFS to write to filesystem quota file. | ||||||
|  | 
 | ||||||
| The read_inode() method is responsible for filling in the "i_op" | The read_inode() method is responsible for filling in the "i_op" | ||||||
| field. This is a pointer to a "struct inode_operations" which | field. This is a pointer to a "struct inode_operations" which | ||||||
| describes the methods that can be performed on individual inodes. | describes the methods that can be performed on individual inodes. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| struct inode_operations                                               <section> | struct inode_operations | ||||||
| ======================= | ======================= | ||||||
| 
 | 
 | ||||||
| This describes how the VFS can manipulate an inode in your | This describes how the VFS can manipulate an inode in your | ||||||
| filesystem. As of kernel 2.1.99, the following members are defined: | filesystem. As of kernel 2.6.13, the following members are defined: | ||||||
| 
 | 
 | ||||||
| struct inode_operations { | struct inode_operations { | ||||||
| 	struct file_operations * default_file_ops; | 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *); | ||||||
| 	int (*create) (struct inode *,struct dentry *,int); | 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); | ||||||
| 	int (*lookup) (struct inode *,struct dentry *); |  | ||||||
| 	int (*link) (struct dentry *,struct inode *,struct dentry *); | 	int (*link) (struct dentry *,struct inode *,struct dentry *); | ||||||
| 	int (*unlink) (struct inode *,struct dentry *); | 	int (*unlink) (struct inode *,struct dentry *); | ||||||
| 	int (*symlink) (struct inode *,struct dentry *,const char *); | 	int (*symlink) (struct inode *,struct dentry *,const char *); | ||||||
|  | @ -261,25 +334,22 @@ struct inode_operations { | ||||||
| 	int (*mknod) (struct inode *,struct dentry *,int,dev_t); | 	int (*mknod) (struct inode *,struct dentry *,int,dev_t); | ||||||
| 	int (*rename) (struct inode *, struct dentry *, | 	int (*rename) (struct inode *, struct dentry *, | ||||||
| 			struct inode *, struct dentry *); | 			struct inode *, struct dentry *); | ||||||
| 	int (*readlink) (struct dentry *, char *,int); | 	int (*readlink) (struct dentry *, char __user *,int); | ||||||
| 	struct dentry * (*follow_link) (struct dentry *, struct dentry *); |         void * (*follow_link) (struct dentry *, struct nameidata *); | ||||||
| 	int (*readpage) (struct file *, struct page *); |         void (*put_link) (struct dentry *, struct nameidata *, void *); | ||||||
| 	int (*writepage) (struct page *page, struct writeback_control *wbc); |  | ||||||
| 	int (*bmap) (struct inode *,int); |  | ||||||
| 	void (*truncate) (struct inode *); | 	void (*truncate) (struct inode *); | ||||||
| 	int (*permission) (struct inode *, int); | 	int (*permission) (struct inode *, int, struct nameidata *); | ||||||
| 	int (*smap) (struct inode *,int); | 	int (*setattr) (struct dentry *, struct iattr *); | ||||||
| 	int (*updatepage) (struct file *, struct page *, const char *, | 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); | ||||||
| 				unsigned long, unsigned int, int); | 	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); | ||||||
| 	int (*revalidate) (struct dentry *); | 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); | ||||||
|  | 	ssize_t (*listxattr) (struct dentry *, char *, size_t); | ||||||
|  | 	int (*removexattr) (struct dentry *, const char *); | ||||||
| }; | }; | ||||||
| 
 | 
 | ||||||
| Again, all methods are called without any locks being held, unless | Again, all methods are called without any locks being held, unless | ||||||
| otherwise noted. | otherwise noted. | ||||||
| 
 | 
 | ||||||
|   default_file_ops: this is a pointer to a "struct file_operations" |  | ||||||
| 	which describes how to open and then manipulate open files |  | ||||||
| 
 |  | ||||||
|   create: called by the open(2) and creat(2) system calls. Only |   create: called by the open(2) and creat(2) system calls. Only | ||||||
| 	required if you want to support regular files. The dentry you | 	required if you want to support regular files. The dentry you | ||||||
| 	get should not have an inode (i.e. it should be a negative | 	get should not have an inode (i.e. it should be a negative | ||||||
|  | @ -328,31 +398,143 @@ otherwise noted. | ||||||
| 	you want to support reading symbolic links | 	you want to support reading symbolic links | ||||||
| 
 | 
 | ||||||
|   follow_link: called by the VFS to follow a symbolic link to the |   follow_link: called by the VFS to follow a symbolic link to the | ||||||
| 	inode it points to. Only required if you want to support | 	inode it points to.  Only required if you want to support | ||||||
| 	symbolic links | 	symbolic links.  This function returns a void pointer cookie | ||||||
|  | 	that is passed to put_link(). | ||||||
|  | 
 | ||||||
|  |   put_link: called by the VFS to release resources allocated by | ||||||
|  |   	follow_link().  The cookie returned by follow_link() is passed to | ||||||
|  | 	to this function as the last parameter.  It is used by filesystems | ||||||
|  | 	such as NFS where page cache is not stable (i.e. page that was | ||||||
|  | 	installed when the symbolic link walk started might not be in the | ||||||
|  | 	page cache at the end of the walk). | ||||||
|  | 
 | ||||||
|  |   truncate: called by the VFS to change the size of a file.  The i_size | ||||||
|  |  	field of the inode is set to the desired size by the VFS before | ||||||
|  | 	this function is called.  This function is called by the truncate(2) | ||||||
|  | 	system call and related functionality. | ||||||
|  | 
 | ||||||
|  |   permission: called by the VFS to check for access rights on a POSIX-like | ||||||
|  |   	filesystem. | ||||||
|  | 
 | ||||||
|  |   setattr: called by the VFS to set attributes for a file.  This function is | ||||||
|  |   	called by chmod(2) and related system calls. | ||||||
|  | 
 | ||||||
|  |   getattr: called by the VFS to get attributes of a file.  This function is | ||||||
|  |   	called by stat(2) and related system calls. | ||||||
|  | 
 | ||||||
|  |   setxattr: called by the VFS to set an extended attribute for a file. | ||||||
|  |   	Extended attribute is a name:value pair associated with an inode. This | ||||||
|  | 	function is called by setxattr(2) system call. | ||||||
|  | 
 | ||||||
|  |   getxattr: called by the VFS to retrieve the value of an extended attribute | ||||||
|  |   	name.  This function is called by getxattr(2) function call. | ||||||
|  | 
 | ||||||
|  |   listxattr: called by the VFS to list all extended attributes for a given | ||||||
|  |   	file.  This function is called by listxattr(2) system call. | ||||||
|  | 
 | ||||||
|  |   removexattr: called by the VFS to remove an extended attribute from a file. | ||||||
|  |   	This function is called by removexattr(2) system call. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| struct file_operations                                                <section> | struct address_space_operations | ||||||
|  | =============================== | ||||||
|  | 
 | ||||||
|  | This describes how the VFS can manipulate mapping of a file to page cache in | ||||||
|  | your filesystem. As of kernel 2.6.13, the following members are defined: | ||||||
|  | 
 | ||||||
|  | struct address_space_operations { | ||||||
|  | 	int (*writepage)(struct page *page, struct writeback_control *wbc); | ||||||
|  | 	int (*readpage)(struct file *, struct page *); | ||||||
|  | 	int (*sync_page)(struct page *); | ||||||
|  | 	int (*writepages)(struct address_space *, struct writeback_control *); | ||||||
|  | 	int (*set_page_dirty)(struct page *page); | ||||||
|  | 	int (*readpages)(struct file *filp, struct address_space *mapping, | ||||||
|  | 			struct list_head *pages, unsigned nr_pages); | ||||||
|  | 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); | ||||||
|  | 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned); | ||||||
|  | 	sector_t (*bmap)(struct address_space *, sector_t); | ||||||
|  | 	int (*invalidatepage) (struct page *, unsigned long); | ||||||
|  | 	int (*releasepage) (struct page *, int); | ||||||
|  | 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, | ||||||
|  | 			loff_t offset, unsigned long nr_segs); | ||||||
|  | 	struct page* (*get_xip_page)(struct address_space *, sector_t, | ||||||
|  | 			int); | ||||||
|  | }; | ||||||
|  | 
 | ||||||
|  |   writepage: called by the VM write a dirty page to backing store. | ||||||
|  | 
 | ||||||
|  |   readpage: called by the VM to read a page from backing store. | ||||||
|  | 
 | ||||||
|  |   sync_page: called by the VM to notify the backing store to perform all | ||||||
|  |   	queued I/O operations for a page. I/O operations for other pages | ||||||
|  | 	associated with this address_space object may also be performed. | ||||||
|  | 
 | ||||||
|  |   writepages: called by the VM to write out pages associated with the | ||||||
|  |   	address_space object. | ||||||
|  | 
 | ||||||
|  |   set_page_dirty: called by the VM to set a page dirty. | ||||||
|  | 
 | ||||||
|  |   readpages: called by the VM to read pages associated with the address_space | ||||||
|  |   	object. | ||||||
|  | 
 | ||||||
|  |   prepare_write: called by the generic write path in VM to set up a write | ||||||
|  |   	request for a page. | ||||||
|  | 
 | ||||||
|  |   commit_write: called by the generic write path in VM to write page to | ||||||
|  |   	its backing store. | ||||||
|  | 
 | ||||||
|  |   bmap: called by the VFS to map a logical block offset within object to | ||||||
|  |   	physical block number. This method is use by for the legacy FIBMAP | ||||||
|  | 	ioctl. Other uses are discouraged. | ||||||
|  | 
 | ||||||
|  |   invalidatepage: called by the VM on truncate to disassociate a page from its | ||||||
|  |   	address_space mapping. | ||||||
|  | 
 | ||||||
|  |   releasepage: called by the VFS to release filesystem specific metadata from | ||||||
|  |   	a page. | ||||||
|  | 
 | ||||||
|  |   direct_IO: called by the VM for direct I/O writes and reads. | ||||||
|  | 
 | ||||||
|  |   get_xip_page: called by the VM to translate a block number to a page. | ||||||
|  | 	The page is valid until the corresponding filesystem is unmounted. | ||||||
|  | 	Filesystems that want to use execute-in-place (XIP) need to implement | ||||||
|  | 	it.  An example implementation can be found in fs/ext2/xip.c. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | struct file_operations | ||||||
| ====================== | ====================== | ||||||
| 
 | 
 | ||||||
| This describes how the VFS can manipulate an open file. As of kernel | This describes how the VFS can manipulate an open file. As of kernel | ||||||
| 2.1.99, the following members are defined: | 2.6.13, the following members are defined: | ||||||
| 
 | 
 | ||||||
| struct file_operations { | struct file_operations { | ||||||
| 	loff_t (*llseek) (struct file *, loff_t, int); | 	loff_t (*llseek) (struct file *, loff_t, int); | ||||||
| 	ssize_t (*read) (struct file *, char *, size_t, loff_t *); | 	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); | ||||||
| 	ssize_t (*write) (struct file *, const char *, size_t, loff_t *); | 	ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); | ||||||
|  | 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); | ||||||
|  | 	ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); | ||||||
| 	int (*readdir) (struct file *, void *, filldir_t); | 	int (*readdir) (struct file *, void *, filldir_t); | ||||||
| 	unsigned int (*poll) (struct file *, struct poll_table_struct *); | 	unsigned int (*poll) (struct file *, struct poll_table_struct *); | ||||||
| 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); | 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); | ||||||
|  | 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); | ||||||
|  | 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long); | ||||||
| 	int (*mmap) (struct file *, struct vm_area_struct *); | 	int (*mmap) (struct file *, struct vm_area_struct *); | ||||||
| 	int (*open) (struct inode *, struct file *); | 	int (*open) (struct inode *, struct file *); | ||||||
|  | 	int (*flush) (struct file *); | ||||||
| 	int (*release) (struct inode *, struct file *); | 	int (*release) (struct inode *, struct file *); | ||||||
| 	int (*fsync) (struct file *, struct dentry *); | 	int (*fsync) (struct file *, struct dentry *, int datasync); | ||||||
| 	int (*fasync) (struct file *, int); | 	int (*aio_fsync) (struct kiocb *, int datasync); | ||||||
| 	int (*check_media_change) (kdev_t dev); | 	int (*fasync) (int, struct file *, int); | ||||||
| 	int (*revalidate) (kdev_t dev); |  | ||||||
| 	int (*lock) (struct file *, int, struct file_lock *); | 	int (*lock) (struct file *, int, struct file_lock *); | ||||||
|  | 	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); | ||||||
|  | 	ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); | ||||||
|  | 	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); | ||||||
|  | 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); | ||||||
|  | 	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); | ||||||
|  | 	int (*check_flags)(int); | ||||||
|  | 	int (*dir_notify)(struct file *filp, unsigned long arg); | ||||||
|  | 	int (*flock) (struct file *, int, struct file_lock *); | ||||||
| }; | }; | ||||||
| 
 | 
 | ||||||
| Again, all methods are called without any locks being held, unless | Again, all methods are called without any locks being held, unless | ||||||
|  | @ -362,8 +544,12 @@ otherwise noted. | ||||||
| 
 | 
 | ||||||
|   read: called by read(2) and related system calls |   read: called by read(2) and related system calls | ||||||
| 
 | 
 | ||||||
|  |   aio_read: called by io_submit(2) and other asynchronous I/O operations | ||||||
|  | 
 | ||||||
|   write: called by write(2) and related system calls |   write: called by write(2) and related system calls | ||||||
| 
 | 
 | ||||||
|  |   aio_write: called by io_submit(2) and other asynchronous I/O operations | ||||||
|  | 
 | ||||||
|   readdir: called when the VFS needs to read the directory contents |   readdir: called when the VFS needs to read the directory contents | ||||||
| 
 | 
 | ||||||
|   poll: called by the VFS when a process wants to check if there is |   poll: called by the VFS when a process wants to check if there is | ||||||
|  | @ -372,18 +558,25 @@ otherwise noted. | ||||||
| 
 | 
 | ||||||
|   ioctl: called by the ioctl(2) system call |   ioctl: called by the ioctl(2) system call | ||||||
| 
 | 
 | ||||||
|  |   unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not | ||||||
|  |   	require the BKL should use this method instead of the ioctl() above. | ||||||
|  | 
 | ||||||
|  |   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls | ||||||
|  |  	 are used on 64 bit kernels. | ||||||
|  | 
 | ||||||
|   mmap: called by the mmap(2) system call |   mmap: called by the mmap(2) system call | ||||||
| 
 | 
 | ||||||
|   open: called by the VFS when an inode should be opened. When the VFS |   open: called by the VFS when an inode should be opened. When the VFS | ||||||
| 	opens a file, it creates a new "struct file" and initialises | 	opens a file, it creates a new "struct file". It then calls the | ||||||
| 	the "f_op" file operations member with the "default_file_ops" | 	open method for the newly allocated file structure. You might | ||||||
| 	field in the inode structure. It then calls the open method | 	think that the open method really belongs in | ||||||
| 	for the newly allocated file structure. You might think that | 	"struct inode_operations", and you may be right. I think it's | ||||||
| 	the open method really belongs in "struct inode_operations", | 	done the way it is because it makes filesystems simpler to | ||||||
| 	and you may be right. I think it's done the way it is because | 	implement. The open() method is a good place to initialize the | ||||||
| 	it makes filesystems simpler to implement. The open() method | 	"private_data" member in the file structure if you want to point | ||||||
| 	is a good place to initialise the "private_data" member in the | 	to a device structure | ||||||
| 	file structure if you want to point to a device structure | 
 | ||||||
|  |   flush: called by the close(2) system call to flush a file | ||||||
| 
 | 
 | ||||||
|   release: called when the last reference to an open file is closed |   release: called when the last reference to an open file is closed | ||||||
| 
 | 
 | ||||||
|  | @ -392,6 +585,23 @@ otherwise noted. | ||||||
|   fasync: called by the fcntl(2) system call when asynchronous |   fasync: called by the fcntl(2) system call when asynchronous | ||||||
| 	(non-blocking) mode is enabled for a file | 	(non-blocking) mode is enabled for a file | ||||||
| 
 | 
 | ||||||
|  |   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW | ||||||
|  |   	commands | ||||||
|  | 
 | ||||||
|  |   readv: called by the readv(2) system call | ||||||
|  | 
 | ||||||
|  |   writev: called by the writev(2) system call | ||||||
|  | 
 | ||||||
|  |   sendfile: called by the sendfile(2) system call | ||||||
|  | 
 | ||||||
|  |   get_unmapped_area: called by the mmap(2) system call | ||||||
|  | 
 | ||||||
|  |   check_flags: called by the fcntl(2) system call for F_SETFL command | ||||||
|  | 
 | ||||||
|  |   dir_notify: called by the fcntl(2) system call for F_NOTIFY command | ||||||
|  | 
 | ||||||
|  |   flock: called by the flock(2) system call | ||||||
|  | 
 | ||||||
| Note that the file operations are implemented by the specific | Note that the file operations are implemented by the specific | ||||||
| filesystem in which the inode resides. When opening a device node | filesystem in which the inode resides. When opening a device node | ||||||
| (character or block special) most filesystems will call special | (character or block special) most filesystems will call special | ||||||
|  | @ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file | ||||||
| operations with those for the device driver, and then proceed to call | operations with those for the device driver, and then proceed to call | ||||||
| the new open() method for the file. This is how opening a device file | the new open() method for the file. This is how opening a device file | ||||||
| in the filesystem eventually ends up calling the device driver open() | in the filesystem eventually ends up calling the device driver open() | ||||||
| method. Note the devfs (the Device FileSystem) has a more direct path | method. | ||||||
| from device node to device driver (this is an unofficial kernel |  | ||||||
| patch). |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| Directory Entry Cache (dcache)                                        <section> | Directory Entry Cache (dcache) | ||||||
| ------------------------------ | ============================== | ||||||
|  | 
 | ||||||
| 
 | 
 | ||||||
| struct dentry_operations | struct dentry_operations | ||||||
| ======================== | ------------------------ | ||||||
| 
 | 
 | ||||||
| This describes how a filesystem can overload the standard dentry | This describes how a filesystem can overload the standard dentry | ||||||
| operations. Dentries and the dcache are the domain of the VFS and the | operations. Dentries and the dcache are the domain of the VFS and the | ||||||
| individual filesystem implementations. Device drivers have no business | individual filesystem implementations. Device drivers have no business | ||||||
| here. These methods may be set to NULL, as they are either optional or | here. These methods may be set to NULL, as they are either optional or | ||||||
| the VFS uses a default. As of kernel 2.1.99, the following members are | the VFS uses a default. As of kernel 2.6.13, the following members are | ||||||
| defined: | defined: | ||||||
| 
 | 
 | ||||||
| struct dentry_operations { | struct dentry_operations { | ||||||
| 	int (*d_revalidate)(struct dentry *); | 	int (*d_revalidate)(struct dentry *, struct nameidata *); | ||||||
| 	int (*d_hash) (struct dentry *, struct qstr *); | 	int (*d_hash) (struct dentry *, struct qstr *); | ||||||
| 	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); | 	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); | ||||||
| 	void (*d_delete)(struct dentry *); | 	int (*d_delete)(struct dentry *); | ||||||
| 	void (*d_release)(struct dentry *); | 	void (*d_release)(struct dentry *); | ||||||
| 	void (*d_iput)(struct dentry *, struct inode *); | 	void (*d_iput)(struct dentry *, struct inode *); | ||||||
| }; | }; | ||||||
|  | @ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list | ||||||
| of child dentries. Child dentries are basically like files in a | of child dentries. Child dentries are basically like files in a | ||||||
| directory. | directory. | ||||||
| 
 | 
 | ||||||
|  | 
 | ||||||
| Directory Entry Cache APIs | Directory Entry Cache APIs | ||||||
| -------------------------- | -------------------------- | ||||||
| 
 | 
 | ||||||
|  | @ -471,7 +681,7 @@ manipulate dentries: | ||||||
| 	"d_delete" method is called | 	"d_delete" method is called | ||||||
| 
 | 
 | ||||||
|   d_drop: this unhashes a dentry from its parents hash list. A |   d_drop: this unhashes a dentry from its parents hash list. A | ||||||
| 	subsequent call to dput() will dellocate the dentry if its | 	subsequent call to dput() will deallocate the dentry if its | ||||||
| 	usage count drops to 0 | 	usage count drops to 0 | ||||||
| 
 | 
 | ||||||
|   d_delete: delete a dentry. If there are no other open references to |   d_delete: delete a dentry. If there are no other open references to | ||||||
|  | @ -507,16 +717,16 @@ up by walking the tree starting with the first component | ||||||
| of the pathname and using that dentry along with the next | of the pathname and using that dentry along with the next | ||||||
| component to look up the next level and so on. Since it | component to look up the next level and so on. Since it | ||||||
| is a frequent operation for workloads like multiuser | is a frequent operation for workloads like multiuser | ||||||
| environments and webservers, it is important to optimize | environments and web servers, it is important to optimize | ||||||
| this path. | this path. | ||||||
| 
 | 
 | ||||||
| Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus | ||||||
| in every component during path look-up. Since 2.5.10 onwards, | in every component during path look-up. Since 2.5.10 onwards, | ||||||
| fastwalk algorithm changed this by holding the dcache_lock | fast-walk algorithm changed this by holding the dcache_lock | ||||||
| at the beginning and walking as many cached path component | at the beginning and walking as many cached path component | ||||||
| dentries as possible. This signficantly decreases the number | dentries as possible. This significantly decreases the number | ||||||
| of acquisition of dcache_lock. However it also increases the | of acquisition of dcache_lock. However it also increases the | ||||||
| lock hold time signficantly and affects performance in large | lock hold time significantly and affects performance in large | ||||||
| SMP machines. Since 2.5.62 kernel, dcache has been using | SMP machines. Since 2.5.62 kernel, dcache has been using | ||||||
| a new locking model that uses RCU to make dcache look-up | a new locking model that uses RCU to make dcache look-up | ||||||
| lock-free. | lock-free. | ||||||
|  | @ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well | ||||||
| as d_inode and several other things like mount look-up. RCU-based | as d_inode and several other things like mount look-up. RCU-based | ||||||
| changes affect only the way the hash chain is protected. For everything | changes affect only the way the hash chain is protected. For everything | ||||||
| else the dcache_lock must be taken for both traversing as well as | else the dcache_lock must be taken for both traversing as well as | ||||||
| updating. The hash chain updations too take the dcache_lock. | updating. The hash chain updates too take the dcache_lock. | ||||||
| The significant change is the way d_lookup traverses the hash chain, | The significant change is the way d_lookup traverses the hash chain, | ||||||
| it doesn't acquire the dcache_lock for this and rely on RCU to | it doesn't acquire the dcache_lock for this and rely on RCU to | ||||||
| ensure that the dentry has not been *freed*. | ensure that the dentry has not been *freed*. | ||||||
|  | @ -535,14 +745,15 @@ ensure that the dentry has not been *freed*. | ||||||
| 
 | 
 | ||||||
| Dcache locking details | Dcache locking details | ||||||
| ---------------------- | ---------------------- | ||||||
|  | 
 | ||||||
| For many multi-user workloads, open() and stat() on files are | For many multi-user workloads, open() and stat() on files are | ||||||
| very frequently occurring operations. Both involve walking | very frequently occurring operations. Both involve walking | ||||||
| of path names to find the dentry corresponding to the | of path names to find the dentry corresponding to the | ||||||
| concerned file. In 2.4 kernel, dcache_lock was held | concerned file. In 2.4 kernel, dcache_lock was held | ||||||
| during look-up of each path component. Contention and | during look-up of each path component. Contention and | ||||||
| cacheline bouncing of this global lock caused significant | cache-line bouncing of this global lock caused significant | ||||||
| scalability problems. With the introduction of RCU | scalability problems. With the introduction of RCU | ||||||
| in linux kernel, this was worked around by making | in Linux kernel, this was worked around by making | ||||||
| the look-up of path components during path walking lock-free. | the look-up of path components during path walking lock-free. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | @ -562,7 +773,7 @@ Some of the important changes are : | ||||||
| 2. Insertion of a dentry into the hash table is done using | 2. Insertion of a dentry into the hash table is done using | ||||||
|    hlist_add_head_rcu() which take care of ordering the writes - |    hlist_add_head_rcu() which take care of ordering the writes - | ||||||
|    the writes to the dentry must be visible before the dentry |    the writes to the dentry must be visible before the dentry | ||||||
|    is inserted. This works in conjuction with hlist_for_each_rcu() |    is inserted. This works in conjunction with hlist_for_each_rcu() | ||||||
|    while walking the hash chain. The only requirement is that |    while walking the hash chain. The only requirement is that | ||||||
|    all initialization to the dentry must be done before hlist_add_head_rcu() |    all initialization to the dentry must be done before hlist_add_head_rcu() | ||||||
|    since we don't have dcache_lock protection while traversing |    since we don't have dcache_lock protection while traversing | ||||||
|  | @ -584,7 +795,7 @@ Some of the important changes are : | ||||||
|    the same.  In some sense, dcache_rcu path walking looks like |    the same.  In some sense, dcache_rcu path walking looks like | ||||||
|    the pre-2.5.10 version. |    the pre-2.5.10 version. | ||||||
| 
 | 
 | ||||||
| 5. All dentry hash chain updations must take the dcache_lock as well as | 5. All dentry hash chain updates must take the dcache_lock as well as | ||||||
|    the per-dentry lock in that order. dput() does this to ensure |    the per-dentry lock in that order. dput() does this to ensure | ||||||
|    that a dentry that has just been looked up in another CPU |    that a dentry that has just been looked up in another CPU | ||||||
|    doesn't get deleted before dget() can be done on it. |    doesn't get deleted before dget() can be done on it. | ||||||
|  | @ -640,10 +851,10 @@ handled as described below : | ||||||
|    Since we redo the d_parent check and compare name while holding |    Since we redo the d_parent check and compare name while holding | ||||||
|    d_lock, lock-free look-up will not race against d_move(). |    d_lock, lock-free look-up will not race against d_move(). | ||||||
| 
 | 
 | ||||||
| 4. There can be a theoritical race when a dentry keeps coming back | 4. There can be a theoretical race when a dentry keeps coming back | ||||||
|    to original bucket due to double moves. Due to this look-up may |    to original bucket due to double moves. Due to this look-up may | ||||||
|    consider that it has never moved and can end up in a infinite loop. |    consider that it has never moved and can end up in a infinite loop. | ||||||
|    But this is not any worse that theoritical livelocks we already |    But this is not any worse that theoretical livelocks we already | ||||||
|    have in the kernel. |    have in the kernel. | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	 Pekka J Enberg
				Pekka J Enberg