177 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
		
		
			
		
	
	
			177 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
|   | 
 | ||
|  | Making Filesystems Exportable | ||
|  | ============================= | ||
|  | 
 | ||
|  | Most filesystem operations require a dentry (or two) as a starting | ||
|  | point.  Local applications have a reference-counted hold on suitable | ||
|  | dentrys via open file descriptors or cwd/root.  However remote | ||
|  | applications that access a filesystem via a remote filesystem protocol | ||
|  | such as NFS may not be able to hold such a reference, and so need a | ||
|  | different way to refer to a particular dentry.  As the alternative | ||
|  | form of reference needs to be stable across renames, truncates, and | ||
|  | server-reboot (among other things, though these tend to be the most | ||
|  | problematic), there is no simple answer like 'filename'. | ||
|  | 
 | ||
|  | The mechanism discussed here allows each filesystem implementation to | ||
|  | specify how to generate an opaque (out side of the filesystem) byte | ||
|  | string for any dentry, and how to find an appropriate dentry for any | ||
|  | given opaque byte string. | ||
|  | This byte string will be called a "filehandle fragment" as it | ||
|  | corresponds to part of an NFS filehandle. | ||
|  | 
 | ||
|  | A filesystem which supports the mapping between filehandle fragments | ||
|  | and dentrys will be termed "exportable". | ||
|  | 
 | ||
|  | 
 | ||
|  | 
 | ||
|  | Dcache Issues | ||
|  | ------------- | ||
|  | 
 | ||
|  | The dcache normally contains a proper prefix of any given filesystem | ||
|  | tree.  This means that if any filesystem object is in the dcache, then | ||
|  | all of the ancestors of that filesystem object are also in the dcache. | ||
|  | As normal access is by filename this prefix is created naturally and | ||
|  | maintained easily (by each object maintaining a reference count on | ||
|  | its parent). | ||
|  | 
 | ||
|  | However when objects are included into the dcache by interpreting a | ||
|  | filehandle fragment, there is no automatic creation of a path prefix | ||
|  | for the object.  This leads to two related but distinct features of | ||
|  | the dcache that are not needed for normal filesystem access. | ||
|  | 
 | ||
|  | 1/ The dcache must sometimes contain objects that are not part of the | ||
|  |    proper prefix. i.e that are not connected to the root. | ||
|  | 2/ The dcache must be prepared for a newly found (via ->lookup) directory | ||
|  |    to already have a (non-connected) dentry, and must be able to move | ||
|  |    that dentry into place (based on the parent and name in the | ||
|  |    ->lookup).   This is particularly needed for directories as | ||
|  |    it is a dcache invariant that directories only have one dentry. | ||
|  | 
 | ||
|  | To implement these features, the dcache has: | ||
|  | 
 | ||
|  | a/ A dentry flag DCACHE_DISCONNECTED which is set on | ||
|  |    any dentry that might not be part of the proper prefix. | ||
|  |    This is set when anonymous dentries are created, and cleared when a | ||
|  |    dentry is noticed to be a child of a dentry which is in the proper | ||
|  |    prefix.  | ||
|  | 
 | ||
|  | b/ A per-superblock list "s_anon" of dentries which are the roots of | ||
|  |    subtrees that are not in the proper prefix.  These dentries, as | ||
|  |    well as the proper prefix, need to be released at unmount time.  As | ||
|  |    these dentries will not be hashed, they are linked together on the | ||
|  |    d_hash list_head. | ||
|  | 
 | ||
|  | c/ Helper routines to allocate anonymous dentries, and to help attach | ||
|  |    loose directory dentries at lookup time. They are: | ||
|  |     d_alloc_anon(inode) will return a dentry for the given inode. | ||
|  |       If the inode already has a dentry, one of those is returned. | ||
|  |       If it doesn't, a new anonymous (IS_ROOT and | ||
|  |         DCACHE_DISCONNECTED) dentry is allocated and attached. | ||
|  |       In the case of a directory, care is taken that only one dentry | ||
|  |       can ever be attached. | ||
|  |     d_splice_alias(inode, dentry) will make sure that there is a | ||
|  |       dentry with the same name and parent as the given dentry, and | ||
|  |       which refers to the given inode. | ||
|  |       If the inode is a directory and already has a dentry, then that | ||
|  |       dentry is d_moved over the given dentry. | ||
|  |       If the passed dentry gets attached, care is taken that this is | ||
|  |       mutually exclusive to a d_alloc_anon operation. | ||
|  |       If the passed dentry is used, NULL is returned, else the used | ||
|  |       dentry is returned.  This corresponds to the calling pattern of | ||
|  |       ->lookup. | ||
|  |    | ||
|  |   | ||
|  | Filesystem Issues | ||
|  | ----------------- | ||
|  | 
 | ||
|  | For a filesystem to be exportable it must: | ||
|  |   | ||
|  |    1/ provide the filehandle fragment routines described below. | ||
|  |    2/ make sure that d_splice_alias is used rather than d_add | ||
|  |       when ->lookup finds an inode for a given parent and name. | ||
|  |       Typically the ->lookup routine will end: | ||
|  | 		if (inode) | ||
|  | 			return d_splice(inode, dentry); | ||
|  | 		d_add(dentry, inode); | ||
|  | 		return NULL; | ||
|  | 	} | ||
|  | 
 | ||
|  | 
 | ||
|  | 
 | ||
|  |   A file system implementation declares that instances of the filesystem | ||
|  | are exportable by setting the s_export_op field in the struct | ||
|  | super_block.  This field must point to a "struct export_operations" | ||
|  | struct which could potentially be full of NULLs, though normally at | ||
|  | least get_parent will be set. | ||
|  | 
 | ||
|  |  The primary operations are decode_fh and encode_fh.   | ||
|  | decode_fh takes a filehandle fragment and tries to find or create a | ||
|  | dentry for the object referred to by the filehandle. | ||
|  | encode_fh takes a dentry and creates a filehandle fragment which can | ||
|  | later be used to find/create a dentry for the same object. | ||
|  | 
 | ||
|  | decode_fh will probably make use of "find_exported_dentry". | ||
|  | This function lives in the "exportfs" module which a filesystem does | ||
|  | not need unless it is being exported.  So rather that calling | ||
|  | find_exported_dentry directly, each filesystem should call it through | ||
|  | the find_exported_dentry pointer in it's export_operations table. | ||
|  | This field is set correctly by the exporting agent (e.g. nfsd) when a | ||
|  | filesystem is exported, and before any export operations are called. | ||
|  | 
 | ||
|  | find_exported_dentry needs three support functions from the | ||
|  | filesystem: | ||
|  |   get_name.  When given a parent dentry and a child dentry, this | ||
|  |     should find a name in the directory identified by the parent | ||
|  |     dentry, which leads to the object identified by the child dentry. | ||
|  |     If no get_name function is supplied, a default implementation is | ||
|  |     provided which uses vfs_readdir to find potential names, and | ||
|  |     matches inode numbers to find the correct match. | ||
|  | 
 | ||
|  |   get_parent.  When given a dentry for a directory, this should return  | ||
|  |     a dentry for the parent.  Quite possibly the parent dentry will | ||
|  |     have been allocated by d_alloc_anon.   | ||
|  |     The default get_parent function just returns an error so any | ||
|  |     filehandle lookup that requires finding a parent will fail. | ||
|  |     ->lookup("..") is *not* used as a default as it can leave ".." | ||
|  |     entries in the dcache which are too messy to work with. | ||
|  | 
 | ||
|  |   get_dentry.  When given an opaque datum, this should find the | ||
|  |     implied object and create a dentry for it (possibly with | ||
|  |     d_alloc_anon).  | ||
|  |     The opaque datum is whatever is passed down by the decode_fh | ||
|  |     function, and is often simply a fragment of the filehandle | ||
|  |     fragment. | ||
|  |     decode_fh passes two datums through find_exported_dentry.  One that  | ||
|  |     should be used to identify the target object, and one that can be | ||
|  |     used to identify the object's parent, should that be necessary. | ||
|  |     The default get_dentry function assumes that the datum contains an | ||
|  |     inode number and a generation number, and it attempts to get the | ||
|  |     inode using "iget" and check it's validity by matching the | ||
|  |     generation number.  A filesystem should only depend on the default | ||
|  |     if iget can safely be used this way. | ||
|  | 
 | ||
|  | If decode_fh and/or encode_fh are left as NULL, then default | ||
|  | implementations are used.  These defaults are suitable for ext2 and  | ||
|  | extremely similar filesystems (like ext3). | ||
|  | 
 | ||
|  | The default encode_fh creates a filehandle fragment from the inode | ||
|  | number and generation number of the target together with the inode | ||
|  | number and generation number of the parent (if the parent is | ||
|  | required). | ||
|  | 
 | ||
|  | The default decode_fh extract the target and parent datums from the | ||
|  | filehandle assuming the format used by the default encode_fh and | ||
|  | passed them to find_exported_dentry. | ||
|  | 
 | ||
|  | 
 | ||
|  | A filehandle fragment consists of an array of 1 or more 4byte words, | ||
|  | together with a one byte "type". | ||
|  | The decode_fh routine should not depend on the stated size that is | ||
|  | passed to it.  This size may be larger than the original filehandle | ||
|  | generated by encode_fh, in which case it will have been padded with | ||
|  | nuls.  Rather, the encode_fh routine should choose a "type" which | ||
|  | indicates the decode_fh how much of the filehandle is valid, and how | ||
|  | it should be interpreted. | ||
|  | 
 | ||
|  |   |