177 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
		
		
			
		
	
	
			177 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Making Filesystems Exportable
							 | 
						||
| 
								 | 
							
								=============================
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Most filesystem operations require a dentry (or two) as a starting
							 | 
						||
| 
								 | 
							
								point.  Local applications have a reference-counted hold on suitable
							 | 
						||
| 
								 | 
							
								dentrys via open file descriptors or cwd/root.  However remote
							 | 
						||
| 
								 | 
							
								applications that access a filesystem via a remote filesystem protocol
							 | 
						||
| 
								 | 
							
								such as NFS may not be able to hold such a reference, and so need a
							 | 
						||
| 
								 | 
							
								different way to refer to a particular dentry.  As the alternative
							 | 
						||
| 
								 | 
							
								form of reference needs to be stable across renames, truncates, and
							 | 
						||
| 
								 | 
							
								server-reboot (among other things, though these tend to be the most
							 | 
						||
| 
								 | 
							
								problematic), there is no simple answer like 'filename'.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The mechanism discussed here allows each filesystem implementation to
							 | 
						||
| 
								 | 
							
								specify how to generate an opaque (out side of the filesystem) byte
							 | 
						||
| 
								 | 
							
								string for any dentry, and how to find an appropriate dentry for any
							 | 
						||
| 
								 | 
							
								given opaque byte string.
							 | 
						||
| 
								 | 
							
								This byte string will be called a "filehandle fragment" as it
							 | 
						||
| 
								 | 
							
								corresponds to part of an NFS filehandle.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A filesystem which supports the mapping between filehandle fragments
							 | 
						||
| 
								 | 
							
								and dentrys will be termed "exportable".
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Dcache Issues
							 | 
						||
| 
								 | 
							
								-------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The dcache normally contains a proper prefix of any given filesystem
							 | 
						||
| 
								 | 
							
								tree.  This means that if any filesystem object is in the dcache, then
							 | 
						||
| 
								 | 
							
								all of the ancestors of that filesystem object are also in the dcache.
							 | 
						||
| 
								 | 
							
								As normal access is by filename this prefix is created naturally and
							 | 
						||
| 
								 | 
							
								maintained easily (by each object maintaining a reference count on
							 | 
						||
| 
								 | 
							
								its parent).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								However when objects are included into the dcache by interpreting a
							 | 
						||
| 
								 | 
							
								filehandle fragment, there is no automatic creation of a path prefix
							 | 
						||
| 
								 | 
							
								for the object.  This leads to two related but distinct features of
							 | 
						||
| 
								 | 
							
								the dcache that are not needed for normal filesystem access.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								1/ The dcache must sometimes contain objects that are not part of the
							 | 
						||
| 
								 | 
							
								   proper prefix. i.e that are not connected to the root.
							 | 
						||
| 
								 | 
							
								2/ The dcache must be prepared for a newly found (via ->lookup) directory
							 | 
						||
| 
								 | 
							
								   to already have a (non-connected) dentry, and must be able to move
							 | 
						||
| 
								 | 
							
								   that dentry into place (based on the parent and name in the
							 | 
						||
| 
								 | 
							
								   ->lookup).   This is particularly needed for directories as
							 | 
						||
| 
								 | 
							
								   it is a dcache invariant that directories only have one dentry.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								To implement these features, the dcache has:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								a/ A dentry flag DCACHE_DISCONNECTED which is set on
							 | 
						||
| 
								 | 
							
								   any dentry that might not be part of the proper prefix.
							 | 
						||
| 
								 | 
							
								   This is set when anonymous dentries are created, and cleared when a
							 | 
						||
| 
								 | 
							
								   dentry is noticed to be a child of a dentry which is in the proper
							 | 
						||
| 
								 | 
							
								   prefix. 
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								b/ A per-superblock list "s_anon" of dentries which are the roots of
							 | 
						||
| 
								 | 
							
								   subtrees that are not in the proper prefix.  These dentries, as
							 | 
						||
| 
								 | 
							
								   well as the proper prefix, need to be released at unmount time.  As
							 | 
						||
| 
								 | 
							
								   these dentries will not be hashed, they are linked together on the
							 | 
						||
| 
								 | 
							
								   d_hash list_head.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								c/ Helper routines to allocate anonymous dentries, and to help attach
							 | 
						||
| 
								 | 
							
								   loose directory dentries at lookup time. They are:
							 | 
						||
| 
								 | 
							
								    d_alloc_anon(inode) will return a dentry for the given inode.
							 | 
						||
| 
								 | 
							
								      If the inode already has a dentry, one of those is returned.
							 | 
						||
| 
								 | 
							
								      If it doesn't, a new anonymous (IS_ROOT and
							 | 
						||
| 
								 | 
							
								        DCACHE_DISCONNECTED) dentry is allocated and attached.
							 | 
						||
| 
								 | 
							
								      In the case of a directory, care is taken that only one dentry
							 | 
						||
| 
								 | 
							
								      can ever be attached.
							 | 
						||
| 
								 | 
							
								    d_splice_alias(inode, dentry) will make sure that there is a
							 | 
						||
| 
								 | 
							
								      dentry with the same name and parent as the given dentry, and
							 | 
						||
| 
								 | 
							
								      which refers to the given inode.
							 | 
						||
| 
								 | 
							
								      If the inode is a directory and already has a dentry, then that
							 | 
						||
| 
								 | 
							
								      dentry is d_moved over the given dentry.
							 | 
						||
| 
								 | 
							
								      If the passed dentry gets attached, care is taken that this is
							 | 
						||
| 
								 | 
							
								      mutually exclusive to a d_alloc_anon operation.
							 | 
						||
| 
								 | 
							
								      If the passed dentry is used, NULL is returned, else the used
							 | 
						||
| 
								 | 
							
								      dentry is returned.  This corresponds to the calling pattern of
							 | 
						||
| 
								 | 
							
								      ->lookup.
							 | 
						||
| 
								 | 
							
								  
							 | 
						||
| 
								 | 
							
								 
							 | 
						||
| 
								 | 
							
								Filesystem Issues
							 | 
						||
| 
								 | 
							
								-----------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For a filesystem to be exportable it must:
							 | 
						||
| 
								 | 
							
								 
							 | 
						||
| 
								 | 
							
								   1/ provide the filehandle fragment routines described below.
							 | 
						||
| 
								 | 
							
								   2/ make sure that d_splice_alias is used rather than d_add
							 | 
						||
| 
								 | 
							
								      when ->lookup finds an inode for a given parent and name.
							 | 
						||
| 
								 | 
							
								      Typically the ->lookup routine will end:
							 | 
						||
| 
								 | 
							
										if (inode)
							 | 
						||
| 
								 | 
							
											return d_splice(inode, dentry);
							 | 
						||
| 
								 | 
							
										d_add(dentry, inode);
							 | 
						||
| 
								 | 
							
										return NULL;
							 | 
						||
| 
								 | 
							
									}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  A file system implementation declares that instances of the filesystem
							 | 
						||
| 
								 | 
							
								are exportable by setting the s_export_op field in the struct
							 | 
						||
| 
								 | 
							
								super_block.  This field must point to a "struct export_operations"
							 | 
						||
| 
								 | 
							
								struct which could potentially be full of NULLs, though normally at
							 | 
						||
| 
								 | 
							
								least get_parent will be set.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 The primary operations are decode_fh and encode_fh.  
							 | 
						||
| 
								 | 
							
								decode_fh takes a filehandle fragment and tries to find or create a
							 | 
						||
| 
								 | 
							
								dentry for the object referred to by the filehandle.
							 | 
						||
| 
								 | 
							
								encode_fh takes a dentry and creates a filehandle fragment which can
							 | 
						||
| 
								 | 
							
								later be used to find/create a dentry for the same object.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								decode_fh will probably make use of "find_exported_dentry".
							 | 
						||
| 
								 | 
							
								This function lives in the "exportfs" module which a filesystem does
							 | 
						||
| 
								 | 
							
								not need unless it is being exported.  So rather that calling
							 | 
						||
| 
								 | 
							
								find_exported_dentry directly, each filesystem should call it through
							 | 
						||
| 
								 | 
							
								the find_exported_dentry pointer in it's export_operations table.
							 | 
						||
| 
								 | 
							
								This field is set correctly by the exporting agent (e.g. nfsd) when a
							 | 
						||
| 
								 | 
							
								filesystem is exported, and before any export operations are called.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								find_exported_dentry needs three support functions from the
							 | 
						||
| 
								 | 
							
								filesystem:
							 | 
						||
| 
								 | 
							
								  get_name.  When given a parent dentry and a child dentry, this
							 | 
						||
| 
								 | 
							
								    should find a name in the directory identified by the parent
							 | 
						||
| 
								 | 
							
								    dentry, which leads to the object identified by the child dentry.
							 | 
						||
| 
								 | 
							
								    If no get_name function is supplied, a default implementation is
							 | 
						||
| 
								 | 
							
								    provided which uses vfs_readdir to find potential names, and
							 | 
						||
| 
								 | 
							
								    matches inode numbers to find the correct match.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  get_parent.  When given a dentry for a directory, this should return 
							 | 
						||
| 
								 | 
							
								    a dentry for the parent.  Quite possibly the parent dentry will
							 | 
						||
| 
								 | 
							
								    have been allocated by d_alloc_anon.  
							 | 
						||
| 
								 | 
							
								    The default get_parent function just returns an error so any
							 | 
						||
| 
								 | 
							
								    filehandle lookup that requires finding a parent will fail.
							 | 
						||
| 
								 | 
							
								    ->lookup("..") is *not* used as a default as it can leave ".."
							 | 
						||
| 
								 | 
							
								    entries in the dcache which are too messy to work with.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								  get_dentry.  When given an opaque datum, this should find the
							 | 
						||
| 
								 | 
							
								    implied object and create a dentry for it (possibly with
							 | 
						||
| 
								 | 
							
								    d_alloc_anon). 
							 | 
						||
| 
								 | 
							
								    The opaque datum is whatever is passed down by the decode_fh
							 | 
						||
| 
								 | 
							
								    function, and is often simply a fragment of the filehandle
							 | 
						||
| 
								 | 
							
								    fragment.
							 | 
						||
| 
								 | 
							
								    decode_fh passes two datums through find_exported_dentry.  One that 
							 | 
						||
| 
								 | 
							
								    should be used to identify the target object, and one that can be
							 | 
						||
| 
								 | 
							
								    used to identify the object's parent, should that be necessary.
							 | 
						||
| 
								 | 
							
								    The default get_dentry function assumes that the datum contains an
							 | 
						||
| 
								 | 
							
								    inode number and a generation number, and it attempts to get the
							 | 
						||
| 
								 | 
							
								    inode using "iget" and check it's validity by matching the
							 | 
						||
| 
								 | 
							
								    generation number.  A filesystem should only depend on the default
							 | 
						||
| 
								 | 
							
								    if iget can safely be used this way.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								If decode_fh and/or encode_fh are left as NULL, then default
							 | 
						||
| 
								 | 
							
								implementations are used.  These defaults are suitable for ext2 and 
							 | 
						||
| 
								 | 
							
								extremely similar filesystems (like ext3).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The default encode_fh creates a filehandle fragment from the inode
							 | 
						||
| 
								 | 
							
								number and generation number of the target together with the inode
							 | 
						||
| 
								 | 
							
								number and generation number of the parent (if the parent is
							 | 
						||
| 
								 | 
							
								required).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The default decode_fh extract the target and parent datums from the
							 | 
						||
| 
								 | 
							
								filehandle assuming the format used by the default encode_fh and
							 | 
						||
| 
								 | 
							
								passed them to find_exported_dentry.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A filehandle fragment consists of an array of 1 or more 4byte words,
							 | 
						||
| 
								 | 
							
								together with a one byte "type".
							 | 
						||
| 
								 | 
							
								The decode_fh routine should not depend on the stated size that is
							 | 
						||
| 
								 | 
							
								passed to it.  This size may be larger than the original filehandle
							 | 
						||
| 
								 | 
							
								generated by encode_fh, in which case it will have been padded with
							 | 
						||
| 
								 | 
							
								nuls.  Rather, the encode_fh routine should choose a "type" which
							 | 
						||
| 
								 | 
							
								indicates the decode_fh how much of the filehandle is valid, and how
							 | 
						||
| 
								 | 
							
								it should be interpreted.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 
							 |