 1da177e4c3
			
		
	
	
	1da177e4c3
	
	
	
		
			
			Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
		
			
				
	
	
		
			176 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			176 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
| 
 | |
| Making Filesystems Exportable
 | |
| =============================
 | |
| 
 | |
| Most filesystem operations require a dentry (or two) as a starting
 | |
| point.  Local applications have a reference-counted hold on suitable
 | |
| dentrys via open file descriptors or cwd/root.  However remote
 | |
| applications that access a filesystem via a remote filesystem protocol
 | |
| such as NFS may not be able to hold such a reference, and so need a
 | |
| different way to refer to a particular dentry.  As the alternative
 | |
| form of reference needs to be stable across renames, truncates, and
 | |
| server-reboot (among other things, though these tend to be the most
 | |
| problematic), there is no simple answer like 'filename'.
 | |
| 
 | |
| The mechanism discussed here allows each filesystem implementation to
 | |
| specify how to generate an opaque (out side of the filesystem) byte
 | |
| string for any dentry, and how to find an appropriate dentry for any
 | |
| given opaque byte string.
 | |
| This byte string will be called a "filehandle fragment" as it
 | |
| corresponds to part of an NFS filehandle.
 | |
| 
 | |
| A filesystem which supports the mapping between filehandle fragments
 | |
| and dentrys will be termed "exportable".
 | |
| 
 | |
| 
 | |
| 
 | |
| Dcache Issues
 | |
| -------------
 | |
| 
 | |
| The dcache normally contains a proper prefix of any given filesystem
 | |
| tree.  This means that if any filesystem object is in the dcache, then
 | |
| all of the ancestors of that filesystem object are also in the dcache.
 | |
| As normal access is by filename this prefix is created naturally and
 | |
| maintained easily (by each object maintaining a reference count on
 | |
| its parent).
 | |
| 
 | |
| However when objects are included into the dcache by interpreting a
 | |
| filehandle fragment, there is no automatic creation of a path prefix
 | |
| for the object.  This leads to two related but distinct features of
 | |
| the dcache that are not needed for normal filesystem access.
 | |
| 
 | |
| 1/ The dcache must sometimes contain objects that are not part of the
 | |
|    proper prefix. i.e that are not connected to the root.
 | |
| 2/ The dcache must be prepared for a newly found (via ->lookup) directory
 | |
|    to already have a (non-connected) dentry, and must be able to move
 | |
|    that dentry into place (based on the parent and name in the
 | |
|    ->lookup).   This is particularly needed for directories as
 | |
|    it is a dcache invariant that directories only have one dentry.
 | |
| 
 | |
| To implement these features, the dcache has:
 | |
| 
 | |
| a/ A dentry flag DCACHE_DISCONNECTED which is set on
 | |
|    any dentry that might not be part of the proper prefix.
 | |
|    This is set when anonymous dentries are created, and cleared when a
 | |
|    dentry is noticed to be a child of a dentry which is in the proper
 | |
|    prefix. 
 | |
| 
 | |
| b/ A per-superblock list "s_anon" of dentries which are the roots of
 | |
|    subtrees that are not in the proper prefix.  These dentries, as
 | |
|    well as the proper prefix, need to be released at unmount time.  As
 | |
|    these dentries will not be hashed, they are linked together on the
 | |
|    d_hash list_head.
 | |
| 
 | |
| c/ Helper routines to allocate anonymous dentries, and to help attach
 | |
|    loose directory dentries at lookup time. They are:
 | |
|     d_alloc_anon(inode) will return a dentry for the given inode.
 | |
|       If the inode already has a dentry, one of those is returned.
 | |
|       If it doesn't, a new anonymous (IS_ROOT and
 | |
|         DCACHE_DISCONNECTED) dentry is allocated and attached.
 | |
|       In the case of a directory, care is taken that only one dentry
 | |
|       can ever be attached.
 | |
|     d_splice_alias(inode, dentry) will make sure that there is a
 | |
|       dentry with the same name and parent as the given dentry, and
 | |
|       which refers to the given inode.
 | |
|       If the inode is a directory and already has a dentry, then that
 | |
|       dentry is d_moved over the given dentry.
 | |
|       If the passed dentry gets attached, care is taken that this is
 | |
|       mutually exclusive to a d_alloc_anon operation.
 | |
|       If the passed dentry is used, NULL is returned, else the used
 | |
|       dentry is returned.  This corresponds to the calling pattern of
 | |
|       ->lookup.
 | |
|   
 | |
|  
 | |
| Filesystem Issues
 | |
| -----------------
 | |
| 
 | |
| For a filesystem to be exportable it must:
 | |
|  
 | |
|    1/ provide the filehandle fragment routines described below.
 | |
|    2/ make sure that d_splice_alias is used rather than d_add
 | |
|       when ->lookup finds an inode for a given parent and name.
 | |
|       Typically the ->lookup routine will end:
 | |
| 		if (inode)
 | |
| 			return d_splice(inode, dentry);
 | |
| 		d_add(dentry, inode);
 | |
| 		return NULL;
 | |
| 	}
 | |
| 
 | |
| 
 | |
| 
 | |
|   A file system implementation declares that instances of the filesystem
 | |
| are exportable by setting the s_export_op field in the struct
 | |
| super_block.  This field must point to a "struct export_operations"
 | |
| struct which could potentially be full of NULLs, though normally at
 | |
| least get_parent will be set.
 | |
| 
 | |
|  The primary operations are decode_fh and encode_fh.  
 | |
| decode_fh takes a filehandle fragment and tries to find or create a
 | |
| dentry for the object referred to by the filehandle.
 | |
| encode_fh takes a dentry and creates a filehandle fragment which can
 | |
| later be used to find/create a dentry for the same object.
 | |
| 
 | |
| decode_fh will probably make use of "find_exported_dentry".
 | |
| This function lives in the "exportfs" module which a filesystem does
 | |
| not need unless it is being exported.  So rather that calling
 | |
| find_exported_dentry directly, each filesystem should call it through
 | |
| the find_exported_dentry pointer in it's export_operations table.
 | |
| This field is set correctly by the exporting agent (e.g. nfsd) when a
 | |
| filesystem is exported, and before any export operations are called.
 | |
| 
 | |
| find_exported_dentry needs three support functions from the
 | |
| filesystem:
 | |
|   get_name.  When given a parent dentry and a child dentry, this
 | |
|     should find a name in the directory identified by the parent
 | |
|     dentry, which leads to the object identified by the child dentry.
 | |
|     If no get_name function is supplied, a default implementation is
 | |
|     provided which uses vfs_readdir to find potential names, and
 | |
|     matches inode numbers to find the correct match.
 | |
| 
 | |
|   get_parent.  When given a dentry for a directory, this should return 
 | |
|     a dentry for the parent.  Quite possibly the parent dentry will
 | |
|     have been allocated by d_alloc_anon.  
 | |
|     The default get_parent function just returns an error so any
 | |
|     filehandle lookup that requires finding a parent will fail.
 | |
|     ->lookup("..") is *not* used as a default as it can leave ".."
 | |
|     entries in the dcache which are too messy to work with.
 | |
| 
 | |
|   get_dentry.  When given an opaque datum, this should find the
 | |
|     implied object and create a dentry for it (possibly with
 | |
|     d_alloc_anon). 
 | |
|     The opaque datum is whatever is passed down by the decode_fh
 | |
|     function, and is often simply a fragment of the filehandle
 | |
|     fragment.
 | |
|     decode_fh passes two datums through find_exported_dentry.  One that 
 | |
|     should be used to identify the target object, and one that can be
 | |
|     used to identify the object's parent, should that be necessary.
 | |
|     The default get_dentry function assumes that the datum contains an
 | |
|     inode number and a generation number, and it attempts to get the
 | |
|     inode using "iget" and check it's validity by matching the
 | |
|     generation number.  A filesystem should only depend on the default
 | |
|     if iget can safely be used this way.
 | |
| 
 | |
| If decode_fh and/or encode_fh are left as NULL, then default
 | |
| implementations are used.  These defaults are suitable for ext2 and 
 | |
| extremely similar filesystems (like ext3).
 | |
| 
 | |
| The default encode_fh creates a filehandle fragment from the inode
 | |
| number and generation number of the target together with the inode
 | |
| number and generation number of the parent (if the parent is
 | |
| required).
 | |
| 
 | |
| The default decode_fh extract the target and parent datums from the
 | |
| filehandle assuming the format used by the default encode_fh and
 | |
| passed them to find_exported_dentry.
 | |
| 
 | |
| 
 | |
| A filehandle fragment consists of an array of 1 or more 4byte words,
 | |
| together with a one byte "type".
 | |
| The decode_fh routine should not depend on the stated size that is
 | |
| passed to it.  This size may be larger than the original filehandle
 | |
| generated by encode_fh, in which case it will have been padded with
 | |
| nuls.  Rather, the encode_fh routine should choose a "type" which
 | |
| indicates the decode_fh how much of the filehandle is valid, and how
 | |
| it should be interpreted.
 | |
| 
 | |
|  
 |