Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
		
			
				
	
	
		
			382 lines
		
	
	
	
		
			18 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			382 lines
		
	
	
	
		
			18 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
Path walking and name lookup locking
 | 
						|
====================================
 | 
						|
 | 
						|
Path resolution is the finding a dentry corresponding to a path name string, by
 | 
						|
performing a path walk. Typically, for every open(), stat() etc., the path name
 | 
						|
will be resolved. Paths are resolved by walking the namespace tree, starting
 | 
						|
with the first component of the pathname (eg. root or cwd) with a known dentry,
 | 
						|
then finding the child of that dentry, which is named the next component in the
 | 
						|
path string. Then repeating the lookup from the child dentry and finding its
 | 
						|
child with the next element, and so on.
 | 
						|
 | 
						|
Since it is a frequent operation for workloads like multiuser environments and
 | 
						|
web servers, it is important to optimize this code.
 | 
						|
 | 
						|
Path walking synchronisation history:
 | 
						|
Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and
 | 
						|
thus in every component during path look-up. Since 2.5.10 onwards, fast-walk
 | 
						|
algorithm changed this by holding the dcache_lock at the beginning and walking
 | 
						|
as many cached path component dentries as possible. This significantly
 | 
						|
decreases the number of acquisition of dcache_lock. However it also increases
 | 
						|
the lock hold time significantly and affects performance in large SMP machines.
 | 
						|
Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to
 | 
						|
make dcache look-up lock-free.
 | 
						|
 | 
						|
All the above algorithms required taking a lock and reference count on the
 | 
						|
dentry that was looked up, so that may be used as the basis for walking the
 | 
						|
next path element. This is inefficient and unscalable. It is inefficient
 | 
						|
because of the locks and atomic operations required for every dentry element
 | 
						|
slows things down. It is not scalable because many parallel applications that
 | 
						|
are path-walk intensive tend to do path lookups starting from a common dentry
 | 
						|
(usually, the root "/" or current working directory). So contention on these
 | 
						|
common path elements causes lock and cacheline queueing.
 | 
						|
 | 
						|
Since 2.6.38, RCU is used to make a significant part of the entire path walk
 | 
						|
(including dcache look-up) completely "store-free" (so, no locks, atomics, or
 | 
						|
even stores into cachelines of common dentries). This is known as "rcu-walk"
 | 
						|
path walking.
 | 
						|
 | 
						|
Path walking overview
 | 
						|
=====================
 | 
						|
 | 
						|
A name string specifies a start (root directory, cwd, fd-relative) and a
 | 
						|
sequence of elements (directory entry names), which together refer to a path in
 | 
						|
the namespace. A path is represented as a (dentry, vfsmount) tuple. The name
 | 
						|
elements are sub-strings, separated by '/'.
 | 
						|
 | 
						|
Name lookups will want to find a particular path that a name string refers to
 | 
						|
(usually the final element, or parent of final element). This is done by taking
 | 
						|
the path given by the name's starting point (which we know in advance -- eg.
 | 
						|
current->fs->cwd or current->fs->root) as the first parent of the lookup. Then
 | 
						|
iteratively for each subsequent name element, look up the child of the current
 | 
						|
parent with the given name and if it is not the desired entry, make it the
 | 
						|
parent for the next lookup.
 | 
						|
 | 
						|
A parent, of course, must be a directory, and we must have appropriate
 | 
						|
permissions on the parent inode to be able to walk into it.
 | 
						|
 | 
						|
Turning the child into a parent for the next lookup requires more checks and
 | 
						|
procedures. Symlinks essentially substitute the symlink name for the target
 | 
						|
name in the name string, and require some recursive path walking.  Mount points
 | 
						|
must be followed into (thus changing the vfsmount that subsequent path elements
 | 
						|
refer to), switching from the mount point path to the root of the particular
 | 
						|
mounted vfsmount. These behaviours are variously modified depending on the
 | 
						|
exact path walking flags.
 | 
						|
 | 
						|
Path walking then must, broadly, do several particular things:
 | 
						|
- find the start point of the walk;
 | 
						|
- perform permissions and validity checks on inodes;
 | 
						|
- perform dcache hash name lookups on (parent, name element) tuples;
 | 
						|
- traverse mount points;
 | 
						|
- traverse symlinks;
 | 
						|
- lookup and create missing parts of the path on demand.
 | 
						|
 | 
						|
Safe store-free look-up of dcache hash table
 | 
						|
============================================
 | 
						|
 | 
						|
Dcache name lookup
 | 
						|
------------------
 | 
						|
In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
 | 
						|
and use that to select a bucket in the dcache-hash table. The list of entries
 | 
						|
in that bucket is then walked, and we do a full comparison of each entry
 | 
						|
against our (parent, name) tuple.
 | 
						|
 | 
						|
The hash lists are RCU protected, so list walking is not serialised with
 | 
						|
concurrent updates (insertion, deletion from the hash). This is a standard RCU
 | 
						|
list application with the exception of renames, which will be covered below.
 | 
						|
 | 
						|
Parent and name members of a dentry, as well as its membership in the dcache
 | 
						|
hash, and its inode are protected by the per-dentry d_lock spinlock. A
 | 
						|
reference is taken on the dentry (while the fields are verified under d_lock),
 | 
						|
and this stabilises its d_inode pointer and actual inode. This gives a stable
 | 
						|
point to perform the next step of our path walk against.
 | 
						|
 | 
						|
These members are also protected by d_seq seqlock, although this offers
 | 
						|
read-only protection and no durability of results, so care must be taken when
 | 
						|
using d_seq for synchronisation (see seqcount based lookups, below).
 | 
						|
 | 
						|
Renames
 | 
						|
-------
 | 
						|
Back to the rename case. In usual RCU protected lists, the only operations that
 | 
						|
will happen to an object is insertion, and then eventually removal from the
 | 
						|
list. The object will not be reused until an RCU grace period is complete.
 | 
						|
This ensures the RCU list traversal primitives can run over the object without
 | 
						|
problems (see RCU documentation for how this works).
 | 
						|
 | 
						|
However when a dentry is renamed, its hash value can change, requiring it to be
 | 
						|
moved to a new hash list. Allocating and inserting a new alias would be
 | 
						|
expensive and also problematic for directory dentries. Latency would be far to
 | 
						|
high to wait for a grace period after removing the dentry and before inserting
 | 
						|
it in the new hash bucket. So what is done is to insert the dentry into the
 | 
						|
new list immediately.
 | 
						|
 | 
						|
However, when the dentry's list pointers are updated to point to objects in the
 | 
						|
new list before waiting for a grace period, this can result in a concurrent RCU
 | 
						|
lookup of the old list veering off into the new (incorrect) list and missing
 | 
						|
the remaining dentries on the list.
 | 
						|
 | 
						|
There is no fundamental problem with walking down the wrong list, because the
 | 
						|
dentry comparisons will never match. However it is fatal to miss a matching
 | 
						|
dentry. So a seqlock is used to detect when a rename has occurred, and so the
 | 
						|
lookup can be retried.
 | 
						|
 | 
						|
         1      2      3
 | 
						|
        +---+  +---+  +---+
 | 
						|
hlist-->| N-+->| N-+->| N-+->
 | 
						|
head <--+-P |<-+-P |<-+-P |
 | 
						|
        +---+  +---+  +---+
 | 
						|
 | 
						|
Rename of dentry 2 may require it deleted from the above list, and inserted
 | 
						|
into a new list. Deleting 2 gives the following list.
 | 
						|
 | 
						|
         1             3
 | 
						|
        +---+         +---+     (don't worry, the longer pointers do not
 | 
						|
hlist-->| N-+-------->| N-+->    impose a measurable performance overhead
 | 
						|
head <--+-P |<--------+-P |      on modern CPUs)
 | 
						|
        +---+         +---+
 | 
						|
          ^      2      ^
 | 
						|
          |    +---+    |
 | 
						|
          |    | N-+----+
 | 
						|
          +----+-P |
 | 
						|
               +---+
 | 
						|
 | 
						|
This is a standard RCU-list deletion, which leaves the deleted object's
 | 
						|
pointers intact, so a concurrent list walker that is currently looking at
 | 
						|
object 2 will correctly continue to object 3 when it is time to traverse the
 | 
						|
next object.
 | 
						|
 | 
						|
However, when inserting object 2 onto a new list, we end up with this:
 | 
						|
 | 
						|
         1             3
 | 
						|
        +---+         +---+
 | 
						|
hlist-->| N-+-------->| N-+->
 | 
						|
head <--+-P |<--------+-P |
 | 
						|
        +---+         +---+
 | 
						|
                 2
 | 
						|
               +---+
 | 
						|
               | N-+---->
 | 
						|
          <----+-P |
 | 
						|
               +---+
 | 
						|
 | 
						|
Because we didn't wait for a grace period, there may be a concurrent lookup
 | 
						|
still at 2. Now when it follows 2's 'next' pointer, it will walk off into
 | 
						|
another list without ever having checked object 3.
 | 
						|
 | 
						|
A related, but distinctly different, issue is that of rename atomicity versus
 | 
						|
lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
 | 
						|
find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
 | 
						|
of 'B' must succeed (note the reverse is not true).
 | 
						|
 | 
						|
Between deleting the dentry from the old hash list, and inserting it on the new
 | 
						|
hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
 | 
						|
rename seqlock is also used to cover this race in much the same way, by
 | 
						|
retrying a negative lookup result if a rename was in progress.
 | 
						|
 | 
						|
Seqcount based lookups
 | 
						|
----------------------
 | 
						|
In refcount based dcache lookups, d_lock is used to serialise access to
 | 
						|
the dentry, stabilising it while comparing its name and parent and then
 | 
						|
taking a reference count (the reference count then gives a stable place to
 | 
						|
start the next part of the path walk from).
 | 
						|
 | 
						|
As explained above, we would like to do path walking without taking locks or
 | 
						|
reference counts on intermediate dentries along the path. To do this, a per
 | 
						|
dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry
 | 
						|
looks like (its name, parent, and inode). That snapshot is then used to start
 | 
						|
the next part of the path walk. When loading the coherent snapshot under d_seq,
 | 
						|
care must be taken to load the members up-front, and use those pointers rather
 | 
						|
than reloading from the dentry later on (otherwise we'd have interesting things
 | 
						|
like d_inode going NULL underneath us, if the name was unlinked).
 | 
						|
 | 
						|
Also important is to avoid performing any destructive operations (pretty much:
 | 
						|
no non-atomic stores to shared data), and to recheck the seqcount when we are
 | 
						|
"done" with the operation. Retry or abort if the seqcount does not match.
 | 
						|
Avoiding destructive or changing operations means we can easily unwind from
 | 
						|
failure.
 | 
						|
 | 
						|
What this means is that a caller, provided they are holding RCU lock to
 | 
						|
protect the dentry object from disappearing, can perform a seqcount based
 | 
						|
lookup which does not increment the refcount on the dentry or write to
 | 
						|
it in any way. This returned dentry can be used for subsequent operations,
 | 
						|
provided that d_seq is rechecked after that operation is complete.
 | 
						|
 | 
						|
Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be
 | 
						|
queried for permissions.
 | 
						|
 | 
						|
With this two parts of the puzzle, we can do path lookups without taking
 | 
						|
locks or refcounts on dentry elements.
 | 
						|
 | 
						|
RCU-walk path walking design
 | 
						|
============================
 | 
						|
 | 
						|
Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk
 | 
						|
is the traditional[*] way of performing dcache lookups using d_lock to
 | 
						|
serialise concurrent modifications to the dentry and take a reference count on
 | 
						|
it. ref-walk is simple and obvious, and may sleep, take locks, etc while path
 | 
						|
walking is operating on each dentry. rcu-walk uses seqcount based dentry
 | 
						|
lookups, and can perform lookup of intermediate elements without any stores to
 | 
						|
shared data in the dentry or inode. rcu-walk can not be applied to all cases,
 | 
						|
eg. if the filesystem must sleep or perform non trivial operations, rcu-walk
 | 
						|
must be switched to ref-walk mode.
 | 
						|
 | 
						|
[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full
 | 
						|
    path walk.
 | 
						|
 | 
						|
Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining
 | 
						|
path string, rcu-walk uses a d_seq protected snapshot. When looking up a
 | 
						|
child of this parent snapshot, we open d_seq critical section on the child
 | 
						|
before closing d_seq critical section on the parent. This gives an interlocking
 | 
						|
ladder of snapshots to walk down.
 | 
						|
 | 
						|
 | 
						|
     proc 101
 | 
						|
      /----------------\
 | 
						|
     / comm:    "vi"    \
 | 
						|
    /  fs.root: dentry0  \
 | 
						|
    \  fs.cwd:  dentry2  /
 | 
						|
     \                  /
 | 
						|
      \----------------/
 | 
						|
 | 
						|
So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will
 | 
						|
start from current->fs->root, which is a pinned dentry. Alternatively,
 | 
						|
"./test.c" would start from cwd; both names refer to the same path in
 | 
						|
the context of proc101.
 | 
						|
 | 
						|
     dentry 0
 | 
						|
    +---------------------+   rcu-walk begins here, we note d_seq, check the
 | 
						|
    | name:    "/"        |   inode's permission, and then look up the next
 | 
						|
    | inode:   10         |   path element which is "home"...
 | 
						|
    | children:"home", ...|
 | 
						|
    +---------------------+
 | 
						|
              |
 | 
						|
     dentry 1 V
 | 
						|
    +---------------------+   ... which brings us here. We find dentry1 via
 | 
						|
    | name:    "home"     |   hash lookup, then note d_seq and compare name
 | 
						|
    | inode:   678        |   string and parent pointer. When we have a match,
 | 
						|
    | children:"npiggin"  |   we now recheck the d_seq of dentry0. Then we
 | 
						|
    +---------------------+   check inode and look up the next element.
 | 
						|
              |
 | 
						|
     dentry2  V
 | 
						|
    +---------------------+   Note: if dentry0 is now modified, lookup is
 | 
						|
    | name:    "npiggin"  |   not necessarily invalid, so we need only keep a
 | 
						|
    | inode:   543        |   parent for d_seq verification, and grandparents
 | 
						|
    | children:"a.c", ... |   can be forgotten.
 | 
						|
    +---------------------+
 | 
						|
              |
 | 
						|
     dentry3  V
 | 
						|
    +---------------------+   At this point we have our destination dentry.
 | 
						|
    | name:    "a.c"      |   We now take its d_lock, verify d_seq of this
 | 
						|
    | inode:   14221      |   dentry. If that checks out, we can increment
 | 
						|
    | children:NULL       |   its refcount because we're holding d_lock.
 | 
						|
    +---------------------+
 | 
						|
 | 
						|
Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock,
 | 
						|
re-checking its d_seq, and then incrementing its refcount is called
 | 
						|
"dropping rcu" or dropping from rcu-walk into ref-walk mode.
 | 
						|
 | 
						|
It is, in some sense, a bit of a house of cards. If the seqcount check of the
 | 
						|
parent snapshot fails, the house comes down, because we had closed the d_seq
 | 
						|
section on the grandparent, so we have nothing left to stand on. In that case,
 | 
						|
the path walk must be fully restarted (which we do in ref-walk mode, to avoid
 | 
						|
live locks). It is costly to have a full restart, but fortunately they are
 | 
						|
quite rare.
 | 
						|
 | 
						|
When we reach a point where sleeping is required, or a filesystem callout
 | 
						|
requires ref-walk, then instead of restarting the walk, we attempt to drop rcu
 | 
						|
at the last known good dentry we have. Avoiding a full restart in ref-walk in
 | 
						|
these cases is fundamental for performance and scalability because blocking
 | 
						|
operations such as creates and unlinks are not uncommon.
 | 
						|
 | 
						|
The detailed design for rcu-walk is like this:
 | 
						|
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
 | 
						|
* Take the RCU lock for the entire path walk, starting with the acquiring
 | 
						|
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
 | 
						|
  not required for dentry persistence.
 | 
						|
* synchronize_rcu is called when unregistering a filesystem, so we can
 | 
						|
  access d_ops and i_ops during rcu-walk.
 | 
						|
* Similarly take the vfsmount lock for the entire path walk. So now mnt
 | 
						|
  refcounts are not required for persistence. Also we are free to perform mount
 | 
						|
  lookups, and to assume dentry mount points and mount roots are stable up and
 | 
						|
  down the path.
 | 
						|
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
 | 
						|
  so we can load this tuple atomically, and also check whether any of its
 | 
						|
  members have changed.
 | 
						|
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
 | 
						|
  sequence after the child is found in case anything changed in the parent
 | 
						|
  during the path walk.
 | 
						|
* inode is also RCU protected so we can load d_inode and use the inode for
 | 
						|
  limited things.
 | 
						|
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
 | 
						|
* i_op can be loaded.
 | 
						|
* When the destination dentry is reached, drop rcu there (ie. take d_lock,
 | 
						|
  verify d_seq, increment refcount).
 | 
						|
* If seqlock verification fails anywhere along the path, do a full restart
 | 
						|
  of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of
 | 
						|
  a better errno) to signal an rcu-walk failure.
 | 
						|
 | 
						|
The cases where rcu-walk cannot continue are:
 | 
						|
* NULL dentry (ie. any uncached path element)
 | 
						|
* Following links
 | 
						|
 | 
						|
It may be possible eventually to make following links rcu-walk aware.
 | 
						|
 | 
						|
Uncached path elements will always require dropping to ref-walk mode, at the
 | 
						|
very least because i_mutex needs to be grabbed, and objects allocated.
 | 
						|
 | 
						|
Final note:
 | 
						|
"store-free" path walking is not strictly store free. We take vfsmount lock
 | 
						|
and refcounts (both of which can be made per-cpu), and we also store to the
 | 
						|
stack (which is essentially CPU-local), and we also have to take locks and
 | 
						|
refcount on final dentry.
 | 
						|
 | 
						|
The point is that shared data, where practically possible, is not locked
 | 
						|
or stored into. The result is massive improvements in performance and
 | 
						|
scalability of path resolution.
 | 
						|
 | 
						|
 | 
						|
Interesting statistics
 | 
						|
======================
 | 
						|
 | 
						|
The following table gives rcu lookup statistics for a few simple workloads
 | 
						|
(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to
 | 
						|
drop rcu that fail due to d_seq failure and requiring the entire path lookup
 | 
						|
again. Other cases are successful rcu-drops that are required before the final
 | 
						|
element, nodentry for missing dentry, revalidate for filesystem revalidate
 | 
						|
routine requiring rcu drop, permission for permission check requiring drop,
 | 
						|
and link for symlink traversal requiring drop.
 | 
						|
 | 
						|
     rcu-lookups     restart  nodentry          link  revalidate  permission
 | 
						|
bootup     47121           0      4624          1010       10283        7852
 | 
						|
dbench  25386793           0   6778659(26.7%)     55         549        1156
 | 
						|
kbuild   2696672          10     64442(2.3%)  108764(4.0%)     1        1590
 | 
						|
git diff   39605           0        28             2           0         106
 | 
						|
vfstest 24185492        4945    708725(2.9%) 1076136(4.4%)     0        2651
 | 
						|
 | 
						|
What this shows is that failed rcu-walk lookups, ie. ones that are restarted
 | 
						|
entirely with ref-walk, are quite rare. Even the "vfstest" case which
 | 
						|
specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise
 | 
						|
such races is not showing a huge amount of restarts.
 | 
						|
 | 
						|
Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where
 | 
						|
the reference count needs to be taken for some reason. This is either because
 | 
						|
we have reached the target of the path walk, or because we have encountered a
 | 
						|
condition that can't be resolved in rcu-walk mode.  Ideally, we drop rcu-walk
 | 
						|
only when we have reached the target dentry, so the other statistics show where
 | 
						|
this does not happen.
 | 
						|
 | 
						|
Note that a graceful drop from rcu-walk mode due to something such as the
 | 
						|
dentry not existing (which can be common) is not necessarily a failure of
 | 
						|
rcu-walk scheme, because some elements of the path may have been walked in
 | 
						|
rcu-walk mode. The further we get from common path elements (such as cwd or
 | 
						|
root), the less contended the dentry is likely to be. The closer we are to
 | 
						|
common path elements, the more likely they will exist in dentry cache.
 | 
						|
 | 
						|
 | 
						|
Papers and other documentation on dcache locking
 | 
						|
================================================
 | 
						|
 | 
						|
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
 | 
						|
 | 
						|
2. http://lse.sourceforge.net/locking/dcache/dcache.html
 | 
						|
 | 
						|
 |