Commit graph

93021 commits

Author SHA1 Message Date
Trond Myklebust
a2b2bb8822 NFSv4: Attempt to use machine credentials in SETCLIENTID calls
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:59 -04:00
Trond Myklebust
7c67db3a8a NFSv4: Reintroduce machine creds
We need to try to ensure that we always use the same credentials whenever
we re-establish the clientid on the server. If not, the server won't
recognise that we're the same client, and so may not allow us to recover
state.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:56 -04:00
Trond Myklebust
78ea323be6 NFSv4: Don't use cred->cr_ops->cr_name in nfs4_proc_setclientid()
With the recent change to generic creds, we can no longer use
cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and
AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name
instead...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:53 -04:00
Fred Isaman
4410924157 nfs: fix printout of multiword bitfields
Benny points out that zero-padding of multiword bitfields is necessary,
and that delimiting each word is nice to avoid endianess confusion.

bhalevy: without zero padding output can be ambiguous. Also,
since the printed array of two 32-bit unsigned integers is not a
64-bit number, delimiting the output with a semicolon makes more sense.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:50 -04:00
Benny Halevy
856dff3d38 nfs: return negative error value from nfs{,4}_stat_to_errno
All use sites for nfs{,4}_stat_to_errno negate their return value.
It's more efficient to return a negative error from the stat_to_errno convertors
rather than negating its return value everywhere. This also produces slightly
smaller code.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:47 -04:00
Trond Myklebust
d11d10cc05 NLM/lockd: Ensure client locking calls use correct credentials
Now that we've added the 'generic' credentials (that are independent of the
rpc_client) to the nfs_open_context, we can use those in the NLM client to
ensure that the lock/unlock requests are authenticated to whoever
originally opened the file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:43 -04:00
Trond Myklebust
c4d7c402b7 NFS: Remove the buggy lock-if-signalled case from do_setlk()
Both NLM and NFSv4 should be able to clean up adequately in the case where
the user interrupts the RPC call...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:52 -04:00
Trond Myklebust
5f50c0c6d6 NLM/lockd: Fix a race when cancelling a blocking lock
We shouldn't remove the lock from the list of blocked locks until the
CANCEL call has completed since we may be racing with a GRANTED callback.

Also ensure that we send an UNLOCK if the CANCEL request failed. Normally
that should only happen if the process gets hit with a fatal signal.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:49 -04:00
Trond Myklebust
6b4b3a752b NLM/lockd: Ensure that nlmclnt_cancel() returns results of the CANCEL call
Currently, it returns success as long as the RPC call was sent. We'd like
to know if the CANCEL operation succeeded on the server.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:45 -04:00
Trond Myklebust
8ec7ff7444 NLM: Remove the signal masking in nlmclnt_proc/nlmclnt_cancel
The signal masks have been rendered obsolete by the preceding patch.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:42 -04:00
Trond Myklebust
dc9d8d0481 NLM/lockd: convert __nlm_async_call to use rpc_run_task()
Peter Staubach comments:

> In the course of investigating testing failures in the locking phase of
> the Connectathon testsuite, I discovered a couple of things.  One was
> that one of the tests in the locking tests was racy when it didn't seem
> to need to be and two, that the NFS client asynchronously releases locks
> when a process is exiting.
...
> The Single UNIX Specification Version 3 specifies that:  "All locks
> associated with a file for a given process shall be removed when a file
> descriptor for that file is closed by that process or the process holding
> that file descriptor terminates.".
>
> This does not specify whether those locks must be released prior to the
> completion of the exit processing for the process or not.  However,
> general assumptions seem to be that those locks will be released.  This
> leads to more deterministic behavior under normal circumstances.

The following patch converts the NFSv2/v3 locking code to use the same
mechanism as NFSv4 for sending asynchronous RPC calls and then waiting for
them to complete. This ensures that the UNLOCK and CANCEL RPC calls will
complete even if the user interrupts the call, yet satisfies the
above request for synchronous behaviour on process exit.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:39 -04:00
Trond Myklebust
5e7f37a76f NLM/lockd: Add a reference counter to struct nlm_rqst
When we replace the existing synchronous RPC calls with asynchronous calls,
the reference count will be needed in order to allow us to examine the
result of the RPC call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:36 -04:00
Trond Myklebust
536ff0f809 NFSv4: Ensure we don't corrupt fl->fl_flags in nfs4_proc_unlck
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:33 -04:00
Trond Myklebust
4a9af59fee NLM/lockd: Ensure we don't corrupt fl->fl_flags in nlmclnt_unlock()
Also fix up nlmclnt_lock() so that it doesn't pass modified versions of
fl->fl_flags to nlmclnt_cancel() and other helpers.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:30 -04:00
Trond Myklebust
1e799b673c SUNRPC: Fix read ordering problems with req->rq_private_buf.len
We want to ensure that req->rq_private_buf.len is updated before
req->rq_received, so that call_decode() doesn't use an old value for
req->rq_rcv_buf.len.

In 'call_decode()' itself, instead of using task->tk_status (which is set
using req->rq_received) must use the actual value of
req->rq_private_buf.len when deciding whether or not the received RPC reply
is too short.

Finally ensure that we set req->rq_rcv_buf.len to zero when retrying a
request. A typo meant that we were resetting req->rq_private_buf.len in
call_decode(), and then clobbering that value with the old rq_rcv_buf.len
again in xprt_transmit().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:20 -04:00
Trond Myklebust
c1d519312d NFSv4: Only increment the sequence id if the server saw it
It is quite possible that the OPEN, CLOSE, LOCK, LOCKU,... compounds fail
before the actual stateful operation has been executed (for instance in the
PUTFH call). There is no way to tell from the overall status result which
operations were executed from the COMPOUND.

The fix is to move incrementing of the sequence id into the XDR layer,
so that we do it as we process the results from the stateful operation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:15 -04:00
Trond Myklebust
35d05778e2 NFSv4: Remove bogus call to nfs4_drop_state_owner() in _nfs4_open_expired()
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:12 -04:00
Trond Myklebust
dbae4c73f0 NFS: Ensure that rpc_run_task() errors are propagated back to the caller
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:08 -04:00
Trond Myklebust
c9d8f89d98 NFS: Ensure that the write code cleans up properly when rpc_run_task() fails
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:05 -04:00
Trond Myklebust
fdd1e74c89 NFS: Ensure that the read code cleans up properly when rpc_run_task() fails
In the case of readpage() we need to ensure that the pages get unlocked,
and that the error is flagged.

In the case of O_DIRECT, we need to ensure that the pages are all released.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:01 -04:00
Trond Myklebust
73e3302f60 NFS: Fix nfs_wb_page() to always exit with an error or a clean page
It is possible for nfs_wb_page() to sometimes exit with 0 return value, yet
the page is left in a dirty state.
For instance in the case where the server rebooted, and the COMMIT request
failed, then all the previously "clean" pages which were cached by the
server, but were not guaranteed to have been writted out to disk,
have to be redirtied and resent to the server.
The fix is to have nfs_wb_page_priority() check that the page is clean
before it exits...

This fixes a condition that triggers the BUG_ON(PagePrivate(page)) in
nfs_create_request() when we're in the nfs_readpage() path.

Also eliminate a redundant BUG_ON(!PageLocked(page)) while we're at it. It
turns out that clear_page_dirty_for_io() has the exact same test.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:52:58 -04:00
Trond Myklebust
080a1f148d SUNRPC: Don't attempt to destroy expired RPCSEC_GSS credentials..
..and always destroy using a 'soft' RPC call. Destroying GSS credentials
isn't mandatory; the server can always cope with a few credentials not
getting destroyed in a timely fashion.

This actually fixes a hang situation. Basically, some servers will decide
that the client is crazy if it tries to destroy an RPC context for which
they have sent an RPCSEC_GSS_CREDPROBLEM, and so will refuse to talk to it
for a while.
The regression therefor probably was introduced by commit
0df7fb74fb.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:52:54 -04:00
Trond Myklebust
b6ddf64ffe SUNRPC: Fix up xprt_write_space()
The rest of the networking layer uses SOCK_ASYNC_NOSPACE to signal whether
or not we have someone waiting for buffer memory. Convert the SUNRPC layer
to use the same idiom.
Remove the unlikely()s in xs_udp_write_space and xs_tcp_write_space. In
fact, the most common case will be that there is nobody waiting for buffer
space.

SOCK_NOSPACE is there to tell the TCP layer whether or not the cwnd was
limited by the application window. Ensure that we follow the same idiom as
the rest of the networking layer here too.

Finally, ensure that we clear SOCK_ASYNC_NOSPACE once we wake up, so that
write_space() doesn't keep waking things up on xprt->pending.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:52:44 -04:00
Trond Myklebust
24b74bf0c9 SUNRPC: Fix a bug in call_decode()
call_verify() can, under certain circumstances, free the RPC slot. In that
case, our cached pointer 'req = task->tk_rqstp' is invalid. Bug was
introduced in commit 220bcc2afd (SUNRPC:
Don't call xprt_release in call refresh).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:52:33 -04:00
Matthew Wilcox
d2f5e80862 Deprecate the asm/semaphore.h files in feature-removal-schedule.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-19 13:49:34 -04:00
Ingo Molnar
486fdae214 sched: build fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:01 +02:00
Viktor Radnai
b9b158fe1c sched: better rt-group documentation
Viktor was nice enough to enhance the document based on my replies to
his questions on the subject.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:01 +02:00
Ingo Molnar
c24b7c5244 sched: features fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
f00b45c145 sched: /debug/sched_features
provide a text based interface to the scheduler features; this saves the
'user' from setting bits using decimal arithmetic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Ingo Molnar
06379aba52 sched: add SCHED_FEAT_DEADLINE
unused at the moment.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
7ba2e74ab5 sched: debug: show a weight tree
Print a tree of weights.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
8f1bc385cf sched: fair: weight calculations
In order to level the hierarchy, we need to calculate load based on the
root view. That is, each task's load is in the same unit.

             A
            / \
           B   1
          / \
         2   3

To compute 1's load we do:

	   weight(1)
	--------------
	 rq_weight(A)

To compute 2's load we do:

	  weight(2)      weight(B)
	------------ * -----------
	rq_weight(B)   rw_weight(A)

This yields load fractions in comparable units.

The consequence is that it changes virtual time. We used to have:

                time_{i}
  vtime_{i} = ------------
               weight_{i}

  vtime = \Sum vtime_{i} = time / rq_weight.

But with the new way of load calculation we get that vtime equals time.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
4a55bd5e97 sched: fair-group: de-couple load-balancing from the rb-trees
De-couple load-balancing from the rb-trees, so that I can change their
organization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
ac884dec6d sched: fair-group scheduling vs latency
Currently FAIR_GROUP sched grows the scheduler latency outside of
sysctl_sched_latency, invert this so it stays within.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
58d6c2d72f sched: rt-group: optimize dequeue_rt_stack
Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
of RT task dequeues will really hurt. Optimize this by providing space to
store the tree path, so we can walk it the other way.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
d19ca30874 sched: debug: add some debug code to handle the full hierarchy
Add some extra debug output so we can get a better overview of the
full hierarchy.

We print the cgroup path after each cfs_rq, so we can see what group
we're looking at.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
18d95a2832 sched: fair-group: SMP-nice for group scheduling
Implement SMP nice support for the full group hierarchy.

On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.

After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.

Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.

XXX: there will be some numerical issues due to the limited nature of
     SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
     total weight. When the tree is deep enough, or the task weight small
     enough, we'll run out of bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Hidetoshi Seto
1d3504fcf5 sched, cpuset: customize sched domains, core
[rebased for sched-devel/latest]

 - Add a new cpuset file, having levels:
     sched_relax_domain_level

 - Modify partition_sched_domains() and build_sched_domains()
   to take attributes parameter passed from cpuset.

 - Fill newidle_idx for node domains which currently unused but
   might be required if sched_relax_domain_level become higher.

 - We can change the default level by boot option 'relax_domain_level='.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Hidetoshi Seto
4d5f35533f sched, cpuset: customize sched domains, docs
This patch introduces new feature of cpuset - sched domain customization.

This version provides a per-cpuset file 'sched_relax_domain_level' that
enable us to change the searching range of scheduler, which used to limit
how many cpus the scheduler searches at some schedule events, such as
wakening task and running out of runqueue.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
b758149c02 sched: prepatory code movement
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
b40b2e8eb5 sched: rt: multi level group constraints
multi level rt constraints

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
f473aa5e02 sched: task_group hierarchy
Add the full parent<->child relation thing into task_groups as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Peter Zijlstra
eff766a65c sched: fix the task_group hierarchy for UID grouping
UID grouping doesn't actually have a task_group representing the root of
the task_group tree. Add one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:45:00 +02:00
Dhaval Giani
ec7dc8ac73 sched: allow the group scheduler to have multiple levels
This patch makes the group scheduler multi hierarchy aware.

[a.p.zijlstra@chello.nl: rt-parts and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Dhaval Giani
354d60c2ff sched: mix tasks and groups
This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Ingo Molnar
ea736ed5d3 sched: fix checks
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Peter Zijlstra
112f53f5d7 sched: old sleeper bonus
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
cd8ba7cd9b sched: add new set_cpus_allowed_ptr function
Add a new function that accepts a pointer to the "newly allowed cpus"
cpumask argument.

int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

The current set_cpus_allowed() function is modified to use the above
but this does not result in an ABI change.  And with some compiler
optimization help, it may not introduce any additional overhead.

Additionally, to enforce the read only nature of the new_mask arg, the
"const" property is migrated to sub-functions called by set_cpus_allowed.
This silences compiler warnings.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
e0982e90cd init: move setup of nr_cpu_ids to as early as possible
Move the setting of nr_cpu_ids from sched_init() to start_kernel()
so that it's available as early as possible.

Note that an arch has the option of setting it even earlier if need be,
but it should not result in a different value than the setup_nr_cpu_ids()
function.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00
Mike Travis
4bdbaad33d sched: remove another cpumask_t variable from stack
* Remove another cpumask_t variable from stack that was missed in the
      last kernel_sched_c updates.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:59 +02:00