When glock_lo_add and rg_lo_add attempt to add an element to the log, they
check to see if has already been added before locking the log. If another
process adds that element to the log in this window between the check and
locking the log, the element will be added to the list twice. This causes
the log element list to become corrupted in such a way that the log element
can never be successfully removed from the list. This patch pulls the
list_empty() check inside the log lock, to remove this window.
Signed-off-by: Benjamin E. Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch speeds up lock_dlm's locking by moving the sprintf
out from the lock acquisition path and into the lock creation path. This
reduces the amount of CPU time used in acquiring locks by a fair amount.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: David Teigland <teigland@redhat.com>
Currently if the lockspace removal fails the misc device associated with a
lockspace is left deleted. After that there is no way to access the orphaned
lockspace from userland.
This patch recreates the misc device if th dlm_release_lockspace fails. I
believe this is better than attempting to remove the lockspace first because
that leaves an unattached device lying around. The potential gap in which there
is no access to the lockspace between removing the misc device and recreating it
is acceptable ... after all the application is trying to remove it, and only new
users of the lockspace will be affected.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since gcc didn't evaluate the last two terms of the expression in
glock.c:1881 as a constant expression, it resulted in an error on
i386 due to the lack of a 64bit divide instruction. This adds some
brackets to fix the problem.
This was reported by Andrew Morton.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
This patch prevents the printing of a warning message in cases where
the fs is functioning normally by handing off responsibility for
unlinked, but still open inodes, to another node for eventual deallocation.
Also, there is now an improved system for ensuring that such requests
to other nodes do not get lost. The callback on the iopen lock is
only ever called when i_nlink == 0 and when a node is unable to deallocate
it due to it still being in use on another node. When a node receives
the callback therefore, it knows that i_nlink must be zero, so we mark
it as such (in gfs2_drop_inode) in order that it will then attempt
deallocation of the inode itself.
As an additional benefit, queuing a demote request no longer requires
a memory allocation. This simplifies the code for dealing with gfs2_holders
as it removes one special case.
There are two new fields in struct gfs2_glock. gl_demote_state is the
state which the remote node has requested and gl_demote_time is the
time when the request came in. Both fields are only valid when the
GLF_DEMOTE flag is set in gl_flags.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If we are writing a file, and in the middle of writing the file
another node attempts to get a shared lock on that file (by doing a du for
example) the process doing the writing will hang waiting on lock_page. The
reason for this is because when we have waiters on a exclusive glock, we will go
through and flush out all dirty pages associated with that inode and release the
lock. The problem is that when we flush the dirty pages, we could hit a page
that we have locked durring the generic_file_buffered_write part of this
operation. This patch unlocks the page before we go to dequeue the lock and
locks it immediatly afterwards, since generic_file_buffered_write needs the page
locked when the commit_write is completed. This patch resolves the problem,
however if somebody sees a better way to do this please don't hesistate to yell.
Signed-off-by: Josef Whiter <jwhiter@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The length of the second element of the kvec array was not initialised before
being added to the first one. This could cause invalid lengths to be passed to
kernel_recvmsg
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If you specify an invalid mount option when trying to mount a gfs2 filesystem,
gfs2 will oops. The attached patch resolves this problem.
Signed-off-by: Josef Whiter <jwhiter@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The attached patch resolves bz 228540. This adds the capability
for gfs2 to dump gfs2 locks through the debugfs file system.
This used to exist in gfs1 as "gfs_tool lockdump" but it's missing from
gfs2 because all the ioctls were stripped out. Please see the bugzilla
for more history about the fix. This patch is also attached to the bugzilla
record.
The patch is against Steve Whitehouse's latest nmw git tree kernel
(2.6.21-rc1) and has been tested on system trin-10.
Signed-off-by: Robert Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch contains the scheduled removal of the i8xx_tco watchdog
driver.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
Try running this script in an NFS mounted directory (Client relatively
recent - 2.6.18 has the problem as does 2.6.20).
------------------------------------------------------
#!/bin/bash
#
# This script will produce the following errormessage from tar:
#
# tar: newdir/innerdir/innerfile: file changed as we read it
# create dirs
rm -rf nfstest
mkdir -p nfstest/dir/innerdir
# create files (should not be empty)
echo "Hello World!" >nfstest/dir/file
echo "Hello World!" >nfstest/dir/innerdir/innerfile
# problem only happens if we sleep before chmod
sleep 1
# change file modes
chmod -R a+r nfstest
# rename dir
mv nfstest/dir nfstest/newdir
# tar it
tar -cf nfstest/nfstest.tar -C nfstest newdir
# restore old dir name
mv nfstest/newdir nfstest/dir
--------------------------------------------------------
What happens:
The 'chmod -R' does a readdir_plus in each directory and the results
get cached in the page cache. It then updates the ctime on each file
by one second. When this happens, the post-op attributes are used to
update the ctime stored on the client to match the value in the kernel.
The 'mv' calls shrink_dcache_parent on the directory tree which
flushes all the dentries (so a new lookup will be required) but
doesn't flush the inodes or pagecache.
The 'tar' does a readdir on each directory, but (in the case of
'innerdir' at least) satisfies it from the pagecache and uses the
READDIRPLUS data to update all the inodes. In the case of
'innerdir/innerfile', the ctime is out of date.
'tar' then calls 'lstat' on innerdir/innerfile getting an old ctime.
It then opens the file (triggering a GETATTR), reads the content, and
then calls fstat to see if anything has changed. It finds that ctime
has changed and so complains.
The problem seems to be that the cache readdirplus info is kept around
for too long.
My patch below discards pagecache data for directories when
dentry_iput is called on them. This effectively removes the symptom
which convinces me that I correctly understand the problem. However
I'm not convinced that is a proper solution, as there could easily be
other races that trigger the same problem without being affected by
this 'fix'.
One possibility would be to require that readdirplus pagecache data be
only used *once* to instantiate an inode. Somehow it should then be
invalidated so that if the dentry subsequently disappears, it will
cause a new request to the server to fill in the stat data.
Another possibility is to compare the cache_change_attribute on the
inode with something similar for the readdirplus info and reject the
info from readdirplus if it is too old.
I haven't tried to implement these and would value other opinions
before I do.
Thanks,
NeilBrown
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't use uninitialsed value for fattr->time_start in readdirplus results.
The 'fattr' structure filled in by nfs3_decode_direct does not get a
value for ->time_start set.
Thus if an entry is for an inode that we already have in cache,
when nfs_readdir_lookup calls nfs_fhget, it will call nfs_refresh_inode
and may update the inode with out-of-date information.
Directories are read a page at a time, so each page could have a
different timestamp that "should" be used to set the time_start for
the fattr for info in that page. However storing the timestamp per
page is awkward. (We could stick in the first 4 bytes and only read 4092
bytes, but that is a bigger code change than I am interested it).
This patch ignores the readdir_plus attributes if a readdir finds the
information already in cache, and otherwise sets ->time_start to the time
the readdir request was sent to the server.
It might be nice to store - in the directory inode - the time stamp for
the earliest readdir request that is still in the page cache, so that we
don't ignore attribute data that we don't have to. This patch doesn't do
that.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
READDIRPLUS can be a performance hindrance when the client is working with
large directories. In addition, some servers still have bugs in their
implementations (e.g. Tru64 returns wrong values for the fsid).
Add a mount flag to enable users to turn it off at mount time following the
implementation in Apple's NFS client.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
net/sunrpc/pmap_clnt.c has been replaced by net/sunrpc/rpcb_clnt.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is arguable whether NFSROOT will support IPv6, and thus whether
rpcb_getport_external needs to support rpcbind versions greater than 2.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eventually this interface will support versions 3 and 4 of the rpcbind
protocol, which will allow the Linux RPC server to register services on
IPv6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a version of the portmapper that supports versions 3 and 4
of the rpcbind protocol, use it for new RPC client connections over
sockets.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a replacement for the in-kernel portmapper client that supports
all 3 versions of the rpcbind protocol. This code is not used yet.
Original code by Groupe Bull updated for the latest kernel, with multiple
bug fixes.
Note that rpcb_clnt.c does not yet support registering via versions 3 and
4 of the rpcbind protocol. That is planned for a later patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently rpc_malloc sets req->rq_buffer internally. Make this a more
generic interface: return a pointer to the new buffer (or NULL) and
make the caller set req->rq_buffer and req->rq_bufsize. This looks much
more like kmalloc and eliminates the side effects.
To fix a potential deadlock, this patch also replaces GFP_NOFS with
GFP_NOWAIT in rpc_malloc. This prevents async RPCs from sleeping outside
the RPC's task scheduler while allocating their buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC buffer size estimation logic in net/sunrpc/clnt.c always
significantly overestimates the requirements for the buffer size.
A little instrumentation demonstrated that in fact rpc_malloc was never
allocating the buffer from the mempool, but almost always called kmalloc.
To compute the size of the RPC buffer more precisely, split p_bufsiz into
two fields; one for the argument size, and one for the result size.
Then, compute the sum of the exact call and reply header sizes, and split
the RPC buffer precisely between the two. That should keep almost all RPC
buffers within the 2KiB buffer mempool limit.
And, we can finally be rid of RPC_SLACK_SPACE!
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NLM version 4 requests estimate the call and reply header sizes rather
conservatively, using the very maximum size allowed in the protocol even
though Linux always uses only a small fraction of the allowable space.
Reduce the size of caller and lock arguments to conserve RPC buffer space
while XDR encoding NLM4 arguments. Add compile-time checks to ensure the
hostname string won't overflow NLM protocol maximums.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently we do write coalescing in a very inefficient manner: one pass in
generic_writepages() in order to lock the pages for writing, then one pass
in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather
the locked pages for coalescing into RPC requests of size "wsize".
In fact, it turns out there is actually a deadlock possible here since we
only start I/O on the second pass. If the user signals the process while
we're in nfs_sync_mapping_wait(), for instance, then we may exit before
starting I/O on all the requests that have been queued up.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do the coalescing of read requests into block sized requests at start of
I/O as we scan through the pages instead of going through a second pass.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is redundant, and will interfere with the call to
balance_dirty_pages_ratelimited_nr in generic_file_write().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs statfs function returns a success code on error, and fills the
output buffer with invalid values. The attached patch makes it return a
correct error code instead.
Signed-off-by: Amnon Aaronsohn <amnonaar@gmail.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(Modified patch to reinstate the dprintk())
The scheme where the thermal driver displayed the
cooling mode /proc/acpi/thermal_zone/*/cooling_mode
was flawed in two ways.
First, the success of _SCP doesn't actually mean
that the BIOS moved any trip points.
On many BIOS, _SCP is present, but does nothing.
So displaying what _SCP executed actually
was wrong more times than it was right.
Second, examining the relative position of the
trip points when the thermal_zone is added
is insufficient -- as the BIOS reserves the right
to change the trip points at run-time.
The only reliable way for the user to determine if
the thermal zone is in active, passive, or critical
mode is to examine the relative position of the trip points.
The user can do this without the kernel doing it
for them by looking in /proc/acpi/thermal_zone/*/trip_points
New contents for /proc/acpi/thermal_zone/*/cooling_mode:
If _SCP available:
"0 - Active; 1 - Passive\n"
If _SCP unavailable:
"<setting not supported>\n"
Signed-off-by: Len Brown <len.brown@intel.com>
/proc/acpi/thermal_zone/*/trip_points displays
what the kernel reads from the BIOS via ACPI.
If you echo a string of ':' deliminted numbers to this file
then it will change what it displays.
But it shouldn't, since the kernel has no way to communicate
these changes to ACPI thermal zones. ACPI thermal zone
trip points are read-only.
The kernel does have the opportunity to ask the BIOS to change
the trip points with _SCP - Set Cooling Policy.
Request Active Cooling Mode:
# echo 0 > /proc/acpi/thermal_zone/*/cooling_policy
Request Passive Cooling Mode:
# echo 1 > /proc/acpi/thermal_zone/*/cooling_policy
However, in practice it is quite rare for the BIOS
to support the optional _SCP, and it is even more rare
for the BIOS to export an _SCP that actually changes
the trip points.
Signed-off-by: Len Brown <len.brown@intel.com>
The Marvell IDE interface on my machine would hit a BUG_ON() in
lib/iomem.c because it was calling ata_pci_init_one() specifying just a
single port on the host, but that would actually end up trying to
initialize two ports, the second one with bogus information.
This fixes "ata_pci_init_one()" so that it actually passes down the
n_ports variable that it got from the low-level driver to the host
allocation routine ("ata_host_alloc_pinfo()"), which results in the ATA
layer actually having the correct port number information.
And in order to make it all work, I also needed to fix a few places that
had incorrectly hard-coded the fact that a host always had exactly two
ports (both ata_pci_init_bmdma() and ata_request_legacy_irqs() would
just always iterate over both ports).
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix TERMINATE layer, type, and ecode values based on
conformance testing.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If skb allocation fails when we start the device, we call
ipoib_cm_dev_stop() even though ipoib_cm_dev_open() did not run to
completion, so we pass an invalid pointer to ib_destroy_cm_id and get
an oops.
Fix by clearing cm.id on error, and testing it in cm_dev_stop().
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=561>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the pending mmap code so it doesn't corrupt the list of pending
mmaps and crash the machine when pending mmaps are destroyed without
first being mapped. Also, remove an unused variable, and use standard
kernel lists instead of our own homebrewed linked list implementation
to keep the pending mmap list.
Signed-off-by: Robert Walsh <robert.walsh@qlogic.com>
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With mthca, RC QPs can starve each other and even UD QPs on the same
hardware schedule queue. As a result, userspace MPI can starve
e.g. IPoIB traffic, with netdev watchdog warnings getting printed out,
and TCP connections getting stuck or failing.
Reduce the chance of this happening by using three separate hardware
schedule queues: one for userspace RC QPs, one for kernel RC QPs, and
one for all other QPs.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This fixes a problem which causes too many RC timeouts and
retransmits.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch fixes the problem reported by Bernd Schubert <bs@q-leap.de>
with kernel debug options enabled:
BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
This was caused by using spin_lock_irq()/spin_unlock_irq() from
interrupt context. Fix all the places that might be called from
interrupts to use spin_lock_irqsave()/spin_unlock_irqrestore().
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For backwards compatibility, call_platform_enable_wakeup() can return 0
instead of -EIO since we aren't guaranteed to have errno defined.
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a kvasprintf() function to complement kasprintf().
No in-tree users yet, but I have some coming up.
[akpm@linux-foundation.org: EXPORT it]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Keir Fraser <keir@xensource.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the docs and behaviour from "all states valid" to "no
states valid" if no .valid callback is assigned. Users of pm_ops that only
need mem sleep can assign pm_valid_only_mem without any overhead, others
will require more elaborate callbacks.
Now that all users of pm_ops have a .valid callback this is a safe thing to
do and prevents things from getting messy again as they were before.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Looks-okay-to: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <linux-pm@lists.linux-foundation.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Almost all users of pm_ops only support mem sleep, don't check in .valid and
don't reject any others in .prepare so users can be confused if they check
/sys/power/state, especially when new states are added (these would then
result in s-t-r although they're supposed to be something different).
This patch implements a generic pm_valid_only_mem function that is then
exported for users and puts it to use in almost all existing pm_ops.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: David Brownell <david-b@pacbell.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: linux-pm@lists.linux-foundation.org
Cc: Len Brown <lenb@kernel.org>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>