Sometimes cpu_idle_wait gets stuck because it might miss CPUS that are
already in idle, have no tasks waiting to run and have no interrupts going
to them. This is common on bootup when switching cpu idle governors.
This patch gives those CPUS that don't check in an IPI kick.
Background:
-----------
I notice this while developing the mcount patches, that every once in a
while the system would hang. Looking deeper, the hang was always at boot
up when registering init_menu of the cpu_idle menu governor. Talking
with Thomas Gliexner, we discovered that one of the CPUS had no timer
events scheduled for it and it was in idle (running with NO_HZ). So the
CPU would not set the cpu_idle_state bit.
Hitting sysrq-t a few times would eventually route the interrupt to the
stuck CPU and the system would continue.
Note, I would have used the PDA isidle but that is set after the
cpu_idle_state bit is cleared, and would leave a window open where we
may miss being kicked.
hmm, looking closer at this, we still have a small race window between
clearing the cpu_idle_state and disabling interrupts (hence the RFC).
CPU0: CPU 1:
--------- ---------
cpu_idle_wait(): cpu_idle():
| __cpu_cpu_var(is_idle) = 1;
| if (__get_cpu_var(cpu_idle_state)) /* == 0 */
per_cpu(cpu_idle_state, 1) = 1; |
if (per_cpu(is_idle, 1)) /* == 1 */ |
smp_call_function(1) |
| receives ipi and runs do_nothing.
wait on map == empty idle();
/* waits forever */
So really we need interrupts off for most of this then. One might think
that we could simply clear the cpu_idle_state from do_nothing, but I'm
assuming that cpu_idle governors can be removed, and this might cause a
race that a governor might be used after the module was removed.
Venki said:
I think your RFC patch is the right solution here. As I see it, there is
no race with your RFC patch. As long as you call a dummy smp_call_function
on all CPUs, we should be OK. We can get rid of cpu_idle_state and the
current wait forever logic altogether with dummy smp_call_function. And so
there wont be any wait forever scenario.
The whole point of cpu_idle_wait() is to make all CPUs come out of idle
loop atleast once. The caller will use cpu_idle_wait something like this.
// Want to change idle handler
- Switch global idle handler to always present default_idle
- call cpu_idle_wait so that all cpus come out of idle for an instant
and stop using old idle pointer and start using default idle
- Change the idle handler to a new handler
- optional cpu_idle_wait if you want all cpus to start using the new
handler immediately.
Maybe the below 1s patch is safe bet for .24. But for .25, I would say we
just replace all complicated logic by simple dummy smp_call_function and
remove cpu_idle_state altogether.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Decrement the slave counter only in ->release() callback instead of both
in ->release() and w1 control.
Patch is based on debug work and preliminary patch made by Henri Laakso.
Henri noticed in debug that this counter becomes negative after w1 slave
device is physically removed.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: Henri Laakso <henri.laakso@wapice.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kyle McMartin reports sysrq_timer_list_show() can hit the module mutex
from hard interrupt context. These paths don't need to though, since we
long ago changed all the module list manipulation to occur via
stop_machine().
Disabling preemption is enough.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is almost no difference between 32 & 64 bit glue code.
Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
spidernet MAINTAINERship update
sky2: remove check for PCI wakeup setting from BIOS
sky2: large memory workaround.
fs_enet: check for phydev existence in the ethtool handlers
[usb netdev] asix: fix regression
r8169: fix missing loop variable increment
ip1000: menu location change
Fixed a small typo in the loopback driver
3c509: PnP resource management fix
netxen: fix byte-swapping in tx and rx
netxen: optimize tx handling
netxen: stop second phy correctly
netxen: update driver version
netxen: update MAINTAINERS
endianness noise in tulip_core
de4x5 fixes
xircom_cb endianness fixes
rt2x00: Put 802.11 data on 4 byte boundary
rt2x00: Corectly initialize rt2500usb MAC
rt2x00: Allow rt61 to catch up after a missing tx report
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
[POWERPC] Fix CPU hotplug when using the SLB shadow buffer
[POWERPC] efika: add phy-handle property for fec_mpc52xx
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
pnpacpi: print resource shortage message only once
PM: ACPI and APM must not be enabled at the same time
ACPI: apply quirk_ich6_lpc_acpi to more ICH8 and ICH9
ACPICA: fix acpi_serialize hang regression
ACPI : Not register gsi for PCI IDE controller in legacy mode
ACPI: Reintroduce run time configurable max_cstate for !CPU_IDLE case
ACPI: Make sysfs interface in ACPI power optional.
ACPI: EC: Enable boot EC before bus_scan
increase PNP_MAX_PORT to 40 from 24
When RPCSEC/GSS and krb5i is used, requests are padded, typically to a multiple
of 8 bytes. This can make the request look slightly longer than it
really is.
As of
f34b95689d "The NFSv2/NFSv3 server does not handle zero
length WRITE request correctly",
the xdr decode routines for NFSv2 and NFSv3 reject requests that aren't
the right length, so krb5i (for example) WRITE requests can get lost.
This patch relaxes the appropriate test and enhances the related comment.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Peter Staubach <staubach@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_ppid_nr_ns is called in three places. One of these should never
have called it. In the other two, using it broke the existing
semantics. This was presumably accidental. If the function had not
been there, it would have been much more obvious to the eye that those
patches were changing the behavior. We don't need this function.
In task_state, the pid of the ptracer is not the ppid of the ptracer.
In do_task_stat, ppid is the tgid of the real_parent, not its pid.
I also moved the call outside of lock_task_sighand, since it doesn't
need it.
In sys_getppid, ppid is the tgid of the real_parent, not its pid.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we switched away from the optimized C version
things stopped being monotonic.
The problem is that if we run this with interrupts disabled, we can
see the interrupt pending because the counter reached the limit value.
When this happens the counter has bit 31 set, and the low bits start
counting again from zero.
Reported by Martin Habets.
Signed-off-by: David S. Miller <davem@davemloft.net>
pnpacpi: exceeded the max number of IO resources: 40
While this message is a real error and should thus
remain KERN_ERR (even a new dmesg line is seen as a regression
by some, since it was not printed in 2.6.23...) it is certainly
impolite to print this warning 50 times should you happen to
have the oddball system with 90 io resources under a device...
So print the warning just once.
In 2.6.25 we'll get rid of the limits altogether
and these warnings will vanish with them.
http://bugzilla.kernel.org/show_bug.cgi?id=9535
Signed-off-by: Len Brown <len.brown@intel.com>
The driver checks status of PCI power management to mark
default setting of Wake On Lan. On some systems this works, but often
it reports a that WOL is disabled when it isn't.
This patch gets rid of that check and just reports the wake on
lan status based on the hardware capablity.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This patch might fix problems with 4G or more of memory.
It stops the driver from doing a small optimization for Tx and Rx,
and instead always sets the high-page on tx/rx descriptors.
Fixes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=9725
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Way back when (in commit 834f2a4a15, aka
"VFS: Allow the filesystem to return a full file pointer on open intent"
to be exact), Trond changed the open logic to keep track of the original
flags to a file open, in order to pass down the the intent of a dentry
lookup to the low-level filesystem.
However, when doing that reorganization, it changed the meaning of
namei_flags, and thus inadvertently changed the test of access mode for
directories (and RO filesystem) to use the wrong flag. So fix those
test back to use access mode ("acc_mode") rather than the open flag
("flag").
Issue noticed by Bill Roman at Datalight.
Reported-and-tested-by: Bill Roman <bill.roman@datalight.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
51bf2976b5 caused a regression in the asix
usbnet driver. usb_control_msg returns the number of bytes read on
success, not 0. Tested with NETGEAR FA120.
Signed-off-by: Russ Dill <Russ.Dill@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Move the ip1000 driver into the expected place for gigabit cards
in the configuration menu structure. It should be under the gigabit
cards, not at the top level.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This is probably a result of the changes from commit
854d836 - [NET]: Dynamically allocate the loopback device, part 2
Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
In order to release PnP resources a card type must be set to EL3_PNP.
Previously, it was never set hence the PnP resources were not
released and device was left in incorrect state.
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Here's the reworked patch.
This cleans up some unnecessary byte-swapping while setting up tx and
interpreting rx desc. The 64 bit rx status data should be converted
to host endian format only once and the macros just need to extract
bitfields.
This saves a spate of interrupts on pseries blades caused by buggy
(non) processing rx status ring.
Signed-off-by: Dhananjay Phadke <dhananjay@netxen.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
netxen driver allows limited number of threads simultaneously posting
skb's in tx ring. If transmit slot is unavailable, driver calls
schedule() or loops in xmit_frame().
This patch returns TX_BUSY and lets the stack reschedule the packet if
transmit slot is unavailable. Also removes unnecessary check for tx
timeout in the driver itself, the network stack does that anyway.
Signed-off-by: Dhananjay Phadke <dhananjay@netxen.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This patch fixes bug that doesn't quiesce second port when interface is
brought down, which could lead to unwarranted interrupt during rmmod /
ifdown.
Signed-off-by: Dhananjay Phadke <dhananjay@netxen.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Bumping up driver version to 3.4.18, several fixes have gone in since
version 3.4.2.
Signed-off-by: Dhananjay Phadke <dhananjay@netxen.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
* (trivial) endianness annotations
* don't bother with del_timer() from the inside of timer handler itself
* disable_ast() really ought to do del_timer_sync(), not del_timer()
* clean the timer handling in general.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
* descriptors inside the rx and tx rings are l-e
* don't cpu_to_le32() the argument of outl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
The bridge code incorrectly causes two POST_ROUTING hook invocations
for DNATed packets that end up on the same bridge device. This
happens because packets with a changed destination address are passed
to dst_output() to make them go through the neighbour output function
again to build a new destination MAC address, before they will continue
through the IP hooks simulated by bridge netfilter.
The resulting hook order is:
PREROUTING (bridge netfilter)
POSTROUTING (dst_output -> ip_output)
FORWARD (bridge netfilter)
POSTROUTING (bridge netfilter)
The deferred hooks used to abort the first POST_ROUTING invocation,
but since the only thing bridge netfilter actually really wants is
a new MAC address, we can avoid going through the IP stack completely
by simply calling the neighbour output function directly.
Tested, reported and lots of data provided by: Damien Thebault <damien.thebault@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch relaxes the default SCSI DMA alignment from 512 bytes to 4
bytes. I remember from previous discussions that usb and firewire have
sector size alignment requirements, so I upped their alignments in the
respective slave allocs.
The reason for doing this is so that we don't get such a huge amount of
copy overhead in bio_copy_user() for udev. (basically all inquiries it
issues can now be directly mapped).
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
The purpose of this is to allow stacked alignment settings, with the
ultimate queue alignment being set to the largest alignment requirement
in the stack.
The reason for this is so that the SCSI mid-layer can relax the default
alignment requirements (which are basically causing a lot of superfluous
copying to go on in the SG_IO interface) while allowing transports,
devices or HBAs to add stricter limits if they need them.
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Looks like that host_cmd_pool_mutex are necessary here.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Based on an original patch from: David Martin <tasio@tasio.net>
When trying to get the drive status via ioctl CDROM_DRIVE_STATUS, with
no disk it gives CDS_TRAY_OPEN even if the tray is closed.
ioctl works as expected with ide-cd driver.
Gentoo bug report: http://bugs.gentoo.org/show_bug.cgi?id=196879
Cc: Maarten Bressers <mbres@gentoo.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This is bad for two reasons:
1. If they're returned to outside applications, no-one knows what
they mean.
2. Eventually they'll clash with the ever expanding standard error
codes.
The problem error code in question is ETASK. I've replaced this by
ECOMM (communications error on send) a network error code that seems to
most closely relay what ETASK meant.
Acked-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Currently in BSG, errors returned in req->errors aren't passed back to
the calling programme (either via SG_IO or via read/write). Fix this,
while preserving the SCSI convention of returning status in
req->errors.
Now update libsas to return errors correctly instead of to ignore
them.
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
All SMP tasks sent through bsg generate messages like:
sas: smp_execute_task: task to dev 500605b000001450 response: 0x0 status 0x81
Three times (because the task gets retried). Firstly, don't retry
either overrun or underrun (the data buffer isn't going to change size)
and secondly, just report the underrun but don't set an error for it.
This is necessary so bsg can report back the residual.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This adds support for host side SMP processing, via a separate
SMP interpreter file.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch fixes mptsas_smp_handler to update both din_resid or
dout_resid on success. bsg can report back the residual.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We need to hold the queue-lock when checking whether we still have a valid
unit/port handle for the task management command, i.e whether we can issue this
request for this unit/port. If the error recovery is about to close this
unit/port, then it competes for the queue-lock. If the close request issued by
the error recovery wins, then it is guaranteed that this unit/port has been
blocked for other requests.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We need to hold the queue-lock when checking whether we still have a valid
unit/port handle for the FCP command, i.e whether we can issue this request for
this unit/port. If the error recovery is about to close this unit/port, then it
competes for the queue-lock. If the close request issued by the error recovery
wins, then it is guaranteed that this unit/port has been blocked for other
requests.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We need to hold the queue-lock when checking whether we still have a valid port
handle for the ELS command, i.e whether we can issue this request for this
port. If the error recovery is about to close this port, then it competes for
the queue-lock. If the close request issued by the error recovery wins, then it
is guaranteed that this port has been blocked for other requests.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
We need to hold the queue-lock when checking whether we still have a valid
unit/port handle for the abort command, i.e whether we can issue this request
for this unit/port. If the error recovery is about to close this unit/port,
then it competes for the queue-lock. If the close request issued by the error
recovery wins, then it is guaranteed that this unit/port has been blocked for
other requests.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
According to the FSF spec, word 0 (bytes 0-3) has the handle
specified with the abort command and word 1 (bytes 4-7) has the
handle for the command to be aborted. Fix the if statements
that try to compare those.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>