* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: refactor wait_for_completion_timeout()
sched: fix wait_for_completion_timeout() spurious failure under heavy load
sched: rt: dont stop the period timer when there are tasks wanting to run
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
xen: don't drop NX bit
xen: mask unwanted pte bits in __supported_pte_mask
xen: Use wmb instead of rmb in xen_evtchn_do_upcall().
x86: fix NULL pointer deref in __switch_to
There is a race in the COW logic. It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page,
however it is possible to have two racing do_wp_page()ers with one
causing the other to mistakenly believe it is safe to take the shortcut
when it is not. This could lead to data corruption.
Process 1 and process2 each have a wp pte of the same anon page (ie.
one forked the other). The page's mapcount is 2. Then they both
attempt to write to it around the same time...
proc1 proc2 thr1 proc2 thr2
CPU0 CPU1 CPU3
do_wp_page() do_wp_page()
trylock_page()
can_share_swap_page()
load page mapcount (==2)
reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
trylock_page()
can_share_swap_page()
load page mapcount (==1)
reuse = 1
ptep_set_access_flags (allow W)
write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)
Fix this by moving the page_remove_rmap of the old page after the pte
clear and flush. Potentially the entire branch could be moved down
here, but in order to stay consistent, I won't (should probably move all
the *_mm_counter stuff with one patch).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 89f5b7da2a ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:
"This broke vmware 6.0.4.
Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
/build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"
and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.
The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.
This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.
Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noted that the 'struct tty_struct *real_tty' is not used in this
function, so I removed the code about 'real_tty'.
Signed-off-by: Gustavo Fernando Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the requirement rules are now more relaxed. Also correct a
contradiction in the previous update
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following sparse warnings:
fs/dcache.c:2183:19: warning: symbol 'filp_cachep' was not declared. Should it be static?
fs/dcache.c:115:3: warning: context imbalance in 'dentry_iput' - unexpected unlock
fs/dcache.c:188:2: warning: context imbalance in 'dput' - different lock contexts for basic block
fs/dcache.c:400:2: warning: context imbalance in 'prune_one_dentry' - different lock contexts for basic block
fs/dcache.c:431:22: warning: context imbalance in 'prune_dcache' - different lock contexts for basic block
fs/dcache.c:563:2: warning: context imbalance in 'shrink_dcache_sb' - different lock contexts for basic block
fs/dcache.c:1385:6: warning: context imbalance in 'd_delete' - wrong count at exit
fs/dcache.c:1636:2: warning: context imbalance in '__d_unalias' - unexpected unlock
fs/dcache.c:1735:2: warning: context imbalance in 'd_materialise_unique' - different lock contexts for basic block
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The path that __d_path() computes can become slightly inconsistent when it
races with mount operations: it grabs the vfsmount_lock when traversing mount
points but immediately drops it again, only to re-grab it when it reaches the
next mount point. The result is that the filename computed is not always
consisent, and the file may never have had that name. (This is unlikely, but
still possible.)
Fix this by grabbing the vfsmount_lock for the whole duration of
__d_path().
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Current memfree FW has a bug which in some cases, assumes that ICM
pages passed to it are cleared. This patch uses __GFP_ZERO to
allocate all ICM pages passed to the FW. Once firmware with a fix is
released, we can make the workaround conditional on firmware version.
This fixes the bug reported by Arthur Kepner <akepner@sgi.com> here:
http://lists.openfabrics.org/pipermail/general/2008-May/050026.html
Cc: <stable@kernel.org>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Rewritten to be a one-liner using __GFP_ZERO instead of vmap()ing
ICM memory and memset()ing it to 0. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
fl_insert and fl_remove are not used right now in the kernel. Remove them.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_readlink calls ERR_PTR for negative and positive values
(vfs_readlink returns length of "link"), but it should not
(not an errno) and does not need to.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for futimens()/utimensat() says:
Only a process with the effective user ID equal to the
user ID of the file, *or with write access to the file*,
or with appropriate privileges may use futimens() or
utimensat() with a null pointer as the times argument
or with both tv_nsec fields set to the special value
UTIME_NOW.
The important piece here is "with write access to the file", and
this matters for futimens(), which deals with an argument that
is a file descriptor referring to the file whose timestamps are
being updated, The standard is saying that the "writability"
check is based on the file permissions, not the access mode with
which the file is opened. (This behavior is consistent with the
semantics of FreeBSD's futimes().) However, Linux is currently
doing the latter -- futimens(fd, times) is a library
function implemented as
utimensat(fd, NULL, times, 0)
and within the utimensat() implementation we have the code:
f = fget(dfd); // dfd is 'fd'
...
if (f) {
if (!(f->f_mode & FMODE_WRITE))
goto mnt_drop_write_and_out;
The check should instead be based on the file permissions.
Thanks to Miklos for pointing out how to do this check.
Miklos also pointed out a simplification that could be
made to my first version of this patch, since the checks
for the pathname and file descriptor cases can now be
conflated.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for utimensat() says:
Only a process with the effective user ID equal to the
user ID of the file or with appropriate privileges may use
futimens() or utimensat() with a non-null times argument
that does not have both tv_nsec fields set to UTIME_NOW
and does not have both tv_nsec fields set to UTIME_OMIT.
If this condition is violated, then the error EPERM should result.
However, the current implementation does not generate EPERM if
one tv_nsec field is UTIME_NOW while the other is UTIME_OMIT.
It should give this error for that case.
This patch:
a) Repairs that problem.
b) Removes the now unneeded nsec_special() helper function.
c) Adds some comments to explain the checks that are being
performed.
Thanks to Miklos, who provided comments on the previous iteration
of this patch. As a result, this version is a little simpler and
and its logic is better structured.
Miklos suggested an alternative idea, migrating the
is_owner_or_cap() checks into fs/attr.c:inode_change_ok() via
the use of an ATTR_OWNER_CHECK flag. Maybe we could do that
later, but for now I've gone with this version, which is
IMO simpler, and can be more easily read as being correct.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The POSIX.1 draft spec for utimensat() says that if a times[n].tv_nsec
field is UTIME_OMIT or UTIME_NOW, then the value in the corresponding
tv_sec field is ignored. See the last sentence of this para, from
the spec:
If the tv_nsec field of a timespec structure has
the special value UTIME_NOW, the file's relevant
timestamp shall be set to the greatest value
supported by the file system that is not greater than
the current time. If the tv_nsec field has the
special value UTIME_OMIT, the file's relevant
timestamp shall not be changed. In either case,
the tv_sec field shall be ignored.
However the current Linux implementation requires the tv_sec value to be
zero (or the EINVAL error results). This requirement should be removed.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes utimensat() to make its behavior consistent
with that of utime()/utimes() when dealing with files marked
immutable and append-only.
The current utimensat() implementation also returns EPERM if
'times' is non-NULL and the tv_nsec fields are both UTIME_NOW.
For consistency, the
(times != NULL && times[0].tv_nsec == UTIME_NOW &&
times[1].tv_nsec == UTIME_NOW)
case should be treated like the traditional utimes() case where
'times' is NULL. That is, the call should succeed for a file
marked append-only and should give the error EACCES if the file
is marked as immutable.
The simple way to do this is to set 'times' to NULL
if (times[0].tv_nsec == UTIME_NOW && times[1].tv_nsec == UTIME_NOW).
This is also the natural approach, since POSIX.1 semantics consider the
times == {{x, UTIME_NOW}, {y, UTIME_NOW}}
to be exactly equivalent to the case for
times == NULL.
(Thanks to Miklos for pointing this out.)
Patch 3 in this series relies on the simplification provided
by this patch.
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
devcgroup_inode_permission() expects MAY_FOO, not FMODE_FOO; kindly
keep your misdesign consistent if you positively have to inflict it
on the kernel.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.
David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.
The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.
When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.
After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:
- the task which acquired the rtmutex
- the pending owner of the pi_state which did not get the rtmutex
Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.
One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.
Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 73 insertions(+), 20 deletions(-)
Kevin Winchester reported a GART related direct rendering failure against
linux-next-20080611, which shows up via these log entries:
PCI: Using ACPI for IRQ routing
PCI: Cannot allocate resource region 0 of device 0000:00:00.0
agpgart: Detected AGP bridge 0
agpgart: Aperture conflicts with PCI mapping.
agpgart: Aperture from AGP @ e0000000 size 128 MB
agpgart: Aperture conflicts with PCI mapping.
agpgart: No usable aperture found.
agpgart: Consider rebooting with iommu=memaper=2 to get a good aperture.
instead of the expected:
PCI: Using ACPI for IRQ routing
agpgart: Detected AGP bridge 0
agpgart: Aperture from AGP @ e0000000 size 128 MB
Kevin bisected it down to this change in tip/x86/gart:
"x86: checking aperture size order".
agp check is using request_mem_region(), and could fail if e820 is reserved...
change it back to e820_any_mapped().
Reported-and-bisected-by: "Kevin Winchester" <kjwinchester@gmail.com>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Tested-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
snd_assert() in save_mixer() and restore_mixer() in sb_mixer.c is
just wrong. The debug code wasn't tested at all, obviously...
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The irq handler may be called before the proper initialization of hardware.
Call snd_aw2_saa7146_setup() before the irq handler registration.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The Marvell Discovery Duo (MV78xx0) is a family of ARM SoCs featuring
(depending on the model) one or two Feroceon CPU cores with 512K of L2
cache and VFP coprocessors running at (depending on the model) between
800 MHz and 1.2 GHz, and features a DDR2 controller, two PCIe
interfaces that can each run either in x4 or quad x1 mode, three USB
2.0 interfaces, two 3Gb/s SATA II interfaces, a SPI interface, two
TWSI interfaces, a crypto accelerator, IDMA/XOR engines, a SPI
interface, four UARTs, and depending on the model, two or four gigabit
ethernet interfaces.
This patch adds basic support for the platform, and allows booting
on the MV78x00 development board, with functional UARTs, SATA, PCIe,
GigE and USB ports.
Signed-off-by: Stanislav Samsonov <samsonov@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
The Discovery Duo (MV78xx0) has two x4 PCIe ports which can either
be used in x4 mode or in quad x1 mode. This patch adds an accessor
function to the generic plat-orion PCIe handling code to detect in
which of the two modes we're running (which is determined by strap
pins and/or configured by the bootloader).
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Add support for the Feroceon 88fr571-vd CPU core as found in e.g.
the Marvell Discovery Duo family of ARM SoCs.
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
The Marvell Kirkwood (88F6000) is a family of ARM SoCs based on a
Shiva CPU core, and features a DDR2 controller, a x1 PCIe interface,
a USB 2.0 interface, a SPI controller, a crypto accelerator, a TS
interface, and IDMA/XOR engines, and depending on the model, also
features one or two Gigabit Ethernet interfaces, two SATA II
interfaces, one or two TWSI interfaces, one or two UARTs, a
TDM/SLIC interface, a NAND controller, an I2S/SPDIF interface, and
an SDIO interface.
This patch adds supports for the Marvell DB-88F6281-BP Development
Board and the RD-88F6192-NAS and the RD-88F6281 Reference Designs,
enabling support for the PCIe interface, the USB interface, the
ethernet interfaces, the SATA interfaces, the TWSI interfaces, the
UARTs, and the NAND controller.
Signed-off-by: Saeed Bishara <saeed@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Add support for the Shiva 88fr131 CPU core as found in e.g. the
Marvell Kirkwood family of ARM SoCs.
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
This patch adds support for the unified Feroceon L2 cache controller
as found in e.g. the Marvell Kirkwood and Marvell Discovery Duo
families of ARM SoCs.
Note that:
- Page table walks are outer uncacheable on Kirkwood and Discovery
Duo, since the ARMv5 spec provides no way to indicate outer
cacheability of page table walks (specifying it in TTBR[4:3] is
an ARMv6+ feature).
This requires adding L2 cache clean instructions to
proc-feroceon.S (dcache_clean_area(), set_pte()) as well as to
tlbflush.h ({flush,clean}_pmd_entry()). The latter case is handled
by defining a new TLB type (TLB_FEROCEON) which is almost identical
to the v4wbi one but provides a TLB_L2CLEAN_FR flag.
- The Feroceon L2 cache controller supports L2 range (i.e. 'clean L2
range by MVA' and 'invalidate L2 range by MVA') operations, and this
patch uses those range operations for all Linux outer cache
operations, as they are faster than the regular per-line operations.
L2 range operations are not interruptible on this hardware, which
avoids potential livelock issues, but can be bad for interrupt
latency, so there is a compile-time tunable (MAX_RANGE_SIZE) which
allows you to select the maximum range size to operate on at once.
(Valid range is between one cache line and one 4KiB page, and must
be a multiple of the line size.)
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
This patch adds support for the L1 D cache range operations that
are supported by the Marvell Discovery Duo and Marvell Kirkwood
ARM SoCs.
Signed-off-by: Stanislav Samsonov <samsonov@marvell.com>
Acked-by: Saeed Bishara <saeed@marvell.com>
Reviewed-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
The Marvell Loki (88RC8480) is an ARM SoC based on a Feroceon CPU
core running at between 400 MHz and 1.0 GHz, and features a 64 bit
DDR controller, 512K of internal SRAM, two x4 PCI-Express ports,
two Gigabit Ethernet ports, two 4x SAS/SATA controllers, two UARTs,
two TWSI controllers, and IDMA/XOR engines.
This patch adds support for the Marvell LB88RC8480 Development
Board, enabling the use of the PCIe interfaces, the ethernet
interfaces, the TWSI interfaces and the UARTs.
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Some Feroceon-based SoCs have an MBUS bridge interrupt controller
that requires writing a one instead of a zero to clear edge
interrupt sources such as timer expiry.
This patch adds a new BRIDGE_INT_TIMER1_CLR define, which platform
code can set to either ~BRIDGE_INT_TIMER1 (write-zero-to-clear) or
BRIDGE_INT_TIMER1 (write-one-to-clear) depending on the platform.
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
There are a couple more Feroceon-based SoCs out in the field that use
different Variant and Architecture fields in their Main ID registers
-- this patch tweaks the processor match/mask in proc-feroceon.S to
catch those SoCs as well.
Signed-off-by: Ke Wei <kewei@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tweak the Feroceon match/mask in arch/arm/boot/compressed/head.S to
match a couple of newer Feroceon cores (such as the 88fr571vd with
CPU ID 0x56155710, and the 88fr131 with CPU ID 0x56251310) as well.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Flushing the L1 D cache with a test/clean/invalidate loop is very
easy in software, but it is not the quickest way of doing it, as
there is a lot of overhead involved in re-scanning the cache from
the beginning every time we hit a dirty line.
This patch makes proc-feroceon.S use "clean+invalidate by set/way"
loops according to possible cache configuration of Feroceon CPUs
(either direct-mapped or 4-way set associative).
Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
This patch adds support for the Maxtor Shared Storage II hardware.
Signed-off-by: Sylver Bruneau <sylver.bruneau@googlemail.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>