The RDMA connection manager is fundamentally asynchronous.
Since the async callback context is the client pointer, the
transport in the client struct needs to be set prior to calling
the first async op.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.
Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)
lat_ctx on a NUMA box is particularly happy about this change:
before:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70
after:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07
a 2.75x speedup.
pipe-test is similarly happy about it too:
| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.
Also:
- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains
Sysbench+postgresql improves all around the board, quite significantly:
.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 +12.20%
64: 7342 8266 +11.18%
128: 7342 8064 +8.95%
256: 7519 7884 +4.62%
512: 7350 7731 +4.93%
-------------------------------------------------
SUM: 55412 59341 +6.62%
So it's a win both for the runup portion, the peak area and the tail.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For "unlock" cycles to 16bit devices in 8bit compatibility mode we need
to use the byte addresses 0xaaa and 0x555. These effectively match
the word address 0x555 and 0x2aa, except the latter has its low bit set.
Most chips don't care about the value of the 'A-1' pin in x8 mode,
but some -- like the ST M29W320D -- do. So we need to be careful to
set it where appropriate.
cfi_send_gen_cmd is only ever passed addresses where the low byte
is 0x00, 0x55 or 0xaa. Of those, only addresses ending 0xaa are
affected by this patch, by masking in the extra low bit when the device
is known to be in compatibility mode.
[dwmw2: Do it only when (cmd_ofs & 0xff) == 0xaa]
v4: Fix stupid typo in cfi_build_cmd_addr that failed to compile
I'm writing this patch way to late at night.
v3: Bring all of the work back into cfi_build_cmd_addr
including calling of map_bankwidth(map) and cfi_interleave(cfi)
So every caller doesn't need to.
v2: Only modified the address if we our device_type is larger than our
bus width.
Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used. In particular, this means that
SO_RCVLOWAT is not respected.
Simply remove the test. And this matches the behavior of several
other systems, including BSD.
Signed-off-by: David S. Miller <davem@davemloft.net>
With some net devices types, an IPv6 address configured while the
interface was down can stay 'tentative' forever, even after the interface
is set up. In some case, pending IPv6 DADs are not executed when the
device becomes ready.
I observed this while doing some tests with kvm. If I assign an IPv6
address to my interface eth0 (kvm driver rtl8139) when it is still down
then the address is flagged tentative (IFA_F_TENTATIVE). Then, I set
eth0 up, and to my surprise, the address stays 'tentative', no DAD is
executed and the address can't be pinged.
I also observed the same behaviour, without kvm, with virtual interfaces
types macvlan and veth.
Some easy steps to reproduce the issue with macvlan:
1. ip link add link eth0 type macvlan
2. ip -6 addr add 2003::ab32/64 dev macvlan0
3. ip addr show dev macvlan0
...
inet6 2003::ab32/64 scope global tentative
...
4. ip link set macvlan0 up
5. ip addr show dev macvlan0
...
inet6 2003::ab32/64 scope global tentative
...
Address is still tentative
I think there's a bug in net/ipv6/addrconf.c, addrconf_notify():
addrconf_dad_run() is not always run when the interface is flagged IF_READY.
Currently it is only run when receiving NETDEV_CHANGE event. Looks like
some (virtual) devices doesn't send this event when becoming up.
For both NETDEV_UP and NETDEV_CHANGE events, when the interface becomes
ready, run_pending should be set to 1. Patch below.
'run_pending = 1' could be moved below the if/else block but it makes
the code less readable.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix printk format warnings in net/9p.
Built cleanly on 7 arches.
net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 6 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: scheduling order fix for group scheduling
For each level in the hierarchy, set the buddy to point to the right entity.
Therefore, when we do the hierarchical schedule, we have a fair chance of
ending up where we meant to.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve/change/fix wakeup-buddy scheduling
Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.
This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.
This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.
The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.
This will basically allow a task to continue after it got preempted.
In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix cross-class preemption
Inter-class wakeup preemptions should go on class order.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In 777e208d40 we changed from outputting
field->cpu (a char) to iter->cpu (unsigned int), increasing the resulting
structure size by 3 bytes.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This gets rid of this build warning:
arch/powerpc/platforms/pseries/pci_dlpar.c: In function 'init_phb_dynamic':
arch/powerpc/platforms/pseries/pci_dlpar.c:192: warning: unused variable 'b'
This is one of the very few warnings left in a ppc64_defconfig build and
getting rid of it will make it easier to see future introduced ones (in
fact this was introduced very recently).
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes this error on Cell when CONFIG_KEXEC = n:
arch/powerpc/platforms/cell/ras.c:299: error: implicit declaration of function 'crash_shutdown_register'
We have to include <asm/kexec.h> because it contains the dummy
definition of crash_shutdown_register that is used when
CONFIG_KEXEC=n, but <linux/kexec.h> doesn't include <asm/kexec.h> in
that case.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Compiling with CONFIG_SMP = n and CONFIG_PS3_LPM != n gives this error:
drivers/ps3/ps3-lpm.c:838: error: implicit declaration of function 'get_hard_smp_processor_id'
This fixes it. We have to include <asm/smp.h> rather than
<linux/smp.h> because the UP definition of get_hard_smp_processor_id()
is in <asm/smp.h>, and <linux/smp.h> only includes <asm/smp.h> if
CONFIG_SMP = y.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The changes to deliver hardware accelerated VLAN packets to packet
sockets (commit bc1d0411) caused a warning for non-NAPI drivers.
The __vlan_hwaccel_rx() function is called directly from the drivers
RX function, for non-NAPI drivers that means its still in RX IRQ
context:
[ 27.779463] ------------[ cut here ]------------
[ 27.779509] WARNING: at kernel/softirq.c:136 local_bh_enable+0x37/0x81()
...
[ 27.782520] [<c0264755>] netif_nit_deliver+0x5b/0x75
[ 27.782590] [<c02bba83>] __vlan_hwaccel_rx+0x79/0x162
[ 27.782664] [<f8851c1d>] atl1_intr+0x9a9/0xa7c [atl1]
[ 27.782738] [<c0155b17>] handle_IRQ_event+0x23/0x51
[ 27.782808] [<c015692e>] handle_edge_irq+0xc2/0x102
[ 27.782878] [<c0105fd5>] do_IRQ+0x4d/0x64
Split hardware accelerated VLAN reception into two parts to fix this:
- __vlan_hwaccel_rx just stores the VLAN TCI and performs the VLAN
device lookup, then calls netif_receive_skb()/netif_rx()
- vlan_hwaccel_do_receive(), which is invoked by netif_receive_skb()
in softirq context, performs the real reception and delivery to
packet sockets.
Reported-and-tested-by: Ramon Casellas <ramon.casellas@cttc.es>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While adding MIGRATE support to strongSwan, Andreas Steffen noticed that
the selectors provided in XFRM_MSG_ACQUIRE have their family field
uninitialized (those in MIGRATE do have their family set).
Looking at the code, this is because the af-specific init_tempsel()
(called via afinfo->init_tempsel() in xfrm_init_tempsel()) do not set
the value.
Reported-by: Andreas Steffen <andreas.steffen@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
On omap24xx, INTCPS_SIR_IRQ_OFFSET bits [6:0] contains the current
active interrupt number.
However, on 34xx INTCPS_SIR_IRQ_OFFSET bits [31:7] also contains the
SPURIOUSIRQFLAG, which gets set if the interrupt sorting information
is invalid.
If the SPURIOUSIRQFLAG bits are not ignored, the interrupt code will
occasionally produce a bunch of confusing errors:
irq -33, desc: c02ddcc8, depth: 0, count: 0, unhandled: 0
->handle_irq(): c006f23c, handle_bad_irq+0x0/0x22c
->chip(): 00000000, 0x0
->action(): 00000000
Fix this by masking out only the ACTIVEIRQ bits. Also fix a
confusing comment.
Signed-off-by: Tony Lindgren <tony@atomide.com>
debugfs_create_*() returns NULL if an error occurs, returns -ENODEV
when debugfs is not enabled in the kernel.
Comparing to PATCH v1, because clk_debugfs_init is included in
"#if defined CONFIG_DEBUG_FS", we only need to check NULL return.
Thanks Li Zefan <lizf@cn.fujitsu.com>
debugfs_create_u8() and other function's return value's checking method are
also fixed in this patch.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Fix these compiler warnings:
gpmc.c: In function 'gpmc_init':
gpmc.c:432: warning: 'return' with a value, in function returning void
gpmc.c:439: warning: 'return' with a value, in function returning void
Signed-off-by: Sanjeev Premi <premi@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
The block layer dropped the virtual merge feature
(b8b3e16cfe). BIO_VMERGE_BOUNDARY
definition is meaningless now (For IA64, BIO_VMERGE_BOUNDARY has been
meaningless for a long time since IA64 disables the virtual merge
feature).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Remove the swiotlb prototypes from the architecture code and use the
common header file instead.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
makedumpfile[1] cannot run on ia64 discontigmem kernel, because the member
node_mem_map of struct pgdat_list has invalid value. This patch fixes it.
node_start_pfn shows the start pfn of each node, and node_mem_map should
point 'struct page' of each node's node_start_pfn. On my machine, node0's
node_start_pfn shows 0x400 and its node_mem_map points 0xa0007fffbf000000.
This address is the same as vmem_map, so the node_mem_map points 'struct
page' of pfn 0, even if its node_start_pfn shows 0x400.
The cause is due to the round down of min_pfn in count_node_pages() and
node0's node_mem_map points 'struct page' of inactive pfn (0x0). This
patch fixes it.
makedumpfile[1]: dump filtering command
https://sourceforge.net/projects/makedumpfile/
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: Bernhard Walle <bwalle@suse.de>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add the error_recovery_info field to the SAL section header,
as defined in the SAL Spec.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add partition id, coherence id, and region size to UV to
make life simpler for drivers shared between sn2 & uv.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
xfrm: Fix xfrm_policy_gc_lock handling.
niu: Use pci_ioremap_bar().
bnx2x: Version Update
bnx2x: Calling netif_carrier_off at the end of the probe
bnx2x: PCI configuration bug on big-endian
bnx2x: Removing the PMF indication when unloading
mv643xx_eth: fix SMI bus access timeouts
net: kconfig cleanup
fs_enet: fix polling
XFRM: copy_to_user_kmaddress() reports local address twice
SMC91x: Fix compilation on some platforms.
udp: Fix the SNMP counter of UDP_MIB_INERRORS
udp: Fix the SNMP counter of UDP_MIB_INDATAGRAMS
drivers/net/smc911x.c: Fix lockdep warning on xmit.
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
libata: mask off DET when restoring SControl for detach
libata: implement ATA_HORKAGE_ATAPI_MOD16_DMA and apply it
libata: Fix a potential race condition in ata_scsi_park_show()
sata_nv: fix generic, nf2/3 detection regression
sata_via: restore vt*_prepare_host error handling
sata_promise: add ATA engine reset to reset ops
Impact: fix boot tracer + sched tracer coupling bug
Fix a bug that made the sched_switch tracer unable to run
if set as the current_tracer after the boot tracer.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch applies some corrections suggested by Steven Rostedt.
Change the type of shed_ref into int since it is used
into a Mutex, we don't need it anymore as an atomic
variable in the sched_switch tracer.
Also change the name of the register mutex.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhance boot trace output with scheduling events
Use the sched_switch tracer from the boot tracer.
We also can trace schedule events inside the initcalls.
Sched tracing is disabled after the initcall has finished and
then reenabled before the next one is started.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
When init_sched_switch_trace() is called, it has no reason to start
the sched tracer if the sched_ref is not zero.
_ If this is non-zero, the tracer is already used, but we can register it
to the tracing engine. There is already a security which avoid the tracer
probes not to be resgistered twice.
_ If this is zero, this block will not be used.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix race condition in sched_switch tracer
This patch fixes a race condition in the sched_switch tracer. If
several tasks (IE: concurrent initcalls) are playing with
tracing_start_cmdline_record() and tracing_stop_cmdline_record(), the
following situation could happen:
_ Task A and B are using the same tracepoint probe. Task A holds it.
Task B is sleeping and doesn't hold it.
_ Task A frees the sched tracer, then sched_ref is decremented to 0.
_ Task A is preempted and hadn't yet unregistered its tracepoint
probe, then B runs.
_ B increments sched_ref, sees it's 1 and then guess it has to
register its probe. But it has not been freed by task A.
_ A lot of bad things can happen after that...
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: modify boot tracer
We used to disable the initcall tracing at a specified time (IE: end
of builtin initcalls). But we don't need it anymore. It will be
stopped when initcalls are finished.
However we want two things:
_Start this tracing only after pre-smp initcalls are finished.
_Since we are planning to trace sched_switches at the same time, we
want to enable them only during the initcall execution.
For this purpose, this patch introduce two functions to enable/disable
the sched_switch tracing during boot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2.6.28-rc tightened up the ELF architecture checks on ARM. For
non-EABI it only allows VFP if the hardware supports it. However,
the kernel fails to also inspect the soft-float flag, so it
incorrectly rejects binaries using soft-VFP.
The fix is simple: also check that EF_ARM_SOFT_FLOAT isn't set
before rejecting VFP binaries on non-VFP hardware.
Acked-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Impact: fix sysctl name typo
Steve must have needed more coffee ;-)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add SysRq-z support to dump trace buffers
Allows one to force an ftrace dump from sysrq
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Documentation update only
Update the version that the ftrace document was written for.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Documentation update only
A lot of changes have gone into ftrace. This patch updates
the ftrace.txt document.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: disable interrupts during trace entry creation (as opposed to preempt)
To help with performance, I set the ftracer to not disable interrupts,
and only to disable preemption. If an interrupt occurred, it would not
be traced, because the function tracer protects itself from recursion.
This may be faster, but the trace output might miss some traces.
This patch makes the fuction trace disable interrupts, but it also
adds a runtime feature to disable preemption instead. It does this by
having two different tracer functions. When the function tracer is
enabled, it will check to see which version is requested (irqs disabled
or preemption disabled). Then it will use the corresponding function
as the tracer.
Irq disabling is the default behaviour, but if the user wants better
performance, with the chance of missing traces, then they can choose
the preempt disabled version.
Running hackbench 3 times with the irqs disabled and 3 times with
the preempt disabled function tracer yielded:
tracing type times entries recorded
------------ -------- ----------------
irq disabled 43.393 166433066
43.282 166172618
43.298 166256704
preempt disabled 38.969 159871710
38.943 159972935
39.325 161056510
Average:
irqs disabled: 43.324 166287462
preempt disabled: 39.079 160300385
preempt is 10.8 percent faster than irqs disabled.
I wrote a patch to count function trace recursion and reran hackbench.
With irq disabled: 1,150 times the function tracer did not trace due to
recursion.
with preempt disabled: 5,117,718 times.
The thousand times with irq disabled could be due to NMIs, or simply a case
where it called a function that was not protected by notrace.
But we also see that a large amount of the trace is lost with the
preempt version.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new, consolidated APIs in ftrace plugins
This patch replaces the schedule safe preempt disable code with the
ftrace_preempt_disable() and ftrace_preempt_enable() safe functions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new ftrace-plugin internal APIs
Parts of the tracer needs to be careful about schedule recursion.
If the NEED_RESCHED flag is set, a preempt_enable will call schedule.
Inside the schedule function, the NEED_RESCHED flag is cleared.
The problem arises when a trace happens in the schedule function but before
NEED_RESCHED is cleared. The race is as follows:
schedule()
>> tracer called
trace_function()
preempt_disable()
[ record trace ]
preempt_enable() <<- here's the issue.
[check NEED_RESCHED]
schedule()
[ Repeat the above, over and over again ]
The naive approach is simply to use preempt_enable_no_schedule instead.
The problem with that approach is that, although we solve the schedule
recursion issue, we now might lose a preemption check when not in the
schedule function.
trace_function()
preempt_disable()
[ record trace ]
[Interrupt comes in and sets NEED_RESCHED]
preempt_enable_no_resched()
[continue without scheduling]
The way ftrace handles this problem is with the following approach:
int resched;
resched = need_resched();
preempt_disable_notrace();
[record trace]
if (resched)
preempt_enable_no_sched_notrace();
else
preempt_enable_notrace();
This may seem like the opposite of what we want. If resched is set
then we call the "no_sched" version?? The reason we do this is because
if NEED_RESCHED is set before we disable preemption, there's two reasons
for that:
1) we are in an atomic code path
2) we are already on our way to the schedule function, and maybe even
in the schedule function, but have yet to clear the flag.
Both the above cases we do not want to schedule.
This solution has already been implemented within the ftrace infrastructure.
But the problem is that it has been implemented several times. This patch
encapsulates this code to two nice functions.
resched = ftrace_preempt_disable();
[ record trace]
ftrace_preempt_enable(resched);
This way the tracers do not need to worry about getting it right.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix udelay when "notsc" boot parameter is passed
With notsc passed on commandline, tsc may not be used for
udelays, make sure that we do not use tsc_khz to calculate
the lpj value in such cases.
Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>