linux-uconsole/kernel
Steven Rostedt 80736c50f9 tracing: Do not record user stack trace from NMI context
commit b6345879cc upstream.

A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-01 15:58:12 -07:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq x86: Avoid race condition in pci_enable_msix() 2010-03-15 08:50:06 -07:00
power PM / Hibernate: Fix preallocating of memory 2010-03-15 08:49:46 -07:00
time timekeeping: Prevent oops when GENERIC_TIME=n 2010-04-01 15:58:06 -07:00
trace tracing: Do not record user stack trace from NMI context 2010-04-01 15:58:12 -07:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-18 14:03:52 -08:00
async.c async: Fix lack of boot-time console due to insufficient synchronization 2009-06-08 12:31:53 -07:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
audit_tree.c fix more leaks in audit_tree.c tag_chunk() 2010-01-18 10:19:50 -08:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
auditfilter.c Audit: clean up all op= output to include string quoting 2009-06-24 00:00:52 -04:00
auditsc.c Audit: rearrange audit_context to save 16 bytes per struct 2009-09-24 03:50:26 -04:00
backtracetest.c
bounds.c
capability.c
cgroup.c cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() 2010-01-18 10:19:32 -08:00
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
compat.c signals: implement sys_rt_tgsigqueueinfo 2009-04-30 19:24:24 +02:00
configs.c
cpu.c timers, init: Limit the number of per cpu calibration bootup messages 2010-01-28 15:01:14 -08:00
cpuset.c sched: Fix balance vs hotplug race 2010-01-06 15:04:49 -08:00
cred-internals.h
cred.c kernel/cred.c: use kmem_cache_free 2010-02-09 04:51:01 -08:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c connector: fix regression introduced by sid connector 2009-10-29 07:39:25 -07:00
extable.c
fork.c Correct nr_processes() when CPUs have been unplugged 2009-11-03 07:52:39 -08:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex.c futex: Handle futex value corruption gracefully 2010-02-23 07:37:43 -08:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
groups.c groups: move code to kernel/groups.c 2009-06-16 19:47:48 -07:00
hrtimer.c Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-05 12:04:16 -07:00
hung_task.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
itimer.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
kallsyms.c kallsyms: use new arch_is_kernel_text() 2009-09-23 07:39:30 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.preempt
kexec.c kexec: fix omitting offset in extended crashkernel syntax 2009-07-29 19:10:34 -07:00
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c sysrq, intel_fb: fix sysrq g collision 2009-05-15 07:56:24 -05:00
kmod.c Revert "kmod: fix race in usermodehelper code" 2009-09-23 18:12:10 -07:00
kprobes.c const: constify remaining file_operations 2009-10-01 16:11:11 -07:00
ksysfs.c
kthread.c sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c 2009-11-03 07:25:00 +01:00
latencytop.c
lockdep.c lockdep: Use cpu_clock() for lockstat 2009-10-09 15:56:44 +02:00
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
Makefile SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
module.c module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y 2010-01-18 10:19:51 -08:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h
mutex.c Merge branch 'linus' into perfcounters/core 2009-06-11 17:55:42 +02:00
mutex.h
notifier.c
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c nsproxy: extract create_nsproxy() 2009-06-18 13:03:56 -07:00
panic.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-08 12:16:35 -07:00
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c perf: Honour event state for aux stream data 2010-01-25 10:49:46 -08:00
pid.c mm: also use alloc_large_system_hash() for the PID hash table 2009-09-22 07:17:38 -07:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pm_qos_params.c
posix-cpu-timers.c itimers: Add tracepoints for itimer 2009-08-29 14:10:07 +02:00
posix-timers.c time: Introduce CLOCK_REALTIME_COARSE 2009-08-21 21:43:46 +02:00
printk.c printk: add printk_delay to make messages readable for some scenarios 2009-09-23 07:39:28 -07:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee 2009-09-24 07:20:59 -07:00
rcupdate.c rcu: Move rcu_barrier() to rcutree 2009-10-07 08:11:20 +02:00
rcutorture.c rcu: Clean up code to address Ingo's checkpatch feedback 2009-09-23 19:46:30 +02:00
rcutree.c rcu: Fix note_new_gpnum() uses of ->gpnum 2009-12-18 14:03:01 -08:00
rcutree.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
rcutree_plugin.h rcu: Remove inline from forward-referenced functions 2009-12-18 14:03:04 -08:00
rcutree_trace.c rcu: Make hot-unplugged CPU relinquish its own RCU callbacks 2009-10-07 08:11:20 +02:00
relay.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c walk system ram range 2009-09-23 07:39:41 -07:00
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() 2009-08-06 05:50:21 +02:00
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c sched: Don't use possibly stale sched_class 2010-03-15 08:50:17 -07:00
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2010-01-22 15:18:30 -08:00
sched_cpupri.c sched: Add new prio to cpupri before removing old prio 2009-08-02 14:26:09 +02:00
sched_cpupri.h
sched_debug.c sched: Rate-limit newidle 2009-12-18 14:03:13 -08:00
sched_fair.c sched: Fix missing sched tunable recalculation on cpu add/remove 2010-01-28 15:01:11 -08:00
sched_features.h sched: Add new wakeup preemption mode: WAKEUP_RUNNING 2009-09-17 10:17:25 +02:00
sched_idletask.c sched: Simplify sys_sched_rr_get_interval() system call 2009-09-21 09:53:55 +02:00
sched_rt.c sched: Simplify sys_sched_rr_get_interval() system call 2009-09-21 09:53:55 +02:00
sched_stats.h
seccomp.c
semaphore.c
signal.c kernel/signal.c: fix kernel information leak with print-fatal-signals=1 2010-01-18 10:19:33 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c cpumask: remove arch_send_call_function_ipi 2009-09-24 09:34:47 +09:30
softirq.c softirq: add BLOCK_IOPOLL to softirq_to_name 2009-09-17 15:53:44 -04:00
softlockup.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
spinlock.c locking: Allow arch-inlined spinlocks 2009-08-31 18:08:50 +02:00
srcu.c
stacktrace.c
stop_machine.c
sys.c Merge branch 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-10-29 08:20:00 -07:00
sys_ni.c Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-09-24 15:13:11 -07:00
sysctl.c kernel/sysctl.c: fix stable merge error in NOMMU mmap_min_addr 2010-01-18 10:19:49 -08:00
sysctl_check.c NET: fix oops at bootime in sysctl code 2010-02-09 04:51:02 -08:00
taskstats.c genetlink: make netns aware 2009-07-12 14:03:27 -07:00
test_kprobes.c
time.c time: Prevent 32 bit overflow with set_normalized_timespec() 2009-09-15 10:17:30 +02:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-23 09:46:15 -07:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user.c uids: Prevent tear down race 2009-11-02 16:02:39 +01:00
user_namespace.c
utsname.c utsns: extract creeate_uts_ns() 2009-06-18 13:03:55 -07:00
utsname_sysctl.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c workqueue: fix race condition in schedule_on_each_cpu() 2009-11-17 17:40:33 -08:00