linux-uconsole/kernel
Benjamin Herrenschmidt b045b9a265 mm/futex: fix futex writes on archs with SW tracking of dirty & young
commit 2efaca927f upstream.

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.

On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case.  get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state.  It will eventually
update those bits ...  in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job.  The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Shan Hai <haishan.bai@gmail.com>
Tested-by: Shan Hai <haishan.bai@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <darren.hart@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04 21:58:38 -07:00
..
debug Fix common misspellings 2011-03-31 11:26:23 -03:00
events perf: Fix software event overflow 2011-08-04 21:58:35 -07:00
gcov gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNEL 2011-06-15 20:04:01 -07:00
irq genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) 2011-07-07 16:02:26 +00:00
power PM / Hibernate: Fix free_unnecessary_pages() 2011-07-06 20:15:23 +02:00
time alarmtimers: Return -ENOTSUPP if no RTC device is present 2011-06-21 16:32:28 -07:00
trace tracing: Have "enable" file use refcounts like the "filter" file 2011-08-04 21:58:38 -07:00
.gitignore
acct.c pass a struct path to vfs_statfs 2010-08-09 16:48:42 -04:00
async.c
audit.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
audit_tree.c Fix common misspellings 2011-03-31 11:26:23 -03:00
audit_watch.c kill path_lookup() 2011-03-14 09:15:23 -04:00
auditfilter.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
auditsc.c audit: acquire creds selectively to reduce atomic op overhead 2011-04-27 15:11:03 +02:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c Merge branch 'master' into next 2011-05-19 18:51:57 +10:00
cgroup.c cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
cgroup_freezer.c cgroups: add per-thread subsystem callbacks 2011-05-26 17:12:34 -07:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile 2011-05-25 15:35:32 -07:00
configs.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
cpu.c Fix common misspellings 2011-03-31 11:26:23 -03:00
cpuset.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
crash_dump.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
cred.c Merge branch 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs 2011-05-27 10:25:02 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c memcg: clear mm->owner when last possible owner leaves 2011-06-15 20:04:01 -07:00
extable.c extable, core_kernel_data(): Make sure all archs define _sdata 2011-05-20 08:56:56 +02:00
fork.c mm: Fix boot crash in mm_alloc() 2011-05-29 11:32:28 -07:00
freezer.c Freezer: Use SMP barriers 2011-05-17 23:19:17 +02:00
futex.c mm/futex: fix futex writes on archs with SW tracking of dirty & young 2011-08-04 21:58:38 -07:00
futex_compat.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
groups.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
hrtimer.c hrtimers: Fix typo causing erratic timers 2011-05-25 15:31:58 -07:00
hung_task.c watchdog, hung_task_timeout: Add Kconfig configurable default 2011-04-28 09:13:17 +02:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump_label: Fix jump_label update for modules 2011-06-29 09:59:17 -04:00
kallsyms.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-25 17:52:22 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks arch:Kconfig.locks Remove unused config option. 2011-04-10 17:01:05 +02:00
Kconfig.preempt
kexec.c PM: Remove sysdev suspend, resume and shutdown operations 2011-05-11 21:37:15 +02:00
kfifo.c kfifo: fix scatterlist usage 2010-10-01 10:50:58 -07:00
kmod.c KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyring 2011-06-17 09:40:48 -07:00
kprobes.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
ksysfs.c kernel/ksysfs.c: expose file_caps_enabled in sysfs 2011-04-19 16:45:51 -07:00
kthread.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
latencytop.c Fix common misspellings 2011-03-31 11:26:23 -03:00
lockdep.c lockdep: Fix lock_is_held() on recursion 2011-06-07 12:25:50 +02:00
lockdep_internals.h
lockdep_proc.c lockdep: Remove unused 'factor' variable from lockdep_stats_show() 2011-03-23 13:54:47 +01:00
lockdep_states.h
Makefile cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
module.c Merge branch 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 2011-05-23 12:49:28 -07:00
mutex-debug.c mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c lockdep, mutex: provide mutex_lock_nest_lock 2011-05-25 08:39:17 -07:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c
nsproxy.c cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
padata.c Fix common misspellings 2011-03-31 11:26:23 -03:00
panic.c move x86 specific oops=panic to generic code 2011-03-22 17:44:11 -07:00
params.c params.c: Use new strtobool function to process boolean inputs 2011-05-19 16:55:28 +09:30
pid.c next_pidmap: fix overflow condition 2011-04-18 10:35:30 -07:00
pid_namespace.c pidns: call pid_ns_prepare_proc() from create_pid_namespace() 2011-03-23 19:46:58 -07:00
pm_qos_params.c Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 2011-05-29 11:18:09 -07:00
posix-cpu-timers.c hrtimers: Avoid touching inactive timer bases 2011-05-23 13:59:54 +02:00
posix-timers.c posix-timers: RCU conversion 2011-05-24 12:10:51 +02:00
printk.c printk: allocate kernel log buffer earlier 2011-05-25 08:39:48 -07:00
profile.c kernel/profile.c: remove some duplicate code from profile_hits() 2011-05-26 17:12:37 -07:00
ptrace.c ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread 2011-05-25 19:20:21 +02:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c rcu: Use WARN_ON_ONCE for DEBUG_OBJECTS_RCU_HEAD warnings 2011-05-05 23:16:57 -07:00
rcutiny.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
rcutiny_plugin.h rcu: Converge TINY_RCU expedited and normal boosting 2011-05-05 23:16:58 -07:00
rcutorture.c rcu: mark rcutorture boosting callback as being on-stack 2011-05-05 23:16:57 -07:00
rcutree.c rcu: Prevent RCU callbacks from executing before scheduler initialized 2011-07-13 08:17:56 -07:00
rcutree.h rcu: Move RCU_BOOST #ifdefs to header file 2011-06-16 16:12:05 -07:00
rcutree_plugin.h softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
rcutree_trace.c rcu: use softirq instead of kthreads except when RCU_BOOST=y 2011-06-15 23:07:21 -07:00
relay.c Clean up relay_alloc_page_array() slightly by using vzalloc rather than vmalloc and memset 2010-11-05 08:21:34 -07:00
res_counter.c memcg: res_counter_read_u64(): fix potential races on 32-bit machines 2011-03-23 19:46:22 -07:00
resource.c resource: ability to resize an allocated resource 2011-07-06 10:54:08 -07:00
rtmutex-debug.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.h
rtmutex-tester.c rtmutex: tester: Remove the remaining BKL leftovers 2011-02-22 22:07:22 +01:00
rtmutex.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex.h
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rwsem.c
sched.c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-07-20 15:56:25 -07:00
sched_autogroup.c Fix common misspellings 2011-03-31 11:26:23 -03:00
sched_autogroup.h sched, autogroup: Stop going ahead if autogroup is disabled 2011-02-23 11:33:59 +01:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c
sched_cpupri.h
sched_debug.c sched: Get rid of lock_depth 2011-04-24 13:18:38 +02:00
sched_fair.c sched: Break out cpu_power from the sched_group structure 2011-07-20 18:32:40 +02:00
sched_features.h sched: Allow for overlapping sched_domain spans 2011-07-20 18:32:41 +02:00
sched_idletask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
sched_rt.c sched: Check if lowest_mask is initialized in find_lowest_rq() 2011-06-15 11:44:48 +02:00
sched_stats.h sched: More sched_domain iterations fixes 2011-05-28 17:02:54 +02:00
sched_stoptask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
seccomp.c
semaphore.c
signal.c signal: align __lock_task_sighand() irq disabling and RCU 2011-07-20 11:04:54 -07:00
smp.c generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts 2011-06-17 10:17:12 +02:00
softirq.c softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
spinlock.c
srcu.c rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status 2011-01-14 04:56:49 -08:00
stacktrace.c
stop_machine.c kthread: use kthread_create_on_node() 2011-03-22 17:44:01 -07:00
sys.c Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 2011-05-19 18:24:11 -07:00
sys_ni.c ipc: Add missing sys_ni entries for ipc/compat.c functions 2011-05-20 13:53:02 -07:00
sysctl.c perf: Comment /proc/sys/kernel/perf_event_paranoid to be part of user ABI 2011-06-04 12:22:04 +02:00
sysctl_binary.c open-style analog of vfs_path_lookup() 2011-03-14 09:15:28 -04:00
sysctl_check.c sysctl_check: drop dead code 2011-03-23 19:46:51 -07:00
taskstats.c taskstats: don't allow duplicate entries in listener mode 2011-06-27 18:00:13 -07:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-15 18:53:35 -07:00
timeconst.pl
timer.c timers: Consider slack value in mod_timer() 2011-06-03 15:02:32 +02:00
tracepoint.c jump label: Introduce static_branch() interface 2011-04-04 12:48:08 -04:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
up.c
user-return-notifier.c Fix common misspellings 2011-03-31 11:26:23 -03:00
user.c userns: add a user_namespace as creator/owner of uts_namespace 2011-03-23 19:46:59 -07:00
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
utsname.c ns proc: Add support for the uts namespace 2011-05-10 14:35:35 -07:00
utsname_sysctl.c
wait.c Fix common misspellings 2011-03-31 11:26:23 -03:00
watchdog.c kernel/watchdog.c: Use proper ANSI C prototypes 2011-05-23 21:07:40 -07:00
workqueue.c Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-05-24 11:53:42 -07:00
workqueue_sched.h