linux-uconsole/kernel
Mel Gorman 627c5c60b4 cpuset: mm: reduce large amounts of memory barrier related damage v3
commit cc9a6c8776 upstream.

Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-01 12:27:20 -07:00
..
debug kgdb,debug_core: pass the breakpoint struct instead of address and memory 2012-04-13 08:14:07 -07:00
events perf: Fix software event overflow 2011-08-04 21:58:35 -07:00
gcov gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNEL 2011-06-15 20:04:01 -07:00
irq genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value 2012-04-13 08:14:06 -07:00
power PM / Hibernate: Enable usermodehelpers in hibernate() error path 2012-04-02 09:27:18 -07:00
time ntp: Fix STA_INS/DEL clearing bug 2012-08-01 12:26:53 -07:00
trace tracing: change CPU ring buffer state from tracing_cpumask 2012-07-16 08:47:51 -07:00
.gitignore
acct.c
async.c
audit.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
audit_tree.c Fix common misspellings 2011-03-31 11:26:23 -03:00
audit_watch.c kill path_lookup() 2011-03-14 09:15:23 -04:00
auditfilter.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
auditsc.c audit: acquire creds selectively to reduce atomic op overhead 2011-04-27 15:11:03 +02:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c Merge branch 'master' into next 2011-05-19 18:51:57 +10:00
cgroup.c cgroup: fix to allow mounting a hierarchy by name 2012-01-12 11:35:08 -08:00
cgroup_freezer.c cgroup_freezer: fix freezing groups with stopped tasks 2011-12-09 08:52:27 -08:00
compat.c compat: Fix RT signal mask corruption via sigprocmask 2012-05-21 09:40:04 -07:00
configs.c
cpu.c PM / Sleep: Fix race between CPU hotplug and freezer 2012-01-12 11:35:46 -08:00
cpuset.c cpuset: mm: reduce large amounts of memory barrier related damage v3 2012-08-01 12:27:20 -07:00
crash_dump.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
cred.c cred: copy_process() should clear child->replacement_session_keyring 2012-04-13 08:14:08 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race 2012-01-06 14:14:14 -08:00
extable.c extable, core_kernel_data(): Make sure all archs define _sdata 2011-05-20 08:56:56 +02:00
fork.c cpuset: mm: reduce large amounts of memory barrier related damage v3 2012-08-01 12:27:20 -07:00
freezer.c Freezer: Use SMP barriers 2011-05-17 23:19:17 +02:00
futex.c futex: Do not leak robust list to unprivileged process 2012-04-22 16:21:45 -07:00
futex_compat.c futex: Do not leak robust list to unprivileged process 2012-04-22 16:21:45 -07:00
groups.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
hrtimer.c hrtimer: Update hrtimer base offsets each hrtimer_interrupt 2012-07-19 08:58:46 -07:00
hung_task.c hung_task: fix false positive during vfork 2012-01-06 14:14:13 -08:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump_label: jump_label_inc may return before the code is patched 2011-12-09 08:52:50 -08:00
kallsyms.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-25 17:52:22 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks arch:Kconfig.locks Remove unused config option. 2011-04-10 17:01:05 +02:00
Kconfig.preempt
kexec.c PM: Remove sysdev suspend, resume and shutdown operations 2011-05-11 21:37:15 +02:00
kfifo.c
kmod.c kmod: prevent kmod_loop_msg overflow in __request_module() 2011-11-11 09:35:48 -08:00
kprobes.c kprobes: adjust "fix a memory leak in function pre_handler_kretprobe()" 2012-03-12 10:32:57 -07:00
ksysfs.c kernel/ksysfs.c: expose file_caps_enabled in sysfs 2011-04-19 16:45:51 -07:00
kthread.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
latencytop.c Fix common misspellings 2011-03-31 11:26:23 -03:00
lockdep.c lockdep: Fix lock_is_held() on recursion 2011-06-07 12:25:50 +02:00
lockdep_internals.h
lockdep_proc.c lockdep: Remove unused 'factor' variable from lockdep_stats_show() 2011-03-23 13:54:47 +01:00
lockdep_states.h
Makefile cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
module.c module: Remove module size limit 2012-04-02 09:27:20 -07:00
mutex-debug.c mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c lockdep, mutex: provide mutex_lock_nest_lock 2011-05-25 08:39:17 -07:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c
nsproxy.c cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
padata.c Fix common misspellings 2011-03-31 11:26:23 -03:00
panic.c lockdep, bug: Exclude TAINT_FIRMWARE_WORKAROUND from disabling lockdep 2012-02-13 11:06:10 -08:00
params.c params.c: Use new strtobool function to process boolean inputs 2011-05-19 16:55:28 +09:30
pid.c next_pidmap: fix overflow condition 2011-04-18 10:35:30 -07:00
pid_namespace.c pidns: call pid_ns_prepare_proc() from create_pid_namespace() 2011-03-23 19:46:58 -07:00
pm_qos_params.c Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 2011-05-29 11:18:09 -07:00
posix-cpu-timers.c cputimer: Cure lock inversion 2011-10-25 07:10:14 +02:00
posix-timers.c posix-timers: RCU conversion 2011-05-24 12:10:51 +02:00
printk.c cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warning 2012-02-03 09:18:57 -08:00
profile.c kernel/profile.c: remove some duplicate code from profile_hits() 2011-05-26 17:12:37 -07:00
ptrace.c ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread 2011-05-25 19:20:21 +02:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c rcu: Use WARN_ON_ONCE for DEBUG_OBJECTS_RCU_HEAD warnings 2011-05-05 23:16:57 -07:00
rcutiny.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
rcutiny_plugin.h rcu: Converge TINY_RCU expedited and normal boosting 2011-05-05 23:16:58 -07:00
rcutorture.c rcu: mark rcutorture boosting callback as being on-stack 2011-05-05 23:16:57 -07:00
rcutree.c rcu: Prevent RCU callbacks from executing before scheduler initialized 2011-07-13 08:17:56 -07:00
rcutree.h rcu: Move RCU_BOOST #ifdefs to header file 2011-06-16 16:12:05 -07:00
rcutree_plugin.h softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
rcutree_trace.c rcu: use softirq instead of kthreads except when RCU_BOOST=y 2011-06-15 23:07:21 -07:00
relay.c relay: prevent integer overflow in relay_open() 2012-02-20 12:48:10 -08:00
res_counter.c memcg: res_counter_read_u64(): fix potential races on 32-bit machines 2011-03-23 19:46:22 -07:00
resource.c resource: ability to resize an allocated resource 2011-07-06 10:54:08 -07:00
rtmutex-debug.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.h
rtmutex-tester.c rtmutex: tester: Remove the remaining BKL leftovers 2011-02-22 22:07:22 +01:00
rtmutex.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex.h
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rwsem.c
sched.c sched: Fix the relax_domain_level boot parameter 2012-06-17 11:23:12 -07:00
sched_autogroup.c Fix common misspellings 2011-03-31 11:26:23 -03:00
sched_autogroup.h sched, autogroup: Stop going ahead if autogroup is disabled 2011-02-23 11:33:59 +01:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c
sched_cpupri.h
sched_debug.c sched: Get rid of lock_depth 2011-04-24 13:18:38 +02:00
sched_fair.c sched: Break out cpu_power from the sched_group structure 2011-07-20 18:32:40 +02:00
sched_features.h sched: Allow for overlapping sched_domain spans 2011-07-20 18:32:41 +02:00
sched_idletask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
sched_rt.c sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW 2012-02-13 11:06:08 -08:00
sched_stats.h sched: More sched_domain iterations fixes 2011-05-28 17:02:54 +02:00
sched_stoptask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
seccomp.c
semaphore.c
signal.c ptrace: don't clear GROUP_STOP_SIGMASK on double-stop 2011-11-11 09:36:23 -08:00
smp.c generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts 2011-06-17 10:17:12 +02:00
softirq.c softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
spinlock.c
srcu.c rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status 2011-01-14 04:56:49 -08:00
stacktrace.c
stop_machine.c x86, mtrr: lock stop machine during MTRR rendezvous sequence 2011-08-29 13:29:08 -07:00
sys.c Avoid using variable-length arrays in kernel/sys.c 2011-10-25 07:10:14 +02:00
sys_ni.c ipc: Add missing sys_ni entries for ipc/compat.c functions 2011-05-20 13:53:02 -07:00
sysctl.c sysctl: fix write access to dmesg_restrict/kptr_restrict 2012-04-13 08:14:07 -07:00
sysctl_binary.c binary_sysctl(): fix memory leak 2012-01-06 14:13:50 -08:00
sysctl_check.c sysctl_check: drop dead code 2011-03-23 19:46:51 -07:00
taskstats.c Make TASKSTATS require root access 2011-12-21 12:57:40 -08:00
test_kprobes.c
time.c time: Change jiffies_to_clock_t() argument type to unsigned long 2011-11-11 09:35:52 -08:00
timeconst.pl
timer.c timers: Consider slack value in mod_timer() 2011-06-03 15:02:32 +02:00
tracepoint.c jump label: Introduce static_branch() interface 2011-04-04 12:48:08 -04:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
up.c
user-return-notifier.c Fix common misspellings 2011-03-31 11:26:23 -03:00
user.c userns: add a user_namespace as creator/owner of uts_namespace 2011-03-23 19:46:59 -07:00
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
utsname.c ns proc: Add support for the uts namespace 2011-05-10 14:35:35 -07:00
utsname_sysctl.c
wait.c Fix common misspellings 2011-03-31 11:26:23 -03:00
watchdog.c kernel/watchdog.c: Use proper ANSI C prototypes 2011-05-23 21:07:40 -07:00
workqueue.c workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active 2012-06-01 15:12:56 +08:00
workqueue_sched.h