linux-uconsole/kernel
Peter Zijlstra 249cf808ba posix-cpu-timers: Cure SMP wobbles
commit d670ec1317 upstream.

David reported:

  Attached below is a watered-down version of rt/tst-cpuclock2.c from
  GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
  similar.

  Run it several times, and you will see cases where the main thread
  will measure a process clock difference before and after the nanosleep
  which is smaller than the cpu-burner thread's individual thread clock
  difference.  This doesn't make any sense since the cpu-burner thread
  is part of the top-level process's thread group.

  I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
  64-bit binaries).

  For example:

  [davem@boricha build-x86_64-linux]$ ./test
  process: before(0.001221967) after(0.498624371) diff(497402404)
  thread:  before(0.000081692) after(0.498316431) diff(498234739)
  self:    before(0.001223521) after(0.001240219) diff(16698)
  [davem@boricha build-x86_64-linux]$

  The diff of 'process' should always be >= the diff of 'thread'.

  I make sure to wrap the 'thread' clock measurements the most tightly
  around the nanosleep() call, and that the 'process' clock measurements
  are the outer-most ones.

  ---
  #include <unistd.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>
  #include <fcntl.h>
  #include <string.h>
  #include <errno.h>
  #include <pthread.h>

  static pthread_barrier_t barrier;

  static void *chew_cpu(void *arg)
  {
	  pthread_barrier_wait(&barrier);
	  while (1)
		  __asm__ __volatile__("" : : : "memory");
	  return NULL;
  }

  int main(void)
  {
	  clockid_t process_clock, my_thread_clock, th_clock;
	  struct timespec process_before, process_after;
	  struct timespec me_before, me_after;
	  struct timespec th_before, th_after;
	  struct timespec sleeptime;
	  unsigned long diff;
	  pthread_t th;
	  int err;

	  err = clock_getcpuclockid(0, &process_clock);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
	  if (err)
		  return 1;

	  pthread_barrier_init(&barrier, NULL, 2);
	  err = pthread_create(&th, NULL, chew_cpu, NULL);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(th, &th_clock);
	  if (err)
		  return 1;

	  pthread_barrier_wait(&barrier);

	  err = clock_gettime(process_clock, &process_before);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_before);
	  if (err)
		  return 1;

	  err = clock_gettime(th_clock, &th_before);
	  if (err)
		  return 1;

	  sleeptime.tv_sec = 0;
	  sleeptime.tv_nsec = 500000000;
	  nanosleep(&sleeptime, NULL);

	  err = clock_gettime(th_clock, &th_after);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_after);
	  if (err)
		  return 1;

	  err = clock_gettime(process_clock, &process_after);
	  if (err)
		  return 1;

	  diff = process_after.tv_nsec - process_before.tv_nsec;
	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 process_before.tv_sec, process_before.tv_nsec,
		 process_after.tv_sec, process_after.tv_nsec, diff);
	  diff = th_after.tv_nsec - th_before.tv_nsec;
	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 th_before.tv_sec, th_before.tv_nsec,
		 th_after.tv_sec, th_after.tv_nsec, diff);
	  diff = me_after.tv_nsec - me_before.tv_nsec;
	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 me_before.tv_sec, me_before.tv_nsec,
		 me_after.tv_sec, me_after.tv_nsec, diff);

	  return 0;
  }

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-16 14:14:51 -07:00
..
debug Fix common misspellings 2011-03-31 11:26:23 -03:00
events perf: Fix software event overflow 2011-08-04 21:58:35 -07:00
gcov gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNEL 2011-06-15 20:04:01 -07:00
irq genirq: Make irq_shutdown() symmetric vs. irq_startup again 2011-10-03 11:40:27 -07:00
power PM / Hibernate: Fix free_unnecessary_pages() 2011-07-06 20:15:23 +02:00
time alarmtimers: Avoid possible denial of service with high freq periodic timers 2011-10-03 11:40:07 -07:00
trace tracing: Have "enable" file use refcounts like the "filter" file 2011-08-04 21:58:38 -07:00
.gitignore
acct.c
async.c
audit.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
audit_tree.c Fix common misspellings 2011-03-31 11:26:23 -03:00
audit_watch.c kill path_lookup() 2011-03-14 09:15:23 -04:00
auditfilter.c netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms 2011-03-03 10:55:40 -08:00
auditsc.c audit: acquire creds selectively to reduce atomic op overhead 2011-04-27 15:11:03 +02:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c Merge branch 'master' into next 2011-05-19 18:51:57 +10:00
cgroup.c cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
cgroup_freezer.c cgroups: add per-thread subsystem callbacks 2011-05-26 17:12:34 -07:00
compat.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile 2011-05-25 15:35:32 -07:00
configs.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
cpu.c Fix common misspellings 2011-03-31 11:26:23 -03:00
cpuset.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
crash_dump.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
cred.c Merge branch 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs 2011-05-27 10:25:02 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c memcg: clear mm->owner when last possible owner leaves 2011-06-15 20:04:01 -07:00
extable.c extable, core_kernel_data(): Make sure all archs define _sdata 2011-05-20 08:56:56 +02:00
fork.c mm: Fix boot crash in mm_alloc() 2011-05-29 11:32:28 -07:00
freezer.c Freezer: Use SMP barriers 2011-05-17 23:19:17 +02:00
futex.c futex: Fix regression with read only mappings 2011-08-15 18:31:32 -07:00
futex_compat.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
groups.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
hrtimer.c hrtimers: Fix typo causing erratic timers 2011-05-25 15:31:58 -07:00
hung_task.c watchdog, hung_task_timeout: Add Kconfig configurable default 2011-04-28 09:13:17 +02:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump_label: Fix jump_label update for modules 2011-06-29 09:59:17 -04:00
kallsyms.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-25 17:52:22 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks arch:Kconfig.locks Remove unused config option. 2011-04-10 17:01:05 +02:00
Kconfig.preempt
kexec.c PM: Remove sysdev suspend, resume and shutdown operations 2011-05-11 21:37:15 +02:00
kfifo.c kfifo: fix scatterlist usage 2010-10-01 10:50:58 -07:00
kmod.c KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyring 2011-06-17 09:40:48 -07:00
kprobes.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
ksysfs.c kernel/ksysfs.c: expose file_caps_enabled in sysfs 2011-04-19 16:45:51 -07:00
kthread.c cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed 2011-05-28 17:02:57 +02:00
latencytop.c Fix common misspellings 2011-03-31 11:26:23 -03:00
lockdep.c lockdep: Fix lock_is_held() on recursion 2011-06-07 12:25:50 +02:00
lockdep_internals.h
lockdep_proc.c lockdep: Remove unused 'factor' variable from lockdep_stats_show() 2011-03-23 13:54:47 +01:00
lockdep_states.h
Makefile cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
module.c Merge branch 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 2011-05-23 12:49:28 -07:00
mutex-debug.c mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c lockdep, mutex: provide mutex_lock_nest_lock 2011-05-25 08:39:17 -07:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c
nsproxy.c cgroup: remove the ns_cgroup 2011-05-26 17:12:34 -07:00
padata.c Fix common misspellings 2011-03-31 11:26:23 -03:00
panic.c move x86 specific oops=panic to generic code 2011-03-22 17:44:11 -07:00
params.c params.c: Use new strtobool function to process boolean inputs 2011-05-19 16:55:28 +09:30
pid.c next_pidmap: fix overflow condition 2011-04-18 10:35:30 -07:00
pid_namespace.c pidns: call pid_ns_prepare_proc() from create_pid_namespace() 2011-03-23 19:46:58 -07:00
pm_qos_params.c Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 2011-05-29 11:18:09 -07:00
posix-cpu-timers.c posix-cpu-timers: Cure SMP wobbles 2011-10-16 14:14:51 -07:00
posix-timers.c posix-timers: RCU conversion 2011-05-24 12:10:51 +02:00
printk.c kernel/printk: do not turn off bootconsole in printk_late_init() if keep_bootcon 2011-10-03 11:39:46 -07:00
profile.c kernel/profile.c: remove some duplicate code from profile_hits() 2011-05-26 17:12:37 -07:00
ptrace.c ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread 2011-05-25 19:20:21 +02:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c rcu: Use WARN_ON_ONCE for DEBUG_OBJECTS_RCU_HEAD warnings 2011-05-05 23:16:57 -07:00
rcutiny.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
rcutiny_plugin.h rcu: Converge TINY_RCU expedited and normal boosting 2011-05-05 23:16:58 -07:00
rcutorture.c rcu: mark rcutorture boosting callback as being on-stack 2011-05-05 23:16:57 -07:00
rcutree.c rcu: Prevent RCU callbacks from executing before scheduler initialized 2011-07-13 08:17:56 -07:00
rcutree.h rcu: Move RCU_BOOST #ifdefs to header file 2011-06-16 16:12:05 -07:00
rcutree_plugin.h softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
rcutree_trace.c rcu: use softirq instead of kthreads except when RCU_BOOST=y 2011-06-15 23:07:21 -07:00
relay.c Clean up relay_alloc_page_array() slightly by using vzalloc rather than vmalloc and memset 2010-11-05 08:21:34 -07:00
res_counter.c memcg: res_counter_read_u64(): fix potential races on 32-bit machines 2011-03-23 19:46:22 -07:00
resource.c resource: ability to resize an allocated resource 2011-07-06 10:54:08 -07:00
rtmutex-debug.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.h
rtmutex-tester.c rtmutex: tester: Remove the remaining BKL leftovers 2011-02-22 22:07:22 +01:00
rtmutex.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex.h
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rwsem.c
sched.c posix-cpu-timers: Cure SMP wobbles 2011-10-16 14:14:51 -07:00
sched_autogroup.c Fix common misspellings 2011-03-31 11:26:23 -03:00
sched_autogroup.h sched, autogroup: Stop going ahead if autogroup is disabled 2011-02-23 11:33:59 +01:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c
sched_cpupri.h
sched_debug.c sched: Get rid of lock_depth 2011-04-24 13:18:38 +02:00
sched_fair.c sched: Break out cpu_power from the sched_group structure 2011-07-20 18:32:40 +02:00
sched_features.h sched: Allow for overlapping sched_domain spans 2011-07-20 18:32:41 +02:00
sched_idletask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
sched_rt.c sched/rt: Migrate equal priority tasks to available CPUs 2011-10-16 14:14:51 -07:00
sched_stats.h sched: More sched_domain iterations fixes 2011-05-28 17:02:54 +02:00
sched_stoptask.c sched: Drop the rq argument to sched_class::select_task_rq() 2011-04-14 08:52:36 +02:00
seccomp.c
semaphore.c
signal.c signal: align __lock_task_sighand() irq disabling and RCU 2011-07-20 11:04:54 -07:00
smp.c generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts 2011-06-17 10:17:12 +02:00
softirq.c softirq,rcu: Inform RCU of irq_exit() activity 2011-07-20 10:50:12 -07:00
spinlock.c
srcu.c rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status 2011-01-14 04:56:49 -08:00
stacktrace.c
stop_machine.c x86, mtrr: lock stop machine during MTRR rendezvous sequence 2011-08-29 13:29:08 -07:00
sys.c Add a personality to report 2.6.x version numbers 2011-08-29 13:29:16 -07:00
sys_ni.c ipc: Add missing sys_ni entries for ipc/compat.c functions 2011-05-20 13:53:02 -07:00
sysctl.c perf: Comment /proc/sys/kernel/perf_event_paranoid to be part of user ABI 2011-06-04 12:22:04 +02:00
sysctl_binary.c open-style analog of vfs_path_lookup() 2011-03-14 09:15:28 -04:00
sysctl_check.c sysctl_check: drop dead code 2011-03-23 19:46:51 -07:00
taskstats.c taskstats: don't allow duplicate entries in listener mode 2011-06-27 18:00:13 -07:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-03-15 18:53:35 -07:00
timeconst.pl
timer.c timers: Consider slack value in mod_timer() 2011-06-03 15:02:32 +02:00
tracepoint.c jump label: Introduce static_branch() interface 2011-04-04 12:48:08 -04:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c userns: user namespaces: convert several capable() calls 2011-03-23 19:47:08 -07:00
up.c
user-return-notifier.c Fix common misspellings 2011-03-31 11:26:23 -03:00
user.c userns: add a user_namespace as creator/owner of uts_namespace 2011-03-23 19:46:59 -07:00
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
utsname.c ns proc: Add support for the uts namespace 2011-05-10 14:35:35 -07:00
utsname_sysctl.c
wait.c Fix common misspellings 2011-03-31 11:26:23 -03:00
watchdog.c kernel/watchdog.c: Use proper ANSI C prototypes 2011-05-23 21:07:40 -07:00
workqueue.c workqueue: lock cwq access in drain_workqueue 2011-10-03 11:40:31 -07:00
workqueue_sched.h