Commit graph

28,647 commits

Author SHA1 Message Date
Johannes Weiner
3bbcbc8039 UPSTREAM: psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit e0c274472d)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Olof Johansson
b822a6da85 UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8fcb2312d1)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
dc9cd29ded UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2ce7135adc)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
e550f94252 UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
8cd88f5398 UPSTREAM: sched: introduce this_rq_lock_irq()
do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
patch is adding another site with the same pattern, so provide a
convenience function for it.

Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 246b3b3342)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
cdda3cf652 UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h
kernel/sched/sched.h includes "stats.h" half-way through the file.  The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
the file so they are available in stats.h.

Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 1f351d7f75)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
4c9c09affa UPSTREAM: sched: loadavg: make calc_load_n() public
It's going to be used in a later patch. Keep the churn separate.

Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 5c54f5b9ed)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I50e0cb0dbf20ced329a484493f82ff69ca1ae97a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
2ba18b41d3 BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8508cf3ffa)

Conflicts:
        block/blk-iolatency.c

(1. manual merge to replace stat->rqs.mean with stat.mean)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Johannes Weiner
580f26b93e UPSTREAM: delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit b1d29ba82c)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I259f693987cf04e6a52ee7e8accf55a17e0de005
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Greg Kroah-Hartman
2e568c979c This is the 4.19.29 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyJb/EACgkQONu9yGCS
 aT4y0g//b9t9/onhTaXcY/ByPmBAwqNgugi7eYcZqGDBp7aCDOBLF6eOwbhdvvuS
 ZTaZ5eWG3Twz3mZu9vveuskgMci2npDyLPgqBWGzW+Ef5r/xPd40diaI75ZUc68T
 gimWbQ0VANuXKklK6LysBUaVQWE3ilIy6qnnpj0DI3ipNDoE62Ry1LNthuKy+73J
 w6r7uwkb6X/CkXpNB/L4cDdpSy/CvhGQhd6p91lBuE4DfyPqEzslYCokD9aPXp9b
 Fedt/Re+8eULBNcgqPYxkS5pBrbHtqrGf00AMlzC8DkC+GZyDqSP2xjv6AiTfGJd
 uf0/Jvsv2OBnP4aYsbk+uB2z3plzPBgmXxa/1bm+yrGCMvbpi9mMx75HM2joAeVp
 tVN4ZN65kNgJkXCchJTHdQ3s6teOD8Par1czy570HyKBU6l1j3AGArGm+b4WGPWx
 dL+82coojMKxKNdTHfxUXES6QGKp716r3un6mCrKR0xET/SDayzDQMaSM8UOtArK
 ELzNeKzKTc5oBx6i+JfGmY8ZsedpNGCIPpsiuoSYAaon5ZzNbruzOAlDOThs157d
 YezDHZ9XMrx3kN/xYnqZD63x/5egq9REbZGWljeykbNkWcEY74jIkKwNLxqv3P64
 JsLp60owvjzwtzKycjZogNU//GGNTBdb+6pESq4MxJpPTteFWnc=
 =n9iV
 -----END PGP SIGNATURE-----

Merge 4.19.29 into android-4.19

Changes in 4.19.29
	media: uvcvideo: Fix 'type' check leading to overflow
	vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel
	perf script: Fix crash with printing mixed trace point and other events
	perf core: Fix perf_proc_update_handler() bug
	perf tools: Handle TOPOLOGY headers with no CPU
	perf script: Fix crash when processing recorded stat data
	IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
	iommu/amd: Call free_iova_fast with pfn in map_sg
	iommu/amd: Unmap all mapped pages in error path of map_sg
	riscv: fixup max_low_pfn with PFN_DOWN.
	ipvs: Fix signed integer overflow when setsockopt timeout
	iommu/amd: Fix IOMMU page flush when detach device from a domain
	clk: ti: Fix error handling in ti_clk_parse_divider_data()
	clk: qcom: gcc: Use active only source for CPUSS clocks
	xtensa: SMP: fix ccount_timer_shutdown
	riscv: Adjust mmap base address at a third of task size
	IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
	selftests: cpu-hotplug: fix case where CPUs offline > CPUs present
	xtensa: SMP: fix secondary CPU initialization
	xtensa: smp_lx200_defconfig: fix vectors clash
	xtensa: SMP: mark each possible CPU as present
	iomap: get/put the page in iomap_page_create/release()
	iomap: fix a use after free in iomap_dio_rw
	xtensa: SMP: limit number of possible CPUs by NR_CPUS
	net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case
	net: hns: Fix for missing of_node_put() after of_parse_phandle()
	net: hns: Restart autoneg need return failed when autoneg off
	net: hns: Fix wrong read accesses via Clause 45 MDIO protocol
	net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup()
	netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present
	gpio: vf610: Mask all GPIO interrupts
	selftests: net: use LDLIBS instead of LDFLAGS
	selftests: timers: use LDLIBS instead of LDFLAGS
	nfs: Fix NULL pointer dereference of dev_name
	qed: Fix bug in tx promiscuous mode settings
	qed: Fix LACP pdu drops for VFs
	qed: Fix VF probe failure while FLR
	qed: Fix system crash in ll2 xmit
	qed: Fix stack out of bounds bug
	scsi: libfc: free skb when receiving invalid flogi resp
	scsi: scsi_debug: fix write_same with virtual_gb problem
	scsi: bnx2fc: Fix error handling in probe()
	scsi: 53c700: pass correct "dev" to dma_alloc_attrs()
	platform/x86: Fix unmet dependency warning for ACPI_CMPC
	platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
	net: macb: Apply RXUBR workaround only to versions with errata
	x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
	cifs: fix computation for MAX_SMB2_HDR_SIZE
	x86/microcode/amd: Don't falsely trick the late loading mechanism
	arm64: kprobe: Always blacklist the KVM world-switch code
	apparmor: Fix aa_label_build() error handling for failed merges
	x86/kexec: Don't setup EFI info if EFI runtime is not enabled
	proc: fix /proc/net/* after setns(2)
	x86_64: increase stack size for KASAN_EXTRA
	mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
	mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
	lib/test_kmod.c: potential double free in error handling
	fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
	autofs: drop dentry reference only when it is never used
	autofs: fix error return in autofs_fill_super()
	mm, memory_hotplug: fix off-by-one in is_pageblock_removable
	ARM: OMAP: dts: N950/N9: fix onenand timings
	ARM: dts: omap4-droid4: Fix typo in cpcap IRQ flags
	ARM: dts: sun8i: h3: Add ethernet0 alias to Beelink X2
	arm: dts: meson: Fix IRQ trigger type for macirq
	ARM: dts: meson8b: odroidc1: mark the SD card detection GPIO active-low
	ARM: dts: meson8m2: mxiii-plus: mark the SD card detection GPIO active-low
	ARM: dts: imx6sx: correct backward compatible of gpt
	arm64: dts: renesas: r8a7796: Enable DMA for SCIF2
	arm64: dts: renesas: r8a77965: Enable DMA for SCIF2
	soc: fsl: qbman: avoid race in clearing QMan interrupt
	pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18
	wlcore: sdio: Fixup power on/off sequence
	bpftool: Fix prog dump by tag
	bpftool: fix percpu maps updating
	bpf: sock recvbuff must be limited by rmem_max in bpf_setsockopt()
	ARM: pxa: ssp: unneeded to free devm_ allocated data
	arm64: dts: add msm8996 compatible to gicv3
	batman-adv: release station info tidstats
	DTS: CI20: Fix bugs in ci20's device tree.
	usb: phy: fix link errors
	irqchip/gic-v4: Fix occasional VLPI drop
	irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
	irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
	drm/amdgpu: Add missing power attribute to APU check
	drm/radeon: check if device is root before getting pci speed caps
	drm/amdgpu: Transfer fences to dmabuf importer
	net: stmmac: Fallback to Platform Data clock in Watchdog conversion
	net: stmmac: Send TSO packets always from Queue 0
	net: stmmac: Disable EEE mode earlier in XMIT callback
	irqchip/gic-v3-its: Fix ITT_entry_size accessor
	relay: check return of create_buf_file() properly
	bpf, selftests: fix handling of sparse CPU allocations
	bpf: fix lockdep false positive in percpu_freelist
	bpf: fix potential deadlock in bpf_prog_register
	bpf: Fix syscall's stackmap lookup potential deadlock
	drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at init
	dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
	vsock/virtio: fix kernel panic after device hot-unplug
	vsock/virtio: reset connected sockets on device removal
	dmaengine: dmatest: Abort test in case of mapping error
	selftests: netfilter: fix config fragment CONFIG_NF_TABLES_INET
	selftests: netfilter: add simple masq/redirect test cases
	netfilter: nf_nat: skip nat clash resolution for same-origin entries
	s390/qeth: release cmd buffer in error paths
	s390/qeth: fix use-after-free in error path
	s390/qeth: cancel close_dev work before removing a card
	perf symbols: Filter out hidden symbols from labels
	perf trace: Support multiple "vfs_getname" probes
	MIPS: Remove function size check in get_frame_info()
	Revert "scsi: libfc: Add WARN_ON() when deleting rports"
	i2c: omap: Use noirq system sleep pm ops to idle device for suspend
	drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr
	nvme: lock NS list changes while handling command effects
	nvme-pci: fix rapid add remove sequence
	fs: ratelimit __find_get_block_slow() failure message.
	qed: Fix EQ full firmware assert.
	qed: Consider TX tcs while deriving the max num_queues for PF.
	qede: Fix system crash on configuring channels.
	blk-iolatency: fix IO hang due to negative inflight counter
	nvme-pci: add missing unlock for reset error
	Input: wacom_serial4 - add support for Wacom ArtPad II tablet
	Input: elan_i2c - add id for touchpad found in Lenovo s21e-20
	iscsi_ibft: Fix missing break in switch statement
	scsi: aacraid: Fix missing break in switch statement
	x86/PCI: Fixup RTIT_BAR of Intel Denverton Trace Hub
	arm64: dts: zcu100-revC: Give wifi some time after power-on
	arm64: dts: hikey: Give wifi some time after power-on
	arm64: dts: hikey: Revert "Enable HS200 mode on eMMC"
	ARM: dts: exynos: Fix pinctrl definition for eMMC RTSN line on Odroid X2/U3
	ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU
	ARM: dts: exynos: Fix max voltage for buck8 regulator on Odroid XU3/XU4
	drm: disable uncached DMA optimization for ARM and arm64
	netfilter: xt_TEE: fix wrong interface selection
	netfilter: xt_TEE: add missing code to get interface index in checkentry.
	gfs2: Fix missed wakeups in find_insert_glock
	staging: erofs: add error handling for xattr submodule
	staging: erofs: fix fast symlink w/o xattr when fs xattr is on
	staging: erofs: fix memleak of inode's shared xattr array
	staging: erofs: fix race of initializing xattrs of a inode at the same time
	staging: erofs: keep corrupted fs from crashing kernel in erofs_namei()
	cifs: allow calling SMB2_xxx_free(NULL)
	ath9k: Avoid OF no-EEPROM quirks without qca,no-eeprom
	driver core: Postpone DMA tear-down until after devres release
	perf/x86/intel: Make cpuc allocations consistent
	perf/x86/intel: Generalize dynamic constraint creation
	x86: Add TSX Force Abort CPUID/MSR
	perf/x86/intel: Implement support for TSX Force Abort
	Linux 4.19.29

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-13 14:17:29 -07:00
Martin KaFai Lau
ae26a7109c bpf: Fix syscall's stackmap lookup potential deadlock
[ Upstream commit 7c4cd051ad ]

The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.

It was true until commit 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.

If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog.  Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.

A later syscall's map_lookup_elem commit f1a2e44a3a ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups.  bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).

Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Alexei Starovoitov
3bbe6a4212 bpf: fix potential deadlock in bpf_prog_register
[ Upstream commit e16ec34039 ]

Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
[   13.007000] WARNING: possible circular locking dependency detected
[   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
[   13.008124] ------------------------------------------------------
[   13.008624] test_progs/246 is trying to acquire lock:
[   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
[   13.009770]
[   13.009770] but task is already holding lock:
[   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.010877]
[   13.010877] which lock already depends on the new lock.
[   13.010877]
[   13.011532]
[   13.011532] the existing dependency chain (in reverse order) is:
[   13.012129]
[   13.012129] -> #4 (bpf_event_mutex){+.+.}:
[   13.012582]        perf_event_query_prog_array+0x9b/0x130
[   13.013016]        _perf_ioctl+0x3aa/0x830
[   13.013354]        perf_ioctl+0x2e/0x50
[   13.013668]        do_vfs_ioctl+0x8f/0x6a0
[   13.014003]        ksys_ioctl+0x70/0x80
[   13.014320]        __x64_sys_ioctl+0x16/0x20
[   13.014668]        do_syscall_64+0x4a/0x180
[   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.015469]
[   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
[   13.015910]        perf_event_init_cpu+0x5a/0x90
[   13.016291]        perf_event_init+0x1b2/0x1de
[   13.016654]        start_kernel+0x2b8/0x42a
[   13.016995]        secondary_startup_64+0xa4/0xb0
[   13.017382]
[   13.017382] -> #2 (pmus_lock){+.+.}:
[   13.017794]        perf_event_init_cpu+0x21/0x90
[   13.018172]        cpuhp_invoke_callback+0xb3/0x960
[   13.018573]        _cpu_up+0xa7/0x140
[   13.018871]        do_cpu_up+0xa4/0xc0
[   13.019178]        smp_init+0xcd/0xd2
[   13.019483]        kernel_init_freeable+0x123/0x24f
[   13.019878]        kernel_init+0xa/0x110
[   13.020201]        ret_from_fork+0x24/0x30
[   13.020541]
[   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[   13.021051]        static_key_slow_inc+0xe/0x20
[   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
[   13.021891]        perf_trace_event_init+0x11f/0x250
[   13.022297]        perf_trace_init+0x6b/0xa0
[   13.022644]        perf_tp_event_init+0x25/0x40
[   13.023011]        perf_try_init_event+0x6b/0x90
[   13.023386]        perf_event_alloc+0x9a8/0xc40
[   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
[   13.024173]        do_syscall_64+0x4a/0x180
[   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.024968]
[   13.024968] -> #0 (tracepoints_mutex){+.+.}:
[   13.025434]        __mutex_lock+0x86/0x970
[   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
[   13.026215]        bpf_probe_register+0x40/0x60
[   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.027042]        __do_sys_bpf+0x94f/0x1a90
[   13.027389]        do_syscall_64+0x4a/0x180
[   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.028171]
[   13.028171] other info that might help us debug this:
[   13.028171]
[   13.028807] Chain exists of:
[   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
[   13.028807]
[   13.029666]  Possible unsafe locking scenario:
[   13.029666]
[   13.030140]        CPU0                    CPU1
[   13.030510]        ----                    ----
[   13.030875]   lock(bpf_event_mutex);
[   13.031166]                                lock(&cpuctx_mutex);
[   13.031645]                                lock(bpf_event_mutex);
[   13.032135]   lock(tracepoints_mutex);
[   13.032441]
[   13.032441]  *** DEADLOCK ***
[   13.032441]
[   13.032911] 1 lock held by test_progs/246:
[   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.033909]
[   13.033909] stack backtrace:
[   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
[   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   13.035657] Call Trace:
[   13.035859]  dump_stack+0x5f/0x8b
[   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
[   13.036526]  __lock_acquire+0x1158/0x1350
[   13.036852]  ? lock_acquire+0x98/0x190
[   13.037154]  lock_acquire+0x98/0x190
[   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.037876]  __mutex_lock+0x86/0x970
[   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.039028]  ? __mutex_lock+0x86/0x970
[   13.039337]  ? __mutex_lock+0x24a/0x970
[   13.039649]  ? bpf_probe_register+0x1d/0x60
[   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
[   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
[   13.041325]  bpf_probe_register+0x40/0x60
[   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.042068]  ? __might_fault+0x3e/0x90
[   13.042374]  __do_sys_bpf+0x94f/0x1a90
[   13.042678]  do_syscall_64+0x4a/0x180
[   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.043382] RIP: 0033:0x7f23b10a07f9
[   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
[   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
[   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
[   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
there is no need to take bpf_event_mutex too.
bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
bpf_raw_tracepoints don't need to take this mutex.

Fixes: c4f6699dfc ("bpf: introduce BPF_RAW_TRACEPOINT")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Alexei Starovoitov
e3bc64c9aa bpf: fix lockdep false positive in percpu_freelist
[ Upstream commit a89fac57b5 ]

Lockdep warns about false positive:
[   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
[   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
[   12.493275]  (&rq->lock){-.-.}
[   12.493276]
[   12.493276]
[   12.493276] and interrupts could create inverse lock ordering between them.
[   12.493276]
[   12.494435]
[   12.494435] other info that might help us debug this:
[   12.494979]  Possible interrupt unsafe locking scenario:
[   12.494979]
[   12.495518]        CPU0                    CPU1
[   12.495879]        ----                    ----
[   12.496243]   lock(&head->lock);
[   12.496502]                                local_irq_disable();
[   12.496969]                                lock(&rq->lock);
[   12.497431]                                lock(&head->lock);
[   12.497890]   <Interrupt>
[   12.498104]     lock(&rq->lock);
[   12.498368]
[   12.498368]  *** DEADLOCK ***
[   12.498368]
[   12.498837] 1 lock held by dd/276:
[   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
[   12.499747]
[   12.499747] the shortest dependencies between 2nd lock and 1st lock:
[   12.500389]  -> (&rq->lock){-.-.} {
[   12.500669]     IN-HARDIRQ-W at:
[   12.500934]                       _raw_spin_lock+0x2f/0x40
[   12.501373]                       scheduler_tick+0x4c/0xf0
[   12.501812]                       update_process_times+0x40/0x50
[   12.502294]                       tick_periodic+0x27/0xb0
[   12.502723]                       tick_handle_periodic+0x1f/0x60
[   12.503203]                       timer_interrupt+0x11/0x20
[   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
[   12.504167]                       handle_irq_event_percpu+0x20/0x50
[   12.504674]                       handle_irq_event+0x37/0x60
[   12.505139]                       handle_level_irq+0xa7/0x120
[   12.505601]                       handle_irq+0xa1/0x150
[   12.506018]                       do_IRQ+0x77/0x140
[   12.506411]                       ret_from_intr+0x0/0x1d
[   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
[   12.507362]                       __setup_irq+0x481/0x730
[   12.507789]                       setup_irq+0x49/0x80
[   12.508195]                       hpet_time_init+0x21/0x32
[   12.508644]                       x86_late_time_init+0xb/0x16
[   12.509106]                       start_kernel+0x390/0x42a
[   12.509554]                       secondary_startup_64+0xa4/0xb0
[   12.510034]     IN-SOFTIRQ-W at:
[   12.510305]                       _raw_spin_lock+0x2f/0x40
[   12.510772]                       try_to_wake_up+0x1c7/0x4e0
[   12.511220]                       swake_up_locked+0x20/0x40
[   12.511657]                       swake_up_one+0x1a/0x30
[   12.512070]                       rcu_process_callbacks+0xc5/0x650
[   12.512553]                       __do_softirq+0xe6/0x47b
[   12.512978]                       irq_exit+0xc3/0xd0
[   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
[   12.513876]                       apic_timer_interrupt+0xf/0x20
[   12.514343]                       default_idle+0x1c/0x170
[   12.514765]                       do_idle+0x199/0x240
[   12.515159]                       cpu_startup_entry+0x19/0x20
[   12.515614]                       start_kernel+0x422/0x42a
[   12.516045]                       secondary_startup_64+0xa4/0xb0
[   12.516521]     INITIAL USE at:
[   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
[   12.517258]                      rq_attach_root+0x16/0xd0
[   12.517685]                      sched_init+0x2f2/0x3eb
[   12.518096]                      start_kernel+0x1fb/0x42a
[   12.518525]                      secondary_startup_64+0xa4/0xb0
[   12.518986]   }
[   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
[   12.519649]   ... acquired at:
[   12.519892]    pcpu_freelist_pop+0x7b/0xd0
[   12.520221]    bpf_get_stackid+0x1d2/0x4d0
[   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
[   12.520887]
[   12.521008] -> (&head->lock){+...} {
[   12.521292]    HARDIRQ-ON-W at:
[   12.521539]                     _raw_spin_lock+0x2f/0x40
[   12.521950]                     pcpu_freelist_push+0x2a/0x40
[   12.522396]                     bpf_get_stackid+0x494/0x4d0
[   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
[   12.523296]    INITIAL USE at:
[   12.523537]                    _raw_spin_lock+0x2f/0x40
[   12.523944]                    pcpu_freelist_populate+0xc0/0x120
[   12.524417]                    htab_map_alloc+0x405/0x500
[   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
[   12.525253]                    do_syscall_64+0x4a/0x180
[   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   12.526167]  }
[   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
[   12.526812]  ... acquired at:
[   12.527047]    __lock_acquire+0x521/0x1350
[   12.527371]    lock_acquire+0x98/0x190
[   12.527680]    _raw_spin_lock+0x2f/0x40
[   12.527994]    pcpu_freelist_push+0x2a/0x40
[   12.528325]    bpf_get_stackid+0x494/0x4d0
[   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
[   12.528970]
[   12.529092]
[   12.529092] stack backtrace:
[   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
[   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   12.530750] Call Trace:
[   12.530948]  dump_stack+0x5f/0x8b
[   12.531248]  check_usage_backwards+0x10c/0x120
[   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
[   12.531935]  ? mark_lock+0x382/0x560
[   12.532229]  mark_lock+0x382/0x560
[   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
[   12.532928]  __lock_acquire+0x521/0x1350
[   12.533271]  ? find_get_entry+0x17f/0x2e0
[   12.533586]  ? find_get_entry+0x19c/0x2e0
[   12.533902]  ? lock_acquire+0x98/0x190
[   12.534196]  lock_acquire+0x98/0x190
[   12.534482]  ? pcpu_freelist_push+0x2a/0x40
[   12.534810]  _raw_spin_lock+0x2f/0x40
[   12.535099]  ? pcpu_freelist_push+0x2a/0x40
[   12.535432]  pcpu_freelist_push+0x2a/0x40
[   12.535750]  bpf_get_stackid+0x494/0x4d0
[   12.536062]  ___bpf_prog_run+0x8b4/0x11a0

It has been explained that is a false positive here:
https://lkml.org/lkml/2018/7/25/756
Recap:
- stackmap uses pcpu_freelist
- The lock in pcpu_freelist is a percpu lock
- stackmap is only used by tracing bpf_prog
- A tracing bpf_prog cannot be run if another bpf_prog
  has already been running (ensured by the percpu bpf_prog_active counter).

Eric pointed out that this lockdep splats stops other
legit lockdep splats in selftests/bpf/test_progs.c.

Fix this by calling local_irq_save/restore for stackmap.

Another false positive had also been worked around by calling
local_irq_save in commit 89ad2fa3f0 ("bpf: fix lockdep splat").
That commit added unnecessary irq_save/restore to fast path of
bpf hash map. irqs are already disabled at that point, since htab
is holding per bucket spin_lock with irqsave.

Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
to be used elsewhere (right now only in stackmap).
It stops lockdep false positive in stackmap with a bit of acceptable overhead.

Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Greg Kroah-Hartman
232bd90cf2 relay: check return of create_buf_file() properly
[ Upstream commit 2c1cf00eea ]

If create_buf_file() returns an error, don't try to reference it later
as a valid dentry pointer.

This problem was exposed when debugfs started to return errors instead
of just NULL for some calls when they do not succeed properly.

Also, the check for WARN_ON(dentry) was just wrong :)

Reported-by: Kees Cook <keescook@chromium.org>
Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:35 -07:00
Stephane Eranian
6ec0698f1c perf core: Fix perf_proc_update_handler() bug
[ Upstream commit 1a51c5da5a ]

The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.

The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().

You get an error:

  $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
  echo: write error: invalid argument

But the value is still modified causing all sorts of inconsistencies:

  $ cat /proc/sys/kernel/perf_event_max_sample_rate
  10001

This patch fixes the problem by moving the parsing of the value after
the test.

Committer testing:

  # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
  # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
  -bash: echo: write error: Invalid argument
  # cat /proc/sys/kernel/perf_event_max_sample_rate
  10001
  #

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:26 -07:00
Greg Kroah-Hartman
34e9e65731 This is the 4.19.28 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyEq/IACgkQONu9yGCS
 aT77AhAAgqbxHsKsgqh7JV477xEgLqrLKYw+/6Bx+l79fUZoaR3PRwM62UEzikPQ
 KbvutNuOHOWuJA0Xyj5gqV4SJaqBOkoGNnHFOBi/qjtFUsOlBFrlGpow+0fsP/ay
 Bmo0LhoTvBub4ap7bJt4pwel/elWYVtOkA1Qgv3OCiDkTorYuPTbIUyuAVOJJbRn
 sZ1eKi00CQPrN65Rxgci0g0p/m7JWpvW2zqmDNZJuZZeEmSLdrrZGwt5ExiI6oKz
 CqX/VBGChEesMTEOLsSfRyg6NZW3j4rOUaCzkxDq/Tsh9XqNabhk1jod2p3t7Nmu
 n5Js3ujfuyCKf5tD49Z8xy5A++nYyJLa5jbFnURr2H/ZJPla0CHuc4RoFf/wFYr4
 xQAeA3XXiZZB02n6oZlweUSp7hVnCgLJ4Ev2ctAyPUpyf4ncl+vffzj40bozFvAC
 adJE1UyEJp0xUuRVdx0I+HyueHcWmRIAgs0iz9B5S2KNc4rYDqE+/t+ddrNqjvSF
 +C33nMn+7A+ngmQlwWjOBaZhhQn3qWrWU0ACERbIG/DUD2voBYu3oIfQzKsXTt0V
 erSSDq0KSy73PRKN4Tzf2GnDUAlNITUuyFgWITOg8p29HuXJO00p4MQ7fOMYacyB
 WdwXFB7islUtUoA8nEedZgF7IF1WRh6Iz2HJ5uMTl6pCMSV+8Dk=
 =OvW8
 -----END PGP SIGNATURE-----

Merge 4.19.28 into android-4.19

Changes in 4.19.28
	cpufreq: Use struct kobj_attribute instead of struct global_attr
	staging: erofs: fix mis-acted TAIL merging behavior
	USB: serial: option: add Telit ME910 ECM composition
	USB: serial: cp210x: add ID for Ingenico 3070
	USB: serial: ftdi_sio: add ID for Hjelmslund Electronics USB485
	staging: erofs: fix illegal address access under memory pressure
	staging: erofs: compressed_pages should not be accessed again after freed
	staging: comedi: ni_660x: fix missing break in switch statement
	staging: wilc1000: fix to set correct value for 'vif_num'
	staging: android: ion: fix sys heap pool's gfp_flags
	staging: android: ashmem: Don't call fallocate() with ashmem_mutex held.
	staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.
	ip6mr: Do not call __IP6_INC_STATS() from preemptible context
	net: dsa: mv88e6xxx: handle unknown duplex modes gracefully in mv88e6xxx_port_set_duplex
	net: dsa: mv8e6xxx: fix number of internal PHYs for 88E6x90 family
	net: sched: put back q.qlen into a single location
	net-sysfs: Fix mem leak in netdev_register_kobject
	qmi_wwan: Add support for Quectel EG12/EM12
	sctp: call iov_iter_revert() after sending ABORT
	sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
	team: Free BPF filter when unregistering netdev
	tipc: fix RDM/DGRAM connect() regression
	bnxt_en: Drop oversize TX packets to prevent errors.
	geneve: correctly handle ipv6.disable module parameter
	hv_netvsc: Fix IP header checksum for coalesced packets
	ipv4: Add ICMPv6 support when parse route ipproto
	lan743x: Fix TX Stall Issue
	net: dsa: mv88e6xxx: Fix statistics on mv88e6161
	net: dsa: mv88e6xxx: Fix u64 statistics
	netlabel: fix out-of-bounds memory accesses
	net: netem: fix skb length BUG_ON in __skb_to_sgvec
	net: nfc: Fix NULL dereference on nfc_llcp_build_tlv fails
	net: phy: Micrel KSZ8061: link failure after cable connect
	net: phy: phylink: fix uninitialized variable in phylink_get_mac_state
	net: sit: fix memory leak in sit_init_net()
	net: socket: set sock->sk to NULL after calling proto_ops::release()
	tipc: fix race condition causing hung sendto
	tun: fix blocking read
	xen-netback: don't populate the hash cache on XenBus disconnect
	xen-netback: fix occasional leak of grant ref mappings under memory pressure
	tun: remove unnecessary memory barrier
	net: Add __icmp_send helper.
	net: avoid use IPCB in cipso_v4_error
	ipv4: Return error for RTA_VIA attribute
	ipv6: Return error for RTA_VIA attribute
	mpls: Return error for RTA_GATEWAY attribute
	ipv4: Pass original device to ip_rcv_finish_core
	net: dsa: mv88e6xxx: power serdes on/off for 10G interfaces on 6390X
	net: dsa: mv88e6xxx: prevent interrupt storm caused by mv88e6390x_port_set_cmode
	net/sched: act_ipt: fix refcount leak when replace fails
	net/sched: act_skbedit: fix refcount leak when replace fails
	net: sched: act_tunnel_key: fix NULL pointer dereference during init
	x86/CPU/AMD: Set the CPB bit unconditionally on F17h
	x86/boot/compressed/64: Do not read legacy ROM on EFI system
	tracing: Fix event filters and triggers to handle negative numbers
	usb: xhci: Fix for Enabling USB ROLE SWITCH QUIRK on INTEL_SUNRISEPOINT_LP_XHCI
	applicom: Fix potential Spectre v1 vulnerabilities
	MIPS: irq: Allocate accurate order pages for irq stack
	aio: Fix locking in aio_poll()
	xtensa: fix get_wchan
	gnss: sirf: fix premature wakeup interrupt enable
	USB: serial: cp210x: fix GPIO in autosuspend
	selftests: firmware: fix verify_reqs() return value
	Bluetooth: btrtl: Restore old logic to assume firmware is already loaded
	Bluetooth: Fix locking in bt_accept_enqueue() for BH context
	exec: Fix mem leak in kernel_read_file
	scsi: core: reset host byte in DID_NEXUS_FAILURE case
	bpf: fix sanitation rewrite in case of non-pointers
	Linux 4.19.28

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:23:21 +01:00
Daniel Borkmann
ca490a9873 bpf: fix sanitation rewrite in case of non-pointers
commit 3612af783c upstream.

Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).

Fixes: d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Reported-by: Arthur Fabre <afabre@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:22 +01:00
Pavel Tikhomirov
690e939da7 tracing: Fix event filters and triggers to handle negative numbers
commit 6a072128d2 upstream.

Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:

[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
        ^
parse_error: Invalid value (did you forget quotes)?

Similar thing happens when setting triggers.

These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.

Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com

Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:20 +01:00
Connor O'Brien
fe90216264 ANDROID: proc: Add /proc/uid directory
Add support for reporting per-uid information through procfs, roughly
following the approach used for per-tid and per-tgid directories in
fs/proc/base.c.
This also entails some new tracking of which uids have been used, to
avoid losing information when the last task with a given uid exits.

Bug: 72339335
Bug: 127641090
Test: ls /proc/uid/; compare with UIDs in /proc/uid_time_in_state
Change-Id: I0908f0c04438b11ceb673d860e58441bf503d478
Signed-off-by: Connor O'Brien <connoro@google.com>
[AmitP: Fix proc_fill_cache() now that upstream commit
        0168b9e38c ("procfs: switch instantiate_t to d_splice_alias()"),
        switched instantiate() callback to d_splice_alias()]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[astrachan: Folded 97b7790f505e ("ANDROID: proc: fix undefined behavior
            in proc_uid_base_readdir") into this change]
Signed-off-by: Alistair Strachan <astrachan@google.com>
2019-03-06 15:59:21 +00:00
Connor O'Brien
406d53f0c7 ANDROID: cpufreq: track per-task time in state
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.

Bug: 72339335
Bug: 127641090
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
Signed-off-by: Connor O'Brien <connoro@google.com>
[astrachan: Folded the following changes into this patch:
            a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES")
            b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")]
Signed-off-by: Alistair Strachan <astrachan@google.com>
2019-03-06 15:57:25 +00:00
Greg Kroah-Hartman
36d178b3bc This is the 4.19.27 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlx+qs4ACgkQONu9yGCS
 aT4OQQ/8DMkQa61GLVtL1Zm0i2GW+fpgAJj59QP7jRdilolP7BBnrX2Ne26Dx+X0
 bYLJDJw6WexsvLKOkSMol4Y3Q3NsBguC+MasVrrWBHpfntizlaPzsYuTKkyNRG0c
 vjOYyTeAxElWqEh1WaVQhk/juhBbrTtkZIK63h9M60KYb0qHA7TY9oTGokEA8WA0
 11Yuge2aV6cR8zup1coWR9NRvW/uealiENAhr0Jw1ZRtrBqwzQoFyn5psXdbzkjb
 HrwvDa/nkwfJuc/1+grOTF5HfI7D5+giq0iRtvBuChjDwqfJ2IrVRPF/DKSmR0Oz
 Sq1ILUdj61gd2hp4N9zpUE87r2tV2ABg/yZrtDcKFhCiUWnfaaGArn7mm/R7OhCD
 PjRDz7acVvKq3LLenM6xcFsAWUdjOoUWcGk70eO/ZZctW8o+HozxFcVrNmT/75zJ
 LcagplvYLiLB4Gard6niNH8qunmq/jD+GkgdKtF9GFHOjG1QMdQ9d/VxeJWGAjtA
 nE8ZX9EFYgmq14OR+YK0DNOOFBZe10lcR9rMIFp6T9AanY7rVT7LqqG6BUXb226P
 mJmNS9sSJhictucyyf/ej9j0jULBBsAYJA5+fkECHIioewACjFGn1dZfm3BBrDAb
 OoNIwaLmrvjh2iKwsj17OK9Fv8PG6VKwYRFQ/I3OGVouZUX5SEk=
 =edyH
 -----END PGP SIGNATURE-----

Merge 4.19.27 into android-4.19

Changes in 4.19.27
	irq/matrix: Split out the CPU selection code into a helper
	irq/matrix: Spread managed interrupts on allocation
	genirq/matrix: Improve target CPU selection for managed interrupts.
	mac80211: Change default tx_sk_pacing_shift to 7
	scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached
	drm/msm: Unblock writer if reader closes file
	ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field
	ALSA: compress: prevent potential divide by zero bugs
	ASoC: Variable "val" in function rt274_i2c_probe() could be uninitialized
	clk: tegra: dfll: Fix a potential Oop in remove()
	clk: sysfs: fix invalid JSON in clk_dump
	clk: vc5: Abort clock configuration without upstream clock
	thermal: int340x_thermal: Fix a NULL vs IS_ERR() check
	usb: dwc3: gadget: synchronize_irq dwc irq in suspend
	usb: dwc3: gadget: Fix the uninitialized link_state when udc starts
	usb: gadget: Potential NULL dereference on allocation error
	selftests: rtc: rtctest: fix alarm tests
	selftests: rtc: rtctest: add alarm test on minute boundary
	genirq: Make sure the initial affinity is not empty
	x86/mm/mem_encrypt: Fix erroneous sizeof()
	ASoC: rt5682: Fix PLL source register definitions
	ASoC: dapm: change snprintf to scnprintf for possible overflow
	ASoC: imx-audmux: change snprintf to scnprintf for possible overflow
	selftests/vm/gup_benchmark.c: match gup struct to kernel
	phy: ath79-usb: Fix the power on error path
	phy: ath79-usb: Fix the main reset name to match the DT binding
	selftests: seccomp: use LDLIBS instead of LDFLAGS
	selftests: gpio-mockup-chardev: Check asprintf() for error
	irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
	ARC: fix __ffs return value to avoid build warnings
	ARC: show_regs: lockdep: avoid page allocator...
	drivers: thermal: int340x_thermal: Fix sysfs race condition
	staging: rtl8723bs: Fix build error with Clang when inlining is disabled
	mac80211: fix miscounting of ttl-dropped frames
	sched/wait: Fix rcuwait_wake_up() ordering
	sched/wake_q: Fix wakeup ordering for wake_q
	futex: Fix (possible) missed wakeup
	locking/rwsem: Fix (possible) missed wakeup
	drm/amd/powerplay: OD setting fix on Vega10
	tty: serial: qcom_geni_serial: Allow mctrl when flow control is disabled
	serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling
	drm/sun4i: hdmi: Fix usage of TMDS clock
	staging: android: ion: Support cpu access during dma_buf_detach
	direct-io: allow direct writes to empty inodes
	writeback: synchronize sync(2) against cgroup writeback membership switches
	scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport
	scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport
	scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state()
	net: altera_tse: fix connect_local_phy error path
	hv_netvsc: Fix ethtool change hash key error
	hv_netvsc: Refactor assignments of struct netvsc_device_info
	hv_netvsc: Fix hash key value reset after other ops
	nvme-rdma: fix timeout handler
	nvme-multipath: drop optimization for static ANA group IDs
	drm/msm: Fix A6XX support for opp-level
	net: usb: asix: ax88772_bind return error when hw_reset fail
	net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP
	ibmveth: Do not process frames after calling napi_reschedule
	mac80211: don't initiate TDLS connection if station is not associated to AP
	mac80211: Add attribute aligned(2) to struct 'action'
	cfg80211: extend range deviation for DMG
	svm: Fix AVIC incomplete IPI emulation
	KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1
	kvm: selftests: Fix region overlap check in kvm_util
	mmc: spi: Fix card detection during probe
	mmc: tmio_mmc_core: don't claim spurious interrupts
	mmc: tmio: fix access width of Block Count Register
	mmc: core: Fix NULL ptr crash from mmc_should_fail_request
	mmc: cqhci: fix space allocated for transfer descriptor
	mmc: cqhci: Fix a tiny potential memory leak on error condition
	mmc: sdhci-esdhc-imx: correct the fix of ERR004536
	mm: enforce min addr even if capable() in expand_downwards()
	drm: Block fb changes for async plane updates
	hugetlbfs: fix races and page leaks during migration
	MIPS: fix truncation in __cmpxchg_small for short values
	MIPS: BCM63XX: provide DMA masks for ethernet devices
	MIPS: eBPF: Fix icache flush end address
	x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
	Linux 4.19.27

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-05 18:07:53 +01:00
Xie Yongji
9ad6216e8c locking/rwsem: Fix (possible) missed wakeup
[ Upstream commit e158488be2 ]

Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:

  e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")

relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.

[ peterz: Added changelog. ]

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Peter Zijlstra
2368e6d3bc futex: Fix (possible) missed wakeup
[ Upstream commit b061c38bef ]

We must not rely on wake_q_add() to delay the wakeup; in particular
commit:

  1d0dcb3ad9 ("futex: Implement lockless wakeups")

moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1d0dcb3ad9 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Peter Zijlstra
653a1dbcb0 sched/wake_q: Fix wakeup ordering for wake_q
[ Upstream commit 4c4e373156 ]

Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.

Andrea Parri provided:

  C wake_up_q-wake_q_add

  {
	int next = 0;
	int y = 0;
  }

  P0(int *next, int *y)
  {
	int r0;

	/* in wake_up_q() */

	WRITE_ONCE(*next, 1);   /* node->next = NULL */
	smp_mb();               /* implied by wake_up_process() */
	r0 = READ_ONCE(*y);
  }

  P1(int *next, int *y)
  {
	int r1;

	/* in wake_q_add() */

	WRITE_ONCE(*y, 1);      /* wake_cond = true */
	smp_mb__before_atomic();
	r1 = cmpxchg_relaxed(next, 1, 2);
  }

  exists (0:r0=0 /\ 1:r1=0)

  This "exists" clause cannot be satisfied according to the LKMM:

  Test wake_up_q-wake_q_add Allowed
  States 3
  0:r0=0; 1:r1=1;
  0:r0=1; 1:r1=0;
  0:r0=1; 1:r1=1;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (0:r0=0 /\ 1:r1=0)
  Observation wake_up_q-wake_q_add Never 0 3

Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Prateek Sood
5024f0a29a sched/wait: Fix rcuwait_wake_up() ordering
[ Upstream commit 6dc080eeb2 ]

For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.

This mistake has been observed to cause a deadlock in the following
situation:

    P1					P2

    percpu_up_read()			percpu_down_write()
      rcu_sync_is_idle() // false
					  rcu_sync_enter()
					  ...
      __percpu_up_read()

[S] ,-  __this_cpu_dec(*sem->read_count)
    |   smp_rmb();
[L] |   task = rcu_dereference(w->task) // NULL
    |
    |				    [S]	    w->task = current
    |					    smp_mb();
    |				    [L]	    readers_active_check() // fail
    `-> <store happens here>

Where the smp_rmb() (obviously) fails to constrain the store.

[ peterz: Added changelog. ]

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Srinivas Ramana
17fab8914f genirq: Make sure the initial affinity is not empty
[ Upstream commit bddda606ec ]

If all CPUs in the irq_default_affinity mask are offline when an interrupt
is initialized then irq_setup_affinity() can set an empty affinity mask for
a newly allocated interrupt.

Fix this by falling back to cpu_online_mask in case the resulting affinity
mask is zero.

Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:47 +01:00
Long Li
765c30b318 genirq/matrix: Improve target CPU selection for managed interrupts.
[ Upstream commit e8da8794a7 ]

On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

    available is decremented when the interrupt is actually requested by
    the device driver and a vector is assigned. It's incremented when the
    interrupt and the vector are freed.

 2) Managed interrupts:

    Managed interrupts guarantee vector reservation when the MSI/MSI-X
    functionality of a device is enabled, which is achieved by reserving
    vectors in the bitmaps of the possible target CPUs. This reservation
    decrements the available count on each possible target CPU.

    When the interrupt is requested by the device driver then a vector is
    allocated from the reserved region. The operation is reversed when the
    interrupt is freed by the device driver. Neither of these operations
    affect the available count.

    The reservation persist up to the point where the MSI/MSI-X
    functionality is disabled and only this operation increments the
    available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

		 CPU0	CPU1
 available:	    2	   0
 allocated:	    5	   3   <--- CPU1 is selected, but available space = 0
 managed reserved:  3	   7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

		 CPU0	CPU1
 available:	    5	   4
 allocated:	    2	   3
 managed reserved:  3	   3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

		  CPU0	CPU1
 available:	     5	   4	<- Select CPU0 for 1st allocation
 --> allocated:	     3	   3

 available:	     4	   4	<- Select CPU0 for 2nd allocation
 --> allocated:	     4	   3

 available:	     3	   4	<- Select CPU1 for 3rd allocation
 --> allocated:	     4	   4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

		  CPU0	CPU1
 available:	     5	   4
 allocated:	     5	   3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

		 CPU0	CPU1
 managed_allocated: 0	   0	<- Select CPU0 for 1st allocation
 --> allocated:	    3	   3

 managed_allocated: 1	   0	<- Select CPU1 for 2nd allocation
 --> allocated:	    3	   4

 managed_allocated: 1	   1	<- Select CPU0 for 3rd allocation
 --> allocated:	    4	   4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

[ tglx: Clarified the background of the problem in the changelog and
  	described it independent of NVME ]

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05 17:58:45 +01:00
Dou Liyang
8cae7757e8 irq/matrix: Spread managed interrupts on allocation
[ Upstream commit 76f99ae5b5 ]

Linux spreads out the non managed interrupt across the possible target CPUs
to avoid vector space exhaustion.

Managed interrupts are treated differently, as for them the vectors are
reserved (with guarantee) when the interrupt descriptors are initialized.

When the interrupt is requested a real vector is assigned. The assignment
logic uses the first CPU in the affinity mask for assignment. If the
interrupt has more than one CPU in the affinity mask, which happens when a
multi queue device has less queues than CPUs, then doing the same search as
for non managed interrupts makes sense as it puts the interrupt on the
least interrupt plagued CPU. For single CPU affine vectors that's obviously
a NOOP.

Restructre the matrix allocation code so it does the 'best CPU' search, add
the sanity check for an empty affinity mask and adapt the call site in the
x86 vector management code.

[ tglx: Added the empty mask check to the core and improved change log ]

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05 17:58:45 +01:00
Dou Liyang
2948b8875d irq/matrix: Split out the CPU selection code into a helper
[ Upstream commit 8ffe4e61c0 ]

Linux finds the CPU which has the lowest vector allocation count to spread
out the non managed interrupts across the possible target CPUs, but does
not do so for managed interrupts.

Split out the CPU selection code into a helper function for reuse. No
functional change.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180908175838.14450-1-dou_liyang@163.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05 17:58:45 +01:00
Greg Kroah-Hartman
c97d2b535c This is the 4.19.26 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlx2U68ACgkQONu9yGCS
 aT6rHA/+PGrMAAzSp193bme2rap7P/EzDD0JF0aQOLwizdit5A4zIYSZApVVSaw5
 U8KIQg+fcwIjTvzBmkHddDVVJBEv/VdE8Xkf0EOIOidIQNqMoRQN7ctU1bnKfUyC
 20/xZBlc78BEU4TB4JpNiKkzD3bsl/Gsb2A3VenhHe0H8dbj4oJYyNWWY99/433y
 u1SNmTogLJ0p64+1XWMCTqPJA2xkUIOciFSLiO7d51MfpbguMwcW+T9XSwYat3tp
 uTdfvZYKc4iGPa4pZCZmZY7+Lrxne2OjMhp3bUlp0coKrjlNO848X+kG7gsVZ7EN
 MA7YOXUj/Ngjl+E9EilNY5gybY2hK0IzaaV8W9TRVYvPBjgSwe5Mj9/EXYO0L8q0
 Arla5HsJdRhLX7HDJ41w8BIngko87hetAaTZVCct+F+nnMYGrrMsle+BgYYbnd7r
 NYWB8HwZZ3krkWmAZb+Phpleo3Oj7beum5MLoydSH4pXhTVft5D/V8K2fwLmZtgP
 FsIC1J09jDvXJ5qHTFBY4ugTq8l21YZ2oMPdwcB6X8hBIKji4XydQ0HPrdMyAtXB
 tqUkzyLZlP1uzm9T3E+hyATWeM4ue7qLFacp7cX5LokDWIxlWi7tcIqBHqRZez+y
 S7un8/Lmz8ZvzC674JTZl1iXSHuuK6TFLrwI3mw/CVhg2qRvZtk=
 =EDHP
 -----END PGP SIGNATURE-----

Merge 4.19.26 into android-4.19

Changes in 4.19.26
	ARM: 8834/1: Fix: kprobes: optimized kprobes illegal instruction
	tracing: Fix number of entries in trace header
	MIPS: eBPF: Always return sign extended 32b values
	gpio: MT7621: use a per instance irq_chip structure
	gpio: pxa: avoid attempting to set pin direction via pinctrl on MMP2
	mac80211: Restore vif beacon interval if start ap fails
	mac80211: Use linked list instead of rhashtable walk for mesh tables
	mac80211: Free mpath object when rhashtable insertion fails
	libceph: handle an empty authorize reply
	ceph: avoid repeatedly adding inode to mdsc->snap_flush_list
	numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODES
	proc, oom: do not report alien mms when setting oom_score_adj
	ALSA: hda/realtek - Headset microphone and internal speaker support for System76 oryp5
	ALSA: hda/realtek: Disable PC beep in passthrough on alc285
	KEYS: allow reaching the keys quotas exactly
	backlight: pwm_bl: Fix devicetree parsing with auto-generated brightness tables
	mfd: ti_am335x_tscadc: Use PLATFORM_DEVID_AUTO while registering mfd cells
	pvcalls-front: read all data before closing the connection
	pvcalls-front: don't try to free unallocated rings
	pvcalls-front: properly allocate sk
	pvcalls-back: set -ENOTCONN in pvcalls_conn_back_read
	mfd: twl-core: Fix section annotations on {,un}protect_pm_master
	mfd: db8500-prcmu: Fix some section annotations
	mfd: mt6397: Do not call irq_domain_remove if PMIC unsupported
	mfd: ab8500-core: Return zero in get_register_interruptible()
	mfd: bd9571mwv: Add volatile register to make DVFS work
	mfd: qcom_rpm: write fw_version to CTRL_REG
	mfd: wm5110: Add missing ASRC rate register
	mfd: axp20x: Add AC power supply cell for AXP813
	mfd: axp20x: Re-align MFD cell entries
	mfd: axp20x: Add supported cells for AXP803
	mfd: cros_ec_dev: Add missing mfd_remove_devices() call in remove
	mfd: tps65218: Use devm_regmap_add_irq_chip and clean up error path in probe()
	mfd: mc13xxx: Fix a missing check of a register-read failure
	xen/pvcalls: remove set but not used variable 'intf'
	qed: Fix qed_chain_set_prod() for PBL chains with non power of 2 page count
	qed: Fix qed_ll2_post_rx_buffer_notify_fw() by adding a write memory barrier
	net: hns: Fix use after free identified by SLUB debug
	bpf: Fix [::] -> [::1] rewrite in sys_sendmsg
	selftests/bpf: Test [::] -> [::1] rewrite in sys_sendmsg in test_sock_addr
	watchdog: mt7621_wdt/rt2880_wdt: Fix compilation problem
	net/mlx4: Get rid of page operation after dma_alloc_coherent
	MIPS: ath79: Enable OF serial ports in the default config
	xprtrdma: Double free in rpcrdma_sendctxs_create()
	mlxsw: spectrum_acl: Add cleanup after C-TCAM update error condition
	selftests: forwarding: Add a test for VLAN deletion
	netfilter: nf_tables: fix leaking object reference count
	scsi: qla4xxx: check return code of qla4xxx_copy_from_fwddb_param
	scsi: isci: initialize shost fully before calling scsi_add_host()
	include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR
	MIPS: jazz: fix 64bit build
	netfilter: nft_flow_offload: Fix reverse route lookup
	bpf: correctly set initial window on active Fast Open sender
	pvcalls-front: Avoid get_free_pages(GFP_KERNEL) under spinlock
	bpf: fix panic in stack_map_get_build_id() on i386 and arm32
	netfilter: nft_flow_offload: fix interaction with vrf slave device
	RDMA/mthca: Clear QP objects during their allocation
	powerpc/8xx: fix setting of pagetable for Abatron BDI debug tool.
	acpi/nfit: Fix race accessing memdev in nfit_get_smbios_id()
	net: stmmac: Fix PCI module removal leak
	net: stmmac: dwxgmac2: Only clear interrupts that are active
	net: stmmac: Check if CBS is supported before configuring
	net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
	net: stmmac: Prevent RX starvation in stmmac_napi_poll()
	isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
	scsi: tcmu: avoid cmd/qfull timers updated whenever a new cmd comes
	scsi: ufs: Fix system suspend status
	scsi: qedi: Add ep_state for login completion on un-reachable targets
	scsi: ufs: Fix geometry descriptor size
	scsi: cxgb4i: add wait_for_completion()
	netfilter: nft_flow_offload: fix checking method of conntrack helper
	always clear the X2APIC_ENABLE bit for PV guest
	drm/meson: add missing of_node_put
	drm/amdkfd: Don't assign dGPUs to APU topology devices
	drm/amd/display: fix PME notification not working in RV desktop
	vhost: return EINVAL if iovecs size does not match the message size
	drm/sun4i: backend: add missing of_node_puts
	pvcalls-front: fix potential null dereference
	selftests: tc-testing: drop test on missing tunnel key id
	selftests: tc-testing: fix tunnel_key failure if dst_port is unspecified
	selftests: tc-testing: fix parsing of ife type
	afs: Don't set vnode->cb_s_break in afs_validate()
	afs: Fix key refcounting in file locking code
	bpf: don't assume build-id length is always 20 bytes
	bpf: zero out build_id for BPF_STACK_BUILD_ID_IP
	selftests/bpf: retry tests that expect build-id
	atm: he: fix sign-extension overflow on large shift
	hwmon: (tmp421) Correct the misspelling of the tmp442 compatible attribute in OF device ID table
	leds: lp5523: fix a missing check of return value of lp55xx_read
	bpf: bpf_setsockopt: reset sock dst on SO_MARK changes
	dpaa_eth: NETIF_F_LLTX requires to do our own update of trans_start
	mlxsw: pci: Return error on PCI reset timeout
	net: bridge: Mark FDB entries that were added by user as such
	mlxsw: spectrum_switchdev: Do not treat static FDB entries as sticky
	selftests: forwarding: Add a test case for externally learned FDB entries
	net/mlx5e: Fix wrong (zero) TX drop counter indication for representor
	isdn: avm: Fix string plus integer warning from Clang
	batman-adv: fix uninit-value in batadv_interface_tx()
	inet_diag: fix reporting cgroup classid and fallback to priority
	ipv6: propagate genlmsg_reply return code
	net: ena: fix race between link up and device initalization
	net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames
	net/mlx5e: Don't overwrite pedit action when multiple pedit used
	net/packet: fix 4gb buffer limit due to overflow check
	net: sfp: do not probe SFP module before we're attached
	sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment
	sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate
	team: avoid complex list operations in team_nl_cmd_options_set()
	Revert "socket: fix struct ifreq size in compat ioctl"
	Revert "kill dev_ifsioc()"
	net: socket: fix SIOCGIFNAME in compat
	net: socket: make bond ioctls go through compat_ifreq_ioctl()
	geneve: should not call rt6_lookup() when ipv6 was disabled
	sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
	net_sched: fix a race condition in tcindex_destroy()
	net_sched: fix a memory leak in cls_tcindex
	net_sched: fix two more memory leaks in cls_tcindex
	net/mlx5e: XDP, fix redirect resources availability check
	RDMA/srp: Rework SCSI device reset handling
	KEYS: user: Align the payload buffer
	KEYS: always initialize keyring_index_key::desc_len
	parisc: Fix ptrace syscall number modification
	ARCv2: Enable unaligned access in early ASM code
	ARC: U-boot: check arguments paranoidly
	ARC: define ARCH_SLAB_MINALIGN = 8
	drm/amdgpu: Set DPM_FLAG_NEVER_SKIP when enabling PM-runtime
	gpu: drm: radeon: Set DPM_FLAG_NEVER_SKIP when enabling PM-runtime
	drm/i915/fbdev: Actually configure untiled displays
	drm/amd/display: Fix MST reboot/poweroff sequence
	mac80211: allocate tailroom for forwarded mesh packets
	kvm: x86: Return LA57 feature based on hardware capability
	net: validate untrusted gso packets without csum offload
	net: avoid false positives in untrusted gso validation
	staging: erofs: fix a bug when appling cache strategy
	staging: erofs: complete error handing of z_erofs_do_read_page
	staging: erofs: replace BUG_ON with DBG_BUGON in data.c
	staging: erofs: drop multiref support temporarily
	staging: erofs: remove the redundant d_rehash() for the root dentry
	staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup
	staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
	staging: erofs: add a full barrier in erofs_workgroup_unfreeze
	staging: erofs: {dir,inode,super}.c: rectify BUG_ONs
	staging: erofs: unzip_{pagevec.h,vle.c}: rectify BUG_ONs
	staging: erofs: unzip_vle_lz4.c,utils.c: rectify BUG_ONs
	Revert "bridge: do not add port to router list when receives query with source 0.0.0.0"
	netfilter: nf_tables: fix flush after rule deletion in the same batch
	netfilter: nft_compat: use-after-free when deleting targets
	netfilter: ipv6: Don't preserve original oif for loopback address
	netfilter: nfnetlink_osf: add missing fmatch check
	netfilter: ipt_CLUSTERIP: fix sleep-in-atomic bug in clusterip_config_entry_put()
	udlfb: handle unplug properly
	pinctrl: max77620: Use define directive for max77620_pinconf_param values
	net: phylink: avoid resolving link state too early
	Linux 4.19.26

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-27 10:23:18 +01:00
Stanislav Fomichev
5c6fdd877e bpf: zero out build_id for BPF_STACK_BUILD_ID_IP
[ Upstream commit 4af396ae48 ]

When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
make sure that build_id field is empty. Since we are using percpu
free list, there is a possibility that we might reuse some previous
bpf_stack_build_id with non-zero build_id.

Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:56 +01:00
Stanislav Fomichev
c4555b9f28 bpf: don't assume build-id length is always 20 bytes
[ Upstream commit 0b698005a9 ]

Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
  * 128-bit (uuid)
  * 160-bit (sha1)
  * any length specified in ld --build-id=0xhexstring

To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
assume that build-id is somewhere in the range of 1 .. 20.
Set the remaining bytes to zero.

v2:
* don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
  we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
  this 'if' condition

Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:56 +01:00
Song Liu
2f3480e340 bpf: fix panic in stack_map_get_build_id() on i386 and arm32
[ Upstream commit beaf3d1901 ]

As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.

This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.

Fixes: 615755a77b (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27 10:08:54 +01:00
Quentin Perret
b5e57dbb5a tracing: Fix number of entries in trace header
commit 9e7382153f upstream.

The following commit

  441dae8f2f ("tracing: Add support for display of tgid in trace output")

removed the call to print_event_info() from print_func_help_header_irq()
which results in the ftrace header not reporting the number of entries
written in the buffer. As this wasn't the original intent of the patch,
re-introduce the call to print_event_info() to restore the orginal
behaviour.

Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com

Acked-by: Joel Fernandes <joelaf@google.com>
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:08:49 +01:00
Greg Kroah-Hartman
cca7d2df6d This is the 4.19.24 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxtHSUACgkQONu9yGCS
 aT6WIw//Xdd2P7aIWSDnmS6L+RlEtdz2hnO07O2lMVIhsH1MgetAk0nJbP8b9sYp
 zJY4kW6gEue1/J7vHIiTnDpnhaylxESOxOb49O/E95bnPIdhE9EqR1aiMmo8R5rU
 8F2lH6kZncAawZ/7WGnQdHqukO9GtwFO0hwtsDT5RcwiIPe/fHkIpp1QfN21ntib
 nV7xQOfDFEMVcbyzAFa+59A4UUQSWKTkld9tcJFG62pjcN9DlpPej86M2qy9mG8U
 0d7hEkYGqHthh092TAxgobsJDq5j0jUUVjK80/Fq0TKe/OSvsgQ8MRBryVsgBf8J
 hJBCLHay4Ps3ONtNI3IXDVZu9dgdL8UynBEfbWdgJHlGgKJEvMGxFCckzI8OP3yr
 HJtFm6gG24kySFt+QnvwHg1/zwAho5FDBMfx02RkukJwDiN16DZSbKRimrTVAioV
 bBTmL+JyoCBLYGfUzq4gM1+3L5TRyMBqjgcjLvXWuXF18fdpVdb0exKlNtLHMUf6
 aqzhroF90sn5vUybpw02ouduxmRPWDaDel7ArA3zb16Dp+ODCxYtpkV7dmRBjPVK
 YxwweO604Rd2musacQkVAB5JkshjPdV7WA/FmhJf7Fn1/CaaXliAVrfMNbEGW15a
 NFQvTMrJva2pihXYJY4p/UfhnsBmxXmY30wYlLX2VxRFb+qoy3Y=
 =MO0g
 -----END PGP SIGNATURE-----

Merge 4.19.24 into android-4.19

Changes in 4.19.24
	dt-bindings: eeprom: at24: add "atmel,24c2048" compatible string
	eeprom: at24: add support for 24c2048
	blk-mq: fix a hung issue when fsync
	ARM: 8789/1: signal: copy registers using __copy_to_user()
	ARM: 8790/1: signal: always use __copy_to_user to save iwmmxt context
	ARM: 8791/1: vfp: use __copy_to_user() when saving VFP state
	ARM: 8792/1: oabi-compat: copy oabi events using __copy_to_user()
	ARM: 8793/1: signal: replace __put_user_error with __put_user
	ARM: 8794/1: uaccess: Prevent speculative use of the current addr_limit
	ARM: 8795/1: spectre-v1.1: use put_user() for __put_user()
	ARM: 8796/1: spectre-v1,v1.1: provide helpers for address sanitization
	ARM: 8797/1: spectre-v1.1: harden __copy_to_user
	ARM: 8810/1: vfp: Fix wrong assignement to ufp_exc
	ARM: make lookup_processor_type() non-__init
	ARM: split out processor lookup
	ARM: clean up per-processor check_bugs method call
	ARM: add PROC_VTABLE and PROC_TABLE macros
	ARM: spectre-v2: per-CPU vtables to work around big.Little systems
	ARM: ensure that processor vtables is not lost after boot
	ARM: fix the cockup in the previous patch
	drm/amdgpu/sriov:Correct pfvf exchange logic
	ACPI: NUMA: Use correct type for printing addresses on i386-PAE
	perf report: Fix wrong iteration count in --branch-history
	perf test shell: Use a fallback to get the pathname in vfs_getname
	tools uapi: fix RISC-V 64-bit support
	riscv: fix trace_sys_exit hook
	cpufreq: check if policy is inactive early in __cpufreq_get()
	drm/bridge: tc358767: add bus flags
	drm/bridge: tc358767: add defines for DP1_SRCCTRL & PHY_2LANE
	drm/bridge: tc358767: fix single lane configuration
	drm/bridge: tc358767: fix initial DP0/1_SRCCTRL value
	drm/bridge: tc358767: reject modes which require too much BW
	drm/bridge: tc358767: fix output H/V syncs
	nvme-pci: use the same attributes when freeing host_mem_desc_bufs.
	nvme-pci: fix out of bounds access in nvme_cqe_pending
	nvme-multipath: zero out ANA log buffer
	nvme: pad fake subsys NQN vid and ssvid with zeros
	drm/amdgpu: set WRITE_BURST_LENGTH to 64B to workaround SDMA1 hang
	ARM: dts: da850-evm: Correct the audio codec regulators
	ARM: dts: da850-evm: Correct the sound card name
	ARM: dts: da850-lcdk: Correct the audio codec regulators
	ARM: dts: da850-lcdk: Correct the sound card name
	ARM: dts: kirkwood: Fix polarity of GPIO fan lines
	gpio: pl061: handle failed allocations
	drm/nouveau: Don't disable polling in fallback mode
	drm/nouveau/falcon: avoid touching registers if engine is off
	cifs: Limit memory used by lock request calls to a page
	kvm: sev: Fail KVM_SEV_INIT if already initialized
	CIFS: Do not assume one credit for async responses
	gpio: mxc: move gpio noirq suspend/resume to syscore phase
	Revert "Input: elan_i2c - add ACPI ID for touchpad in ASUS Aspire F5-573G"
	Input: elan_i2c - add ACPI ID for touchpad in Lenovo V330-15ISK
	ARM: OMAP5+: Fix inverted nirq pin interrupts with irq_set_type
	perf/core: Fix impossible ring-buffer sizes warning
	perf/x86: Add check_period PMU callback
	ALSA: hda - Add quirk for HP EliteBook 840 G5
	ALSA: usb-audio: Fix implicit fb endpoint setup by quirk
	ASoC: hdmi-codec: fix oops on re-probe
	tools uapi: fix Alpha support
	riscv: Add pte bit to distinguish swap from invalid
	x86/kvm/nVMX: read from MSR_IA32_VMX_PROCBASED_CTLS2 only when it is available
	kvm: vmx: Fix entry number check for add_atomic_switch_msr()
	mmc: sunxi: Filter out unsupported modes declared in the device tree
	mmc: block: handle complete_work on separate workqueue
	Input: bma150 - register input device after setting private data
	Input: elantech - enable 3rd button support on Fujitsu CELSIUS H780
	Revert "nfsd4: return default lease period"
	Revert "mm: don't reclaim inodes with many attached pages"
	Revert "mm: slowly shrink slabs with a relatively small number of objects"
	alpha: fix page fault handling for r16-r18 targets
	alpha: Fix Eiger NR_IRQS to 128
	s390/zcrypt: fix specification exception on z196 during ap probe
	tracing/uprobes: Fix output for multiple string arguments
	x86/platform/UV: Use efi_runtime_lock to serialise BIOS calls
	scsi: sd: fix entropy gathering for most rotational disks
	signal: Restore the stop PTRACE_EVENT_EXIT
	md/raid1: don't clear bitmap bits on interrupted recovery.
	x86/a.out: Clear the dump structure initially
	dm crypt: don't overallocate the integrity tag space
	dm thin: fix bug where bio that overwrites thin block ignores FUA
	drm: Use array_size() when creating lease
	drm/vkms: Fix license inconsistent
	drm/i915: Block fbdev HPD processing during suspend
	drm/i915: Prevent a race during I915_GEM_MMAP ioctl with WC set
	mm: proc: smaps_rollup: fix pss_locked calculation
	Linux 4.19.24

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-20 10:37:09 +01:00
Eric W. Biederman
a2b3e2c0f5 signal: Restore the stop PTRACE_EVENT_EXIT
commit cf43a757fd upstream.

In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.

Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set.  This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending.  This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.

This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored

Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Ivan Delalande <colona@arista.com>
Cc: stable@vger.kernel.org
Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20 10:25:48 +01:00
Andreas Ziegler
45649b9996 tracing/uprobes: Fix output for multiple string arguments
commit 0722069a53 upstream.

When printing multiple uprobe arguments as strings the output for the
earlier arguments would also include all later string arguments.

This is best explained in an example:

Consider adding a uprobe to a function receiving two strings as
parameters which is at offset 0xa0 in strlib.so and we want to print
both parameters when the uprobe is hit (on x86_64):

$ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
    /sys/kernel/debug/tracing/uprobe_events

When the function is called as func("foo", "bar") and we hit the probe,
the trace file shows a line like the following:

  [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"

Note the extra "bar" printed as part of arg1. This behaviour stacks up
for additional string arguments.

The strings are stored in a dynamically growing part of the uprobe
buffer by fetch_store_string() after copying them from userspace via
strncpy_from_user(). The return value of strncpy_from_user() is then
directly used as the required size for the string. However, this does
not take the terminating null byte into account as the documentation
for strncpy_from_user() cleary states that it "[...] returns the
length of the string (not including the trailing NUL)" even though the
null byte will be copied to the destination.

Therefore, subsequent calls to fetch_store_string() will overwrite
the terminating null byte of the most recently fetched string with
the first character of the current string, leading to the
"accumulation" of strings in earlier arguments in the output.

Fix this by incrementing the return value of strncpy_from_user() by
one if we did not hit the maximum buffer size.

Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5baaa59ef0 ("tracing/probes: Implement 'memory' fetch method for uprobes")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20 10:25:48 +01:00
Jiri Olsa
74cbb754d6 perf/x86: Add check_period PMU callback
commit 81ec3f3c4c upstream.

Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:

  general protection fault: 0000 [#1] SMP PTI
  ...
  RIP: 0010:perf_prepare_sample+0x8f/0x510
  ...
  Call Trace:
   <IRQ>
   ? intel_pmu_drain_bts_buffer+0x194/0x230
   intel_pmu_drain_bts_buffer+0x160/0x230
   ? tick_nohz_irq_exit+0x31/0x40
   ? smp_call_function_single_interrupt+0x48/0xe0
   ? call_function_single_interrupt+0xf/0x20
   ? call_function_single_interrupt+0xa/0x20
   ? x86_schedule_events+0x1a0/0x2f0
   ? x86_pmu_commit_txn+0xb4/0x100
   ? find_busiest_group+0x47/0x5d0
   ? perf_event_set_state.part.42+0x12/0x50
   ? perf_mux_hrtimer_restart+0x40/0xb0
   intel_pmu_disable_event+0xae/0x100
   ? intel_pmu_disable_event+0xae/0x100
   x86_pmu_stop+0x7a/0xb0
   x86_pmu_del+0x57/0x120
   event_sched_out.isra.101+0x83/0x180
   group_sched_out.part.103+0x57/0xe0
   ctx_sched_out+0x188/0x240
   ctx_resched+0xa8/0xd0
   __perf_event_enable+0x193/0x1e0
   event_function+0x8e/0xc0
   remote_function+0x41/0x50
   flush_smp_call_function_queue+0x68/0x100
   generic_smp_call_function_single_interrupt+0x13/0x30
   smp_call_function_single_interrupt+0x3e/0xe0
   call_function_single_interrupt+0xf/0x20
   </IRQ>

The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.

Following sequence will cause the crash:

If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.

Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20 10:25:45 +01:00
Ingo Molnar
d10e77c260 perf/core: Fix impossible ring-buffer sizes warning
commit 528871b456 upstream.

The following commit:

  9dff0aa95a ("perf/core: Don't WARN() for impossible ring-buffer sizes")

results in perf recording failures with larger mmap areas:

  root@skl:/tmp# perf record -g -a
  failed to mmap with 12 (Cannot allocate memory)

The root cause is that the following condition is buggy:

	if (order_base_2(size) >= MAX_ORDER)
		goto fail;

The problem is that @size is in bytes and MAX_ORDER is in pages,
so the right test is:

	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
		goto fail;

Fix it.

Reported-by: "Jin, Yao" <yao.jin@linux.intel.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Analyzed-by: Peter Zijlstra <peterz@infradead.org>
Cc: Julien Thierry <julien.thierry@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Fixes: 9dff0aa95a ("perf/core: Don't WARN() for impossible ring-buffer sizes")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20 10:25:44 +01:00
Greg Kroah-Hartman
0755dc9375 This is the 4.19.22 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxmZdUACgkQONu9yGCS
 aT7dtg/9G+mhds8xaPiFxqd+0np4NJ01neNeAvF5Mosf5OMC9yDgNe8iZqrfjXCG
 aJ/1mmnT+VXLihEZnjTrLuIOLF9iO83820lOwafPJWina8r6a0I2Uly+52Znv8Vz
 +Y860MQO4JncWcqBKR2En2690HgNLVv3zg3iDtYx/fzrDaq0gvtFHQFUJzjWp0yT
 vxmWMmEkEPVJfRLaRIb5Mgtmo5y4lpLeCZNj67Q/FKR0qPdTvnmqS8yS9O/CK1kR
 cdckosanno/HtbR2dhMrqyYuWmqj7JGAb4cCsvD2Y5X25ooFY/NRUSzxb6xSH1o8
 +xk7daqXZ75cnIA5DlJZJJFq5i8iUvltCQGTquk+nRXlDINwOvWxOuxNhWcEzR7Y
 pmp8A/mC/w5vat96rvpA3HmIrLLZC5ZLSYtSB1kuTDRKm7NZW1sT6u4kwPKvXJci
 S42QK4hJDmKzt+rWegu4K8kBQ8ihrOyERxtXA8xKG2VSaWvwSxrTFa/gInWrlB5W
 vswYWmul/8f6bpBIj+L+rYvJdaYFnSXhOobliCoUFK4ZQpQK8Vr/uAwPU7yamHh+
 3r40NzSpULfLHSEcSKYu4EII8hppaSosP37dHjGKET6bs4sOS2rTQVik/NfOHs3u
 fBIDIxklkBAcgOvue23KuD7xbAXONZI966qYoHExZOS5nDLlzVo=
 =tekf
 -----END PGP SIGNATURE-----

Merge 4.19.22 into android-4.19

Changes in 4.19.22
	mtd: Make sure mtd->erasesize is valid even if the partition is of size 0
	mtd: spinand: Handle the case where PROGRAM LOAD does not reset the cache
	mtd: spinand: Fix the error/cleanup path in spinand_init()
	mtd: rawnand: gpmi: fix MX28 bus master lockup problem
	libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD
	tools: iio: iio_generic_buffer: make num_loops signed
	iio: adc: axp288: Fix TS-pin handling
	iio: chemical: atlas-ph-sensor: correct IIO_TEMP values to millicelsius
	iio: ti-ads8688: Update buffer allocation for timestamps
	signal: Always notice exiting tasks
	signal: Better detection of synchronous signals
	misc: vexpress: Off by one in vexpress_syscfg_exec()
	mei: me: add ice lake point device id.
	samples: mei: use /dev/mei0 instead of /dev/mei
	debugfs: fix debugfs_rename parameter checking
	pinctrl: sunxi: Correct number of IRQ banks on H6 main pin controller
	pinctrl: cherryview: fix Strago DMI workaround
	tracing: uprobes: Fix typo in pr_fmt string
	mips: cm: reprime error cause
	MIPS: OCTEON: don't set octeon_dma_bar_type if PCI is disabled
	MIPS: VDSO: Use same -m%-float cflag as the kernel proper
	mips: loongson64: remove unreachable(), fix loongson_poweroff().
	MIPS: VDSO: Include $(ccflags-vdso) in o32,n32 .lds builds
	ARM: iop32x/n2100: fix PCI IRQ mapping
	ARM: tango: Improve ARCH_MULTIPLATFORM compatibility
	ARM: dts: da850: fix interrupt numbers for clocksource
	firmware: arm_scmi: provide the mandatory device release callback
	powerpc/radix: Fix kernel crash with mremap()
	mic: vop: Fix use-after-free on remove
	mac80211: ensure that mgmt tx skbs have tailroom for encryption
	drm/modes: Prevent division by zero htotal
	drm/amd/powerplay: Fix missing break in switch
	drm/i915: always return something on DDI clock selection
	drm/vmwgfx: Fix setting of dma masks
	drm/vmwgfx: Return error code from vmw_execbuf_copy_fence_user
	SUNRPC: Always drop the XPRT_LOCK on XPRT_CLOSE_WAIT
	xfrm: Make set-mark default behavior backward compatible
	Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal"
	libceph: avoid KEEPALIVE_PENDING races in ceph_con_keepalive()
	xfrm: refine validation of template and selector families
	batman-adv: Avoid WARN on net_device without parent in netns
	batman-adv: Force mac header to start of data on xmit
	svcrdma: Reduce max_send_sges
	svcrdma: Remove max_sge check at connect time
	Linux 4.19.22

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-15 09:02:44 +01:00
Andreas Ziegler
7e44aab927 tracing: uprobes: Fix typo in pr_fmt string
commit ea6eb5e7d1 upstream.

The subsystem-specific message prefix for uprobes was also
"trace_kprobe: " instead of "trace_uprobe: " as described in
the original commit message.

Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 7257634135 ("tracing/probe: Show subsystem name in messages")
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-15 08:10:11 +01:00
Eric W. Biederman
959e46afec signal: Better detection of synchronous signals
commit 7146db3317 upstream.

Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop failing
to deliver SIGHUP but always trying.

When the stack overflows delivery of SIGHUP fails and force_sigsegv is
called.  Unfortunately because SIGSEGV is numerically higher than
SIGHUP next_signal tries again to deliver a SIGHUP.

From a quality of implementation standpoint attempting to deliver the
timer SIGHUP signal is wrong.  We should attempt to deliver the
synchronous SIGSEGV signal we just forced.

We can make that happening in a fairly straight forward manner by
instead of just looking at the signal number we also look at the
si_code.  In particular for exceptions (aka synchronous signals) the
si_code is always greater than 0.

That still has the potential to pick up a number of asynchronous
signals as in a few cases the same si_codes that are used
for synchronous signals are also used for asynchronous signals,
and SI_KERNEL is also included in the list of possible si_codes.

Still the heuristic is much better and timer signals are definitely
excluded.  Which is enough to prevent all known ways for someone
sending a process signals fast enough to cause unexpected and
arguably incorrect behavior.

Cc: stable@vger.kernel.org
Fixes: a27341cd5f ("Prioritize synchronous signals over 'normal' signals")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-15 08:10:11 +01:00
Eric W. Biederman
f681f2684f signal: Always notice exiting tasks
commit 35634ffa17 upstream.

Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop
failing to deliver SIGHUP but always trying.

Upon examination it turns out part of the problem is actually most of
the solution.  Since 2.5 signal delivery has found all fatal signals,
marked the signal group for death, and queued SIGKILL in every threads
thread queue relying on signal->group_exit_code to preserve the
information of which was the actual fatal signal.

The conversion of all fatal signals to SIGKILL results in the
synchronous signal heuristic in next_signal kicking in and preferring
SIGHUP to SIGKILL.  Which is especially problematic as all
fatal signals have already been transformed into SIGKILL.

Instead of dequeueing signals and depending upon SIGKILL to
be the first signal dequeued, first test if the signal group
has already been marked for death.  This guarantees that
nothing in the signal queue can prevent a process that needs
to exit from exiting.

Cc: stable@vger.kernel.org
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-15 08:10:11 +01:00
Greg Kroah-Hartman
6e0411bdc2 This is the 4.19.21 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxjFL8ACgkQONu9yGCS
 aT643xAAk+mnsrOfU6/LBFjcJUUUYohK01UAU+PRvTjy4uH8rrA4G01xvp11ftIu
 jjEikA2582ScR5a2Ww+E4SfqOdKT4z9hOGLTnyI0P4xN9jeVidvu9+C90AYyBYhi
 orHm1osVQIj6n9+OQ5db+DzZYbZLbyfCoqNXbq9EoLvNRS3FUUH0y2VXqcz9Ghcj
 obpdHTVMKRaFkRWdCglo+3hSpoKrncSVpKrwUXR18GCt8jjZjj39kI9t6UoGNfc7
 nN3GOd26U1tpGo6ShZJYu6aPjV+zoYNlsg1o2zn9qJANIdulYe30vqNhWCeJ+/T7
 WcT3EHv4pPEO3Lvgfp+l10Nc6IbYdJEFUpAP3CvfP+MvRfKvz8Vo3Nm/BQlr20+q
 +MUYJb+wxhlHPRLV192XbnYFkEzZg7vzymoMPL034XheAkOkPbOK0IIVo41p5Rai
 LxmOdvhzfAktbtD/VWnLTUbexjs2EJ05bvZRjdPKKIMBNKnAWz4ux3KcHxpdsUF8
 KMCKwpJE8KDM4uiaKVdyfMhFeIg37pmy+7Uv9cUFjWwtsL3K+CiAXg8uaSvnajKr
 bOhwbFIgxoOI9VRBK8M0wvzouphA0miVbxOY81sdfbMeWNpbCLdb968AFz25llta
 rVlWp7bSAUQTsvTkVBv6mrTKPwzO4jnfHYN8yj5gO7pr5fmC2ig=
 =L+Mp
 -----END PGP SIGNATURE-----

Merge 4.19.21 into android-4.19

Changes in 4.19.21
	devres: Align data[] to ARCH_KMALLOC_MINALIGN
	drm/bufs: Fix Spectre v1 vulnerability
	staging: iio: adc: ad7280a: handle error from __ad7280_read32()
	drm/vgem: Fix vgem_init to get drm device available.
	pinctrl: bcm2835: Use raw spinlock for RT compatibility
	ASoC: Intel: mrfld: fix uninitialized variable access
	gpiolib: Fix possible use after free on label
	drm/sun4i: Initialize registers in tcon-top driver
	genirq/affinity: Spread IRQs to all available NUMA nodes
	gpu: ipu-v3: image-convert: Prevent race between run and unprepare
	nds32: Fix gcc 8.0 compiler option incompatible.
	wil6210: fix reset flow for Talyn-mb
	wil6210: fix memory leak in wil_find_tx_bcast_2
	ath10k: assign 'n_cipher_suites' for WCN3990
	ath9k: dynack: use authentication messages for 'late' ack
	scsi: lpfc: Correct LCB RJT handling
	scsi: mpt3sas: Call sas_remove_host before removing the target devices
	scsi: lpfc: Fix LOGO/PLOGI handling when triggerd by ABTS Timeout event
	ARM: 8808/1: kexec:offline panic_smp_self_stop CPU
	clk: boston: fix possible memory leak in clk_boston_setup()
	dlm: Don't swamp the CPU with callbacks queued during recovery
	x86/PCI: Fix Broadcom CNB20LE unintended sign extension (redux)
	powerpc/pseries: add of_node_put() in dlpar_detach_node()
	crypto: aes_ti - disable interrupts while accessing S-box
	drm/vc4: ->x_scaling[1] should never be set to VC4_SCALING_NONE
	serial: fsl_lpuart: clear parity enable bit when disable parity
	ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl
	MIPS: Boston: Disable EG20T prefetch
	dpaa2-ptp: defer probe when portal allocation failed
	iwlwifi: fw: do not set sgi bits for HE connection
	staging:iio:ad2s90: Make probe handle spi_setup failure
	fpga: altera-cvp: Fix registration for CvP incapable devices
	Tools: hv: kvp: Fix a warning of buffer overflow with gcc 8.0.1
	fpga: altera-cvp: fix 'bad IO access' on x86_64
	vbox: fix link error with 'gcc -Og'
	platform/chrome: don't report EC_MKBP_EVENT_SENSOR_FIFO as wakeup
	i40e: prevent overlapping tx_timeout recover
	scsi: hisi_sas: change the time of SAS SSP connection
	staging: iio: ad7780: update voltage on read
	usbnet: smsc95xx: fix rx packet alignment
	drm/rockchip: fix for mailbox read size
	ARM: OMAP2+: hwmod: Fix some section annotations
	drm/amd/display: fix gamma not being applied correctly
	drm/amd/display: calculate stream->phy_pix_clk before clock mapping
	bpf: libbpf: retry map creation without the name
	net/mlx5: EQ, Use the right place to store/read IRQ affinity hint
	modpost: validate symbol names also in find_elf_symbol
	perf tools: Add Hygon Dhyana support
	soc/tegra: Don't leak device tree node reference
	media: rc: ensure close() is called on rc_unregister_device
	media: video-i2c: avoid accessing released memory area when removing driver
	media: mtk-vcodec: Release device nodes in mtk_vcodec_init_enc_pm()
	staging: erofs: fix the definition of DBG_BUGON
	clk: meson: meson8b: do not use cpu_div3 for cpu_scale_out_sel
	clk: meson: meson8b: fix the width of the cpu_scale_div clock
	clk: meson: meson8b: mark the CPU clock as CLK_IS_CRITICAL
	ptp: Fix pass zero to ERR_PTR() in ptp_clock_register
	dmaengine: xilinx_dma: Remove __aligned attribute on zynqmp_dma_desc_ll
	powerpc/32: Add .data..Lubsan_data*/.data..Lubsan_type* sections explicitly
	iio: adc: meson-saradc: check for devm_kasprintf failure
	iio: adc: meson-saradc: fix internal clock names
	iio: accel: kxcjk1013: Add KIOX010A ACPI Hardware-ID
	media: adv*/tc358743/ths8200: fill in min width/height/pixelclock
	ACPI: SPCR: Consider baud rate 0 as preconfigured state
	staging: pi433: fix potential null dereference
	f2fs: move dir data flush to write checkpoint process
	f2fs: fix race between write_checkpoint and write_begin
	f2fs: fix wrong return value of f2fs_acl_create
	i2c: sh_mobile: add support for r8a77990 (R-Car E3)
	arm64: io: Ensure calls to delay routines are ordered against prior readX()
	net: aquantia: return 'err' if set MPI_DEINIT state fails
	sunvdc: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN
	soc: bcm: brcmstb: Don't leak device tree node reference
	nfsd4: fix crash on writing v4_end_grace before nfsd startup
	drm: Clear state->acquire_ctx before leaving drm_atomic_helper_commit_duplicated_state()
	perf: arm_spe: handle devm_kasprintf() failure
	arm64: io: Ensure value passed to __iormb() is held in a 64-bit register
	Thermal: do not clear passive state during system sleep
	thermal: Fix locking in cooling device sysfs update cur_state
	firmware/efi: Add NULL pointer checks in efivars API functions
	s390/zcrypt: improve special ap message cmd handling
	mt76x0: dfs: fix IBI_R11 configuration on non-radar channels
	arm64: ftrace: don't adjust the LR value
	drm/v3d: Fix prime imports of buffers from other drivers.
	ARM: dts: mmp2: fix TWSI2
	ARM: dts: aspeed: add missing memory unit-address
	x86/fpu: Add might_fault() to user_insn()
	media: i2c: TDA1997x: select CONFIG_HDMI
	media: DaVinci-VPBE: fix error handling in vpbe_initialize()
	smack: fix access permissions for keyring
	xtensa: xtfpga.dtsi: fix dtc warnings about SPI
	usb: dwc3: Correct the logic for checking TRB full in __dwc3_prepare_one_trb()
	usb: dwc2: Disable power down feature on Samsung SoCs
	usb: hub: delay hub autosuspend if USB3 port is still link training
	timekeeping: Use proper seqcount initializer
	usb: mtu3: fix the issue about SetFeature(U1/U2_Enable)
	clk: sunxi-ng: a33: Set CLK_SET_RATE_PARENT for all audio module clocks
	media: imx274: select REGMAP_I2C
	drm/amdgpu/powerplay: fix clock stretcher limits on polaris (v2)
	tipc: fix node keep alive interval calculation
	driver core: Move async_synchronize_full call
	kobject: return error code if writing /sys/.../uevent fails
	IB/hfi1: Unreserve a reserved request when it is completed
	usb: dwc3: trace: add missing break statement to make compiler happy
	gpio: mt7621: report failure of devm_kasprintf()
	gpio: mt7621: pass mediatek_gpio_bank_probe() failure up the stack
	pinctrl: sx150x: handle failure case of devm_kstrdup
	iommu/amd: Fix amd_iommu=force_isolation
	ARM: dts: Fix OMAP4430 SDP Ethernet startup
	mips: bpf: fix encoding bug for mm_srlv32_op
	media: coda: fix H.264 deblocking filter controls
	ARM: dts: Fix up the D-Link DIR-685 MTD partition info
	watchdog: renesas_wdt: don't set divider while watchdog is running
	ARM: dts: imx51-zii-rdu1: Do not specify "power-gpio" for hpa1
	usb: dwc3: gadget: Disable CSP for stream OUT ep
	iommu/arm-smmu-v3: Avoid memory corruption from Hisilicon MSI payloads
	iommu/arm-smmu: Add support for qcom,smmu-v2 variant
	iommu/arm-smmu-v3: Use explicit mb() when moving cons pointer
	sata_rcar: fix deferred probing
	clk: imx6sl: ensure MMDC CH0 handshake is bypassed
	platform/x86: mlx-platform: Fix tachometer registers
	cpuidle: big.LITTLE: fix refcount leak
	OPP: Use opp_table->regulators to verify no regulator case
	tee: optee: avoid possible double list_del()
	drm/msm/dsi: fix dsi clock names in DSI 10nm PLL driver
	drm/msm: dpu: Only check flush register against pending flushes
	lightnvm: pblk: fix resubmission of overwritten write err lbas
	lightnvm: pblk: add lock protection to list operations
	i2c-axxia: check for error conditions first
	phy: sun4i-usb: add support for missing USB PHY index
	mlxsw: spectrum_acl: Limit priority value
	udf: Fix BUG on corrupted inode
	switchtec: Fix SWITCHTEC_IOCTL_EVENT_IDX_ALL flags overwrite
	selftests/bpf: use __bpf_constant_htons in test_prog.c
	ARM: pxa: avoid section mismatch warning
	ASoC: fsl: Fix SND_SOC_EUKREA_TLV320 build error on i.MX8M
	KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
	mmc: bcm2835: Recover from MMC_SEND_EXT_CSD
	mmc: bcm2835: reset host on timeout
	mmc: meson-mx-sdio: check devm_kasprintf for failure
	memstick: Prevent memstick host from getting runtime suspended during card detection
	mmc: sdhci-of-esdhc: Fix timeout checks
	mmc: sdhci-omap: Fix timeout checks
	mmc: sdhci-xenon: Fix timeout checks
	mmc: jz4740: Get CD/WP GPIOs from descriptors
	usb: renesas_usbhs: add support for RZ/G2E
	btrfs: harden agaist duplicate fsid on scanned devices
	serial: sh-sci: Fix locking in sci_submit_rx()
	serial: sh-sci: Resume PIO in sci_rx_interrupt() on DMA failure
	tty: serial: samsung: Properly set flags in autoCTS mode
	perf test: Fix perf_event_attr test failure
	perf dso: Fix unchecked usage of strncpy()
	perf header: Fix unchecked usage of strncpy()
	btrfs: use tagged writepage to mitigate livelock of snapshot
	perf probe: Fix unchecked usage of strncpy()
	i2c: sh_mobile: Add support for r8a774c0 (RZ/G2E)
	bnxt_en: Disable MSIX before re-reserving NQs/CMPL rings.
	tools/power/x86/intel_pstate_tracer: Fix non root execution for post processing a trace file
	livepatch: check kzalloc return values
	arm64: KVM: Skip MMIO insn after emulation
	usb: musb: dsps: fix otg state machine
	usb: musb: dsps: fix runtime pm for peripheral mode
	perf header: Fix up argument to ctime()
	perf tools: Cast off_t to s64 to avoid warning on bionic libc
	percpu: convert spin_lock_irq to spin_lock_irqsave.
	net: hns3: fix incomplete uninitialization of IRQ in the hns3_nic_uninit_vector_data()
	drm/amd/display: Add retry to read ddc_clock pin
	Bluetooth: hci_bcm: Handle deferred probing for the clock supply
	drm/amd/display: fix YCbCr420 blank color
	powerpc/uaccess: fix warning/error with access_ok()
	mac80211: fix radiotap vendor presence bitmap handling
	xfrm6_tunnel: Fix spi check in __xfrm6_tunnel_alloc_spi
	mlxsw: spectrum: Properly cleanup LAG uppers when removing port from LAG
	scsi: smartpqi: correct host serial num for ssa
	scsi: smartpqi: correct volume status
	scsi: smartpqi: increase fw status register read timeout
	cw1200: Fix concurrency use-after-free bugs in cw1200_hw_scan()
	net: hns3: add max vector number check for pf
	powerpc/perf: Fix thresholding counter data for unknown type
	iwlwifi: mvm: fix setting HE ppe FW config
	powerpc/powernv/ioda: Allocate indirect TCE levels of cached userspace addresses on demand
	mlx5: update timecounter at least twice per counter overflow
	drbd: narrow rcu_read_lock in drbd_sync_handshake
	drbd: disconnect, if the wrong UUIDs are attached on a connected peer
	drbd: skip spurious timeout (ping-timeo) when failing promote
	drbd: Avoid Clang warning about pointless switch statment
	drm/amd/display: validate extended dongle caps
	video: clps711x-fb: release disp device node in probe()
	md: fix raid10 hang issue caused by barrier
	fbdev: fbmem: behave better with small rotated displays and many CPUs
	i40e: define proper net_device::neigh_priv_len
	ice: Do not enable NAPI on q_vectors that have no rings
	igb: Fix an issue that PME is not enabled during runtime suspend
	ACPI/APEI: Clear GHES block_status before panic()
	fbdev: fbcon: Fix unregister crash when more than one framebuffer
	powerpc/mm: Fix reporting of kernel execute faults on the 8xx
	pinctrl: meson: meson8: fix the GPIO function for the GPIOAO pins
	pinctrl: meson: meson8b: fix the GPIO function for the GPIOAO pins
	KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported
	powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
	kvm: Change offset in kvm_write_guest_offset_cached to unsigned
	NFS: nfs_compare_mount_options always compare auth flavors.
	perf build: Don't unconditionally link the libbfd feature test to -liberty and -lz
	hwmon: (lm80) fix a missing check of the status of SMBus read
	hwmon: (lm80) fix a missing check of bus read in lm80 probe
	seq_buf: Make seq_buf_puts() null-terminate the buffer
	crypto: ux500 - Use proper enum in cryp_set_dma_transfer
	crypto: ux500 - Use proper enum in hash_set_dma_transfer
	MIPS: ralink: Select CONFIG_CPU_MIPSR2_IRQ_VI on MT7620/8
	cifs: check ntwrk_buf_start for NULL before dereferencing it
	f2fs: fix use-after-free issue when accessing sbi->stat_info
	um: Avoid marking pages with "changed protection"
	niu: fix missing checks of niu_pci_eeprom_read
	f2fs: fix sbi->extent_list corruption issue
	cgroup: fix parsing empty mount option string
	perf python: Do not force closing original perf descriptor in evlist.get_pollfd()
	scripts/decode_stacktrace: only strip base path when a prefix of the path
	arch/sh/boards/mach-kfr2r09/setup.c: fix struct mtd_oob_ops build warning
	ocfs2: don't clear bh uptodate for block read
	ocfs2: improve ocfs2 Makefile
	mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
	zram: fix lockdep warning of free block handling
	isdn: hisax: hfc_pci: Fix a possible concurrency use-after-free bug in HFCPCI_l1hw()
	gdrom: fix a memory leak bug
	fsl/fman: Use GFP_ATOMIC in {memac,tgec}_add_hash_mac_address()
	block/swim3: Fix -EBUSY error when re-opening device after unmount
	thermal: bcm2835: enable hwmon explicitly
	kdb: Don't back trace on a cpu that didn't round up
	PCI: imx: Enable MSI from downstream components
	thermal: generic-adc: Fix adc to temp interpolation
	HID: lenovo: Add checks to fix of_led_classdev_register
	arm64/sve: ptrace: Fix SVE_PT_REGS_OFFSET definition
	kernel/hung_task.c: break RCU locks based on jiffies
	proc/sysctl: fix return error for proc_doulongvec_minmax()
	kernel/hung_task.c: force console verbose before panic
	fs/epoll: drop ovflist branch prediction
	exec: load_script: don't blindly truncate shebang string
	kernel/kcov.c: mark write_comp_data() as notrace
	scripts/gdb: fix lx-version string output
	xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
	xfs: cancel COW blocks before swapext
	xfs: Fix error code in 'xfs_ioc_getbmap()'
	xfs: fix overflow in xfs_attr3_leaf_verify
	xfs: fix shared extent data corruption due to missing cow reservation
	xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
	xfs: delalloc -> unwritten COW fork allocation can go wrong
	fs/xfs: fix f_ffree value for statfs when project quota is set
	xfs: fix PAGE_MASK usage in xfs_free_file_space
	xfs: fix inverted return from xfs_btree_sblock_verify_crc
	thermal: hwmon: inline helpers when CONFIG_THERMAL_HWMON is not set
	dccp: fool proof ccid_hc_[rt]x_parse_options()
	enic: fix checksum validation for IPv6
	lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically
	net: dp83640: expire old TX-skb
	net: dsa: Fix lockdep false positive splat
	net: dsa: Fix NULL checking in dsa_slave_set_eee()
	net: dsa: mv88e6xxx: Fix counting of ATU violations
	net: dsa: slave: Don't propagate flag changes on down slave interfaces
	net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
	net: systemport: Fix WoL with password after deep sleep
	rds: fix refcount bug in rds_sock_addref
	Revert "net: phy: marvell: avoid pause mode on SGMII-to-Copper for 88e151x"
	rxrpc: bad unlock balance in rxrpc_recvmsg
	sctp: check and update stream->out_curr when allocating stream_out
	sctp: walk the list of asoc safely
	skge: potential memory corruption in skge_get_regs()
	virtio_net: Account for tx bytes and packets on sending xdp_frames
	net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance
	xfs: eof trim writeback mapping as soon as it is cached
	ALSA: compress: Fix stop handling on compressed capture streams
	ALSA: usb-audio: Add support for new T+A USB DAC
	ALSA: hda - Serialize codec registrations
	ALSA: hda/realtek - Fix lose hp_pins for disable auto mute
	ALSA: hda/realtek - Use a common helper for hp pin reference
	ALSA: hda/realtek - Headset microphone support for System76 darp5
	fuse: call pipe_buf_release() under pipe lock
	fuse: decrement NR_WRITEBACK_TEMP on the right page
	fuse: handle zero sized retrieve correctly
	HID: debug: fix the ring buffer implementation
	dmaengine: bcm2835: Fix interrupt race on RT
	dmaengine: bcm2835: Fix abort of transactions
	dmaengine: imx-dma: fix wrong callback invoke
	futex: Handle early deadlock return correctly
	irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
	usb: phy: am335x: fix race condition in _probe
	usb: dwc3: gadget: Handle 0 xfer length for OUT EP
	usb: gadget: udc: net2272: Fix bitwise and boolean operations
	usb: gadget: musb: fix short isoc packets with inventra dma
	staging: speakup: fix tty-operation NULL derefs
	scsi: cxlflash: Prevent deadlock when adapter probe fails
	scsi: aic94xx: fix module loading
	KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
	kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
	KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
	cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
	perf/x86/intel/uncore: Add Node ID mask
	x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
	perf/core: Don't WARN() for impossible ring-buffer sizes
	perf tests evsel-tp-sched: Fix bitwise operator
	serial: fix race between flush_to_ldisc and tty_open
	serial: 8250_pci: Make PCI class test non fatal
	serial: sh-sci: Do not free irqs that have already been freed
	cacheinfo: Keep the old value if of_property_read_u32 fails
	IB/hfi1: Add limit test for RC/UC send via loopback
	perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
	ath9k: dynack: make ewma estimation faster
	ath9k: dynack: check da->enabled first in sampling routines
	Linux 4.19.21

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-12 20:37:21 +01:00
Mark Rutland
1aeeb17668 perf/core: Don't WARN() for impossible ring-buffer sizes
commit 9dff0aa95a upstream.

The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how
large its ringbuffer mmap should be. This can be configured to arbitrary
values, which can be larger than the maximum possible allocation from
kmalloc.

When this is configured to a suitably large value (e.g. thanks to the
perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in
__alloc_pages_nodemask():

   WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8

Let's avoid this by checking that the requested allocation is possible
before calling kzalloc.

Reported-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12 19:47:26 +01:00
Josh Poimboeuf
97a7fa90ea cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
commit b284909aba upstream.

With the following commit:

  73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")

... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled.  However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.

The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case.  So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.

Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:

  1) /sys/devices/system/cpu/smt/control showed "on" instead of
     "notsupported"; and

  2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.

I'd propose that we instead consider #1 above to not actually be a
problem.  Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later.  So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).

The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.

So fix it by:

  a) reverting the original "fix" and its followup fix:

     73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
     bc2d8d262c ("cpu/hotplug: Fix SMT supported evaluation")

     and

  b) changing vmx_vm_init() to query the actual current SMT state --
     instead of the sysfs control value -- to determine whether the L1TF
     warning is needed.  This also requires the 'sched_smt_present'
     variable to exported, instead of 'cpu_smt_control'.

Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12 19:47:25 +01:00
Thomas Gleixner
ee73954d9a futex: Handle early deadlock return correctly
commit 1a1fb985f2 upstream.

commit 56222b212e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi()                         futex_unlock_pi()
  lock(hb->lock)
  queue(hb_waiter)				lock(hb->lock)
  lock(rtmutex->wait_lock)
  unlock(hb->lock)
                                        // acquired hb->lock
                                        hb_waiter = futex_top_waiter()
                                        lock(rtmutex->wait_lock)
  __rt_mutex_proxy_start()
     ---> fail
          remove(rtmutex_waiter);
     ---> returns -EDEADLOCK
  unlock(rtmutex->wait_lock)
                                        // acquired wait_lock
                                        wake_futex_pi()
                                        rt_mutex_next_owner()
					  --> returns NULL
                                          --> WARN

  lock(hb->lock)
  unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler <stli@linux.ibm.com>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-12 19:47:24 +01:00
Anders Roxell
58e57bcbc1 kernel/kcov.c: mark write_comp_data() as notrace
[ Upstream commit 6347244316 ]

Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the
function called from __sanitizer_cov_trace_const_cmp4 shouldn't be
traceable either.  ftrace_graph_caller() gets called every time func
write_comp_data() gets called if it isn't marked 'notrace'.  This is the
backtrace from gdb:

 #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 #1  0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
 #2  0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116
 #3  0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188
 #4  0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27
 #5  0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182

Rework so that write_comp_data() that are called from
__sanitizer_cov_trace_*_cmp*() are marked as 'notrace'.

Commit 903e8ff867 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace")
missed to mark write_comp_data() as 'notrace'. When that patch was
created gcc-7 was used. In lib/Kconfig.debug
config KCOV_ENABLE_COMPARISONS
	depends on $(cc-option,-fsanitize-coverage=trace-cmp)

That code path isn't hit with gcc-7. However, it were that with gcc-8.

Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:47:20 +01:00
Liu, Chuansheng
f0d32c54ff kernel/hung_task.c: force console verbose before panic
[ Upstream commit 168e06f793 ]

Based on commit 401c636a0e ("kernel/hung_task.c: show all hung tasks
before panic"), we could get the call stack of hung task.

However, if the console loglevel is not high, we still can not see the
useful panic information in practice, and in most cases users don't set
console loglevel to high level.

This patch is to force console verbose before system panic, so that the
real useful information can be seen in the console, instead of being
like the following, which doesn't have hung task information.

  INFO: task init:1 blocked for more than 120 seconds.
        Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  Kernel panic - not syncing: hung_task: blocked tasks
  CPU: 2 PID: 479 Comm: khungtaskd Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
  Call Trace:
   dump_stack+0x4f/0x65
   panic+0xde/0x231
   watchdog+0x290/0x410
   kthread+0x12c/0x150
   ret_from_fork+0x35/0x40
  reboot: panic mode set: p,w
  Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:47:19 +01:00
Cheng Lin
9beb84c027 proc/sysctl: fix return error for proc_doulongvec_minmax()
[ Upstream commit 09be178400 ]

If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.

For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:

{
	.procname       = "monitor_signals",
	.data           = &monitor_sigs,
	.maxlen         = 2*sizeof(unsigned long),
	.mode           = 0644,
	.proc_handler   = proc_doulongvec_minmax,
},

Reproduce:

When passing two parameters, it's work normal.  But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.

  [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  1       2
  [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
  -bash: echo: write error: Invalid argument
  [root@cl150 ~]# echo $?
  1
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  3       2
  [root@cl150 ~]#

The following is the result after apply this patch.  No error is
returned when the number of input parameters is less than the total
parameters.

  [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  1       2
  [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# echo $?
  0
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  3       2
  [root@cl150 ~]#

There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.

This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'.  In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.

static int __do_proc_douintvec(...){
         ...
         /*
          * Arrays are not supported, keep this simple. *Do not* add
          * support for them.
          */
         if (vleft != 1) {
                 *lenp = 0;
                 return -EINVAL;
         }
         ...
}

So, just __do_proc_doulongvec_minmax has the problem.  And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.

Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:47:19 +01:00