Commit graph

28,672 commits

Author SHA1 Message Date
Quentin Perret
549020c814 ANDROID: sched: Disable find_best_target() by default
Now that the mainline EAS wake-up path has been extended to cope with
prefer-idle tasks, the need for a dedicated Android-specific wake-up
routine (find_best_target()) becomes less clear. Indeed, main reasons
for introducting find_best_target() in the first place were:

  1. the energy_diff function was very slow, so we couldn't afford to
     use it on all CPUs for each wake-up for latency reasons;

  2. schedtune provides additional information about tasks (the
     prefer-idle flag in particular) which needed to be taken into
     account in the placement algorithm.

Now that the energy diff calculation is much faster (with the simplified
energy model) and that the EAS path is aware of prefer-idle tasks, there
is no clear reason to use find_best_target() any more.

So, let's disable it for now to minimize the amount of out-of-tree code
used in the scheduler. If using the mainline path doesn't cause
regressions, it is a good sign find_best_target() can be removed safely,
eventually. Otherwise, reverting back to the old behaviour is trivial
since this patch only changes the sched_feat default, but doesn't remove
the fbt() code path.

Bug: 120440300
Change-Id: Idb5d68a3c4af7d2212e0922ab6d9a089170b5e1c
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-27 15:58:02 +00:00
Quentin Perret
d0eb1f3514 ANDROID: sched/fair: Make the EAS wake-up prefer-idle aware
Make the mainline EAS wake-up path aware of prefer idle tasks in
preparation for disabling find_best_target().

What is done in the mainline algoritm isn't strictly equivalent to the
find_best_target() algorithm but comes real close, and isn't very
invasive. The main differences with the original find_best_target()
behaviour are the following:

 1. the policy for prefer idle when there isn't a single idle CPU in the
    system is simpler now. We just pick the CPU with the highest spare
    capacity;

 2. the cstate awareness for prefer idle is implemented by minimizing
    the exit latency rather than the idle state index. This is how it is
    done in the slow path (find_idlest_group_cpu()), it doesn't require
    us to keep hooks into CPUIdle, and should actually be better because
    what we want is a CPU that can wake up quickly;

 3. non-prefer-idle tasks just use the standard mainline energy-aware
    wake-up path, which decides the placement using the Energy Model.

Bug: 120440300
Change-Id: I57769c90c57115f6a28d27c5a88e08aa93a30a56
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-27 15:58:02 +00:00
Vincent Guittot
d36e8b820e UPSTREAM: sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
util_est is mainly meant to be a lower-bound for tasks utilization.
That's why task_util_est() returns the actual util_avg when it's higher
than the estimated utilization.

With new invaraince signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU capacity,
we skip the sampling when utilization is higher than CPU's capacity.

Bug: 120440300
Change-Id: If1a6001451f80acb953e2a5f955fd302b1b73bc0
Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 10a35e6812)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Vincent Guittot
eb0db1782a UPSTREAM: sched/fair: Update scale invariance of PELT
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times:

	./perf bench sched pipe
	(higher is better)
	kernel	tip/sched/core     + patch
	        ops/seconds        ops/seconds         diff
	cgroup
	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

	hackbench -l 1000
	(lower is better)
	kernel	tip/sched/core     + patch
	        duration(sec)      duration(sec)        diff
	cgroup
	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

  Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
  972 (95%)    138ms         not reachable            276ms
  486 (47.5%)  30ms          138ms                     60ms
  256 (25%)    13ms           32ms                     26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Bug: 120440300
Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2312729688)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Vincent Guittot
0dd28f4253 UPSTREAM: sched/fair: Move the rq_of() helper function
Move rq_of() helper function so it can be used in pelt.c

[ mingo: Improve readability while at it. ]

Bug: 120440300
Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 62478d9911)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:50 +00:00
Dietmar Eggemann
5bfd3ce4e0 UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.

se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).

There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:

(1) A task switches to SCHED_IDLE.

(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
    (during which only its static priority gets set) switches to
    SCHED_OTHER or SCHED_BATCH.

Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.

Bug: 120440300
Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a465e3ebb)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-03-26 14:22:49 +00:00
Greg Kroah-Hartman
bb418a146a This is the 4.19.31 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS
 aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1
 peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs
 gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59
 NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF
 R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3
 61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t
 UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh
 jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I
 U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe
 9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU
 nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us=
 =xtLx
 -----END PGP SIGNATURE-----

Merge 4.19.31 into android-4.19

Changes in 4.19.31
	media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
	9p: use inode->i_lock to protect i_size_write() under 32-bit
	9p/net: fix memory leak in p9_client_create
	ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
	ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
	ASoC: codecs: pcm186x: Fix energysense SLEEP bit
	iio: adc: exynos-adc: Fix NULL pointer exception on unbind
	mei: hbm: clean the feature flags on link reset
	mei: bus: move hw module get/put to probe/release
	stm class: Fix an endless loop in channel allocation
	crypto: caam - fix hash context DMA unmap size
	crypto: ccree - fix missing break in switch statement
	crypto: caam - fixed handling of sg list
	crypto: caam - fix DMA mapping of stack memory
	crypto: ccree - fix free of unallocated mlli buffer
	crypto: ccree - unmap buffer before copying IV
	crypto: ccree - don't copy zero size ciphertext
	crypto: cfb - add missing 'chunksize' property
	crypto: cfb - remove bogus memcpy() with src == dest
	crypto: ahash - fix another early termination in hash walk
	crypto: rockchip - fix scatterlist nents error
	crypto: rockchip - update new iv to device in multiple operations
	drm/imx: ignore plane updates on disabled crtcs
	gpu: ipu-v3: Fix i.MX51 CSI control registers offset
	drm/imx: imx-ldb: add missing of_node_puts
	gpu: ipu-v3: Fix CSI offsets for imx53
	ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
	clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
	KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
	arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
	s390/dasd: fix using offset into zero size array error
	Input: pwm-vibra - prevent unbalanced regulator
	Input: pwm-vibra - stop regulator after disabling pwm, not before
	ARM: dts: Configure clock parent for pwm vibra
	ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
	ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
	ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
	KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
	arm/arm64: KVM: Allow a VCPU to fully reset itself
	arm/arm64: KVM: Don't panic on failure to properly reset system registers
	KVM: arm/arm64: vgic: Always initialize the group of private IRQs
	KVM: arm64: Forbid kprobing of the VHE world-switch code
	ASoC: samsung: Prevent clk_get_rate() calls in atomic context
	ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
	Input: cap11xx - switch to using set_brightness_blocking()
	Input: ps2-gpio - flush TX work when closing port
	Input: matrix_keypad - use flush_delayed_work()
	mac80211: call drv_ibss_join() on restart
	mac80211: Fix Tx aggregation session tear down with ITXQs
	netfilter: compat: initialize all fields in xt_init
	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
	ipvs: fix dependency on nf_defrag_ipv6
	floppy: check_events callback should not return a negative number
	xprtrdma: Make sure Send CQ is allocated on an existing compvec
	NFS: Don't use page_file_mapping after removing the page
	mm/gup: fix gup_pmd_range() for dax
	Revert "mm: use early_pfn_to_nid in page_ext_init"
	scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
	net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
	x86/CPU: Add Icelake model number
	mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
	net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
	i2c: cadence: Fix the hold bit setting
	i2c: bcm2835: Clear current buffer pointers and counts after a transfer
	auxdisplay: ht16k33: fix potential user-after-free on module unload
	Input: st-keyscan - fix potential zalloc NULL dereference
	clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
	kallsyms: Handle too long symbols in kallsyms.c
	clk: sunxi: A31: Fix wrong AHB gate number
	esp: Skip TX bytes accounting when sending from a request socket
	ARM: 8824/1: fix a migrating irq bug when hotplug cpu
	bpf: only adjust gso_size on bytestream protocols
	bpf: fix lockdep false positive in stackmap
	af_key: unconditionally clone on broadcast
	ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
	assoc_array: Fix shortcut creation
	keys: Fix dependency loop between construction record and auth key
	scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
	net: systemport: Fix reception of BPDUs
	net: dsa: bcm_sf2: Do not assume DSA master supports WoL
	pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
	qmi_wwan: apply SET_DTR quirk to Sierra WP7607
	net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
	xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
	mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
	ASoC: topology: free created components in tplg load error
	qed: Fix iWARP buffer size provided for syn packet processing.
	qed: Fix iWARP syn packet mac address validation.
	ARM: dts: armada-xp: fix Armada XP boards NAND description
	arm64: Relax GIC version check during early boot
	ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
	net: marvell: mvneta: fix DMA debug warning
	mm: handle lru_add_drain_all for UP properly
	tmpfs: fix link accounting when a tmpfile is linked in
	ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
	ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
	ARC: uacces: remove lp_start, lp_end from clobber list
	ARCv2: support manual regfile save on interrupts
	ARCv2: don't assume core 0x54 has dual issue
	phonet: fix building with clang
	mac80211_hwsim: propagate genlmsg_reply return code
	bpf, lpm: fix lookup bug in map_delete_elem
	net: thunderx: make CFG_DONE message to run through generic send-ack sequence
	net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
	nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
	nfp: bpf: fix ALU32 high bits clearance bug
	bnxt_en: Fix typo in firmware message timeout logic.
	bnxt_en: Wait longer for the firmware message response to complete.
	net: set static variable an initial value in atl2_probe()
	selftests: fib_tests: sleep after changing carrier. again.
	tmpfs: fix uninitialized return value in shmem_link
	stm class: Prevent division by zero
	nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
	acpi/nfit: Fix bus command validation
	nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
	nfit/ars: Attempt short-ARS even in the no_init_ars case
	libnvdimm/label: Clear 'updating' flag after label-set update
	libnvdimm, pfn: Fix over-trim in trim_pfn_device()
	libnvdimm/pmem: Honor force_raw for legacy pmem regions
	libnvdimm: Fix altmap reservation size calculation
	fix cgroup_do_mount() handling of failure exits
	crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: aegis - fix handling chunked inputs
	crypto: arm/crct10dif - revert to C code for short inputs
	crypto: arm64/aes-neonbs - fix returning final keystream block
	crypto: arm64/crct10dif - revert to C code for short inputs
	crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: morus - fix handling chunked inputs
	crypto: pcbc - remove bogus memcpy()s with src == dest
	crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: testmgr - skip crc32c context test for ahash algorithms
	crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
	crypto: x86/aesni-gcm - fix crash on empty plaintext
	crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
	crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
	crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
	CIFS: Do not reset lease state to NONE on lease break
	CIFS: Do not skip SMB2 message IDs on send failures
	CIFS: Fix read after write for files with read caching
	tracing: Use strncpy instead of memcpy for string keys in hist triggers
	tracing: Do not free iter->trace in fail path of tracing_open_pipe()
	tracing/perf: Use strndup_user() instead of buggy open-coded version
	xen: fix dom0 boot on huge systems
	ACPI / device_sysfs: Avoid OF modalias creation for removed device
	mmc: sdhci-esdhc-imx: fix HS400 timing issue
	mmc:fix a bug when max_discard is 0
	netfilter: ipt_CLUSTERIP: fix warning unused variable cn
	spi: ti-qspi: Fix mmap read when more than one CS in use
	spi: pxa2xx: Setup maximum supported DMA transfer length
	regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
	regulator: max77620: Initialize values for DT properties
	regulator: s2mpa01: Fix step values for some LDOs
	clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
	clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
	clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
	s390/setup: fix early warning messages
	s390/virtio: handle find on invalid queue gracefully
	scsi: virtio_scsi: don't send sc payload with tmfs
	scsi: aacraid: Fix performance issue on logical drives
	scsi: sd: Optimal I/O size should be a multiple of physical block size
	scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
	scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
	fs/devpts: always delete dcache dentry-s in dput()
	splice: don't merge into linked buffers
	ovl: During copy up, first copy up data and then xattrs
	ovl: Do not lose security.capability xattr over metadata file copy-up
	m68k: Add -ffreestanding to CFLAGS
	Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
	Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
	btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
	Btrfs: fix corruption reading shared and compressed extents after hole punching
	soc: qcom: rpmh: Avoid accessing freed memory from batch API
	libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
	irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
	irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
	x86/kprobes: Prohibit probing on optprobe template code
	cpufreq: kryo: Release OPP tables on module removal
	cpufreq: tegra124: add missing of_node_put()
	cpufreq: pxa2xx: remove incorrect __init annotation
	ext4: fix check of inode in swap_inode_boot_loader
	ext4: cleanup pagecache before swap i_data
	ext4: update quota information while swapping boot loader inode
	ext4: add mask of ext4 flags to swap
	ext4: fix crash during online resizing
	PCI/ASPM: Use LTR if already enabled by platform
	PCI/DPC: Fix print AER status in DPC event handling
	PCI: dwc: skip MSI init if MSIs have been explicitly disabled
	IB/hfi1: Close race condition on user context disable and close
	cxl: Wrap iterations over afu slices inside 'afu_list_lock'
	ext2: Fix underflow in ext2_max_size()
	clk: uniphier: Fix update register for CPU-gear
	clk: clk-twl6040: Fix imprecise external abort for pdmclk
	clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
	clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
	clk: ingenic: Fix round_rate misbehaving with non-integer dividers
	clk: ingenic: Fix doc of ingenic_cgu_div_info
	usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
	usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
	dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
	mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
	mm/vmalloc: fix size check for remap_vmalloc_range_partial()
	mm/memory.c: do_fault: avoid usage of stale vm_area_struct
	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
	device property: Fix the length used in PROPERTY_ENTRY_STRING()
	intel_th: Don't reference unassigned outputs
	parport_pc: fix find_superio io compare code, should use equal test.
	i2c: tegra: fix maximum transfer size
	media: i2c: ov5640: Fix post-reset delay
	gpio: pca953x: Fix dereference of irq data in shutdown
	can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
	drm/i915: Relax mmap VMA check
	bpf: only test gso type on gso packets
	serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
	serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
	serial: 8250_pci: Fix number of ports for ACCES serial cards
	serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
	jbd2: clear dirty flag when revoking a buffer from an older transaction
	jbd2: fix compile warning when using JBUFFER_TRACE
	selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
	security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
	powerpc/32: Clear on-stack exception marker upon exception return
	powerpc/wii: properly disable use of BATs when requested.
	powerpc/powernv: Make opal log only readable by root
	powerpc/83xx: Also save/restore SPRG4-7 during suspend
	powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
	powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
	powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
	powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
	powerpc/traps: fix recoverability of machine check handling on book3s/32
	powerpc/traps: Fix the message printed when stack overflows
	ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
	arm64: Fix HCR.TGE status for NMI contexts
	arm64: debug: Ensure debug handlers check triggering exception level
	arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
	ipmi_si: fix use-after-free of resource->name
	dm: fix to_sector() for 32bit
	dm integrity: limit the rate of error messages
	mfd: sm501: Fix potential NULL pointer dereference
	cpcap-charger: generate events for userspace
	NFS: Fix I/O request leakages
	NFS: Fix an I/O request leakage in nfs_do_recoalesce
	NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
	nfsd: fix performance-limiting session calculation
	nfsd: fix memory corruption caused by readdir
	nfsd: fix wrong check in write_v4_end_grace()
	NFSv4.1: Reinitialise sequence results before retransmitting a request
	svcrpc: fix UDP on servers with lots of threads
	PM / wakeup: Rework wakeup source timer cancellation
	bcache: never writeback a discard operation
	stable-kernel-rules.rst: add link to networking patch queue
	vt: perform safe console erase in the right order
	x86/unwind/orc: Fix ORC unwind table alignment
	perf intel-pt: Fix CYC timestamp calculation after OVF
	perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
	perf auxtrace: Define auxtrace record alignment
	perf intel-pt: Fix overlap calculation for padding
	perf/x86/intel/uncore: Fix client IMC events return huge result
	perf intel-pt: Fix divide by zero when TSC is not available
	md: Fix failed allocation of md_register_thread
	tpm/tpm_crb: Avoid unaligned reads in crb_recv()
	tpm: Unify the send callback behaviour
	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
	media: imx: prpencvf: Stop upstream before disabling IDMA channel
	media: lgdt330x: fix lock status reporting
	media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
	media: vimc: Add vimc-streamer for stream control
	media: imx: csi: Disable CSI immediately after last EOF
	media: imx: csi: Stop upstream before disabling IDMA channel
	drm/fb-helper: generic: Fix drm_fbdev_client_restore()
	drm/radeon/evergreen_cs: fix missing break in switch statement
	drm/amd/powerplay: correct power reading on fiji
	drm/amd/display: don't call dm_pp_ function from an fpu block
	KVM: Call kvm_arch_memslots_updated() before updating memslots
	KVM: x86/mmu: Detect MMIO generation wrap in any address space
	KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
	KVM: nVMX: Sign extend displacements of VMX instr's mem operands
	KVM: nVMX: Apply addr size mask to effective address for VMX instructions
	KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
	bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
	s390/setup: fix boot crash for machine without EDAT-1
	Linux 4.19.31

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-23 21:13:30 +01:00
Zhang, Jun
e97a32a5a3 rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
commit 1d1f898df6 upstream.

The rcu_gp_kthread_wake() function is invoked when it might be necessary
to wake the RCU grace-period kthread.  Because self-wakeups are normally
a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
this kthread, it naturally refuses to do the wakeup.

Unfortunately, natural though it might be, this heuristic fails when
rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
that interrupted the grace-period kthread just after the final check of
the wait-event condition but just before the schedule() call.  In this
case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
is within the RCU grace-period kthread's context.  Failing to provide
this wakeup can result in grace periods failing to start, which in turn
results in out-of-memory conditions.

This race window is quite narrow, but it actually did happen during real
testing.  It would of course need to be fixed even if it was strictly
theoretical in nature.

This patch does not Cc stable because it does not apply cleanly to
earlier kernel versions.

Fixes: 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
Reported-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com>
Co-developed-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "xiao, jin" <jin.xiao@intel.com>
Co-developed-by: Bai, Jie A <jie.a.bai@intel.com>
Signed-off: "Zhang, Jun" <jun.zhang@intel.com>
Signed-off: "He, Bo" <bo.he@intel.com>
Signed-off: "xiao, jin" <jin.xiao@intel.com>
Signed-off: Bai, Jie A <jie.a.bai@intel.com>
Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com>
[ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
  !in_serving_softirq() to avoid redundant wakeups and to also handle the
  interrupt-handler scenario as well as the softirq-handler scenario that
  actually occurred in testing. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:10:12 +01:00
Zev Weiss
93c8a44a82 kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
commit 8cf7630b29 upstream.

This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").

As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().

Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org>	[2.6.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:10:04 +01:00
Jann Horn
24d5097655 tracing/perf: Use strndup_user() instead of buggy open-coded version
commit 83540fbc88 upstream.

The first version of this method was missing the check for
`ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
on error, so there was still a small memory leak in the error case.
Fix it by using strndup_user() instead of open-coding it.

Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com

Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 0eadcc7a7b ("perf/core: Fix perf_uprobe_init()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
zhangyi (F)
f27077e5f5 tracing: Do not free iter->trace in fail path of tracing_open_pipe()
commit e7f0c424d0 upstream.

Commit d716ff71dd ("tracing: Remove taking of trace_types_lock in
pipe files") use the current tracer instead of the copy in
tracing_open_pipe(), but it forget to remove the freeing sentence in
the error path.

There's an error path that can call kfree(iter->trace) after the iter->trace
was assigned to tr->current_trace, which would be bad to free.

Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com

Cc: stable@vger.kernel.org
Fixes: d716ff71dd ("tracing: Remove taking of trace_types_lock in pipe files")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
Tom Zanussi
ebca08d7e8 tracing: Use strncpy instead of memcpy for string keys in hist triggers
commit 9f0bbf3115 upstream.

Because there may be random garbage beyond a string's null terminator,
it's not correct to copy the the complete character array for use as a
hist trigger key.  This results in multiple histogram entries for the
'same' string key.

So, in the case of a string key, use strncpy instead of memcpy to
avoid copying in the extra bytes.

Before, using the gdbus entries in the following hist trigger as an
example:

  # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist

  ...

  { comm: ImgDecoder #4                      } hitcount:        203
  { comm: gmain                              } hitcount:        213
  { comm: gmain                              } hitcount:        216
  { comm: StreamTrans #73                    } hitcount:        221
  { comm: mozStorage #3                      } hitcount:        230
  { comm: gdbus                              } hitcount:        233
  { comm: StyleThread#5                      } hitcount:        253
  { comm: gdbus                              } hitcount:        256
  { comm: gdbus                              } hitcount:        260
  { comm: StyleThread#4                      } hitcount:        271

  ...

  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
  51

After:

  # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
  1

Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 79e577cbce ("tracing: Support string type key properly")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:56 +01:00
Al Viro
7a8b048430 fix cgroup_do_mount() handling of failure exits
commit 399504e21a upstream.

same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped.  Easily fixed (same way as in sysfs
case).  Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:09:53 +01:00
Alban Crequy
02f8211b75 bpf, lpm: fix lookup bug in map_delete_elem
[ Upstream commit 7c0cdf0b39 ]

trie_delete_elem() was deleting an entry even though it was not matching
if the prefixlen was correct. This patch adds a check on matchlen.

Reproducer:

$ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
$ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
key: 10 00 00 00 aa bb cc dd  value: 01
Found 1 element
$ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
$ echo $?
0
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
Found 0 elements

A similar reproducer is added in the selftests.

Without the patch:

$ sudo ./tools/testing/selftests/bpf/test_lpm_map
test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
Aborted

With the patch: test_lpm_map runs without errors.

Fixes: e454cf5958 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:51 +01:00
Alexei Starovoitov
c7c68a1b9a bpf: fix lockdep false positive in stackmap
[ Upstream commit 3defaf2f15 ]

Lockdep warns about false positive:
[   11.211460] ------------[ cut here ]------------
[   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[   11.213134] Modules linked in:
[   11.214954] RIP: 0010:lock_release+0x1ad/0x280
[   11.223508] Call Trace:
[   11.223705]  <IRQ>
[   11.223874]  ? __local_bh_enable+0x7a/0x80
[   11.224199]  up_read+0x1c/0xa0
[   11.224446]  do_up_read+0x12/0x20
[   11.224713]  irq_work_run_list+0x43/0x70
[   11.225030]  irq_work_run+0x26/0x50
[   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
[   11.225662]  irq_work_interrupt+0xf/0x20

since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Fix the warning with up_read_non_owner() and rwsem_release() annotation.

Fixes: bae77c5eb5 ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:48 +01:00
Suren Baghdasaryan
617a4ba0ec FROMLIST: psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I860049d32420485346ad545c4650f990fe0c08e3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:07:14 +00:00
Suren Baghdasaryan
3a905dc573 FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I88cd99f41534f0b9df18043cde8d1ee54aaa93de
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:07:04 +00:00
Suren Baghdasaryan
23c32cf595 FROMLIST: psi: track changed states
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I944b024cd65e8520a57097bf5a3d7b2c01605bd0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:55 +00:00
Suren Baghdasaryan
f270022469 FROMLIST: psi: split update_stats into parts
Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: Ia9cfed8964fd57e41098fca285a2be0252fd5277
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:46 +00:00
Suren Baghdasaryan
c6e18d9458 FROMLIST: psi: rename psi fields in preparation for psi trigger addition
Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I579a60e0915fa8fedaa508357d3d1aefab9428c4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:38 +00:00
Suren Baghdasaryan
18d15b1861 FROMLIST: psi: make psi_enable static
psi_enable is not used outside of psi.c, make it static.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I249c6d2271f93a7975f1622faf2d2b4196b701bc
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:29 +00:00
Suren Baghdasaryan
ada57da3b1 FROMLIST: psi: introduce state_mask to represent stalled psi states
The psi monitoring patches will need to determine the same states as
record_times().  To avoid calculating them twice, maintain a state mask
that can be consulted cheaply.  Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long.  Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.

Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I38a1ca3d5c9e6cc3ba39e88c6a9af29ecdc0df5b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 23:06:13 +00:00
Johannes Weiner
9f79143ebb UPSTREAM: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(cherry picked from commit: dc50537bdd)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:28 -07:00
Johannes Weiner
ec350213df UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines
We've been seeing hard-to-trigger psi crashes when running inside VM
instances:

    divide error: 0000 [#1] SMP PTI
    Modules linked in: [...]
    CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events psi_clock
    RIP: 0010:psi_update_stats+0x270/0x490
    RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
    RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
    RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     psi_clock+0x12/0x50
     process_one_work+0x1e0/0x390
     worker_thread+0x2b/0x3c0
     ? rescuer_thread+0x330/0x330
     kthread+0x113/0x130
     ? kthread_create_worker_on_cpu+0x40/0x40
     ? SyS_exit_group+0x10/0x10
     ret_from_fork+0x35/0x40
    Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.

The elapsed period should never be 0.  The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.

The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift.  So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.

However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'.  The run finishes
with last_update == next_update == sched_clock().

With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs.  A pointlessly short
period, but besides the extra work, no harm no foul.  However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again.  And when we calculate the elapsed period, the
result, our pressure divisor, will be 0.  Ouch.

Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.

Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Łukasz Siudut <lsiudut@fb.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 4e37504d1c)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
2a070382c9 UPSTREAM: psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 1b69ac6b40)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
3bbcbc8039 UPSTREAM: psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit e0c274472d)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Olof Johansson
b822a6da85 UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8fcb2312d1)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
dc9cd29ded UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2ce7135adc)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
e550f94252 UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
8cd88f5398 UPSTREAM: sched: introduce this_rq_lock_irq()
do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
patch is adding another site with the same pattern, so provide a
convenience function for it.

Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 246b3b3342)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
cdda3cf652 UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h
kernel/sched/sched.h includes "stats.h" half-way through the file.  The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
the file so they are available in stats.h.

Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 1f351d7f75)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
4c9c09affa UPSTREAM: sched: loadavg: make calc_load_n() public
It's going to be used in a later patch. Keep the churn separate.

Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 5c54f5b9ed)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I50e0cb0dbf20ced329a484493f82ff69ca1ae97a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Johannes Weiner
2ba18b41d3 BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8508cf3ffa)

Conflicts:
        block/blk-iolatency.c

(1. manual merge to replace stat->rqs.mean with stat.mean)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Johannes Weiner
580f26b93e UPSTREAM: delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit b1d29ba82c)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I259f693987cf04e6a52ee7e8accf55a17e0de005
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Greg Kroah-Hartman
2e568c979c This is the 4.19.29 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyJb/EACgkQONu9yGCS
 aT4y0g//b9t9/onhTaXcY/ByPmBAwqNgugi7eYcZqGDBp7aCDOBLF6eOwbhdvvuS
 ZTaZ5eWG3Twz3mZu9vveuskgMci2npDyLPgqBWGzW+Ef5r/xPd40diaI75ZUc68T
 gimWbQ0VANuXKklK6LysBUaVQWE3ilIy6qnnpj0DI3ipNDoE62Ry1LNthuKy+73J
 w6r7uwkb6X/CkXpNB/L4cDdpSy/CvhGQhd6p91lBuE4DfyPqEzslYCokD9aPXp9b
 Fedt/Re+8eULBNcgqPYxkS5pBrbHtqrGf00AMlzC8DkC+GZyDqSP2xjv6AiTfGJd
 uf0/Jvsv2OBnP4aYsbk+uB2z3plzPBgmXxa/1bm+yrGCMvbpi9mMx75HM2joAeVp
 tVN4ZN65kNgJkXCchJTHdQ3s6teOD8Par1czy570HyKBU6l1j3AGArGm+b4WGPWx
 dL+82coojMKxKNdTHfxUXES6QGKp716r3un6mCrKR0xET/SDayzDQMaSM8UOtArK
 ELzNeKzKTc5oBx6i+JfGmY8ZsedpNGCIPpsiuoSYAaon5ZzNbruzOAlDOThs157d
 YezDHZ9XMrx3kN/xYnqZD63x/5egq9REbZGWljeykbNkWcEY74jIkKwNLxqv3P64
 JsLp60owvjzwtzKycjZogNU//GGNTBdb+6pESq4MxJpPTteFWnc=
 =n9iV
 -----END PGP SIGNATURE-----

Merge 4.19.29 into android-4.19

Changes in 4.19.29
	media: uvcvideo: Fix 'type' check leading to overflow
	vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel
	perf script: Fix crash with printing mixed trace point and other events
	perf core: Fix perf_proc_update_handler() bug
	perf tools: Handle TOPOLOGY headers with no CPU
	perf script: Fix crash when processing recorded stat data
	IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
	iommu/amd: Call free_iova_fast with pfn in map_sg
	iommu/amd: Unmap all mapped pages in error path of map_sg
	riscv: fixup max_low_pfn with PFN_DOWN.
	ipvs: Fix signed integer overflow when setsockopt timeout
	iommu/amd: Fix IOMMU page flush when detach device from a domain
	clk: ti: Fix error handling in ti_clk_parse_divider_data()
	clk: qcom: gcc: Use active only source for CPUSS clocks
	xtensa: SMP: fix ccount_timer_shutdown
	riscv: Adjust mmap base address at a third of task size
	IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
	selftests: cpu-hotplug: fix case where CPUs offline > CPUs present
	xtensa: SMP: fix secondary CPU initialization
	xtensa: smp_lx200_defconfig: fix vectors clash
	xtensa: SMP: mark each possible CPU as present
	iomap: get/put the page in iomap_page_create/release()
	iomap: fix a use after free in iomap_dio_rw
	xtensa: SMP: limit number of possible CPUs by NR_CPUS
	net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case
	net: hns: Fix for missing of_node_put() after of_parse_phandle()
	net: hns: Restart autoneg need return failed when autoneg off
	net: hns: Fix wrong read accesses via Clause 45 MDIO protocol
	net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup()
	netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present
	gpio: vf610: Mask all GPIO interrupts
	selftests: net: use LDLIBS instead of LDFLAGS
	selftests: timers: use LDLIBS instead of LDFLAGS
	nfs: Fix NULL pointer dereference of dev_name
	qed: Fix bug in tx promiscuous mode settings
	qed: Fix LACP pdu drops for VFs
	qed: Fix VF probe failure while FLR
	qed: Fix system crash in ll2 xmit
	qed: Fix stack out of bounds bug
	scsi: libfc: free skb when receiving invalid flogi resp
	scsi: scsi_debug: fix write_same with virtual_gb problem
	scsi: bnx2fc: Fix error handling in probe()
	scsi: 53c700: pass correct "dev" to dma_alloc_attrs()
	platform/x86: Fix unmet dependency warning for ACPI_CMPC
	platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
	net: macb: Apply RXUBR workaround only to versions with errata
	x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
	cifs: fix computation for MAX_SMB2_HDR_SIZE
	x86/microcode/amd: Don't falsely trick the late loading mechanism
	arm64: kprobe: Always blacklist the KVM world-switch code
	apparmor: Fix aa_label_build() error handling for failed merges
	x86/kexec: Don't setup EFI info if EFI runtime is not enabled
	proc: fix /proc/net/* after setns(2)
	x86_64: increase stack size for KASAN_EXTRA
	mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
	mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
	lib/test_kmod.c: potential double free in error handling
	fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
	autofs: drop dentry reference only when it is never used
	autofs: fix error return in autofs_fill_super()
	mm, memory_hotplug: fix off-by-one in is_pageblock_removable
	ARM: OMAP: dts: N950/N9: fix onenand timings
	ARM: dts: omap4-droid4: Fix typo in cpcap IRQ flags
	ARM: dts: sun8i: h3: Add ethernet0 alias to Beelink X2
	arm: dts: meson: Fix IRQ trigger type for macirq
	ARM: dts: meson8b: odroidc1: mark the SD card detection GPIO active-low
	ARM: dts: meson8m2: mxiii-plus: mark the SD card detection GPIO active-low
	ARM: dts: imx6sx: correct backward compatible of gpt
	arm64: dts: renesas: r8a7796: Enable DMA for SCIF2
	arm64: dts: renesas: r8a77965: Enable DMA for SCIF2
	soc: fsl: qbman: avoid race in clearing QMan interrupt
	pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18
	wlcore: sdio: Fixup power on/off sequence
	bpftool: Fix prog dump by tag
	bpftool: fix percpu maps updating
	bpf: sock recvbuff must be limited by rmem_max in bpf_setsockopt()
	ARM: pxa: ssp: unneeded to free devm_ allocated data
	arm64: dts: add msm8996 compatible to gicv3
	batman-adv: release station info tidstats
	DTS: CI20: Fix bugs in ci20's device tree.
	usb: phy: fix link errors
	irqchip/gic-v4: Fix occasional VLPI drop
	irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
	irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
	drm/amdgpu: Add missing power attribute to APU check
	drm/radeon: check if device is root before getting pci speed caps
	drm/amdgpu: Transfer fences to dmabuf importer
	net: stmmac: Fallback to Platform Data clock in Watchdog conversion
	net: stmmac: Send TSO packets always from Queue 0
	net: stmmac: Disable EEE mode earlier in XMIT callback
	irqchip/gic-v3-its: Fix ITT_entry_size accessor
	relay: check return of create_buf_file() properly
	bpf, selftests: fix handling of sparse CPU allocations
	bpf: fix lockdep false positive in percpu_freelist
	bpf: fix potential deadlock in bpf_prog_register
	bpf: Fix syscall's stackmap lookup potential deadlock
	drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at init
	dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
	vsock/virtio: fix kernel panic after device hot-unplug
	vsock/virtio: reset connected sockets on device removal
	dmaengine: dmatest: Abort test in case of mapping error
	selftests: netfilter: fix config fragment CONFIG_NF_TABLES_INET
	selftests: netfilter: add simple masq/redirect test cases
	netfilter: nf_nat: skip nat clash resolution for same-origin entries
	s390/qeth: release cmd buffer in error paths
	s390/qeth: fix use-after-free in error path
	s390/qeth: cancel close_dev work before removing a card
	perf symbols: Filter out hidden symbols from labels
	perf trace: Support multiple "vfs_getname" probes
	MIPS: Remove function size check in get_frame_info()
	Revert "scsi: libfc: Add WARN_ON() when deleting rports"
	i2c: omap: Use noirq system sleep pm ops to idle device for suspend
	drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr
	nvme: lock NS list changes while handling command effects
	nvme-pci: fix rapid add remove sequence
	fs: ratelimit __find_get_block_slow() failure message.
	qed: Fix EQ full firmware assert.
	qed: Consider TX tcs while deriving the max num_queues for PF.
	qede: Fix system crash on configuring channels.
	blk-iolatency: fix IO hang due to negative inflight counter
	nvme-pci: add missing unlock for reset error
	Input: wacom_serial4 - add support for Wacom ArtPad II tablet
	Input: elan_i2c - add id for touchpad found in Lenovo s21e-20
	iscsi_ibft: Fix missing break in switch statement
	scsi: aacraid: Fix missing break in switch statement
	x86/PCI: Fixup RTIT_BAR of Intel Denverton Trace Hub
	arm64: dts: zcu100-revC: Give wifi some time after power-on
	arm64: dts: hikey: Give wifi some time after power-on
	arm64: dts: hikey: Revert "Enable HS200 mode on eMMC"
	ARM: dts: exynos: Fix pinctrl definition for eMMC RTSN line on Odroid X2/U3
	ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU
	ARM: dts: exynos: Fix max voltage for buck8 regulator on Odroid XU3/XU4
	drm: disable uncached DMA optimization for ARM and arm64
	netfilter: xt_TEE: fix wrong interface selection
	netfilter: xt_TEE: add missing code to get interface index in checkentry.
	gfs2: Fix missed wakeups in find_insert_glock
	staging: erofs: add error handling for xattr submodule
	staging: erofs: fix fast symlink w/o xattr when fs xattr is on
	staging: erofs: fix memleak of inode's shared xattr array
	staging: erofs: fix race of initializing xattrs of a inode at the same time
	staging: erofs: keep corrupted fs from crashing kernel in erofs_namei()
	cifs: allow calling SMB2_xxx_free(NULL)
	ath9k: Avoid OF no-EEPROM quirks without qca,no-eeprom
	driver core: Postpone DMA tear-down until after devres release
	perf/x86/intel: Make cpuc allocations consistent
	perf/x86/intel: Generalize dynamic constraint creation
	x86: Add TSX Force Abort CPUID/MSR
	perf/x86/intel: Implement support for TSX Force Abort
	Linux 4.19.29

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-13 14:17:29 -07:00
Martin KaFai Lau
ae26a7109c bpf: Fix syscall's stackmap lookup potential deadlock
[ Upstream commit 7c4cd051ad ]

The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.

It was true until commit 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.

If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog.  Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.

A later syscall's map_lookup_elem commit f1a2e44a3a ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups.  bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).

Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Alexei Starovoitov
3bbe6a4212 bpf: fix potential deadlock in bpf_prog_register
[ Upstream commit e16ec34039 ]

Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
[   13.007000] WARNING: possible circular locking dependency detected
[   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
[   13.008124] ------------------------------------------------------
[   13.008624] test_progs/246 is trying to acquire lock:
[   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
[   13.009770]
[   13.009770] but task is already holding lock:
[   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.010877]
[   13.010877] which lock already depends on the new lock.
[   13.010877]
[   13.011532]
[   13.011532] the existing dependency chain (in reverse order) is:
[   13.012129]
[   13.012129] -> #4 (bpf_event_mutex){+.+.}:
[   13.012582]        perf_event_query_prog_array+0x9b/0x130
[   13.013016]        _perf_ioctl+0x3aa/0x830
[   13.013354]        perf_ioctl+0x2e/0x50
[   13.013668]        do_vfs_ioctl+0x8f/0x6a0
[   13.014003]        ksys_ioctl+0x70/0x80
[   13.014320]        __x64_sys_ioctl+0x16/0x20
[   13.014668]        do_syscall_64+0x4a/0x180
[   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.015469]
[   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
[   13.015910]        perf_event_init_cpu+0x5a/0x90
[   13.016291]        perf_event_init+0x1b2/0x1de
[   13.016654]        start_kernel+0x2b8/0x42a
[   13.016995]        secondary_startup_64+0xa4/0xb0
[   13.017382]
[   13.017382] -> #2 (pmus_lock){+.+.}:
[   13.017794]        perf_event_init_cpu+0x21/0x90
[   13.018172]        cpuhp_invoke_callback+0xb3/0x960
[   13.018573]        _cpu_up+0xa7/0x140
[   13.018871]        do_cpu_up+0xa4/0xc0
[   13.019178]        smp_init+0xcd/0xd2
[   13.019483]        kernel_init_freeable+0x123/0x24f
[   13.019878]        kernel_init+0xa/0x110
[   13.020201]        ret_from_fork+0x24/0x30
[   13.020541]
[   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[   13.021051]        static_key_slow_inc+0xe/0x20
[   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
[   13.021891]        perf_trace_event_init+0x11f/0x250
[   13.022297]        perf_trace_init+0x6b/0xa0
[   13.022644]        perf_tp_event_init+0x25/0x40
[   13.023011]        perf_try_init_event+0x6b/0x90
[   13.023386]        perf_event_alloc+0x9a8/0xc40
[   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
[   13.024173]        do_syscall_64+0x4a/0x180
[   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.024968]
[   13.024968] -> #0 (tracepoints_mutex){+.+.}:
[   13.025434]        __mutex_lock+0x86/0x970
[   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
[   13.026215]        bpf_probe_register+0x40/0x60
[   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.027042]        __do_sys_bpf+0x94f/0x1a90
[   13.027389]        do_syscall_64+0x4a/0x180
[   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.028171]
[   13.028171] other info that might help us debug this:
[   13.028171]
[   13.028807] Chain exists of:
[   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
[   13.028807]
[   13.029666]  Possible unsafe locking scenario:
[   13.029666]
[   13.030140]        CPU0                    CPU1
[   13.030510]        ----                    ----
[   13.030875]   lock(bpf_event_mutex);
[   13.031166]                                lock(&cpuctx_mutex);
[   13.031645]                                lock(bpf_event_mutex);
[   13.032135]   lock(tracepoints_mutex);
[   13.032441]
[   13.032441]  *** DEADLOCK ***
[   13.032441]
[   13.032911] 1 lock held by test_progs/246:
[   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
[   13.033909]
[   13.033909] stack backtrace:
[   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
[   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   13.035657] Call Trace:
[   13.035859]  dump_stack+0x5f/0x8b
[   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
[   13.036526]  __lock_acquire+0x1158/0x1350
[   13.036852]  ? lock_acquire+0x98/0x190
[   13.037154]  lock_acquire+0x98/0x190
[   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.037876]  __mutex_lock+0x86/0x970
[   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.039028]  ? __mutex_lock+0x86/0x970
[   13.039337]  ? __mutex_lock+0x24a/0x970
[   13.039649]  ? bpf_probe_register+0x1d/0x60
[   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
[   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
[   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
[   13.041325]  bpf_probe_register+0x40/0x60
[   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
[   13.042068]  ? __might_fault+0x3e/0x90
[   13.042374]  __do_sys_bpf+0x94f/0x1a90
[   13.042678]  do_syscall_64+0x4a/0x180
[   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.043382] RIP: 0033:0x7f23b10a07f9
[   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
[   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
[   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
[   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
there is no need to take bpf_event_mutex too.
bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
bpf_raw_tracepoints don't need to take this mutex.

Fixes: c4f6699dfc ("bpf: introduce BPF_RAW_TRACEPOINT")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Alexei Starovoitov
e3bc64c9aa bpf: fix lockdep false positive in percpu_freelist
[ Upstream commit a89fac57b5 ]

Lockdep warns about false positive:
[   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
[   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
[   12.493275]  (&rq->lock){-.-.}
[   12.493276]
[   12.493276]
[   12.493276] and interrupts could create inverse lock ordering between them.
[   12.493276]
[   12.494435]
[   12.494435] other info that might help us debug this:
[   12.494979]  Possible interrupt unsafe locking scenario:
[   12.494979]
[   12.495518]        CPU0                    CPU1
[   12.495879]        ----                    ----
[   12.496243]   lock(&head->lock);
[   12.496502]                                local_irq_disable();
[   12.496969]                                lock(&rq->lock);
[   12.497431]                                lock(&head->lock);
[   12.497890]   <Interrupt>
[   12.498104]     lock(&rq->lock);
[   12.498368]
[   12.498368]  *** DEADLOCK ***
[   12.498368]
[   12.498837] 1 lock held by dd/276:
[   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
[   12.499747]
[   12.499747] the shortest dependencies between 2nd lock and 1st lock:
[   12.500389]  -> (&rq->lock){-.-.} {
[   12.500669]     IN-HARDIRQ-W at:
[   12.500934]                       _raw_spin_lock+0x2f/0x40
[   12.501373]                       scheduler_tick+0x4c/0xf0
[   12.501812]                       update_process_times+0x40/0x50
[   12.502294]                       tick_periodic+0x27/0xb0
[   12.502723]                       tick_handle_periodic+0x1f/0x60
[   12.503203]                       timer_interrupt+0x11/0x20
[   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
[   12.504167]                       handle_irq_event_percpu+0x20/0x50
[   12.504674]                       handle_irq_event+0x37/0x60
[   12.505139]                       handle_level_irq+0xa7/0x120
[   12.505601]                       handle_irq+0xa1/0x150
[   12.506018]                       do_IRQ+0x77/0x140
[   12.506411]                       ret_from_intr+0x0/0x1d
[   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
[   12.507362]                       __setup_irq+0x481/0x730
[   12.507789]                       setup_irq+0x49/0x80
[   12.508195]                       hpet_time_init+0x21/0x32
[   12.508644]                       x86_late_time_init+0xb/0x16
[   12.509106]                       start_kernel+0x390/0x42a
[   12.509554]                       secondary_startup_64+0xa4/0xb0
[   12.510034]     IN-SOFTIRQ-W at:
[   12.510305]                       _raw_spin_lock+0x2f/0x40
[   12.510772]                       try_to_wake_up+0x1c7/0x4e0
[   12.511220]                       swake_up_locked+0x20/0x40
[   12.511657]                       swake_up_one+0x1a/0x30
[   12.512070]                       rcu_process_callbacks+0xc5/0x650
[   12.512553]                       __do_softirq+0xe6/0x47b
[   12.512978]                       irq_exit+0xc3/0xd0
[   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
[   12.513876]                       apic_timer_interrupt+0xf/0x20
[   12.514343]                       default_idle+0x1c/0x170
[   12.514765]                       do_idle+0x199/0x240
[   12.515159]                       cpu_startup_entry+0x19/0x20
[   12.515614]                       start_kernel+0x422/0x42a
[   12.516045]                       secondary_startup_64+0xa4/0xb0
[   12.516521]     INITIAL USE at:
[   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
[   12.517258]                      rq_attach_root+0x16/0xd0
[   12.517685]                      sched_init+0x2f2/0x3eb
[   12.518096]                      start_kernel+0x1fb/0x42a
[   12.518525]                      secondary_startup_64+0xa4/0xb0
[   12.518986]   }
[   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
[   12.519649]   ... acquired at:
[   12.519892]    pcpu_freelist_pop+0x7b/0xd0
[   12.520221]    bpf_get_stackid+0x1d2/0x4d0
[   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
[   12.520887]
[   12.521008] -> (&head->lock){+...} {
[   12.521292]    HARDIRQ-ON-W at:
[   12.521539]                     _raw_spin_lock+0x2f/0x40
[   12.521950]                     pcpu_freelist_push+0x2a/0x40
[   12.522396]                     bpf_get_stackid+0x494/0x4d0
[   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
[   12.523296]    INITIAL USE at:
[   12.523537]                    _raw_spin_lock+0x2f/0x40
[   12.523944]                    pcpu_freelist_populate+0xc0/0x120
[   12.524417]                    htab_map_alloc+0x405/0x500
[   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
[   12.525253]                    do_syscall_64+0x4a/0x180
[   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   12.526167]  }
[   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
[   12.526812]  ... acquired at:
[   12.527047]    __lock_acquire+0x521/0x1350
[   12.527371]    lock_acquire+0x98/0x190
[   12.527680]    _raw_spin_lock+0x2f/0x40
[   12.527994]    pcpu_freelist_push+0x2a/0x40
[   12.528325]    bpf_get_stackid+0x494/0x4d0
[   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
[   12.528970]
[   12.529092]
[   12.529092] stack backtrace:
[   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
[   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
[   12.530750] Call Trace:
[   12.530948]  dump_stack+0x5f/0x8b
[   12.531248]  check_usage_backwards+0x10c/0x120
[   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
[   12.531935]  ? mark_lock+0x382/0x560
[   12.532229]  mark_lock+0x382/0x560
[   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
[   12.532928]  __lock_acquire+0x521/0x1350
[   12.533271]  ? find_get_entry+0x17f/0x2e0
[   12.533586]  ? find_get_entry+0x19c/0x2e0
[   12.533902]  ? lock_acquire+0x98/0x190
[   12.534196]  lock_acquire+0x98/0x190
[   12.534482]  ? pcpu_freelist_push+0x2a/0x40
[   12.534810]  _raw_spin_lock+0x2f/0x40
[   12.535099]  ? pcpu_freelist_push+0x2a/0x40
[   12.535432]  pcpu_freelist_push+0x2a/0x40
[   12.535750]  bpf_get_stackid+0x494/0x4d0
[   12.536062]  ___bpf_prog_run+0x8b4/0x11a0

It has been explained that is a false positive here:
https://lkml.org/lkml/2018/7/25/756
Recap:
- stackmap uses pcpu_freelist
- The lock in pcpu_freelist is a percpu lock
- stackmap is only used by tracing bpf_prog
- A tracing bpf_prog cannot be run if another bpf_prog
  has already been running (ensured by the percpu bpf_prog_active counter).

Eric pointed out that this lockdep splats stops other
legit lockdep splats in selftests/bpf/test_progs.c.

Fix this by calling local_irq_save/restore for stackmap.

Another false positive had also been worked around by calling
local_irq_save in commit 89ad2fa3f0 ("bpf: fix lockdep splat").
That commit added unnecessary irq_save/restore to fast path of
bpf hash map. irqs are already disabled at that point, since htab
is holding per bucket spin_lock with irqsave.

Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
to be used elsewhere (right now only in stackmap).
It stops lockdep false positive in stackmap with a bit of acceptable overhead.

Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:36 -07:00
Greg Kroah-Hartman
232bd90cf2 relay: check return of create_buf_file() properly
[ Upstream commit 2c1cf00eea ]

If create_buf_file() returns an error, don't try to reference it later
as a valid dentry pointer.

This problem was exposed when debugfs started to return errors instead
of just NULL for some calls when they do not succeed properly.

Also, the check for WARN_ON(dentry) was just wrong :)

Reported-by: Kees Cook <keescook@chromium.org>
Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:35 -07:00
Stephane Eranian
6ec0698f1c perf core: Fix perf_proc_update_handler() bug
[ Upstream commit 1a51c5da5a ]

The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable.  When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.

The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().

You get an error:

  $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
  echo: write error: invalid argument

But the value is still modified causing all sorts of inconsistencies:

  $ cat /proc/sys/kernel/perf_event_max_sample_rate
  10001

This patch fixes the problem by moving the parsing of the value after
the test.

Committer testing:

  # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
  # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
  -bash: echo: write error: Invalid argument
  # cat /proc/sys/kernel/perf_event_max_sample_rate
  10001
  #

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:26 -07:00
Greg Kroah-Hartman
34e9e65731 This is the 4.19.28 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyEq/IACgkQONu9yGCS
 aT77AhAAgqbxHsKsgqh7JV477xEgLqrLKYw+/6Bx+l79fUZoaR3PRwM62UEzikPQ
 KbvutNuOHOWuJA0Xyj5gqV4SJaqBOkoGNnHFOBi/qjtFUsOlBFrlGpow+0fsP/ay
 Bmo0LhoTvBub4ap7bJt4pwel/elWYVtOkA1Qgv3OCiDkTorYuPTbIUyuAVOJJbRn
 sZ1eKi00CQPrN65Rxgci0g0p/m7JWpvW2zqmDNZJuZZeEmSLdrrZGwt5ExiI6oKz
 CqX/VBGChEesMTEOLsSfRyg6NZW3j4rOUaCzkxDq/Tsh9XqNabhk1jod2p3t7Nmu
 n5Js3ujfuyCKf5tD49Z8xy5A++nYyJLa5jbFnURr2H/ZJPla0CHuc4RoFf/wFYr4
 xQAeA3XXiZZB02n6oZlweUSp7hVnCgLJ4Ev2ctAyPUpyf4ncl+vffzj40bozFvAC
 adJE1UyEJp0xUuRVdx0I+HyueHcWmRIAgs0iz9B5S2KNc4rYDqE+/t+ddrNqjvSF
 +C33nMn+7A+ngmQlwWjOBaZhhQn3qWrWU0ACERbIG/DUD2voBYu3oIfQzKsXTt0V
 erSSDq0KSy73PRKN4Tzf2GnDUAlNITUuyFgWITOg8p29HuXJO00p4MQ7fOMYacyB
 WdwXFB7islUtUoA8nEedZgF7IF1WRh6Iz2HJ5uMTl6pCMSV+8Dk=
 =OvW8
 -----END PGP SIGNATURE-----

Merge 4.19.28 into android-4.19

Changes in 4.19.28
	cpufreq: Use struct kobj_attribute instead of struct global_attr
	staging: erofs: fix mis-acted TAIL merging behavior
	USB: serial: option: add Telit ME910 ECM composition
	USB: serial: cp210x: add ID for Ingenico 3070
	USB: serial: ftdi_sio: add ID for Hjelmslund Electronics USB485
	staging: erofs: fix illegal address access under memory pressure
	staging: erofs: compressed_pages should not be accessed again after freed
	staging: comedi: ni_660x: fix missing break in switch statement
	staging: wilc1000: fix to set correct value for 'vif_num'
	staging: android: ion: fix sys heap pool's gfp_flags
	staging: android: ashmem: Don't call fallocate() with ashmem_mutex held.
	staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.
	ip6mr: Do not call __IP6_INC_STATS() from preemptible context
	net: dsa: mv88e6xxx: handle unknown duplex modes gracefully in mv88e6xxx_port_set_duplex
	net: dsa: mv8e6xxx: fix number of internal PHYs for 88E6x90 family
	net: sched: put back q.qlen into a single location
	net-sysfs: Fix mem leak in netdev_register_kobject
	qmi_wwan: Add support for Quectel EG12/EM12
	sctp: call iov_iter_revert() after sending ABORT
	sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
	team: Free BPF filter when unregistering netdev
	tipc: fix RDM/DGRAM connect() regression
	bnxt_en: Drop oversize TX packets to prevent errors.
	geneve: correctly handle ipv6.disable module parameter
	hv_netvsc: Fix IP header checksum for coalesced packets
	ipv4: Add ICMPv6 support when parse route ipproto
	lan743x: Fix TX Stall Issue
	net: dsa: mv88e6xxx: Fix statistics on mv88e6161
	net: dsa: mv88e6xxx: Fix u64 statistics
	netlabel: fix out-of-bounds memory accesses
	net: netem: fix skb length BUG_ON in __skb_to_sgvec
	net: nfc: Fix NULL dereference on nfc_llcp_build_tlv fails
	net: phy: Micrel KSZ8061: link failure after cable connect
	net: phy: phylink: fix uninitialized variable in phylink_get_mac_state
	net: sit: fix memory leak in sit_init_net()
	net: socket: set sock->sk to NULL after calling proto_ops::release()
	tipc: fix race condition causing hung sendto
	tun: fix blocking read
	xen-netback: don't populate the hash cache on XenBus disconnect
	xen-netback: fix occasional leak of grant ref mappings under memory pressure
	tun: remove unnecessary memory barrier
	net: Add __icmp_send helper.
	net: avoid use IPCB in cipso_v4_error
	ipv4: Return error for RTA_VIA attribute
	ipv6: Return error for RTA_VIA attribute
	mpls: Return error for RTA_GATEWAY attribute
	ipv4: Pass original device to ip_rcv_finish_core
	net: dsa: mv88e6xxx: power serdes on/off for 10G interfaces on 6390X
	net: dsa: mv88e6xxx: prevent interrupt storm caused by mv88e6390x_port_set_cmode
	net/sched: act_ipt: fix refcount leak when replace fails
	net/sched: act_skbedit: fix refcount leak when replace fails
	net: sched: act_tunnel_key: fix NULL pointer dereference during init
	x86/CPU/AMD: Set the CPB bit unconditionally on F17h
	x86/boot/compressed/64: Do not read legacy ROM on EFI system
	tracing: Fix event filters and triggers to handle negative numbers
	usb: xhci: Fix for Enabling USB ROLE SWITCH QUIRK on INTEL_SUNRISEPOINT_LP_XHCI
	applicom: Fix potential Spectre v1 vulnerabilities
	MIPS: irq: Allocate accurate order pages for irq stack
	aio: Fix locking in aio_poll()
	xtensa: fix get_wchan
	gnss: sirf: fix premature wakeup interrupt enable
	USB: serial: cp210x: fix GPIO in autosuspend
	selftests: firmware: fix verify_reqs() return value
	Bluetooth: btrtl: Restore old logic to assume firmware is already loaded
	Bluetooth: Fix locking in bt_accept_enqueue() for BH context
	exec: Fix mem leak in kernel_read_file
	scsi: core: reset host byte in DID_NEXUS_FAILURE case
	bpf: fix sanitation rewrite in case of non-pointers
	Linux 4.19.28

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:23:21 +01:00
Daniel Borkmann
ca490a9873 bpf: fix sanitation rewrite in case of non-pointers
commit 3612af783c upstream.

Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).

Fixes: d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Reported-by: Arthur Fabre <afabre@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:22 +01:00
Pavel Tikhomirov
690e939da7 tracing: Fix event filters and triggers to handle negative numbers
commit 6a072128d2 upstream.

Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:

[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
        ^
parse_error: Invalid value (did you forget quotes)?

Similar thing happens when setting triggers.

These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.

Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com

Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:20 +01:00
Connor O'Brien
fe90216264 ANDROID: proc: Add /proc/uid directory
Add support for reporting per-uid information through procfs, roughly
following the approach used for per-tid and per-tgid directories in
fs/proc/base.c.
This also entails some new tracking of which uids have been used, to
avoid losing information when the last task with a given uid exits.

Bug: 72339335
Bug: 127641090
Test: ls /proc/uid/; compare with UIDs in /proc/uid_time_in_state
Change-Id: I0908f0c04438b11ceb673d860e58441bf503d478
Signed-off-by: Connor O'Brien <connoro@google.com>
[AmitP: Fix proc_fill_cache() now that upstream commit
        0168b9e38c ("procfs: switch instantiate_t to d_splice_alias()"),
        switched instantiate() callback to d_splice_alias()]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[astrachan: Folded 97b7790f505e ("ANDROID: proc: fix undefined behavior
            in proc_uid_base_readdir") into this change]
Signed-off-by: Alistair Strachan <astrachan@google.com>
2019-03-06 15:59:21 +00:00
Connor O'Brien
406d53f0c7 ANDROID: cpufreq: track per-task time in state
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.

Bug: 72339335
Bug: 127641090
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
Signed-off-by: Connor O'Brien <connoro@google.com>
[astrachan: Folded the following changes into this patch:
            a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES")
            b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")]
Signed-off-by: Alistair Strachan <astrachan@google.com>
2019-03-06 15:57:25 +00:00
Greg Kroah-Hartman
36d178b3bc This is the 4.19.27 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlx+qs4ACgkQONu9yGCS
 aT4OQQ/8DMkQa61GLVtL1Zm0i2GW+fpgAJj59QP7jRdilolP7BBnrX2Ne26Dx+X0
 bYLJDJw6WexsvLKOkSMol4Y3Q3NsBguC+MasVrrWBHpfntizlaPzsYuTKkyNRG0c
 vjOYyTeAxElWqEh1WaVQhk/juhBbrTtkZIK63h9M60KYb0qHA7TY9oTGokEA8WA0
 11Yuge2aV6cR8zup1coWR9NRvW/uealiENAhr0Jw1ZRtrBqwzQoFyn5psXdbzkjb
 HrwvDa/nkwfJuc/1+grOTF5HfI7D5+giq0iRtvBuChjDwqfJ2IrVRPF/DKSmR0Oz
 Sq1ILUdj61gd2hp4N9zpUE87r2tV2ABg/yZrtDcKFhCiUWnfaaGArn7mm/R7OhCD
 PjRDz7acVvKq3LLenM6xcFsAWUdjOoUWcGk70eO/ZZctW8o+HozxFcVrNmT/75zJ
 LcagplvYLiLB4Gard6niNH8qunmq/jD+GkgdKtF9GFHOjG1QMdQ9d/VxeJWGAjtA
 nE8ZX9EFYgmq14OR+YK0DNOOFBZe10lcR9rMIFp6T9AanY7rVT7LqqG6BUXb226P
 mJmNS9sSJhictucyyf/ej9j0jULBBsAYJA5+fkECHIioewACjFGn1dZfm3BBrDAb
 OoNIwaLmrvjh2iKwsj17OK9Fv8PG6VKwYRFQ/I3OGVouZUX5SEk=
 =edyH
 -----END PGP SIGNATURE-----

Merge 4.19.27 into android-4.19

Changes in 4.19.27
	irq/matrix: Split out the CPU selection code into a helper
	irq/matrix: Spread managed interrupts on allocation
	genirq/matrix: Improve target CPU selection for managed interrupts.
	mac80211: Change default tx_sk_pacing_shift to 7
	scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached
	drm/msm: Unblock writer if reader closes file
	ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field
	ALSA: compress: prevent potential divide by zero bugs
	ASoC: Variable "val" in function rt274_i2c_probe() could be uninitialized
	clk: tegra: dfll: Fix a potential Oop in remove()
	clk: sysfs: fix invalid JSON in clk_dump
	clk: vc5: Abort clock configuration without upstream clock
	thermal: int340x_thermal: Fix a NULL vs IS_ERR() check
	usb: dwc3: gadget: synchronize_irq dwc irq in suspend
	usb: dwc3: gadget: Fix the uninitialized link_state when udc starts
	usb: gadget: Potential NULL dereference on allocation error
	selftests: rtc: rtctest: fix alarm tests
	selftests: rtc: rtctest: add alarm test on minute boundary
	genirq: Make sure the initial affinity is not empty
	x86/mm/mem_encrypt: Fix erroneous sizeof()
	ASoC: rt5682: Fix PLL source register definitions
	ASoC: dapm: change snprintf to scnprintf for possible overflow
	ASoC: imx-audmux: change snprintf to scnprintf for possible overflow
	selftests/vm/gup_benchmark.c: match gup struct to kernel
	phy: ath79-usb: Fix the power on error path
	phy: ath79-usb: Fix the main reset name to match the DT binding
	selftests: seccomp: use LDLIBS instead of LDFLAGS
	selftests: gpio-mockup-chardev: Check asprintf() for error
	irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
	ARC: fix __ffs return value to avoid build warnings
	ARC: show_regs: lockdep: avoid page allocator...
	drivers: thermal: int340x_thermal: Fix sysfs race condition
	staging: rtl8723bs: Fix build error with Clang when inlining is disabled
	mac80211: fix miscounting of ttl-dropped frames
	sched/wait: Fix rcuwait_wake_up() ordering
	sched/wake_q: Fix wakeup ordering for wake_q
	futex: Fix (possible) missed wakeup
	locking/rwsem: Fix (possible) missed wakeup
	drm/amd/powerplay: OD setting fix on Vega10
	tty: serial: qcom_geni_serial: Allow mctrl when flow control is disabled
	serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling
	drm/sun4i: hdmi: Fix usage of TMDS clock
	staging: android: ion: Support cpu access during dma_buf_detach
	direct-io: allow direct writes to empty inodes
	writeback: synchronize sync(2) against cgroup writeback membership switches
	scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport
	scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport
	scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state()
	net: altera_tse: fix connect_local_phy error path
	hv_netvsc: Fix ethtool change hash key error
	hv_netvsc: Refactor assignments of struct netvsc_device_info
	hv_netvsc: Fix hash key value reset after other ops
	nvme-rdma: fix timeout handler
	nvme-multipath: drop optimization for static ANA group IDs
	drm/msm: Fix A6XX support for opp-level
	net: usb: asix: ax88772_bind return error when hw_reset fail
	net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP
	ibmveth: Do not process frames after calling napi_reschedule
	mac80211: don't initiate TDLS connection if station is not associated to AP
	mac80211: Add attribute aligned(2) to struct 'action'
	cfg80211: extend range deviation for DMG
	svm: Fix AVIC incomplete IPI emulation
	KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1
	kvm: selftests: Fix region overlap check in kvm_util
	mmc: spi: Fix card detection during probe
	mmc: tmio_mmc_core: don't claim spurious interrupts
	mmc: tmio: fix access width of Block Count Register
	mmc: core: Fix NULL ptr crash from mmc_should_fail_request
	mmc: cqhci: fix space allocated for transfer descriptor
	mmc: cqhci: Fix a tiny potential memory leak on error condition
	mmc: sdhci-esdhc-imx: correct the fix of ERR004536
	mm: enforce min addr even if capable() in expand_downwards()
	drm: Block fb changes for async plane updates
	hugetlbfs: fix races and page leaks during migration
	MIPS: fix truncation in __cmpxchg_small for short values
	MIPS: BCM63XX: provide DMA masks for ethernet devices
	MIPS: eBPF: Fix icache flush end address
	x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
	Linux 4.19.27

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-05 18:07:53 +01:00
Xie Yongji
9ad6216e8c locking/rwsem: Fix (possible) missed wakeup
[ Upstream commit e158488be2 ]

Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:

  e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")

relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.

[ peterz: Added changelog. ]

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Peter Zijlstra
2368e6d3bc futex: Fix (possible) missed wakeup
[ Upstream commit b061c38bef ]

We must not rely on wake_q_add() to delay the wakeup; in particular
commit:

  1d0dcb3ad9 ("futex: Implement lockless wakeups")

moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1d0dcb3ad9 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Peter Zijlstra
653a1dbcb0 sched/wake_q: Fix wakeup ordering for wake_q
[ Upstream commit 4c4e373156 ]

Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.

Andrea Parri provided:

  C wake_up_q-wake_q_add

  {
	int next = 0;
	int y = 0;
  }

  P0(int *next, int *y)
  {
	int r0;

	/* in wake_up_q() */

	WRITE_ONCE(*next, 1);   /* node->next = NULL */
	smp_mb();               /* implied by wake_up_process() */
	r0 = READ_ONCE(*y);
  }

  P1(int *next, int *y)
  {
	int r1;

	/* in wake_q_add() */

	WRITE_ONCE(*y, 1);      /* wake_cond = true */
	smp_mb__before_atomic();
	r1 = cmpxchg_relaxed(next, 1, 2);
  }

  exists (0:r0=0 /\ 1:r1=0)

  This "exists" clause cannot be satisfied according to the LKMM:

  Test wake_up_q-wake_q_add Allowed
  States 3
  0:r0=0; 1:r1=1;
  0:r0=1; 1:r1=0;
  0:r0=1; 1:r1=1;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (0:r0=0 /\ 1:r1=0)
  Observation wake_up_q-wake_q_add Never 0 3

Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00
Prateek Sood
5024f0a29a sched/wait: Fix rcuwait_wake_up() ordering
[ Upstream commit 6dc080eeb2 ]

For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.

This mistake has been observed to cause a deadlock in the following
situation:

    P1					P2

    percpu_up_read()			percpu_down_write()
      rcu_sync_is_idle() // false
					  rcu_sync_enter()
					  ...
      __percpu_up_read()

[S] ,-  __this_cpu_dec(*sem->read_count)
    |   smp_rmb();
[L] |   task = rcu_dereference(w->task) // NULL
    |
    |				    [S]	    w->task = current
    |					    smp_mb();
    |				    [L]	    readers_active_check() // fail
    `-> <store happens here>

Where the smp_rmb() (obviously) fails to constrain the store.

[ peterz: Added changelog. ]

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-05 17:58:49 +01:00