Now that the mainline EAS wake-up path has been extended to cope with
prefer-idle tasks, the need for a dedicated Android-specific wake-up
routine (find_best_target()) becomes less clear. Indeed, main reasons
for introducting find_best_target() in the first place were:
1. the energy_diff function was very slow, so we couldn't afford to
use it on all CPUs for each wake-up for latency reasons;
2. schedtune provides additional information about tasks (the
prefer-idle flag in particular) which needed to be taken into
account in the placement algorithm.
Now that the energy diff calculation is much faster (with the simplified
energy model) and that the EAS path is aware of prefer-idle tasks, there
is no clear reason to use find_best_target() any more.
So, let's disable it for now to minimize the amount of out-of-tree code
used in the scheduler. If using the mainline path doesn't cause
regressions, it is a good sign find_best_target() can be removed safely,
eventually. Otherwise, reverting back to the old behaviour is trivial
since this patch only changes the sched_feat default, but doesn't remove
the fbt() code path.
Bug: 120440300
Change-Id: Idb5d68a3c4af7d2212e0922ab6d9a089170b5e1c
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Make the mainline EAS wake-up path aware of prefer idle tasks in
preparation for disabling find_best_target().
What is done in the mainline algoritm isn't strictly equivalent to the
find_best_target() algorithm but comes real close, and isn't very
invasive. The main differences with the original find_best_target()
behaviour are the following:
1. the policy for prefer idle when there isn't a single idle CPU in the
system is simpler now. We just pick the CPU with the highest spare
capacity;
2. the cstate awareness for prefer idle is implemented by minimizing
the exit latency rather than the idle state index. This is how it is
done in the slow path (find_idlest_group_cpu()), it doesn't require
us to keep hooks into CPUIdle, and should actually be better because
what we want is a CPU that can wake up quickly;
3. non-prefer-idle tasks just use the standard mainline energy-aware
wake-up path, which decides the placement using the Energy Model.
Bug: 120440300
Change-Id: I57769c90c57115f6a28d27c5a88e08aa93a30a56
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
util_est is mainly meant to be a lower-bound for tasks utilization.
That's why task_util_est() returns the actual util_avg when it's higher
than the estimated utilization.
With new invaraince signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.
Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU capacity,
we skip the sampling when utilization is higher than CPU's capacity.
Bug: 120440300
Change-Id: If1a6001451f80acb953e2a5f955fd302b1b73bc0
Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 10a35e6812)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.
The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :
U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
with U is the max util_avg value = SCHED_CAPACITY_SCALE
At a lower capacity, the range becomes:
U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
with C reflecting the compute capacity ratio between current capacity and
max capacity.
so C tries to compensate changes in (1-y^r') but it can't be accurate.
Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.
In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.
In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:
On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.
each test runs 16 times:
./perf bench sched pipe
(higher is better)
kernel tip/sched/core + patch
ops/seconds ops/seconds diff
cgroup
root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%
hackbench -l 1000
(lower is better)
kernel tip/sched/core + patch
duration(sec) duration(sec) diff
cgroup
root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%
Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:
Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms not reachable 276ms
486 (47.5%) 30ms 138ms 60ms
256 (25%) 13ms 32ms 26ms
On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.
Bug: 120440300
Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: patrick.bellasi@arm.com
Cc: pjt@google.com
Cc: pkondeti@codeaurora.org
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2312729688)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.
se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).
There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:
(1) A task switches to SCHED_IDLE.
(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
(during which only its static priority gets set) switches to
SCHED_OTHER or SCHED_BATCH.
Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.
Bug: 120440300
Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4a465e3ebb)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS
aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1
peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs
gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59
NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF
R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3
61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t
UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh
jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I
U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe
9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU
nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us=
=xtLx
-----END PGP SIGNATURE-----
Merge 4.19.31 into android-4.19
Changes in 4.19.31
media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
9p: use inode->i_lock to protect i_size_write() under 32-bit
9p/net: fix memory leak in p9_client_create
ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
ASoC: codecs: pcm186x: Fix energysense SLEEP bit
iio: adc: exynos-adc: Fix NULL pointer exception on unbind
mei: hbm: clean the feature flags on link reset
mei: bus: move hw module get/put to probe/release
stm class: Fix an endless loop in channel allocation
crypto: caam - fix hash context DMA unmap size
crypto: ccree - fix missing break in switch statement
crypto: caam - fixed handling of sg list
crypto: caam - fix DMA mapping of stack memory
crypto: ccree - fix free of unallocated mlli buffer
crypto: ccree - unmap buffer before copying IV
crypto: ccree - don't copy zero size ciphertext
crypto: cfb - add missing 'chunksize' property
crypto: cfb - remove bogus memcpy() with src == dest
crypto: ahash - fix another early termination in hash walk
crypto: rockchip - fix scatterlist nents error
crypto: rockchip - update new iv to device in multiple operations
drm/imx: ignore plane updates on disabled crtcs
gpu: ipu-v3: Fix i.MX51 CSI control registers offset
drm/imx: imx-ldb: add missing of_node_puts
gpu: ipu-v3: Fix CSI offsets for imx53
ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
s390/dasd: fix using offset into zero size array error
Input: pwm-vibra - prevent unbalanced regulator
Input: pwm-vibra - stop regulator after disabling pwm, not before
ARM: dts: Configure clock parent for pwm vibra
ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
arm/arm64: KVM: Allow a VCPU to fully reset itself
arm/arm64: KVM: Don't panic on failure to properly reset system registers
KVM: arm/arm64: vgic: Always initialize the group of private IRQs
KVM: arm64: Forbid kprobing of the VHE world-switch code
ASoC: samsung: Prevent clk_get_rate() calls in atomic context
ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
Input: cap11xx - switch to using set_brightness_blocking()
Input: ps2-gpio - flush TX work when closing port
Input: matrix_keypad - use flush_delayed_work()
mac80211: call drv_ibss_join() on restart
mac80211: Fix Tx aggregation session tear down with ITXQs
netfilter: compat: initialize all fields in xt_init
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
ipvs: fix dependency on nf_defrag_ipv6
floppy: check_events callback should not return a negative number
xprtrdma: Make sure Send CQ is allocated on an existing compvec
NFS: Don't use page_file_mapping after removing the page
mm/gup: fix gup_pmd_range() for dax
Revert "mm: use early_pfn_to_nid in page_ext_init"
scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
x86/CPU: Add Icelake model number
mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
i2c: cadence: Fix the hold bit setting
i2c: bcm2835: Clear current buffer pointers and counts after a transfer
auxdisplay: ht16k33: fix potential user-after-free on module unload
Input: st-keyscan - fix potential zalloc NULL dereference
clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
kallsyms: Handle too long symbols in kallsyms.c
clk: sunxi: A31: Fix wrong AHB gate number
esp: Skip TX bytes accounting when sending from a request socket
ARM: 8824/1: fix a migrating irq bug when hotplug cpu
bpf: only adjust gso_size on bytestream protocols
bpf: fix lockdep false positive in stackmap
af_key: unconditionally clone on broadcast
ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
assoc_array: Fix shortcut creation
keys: Fix dependency loop between construction record and auth key
scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
net: systemport: Fix reception of BPDUs
net: dsa: bcm_sf2: Do not assume DSA master supports WoL
pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
qmi_wwan: apply SET_DTR quirk to Sierra WP7607
net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
ASoC: topology: free created components in tplg load error
qed: Fix iWARP buffer size provided for syn packet processing.
qed: Fix iWARP syn packet mac address validation.
ARM: dts: armada-xp: fix Armada XP boards NAND description
arm64: Relax GIC version check during early boot
ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
net: marvell: mvneta: fix DMA debug warning
mm: handle lru_add_drain_all for UP properly
tmpfs: fix link accounting when a tmpfile is linked in
ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
ARC: uacces: remove lp_start, lp_end from clobber list
ARCv2: support manual regfile save on interrupts
ARCv2: don't assume core 0x54 has dual issue
phonet: fix building with clang
mac80211_hwsim: propagate genlmsg_reply return code
bpf, lpm: fix lookup bug in map_delete_elem
net: thunderx: make CFG_DONE message to run through generic send-ack sequence
net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
nfp: bpf: fix ALU32 high bits clearance bug
bnxt_en: Fix typo in firmware message timeout logic.
bnxt_en: Wait longer for the firmware message response to complete.
net: set static variable an initial value in atl2_probe()
selftests: fib_tests: sleep after changing carrier. again.
tmpfs: fix uninitialized return value in shmem_link
stm class: Prevent division by zero
nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
acpi/nfit: Fix bus command validation
nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
nfit/ars: Attempt short-ARS even in the no_init_ars case
libnvdimm/label: Clear 'updating' flag after label-set update
libnvdimm, pfn: Fix over-trim in trim_pfn_device()
libnvdimm/pmem: Honor force_raw for legacy pmem regions
libnvdimm: Fix altmap reservation size calculation
fix cgroup_do_mount() handling of failure exits
crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: aegis - fix handling chunked inputs
crypto: arm/crct10dif - revert to C code for short inputs
crypto: arm64/aes-neonbs - fix returning final keystream block
crypto: arm64/crct10dif - revert to C code for short inputs
crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: morus - fix handling chunked inputs
crypto: pcbc - remove bogus memcpy()s with src == dest
crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: testmgr - skip crc32c context test for ahash algorithms
crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
crypto: x86/aesni-gcm - fix crash on empty plaintext
crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
CIFS: Do not reset lease state to NONE on lease break
CIFS: Do not skip SMB2 message IDs on send failures
CIFS: Fix read after write for files with read caching
tracing: Use strncpy instead of memcpy for string keys in hist triggers
tracing: Do not free iter->trace in fail path of tracing_open_pipe()
tracing/perf: Use strndup_user() instead of buggy open-coded version
xen: fix dom0 boot on huge systems
ACPI / device_sysfs: Avoid OF modalias creation for removed device
mmc: sdhci-esdhc-imx: fix HS400 timing issue
mmc:fix a bug when max_discard is 0
netfilter: ipt_CLUSTERIP: fix warning unused variable cn
spi: ti-qspi: Fix mmap read when more than one CS in use
spi: pxa2xx: Setup maximum supported DMA transfer length
regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
regulator: max77620: Initialize values for DT properties
regulator: s2mpa01: Fix step values for some LDOs
clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
s390/setup: fix early warning messages
s390/virtio: handle find on invalid queue gracefully
scsi: virtio_scsi: don't send sc payload with tmfs
scsi: aacraid: Fix performance issue on logical drives
scsi: sd: Optimal I/O size should be a multiple of physical block size
scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
fs/devpts: always delete dcache dentry-s in dput()
splice: don't merge into linked buffers
ovl: During copy up, first copy up data and then xattrs
ovl: Do not lose security.capability xattr over metadata file copy-up
m68k: Add -ffreestanding to CFLAGS
Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
Btrfs: fix corruption reading shared and compressed extents after hole punching
soc: qcom: rpmh: Avoid accessing freed memory from batch API
libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
x86/kprobes: Prohibit probing on optprobe template code
cpufreq: kryo: Release OPP tables on module removal
cpufreq: tegra124: add missing of_node_put()
cpufreq: pxa2xx: remove incorrect __init annotation
ext4: fix check of inode in swap_inode_boot_loader
ext4: cleanup pagecache before swap i_data
ext4: update quota information while swapping boot loader inode
ext4: add mask of ext4 flags to swap
ext4: fix crash during online resizing
PCI/ASPM: Use LTR if already enabled by platform
PCI/DPC: Fix print AER status in DPC event handling
PCI: dwc: skip MSI init if MSIs have been explicitly disabled
IB/hfi1: Close race condition on user context disable and close
cxl: Wrap iterations over afu slices inside 'afu_list_lock'
ext2: Fix underflow in ext2_max_size()
clk: uniphier: Fix update register for CPU-gear
clk: clk-twl6040: Fix imprecise external abort for pdmclk
clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
clk: ingenic: Fix round_rate misbehaving with non-integer dividers
clk: ingenic: Fix doc of ingenic_cgu_div_info
usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
mm/vmalloc: fix size check for remap_vmalloc_range_partial()
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
device property: Fix the length used in PROPERTY_ENTRY_STRING()
intel_th: Don't reference unassigned outputs
parport_pc: fix find_superio io compare code, should use equal test.
i2c: tegra: fix maximum transfer size
media: i2c: ov5640: Fix post-reset delay
gpio: pca953x: Fix dereference of irq data in shutdown
can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
drm/i915: Relax mmap VMA check
bpf: only test gso type on gso packets
serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
serial: 8250_pci: Fix number of ports for ACCES serial cards
serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
jbd2: clear dirty flag when revoking a buffer from an older transaction
jbd2: fix compile warning when using JBUFFER_TRACE
selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
powerpc/32: Clear on-stack exception marker upon exception return
powerpc/wii: properly disable use of BATs when requested.
powerpc/powernv: Make opal log only readable by root
powerpc/83xx: Also save/restore SPRG4-7 during suspend
powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
powerpc/traps: fix recoverability of machine check handling on book3s/32
powerpc/traps: Fix the message printed when stack overflows
ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
arm64: Fix HCR.TGE status for NMI contexts
arm64: debug: Ensure debug handlers check triggering exception level
arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
ipmi_si: fix use-after-free of resource->name
dm: fix to_sector() for 32bit
dm integrity: limit the rate of error messages
mfd: sm501: Fix potential NULL pointer dereference
cpcap-charger: generate events for userspace
NFS: Fix I/O request leakages
NFS: Fix an I/O request leakage in nfs_do_recoalesce
NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
nfsd: fix performance-limiting session calculation
nfsd: fix memory corruption caused by readdir
nfsd: fix wrong check in write_v4_end_grace()
NFSv4.1: Reinitialise sequence results before retransmitting a request
svcrpc: fix UDP on servers with lots of threads
PM / wakeup: Rework wakeup source timer cancellation
bcache: never writeback a discard operation
stable-kernel-rules.rst: add link to networking patch queue
vt: perform safe console erase in the right order
x86/unwind/orc: Fix ORC unwind table alignment
perf intel-pt: Fix CYC timestamp calculation after OVF
perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
perf auxtrace: Define auxtrace record alignment
perf intel-pt: Fix overlap calculation for padding
perf/x86/intel/uncore: Fix client IMC events return huge result
perf intel-pt: Fix divide by zero when TSC is not available
md: Fix failed allocation of md_register_thread
tpm/tpm_crb: Avoid unaligned reads in crb_recv()
tpm: Unify the send callback behaviour
rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
media: imx: prpencvf: Stop upstream before disabling IDMA channel
media: lgdt330x: fix lock status reporting
media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
media: vimc: Add vimc-streamer for stream control
media: imx: csi: Disable CSI immediately after last EOF
media: imx: csi: Stop upstream before disabling IDMA channel
drm/fb-helper: generic: Fix drm_fbdev_client_restore()
drm/radeon/evergreen_cs: fix missing break in switch statement
drm/amd/powerplay: correct power reading on fiji
drm/amd/display: don't call dm_pp_ function from an fpu block
KVM: Call kvm_arch_memslots_updated() before updating memslots
KVM: x86/mmu: Detect MMIO generation wrap in any address space
KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
KVM: nVMX: Sign extend displacements of VMX instr's mem operands
KVM: nVMX: Apply addr size mask to effective address for VMX instructions
KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
s390/setup: fix boot crash for machine without EDAT-1
Linux 4.19.31
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 1d1f898df6 upstream.
The rcu_gp_kthread_wake() function is invoked when it might be necessary
to wake the RCU grace-period kthread. Because self-wakeups are normally
a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
this kthread, it naturally refuses to do the wakeup.
Unfortunately, natural though it might be, this heuristic fails when
rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
that interrupted the grace-period kthread just after the final check of
the wait-event condition but just before the schedule() call. In this
case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
is within the RCU grace-period kthread's context. Failing to provide
this wakeup can result in grace periods failing to start, which in turn
results in out-of-memory conditions.
This race window is quite narrow, but it actually did happen during real
testing. It would of course need to be fixed even if it was strictly
theoretical in nature.
This patch does not Cc stable because it does not apply cleanly to
earlier kernel versions.
Fixes: 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
Reported-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com>
Co-developed-by: "He, Bo" <bo.he@intel.com>
Co-developed-by: "xiao, jin" <jin.xiao@intel.com>
Co-developed-by: Bai, Jie A <jie.a.bai@intel.com>
Signed-off: "Zhang, Jun" <jun.zhang@intel.com>
Signed-off: "He, Bo" <bo.he@intel.com>
Signed-off: "xiao, jin" <jin.xiao@intel.com>
Signed-off: Bai, Jie A <jie.a.bai@intel.com>
Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com>
[ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
!in_serving_softirq() to avoid redundant wakeups and to also handle the
interrupt-handler scenario as well as the softirq-handler scenario that
actually occurred in testing. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8cf7630b29 upstream.
This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").
As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().
Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org> [2.6.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 83540fbc88 upstream.
The first version of this method was missing the check for
`ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
on error, so there was still a small memory leak in the error case.
Fix it by using strndup_user() instead of open-coding it.
Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 0eadcc7a7b ("perf/core: Fix perf_uprobe_init()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e7f0c424d0 upstream.
Commit d716ff71dd ("tracing: Remove taking of trace_types_lock in
pipe files") use the current tracer instead of the copy in
tracing_open_pipe(), but it forget to remove the freeing sentence in
the error path.
There's an error path that can call kfree(iter->trace) after the iter->trace
was assigned to tr->current_trace, which would be bad to free.
Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com
Cc: stable@vger.kernel.org
Fixes: d716ff71dd ("tracing: Remove taking of trace_types_lock in pipe files")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f0bbf3115 upstream.
Because there may be random garbage beyond a string's null terminator,
it's not correct to copy the the complete character array for use as a
hist trigger key. This results in multiple histogram entries for the
'same' string key.
So, in the case of a string key, use strncpy instead of memcpy to
avoid copying in the extra bytes.
Before, using the gdbus entries in the following hist trigger as an
example:
# echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist
...
{ comm: ImgDecoder #4 } hitcount: 203
{ comm: gmain } hitcount: 213
{ comm: gmain } hitcount: 216
{ comm: StreamTrans #73 } hitcount: 221
{ comm: mozStorage #3 } hitcount: 230
{ comm: gdbus } hitcount: 233
{ comm: StyleThread#5 } hitcount: 253
{ comm: gdbus } hitcount: 256
{ comm: gdbus } hitcount: 260
{ comm: StyleThread#4 } hitcount: 271
...
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
51
After:
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
1
Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 79e577cbce ("tracing: Support string type key properly")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 399504e21a upstream.
same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped. Easily fixed (same way as in sysfs
case). Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7c0cdf0b39 ]
trie_delete_elem() was deleting an entry even though it was not matching
if the prefixlen was correct. This patch adds a check on matchlen.
Reproducer:
$ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
$ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
key: 10 00 00 00 aa bb cc dd value: 01
Found 1 element
$ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
$ echo $?
0
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
Found 0 elements
A similar reproducer is added in the selftests.
Without the patch:
$ sudo ./tools/testing/selftests/bpf/test_lpm_map
test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
Aborted
With the patch: test_lpm_map runs without errors.
Fixes: e454cf5958 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.
Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.
Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.
When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.
Notifications to the users are rate-limited to one per tracking window.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I860049d32420485346ad545c4650f990fe0c08e3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I88cd99f41534f0b9df18043cde8d1ee54aaa93de
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I944b024cd65e8520a57097bf5a3d7b2c01605bd0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: Ia9cfed8964fd57e41098fca285a2be0252fd5277
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I579a60e0915fa8fedaa508357d3d1aefab9428c4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
psi_enable is not used outside of psi.c, make it static.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I249c6d2271f93a7975f1622faf2d2b4196b701bc
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long. Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.
Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/)
Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I38a1ca3d5c9e6cc3ba39e88c6a9af29ecdc0df5b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.
This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.
Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit: dc50537bdd)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
We've been seeing hard-to-trigger psi crashes when running inside VM
instances:
divide error: 0000 [#1] SMP PTI
Modules linked in: [...]
CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: events psi_clock
RIP: 0010:psi_update_stats+0x270/0x490
RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
psi_clock+0x12/0x50
process_one_work+0x1e0/0x390
worker_thread+0x2b/0x3c0
? rescuer_thread+0x330/0x330
kthread+0x113/0x130
? kthread_create_worker_on_cpu+0x40/0x40
? SyS_exit_group+0x10/0x10
ret_from_fork+0x35/0x40
Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.
The elapsed period should never be 0. The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.
The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift. So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.
However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'. The run finishes
with last_update == next_update == sched_clock().
With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs. A pointlessly short
period, but besides the extra work, no harm no foul. However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again. And when we calculate the elapsed period, the
result, our pressure divisor, will be 0. Ouch.
Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.
Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Łukasz Siudut <lsiudut@fb.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4e37504d1c)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1b69ac6b40)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.
With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge. Do the following things to
make it easier:
1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
to enable CONFIG_PSI in their kernel but leave the feature disabled
unless a user requests it at boot-time.
To avoid double negatives, rename psi_disabled= to psi=.
2. Make psi_disabled a static branch to eliminate any branch costs
when the feature is disabled.
In terms of numbers before and after this patch, Mel says:
: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
: 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
: kconfigdisable-v1r1 vanilla psidisable-v1r1
: Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
: Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
: Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
: Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
: Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
: Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
: Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
: Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
: Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;
Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e0c274472d)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8fcb2312d1)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.
Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2ce7135adc)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit eb414681d5)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next
patch is adding another site with the same pattern, so provide a
convenience function for it.
Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 246b3b3342)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
kernel/sched/sched.h includes "stats.h" half-way through the file. The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h. Move those definitions up in
the file so they are available in stats.h.
Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f351d7f75)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8508cf3ffa)
Conflicts:
block/blk-iolatency.c
(1. manual merge to replace stat->rqs.mean with stat.mean)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache. This
isn't tracked right now.
To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.
Also update tools/accounting/getdelays.c:
[hannes@computer accounting]$ sudo ./getdelays -d -p 1
print delayacct stats ON
PID 1
CPU count real total virtual total delay total delay average
50318 745000000 847346785 400533713 0.008ms
IO count delay total delay average
435 122601218 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
19 12621439 0ms
Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1d29ba82c)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I259f693987cf04e6a52ee7e8accf55a17e0de005
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyJb/EACgkQONu9yGCS
aT4y0g//b9t9/onhTaXcY/ByPmBAwqNgugi7eYcZqGDBp7aCDOBLF6eOwbhdvvuS
ZTaZ5eWG3Twz3mZu9vveuskgMci2npDyLPgqBWGzW+Ef5r/xPd40diaI75ZUc68T
gimWbQ0VANuXKklK6LysBUaVQWE3ilIy6qnnpj0DI3ipNDoE62Ry1LNthuKy+73J
w6r7uwkb6X/CkXpNB/L4cDdpSy/CvhGQhd6p91lBuE4DfyPqEzslYCokD9aPXp9b
Fedt/Re+8eULBNcgqPYxkS5pBrbHtqrGf00AMlzC8DkC+GZyDqSP2xjv6AiTfGJd
uf0/Jvsv2OBnP4aYsbk+uB2z3plzPBgmXxa/1bm+yrGCMvbpi9mMx75HM2joAeVp
tVN4ZN65kNgJkXCchJTHdQ3s6teOD8Par1czy570HyKBU6l1j3AGArGm+b4WGPWx
dL+82coojMKxKNdTHfxUXES6QGKp716r3un6mCrKR0xET/SDayzDQMaSM8UOtArK
ELzNeKzKTc5oBx6i+JfGmY8ZsedpNGCIPpsiuoSYAaon5ZzNbruzOAlDOThs157d
YezDHZ9XMrx3kN/xYnqZD63x/5egq9REbZGWljeykbNkWcEY74jIkKwNLxqv3P64
JsLp60owvjzwtzKycjZogNU//GGNTBdb+6pESq4MxJpPTteFWnc=
=n9iV
-----END PGP SIGNATURE-----
Merge 4.19.29 into android-4.19
Changes in 4.19.29
media: uvcvideo: Fix 'type' check leading to overflow
vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel
perf script: Fix crash with printing mixed trace point and other events
perf core: Fix perf_proc_update_handler() bug
perf tools: Handle TOPOLOGY headers with no CPU
perf script: Fix crash when processing recorded stat data
IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
iommu/amd: Call free_iova_fast with pfn in map_sg
iommu/amd: Unmap all mapped pages in error path of map_sg
riscv: fixup max_low_pfn with PFN_DOWN.
ipvs: Fix signed integer overflow when setsockopt timeout
iommu/amd: Fix IOMMU page flush when detach device from a domain
clk: ti: Fix error handling in ti_clk_parse_divider_data()
clk: qcom: gcc: Use active only source for CPUSS clocks
xtensa: SMP: fix ccount_timer_shutdown
riscv: Adjust mmap base address at a third of task size
IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
selftests: cpu-hotplug: fix case where CPUs offline > CPUs present
xtensa: SMP: fix secondary CPU initialization
xtensa: smp_lx200_defconfig: fix vectors clash
xtensa: SMP: mark each possible CPU as present
iomap: get/put the page in iomap_page_create/release()
iomap: fix a use after free in iomap_dio_rw
xtensa: SMP: limit number of possible CPUs by NR_CPUS
net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case
net: hns: Fix for missing of_node_put() after of_parse_phandle()
net: hns: Restart autoneg need return failed when autoneg off
net: hns: Fix wrong read accesses via Clause 45 MDIO protocol
net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup()
netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present
gpio: vf610: Mask all GPIO interrupts
selftests: net: use LDLIBS instead of LDFLAGS
selftests: timers: use LDLIBS instead of LDFLAGS
nfs: Fix NULL pointer dereference of dev_name
qed: Fix bug in tx promiscuous mode settings
qed: Fix LACP pdu drops for VFs
qed: Fix VF probe failure while FLR
qed: Fix system crash in ll2 xmit
qed: Fix stack out of bounds bug
scsi: libfc: free skb when receiving invalid flogi resp
scsi: scsi_debug: fix write_same with virtual_gb problem
scsi: bnx2fc: Fix error handling in probe()
scsi: 53c700: pass correct "dev" to dma_alloc_attrs()
platform/x86: Fix unmet dependency warning for ACPI_CMPC
platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
net: macb: Apply RXUBR workaround only to versions with errata
x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
cifs: fix computation for MAX_SMB2_HDR_SIZE
x86/microcode/amd: Don't falsely trick the late loading mechanism
arm64: kprobe: Always blacklist the KVM world-switch code
apparmor: Fix aa_label_build() error handling for failed merges
x86/kexec: Don't setup EFI info if EFI runtime is not enabled
proc: fix /proc/net/* after setns(2)
x86_64: increase stack size for KASAN_EXTRA
mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
lib/test_kmod.c: potential double free in error handling
fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
autofs: drop dentry reference only when it is never used
autofs: fix error return in autofs_fill_super()
mm, memory_hotplug: fix off-by-one in is_pageblock_removable
ARM: OMAP: dts: N950/N9: fix onenand timings
ARM: dts: omap4-droid4: Fix typo in cpcap IRQ flags
ARM: dts: sun8i: h3: Add ethernet0 alias to Beelink X2
arm: dts: meson: Fix IRQ trigger type for macirq
ARM: dts: meson8b: odroidc1: mark the SD card detection GPIO active-low
ARM: dts: meson8m2: mxiii-plus: mark the SD card detection GPIO active-low
ARM: dts: imx6sx: correct backward compatible of gpt
arm64: dts: renesas: r8a7796: Enable DMA for SCIF2
arm64: dts: renesas: r8a77965: Enable DMA for SCIF2
soc: fsl: qbman: avoid race in clearing QMan interrupt
pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18
wlcore: sdio: Fixup power on/off sequence
bpftool: Fix prog dump by tag
bpftool: fix percpu maps updating
bpf: sock recvbuff must be limited by rmem_max in bpf_setsockopt()
ARM: pxa: ssp: unneeded to free devm_ allocated data
arm64: dts: add msm8996 compatible to gicv3
batman-adv: release station info tidstats
DTS: CI20: Fix bugs in ci20's device tree.
usb: phy: fix link errors
irqchip/gic-v4: Fix occasional VLPI drop
irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
drm/amdgpu: Add missing power attribute to APU check
drm/radeon: check if device is root before getting pci speed caps
drm/amdgpu: Transfer fences to dmabuf importer
net: stmmac: Fallback to Platform Data clock in Watchdog conversion
net: stmmac: Send TSO packets always from Queue 0
net: stmmac: Disable EEE mode earlier in XMIT callback
irqchip/gic-v3-its: Fix ITT_entry_size accessor
relay: check return of create_buf_file() properly
bpf, selftests: fix handling of sparse CPU allocations
bpf: fix lockdep false positive in percpu_freelist
bpf: fix potential deadlock in bpf_prog_register
bpf: Fix syscall's stackmap lookup potential deadlock
drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at init
dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
vsock/virtio: fix kernel panic after device hot-unplug
vsock/virtio: reset connected sockets on device removal
dmaengine: dmatest: Abort test in case of mapping error
selftests: netfilter: fix config fragment CONFIG_NF_TABLES_INET
selftests: netfilter: add simple masq/redirect test cases
netfilter: nf_nat: skip nat clash resolution for same-origin entries
s390/qeth: release cmd buffer in error paths
s390/qeth: fix use-after-free in error path
s390/qeth: cancel close_dev work before removing a card
perf symbols: Filter out hidden symbols from labels
perf trace: Support multiple "vfs_getname" probes
MIPS: Remove function size check in get_frame_info()
Revert "scsi: libfc: Add WARN_ON() when deleting rports"
i2c: omap: Use noirq system sleep pm ops to idle device for suspend
drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr
nvme: lock NS list changes while handling command effects
nvme-pci: fix rapid add remove sequence
fs: ratelimit __find_get_block_slow() failure message.
qed: Fix EQ full firmware assert.
qed: Consider TX tcs while deriving the max num_queues for PF.
qede: Fix system crash on configuring channels.
blk-iolatency: fix IO hang due to negative inflight counter
nvme-pci: add missing unlock for reset error
Input: wacom_serial4 - add support for Wacom ArtPad II tablet
Input: elan_i2c - add id for touchpad found in Lenovo s21e-20
iscsi_ibft: Fix missing break in switch statement
scsi: aacraid: Fix missing break in switch statement
x86/PCI: Fixup RTIT_BAR of Intel Denverton Trace Hub
arm64: dts: zcu100-revC: Give wifi some time after power-on
arm64: dts: hikey: Give wifi some time after power-on
arm64: dts: hikey: Revert "Enable HS200 mode on eMMC"
ARM: dts: exynos: Fix pinctrl definition for eMMC RTSN line on Odroid X2/U3
ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU
ARM: dts: exynos: Fix max voltage for buck8 regulator on Odroid XU3/XU4
drm: disable uncached DMA optimization for ARM and arm64
netfilter: xt_TEE: fix wrong interface selection
netfilter: xt_TEE: add missing code to get interface index in checkentry.
gfs2: Fix missed wakeups in find_insert_glock
staging: erofs: add error handling for xattr submodule
staging: erofs: fix fast symlink w/o xattr when fs xattr is on
staging: erofs: fix memleak of inode's shared xattr array
staging: erofs: fix race of initializing xattrs of a inode at the same time
staging: erofs: keep corrupted fs from crashing kernel in erofs_namei()
cifs: allow calling SMB2_xxx_free(NULL)
ath9k: Avoid OF no-EEPROM quirks without qca,no-eeprom
driver core: Postpone DMA tear-down until after devres release
perf/x86/intel: Make cpuc allocations consistent
perf/x86/intel: Generalize dynamic constraint creation
x86: Add TSX Force Abort CPUID/MSR
perf/x86/intel: Implement support for TSX Force Abort
Linux 4.19.29
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 7c4cd051ad ]
The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.
It was true until commit 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.
If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog. Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.
A later syscall's map_lookup_elem commit f1a2e44a3a ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups. bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).
Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2c1cf00eea ]
If create_buf_file() returns an error, don't try to reference it later
as a valid dentry pointer.
This problem was exposed when debugfs started to return errors instead
of just NULL for some calls when they do not succeed properly.
Also, the check for WARN_ON(dentry) was just wrong :)
Reported-by: Kees Cook <keescook@chromium.org>
Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1a51c5da5a ]
The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.
The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().
You get an error:
$ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
echo: write error: invalid argument
But the value is still modified causing all sorts of inconsistencies:
$ cat /proc/sys/kernel/perf_event_max_sample_rate
10001
This patch fixes the problem by moving the parsing of the value after
the test.
Committer testing:
# echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
# echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
-bash: echo: write error: Invalid argument
# cat /proc/sys/kernel/perf_event_max_sample_rate
10001
#
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyEq/IACgkQONu9yGCS
aT77AhAAgqbxHsKsgqh7JV477xEgLqrLKYw+/6Bx+l79fUZoaR3PRwM62UEzikPQ
KbvutNuOHOWuJA0Xyj5gqV4SJaqBOkoGNnHFOBi/qjtFUsOlBFrlGpow+0fsP/ay
Bmo0LhoTvBub4ap7bJt4pwel/elWYVtOkA1Qgv3OCiDkTorYuPTbIUyuAVOJJbRn
sZ1eKi00CQPrN65Rxgci0g0p/m7JWpvW2zqmDNZJuZZeEmSLdrrZGwt5ExiI6oKz
CqX/VBGChEesMTEOLsSfRyg6NZW3j4rOUaCzkxDq/Tsh9XqNabhk1jod2p3t7Nmu
n5Js3ujfuyCKf5tD49Z8xy5A++nYyJLa5jbFnURr2H/ZJPla0CHuc4RoFf/wFYr4
xQAeA3XXiZZB02n6oZlweUSp7hVnCgLJ4Ev2ctAyPUpyf4ncl+vffzj40bozFvAC
adJE1UyEJp0xUuRVdx0I+HyueHcWmRIAgs0iz9B5S2KNc4rYDqE+/t+ddrNqjvSF
+C33nMn+7A+ngmQlwWjOBaZhhQn3qWrWU0ACERbIG/DUD2voBYu3oIfQzKsXTt0V
erSSDq0KSy73PRKN4Tzf2GnDUAlNITUuyFgWITOg8p29HuXJO00p4MQ7fOMYacyB
WdwXFB7islUtUoA8nEedZgF7IF1WRh6Iz2HJ5uMTl6pCMSV+8Dk=
=OvW8
-----END PGP SIGNATURE-----
Merge 4.19.28 into android-4.19
Changes in 4.19.28
cpufreq: Use struct kobj_attribute instead of struct global_attr
staging: erofs: fix mis-acted TAIL merging behavior
USB: serial: option: add Telit ME910 ECM composition
USB: serial: cp210x: add ID for Ingenico 3070
USB: serial: ftdi_sio: add ID for Hjelmslund Electronics USB485
staging: erofs: fix illegal address access under memory pressure
staging: erofs: compressed_pages should not be accessed again after freed
staging: comedi: ni_660x: fix missing break in switch statement
staging: wilc1000: fix to set correct value for 'vif_num'
staging: android: ion: fix sys heap pool's gfp_flags
staging: android: ashmem: Don't call fallocate() with ashmem_mutex held.
staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held.
ip6mr: Do not call __IP6_INC_STATS() from preemptible context
net: dsa: mv88e6xxx: handle unknown duplex modes gracefully in mv88e6xxx_port_set_duplex
net: dsa: mv8e6xxx: fix number of internal PHYs for 88E6x90 family
net: sched: put back q.qlen into a single location
net-sysfs: Fix mem leak in netdev_register_kobject
qmi_wwan: Add support for Quectel EG12/EM12
sctp: call iov_iter_revert() after sending ABORT
sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
team: Free BPF filter when unregistering netdev
tipc: fix RDM/DGRAM connect() regression
bnxt_en: Drop oversize TX packets to prevent errors.
geneve: correctly handle ipv6.disable module parameter
hv_netvsc: Fix IP header checksum for coalesced packets
ipv4: Add ICMPv6 support when parse route ipproto
lan743x: Fix TX Stall Issue
net: dsa: mv88e6xxx: Fix statistics on mv88e6161
net: dsa: mv88e6xxx: Fix u64 statistics
netlabel: fix out-of-bounds memory accesses
net: netem: fix skb length BUG_ON in __skb_to_sgvec
net: nfc: Fix NULL dereference on nfc_llcp_build_tlv fails
net: phy: Micrel KSZ8061: link failure after cable connect
net: phy: phylink: fix uninitialized variable in phylink_get_mac_state
net: sit: fix memory leak in sit_init_net()
net: socket: set sock->sk to NULL after calling proto_ops::release()
tipc: fix race condition causing hung sendto
tun: fix blocking read
xen-netback: don't populate the hash cache on XenBus disconnect
xen-netback: fix occasional leak of grant ref mappings under memory pressure
tun: remove unnecessary memory barrier
net: Add __icmp_send helper.
net: avoid use IPCB in cipso_v4_error
ipv4: Return error for RTA_VIA attribute
ipv6: Return error for RTA_VIA attribute
mpls: Return error for RTA_GATEWAY attribute
ipv4: Pass original device to ip_rcv_finish_core
net: dsa: mv88e6xxx: power serdes on/off for 10G interfaces on 6390X
net: dsa: mv88e6xxx: prevent interrupt storm caused by mv88e6390x_port_set_cmode
net/sched: act_ipt: fix refcount leak when replace fails
net/sched: act_skbedit: fix refcount leak when replace fails
net: sched: act_tunnel_key: fix NULL pointer dereference during init
x86/CPU/AMD: Set the CPB bit unconditionally on F17h
x86/boot/compressed/64: Do not read legacy ROM on EFI system
tracing: Fix event filters and triggers to handle negative numbers
usb: xhci: Fix for Enabling USB ROLE SWITCH QUIRK on INTEL_SUNRISEPOINT_LP_XHCI
applicom: Fix potential Spectre v1 vulnerabilities
MIPS: irq: Allocate accurate order pages for irq stack
aio: Fix locking in aio_poll()
xtensa: fix get_wchan
gnss: sirf: fix premature wakeup interrupt enable
USB: serial: cp210x: fix GPIO in autosuspend
selftests: firmware: fix verify_reqs() return value
Bluetooth: btrtl: Restore old logic to assume firmware is already loaded
Bluetooth: Fix locking in bt_accept_enqueue() for BH context
exec: Fix mem leak in kernel_read_file
scsi: core: reset host byte in DID_NEXUS_FAILURE case
bpf: fix sanitation rewrite in case of non-pointers
Linux 4.19.28
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3612af783c upstream.
Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:
[...]
uint64_t a = bpf_ktime_get_ns();
uint64_t b = bpf_ktime_get_ns();
uint64_t delta = b - a;
if ((int64_t)delta > 0) {
[...]
Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).
Fixes: d3bd7413e0 ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Reported-by: Arthur Fabre <afabre@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6a072128d2 upstream.
Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:
[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
^
parse_error: Invalid value (did you forget quotes)?
Similar thing happens when setting triggers.
These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.
Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add support for reporting per-uid information through procfs, roughly
following the approach used for per-tid and per-tgid directories in
fs/proc/base.c.
This also entails some new tracking of which uids have been used, to
avoid losing information when the last task with a given uid exits.
Bug: 72339335
Bug: 127641090
Test: ls /proc/uid/; compare with UIDs in /proc/uid_time_in_state
Change-Id: I0908f0c04438b11ceb673d860e58441bf503d478
Signed-off-by: Connor O'Brien <connoro@google.com>
[AmitP: Fix proc_fill_cache() now that upstream commit
0168b9e38c ("procfs: switch instantiate_t to d_splice_alias()"),
switched instantiate() callback to d_splice_alias()]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[astrachan: Folded 97b7790f505e ("ANDROID: proc: fix undefined behavior
in proc_uid_base_readdir") into this change]
Signed-off-by: Alistair Strachan <astrachan@google.com>
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.
Bug: 72339335
Bug: 127641090
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
Signed-off-by: Connor O'Brien <connoro@google.com>
[astrachan: Folded the following changes into this patch:
a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES")
b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")]
Signed-off-by: Alistair Strachan <astrachan@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlx+qs4ACgkQONu9yGCS
aT4OQQ/8DMkQa61GLVtL1Zm0i2GW+fpgAJj59QP7jRdilolP7BBnrX2Ne26Dx+X0
bYLJDJw6WexsvLKOkSMol4Y3Q3NsBguC+MasVrrWBHpfntizlaPzsYuTKkyNRG0c
vjOYyTeAxElWqEh1WaVQhk/juhBbrTtkZIK63h9M60KYb0qHA7TY9oTGokEA8WA0
11Yuge2aV6cR8zup1coWR9NRvW/uealiENAhr0Jw1ZRtrBqwzQoFyn5psXdbzkjb
HrwvDa/nkwfJuc/1+grOTF5HfI7D5+giq0iRtvBuChjDwqfJ2IrVRPF/DKSmR0Oz
Sq1ILUdj61gd2hp4N9zpUE87r2tV2ABg/yZrtDcKFhCiUWnfaaGArn7mm/R7OhCD
PjRDz7acVvKq3LLenM6xcFsAWUdjOoUWcGk70eO/ZZctW8o+HozxFcVrNmT/75zJ
LcagplvYLiLB4Gard6niNH8qunmq/jD+GkgdKtF9GFHOjG1QMdQ9d/VxeJWGAjtA
nE8ZX9EFYgmq14OR+YK0DNOOFBZe10lcR9rMIFp6T9AanY7rVT7LqqG6BUXb226P
mJmNS9sSJhictucyyf/ej9j0jULBBsAYJA5+fkECHIioewACjFGn1dZfm3BBrDAb
OoNIwaLmrvjh2iKwsj17OK9Fv8PG6VKwYRFQ/I3OGVouZUX5SEk=
=edyH
-----END PGP SIGNATURE-----
Merge 4.19.27 into android-4.19
Changes in 4.19.27
irq/matrix: Split out the CPU selection code into a helper
irq/matrix: Spread managed interrupts on allocation
genirq/matrix: Improve target CPU selection for managed interrupts.
mac80211: Change default tx_sk_pacing_shift to 7
scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached
drm/msm: Unblock writer if reader closes file
ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field
ALSA: compress: prevent potential divide by zero bugs
ASoC: Variable "val" in function rt274_i2c_probe() could be uninitialized
clk: tegra: dfll: Fix a potential Oop in remove()
clk: sysfs: fix invalid JSON in clk_dump
clk: vc5: Abort clock configuration without upstream clock
thermal: int340x_thermal: Fix a NULL vs IS_ERR() check
usb: dwc3: gadget: synchronize_irq dwc irq in suspend
usb: dwc3: gadget: Fix the uninitialized link_state when udc starts
usb: gadget: Potential NULL dereference on allocation error
selftests: rtc: rtctest: fix alarm tests
selftests: rtc: rtctest: add alarm test on minute boundary
genirq: Make sure the initial affinity is not empty
x86/mm/mem_encrypt: Fix erroneous sizeof()
ASoC: rt5682: Fix PLL source register definitions
ASoC: dapm: change snprintf to scnprintf for possible overflow
ASoC: imx-audmux: change snprintf to scnprintf for possible overflow
selftests/vm/gup_benchmark.c: match gup struct to kernel
phy: ath79-usb: Fix the power on error path
phy: ath79-usb: Fix the main reset name to match the DT binding
selftests: seccomp: use LDLIBS instead of LDFLAGS
selftests: gpio-mockup-chardev: Check asprintf() for error
irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
ARC: fix __ffs return value to avoid build warnings
ARC: show_regs: lockdep: avoid page allocator...
drivers: thermal: int340x_thermal: Fix sysfs race condition
staging: rtl8723bs: Fix build error with Clang when inlining is disabled
mac80211: fix miscounting of ttl-dropped frames
sched/wait: Fix rcuwait_wake_up() ordering
sched/wake_q: Fix wakeup ordering for wake_q
futex: Fix (possible) missed wakeup
locking/rwsem: Fix (possible) missed wakeup
drm/amd/powerplay: OD setting fix on Vega10
tty: serial: qcom_geni_serial: Allow mctrl when flow control is disabled
serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling
drm/sun4i: hdmi: Fix usage of TMDS clock
staging: android: ion: Support cpu access during dma_buf_detach
direct-io: allow direct writes to empty inodes
writeback: synchronize sync(2) against cgroup writeback membership switches
scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport
scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport
scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state()
net: altera_tse: fix connect_local_phy error path
hv_netvsc: Fix ethtool change hash key error
hv_netvsc: Refactor assignments of struct netvsc_device_info
hv_netvsc: Fix hash key value reset after other ops
nvme-rdma: fix timeout handler
nvme-multipath: drop optimization for static ANA group IDs
drm/msm: Fix A6XX support for opp-level
net: usb: asix: ax88772_bind return error when hw_reset fail
net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP
ibmveth: Do not process frames after calling napi_reschedule
mac80211: don't initiate TDLS connection if station is not associated to AP
mac80211: Add attribute aligned(2) to struct 'action'
cfg80211: extend range deviation for DMG
svm: Fix AVIC incomplete IPI emulation
KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1
kvm: selftests: Fix region overlap check in kvm_util
mmc: spi: Fix card detection during probe
mmc: tmio_mmc_core: don't claim spurious interrupts
mmc: tmio: fix access width of Block Count Register
mmc: core: Fix NULL ptr crash from mmc_should_fail_request
mmc: cqhci: fix space allocated for transfer descriptor
mmc: cqhci: Fix a tiny potential memory leak on error condition
mmc: sdhci-esdhc-imx: correct the fix of ERR004536
mm: enforce min addr even if capable() in expand_downwards()
drm: Block fb changes for async plane updates
hugetlbfs: fix races and page leaks during migration
MIPS: fix truncation in __cmpxchg_small for short values
MIPS: BCM63XX: provide DMA masks for ethernet devices
MIPS: eBPF: Fix icache flush end address
x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
Linux 4.19.27
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit e158488be2 ]
Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:
e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.
[ peterz: Added changelog. ]
Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b061c38bef ]
We must not rely on wake_q_add() to delay the wakeup; in particular
commit:
1d0dcb3ad9 ("futex: Implement lockless wakeups")
moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1d0dcb3ad9 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4c4e373156 ]
Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.
Andrea Parri provided:
C wake_up_q-wake_q_add
{
int next = 0;
int y = 0;
}
P0(int *next, int *y)
{
int r0;
/* in wake_up_q() */
WRITE_ONCE(*next, 1); /* node->next = NULL */
smp_mb(); /* implied by wake_up_process() */
r0 = READ_ONCE(*y);
}
P1(int *next, int *y)
{
int r1;
/* in wake_q_add() */
WRITE_ONCE(*y, 1); /* wake_cond = true */
smp_mb__before_atomic();
r1 = cmpxchg_relaxed(next, 1, 2);
}
exists (0:r0=0 /\ 1:r1=0)
This "exists" clause cannot be satisfied according to the LKMM:
Test wake_up_q-wake_q_add Allowed
States 3
0:r0=0; 1:r1=1;
0:r0=1; 1:r1=0;
0:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r0=0 /\ 1:r1=0)
Observation wake_up_q-wake_q_add Never 0 3
Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6dc080eeb2 ]
For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.
This mistake has been observed to cause a deadlock in the following
situation:
P1 P2
percpu_up_read() percpu_down_write()
rcu_sync_is_idle() // false
rcu_sync_enter()
...
__percpu_up_read()
[S] ,- __this_cpu_dec(*sem->read_count)
| smp_rmb();
[L] | task = rcu_dereference(w->task) // NULL
|
| [S] w->task = current
| smp_mb();
| [L] readers_active_check() // fail
`-> <store happens here>
Where the smp_rmb() (obviously) fails to constrain the store.
[ peterz: Added changelog. ]
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>