The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.
Use div64_u64.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit c3466952ca)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I49fdfd55751d1a2cde19666624c9c5d76dc78dad
Jingfeng reports rare div0 crashes in psi on systems with some uptime:
[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330
The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.
We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.
The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.
Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.
Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 3dfbe25c27)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iaada5c2f1a03cf38cbb053adde478f762ce40843
With non-canonical CFI, LLVM generates jump table entries for external
symbols in modules and as a result, a function pointer passed from a
module to the core kernel will have a different address.
Disable the warning for now.
Bug: 145210207
Change-Id: Ifdcee3479280f7b97abdee6b4c746f447e0944e6
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Alistair Delva <adelva@google.com>
Some devices need an additional sched-tune boost group to
optimize performance for key tasks
Bug: 150302001
Change-Id: I392c8cc05a8851f1d416c381b4a27242924c2c27
Signed-off-by: Todd Kjos <tkjos@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl5TfLwACgkQONu9yGCS
aT5wlRAAhZELK39c78NMCTZKHtKGLsGb2os2IiI7zIRbqNNwnvJi+jAc3kgbS9jP
+W+wnhYFtFisDvqdCQ009I6A0NA1p3Nqy166JplW0iIg1e7rgUKKUfabCN9sJmjh
HGK913cJlHwGmkSxq//sBucBwWhYYGaHec28pZ7uCFATjWrTaH3G4VrvLStuicYR
YgS9MH261tWJKJm5+V2MxnOOI0103+Uey+xVqwSnLlV+qmasxwDCMU5ae+SK7e7f
cXIkNZwvDph1zunekHg+jd64GN3GYswXVcRighWP0n7Lr+0tGPN7SY5pvZIjZLv/
sdroyrqAxytTYP32hypIUgsToVvJr7zXD09LGdsgOCKVwFVn8yl1e4zgGKH3L9Xu
OK2krI90v1MVevibyaNndZ4UDKilF75oE2YYDOFW/BU1lorFAIzk4hh15CfKc8s1
KHRjePfcgQREs/SGK8k2BAmf/JwxFN1/Ro5dl7MvKn07ZYqx6QOwUoMhgxspIntN
9TlFw6elu1RSwu2BFts9wvoHO1tr7GZBa1cVkNF8qV1rzaGVY68aLDvvHGdffD6W
JgX+BCfr6vcN7R4izak1RxzAoqDrRxS0vWoC1vVsPqeIIZydSxpYDquaFnbZm+Wc
MRuh5gpQ2PzTXuMLeBB+ig6UnzsAO3x+3yIG/l5ZmmYxJbMFBKU=
=zE/i
-----END PGP SIGNATURE-----
Merge 4.19.106 into android-4.19
Changes in 4.19.106
core: Don't skip generic XDP program execution for cloned SKBs
enic: prevent waking up stopped tx queues over watchdog reset
net/smc: fix leak of kernel memory to user space
net: dsa: tag_qca: Make sure there is headroom for tag
net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
Revert "KVM: nVMX: Use correct root level for nested EPT shadow page tables"
Revert "KVM: VMX: Add non-canonical check on writes to RTIT address MSRs"
KVM: nVMX: Use correct root level for nested EPT shadow page tables
drm/gma500: Fixup fbdev stolen size usage evaluation
cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order
brcmfmac: Fix use after free in brcmf_sdio_readframes()
leds: pca963x: Fix open-drain initialization
ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
ALSA: ctl: allow TLV read operation for callback type of element in locked case
gianfar: Fix TX timestamping with a stacked DSA driver
pinctrl: sh-pfc: sh7264: Fix CAN function GPIOs
pxa168fb: Fix the function used to release some memory in an error handling path
media: i2c: mt9v032: fix enum mbus codes and frame sizes
powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number
gpio: gpio-grgpio: fix possible sleep-in-atomic-context bugs in grgpio_irq_map/unmap()
iommu/vt-d: Fix off-by-one in PASID allocation
char/random: silence a lockdep splat with printk()
media: sti: bdisp: fix a possible sleep-in-atomic-context bug in bdisp_device_run()
pinctrl: baytrail: Do not clear IRQ flags on direct-irq enabled pins
efi/x86: Map the entire EFI vendor string before copying it
MIPS: Loongson: Fix potential NULL dereference in loongson3_platform_init()
sparc: Add .exit.data section.
uio: fix a sleep-in-atomic-context bug in uio_dmem_genirq_irqcontrol()
usb: gadget: udc: fix possible sleep-in-atomic-context bugs in gr_probe()
usb: dwc2: Fix IN FIFO allocation
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
kselftest: Minimise dependency of get_size on C library interfaces
jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info when load journal
x86/sysfb: Fix check for bad VRAM size
pwm: omap-dmtimer: Simplify error handling
s390/pci: Fix possible deadlock in recover_store()
powerpc/iov: Move VF pdev fixup into pcibios_fixup_iov()
tracing: Fix tracing_stat return values in error handling paths
tracing: Fix very unlikely race of registering two stat tracers
ARM: 8952/1: Disable kmemleak on XIP kernels
ext4, jbd2: ensure panic when aborting with zero errno
ath10k: Correct the DMA direction for management tx buffers
drm/amd/display: Retrain dongles when SINK_COUNT becomes non-zero
nbd: add a flush_workqueue in nbd_start_device
KVM: s390: ENOTSUPP -> EOPNOTSUPP fixups
kconfig: fix broken dependency in randconfig-generated .config
clk: qcom: rcg2: Don't crash if our parent can't be found; return an error
drm/amdgpu: remove 4 set but not used variable in amdgpu_atombios_get_connector_info_from_object_table
drm/amdgpu: Ensure ret is always initialized when using SOC15_WAIT_ON_RREG
regulator: rk808: Lower log level on optional GPIOs being not available
net/wan/fsl_ucc_hdlc: reject muram offsets above 64K
NFC: port100: Convert cpu_to_le16(le16_to_cpu(E1) + E2) to use le16_add_cpu().
selinux: fall back to ref-walk if audit is required
arm64: dts: allwinner: H6: Add PMU mode
arm: dts: allwinner: H3: Add PMU node
selinux: ensure we cleanup the internal AVC counters on error in avc_insert()
arm64: dts: qcom: msm8996: Disable USB2 PHY suspend by core
ARM: dts: imx6: rdu2: Disable WP for USDHC2 and USDHC3
ARM: dts: imx6: rdu2: Limit USBH1 to Full Speed
PCI: iproc: Apply quirk_paxc_bridge() for module as well as built-in
media: cx23885: Add support for AVerMedia CE310B
PCI: Add generic quirk for increasing D3hot delay
PCI: Increase D3 delay for AMD Ryzen5/7 XHCI controllers
media: v4l2-device.h: Explicitly compare grp{id,mask} to zero in v4l2_device macros
reiserfs: Fix spurious unlock in reiserfs_fill_super() error handling
r8169: check that Realtek PHY driver module is loaded
fore200e: Fix incorrect checks of NULL pointer dereference
netfilter: nft_tunnel: add the missing ERSPAN_VERSION nla_policy
ALSA: usx2y: Adjust indentation in snd_usX2Y_hwdep_dsp_status
b43legacy: Fix -Wcast-function-type
ipw2x00: Fix -Wcast-function-type
iwlegacy: Fix -Wcast-function-type
rtlwifi: rtl_pci: Fix -Wcast-function-type
orinoco: avoid assertion in case of NULL pointer
ACPICA: Disassembler: create buffer fields in ACPI_PARSE_LOAD_PASS1
scsi: ufs: Complete pending requests in host reset and restore path
scsi: aic7xxx: Adjust indentation in ahc_find_syncrate
drm/mediatek: handle events when enabling/disabling crtc
ARM: dts: r8a7779: Add device node for ARM global timer
selinux: ensure we cleanup the internal AVC counters on error in avc_update()
dmaengine: Store module owner in dma_device struct
dmaengine: imx-sdma: Fix memory leak
crypto: chtls - Fixed memory leak
x86/vdso: Provide missing include file
PM / devfreq: rk3399_dmc: Add COMPILE_TEST and HAVE_ARM_SMCCC dependency
pinctrl: sh-pfc: sh7269: Fix CAN function GPIOs
reset: uniphier: Add SCSSI reset control for each channel
RDMA/rxe: Fix error type of mmap_offset
clk: sunxi-ng: add mux and pll notifiers for A64 CPU clock
ALSA: sh: Fix unused variable warnings
clk: uniphier: Add SCSSI clock gate for each channel
ALSA: sh: Fix compile warning wrt const
tools lib api fs: Fix gcc9 stringop-truncation compilation error
ACPI: button: Add DMI quirk for Razer Blade Stealth 13 late 2019 lid switch
mlx5: work around high stack usage with gcc
drm: remove the newline for CRC source name.
ARM: dts: stm32: Add power-supply for DSI panel on stm32f469-disco
usbip: Fix unsafe unaligned pointer usage
udf: Fix free space reporting for metadata and virtual partitions
staging: rtl8188: avoid excessive stack usage
IB/hfi1: Add software counter for ctxt0 seq drop
soc/tegra: fuse: Correct straps' address for older Tegra124 device trees
efi/x86: Don't panic or BUG() on non-critical error conditions
rcu: Use WRITE_ONCE() for assignments to ->pprev for hlist_nulls
Input: edt-ft5x06 - work around first register access error
x86/nmi: Remove irq_work from the long duration NMI handler
wan: ixp4xx_hss: fix compile-testing on 64-bit
ASoC: atmel: fix build error with CONFIG_SND_ATMEL_SOC_DMA=m
tty: synclinkmp: Adjust indentation in several functions
tty: synclink_gt: Adjust indentation in several functions
visorbus: fix uninitialized variable access
driver core: platform: Prevent resouce overflow from causing infinite loops
driver core: Print device when resources present in really_probe()
bpf: Return -EBADRQC for invalid map type in __bpf_tx_xdp_map
vme: bridges: reduce stack usage
drm/nouveau/secboot/gm20b: initialize pointer in gm20b_secboot_new()
drm/nouveau/gr/gk20a,gm200-: add terminators to method lists read from fw
drm/nouveau: Fix copy-paste error in nouveau_fence_wait_uevent_handler
drm/nouveau/drm/ttm: Remove set but not used variable 'mem'
drm/nouveau/fault/gv100-: fix memory leak on module unload
drm/vmwgfx: prevent memory leak in vmw_cmdbuf_res_add
usb: musb: omap2430: Get rid of musb .set_vbus for omap2430 glue
iommu/arm-smmu-v3: Use WRITE_ONCE() when changing validity of an STE
f2fs: set I_LINKABLE early to avoid wrong access by vfs
f2fs: free sysfs kobject
scsi: iscsi: Don't destroy session if there are outstanding connections
arm64: fix alternatives with LLVM's integrated assembler
drm/amd/display: fixup DML dependencies
watchdog/softlockup: Enforce that timestamp is valid on boot
f2fs: fix memleak of kobject
x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
pwm: omap-dmtimer: Remove PWM chip in .remove before making it unfunctional
cmd64x: potential buffer overflow in cmd64x_program_timings()
ide: serverworks: potential overflow in svwks_set_pio_mode()
pwm: Remove set but not set variable 'pwm'
btrfs: fix possible NULL-pointer dereference in integrity checks
btrfs: safely advance counter when looking up bio csums
btrfs: device stats, log when stats are zeroed
module: avoid setting info->name early in case we can fall back to info->mod->name
remoteproc: Initialize rproc_class before use
irqchip/mbigen: Set driver .suppress_bind_attrs to avoid remove problems
ALSA: hda/hdmi - add retry logic to parse_intel_hdmi()
kbuild: use -S instead of -E for precise cc-option test in Kconfig
x86/decoder: Add TEST opcode to Group3-2
s390: adjust -mpacked-stack support check for clang 10
s390/ftrace: generate traced function stack frame
driver core: platform: fix u32 greater or equal to zero comparison
ALSA: hda - Add docking station support for Lenovo Thinkpad T420s
drm/nouveau/mmu: fix comptag memory leak
powerpc/sriov: Remove VF eeh_dev state when disabling SR-IOV
bcache: cached_dev_free needs to put the sb page
iommu/vt-d: Remove unnecessary WARN_ON_ONCE()
selftests: bpf: Reset global state between reuseport test runs
jbd2: switch to use jbd2_journal_abort() when failed to submit the commit record
jbd2: make sure ESHUTDOWN to be recorded in the journal superblock
ARM: 8951/1: Fix Kexec compilation issue.
hostap: Adjust indentation in prism2_hostapd_add_sta
iwlegacy: ensure loop counter addr does not wrap and cause an infinite loop
cifs: fix NULL dereference in match_prepath
bpf: map_seq_next should always increase position index
ceph: check availability of mds cluster on mount after wait timeout
rbd: work around -Wuninitialized warning
irqchip/gic-v3: Only provision redistributors that are enabled in ACPI
drm/nouveau/disp/nv50-: prevent oops when no channel method map provided
ftrace: fpid_next() should increase position index
trigger_next should increase position index
radeon: insert 10ms sleep in dce5_crtc_load_lut
ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()
lib/scatterlist.c: adjust indentation in __sg_alloc_table
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
bcache: explicity type cast in bset_bkey_last()
irqchip/gic-v3-its: Reference to its_invall_cmd descriptor when building INVALL
iwlwifi: mvm: Fix thermal zone registration
microblaze: Prevent the overflow of the start
brd: check and limit max_part par
drm/amdgpu/smu10: fix smu10_get_clock_by_type_with_latency
drm/amdgpu/smu10: fix smu10_get_clock_by_type_with_voltage
NFS: Fix memory leaks
help_next should increase position index
cifs: log warning message (once) if out of disk space
virtio_balloon: prevent pfn array overflow
mlxsw: spectrum_dpipe: Add missing error path
drm/amdgpu/display: handle multiple numbers of fclks in dcn_calcs.c (v2)
Linux 4.19.106
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia1032b50dd82b42e13973120dcbf94ae7b864648
[ Upstream commit 6722b23e7a ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s
Notice the printing of "# Available triggers:..." after the line.
With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s
It only prints the end of the file, and does not restart.
Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e4075e8bdf ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s
Notice the "id" followed by "no pid".
With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s
Notice that it only prints "id" and not the "no pid" afterward.
Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 708e0ada19 ]
In setup_load_info(), info->name (which contains the name of the module,
mostly used for early logging purposes before the module gets set up)
gets unconditionally assigned if .modinfo is missing despite the fact
that there is an if (!info->name) check near the end of the function.
Avoid assigning a placeholder string to info->name if .modinfo doesn't
exist, so that we can fall back to info->mod->name later on.
Fixes: 5fdc7db644 ("module: setup load info before module_sig_check()")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 11e31f608b ]
Robert reported that during boot the watchdog timestamp is set to 0 for one
second which is the indicator for a watchdog reset.
The reason for this is that the timestamp is in seconds and the time is
taken from sched clock and divided by ~1e9. sched clock starts at 0 which
means that for the first second during boot the watchdog timestamp is 0,
i.e. reset.
Use ULONG_MAX as the reset indicator value so the watchdog works correctly
right from the start. ULONG_MAX would only conflict with a real timestamp
if the system reaches an uptime of 136 years on 32bit and almost eternity
on 64bit.
Reported-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dfb6cd1e65 ]
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit afccc00f75 ]
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996b ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 45178ac0ce ]
Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.
Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).
It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:
BP AP
bringup_cpu()
__cpu_up(cpu, idle) --> start_secondary()
...
cpu_startup_entry()
bringup_wait_for_ap()
wait_for_ap_thread() <-- cpuhp_online_idle()
while (1)
do_idle()
... available to run kthreads ...
stop_machine_unpark()
stopper->enable = true;
Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl5HEigACgkQONu9yGCS
aT7Gcw/6AkDGkK5U/aDpKMqWiRmZUqDIg8U9xR+44Gl57Q71vicrzq8NGPHxxbsF
slWoCyXLVSD7bMWGsTD0qJR8muROAraMxDl8dCxojEXHnXFMx4A4Cf0h1E0lY0mu
Jq/O9m33ZMSppjio88sCcLpo0pbXF+cCX1CY87NI5QUitUzHgRh18W8BtyFpMMI8
eC0Fc+hMWax3+qqHt/hFVpufaTKm35zLCpGjGAJiHd7GFvqUJnuAzBYCs1Cf8NO1
KrrL3l/IWk8z3Z0Wc9PbBz309a9H6FVpjrXSXj6URkxjtqJ0F0mBMaIYxhaUF8PD
CHY5xLyqKodC8/7O5zNOrP80oT9nqJvsmKwUwlG34IJuMVaq/o+hZu+88JVB02Yw
v9XVcaQda5aZgWF9cBWzFQEcNwHFDCQ9VNidLDcHJLGPyFo/BogvMo8T4yPM9tI0
O0PSFm/yYu0airZSCzIbPzuF2Iv+iilVtq+o10VRDsGtEYAOzTL7nA01MkdXFhwy
4V+Q51C90TGo13BnnZ6xpEqjspuDWgeOD71/xkQ5cnyFgam0XQq/5R6JJghJIHOP
7p8NMMyNhK2FnOGrFUgqvwBCp6Dap1ISZyKvie1Z8vuCJsZcwMVIw8fxAzoZWOjj
MlmmePjlbC7XTFxjdo0jrQTdvBwq+gFgNitD7UAlfHAdqKJKKA4=
=8ktI
-----END PGP SIGNATURE-----
Merge 4.19.104 into android-4.19
Changes in 4.19.104
ASoC: pcm: update FE/BE trigger order based on the command
hv_sock: Remove the accept port restriction
IB/mlx4: Fix memory leak in add_gid error flow
RDMA/netlink: Do not always generate an ACK for some netlink operations
RDMA/core: Fix locking in ib_uverbs_event_read
RDMA/uverbs: Verify MR access flags
scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
PCI/IOV: Fix memory leak in pci_iov_add_virtfn()
ath10k: pci: Only dump ATH10K_MEM_REGION_TYPE_IOREG when safe
PCI/switchtec: Fix vep_vector_number ioread width
PCI: Don't disable bridge BARs when assigning bus resources
nfs: NFS_SWAP should depend on SWAP
NFS: Revalidate the file size on a fatal write error
NFS/pnfs: Fix pnfs_generic_prepare_to_resend_writes()
NFSv4: try lease recovery on NFS4ERR_EXPIRED
serial: uartps: Add a timeout to the tx empty wait
gpio: zynq: Report gpio direction at boot
spi: spi-mem: Add extra sanity checks on the op param
spi: spi-mem: Fix inverted logic in op sanity check
rtc: hym8563: Return -EINVAL if the time is known to be invalid
rtc: cmos: Stop using shared IRQ
ARC: [plat-axs10x]: Add missing multicast filter number to GMAC node
platform/x86: intel_mid_powerbtn: Take a copy of ddata
ARM: dts: at91: Reenable UART TX pull-ups
ARM: dts: am43xx: add support for clkout1 clock
ARM: dts: at91: sama5d3: fix maximum peripheral clock rates
ARM: dts: at91: sama5d3: define clock rate range for tcb1
tools/power/acpi: fix compilation error
powerpc/pseries/vio: Fix iommu_table use-after-free refcount warning
powerpc/pseries: Allow not having ibm, hypertas-functions::hcall-multi-tce for DDW
iommu/arm-smmu-v3: Populate VMID field for CMDQ_OP_TLBI_NH_VA
KVM: arm/arm64: vgic-its: Fix restoration of unmapped collections
ARM: 8949/1: mm: mark free_memmap as __init
arm64: cpufeature: Fix the type of no FP/SIMD capability
arm64: ptrace: nofpsimd: Fail FP/SIMD regset operations
KVM: arm/arm64: Fix young bit from mmu notifier
KVM: arm: Fix DFSR setting for non-LPAE aarch32 guests
KVM: arm: Make inject_abt32() inject an external abort instead
KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset
mtd: onenand_base: Adjust indentation in onenand_read_ops_nolock
mtd: sharpslpart: Fix unsigned comparison to zero
crypto: artpec6 - return correct error code for failed setkey()
crypto: atmel-sha - fix error handling when setting hmac key
media: i2c: adv748x: Fix unsafe macros
pinctrl: sh-pfc: r8a7778: Fix duplicate SDSELF_B and SD1_CLK_B
mwifiex: Fix possible buffer overflows in mwifiex_ret_wmm_get_status()
mwifiex: Fix possible buffer overflows in mwifiex_cmd_append_vsie_tlv()
libertas: don't exit from lbs_ibss_join_existing() with RCU read lock held
libertas: make lbs_ibss_join_existing() return error code on rates overflow
scsi: megaraid_sas: Do not initiate OCR if controller is not in ready state
x86/stackframe: Move ENCODE_FRAME_POINTER to asm/frame.h
x86/stackframe, x86/ftrace: Add pt_regs frame annotations
serial: uartps: Move the spinlock after the read of the tx empty
padata: fix null pointer deref of pd->pinst
Linux 4.19.104
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I42a465b140183dcc8cf49e19903d0e8f4b688930
The 4.19 backport dc34710a7a ("padata: Remove broken queue flushing")
removed padata_alloc_pd()'s assignment to pd->pinst, resulting in:
Unable to handle kernel NULL pointer dereference ...
...
pc : padata_reorder+0x144/0x2e0
...
Call trace:
padata_reorder+0x144/0x2e0
padata_do_serial+0xc8/0x128
pcrypt_aead_enc+0x60/0x70 [pcrypt]
padata_parallel_worker+0xd8/0x138
process_one_work+0x1bc/0x4b8
worker_thread+0x164/0x580
kthread+0x134/0x138
ret_from_fork+0x10/0x18
This happened because the backport was based on an enhancement that
moved this assignment but isn't in 4.19:
bfde23ce20 ("padata: unbind parallel jobs from specific CPUs")
Simply restore the assignment to fix the crash.
Fixes: dc34710a7a ("padata: Remove broken queue flushing")
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl5Cn0wACgkQONu9yGCS
aT584xAAtePSlzTxst/jukREoyrpAfTM1BeovMdsZEBpKh+/F3n1udqHeo+iNAAN
qSOig012aW2qP7b5/4CrEU9ZRTvd0AM4fog7ABLJVahMYMqoJgod8TRaE4v0nVut
eRans6w3NbZJCZwdw2aiu5gwFfjwJLSUckBNmj4XVYdyfh7q0BgnZV5OY0V+zhuG
1MWXaylbRqjguR/ZFk0UPAmRaqNKHbwfCJ1V0ygL9xQkJM0cUn7hX9/CqM4aYnm6
m1oux4ektLAmF1XK4NiQEuRBMeFO74XlKcsZqQHf/b4FZfcPergcPwIj8ugtCHzJ
kx2QgURDjgH4Tnu+Q0ScPrjj2kjU8rWmjqlcv1PcUyOWm+MR0OK9bW7TLEntMSF8
HOEe9j6SsjQNIOoYh1YcMnuGjKNIZjl2L3VbDzpVN2GxZxwAutY6G68tV7sbA2pu
wtsrAVOqdcjoo0ruRmwognBqQAdNdsbiBx7bgcNjVEXWL0N3Ddiv6CNYwnehA5Hq
cvQwVQpFGP9ZGYUcCMbdwR+7kJzVy6V2S615M8GkE9FouOwTfV60zM/sZ1rFVt1J
70zxfRX5ys19aTAVkbi6pHHCUJ0ZAiTgWujp5Hp4kPt7gEz01Ur0s1kI3b7b6iWh
cuycRFULvqeXCApQacs//lOVDoUV20uFcL/zqOFM33v/+YzkyjA=
=3D8z
-----END PGP SIGNATURE-----
Merge 4.19.103 into android-4.19
Changes in 4.19.103
Revert "drm/sun4i: dsi: Change the start delay calculation"
ovl: fix lseek overflow on 32bit
kernel/module: Fix memleak in module_add_modinfo_attrs()
media: iguanair: fix endpoint sanity check
ocfs2: fix oops when writing cloned file
x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
udf: Allow writing to 'Rewritable' partitions
printk: fix exclusive_console replaying
iwlwifi: mvm: fix NVM check for 3168 devices
sparc32: fix struct ipc64_perm type definition
cls_rsvp: fix rsvp_policy
gtp: use __GFP_NOWARN to avoid memalloc warning
l2tp: Allow duplicate session creation with UDP
net: hsr: fix possible NULL deref in hsr_handle_frame()
net_sched: fix an OOB access in cls_tcindex
net: stmmac: Delete txtimer in suspend()
bnxt_en: Fix TC queue mapping.
tcp: clear tp->total_retrans in tcp_disconnect()
tcp: clear tp->delivered in tcp_disconnect()
tcp: clear tp->data_segs{in|out} in tcp_disconnect()
tcp: clear tp->segs_{in|out} in tcp_disconnect()
rxrpc: Fix use-after-free in rxrpc_put_local()
rxrpc: Fix insufficient receive notification generation
rxrpc: Fix missing active use pinning of rxrpc_local object
rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect
media: uvcvideo: Avoid cyclic entity chains due to malformed USB descriptors
mfd: dln2: More sanity checking for endpoints
ipc/msg.c: consolidate all xxxctl_down() functions
tracing: Fix sched switch start/stop refcount racy updates
rcu: Avoid data-race in rcu_gp_fqs_check_wake()
brcmfmac: Fix memory leak in brcmf_usbdev_qinit
usb: typec: tcpci: mask event interrupts when remove driver
usb: gadget: legacy: set max_speed to super-speed
usb: gadget: f_ncm: Use atomic_t to track in-flight request
usb: gadget: f_ecm: Use atomic_t to track in-flight request
ALSA: usb-audio: Fix endianess in descriptor validation
ALSA: dummy: Fix PCM format loop in proc output
mm/memory_hotplug: fix remove_memory() lockdep splat
mm: move_pages: report the number of non-attempted pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
media: v4l2-core: compat: ignore native command codes
media: v4l2-rect.h: fix v4l2_rect_map_inside() top/left adjustments
lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()
irqdomain: Fix a memory leak in irq_domain_push_irq()
platform/x86: intel_scu_ipc: Fix interrupt support
ALSA: hda: Add Clevo W65_67SB the power_save blacklist
KVM: arm64: Correct PSTATE on exception entry
KVM: arm/arm64: Correct CPSR on exception entry
KVM: arm/arm64: Correct AArch32 SPSR on exception entry
KVM: arm64: Only sign-extend MMIO up to register width
MIPS: fix indentation of the 'RELOCS' message
MIPS: boot: fix typo in 'vmlinux.lzma.its' target
s390/mm: fix dynamic pagetable upgrade for hugetlbfs
powerpc/xmon: don't access ASDR in VMs
powerpc/pseries: Advance pfn if section is not present in lmb_is_removable()
smb3: fix signing verification of large reads
PCI: tegra: Fix return value check of pm_runtime_get_sync()
mmc: spi: Toggle SPI polarity, do not hardcode it
ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards
ACPI / battery: Deal with design or full capacity being reported as -1
ACPI / battery: Use design-cap for capacity calculations if full-cap is not available
ACPI / battery: Deal better with neither design nor full capacity not being reported
alarmtimer: Unregister wakeup source when module get fails
ubifs: Reject unsupported ioctl flags explicitly
ubifs: don't trigger assertion on invalid no-key filename
ubifs: Fix FS_IOC_SETFLAGS unexpectedly clearing encrypt flag
ubifs: Fix deadlock in concurrent bulk-read and writepage
crypto: geode-aes - convert to skcipher API and make thread-safe
PCI: keystone: Fix link training retries initiation
mmc: sdhci-of-at91: fix memleak on clk_get failure
hv_balloon: Balloon up according to request page number
mfd: axp20x: Mark AXP20X_VBUS_IPSOUT_MGMT as volatile
crypto: api - Check spawn->alg under lock in crypto_drop_spawn
crypto: ccree - fix backlog memory leak
crypto: ccree - fix pm wrongful error reporting
crypto: ccree - fix PM race condition
scripts/find-unused-docs: Fix massive false positives
scsi: qla2xxx: Fix mtcp dump collection failure
power: supply: ltc2941-battery-gauge: fix use-after-free
ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
f2fs: fix miscounted block limit in f2fs_statfs_project()
f2fs: code cleanup for f2fs_statfs_project()
PM: core: Fix handling of devices deleted during system-wide resume
of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpc
dm zoned: support zone sizes smaller than 128MiB
dm space map common: fix to ensure new block isn't already in use
dm crypt: fix benbi IV constructor crash if used in authenticated mode
dm: fix potential for q->make_request_fn NULL pointer
dm writecache: fix incorrect flush sequence when doing SSD mode commit
padata: Remove broken queue flushing
tracing: Annotate ftrace_graph_hash pointer with __rcu
tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
ftrace: Add comment to why rcu_dereference_sched() is open coded
ftrace: Protect ftrace_graph_hash with ftrace_sync
samples/bpf: Don't try to remove user's homedir on clean
crypto: ccp - set max RSA modulus size for v3 platform devices as well
crypto: pcrypt - Do not clear MAY_SLEEP flag in original request
crypto: atmel-aes - Fix counter overflow in CTR mode
crypto: api - Fix race condition in crypto_spawn_alg
crypto: picoxcell - adjust the position of tasklet_init and fix missed tasklet_kill
scsi: qla2xxx: Fix unbound NVME response length
NFS: Fix memory leaks and corruption in readdir
NFS: Directory page cache pages need to be locked when read
jbd2_seq_info_next should increase position index
Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES
btrfs: set trans->drity in btrfs_commit_transaction
Btrfs: fix race between adding and putting tree mod seq elements and nodes
ARM: tegra: Enable PLLP bypass during Tegra124 LP1
iwlwifi: don't throw error when trying to remove IGTK
mwifiex: fix unbalanced locking in mwifiex_process_country_ie()
sunrpc: expiry_time should be seconds not timeval
gfs2: move setting current->backing_dev_info
gfs2: fix O_SYNC write handling
drm/rect: Avoid division by zero
media: rc: ensure lirc is initialized before registering input device
tools/kvm_stat: Fix kvm_exit filter name
xen/balloon: Support xend-based toolstack take two
watchdog: fix UAF in reboot notifier handling in watchdog core code
bcache: add readahead cache policy options via sysfs interface
eventfd: track eventfd_signal() recursion depth
aio: prevent potential eventfd recursion on poll
KVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacks
KVM: x86: Refactor prefix decoding to prevent Spectre-v1/L1TF attacks
KVM: x86: Protect pmu_intel.c from Spectre-v1/L1TF attacks
KVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacks
KVM: x86: Protect kvm_lapic_reg_write() from Spectre-v1/L1TF attacks
KVM: x86: Protect kvm_hv_msr_[get|set]_crash_data() from Spectre-v1/L1TF attacks
KVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacks
KVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF attacks
KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks
KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c
KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks
KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spectre-v1/L1TF attacks
KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform
KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails
KVM: PPC: Book3S PR: Free shared page if mmu initialization fails
x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
KVM: x86: Don't let userspace set host-reserved cr4 bits
KVM: x86: Free wbinvd_dirty_mask if vCPU creation fails
KVM: s390: do not clobber registers during guest reset/store status
clk: tegra: Mark fuse clock as critical
drm/amd/dm/mst: Ignore payload update failures
percpu: Separate decrypted varaibles anytime encryption can be enabled
scsi: qla2xxx: Fix the endianness of the qla82xx_get_fw_size() return type
scsi: csiostor: Adjust indentation in csio_device_reset
scsi: qla4xxx: Adjust indentation in qla4xxx_mem_free
scsi: ufs: Recheck bkops level if bkops is disabled
phy: qualcomm: Adjust indentation in read_poll_timeout
ext2: Adjust indentation in ext2_fill_super
powerpc/44x: Adjust indentation in ibm4xx_denali_fixup_memsize
drm: msm: mdp4: Adjust indentation in mdp4_dsi_encoder_enable
NFC: pn544: Adjust indentation in pn544_hci_check_presence
ppp: Adjust indentation into ppp_async_input
net: smc911x: Adjust indentation in smc911x_phy_configure
net: tulip: Adjust indentation in {dmfe, uli526x}_init_module
IB/mlx5: Fix outstanding_pi index for GSI qps
IB/core: Fix ODP get user pages flow
nfsd: fix delay timer on 32-bit architectures
nfsd: fix jiffies/time_t mixup in LRU list
nfsd: Return the correct number of bytes written to the file
ubi: fastmap: Fix inverted logic in seen selfcheck
ubi: Fix an error pointer dereference in error handling code
mfd: da9062: Fix watchdog compatible string
mfd: rn5t618: Mark ADC control register volatile
bonding/alb: properly access headers in bond_alb_xmit()
net: dsa: bcm_sf2: Only 7278 supports 2Gb/sec IMP port
net: mvneta: move rx_dropped and rx_errors in per-cpu stats
net_sched: fix a resource leak in tcindex_set_parms()
net: systemport: Avoid RBUF stuck in Wake-on-LAN mode
net/mlx5: IPsec, Fix esp modify function attribute
net/mlx5: IPsec, fix memory leak at mlx5_fpga_ipsec_delete_sa_ctx
net: macb: Remove unnecessary alignment check for TSO
net: macb: Limit maximum GEM TX length in TSO
net: dsa: b53: Always use dev->vlan_enabled in b53_configure_vlan()
ext4: fix deadlock allocating crypto bounce page from mempool
btrfs: use bool argument in free_root_pointers()
btrfs: free block groups after free'ing fs trees
drm: atmel-hlcdc: enable clock before configuring timing engine
drm/dp_mst: Remove VCPI while disabling topology mgr
btrfs: flush write bio if we loop in extent_write_cache_pages
KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM
KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
KVM: VMX: Add non-canonical check on writes to RTIT address MSRs
KVM: nVMX: vmread should not set rflags to specify success in case of #PF
KVM: Use vcpu-specific gva->hva translation when querying host page size
KVM: Play nice with read-only memslots when querying host page size
mm: zero remaining unavailable struct pages
mm: return zero_resv_unavail optimization
mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
cifs: fail i/o on soft mounts if sessionsetup errors out
x86/apic/msi: Plug non-maskable MSI affinity race
clocksource: Prevent double add_timer_on() for watchdog_timer
perf/core: Fix mlock accounting in perf_mmap()
rxrpc: Fix service call disconnection
Linux 4.19.103
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0d7f09085c3541373e0fd6b2e3ffacc5e34f7d55
commit 003461559e upstream.
Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.
Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.
Fixes: c4b7547974 ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit febac332a8 upstream.
Kernel crashes inside QEMU/KVM are observed:
kernel BUG at kernel/time/timer.c:1154!
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
At the same time another cpu got:
general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
__hlist_del at include/linux/list.h:681
(inlined by) detach_timer at kernel/time/timer.c:818
(inlined by) expire_timers at kernel/time/timer.c:1355
(inlined by) __run_timers at kernel/time/timer.c:1686
(inlined by) run_timer_softirq at kernel/time/timer.c:1699
Unfortunately kernel logs are badly scrambled, stacktraces are lost.
Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().
The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():
expire_timers()
detach_timer(timer, true);
timer->entry.pprev = NULL;
raw_spin_unlock_irq(&base->lock);
call_timer_fn
clocksource_watchdog()
clocksource_watchdog_kthread() or
clocksource_unbind()
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_stop_watchdog();
del_timer(&watchdog_timer);
watchdog_running = 0;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_start_watchdog();
add_timer_on(&watchdog_timer, ...);
watchdog_running = 1;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock(&watchdog_lock);
add_timer_on(&watchdog_timer, ...);
BUG_ON(timer_pending(timer) || !timer->function);
timer_pending() -> true
BUG()
I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.
Fixes: 75c5158f70 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6f1a4891a5 upstream.
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.
- Write address low 32bits
- Write address high 32bits (If supported by device)
- Write data
When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.
On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.
Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:
If MSI is disabled the PCI device is routing an interrupt to the legacy
INTx mechanism. The INTx delivery can be disabled, but the disablement is
not working on all devices.
Some devices lose interrupts when both MSI and INTx delivery are disabled.
Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.
Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.
That makes it possible to solve it with a two step update:
1) Target the MSI msg to the new vector on the current target CPU
2) Target the MSI msg to the new vector on the new target CPU
In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.
After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.
This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.
This can cause spurious interrupts on both the local and the new target
CPU.
1) If the new vector is not in use on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then interrupt entry code will
ignore that spurious interrupt. The vector is marked so that the
'No irq handler for vector' warning is supressed once.
2) If the new vector is in use already on the local CPU then the IRR check
might see an pending interrupt from the device which is using this
vector. The IPI to the new target CPU will then invoke the handler of
the device, which got the affinity change, even if that device did not
issue an interrupt
3) If the new vector is in use already on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then the handler of the device which
uses that vector on the local CPU will be invoked.
expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.
Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 54a16ff6f2 ]
As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.
Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72
Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 16052dd5bd ]
Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07928d9bfc ]
The function padata_flush_queues is fundamentally broken because
it cannot force padata users to complete the request that is
underway. IOW padata has to passively wait for the completion
of any outstanding work.
As it stands flushing is used in two places. Its use in padata_stop
is simply unnecessary because nothing depends on the queues to
be flushed afterwards.
The other use in padata_replace is more substantial as we depend
on it to free the old pd structure. This patch instead uses the
pd->refcnt to dynamically free the pd structure once all requests
are complete.
Fixes: 2b73b07ab8 ("padata: Flush the padata queues actively")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6b6d188aae upstream.
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.
Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.
Fixes: 51218298a2 ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6935c3983b upstream.
The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while other cpus might overwrite this field.
We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :
BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore
write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
__rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
rcu_read_unlock include/linux/rcupdate.h:645 [inline]
__ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
ip_queue_xmit+0x45/0x60 include/net/ip.h:236
__tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
__tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
sock_recvmsg_nosec net/socket.c:871 [inline]
sock_recvmsg net/socket.c:889 [inline]
sock_recvmsg+0x92/0xb0 net/socket.c:885
sock_read_iter+0x15f/0x1e0 net/socket.c:967
call_read_iter include/linux/fs.h:1864 [inline]
new_sync_read+0x389/0x4f0 fs/read_write.c:414
read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
[ paulmck: Added another READ_ONCE() for RCU CPU stall warnings. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 64ae572bc7 upstream.
Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).
[ May be the fix for this bug ]
Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com
Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com
Cc: stable@vger.kernel.org
CC: Steven Rostedt (VMware) <rostedt@goodmis.org>
CC: Joel Fernandes (Google) <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit def97da136 ]
Commit f92b070f2d ("printk: Do not miss new messages when replaying
the log") introduced a new variable @exclusive_console_stop_seq to
store when an exclusive console should stop printing. It should be
set to the @console_seq value at registration. However, @console_seq
is previously set to @syslog_seq so that the exclusive console knows
where to begin. This results in the exclusive console immediately
reactivating all the other consoles and thus repeating the messages
for those consoles.
Set @console_seq after @exclusive_console_stop_seq has stored the
current @console_seq value.
Fixes: f92b070f2d ("printk: Do not miss new messages when replaying the log")
Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f6d061d617 ]
In module_add_modinfo_attrs() if sysfs_create_file() fails
on the first iteration of the loop (so i = 0), we forget to
free the modinfo_attrs.
Fixes: bc6f2a757d ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl461NIACgkQONu9yGCS
aT6Mqw//W5xIIcs0Ut+P+QYNN6lCTRJ0AvFUolz79M3pyK/rHUluwTvYJbDAeGE3
sckv96rE1pxj5ZSf6LegXIoALrA4RlYHS8xXkYnRrt6xfrb7UwpqsJtt4Mx+IrJ3
9uFfaWRSvuDfRCraZxLiE2Bl9xVYvaPfFJYBmH383VB+deYNfpwORFsqNDQT+gR6
PZLuV0x//Kerwmd4OvaaHR/fIl8YVKmIz5lu3+3WIuVKxTK6Bbd3YzVu13dhVaX2
mETflLEAO/sYsUQiS4SO22ejLAiWyD8LyMV8s9KeTFQXzML3JpibKnt3ySDfzsFE
m8VRlaLcQwB0Ca2AVGHA5QV0+V+2+6qh/IcZl630feBueGQX59qLQkOurD4e/9lm
Na6ZkLPTh9UipIfTu9fvA5HY5lPt2VcSWwG2nLluckfJIpKNFVQEB7vuk9zd7468
qkXmj/J1YDdJzt2YgD0WZuKu3f1/No7rXbNmT2Oj0+HNWWvIU9xFNFlIPAxo7pJy
kwekd9+gHI0n1OhLRjzYUyf0pD+j0o75ZHsYYsUW0y6cGoWX/LmQ8JPFi+waHiov
FOe8FJz/uDtfQnJ4+izAM5Jjbu1LE+L8uGoIExYAv4DuXgPZtI2wtHvP4HHM3Aov
mDWLesMgizsroViv57aXC0C1ZPksPpGeHT+HcH7RnDQ0kQmpe3E=
=2XGW
-----END PGP SIGNATURE-----
Merge 4.19.102 into android-4.19
Changes in 4.19.102
vfs: fix do_last() regression
x86/resctrl: Fix use-after-free when deleting resource groups
x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup
x86/resctrl: Fix a deadlock due to inaccurate reference
crypto: pcrypt - Fix user-after-free on module unload
rsi: add hci detach for hibernation and poweroff
rsi: fix use-after-free on failed probe and unbind
perf c2c: Fix return type for histogram sorting comparision functions
PM / devfreq: Add new name attribute for sysfs
tools lib: Fix builds when glibc contains strlcpy()
arm64: kbuild: remove compressed images on 'make ARCH=arm64 (dist)clean'
ext4: validate the debug_want_extra_isize mount option at parse time
mm/mempolicy.c: fix out of bounds write in mpol_parse_str()
reiserfs: Fix memory leak of journal device string
media: digitv: don't continue if remote control state can't be read
media: af9005: uninitialized variable printked
media: vp7045: do not read uninitialized values if usb transfer fails
media: gspca: zero usb_buf
media: dvb-usb/dvb-usb-urb.c: initialize actlen to 0
tomoyo: Use atomic_t for statistics counter
ttyprintk: fix a potential deadlock in interrupt context issue
Bluetooth: Fix race condition in hci_release_sock()
cgroup: Prevent double killing of css when enabling threaded cgroup
media: si470x-i2c: Move free() past last use of 'radio'
ARM: dts: sun8i: a83t: Correct USB3503 GPIOs polarity
ARM: dts: am57xx-beagle-x15/am57xx-idk: Remove "gpios" for endpoint dt nodes
ARM: dts: beagle-x15-common: Model 5V0 regulator
soc: ti: wkup_m3_ipc: Fix race condition with rproc_boot
tools lib traceevent: Fix memory leakage in filter_event
rseq: Unregister rseq for clone CLONE_VM
clk: sunxi-ng: h6-r: Fix AR100/R_APB2 parent order
mac80211: mesh: restrict airtime metric to peered established plinks
clk: mmp2: Fix the order of timer mux parents
ASoC: rt5640: Fix NULL dereference on module unload
ixgbevf: Remove limit of 10 entries for unicast filter list
ixgbe: Fix calculation of queue with VFs and flow director on interface flap
igb: Fix SGMII SFP module discovery for 100FX/LX.
platform/x86: GPD pocket fan: Allow somewhat lower/higher temperature limits
ASoC: sti: fix possible sleep-in-atomic
qmi_wwan: Add support for Quectel RM500Q
parisc: Use proper printk format for resource_size_t
wireless: fix enabling channel 12 for custom regulatory domain
cfg80211: Fix radar event during another phy CAC
mac80211: Fix TKIP replay protection immediately after key setup
wireless: wext: avoid gcc -O3 warning
netfilter: nft_tunnel: ERSPAN_VERSION must not be null
net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
bnxt_en: Fix ipv6 RFS filter matching logic.
riscv: delete temporary files
iwlwifi: Don't ignore the cap field upon mcc update
ARM: dts: am335x-boneblack-common: fix memory size
vti[6]: fix packet tx through bpf_redirect()
xfrm interface: fix packet tx through bpf_redirect()
xfrm: interface: do not confirm neighbor when do pmtu update
scsi: fnic: do not queue commands during fwreset
ARM: 8955/1: virt: Relax arch timer version check during early boot
tee: optee: Fix compilation issue with nommu
airo: Fix possible info leak in AIROOLDIOCTL/SIOCDEVPRIVATE
airo: Add missing CAP_NET_ADMIN check in AIROOLDIOCTL/SIOCDEVPRIVATE
r8152: get default setting of WOL before initializing
ARM: dts: am43x-epos-evm: set data pin directions for spi0 and spi1
qlcnic: Fix CPU soft lockup while collecting firmware dump
powerpc/fsl/dts: add fsl,erratum-a011043
net/fsl: treat fsl,erratum-a011043
net: fsl/fman: rename IF_MODE_XGMII to IF_MODE_10G
seq_tab_next() should increase position index
l2t_seq_next should increase position index
net: Fix skb->csum update in inet_proto_csum_replace16().
btrfs: do not zero f_bavail if we have available space
perf report: Fix no libunwind compiled warning break s390 issue
mm/migrate.c: also overwrite error when it is bigger than zero
Linux 4.19.102
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia9b63c7932b66f469ab0e88467e1e07741408f0b
commit 3bc0bb36fa upstream.
The test_cgcore_no_internal_process_constraint_on_threads selftest when
running with subsystem controlling noise triggers two warnings:
> [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
> [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160
Both stem from a call to cgroup_type_write. The first warning was also
triggered by syzkaller.
When we're switching cgroup to threaded mode shortly after a subsystem
was disabled on it, we can see the respective subsystem css dying there.
The warning in cgroup_apply_control_enable is harmless in this case
since we're not adding new subsys anyway.
The warning in cgroup_apply_control_disable indicates an attempt to kill
css of recently disabled subsystem repeatedly.
The commit prevents these situations by making cgroup_type_write wait
for all dying csses to go away before re-applying subtree controls.
When at it, the locations of WARN_ON_ONCE calls are moved so that
warning is triggered only when we are about to misuse the dying css.
Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The estimated utilization for a task:
util_est = max(util_avg, est.enqueue, est.ewma)
is defined based on:
- util_avg: the PELT defined utilization
- est.enqueued: the util_avg at the end of the last activation
- est.ewma: a exponential moving average on the est.enqueued samples
According to this definition, when a task suddenly changes its bandwidth
requirements from small to big, the EWMA will need to collect multiple
samples before converging up to track the new big utilization.
This slow convergence towards bigger utilization values is not
aligned to the default scheduler behavior, which is to optimize for
performance. Moreover, the est.ewma component fails to compensate for
temporarely utilization drops which spans just few est.enqueued samples.
To let util_est do a better job in the scenario depicted above, change
its definition by making util_est directly follow upward motion and
only decay the est.ewma on downward.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@matbug.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191023205630.14469-1-patrick.bellasi@matbug.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b8c9636140)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I5c0bdd401f3fe599a2b7b9215c9a3a621f91002d
By default uclamp RT tasks will use the max frequency, which is not the
desired default behavior on mobile devices.
Re-use the SUGOV_RT_MAX_FREQ sched_feat to control the default behavior.
When SUGOV_RT_MAX_FREQ is NOT selected, the uclamp_min value of the RT
tasks will be 0.
Note, since now we use SUGOV_RT_MAX_FREQ to enforce the default max
frequency for RT when uclamp is compiled in; the condition in
schedutil_cpu_util() needs to be inverted so that max no longer
unconditionally applied when uclamp is compiled in && SUGOV_RT_MAX_FREQ
is true. This unconditional application means uclamp values are always
ignored which is not what we want when uclamp is compiled in.
Bug: 120440300
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I3d36f1ebed6ef35a6299af32bbf4462d0353e783
Signed-off-by: Quentin Perret <qperret@google.com>
task_fits_capacity() has just been made uclamp-aware, and
find_energy_efficient_cpu() needs to go through the same treatment.
Things are somewhat different here however - using the task max clamp isn't
sufficient. Consider the following setup:
The target runqueue, rq:
rq.cpu_capacity_orig = 512
rq.cfs.avg.util_avg = 200
rq.uclamp.max = 768 // the max p.uclamp.max of all enqueued p's is 768
The waking task, p (not yet enqueued on rq):
p.util_est = 600
p.uclamp.max = 100
Now, consider the following code which doesn't use the rq clamps:
util = uclamp_task_util(p);
// Does the task fit in the spare CPU capacity?
cpu = cpu_of(rq);
fits_capacity(util, cpu_capacity(cpu) - cpu_util(cpu))
This would lead to:
util = 100;
fits_capacity(100, 512 - 200)
fits_capacity() would return true. However, enqueuing p on that CPU *will*
cause it to become overutilized since rq clamp values are max-aggregated,
so we'd remain with
rq.uclamp.max = 768
which comes from the other tasks already enqueued on rq. Thus, we could
select a high enough frequency to reach beyond 0.8 * 512 utilization
(== overutilized) after enqueuing p on rq. What find_energy_efficient_cpu()
needs here is uclamp_rq_util_with() which lets us peek at the future
utilization landscape, including rq-wide uclamp values.
Make find_energy_efficient_cpu() use uclamp_rq_util_with() for its
fits_capacity() check. This is in line with what compute_energy() ends up
using for estimating utilization.
[QP: moved changes to select_cpu_candidates(), which is the equivalent
to the mainline path, and fix missing dependency on fits_capacity() by
using the open coded version]
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-6-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1d42509e47)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ibe1643cd5e6c97daceceae9733344e54bf4a4857
task_fits_capacity() drives CPU selection at wakeup time, and is also used
to detect misfit tasks. Right now it does so by comparing task_util_est()
with a CPU's capacity, but doesn't take into account uclamp restrictions.
There's a few interesting uses that can come out of doing this. For
instance, a low uclamp.max value could prevent certain tasks from being
flagged as misfit tasks, so they could merrily remain on low-capacity CPUs.
Similarly, a high uclamp.min value would steer tasks towards high capacity
CPUs at wakeup (and, should that fail, later steered via misfit balancing),
so such "boosted" tasks would favor CPUs of higher capacity.
Introduce uclamp_task_util() and make task_fits_capacity() use it.
[QP: fixed missing dependency on fits_capacity() by using the open coded
alternative]
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a7008c07a5)
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Iabde2eda7252c3bcc273e61260a7a12a7de991b1
The main SchedTune API calls realted to task tuning attributes are now
wrapped by more generic and mainlinish UtilClamp calls.
The new APIs are:
- uclamp_task(p) <= boosted_task_util(p)
- uclamp_boosted(p) <= schedtune_task_boost(p) > 0
- uclamp_latency_sensitive(p) <= schedtune_prefer_idle(p)
Let's provide also an implementation of the same API based on the new
uclamp.uclamp_latency_sensitive flag.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[Modified the patch to use uclamp.latency_sensitive instead mainline
attributes]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ib1a6902e1c07a82a370e36bf1776d895b7528cbc
Signed-off-by: Quentin Perret <qperret@google.com>
Add a 'latency_sensitive' flag to uclamp in order to express the need
for some tasks to find a CPU where they can wake-up quickly. This is not
expected to be used without cgroup support, so add solely a cgroup
interface for it.
As this flag represents a boolean attribute and not an amount of
resources to be shared, it is not clear what the delegation logic should
be. As such, it is kept simple: every new cgroup starts with
latency_sensitive set to false, regardless of the parent.
In essence, this is similar to SchedTune's prefer-idle flag which was
used in android-4.19 and prior.
Bug: 120440300
Change-Id: I722d8ecabb428bb7b95a5b54bc70a87f182dde2a
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
(cherry picked from commit ad7dd648fc7dbe11f23673a3463af2468a274998)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Quentin Perret <qperret@google.com>
The SchedTune CPU boosting API is currently used from sugov_get_util()
to get the boosted utilization and to pass it into schedutil_cpu_util().
When UtilClamp is in use instead we call schedutil_cpu_util() by
passing in just the CFS utilization and the clamping is done internally
on the aggregated CFS+RT utilization for FREQUENCY_UTIL calls.
This asymmetry is not required moreover, schedutil code is polluted by
non-mainline SchedTune code.
Wrap SchedTune API call related to cpu utilization boosting with a more
generic and mainlinish UtilClamp call:
- uclamp_rq_util_with(cpu, util, p) <= boosted_cpu_util(cpu)
This new API is already used in schedutil_cpu_util() to clamp the
aggregated RT+CFS utilization on FREQUENCY_UTIL calls.
Move the cpu boosting into uclamp_rq_util_with() so that we remove any
SchedTune specific bit from kernel/sched/cpufreq_schedutil.c.
Get rid of the no more required boosted_cpu_util(cpu) method and replace
it with a stune_util(cpu, util) which signature is better aligned with
its uclamp_rq_util_with(cpu, util, p) counterpart.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I45b0f0f54123fe0a2515fa9f1683842e6b99234f
[Removed superfluous __maybe_unused for capacity_orig_of]
Signed-off-by: Quentin Perret <qperret@google.com>
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.
Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.
Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.
The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.
By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.
Bug: 120440300
Fixes: 0b60ba2dd3 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
(cherry picked from commit 7226017ad3https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I9636c60e04d58bbfc5041df1305b34a12b5a3f46
Signed-off-by: Quentin Perret <qperret@google.com>
The current helper returns (CPU) rq utilization with uclamp restrictions
taken into account. A uclamp task utilization helper would be quite
helpful, but this requires some renaming.
Prepare the code for the introduction of a uclamp_task_util() by renaming
the existing uclamp_util_with() to uclamp_rq_util_with().
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit d2b58a286ehttps://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I3e7146b788e079e400167203df5e5dadee2fd232
Signed-off-by: Quentin Perret <qperret@google.com>
Vincent pointed out recently that the canonical type for utilization
values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for
cache optimization, but this doesn't have to be exported to its users.
Make the uclamp helpers that deal with utilization use and return unsigned
long values.
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 686516b55ehttps://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Id3837f12237e5b77eb3a236bd32457dcd7de743e
Signed-off-by: Quentin Perret <qperret@google.com>
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use
uclamp_util_with() instead in commit:
af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
From then on, uclamp_util() has remained unused. Being a simple wrapper
around uclamp_util_with(), we can get rid of it and win back a few lines.
Bug: 120440300
Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59fe675248https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I11dbff80c6c4be9666438800b2527aca8cd24025
Signed-off-by: Quentin Perret <qperret@google.com>
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.
In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.
The definition of task.requirement is dependent on the scheduling class.
For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.
capacity_orig_of(cpu) >= cfs_task.util
DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:
https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.
ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.
capacity_orig_of(cpu) >= rt_task.uclamp_min
Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.
Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.
On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.
[ mingo: Minor edits to comments. ]
Bug: 120440300
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 804d402fb6https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git)
[Qais: resolved minor conflict in kernel/sched/cpupri.c]
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ifc9da1c47de1aec9b4d87be2614e4c8968366900
Signed-off-by: Quentin Perret <qperret@google.com>
Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit
0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.
Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.
Bug: 120440300
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7763baace1)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I924a99c125372a8fca81cb4bc0c82e6a7183fc8a
Signed-off-by: Quentin Perret <qperret@google.com>