linux-uconsole

Author	SHA1	Message	Date
Quentin Perret	549020c814	ANDROID: sched: Disable find_best_target() by default Now that the mainline EAS wake-up path has been extended to cope with prefer-idle tasks, the need for a dedicated Android-specific wake-up routine (find_best_target()) becomes less clear. Indeed, main reasons for introducting find_best_target() in the first place were: 1. the energy_diff function was very slow, so we couldn't afford to use it on all CPUs for each wake-up for latency reasons; 2. schedtune provides additional information about tasks (the prefer-idle flag in particular) which needed to be taken into account in the placement algorithm. Now that the energy diff calculation is much faster (with the simplified energy model) and that the EAS path is aware of prefer-idle tasks, there is no clear reason to use find_best_target() any more. So, let's disable it for now to minimize the amount of out-of-tree code used in the scheduler. If using the mainline path doesn't cause regressions, it is a good sign find_best_target() can be removed safely, eventually. Otherwise, reverting back to the old behaviour is trivial since this patch only changes the sched_feat default, but doesn't remove the fbt() code path. Bug: 120440300 Change-Id: Idb5d68a3c4af7d2212e0922ab6d9a089170b5e1c Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-27 15:58:02 +00:00
Quentin Perret	d0eb1f3514	ANDROID: sched/fair: Make the EAS wake-up prefer-idle aware Make the mainline EAS wake-up path aware of prefer idle tasks in preparation for disabling find_best_target(). What is done in the mainline algoritm isn't strictly equivalent to the find_best_target() algorithm but comes real close, and isn't very invasive. The main differences with the original find_best_target() behaviour are the following: 1. the policy for prefer idle when there isn't a single idle CPU in the system is simpler now. We just pick the CPU with the highest spare capacity; 2. the cstate awareness for prefer idle is implemented by minimizing the exit latency rather than the idle state index. This is how it is done in the slow path (find_idlest_group_cpu()), it doesn't require us to keep hooks into CPUIdle, and should actually be better because what we want is a CPU that can wake up quickly; 3. non-prefer-idle tasks just use the standard mainline energy-aware wake-up path, which decides the placement using the Energy Model. Bug: 120440300 Change-Id: I57769c90c57115f6a28d27c5a88e08aa93a30a56 Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-27 15:58:02 +00:00
Vincent Guittot	d36e8b820e	UPSTREAM: sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity util_est is mainly meant to be a lower-bound for tasks utilization. That's why task_util_est() returns the actual util_avg when it's higher than the estimated utilization. With new invaraince signal and without any special check on samples collection, if a task is limited because of thermal capping for example, we could end up overestimating its utilization and thus perhaps generating an unwanted frequency spike when the capping is relaxed... and (even worst) it will take some more activations for the estimated utilization to converge back to the actual utilization. Since we cannot easily know if there is idle time in a CPU when a task completes an activation with a utilization higher then the CPU capacity, we skip the sampling when utilization is higher than CPU's capacity. Bug: 120440300 Change-Id: If1a6001451f80acb953e2a5f955fd302b1b73bc0 Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `10a35e6812`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Vincent Guittot	eb0db1782a	UPSTREAM: sched/fair: Update scale invariance of PELT The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Bug: 120440300 Change-Id: I0bd4ed2317f2a9a965634e53ce1476417af697a6 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `2312729688`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Vincent Guittot	0dd28f4253	UPSTREAM: sched/fair: Move the rq_of() helper function Move rq_of() helper function so it can be used in pelt.c [ mingo: Improve readability while at it. ] Bug: 120440300 Change-Id: I2133979476631d68baaffcaa308f4cdab94f22b1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `62478d9911`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:50 +00:00
Dietmar Eggemann	5bfd3ce4e0	UPSTREAM: sched/fair: Remove setting task's se->runnable_weight during PELT update A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's se->runnable_weight must always be in sync with its se->load.weight. se->runnable_weight is set to se->load.weight when the task is forked (init_entity_runnable_average()) or reniced (reweight_entity()). There are two cases in set_load_weight() which since they currently only set se->load.weight could lead to a situation in which se->load.weight is different to se->runnable_weight for a CFS task: (1) A task switches to SCHED_IDLE. (2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced (during which only its static priority gets set) switches to SCHED_OTHER or SCHED_BATCH. Set se->runnable_weight to se->load.weight in these two cases to prevent this. This eliminates the need to explicitly set it to se->load.weight during PELT updates in the CFS scheduler fastpath. Bug: 120440300 Change-Id: I52184a9e1fd53cb42ef3ae546b1fae78b744c9ad Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `4a465e3ebb`) Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2019-03-26 14:22:49 +00:00
Greg Kroah-Hartman	bb418a146a	This is the 4.19.31 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1 peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59 NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3 61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe 9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us= =xtLx -----END PGP SIGNATURE----- Merge 4.19.31 into android-4.19 Changes in 4.19.31 media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused() 9p: use inode->i_lock to protect i_size_write() under 32-bit 9p/net: fix memory leak in p9_client_create ASoC: fsl_esai: fix register setting issue in RIGHT_J mode ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE() ASoC: codecs: pcm186x: Fix energysense SLEEP bit iio: adc: exynos-adc: Fix NULL pointer exception on unbind mei: hbm: clean the feature flags on link reset mei: bus: move hw module get/put to probe/release stm class: Fix an endless loop in channel allocation crypto: caam - fix hash context DMA unmap size crypto: ccree - fix missing break in switch statement crypto: caam - fixed handling of sg list crypto: caam - fix DMA mapping of stack memory crypto: ccree - fix free of unallocated mlli buffer crypto: ccree - unmap buffer before copying IV crypto: ccree - don't copy zero size ciphertext crypto: cfb - add missing 'chunksize' property crypto: cfb - remove bogus memcpy() with src == dest crypto: ahash - fix another early termination in hash walk crypto: rockchip - fix scatterlist nents error crypto: rockchip - update new iv to device in multiple operations drm/imx: ignore plane updates on disabled crtcs gpu: ipu-v3: Fix i.MX51 CSI control registers offset drm/imx: imx-ldb: add missing of_node_puts gpu: ipu-v3: Fix CSI offsets for imx53 ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator s390/dasd: fix using offset into zero size array error Input: pwm-vibra - prevent unbalanced regulator Input: pwm-vibra - stop regulator after disabling pwm, not before ARM: dts: Configure clock parent for pwm vibra ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded arm/arm64: KVM: Allow a VCPU to fully reset itself arm/arm64: KVM: Don't panic on failure to properly reset system registers KVM: arm/arm64: vgic: Always initialize the group of private IRQs KVM: arm64: Forbid kprobing of the VHE world-switch code ASoC: samsung: Prevent clk_get_rate() calls in atomic context ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug Input: cap11xx - switch to using set_brightness_blocking() Input: ps2-gpio - flush TX work when closing port Input: matrix_keypad - use flush_delayed_work() mac80211: call drv_ibss_join() on restart mac80211: Fix Tx aggregation session tear down with ITXQs netfilter: compat: initialize all fields in xt_init blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue ipvs: fix dependency on nf_defrag_ipv6 floppy: check_events callback should not return a negative number xprtrdma: Make sure Send CQ is allocated on an existing compvec NFS: Don't use page_file_mapping after removing the page mm/gup: fix gup_pmd_range() for dax Revert "mm: use early_pfn_to_nid in page_ext_init" scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend() x86/CPU: Add Icelake model number mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs net: hns: Fix object reference leaks in hns_dsaf_roce_reset() i2c: cadence: Fix the hold bit setting i2c: bcm2835: Clear current buffer pointers and counts after a transfer auxdisplay: ht16k33: fix potential user-after-free on module unload Input: st-keyscan - fix potential zalloc NULL dereference clk: sunxi-ng: v3s: Fix TCON reset de-assert bit kallsyms: Handle too long symbols in kallsyms.c clk: sunxi: A31: Fix wrong AHB gate number esp: Skip TX bytes accounting when sending from a request socket ARM: 8824/1: fix a migrating irq bug when hotplug cpu bpf: only adjust gso_size on bytestream protocols bpf: fix lockdep false positive in stackmap af_key: unconditionally clone on broadcast ARM: 8835/1: dma-mapping: Clear DMA ops on teardown assoc_array: Fix shortcut creation keys: Fix dependency loop between construction record and auth key scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task net: systemport: Fix reception of BPDUs net: dsa: bcm_sf2: Do not assume DSA master supports WoL pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins qmi_wwan: apply SET_DTR quirk to Sierra WP7607 net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe() xfrm: Fix inbound traffic via XFRM interfaces across network namespaces mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue ASoC: topology: free created components in tplg load error qed: Fix iWARP buffer size provided for syn packet processing. qed: Fix iWARP syn packet mac address validation. ARM: dts: armada-xp: fix Armada XP boards NAND description arm64: Relax GIC version check during early boot ARM: tegra: Restore DT ABI on Tegra124 Chromebooks net: marvell: mvneta: fix DMA debug warning mm: handle lru_add_drain_all for UP properly tmpfs: fix link accounting when a tmpfile is linked in ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN ARCv2: lib: memcpy: fix doing prefetchw outside of buffer ARC: uacces: remove lp_start, lp_end from clobber list ARCv2: support manual regfile save on interrupts ARCv2: don't assume core 0x54 has dual issue phonet: fix building with clang mac80211_hwsim: propagate genlmsg_reply return code bpf, lpm: fix lookup bug in map_delete_elem net: thunderx: make CFG_DONE message to run through generic send-ack sequence net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task nfp: bpf: fix code-gen bug on BPF_ALU \| BPF_XOR \| BPF_K nfp: bpf: fix ALU32 high bits clearance bug bnxt_en: Fix typo in firmware message timeout logic. bnxt_en: Wait longer for the firmware message response to complete. net: set static variable an initial value in atl2_probe() selftests: fib_tests: sleep after changing carrier. again. tmpfs: fix uninitialized return value in shmem_link stm class: Prevent division by zero nfit: acpi_nfit_ctl(): Check out_obj->type in the right place acpi/nfit: Fix bus command validation nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot nfit/ars: Attempt short-ARS even in the no_init_ars case libnvdimm/label: Clear 'updating' flag after label-set update libnvdimm, pfn: Fix over-trim in trim_pfn_device() libnvdimm/pmem: Honor force_raw for legacy pmem regions libnvdimm: Fix altmap reservation size calculation fix cgroup_do_mount() handling of failure exits crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: aegis - fix handling chunked inputs crypto: arm/crct10dif - revert to C code for short inputs crypto: arm64/aes-neonbs - fix returning final keystream block crypto: arm64/crct10dif - revert to C code for short inputs crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: morus - fix handling chunked inputs crypto: pcbc - remove bogus memcpy()s with src == dest crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: testmgr - skip crc32c context test for ahash algorithms crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP crypto: x86/aesni-gcm - fix crash on empty plaintext crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine CIFS: Do not reset lease state to NONE on lease break CIFS: Do not skip SMB2 message IDs on send failures CIFS: Fix read after write for files with read caching tracing: Use strncpy instead of memcpy for string keys in hist triggers tracing: Do not free iter->trace in fail path of tracing_open_pipe() tracing/perf: Use strndup_user() instead of buggy open-coded version xen: fix dom0 boot on huge systems ACPI / device_sysfs: Avoid OF modalias creation for removed device mmc: sdhci-esdhc-imx: fix HS400 timing issue mmc:fix a bug when max_discard is 0 netfilter: ipt_CLUSTERIP: fix warning unused variable cn spi: ti-qspi: Fix mmap read when more than one CS in use spi: pxa2xx: Setup maximum supported DMA transfer length regulator: s2mps11: Fix steps for buck7, buck8 and LDO35 regulator: max77620: Initialize values for DT properties regulator: s2mpa01: Fix step values for some LDOs clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability s390/setup: fix early warning messages s390/virtio: handle find on invalid queue gracefully scsi: virtio_scsi: don't send sc payload with tmfs scsi: aacraid: Fix performance issue on logical drives scsi: sd: Optimal I/O size should be a multiple of physical block size scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware fs/devpts: always delete dcache dentry-s in dput() splice: don't merge into linked buffers ovl: During copy up, first copy up data and then xattrs ovl: Do not lose security.capability xattr over metadata file copy-up m68k: Add -ffreestanding to CFLAGS Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl btrfs: ensure that a DUP or RAID1 block group has exactly two stripes Btrfs: fix corruption reading shared and compressed extents after hole punching soc: qcom: rpmh: Avoid accessing freed memory from batch API libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code x86/kprobes: Prohibit probing on optprobe template code cpufreq: kryo: Release OPP tables on module removal cpufreq: tegra124: add missing of_node_put() cpufreq: pxa2xx: remove incorrect __init annotation ext4: fix check of inode in swap_inode_boot_loader ext4: cleanup pagecache before swap i_data ext4: update quota information while swapping boot loader inode ext4: add mask of ext4 flags to swap ext4: fix crash during online resizing PCI/ASPM: Use LTR if already enabled by platform PCI/DPC: Fix print AER status in DPC event handling PCI: dwc: skip MSI init if MSIs have been explicitly disabled IB/hfi1: Close race condition on user context disable and close cxl: Wrap iterations over afu slices inside 'afu_list_lock' ext2: Fix underflow in ext2_max_size() clk: uniphier: Fix update register for CPU-gear clk: clk-twl6040: Fix imprecise external abort for pdmclk clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override clk: ingenic: Fix round_rate misbehaving with non-integer dividers clk: ingenic: Fix doc of ingenic_cgu_div_info usb: chipidea: tegra: Fix missed ci_hdrc_remove_device() usb: typec: tps6598x: handle block writes separately with plain-I2C adapters dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit mm: hwpoison: fix thp split handing in soft_offline_in_use_page() mm/vmalloc: fix size check for remap_vmalloc_range_partial() mm/memory.c: do_fault: avoid usage of stale vm_area_struct kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv device property: Fix the length used in PROPERTY_ENTRY_STRING() intel_th: Don't reference unassigned outputs parport_pc: fix find_superio io compare code, should use equal test. i2c: tegra: fix maximum transfer size media: i2c: ov5640: Fix post-reset delay gpio: pca953x: Fix dereference of irq data in shutdown can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument drm/i915: Relax mmap VMA check bpf: only test gso type on gso packets serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart serial: 8250_pci: Fix number of ports for ACCES serial cards serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup() jbd2: clear dirty flag when revoking a buffer from an older transaction jbd2: fix compile warning when using JBUFFER_TRACE selinux: add the missing walk_size + len check in selinux_sctp_bind_connect security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock powerpc/32: Clear on-stack exception marker upon exception return powerpc/wii: properly disable use of BATs when requested. powerpc/powernv: Make opal log only readable by root powerpc/83xx: Also save/restore SPRG4-7 during suspend powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration powerpc/traps: fix recoverability of machine check handling on book3s/32 powerpc/traps: Fix the message printed when stack overflows ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify arm64: Fix HCR.TGE status for NMI contexts arm64: debug: Ensure debug handlers check triggering exception level arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2 ipmi_si: fix use-after-free of resource->name dm: fix to_sector() for 32bit dm integrity: limit the rate of error messages mfd: sm501: Fix potential NULL pointer dereference cpcap-charger: generate events for userspace NFS: Fix I/O request leakages NFS: Fix an I/O request leakage in nfs_do_recoalesce NFS: Don't recoalesce on error in nfs_pageio_complete_mirror() nfsd: fix performance-limiting session calculation nfsd: fix memory corruption caused by readdir nfsd: fix wrong check in write_v4_end_grace() NFSv4.1: Reinitialise sequence results before retransmitting a request svcrpc: fix UDP on servers with lots of threads PM / wakeup: Rework wakeup source timer cancellation bcache: never writeback a discard operation stable-kernel-rules.rst: add link to networking patch queue vt: perform safe console erase in the right order x86/unwind/orc: Fix ORC unwind table alignment perf intel-pt: Fix CYC timestamp calculation after OVF perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols perf auxtrace: Define auxtrace record alignment perf intel-pt: Fix overlap calculation for padding perf/x86/intel/uncore: Fix client IMC events return huge result perf intel-pt: Fix divide by zero when TSC is not available md: Fix failed allocation of md_register_thread tpm/tpm_crb: Avoid unaligned reads in crb_recv() tpm: Unify the send callback behaviour rcu: Do RCU GP kthread self-wakeup from softirq and interrupt media: imx: prpencvf: Stop upstream before disabling IDMA channel media: lgdt330x: fix lock status reporting media: uvcvideo: Avoid NULL pointer dereference at the end of streaming media: vimc: Add vimc-streamer for stream control media: imx: csi: Disable CSI immediately after last EOF media: imx: csi: Stop upstream before disabling IDMA channel drm/fb-helper: generic: Fix drm_fbdev_client_restore() drm/radeon/evergreen_cs: fix missing break in switch statement drm/amd/powerplay: correct power reading on fiji drm/amd/display: don't call dm_pp_ function from an fpu block KVM: Call kvm_arch_memslots_updated() before updating memslots KVM: x86/mmu: Detect MMIO generation wrap in any address space KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux KVM: nVMX: Sign extend displacements of VMX instr's mem operands KVM: nVMX: Apply addr size mask to effective address for VMX instructions KVM: nVMX: Ignore limit checks on VMX instructions using flat segments bcache: use (REQ_META\|REQ_PRIO) to indicate bio for metadata s390/setup: fix boot crash for machine without EDAT-1 Linux 4.19.31 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-03-23 21:13:30 +01:00
Zhang, Jun	e97a32a5a3	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt commit `1d1f898df6` upstream. The rcu_gp_kthread_wake() function is invoked when it might be necessary to wake the RCU grace-period kthread. Because self-wakeups are normally a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from this kthread, it naturally refuses to do the wakeup. Unfortunately, natural though it might be, this heuristic fails when rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler that interrupted the grace-period kthread just after the final check of the wait-event condition but just before the schedule() call. In this case, a wakeup is required, even though the call to rcu_gp_kthread_wake() is within the RCU grace-period kthread's context. Failing to provide this wakeup can result in grace periods failing to start, which in turn results in out-of-memory conditions. This race window is quite narrow, but it actually did happen during real testing. It would of course need to be fixed even if it was strictly theoretical in nature. This patch does not Cc stable because it does not apply cleanly to earlier kernel versions. Fixes: `48a7639ce8` ("rcu: Make callers awaken grace-period kthread") Reported-by: "He, Bo" <bo.he@intel.com> Co-developed-by: "Zhang, Jun" <jun.zhang@intel.com> Co-developed-by: "He, Bo" <bo.he@intel.com> Co-developed-by: "xiao, jin" <jin.xiao@intel.com> Co-developed-by: Bai, Jie A <jie.a.bai@intel.com> Signed-off: "Zhang, Jun" <jun.zhang@intel.com> Signed-off: "He, Bo" <bo.he@intel.com> Signed-off: "xiao, jin" <jin.xiao@intel.com> Signed-off: Bai, Jie A <jie.a.bai@intel.com> Signed-off-by: "Zhang, Jun" <jun.zhang@intel.com> [ paulmck: Switch from !in_softirq() to "!in_interrupt() && !in_serving_softirq() to avoid redundant wakeups and to also handle the interrupt-handler scenario as well as the softirq-handler scenario that actually occurred in testing. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:10:12 +01:00
Zev Weiss	93c8a44a82	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv commit `8cf7630b29` upstream. This bug has apparently existed since the introduction of this function in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git, "[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of neighbour sysctls."). As a minimal fix we can simply duplicate the corresponding check in do_proc_dointvec_conv(). Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> [2.6.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:10:04 +01:00
Jann Horn	24d5097655	tracing/perf: Use strndup_user() instead of buggy open-coded version commit `83540fbc88` upstream. The first version of this method was missing the check for `ret == PATH_MAX`; then such a check was added, but it didn't call kfree() on error, so there was still a small memory leak in the error case. Fix it by using strndup_user() instead of open-coding it. Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: `0eadcc7a7b` ("perf/core: Fix perf_uprobe_init()") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
zhangyi (F)	f27077e5f5	tracing: Do not free iter->trace in fail path of tracing_open_pipe() commit `e7f0c424d0` upstream. Commit `d716ff71dd` ("tracing: Remove taking of trace_types_lock in pipe files") use the current tracer instead of the copy in tracing_open_pipe(), but it forget to remove the freeing sentence in the error path. There's an error path that can call kfree(iter->trace) after the iter->trace was assigned to tr->current_trace, which would be bad to free. Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com Cc: stable@vger.kernel.org Fixes: `d716ff71dd` ("tracing: Remove taking of trace_types_lock in pipe files") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
Tom Zanussi	ebca08d7e8	tracing: Use strncpy instead of memcpy for string keys in hist triggers commit `9f0bbf3115` upstream. Because there may be random garbage beyond a string's null terminator, it's not correct to copy the the complete character array for use as a hist trigger key. This results in multiple histogram entries for the 'same' string key. So, in the case of a string key, use strncpy instead of memcpy to avoid copying in the extra bytes. Before, using the gdbus entries in the following hist trigger as an example: # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist ... { comm: ImgDecoder #4 } hitcount: 203 { comm: gmain } hitcount: 213 { comm: gmain } hitcount: 216 { comm: StreamTrans #73 } hitcount: 221 { comm: mozStorage #3 } hitcount: 230 { comm: gdbus } hitcount: 233 { comm: StyleThread#5 } hitcount: 253 { comm: gdbus } hitcount: 256 { comm: gdbus } hitcount: 260 { comm: StyleThread#4 } hitcount: 271 ... # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist \| egrep gdbus \| wc -l 51 After: # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist \| egrep gdbus \| wc -l 1 Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@vger.kernel.org Fixes: `79e577cbce` ("tracing: Support string type key properly") Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:56 +01:00
Al Viro	7a8b048430	fix cgroup_do_mount() handling of failure exits commit `399504e21a` upstream. same story as with last May fixes in sysfs (`7b745a4e40` "unfuck sysfs_mount()"); new_sb is left uninitialized in case of early errors in kernfs_mount_ns() and papering over it by treating any error from kernfs_mount_ns() as equivalent to !new_ns ends up conflating the cases when objects had never been transferred to a superblock with ones when that has happened and resulting new superblock had been dropped. Easily fixed (same way as in sysfs case). Additionally, there's a superblock leak on kernfs_node_dentry() failure and a dentry leak inside kernfs_node_dentry() itself - the latter on probably impossible errors, but the former not impossible to trigger (as the matter of fact, injecting allocation failures at that point does trigger it). Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-23 20:09:53 +01:00
Alban Crequy	02f8211b75	bpf, lpm: fix lookup bug in map_delete_elem [ Upstream commit `7c0cdf0b39` ] trie_delete_elem() was deleting an entry even though it was not matching if the prefixlen was correct. This patch adds a check on matchlen. Reproducer: $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1 $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm key: 10 00 00 00 aa bb cc dd value: 01 Found 1 element $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff $ echo $? 0 $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm Found 0 elements A similar reproducer is added in the selftests. Without the patch: $ sudo ./tools/testing/selftests/bpf/test_lpm_map test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed. Aborted With the patch: test_lpm_map runs without errors. Fixes: `e454cf5958` ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE") Cc: Craig Gallek <kraig@google.com> Signed-off-by: Alban Crequy <alban@kinvolk.io> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:51 +01:00
Alexei Starovoitov	c7c68a1b9a	bpf: fix lockdep false positive in stackmap [ Upstream commit `3defaf2f15` ] Lockdep warns about false positive: [ 11.211460] ------------[ cut here ]------------ [ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280 [ 11.213134] Modules linked in: [ 11.214954] RIP: 0010:lock_release+0x1ad/0x280 [ 11.223508] Call Trace: [ 11.223705] <IRQ> [ 11.223874] ? __local_bh_enable+0x7a/0x80 [ 11.224199] up_read+0x1c/0xa0 [ 11.224446] do_up_read+0x12/0x20 [ 11.224713] irq_work_run_list+0x43/0x70 [ 11.225030] irq_work_run+0x26/0x50 [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0 [ 11.225662] irq_work_interrupt+0xf/0x20 since rw_semaphore is released in a different task vs task that locked the sema. It is expected behavior. Fix the warning with up_read_non_owner() and rwsem_release() annotation. Fixes: `bae77c5eb5` ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:48 +01:00
Suren Baghdasaryan	617a4ba0ec	FROMLIST: psi: introduce psi monitor Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I860049d32420485346ad545c4650f990fe0c08e3 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:07:14 +00:00
Suren Baghdasaryan	3a905dc573	FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052417/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I88cd99f41534f0b9df18043cde8d1ee54aaa93de Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:07:04 +00:00
Suren Baghdasaryan	23c32cf595	FROMLIST: psi: track changed states Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052420/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I944b024cd65e8520a57097bf5a3d7b2c01605bd0 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:55 +00:00
Suren Baghdasaryan	f270022469	FROMLIST: psi: split update_stats into parts Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052419/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: Ia9cfed8964fd57e41098fca285a2be0252fd5277 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:46 +00:00
Suren Baghdasaryan	c6e18d9458	FROMLIST: psi: rename psi fields in preparation for psi trigger addition Renaming psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and trigger-related fields that will be added next. Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052416/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I579a60e0915fa8fedaa508357d3d1aefab9428c4 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:38 +00:00
Suren Baghdasaryan	18d15b1861	FROMLIST: psi: make psi_enable static psi_enable is not used outside of psi.c, make it static. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052415/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I249c6d2271f93a7975f1622faf2d2b4196b701bc Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:29 +00:00
Suren Baghdasaryan	ada57da3b1	FROMLIST: psi: introduce state_mask to represent stalled psi states The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum. This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size. Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052414/) Bug: 127712811 Bug: 129157727 Test: lmkd in PSI mode Change-Id: I38a1ca3d5c9e6cc3ba39e88c6a9af29ecdc0df5b Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-22 23:06:13 +00:00
Johannes Weiner	9f79143ebb	UPSTREAM: kernel: cgroup: add poll file operation Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit: `dc50537bdd`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:28 -07:00
Johannes Weiner	ec350213df	UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines We've been seeing hard-to-trigger psi crashes when running inside VM instances: divide error: 0000 [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events psi_clock RIP: 0010:psi_update_stats+0x270/0x490 RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: psi_clock+0x12/0x50 process_one_work+0x1e0/0x390 worker_thread+0x2b/0x3c0 ? rescuer_thread+0x330/0x330 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48 The Code-line points to `period` being 0 inside update_stats(), and we divide by that when calculating that period's pressure percentage. The elapsed period should never be 0. The reason this can happen is due to an off-by-one in the idle time / missing period calculation combined with a coarse sched_clock() in the virtual machine. The target time for aggregation is advanced into the future on a fixed grid to prevent clock drift. So when an aggregation runs after some idle period, we can not just set it to "now + psi_period", but have to calculate the downtime and advance the target time relative to itself. However, if the aggregator was disabled exactly one psi_period (ns), we drop one idle period in the calculation due to a > when we should do >=. In that case, next_update will be advanced from 'now - psi_period' to 'now' when it should be moved to 'now + psi_period'. The run finishes with last_update == next_update == sched_clock(). With hardware clocks, this exact nanosecond match isn't likely in the first place; but if it does happen, the clock will still have moved on and the period non-zero by the time the worker runs. A pointlessly short period, but besides the extra work, no harm no foul. However, a slow sched_clock() like we have on VMs might not have advanced either by the time the worker runs again. And when we calculate the elapsed period, the result, our pressure divisor, will be 0. Ouch. Fix this by correctly handling the situation when the elapsed time between aggregation runs is precisely two periods, and advance the expiration timestamp correctly to period into the future. Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Łukasz Siudut <lsiudut@fb.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `4e37504d1c`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I40917c84354f9f32259c6703f00b6b1d21f45f02 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	2a070382c9	UPSTREAM: psi: fix aggregation idle shut-off psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: `eb414681d5` ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `1b69ac6b40`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	3bbcbc8039	UPSTREAM: psi: make disabling/enabling easier for vendor kernels Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `e0c274472d`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I6cb666fa351e8901df82e4d6931bfec0c5ce230d Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Olof Johansson	b822a6da85	UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task() The existing code triggered an invalid warning about 'rq' possibly being used uninitialized. Instead of doing the silly warning suppression by initializa it to NULL, refactor the code to bail out early instead. Warning was: kernel/sched/psi.c: In function `cgroup_move_task': kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized] Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net Fixes: `2ce7135adc` ("psi: cgroup support") Signed-off-by: Olof Johansson <olof@lixom.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `8fcb2312d1`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id989da224a726082e0cfa5d5d9460bf63d448a93 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	dc9cd29ded	UPSTREAM: psi: cgroup support On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `2ce7135adc`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	e550f94252	UPSTREAM: psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `eb414681d5`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	8cd88f5398	UPSTREAM: sched: introduce this_rq_lock_irq() do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next patch is adding another site with the same pattern, so provide a convenience function for it. Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `246b3b3342`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I24b42cff1624c80633f116b7cb485564f53a30a7 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	cdda3cf652	UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h kernel/sched/sched.h includes "stats.h" half-way through the file. The next patch introduces users of sched.h's rq locking functions and update_rq_clock() in kernel/sched/stats.h. Move those definitions up in the file so they are available in stats.h. Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `1f351d7f75`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id342e0ba9a62b49e64f2ce8b87f883ea70230b2f Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	4c9c09affa	UPSTREAM: sched: loadavg: make calc_load_n() public It's going to be used in a later patch. Keep the churn separate. Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `5c54f5b9ed`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I50e0cb0dbf20ced329a484493f82ff69ca1ae97a Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	2ba18b41d3	BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `8508cf3ffa`) Conflicts: block/blk-iolatency.c (1. manual merge to replace stat->rqs.mean with stat.mean) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:26 -07:00
Johannes Weiner	580f26b93e	UPSTREAM: delayacct: track delays from thrashing cache pages Delay accounting already measures the time a task spends in direct reclaim and waiting for swapin, but in low memory situations tasks spend can spend a significant amount of their time waiting on thrashing page cache. This isn't tracked right now. To know the full impact of memory contention on an individual task, measure the delay when waiting for a recently evicted active cache page to read back into memory. Also update tools/accounting/getdelays.c: [hannes@computer accounting]$ sudo ./getdelays -d -p 1 print delayacct stats ON PID 1 CPU count real total virtual total delay total delay average 50318 745000000 847346785 400533713 0.008ms IO count delay total delay average 435 122601218 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms THRASHING count delay total delay average 19 12621439 0ms Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `b1d29ba82c`) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I259f693987cf04e6a52ee7e8accf55a17e0de005 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:26 -07:00
Greg Kroah-Hartman	2e568c979c	This is the 4.19.29 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyJb/EACgkQONu9yGCS aT4y0g//b9t9/onhTaXcY/ByPmBAwqNgugi7eYcZqGDBp7aCDOBLF6eOwbhdvvuS ZTaZ5eWG3Twz3mZu9vveuskgMci2npDyLPgqBWGzW+Ef5r/xPd40diaI75ZUc68T gimWbQ0VANuXKklK6LysBUaVQWE3ilIy6qnnpj0DI3ipNDoE62Ry1LNthuKy+73J w6r7uwkb6X/CkXpNB/L4cDdpSy/CvhGQhd6p91lBuE4DfyPqEzslYCokD9aPXp9b Fedt/Re+8eULBNcgqPYxkS5pBrbHtqrGf00AMlzC8DkC+GZyDqSP2xjv6AiTfGJd uf0/Jvsv2OBnP4aYsbk+uB2z3plzPBgmXxa/1bm+yrGCMvbpi9mMx75HM2joAeVp tVN4ZN65kNgJkXCchJTHdQ3s6teOD8Par1czy570HyKBU6l1j3AGArGm+b4WGPWx dL+82coojMKxKNdTHfxUXES6QGKp716r3un6mCrKR0xET/SDayzDQMaSM8UOtArK ELzNeKzKTc5oBx6i+JfGmY8ZsedpNGCIPpsiuoSYAaon5ZzNbruzOAlDOThs157d YezDHZ9XMrx3kN/xYnqZD63x/5egq9REbZGWljeykbNkWcEY74jIkKwNLxqv3P64 JsLp60owvjzwtzKycjZogNU//GGNTBdb+6pESq4MxJpPTteFWnc= =n9iV -----END PGP SIGNATURE----- Merge 4.19.29 into android-4.19 Changes in 4.19.29 media: uvcvideo: Fix 'type' check leading to overflow vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel perf script: Fix crash with printing mixed trace point and other events perf core: Fix perf_proc_update_handler() bug perf tools: Handle TOPOLOGY headers with no CPU perf script: Fix crash when processing recorded stat data IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM iommu/amd: Call free_iova_fast with pfn in map_sg iommu/amd: Unmap all mapped pages in error path of map_sg riscv: fixup max_low_pfn with PFN_DOWN. ipvs: Fix signed integer overflow when setsockopt timeout iommu/amd: Fix IOMMU page flush when detach device from a domain clk: ti: Fix error handling in ti_clk_parse_divider_data() clk: qcom: gcc: Use active only source for CPUSS clocks xtensa: SMP: fix ccount_timer_shutdown riscv: Adjust mmap base address at a third of task size IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start selftests: cpu-hotplug: fix case where CPUs offline > CPUs present xtensa: SMP: fix secondary CPU initialization xtensa: smp_lx200_defconfig: fix vectors clash xtensa: SMP: mark each possible CPU as present iomap: get/put the page in iomap_page_create/release() iomap: fix a use after free in iomap_dio_rw xtensa: SMP: limit number of possible CPUs by NR_CPUS net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case net: hns: Fix for missing of_node_put() after of_parse_phandle() net: hns: Restart autoneg need return failed when autoneg off net: hns: Fix wrong read accesses via Clause 45 MDIO protocol net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup() netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present gpio: vf610: Mask all GPIO interrupts selftests: net: use LDLIBS instead of LDFLAGS selftests: timers: use LDLIBS instead of LDFLAGS nfs: Fix NULL pointer dereference of dev_name qed: Fix bug in tx promiscuous mode settings qed: Fix LACP pdu drops for VFs qed: Fix VF probe failure while FLR qed: Fix system crash in ll2 xmit qed: Fix stack out of bounds bug scsi: libfc: free skb when receiving invalid flogi resp scsi: scsi_debug: fix write_same with virtual_gb problem scsi: bnx2fc: Fix error handling in probe() scsi: 53c700: pass correct "dev" to dma_alloc_attrs() platform/x86: Fix unmet dependency warning for ACPI_CMPC platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 net: macb: Apply RXUBR workaround only to versions with errata x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode cifs: fix computation for MAX_SMB2_HDR_SIZE x86/microcode/amd: Don't falsely trick the late loading mechanism arm64: kprobe: Always blacklist the KVM world-switch code apparmor: Fix aa_label_build() error handling for failed merges x86/kexec: Don't setup EFI info if EFI runtime is not enabled proc: fix /proc/net/* after setns(2) x86_64: increase stack size for KASAN_EXTRA mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone lib/test_kmod.c: potential double free in error handling fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() autofs: drop dentry reference only when it is never used autofs: fix error return in autofs_fill_super() mm, memory_hotplug: fix off-by-one in is_pageblock_removable ARM: OMAP: dts: N950/N9: fix onenand timings ARM: dts: omap4-droid4: Fix typo in cpcap IRQ flags ARM: dts: sun8i: h3: Add ethernet0 alias to Beelink X2 arm: dts: meson: Fix IRQ trigger type for macirq ARM: dts: meson8b: odroidc1: mark the SD card detection GPIO active-low ARM: dts: meson8m2: mxiii-plus: mark the SD card detection GPIO active-low ARM: dts: imx6sx: correct backward compatible of gpt arm64: dts: renesas: r8a7796: Enable DMA for SCIF2 arm64: dts: renesas: r8a77965: Enable DMA for SCIF2 soc: fsl: qbman: avoid race in clearing QMan interrupt pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18 wlcore: sdio: Fixup power on/off sequence bpftool: Fix prog dump by tag bpftool: fix percpu maps updating bpf: sock recvbuff must be limited by rmem_max in bpf_setsockopt() ARM: pxa: ssp: unneeded to free devm_ allocated data arm64: dts: add msm8996 compatible to gicv3 batman-adv: release station info tidstats DTS: CI20: Fix bugs in ci20's device tree. usb: phy: fix link errors irqchip/gic-v4: Fix occasional VLPI drop irqchip/gic-v3-its: Gracefully fail on LPI exhaustion irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable drm/amdgpu: Add missing power attribute to APU check drm/radeon: check if device is root before getting pci speed caps drm/amdgpu: Transfer fences to dmabuf importer net: stmmac: Fallback to Platform Data clock in Watchdog conversion net: stmmac: Send TSO packets always from Queue 0 net: stmmac: Disable EEE mode earlier in XMIT callback irqchip/gic-v3-its: Fix ITT_entry_size accessor relay: check return of create_buf_file() properly bpf, selftests: fix handling of sparse CPU allocations bpf: fix lockdep false positive in percpu_freelist bpf: fix potential deadlock in bpf_prog_register bpf: Fix syscall's stackmap lookup potential deadlock drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at init dmaengine: at_xdmac: Fix wrongfull report of a channel as in use vsock/virtio: fix kernel panic after device hot-unplug vsock/virtio: reset connected sockets on device removal dmaengine: dmatest: Abort test in case of mapping error selftests: netfilter: fix config fragment CONFIG_NF_TABLES_INET selftests: netfilter: add simple masq/redirect test cases netfilter: nf_nat: skip nat clash resolution for same-origin entries s390/qeth: release cmd buffer in error paths s390/qeth: fix use-after-free in error path s390/qeth: cancel close_dev work before removing a card perf symbols: Filter out hidden symbols from labels perf trace: Support multiple "vfs_getname" probes MIPS: Remove function size check in get_frame_info() Revert "scsi: libfc: Add WARN_ON() when deleting rports" i2c: omap: Use noirq system sleep pm ops to idle device for suspend drm/amdgpu: use spin_lock_irqsave to protect vm_manager.pasid_idr nvme: lock NS list changes while handling command effects nvme-pci: fix rapid add remove sequence fs: ratelimit __find_get_block_slow() failure message. qed: Fix EQ full firmware assert. qed: Consider TX tcs while deriving the max num_queues for PF. qede: Fix system crash on configuring channels. blk-iolatency: fix IO hang due to negative inflight counter nvme-pci: add missing unlock for reset error Input: wacom_serial4 - add support for Wacom ArtPad II tablet Input: elan_i2c - add id for touchpad found in Lenovo s21e-20 iscsi_ibft: Fix missing break in switch statement scsi: aacraid: Fix missing break in switch statement x86/PCI: Fixup RTIT_BAR of Intel Denverton Trace Hub arm64: dts: zcu100-revC: Give wifi some time after power-on arm64: dts: hikey: Give wifi some time after power-on arm64: dts: hikey: Revert "Enable HS200 mode on eMMC" ARM: dts: exynos: Fix pinctrl definition for eMMC RTSN line on Odroid X2/U3 ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU ARM: dts: exynos: Fix max voltage for buck8 regulator on Odroid XU3/XU4 drm: disable uncached DMA optimization for ARM and arm64 netfilter: xt_TEE: fix wrong interface selection netfilter: xt_TEE: add missing code to get interface index in checkentry. gfs2: Fix missed wakeups in find_insert_glock staging: erofs: add error handling for xattr submodule staging: erofs: fix fast symlink w/o xattr when fs xattr is on staging: erofs: fix memleak of inode's shared xattr array staging: erofs: fix race of initializing xattrs of a inode at the same time staging: erofs: keep corrupted fs from crashing kernel in erofs_namei() cifs: allow calling SMB2_xxx_free(NULL) ath9k: Avoid OF no-EEPROM quirks without qca,no-eeprom driver core: Postpone DMA tear-down until after devres release perf/x86/intel: Make cpuc allocations consistent perf/x86/intel: Generalize dynamic constraint creation x86: Add TSX Force Abort CPUID/MSR perf/x86/intel: Implement support for TSX Force Abort Linux 4.19.29 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-03-13 14:17:29 -07:00
Martin KaFai Lau	ae26a7109c	bpf: Fix syscall's stackmap lookup potential deadlock [ Upstream commit `7c4cd051ad` ] The map_lookup_elem used to not acquiring spinlock in order to optimize the reader. It was true until commit `557c0c6e7d` ("bpf: convert stackmap to pre-allocation") The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy(). bpf_stackmap_copy() may find the elem no longer needed after the copy is done. If that is the case, pcpu_freelist_push() saves this elem for reuse later. This push requires a spinlock. If a tracing bpf_prog got run in the middle of the syscall's map_lookup_elem(stackmap) and this tracing bpf_prog is calling bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's spinlock, it may end up with a dead lock situation as reported by Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/ The situation is the same as the syscall's map_update_elem() which needs to acquire the pcpu_freelist's spinlock and could race with tracing bpf_prog. Hence, this patch fixes it by protecting bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active) to prevent tracing bpf_prog from running. A later syscall's map_lookup_elem commit `f1a2e44a3a` ("bpf: add queue and stack maps") also acquires a spinlock and races with tracing bpf_prog similarly. Hence, this patch is forward looking and protects the majority of the map lookups. bpf_map_offload_lookup_elem() is the exception since it is for network bpf_prog only (i.e. never called by tracing bpf_prog). Fixes: `557c0c6e7d` ("bpf: convert stackmap to pre-allocation") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:36 -07:00
Alexei Starovoitov	3bbe6a4212	bpf: fix potential deadlock in bpf_prog_register [ Upstream commit `e16ec34039` ] Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex: [ 13.007000] WARNING: possible circular locking dependency detected [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted [ 13.008124] ------------------------------------------------------ [ 13.008624] test_progs/246 is trying to acquire lock: [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300 [ 13.009770] [ 13.009770] but task is already holding lock: [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.010877] [ 13.010877] which lock already depends on the new lock. [ 13.010877] [ 13.011532] [ 13.011532] the existing dependency chain (in reverse order) is: [ 13.012129] [ 13.012129] -> #4 (bpf_event_mutex){+.+.}: [ 13.012582] perf_event_query_prog_array+0x9b/0x130 [ 13.013016] _perf_ioctl+0x3aa/0x830 [ 13.013354] perf_ioctl+0x2e/0x50 [ 13.013668] do_vfs_ioctl+0x8f/0x6a0 [ 13.014003] ksys_ioctl+0x70/0x80 [ 13.014320] __x64_sys_ioctl+0x16/0x20 [ 13.014668] do_syscall_64+0x4a/0x180 [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.015469] [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}: [ 13.015910] perf_event_init_cpu+0x5a/0x90 [ 13.016291] perf_event_init+0x1b2/0x1de [ 13.016654] start_kernel+0x2b8/0x42a [ 13.016995] secondary_startup_64+0xa4/0xb0 [ 13.017382] [ 13.017382] -> #2 (pmus_lock){+.+.}: [ 13.017794] perf_event_init_cpu+0x21/0x90 [ 13.018172] cpuhp_invoke_callback+0xb3/0x960 [ 13.018573] _cpu_up+0xa7/0x140 [ 13.018871] do_cpu_up+0xa4/0xc0 [ 13.019178] smp_init+0xcd/0xd2 [ 13.019483] kernel_init_freeable+0x123/0x24f [ 13.019878] kernel_init+0xa/0x110 [ 13.020201] ret_from_fork+0x24/0x30 [ 13.020541] [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}: [ 13.021051] static_key_slow_inc+0xe/0x20 [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300 [ 13.021891] perf_trace_event_init+0x11f/0x250 [ 13.022297] perf_trace_init+0x6b/0xa0 [ 13.022644] perf_tp_event_init+0x25/0x40 [ 13.023011] perf_try_init_event+0x6b/0x90 [ 13.023386] perf_event_alloc+0x9a8/0xc40 [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30 [ 13.024173] do_syscall_64+0x4a/0x180 [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.024968] [ 13.024968] -> #0 (tracepoints_mutex){+.+.}: [ 13.025434] __mutex_lock+0x86/0x970 [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300 [ 13.026215] bpf_probe_register+0x40/0x60 [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.027042] __do_sys_bpf+0x94f/0x1a90 [ 13.027389] do_syscall_64+0x4a/0x180 [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.028171] [ 13.028171] other info that might help us debug this: [ 13.028171] [ 13.028807] Chain exists of: [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex [ 13.028807] [ 13.029666] Possible unsafe locking scenario: [ 13.029666] [ 13.030140] CPU0 CPU1 [ 13.030510] ---- ---- [ 13.030875] lock(bpf_event_mutex); [ 13.031166] lock(&cpuctx_mutex); [ 13.031645] lock(bpf_event_mutex); [ 13.032135] lock(tracepoints_mutex); [ 13.032441] [ 13.032441] * DEADLOCK * [ 13.032441] [ 13.032911] 1 lock held by test_progs/246: [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.033909] [ 13.033909] stack backtrace: [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477 [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 13.035657] Call Trace: [ 13.035859] dump_stack+0x5f/0x8b [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db [ 13.036526] __lock_acquire+0x1158/0x1350 [ 13.036852] ? lock_acquire+0x98/0x190 [ 13.037154] lock_acquire+0x98/0x190 [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.037876] __mutex_lock+0x86/0x970 [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.039028] ? __mutex_lock+0x86/0x970 [ 13.039337] ? __mutex_lock+0x24a/0x970 [ 13.039649] ? bpf_probe_register+0x1d/0x60 [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10 [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300 [ 13.041325] bpf_probe_register+0x40/0x60 [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.042068] ? __might_fault+0x3e/0x90 [ 13.042374] __do_sys_bpf+0x94f/0x1a90 [ 13.042678] do_syscall_64+0x4a/0x180 [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.043382] RIP: 0033:0x7f23b10a07f9 [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9 [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011 [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10 [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000 Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister() there is no need to take bpf_event_mutex too. bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs. bpf_raw_tracepoints don't need to take this mutex. Fixes: `c4f6699dfc` ("bpf: introduce BPF_RAW_TRACEPOINT") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:36 -07:00
Alexei Starovoitov	e3bc64c9aa	bpf: fix lockdep false positive in percpu_freelist [ Upstream commit `a89fac57b5` ] Lockdep warns about false positive: [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40 [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 12.493275] (&rq->lock){-.-.} [ 12.493276] [ 12.493276] [ 12.493276] and interrupts could create inverse lock ordering between them. [ 12.493276] [ 12.494435] [ 12.494435] other info that might help us debug this: [ 12.494979] Possible interrupt unsafe locking scenario: [ 12.494979] [ 12.495518] CPU0 CPU1 [ 12.495879] ---- ---- [ 12.496243] lock(&head->lock); [ 12.496502] local_irq_disable(); [ 12.496969] lock(&rq->lock); [ 12.497431] lock(&head->lock); [ 12.497890] <Interrupt> [ 12.498104] lock(&rq->lock); [ 12.498368] [ 12.498368] * DEADLOCK * [ 12.498368] [ 12.498837] 1 lock held by dd/276: [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240 [ 12.499747] [ 12.499747] the shortest dependencies between 2nd lock and 1st lock: [ 12.500389] -> (&rq->lock){-.-.} { [ 12.500669] IN-HARDIRQ-W at: [ 12.500934] _raw_spin_lock+0x2f/0x40 [ 12.501373] scheduler_tick+0x4c/0xf0 [ 12.501812] update_process_times+0x40/0x50 [ 12.502294] tick_periodic+0x27/0xb0 [ 12.502723] tick_handle_periodic+0x1f/0x60 [ 12.503203] timer_interrupt+0x11/0x20 [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0 [ 12.504167] handle_irq_event_percpu+0x20/0x50 [ 12.504674] handle_irq_event+0x37/0x60 [ 12.505139] handle_level_irq+0xa7/0x120 [ 12.505601] handle_irq+0xa1/0x150 [ 12.506018] do_IRQ+0x77/0x140 [ 12.506411] ret_from_intr+0x0/0x1d [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60 [ 12.507362] __setup_irq+0x481/0x730 [ 12.507789] setup_irq+0x49/0x80 [ 12.508195] hpet_time_init+0x21/0x32 [ 12.508644] x86_late_time_init+0xb/0x16 [ 12.509106] start_kernel+0x390/0x42a [ 12.509554] secondary_startup_64+0xa4/0xb0 [ 12.510034] IN-SOFTIRQ-W at: [ 12.510305] _raw_spin_lock+0x2f/0x40 [ 12.510772] try_to_wake_up+0x1c7/0x4e0 [ 12.511220] swake_up_locked+0x20/0x40 [ 12.511657] swake_up_one+0x1a/0x30 [ 12.512070] rcu_process_callbacks+0xc5/0x650 [ 12.512553] __do_softirq+0xe6/0x47b [ 12.512978] irq_exit+0xc3/0xd0 [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250 [ 12.513876] apic_timer_interrupt+0xf/0x20 [ 12.514343] default_idle+0x1c/0x170 [ 12.514765] do_idle+0x199/0x240 [ 12.515159] cpu_startup_entry+0x19/0x20 [ 12.515614] start_kernel+0x422/0x42a [ 12.516045] secondary_startup_64+0xa4/0xb0 [ 12.516521] INITIAL USE at: [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50 [ 12.517258] rq_attach_root+0x16/0xd0 [ 12.517685] sched_init+0x2f2/0x3eb [ 12.518096] start_kernel+0x1fb/0x42a [ 12.518525] secondary_startup_64+0xa4/0xb0 [ 12.518986] } [ 12.519132] ... key at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8 [ 12.519649] ... acquired at: [ 12.519892] pcpu_freelist_pop+0x7b/0xd0 [ 12.520221] bpf_get_stackid+0x1d2/0x4d0 [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0 [ 12.520887] [ 12.521008] -> (&head->lock){+...} { [ 12.521292] HARDIRQ-ON-W at: [ 12.521539] _raw_spin_lock+0x2f/0x40 [ 12.521950] pcpu_freelist_push+0x2a/0x40 [ 12.522396] bpf_get_stackid+0x494/0x4d0 [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0 [ 12.523296] INITIAL USE at: [ 12.523537] _raw_spin_lock+0x2f/0x40 [ 12.523944] pcpu_freelist_populate+0xc0/0x120 [ 12.524417] htab_map_alloc+0x405/0x500 [ 12.524835] __do_sys_bpf+0x1a3/0x1a90 [ 12.525253] do_syscall_64+0x4a/0x180 [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 12.526167] } [ 12.526311] ... key at: [<ffffffff838f7668>] __key.13130+0x0/0x8 [ 12.526812] ... acquired at: [ 12.527047] __lock_acquire+0x521/0x1350 [ 12.527371] lock_acquire+0x98/0x190 [ 12.527680] _raw_spin_lock+0x2f/0x40 [ 12.527994] pcpu_freelist_push+0x2a/0x40 [ 12.528325] bpf_get_stackid+0x494/0x4d0 [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0 [ 12.528970] [ 12.529092] [ 12.529092] stack backtrace: [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475 [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 12.530750] Call Trace: [ 12.530948] dump_stack+0x5f/0x8b [ 12.531248] check_usage_backwards+0x10c/0x120 [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0 [ 12.531935] ? mark_lock+0x382/0x560 [ 12.532229] mark_lock+0x382/0x560 [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180 [ 12.532928] __lock_acquire+0x521/0x1350 [ 12.533271] ? find_get_entry+0x17f/0x2e0 [ 12.533586] ? find_get_entry+0x19c/0x2e0 [ 12.533902] ? lock_acquire+0x98/0x190 [ 12.534196] lock_acquire+0x98/0x190 [ 12.534482] ? pcpu_freelist_push+0x2a/0x40 [ 12.534810] _raw_spin_lock+0x2f/0x40 [ 12.535099] ? pcpu_freelist_push+0x2a/0x40 [ 12.535432] pcpu_freelist_push+0x2a/0x40 [ 12.535750] bpf_get_stackid+0x494/0x4d0 [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0 It has been explained that is a false positive here: https://lkml.org/lkml/2018/7/25/756 Recap: - stackmap uses pcpu_freelist - The lock in pcpu_freelist is a percpu lock - stackmap is only used by tracing bpf_prog - A tracing bpf_prog cannot be run if another bpf_prog has already been running (ensured by the percpu bpf_prog_active counter). Eric pointed out that this lockdep splats stops other legit lockdep splats in selftests/bpf/test_progs.c. Fix this by calling local_irq_save/restore for stackmap. Another false positive had also been worked around by calling local_irq_save in commit `89ad2fa3f0` ("bpf: fix lockdep splat"). That commit added unnecessary irq_save/restore to fast path of bpf hash map. irqs are already disabled at that point, since htab is holding per bucket spin_lock with irqsave. Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop function w/o irqsave and convert pcpu_freelist_push/pop to irqsave to be used elsewhere (right now only in stackmap). It stops lockdep false positive in stackmap with a bit of acceptable overhead. Fixes: `557c0c6e7d` ("bpf: convert stackmap to pre-allocation") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:36 -07:00
Greg Kroah-Hartman	232bd90cf2	relay: check return of create_buf_file() properly [ Upstream commit `2c1cf00eea` ] If create_buf_file() returns an error, don't try to reference it later as a valid dentry pointer. This problem was exposed when debugfs started to return errors instead of just NULL for some calls when they do not succeed properly. Also, the check for WARN_ON(dentry) was just wrong :) Reported-by: Kees Cook <keescook@chromium.org> Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Fixes: `ff9fb72bc0` ("debugfs: return error values, not NULL") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:35 -07:00
Stephane Eranian	6ec0698f1c	perf core: Fix perf_proc_update_handler() bug [ Upstream commit `1a51c5da5a` ] The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e, when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100, then no modification to sysctl_perf_event_sample_rate is allowed to prevent possible hang from wrong values. The problem is that the test to prevent modification is made after the sysctl variable is modified in perf_proc_update_handler(). You get an error: $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate echo: write error: invalid argument But the value is still modified causing all sorts of inconsistencies: $ cat /proc/sys/kernel/perf_event_max_sample_rate 10001 This patch fixes the problem by moving the parsing of the value after the test. Committer testing: # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate -bash: echo: write error: Invalid argument # cat /proc/sys/kernel/perf_event_max_sample_rate 10001 # Signed-off-by: Stephane Eranian <eranian@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-13 14:02:26 -07:00
Greg Kroah-Hartman	34e9e65731	This is the 4.19.28 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyEq/IACgkQONu9yGCS aT77AhAAgqbxHsKsgqh7JV477xEgLqrLKYw+/6Bx+l79fUZoaR3PRwM62UEzikPQ KbvutNuOHOWuJA0Xyj5gqV4SJaqBOkoGNnHFOBi/qjtFUsOlBFrlGpow+0fsP/ay Bmo0LhoTvBub4ap7bJt4pwel/elWYVtOkA1Qgv3OCiDkTorYuPTbIUyuAVOJJbRn sZ1eKi00CQPrN65Rxgci0g0p/m7JWpvW2zqmDNZJuZZeEmSLdrrZGwt5ExiI6oKz CqX/VBGChEesMTEOLsSfRyg6NZW3j4rOUaCzkxDq/Tsh9XqNabhk1jod2p3t7Nmu n5Js3ujfuyCKf5tD49Z8xy5A++nYyJLa5jbFnURr2H/ZJPla0CHuc4RoFf/wFYr4 xQAeA3XXiZZB02n6oZlweUSp7hVnCgLJ4Ev2ctAyPUpyf4ncl+vffzj40bozFvAC adJE1UyEJp0xUuRVdx0I+HyueHcWmRIAgs0iz9B5S2KNc4rYDqE+/t+ddrNqjvSF +C33nMn+7A+ngmQlwWjOBaZhhQn3qWrWU0ACERbIG/DUD2voBYu3oIfQzKsXTt0V erSSDq0KSy73PRKN4Tzf2GnDUAlNITUuyFgWITOg8p29HuXJO00p4MQ7fOMYacyB WdwXFB7islUtUoA8nEedZgF7IF1WRh6Iz2HJ5uMTl6pCMSV+8Dk= =OvW8 -----END PGP SIGNATURE----- Merge 4.19.28 into android-4.19 Changes in 4.19.28 cpufreq: Use struct kobj_attribute instead of struct global_attr staging: erofs: fix mis-acted TAIL merging behavior USB: serial: option: add Telit ME910 ECM composition USB: serial: cp210x: add ID for Ingenico 3070 USB: serial: ftdi_sio: add ID for Hjelmslund Electronics USB485 staging: erofs: fix illegal address access under memory pressure staging: erofs: compressed_pages should not be accessed again after freed staging: comedi: ni_660x: fix missing break in switch statement staging: wilc1000: fix to set correct value for 'vif_num' staging: android: ion: fix sys heap pool's gfp_flags staging: android: ashmem: Don't call fallocate() with ashmem_mutex held. staging: android: ashmem: Avoid range_alloc() allocation with ashmem_mutex held. ip6mr: Do not call __IP6_INC_STATS() from preemptible context net: dsa: mv88e6xxx: handle unknown duplex modes gracefully in mv88e6xxx_port_set_duplex net: dsa: mv8e6xxx: fix number of internal PHYs for 88E6x90 family net: sched: put back q.qlen into a single location net-sysfs: Fix mem leak in netdev_register_kobject qmi_wwan: Add support for Quectel EG12/EM12 sctp: call iov_iter_revert() after sending ABORT sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 team: Free BPF filter when unregistering netdev tipc: fix RDM/DGRAM connect() regression bnxt_en: Drop oversize TX packets to prevent errors. geneve: correctly handle ipv6.disable module parameter hv_netvsc: Fix IP header checksum for coalesced packets ipv4: Add ICMPv6 support when parse route ipproto lan743x: Fix TX Stall Issue net: dsa: mv88e6xxx: Fix statistics on mv88e6161 net: dsa: mv88e6xxx: Fix u64 statistics netlabel: fix out-of-bounds memory accesses net: netem: fix skb length BUG_ON in __skb_to_sgvec net: nfc: Fix NULL dereference on nfc_llcp_build_tlv fails net: phy: Micrel KSZ8061: link failure after cable connect net: phy: phylink: fix uninitialized variable in phylink_get_mac_state net: sit: fix memory leak in sit_init_net() net: socket: set sock->sk to NULL after calling proto_ops::release() tipc: fix race condition causing hung sendto tun: fix blocking read xen-netback: don't populate the hash cache on XenBus disconnect xen-netback: fix occasional leak of grant ref mappings under memory pressure tun: remove unnecessary memory barrier net: Add __icmp_send helper. net: avoid use IPCB in cipso_v4_error ipv4: Return error for RTA_VIA attribute ipv6: Return error for RTA_VIA attribute mpls: Return error for RTA_GATEWAY attribute ipv4: Pass original device to ip_rcv_finish_core net: dsa: mv88e6xxx: power serdes on/off for 10G interfaces on 6390X net: dsa: mv88e6xxx: prevent interrupt storm caused by mv88e6390x_port_set_cmode net/sched: act_ipt: fix refcount leak when replace fails net/sched: act_skbedit: fix refcount leak when replace fails net: sched: act_tunnel_key: fix NULL pointer dereference during init x86/CPU/AMD: Set the CPB bit unconditionally on F17h x86/boot/compressed/64: Do not read legacy ROM on EFI system tracing: Fix event filters and triggers to handle negative numbers usb: xhci: Fix for Enabling USB ROLE SWITCH QUIRK on INTEL_SUNRISEPOINT_LP_XHCI applicom: Fix potential Spectre v1 vulnerabilities MIPS: irq: Allocate accurate order pages for irq stack aio: Fix locking in aio_poll() xtensa: fix get_wchan gnss: sirf: fix premature wakeup interrupt enable USB: serial: cp210x: fix GPIO in autosuspend selftests: firmware: fix verify_reqs() return value Bluetooth: btrtl: Restore old logic to assume firmware is already loaded Bluetooth: Fix locking in bt_accept_enqueue() for BH context exec: Fix mem leak in kernel_read_file scsi: core: reset host byte in DID_NEXUS_FAILURE case bpf: fix sanitation rewrite in case of non-pointers Linux 4.19.28 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-10 07:23:21 +01:00
Daniel Borkmann	ca490a9873	bpf: fix sanitation rewrite in case of non-pointers commit `3612af783c` upstream. Marek reported that he saw an issue with the below snippet in that timing measurements where off when loaded as unpriv while results were reasonable when loaded as privileged: [...] uint64_t a = bpf_ktime_get_ns(); uint64_t b = bpf_ktime_get_ns(); uint64_t delta = b - a; if ((int64_t)delta > 0) { [...] Turns out there is a bug where a corner case is missing in the fix `d3bd7413e0` ("bpf: fix sanitation of alu op with pointer / scalar type from different paths"), namely fixup_bpf_calls() only checks whether aux has a non-zero alu_state, but it also needs to test for the case of BPF_ALU_NON_POINTER since in both occasions we need to skip the masking rewrite (as there is nothing to mask). Fixes: `d3bd7413e0` ("bpf: fix sanitation of alu op with pointer / scalar type from different paths") Reported-by: Marek Majkowski <marek@cloudflare.com> Reported-by: Arthur Fabre <afabre@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/ Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-10 07:17:22 +01:00
Pavel Tikhomirov	690e939da7	tracing: Fix event filters and triggers to handle negative numbers commit `6a072128d2` upstream. Then tracing syscall exit event it is extremely useful to filter exit codes equal to some negative value, to react only to required errors. But negative numbers does not work: [root@snorch sys_exit_read]# echo "ret == -1" > filter bash: echo: write error: Invalid argument [root@snorch sys_exit_read]# cat filter ret == -1 ^ parse_error: Invalid value (did you forget quotes)? Similar thing happens when setting triggers. These is a regression in v4.17 introduced by the commit mentioned below, testing without these commit shows no problem with negative numbers. Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com Cc: stable@vger.kernel.org Fixes: `80765597bc` ("tracing: Rewrite filter logic to be simpler and faster") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-03-10 07:17:20 +01:00
Connor O'Brien	fe90216264	ANDROID: proc: Add /proc/uid directory Add support for reporting per-uid information through procfs, roughly following the approach used for per-tid and per-tgid directories in fs/proc/base.c. This also entails some new tracking of which uids have been used, to avoid losing information when the last task with a given uid exits. Bug: 72339335 Bug: 127641090 Test: ls /proc/uid/; compare with UIDs in /proc/uid_time_in_state Change-Id: I0908f0c04438b11ceb673d860e58441bf503d478 Signed-off-by: Connor O'Brien <connoro@google.com> [AmitP: Fix proc_fill_cache() now that upstream commit `0168b9e38c` ("procfs: switch instantiate_t to d_splice_alias()"), switched instantiate() callback to d_splice_alias()] Signed-off-by: Amit Pundir <amit.pundir@linaro.org> [astrachan: Folded 97b7790f505e ("ANDROID: proc: fix undefined behavior in proc_uid_base_readdir") into this change] Signed-off-by: Alistair Strachan <astrachan@google.com>	2019-03-06 15:59:21 +00:00
Connor O'Brien	406d53f0c7	ANDROID: cpufreq: track per-task time in state Add time in state data to task structs, and create /proc/<pid>/time_in_state files to show how long each individual task has run at each frequency. Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking. Bug: 72339335 Bug: 127641090 Test: Read /proc/<pid>/time_in_state Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8 Signed-off-by: Connor O'Brien <connoro@google.com> [astrachan: Folded the following changes into this patch: a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES") b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")] Signed-off-by: Alistair Strachan <astrachan@google.com>	2019-03-06 15:57:25 +00:00
Greg Kroah-Hartman	36d178b3bc	This is the 4.19.27 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlx+qs4ACgkQONu9yGCS aT4OQQ/8DMkQa61GLVtL1Zm0i2GW+fpgAJj59QP7jRdilolP7BBnrX2Ne26Dx+X0 bYLJDJw6WexsvLKOkSMol4Y3Q3NsBguC+MasVrrWBHpfntizlaPzsYuTKkyNRG0c vjOYyTeAxElWqEh1WaVQhk/juhBbrTtkZIK63h9M60KYb0qHA7TY9oTGokEA8WA0 11Yuge2aV6cR8zup1coWR9NRvW/uealiENAhr0Jw1ZRtrBqwzQoFyn5psXdbzkjb HrwvDa/nkwfJuc/1+grOTF5HfI7D5+giq0iRtvBuChjDwqfJ2IrVRPF/DKSmR0Oz Sq1ILUdj61gd2hp4N9zpUE87r2tV2ABg/yZrtDcKFhCiUWnfaaGArn7mm/R7OhCD PjRDz7acVvKq3LLenM6xcFsAWUdjOoUWcGk70eO/ZZctW8o+HozxFcVrNmT/75zJ LcagplvYLiLB4Gard6niNH8qunmq/jD+GkgdKtF9GFHOjG1QMdQ9d/VxeJWGAjtA nE8ZX9EFYgmq14OR+YK0DNOOFBZe10lcR9rMIFp6T9AanY7rVT7LqqG6BUXb226P mJmNS9sSJhictucyyf/ej9j0jULBBsAYJA5+fkECHIioewACjFGn1dZfm3BBrDAb OoNIwaLmrvjh2iKwsj17OK9Fv8PG6VKwYRFQ/I3OGVouZUX5SEk= =edyH -----END PGP SIGNATURE----- Merge 4.19.27 into android-4.19 Changes in 4.19.27 irq/matrix: Split out the CPU selection code into a helper irq/matrix: Spread managed interrupts on allocation genirq/matrix: Improve target CPU selection for managed interrupts. mac80211: Change default tx_sk_pacing_shift to 7 scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached drm/msm: Unblock writer if reader closes file ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field ALSA: compress: prevent potential divide by zero bugs ASoC: Variable "val" in function rt274_i2c_probe() could be uninitialized clk: tegra: dfll: Fix a potential Oop in remove() clk: sysfs: fix invalid JSON in clk_dump clk: vc5: Abort clock configuration without upstream clock thermal: int340x_thermal: Fix a NULL vs IS_ERR() check usb: dwc3: gadget: synchronize_irq dwc irq in suspend usb: dwc3: gadget: Fix the uninitialized link_state when udc starts usb: gadget: Potential NULL dereference on allocation error selftests: rtc: rtctest: fix alarm tests selftests: rtc: rtctest: add alarm test on minute boundary genirq: Make sure the initial affinity is not empty x86/mm/mem_encrypt: Fix erroneous sizeof() ASoC: rt5682: Fix PLL source register definitions ASoC: dapm: change snprintf to scnprintf for possible overflow ASoC: imx-audmux: change snprintf to scnprintf for possible overflow selftests/vm/gup_benchmark.c: match gup struct to kernel phy: ath79-usb: Fix the power on error path phy: ath79-usb: Fix the main reset name to match the DT binding selftests: seccomp: use LDLIBS instead of LDFLAGS selftests: gpio-mockup-chardev: Check asprintf() for error irqchip/gic-v3-mbi: Fix uninitialized mbi_lock ARC: fix __ffs return value to avoid build warnings ARC: show_regs: lockdep: avoid page allocator... drivers: thermal: int340x_thermal: Fix sysfs race condition staging: rtl8723bs: Fix build error with Clang when inlining is disabled mac80211: fix miscounting of ttl-dropped frames sched/wait: Fix rcuwait_wake_up() ordering sched/wake_q: Fix wakeup ordering for wake_q futex: Fix (possible) missed wakeup locking/rwsem: Fix (possible) missed wakeup drm/amd/powerplay: OD setting fix on Vega10 tty: serial: qcom_geni_serial: Allow mctrl when flow control is disabled serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling drm/sun4i: hdmi: Fix usage of TMDS clock staging: android: ion: Support cpu access during dma_buf_detach direct-io: allow direct writes to empty inodes writeback: synchronize sync(2) against cgroup writeback membership switches scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state() net: altera_tse: fix connect_local_phy error path hv_netvsc: Fix ethtool change hash key error hv_netvsc: Refactor assignments of struct netvsc_device_info hv_netvsc: Fix hash key value reset after other ops nvme-rdma: fix timeout handler nvme-multipath: drop optimization for static ANA group IDs drm/msm: Fix A6XX support for opp-level net: usb: asix: ax88772_bind return error when hw_reset fail net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP ibmveth: Do not process frames after calling napi_reschedule mac80211: don't initiate TDLS connection if station is not associated to AP mac80211: Add attribute aligned(2) to struct 'action' cfg80211: extend range deviation for DMG svm: Fix AVIC incomplete IPI emulation KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 kvm: selftests: Fix region overlap check in kvm_util mmc: spi: Fix card detection during probe mmc: tmio_mmc_core: don't claim spurious interrupts mmc: tmio: fix access width of Block Count Register mmc: core: Fix NULL ptr crash from mmc_should_fail_request mmc: cqhci: fix space allocated for transfer descriptor mmc: cqhci: Fix a tiny potential memory leak on error condition mmc: sdhci-esdhc-imx: correct the fix of ERR004536 mm: enforce min addr even if capable() in expand_downwards() drm: Block fb changes for async plane updates hugetlbfs: fix races and page leaks during migration MIPS: fix truncation in __cmpxchg_small for short values MIPS: BCM63XX: provide DMA masks for ethernet devices MIPS: eBPF: Fix icache flush end address x86/uaccess: Don't leak the AC flag into __put_user() value evaluation Linux 4.19.27 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-03-05 18:07:53 +01:00
Xie Yongji	9ad6216e8c	locking/rwsem: Fix (possible) missed wakeup [ Upstream commit `e158488be2` ] Because wake_q_add() can imply an immediate wakeup (cmpxchg failure case), we must not rely on the wakeup being delayed. However, commit: `e38513905e` ("locking/rwsem: Rework zeroing reader waiter->task") relies on exactly that behaviour in that the wakeup must not happen until after we clear waiter->task. [ peterz: Added changelog. ] Signed-off-by: Xie Yongji <xieyongji@baidu.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `e38513905e` ("locking/rwsem: Rework zeroing reader waiter->task") Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-05 17:58:49 +01:00
Peter Zijlstra	2368e6d3bc	futex: Fix (possible) missed wakeup [ Upstream commit `b061c38bef` ] We must not rely on wake_q_add() to delay the wakeup; in particular commit: `1d0dcb3ad9` ("futex: Implement lockless wakeups") moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which could result in futex_wait() waking before observing ->lock_ptr == NULL and going back to sleep again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `1d0dcb3ad9` ("futex: Implement lockless wakeups") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-05 17:58:49 +01:00
Peter Zijlstra	653a1dbcb0	sched/wake_q: Fix wakeup ordering for wake_q [ Upstream commit `4c4e373156` ] Notable cmpxchg() does not provide ordering when it fails, however wake_q_add() requires ordering in this specific case too. Without this it would be possible for the concurrent wakeup to not observe our prior state. Andrea Parri provided: C wake_up_q-wake_q_add { int next = 0; int y = 0; } P0(int next, int y) { int r0; /* in wake_up_q() / WRITE_ONCE(next, 1); /* node->next = NULL / smp_mb(); / implied by wake_up_process() / r0 = READ_ONCE(y); } P1(int next, int y) { int r1; /* in wake_q_add() / WRITE_ONCE(y, 1); /* wake_cond = true */ smp_mb__before_atomic(); r1 = cmpxchg_relaxed(next, 1, 2); } exists (0:r0=0 /\ 1:r1=0) This "exists" clause cannot be satisfied according to the LKMM: Test wake_up_q-wake_q_add Allowed States 3 0:r0=0; 1:r1=1; 0:r0=1; 1:r1=0; 0:r0=1; 1:r1=1; No Witnesses Positive: 0 Negative: 3 Condition exists (0:r0=0 /\ 1:r1=0) Observation wake_up_q-wake_q_add Never 0 3 Reported-by: Yongji Xie <elohimes@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-05 17:58:49 +01:00
Prateek Sood	5024f0a29a	sched/wait: Fix rcuwait_wake_up() ordering [ Upstream commit `6dc080eeb2` ] For some peculiar reason rcuwait_wake_up() has the right barrier in the comment, but not in the code. This mistake has been observed to cause a deadlock in the following situation: P1 P2 percpu_up_read() percpu_down_write() rcu_sync_is_idle() // false rcu_sync_enter() ... __percpu_up_read() [S] ,- __this_cpu_dec(*sem->read_count) \| smp_rmb(); [L] \| task = rcu_dereference(w->task) // NULL \| \| [S] w->task = current \| smp_mb(); \| [L] readers_active_check() // fail `-> <store happens here> Where the smp_rmb() (obviously) fails to constrain the store. [ peterz: Added changelog. ] Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `8f95c90ceb` ("sched/wait, RCU: Introduce rcuwait machinery") Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-05 17:58:49 +01:00

1 2 3 4 5 ...

28,672 commits