-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1GibIACgkQONu9yGCS
aT7z2hAAmv8AsH9IG43m7t6zLroJVswr/9594xk7yPBQgcY3/PW2aTFBCFbsdOL4
yXcj2PSwRiq9K6qAJULrvOvncR9fIILHqzWzyXnoaZ30lR/FxaaFmuHZX/5Ix1tB
e5EEE/EA49UAEjEDaMLq8g2IvibsReDxmSpnXyBJWoyRAdFIElVnMJ2+zvP/wRhF
NKzQj/bj/qecCbis2lUCaVWJFZ6+P/52UbD8lvIwqR3nk2TKsGDcLU6eY3yg4KrB
rEHl5T8KIPrkX3KNIEB8EcFREene+rdpZLLVe4fYwf+gOqfiFXSzZZvweauMkplq
ehlVHkykvQvlsVM2tjBD379z3C4aasZDuMVNMCbAy2FlruLeBQ7gEn77mCJB9VH5
/n/mlc2yizdoowtARCLWOUMfASpdSbqu2SQ7A/3kwG7l6GrpzKSIU2nQgm+41sUZ
QJVtZ3IYsPoYjnU4B3JZzgJnf3M9jcRz/3JegviqhSEbF1gaScJX0cqN8C1idN/v
ZAGCJK9S20/EEEsp5jn+bq2grUehvmD4TVDfot4P+5yRYyBIhMFpbM2RpjydOpwy
+x8D1Q34LYPFgZfQ0vF62vcSBhMBiJ/7j41rUeo44K+Lg00F3yCOyL6FxK6S8h6j
wsD0xLbllMrhV5KRYFizb3QbCHoHYiROIJk76uLvB+Tqq2Jg9VQ=
=qIi2
-----END PGP SIGNATURE-----
Merge 4.19.64 into android-4.19
Changes in 4.19.64
hv_sock: Add support for delayed close
vsock: correct removal of socket from the list
NFS: Fix dentry revalidation on NFSv4 lookup
NFS: Refactor nfs_lookup_revalidate()
NFSv4: Fix lookup revalidate of regular files
usb: dwc2: Disable all EP's on disconnect
usb: dwc2: Fix disable all EP's on disconnect
arm64: compat: Provide definition for COMPAT_SIGMINSTKSZ
binder: fix possible UAF when freeing buffer
ISDN: hfcsusb: checking idx of ep configuration
media: au0828: fix null dereference in error path
ath10k: Change the warning message string
media: cpia2_usb: first wake up, then free in disconnect
media: pvrusb2: use a different format for warnings
NFS: Cleanup if nfs_match_client is interrupted
media: radio-raremono: change devm_k*alloc to k*alloc
iommu/vt-d: Don't queue_iova() if there is no flush queue
iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA
Bluetooth: hci_uart: check for missing tty operations
vhost: introduce vhost_exceeds_weight()
vhost_net: fix possible infinite loop
vhost: vsock: add weight support
vhost: scsi: add weight support
sched/fair: Don't free p->numa_faults with concurrent readers
sched/fair: Use RCU accessors consistently for ->numa_group
/proc/<pid>/cmdline: remove all the special cases
/proc/<pid>/cmdline: add back the setproctitle() special case
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
Fix allyesconfig output.
ceph: hold i_ceph_lock when removing caps for freeing inode
block, scsi: Change the preempt-only flag into a counter
scsi: core: Avoid that a kernel warning appears during system resume
ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
Linux 4.19.64
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3e9055b677bd8ad9d5070307fae0bc765d444e9d
commit 16d51a590a upstream.
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl0Nx3oACgkQONu9yGCS
aT7hJRAAzdu1EKr/VUiFJshvPL/+veH1MLdMgafW6gXcoQCQSdMuv1j3mv3axElC
dz2e/nHXvIhMKMrhzgEvVwV8m8eKPwM4IZjTJ8ji8st9uy0R2UIdbPUHrLtRAQzI
bK2BIk6697B9/9z2s8nDXhKex9EtZ2vj8DeQLT1AgQKMM0+abPA2YR9+cp9HEjQR
Wa5NjajwWgpty1VVk+Q6GMD+GBnEILyqxtVGv30z/prWoVadFAXJEZrWsaXotUvh
ySc2CS9Iu/muiIegTSStnNXaWaawZblA4JdDcdRJ235rPK8sbACYtDgZispYopi4
y/9kUtuZzWEv61aFZkXd13jKZ8pJsrLZLoc+yDqew0IA6M2Hz7d7Pq/UWQ0m4qnt
rCcU1vTFooSp/IHOToXN0oQQrTqD2oK5zO4V9S5McwQniuO/RqUPJfX/sKjtrvNT
SI/yLVKMXImAwVXxNEfO4bE+T9F/QYptOHdpfqJkPvpKLzjxnPBPswG8poIKwQzJ
w3p1AoQ1ISuW6b94a/nI5nSC28gVslVg+oTjRjF7kfIcOtvglX79XPvfdM5bB2DC
qD51A8veEGCtAu57/tSyisGOomqTqqqbaiYXwuhgTI1NSszbqKDLNvjsw1GKj0rK
mSmDzVMgQhxaHOagOeDqLYlj1zgTamx5R6+dretf8AOXyEE2RSg=
=+MJy
-----END PGP SIGNATURE-----
Merge 4.19.54 into android-4.19
Changes in 4.19.54
ax25: fix inconsistent lock state in ax25_destroy_timer
be2net: Fix number of Rx queues used for flow hashing
hv_netvsc: Set probe mode to sync
ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
lapb: fixed leak of control-blocks.
neigh: fix use-after-free read in pneigh_get_next
net: dsa: rtl8366: Fix up VLAN filtering
net: openvswitch: do not free vport if register_netdevice() is failed.
nfc: Ensure presence of required attributes in the deactivate_target handler
sctp: Free cookie before we memdup a new one
sunhv: Fix device naming inconsistency between sunhv_console and sunhv_reg
tipc: purge deferredq list for each grp member in tipc_group_delete
vsock/virtio: set SOCK_DONE on peer shutdown
net/mlx5: Avoid reloading already removed devices
net: mvpp2: prs: Fix parser range for VID filtering
net: mvpp2: prs: Use the correct helpers when removing all VID filters
Staging: vc04_services: Fix a couple error codes
perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints
netfilter: nf_queue: fix reinject verdict handling
ipvs: Fix use-after-free in ip_vs_in
selftests: netfilter: missing error check when setting up veth interface
clk: ti: clkctrl: Fix clkdm_clk handling
powerpc/powernv: Return for invalid IMC domain
usb: xhci: Fix a potential null pointer dereference in xhci_debugfs_create_endpoint()
mISDN: make sure device name is NUL terminated
x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor
perf/ring_buffer: Fix exposing a temporarily decreased data_head
perf/ring_buffer: Add ordering to rb->nest increment
perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
gpio: fix gpio-adp5588 build errors
net: stmmac: update rx tail pointer register to fix rx dma hang issue.
net: tulip: de4x5: Drop redundant MODULE_DEVICE_TABLE()
ACPI/PCI: PM: Add missing wakeup.flags.valid checks
drm/etnaviv: lock MMU while dumping core
net: aquantia: tx clean budget logic error
net: aquantia: fix LRO with FCS error
i2c: dev: fix potential memory leak in i2cdev_ioctl_rdwr
ALSA: hda - Force polling mode on CNL for fixing codec communication
configfs: Fix use-after-free when accessing sd->s_dentry
perf data: Fix 'strncat may truncate' build failure with recent gcc
perf namespace: Protect reading thread's namespace
perf record: Fix s390 missing module symbol and warning for non-root users
ia64: fix build errors by exporting paddr_to_nid()
xen/pvcalls: Remove set but not used variable
xenbus: Avoid deadlock during suspend due to open transactions
KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list
KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu
arm64: fix syscall_fn_t type
arm64: use the correct function type in SYSCALL_DEFINE0
arm64: use the correct function type for __arm64_sys_ni_syscall
net: sh_eth: fix mdio access in sh_eth_close() for R-Car Gen2 and RZ/A1 SoCs
net: phylink: ensure consistent phy interface mode
net: phy: dp83867: Set up RGMII TX delay
scsi: libcxgbi: add a check for NULL pointer in cxgbi_check_route()
scsi: smartpqi: properly set both the DMA mask and the coherent DMA mask
scsi: scsi_dh_alua: Fix possible null-ptr-deref
scsi: libsas: delete sas port if expander discover failed
mlxsw: spectrum: Prevent force of 56G
ocfs2: fix error path kobject memory leak
coredump: fix race condition between collapse_huge_page() and core dumping
Abort file_remove_privs() for non-reg. files
Linux 4.19.54
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 59ea6d06cf upstream.
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in 04f5866e41 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.
If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.
khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.
collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().
Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().
So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump... as long as the coredump
can invoke page faults without holding the mmap_sem for reading.
This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().
Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fcfc2aa018 ]
There are a few system calls (pselect, ppoll, etc) which replace a task
sigmask while they are running in a kernel-space
When a task calls one of these syscalls, the kernel saves a current
sigmask in task->saved_sigmask and sets a syscall sigmask.
On syscall-exit-stop, ptrace traps a task before restoring the
saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
saved_sigmask, when the task returns to user-space.
This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask
if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.
Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
Fixes: 29000caecb ("ptrace: add ability to get/set signal-blocked mask")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzEBokACgkQONu9yGCS
aT7G7w/8C93URGM67H7ynkCHTo8y3hkRE2rUJPckJNdS+IJKuecmOphak4tF0h07
qPWDPya70Q1S0cNu661TuVAGrhmE5jBx8/xfZaAOeaaU0xtZive+TfSHdAQQaHct
tDk32O85N1aZ49rDEz9ibr7CGLVFDZtyhxV5gFMYQpjbqA7MzJC61zQg1jHyPSCz
sKjQzW+uXMuSLru8jXHMvp41K5sFFp5gYdQbAVKlWtt79qPxWdxZPJbLbM0LBbtz
XHt9E45Ink3ALF9P6tZ4e6gi4zzlNbh9yR92+X5NK5/8AP57yWba4W9JHWIfMBpC
yyDYTOEAzdxqa2Jrgwr4WTdKH6U7FbQZFmWfTBB4VotbHLBWkVXj0OnF10qxP9eQ
p5wGDTJAlWezhX1BTCfYroglDsvqhj+gHfwHzDRF1Del1dRgydRMQc0qLD1d9tul
ovzwOkx1xyJrM2wq05I5gc0FoVyOL6/KCwqMrpVfKa3WKY7Uttjgf56bMqdIIkns
i/6opzF+wtvwlLlCoXgYPXdm6kbWdgvS+skVHfWcHmZFMuGrFGGzJNwzXb7qnVjK
T0hD1OestsfTyD/amnDNYkNeCkoOZqtHAi+xYOQR4kGY5cxP1lQJf85MgAy6RZSY
h+rjys76Qf6+hTCtrowLr8SgksX4ACWxm+UarfAiiNnnDXwGfu8=
=SrFV
-----END PGP SIGNATURE-----
Merge 4.19.37 into android-4.19
Changes in 4.19.37
bonding: fix event handling for stacked bonds
failover: allow name change on IFF_UP slave interfaces
net: atm: Fix potential Spectre v1 vulnerabilities
net: bridge: fix per-port af_packet sockets
net: bridge: multicast: use rcu to access port list from br_multicast_start_querier
net: Fix missing meta data in skb with vlan packet
net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv
tcp: tcp_grow_window() needs to respect tcp_space()
team: set slave to promisc if team is already in promisc mode
tipc: missing entries in name table of publications
vhost: reject zero size iova range
ipv4: recompile ip options in ipv4_link_failure
ipv4: ensure rcu_read_lock() in ipv4_link_failure()
net: thunderx: raise XDP MTU to 1508
net: thunderx: don't allow jumbo frames with XDP
net/mlx5: FPGA, tls, hold rcu read lock a bit longer
net/tls: prevent bad memory access in tls_is_sk_tx_device_offloaded()
net/mlx5: FPGA, tls, idr remove on flow delete
route: Avoid crash from dereferencing NULL rt->from
sch_cake: Use tc_skb_protocol() helper for getting packet protocol
sch_cake: Make sure we can write the IP header before changing DSCP bits
nfp: flower: replace CFI with vlan present
nfp: flower: remove vlan CFI bit from push vlan action
sch_cake: Simplify logic in cake_select_tin()
net: IP defrag: encapsulate rbtree defrag code into callable functions
net: IP6 defrag: use rbtrees for IPv6 defrag
net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
CIFS: keep FileInfo handle live during oplock break
cifs: Fix use-after-free in SMB2_write
cifs: Fix use-after-free in SMB2_read
cifs: fix handle leak in smb2_query_symlink()
KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU
KVM: x86: svm: make sure NMI is injected after nmi_singlestep
Staging: iio: meter: fixed typo
staging: iio: ad7192: Fix ad7193 channel address
iio: gyro: mpu3050: fix chip ID reading
iio/gyro/bmg160: Use millidegrees for temperature scale
iio:chemical:bme680: Fix, report temperature in millidegrees
iio:chemical:bme680: Fix SPI read interface
iio: cros_ec: Fix the maths for gyro scale calculation
iio: ad_sigma_delta: select channel when reading register
iio: dac: mcp4725: add missing powerdown bits in store eeprom
iio: Fix scan mask selection
iio: adc: at91: disable adc channel interrupt in timeout case
iio: core: fix a possible circular locking dependency
io: accel: kxcjk1013: restore the range after resume.
staging: most: core: use device description as name
staging: comedi: vmk80xx: Fix use of uninitialized semaphore
staging: comedi: vmk80xx: Fix possible double-free of ->usb_rx_buf
staging: comedi: ni_usb6501: Fix use of uninitialized mutex
staging: comedi: ni_usb6501: Fix possible double-free of ->usb_rx_buf
ALSA: hda/realtek - add two more pin configuration sets to quirk table
ALSA: core: Fix card races between register and disconnect
Input: elan_i2c - add hardware ID for multiple Lenovo laptops
serial: sh-sci: Fix HSCIF RX sampling point adjustment
serial: sh-sci: Fix HSCIF RX sampling point calculation
vt: fix cursor when clearing the screen
scsi: core: set result when the command cannot be dispatched
Revert "scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO"
Revert "svm: Fix AVIC incomplete IPI emulation"
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier
crypto: x86/poly1305 - fix overflow during partial reduction
drm/ttm: fix out-of-bounds read in ttm_put_pages() v2
arm64: futex: Restore oldval initialization to work around buggy compilers
x86/kprobes: Verify stack frame on kretprobe
kprobes: Mark ftrace mcount handler functions nokprobe
kprobes: Fix error check when reusing optimized probes
rt2x00: do not increment sequence number while re-transmitting
mac80211: do not call driver wake_tx_queue op during reconfig
drm/amdgpu/gmc9: fix VM_L2_CNTL3 programming
perf/x86/amd: Add event map for AMD Family 17h
x86/cpu/bugs: Use __initconst for 'const' init data
perf/x86: Fix incorrect PEBS_REGS
x86/speculation: Prevent deadlock on ssb_state::lock
timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()
nfit/ars: Remove ars_start_flags
nfit/ars: Introduce scrub_flags
nfit/ars: Allow root to busy-poll the ARS state machine
nfit/ars: Avoid stale ARS results
mmc: sdhci: Fix data command CRC error handling
mmc: sdhci: Rename SDHCI_ACMD12_ERR and SDHCI_INT_ACMD12ERR
mmc: sdhci: Handle auto-command errors
modpost: file2alias: go back to simple devtable lookup
modpost: file2alias: check prototype of handler
tpm/tpm_i2c_atmel: Return -E2BIG when the transfer is incomplete
tpm: Fix the type of the return value in calc_tpm2_event_size()
Revert "kbuild: use -Oz instead of -Os when using clang"
sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
device_cgroup: fix RCU imbalance in error case
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
ALSA: info: Fix racy addition/deletion of nodes
percpu: stop printing kernel addresses
tools include: Adopt linux/bits.h
ASoC: rockchip: add missing INTERLEAVED PCM attribute
i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array
Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()"
kernel/sysctl.c: fix out-of-bounds access when setting file-max
Linux 4.19.37
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 04f5866e41 upstream.
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlynu40ACgkQONu9yGCS
aT5X6g//Wkfm/+qSZ0GhLDQkPniiH1QkvzhOmVrrxu+KB0qsiwsEl8Srw33ZVkJK
LT8+IPGiG9jEGu9dj+BYXTIfy9ZvfSsEL2N6GhYwDSXP0fok2rUaHbZvv1IB2g4W
afhGdNwNAUCJ/j1UrUsi+SAFJ+xWbVxFpGstd0cqM9IbKdEV7RIukvuKckHiKOKR
qI8FxC+G2PAr+BtnETfk5/suPDJ7B3ZicDoMhiWJGxJ6dfFTVmkSmasSoPDaMiHm
4S3hN2lu+WTeRpRPPB17Dlk4MmIp0k+bGYBKAlaxAMCc/RZxvbT2pRYaMQbId2/L
mNUfSnOQFGEAhlAPfb7wdbObphnyT34GhlkWfZBTrnhPO0/FomLOvU6xVdcNuakX
Tv2JKfDzb+2ttcMZ+0T84Ru9RztoswFATSw8uFMVxW8oTS6MVWnHu96Kxfl7QO3J
PdlIGcyqxSuWNE8OX1QVtdSruGZfwUDNs94S4nQJtkB8BViRwhGJlqaXuy4d9Wp6
fGlI2W6qhjyosi2wBSMTjh/ytk/jq0vfs+z2XjR2gAYssvB/SOLR/AlSVguWsDnf
WaoFBkXvCbuPvPlo0TrLpl5RW5WlOtLXHE3Vr3dKp458wLwpf/OZBGoZiknp7DrF
PzBZs2ie5tmyqTxbAygl7WkbQPJ682pd5R4nf5CY+zvUaOMZv1g=
=Iuup
-----END PGP SIGNATURE-----
Merge 4.19.34 into android-4.19
Changes in 4.19.34
arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
ext4: cleanup bh release code in ext4_ind_remove_space()
tty/serial: atmel: Add is_half_duplex helper
tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
CIFS: fix POSIX lock leak and invalid ptr deref
h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
tracing: kdb: Fix ftdump to not sleep
net/mlx5: Avoid panic when setting vport rate
net/mlx5: Avoid panic when setting vport mac, getting vport config
gpio: gpio-omap: fix level interrupt idling
include/linux/relay.h: fix percpu annotation in struct rchan
sysctl: handle overflow for file-max
net: stmmac: Avoid sometimes uninitialized Clang warnings
enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
libbpf: force fixdep compilation at the start of the build
scsi: hisi_sas: Set PHY linkrate when disconnected
scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
x86/hyperv: Fix kernel panic when kexec on HyperV
perf c2c: Fix c2c report for empty numa node
mm/sparse: fix a bad comparison
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
mm,oom: don't kill global init via memory.oom.group
memcg: killed threads should not invoke memcg OOM killer
mm, mempolicy: fix uninit memory access
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
mm/slab.c: kmemleak no scan alien caches
ocfs2: fix a panic problem caused by o2cb_ctl
f2fs: do not use mutex lock in atomic context
fs/file.c: initialize init_files.resize_wait
page_poison: play nicely with KASAN
cifs: use correct format characters
dm thin: add sanity checks to thin-pool and external snapshot creation
f2fs: fix to check inline_xattr_size boundary correctly
cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
cifs: Fix NULL pointer dereference of devname
netfilter: nf_tables: check the result of dereferencing base_chain->stats
netfilter: conntrack: tcp: only close if RST matches exact sequence
jbd2: fix invalid descriptor block checksum
fs: fix guard_bio_eod to check for real EOD errors
tools lib traceevent: Fix buffer overflow in arg_eval
PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
mt76: fix a leaked reference by adding a missing of_node_put
crypto: crypto4xx - add missing of_node_put after of_device_is_available
crypto: cavium/zip - fix collision with generic cra_driver_name
usb: chipidea: Grab the (legacy) USB PHY by phandle first
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
coresight: etm4x: Add support to enable ETMv4.2
serial: 8250_pxa: honor the port number from devicetree
ARM: 8840/1: use a raw_spinlock_t in unwind
iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
btrfs: qgroup: Make qgroup async transaction commit more aggressive
mmc: omap: fix the maximum timeout setting
net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
e1000e: Fix -Wformat-truncation warnings
mlxsw: spectrum: Avoid -Wformat-truncation warnings
platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
platform/mellanox: mlxreg-hotplug: Fix KASAN warning
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
IB/mlx4: Increase the timeout for CM cache
clk: fractional-divider: check parent rate only if flag is set
perf annotate: Fix getting source line failure
ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
efi: cper: Fix possible out-of-bounds access
s390/ism: ignore some errors during deregistration
scsi: megaraid_sas: return error when create DMA pool failed
scsi: fcoe: make use of fip_mode enum complete
drm/amd/display: Clear stream->mode_changed after commit
perf test: Fix failure of 'evsel-tp-sched' test on s390
mwifiex: don't advertise IBSS features without FW support
perf report: Don't shadow inlined symbol with different addr range
SoC: imx-sgtl5000: add missing put_device()
media: ov7740: fix runtime pm initialization
media: sh_veu: Correct return type for mem2mem buffer helpers
media: s5p-jpeg: Correct return type for mem2mem buffer helpers
media: rockchip/rga: Correct return type for mem2mem buffer helpers
media: s5p-g2d: Correct return type for mem2mem buffer helpers
media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
media: mtk-jpeg: Correct return type for mem2mem buffer helpers
mt76: usb: do not run mt76u_queues_deinit twice
xen/gntdev: Do not destroy context while dma-bufs are in use
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
cgroup, rstat: Don't flush subtree root unless necessary
jbd2: fix race when writing superblock
leds: lp55xx: fix null deref on firmware load failure
perf report: Add s390 diagnosic sampling descriptor size
iwlwifi: pcie: fix emergency path
ACPI / video: Refactor and fix dmi_is_desktop()
selftests: skip seccomp get_metadata test if not real root
kprobes: Prohibit probing on bsearch()
kprobes: Prohibit probing on RCU debug routine
netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
ARM: 8833/1: Ensure that NEON code always compiles with Clang
ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
ALSA: PCM: check if ops are defined before suspending PCM
ath10k: fix shadow register implementation for WCN3990
usb: f_fs: Avoid crash due to out-of-scope stack ptr access
sched/topology: Fix percpu data types in struct sd_data & struct s_data
bcache: fix input overflow to cache set sysfs file io_error_halflife
bcache: fix input overflow to sequential_cutoff
bcache: fix potential div-zero error of writeback_rate_i_term_inverse
bcache: improve sysfs_strtoul_clamp()
genirq: Avoid summation loops for /proc/stat
net: marvell: mvpp2: fix stuck in-band SGMII negotiation
iw_cxgb4: fix srqidx leak during connection abort
net: phy: consider latched link-down status in polling mode
fbdev: fbmem: fix memory access if logo is bigger than the screen
cdrom: Fix race condition in cdrom_sysctl_register
drm: rcar-du: add missing of_node_put
drm/amd/display: Don't re-program planes for DPMS changes
drm/amd/display: Disconnect mpcc when changing tg
perf/aux: Make perf_event accessible to setup_aux()
e1000e: fix cyclic resets at link up with active tx
e1000e: Exclude device from suspend direct complete optimization
platform/x86: intel_pmc_core: Fix PCH IP sts reading
i2c: of: Try to find an I2C adapter matching the parent
staging: spi: mt7621: Add return code check on device_reset()
iwlwifi: mvm: fix RFH config command with >=10 CPUs
ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
efi/memattr: Don't bail on zero VA if it equals the region's PA
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
drm/vkms: Bugfix extra vblank frame
ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
soc: qcom: gsbi: Fix error handling in gsbi_probe()
mt7601u: bump supported EEPROM version
ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
ARM: avoid Cortex-A9 livelock on tight dmb loops
block, bfq: fix in-service-queue check for queue merging
bpf: fix missing prototype warnings
selftests/bpf: skip verifier tests for unsupported program types
powerpc/64s: Clear on-stack exception marker upon exception return
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
tty: increase the default flip buffer limit to 2*640K
powerpc/pseries: Perform full re-add of CPU for topology update post-migration
drm/amd/display: Enable vblank interrupt during CRC capture
ALSA: dice: add support for Solid State Logic Duende Classic/Mini
usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
platform/x86: intel-hid: Missing power button release on some Dell models
perf script python: Use PyBytes for attr in trace-event-python
perf script python: Add trace_context extension module to sys.modules
media: mt9m111: set initial frame size other than 0x0
hwrng: virtio - Avoid repeated init of completion
soc/tegra: fuse: Fix illegal free of IO base address
HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
f2fs: UBSAN: set boolean value iostat_enable correctly
hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
cpu/hotplug: Mute hotplug lockdep during init
dmaengine: imx-dma: fix warning comparison of distinct pointer types
dmaengine: qcom_hidma: assign channel cookie correctly
dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
netfilter: physdev: relax br_netfilter dependency
media: rcar-vin: Allow independent VIN link enablement
media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
drm: Auto-set allow_fb_modifiers when given modifiers at plane init
drm/nouveau: Stop using drm_crtc_force_disable
x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
selinux: do not override context on context mounts
brcmfmac: Use firmware_request_nowarn for the clm_blob
wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
x86/build: Mark per-CPU symbols as absolute explicitly for LLD
drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
clk: meson: clean-up clock registration
clk: rockchip: fix frac settings of GPLL clock for rk3328
dmaengine: tegra: avoid overflow of byte tracking
Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
net: stmmac: Avoid one more sometimes uninitialized Clang warning
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
bcache: fix potential div-zero error of writeback_rate_p_term_inverse
kprobes/x86: Blacklist non-attachable interrupt functions
Linux 4.19.34
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 99687cdbb3 ]
The percpu members of struct sd_data and s_data are declared as:
struct ... ** __percpu member;
So their type is:
__percpu pointer to pointer to struct ...
But looking at how they're used, their type should be:
pointer to __percpu pointer to struct ...
and they should thus be declared as:
struct ... * __percpu *member;
So fix the placement of '__percpu' in the definition of these
structures.
This addresses a bunch of Sparse's warnings like:
warning: incorrect type in initializer (different address spaces)
expected void const [noderef] <asn:3> *__vpp_verify
got struct sched_domain **
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8508cf3ffa)
Conflicts:
block/blk-iolatency.c
(1. manual merge to replace stat->rqs.mean with stat.mean)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxbC5gACgkQONu9yGCS
aT4DYQ//Uqm/Q63KQuExgd7W+61FoP4NFHlYXZ31B5Rkydryyk2K5P2ONSdVd5n9
k3wjzRrxvlPvjOwbh9PHv+pLkGxBDqpT1X8IAXPe36bYUkvXoH71BE4YSRPRUJAf
sdzw/vs7WE/Kx41iT3SXiQih8ok0y3LoACBKmUsEXoLI1cJZCUnnSFpP++QNe1Iz
B/y04BigL8R7OWR/jow6OPWe9uXOI8iEe9QKVX26g4oaakzly4vkp6OwROSwM31q
0wut8jF/AtDcZpZXjJLjDCj10k5DRN8jwGcLD7iZeIKqexOabjUrsvfIHfbpUtXr
e7pJw2aUM8BFb8Ba2lsB7gkqvdHQohqVKQE4Qy59aPyesm2G5miH4gAbncoixjCa
u3eQV5ACpFLksUFR4RAMKq+10k7swsutyyJr5vG4qdbRpcTCNJirEwAGGqgI6IEP
SDqtw6u8gMP8+SicwA9p71Wwntcq9RR6fx0gX/3wi2DQp6F8Txem00SqaciE7uQ1
uIOUrhcpWzIq4m58SGhgTSQcBkm5qBD5S154/xRKIo0mvME+NwBub/x3fIsixN/u
AzWQmQPXBajHbYXbKGC7t2jNHkU5d9FedZ4iDmJk/+ZZsWyFByY1bH1cg4Qnq89e
tDxL114YmSujbZD/mFlbGWcqdmGNT355BmyetKDx6w0rNiU/RBU=
=oprJ
-----END PGP SIGNATURE-----
Merge 4.19.20 into android-4.19
Changes in 4.19.20
Fix "net: ipv4: do not handle duplicate fragments as overlapping"
drm/msm/gpu: fix building without debugfs
ipv6: Consider sk_bound_dev_if when binding a socket to an address
ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation
ipvlan, l3mdev: fix broken l3s mode wrt local routes
l2tp: copy 4 more bytes to linear part if necessary
l2tp: fix reading optional fields of L2TPv3
net: ip_gre: always reports o_key to userspace
net: ip_gre: use erspan key field for tunnel lookup
net/mlx4_core: Add masking for a few queries on HCA caps
netrom: switch to sock timer API
net/rose: fix NULL ax25_cb kernel panic
net: set default network namespace in init_dummy_netdev()
ravb: expand rx descriptor data to accommodate hw checksum
sctp: improve the events for sctp stream reset
tun: move the call to tun_set_real_num_queues
ucc_geth: Reset BQL queue when stopping device
vhost: fix OOB in get_rx_bufs()
net: ip6_gre: always reports o_key to userspace
sctp: improve the events for sctp stream adding
net/mlx5e: Allow MAC invalidation while spoofchk is ON
ip6mr: Fix notifiers call on mroute_clean_tables()
Revert "net/mlx5e: E-Switch, Initialize eswitch only if eswitch manager"
sctp: set chunk transport correctly when it's a new asoc
sctp: set flow sport from saddr only when it's 0
virtio_net: Don't enable NAPI when interface is down
virtio_net: Don't call free_old_xmit_skbs for xdp_frames
virtio_net: Fix not restoring real_num_rx_queues
virtio_net: Fix out of bounds access of sq
virtio_net: Don't process redirected XDP frames when XDP is disabled
virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs
virtio_net: Differentiate sk_buff and xdp_frame on freeing
CIFS: Do not count -ENODATA as failure for query directory
CIFS: Fix trace command logging for SMB2 reads and writes
CIFS: Do not consider -ENODATA as stat failure for reads
fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()
iommu/vt-d: Fix memory leak in intel_iommu_put_resv_regions()
selftests/seccomp: Enhance per-arch ptrace syscall skip tests
NFS: Fix up return value on fatal errors in nfs_page_async_flush()
ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment
arm64: kaslr: ensure randomized quantities are clean also when kaslr is off
arm64: Do not issue IPIs for user executable ptes
arm64: hyp-stub: Forbid kprobing of the hyp-stub
arm64: hibernate: Clean the __hyp_text to PoC after resume
gpio: altera-a10sr: Set proper output level for direction_output
gpiolib: fix line event timestamps for nested irqs
gpio: pcf857x: Fix interrupts on multiple instances
gpio: sprd: Fix the incorrect data register
gpio: sprd: Fix incorrect irq type setting for the async EIC
gfs2: Revert "Fix loop in gfs2_rbm_find"
mmc: bcm2835: Fix DMA channel leak on probe error
mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay
ALSA: usb-audio: Add Opus #3 to quirks for native DSD support
ALSA: hda/realtek - Fixed hp_pin no value
IB/hfi1: Remove overly conservative VM_EXEC flag check
platform/x86: asus-nb-wmi: Map 0x35 to KEY_SCREENLOCK
platform/x86: asus-nb-wmi: Drop mapping of 0x33 and 0x34 scan codes
mmc: sdhci-iproc: handle mmc_of_parse() errors during probe
Btrfs: fix deadlock when allocating tree block during leaf/node split
btrfs: On error always free subvol_name in btrfs_mount
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
oom, oom_reaper: do not enqueue same task twice
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
mm, oom: fix use-after-free in oom_kill_process
mm: hwpoison: use do_send_sig_info() instead of force_sig()
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
of: Convert to using %pOFn instead of device_node.name
of: overlay: add tests to validate kfrees from overlay removal
of: overlay: add missing of_node_get() in __of_attach_node_sysfs
of: overlay: use prop add changeset entry for property in new nodes
of: overlay: do not duplicate properties from overlay for new nodes
md/raid5: fix 'out of memory' during raid cache recovery
cifs: Always resolve hostname before reconnecting
Linux 4.19.20
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 9bcdeb51bd upstream.
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0. It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.
This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
T1@P1 |T2@P1 |T3@P1 |OOM reaper
----------+----------+----------+------------
# Processing an OOM victim in a different memcg domain.
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
out_of_memory()
oom_kill_process(P1)
do_send_sig_info(SIGKILL, @P1)
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T2@P1)
wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
mutex_unlock(&oom_lock)
# Completed processing an OOM victim in a different memcg domain.
spin_lock(&oom_reaper_lock)
# T1P1 is dequeued.
spin_unlock(&oom_reaper_lock)
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.
Fix this bug using an approach used by commit 855b018325 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.
Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit b78eec5fb7. It has not
been accepted upstream and is of no particular interest in Android since
EAS is always enabled anyway.
Bug: 120440300
Change-Id: I7d1f2ad206c05041d386fc99f18e29c4fa56cbc8
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The core EAS patches have now been accepted upstream. The patches used
in Android are based on a slightly earlier version of the series. In
order to reduce the delta with mainline and ease backports, align the
EAS code paths with their upstream version.
This basically applies the output of git range-diff on the appropriate
commits, and fixes a conflict in schedutil regarding the integration of
schedtune.
Bug: 120440300
Change-Id: I208ebeb4207e3f4f4bbb5103c606b293a464c20f
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlwIG48ACgkQONu9yGCS
aT7g6Q//RkJ8ZWaRkykcCGaWIvwI6QF1tmKalIEWmToPdndDuQdUDGzWVwfE9G7P
yLcnp3GMlXo4F82BBwG8lFSAm9zaeqaLabnJnXbCc5mZ3xi/2aNqIGHzBY1isNZl
0fTzzcelnAKzjp0Aa/egRLOeraSLgVt/Cp7Ha3FXMP6RNxUMzs1pbQ2IFZ3m+P4G
CAD3Iye6geOaZTu/kXiiooUEUGFQFbV4c3AZ4VW7dZDdrG+ekwtF4YHtkEPseWJQ
Ugtrbr6S0IxYQ91o1Pk77kg4uwUFYo12jrk8Ni4gaPZE6mQCa08tr2Alg2oZkJGw
PdXnt2ASYGRWFYK2JAuTvKzhHrTEJYhiC323dKYCAx7BgfFaqdo5F20oNzYxXFBB
gGA3AzDDtLUD3OOO+lxrDxXMhpwXUx92WXsoJVsaSafdqIDAueq14sH19wqm0gUJ
D1fC2dWTsFrPZKjkU8Z6rJAyO1XZED55h7v1YlqAt2ibjCeDKpjnW3yvUt8Ivpqc
nlnmp8v/Yl2cdY55XtlgUadpknSc2jApFMwhSWetxAaqDCvha2dLQ28YMyPRJzat
ZHOkizM/VUntXvlUzFvVTsqLQiX0sfLG6MKcUkzWehPomNKT+B8XL1wtzytv9QXb
jOY8nRD5PiQo2p35cqdDCskBwqzEwY+WxDe7ji0yHZysBZLxoxQ=
=OiCf
-----END PGP SIGNATURE-----
Merge 4.19.7 into android-4.19
Changes in 4.19.7
mm/huge_memory: rename freeze_page() to unmap_page()
mm/huge_memory: splitting set mapping+index before unfreeze
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
mm/khugepaged: collapse_shmem() stop if punched or truncated
mm/khugepaged: fix crashes due to misaccounted holes
mm/khugepaged: collapse_shmem() remember to clear holes
mm/khugepaged: minor reorderings in collapse_shmem()
mm/khugepaged: collapse_shmem() without freezing new_page
mm/khugepaged: collapse_shmem() do not crash on Compound
lan743x: Enable driver to work with LAN7431
lan743x: fix return value for lan743x_tx_napi_poll
net: don't keep lonely packets forever in the gro hash
net: gemini: Fix copy/paste error
net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue
packet: copy user buffers before orphan or clone
rapidio/rionet: do not free skb before reading its length
s390/qeth: fix length check in SNMP processing
usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2
net: thunderx: set xdp_prog to NULL if bpf_prog_add fails
net: skb_scrub_packet(): Scrub offload_fwd_mark
virtio-net: disable guest csum during XDP set
virtio-net: fail XDP set if guest csum is negotiated
net/dim: Update DIM start sample after each DIM iteration
tcp: defer SACK compression after DupThresh
net: phy: add workaround for issue where PHY driver doesn't bind to the device
tipc: fix lockdep warning during node delete
x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation
x86/speculation: Apply IBPB more strictly to avoid cross-process data leak
x86/speculation: Propagate information about RSB filling mitigation to sysfs
x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant
x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support
x86/retpoline: Remove minimal retpoline support
x86/speculation: Update the TIF_SSBD comment
x86/speculation: Clean up spectre_v2_parse_cmdline()
x86/speculation: Remove unnecessary ret variable in cpu_show_common()
x86/speculation: Move STIPB/IBPB string conditionals out of cpu_show_common()
x86/speculation: Disable STIBP when enhanced IBRS is in use
x86/speculation: Rename SSBD update functions
x86/speculation: Reorganize speculation control MSRs update
sched/smt: Make sched_smt_present track topology
x86/Kconfig: Select SCHED_SMT if SMP enabled
sched/smt: Expose sched_smt_present static key
x86/speculation: Rework SMT state change
x86/l1tf: Show actual SMT state
x86/speculation: Reorder the spec_v2 code
x86/speculation: Mark string arrays const correctly
x86/speculataion: Mark command line parser data __initdata
x86/speculation: Unify conditional spectre v2 print functions
x86/speculation: Add command line control for indirect branch speculation
x86/speculation: Prepare for per task indirect branch speculation control
x86/process: Consolidate and simplify switch_to_xtra() code
x86/speculation: Avoid __switch_to_xtra() calls
x86/speculation: Prepare for conditional IBPB in switch_mm()
ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS
x86/speculation: Split out TIF update
x86/speculation: Prevent stale SPEC_CTRL msr content
x86/speculation: Prepare arch_smt_update() for PRCTL mode
x86/speculation: Add prctl() control for indirect branch speculation
x86/speculation: Enable prctl mode for spectre_v2_user
x86/speculation: Add seccomp Spectre v2 user space protection mode
x86/speculation: Provide IBPB always command line options
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
kvm: mmu: Fix race in emulated page table writes
kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb
KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset
KVM: x86: Fix kernel info-leak in KVM_HC_CLOCK_PAIRING hypercall
KVM: LAPIC: Fix pv ipis use-before-initialization
KVM: X86: Fix scan ioapic use-before-initialization
KVM: VMX: re-add ple_gap module parameter
xtensa: enable coprocessors that are being flushed
xtensa: fix coprocessor context offset definitions
xtensa: fix coprocessor part of ptrace_{get,set}xregs
udf: Allow mounting volumes with incorrect identification strings
btrfs: Always try all copies when reading extent buffers
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
Btrfs: fix race between enabling quotas and subvolume creation
btrfs: relocation: set trans to be NULL after ending transaction
PCI: layerscape: Fix wrong invocation of outbound window disable accessor
PCI: dwc: Fix MSI-X EP framework address calculation bug
PCI: Fix incorrect value returned from pcie_get_speed_cap()
arm64: dts: rockchip: Fix PCIe reset polarity for rk3399-puma-haikou.
x86/MCE/AMD: Fix the thresholding machinery initialization order
x86/fpu: Disable bottom halves while loading FPU registers
perf/x86/intel: Move branch tracing setup to the Intel-specific source file
perf/x86/intel: Add generic branch tracing check to intel_pmu_has_bts()
perf/x86/intel: Disallow precise_ip on BTS events
fs: fix lost error code in dio_complete
ALSA: wss: Fix invalid snd_free_pages() at error path
ALSA: ac97: Fix incorrect bit shift at AC97-SPSA control write
ALSA: control: Fix race between adding and removing a user element
ALSA: sparc: Fix invalid snd_free_pages() at error path
ALSA: hda: Add ASRock N68C-S UCC the power_save blacklist
ALSA: hda/realtek - Support ALC300
ALSA: hda/realtek - fix headset mic detection for MSI MS-B171
ALSA: hda/realtek - fix the pop noise on headphone for lenovo laptops
ALSA: hda/realtek - Add auto-mute quirk for HP Spectre x360 laptop
function_graph: Create function_graph_enter() to consolidate architecture code
ARM: function_graph: Simplify with function_graph_enter()
microblaze: function_graph: Simplify with function_graph_enter()
x86/function_graph: Simplify with function_graph_enter()
nds32: function_graph: Simplify with function_graph_enter()
powerpc/function_graph: Simplify with function_graph_enter()
sh/function_graph: Simplify with function_graph_enter()
sparc/function_graph: Simplify with function_graph_enter()
parisc: function_graph: Simplify with function_graph_enter()
riscv/function_graph: Simplify with function_graph_enter()
s390/function_graph: Simplify with function_graph_enter()
arm64: function_graph: Simplify with function_graph_enter()
MIPS: function_graph: Simplify with function_graph_enter()
function_graph: Make ftrace_push_return_trace() static
function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
function_graph: Have profiler use curr_ret_stack and not depth
function_graph: Move return callback before update of curr_ret_stack
function_graph: Reverse the order of pushing the ret_stack and the callback
binder: fix race that allows malicious free of live buffer
ext2: initialize opts.s_mount_opt as zero before using it
ext2: fix potential use after free
ASoC: intel: cht_bsw_max98090_ti: Add quirk for boards using pmc_plt_clk_0
ASoC: pcm186x: Fix device reset-registers trigger value
ARM: dts: rockchip: Remove @0 from the veyron memory node
dmaengine: at_hdmac: fix memory leak in at_dma_xlate()
dmaengine: at_hdmac: fix module unloading
staging: most: use format specifier "%s" in snprintf
staging: vchiq_arm: fix compat VCHIQ_IOC_AWAIT_COMPLETION
staging: mt7621-dma: fix potentially dereferencing uninitialized 'tx_desc'
staging: mt7621-pinctrl: fix uninitialized variable ngroups
staging: rtl8723bs: Fix incorrect sense of ether_addr_equal
staging: rtl8723bs: Add missing return for cfg80211_rtw_get_station
USB: usb-storage: Add new IDs to ums-realtek
usb: core: quirks: add RESET_RESUME quirk for Cherry G230 Stream series
Revert "usb: dwc3: gadget: skip Set/Clear Halt when invalid"
iio/hid-sensors: Fix IIO_CHAN_INFO_RAW returning wrong values for signed numbers
iio:st_magn: Fix enable device after trigger
lib/test_kmod.c: fix rmmod double free
mm: cleancache: fix corruption on missed inode invalidation
mm: use swp_offset as key in shmem_replace_page()
Drivers: hv: vmbus: check the creation_status in vmbus_establish_gpadl()
misc: mic/scif: fix copy-paste error in scif_create_remote_lookup
Linux 4.19.7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit a74cfffb03 upstream
arch_smt_update() is only called when the sysfs SMT control knob is
changed. This means that when SMT is enabled in the sysfs control knob the
system is considered to have SMT active even if all siblings are offline.
To allow finegrained control of the speculation mitigations, the actual SMT
state is more interesting than the fact that siblings could be enabled.
Rework the code, so arch_smt_update() is invoked from each individual CPU
hotplug function, and simplify the update function while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 321a874a7e upstream
Make the scheduler's 'sched_smt_present' static key globaly available, so
it can be used in the x86 speculation control code.
Provide a query function and a stub for the CONFIG_SMP=n case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This adds a counter to the taskstats extended accounting fields, which
tracks the number of times fsync is called, and then plumbs it through
to the uid_sys_stats driver.
Bug: 120442023
Change-Id: I6c138de5b2332eea70f57e098134d1d141247b3f
Signed-off-by: Jin Qian <jinqian@google.com>
[AmitP: Refactored changes to align with changes from upstream commit
9a07000400 ("sched/headers: Move CONFIG_TASK_XACCT bits from <linux/sched.h> to <linux/sched/xacct.h>")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[tkjos: Needed for storaged fsync accounting ("storaged --uid" and
"storaged --task").]
[astrachan: This is modifying a userspace interface and should probably
be reworked]
Signed-off-by: Alistair Strachan <astrachan@google.com>
This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.
The wake_q mechanism is one case where this information is available.
select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.
* * *
The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:
v CPU v ->time->
-------------
0 (big) 11111 /333
-------------
1 (big) 22222 /444|
-------------
2 (LITTLE) 333333/
-------------
3 (LITTLE) 444444/
-------------
Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:
v CPU v ->time->
------------------------
0 (big) 11111 /333 31313\33333
------------------------
1 (big) 22222 /444|424\4444444
------------------------
2 (LITTLE) 333333/ \222222
------------------------
3 (LITTLE) 444444/ \1111
------------------------
^^^
underutilization
So, I'm trying to get want_affine = 0 for these tasks.
I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.
However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.
It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.
IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?
* * *
Final note..
In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:
v CPU v ->time->
--------------
0 (big) 11111 /333 3
--------------
1 (big) 22222 /444|4
--------------
2 (LITTLE) 333333/ 2
--------------
3 (LITTLE) 444444/ <- Task 1 should go here
--------------
If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
( - Applied from https://patchwork.kernel.org/patch/9895261/
- Fixed trivial conflict in kernel/sched/core.c
- Fixed select_task_rq_idle, now in kernel/sched/idle.c
- Fixed trivial conflict in select_task_rq_fair )
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I3cfc4bf48c3d7feef969db4d22449f4fbb4f795d
Since we don't do energy-aware wakeups when we are overutilized, always
honoring sync wakeups in this state does not prevent wake-wide mechanics
overruling the flag as normal.
This patch is based upon previous work to build EAS for android products.
sync-hint code taken from commit 4a5e890ec60d
"sched/fair: add tunable to force selection at cpu granularity" written
by Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit f1ec666a62dec1083ed52fe1ddef093b84373aaf)
[ Moved the feature to find_energy_efficient_cpu() ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, the scheduler can make use of the idle state
indexes in order to break the tie between potential CPU candidates.
This patch is based on 7f6fb825d6bc ("ANDROID: sched: EAS: take cstate
into account when selecting idle core") from android-4.14. All the
credits goes to the authors.
Change-Id: Ia076cf32faff91e90905291fa6f7924dc3dd6458
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In its current state, Energy Aware Scheduling (EAS) starts automatically
on asymmetric platforms having an Energy Model (EM). However, there are
users who want to have an EM (for thermal management for example), but
don't want EAS with it.
In order to let users disable EAS explicitly, introduce a new sysctl
called 'sched_energy_aware'. It is enabled by default so that EAS can
start automatically on platforms where it makes sense. Flipping it to 0
rebuilds the scheduling domains and disables EAS.
Change-Id: I55764e70bf5e90795d2269ec9135ae6e82794a2b
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-11-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Energy Aware Scheduling (EAS) is designed with the assumption that
frequencies of CPUs follow their utilization value. When using a CPUFreq
governor other than schedutil, the chances of this assumption being true
are small, if any. When schedutil is being used, EAS' predictions are at
least consistent with the frequency requests. Although those requests
have no guarantees to be honored by the hardware, they should at least
guide DVFS in the right direction and provide some hope in regards to the
EAS model being accurate.
To make sure EAS is only used in a sane configuration, create a strong
dependency on schedutil being used. Since having sugov compiled-in does
not provide that guarantee, make CPUFreq call a scheduler function on
governor changes hence letting it rebuild the scheduling domains, check
the governors of the online CPUs, and enable/disable EAS accordingly.
Change-Id: I872949134f97d2772fc681b7393eaed7f0e224f2
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-9-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Schedutil requests frequency by aggregating utilization signals from
the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of
them. Since Energy Aware Scheduling (EAS) needs to be able to predict
the frequency requests, it needs to forecast the decisions made by the
governor.
In order to prepare the introduction of EAS, introduce
schedutil_freq_util() to centralize the aforementioned signal
aggregation and make it available to both schedutil and EAS. Since
frequency selection and energy estimation still need to deal with RT and
DL signals slightly differently, schedutil_freq_util() is called with a
different 'type' parameter in those two contexts, and returns an
aggregated utilization signal accordingly. While at it, introduce the
map_util_freq() function which is designed to make schedutil's 25%
margin usable easily for both sugov and EAS.
As EAS will be able to predict schedutil's frequency requests more
accurately than any other governor by design, it'd be sensible to make
sure EAS cannot be used without schedutil. This will be done later, once
EAS has actually been introduced.
Change-Id: Idbeeb00926045507b73f9cba37630b38ae0816c0
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-3-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.
Change-Id: I144c7299e122201dbcadc431d55d0a6d24d90005
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-2-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all CPU capacities are visible for
any CPU's point of view on asymmetric CPU capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available CPU
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.
The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all CPUs in the system, i.e. when hotplugging
all CPUs out except those with one particular CPU capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.
Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com
[ Fixed 'CPU' capitalization. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 05484e0984)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I1d5f695a95f8d023f1ecf14ecb71a558ceb67ed6
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
Patch series "signal: refactor some functions", v3.
This series refactors a bunch of functions in signal.c to simplify parts
of the code.
The greatest single change is declaring the static do_sigpending() helper
as void which makes it possible to remove a bunch of unnecessary checks in
the syscalls later on.
This patch (of 17):
force_sigsegv() returned 0 unconditionally so it doesn't make sense to have
it return at all. In addition, there are no callers that check
force_sigsegv()'s return value.
Link: http://lkml.kernel.org/r/20180602103653.18181-2-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: James Morris <james.morris@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently task hung checking interval is equal to timeout, as the result
hung is detected anywhere between timeout and 2*timeout. This is fine for
most interactive environments, but this hurts automated testing setups
(syzbot). In an automated setup we need to strictly order CPU lockup <
RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
is not detected as task hung and task hung is not detected as silent
machine loss. The large variance in task hung detection timeout requires
setting silent machine loss timeout to a very large value (e.g. if task
hung is 3 mins, then silent loss need to be set to ~7 mins). The
additional 3 minutes significantly reduce testing efficiency because
usually we crash kernel within a minute, and this can add hours to bug
localization process as it needs to do dozens of tests.
Allow setting checking interval separately from timeout. This allows to
set timeout to, say, 3 minutes, but checking interval to 10 secs.
The interval is controlled via a new hung_task_check_interval_secs sysctl,
similar to the existing hung_task_timeout_secs sysctl. The default value
of 0 results in the current behavior: checking interval is equal to
timeout.
[akpm@linux-foundation.org: update hung_task_timeout_max's comment]
Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
refcount_t type and corresponding API should be used instead of atomic_t
wh en the variable is used as a reference counter. This avoids accidental
refcounter overflows that might lead to use-after-free situations.
Link: http://lkml.kernel.org/r/20180703200141.28415-6-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837 ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL. Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered. Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes. Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
added to the thread group list. This should not matter as
sighand->siglock is held over both the addition of the threads,
the call to task_join_group_stop and do_signal_stop. But the change
is trivial and it is one less thing to worry about when reading
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork. Signals have to
happen either before or after fork. Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.
So we need move signals that we can after the fork to prevent that.
This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.
As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.
Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.
The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.
I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.
I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled. The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well. But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.
V2: I have moved the test into schedule_tail where siglock can
be grabbed and recalc_sigpending can be reused directly.
Further as the last action of setting up a new task this
guarantees that TIF_SIGPENDING will be properly set in the
new process.
The helper calculate_sigpending takes the siglock and
uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
clear TIF_SIGPENDING if it is unnecessary. This allows reusing
the existing code and keeps maintenance of the conditions simple.
Oleg Nesterov <oleg@redhat.com> suggested the movement
and pointed out the need to take siglock if this code
was going to be called while the new task is discoverable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
kernel_wait4() expects a userland address for status - it's only
rusage that goes as a kernel one (and needs a copyout afterwards)
[ Also, fix the prototype of kernel_wait4() to have that __user
annotation - Linus ]
Fixes: 92ebce5ac5 ("osf_wait4: switch to kernel_wait4()")
Cc: stable@kernel.org # v4.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the code more maintainable by performing more of the signal
related work in send_sigqueue.
A quick inspection of do_timer_create will show that this code path
does not lookup a thread group by a thread's pid. Making it safe
to find the task pointed to by it_pid with "pid_task(it_pid, type)";
This supports the changes needed in fork to tell if a signal was sent
to a single process or a group of processes.
Having the pid to task transition in signal.c will also make it easier
to sort out races with de_thread and and the thread group leader
exiting when it comes time to address that.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.
The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.
The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.
This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.
1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb3 ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script to
keep it updated.
- Various LICENSES updates from Thomas, along with a script to check SPDX
tags.
- Work to fix dangling references to documentation files; this involved a
fair number of one-liner comment changes outside of Documentation/
...and the usual list of documentation improvements, typo fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbFTkKAAoJEI3ONVYwIuV6t24P/0K9qltHkLwsBo2fbGu/emem
mb1QrZCFZGebKVrCIvET3YcT0q0xPW+ZldwMQYEUeCcu/vD3cGHGXlDbVJCa1fFD
2OS10W/sEObPnREtlHO/zAzpapKP9DO1/f6NhO55iBJLGOCgoLL5xvSqgsI8MTGd
vcJDXLitkh4CJEcfNLkQt8dEZzq9Tb6wdSFIvZBBXRNon2ItVN92D5xoQ0wtB+qt
KmcGYofajK9bjtZpnC4iNg3i+zdwkd80bGTEN9f0hJTRZK5emCILk8fip8CMhRuB
iwmcqb2RnMLydNLyK9RSs6OS5z3G4fYu9llRtLlZBAupcjRVpalWaBGxLOVO6jBG
mvkqdKPMtxV4c7NvwKwFQL9dcjtxsxO4RDRYVWN82dS1L6WKKk8UvTuJUBLH0YA5
af7ZKn7mJVhJ1cxPblaEBOBM3oQuk57LLkjmcpMOXyJ/IOkTIuV1Ezht+XzFyFQv
VWSyekiKo+8D6WHACPTaWiizjW15e8CyP+WIhKzJyn7VQQrZwhsOS+R//ITsuvQ0
vRdZ20lwUeBhR+mnXd5NfIo2w7G+OiqiREVAgxjgRrS0PnkzWG7lzzcSVU8HTfT4
S7VXqval2a9Xg+N8aU2JUe49W858J8hKvIa98hBxGoZa84wxOGtEo7pIKhnMwMSe
Uhkh/1/bQMxsK3fBEF74
=I6FG
-----END PGP SIGNATURE-----
Merge tag 'docs-4.18' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"There's been a fair amount of work in the docs tree this time around,
including:
- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.
- An update of Documentation/features from Andrea Parri and a script
to keep it updated.
- Various LICENSES updates from Thomas, along with a script to check
SPDX tags.
- Work to fix dangling references to documentation files; this
involved a fair number of one-liner comment changes outside of
Documentation/
... and the usual list of documentation improvements, typo fixes, etc"
* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
Documentation: document hung_task_panic kernel parameter
docs/admin-guide/mm: add high level concepts overview
docs/vm: move ksm and transhuge from "user" to "internals" section.
docs: Use the kerneldoc comments for memalloc_no*()
doc: document scope NOFS, NOIO APIs
docs: update kernel versions and dates in tables
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
docs/vm: transhuge: minor updates
docs/vm: transhuge: change sections order
Documentation: arm: clean up Marvell Berlin family info
Documentation: gpio: driver: Fix a typo and some odd grammar
docs: ranoops.rst: fix location of ramoops.txt
scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
docs: uio-howto.rst: use a code block to solve a warning
mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
w1: w1_io.c: fix a kernel-doc warning
Documentation/process/posting: wrap text at 80 cols
docs: admin-guide: add cgroup-v2 documentation
Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
Documentation: refcount-vs-atomic: Update reference to LKMM doc.
...
Although the api is documented in the source code Ted has pointed out
that there is no mention in the core-api Documentation and there are
people looking there to find answers how to use a specific API.
Requested-by: "Theodore Y. Ts'o" <tytso@mit.edu>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.
When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.
Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.
There already is a fix for TASK_DEAD. Abstract that and also apply it
to TASK_STOPPED and TASK_TRACED, both of which are also without
condition based wait-loop.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike Rapoport says:
These patches convert files in Documentation/vm to ReST format, add an
initial index and link it to the top level documentation.
There are no contents changes in the documentation, except few spelling
fixes. The relatively large diffstat stems from the indentation and
paragraph wrapping changes.
I've tried to keep the formatting as consistent as possible, but I could
miss some places that needed markup and add some markup where it was not
necessary.
[jc: significant conflicts in vm/hmm.rst]