* origin/upstream-f2fs-stable-linux-4.19.y:
f2fs: use EINVAL for superblock with invalid magic
f2fs: fix to read source block before invalidating it
f2fs: remove redundant check from f2fs_setflags_common()
f2fs: use generic checking function for FS_IOC_FSSETXATTR
f2fs: use generic checking and prep function for FS_IOC_SETFLAGS
ubifs, fscrypt: cache decrypted symlink target in ->i_link
vfs: use READ_ONCE() to access ->i_link
fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
fscrypt: cache decrypted symlink target in ->i_link
fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext
fscrypt: only set dentry_operations on ciphertext dentries
fscrypt: fix race allowing rename() and link() of ciphertext dentries
fscrypt: clean up and improve dentry revalidation
fscrypt: use READ_ONCE() to access ->i_crypt_info
fscrypt: remove WARN_ON_ONCE() when decryption fails
fscrypt: drop inode argument from fscrypt_get_ctx()
f2fs: improve print log in f2fs_sanity_check_ckpt()
f2fs: avoid out-of-range memory access
f2fs: fix to avoid long latency during umount
f2fs: allow all the users to pin a file
f2fs: support swap file w/ DIO
f2fs: allocate blocks for pinned file
f2fs: fix is_idle() check for discard type
f2fs: add a rw_sem to cover quota flag changes
f2fs: set SBI_NEED_FSCK for xattr corruption case
f2fs: use generic EFSBADCRC/EFSCORRUPTED
f2fs: Use DIV_ROUND_UP() instead of open-coding
f2fs: print kernel message if filesystem is inconsistent
f2fs: introduce f2fs_<level> macros to wrap f2fs_printk()
f2fs: avoid get_valid_blocks() for cleanup
f2fs: ioctl for removing a range from F2FS
f2fs: only set project inherit bit for directory
f2fs: separate f2fs i_flags from fs_flags and ext4 i_flags
f2fs: Add option to limit required GC for checkpoint=disable
f2fs: Fix accounting for unusable blocks
f2fs: Fix root reserved on remount
f2fs: Lower threshold for disable_cp_again
f2fs: fix sparse warning
f2fs: fix f2fs_show_options to show nodiscard mount option
f2fs: add error prints for debugging mount failure
f2fs: fix to do sanity check on segment bitmap of LFS curseg
f2fs: add missing sysfs entries in documentation
f2fs: fix to avoid deadloop if data_flush is on
f2fs: always assume that the device is idle under gc_urgent
f2fs: add bio cache for IPU
f2fs: allow ssr block allocation during checkpoint=disable period
f2fs: fix to check layout on last valid checkpoint park
Change-Id: Ie910f127f574c2115e5b9a6725461ce002c267be
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Make the f2fs implementation of FS_IOC_FSSETXATTR use the new VFS helper
function vfs_ioc_fssetxattr_check(), and remove the project quota check
since it's now done by the helper function.
This is based on a patch from Darrick Wong, but reworked to apply after
commit 360985573b ("f2fs: separate f2fs i_flags from fs_flags and ext4
i_flags").
Originally-from: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make the f2fs implementation of FS_IOC_SETFLAGS use the new VFS helper
function vfs_ioc_setflags_prepare().
This is based on a patch from Darrick Wong, but reworked to apply after
commit 360985573b ("f2fs: separate f2fs i_flags from fs_flags and ext4
i_flags").
Originally-from: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl09QMoACgkQONu9yGCS
aT7bBQ//dApu9DLQ3tyfc+tucIc3ViOcif3zlfLWRqClcLXw1HF63/Q/u40kYuf1
TOnMkVXFWmpeGPgz2/tWw9TN+ibHaPBfyMoPGw/Ehs5dB8qn/F4AnLqkePFimIYQ
+YOZTbvPqW/xSh8I79i6y2NxrZLGvO9FDY9YUblBKz4lKWc+HULXn9RmjtxlBOLE
s6epoM0tWKPPeM5MyJIwT+9MZ0o3Cj89YNv+PqYLQSzcvbUbHDUiMt+DO4rqSyyf
6fQe4mSX4tk+tbviyEruvKH3kJg0rWS1Maw1/h+AOJ8Gmr2WZ+vYBorPEdwdYu5K
eootc3Jkj4wtcPsvKtNzWVPqJxdd5j651vS8/bxA0DmOVQM7A8dUBWKtMJGGG7nM
nIRvxoMo+kv9QE/DJpeQ/apE9WnBflSqw6/DYdqs11gTX4E+beR75aRoVD1ue0lS
if5CfnM0BTiOO1rMdMo42rzcBfh2DLt5a18WXsMmYOaSMGEJuF0KjgaGc5E4N00A
QlqIJ7PKk+Kb53jz5oLV2vB/SXNvAQRLAvMTMH2Mst4veDonYyiVTfp6C+r8DJYc
hvyaF1FoCMTWV5XsINmM2mlJ2+/G0nV5kLDmVPUHiFxE0JXnPQ8iCx9a96Ub+Zim
F//muwohNIUa6EEN6AwrEUsoyVZ7cP/aR5wHQ9PAY3RV5nis474=
=zWcX
-----END PGP SIGNATURE-----
Merge 4.19.62 into android-4.19
Changes in 4.19.62
bnx2x: Prevent load reordering in tx completion processing
caif-hsi: fix possible deadlock in cfhsi_exit_module()
hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
igmp: fix memory leak in igmpv3_del_delrec()
ipv4: don't set IPv6 only flags to IPv4 addresses
ipv6: rt6_check should return NULL if 'from' is NULL
ipv6: Unlink sibling route in case of failure
net: bcmgenet: use promisc for unsupported filters
net: dsa: mv88e6xxx: wait after reset deactivation
net: make skb_dst_force return true when dst is refcounted
net: neigh: fix multiple neigh timer scheduling
net: openvswitch: fix csum updates for MPLS actions
net: phy: sfp: hwmon: Fix scaling of RX power
net: stmmac: Re-work the queue selection for TSO packets
nfc: fix potential illegal memory access
r8169: fix issue with confused RX unit after PHY power-down on RTL8411b
rxrpc: Fix send on a connected, but unbound socket
sctp: fix error handling on stream scheduler initialization
sky2: Disable MSI on ASUS P6T
tcp: be more careful in tcp_fragment()
tcp: fix tcp_set_congestion_control() use from bpf hook
tcp: Reset bytes_acked and bytes_received when disconnecting
vrf: make sure skb->data contains ip header to make routing
net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
macsec: fix use-after-free of skb during RX
macsec: fix checksumming after decryption
netrom: fix a memory leak in nr_rx_frame()
netrom: hold sock when setting skb->destructor
net_sched: unset TCQ_F_CAN_BYPASS when adding filters
net/tls: make sure offload also gets the keys wiped
sctp: not bind the socket in sctp_connect
net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
net: bridge: don't cache ether dest pointer on input
net: bridge: stp: don't cache eth dest pointer before skb pull
dma-buf: balance refcount inbalance
dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc
gpio: davinci: silence error prints in case of EPROBE_DEFER
MIPS: lb60: Fix pin mappings
perf/core: Fix exclusive events' grouping
perf/core: Fix race between close() and fork()
ext4: don't allow any modifications to an immutable file
ext4: enforce the immutable flag on open files
mm: add filemap_fdatawait_range_keep_errors()
jbd2: introduce jbd2_inode dirty range scoping
ext4: use jbd2_inode dirty range scoping
ext4: allow directory holes
KVM: nVMX: do not use dangling shadow VMCS after guest reset
KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested
mm: vmscan: scan anonymous pages on file refaults
net: sched: verify that q!=NULL before setting q->flags
Linux 4.19.62
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2eb23bda9d5294a5c874fe4f403934fd99e84661
commit aa0bfcd939 upstream.
In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* origin/upstream-f2fs-stable-linux-4.19.y:
fscrypt: remove filesystem specific build config option
f2fs: use IS_ENCRYPTED() to check encryption status
ext4: use IS_ENCRYPTED() to check encryption status
fscrypt: return -EXDEV for incompatible rename or link into encrypted dir
fscrypt: remove CRYPTO_CTR dependency
fscrypt: add Adiantum support
crypto: speck - remove Speck
Conflicts:
arch/arm/crypto/Kconfig
arch/arm/crypto/Makefile
crypto/testmgr.h
Change-Id: I1a6d1e35c857c4117190388b4797d0c11a109cf0
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzSZ3MACgkQONu9yGCS
aT7f2hAAww50MxygeVQXgB0zJ7oSlEBwJAUH1GRCj1ual70pDoK9W8iuacPI4gNu
+A0V4cD+lndMZ8E+F9/Vmzhmj/xHdxRfCkRmtnpa5IKUJqA/bLr1kdspYO6G354G
ZBGvizvmFaAikbqNSnoadQZ5AYoT/C4FxxD6t/meQtnwmrK+OhRlh1xCkTddAbQo
UEPDns9pxqrwHK7tS0GbPXLWgsr4y7BPq92ILdt/GFZGmEBJa0+tf66/DF+RMKjD
OCmAkiFr+hN8KScUHeZpJKdmI3s7dHEUnQnwkHyC1TwjEXwxhtA0QZgzl9j1F8JI
f27kbJrpmYuw2QADJyZmCnTiqBo7XglItxEZ2csJKyV7+jFFlzKT4S26iyrHwgXp
BTdSuTaVe2Ubp/EiULwoSsmxaIyW6fROotQK/x4oB0eDm+OSXq217nnbnF2MRpMb
eOqzRPCd5CvL8+7T9rc0kOJcDuK/9h95OjzZWBo9rDGz0glxlokw7cJ4fUyrJMPo
kcw7ZIRRHLtKGTD/ZKX3XrErAqz9I4RAEzBRAXT6W0is7PUi00vFNcOU4ThNSsSK
ihfuHsp9Jq/SKEUl6/CumU8e91LWv77/gqYxLLuygpuKLFLpMMc/PQOruxXT6ojx
QC/KIUtcMfTP+xy23Wg2UQ1BXhl+HcfShXzvcEXxHzjgHpglKvc=
=S4RT
-----END PGP SIGNATURE-----
Merge 4.19.41 into android-4.19
Changes in 4.19.41
iwlwifi: fix driver operation for 5350
mwifiex: Make resume actually do something useful again on SDIO cards
mac80211: don't attempt to rename ERR_PTR() debugfs dirs
i2c: synquacer: fix enumeration of slave devices
i2c: imx: correct the method of getting private data in notifier_call
i2c: Remove unnecessary call to irq_find_mapping
i2c: Clear client->irq in i2c_device_remove
i2c: Allow recovery of the initial IRQ by an I2C client device.
i2c: Prevent runtime suspend of adapter when Host Notify is required
ALSA: hda/realtek - Add new Dell platform for headset mode
ALSA: hda/realtek - Fixed Dell AIO speaker noise
ALSA: hda/realtek - Apply the fixup for ASUS Q325UAR
USB: yurex: Fix protection fault after device removal
USB: w1 ds2490: Fix bug caused by improper use of altsetting array
USB: dummy-hcd: Fix failure to give back unlinked URBs
usb: usbip: fix isoc packet num validation in get_pipe
USB: core: Fix unterminated string returned by usb_string()
USB: core: Fix bug caused by duplicate interface PM usage counter
nvme-loop: init nvmet_ctrl fatal_err_work when allocate
efi: Fix debugobjects warning on 'efi_rts_work'
arm64: dts: rockchip: fix rk3328-roc-cc gmac2io tx/rx_delay
HID: logitech: check the return value of create_singlethread_workqueue
HID: debug: fix race condition with between rdesc_show() and device removal
rtc: cros-ec: Fail suspend/resume if wake IRQ can't be configured
rtc: sh: Fix invalid alarm warning for non-enabled alarm
batman-adv: Reduce claim hash refcnt only for removed entry
batman-adv: Reduce tt_local hash refcnt only for removed entry
batman-adv: Reduce tt_global hash refcnt only for removed entry
batman-adv: fix warning in function batadv_v_elp_get_throughput
ARM: dts: rockchip: Fix gpu opp node names for rk3288
reset: meson-audio-arb: Fix missing .owner setting of reset_controller_dev
igb: Fix WARN_ONCE on runtime suspend
riscv: fix accessing 8-byte variable from RV32
HID: quirks: Fix keyboard + touchpad on Lenovo Miix 630
net: hns3: fix compile error
net/mlx5: E-Switch, Fix esw manager vport indication for more vport commands
bonding: show full hw address in sysfs for slave entries
net: stmmac: use correct DMA buffer size in the RX descriptor
net: stmmac: ratelimit RX error logs
net: stmmac: don't stop NAPI processing when dropping a packet
net: stmmac: don't overwrite discard_frame status
net: stmmac: fix dropping of multi-descriptor RX frames
net: stmmac: don't log oversized frames
jffs2: fix use-after-free on symlink traversal
debugfs: fix use-after-free on symlink traversal
mfd: twl-core: Disable IRQ while suspended
block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx
rtc: da9063: set uie_unsupported when relevant
HID: input: add mapping for Assistant key
vfio/pci: use correct format characters
scsi: core: add new RDAC LENOVO/DE_Series device
scsi: storvsc: Fix calculation of sub-channel count
arm/mach-at91/pm : fix possible object reference leak
arm64: fix wrong check of on_sdei_stack in nmi context
net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
net: hns: Use NAPI_POLL_WEIGHT for hns driver
net: hns: Fix probabilistic memory overwrite when HNS driver initialized
net: hns: fix ICMP6 neighbor solicitation messages discard problem
net: hns: Fix WARNING when remove HNS driver with SMMU enabled
libcxgb: fix incorrect ppmax calculation
KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow
kmemleak: powerpc: skip scanning holes in the .bss section
hugetlbfs: fix memory leak for resv_map
sh: fix multiple function definition build errors
xsysace: Fix error handling in ace_setup
fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
ARM: orion: don't use using 64-bit DMA masks
ARM: iop: don't use using 64-bit DMA masks
block: pass no-op callback to INIT_WORK().
perf/x86/amd: Update generic hardware cache events for Family 17h
Bluetooth: btusb: request wake pin with NOAUTOEN
Bluetooth: mediatek: fix up an error path to restore bdev->tx_state
clk: qcom: Add missing freq for usb30_master_clk on 8998
staging: iio: adt7316: allow adt751x to use internal vref for all dacs
staging: iio: adt7316: fix the dac read calculation
staging: iio: adt7316: fix the dac write calculation
scsi: RDMA/srpt: Fix a credit leak for aborted commands
ASoC: Intel: bytcr_rt5651: Revert "Fix DMIC map headsetmic mapping"
ASoC: wm_adsp: Correct handling of compressed streams that restart
ASoC: stm32: fix sai driver name initialisation
platform/x86: intel_pmc_core: Fix PCH IP name
platform/x86: intel_pmc_core: Handle CFL regmap properly
IB/core: Unregister notifier before freeing MAD security
IB/core: Fix potential memory leak while creating MAD agents
IB/core: Destroy QP if XRC QP fails
Input: snvs_pwrkey - initialize necessary driver data before enabling IRQ
Input: stmfts - acknowledge that setting brightness is a blocking call
gpio: mxc: add check to return defer probe if clock tree NOT ready
selinux: avoid silent denials in permissive mode under RCU walk
selinux: never allow relabeling on context mounts
mac80211: Honor SW_CRYPTO_CONTROL for unicast keys in AP VLAN mode
powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search
x86/mce: Improve error message when kernel cannot recover, p2
clk: x86: Add system specific quirk to mark clocks as critical
x86/mm/KASLR: Fix the size of the direct mapping section
x86/mm: Fix a crash with kmemleak_scan()
x86/mm/tlb: Revert "x86/mm: Align TLB invalidation info"
i2c: i2c-stm32f7: Fix SDADEL minimum formula
media: v4l2: i2c: ov7670: Fix PLL bypass register values
ASoC: wm_adsp: Check for buffer in trigger stop
mm/kmemleak.c: fix unused-function warning
Linux 4.19.41
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 10dce8af34 ]
Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.
This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d0 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.
The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.
However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655 -
is doing so irregardless of whether a file is seekable or not.
See
https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/https://lwn.net/Articles/180387https://lwn.net/Articles/180396
for historic context.
The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:
kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...
Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.
Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):
drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:
See
https://github.com/libfuse/osspdhttps://lwn.net/Articles/30844514a9cff0/osspd.c (L1406)14a9cff0/osspd.c (L1438-L1477)14a9cff0/osspd.c (L1479-L1510)
Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:
f13aa600/wcfs/wcfs.go (L88-169)
Let's fix this regression. The plan is:
1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.
2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.
3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.
4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.
It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3Dhttps://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).
This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.
This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.
Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.
The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzKo0YACgkQONu9yGCS
aT4dbQ//U1bo/8bdBJec+a0aNMy3cxzPF1Ozbrb/vEaHofj1BR87hgo4BODBO7pu
6ppwloPle9VFrsfT1FYOjsicUBhT4NmieHlsC3msAR4xlBEbHEOBTEbUdu3HinGV
Jn/uL/NDTrq+wA5rROGOh9sTlQ5w6dqItjHAWvnGkXlerbUJwIgnzbgH5qGBFZhQ
6SbPmqJv5V+C+qYy3yXNs2CnbtS7+cfulLy26MNnkFMEZGbHTWeNbeu9H41AK6T4
xtO8INse28RD6lbAPvW/xb//iAXsOHv+7KF1TgtZq89Z1RmlaqLSdPdgTYvCxm+Y
RhWa8KyIdhADJ8z8sRcPviFI5bR65cfCMUAEgBcFNYYByDv36KCBLsXajn4JbBsF
OOOtqnGaZyAJBZgMXySfVJIXLAx7cUlt07YD9cIdsOzjl1DCMP76XvypeGXLw5Mk
ZBXBJ+By+8jwnE7PAtecij/VH6qCDsfn4HqoRELsRLVahFsnFFid5lutVIjsO21j
QHrwi4hChuYGa89MhD48KyC2ZuaQmbs3rm6F3O0iQ0aipknvlsDoB4jYYp9qRI04
0FYMlZLlVyg+sNYOM2XvTtpOBFa1PFwFwscqXoyt0CGtig0D+pD3gDYExRONj6Fp
8h+OUBWbVHWscceMc6G1p/Qu+YcgmQTu8CFAUO8l/X8xq655c1A=
=isRm
-----END PGP SIGNATURE-----
Merge 4.19.38 into android-4.19
Changes in 4.19.38
netfilter: nft_compat: use refcnt_t type for nft_xt reference count
netfilter: nft_compat: make lists per netns
netfilter: nf_tables: split set destruction in deactivate and destroy phase
netfilter: nft_compat: destroy function must not have side effects
netfilter: nf_tables: warn when expr implements only one of activate/deactivate
netfilter: nf_tables: unbind set in rule from commit path
netfilter: nft_compat: don't use refcount_inc on newly allocated entry
netfilter: nft_compat: use .release_ops and remove list of extension
netfilter: nf_tables: fix set double-free in abort path
netfilter: nf_tables: bogus EBUSY when deleting set after flush
netfilter: nf_tables: bogus EBUSY in helper removal from transaction
net/ibmvnic: Fix RTNL deadlock during device reset
net: mvpp2: fix validate for PPv2.1
ext4: fix some error pointer dereferences
tipc: handle the err returned from cmd header function
loop: do not print warn message if partition scan is successful
drm/rockchip: fix for mailbox read validation.
vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock
ipvs: fix warning on unused variable
powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
ALSA: hda/ca0132 - Fix build error without CONFIG_PCI
net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework
cifs: fix memory leak in SMB2_read
cifs: do not attempt cifs operation on smb2+ rename error
tracing: Fix a memory leak by early error exit in trace_pid_write()
tracing: Fix buffer_ref pipe ops
gpio: eic: sprd: Fix incorrect irq type setting for the sync EIC
zram: pass down the bvec we need to read into in the work struct
lib/Kconfig.debug: fix build error without CONFIG_BLOCK
MIPS: scall64-o32: Fix indirect syscall number load
trace: Fix preempt_enable_no_resched() abuse
IB/rdmavt: Fix frwr memory registration
RDMA/mlx5: Do not allow the user to write to the clock page
sched/numa: Fix a possible divide-by-zero
ceph: only use d_name directly when parent is locked
ceph: ensure d_name stability in ceph_dentry_hash()
ceph: fix ci->i_head_snapc leak
nfsd: Don't release the callback slot unless it was actually held
sunrpc: don't mark uninitialised items as VALID.
perf/x86/intel: Update KBL Package C-state events to also include PC8/PC9/PC10 counters
Input: synaptics-rmi4 - write config register values to the right offset
vfio/type1: Limit DMA mappings per container
dmaengine: sh: rcar-dmac: With cyclic DMA residue 0 is valid
dmaengine: sh: rcar-dmac: Fix glitch in dmaengine_tx_status
ARM: 8857/1: efi: enable CP15 DMB instructions before cleaning the cache
powerpc/mm/radix: Make Radix require HUGETLB_PAGE
drm/vc4: Fix memory leak during gpu reset.
Revert "drm/i915/fbdev: Actually configure untiled displays"
drm/vc4: Fix compilation error reported by kbuild test bot
USB: Add new USB LPM helpers
USB: Consolidate LPM checks to avoid enabling LPM twice
slip: make slhc_free() silently accept an error pointer
intel_th: gth: Fix an off-by-one in output unassigning
fs/proc/proc_sysctl.c: Fix a NULL pointer dereference
workqueue: Try to catch flush_work() without INIT_WORK().
binder: fix handling of misaligned binder object
sched/deadline: Correctly handle active 0-lag timers
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ON
fm10k: Fix a potential NULL pointer dereference
tipc: check bearer name with right length in tipc_nl_compat_bearer_enable
tipc: check link name with right length in tipc_nl_compat_link_set
net: netrom: Fix error cleanup path of nr_proto_init
net/rds: Check address length before reading address family
rxrpc: fix race condition in rxrpc_input_packet()
aio: clear IOCB_HIPRI
aio: use assigned completion handler
aio: separate out ring reservation from req allocation
aio: don't zero entire aio_kiocb aio_get_req()
aio: use iocb_put() instead of open coding it
aio: split out iocb copy from io_submit_one()
aio: abstract out io_event filler helper
aio: initialize kiocb private in case any filesystems expect it.
aio: simplify - and fix - fget/fput for io_submit()
pin iocb through aio.
aio: fold lookup_kiocb() into its sole caller
aio: keep io_event in aio_kiocb
aio: store event at final iocb_put()
Fix aio_poll() races
x86, retpolines: Raise limit for generating indirect calls from switch-case
x86/retpolines: Disable switch jump tables when retpolines are enabled
mm: Fix warning in insert_pfn()
x86/fpu: Don't export __kernel_fpu_{begin,end}()
ipv4: add sanity checks in ipv4_link_failure()
ipv4: set the tcp_min_rtt_wlen range from 0 to one day
mlxsw: spectrum: Fix autoneg status in ethtool
net/mlx5e: ethtool, Remove unsupported SFP EEPROM high pages query
net: rds: exchange of 8K and 1M pool
net/rose: fix unbound loop in rose_loopback_timer()
net: stmmac: move stmmac_check_ether_addr() to driver probe
net/tls: fix refcount adjustment in fallback
stmmac: pci: Adjust IOT2000 matching
team: fix possible recursive locking when add slaves
net: hns: Fix WARNING when hns modules installed
mlxsw: pci: Reincrease PCI reset timeout
mlxsw: spectrum: Put MC TCs into DWRR mode
net/mlx5e: Fix the max MTU check in case of XDP
net/mlx5e: Fix use-after-free after xdp_return_frame
net/tls: avoid potential deadlock in tls_set_device_offload_rx()
net/tls: don't leak IV and record seq when offload fails
powerpc/fsl: Add FSL_PPC_BOOK3E as supported arch for nospectre_v2 boot arg
Linux 4.19.38
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 84c4e1f89f upstream.
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e72 ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
This allows filesystems to use their mount private data to
influence the permssions they return in permission2. It has
been separated into a new call to avoid disrupting current
permission users.
Bug: 35848445
Bug: 120446149
Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[AmitP: Minor refactoring of original patch to align with
changes from the following upstream commit
4bfd054ae1 ("fs: fold __inode_permission() into inode_permission()").
Also introduce vfs_mkobj2(), because do_create()
moved from using vfs_create() to vfs_mkobj()
eecec19d9e ("mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata")
do_create() is dropped/cleaned-up upstream so a
minor refactoring there as well.
066cc813e9 ("do_mq_open(): move all work prior to dentry_open() into a helper")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[astrachan: Folded the following changes into this patch:
f46c9d62dd81 ("ANDROID: fs: Export vfs_rmdir2")
9992eb8b9a1e ("ANDROID: xattr: Pass EOPNOTSUPP to permission2")]
Signed-off-by: Alistair Strachan <astrachan@google.com>
This allows filesystems to use their mount private data to
influence the permssions they use in setattr2. It has
been separated into a new call to avoid disrupting current
setattr users.
Bug: 120446149
Change-Id: I19959038309284448f1b7f232d579674ef546385
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Now we pass the vfsmount when mounting and remounting.
This allows the filesystem to actually set up the mount
specific data, although we can't quite do anything with
it yet. show_options is expanded to include data that
lives with the mount.
To avoid changing existing filesystems, these have
been added as new vfs functions.
Bug: 120446149
Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097
Signed-off-by: Daniel Rosenberg <drosen@google.com>
This starts to add private data associated directly
to mount points. The intent is to give filesystems
a sense of where they have come from, as a means of
letting a filesystem take different actions based on
this information.
Bug: 62094374
Bug: 120446149
Change-Id: Ie769d7b3bb2f5972afe05c1bf16cf88c91647ab2
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[astrachan: Folded 89a54ed3bf68 ("ANDROID: mnt: Fix next_descendent")
into this patch]
Signed-off-by: Alistair Strachan <astrachan@google.com>
commit 721fb6fbfd upstream.
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:
CPU1 CPU2
fsnotify_put_mark()
spin_lock(&conn->lock);
...
inode = fsnotify_detach_connector_from_object(conn)
spin_unlock(&conn->lock);
generic_shutdown_super()
fsnotify_unmount_inodes()
sees connector detached for inode
-> nothing to do
evict_inode()
barfs on pending inode reference
iput(inode);
Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.
Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:
Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294
So there's no statistically significant difference.
Fixes: 6b3f05d24d ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.
The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.
It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.
Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is going to be used by overlayfs and possibly useful
for other filesystems.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.
Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.
Add some documentation notes on ->readpages() being purely for
read-ahead.
Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.
Let's discard it properly.
Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs updates from Al Viro:
"Misc cleanups from various folks all over the place
I expected more fs/dcache.c cleanups this cycle, so that went into a
separate branch. Said cleanups have missed the window, so in the
hindsight it could've gone into work.misc instead. Decided not to
cherry-pick, thus the 'work.dcache' branch"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: dcache: Use true and false for boolean values
fold generic_readlink() into its only caller
fs: shave 8 bytes off of struct inode
fs: Add more kernel-doc to the produced documentation
fs: Fix attr.c kernel-doc
removed extra extern file_fdatawait_range
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill dentry_update_name_case()
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
Pull vfs open-related updates from Al Viro:
- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.
- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().
- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.
- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).
* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction. However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.
Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.
iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.
[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fixes from Al Viro:
"Fix several places that screw up cleanups after failures halfway
through opening a file (one open-coding filp_clone_open() and getting
it wrong, two misusing alloc_file()). That part is -stable fodder from
the 'work.open' branch.
And Christoph's regression fix for uapi breakage in aio series;
include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
definition of sigset_t, the reason for doing so in the first place had
been bogus - there's no need to expose struct __aio_sigset in
aio_abi.h at all"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: don't expose __aio_sigset in uapi
ocxlflash_getfile(): fix double-iput() on alloc_file() failures
cxl_getfile(): fix double-iput() on alloc_file() failures
drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
Opening regular files on overlayfs is now handled via ovl_open(). Remove
the now unused "open_flags" argument from d_op->d_real() and the d_real()
helper.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This partially reverts commit c568d68341.
Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.
It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode(). We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.
Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace
API for some time (though not in a useful way). Will try to remove
internal flags later when the dust around the new mount API settles.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Stacking file operations in overlay will store an extra open file for each
overlay file opened.
The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.
For fear of breaking working setups, don't start accounting the extra file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
open a file by given inode, faking ->f_path. Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question. Again, this is no general-purpose interface and I hope
it will eventually go away. Right now overlayfs wants something
like that, but nothing else should.
Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...
Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".
NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Failure of ->open() should *not* be followed by fput(). Fixed by
using filp_clone_open(), which gets the cleanups right.
Cc: stable@vger.kernel.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's exactly the same thing as
dentry_open(&file->f_path, file->f_flags, file->f_cred)
... and rename it to file_clone_open(), while we are at it.
'filp' naming convention is bogus; sure, it's "file pointer",
but we generally don't do that kind of Hungarian notation.
Some of the instances have too many callers to touch, but this
one has only two, so let's sanitize it while we can...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here is a link to Linus' reply to Jan's concern about making
i_blkbibts byte addressable:
https://marc.info/?l=linux-fsdevel&m=152882624707975&w=2
Here is a link to an lkp.org report about potential performance
improvement in some workload, which could(?) be related to packing
i_blkbits closer to i_bytes/i_lock:
https://marc.info/?l=linux-fsdevel&m=153077048108198&w=2
Changes since v1:
- Add links to relevant discussions
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Clean up f_op->dedupe_file_range() interface.
1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.
Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.
But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.
[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff added this extern twice in commit a823e4589e
Fixes: a823e4589e ("mm: add file_fdatawait_range and file_write_and_wait")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()