Commit graph

741 commits

Author SHA1 Message Date
Greg Kroah-Hartman
be3bb0daac This is the 4.19.119 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl6pj8gACgkQONu9yGCS
 aT5mKxAAvzC4s4XHwDDckvvu57/sED2oEtp7MgmLuyK4Ih55GyyLGx9zg1A2z+rs
 wQSsVW+/WeurCj4CuVciakkCvgBeY494cbnghr2lohhJZ918/XnYmPODLJhlvtcV
 gZ4vxk5euNqpWGsmu+X+DRBG6QuU5GYf4ox39NZdtKm5I+kt5Lw44AHSNlFP0q3y
 drRFc49cqSxa4WkVRixJJOTQbSHARNWiayOG4uLb4zoZFvJOTDAp7+yX5LYD7lxY
 3FsQLVMSp7c/whppeGySVX0oJF/12weR9OQJZVxxhlMNggmGREwDxayBaPYqA4pa
 0OO83rO1aP9j2VK3HFiK4OwatKHcu0GvGV9I4rP3u8hWvJyUzTAfdcVUXHl6of12
 6hXG7F3f0TVY/OP6J2WepcQG5IbkiiAY1J0wlqbqo5MvOqESJZ/J0pGuFD1qzQ8n
 zaMnj2zhJHkJEfyP7Dvjo4y72eM9tWnFxKfm/PtuHWGovpP15rrsuHcs343U92Z7
 zQ/Ak10tA8FpSM7dXaTd98/3FkVdQbkImkEUOpWzPjiJFGyuk8j6/ZE9rCWtlNR1
 HP9cLgKB/PF/a3+kwtgGAhAHBVIA8trhSm1jRqEU7ki9sBQnV/2iR5b7UJ8xn4uA
 dl9HlxpiDYvIjRhHfMh6GXIhdO2T8coFzxKRztjxrM0dbaQeVKQ=
 =31Y5
 -----END PGP SIGNATURE-----

Merge 4.19.119 into android-4.19

Changes in 4.19.119
	ext4: fix extent_status fragmentation for plain files
	drm/msm: Use the correct dma_sync calls harder
	bpftool: Fix printing incorrect pointer in btf_dump_ptr
	crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static
	vti4: removed duplicate log message.
	arm64: Add part number for Neoverse N1
	arm64: errata: Hide CTR_EL0.DIC on systems affected by Neoverse-N1 #1542419
	arm64: Fake the IminLine size on systems affected by Neoverse-N1 #1542419
	arm64: compat: Workaround Neoverse-N1 #1542419 for compat user-space
	arm64: Silence clang warning on mismatched value/register sizes
	watchdog: reset last_hw_keepalive time at start
	scsi: lpfc: Fix kasan slab-out-of-bounds error in lpfc_unreg_login
	scsi: lpfc: Fix crash in target side cable pulls hitting WAIT_FOR_UNREG
	ceph: return ceph_mdsc_do_request() errors from __get_parent()
	ceph: don't skip updating wanted caps when cap is stale
	pwm: rcar: Fix late Runtime PM enablement
	scsi: iscsi: Report unbind session event when the target has been removed
	ASoC: Intel: atom: Take the drv->lock mutex before calling sst_send_slot_map()
	nvme: fix deadlock caused by ANA update wrong locking
	kernel/gcov/fs.c: gcov_seq_next() should increase position index
	selftests: kmod: fix handling test numbers above 9
	ipc/util.c: sysvipc_find_ipc() should increase position index
	kconfig: qconf: Fix a few alignment issues
	s390/cio: avoid duplicated 'ADD' uevents
	loop: Better discard support for block devices
	Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
	pwm: renesas-tpu: Fix late Runtime PM enablement
	pwm: bcm2835: Dynamically allocate base
	perf/core: Disable page faults when getting phys address
	ASoC: Intel: bytcr_rt5640: Add quirk for MPMAN MPWIN895CL tablet
	xhci: Ensure link state is U3 after setting USB_SS_PORT_LS_U3
	drm/amd/display: Not doing optimize bandwidth if flip pending.
	tracing/selftests: Turn off timeout setting
	virtio-blk: improve virtqueue error to BLK_STS
	scsi: smartpqi: fix call trace in device discovery
	PCI/ASPM: Allow re-enabling Clock PM
	net: ipv6: add net argument to ip6_dst_lookup_flow
	net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup
	blktrace: Protect q->blk_trace with RCU
	blktrace: fix dereference after null check
	f2fs: fix to avoid memory leakage in f2fs_listxattr
	KVM: VMX: Zero out *all* general purpose registers after VM-Exit
	KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01
	KVM: Introduce a new guest mapping API
	kvm: fix compilation on aarch64
	kvm: fix compilation on s390
	kvm: fix compile on s390 part 2
	KVM: Properly check if "page" is valid in kvm_vcpu_unmap
	x86/kvm: Introduce kvm_(un)map_gfn()
	x86/kvm: Cache gfn to pfn translation
	x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
	x86/KVM: Clean up host's steal time structure
	cxgb4: fix adapter crash due to wrong MC size
	cxgb4: fix large delays in PTP synchronization
	ipv6: fix restrict IPV6_ADDRFORM operation
	macsec: avoid to set wrong mtu
	macvlan: fix null dereference in macvlan_device_event()
	net: bcmgenet: correct per TX/RX ring statistics
	net: netrom: Fix potential nr_neigh refcnt leak in nr_add_node
	net: stmmac: dwmac-meson8b: Add missing boundary to RGMII TX clock array
	net/x25: Fix x25_neigh refcnt leak when receiving frame
	sched: etf: do not assume all sockets are full blown
	tcp: cache line align MAX_TCP_HEADER
	team: fix hang in team_mode_get()
	vrf: Fix IPv6 with qdisc and xfrm
	net: dsa: b53: Lookup VID in ARL searches when VLAN is enabled
	net: dsa: b53: Fix ARL register definitions
	net: dsa: b53: Rework ARL bin logic
	net: dsa: b53: b53_arl_rw_op() needs to select IVL or SVL
	xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finish
	vrf: Check skb for XFRM_TRANSFORMED flag
	mlxsw: Fix some IS_ERR() vs NULL bugs
	KEYS: Avoid false positive ENOMEM error on key read
	ALSA: hda: Remove ASUS ROG Zenith from the blacklist
	ALSA: usb-audio: Add static mapping table for ALC1220-VB-based mobos
	ALSA: usb-audio: Add connector notifier delegation
	iio: core: remove extra semi-colon from devm_iio_device_register() macro
	iio: st_sensors: rely on odr mask to know if odr can be set
	iio: adc: stm32-adc: fix sleep in atomic context
	iio: xilinx-xadc: Fix ADC-B powerdown
	iio: xilinx-xadc: Fix clearing interrupt when enabling trigger
	iio: xilinx-xadc: Fix sequencer configuration for aux channels in simultaneous mode
	iio: xilinx-xadc: Make sure not exceed maximum samplerate
	fs/namespace.c: fix mountpoint reference counter race
	USB: sisusbvga: Change port variable from signed to unsigned
	USB: Add USB_QUIRK_DELAY_CTRL_MSG and USB_QUIRK_DELAY_INIT for Corsair K70 RGB RAPIDFIRE
	USB: early: Handle AMD's spec-compliant identifiers, too
	USB: core: Fix free-while-in-use bug in the USB S-Glibrary
	USB: hub: Fix handling of connect changes during sleep
	vmalloc: fix remap_vmalloc_range() bounds checks
	mm/hugetlb: fix a addressing exception caused by huge_pte_offset
	mm/ksm: fix NULL pointer dereference when KSM zero page is enabled
	tools/vm: fix cross-compile build
	ALSA: usx2y: Fix potential NULL dereference
	ALSA: hda/realtek - Fix unexpected init_amp override
	ALSA: hda/realtek - Add new codec supported for ALC245
	ALSA: usb-audio: Fix usb audio refcnt leak when getting spdif
	ALSA: usb-audio: Filter out unsupported sample rates on Focusrite devices
	tpm/tpm_tis: Free IRQ if probing fails
	tpm: ibmvtpm: retry on H_CLOSED in tpm_ibmvtpm_send()
	KVM: s390: Return last valid slot if approx index is out-of-bounds
	KVM: Check validity of resolved slot when searching memslots
	KVM: VMX: Enable machine check support for 32bit targets
	tty: hvc: fix buffer overflow during hvc_alloc().
	tty: rocket, avoid OOB access
	usb-storage: Add unusual_devs entry for JMicron JMS566
	audit: check the length of userspace generated audit records
	ASoC: dapm: fixup dapm kcontrol widget
	iwlwifi: pcie: actually release queue memory in TVQM
	iwlwifi: mvm: beacon statistics shouldn't go backwards
	ARM: imx: provide v7_cpu_resume() only on ARM_CPU_SUSPEND=y
	powerpc/setup_64: Set cache-line-size based on cache-block-size
	staging: comedi: dt2815: fix writing hi byte of analog output
	staging: comedi: Fix comedi_device refcnt leak in comedi_open
	vt: don't hardcode the mem allocation upper bound
	vt: don't use kmalloc() for the unicode screen buffer
	staging: vt6656: Don't set RCR_MULTICAST or RCR_BROADCAST by default.
	staging: vt6656: Fix calling conditions of vnt_set_bss_mode
	staging: vt6656: Fix drivers TBTT timing counter.
	staging: vt6656: Fix pairwise key entry save.
	staging: vt6656: Power save stop wake_up_count wrap around.
	cdc-acm: close race betrween suspend() and acm_softint
	cdc-acm: introduce a cool down
	UAS: no use logging any details in case of ENODEV
	UAS: fix deadlock in error handling and PM flushing work
	usb: dwc3: gadget: Fix request completion check
	usb: f_fs: Clear OS Extended descriptor counts to zero in ffs_data_reset()
	xhci: prevent bus suspend if a roothub port detected a over-current condition
	serial: sh-sci: Make sure status register SCxSR is read in correct sequence
	xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
	s390/mm: fix page table upgrade vs 2ndary address mode accesses
	Linux 4.19.119

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4b16db8472367d135a4ff68d2863c634bf093ef5
2020-04-29 17:26:17 +02:00
Jan Kara
473d7f5ed7 blktrace: Protect q->blk_trace with RCU
commit c780e86dd4 upstream.

KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
CC: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Tristan Madani <tristmd@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[bwh: Backported to 4.19: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29 16:31:17 +02:00
Will McVicker
7b8719ed1c ANDROID: GKI: block: resolve ABI diff when CONFIG_BLK_DEV_BSG is unset
If CONFIG_BLK_DEV_BSG is disabled on the vendor side, then struct
request_queue will have an ABI diff. Remove the #ifdef in the struct so
that vendors can choose whether or not to use the config.

'struct request_queue at blkdev.h:434:1' changed:¬
  type size changed from 17600 to 17920 (in bits)¬
  2 data member insertions:¬
    'bsg_job_fn* request_queue::bsg_job_fn', at offset 13952 (in bits) at blkdev.h:650:1¬
    'bsg_class_device request_queue::bsg_dev', at offset 14016 (in bits) at blkdev.h:651:1¬

Signed-off-by: Will McVicker <willmcvicker@google.com>
Bug: 152917514
Test: compile, verify ABI diff
Change-Id: Iec2e3460b717b667fcc0409cbc4ec985cfe0f69c
2020-04-01 16:26:56 -07:00
Greg Kroah-Hartman
8cb4870403 This is the 4.19.98 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4pSYMACgkQONu9yGCS
 aT7Rkg/8C/AXaTp+2HxRj3ZO56uzpMBMb5duBzdzxnEnvFp+DIM7xxRX+NFI5CSK
 4rjnxMd2tPsFtqiWo/bBCUcHh9gu5HJKOMFRZGaRYAXvJ/8hgahgzkBE00JiAB6r
 mrk9Y/pwcKxMFsAHtu3xM0oENeefXOmavVTHc9N3DQLd3hNuyTrPztBMFaDg8djR
 pSwh1uE2G+Z2UOdi2kXmHiEIG6NViIqp+qFYI5CUIyeKfvOEsR5nSQ97LyNQ+dUX
 qshARQFuk78+Ax+GNPTQXiWdzN7+SH5aw5frFtdhAN90F+XrRDj4ZXw+EkX+/M2J
 NZU9P/v41ESG8RWxbAZ6osAUkQ4Dgq2BQpdyRxNNjTchXc0Kr4K6BCKuhY6cGxS7
 0PXPV7MsuAHYIrIvzG2lqif9gmknA0UrGVKuYJIZxBaWlHD2mEkFby0W0HIcBwir
 yKKK3fkFjmsGKYzh+VZVoGySWDbs7qYASWXHOCz0QCLb0CT8/ePbyxLdjY7u5KyX
 wDaDHXG9nm6Nu68HD/9CRnUkiK8dnsODZ0k+sBZfEa+xvHPJCdv3gnrf4SwU7dj7
 ZyhO9XkFzncOJDoxYxiXTfI+zbU1ZhaDw7fk2PFvAI6P1xRS3m6rp8pDWp8iw/MX
 92Sz1YzS68+otHLi+OBGxzu10PwMDtu2nUvqn68SYq6Rp0mZnnE=
 =2O94
 -----END PGP SIGNATURE-----

Merge 4.19.98 into android-4.19

Changes in 4.19.98
	ARM: dts: meson8: fix the size of the PMU registers
	clk: qcom: gcc-sdm845: Add missing flag to votable GDSCs
	dt-bindings: reset: meson8b: fix duplicate reset IDs
	ARM: dts: imx6q-dhcom: fix rtc compatible
	clk: Don't try to enable critical clocks if prepare failed
	ASoC: msm8916-wcd-digital: Reset RX interpolation path after use
	iio: buffer: align the size of scan bytes to size of the largest element
	USB: serial: simple: Add Motorola Solutions TETRA MTP3xxx and MTP85xx
	USB: serial: option: Add support for Quectel RM500Q
	USB: serial: opticon: fix control-message timeouts
	USB: serial: option: add support for Quectel RM500Q in QDL mode
	USB: serial: suppress driver bind attributes
	USB: serial: ch341: handle unbound port at reset_resume
	USB: serial: io_edgeport: handle unbound ports on URB completion
	USB: serial: io_edgeport: add missing active-port sanity check
	USB: serial: keyspan: handle unbound ports
	USB: serial: quatech2: handle unbound ports
	scsi: fnic: fix invalid stack access
	scsi: mptfusion: Fix double fetch bug in ioctl
	ASoC: msm8916-wcd-analog: Fix selected events for MIC BIAS External1
	ASoC: msm8916-wcd-analog: Fix MIC BIAS Internal1
	ARM: dts: imx6q-dhcom: Fix SGTL5000 VDDIO regulator connection
	ALSA: dice: fix fallback from protocol extension into limited functionality
	ALSA: seq: Fix racy access for queue timer in proc read
	ALSA: usb-audio: fix sync-ep altsetting sanity check
	arm64: dts: allwinner: a64: olinuxino: Fix SDIO supply regulator
	Fix built-in early-load Intel microcode alignment
	block: fix an integer overflow in logical block size
	ARM: dts: am571x-idk: Fix gpios property to have the correct gpio number
	LSM: generalize flag passing to security_capable
	ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
	usb: core: hub: Improved device recognition on remote wakeup
	x86/resctrl: Fix an imbalance in domain_remove_cpu()
	x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
	x86/efistub: Disable paging at mixed mode entry
	drm/i915: Add missing include file <linux/math64.h>
	x86/resctrl: Fix potential memory leak
	perf hists: Fix variable name's inconsistency in hists__for_each() macro
	perf report: Fix incorrectly added dimensions as switch perf data file
	mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
	mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
	btrfs: rework arguments of btrfs_unlink_subvol
	btrfs: fix invalid removal of root ref
	btrfs: do not delete mismatched root refs
	btrfs: fix memory leak in qgroup accounting
	mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
	ARM: dts: imx6qdl: Add Engicam i.Core 1.5 MX6
	ARM: dts: imx6q-icore-mipi: Use 1.5 version of i.Core MX6DL
	ARM: dts: imx7: Fix Toradex Colibri iMX7S 256MB NAND flash support
	net: stmmac: 16KB buffer must be 16 byte aligned
	net: stmmac: Enable 16KB buffer size
	mm/huge_memory.c: make __thp_get_unmapped_area static
	mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
	arm64: dts: agilex/stratix10: fix pmu interrupt numbers
	bpf: Fix incorrect verifier simulation of ARSH under ALU32
	cfg80211: fix deadlocks in autodisconnect work
	cfg80211: fix memory leak in cfg80211_cqm_rssi_update
	cfg80211: fix page refcount issue in A-MSDU decap
	netfilter: fix a use-after-free in mtype_destroy()
	netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct
	netfilter: nft_tunnel: fix null-attribute check
	netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
	netfilter: nf_tables: store transaction list locally while requesting module
	netfilter: nf_tables: fix flowtable list del corruption
	NFC: pn533: fix bulk-message timeout
	batman-adv: Fix DAT candidate selection on little endian systems
	macvlan: use skb_reset_mac_header() in macvlan_queue_xmit()
	hv_netvsc: Fix memory leak when removing rndis device
	net: dsa: tag_qca: fix doubled Tx statistics
	net: hns: fix soft lockup when there is not enough memory
	net: usb: lan78xx: limit size of local TSO packets
	net/wan/fsl_ucc_hdlc: fix out of bounds write on array utdm_info
	ptp: free ptp device pin descriptors properly
	r8152: add missing endpoint sanity check
	tcp: fix marked lost packets not being retransmitted
	sh_eth: check sh_eth_cpu_data::dual_port when dumping registers
	mlxsw: spectrum: Wipe xstats.backlog of down ports
	mlxsw: spectrum_qdisc: Include MC TCs in Qdisc counters
	xen/blkfront: Adjust indentation in xlvbd_alloc_gendisk
	tcp: refine rule to allow EPOLLOUT generation under mem pressure
	irqchip: Place CONFIG_SIFIVE_PLIC into the menu
	cw1200: Fix a signedness bug in cw1200_load_firmware()
	arm64: dts: meson-gxl-s905x-khadas-vim: fix gpio-keys-polled node
	cfg80211: check for set_wiphy_params
	tick/sched: Annotate lockless access to last_jiffies_update
	arm64: dts: marvell: Fix CP110 NAND controller node multi-line comment alignment
	Revert "arm64: dts: juno: add dma-ranges property"
	mtd: devices: fix mchp23k256 read and write
	drm/nouveau/bar/nv50: check bar1 vmm return value
	drm/nouveau/bar/gf100: ensure BAR is mapped
	drm/nouveau/mmu: qualify vmm during dtor
	reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr
	scsi: esas2r: unlock on error in esas2r_nvram_read_direct()
	scsi: qla4xxx: fix double free bug
	scsi: bnx2i: fix potential use after free
	scsi: target: core: Fix a pr_debug() argument
	scsi: qla2xxx: Fix qla2x00_request_irqs() for MSI
	scsi: qla2xxx: fix rports not being mark as lost in sync fabric scan
	scsi: core: scsi_trace: Use get_unaligned_be*()
	perf probe: Fix wrong address verification
	clk: sprd: Use IS_ERR() to validate the return value of syscon_regmap_lookup_by_phandle()
	regulator: ab8500: Remove SYSCLKREQ from enum ab8505_regulator_id
	hwmon: (pmbus/ibm-cffps) Switch LEDs to blocking brightness call
	Linux 4.19.98

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I74a43a9e60734aec6d24b10374ba97de89172eca
2020-01-23 08:36:16 +01:00
Mikulas Patocka
a7f79052d1 block: fix an integer overflow in logical block size
commit ad6bf88a6c upstream.

Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:21:29 +01:00
Satya Tangirala
b0a4fb22e5 BACKPORT: FROMLIST: block: Keyslot Manager for Inline Encryption
Inline Encryption hardware allows software to specify an encryption context
(an encryption key, crypto algorithm, data unit num, data unit size, etc.)
along with a data transfer request to a storage device, and the inline
encryption hardware will use that context to en/decrypt the data. The
inline encryption hardware is part of the storage device, and it
conceptually sits on the data path between system memory and the storage
device.

Inline Encryption hardware implementations often function around the
concept of "keyslots". These implementations often have a limited number
of "keyslots", each of which can hold an encryption context (we say that
an encryption context can be "programmed" into a keyslot). Requests made
to the storage device may have a keyslot associated with them, and the
inline encryption hardware will en/decrypt the data in the requests using
the encryption context programmed into that associated keyslot. As
keyslots are limited, and programming keys may be expensive in many
implementations, and multiple requests may use exactly the same encryption
contexts, we introduce a Keyslot Manager to efficiently manage keyslots.
The keyslot manager also functions as the interface that upper layers will
use to program keys into inline encryption hardware. For more information
on the Keyslot Manager, refer to documentation found in
block/keyslot-manager.c and linux/keyslot-manager.h.

Bug: 137270441
Test: tested as series; see I26aac0ac7845a9064f28bb1421eb2522828a6dec
Change-Id: I9a2dc72d61d5a3c64af379a97dd46155b41193eb
Signed-off-by: Satya Tangirala <satyat@google.com>
Link: https://patchwork.kernel.org/patch/11214713/
2019-11-14 14:47:49 -08:00
Bart Van Assche
c58a650736 block, scsi: Change the preempt-only flag into a counter
commit cd84a62e00 upstream.

The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04 09:30:57 +02:00
Jens Axboe
01c5f85aeb blk-cgroup: increase number of supported policies
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.

Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-11 10:59:53 -06:00
Bart Van Assche
b1f4267cc5 block: Remove two superfluous #include directives
Commit 12f5b93145 ("blk-mq: Remove generation seqeunce") removed the
only seqcount_t and u64_stats_sync instances from <linux/blkdev.h> but
did not remove the corresponding #include directives. Since these
include directives are no longer needed, remove them.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>,
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:12:26 -06:00
Greg Edwards
359f642700 block: move bio_integrity_{intervals,bytes} into blkdev.h
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 15:49:41 -06:00
Tejun Heo
3f289dcb4b block: make bdev_ops->rw_page() take a REQ_OP instead of bool
c11f0c0b5b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information.  This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:14 -06:00
Vladimir Zapolskiy
05814a1037 block: remove blkdev_entry_to_request() macro
Remove blkdev_entry_to_request() macro, which remained unused through
the observable history, also note that it repeats list_entry_rq() macro
verbatim.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-13 08:12:26 -06:00
Josef Bacik
a79050434b blk-rq-qos: refactor out common elements of blk-wbt
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Ming Lei
97889f9ac2 blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().

This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:

1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.

2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.

In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.

Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
6a5ac98465 block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n
Exclude zoned block device members from struct request_queue for
CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
the code that uses these struct request_queue members if
CONFIG_BLK_DEV_ZONED != n.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
7c8542b798 block: Inline blk_queue_nr_zones()
Since the implementation of blk_queue_nr_zones() is trivial and since
it only has a single caller, inline this function.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
6b1d83d274 block: Remove bdev_nr_zones()
Remove this function since it has no callers. This function was
introduced in commit 6cc77e9cb0 ("block: introduce zoned block
devices zone write locking").

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjorling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Keith Busch
15bfd21fbc block: Fix transfer when chunk sectors exceeds max
A device may have boundary restrictions where the number of sectors
between boundaries exceeds its max transfer size. In this case, we need
to cap the max size to the smaller of the two limits.

Reported-by: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
Tested-by: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-26 12:18:27 -06:00
Christoph Hellwig
be7f99c536 block: remov blk_queue_invalidate_tags
This function is entirely unused, so remove it and the tag_queue_busy
member of struct request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-15 08:13:35 -06:00
Christoph Hellwig
da66126739 blk-mq: don't time out requests again that are in the timeout handler
We can currently call the timeout handler again on a request that has
already been handed over to the timeout handler.  Prevent that with a new
flag.

Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce")
Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Tested-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-14 08:39:46 -06:00
Kent Overstreet
338aa96d56 block: convert bounce, q->bio_split to bioset_init()/mempool_init()
Convert the core block functionality to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Jens Axboe
0b7576d8eb block: move ->timeout request member
After the recent timeout handling changes, we have two holes in
the struct. Move the timeout near the deadline, killing both,
and moving related members closer together. On my config on
x86-64, this shrinks struct request from 312 to 304 bytes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
88b0cfad28 block: document the blk_eh_timer_return values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
f6e7d48a78 block: remove BLK_EH_HANDLED
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
6600593cbd block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command.  Fix the symbolic name to reflect that a little better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Keith Busch
12f5b93145 blk-mq: Remove generation seqeunce
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it.  The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
 for completions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
ff005a0662 block: sanitize blk_get_request calling conventions
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:12 -06:00
Omar Sandoval
522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00
Omar Sandoval
84c7afcebe block: use ktime_get_ns() instead of sched_clock() for cfq and bfq
cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d ("block: disable preemption before
using sched_clock()").

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:06 -06:00
Omar Sandoval
544ccc8dc9 block: get rid of struct blk_issue_stat
struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:05 -06:00
Linus Torvalds
3442097b76 SCSI fixes on 20180425
8 bug fixes, one spelling update and one tracepoint addition.  The
 most serious is probably the mpt3sas write same fix because it means
 anyone using these controllers sees errors when modern filesystems try
 to issue discards.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWuDoQyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishU0eAP0QvCH8
 NF2L35OadCr7I1Nvcb8h/OKsVtF6IIpFDD/0DAEA/FwV9wxTknA2OoSWhFzxPfMY
 EkQR56i7DQAvX3Agrno=
 =jMLe
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Eight bug fixes, one spelling update and one tracepoint addition.

  The most serious is probably the mptsas write same fix because it
  means anyone using these controllers sees errors when modern
  filesystems try to issue discards"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target: fix crash with iscsi target and dvd
  scsi: sd_zbc: Avoid that resetting a zone fails sporadically
  scsi: sd: Defer spinning up drive while SANITIZE is in progress
  scsi: megaraid_sas: Do not log an error if FW successfully initializes.
  scsi: ufs: add trace event for ufs upiu
  scsi: core: remove reference to scsi_show_extd_sense()
  scsi: mptsas: Disable WRITE SAME
  scsi: fnic: fix spelling mistake in fnic stats "Abord" -> "Abort"
  scsi: scsi_debug: IMMED related delay adjustments
  scsi: iscsi: respond to netlink with unicast when appropriate
2018-04-25 21:13:40 -07:00
Bart Van Assche
ccce20fc79 scsi: sd_zbc: Avoid that resetting a zone fails sporadically
Since SCSI scanning occurs asynchronously, since sd_revalidate_disk() is
called from sd_probe_async() and since sd_revalidate_disk() calls
sd_zbc_read_zones() it can happen that sd_zbc_read_zones() is called
concurrently with blkdev_report_zones() and/or blkdev_reset_zones().  That can
cause these functions to fail with -EIO because sd_zbc_read_zones() e.g. sets
q->nr_zones to zero before restoring it to the actual value, even if no drive
characteristics have changed.  Avoid that this can happen by making the
following changes:

- Protect the code that updates zone information with blk_queue_enter()
  and blk_queue_exit().
- Modify sd_zbc_setup_seq_zones_bitmap() and sd_zbc_setup() such that
  these functions do not modify struct scsi_disk before all zone
  information has been obtained.

Note: since commit 055f6e18e0 ("block: Make q_usage_counter also track
legacy requests"; kernel v4.15) the request queue freezing mechanism also
affects legacy request queues.

Fixes: 89d9475610 ("sd: Implement support for ZBC devices")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org # v4.16
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-04-19 00:04:10 -04:00
Dave Chinner
0ce9144471 block: add blk_queue_fua() helper function
So we can check FUA support status from the iomap direct IO code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-18 08:24:29 -06:00
Bart Van Assche
233bde21aa block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>
It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.

Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.

Cc: David S. Miller <davem@davemloft.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-17 14:45:23 -06:00
Bart Van Assche
8a0ac14b8d block: Move the queue_flag_*() functions from a public into a private header file
This patch helps to avoid that new code gets introduced in block drivers
that manipulates queue flags without holding the queue lock when that
lock should be held.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche
1db2008b79 block: Complain if queue_flag_(set|clear)_unlocked() is abused
Since it is not safe to use queue_flag_(set|clear)_unlocked()
without holding the queue lock after the sysfs entries for a
queue have been created, complain if this happens.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche
8814ce8a0f block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()
Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche
66f91322f3 block: Reorder the queue flag manipulation function definitions
Move the definition of queue_flag_clear_unlocked() up and move the
definition of queue_in_flight() down such that all queue flag
manipulation function definitions become contiguous.

This patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche
5ee0524ba1 block: Add 'lock' as third argument to blk_alloc_queue_node()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28 12:23:35 -07:00
Minwoo Im
096392e071 block: fix a typo in comment of BLK_MQ_POLL_STATS_BKTS
Update comment typo _consisitent_ to _consistent_ from following commit.
commit 0206319fdf ("blk-mq: Fix poll_stat for new size-based bucketing.")

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-15 08:27:06 -07:00
Linus Torvalds
0a4b6e2f80 Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "This is the main pull request for block IO related changes for the
  4.16 kernel. Nothing major in this pull request, but a good amount of
  improvements and fixes all over the map. This contains:

   - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
     Paolo.

   - Support for SMR zones for deadline and mq-deadline from Damien and
     Christoph.

   - Set of fixes for bcache by way of Michael Lyle, including fixes
     from himself, Kent, Rui, Tang, and Coly.

   - Series from Matias for lightnvm with fixes from Hans Holmberg,
     Javier, and Matias. Mostly centered around pblk, and the removing
     rrpc 1.2 in preparation for supporting 2.0.

   - A couple of NVMe pull requests from Christoph. Nothing major in
     here, just fixes and cleanups, and support for command tracing from
     Johannes.

   - Support for blk-throttle for tracking reads and writes separately.
     From Joseph Qi. A few cleanups/fixes also for blk-throttle from
     Weiping.

   - Series from Mike Snitzer that enables dm to register its queue more
     logically, something that's alwways been problematic on dm since
     it's a stacked device.

   - Series from Ming cleaning up some of the bio accessor use, in
     preparation for supporting multipage bvecs.

   - Various fixes from Ming closing up holes around queue mapping and
     quiescing.

   - BSD partition fix from Richard Narron, fixing a problem where we
     can't mount newer (10/11) FreeBSD partitions.

   - Series from Tejun reworking blk-mq timeout handling. The previous
     scheme relied on atomic bits, but it had races where we would think
     a request had timed out if it to reused at the wrong time.

   - null_blk now supports faking timeouts, to enable us to better
     exercise and test that functionality separately. From me.

   - Kill the separate atomic poll bit in the request struct. After
     this, we don't use the atomic bits on blk-mq anymore at all. From
     me.

   - sgl_alloc/free helpers from Bart.

   - Heavily contended tag case scalability improvement from me.

   - Various little fixes and cleanups from Arnd, Bart, Corentin,
     Douglas, Eryu, Goldwyn, and myself"

* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
  block: remove smart1,2.h
  nvme: add tracepoint for nvme_complete_rq
  nvme: add tracepoint for nvme_setup_cmd
  nvme-pci: introduce RECONNECTING state to mark initializing procedure
  nvme-rdma: remove redundant boolean for inline_data
  nvme: don't free uuid pointer before printing it
  nvme-pci: Suspend queues after deleting them
  bsg: use pr_debug instead of hand crafted macros
  blk-mq-debugfs: don't allow write on attributes with seq_operations set
  nvme-pci: Fix queue double allocations
  block: Set BIO_TRACE_COMPLETION on new bio during split
  blk-throttle: use queue_is_rq_based
  block: Remove kblockd_schedule_delayed_work{,_on}()
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
  blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
  lib/scatterlist: Fix chaining support in sgl_alloc_order()
  blk-throttle: track read and write request individually
  block: add bdev_read_only() checks to common helpers
  block: fail op_is_write() requests to read-only partitions
  blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
  ...
2018-01-29 11:51:49 -08:00
Bart Van Assche
f5ced52aaa block: Remove kblockd_schedule_delayed_work{,_on}()
The previous patch removed all users of these two functions. Hence
also remove the functions themselves.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-19 12:52:03 -07:00
Jens Axboe
7c3fb70f03 block: rearrange a few request fields for better cache layout
Move completion related items (like the call single data) near the
end of the struct, instead of mixing them in with the initial
queueing related fields.

Move queuelist below the bio structures. Then we have all
queueing related bits in the first cache line.

This yields a 1.5-2% increase in IOPS for a null_blk test, both for
sync and for high thread count access. Sync test goes form 975K to
992K, 32-thread case from 20.8M to 21.2M IOPS.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 11:47:58 -07:00
Jens Axboe
e14575b3d4 block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit
We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.

Remove ->atomic_flags, since it's now unused.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 11:47:53 -07:00
Jens Axboe
0a72e7f449 block: add accessors for setting/querying request deadline
We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 11:47:47 -07:00
Jens Axboe
76a86f9d02 block: remove REQ_ATOM_POLL_SLEPT
We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 11:47:43 -07:00
Tejun Heo
634f9e4631 blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 09:31:15 -07:00
Tejun Heo
1d9bd5161b blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
      spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
      of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
      commit message.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 09:31:15 -07:00
Christoph Hellwig
6cc77e9cb0 block: introduce zoned block devices zone write locking
Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).

Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).

Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.

Based on previous patches from Damien Le Moal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05 09:22:17 -07:00
Jens Axboe
4ccafe0320 block: unalign call_single_data in struct request
A previous change blindly added massive alignment to the
call_single_data structure in struct request. This ballooned it in size
from 296 to 320 bytes on my setup, for no valid reason at all.

Use the unaligned struct __call_single_data variant instead.

Fixes: 966a967116 ("smp: Avoid using two cache lines for struct call_single_data")
Cc: stable@vger.kernel.org # v4.14
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-20 13:16:33 -07:00