It was calling flush_cache_all() which is a no-op since a long time anyway
and which was overkill in the old days when it was actually doing something
because only the D-cache needs to be flushed, never the I-cache, never
the S-cache. Since however highmem on MIPS is still only supported on
processors that don't suffer from cache aliases, we could turn
flush_cache_kmaps() into a no-op - but for paranoia's sake we rather make
it BUG_ON(cpu_has_dc_aliases()).
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
It's probably a good idea to flush caches before reset and by the time
this code was written flush_cache_all did actually still do something.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Flushing caches is probably sensible on reset but flush_cache_all has been
a no-op for a very long time.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Don't set _machine_restart() on OF machines as the reset driver
now provides a system restart handler.
Signed-off-by: Alban Bedel <albeu@free.fr>
Cc: Felix Fietkau <nbd@openwrt.org>
Cc: Antony Pavlov <antonynpavlov@gmail.com>
Cc: Gabor Juhos <juhosg@openwrt.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12235/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reuse the early printk code to support the serial in zboot. We copy
early_printk.c instead of referencing it because we need to build a
different object file for the normal kernel and zboot.
Signed-off-by: Alban Bedel <albeu@free.fr>
Cc: Andrew Bresticker <abrestic@chromium.org>
Cc: Alex Smith <alex.smith@imgtec.com>
Cc: Wu Zhangjin <wuzhangjin@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12234/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Now that appended DTB is usable we can drop the builtin DTB support.
Signed-off-by: Alban Bedel <albeu@free.fr>
Cc: Felix Fietkau <nbd@openwrt.org>
Cc: Antony Pavlov <antonynpavlov@gmail.com>
Cc: Gabor Juhos <juhosg@openwrt.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12231/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This is needed for bootloader supporting UHI and to support appended
DTB.
Signed-off-by: Alban Bedel <albeu@free.fr>
Cc: Felix Fietkau <nbd@openwrt.org>
Cc: Antony Pavlov <antonynpavlov@gmail.com>
Cc: Gabor Juhos <juhosg@openwrt.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12230/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The framebuffer mapping virtual address leaks information about the
kernel memory layout. Stop logging it.
Cc: Peter Jones <pjones@redhat.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
A common simplified DT parsing code for regulators was introduced in
commit a0c7b164ad ("regulator: of: Provide simplified DT parsing
method")
While at it also added RK8XX_DESC and RK8XX_DESC_SWITCH macros for the
regulator_desc struct initialization. This just makes the driver more compact.
Signed-off-by: Wadim Egorov <w.egorov@phytec.de>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Happened when the Playback (or Capture) is running continuously
and Capture (or Playback) is restarted (xrun, manual stop/start...)
Since the RX (or TX) FIFO are only reset when the whole SSI is disabled,
pending samples from previous capture (or playback) session may still
be present. They must be erased to not introduce channel slipping.
FIFO Clear register fields are documented in IMX51, IMX35 reference manual.
They are not documented in IMX50 or IMX6 RM, despite they are
working as expected on IMX6SL and IMX6solo.
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Previously, SCR.SSIEN and SCR.TE were enabled at once if no capture
stream was also running.
This may not give a chance for the DMA to write the first sample in
TX FIFO before the streaming starts on the PCM bus, inserting void
samples first.
Those void samples are then responsible for slipping the channels.
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
If the capture is already running while playback is started, it is highly
probable (>80% in a 8 channels scenario) that samples are lost between
the DMA and TX fifo.
The reason is that SIER.TDMAE is set before STCR.TFEN0, leaving a time
window where the FIFO doesn't receive the samples written by the DMA.
This particular case happened only if capture is already enabled as
SCR.SSIEN is already set at the playback startup instant.
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Most of functions only receive the ssi_private reference and don't have
a knowledge of 'dev' pointer, even for debug purpose.
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
im6sl reference manual 47.7.4:
"
Bit clock - Used to serially clock the data bits in and out of the SSI port.
This clock is either generated internally (from SSI's sys clock) or taken
from external clock source (through the Tx/Rx clock ports).
[...]
Care should be taken to ensure that the bit clock frequency (either
internally generated by dividing the SSI's sys clock or sourced from
external device through Tx/Rx clock ports) is never greater than 1/5
of the ipg_clk (from CCM) frequency.
"
Since, in master mode, the sysclk is a multiple of bitclk, we can
easily reach a high sysclk value, whereas keeping a reasonable bitclk.
ex: 8ch x 16bit x 48kHz = 6144000, requires a 24576000 sysclk (PM=1)
yet ipg_clk/5 = 66Mhz/5 = 13.2
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
The max number of slots in TDM mode is 32:
- Frame Rate Divider Control is a 5bit value
- Time slot mask registers control 32 slots.
Signed-off-by: Arnaud Mouiche <arnaud.mouiche@invoxia.com>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Tested-by: Caleb Crome <caleb@crome.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Some definitions to support the PCM5102A codec
by Texas Instruments.
Signed-off-by: Florian Meier <florian.meier@koalo.de>
Changes to original patch by Florian Meier:
* rebased (Makefile and Kconfig
* fixed checkpath errors (spaces, newlines)
* added dt-binding documentation
Signed-off-by: Martin Sperl <kernel@martin.sperl.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Manage the hda idisp link using shiny new link APIs. We need to
keep link On while we probe and also hold the reference in runtime
resume and drop in suspend
Signed-off-by: Jeeja KP <jeeja.kp@intel.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Use shiny new link APIs to manage the links. Also remove old link
configuration logic from driver.
We need to keep link and cmd dma to off during active suspend
to allow system to enter low power state and turn it on if
the link and cmd dma was on before active suspend in active
resume.
Signed-off-by: Jeeja KP <jeeja.kp@intel.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
The HDA links can be switched off when not is use, similarly
command DMA can be stopped as well. This calls for a reference
counting mechanism on the link by it's users to manage the link
power. The DMA can be turned off when all links are off
For this we add two APIs
snd_hdac_ext_bus_link_get
snd_hdac_ext_bus_link_put
They help users to turn up/down link and manage the DMA as well
Signed-off-by: Jeeja KP <jeeja.kp@intel.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
>From I2C specifications:
http://www.nxp.com/documents/user_manual/UM10204.pdf
Chapter 3.1.16, when the i2c device held the SDA line low, the master
should send 9 clocks pulses to try to recover.
Signed-off-by: Frederic Pillon <frederic.pillon@st.com>
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
A custom recovery function doesn't need these pointers to be populated
because it may work differently internally.
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Tested-by: Peter Griffin <peter.griffin@linaro.org>
Change the adf_ctl_stop_devices to a void function.
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Added basic clock support for axi dma's.
The clocks are requested at probe and released at remove.
Reviewed-by: Shubhrajyoti Datta <shubhraj@xilinx.com>
Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
This patch adds config structure in the driver to differentiate
AXI DMA's and to add more features(clock support etc..) to these DMA's.
Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Update the Tegra DMA driver maintainer field to include the newly added
Tegra210 ADMA and add Jon Hunter as a co-maintainer for Tegra DMA.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Add support for the Tegra210 Audio DMA controller that is used for
transferring data between system memory and the Audio sub-system.
The driver only supports cyclic transfers because this is being solely
used for audio.
This driver is based upon the work by Dara Ramesh <dramesh@nvidia.com>.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Patch 562abd39 "xen-netback: support multiple extra info fragments
passed from frontend" contained a mistake which can result in an in-
correct number of responses being generated when handling errors
encountered when processing packets containing extra info fragments.
This patch fixes the problem.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing an OpenStack configuration using VXLANs I saw the following
call trace:
RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
RSP: 0018:ffff88103867bc50 EFLAGS: 00010286
RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
Stack:
ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
Call Trace:
[<ffffffff81576515>] ? skb_checksum+0x35/0x50
[<ffffffff815733c0>] ? skb_push+0x40/0x40
[<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
[<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
[<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
[<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
[<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
[<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
[<ffffffff815899d0>] net_rx_action+0x160/0x380
[<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
[<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
[<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
[<ffffffff8109a400>] ? sort_range+0x30/0x30
[<ffffffff81096da8>] kthread+0xd8/0xf0
[<ffffffff81693c82>] ret_from_fork+0x22/0x40
[<ffffffff81096cd0>] ? kthread_park+0x60/0x60
The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel. I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev. This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated. In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.
Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.
There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.
Fixes: 9317bb6982 ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
User visible:
- Fallback to usermode only counters when perf_event_paranoid > 1, which
is the case now (Arnaldo Carvalho de Melo)
- Do not reassign parg after collapse_tree() in libtraceevent, which
may cause tool crashes (Steven Rostedt)
Build fixes:
- Fix the build on Fedora Rawhide, where readdir_r() is deprecated and
also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on
!x86_64 (Arnaldo Carvalho de Melo)
- Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't
available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXNOELAAoJENZQFvNTUqpAEmcQAJIZePquYNAsnTcRAPcgXjSG
wdxMK1xlHW8WgPi7OoU6a2Hx04xRtEc8CUsb4mTUa1umAMe7Fp9BCuWZrX2tNAxJ
y6tn95RkMSxZSbPp9SvSJDLAi7NrOafoblPnsjLw3bpdMTY+ngxhMq2Aftjpz5ai
Ok3sC3deCYoibSzpxLNkoBPPR7eOhK8wzcVNAvu5mPY0EF9VPpNF+yPs6mW38wBs
TQtji1X7049bAjxMLeeXcR4Z6x5hhVM6i3U1lB5XSMoqkcQMLFHJze7Fe5VLMc/3
jtZFZvV79PlAhJtYxlLeuuSjjuaZ6dCSE+87YYAmukup3SMp3sCTvba+7YhWEgEE
hZEAaHe8eJbSQndhYpY82mV+AQe4dINTgdLBoV1uQ5EUh/KlOaph3MuCb2jMXtVb
JLROl6wktgFJ75NzkHvix798DtOVyLqa5z0H2h27Jqm2LqrotIn+trXuz6a+0nxP
aoHsKPyZYytPvZoWHjesIn86iOSCrLN8UGhaQPTIunfO0evlSvEkWtAqU1u4mX+P
CqDFI4/2dZXnAVBZvk47xsrcxhkvEO63SDv3AIdXvsihYBGJLtaVKT9qfsxyxyqN
fAy9MghKrPXTRRJP9Z/Fzbycbv/ioiRdnNmj+cgWkPkjqt+lr5Q1V+B9BZhqsWHF
Fl0GoIr//HFjSk7gSFaN
=qn4a
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo-20160512' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
- Fallback to usermode-only counters when perf_event_paranoid > 1, which
is the case now (Arnaldo Carvalho de Melo)
- Do not reassign parg after collapse_tree() in libtraceevent, which
may cause tool crashes (Steven Rostedt)
- Fix the build on Fedora Rawhide, where readdir_r() is deprecated and
also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on
!x86_64 (Arnaldo Carvalho de Melo)
- Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't
available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is mostly the same as on other filesystems except for attribute
names with an "os2." prefix: for those, the prefix is not stored on
disk, and on-attribute names without a prefix have "os2." added.
As on several other filesystems, the underlying function for
setting/removing xattrs (__jfs_setxattr) removes attributes when the
value is NULL, so the set xattr handlers will work as expected.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of stripping "os2." prefixes in __jfs_setxattr, make callers
strip them, as __jfs_getxattr already does. With that change, use the
same name mapping function in jfs_{get,set,remove}xattr.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Switch to the generic xattr handlers and take the necessary glocks at
the layer below. The following are the new xattr "entry points"; they
are called with the glock held already in the following cases:
gfs2_xattr_get: From SELinux, during lookups.
gfs2_xattr_set: The glock is never held.
gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and
gfs2_setattr -> posix_acl_chmod.
gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge fixes from Andrew Morton:
"4 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: thp: calculate the mapcount correctly for THP pages during WP faults
ksm: fix conflict between mmput and scan_get_next_rmap_item
ocfs2: fix posix_acl_create deadlock
ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit 38851cc19a ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.
This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:
CPU 1 CPU 2
btrfs_relocate_block_group(bg X)
direct IO write starts against
an extent in block group X
using nocow mode (inode has the
nodatacow flag or the write is
for a prealloc extent)
btrfs_direct_IO()
btrfs_get_blocks_direct()
--> can_nocow_extent() returns 1
btrfs_inc_block_group_ro(bg X)
--> turns block group into RO mode
btrfs_wait_ordered_roots()
--> returns and does not know about
the DIO write happening at CPU 2
(the task there has not created
yet an ordered extent)
relocate_block_group(bg X)
--> rc->stage == MOVE_DATA_EXTENTS
find_next_extent()
--> returns extent that the DIO
write is going to write to
relocate_data_extent()
relocate_file_extent_cluster()
--> reads the extent from disk into
pages belonging to the relocation
inode and dirties them
--> creates DIO ordered extent
btrfs_submit_direct()
--> submits bio against a location
on disk obtained from an extent
map before the relocation started
btrfs_wait_ordered_range()
--> writes all the pages read before
to disk (belonging to the
relocation inode)
relocation finishes
bio completes and wrote new data
to the old location of the block
group
So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.
The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19a ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.
This is an extension of what was done in commit de0ee0edb2 ("Btrfs: fix
race between fsync and lockless direct IO writes").
So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>