Commit graph

560039 commits

Author SHA1 Message Date
Ilia Mirkin
04b8a4bd8e drm/nouveau/gem: return only valid domain when there's only one
On nv50+, we restrict the valid domains to just the one where the buffer
was originally created. However after the buffer is evicted to system
memory, we might move it back to a different domain that was not
originally valid. When sharing the buffer and retrieving its GEM_INFO
data, we still want the domain that will be valid for this buffer in a
pushbuf, not the one where it currently happens to be.

This resolves fdo#92504 and several others. These are due to suspend
evicting all buffers, making it more likely that they temporarily end up
in the wrong place.

Cc: stable@vger.kernel.org
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92504
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
2015-11-03 14:56:06 +10:00
Shaohui Xie
5f6c99e0ab net: phy: fix a bug in get_phy_c45_ids
When probing devices-in-package for a c45 phy, device zero is the last
device to probe, however, if driver reads 0 from device zero,
c45_ids->devices_in_package is set to '0', the loop condition of probing
will be matched again, see codes below:

for (i = 1;i < num_ids && c45_ids->devices_in_package == 0;i++)

driver will run in a dead loop.

This patch restructures the bug and confusing loop, it provides a helper
function get_phy_c45_devs_in_pkg which to read devices-in-package registers
of a MMD, and rewrites the loop with using the helper function.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:45:20 -05:00
Jarod Wilson
fd867d51f8 net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.

This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.

Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.

[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off

dmesg dump:

[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.

Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.

Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:41:31 -05:00
Martin Habets
b2663a4f30 sfc: push partner queue for skb->xmit_more
When the IP stack passes SKBs the sfc driver puts them in 2 different TX
queues (called partners), one for checksummed and one for not checksummed.
If the SKB has xmit_more set the driver will delay pushing the work to the
NIC.

When later it does decide to push the buffers this patch ensures it also
pushes the partner queue, if that also has any delayed work. Before this
fix the work in the partner queue would be left for a long time and cause
a netdev watchdog.

Fixes: 70b33fb ("sfc: add support for skb->xmit_more")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:02:58 -05:00
Sergei Shtylyov
c238041f51 sh_eth: fix typo in RX descriptor bit name
The correct name of the RX descriptor 0 bit 30 is RDLE (receive descriptor
list end),  not  RDEL.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:01:06 -05:00
Eric Dumazet
4ece900977 sit: fix sit0 percpu double allocations
sit0 device allocates its percpu storage twice :
- One time in ipip6_tunnel_init()
- One time in ipip6_fb_tunnel_init()

Thus we leak 48 bytes per possible cpu per network namespace dismantle.

ipip6_fb_tunnel_init() can be much simpler and does not
return an error, and should be called after register_netdev()

Note that ipip6_tunnel_clone_6rd() also needs to be called
after register_netdev() (calling ipip6_tunnel_init())

Fixes: ebe084aafb ("sit: Use ipip6_tunnel_init as the ndo_init function.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:54:45 -05:00
David S. Miller
a176ded3f4 Merge branch 'bonding-actor-updates'
Mahesh Bandewar says:

====================
re-org actor admin/oper key updates

I was observing machines entering into weird LACP state when the
partner is in passive mode. This issue is not because of the partners
in passive state but probably because of some operational key update
which is pushing the state-machine is that weird state. This was
happening randomly on about 1% of the machine (when the sample size
is a large set of machines with a variety of NICs/ports bonded).

In this patch-series I'm attempting to unify the logic of actor-key
/ operational-key changes to one place to avoid possible errors in
update. Also this eliminates the need for the event-handler to decide
if the key needs update.

After this patch-set none of the machines (from same sample set) were
exhibiting LACP-weirdness that was observed earlier.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:52:24 -05:00
Mahesh Bandewar
52bc671681 bonding: simplify / unify event handling code for 3ad mode.
Old logic of updating state-machine is not required since
ad_update_actor_keys() does it implicitly. The only loss is
the notification differentiation between speed vs. duplex
change. Now only one unified notification is printed.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:52:24 -05:00
Mahesh Bandewar
7bb11dc9f5 bonding: unify all places where actor-oper key needs to be updated.
actor_admin, and actor_oper key is changed at multiple locations in
the code. This patch brings all those updates into one location in
an attempt to avoid possible inconsistent updates causing LACP state
machine to go in weird state.

The unified place is ad_update_actor_key() with simple state-machine
logic -
  (a) If port is "duplex" then only it can participate in LACP
  (b) Speed change reinitializes the LACP state-machine.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:52:24 -05:00
Mahesh Bandewar
b25c2e7d3c bonding: Simplify __get_duplex function.
Eliminate 'else' clause by simply initializing variable

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:52:24 -05:00
David S. Miller
12d4309636 Merge branch 'bpf-persistent'
Daniel Borkmann says:

====================
BPF updates

This set adds support for persistent maps/progs. Please see
individual patches for further details. A man-page update
to bpf(2) will be sent later on, also a iproute2 patch for
support in tc.

v1 -> v2:
  - Reworked most of patch 4 and 5
  - Rebased to latest net-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
42984d7c1e bpf: add sample usages for persistent maps/progs
This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN
and BPF_OBJ_GET commands can be used.

Example with maps:

  # ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42
  bpf: map fd:3 (Success)
  bpf: pin ret:(0,Success)
  bpf: fd:3 u->(1:42) ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):42 ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24
  bpf: get fd:3 (Success)
  bpf: fd:3 u->(1:24) ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):24 ret:(0,Success)

  # ./fds_example -F /sys/fs/bpf/m2 -P -m
  bpf: map fd:3 (Success)
  bpf: pin ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1
  bpf: get fd:3 (Success)
  bpf: fd:3 l->(1):0 ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/m2 -G -m
  bpf: get fd:3 (Success)

Example with progs:

  # ./fds_example -F /sys/fs/bpf/p -P -p
  bpf: prog fd:3 (Success)
  bpf: pin ret:(0,Success)
  bpf sock:4 <- fd:3 attached ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/p -G -p
  bpf: get fd:3 (Success)
  bpf: sock:4 <- fd:3 attached ret:(0,Success)

  # ./fds_example -F /sys/fs/bpf/p2 -P -p -o ./sockex1_kern.o
  bpf: prog fd:5 (Success)
  bpf: pin ret:(0,Success)
  bpf: sock:3 <- fd:5 attached ret:(0,Success)
  # ./fds_example -F /sys/fs/bpf/p2 -G -p
  bpf: get fd:3 (Success)
  bpf: sock:4 <- fd:3 attached ret:(0,Success)

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
b2197755b2 bpf: add support for persistent maps/progs
This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far, which can be read up further in
the below reference.

The architecture we eventually ended up with is a minimal file system
that can hold map/prog objects. The file system is a per mount namespace
singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
mounts within a given namespace will point to the same instance. The
file system allows for creating a user-defined directory structure.
The objects for maps/progs are created/fetched through bpf(2) with
two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
along with a pathname is being passed to bpf(2) that in turn creates
(we call it eBPF object pinning) the file system nodes. Only the pathname
is being passed to bpf(2) for getting a new BPF file descriptor to an
existing node. The user can use that to access maps and progs later on,
through bpf(2). Removal of file system nodes is being managed through
normal VFS functions such as unlink(2), etc. The file system code is
kept to a very minimum and can be further extended later on.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_OBJ_GET.

Big thanks also to Alexei and Hannes who significantly contributed
in the design discussion that eventually let us end up with this
architecture here.

Reference: https://lkml.org/lkml/2015/10/15/925
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
e9d8afa90b bpf: consolidate bpf_prog_put{, _rcu} dismantle paths
We currently have duplicated cleanup code in bpf_prog_put() and
bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
not worth it to make it a common helper called by both, but with
the recent addition of resource charging, we could have avoided
the fix in commit ac00737f4e ("bpf: Need to call bpf_prog_uncharge_memlock
from bpf_prog_put") if we would have had only a single, common path.
We can simplify it further by assigning aux->prog only once during
allocation time.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
c210129760 bpf: align and clean bpf_{map,prog}_get helpers
Add a bpf_map_get() function that we're going to use later on and
align/clean the remaining helpers a bit so that we have them a bit
more consistent:

  - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

  - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
aa79781b65 bpf: abstract anon_inode_getfd invocations
Since we're going to use anon_inode_getfd() invocations in more than just
the current places, make a helper function for both, so that we only need
to pass a map/prog pointer to the helper itself in order to get a fd. The
new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Eric Dumazet
1d6119baf0 net: fix percpu memory leaks
This patch fixes following problems :

1) percpu_counter_init() can return an error, therefore
  init_frag_mem_limit() must propagate this error so that
  inet_frags_init_net() can do the same up to its callers.

2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
   properly and free the percpu_counter.

Without this fix, we leave freed object in percpu_counters
global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

This bug was detected by KASAN and syzkaller tool
(http://github.com/google/syzkaller)

Fixes: 6d7b857d54 ("net: use lib/percpu_counter API for fragmentation mem accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:47:14 -05:00
Eric Dumazet
8fa677d270 net: avoid NULL deref in inet_ctl_sock_destroy()
Under low memory conditions, tcp_sk_init() and icmp_sk_init()
can both iterate on all possible cpus and call inet_ctl_sock_destroy(),
with eventual NULL pointer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:46:09 -05:00
Marek Szyprowski
df547bf773 drm/exynos/gem: remove DMA-mapping hacks used for constructing page array
Exynos GEM objects contains an array of pointers to the pages, which the
allocated buffer consists of. Till now the code used some hacks (like
relying on DMA-mapping internal structures or using ARM-specific
dma_to_pfn helper) to build this array. This patch fixes this by adding
proper call to dma_get_sgtable_attrs() and using the acquired scatter-list
to construct needed array. This approach is more portable (work also for
ARM64) and finally fixes the layering violation that was present in this
code.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:39 +09:00
Andrzej Hajda
0135131546 ARM: exynos_defconfig: enable Exynos DRM Mixer driver
Mixer driver is selected by CONFIG_DRM_EXYNOS_HDMI option. Since Exynos5433
HDMI does not require Mixer. There will be separate options to select Mixer
and HDMI. Adding new option to defconfig before Kconfig will allow to keep
bisectability.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:39 +09:00
Andrzej Hajda
5a3c35b377 drm/exynos: simplify Kconfig component names
Many Exynos DRM sub-options mentions Exynos DRM in their titles.
It is redundant and can be safely shortened. The patch additionally
makes some entries more descriptive.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:38 +09:00
Andrzej Hajda
ea9776465d drm/exynos: re-arrange Kconfig entries
Exynos DRM driver have quite big number of components and options.
The patch re-arranges them into three logical groups:
- CRTCs,
- Encoders and Bridges,
- Sub-drivers.
It should make driver options more clear.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:38 +09:00
Andrzej Hajda
dba6c5280d drm/exynos: abstract out common dependency
All options depends on DRM_EXYNOS so it can be moved to enclosing if clause.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:38 +09:00
Andrzej Hajda
3cb02b4a9e drm/exynos: separate Mixer and HDMI drivers
Latest Exynos SoCs does not have Mixer IP, but they still have HDMI IP.
Their drivers should be configurable separately.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:38 +09:00
Andrzej Hajda
3ae24362e0 drm/exynos/mixer: replace direct cross-driver call with drm mode validation
HDMI driver called directly function from MIXER driver to invalidate modes
not supported by MIXER. The patch replaces the hack with proper .atomic_check
callback.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:38 +09:00
Andrzej Hajda
5625b3418a drm/exynos: add atomic_check callback to exynos_crtc
Some CRTCs needs mode validation, this patch adds neccessary
callback to Exynos DRM framework. It is called from DRM core
via atomic_check helper for drm_crtc.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:37 +09:00
Andrzej Hajda
b8182832c5 drm/exynos/decon5433: add support for DECON-TV
DECON-TV IP is responsible for generating video stream which is transferred
to HDMI IP. It is almost fully compatible with DECON IP.

The patch is based on initial work of Hyungwon Hwang.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:37 +09:00
Andrzej Hajda
5d929ba50a drm/exynos/decon5433: remove duplicated initialization
Field .commit is already initialized few lines above.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:37 +09:00
Andrzej Hajda
7b6bb6ed01 drm/exynos/decon5433: merge different flag fields
Driver uses four different fields for internal flags. They can be merged
into one.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:37 +09:00
Andrzej Hajda
b219207385 drm/exynos/decon5433: add function to set particular register bits
The driver often sets only particular bits of configuration registers.
Using separate function to such action simplifies the code.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:37 +09:00
Andrzej Hajda
85de275ad9 drm/exynos/decon5433: fix timing registers writes
All timing registers should contain values decreased by one.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:36 +09:00
Andrzej Hajda
4f54f21cd6 drm/exynos/decon5433: add PCLK clock
PCLK clock is used by DECON IP. The patch also replaces magic number with
number of clocks in array definition.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
2015-11-03 11:46:36 +09:00
Dave Chinner
264e89ad34 Merge branch 'xfs-dax-updates' into for-next 2015-11-03 13:28:41 +11:00
Dave Chinner
2da5c4b05a Merge branch 'xfs-misc-fixes-for-4.4-2' into for-next 2015-11-03 13:27:58 +11:00
Dave Chinner
fc0561cefc xfs: optimise away log forces on timestamp updates for fdatasync
xfs: timestamp updates cause excessive fdatasync log traffic

Sage Weil reported that a ceph test workload was writing to the
log on every fdatasync during an overwrite workload. Event tracing
showed that the only metadata modification being made was the
timestamp updates during the write(2) syscall, but fdatasync(2)
is supposed to ignore them. The key observation was that the
transactions in the log all looked like this:

INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32

And contained a flags field of 0x45 or 0x85, and had data and
attribute forks following the inode core. This means that the
timestamp updates were triggering dirty relogging of previously
logged parts of the inode that hadn't yet been flushed back to
disk.

There are two parts to this problem. The first is that XFS relogs
dirty regions in subsequent transactions, so it carries around the
fields that have been dirtied since the last time the inode was
written back to disk, not since the last time the inode was forced
into the log.

The second part is that on v5 filesystems, the inode change count
update during inode dirtying also sets the XFS_ILOG_CORE flag, so
on v5 filesystems this makes a timestamp update dirty the entire
inode.

As a result when fdatasync is run, it looks at the dirty fields in
the inode, and sees more than just the timestamp flag, even though
the only metadata change since the last fdatasync was just the
timestamps. Hence we force the log on every subsequent fdatasync
even though it is not needed.

To fix this, add a new field to the inode log item that tracks
changes since the last time fsync/fdatasync forced the log to flush
the changes to the journal. This flag is updated when we dirty the
inode, but we do it before updating the change count so it does not
carry the "core dirty" flag from timestamp updates. The fields are
zeroed when the inode is marked clean (due to writeback/freeing) or
when an fsync/datasync forces the log. Hence if we only dirty the
timestamps on the inode between fsync/fdatasync calls, the fdatasync
will not trigger another log force.

Over 100 runs of the test program:

Ext4 baseline:
	runtime: 1.63s +/- 0.24s
	avg lat: 1.59ms +/- 0.24ms
	iops: ~2000

XFS, vanilla kernel:
        runtime: 2.45s +/- 0.18s
	avg lat: 2.39ms +/- 0.18ms
	log forces: ~400/s
	iops: ~1000

XFS, patched kernel:
        runtime: 1.49s +/- 0.26s
	avg lat: 1.46ms +/- 0.25ms
	log forces: ~30/s
	iops: ~1500

Reported-by: Sage Weil <sage@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 13:14:59 +11:00
Darrick J. Wong
af3b63822e xfs: don't leak uuid table on rmmod
Don't leak the UUID table when the module is unloaded.
(Found with kmemleak.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 13:06:34 +11:00
Andreas Gruenbacher
47e1bf6405 xfs: invalidate cached acl if set via ioctl
Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the
XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL
infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs
did until commit 6caa1056.  Similar to that commit, invalidate cached
acls when setting/removing them via the ioctl as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:56:17 +11:00
Andreas Gruenbacher
09cb22d2a5 xfs: Plug memory leak in xfs_attrmulti_attr_set
When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space
buffer is copied into a new kernel-space buffer via memdup_user; that
buffer then isn't freed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:53:54 +11:00
Andreas Gruenbacher
86a21c7974 xfs: Validate the length of on-disk ACLs
In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
make sure that the length of the attributes is correct as well.  Also, turn
the aclp parameter into a const pointer.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:41:59 +11:00
Brian Foster
67d8e04e34 xfs: invalidate cached acl if set directly via xattr
ACLs are stored as extended attributes of the inode to which they apply.
XFS converts the standard "system.posix_acl_[access|default]" attribute
names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored
on-disk. These xattrs are directly exposed in on-disk format via
getxattr/setxattr, without any ACL aware code in the path to perform
validation, etc. This is partly historical and supports backup/restore
applications such as xfsdump to back up and restore the binary blob that
represents ACLs as-is.

Andreas reports that the ACLs observed via the getfacl interface is not
consistent when ACLs are set directly via the setxattr path. This occurs
because the ACLs are cached in-core against the inode and the xattr path
has no knowledge that the operation relates to ACLs.

Update the xattr set codepath to trap writes of the special XFS ACL
attributes and invalidate the associated cached ACL when this occurs.
This ensures that the correct ACLs are used on a subsequent operation
through the actual ACL interface.

Note that this does not update or add support for setting the ACL xattrs
directly beyond the restore use case that requires a correctly formatted
binary blob and to restore a consistent i_mode at the same time. It is
still possible for a root user to set an invalid or inconsistent (with
i_mode) ACL blob on-disk and potentially cause corruption.

[ With fixes from Andreas Gruenbacher. ]

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:40:59 +11:00
Dave Chinner
13ad4fe3e0 xfs: xfs_filemap_pmd_fault treats read faults as write faults
The code initially committed didn't have the same checks for write
faults as the dax_pmd_fault code and hence treats all faults as
write faults. We can get read faults through this path because they
is no pmd_mkwrite path for write faults similar to the normal page
fault path. Hence we need to ensure that we only do c/mtime updates
on write faults, and freeze protection is unnecessary for read
faults.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:02 +11:00
Dave Chinner
3af4928585 xfs: add ->pfn_mkwrite support for DAX
->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.

This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:02 +11:00
Dave Chinner
01a155e6cf xfs: DAX does not use IO completion callbacks
For DAX, we are now doing block zeroing during allocation. This
means we no longer need a special DAX fault IO completion callback
to do unwritten extent conversion. Because mmap never extends the
file size (it SEGVs the process) we don't need a callback to update
the file size, either. Hence we can remove the completion callbacks
from the __dax_fault and __dax_mkwrite calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:02 +11:00
Dave Chinner
1ca191576f xfs: Don't use unwritten extents for DAX
DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:00 +11:00
Dave Chinner
3fbbbea34b xfs: introduce BMAPI_ZERO for allocating zeroed extents
To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:27:22 +11:00
Dave Chinner
3e12dbbdbd xfs: fix inode size update overflow in xfs_map_direct()
Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:27:22 +11:00
Neil Brown
3ce96239d4 Documentation: add new description of path-name lookup.
This document is based on three recent lwn.net articles.
Some of the introductory material and linkage between articles
has been removed, and some time-based descriptions have been
revised.

Also all links to code have been removed as the code is very close by.

Contains corrections and improvements from Randy Dunlap <rdunlap@infradead.org>.

Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02 18:18:25 -07:00
Sergey Senozhatsky
05be961772 Documentation/vm/slub.txt: document slabinfo-gnuplot.sh
Add documentation on how to use slabinfo-gnuplot.sh script.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02 18:15:33 -07:00
Masanari Iida
83432ef3b8 Doc: ABI/stable: Fix typo in ABI/stable
This patch fix some spelling typos in Documentation/ABI/stable.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-02 18:10:33 -07:00
Linus Torvalds
5062ecdb66 regmap: Updates for v4.4
Quite a few new features for regmap this time, mostly expanding things
 around the edges of the existing functionality to cover more devices
 rather than thinsg with wide applicability:
 
  - Support for offload of the update_bits() operation to hardware where
    devices implement bit level access.
  - Support for a few extra operations that need scratch buffers on
    fast_io devices where we can't sleep.
  - Expanded the feature set of regmap_irq to cope with some extra
    register layouts.
  - Cleanups to the debugfs code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWN6usAAoJECTWi3JdVIfQT7cH/1sNJlpt8s2pu/29LtBN0oMH
 ZPPT8p46Xph7lbVMbMWZfgcobtISKMYo8dGDEeIYR1c1DZObErJSDxj5I/g5PxUv
 FuJ1kVeJ0L0GJE78sB69LQQUAYXfmbtOnmivok/huNID+wpZbTE7uZjAby4tffuk
 Mom2TLSnrm63j/zDnsi4VBAB9w1pQCIp+UiJ8LR8PCglcSHJ/eZLrh53F6BzXRU8
 9QseD5YQCh4d3go6QZhVNX8iJEby9ehmtCnzN8zEfpXKpc2T1pCoHEH/BnKnxd16
 i35A4RjW0GZfilS1815prjDhHJSbrQiqzyIh7Wua+zll7kCx6K/oLdjUfTIK9L0=
 =A8Lm
 -----END PGP SIGNATURE-----

Merge tag 'regmap-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap

Pull regmap updates from Mark Brown:
 "Quite a few new features for regmap this time, mostly expanding things
  around the edges of the existing functionality to cover more devices
  rather than thinsg with wide applicability:

   - Support for offload of the update_bits() operation to hardware
     where devices implement bit level access.
   - Support for a few extra operations that need scratch buffers on
     fast_io devices where we can't sleep.
   - Expanded the feature set of regmap_irq to cope with some extra
     register layouts.
   - Cleanups to the debugfs code"

* tag 'regmap-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: Allow installing custom reg_update_bits function
  regmap: debugfs: simplify regmap_reg_ranges_read_file() slightly
  regmap: debugfs: use memcpy instead of snprintf
  regmap: debugfs: use snprintf return value in regmap_reg_ranges_read_file()
  regmap: Add generic macro to define regmap_irq
  regmap: debugfs: Remove scratch buffer for register length calculation
  regmap: irq: add ack_invert flag for chips using cleared bits as ack
  regmap: irq: add support for chips who have separate unmask registers
  regmap: Allocate buffers with GFP_ATOMIC when fast_io == true
2015-11-02 16:16:24 -08:00