Commit graph

38,980 commits

Author SHA1 Message Date
YOSHIFUJI Hideaki
c15df306fc ipv6: Remove unused arguments for __ipv6_dev_get_saddr().
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 01:00:56 -07:00
Anuradha Karuppiah
88d6378bd6 netlink: changes for setting and clearing protodown via netlink.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:39:40 -07:00
Anuradha Karuppiah
d746d707a8 net core: Add protodown support.
This patch introduces the proto_down flag that can be used by user space
applications to notify switch drivers that errors have been detected on the
device.

The switch driver can react to protodown notification by doing a phys down
on the associated switch port.

Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:39:40 -07:00
Alexei Starovoitov
ddf06c1e56 tc: act_bpf: fix memory leak
prog->bpf_ops is populated when act_bpf is used with classic BPF and
prog->bpf_name is optionally used with extended BPF.
Fix memory leak when act_bpf is released.

Fixes: d23b8ad8ab ("tc: add BPF based action")
Fixes: a8cb5f556b ("act_bpf: add initial eBPF support for actions")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:35 -07:00
WANG Cong
c0afd9ce4d fq_codel: fix return value of fq_codel_drop()
The ->drop() is supposed to return the number of bytes it dropped,
however fq_codel_drop() returns the index of the flow where it drops
a packet from.

Fix this by introducing a helper to wrap fq_codel_drop().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:35 -07:00
WANG Cong
e8d092aafd net_sched: fix a use-after-free in sfq
Fixes: 25331d6ce4 ("net: sched: implement qstat helper routines")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:34 -07:00
YOSHIFUJI Hideaki/吉藤英明
c0b8da1e76 ipv6: Fix finding best source address in ipv6_dev_get_saddr().
Commit 9131f3de2 ("ipv6: Do not iterate over all interfaces when
finding source address on specific interface.") did not properly
update best source address available.  Plus, it introduced
possible NULL pointer dereference.

Bug was reported by Erik Kline <ek@google.com>.
Based on patch proposed by Hajime Tazaki <thehajime@gmail.com>.

Fixes: 9131f3de24 ("ipv6: Do not
	iterate over all interfaces when finding source address
	on specific interface.")
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: Hajime Tazaki <thehajime@gmail.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:06:13 -07:00
Eric Dumazet
03645a11a5 ipv6: lock socket in ip6_datagram_connect()
ip6_datagram_connect() is doing a lot of socket changes without
socket being locked.

This looks wrong, at least for udp_lib_rehash() which could corrupt
lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:25:51 -07:00
Andrea Parri
40bdc5360d pkt_sched: sch_qfq: remove unused member of struct qfq_sched
The member (u32) "num_active_agg" of struct qfq_sched has been unused
since its introduction in 462dbc9101
"pkt_sched: QFQ Plus: fair-queueing service at DRR cost" and (AFAICT)
there is no active plan to use it; this removes the member.

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:21:31 -07:00
WANG Cong
052cbda41f fq_codel: fix a use-after-free
Fixes: 25331d6ce4 ("net: sched: implement qstat helper routines")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:18:21 -07:00
Yuchung Cheng
f82b681a51 tcp: don't use F-RTO on non-recurring timeouts
Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.

This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.

Fix it by correcting the test of a new recovery episode.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:17:21 -07:00
Nikolay Aleksandrov
5ebc784625 bridge: mdb: fix double add notification
Since the mdb add/del code was introduced there have been 2 br_mdb_notify
calls when doing br_mdb_add() resulting in 2 notifications on each add.

Example:
 Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
 Before patch:
 root@debian:~# bridge monitor all
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent

 After patch:
 root@debian:~# bridge monitor all
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: cfd5675435 ("bridge: add support of adding and deleting mdb entries")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:15:01 -07:00
Satish Ashok
bc8c20acae bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave
A report with INCLUDE/Change_to_include and empty source list should be
treated as a leave, specified by RFC 3376, section 3.1:
"If the requested filter mode is INCLUDE *and* the requested source
 list is empty, then the entry corresponding to the requested
 interface and multicast address is deleted if present.  If no such
 entry is present, the request is ignored."

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:10:40 -07:00
Linus Torvalds
9090fdb9c9 Changes for 4.2-rc
- Mainly fix-ups for the various 4.2 items
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVpWPTAAoJELgmozMOVy/dRjkQALGpY+m0Q+dfS4geU+JvvH0/
 tTNJevWA1lu5xIgzk8aW+RgNOt+IQV2K6XnKOdKWcdE2wg9Yg9/8Y7atxeNSLoB6
 mdUfstmTHoDFRXea1wog45QVVv9ABqSXIOmFd/PQSf2pEHvwgmVaxO2KW5n0n71H
 0G9nw28jx+Igd1IjX3UezbB4Rm5jeG6exQEM5KKQ/l/cD2Pt/DIr9OmTmkMhj4Ft
 4ppC+JsDHZMT2mQljEB1UL6Va1BGgULn4JwZviXRGgTEFiS+rtUkWXy9CVeNIV+A
 bo6aHEBlO8zbj9xg0POkHGBP2XCkDpsc5EP6/zon30MLpOTpnnTp2zq/bhGpyr5w
 ytH1x6V8Od1Or9BrHygy1yLGaqaVthIuZMReLq0Bu26px8mTECdu8bUjwxEvbvep
 u51vQ75h99td5XtXBHULcd/I8mUb3JqbEHqig0ChLOcnOMh72KLql8qfhLvzrYSv
 gFMvnqUZWq4WYET8cKcWypj5+cIME4objOim6T/TG+zTCmgbPnhwBsgvujLhfhDy
 JhUjUfS+31S99I4smL7ceKQpfA/7awLuZf/C20zdLZO2wXnx7VmbNlhQA3mvssyt
 ZOXeLkLGcXDIT3egNj5VWQR/KlVtvUS9FNVslxqo4ItyZZwJtbHTwRAgATp3XRwD
 Po7E7bWoUQqpanhgiX4Q
 =N/5G
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 "Mainly fix-ups for the various 4.2 items"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits)
  IB/core: Destroy ocrdma_dev_id IDR on module exit
  IB/core: Destroy multcast_idr on module exit
  IB/mlx4: Optimize do_slave_init
  IB/mlx4: Fix memory leak in do_slave_init
  IB/mlx4: Optimize freeing of items on error unwind
  IB/mlx4: Fix use of flow-counters for process_mad
  IB/ipath: Convert use of __constant_<foo> to <foo>
  IB/ipoib: Set MTU to max allowed by mode when mode changes
  IB/ipoib: Scatter-Gather support in connected mode
  IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES
  IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush
  IB/ucma: Fix lockdep warning in ucma_lock_files
  rds: rds_ib_device.refcount overflow
  RDMA/nes: Fix for incorrect recording of the MAC address
  RDMA/nes: Fix for resolving the neigh
  RDMA/core: Fixes for port mapper client registration
  IB/IPoIB: Fix bad error flow in ipoib_add_port()
  IB/mlx4: Do not attemp to report HCA clock offset on VFs
  IB/cm: Do not queue work to a device that's going away
  IB/srp: Avoid using uninitialized variable
  ...
2015-07-15 17:03:03 -07:00
Herbert Xu
89c22d8c3b net: Fix skb csum races when peeking
When we calculate the checksum on the recv path, we store the
result in the skb as an optimisation in case we need the checksum
again down the line.

This is in fact bogus for the MSG_PEEK case as this is done without
any locking.  So multiple threads can peek and then store the result
to the same skb, potentially resulting in bogus skb states.

This patch fixes this by only storing the result if the skb is not
shared.  This preserves the optimisations for the few cases where
it can be done safely due to locking or other reasons, e.g., SIOCINQ.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Richard Stearn
da278622bf NET: AX.25: Stop heartbeat timer on disconnect.
This may result in a kernel panic.  The bug has always existed but
somehow we've run out of luck now and it bites.

Signed-off-by: Richard Stearn <richard@rns-stearn.demon.co.uk>
Cc: stable@vger.kernel.org	# all branches
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Herbert Xu
738ac1ebb9 net: Clone skb before setting peeked flag
Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb->peeked.

Fixes: a59322be07 ("[UDP]: Only increment counter on first peek/recv")
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Daniel Borkmann
035d210f92 rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTS
Similarly as in commit 4f7d2cdfdd ("rtnetlink: verify IFLA_VF_INFO
attributes before passing them to driver"), we have a double nesting
of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT
that is nested itself. While IFLA_VF_PORTS is a verified attribute
from ifla_policy[], we only check if the IFLA_VF_PORTS container has
IFLA_VF_PORT attributes and then pass the attribute's content itself
via nla_parse_nested(). It would be more correct to reject inner types
other than IFLA_VF_PORT instead of continuing parsing and also similarly
as in commit 4f7d2cdfdd, to check for a minimum of NLA_HDRLEN.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:53:27 -07:00
Florian Westphal
6c7941dee9 netfilter: xtables: remove __pure annotation
sparse complains:
ip_tables.c:361:27: warning: incorrect type in assignment (different modifiers)
ip_tables.c:361:27:    expected struct ipt_entry *[assigned] e
ip_tables.c:361:27:    got struct ipt_entry [pure] *

doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:18:07 +02:00
Florian Westphal
dcebd3153e netfilter: add and use jump label for xt_tee
Don't bother testing if we need to switch to alternate stack
unless TEE target is used.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:18:06 +02:00
Florian Westphal
7814b6ec6d netfilter: xtables: don't save/restore jumpstack offset
In most cases there is no reentrancy into ip/ip6tables.

For skbs sent by REJECT or SYNPROXY targets, there is one level
of reentrancy, but its not relevant as those targets issue an absolute
verdict, i.e. the jumpstack can be clobbered since its not used
after the target issues absolute verdict (ACCEPT, DROP, STOLEN, etc).

So the only special case where it is relevant is the TEE target, which
returns XT_CONTINUE.

This patch changes ip(6)_do_table to always use the jump stack starting
from 0.

When we detect we're operating on an skb sent via TEE (percpu
nf_skb_duplicated is 1) we switch to an alternate stack to leave
the original one alone.

Since there is no TEE support for arptables, it doesn't need to
test if tee is active.

The jump stack overflow tests are no longer needed as well --
since ->stacksize is the largest call depth we cannot exceed it.

A much better alternative to the external jumpstack would be to just
declare a jumps[32] stack on the local stack frame, but that would mean
we'd have to reject iptables rulesets that used to work before.

Another alternative would be to start rejecting rulesets with a larger
call depth, e.g. 1000 -- in this case it would be feasible to allocate the
entire stack in the percpu area which would avoid one dereference.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:18:06 +02:00
Florian Westphal
e7c8899f3e netfilter: move tee_active to core
This prepares for a TEE like expression in nftables.
We want to ensure only one duplicate is sent, so both will
use the same percpu variable to detect duplication.

The other use case is detection of recursive call to xtables, but since
we don't want dependency from nft to xtables core its put into core.c
instead of the x_tables core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:18:05 +02:00
Florian Westphal
98d1bd802c netfilter: xtables: compute exact size needed for jumpstack
The {arp,ip,ip6tables} jump stack is currently sized based
on the number of user chains.

However, its rather unlikely that every user defined chain jumps to the
next, so lets use the existing loop detection logic to also track the
chain depths.

The stacksize is then set to the largest chain depth seen.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:18:04 +02:00
Eric W. Biederman
fd2ecda034 netfilter: nftables: Only run the nftables chains in the proper netns
- Register the nftables chains in the network namespace that they need
  to run in.

- Remove the hacks that stopped chains running in the wrong network
  namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 18:17:36 +02:00
Eric W. Biederman
085db2c045 netfilter: Per network namespace netfilter hooks.
- Add a new set of functions for registering and unregistering per
  network namespace hooks.

- Modify the old global namespace hook functions to use the per
  network namespace hooks in their implementation, so their remains a
  single list that needs to be walked for any hook (this is important
  for keeping the hook priority working and for keeping the code
  walking the hooks simple).

- Only allow registering the per netdevice hooks in the network
  namespace where the network device lives.

- Dynamically allocate the structures in the per network namespace
  hook list in nf_register_net_hook, and unregister them in
  nf_unregister_net_hook.

  Dynamic allocate is required somewhere as the number of network
  namespaces are not fixed so we might as well allocate them in the
  registration function.

  The chain of registered hooks on any list is expected to be small so
  the cost of walking that list to find the entry we are unregistering
  should also be small.

  Performing the management of the dynamically allocated list entries
  in the registration and unregistration functions keeps the complexity
  from spreading.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-15 18:17:26 +02:00
Eric W. Biederman
0edcf282b0 netfilter: Factor out the hook list selection from nf_register_hook
- Add a new function find_nf_hook_list to select the nf_hook_list

- Fail nf_register_hook when asked for a per netdevice hook list when
  support for per netdevice hook lists is not built into the kernel.

- Move the hook list head selection outside of nf_hook_mutex as
  nothing in the selection requires the hook list, and error handling
  is simpler if a mutex is not held.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 17:51:44 +02:00
Eric W. Biederman
4c0911566d netfilter: Simply the tests for enabling and disabling the ingress queue hook
Replace an overcomplicated switch statement with a simple if statement.

This also removes the ingress queue enable outside of nf_hook_mutex as
the protection provided by the mutex is not necessary and the code is
clearer having both of the static key increments together.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 17:51:43 +02:00
Markus Elfring
0d6ef0688d ipvs: Delete an unnecessary check before the function call "module_put"
The module_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-15 17:51:22 +02:00
Wengang Wang
4fabb59449 rds: rds_ib_device.refcount overflow
Fixes: 3e0249f9c0 ("RDS/IB: add refcount tracking to struct rds_ib_device")

There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr
failed(mr pool running out). this lead to the refcount overflow.

A complain in line 117(see following) is seen. From vmcore:
s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448.
That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely
to return ERR_PTR(-EAGAIN).

115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev)
116 {
117         BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0);
118         if (atomic_dec_and_test(&rds_ibdev->refcount))
119                 queue_work(rds_wq, &rds_ibdev->free_work);
120 }

fix is to drop refcount when rds_ib_alloc_fmr failed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14 13:20:11 -04:00
Julian Anastasov
e3895c0334 ipvs: call skb_sender_cpu_clear
Reset XPS's sender_cpu on forwarding.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
Julian Anastasov
56184858d1 ipvs: fix crash with sync protocol v0 and FTP
Fix crash in 3.5+ if FTP is used after switching
sync_version to 0.

Fixes: 749c42b620 ("ipvs: reduce sync rate with time thresholds")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
Alex Gartrell
71563f3414 ipvs: skb_orphan in case of forwarding
It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it.  In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
Additionally, skb_set_owner_w is totally broken for non full-socks.

Signed-off-by: Alex Gartrell <agartrell@fb.com>
Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
Julian Anastasov
05f00505a8 ipvs: fix crash if scheduler is changed
I overlooked the svc->sched_data usage from schedulers
when the services were converted to RCU in 3.10. Now
the rare ipvsadm -E command can change the scheduler
but due to the reverse order of ip_vs_bind_scheduler
and ip_vs_unbind_scheduler we provide new sched_data
to the old scheduler resulting in a crash.

To fix it without changing the scheduler methods we
have to use synchronize_rcu() only for the editing case.
It means all svc->scheduler readers should expect a
NULL value. To avoid breakage for the service listing
and ipvsadm -R we can use the "none" name to indicate
that scheduler is not assigned, a state when we drop
new connections.

Reported-by: Alexander Vasiliev <a.vasylev@404-group.com>
Fixes: ceec4c3816 ("ipvs: convert services to rcu")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
Julian Anastasov
4754957f04 ipvs: do not use random local source address for tunnels
Michael Vallaly reports about wrong source address used
in rare cases for tunneled traffic. Looks like
__ip_vs_get_out_rt in 3.10+ is providing uninitialized
dest_dst->dst_saddr.ip because ip_vs_dest_dst_alloc uses
kmalloc. While we retry after seeing EINVAL from routing
for data that does not look like valid local address, it
still succeeded when this memory was previously used from
other dests and with different local addresses. As result,
we can use valid local address that is not suitable for
our real server.

Fix it by providing 0.0.0.0 every time our cache is refreshed.
By this way we will get preferred source address from routing.

Reported-by: Michael Vallaly <lvs@nolatency.com>
Fixes: 026ace060d ("ipvs: optimize dst usage for real server")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
Alex Gartrell
326bf17ea5 ipvs: fix ipv6 route unreach panic
Previously there was a trivial panic

unshare -n /bin/bash <<EOF
ip addr add dev lo face::1/128
ipvsadm -A -t [face::1]:15213
ipvsadm -a -t [face::1]:15213 -r b00c::1
echo boom | nc face::1 15213
EOF

This patch allows us to replicate the net logic above and simply capture
the skb_dst(skb)->dev and use that for the purpose of the invocation.

Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-07-14 16:41:27 +09:00
David S. Miller
638d3c6381 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bridge/br_mdb.c

Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 17:28:09 -07:00
Nikolay Aleksandrov
74fe61f17e bridge: mdb: add vlan support for user entries
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).

Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 14:41:26 -07:00
Tom Herbert
de551f2eb2 net: Build IPv6 into kernel by default
This patch makes the default to build IPv6 into the kernel. IPv6
now has significant traction and any remaining vestiges of IPv6
not being provided parity with IPv4 should be swept away. IPv6 is now
core to the Internet and kernel.

Points on IPv6 adoption:

- Per Google statistics, IPv6 usage has reached 7% on the Internet
  and continues to exhibit an exponential growth rate
  https://www.google.com/intl/en/ipv6/statistics.html
- Just a few days ago ARIN officially depleted its IPv4 pool
- IPv6 only data centers are being successfully built
  (e.g. at Facebook)

This patch changes the IPv6 Kconfig for IPV6. Default for CONFIG_IPV6
is set to "y" and the text has been updated to reflect the maturity of
IPv6.

Impact:

Under some circumstances building modules in to kernel might have a
performance advantage. In my testing, I did notice a very slight
improvement.

This will obviously increase the size of the kernel image. In my
configuration I see:

IPv6 as module:

   text    data     bss     dec     hex filename
9703666 1899288  933888 12536842         bf4c0a vmlinux

IPv6 built into kernel

  text     data     bss     dec     hex filename
9436490 1879600  913408 12229498         ba9b7a vmlinux

Which increases text size by ~270K (2.8% increase in size for me). If
image size is an issue, presumably for a device which does not do IP
networking (IMO we should be discouraging IPv4-only devices), IPV6 can
be disabled or still built as a module.

Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 13:10:21 -07:00
Linus Torvalds
f760b87f8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Missing list head init in bluetooth hidp session creation, from Tedd
    Ho-Jeong An.

 2) Don't leak SKB in bridge netfilter error paths, from Florian
    Westphal.

 3) ipv6 netdevice private leak in netfilter bridging, fixed by Julien
    Grall.

 4) Fix regression in IP over hamradio bpq encapsulation, from Ralf
    Baechle.

 5) Fix race between rhashtable resize events and table walks, from Phil
    Sutter.

 6) Missing validation of IFLA_VF_INFO netlink attributes, fix from
    Daniel Borkmann.

 7) Missing security layer socket state initialization in tipc code,
    from Stephen Smalley.

 8) Fix shared IRQ handling in boomerang 3c59x interrupt handler, from
    Denys Vlasenko.

 9) Missing minor_idr destroy on module unload on macvtap driver, from
    Johannes Thumshirn.

10) Various pktgen kernel thread races, from Oleg Nesterov.

11) Fix races that can cause packets to be processed in the backlog even
    after a device attached to that SKB has been fully unregistered.
    From Julian Anastasov.

12) bcmgenet driver doesn't account packet drops vs.  errors properly,
    fix from Petri Gynther.

13) Array index validation and off by one fix in DSA layer from Florian
    Fainelli

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (66 commits)
  can: replace timestamp as unique skb attribute
  ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
  can: c_can: Fix default pinmux glitch at init
  can: rcar_can: unify error messages
  can: rcar_can: print request_irq() error code
  can: rcar_can: fix typo in error message
  can: rcar_can: print signed IRQ #
  can: rcar_can: fix IRQ check
  net: dsa: Fix off-by-one in switch address parsing
  net: dsa: Test array index before use
  net: switchdev: don't abort unsupported operations
  net: bcmgenet: fix accounting of packet drops vs errors
  cdc_ncm: update specs URL
  Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
  net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
  bridge: mdb: allow the user to delete mdb entry if there's a querier
  net: call rcu_read_lock early in process_backlog
  net: do not process device backlog during unregistration
  bridge: fix potential crash in __netdev_pick_tx()
  net: axienet: Fix devm_ioremap_resource return value check
  ...
2015-07-13 11:18:25 -07:00
Pierre Morel
ea52bf8eda 9p/trans_virtio: reset virtio device on remove
On device shutdown/removal, virtio drivers need to trigger a reset on
the device; if this is neglected, the virtio core will complain about
non-zero device status.

This patch resets the status when the 9p virtio driver is removed
from the system by calling vdev->config->reset on the virtio_device
to send a reset to the host virtio device.

Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2015-07-13 19:47:10 +03:00
Dmitry Torokhov
484836ec2d netfilter: IDLETIMER: fix lockdep warning
Dynamically allocated sysfs attributes should be initialized with
sysfs_attr_init() otherwise lockdep will be angry with us:

[   45.468653] BUG: key ffffffc030fad4e0 not in .data!
[   45.468655] ------------[ cut here ]------------
[   45.468666] WARNING: CPU: 0 PID: 1176 at /mnt/host/source/src/third_party/kernel/v3.18/kernel/locking/lockdep.c:2991 lockdep_init_map+0x12c/0x490()
[   45.468672] DEBUG_LOCKS_WARN_ON(1)
[   45.468672] CPU: 0 PID: 1176 Comm: iptables Tainted: G     U  W 3.18.0 #43
[   45.468674] Hardware name: XXX
[   45.468675] Call trace:
[   45.468680] [<ffffffc0002072b4>] dump_backtrace+0x0/0x10c
[   45.468683] [<ffffffc0002073d0>] show_stack+0x10/0x1c
[   45.468688] [<ffffffc000a86cd4>] dump_stack+0x74/0x94
[   45.468692] [<ffffffc000217ae0>] warn_slowpath_common+0x84/0xb0
[   45.468694] [<ffffffc000217b84>] warn_slowpath_fmt+0x4c/0x58
[   45.468697] [<ffffffc0002530a4>] lockdep_init_map+0x128/0x490
[   45.468701] [<ffffffc000367ef0>] __kernfs_create_file+0x80/0xe4
[   45.468704] [<ffffffc00036862c>] sysfs_add_file_mode_ns+0x104/0x170
[   45.468706] [<ffffffc00036870c>] sysfs_create_file_ns+0x58/0x64
[   45.468711] [<ffffffc000930430>] idletimer_tg_checkentry+0x14c/0x324
[   45.468714] [<ffffffc00092a728>] xt_check_target+0x170/0x198
[   45.468717] [<ffffffc000993efc>] check_target+0x58/0x6c
[   45.468720] [<ffffffc000994c64>] translate_table+0x30c/0x424
[   45.468723] [<ffffffc00099529c>] do_ipt_set_ctl+0x144/0x1d0
[   45.468728] [<ffffffc0009079f0>] nf_setsockopt+0x50/0x60
[   45.468732] [<ffffffc000946870>] ip_setsockopt+0x8c/0xb4
[   45.468735] [<ffffffc0009661c0>] raw_setsockopt+0x10/0x50
[   45.468739] [<ffffffc0008c1550>] sock_common_setsockopt+0x14/0x20
[   45.468742] [<ffffffc0008bd190>] SyS_setsockopt+0x88/0xb8
[   45.468744] ---[ end trace 41d156354d18c039 ]---

Signed-off-by: Dmitry Torokhov <dtor@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-13 17:23:25 +02:00
David S. Miller
cee9f6d018 linux-can-fixes-for-4.2-20150712
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCgAGBQJVorxbAAoJEP5prqPJtc/H5zgH/2nvkmT3aQ1gBKdU8L+Ten5D
 LIKYyDJ67agNoMfEC/5xWOm7P3X8Qi8cGc326/9AZE22QLE5x6fMCq/SmEC4MA25
 GKu2ocwosdPVhIDQOY33IYZHxxtV8UZjt5KdNbQnNO6+iqE6tX5CbgueMlcWusry
 WmPCOlezTqHydXo0TYYLPjmqJ66RlrQMWYKvcB0SaxLYQfkBGbkW+fei3wdE4kW1
 t7cocRj6Ievv59wCeF+G+fpviTVQVR5bknGA0J3lku9onATOYfdArEAD9FRajygG
 KhA/eNr0YEA40VO/baUInuEAJ/YvDqdk9LoyJ1DOXgUv1ysxYSxu/44G/IcZsOU=
 =+zVD
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.2-20150712' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2015-07-12

this is a pull request of 8 patchs for net/master.

Sergei Shtylyov contributes 5 patches for the rcar_can driver, fixing the IRQ
check and several info and error messages. There are two patches by J.D.
Schroeder and Roger Quadros for the c_can driver and dra7x-evm device tree,
which precent a glitch in the DCAN1 pinmux. Oliver Hartkopp provides a better
approach to make the CAN skbs unique, the timestamp is replaced by a counter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-12 22:24:01 -07:00
Oliver Hartkopp
d3b58c47d3 can: replace timestamp as unique skb attribute
Commit 514ac99c64 "can: fix multiple delivery of a single CAN frame for
overlapping CAN filters" requires the skb->tstamp to be set to check for
identical CAN skbs.

Without timestamping to be required by user space applications this timestamp
was not generated which lead to commit 36c01245eb "can: fix loss of CAN frames
in raw_rcv" - which forces the timestamp to be set in all CAN related skbuffs
by introducing several __net_timestamp() calls.

This forces e.g. out of tree drivers which are not using alloc_can{,fd}_skb()
to add __net_timestamp() after skbuff creation to prevent the frame loss fixed
in mainline Linux.

This patch removes the timestamp dependency and uses an atomic counter to
create an unique identifier together with the skbuff pointer.

Btw: the new skbcnt element introduced in struct can_skb_priv has to be
initialized with zero in out-of-tree drivers which are not using
alloc_can{,fd}_skb() too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2015-07-12 21:13:22 +02:00
Florian Fainelli
c8cf89f73f net: dsa: Fix off-by-one in switch address parsing
cd->sw_addr is used as a MDIO bus address, which cannot exceed
PHY_MAX_ADDR (32), our check was off-by-one.

Fixes: 5e95329b70 ("dsa: add device tree bindings to register DSA switches")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 23:25:16 -07:00
Florian Fainelli
8f5063e97f net: dsa: Test array index before use
port_index is used an index into an array, and this information comes
from Device Tree, make sure that port_index is not equal to the array
size before using it. Move the check against port_index earlier in the
loop.

Fixes: 5e95329b70: ("dsa: add device tree bindings to register DSA switches")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 23:25:16 -07:00
Vivien Didelot
2ee94014d9 net: switchdev: don't abort unsupported operations
There is no need to abort attribute setting or object addition, if the
prepare phase returned operation not supported.

Thus, abort these two transactions only if the error is not -EOPNOTSUPP.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 21:29:55 -07:00
Florian Westphal
14fe22e334 Revert "ipv4: use skb coalescing in defragmentation"
This reverts commit 3cc4949269.

There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).

However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.

In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded.  But discussion during NFWS resulted in
'lets just remove this for now'.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 21:16:40 -07:00
Phil Sutter
8220ea2324 net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.

>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 23:25:24 -07:00
YOSHIFUJI Hideaki/吉藤英明
9131f3de24 ipv6: Do not iterate over all interfaces when finding source address on specific interface.
If outgoing interface is specified and the candidate address is
restricted to the outgoing interface, it is enough to iterate
over that given interface only.

Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 23:19:25 -07:00
Satish Ashok
51ed7f3e7d bridge: mdb: allow the user to delete mdb entry if there's a querier
Until now when a querier was present static entries couldn't be deleted.
Fix this and allow the user to manipulate the mdb with or without a
querier.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:18:00 -07:00