linux-pinenote

Author	SHA1	Message	Date
Trond Myklebust	775f06ab49	SUNRPC: Set the TCP user timeout option on client sockets Use the TCP_USER_TIMEOUT socket option to advertise to the server how long we will keep the connection open if there is unacknowledged data. See RFC5482. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-20 15:31:54 -04:00
Trond Myklebust	4876cc779f	SUNRPC: Ensure we release the TCP socket once it has been closed This fixes a regression introduced by commit `caf4ccd4e8` ("SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release"). Prior to that commit, the autoclose feature would ensure that an idle connection would result in the socket being both disconnected and released, whereas now only gets disconnected. While the current behaviour is harmless, it does leave the port bound until either RPC traffic resumes or the RPC client is shut down. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-19 19:20:12 -04:00
Trond Myklebust	3832591e6f	SUNRPC: Handle connection issues correctly on the back channel If the back channel is disconnected, we can and should just fail the transmission. The expectation is that the NFSv4.1 server will always retransmit any outstanding callbacks once the connection is re-established. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-19 13:04:13 -04:00
Fabian Frederick	d1381929de	sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key() Don't opencode sg_init_one() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-18 08:58:38 -04:00
Trond Myklebust	3438995bc4	NFS: NFSoRDMA Client Changes These patches continue to build up for improving the rsize and wsize that the NFS client uses when talking over RDMA. In addition, these patches also add in scalability enhancements and other bugfixes. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVey8qAAoJENfLVL+wpUDroT4P/3lwspXwdxS6VZWsW1VpNtdV V1KKd5D+TkpBpz/ih9GdOVaBZijaHpb6XtMReh8xuh0KI893iYmsmLoyhMTPJMsU 6sjUDEv8IFrXwlRKldX1KEfBvNgR0czCNiha6O+YsV5Az08+zr57ahyGKmLUzMxo 4XzPZbwnb5fxvgmBgENUU33g+xXGsXDbsdzLvKW3UGcPU2x6PGOTLr5vP7lQkwxE 20d9ak8xQeRUk0hsmRM4fAebzcluD1o3PLIFQBEhh0Gqm1VGtSCkr9o493gT5TgM /+XrU7B8OnbdJ1B4f/y4Bz4RucfKzyRuXMpulrnK1hL7QIiZLqiph7UrTel/ajcD 0us9PImNwXPo8tMz7Wjw2XMQplndHB3FG3M3lXlJGHlXvCI7F0yjm21AP4SeetOm kxL24Qiyi7l/S7JJxHqNlOc0b8kpVLohBZm6yee9w4r/JUPnynUqfnXCHLjIp/5W F1hzbCUATyfKrSs7VKO0hCQHfntigPEhRmyfoyXRAXzl5LnR1XqD6Wah3a3pwXn+ mEquUd6fKRHIIvJ8cKU6KtykkhRHg1sR/z1mw2ZEW/2PCd0cb+8+WN7X/fQqEN+u +VQSo7oPp38SHdsyozuUUyukN5qHptTMSrNZL+LI7J8/0+BuRuIvW0nojViapc51 LOUlcgqRdUlIvmn754Yo =N1tO -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: NFSoRDMA Client Changes These patches continue to build up for improving the rsize and wsize that the NFS client uses when talking over RDMA. In addition, these patches also add in scalability enhancements and other bugfixes. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (142 commits) xprtrdma: Reduce per-transport MR allocation xprtrdma: Stack relief in fmr_op_map() xprtrdma: Split rb_lock xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy xprtrdma: Remove ->ro_reset xprtrdma: Remove unused LOCAL_INV recovery logic xprtrdma: Acquire MRs in rpcrdma_register_external() xprtrdma: Introduce an FRMR recovery workqueue xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() xprtrdma: Introduce helpers for allocating MWs xprtrdma: Use ib_device pointer safely xprtrdma: Remove rr_func xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt xprtrdma: Warn when there are orphaned IB objects ...	2015-06-16 11:37:50 -04:00
Neil Brown	2980731811	SUNRPC: never enqueue a ->rq_cong request on ->sending If the sending queue has a task without ->rq_cong set at the front, and then a number of tasks with ->rq_cong set such that they use the entire congestion window, then the queue deadlocks. The first entry cannot be processed until later entries complete. This scenario has been seen with a client using UDP to access a server, and the network connection breaking for a period of time - it doesn't recover. It never really makes sense for an ->rq_cong request to be on the ->sending queue, but it can happen when a request is being retried, and finds the transport if locked (XPRT_LOCKED). In this case we simple call __xprt_put_cong() and the deadlock goes away. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:13:21 -04:00
Chuck Lever	40c6ed0c8a	xprtrdma: Reduce per-transport MR allocation Reduce resource consumption per-transport to make way for increasing the credit limit and maximum r/wsize. Pre-allocate fewer MRs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	acb9da7a57	xprtrdma: Stack relief in fmr_op_map() fmr_op_map() declares a 64 element array of u64 in automatic storage. This is 512 bytes (8 * 64) on the stack. Instead, when FMR memory registration is in use, pre-allocate a physaddr array for each rpcrdma_mw. This is a pre-requisite for increasing the r/wsize maximum for FMR on platforms with 4KB pages. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	58d1dcf5a8	xprtrdma: Split rb_lock /proc/lock_stat showed contention between rpcrdma_buffer_get/put and the MR allocation functions during I/O intensive workloads. Now that MRs are no longer allocated in rpcrdma_buffer_get(), there's no reason the rb_mws list has to be managed using the same lock as the send/receive buffers. Split that lock. The new lock does not need to disable interrupts because buffer get/put is never called in an interrupt context. struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws are always in a different cacheline than rb_lock and the buffer pointers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	7e53df111b	xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy Clean up: This field is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	3269a94b62	xprtrdma: Remove ->ro_reset An RPC can exit at any time. When it does so, xprt_rdma_free() is called, and it calls ->op_unmap(). If ->ro_reset() is running due to a transport disconnect, the two methods can race while processing the same rpcrdma_mw. The results are unpredictable. Because of this, in previous patches I've altered ->ro_map() to handle MR reset. ->ro_reset() is no longer needed and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	06b00880b0	xprtrdma: Remove unused LOCAL_INV recovery logic Clean up: Remove functions no longer used to recover broken FRMRs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	c14d86e591	xprtrdma: Acquire MRs in rpcrdma_register_external() Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer pool lock is expensive, and unnecessary because most modern adapters can transfer 100s of KBs of payload using just a single MR. Instead, acquire MRs one-at-a-time as chunks are registered, and return them to rb_mws immediately during deregistration. Note: commit `539431a437` ("xprtrdma: Don't invalidate FRMRs if registration fails") is reverted: There is now a valid case where registration can fail (with -ENOMEM) but the QP is still in RTS. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:37 -04:00
Chuck Lever	951e721ca0	xprtrdma: Introduce an FRMR recovery workqueue After a transport disconnect, FRMRs can be left in an undetermined state. In particular, the MR's rkey is no good. Currently, FRMRs are fixed up by the transport connect worker, but that can race with ->ro_unmap if an RPC happens to exit while the transport connect worker is running. A better way of dealing with broken FRMRs is to detect them before they are re-used by ->ro_map. Such FRMRs are either already invalid or are owned by the sending RPC, and thus no race with ->ro_unmap is possible. Introduce a mechanism for handing broken FRMRs to a workqueue to be reset in a context that is appropriate for allocating resources (ie. an ib_alloc_fast_reg_mr() API call). This mechanism is not yet used, but will be in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	fc7fbb59e7	xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer pool lock is expensive, and unnecessary because FMR mode can transfer up to a 1MB payload using just a single ib_fmr. Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and return them to rb_mws immediately during deregistration. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	346aa66b2a	xprtrdma: Introduce helpers for allocating MWs We eventually want to handle allocating MWs one at a time, as needed, instead of grabbing 64 and throwing them at each RPC in the pipeline. Add a helper for grabbing an MW off rb_mws, and a helper for returning an MW to rb_mws. These will be used in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	89e0d11258	xprtrdma: Use ib_device pointer safely The connect worker can replace ri_id, but prevents ri_id->device from changing during the lifetime of a transport instance. The old ID is kept around until a new ID is created and the ->device is confirmed to be the same. Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep. The cached copy can be used safely in code that does not serialize with the connect worker. Other code can use it to save an extra address generation (one pointer dereference instead of two). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	494ae30d2a	xprtrdma: Remove rr_func A posted rpcrdma_rep never has rr_func set to anything but rpcrdma_reply_handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	fed171b35c	xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt Clean up: Instead of carrying a pointer to the buffer pool and the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	6d44698d54	xprtrdma: Warn when there are orphaned IB objects WARN during transport destruction if ib_dealloc_pd() fails. This is a sign that xprtrdma orphaned one or more RDMA API objects at some point, which can pin lower layer kernel modules and cause shutdown to hang. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2015-06-12 13:10:36 -04:00
Chuck Lever	5fd23f7e1d	SUNRPC: Address kbuild warning in net/sunrpc/debugfs.c Cross-compile test on ARCH=mn10300: In file included from include/linux/list.h:8:0, from include/linux/wait.h:6, from include/linux/fs.h:6, from include/linux/debugfs.h:18, from net/sunrpc/debugfs.c:7: net/sunrpc/debugfs.c: In function 'fault_disconnect_write': include/linux/kernel.h:723:17: warning: comparison of distinct pointer types lacks a cast (void) (&_min1 == &_min2); \ ^ >> net/sunrpc/debugfs.c:307:8: note: in expansion of macro 'min' len = min(len, sizeof(buffer) - 1); Fixes: ('SUNRPC: Transport fault injection') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-11 14:01:06 -04:00
Chuck Lever	4a06825839	SUNRPC: Transport fault injection It has been exceptionally useful to exercise the logic that handles local immediate errors and RDMA connection loss. To enable developers to test this regularly and repeatably, add logic to simulate connection loss every so often. Fault injection is disabled by default. It is enabled with $ sudo echo xxx > /sys/kernel/debug/sunrpc/inject_fault/disconnect where "xxx" is a large positive number of transport method calls before a disconnect. A value of several thousand is usually a good number that allows reasonable forward progress while still causing a lot of connection drops. These hooks are disabled when SUNRPC_DEBUG is turned off. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:37:26 -04:00
Jeff Layton	d67fa4d85a	sunrpc: turn swapper_enable/disable functions into rpc_xprt_ops RDMA xprts don't have a sock_xprt, but an rdma_xprt, so the xs_swapper_enable/disable functions will likely oops when fed an RDMA xprt. Turn these functions into rpc_xprt_ops so that that doesn't occur. For now the RDMA versions are no-ops that just return -EINVAL on an attempt to swapon. Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:26 -04:00
Jeff Layton	d6e971d8ec	sunrpc: lock xprt before trying to set memalloc on the sockets It's possible that we could race with a call to xs_reset_transport, in which case the xprt->inet pointer could be zeroed out while we're accessing it. Lock the xprt before we try to set memalloc on it. Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:24 -04:00
Jeff Layton	264d1df3b3	sunrpc: if we're closing down a socket, clear memalloc on it first We currently increment the memalloc_socks counter if we have a xprt that is associated with a swapfile. That socket can be replaced however during a reconnect event, and the memalloc_socks counter is never decremented if that occurs. When tearing down a xprt socket, check to see if the xprt is set up for swapping and sk_clear_memalloc before releasing the socket if so. Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:22 -04:00
Jeff Layton	8e2281330f	sunrpc: make xprt->swapper an atomic_t Split xs_swapper into enable/disable functions and eliminate the "enable" flag. Currently, it's racy if you have multiple swapon/swapoff operations running in parallel over the same xprt. Also fix it so that we only set it to a memalloc socket on a 0->1 transition and only clear it on a 1->0 transition. Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:18 -04:00
Jeff Layton	3c87ef6efb	sunrpc: keep a count of swapfiles associated with the rpc_clnt Jerome reported seeing a warning pop when working with a swapfile on NFS. The nfs_swap_activate can end up calling sk_set_memalloc while holding the rcu_read_lock and that function can sleep. To fix that, we need to take a reference to the xprt while holding the rcu_read_lock, set the socket up for swapping and then drop that reference. But, xprt_put is not exported and having NFS deal with the underlying xprt is a bit of layering violation anyway. Fix this by adding a set of activate/deactivate functions that take a rpc_clnt pointer instead of an rpc_xprt, and have nfs_swap_activate and nfs_swap_deactivate call those. Also, add a per-rpc_clnt atomic counter to keep track of the number of active swapfiles associated with it. When the counter does a 0->1 transition, we enable swapping on the xprt, when we do a 1->0 transition we disable swapping on it. This also allows us to be a bit more selective with the RPC_TASK_SWAPPER flag. If non-swapper and swapper clnts are sharing a xprt, then we only need to flag the tasks from the swapper clnt with that flag. Acked-by: Mel Gorman <mgorman@suse.de> Reported-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:14 -04:00
Trond Myklebust	0d2a970d0a	SUNRPC: Fix a backchannel race We need to allow the server to send a new request immediately after we've replied to the previous one. Right now, there is a window between the send and the release of the old request in rpc_put_task(), where the server could send us a new backchannel RPC call, and we have no request to service it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:43 -04:00
Trond Myklebust	1dddda86c0	SUNRPC: Clean up allocation and freeing of back channel requests Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:43 -04:00
Trond Myklebust	0f41979164	SUNRPC: Remove unused argument 'tk_ops' in rpc_run_bc_task Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-05 11:15:42 -04:00
Chuck Lever	632dda833e	SUNRPC: Clean up bc_send() Clean up: Merge bc_send() into bc_svc_process(). Note: even thought this touches svc.c, it is a client-side change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 13:30:35 -04:00
Trond Myklebust	1193d58f75	SUNRPC: Backchannel handle socket nospace If the socket was busy due to a socket nospace error, then we should retry the send. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 13:30:35 -04:00
Trond Myklebust	88de6af24f	SUNRPC: Fix a memory leak in the backchannel code req->rq_private_buf isn't initialised when xprt_setup_backchannel calls xprt_free_allocation. Fixes: `fb7a0b9add` ("nfs41: New backchannel helper routines") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
Stefan Hajnoczi	9300fdba25	SUNRPC: drop stale doc comments in xprtsock.c Several functions have outdated arguments listed in the doc comments. Drop documentation for arguments that no longer exist. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
David S. Miller	e453581dd5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fix for net The following patch reverts the ebtables chunk that enforces counters that was introduced in the recently applied `d26e2c9ffa` ('Revert "netfilter: ensure number of counters is >0 in do_replace()"') since this breaks ebtables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 16:56:43 -07:00
Steffen Klassert	ccd740cbc6	vti6: Add pmtu handling to vti6_xmit. We currently rely on the PMTU discovery of xfrm. However if a packet is localy sent, the PMTU mechanism of xfrm tries to to local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti6_xmit to report MTU changes immediately. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 16:03:43 -07:00
David S. Miller	18ec898ee5	Revert "net: core: 'ethtool' issue with querying phy settings" This reverts commit `f96dee13b8`. It isn't right, ethtool is meant to manage one PHY instance per netdevice at a time, and this is selected by the SET command. Therefore by definition the GET command must only return the settings for the configured and selected PHY. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 14:43:50 -07:00
Bernhard Thaler	d26e2c9ffa	Revert "netfilter: ensure number of counters is >0 in do_replace()" This partially reverts commit `1086bbe97a` ("netfilter: ensure number of counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c. Setting rules with ebtables does not work any more with `1086bbe97a` place. There is an error message and no rules set in the end. e.g. ~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP Unable to update the kernel. Two possible causes: 1. Multiple ebtables programs were executing simultaneously. The ebtables userspace tool doesn't by default support multiple ebtables programs running Reverting the ebtables part of `1086bbe97a` makes this work again. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-06-01 19:45:47 +02:00
Florian Fainelli	24595346d7	net: dsa: Properly propagate errors from dsa_switch_setup_one While shuffling some code around, dsa_switch_setup_one() was introduced, and it was modified to return either an error code using ERR_PTR() or a NULL pointer when running out of memory or failing to setup a switch. This is a problem for its caler: dsa_switch_setup() which uses IS_ERR() and expects to find an error code, not a NULL pointer, so we still try to proceed with dsa_switch_setup() and operate on invalid memory addresses. This can be easily reproduced by having e.g: the bcm_sf2 driver built-in, but having no such switch, such that drv->setup will fail. Fix this by using PTR_ERR() consistently which is both more informative and avoids for the caller to use IS_ERR_OR_NULL(). Fixes: `df197195a5` ("net: dsa: split dsa_switch_setup into two functions") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:50:34 -07:00
Neal Cardwell	9f950415e4	tcp: fix child sockets to use system default congestion control if not set Linux 3.17 and earlier are explicitly engineered so that if the app doesn't specifically request a CC module on a listener before the SYN arrives, then the child gets the system default CC when the connection is established. See tcp_init_congestion_control() in 3.17 or earlier, which says "if no choice made yet assign the current value set as default". The change ("net: tcp: assign tcp cong_ops when tcp sk is created") altered these semantics, so that children got their parent listener's congestion control even if the system default had changed after the listener was created. This commit returns to those original semantics from 3.17 and earlier, since they are the original semantics from 2007 in `4d4d3d1e8` ("[TCP]: Congestion control initialization."), and some Linux congestion control workflows depend on that. In summary, if a listener socket specifically sets TCP_CONGESTION to "x", or the route locks the CC module to "x", then the child gets "x". Otherwise the child gets current system default from net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and earlier, and this commit restores that. Fixes: `55d8694fa8` ("net: tcp: assign tcp cong_ops when tcp sk is created") Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:49:14 -07:00
Eric Dumazet	beb39db59d	udp: fix behavior of wrong checksums We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:42:18 -07:00
Eric Dumazet	71d9f6149c	bridge: fix br_multicast_query_expired() bug br_multicast_query_expired() querier argument is a pointer to a struct bridge_mcast_querier : struct bridge_mcast_querier { struct br_ip addr; struct net_bridge_port __rcu *port; }; Intent of the code was to clear port field, not the pointer to querier. Fixes: `2cd4143192` ("bridge: memorize and export selected IGMP/MLD querier port") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Cc: Linus Lüssing <linus.luessing@web.de> Cc: Steinar H. Gunderson <sesse@samfundet.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-30 23:31:28 -07:00
David S. Miller	5aab0e8a45	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-05-28 1) Fix a race in xfrm_state_lookup_byspi, we need to take the refcount before we release xfrm_state_lock. From Li RongQing. 2) Fix IV generation on ESN state. We used just the low order sequence numbers for IV generation on ESN, as a result the IV can repeat on the same state. Fix this by using the high order sequence number bits too and make sure to always initialize the high order bits with zero. These patches are serious stable candidates. Fixes from Herbert Xu. 3) Fix the skb->mark handling on vti. We don't reset skb->mark in skb_scrub_packet anymore, so vti must care to restore the original value back after it was used to lookup the vti policy and state. Fixes from Alexander Duyck. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-28 20:41:35 -07:00
Alexander Duyck	d55c670cbc	ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified after completing the function. This resulted in the original skb->mark value being lost. Since we only need skb->mark to be set for xfrm_policy_check we can pull the assignment into the rcv_cb calls and then just restore the original mark after xfrm_policy_check has been completed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:32 +02:00
Alexander Duyck	049f8e2e28	xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input This change makes it so that if a tunnel is defined we just use the mark from the tunnel instead of the mark from the skb header. By doing this we can avoid the need to set skb->mark inside of the tunnel receive functions. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:31 +02:00
Alexander Duyck	cd5279c194	ip_vti/ip6_vti: Do not touch skb->mark on xmit Instead of modifying skb->mark we can simply modify the flowi_mark that is generated as a result of the xfrm_decode_session. By doing this we don't need to actually touch the skb->mark and it can be preserved as it passes out through the tunnel. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2015-05-28 06:23:31 +02:00
Linus Torvalds	8f98bcdf8f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Don't use MMIO on certain iwlwifi devices otherwise we get a firmware crash. 2) Don't corrupt the GRO lists of mac80211 contexts by doing sends via timer interrupt, from Johannes Berg. 3) SKB tailroom is miscalculated in AP_VLAN crypto code, from Michal Kazior. 4) Fix fw_status memory leak in iwlwifi, from Haim Dreyfuss. 5) Fix use after free in iwl_mvm_d0i3_enable_tx(), from Eliad Peller. 6) JIT'ing of large BPF programs is broken on x86, from Alexei Starovoitov. 7) EMAC driver ethtool register dump size is miscalculated, from Ivan Mikhaylov. 8) Fix PHY initial link mode when autonegotiation is disabled in amd-xgbe, from Tom Lendacky. 9) Fix NULL deref on SOCK_DEAD socket in AF_UNIX and CAIF protocols, from Mark Salyzyn. 10) credit_bytes not initialized properly in xen-netback, from Ross Lagerwall. 11) Fallback from MSI-X to INTx interrupts not handled properly in mlx4 driver, fix from Benjamin Poirier. 12) Perform ->attach() after binding dev->qdisc in packet scheduler, otherwise we can crash. From Cong WANG. 13) Don't clobber data in sctp_v4_map_v6(). From Jason Gunthorpe. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits) sctp: Fix mangled IPv4 addresses on a IPv6 listening socket net_sched: invoke ->attach() after setting dev->qdisc xen-netfront: properly destroy queues when removing device mlx4_core: Fix fallback from MSI-X to INTx xen/netback: Properly initialize credit_bytes net: netxen: correct sysfs bin attribute return code tools: bpf_jit_disasm: fix segfault on disabled debugging log output unix/caif: sk_socket can disappear when state is unlocked amd-xgbe-phy: Fix initial mode when autoneg is disabled net: dp83640: fix improper double spin locking. net: dp83640: reinforce locking rules. net: dp83640: fix broken calibration routine. net: stmmac: create one debugfs dir per net-device net/ibm/emac: fix size of emac dump memory areas x86: bpf_jit: fix compilation of large bpf programs net: phy: bcm7xxx: Fix 7425 PHY ID and flags iwlwifi: mvm: avoid use-after-free on iwl_mvm_d0i3_enable_tx() iwlwifi: mvm: clean net-detect info if device was reset during suspend iwlwifi: mvm: take the UCODE_DOWN reference when resuming iwlwifi: mvm: BT Coex - duplicate the command if sent ASYNC ...	2015-05-27 13:41:13 -07:00
WANG Cong	86e363dc3b	net_sched: invoke ->attach() after setting dev->qdisc For mq qdisc, we add per tx queue qdisc to root qdisc for display purpose, however, that happens too early, before the new dev->qdisc is finally set, this causes q->list points to an old root qdisc which is going to be freed right before assigning with a new one. Fix this by moving ->attach() after setting dev->qdisc. For the record, this fixes the following crash: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756 ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20 ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000 Call Trace: [<ffffffff81a44e7f>] dump_stack+0x4c/0x65 [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6 [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98 [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48 [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a [<ffffffff814e725b>] __list_del_entry+0x5a/0x98 [<ffffffff814e72a7>] list_del+0xe/0x2d [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20 [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6 [<ffffffff81822676>] qdisc_graft+0x11d/0x243 [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4 [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226 [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17 [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93 [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d [<ffffffff818544b2>] netlink_unicast+0xcb/0x150 [<ffffffff81161db9>] ? might_fault+0x59/0xa9 [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37 [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72 [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32 [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76 [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199 [<ffffffff811b14d5>] ? __fget_light+0x50/0x78 [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60 [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f ---[ end trace ef29d3fb28e97ae7 ]--- For long term, we probably need to clean up the qdisc_graft() code in case it hides other bugs like this. Fixes: `95dc19299f` ("pkt_sched: give visibility to mq slave qdiscs") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 14:09:55 -04:00
Mark Salyzyn	b48732e4a4	unix/caif: sk_socket can disappear when state is unlocked got a rare NULL pointer dereference in clear_bit Signed-off-by: Mark Salyzyn <salyzyn@android.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> ---- v2: switch to sock_flag(sk, SOCK_DEAD) and added net/caif/caif_socket.c v3: return -ECONNRESET in upstream caller of wait function for SOCK_DEAD Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 23:19:29 -04:00
David S. Miller	fe9066ade6	We have three more fixes: * AP_VLAN tailroom calculation fix, the bug leads to warnings along with dropped packets * NAPI context issue, calling napi_gro_receive() from a timer (obviously) can lead to crashes * remain-on-channel combining leads to dropped requests and not being able to finish certain operations, so remove it -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJVZB7ZAAoJEDBSmw7B7bqr/qkQAKW5RHFfo8TtjUB7pl/iWqc3 /gyym9tisy4hc7OOr8BDTQdUvqzY4fRMhuTAvjPLXCqdV2Isz1IetiogiJRMglNs 3v83QkmEI8vMQ3lP6Y+2Hnz9tw1zNaVXHGwIKPjk9YLrzsBV0AJoGqn8qo4OBfWl JjkHLM6/0PVDy5UDF95cRyM6+L0XJdPdVS/YRLslp5Tda8fgTbH+dMwLnzjQZbIu ZanFpAuRxx35g/Zg6vAsRhlva/zrucphteaiJGAa6a3NgH9Z4tDlGHRveHQOgNYt xHKNcvOgegaFNcEY8ftKMcQ/RIVJjxXr6nPYnQyFnG0aAxYePysNx8TJaUX2dxq7 +O1RlYHwJpRRUmScSsDFDa/CmQcpUxgloUfCmkWS611g1LZFnpVOcXEZRJg/c9lm hO6mH19OYlDiWeE3ZhKeYJNxmpWvPM4bxhswHcYfLG+vA93kLTYQ/xGi/0YfMKl+ +UCTPbdiXdRyYLzixiu/NWcKwWDH2pHAH1pjimH+r2266lQYs2Jsk8436uQKhZxI D4l3ethujDcmMzO+ZTzLHjWbNdO2fC4R1LIF/Eg/sDe7g11dQItWYrAVddQFIkfb /A5VV1DbGW33tpj4QUXfjuQ65I6rLOq4NbTY3j4/lPSiDkqIpEfwjzuc3/UZeuZl Z8cu8Qxf7jdOfihbnUFR =PS0g -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2015-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We have three more fixes: * AP_VLAN tailroom calculation fix, the bug leads to warnings along with dropped packets * NAPI context issue, calling napi_gro_receive() from a timer (obviously) can lead to crashes * remain-on-channel combining leads to dropped requests and not being able to finish certain operations, so remove it ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-26 19:38:53 -04:00

1 2 3 4 5 ...

37,500 commits