Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.
Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.
Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.
See also http://comments.gmane.org/gmane.linux.network/401064
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Originally I only wanted to drop the unneeded inclusion of
<linux/i2c.h>, but then noticed that struct
microread_nfc_platform_data isn't actually used, and
MICROREAD_DRIVER_NAME is redefined in the only file where it is used,
so we can get rid of the header file and dead code altogether.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Tom Herbert says:
====================
kcm: Kernel Connection Multiplexor (KCM)
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.
In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.
KCM implements an NxM multiplexor in the kernel as diagrammed below:
+------------+ +------------+ +------------+ +------------+
| KCM socket | | KCM socket | | KCM socket | | KCM socket |
+------------+ +------------+ +------------+ +------------+
| | | |
+-----------+ | | +----------+
| | | |
+----------------------------------+
| Multiplexor |
+----------------------------------+
| | | | |
+---------+ | | | ------------+
| | | | |
+----------+ +----------+ +----------+ +----------+ +----------+
| Psock | | Psock | | Psock | | Psock | | Psock |
+----------+ +----------+ +----------+ +----------+ +----------+
| | | | |
+----------+ +----------+ +----------+ +----------+ +----------+
| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
+----------+ +----------+ +----------+ +----------+ +----------+
The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).
A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.
In this patch set:
- Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
indicate that more messages will be sent on the socket. The stack
may batch messages up if it is beneficial for transmission.
- In sendmmsg, set MSG_BATCH in all sub messages except for the last
one.
- In order to allow sendmmsg to contain multiple messages with
SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
- Add KCM module
- This supports SOCK_DGRAM and SOCK_SEQPACKET.
- KCM documentation
v2:
- Added splice and page operations.
- Assemble receive messages in place on TCP socket (don't have a
separate assembly queue.
- Based on above, enforce maxmimum receive message to be the size
of the recceive socket buffer.
- Support message assembly timeout. Use the timeout value in
sk_rcvtimeo on the TCP socket.
- Tested some with a couple of other production applications,
see ~5% improvement in application latency.
Testing:
Dave Watson has integrated KCM into Thrift and we intend to put these
changes into open source. Example of this is in:
https://github.com/djwatson/fbthrift/commit/
dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc
Some initial KCM Thrift benchmark numbers (comment from Dave)
Thrift by default ties a single connection to a single thread. KCM is
instead able to load balance multiple connections across multiple epoll
loops easily.
A test sending ~5k bytes of data to a kcm thrift server, dropping the
bytes on recv:
QPS Latency / std dev Latency
without KCM
70336 209/123
with KCM
70353 191/124
A test sending a small request, then doing work in the epoll thread,
before serving more requests:
QPS Latency / std dev Latency
without KCM
14282 559/602
with KCM
23192 344/234
At the high end, there's definitely some additional kernel overhead:
Cranking the pipelining way up, with lots of small requests
QPS Latency / std dev Latency
without KCM
1863429 127/119
with KCM
1337713 192/241
---
So for a "realistic" workload, KCM performs pretty well (second case).
Under extreme conditions of highest tps we still have some work to do.
In its nature a multiplexor will spread work between CPUs which is
logically good for load balancing but coan conflict with the goal
promoting affinity. Batching messages on both send and receive are
the means to recoup performance.
Future support:
- Integration with TLS (TLS-in-kernel is a separate initiative).
- Page operations/splice support
- Unconnected KCM sockets. Will be able to attach sockets to different
destinations, AF_KCM addresses with be used in sendmsg and recvmsg
to indicate destination
- Explore more utility in performing BPF inline with a TCP data stream
(setting SO_MARK, rxhash for messages being sent received on
KCM sockets).
- Performance work
- Diagnose performance issues under high message load
FAQ (Questions posted on LWN)
Q: Why do this in the kernel?
A: Because the kernel is good at scheduling threads and steering packets
to threads. KCM fits well into this model since it allows the unit
of work for scheduling and steering to be the application layer
messages themselves. KCM should be thought of as generic application
protocol acceleration. It to the philosophy that the kernel provides
generic and extensible interfaces.
Q: How can adding code in the path yield better performance?
A: It is true that for just sending receiving a single message there
would be some performance loss since the code path is longer (for
instance comparing netperf to KCM). But for real production
applications performance takes on many dynamics. Parallelism, context
switching, affinity, granularity of locking, and load balancing are
all relevant. The theory of KCM is that by an application-centric
interface, the kernel can provide better support for these
performance characteristics.
Q: Why not use an existing message-oriented protocol such as RUDP,
DCCP, SCTP, RDS, and others?
A: Because that would entail using a completely new transport protocol.
Deploying a new protocol at scale is either a huge undertaking or
fundamentally infeasible. This is true in either the Internet and in
the data center due in a large part to protocol ossification.
Besides, KCM we want KCM to work existing, well deployed application
protocols that we couldn't change even if we wanted to (e.g. http/2).
KCM simply defines a new interface method, it does not redefine any
aspect of the transport protocol nor application protocol, nor set
any new requirements on these. Neither does KCM attempt to implement
any application protocol logic other than message deliniation in the
stream. These are fundamental requirement of KCM.
Q: How does this affect TCP?
A: It doesn't, not in the slightest. The use of KCM can be one-sided,
KCM has no effect on the wire.
Q: Why force TCP into doing something it's not designed for?
A: TCP is defined as transport protocol and there is no standard that
says the API into TCP must be stream based sockets, or for that
matter sockets at all (or even that TCP needs to be implemented in a
kernel). KCM is not inconsistent with the design of TCP just because
to makes an message based interface over TCP, if it were then every
application protocol sending messages over TCP would also be! :-)
Q: What about the problem of a connections with very slow rate of
incoming data? As a result your application can get storms of very
short reads. And it actually happens a lot with connection from
mobile devices and it is a problem for servers handling a lot of
connections.
A: The storm of short reads will occur regardless of whether KCM is used
or not. KCM does have one advantage in this scenario though, it will
only wake up the application when a full message has been received,
not for each packet that makes up part of a bigger messages. If a
bunch of small messages are received, the application can receive
messages in batches using recvmmsg.
Q: Why not just use DPDK, or at least provide KCM like functionality in
DPDK?
A: DPDK, or more generally OS bypass presumably with a TCP stack in
userland, presents a different model of load balancing than that of
KCM (and the kernel). KCM implements load balancing of messages
across the threads of an application, whereas DPDK load balances
based on queues which are more static and coarse-grained since
multiple connections are bound to queues. DPDK works best when
processing of packets is silo'ed in a thread on the CPU processing
a queue, and packet processing (for both the stack and application)
is fairly uniform. KCM works well for applications where the amount
of work to process messages varies an application work is commonly
delegated to worker threads often on different CPUs.
The message based interface over TCP is something that could be
provide by a DPDK or OS bypass library.
Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
sockets?
A: Yes. KCM is most appropriate for message based protocols over TCP
where is easy to deduce the message length (e.g. a length field)
and the protocol implements its own message ordering semantics.
Fortunately this encompasses many modern protocols.
Q: How is memory limited and controlled?
A: In v2 all data for messages is now kept in socket buffers, either
those for TCP or KCM, so socket buffer limits are applicable.
This includes receive messages assembly which is now done ont teh
TCP socket buffer instead of a separate queue-- this has the
consequence that the TCP socket buffer limit provides an
enforceable maxmimum message size.
Additionally, a timeout may be set for messages assembly. The
value used for this is taken from sk_rcvtimeo of the TCP socket.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.
The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.
The receive algorithm is something like:
1) Receive the first skbuf for a message (or skbufs if multiple are
needed to determine message length).
2) Check the message length against the number of bytes in the TCP
receive queue (tcp_inq()).
- If all the bytes of the message are in the queue (incluing the
skbuf received), then proceed with message assembly (it should
complete with the tcp_read_sock)
- Else, mark the psock with the number of bytes needed to
complete the message.
3) In TCP data ready function, if the psock indicates that we are
waiting for the rest of the bytes of a messages, check the number
of queued bytes against that.
- If there are still not enough bytes for the message, just
return
- Else, clear the waiting bytes and proceed to receive the
skbufs. The message should now be received in one
tcp_read_sock
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
dgram and seqpacket ops.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement kcm_splice_read. This is supported only for seqpacket.
Add kcm_seqpacket_ops and set splice read to kcm_splice_read.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.
The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module implements the Kernel Connection Multiplexor.
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.
For more information see the included Documentation/networking/kcm.txt
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.
MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export it for cases where we want to create sockets by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gscan capabilities were updated with new capabilities supported
by the device. While at it, simplify the firmware support
conditional and move both conditions into the WARN() to make it
easier to undertand and use the unlikely() for both.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
The firmware/hardware only supports checking AES-CMAC on RX, not
using it on TX. For station mode this is fine, since it's the only
thing it will ever do. For AP mode, it never receives such frames,
but must be able to transmit them. This is currently broken since
we try to enable them for hardware crypto (for RX only) and then
treat them as TX_CMD_SEC_EXT, leading to FIFO underruns during TX
so the frames never go out to the air.
To fix this, simply use software on TX in AP (and IBSS) mode.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Newer firmware versions put different data in the memory
which is read by the driver upon firmware crash. Just
change the variable names in the code and the name of the
data in the log that we print withouth any functional
change.
On older firmware, there will be a mismatch between the
names that are printed and the content itself, but that's
harmless.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
When trying to reach high Rx throughput of more than 500Mbps on
a device with a relatively weak CPU (Atom x5-Z8500), CPU utilization
may become a bottleneck. Analysis showed that we are looping in
iwl_pcie_rx_handle for very long periods which led to starvation
of other threads (iwl_pcie_rx_handle runs with _bh disabled).
We were handling Rx and allocating new buffers and the new buffers
were ready quickly enough to be available before we had finished
handling all the buffers available in the hardware. As a
consequence, we called iwl_pcie_rxq_restock to refill the hardware
with the new buffers, and start again handling new buffers without
exiting the function. Since we read the hardware pointer again when
we goto restart, new buffers were handled immediately instead of
exiting the function.
This patch avoids refilling RBs inside rx handling loop, unless an
emergency situation is reached. It also doesn't read the hardware
pointer again unless we are in an emergency (unlikely) case.
This significantly reduce the maximal time we spend in
iwl_pcie_rx_handle with _bh disabled.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
iwl_mvm_tcool_get_cur_state is the function that returns the
cooling state index to the sysfs handler. This function returns
mvm->cooling_dev.cur_state but that variable was set to the
budget and not the cooling state index. Fix that.
Add a missing blank line while at it.
Signed-off-by: Chaya Rachel Ivgi <chaya.rachel.ivgi@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
We need to track the next packet that we will reclaim in
order to know when the Tx queues are empty. This is useful
when we open or tear down an A-MPDU session which requires
to switch queue.
The next packet being reclaimed is identified by its WiFi
sequence number and this is relevant only when we use QoS.
QoS NDPs do have a TID but have a meaningless sequence
number. The spec mandates the receiver to ignore the
sequence number in this case, allowing the transmitter to
put any sequence number. Our implementation leaves it 0.
When we reclaim a QoS NDP, we can't update the next_relcaim
counter since the sequence number of the QoS NDP itself is
invalid.
We used to update the next_reclaim based on the sequence
number of the QoS NDP which reset it to 1 (0 + 1) and
because of this, we never knew when the queue got empty.
This had to sad consequence to stuck the A-MPDU state
machine in a transient state.
To fix this, don't update next_reclaim when we reclaim
a QoS NDP.
Alesya saw this bug when testing u-APSD. Because the
A-MPDU state machine was stuck in EMPTYING_DELBA, we
updated mac80211 that we still have frames for that
station when it got back to sleep. mac80211 then wrongly
set the TIM bit in the beacon and requested to release
non-existent frames from the A-MPDU queue. This led to
a situation where the client was trying to poll frames
but we had no frames to send.
Reported-by: Alesya Shapira <alesya.shapira@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
From 9000 family on, we need to get HW address from host
CSR registers.
OEM can override it by fusing the override registers - read
those first, and if those are 0 - read the OTP registers instead.
In addition - bail out if no valid mac address is present. Make
it shared for all NICs.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
We kick the allocator when we have 2 RBDs that don't have
attached RBs, and the allocator allocates 8 RBs meaning
that it needs another 6 RBDs to attach the RBs to.
The design is that allocator should always have enough RBDs
to fulfill requests, so we give in advance 6 RBDs to the
allocator so that when it is kicked, it gets additional 2 RBDs
and has enough RBDs.
These RBDs were taken from the Rx queue itself, meaning
that each Rx queue didn't have the maximal number of
RBDs, but MAX - 6.
Change initial number of RBDs in the system to include both
queue size and allocator reserves.
Note the multi-queue is always 511 instead of 512 to avoid a
full queue since we cannot detect this state easily enough in
the 9000 arch.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
When running async rx handler the framework holds the mvm->mutex
before starting the async handler, that might cause a deadlock in case
the handler calls to ops that lock the mutex as well.
Add support for running async rx handler without hold the mutex before
activating the handler.
Signed-off-by: Chaya Rachel Ivgi <chaya.rachel.ivgi@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Currently when the FW sends start/stop aux roc time event
notification with an error status, the driver returns an
error value, but does not remove the time event, and does
not notify the stack above that the time event is over.
This causes problems that the stack above assumes we are still
in the middle of a time event, and therefore can block different
events, such as scanning.
On FW failure notification, cleanup the time event parameters and
notify the stack above that the time event is over.
Signed-off-by: Matti Gottlieb <matti.gottlieb@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Our hardware de-aggregates AMSDUs but copies the mac header
as it to the de-aggregated MPDUs. We need to turn off the AMSDU
bit in the QoS control ourselves.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Add debugfs entries to get the ctdp budget average
and to stop ctdp.
Signed-off-by: Chaya Rachel Ivgi <chaya.rachel.ivgi@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
performance tests for hash map and per-cpu hash map
with and without pre-allocation
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
increase stress by also calling bpf_get_stackid() from
various *spin* functions
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
this test calls bpf programs from different contexts:
from inside of slub, from rcu, from pretty much everywhere,
since it kprobes all spin_lock functions.
It stresses the bpf hash and percpu map pre-allocation,
deallocation logic and call_rcu mechanisms.
User space part adding more stress by walking and deleting map elements.
Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
already active while loader loads new programs, creates new kprobes and
attaches them.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.
This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.
Fixes: 14ca0751c9 ("bpf: support for access to tunnel options")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
bpf: map pre-alloc
v1->v2:
. fix few issues spotted by Daniel
. converted stackmap into pre-allocation as well
. added a workaround for lockdep false positive
. added pcpu_freelist_populate to be used by hashmap and stackmap
this path set switches bpf hash map to use pre-allocation by default
and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
where full map pre-allocation is too memory expensive.
Some time back Daniel Wagner reported crashes when bpf hash map is
used to compute time intervals between preempt_disable->preempt_enable
and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
tool if it's used to count the number of invocations of kernel
'*spin*' functions. Both problems are due to the recursive use of
slub and can only be solved by pre-allocating all map elements.
A lot of different solutions were considered. Many implemented,
but at the end pre-allocation seems to be the only feasible answer.
As far as pre-allocation goes it also was implemented 4 different ways:
- simple free-list with single lock
- percpu_ida with optimizations
- blk-mq-tag variant customized for bpf use case
- percpu_freelist
For bpf style of alloc/free patterns percpu_freelist is the best
and implemented in this patch set.
Detailed performance numbers in patch 3.
Patch 2 introduces percpu_freelist
Patch 1 fixes simple deadlocks due to missing recursion checks
Patch 5: converts stackmap to pre-allocation
Patches 6-9: prepare test infra
Patch 10: stress test for hash map infra. It attaches to spin_lock
functions and bpf_map_update/delete are called from different contexts
Patch 11: stress for bpf_get_stackid
Patch 12: map performance test
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Tom Zanussi <tom.zanussi@linux.intel.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
extend test coveraged to include pre-allocated and run-time alloc maps
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
note old loader is compatible with new kernel.
map_flags are optional
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
move ksym search from offwaketime into library to be reused
in other tests
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
map creation is typically the first one to fail when rlimits are
too low, not enough memory, etc
Make this failure scenario more verbose
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.
The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
cpu0 cpu1
---- ----
user does lookup(stackidX)
starts copying ips into buffer
delete(stackidX)
calls bpf_get_stackid()
which recyles the element and
overwrites with new stack trace
To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.
Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.
Fixes: d5a3b1f691 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.
/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.
For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.
Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek says:
====================
ipv6: per netns FIB6 walkers and garbage collector
Commit 2ac3ac8f86 ("ipv6: prevent fib6_run_gc() contention") reduced
the risk of contention on FIB6 garbage collector lock on systems with
many CPUs. However, one of our customers can still observe heavy
contention on fib6_gc_lock which can even trigger the soft lockup
detector.
This is caused by garbage collector running in forced mode from a timer.
While there is one timer per network namespace, the instances of
fib6_run_gc() running from them are protected by one global spinlock so
that only one garbage collector can run at any moment and other
namespaces have to wait. As most relevant data structures are separated
per netns, there is little reason for garbage collectors blocking each
other.
Similar problem exists for walkers: changes in one tree do not need to
adjust (and block) walkers traversing FIB trees in other namespaces.
This series separates both the walkers infrastructure and garbage
collector so that they work independently in network namespaces.
v2: get rid of ifdef in ipv6_route_seq_setup_walk(), pass net from
callers instead
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
One of our customers observed issues with FIB6 garbage collectors
running in different network namespaces blocking each other, resulting
in soft lockups (fib6_run_gc() initiated from timer runs always in
forced mode).
Now that FIB6 walkers are separated per namespace, there is no more need
for instances of fib6_run_gc() in different namespaces blocking each
other. There is still a call to icmp6_dst_gc() which operates on shared
data but this function is protected by its own shared lock.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 FIB data structures are separated per network namespace but
there is still only one global walkers list and one global walker list
lock. This means changes in one namespace unnecessarily interfere with
walkers in other namespaces.
Replace the global list with per-netns lists (and give each its own
lock).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Global variable gc_args is only used in fib6_run_gc() and functions
called from it. As fib6_run_gc() makes sure there is at most one
instance of fib6_clean_all() running at any moment, we can replace
gc_args with a local variable which will be needed once multiple
instances (per netns) of garbage collector are allowed.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan says:
====================
bnxt_en: Updates for net-next.
Updates to support autoneg for all supported speeds, add PF port statistics,
and Advanced Error Reporting.
v2: Fixed patch 3 to not use parentheses on function return.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pci_error_handler callbacks to support for pcie advanced error
recovery.
Signed-off-by: Satish Baddipadige <sbaddipa@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include the more useful port statistics in ethtool -S for the PF device.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>