Currently lockd directly access the file_lock_list from fs/locks.c.
It does so to mark locks granted or reclaimable. This is very
suboptimal, because a) lockd needs to poke into locks.c internals, and
b) it needs to iterate over all locks in the system for marking locks
granted or reclaimable.
This patch adds lists for granted and reclaimable locks to the nlm_host
structure instead, and adds locks to those.
nlmclnt_lock:
now adds the lock to h_granted instead of setting the
NFS_LCK_GRANTED, still O(1)
nlmclnt_mark_reclaim:
goes away completely, replaced by a list_splice_init.
Complexity reduced from O(locks in the system) to O(1)
reclaimer:
iterates over h_reclaim now, complexity reduced from
O(locks in the system) to O(locks per nlm_host)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently NFS O_DIRECT writes use FILE_SYNC so that a COMMIT is not
necessary. This simplifies the internal logic, but this could be a
difficult workload for some servers.
Instead, let's send UNSTABLE writes, and after they all complete, send a
COMMIT for the dirty range. After the COMMIT returns successfully, then do
the wake_up or fire off aio_complete().
Test plan:
Async direct I/O tests against Solaris (or any server that requires
committed unstable writes). Reboot server during test.
Based on an earlier patch by Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Duplicate infrastructure from direct read path that will allow write
path to generate multiple write requests concurrently. This will
enable us to add support for aio in this path.
Temporarily we will lose the ability to do UNSTABLE writes followed by
a COMMIT in the direct write path. However, all applications I am
aware of that use NFS O_DIRECT currently write in relatively small
chunks, so this should not be inconvenient in any way.
Test plan:
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Same callback hierarchy inversion as for the NFS write calls. This patch is
not strictly speaking needed by the O_DIRECT code, but avoids confusing
differences between the asynchronous read and write code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch inverts the callback hierarchy for NFS write calls.
Instead of having the NFSv2/v3/v4-specific code set up the RPC callback
ops, we allow the original caller to do so. This allows for more
flexibility w.r.t. how to set up and tear down the nfs_write_data
structure while still allowing the NFSv3/v4 code to perform error
handling.
The greater flexibility is needed by the asynchronous O_DIRECT code, which
wants to be able to hold on to the original nfs_write_data structures after
the WRITE RPC call has completed in order to be able to replay them if the
COMMIT call determines that the server has rebooted.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently lockd identifies its own locks using the FL_LOCKD flag. This
doesn't scale well to multiple lock managers--if we did this in nfsv4 too,
for example, we'd be left with only one free flag bit.
Instead, we just check whether the file manager ops (fl_lmops) set on this
lock are our own.
The only use for this is in nlm_traverse_locks, which uses it to find locks
that need cleaning up when freeing a host or a file.
In the long run it might be nice to do reference counting instead of
traversing all the locks like this....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
posix_test_lock() returns a pointer to a struct file_lock which is unprotected
and can be removed while in use by the caller. Move the conflicting lock from
the return to a parameter, and copy the conflicting lock.
In most cases the caller ends up putting the copy of the conflicting lock on
the stack. On i386, sizeof(struct file_lock) appears to be about 100 bytes.
We're assuming that's reasonable.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
posix_lock_file() is used to add a blocked lock to Lockd's block, so
posix_block_lock() is no longer needed.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: replace rpc_call() helper with direct call to rpc_call_sync.
This makes NFSv2 and NFSv3 synchronous calls more computationally
efficient, and reduces stack consumption in functions that used to
invoke rpc_call more than once.
Test plan:
Compile kernel with CONFIG_NFS enabled. Connectathon on NFS version 2,
version 3, and version 4 mount points.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add fields to the rpc_procinfo struct that allow the display of a
human-readable name for each procedure in the rpc_iostats output.
Also fix it so that the NFSv4 stats are broken up correctly by
sub-procedure number. NFSv4 uses only two real RPC procedures:
NULL, and COMPOUND.
Test plan:
Mount with NFSv2, NFSv3, and NFSv4, and do "cat /proc/self/mountstats".
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a simple mechanism for collecting stats in the RPC client. Stats are
tabulated during xprt_release. Note that per_cpu shenanigans are not
required here because the RPC client already serializes on the transport
write lock.
Test plan:
Compile kernel with CONFIG_NFS enabled. Basic performance regression
testing with high-speed networking and high performance server.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Account for various things that occur while an RPC task is executed.
Separate timers for RPC round trip and RPC execution time show how
long RPC requests wait in queue before being sent. Eventually these
will be accumulated at xprt_release time in one place where they can
be viewed from userland.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Monitor generic transport events. Add a transport switch callout to
format transport counters for export to user-land.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RPC wait queue length will eventually be exported to userland via the RPC
iostats interface.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a field in nfs_server to record a timestamp when a mount succeeds.
Report the number of seconds the file system has been mounted via
nfs_show_stats().
Test plan:
Mount an NFS file system, watch the mountstats reports and compare with
clock time.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a per-superblock performance counter facility to the NFS client. This
facility mimics the counters available for block devices and for
networking. Expose these new counters via the new /proc/self/mountstats
interface.
Thanks to Andrew Morton and Trond Myklebust for their review and comments.
Test plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sometimes it's important to know the exact RPC retransmit settings the
kernel is using for an NFS mount point. Add this facility to the NFS
client's show_options method.
Test plan:
Set various retransmit settings via the mount command, and check that the
settings are reflected in /proc/mounts.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Create a new file under /proc/self, called mountstats, where mounted file
systems can export information (configuration options, performance counters,
and so on). Use a mechanism similar to /proc/mounts and s_ops->show_options.
This mechanism does not violate namespace security, and is safe to use while
other processes are unmounting file systems.
Thanks to Mike Waychison for his review and comments.
Test-plan:
Test concurrent mount/unmount operations while cat'ing /proc/self/mountstats.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
read_cache_mtime is no longer used in nfs_inode. This patch removes
references of read_cache_mtime in the code comments.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_open_context may live longer than the file descriptor that spawned
it, so it needs to carry a reference to the vfsmount. If not, then
generic_shutdown_super() may end up being called before reads and writes
have been flushed out.
Make a couple of functions static while we're at it...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (150 commits)
[PATCH] ipw2100: Update version ipw2100 stamp to 1.2.2
[PATCH] ipw2100: move mutex.h include from ipw2100.c to ipw2100.h
[PATCH] ipw2100: semaphore to mutexes conversion
[PATCH] ipw2100: Fix radiotap code gcc warning
[PATCH] ipw2100: add radiotap headers to packtes captured in monitor mode
[PATCH] ipw2x00: expend Copyright to 2006
[PATCH] drivers/net/wireless/ipw2200.c: fix an array overun
[PATCH] ieee80211: Don't update network statistics from off-channel packets.
[PATCH] ipw2200: Update ipw2200 version stamp to 1.1.1
[PATCH] ipw2200: switch to the new ipw2200-fw-3.0 image format
[PATCH] ipw2200: wireless extension sensitivity threshold support
[PATCH] ipw2200: Enables the "slow diversity" algorithm
[PATCH] ipw2200: Set a meaningful silence threshold value
[PATCH] ipw2200: export `debug' module param only if CONFIG_IPW2200_DEBUG
[PATCH] ipw2200: Change debug level for firmware error logging
[PATCH] ipw2200: Filter unsupported channels out in ad-hoc mode
[PATCH] ipw2200: Fix ipw_sw_reset() implementation inconsistent with comment
[PATCH] ipw2200: Fix rf_kill is activated after mode change with 'disable=1'
[PATCH] ipw2200: remove the WPA card associates to non-WPA AP checking
[PATCH] ipw2200: Add signal level to iwlist scan output
...
* 'block-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/block:
[PATCH] fix rmmod problems with elevator attributes, clean them up
[PATCH] elevator_t lifetime rules and sysfs fixes
[PATCH] noise removal: cfq-iosched.c
[PATCH] don't bother with refcounting for cfq_data
[PATCH] fix sysfs interaction and lifetime rules handling for queues
[PATCH] regularize blk_cleanup_queue() use
[PATCH] fix cfq_get_queue()/ioprio_set(2) races
[PATCH] deal with rmmod/put_io_context() races
[PATCH] stop elv_unregister() from rogering other iosched's data, fix locking
[PATCH] stop cfq from pinning queue down
[PATCH] make cfq_exit_queue() prune the cfq_io_context for that queue
[PATCH] fix the exclusion for ioprio_set()
[PATCH] keep sync and async cfq_queue separate
[PATCH] switch to use of ->key to get cfq_data by cfq_io_context
[PATCH] stop leaking cfq_data in cfq_set_request()
[PATCH] fix cfq hash lookups
[PATCH] fix locking in queue_requests_store()
[PATCH] fix double-free in blk_init_queue_node()
[PATCH] don't do exit_io_context() until we know we won't be doing any IO
Add support for sending and receiving large RMPP transfers. The old
code supports transfers only as large as a single contiguous kernel
memory allocation. This patch uses linked list of memory buffers when
sending and receiving data to avoid needing contiguous pages for
larger transfers.
Receive side: copy the arriving MADs in chunks instead of coalescing
to one large buffer in kernel space.
Send side: split a multipacket MAD buffer to a list of segments,
(multipacket_list) and send these using a gather list of size 2.
Also, save pointer to last sent segment, and retrieve requested
segments by walking list starting at last sent segment. Finally,
save pointer to last-acked segment. When retrying, retrieve
segments for resending relative to this pointer. When updating last
ack, start at this pointer.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass actual capacity of created SRQ back to userspace, so that
userspace can report accurate capacities. This requires an ABI bump,
to change struct ib_uverbs_create_srq_resp.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Have mthca's create_srq method return the actual capacity of the SRQ
that gets created. Also update comments in <rdma/ib_verbs.h> to
clarify that this is what is expected from ib_create_srq().
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The size of struct ib_uverbs_create_qp_resp is not even multiple of 8
bytes. This causes problems for low-level drivers that add private
data after the structure: 32-bit userspace will look in the wrong
place for a response from a 64-bit kernel. Fix this by adding a
reserved field. Also, bump the ABI version because this changes the
size of a structure.
Pointed out by Hoang-Nam Nguyen <HNGUYEN@de.ibm.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace SRQs (shared
receive queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has the underlying
ib_query_srq() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace QPs (queue pairs),
including adding an ABI for marshalling requests and responses. The
kernel midlayer already has the underlying ib_query_qp() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The in-kernel mthca driver contains a table of which attributes are
valid for each queue pair state transition. It turns out that both
other IB drivers -- ipath and ehca -- which are being prepared for
merging have copied this table, errors and all.
To forestall this code duplication, move this table and the code to
check parameters against it into a midlayer library function,
ib_modify_qp_is_ok().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows the consumer to set the page size of "pages" mapped
by the pool FMRs, which is a feature already existing in the base
verbs API. On the cosmetic side it changes ib_fmr_attr.page_size field
to be named page_shift.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Expose a writable "node_desc" sysfs attribute for InfiniBand devices.
This allows userspace to update the node description with information
such as the node's hostname, so that IB network management software
can tie its view to the real world.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle resizing userspace CQs (completion
queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has ib_resize_cq().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
1) huge_pte_offset() did not check the page table hierarchy
elements as being empty correctly, resulting in an OOPS
2) Need platform specific hugetlb_get_unmapped_area() to handle
the top-down vs. bottom-up address space allocation strategies.
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need to write an invalid tag every 16 bytes,
so taking advantage of this can save many instructions
compared to the simple memset() call we make now.
A prefetching implementation is implemented for sun4u
and a block-init store version if implemented for Niagara.
The next trick is to be able to perform an init and
a copy_tsb() in parallel when growing a TSB table.
Signed-off-by: David S. Miller <davem@davemloft.net>
Put it one page below the top of the 32-bit address space.
This gives us ~16MB more address space to work with.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently allocations are very constrained for 32-bit processes.
It grows down-up from 0x70000000 to 0xf0000000 which gives about
2GB of stack + dynamic mmap() space.
So support the top-down method, and we need to override the
generic helper function in order to deal with D-cache coloring.
With these changes I was able to squeeze out a mmap() just over
3.6GB in size in a 32-bit process.
Signed-off-by: David S. Miller <davem@davemloft.net>