My test do: fallocate a big file and do write. The file is 512M, but
after file write is done btrfs-debug-tree shows:
item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
extent data disk byte 1103101952 nr 536870912
extent data offset 0 nr 399634432 ram 536870912
extent compression 0
Looks like a regression introducted by
6c7d54ac87, where we set wrong slot.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we have a pinned inode it must have a log item attached to it.
Usually that log item will have ili_last_lsn already set, in which
case we only need to flush the log up to that LSN instead of doing a
full log force. This gives speedups of about 5% in some fsync heavy
workloads.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
file_remove_suid already calls into ->setattr to clear the suid and
sgid bits if needed, no need to start a second transaction to do it
ourselves.
Note that xfs_write_clear_setuid issues a sync transaction while the
path through ->setattr doesn't, but that is consistant with the
other filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This patch solves a corner case during allocation which occurs if both
metadata (indirect) and data blocks are required but there is an
obstacle in the filesystem (e.g. a resource group header or another
allocated block) such that when the allocation is requested only
enough blocks for the metadata are returned.
By changing the exit condition of this loop, we ensure that a
minimum of one data block will always be returned.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We need this one-liner to signal the mount helper of the 'insufficient journals' condition.
Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix the mapping of the NFSERR_SERVERFAULT error
NFS: Remove a redundant check for PageFsCache in nfs_migrate_page()
NFS: Fix a bug in nfs_fscache_release_page()
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] qla2xxx: Obtain proper host structure during response-queue processing.
[SCSI] compat_ioct: fix bsg SG_IO
[SCSI] qla2xxx: make msix interrupt handler safe for irq
[SCSI] zfcp: Report FC BSG errors in correct field
[SCSI] mptfusion : mptscsih_abort return value should be SUCCESS instead of value 0.
When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.
This fixes a bug caused by b6a2fea393 ("mm:
variable length argument support") and unmasked by
fc63cf2370 ("exec: setup_arg_pages() fails
to return errors").
This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start. This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.
To test, do:
'ulimit -s 15; ls'
before and after the patch is applied. Before it's applied, 'ls' should
be killed. After the patch is applied, 'ls' should no longer be killed.
A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.
4K pages should be fine to test with.
[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is used by tcgetsid(3).
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were invalidating mapping pages when dropping FILE_CACHE in
__send_cap(). But ceph_check_caps attempts to invalidate already, and
also checks for success, so we should never get to this point.
Signed-off-by: Sage Weil <sage@newdream.net>
If a sync read gets a short result from the OSD, it may need to do a
getattr to see if it is short due to reaching end-of-file. The getattr
was being done while holding a reference to FILE_RD, which can lead to
a deadlock if the MDS is revoking that capability bit and can't process
the getattr until it does.
We fix this by setting a flag if EOF size validation is needed, and doing
the getattr in ceph_aio_read, after the RD cap ref is dropped. If the
read needs to be continued, we loop and continue traversing the file.
Signed-off-by: Sage Weil <sage@newdream.net>
Try to invalidate pages in ceph_check_caps() if FILE_CACHE is being
revoked. If we fail, queue an immediate async invalidate if FILE_CACHE
is being revoked. (If it's not being revoked, we just queue the caps
for later evaluation later, as per the old behavior.)
Signed-off-by: Sage Weil <sage@newdream.net>
In the cases where we either do a sync read or a write, we
need to make sure that everything in the page cache is flushed.
In the case of a sync write we invalidate the relevant pages,
so that subsequent read/write reflects the new data written.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
A truncation should occur when either we have the
specified caps for the file, or (in cases where we are
not the only ones referencing the file) when it is mapped
or when it is opened. The latter two cases were not
handled.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Originally ceph_page_mkwrite called ceph_write_begin, hoping that
the returned locked page would be the page that it was requested
to mkwrite. Factored out relevant part of ceph_page_mkwrite and
we lock the right page anyway.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Zeroing of holes was not done correctly: page_off was miscalculated and
zeroing the tail didn't not adjust the 'read' value to include the zeroed
portion.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Instead of removing osd connection immediately when the
requests list is empty, put the osd connection on an lru.
Only if that osd has not been used for more than a specified
time, will it be removed.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The auth_x protocol implements support for a kerberos-like mutual
authentication infrastructure used by Ceph. We do not simply use vanilla
kerberos because of scalability and performance issues when dealing with
a large cluster of nodes providing a single logical service.
Auth_x provides mutual authentication of client and server and protects
against replay and man in the middle attacks. It does not encrypt
the full session over the wire, however, so data payload may still be
snooped.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Add infrastructure to allow the mon_client to periodically renew its auth
credentials. Also add a messenger callback that will force such a renewal
if a peer rejects our authenticator.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Helper for decoding into a ceph_buffer, and other misc decoding helpers
we will need.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Some places in kernel need to iterate over a hlist in seq_file,
so provide some common helpers.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
md ioctls are now handled by the md driver itself, but mdadm
may call RAID_VERSION on other devices as well. Mark the command
as IGNORE_IOCTL so this fails silently rather than printing
an annoying message.
Reported-by: "Michael S. Tsirkin" <m.s.tsirkin@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: fix dentry hash calculation for case-insensitive mounts
[CIFS] Don't cache timestamps on utimes due to coarse granularity
[CIFS] Maximum username length check in session setup does not match
cifs: fix length calculation for converted unicode readdir names
[CIFS] Add support for TCP_NODELAY
I found that the length of a file name when created cannot exceed 255
characters, yet, pathconf(), via statfs(), returns the maximum as 260.
Signed-off-by: Kevin Dankwardt <k@kcomputing.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
For NFSv2 and v3:
O_DIRECT writes are always synchronous, and aren't cached, so nothing
should be flushed when closing an NFS O_DIRECT file descriptor. Thus
there are no write errors to report on close(2).
In addition, there's no cached data to verify on the next open(2),
so we don't need clean GETATTR results at close time to compare with.
Thus, there's no need for the nfs_revalidate_inode() call when closing
an NFS O_DIRECT file. This reduces the number of synchronous
on-the-wire requests for a simple open-write-close of an NFS O_DIRECT
file by roughly 20%.
For NFSv4:
Call nfs4_do_close() with wait set to zero when closing an NFS
O_DIRECT file. The CLOSE will go on the wire, but the application
won't wait for it to complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The bytes counted by the performance counters for NFS writes should
reflect write and sync errors. If the write(2) system call reports
an error, the bytes should not be counted. And, if the write is
short, the actual number of bytes that was written should be counted,
not the number of bytes that was requested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Bytes read via the splice API should be accounted for in the NFS
performance statistics.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the NFS I/O counters count the number of bytes requested
by applications, rather than the number of bytes actually read by the
system calls.
The number of bytes requested for reads is actually not that useful,
because the value is usually a buffer size for reads. That is, that
requested number is usually a maximum, and frequently doesn't reflect
the actual number of bytes read.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nit: The VFSOPEN and VFSFLUSH counters are function call counters.
Count every call to these routines.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When session is reset, client can renegotiate slot table size.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Drain the fore channel and reset the max_slots to the new value.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For now the back channel ca_maxresponsesize_cached is 0 and there is no
backchannel DRC. Return NFS4ERR_REP_TOO_BIG_TO_CACHE when a cb_sequence
cachethis is true. When it is false, return NFS4ERR_RETRY_UNCACHED_REP as the
next operation error.
Remember the replay error accross compound operation processing.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make all cb_sequence arguments available to verify_seqid which will make
replay decisions.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All callback operations have arguments to decode and require processing.
The preprocess_nfs4X_op functions catch unsupported or illegal ops so
decode_args and process_op pointers are always non NULL.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Skip all other processing when error is encountered.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set NFS4ERR_RESOURCE as CB_COMPOUND status and do not return an op on
decode_op_hdr or encode_op_hdr buffer overflow.
NFS4ERR_RESOURCE is correct for v4.0. Will fix the return for v4.1 along with
all the other NFS4ERR_RESOURCE errors in a later patch.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a CB_SEQUENCE referring call triple matches a slot table entry, the
client is still waiting for a response to the original request. In this
case, return NFS4ERR_DELAY as the response to the callback.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Traverse a list of referring calls and look for a session/slot/seq number
match.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For the CREATE_SESSION attribute ca_maxresponsesize_cached, calculate
the value based on the rpc reply header size plus the maximum nfs compound
reply size.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a wrapper around rpc_call_sync that handles -EKEYEXPIRED errors from
the RPC layer as it would an -EJUKEBOX error if NFSv2 had such a thing.
Also, add a handler for that error for async calls that makes it
resubmit the RPC on -EKEYEXPIRED.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>