Commit graph

42,026 commits

Author SHA1 Message Date
Trond Myklebust
3438995bc4 NFS: NFSoRDMA Client Changes
These patches continue to build up for improving the rsize and wsize that the
 NFS client uses when talking over RDMA.  In addition, these patches also add
 in scalability enhancements and other bugfixes.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVey8qAAoJENfLVL+wpUDroT4P/3lwspXwdxS6VZWsW1VpNtdV
 V1KKd5D+TkpBpz/ih9GdOVaBZijaHpb6XtMReh8xuh0KI893iYmsmLoyhMTPJMsU
 6sjUDEv8IFrXwlRKldX1KEfBvNgR0czCNiha6O+YsV5Az08+zr57ahyGKmLUzMxo
 4XzPZbwnb5fxvgmBgENUU33g+xXGsXDbsdzLvKW3UGcPU2x6PGOTLr5vP7lQkwxE
 20d9ak8xQeRUk0hsmRM4fAebzcluD1o3PLIFQBEhh0Gqm1VGtSCkr9o493gT5TgM
 /+XrU7B8OnbdJ1B4f/y4Bz4RucfKzyRuXMpulrnK1hL7QIiZLqiph7UrTel/ajcD
 0us9PImNwXPo8tMz7Wjw2XMQplndHB3FG3M3lXlJGHlXvCI7F0yjm21AP4SeetOm
 kxL24Qiyi7l/S7JJxHqNlOc0b8kpVLohBZm6yee9w4r/JUPnynUqfnXCHLjIp/5W
 F1hzbCUATyfKrSs7VKO0hCQHfntigPEhRmyfoyXRAXzl5LnR1XqD6Wah3a3pwXn+
 mEquUd6fKRHIIvJ8cKU6KtykkhRHg1sR/z1mw2ZEW/2PCd0cb+8+WN7X/fQqEN+u
 +VQSo7oPp38SHdsyozuUUyukN5qHptTMSrNZL+LI7J8/0+BuRuIvW0nojViapc51
 LOUlcgqRdUlIvmn754Yo
 =N1tO
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Changes

These patches continue to build up for improving the rsize and wsize that the
NFS client uses when talking over RDMA.  In addition, these patches also add
in scalability enhancements and other bugfixes.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (142 commits)
  xprtrdma: Reduce per-transport MR allocation
  xprtrdma: Stack relief in fmr_op_map()
  xprtrdma: Split rb_lock
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Remove ->ro_reset
  xprtrdma: Remove unused LOCAL_INV recovery logic
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Introduce an FRMR recovery workqueue
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Introduce helpers for allocating MWs
  xprtrdma: Use ib_device pointer safely
  xprtrdma: Remove rr_func
  xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
  xprtrdma: Warn when there are orphaned IB objects
  ...
2015-06-16 11:37:50 -04:00
Trond Myklebust
5ba12443a1 NFSv4: Fix stateid recovery on revoked delegations
Ensure that we fix the non-NULL stateid case as well.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:29:51 -04:00
Olga Kornievskaia
ae2ffef383 Recover from stateid-type error on SETATTR
Client can receives stateid-type error (eg., BAD_STATEID) on SETATTR when
delegation stateid was used. When no open state exists, in case of application
calling truncate() on the file, client has no state to recover and fails with
EIO.

Instead, upon such error, return the bad delegation and then resend the
SETATTR with a zero stateid.

Signed-off: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:29:46 -04:00
Kinglong Mee
df05a49f72 nfs: Fix showing truncated fsid/dev in, /proc/net/nfsfs/volumes
A truncated fsid showing from /proc/fs/nfsfs/volumes as,
NV SERVER   PORT DEV     FSID              FSC
v4 c0a80881  801 0:43    34931f044c2a439b  no

It should be as,
NV SERVER   PORT DEV          FSID                              FSC
v4 c0a80881  801 0:43         34931f044c2a439b:954c5d830fa4be8c no

The max buffer length for storing "%llx:%llx" format should be
 16 + 1 + 16 + 1 = 34 (16 for %llx, 1 for ':', 1 for '\0').

Also, for storing "%u:%u" of MAJOR() and MINOR() should be
 8 + 1 + 3 + 1 = 13 (8 for 2^24, 1 for ':', 3 for 2^8, 1 for '\0').

v2, add comments for dev/fsid buffer and use sizeof in snprintf.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:17:37 -04:00
Jeff Layton
873e385116 nfs: make nfs4_init_uniform_client_string use a dynamically allocated buffer
Change the uniform client string generator to dynamically allocate the
NFSv4 client name string buffer. With this patch, we can eliminate the
buffers that are embedded within the "args" structs and simply use the
name string that is hanging off the client.

This uniform string case is a little simpler than the nonuniform since
we don't need to deal with RCU, but we do have two different cases,
depending on whether there is a uniquifier or not.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:51 -04:00
Jeff Layton
a319268891 nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer
The way the *_client_string functions work is a little goofy. They build
the string in an on-stack buffer and then use kstrdup to copy it. This
is not only stack-heavy but artificially limits the size of the client
name string. Change it so that we determine the length of the string,
allocate it and then scnprintf into it.

Since the contents of the nonuniform string depend on rcu-managed data
structures, it's possible that they'll change between when we allocate
the string and when we go to fill it. If that happens, free the string,
recalculate the length and try again. If it the mismatch isn't resolved
on the second try then just give up and return -EINVAL.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:45 -04:00
Jeff Layton
b8fb2f595e nfs: update maxsz values for SETCLIENTID and EXCHANGE_ID
The spec allows for up to NFS4_OPAQUE_LIMIT (1k). While we'll almost
certainly never use that much, these ops are generally the only ones
in the compound so we might as well allow for them to be that large.

Also, the existing code didn't add in a word for the opaque length
field for either name string. Fix that while we're in there.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:40 -04:00
Jeff Layton
3a6bb73879 nfs: convert setclientid and exchange_id encoders to use clp->cl_owner_id
...instead of buffers that are part of their arg structs. We already
hold a reference to the client, so we might as well use the allocated
buffer. In the event that we can't allocate the clp->cl_owner_id, then
just return -ENOMEM.

Note too that we switch from a GFP_KERNEL allocation here to GFP_NOFS.
It's possible we could end up trying to do a SETCLIENTID or EXCHANGE_ID
in order to reclaim some memory, and the GFP_KERNEL allocations in the
existing code could cause recursion back into NFS reclaim.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:15:31 -04:00
Fabian Frederick
455b6ee645 pnfs/flexfiles: use swap() in ff_layout_sort_mirrors()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-16 11:14:03 -04:00
Al Viro
70d45cdb66 ufs: don't touch mtime/ctime of directory being moved
See "ext2: Do not update mtime of a moved directory" (and followup in
"ext2: fix unbalanced kmap()/kunmap()") for background; this is UFS
equivalent - the same problem exists here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-16 02:08:34 -04:00
Al Viro
a50e4a02ad ufs: don't bother with lock_ufs()/unlock_ufs() for directory access
We are already serialized by ->i_mutex and operations on different
directories are independent.  These calls are just rudiments of
blind BKL conversion and they should've been removed back then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-16 02:08:31 -04:00
Jan Kara
514d748f69 ufs: Fix possible deadlock when looking up directories
Commit e4502c63f5 (ufs: deal with nfsd/iget races) made ufs
create inodes with I_NEW flag set. However ufs_mkdir() never cleared
this flag. Thus if someone ever tried to lookup the directory by inode
number, he would deadlock waiting for I_NEW to be cleared. Luckily this
mostly happens only if the filesystem is exported over NFS since
otherwise we have the inode attached to dentry and don't look it up by
inode number. In rare cases dentry can get freed without inode being
freed and then we'd hit the deadlock even without NFS export.

Fix the problem by clearing I_NEW before instantiating new directory
inode.

Fixes: e4502c63f5
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-16 02:08:12 -04:00
Jan Kara
12ecbb4b1d ufs: Fix warning from unlock_new_inode()
Commit e4502c63f5 (ufs: deal with nfsd/iget races) introduced
unlock_new_inode() call into ufs_add_nondir(). However that function
gets called also from ufs_link() which hands it already initialized
inode and thus unlock_new_inode() complains. The problem is harmless but
annoying.

Fix the problem by opencoding necessary stuff in ufs_link()

Fixes: e4502c63f5
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-16 02:08:07 -04:00
Fabian Frederick
cdd9eefdf9 fs/ufs: restore s_lock mutex
Commit 0244756edc ("ufs: sb mutex merge + mutex_destroy") generated
deadlocks in read/write mode on mkdir.

This patch partially reverts it keeping fixes by Andrew Morton and
mutex_destroy()

[AV: fixed a missing bit in ufs_remount()]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: Ian Campbell <ian.campbell@citrix.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Roger Pau Monne <roger.pau@citrix.com>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-16 02:07:38 -04:00
Michal Hocko
7b506b1035 jbd2: get rid of open coded allocation retry loop
insert_revoke_hash does an open coded endless allocation loop if
journal_oom_retry is true. It doesn't implement any allocation fallback
strategy between the retries, though. The memory allocator doesn't know
about the never fail requirement so it cannot potentially help to move
on with the allocation (e.g. use memory reserves).

Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the
debugging message but I am not sure it is anyhow helpful.

Do the same for journal_alloc_journal_head which is doing a similar
thing.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 15:45:58 -04:00
Andreas Dilger
b03a2f7eb2 ext4: improve warning directory handling messages
Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner.  Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 14:50:26 -04:00
Joseph Qi
6f6a6fda29 jbd2: fix ocfs2 corrupt when updating journal superblock fails
If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.  In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts.  So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first.  And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[ Fixed bug in patch which allowed a non-negative error return from
  jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
  was causing xfstests ext4/306 to fail. -- Ted ]

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: stable@vger.kernel.org
2015-06-15 14:36:01 -04:00
Rasmus Villemoes
97b4af2f76 ext4: mballoc: avoid 20-argument function call
Making a function call with 20 arguments is rather expensive in both
stack and .text. In this case, doing the formatting manually doesn't
make it any less readable, so we might as well save 155 bytes of .text
and 112 bytes of stack.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
2015-06-15 00:32:58 -04:00
Lukas Czerner
0d306dcf86 ext4: wait for existing dio workers in ext4_alloc_file_blocks()
Currently existing dio workers can jump in and potentially increase
extent tree depth while we're allocating blocks in
ext4_alloc_file_blocks().  This may cause us to underestimate the
number of credits needed for the transaction because the extent tree
depth can change after our estimation.

Fix this by waiting for all the existing dio workers in the same way
as we do it in ext4_punch_hole.  We've seen errors caused by this in
xfstest generic/299, however it's really hard to reproduce.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:23:53 -04:00
Lukas Czerner
4134f5c88d ext4: recalculate journal credits as inode depth changes
Currently in ext4_alloc_file_blocks() the number of credits is
calculated only once before we enter the allocation loop. However within
the allocation loop the extent tree depth can change, hence the number
of credits needed can increase potentially exceeding the number of credits
reserved in the handle which can cause journal failures.

Fix this by recalculating number of credits when the inode depth
changes. Note that even though ext4_alloc_file_blocks() is only
currently used by extent base inodes we will avoid recalculating number
of credits unnecessarily in the case of indirect based inodes.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:20:46 -04:00
Dmitry Monakhov
b4f1afcd06 jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS

[Full stack trace snipped from 3.10-rh7]
[<ffffffff815c4bd4>] dump_stack+0x19/0x1b
[<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80
[<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20
[<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210
[<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20
[<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20
[<ffffffff81147939>] mempool_alloc+0x69/0x170
[<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20
[<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150
[<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0
[<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120
[<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2]
[<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30
[<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210
[<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2]
[<ffffffff811532ce>] ? lru_cache_add+0xe/0x10
[<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4]
[<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270
[<ffffffff81146918>] __generic_file_write_iter+0x178/0x390
[<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0
[<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4]
[<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0
[<ffffffff811b93f0>] do_sync_write+0x90/0xe0
[<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0
[<ffffffff811ba5b8>] SyS_write+0x58/0xb0
[<ffffffff815d4799>] system_call_fastpath+0x16/0x1b

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-15 00:18:02 -04:00
Fabian Frederick
13b987ea27 fs/ufs: revert "ufs: fix deadlocks introduced by sb mutex merge"
This reverts commit 9ef7db7f38 ("ufs: fix deadlocks introduced by sb
mutex merge") That patch tried to solve commit 0244756edc ("ufs: sb
mutex merge + mutex_destroy") which is itself partially reverted due to
multiple deadlocks.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Roger Pau Monne <roger.pau@citrix.com>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2015-06-14 11:31:51 -04:00
Al Viro
3f4a949410 ncpfs: successful rename() should invalidate caches for parents
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-14 11:31:39 -04:00
Fabian Frederick
bf86546760 ext4: use swap() in mext_page_double_lock()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:47:33 -04:00
Fabian Frederick
4b7e2db5c0 ext4: use swap() in memswap()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:46:33 -04:00
Theodore Ts'o
bdf96838ae ext4: fix race between truncate and __ext4_journalled_writepage()
The commit cf108bca46: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c027d883>] ? lock_buffer+0x36/0x36
 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
 [<c0229139>] do_invalidatepage+0x22/0x26
 [<c0229198>] truncate_inode_page+0x5b/0x85
 [<c022934b>] truncate_inode_pages_range+0x156/0x38c
 [<c0229592>] truncate_inode_pages+0x11/0x15
 [<c022962d>] truncate_pagecache+0x55/0x71
 [<c02b913b>] ext4_setattr+0x4a9/0x560
 [<c01ca542>] ? current_kernel_time+0x10/0x44
 [<c026c4d8>] notify_change+0x1c7/0x2be
 [<c0256a00>] do_truncate+0x65/0x85
 [<c0226f31>] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [<c01b879f>] ? console_unlock+0x3a1/0x3ce
 [<c082cbb4>] dump_stack+0x48/0x60
 [<c0178b65>] warn_slowpath_common+0x89/0xa0
 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c0178bef>] warn_slowpath_null+0x14/0x18
 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
 [<c02b2f44>] write_end_fn+0x40/0x53
 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
 [<c02b59e7>] ext4_writepage+0x354/0x3b8
 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b5a5b>] __writepage+0x10/0x2e
 [<c0225956>] write_cache_pages+0x22d/0x32c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b6ee8>] ext4_writepages+0x102/0x607
 [<c019adfe>] ? sched_clock_local+0x10/0x10e
 [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
 [<c01a8ad5>] ? lock_is_held+0x43/0x51
 [<c0226dff>] do_writepages+0x1c/0x29
 [<c0276bed>] __writeback_single_inode+0xc3/0x545
 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
    ...

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-12 23:45:33 -04:00
Theodore Ts'o
1cb767cd4a ext4 crypto: fail the mount if blocksize != pagesize
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:44:33 -04:00
Josef Bacik
37b8d27de5 Btrfs: use received_uuid of parent during send
Neil Horman pointed out a problem where if he did something like this

receive A
snap A B
change B
send -p A B

and then on another box do

recieve A
receive B

the receive B would fail because we use the UUID of A for the clone sources for
B.  This makes sense most of the time because normally you are sending from the
original sources, not a received source.  However when you use a recieved subvol
its UUID is going to be something completely different, so if you then try to
receive the diff on a different volume it won't find the UUID because the new A
will be something else.  The only constant is the received uuid.  So instead
check to see if we have received_uuid set on the root, and if so use that as the
clone source, as btrfs receive looks for matches either in received_uuid or
uuid.  Thanks,

Reported-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 13:20:38 -07:00
Liu Bo
0eeff2362b Btrfs: fix use-after-free in btrfs_replay_log
@log_root_tree should not be referenced after kfree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 11:03:21 -07:00
Chao Yu
3c45414527 f2fs: do not trim preallocated blocks when truncating after i_size
When we perform generic/092 in xfstests, output is like below:

     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
     0: [0..10239]: data
     0: [0..10239]: data
    -1: [10240..20479]: unwritten
    +1: [10240..14335]: unwritten

This is because with this testcase, we redefine the regulation for
truncate in perallocated space past i_size as below:

"There was some confused about what the fs was supposed to do when you
truncate at i_size with preallocated space past i_size. We decided on the
following things.

1) truncate(i_size) will trim all blocks past i_size.
2) truncate(x) where x > i_size will not trim all blocks past i_size.
"

This method is used in xfs, and then ext4/btrfs will follow the rule.

This patch fixes to follow the new rule for f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 18:30:49 -07:00
Trond Myklebust
4e54ab8d8c NFS: Ensure that we update the sequence id under the slot table lock
Fix a callback slot table regression.

Fixes: e937ee714b ("nfs: Only update callback sequnce id when CB_SEQUENCE success")
Cc: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 21:15:52 -04:00
Kinglong Mee
0579c8d208 nfs: Initialize cb_sequenceres information before validate_seqid()
For a cb_layoutrecall replay, nfsd got CB_SEQUENCE status of zero,
but all informations of cb_sequenceres are zero too !!!

validate_seqid() return NFS4ERR_RETRY_UNCACHED_REP for a replay,
and skip the initlize cb_sequenceres.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 21:09:06 -04:00
Jaegeuk Kim
43f54cd52f f2fs crypto: add alloc_bounce_page
This patch adds alloc_bounce_page likewise ext4.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 15:04:20 -07:00
Jaegeuk Kim
7e8e754a4b f2fs crypto: fix to handle errors likewise ext4
This patch makes some error handling policies same with ext4.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 15:00:36 -07:00
Jeff Layton
6f02dc88be nfs: deny backchannel RPCs with an incorrect authflavor instead of dropping them
A drop should really only be done when the frame is malformed or we have
reason to think that there is some sort of DoS going on. When we get an
RPC with bad auth, we should send back an error instead.

Cc: Andy Adamson <William.Adamson@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 14:06:34 -04:00
Kinglong Mee
e937ee714b nfs: Only update callback sequnce id when CB_SEQUENCE success
When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED.
It is caused by nfs return NFS4ERR_DELAY before validate_seqid(),
don't update the sequnce id, but nfsd updates the sequnce id !!!

According to RFC5661 20.9.3,
" If CB_SEQUENCE returns an error, then the state of the slot
  (sequence ID, cached reply) MUST NOT change. "

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-11 14:00:40 -04:00
Vaishali Thakkar
4ed0d83d05 NFS: Convert use of __constant_htonl to htonl
In little endian cases, the macro htonl unfolds to __swab32 which
provides special case for constants. In big endian cases,
__constant_htonl and htonl expand directly to the same expression.
So, replace __constant_htonl with htonl with the goal of getting
rid of the definition of __constant_htonl completely.

The semantic patch that performs this transformation is as follows:

@@expression x;@@

- __constant_htonl(x)
+ htonl(x)

Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:57:59 -04:00
Anna Schumaker
11598b8ff2 NFS: Remove unused nfs_rw_ops->rw_release() function
This was only ever set to nfs_writeback_release_common(), a function
which is completely empty.  Let's just drop this function pointer and
simplify the code a bit.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:32:40 -04:00
Dominique Martinet
c86c90c656 NFSv4: handle nfs4_get_referral failure
nfs4_proc_lookup_common is supposed to return a posix error, we have to
handle any error returned that isn't errno

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:28:02 -04:00
Jeff Layton
3c87ef6efb sunrpc: keep a count of swapfiles associated with the rpc_clnt
Jerome reported seeing a warning pop when working with a swapfile on
NFS. The nfs_swap_activate can end up calling sk_set_memalloc while
holding the rcu_read_lock and that function can sleep.

To fix that, we need to take a reference to the xprt while holding the
rcu_read_lock, set the socket up for swapping and then drop that
reference. But, xprt_put is not exported and having NFS deal with the
underlying xprt is a bit of layering violation anyway.

Fix this by adding a set of activate/deactivate functions that take a
rpc_clnt pointer instead of an rpc_xprt, and have nfs_swap_activate and
nfs_swap_deactivate call those.

Also, add a per-rpc_clnt atomic counter to keep track of the number of
active swapfiles associated with it. When the counter does a 0->1
transition, we enable swapping on the xprt, when we do a 1->0 transition
we disable swapping on it.

This also allows us to be a bit more selective with the RPC_TASK_SWAPPER
flag. If non-swapper and swapper clnts are sharing a xprt, then we only
need to flag the tasks from the swapper clnt with that flag.

Acked-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:26:14 -04:00
Zhao Lei
9a4e7276d3 btrfs: wait for delayed iputs on no space
btrfs will report no_space when we run following write and delete
file loop:
 # FILE_SIZE_M=[ 75% of fs space ]
 # DEV=[ some dev ]
 # MNT=[ some dir ]
 #
 # mkfs.btrfs -f "$DEV"
 # mount -o nodatacow "$DEV" "$MNT"
 # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
 #

Reason:
 iput() and evict() is run after write pages to block device, if
 write pages work is not finished before next write, the "rm"ed space
 is not freed, and caused above bug.

Fix:
 We can add "-o flushoncommit" mount option to avoid above bug, but
 it have performance problem. Actually, we can to wait for on-the-fly
 writes only when no-space happened, it is which this patch do.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:34 -07:00
Qu Wenruo
d672633545 btrfs: qgroup: Make snapshot accounting work with new extent-oriented
qgroup.

Make snapshot accounting work with new extent-oriented mechanism by
skipping given root in new/old_roots in create_pending_snapshot().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:29 -07:00
Qu Wenruo
9086db86e0 btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
This is used by later qgroup fix patches for snapshot.

As current snapshot accounting is done by btrfs_qgroup_inherit(), but
new extent oriented quota mechanism will account extent from
btrfs_copy_root() and other snapshot things, causing wrong result.

So add this ability to handle snapshot accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:23 -07:00
Qu Wenruo
d4b8040459 btrfs: ulist: Add ulist_del() function.
This function will delete unode with given (val,aux) pair.
And with this patch, seqnum for debug usage doesn't have any meaning
now, so remove them.

This is used by later patches to skip snapshot root.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:17 -07:00
Qu Wenruo
e69bcee376 btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
Goodbye, the old mechanisim.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:11 -07:00
Qu Wenruo
442244c963 btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
Since the self test transaction don't have delayed_ref_roots, so use
find_all_roots() and export btrfs_qgroup_account_extent() to simulate it

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:05 -07:00
Qu Wenruo
0ed4792af0 btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
Switch from old ref_node based qgroup to extent based qgroup mechanism
for normal operations.

The new mechanism should hugely reduce the overhead of btrfs quota
system, and further more, the codes and logic should be more clean and
easier to maintain.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:59 -07:00
Qu Wenruo
9d220c95f5 btrfs: qgroup: Switch rescan to new mechanism.
Switch rescan to use the new new extent oriented mechanism.

As rescan is also based on extent, new mechanism is just a perfect match
for rescan.

With re-designed internal functions, rescan is quite easy, just call
btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:54 -07:00
Qu Wenruo
550d7a2ed5 btrfs: qgroup: Add new qgroup calculation function
btrfs_qgroup_account_extents().

The new btrfs_qgroup_account_extents() function should be called in
btrfs_commit_transaction() and it will update all the qgroup according
to delayed_ref_root->dirty_extent_root.

The new function can handle both normal operation during
commit_transaction() or in rescan in a unified method with clearer
logic.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:49 -07:00
Qu Wenruo
21633fc603 btrfs: backref: Add special time_seq == (u64)-1 case for
btrfs_find_all_roots().

Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
lock to do tree search.

This is important for later qgroup implement which will call
find_all_roots() after fs trees are committed.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:45 -07:00