Commit graph

33,155 commits

Author SHA1 Message Date
Haicheng Li
8d8451af68 f2fs: make locate_dirty_segment() as static
It's used only locally and could be static.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:44 +09:00
Haicheng Li
e79efe3b69 f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
We can get the value directly from pointer "curseg".

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Jaegeuk Kim
b3783873cc f2fs: avoid freqeunt write_inode calls
If update_inode is called, we don't need to do write_inode.
So, let's use a *dirty* flag for each inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Namjae Jeon
d7cc950b4c f2fs: optimise the truncate_data_blocks_range() range
The function truncate_data_blocks_range() decrements the valid
block count of inode via dec_valid_block_count(). Since this
function updates the i_blocks field of inode, we can update this
field once we have calculated total the number of blocks
to be freed.

Therefore we can decrement valid blocks outside of the for loop.

	if (nr_free) {
+		dec_valid_block_count(sbi, dn->inode, nr_free);
 		set_page_dirty(dn->node_page);
 		sync_inode_page(dn);
 	}

'nr_free' tells the total number of blocks freed. So, we can
just directly pass this value to dec_valid_block_count() and update
the i_blocks.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:42 +09:00
Namjae Jeon
6a3e8ef0de f2fs: use the F2FS specific flags in f2fs_ioctl()
In f2fs_ioctl() function, it is using generic flags.
Since F2FS specific flags are defined. So lets use
those flags.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:26 +09:00
Dave Chinner
ade1335afe xfs: ensure btree root split sets blkno correctly
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317

which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:

$ pminfo -f xfs.btree.block_map

xfs.btree.block_map.lookup
    value 39135

xfs.btree.block_map.compare
    value 268432

xfs.btree.block_map.insrec
    value 15786

xfs.btree.block_map.delrec
    value 13884

xfs.btree.block_map.newroot
    value 2

xfs.btree.block_map.killroot
    value 0
.....

Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.

Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-13 14:18:02 -05:00
Dave Chinner
8a1fd2950e xfs: fix implicit padding in directory and attr CRC formats
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:

.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.

It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:

[  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[  172.316346] #9 calling xfs_dir2_data_log_unused()
[  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096

Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.

Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.

And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.

Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-13 10:30:03 -05:00
Jie Liu
72dac95d44 ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12 23:13:59 -04:00
Paul Gortmaker
75497d0607 jbd2: remove debug dependency on debug_fs and update Kconfig help text
Commit b6e96d0067 ("jbd2: use module parameters instead of debugfs
for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
also moved the /sys variables out from underneath the typical debugfs
mount point.  Delete the dependency and update the /sys path to where
the debug settings are currently.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:07:51 -04:00
Paul Gortmaker
169f1a2a87 jbd2: use a single printk for jbd_debug()
Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):

[  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved.  The output should have been:

(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:04:04 -04:00
Paul Gortmaker
cfc7bc896f jbd2: fix duplicate debug label for phase 2
Currently we see this output:

  $git grep phase fs/jbd2
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
  [...]

There is clearly a duplicate label for phase 2, and they are
both active (i.e. not in #if ... #else block).  Rename them to
be "2a" and "2b" so the debug output is unambiguous.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:56:35 -04:00
Paul Gortmaker
0ef54180e0 jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space()
While trying to debug an an issue under extreme I/O loading
on preempt-rt kernels, the following backtrace was observed
via SysRQ output:

rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
 ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
 ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
 ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
Call Trace:
 [<ffffffff8172dc34>] schedule+0x24/0x70
 [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
 [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
 [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
 [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
 [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
 [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
 [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
 [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
 [<ffffffff8114370f>] do_rmdir+0xdf/0x120
 [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
 [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
 [<ffffffff817361ae>] ? int_signal+0x12/0x17
 [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
 [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b

What is interesting here, is that we call log_wait_commit, from
within wait_for_space, but we are still holding the checkpoint_mutex
as it surrounds mostly the whole of wait_for_space.  And then, as we
are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
bit is set, then we will also try to take the same checkpoint_mutex.

It seems that we need to drop the checkpoint_mutex while sitting in
jbd2_log_wait_commit, if we want to guarantee that progress can be made
by jbd2_journal_commit_transaction().  There does not seem to be
anything preempt-rt specific about this, other then perhaps increasing
the odds of it happening.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:47:35 -04:00
Paul Gortmaker
3ca841c106 jbd2: relocate assert after state lock in journal_commit_transaction()
The state lock is taken after we are doing an assert on the state
value, not before.  So we might in fact be doing an assert on a
transient value.  Ensure the state check is within the scope of
the state lock being taken.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:46:35 -04:00
Dmitry Monakhov
4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov
06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Dmitry Monakhov
9ff8644624 jbd2: optimize jbd2_journal_force_commit
Current implementation of jbd2_journal_force_commit() is suboptimal because
result in empty and useless commits. But callers just want to force and wait
any unfinished commits. We already have jbd2_journal_force_commit_nested()
which does exactly what we want, except we are guaranteed that we do not hold
journal transaction open.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:24:07 -04:00
Linus Torvalds
a568fa1c91 Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Bunch of fixes and one little addition to math64.h"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  include/linux/math64.h: add div64_ul()
  mm: memcontrol: fix lockless reclaim hierarchy iterator
  frontswap: fix incorrect zeroing and allocation size for frontswap_map
  kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
  mm: migration: add migrate_entry_wait_huge()
  ocfs2: add missing lockres put in dlm_mig_lockres_handler
  mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
  drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
  aio: fix io_destroy() regression by using call_rcu()
  rtc-at91rm9200: use shadow IMR on at91sam9x5
  rtc-at91rm9200: add shadow interrupt mask
  rtc-at91rm9200: refactor interrupt-register handling
  rtc-at91rm9200: add configuration support
  rtc-at91rm9200: add match-table compile guard
  fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
  swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
  drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
  cciss: fix broken mutex usage in ioctl
  audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
  drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
  ...
2013-06-12 16:29:53 -07:00
Xue jiufei
27749f2ff0 ocfs2: add missing lockres put in dlm_mig_lockres_handler
dlm_mig_lockres_handler() is missing a dlm_lockres_put() on an error path.

Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Kent Overstreet
4fcc712f5c aio: fix io_destroy() regression by using call_rcu()
There was a regression introduced by 36f5588905 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.

This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).

Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).

So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Goldwyn Rodrigues
e099127169 fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
While removing a non-empty directory, the kernel dumps a message:

  (rmdir,21743,1):ocfs2_unlink:953 ERROR: status = -39

Suppress the error message from being printed in the dmesg so users
don't panic.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Xiaowei.Hu
7869e59067 ocfs2: ocfs2_prep_new_orphaned_file() should return ret
If an error occurs, for example an EIO in __ocfs2_prepare_orphan_dir,
ocfs2_prep_new_orphaned_file will release the inode_ac, then when the
caller of ocfs2_prep_new_orphaned_file gets a 0 return, it will refer to
a NULL ocfs2_alloc_context struct in the following functions.  A kernel
panic happens.

Signed-off-by: "Xiaowei.Hu" <xiaowei.hu@oracle.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Kees Cook
637241a900 kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections.  Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions.  With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.

To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:

 - /proc/kmsg allows:
  - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
  - everything, after an open.

 - syslog syscall allows:
  - anything, if CAP_SYSLOG.
  - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
  - nothing else (EPERM).

The use-cases were:
 - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
 - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
   destructive SYSLOG_ACTION_READs.

AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.

Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.

To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.

 - /dev/kmsg allows:
  - open if CAP_SYSLOG or dmesg_restrict==0
  - reading/polling, after open

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Theodore Ts'o
981250ca89 ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily
Commit 18888cf088: "ext4: speed up truncate/unlink by not using
bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
the most important codepath for file systems using extents, but a
similar optimization also can be done for file systems using indirect
blocks, and for the two special cases in the ext4 extents code.

Cc: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:48:29 -04:00
Theodore Ts'o
2ed5724d5a ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match.  This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number.  So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more.  So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:43:02 -04:00
Linus Torvalds
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Jaegeuk Kim
699489bbbe f2fs: sync dir->i_size with its block allocation
If new dentry block is allocated and its i_size is updated, we should update
its inode block together in order to sync i_size and its block allocation.
Otherwise, we can loose additional dentry block due to the unconsistent i_size.

Errorneous Scenario
-------------------

In the recovery routine,
 - recovery_dentry
 | - __f2fs_add_link
 | | - get_new_data_page
 | | | - i_size_write(new_i_size)
 | | | - mark_inode_dirty_sync(dir)
 | | - update_parent_metadata
 | | | - mark_inode_dirty(dir)
 |
 - write_checkpoint
   - sync_dirty_dir_inodes
     - filemap_flush(dentry_blocks)
       - f2fs_write_data_page
         - skip to write the last dentry block due to index < i_size

In the above flow, new_i_size is not updated to its inode block so that the
last dentry block will be lost accordingly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-12 08:18:35 +09:00
Steven Whitehouse
5a00f3cc97 GFS2: Only do one directory search on create
Creation of a new inode requires a directory search in order to ensure
that we are not trying to create an inode with the same name as an
existing one. This was hidden away inside the create_ok() function.

In the case that there was an existing inode, and a lookup can be
substituted for a create (which is the case with regular files
when the O_EXCL flag is not in use) then we were doing a second
lookup in order to return the inode.

This patch merges these two lookups into one. This can be done by
passing a flag to gfs2_dir_search() to tell it to just return -EEXIST
in the cases where we don't actually want to look up the inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-11 13:45:29 +01:00
Jaegeuk Kim
2d4d9fb591 f2fs: fix i_blocks translation on various types of files
Basically an inode manages the number of allocated blocks with inode->i_blocks
which is represented in a unit of sectors, not file system blocks.
But, f2fs has used i_blocks in a unit of file system blocks, and f2fs_getattr
translates it to the number of sectors when fstat is called.

However, previously f2fs_file_inode_operations only has this, so this patch adds
it to all the types of inode_operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:09 +09:00
Gu Zheng
5fb08372a6 f2fs: set sb->s_fs_info before calling parse_options()
In f2fs_fill_super(), set sb->s_fs_info before calling parse_options(), then we can get
f2fs_sb_info via F2FS_SB(sb) in parse_options().
So that the second argument "sbi" of func parse_options() is no longer needed.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:08 +09:00
Jaegeuk Kim
8ae8f1627f f2fs: support xattr security labels
This patch adds the support of security labels for f2fs, which will be used
by Linus Security Models (LSMs).

Quote from http://en.wikipedia.org/wiki/Linux_Security_Modules:
"Linux Security Modules (LSM) is a framework that allows the Linux kernel to
support a variety of computer security models while avoiding favoritism toward
any single security implementation. The framework is licensed under the terms of
the GNU General Public License and is standard part of the Linux kernel since
Linux 2.6. AppArmor, SELinux, Smack and TOMOYO Linux are the currently accepted
modules in the official kernel.".

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:03 +09:00
Mikulas Patocka
bbd465df73 hpfs: fix warnings when the filesystem fills up
This patch fixes warnings due to missing lock on write error path.

  WARNING: at fs/hpfs/hpfs_fn.h:353 hpfs_truncate+0x75/0x80 [hpfs]()
  Hardware name: empty
  Pid: 26563, comm: dd Tainted: P           O 3.9.4 #12
  Call Trace:
    hpfs_truncate+0x75/0x80 [hpfs]
    hpfs_write_begin+0x84/0x90 [hpfs]
    _hpfs_bmap+0x10/0x10 [hpfs]
    generic_file_buffered_write+0x121/0x2c0
    __generic_file_aio_write+0x1c7/0x3f0
    generic_file_aio_write+0x7c/0x100
    do_sync_write+0x98/0xd0
    hpfs_file_write+0xd/0x50 [hpfs]
    vfs_write+0xa2/0x160
    sys_write+0x51/0xa0
    page_fault+0x22/0x30
    system_call_fastpath+0x1a/0x1f

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@kernel.org  # 2.6.39+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-08 17:39:40 -07:00
Linus Torvalds
81db4dbf59 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:

 - Trivial: unused variable removal

 - Posix-timers: Add the clock ID to the new proc interface to make it
   useful.  The interface is new and should be functional when we reach
   the final 3.10 release.

 - Cure a false positive warning in the tick code introduced by the
   overhaul in 3.10

 - Fix for a persistent clock detection regression introduced in this
   cycle

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Correct run-time detection of persistent_clock.
  ntp: Remove unused variable flags in __hardpps
  posix-timers: Show clock ID in proc file
  tick: Cure broadcast false positive pending bit warning
2013-06-08 15:51:21 -07:00
Bryan Schumaker
6b140b85d9 NFS: Add in v4.2 callback operation
NFS v4.2 adds a CB_OFFLOAD operation used by COPY and WRITE_PLUS.  Since
neither of these operations have been implemented yet, simply return
NFS4ERR_NOTSUPP.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:18 -04:00
Bryan Schumaker
459de2edb9 NFS: Make callbacks minor version generic
I found a few places that hardcode the minor version number rather than
making it dependent on the protocol the callback came in over.  This
patch makes it easier to add new minor versions in the future.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:18 -04:00
Steve Dickson
f58eda9bc2 Kconfig: Add Kconfig entry for Labeled NFS V4 client
This patch adds the NFS_V4_SECURITY_LABEL entry which
enables security label support for the NFSv4 client

Signed-off-by: Steve Dickson <steved@redhat.com>
[trond: Make this non-interactive]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:17 -04:00
David Quigley
c9bccef6b9 NFS: Extend NFS xattr handlers to accept the security namespace
The existing NFSv4 xattr handlers do not accept xattr calls to the security
namespace. This patch extends these handlers to accept xattrs from the security
namespace in addition to the default NFSv4 ACL namespace.

Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:17 -04:00
David Quigley
aa9c266962 NFS: Client implementation of Labeled-NFS
This patch implements the client transport and handling support for labeled
NFS. The patch adds two functions to encode and decode the security label
recommended attribute which makes use of the LSM hooks added earlier. It also
adds code to grab the label from the file attribute structures and encode the
label to be sent back to the server.

Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:16 -04:00
David Quigley
14c43f7678 NFS: Add label lifecycle management
This patch adds the lifecycle management for the security label structure
introduced in an earlier patch. The label is not used yet but allocations and
freeing of the structure is handled.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:15 -04:00
David Quigley
1775fd3e80 NFS:Add labels to client function prototypes
After looking at all of the nfsv4 operations the label structure has been added
to the prototypes of the functions which can transmit label data.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:15 -04:00
David Quigley
a09df2ca23 NFSv4: Extend fattr bitmaps to support all 3 words
The fattr handling bitmap code only uses the first two fattr words sofar. This
patch adds the 3rd word to being sent but doesn't populate it yet.

Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:14 -04:00
Steve Dickson
e058f70b80 NFSv4: Introduce new label structure
In order to mimic the way that NFSv4 ACLs are implemented we have created a
structure to be used to pass label data up and down the call chain. This patch
adds the new structure and new members to the required NFSv4 call structures.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:14 -04:00
Steve Dickson
42c2c4249c NFSv4.2: Added NFS v4.2 support to the NFS client
This enable NFSv4.2 support. To enable this code the
CONFIG_NFS_V4_2 Kconfig define needs to be set and
the -o v4.2 mount option need to be used.

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:13 -04:00
David Quigley
649f6e7718 LSM: Add flags field to security_sb_set_mnt_opts for in kernel mount data.
There is no way to differentiate if a text mount option is passed from user
space or the kernel. A flags field is being added to the
security_sb_set_mnt_opts hook to allow for in kernel security flags to be sent
to the LSM for processing in addition to the text options received from mount.
This patch also updated existing code to fix compilation errors.

Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:12 -04:00
Josef Bacik
13e6c37b98 Btrfs: stop all workers before cleaning up roots
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread.  That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup.  This
should fix the race.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-08 15:11:35 -04:00
Liu Bo
2932505abe Btrfs: fix use-after-free bug during umount
Commit be283b2e67
(    Btrfs: use helper to cleanup tree roots) introduced the following bug,

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
[...]
 Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
 RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
 Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
[...]
 Call Trace:
  [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
  [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
  [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
  [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
  [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
  [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
  [<ffffffff81068d3d>] kthread+0x8d/0x95
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
  [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]

We've free'ed commit_root before actually getting to free block groups where
caching thread needs valid extent_root->commit_root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:10:01 -04:00
Josef Bacik
a9995eece3 Btrfs: init relocate extent_io_tree with a mapping
Dave reported a NULL pointer deref.  This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping.  To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Naohiro Aota
6379ef9fb2 btrfs: Drop inode if inode root is NULL
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Josef Bacik
7b5ff90ed0 Btrfs: don't delete fs_roots until after we cleanup the transaction
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root.  Thanks

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Tyler Hicks
0df5ed65c1 eCryptfs: Make extent and scatterlist crypt function parameters similar
The 'dest' abbreviation is only used in crypt_scatterlist(), while all
other functions in crypto.c use 'dst' so dest_sg should be renamed to
dst_sg.

The crypt_stat parameter is typically the first parameter in internal
eCryptfs functions so crypt_stat and dst_page should be swapped in
crypt_extent().

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:29 -07:00
Tyler Hicks
406c93df09 eCryptfs: Collapse crypt_page_offset() into crypt_extent()
crypt_page_offset() simply initialized the two scatterlists and called
crypt_scatterlist() so it is simple enough to move into the only
function that calls it.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:28 -07:00