Commit graph

38980 commits

Author SHA1 Message Date
Jaegeuk Kim
e7a2bf2283 f2fs: fix counting inline_data inode numbers
This patch fixes wrongly counting inline_data inode numbers.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:33 -08:00
Jaegeuk Kim
3289c061c5 f2fs: add stat info for inline_dentry inodes
This patch adds status information for inline_dentry inodes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:33 -08:00
Jaegeuk Kim
bce8d11207 f2fs: avoid deadlock on init_inode_metadata
Previously, init_inode_metadata does not hold any parent directory's inode
page. So, f2fs_init_acl can grab its parent inode page without any problem.
But, when we use inline_dentry, that page is grabbed during f2fs_add_link,
so that we can fall into deadlock condition like below.

INFO: task mknod:11006 blocked for more than 120 seconds.
      Tainted: G           OE  3.17.0-rc1+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mknod           D ffff88003fc94580     0 11006  11004 0x00000000
 ffff880007717b10 0000000000000002 ffff88003c323220 ffff880007717fd8
 0000000000014580 0000000000014580 ffff88003daecb30 ffff88003c323220
 ffff88003fc94e80 ffff88003ffbb4e8 ffff880007717ba0 0000000000000002
Call Trace:
 [<ffffffff8173dc40>] ? bit_wait+0x50/0x50
 [<ffffffff8173d4cd>] io_schedule+0x9d/0x130
 [<ffffffff8173dc6c>] bit_wait_io+0x2c/0x50
 [<ffffffff8173da3b>] __wait_on_bit_lock+0x4b/0xb0
 [<ffffffff811640a7>] __lock_page+0x67/0x70
 [<ffffffff810acf50>] ? autoremove_wake_function+0x40/0x40
 [<ffffffff811652cc>] pagecache_get_page+0x14c/0x1e0
 [<ffffffffa029afa9>] get_node_page+0x59/0x130 [f2fs]
 [<ffffffffa02a63ad>] read_all_xattrs+0x24d/0x430 [f2fs]
 [<ffffffffa02a6ca2>] f2fs_getxattr+0x52/0xe0 [f2fs]
 [<ffffffffa02a7481>] f2fs_get_acl+0x41/0x2d0 [f2fs]
 [<ffffffff8122d847>] get_acl+0x47/0x70
 [<ffffffff8122db5a>] posix_acl_create+0x5a/0x150
 [<ffffffffa02a7759>] f2fs_init_acl+0x29/0xcb [f2fs]
 [<ffffffffa0286a8d>] init_inode_metadata+0x5d/0x340 [f2fs]
 [<ffffffffa029253a>] f2fs_add_inline_entry+0x12a/0x2e0 [f2fs]
 [<ffffffffa0286ea5>] __f2fs_add_link+0x45/0x4a0 [f2fs]
 [<ffffffffa028b5b6>] ? f2fs_new_inode+0x146/0x220 [f2fs]
 [<ffffffffa028b816>] f2fs_mknod+0x86/0xf0 [f2fs]
 [<ffffffff811e3ec1>] vfs_mknod+0xe1/0x160
 [<ffffffff811e4b26>] SyS_mknod+0x1f6/0x200
 [<ffffffff81741d7f>] tracesys+0xe1/0xe6

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:33 -08:00
Jaegeuk Kim
59a0615540 f2fs: fix to wait correct block type
The inode page needs to wait NODE block io.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:33 -08:00
Jaegeuk Kim
4e6ebf6d49 f2fs: reuse find_in_block code for find_in_inline_dir
This patch removes redundant copied code in find_in_inline_dir.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:32 -08:00
Jaegeuk Kim
a82afa2019 f2fs: reuse room_for_filename for inline dentry operation
This patch introduces to reuse the existing room_for_filename for inline dentry
operation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:32 -08:00
Chao Yu
622f28ae9b f2fs: enable inline dir handling
Add inline dir functions into normal dir ops' function to handle inline ops.
Besides, we enable inline dir mode when a new dir inode is created if
inline_data option is on.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:32 -08:00
Chao Yu
201a05be96 f2fs: add key function to handle inline dir
Adds Functions to implement inline dir init/lookup/insert/delete/convert ops.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove needless reserved area copy, pointed by Dan Carpenter]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:31 -08:00
Chao Yu
dbeacf02eb f2fs: export dir operations for inline dir
This patch exports some dir operations for inline dir, additionally introduces
f2fs_drop_nlink from f2fs_delete_entry for reusing by inline dir function.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:31 -08:00
Chao Yu
5efd3c6f1b f2fs: add a new mount option for inline dir
Adds a new mount option 'inline_dentry' for inline dir.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:31 -08:00
Chao Yu
34d67debe0 f2fs: add infra struct and helper for inline dir
This patch defines macro/inline dentry structure, and adds some helpers for
inline dir infrastructure.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:31 -08:00
Jaegeuk Kim
af41d3ee00 f2fs: avoid infinite loop at cp_error
This patch avoids an infinite loop in sync_dirty_inode_page when -EIO was
detected.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:31 -08:00
Jaegeuk Kim
4a257ed677 f2fs: avoid build warning
This patch removes build warning.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:30 -08:00
Jaegeuk Kim
13fd8f89f6 f2fs: fix to call f2fs_unlock_op
This patch fixes to call f2fs_unlock_op, which was missing before.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:30 -08:00
Jaegeuk Kim
9ba69cf987 f2fs: avoid to allocate when inline_data was written
The sceanrio is like this.
inline_data   i_size     page                 write_begin/vm_page_mkwrite
  X             30       dirty_page
  X             30                            write to #4096 position
  X             30       get_dnode_of_data    wait for get_dnode_of_data
  O             30       write inline_data
  O             30                            get_dnode_of_data
  O             30                            reserve data block
..

In this case, we have #0 = NEW_ADDR and inline_data as well.
We should not allow this condition for further access.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:30 -08:00
Jaegeuk Kim
a78186ebe5 f2fs: use highmem for directory pages
This patch fixes to use highmem for directory pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:30 -08:00
Jaegeuk Kim
1ce86bf6f8 f2fs: fix race conditon on truncation with inline_data
Let's consider the following scenario.

blkaddr[0] inline_data i_size  i_blocks writepage           truncate
  NEW        X        4096        2    dirty page #0
  NEW        X         0                                    change i_size
  NEW        X         0          2    f2fs_write_inline_data
  NEW        X         0          2    get_dnode_of_data
  NEW        X         0          2    truncate_data_blocks_range
  NULL       O         0          1    memcpy(inline_data)
  NULL       O         0          1    f2fs_put_dnode
  NULL       O         0          1                         f2fs_truncate
  NULL       O         0          1                         get_dnode_of_data
  NULL       O         0          1                       *invalid block addr*

This patch adds checking inline_data flag during f2fs_truncate not to refer
corrupted block indices.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:29 -08:00
Jaegeuk Kim
c08a690b46 f2fs: should truncate any allocated block for inline_data write
When trying to write inline_data, we should truncate any data block allocated
and pointed by the inode block.
We should consider the data index is not 0.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:29 -08:00
Jaegeuk Kim
cbcb2872e3 f2fs: invalidate inmemory page
If user truncates file's data, we should truncate inmemory pages too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:29 -08:00
Jaegeuk Kim
34ba94bac9 f2fs: do not make dirty any inmemory pages
This patch let inmemory pages be clean all the time.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-03 16:07:29 -08:00
Al Viro
ca5358ef75 deal with deadlock in d_walk()
... by not hitting rename_retry for reasons other than rename having
happened.  In other words, do _not_ restart when finding that
between unlocking the child and locking the parent the former got
into __dentry_kill().  Skip the killed siblings instead...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-03 15:22:16 -05:00
Al Viro
946e51f2bf move d_rcu from overlapping d_child to overlapping d_alias
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-03 15:20:29 -05:00
Bob Peterson
1a8550332a GFS2: If we use up our block reservation, request more next time
If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-03 19:26:54 +00:00
Bob Peterson
33ad5d5428 GFS2: Only increase rs_sizehint
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-03 19:25:41 +00:00
Bob Peterson
0e27c18c30 GFS2: Set of distributed preferences for rgrps
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-03 19:24:49 +00:00
Fabian Frederick
37975f1503 GFS2: directly return gfs2_dir_check()
No need to store gfs2_dir_check result and test it before returning.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-03 19:23:32 +00:00
Linus Torvalds
7e05b807b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "A bunch of assorted fixes, most of them followups to overlayfs merge"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ovl: initialize ->is_cursor
  Return short read or 0 at end of a raw device, not EIO
  isofs: don't bother with ->d_op for normal case
  isofs_cmp(): we'll never see a dentry for . or ..
  overlayfs: fix lockdep misannotation
  ovl: fix check for cursor
  overlayfs: barriers for opening upper-layer directory
  rcu: Provide counterpart to rcu_dereference() for non-RCU situations
  staging: android: logger: Fix log corruption regression
2014-11-02 10:28:43 -08:00
Linus Torvalds
4f4274af70 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is nailing down some problems with our skinny extent variation,
  and Dave's patch fixes endian problems in the new super block checks"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
  Btrfs: properly clean up btrfs_end_io_wq_cache
  Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
  btrfs: use macro accessors in superblock validation checks
2014-11-01 10:41:26 -07:00
Linus Torvalds
32e8fd2f8e A set of miscellaneous ext4 bug fixes for 3.18.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUVAF/AAoJENNvdpvBGATwEbAQALNiAIChEyJTnQDkAQc2wqqn
 dv8NQmFr5aefc63A/+n/yJJGrQZtKs0ceh29ty5ksYLFXzUdc2ctFg6vBmllQfbz
 PQawAk2gOkF8zfVuqiQU7X+wTBpGmGXTa8HY+WJTtk0pBfhl+p0PDCYsWXMwZJ1D
 tAZpxJ4AmPc7A4hApWOvce6r7Xg24vZk/8UA93Tif9AkeY6VoN272Hx5b/UGmBHY
 RCEgpowuiIY38bghtLh5+T0J98/EQNof46cEHgGI9nIDZeXRzgvDojE5bLI0/IS/
 K07MjYlm/WFWsLFkgNJkTiqEXgnji9BNYRF1xxUjMMBAR4+fnFLw9kXXgcETrPCx
 U7lHOhs8M2FK40cWhUDz/tukvL4S4lQwPEeqBPlRE8J5/twRyXHeZDp4F7LOobwq
 mk6AajSJlP+05XwXOuCx7Hcf9uxjw/IpqhBS5IZxy8Nn3T2guPlY9wMhYU1RYFws
 54FeE76SJ8EDgjVK/txj7rgh11GggWsjsdXvftSElM2DsKsqYEOKAvDzvwmbm7eV
 dsFOlRB6B/X4UpiAC2MiPJynYg9TJ7LkVBzDZeZ/fbm7JhTqChSJDzapqdrmNPIY
 SQqwLmFXnHqaw6HNitZ5Bs+fD6nfvKqy85NeImxE3lhLWDuiTt77Y3o80IW30TgN
 5bnuXq8Rkukrxs/VDvPq
 =kI6P
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "A set of miscellaneous ext4 bug fixes for 3.18"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
  ext4: bail early when clearing inode journal flag fails
  ext4: bail out from make_indexed_dir() on first error
  jbd2: use a better hash function for the revoke table
  ext4: prevent bugon on race between write/fcntl
  ext4: remove extent status procfs files if journal load fails
  ext4: disallow changing journal_csum option during remount
  ext4: enable journal checksum when metadata checksum feature enabled
  ext4: fix oops when loading block bitmap failed
  ext4: fix overflow when updating superblock backups after resize
2014-10-31 16:22:29 -07:00
Linus Torvalds
e2488ab6ab Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and ext3 fixes from Jan Kara.

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fs, jbd: use a more generic hash function
  quota: Properly return errors from dquot_writeback_dquots()
  ext3: Don't check quota format when there are no quota files
2014-10-31 16:18:47 -07:00
Al Viro
a7400222e3 new helper: is_root_inode()
replace open-coded instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31 17:48:54 -04:00
Miklos Szeredi
ac7576f4b1 vfs: make first argument of dir_context.actor typed
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31 17:48:54 -04:00
Miklos Szeredi
9f2f7d4c8d ovl: initialize ->is_cursor
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31 17:47:51 -04:00
David Jeffery
b2de525f09 Return short read or 0 at end of a raw device, not EIO
Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device.  Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.

The raw driver needs the same end of device handling as was added for normal
block devices.  Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31 06:33:26 -04:00
Al Viro
b0afd8e5db isofs: don't bother with ->d_op for normal case
we only need it for joliet and case-insensitive mounts

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31 06:33:17 -04:00
Eric Rannaud
69a91c237a fs: allow open(dir, O_TMPFILE|..., 0) with mode 0
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:

	Note that this mode applies only to future accesses of the newly
	created file; the open() call that creates a read-only file
	may well return a read/write file descriptor.

The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.

O_TMPFILE, however, behaves differently:

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
	assert(fd == -1);
	assert(errno == EACCES);

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
	assert(fd > 0);

For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:

	if (*opened & FILE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		will_truncate = false;
		acc_mode = MAY_OPEN;
		path_to_nameidata(path, nd);
		goto finish_open_created;
	}

But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().

This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().

A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523

The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.

On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.

But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.

Signed-off-by: Eric Rannaud <e@nanocritical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-30 15:50:13 -07:00
Jan Kara
ae9e9c6aee ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.

Coverity-id: 1226848
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:17 -04:00
Jan Kara
4f879ca687 ext4: bail early when clearing inode journal flag fails
When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.

Coverity-id: 989044
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:17 -04:00
Jan Kara
6050d47adc ext4: bail out from make_indexed_dir() on first error
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).

Coverity-id: 741300
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:17 -04:00
Theodore Ts'o
d48458d4a7 jbd2: use a better hash function for the revoke table
The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior.  Use the generic
64-bit hash function instead.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
2014-10-30 10:53:17 -04:00
Dmitry Monakhov
a41537e69b ext4: prevent bugon on race between write/fcntl
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.

Let's initialize iocb->private unconditionally.

TESTCASE: xfstest:generic/036  https://patchwork.ozlabs.org/patch/402445/

#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:2960!
invalid opcode: 0000 [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
RIP: 0010:[<ffffffff811fabf2>]  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
RSP: 0018:ffff88080f90bb58  EFLAGS: 00010246
RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
FS:  00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
Stack:
 ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
 0000000000000200 0000000000000200 0000000000000001 0000000000000200
 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
Call Trace:
 [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180
 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370
 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700
 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0
 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0
 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740
 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740
 [<ffffffff811be030>] SyS_io_submit+0x10/0x20
 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
 RSP <ffff88080f90bb58>

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
50460fe8c6 ext4: remove extent status procfs files if journal load fails
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
6b992ff256 ext4: disallow changing journal_csum option during remount
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted.  Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Darrick J. Wong
98c1a7593f ext4: enable journal checksum when metadata checksum feature enabled
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Jan Kara
599a9b77ab ext4: fix oops when loading block bitmap failed
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().

Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Jan Kara
9378c6768e ext4: fix overflow when updating superblock backups after resize
When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.

Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-10-30 10:52:57 -04:00
Joe Perches
1f33c41c03 seq_file: Rename seq_overflow() to seq_has_overflowed() and make public
The return values of seq_printf/puts/putc are frequently misused.

Start down a path to remove all the return value uses of these
functions.

Move the seq_overflow() to a global inlined function called
seq_has_overflowed() that can be used by the users of seq_file() calls.

Update the documentation to not show return types for seq_printf
et al.  Add a description of seq_has_overflowed().

Link: http://lkml.kernel.org/p/848ac7e3d1c31cddf638a8526fa3c59fa6fdeb8a.1412031505.git.joe@perches.com

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ Reworked the original patch from Joe ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-29 20:26:06 -04:00
Linus Torvalds
a7ca10f263 Merge branch 'akpm' (incoming from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "21 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
  mm/balloon_compaction: fix deflation when compaction is disabled
  sh: fix sh770x SCIF memory regions
  zram: avoid NULL pointer access in concurrent situation
  mm/slab_common: don't check for duplicate cache names
  ocfs2: fix d_splice_alias() return code checking
  mm: rmap: split out page_remove_file_rmap()
  mm: memcontrol: fix missed end-writeback page accounting
  mm: page-writeback: inline account_page_dirtied() into single caller
  lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
  drivers/rtc/rtc-bq32k.c: fix register value
  memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
  drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
  kernel/kmod: fix use-after-free of the sub_info structure
  drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
  mm, thp: fix collapsing of hugepages on madvise
  drivers: of: add return value to of_reserved_mem_device_init()
  mm: free compound page with correct order
  gcov: add ARM64 to GCOV_PROFILE_ALL
  fsnotify: next_i is freed during fsnotify_unmount_inodes.
  mm/compaction.c: avoid premature range skip in isolate_migratepages_range
  ...
2014-10-29 16:38:48 -07:00
Brian Foster
5d11fb4b9a xfs: rework zero range to prevent invalid i_size updates
The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.

The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).

The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
For example:

$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
...
$ xfs_io -d -c "pread 128k 128k" <file>
<BUG>

If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.

Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. This implementation is not efficient nor ideal as we
writeback dirty data over the range and remove existing extents rather
than convert to unwrittern. The former writeback, however, is currently
the only mechanism available to ensure consistency between pagecache and
extent state. Even a pagecache truncate/delalloc punch prior to hole
punch has lead to inconsistencies due to racing with writeback.

This provides a consistent, correct implementation of zero range that
survives fsstress/fsx testing without assert failures. The
implementation can be optimized from this point forward once the
fundamental issue of pagecache and delalloc extent state consistency is
addressed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-30 10:35:11 +11:00
Jan Kara
7a19dee116 xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
case of specific fs corruption that could result in xfs_bulkstat()
entering an infinite loop because we would be looping over the same
chunk over and over again. Fix the problem by checking the return value
and terminating the loop properly.

Coverity-id: 1231338
cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jie Liu <jeff.u.liu@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-30 10:34:52 +11:00