Commit graph

39,687 commits

Author SHA1 Message Date
Zhao Lei
5d3edd8f44 Btrfs, replace: enable dev-replace for raid56
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:20:48 +08:00
Filipe Manana
ae0ab003f2 Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
4f69cb987e Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
292cbd51ec Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Miao Xie
4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie
7603597690 Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
5a6ac9eacb Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
1b94b5567e Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
af8e2d1df9 Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
b89e1b012c Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:44 +08:00
Zhao Lei
6de6565075 Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Zhao Lei
f90523d1aa Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Dmitry Monakhov
14516bb7bb ext4: fix suboptimal seek_{data,hole} extents traversial
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);

Original report: https://lkml.org/lkml/2014/10/16/620

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 18:08:53 -05:00
Dmitry Monakhov
d952d69e26 ext4: ext4_inline_data_fiemap should respect callers argument
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0.  Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:11:20 -05:00
Dmitry Monakhov
5cc28a9eaa ext4: prevent fsreentrance deadlock for inline_data
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin().  grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:09:50 -05:00
Changman Lee
7dda2af83b f2fs: more fast lookup for gc_inode list
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-02 11:02:50 -08:00
Eric W. Biederman
4fed655c41 mnt: Clear mnt_expire during pivot_root
When inspecting the pivot_root and the current mount expiry logic I
realized that pivot_root fails to clear like mount move does.

Add the missing line in case someone does the interesting feat of
moving an expirable submount.  This gives a strong guarantee that root
of the filesystem tree will never expire.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:51 -06:00
Eric W. Biederman
381cacb12c mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
old->mnt_expiry should be ignored unless CL_EXPIRE is set.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
8486a7882b mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
collect_mounts.  In copy_mnt_ns it is necessary to create an exact
copy of a mount tree, so not clearing MNT_LOCKED is important.
Similarly collect_mounts is used to take a snapshot of the mount tree
for audit logging purposes and auditing using a faithful copy of the
tree is important.

This becomes particularly significant when we start setting MNT_LOCKED
on rootfs to prevent it from being unmounted.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
da362b09e4 umount: Do not allow unmounting rootfs.
Andrew Vagin <avagin@parallels.com> writes:

> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <unistd.h>
> #include <sys/mount.h>
>
> int main(int argc, char **argv)
> {
> 	int fd;
>
> 	fd = open("/proc/self/ns/mnt", O_RDONLY);
> 	if (fd < 0)
> 	   return 1;
> 	   while (1) {
> 	   	 if (umount2("/", MNT_DETACH) ||
> 		        setns(fd, CLONE_NEWNS))
> 					break;
> 					}
>
> 					return 0;
> }
>
> root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> root@ubuntu:/home/avagin# strace ./nsenter
> execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> ...
> open("/proc/self/ns/mnt", O_RDONLY)     = 3
> umount("/", MNT_DETACH)                 = 0
> setns(3, 131072)                        = 0
> umount("/", MNT_DETACH
>
causes:

> [  260.548301] ------------[ cut here ]------------
> [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> [  260.552068] invalid opcode: 0000 [#1] SMP
> [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
> [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> [  260.552068] Stack:
> [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> [  260.552068] Call Trace:
> [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
> [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
> [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
> [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068]  RSP <ffff880074825e98>
> [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--

Which in practice is totally uninteresting.  Only the global root user can
do it, and it is just a stupid thing to do.

However that is no excuse to allow a silly way to oops the kernel.

We can avoid this silly problem by setting MNT_LOCKED on the rootfs
mount point and thus avoid needing any special cases in the unmount
code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:49 -06:00
Eric W. Biederman
b2f5d4dc38 umount: Disallow unprivileged mount force
Forced unmount affects not just the mount namespace but the underlying
superblock as well.  Restrict forced unmount to the global root user
for now.  Otherwise it becomes possible a user in a less privileged
mount namespace to force the shutdown of a superblock of a filesystem
in a more privileged mount namespace, allowing a DOS attack on root.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:48 -06:00
Eric W. Biederman
3e1866410f mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
Now that remount is properly enforcing the rule that you can't remove
nodev at least sandstorm.io is breaking when performing a remount.

It turns out that there is an easy intuitive solution implicitly
add nodev on remount when nodev was implicitly added on mount.

Tested-by: Cedric Bosdonnat <cbosdonnat@suse.com>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:39 -06:00
Linus Torvalds
3a18ca0613 Fix an ext4 metadata checksum regression introduced in v3.18-rc3.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUfTKdAAoJENNvdpvBGATwRhsQAJOWfDD1K57CFQQ0MbLdRfeA
 YzdTHlKIYpt4Dm3+Bc/LJskLrA0sn2vmxF/v0Jxr3F/agwfCODLMO8dIsB0nFX0E
 eC8fCx05titBMQLN4adWr59qQTeE1nssQdpA5SomPrJZr5pSabtj3ekFiHQJ+bWb
 9WNY737TxSPLK0ex9iDlAAp/AoxgF4K6zj/azsRY+mmlfM9+dFoprZWqWwgwl99m
 4LVx1waAnLQdU2Yj7ZYGReweFFTTOqGz4ds1GggymB3Z8Q873dVYO7vdbQWDFJgC
 TcAp8YbfrQC6/M/IaASZKVj6hwEPVMTgOs7dUeyfPtSaXBrW0WBGAhM5gFnQ6J+T
 DO4YHC+tH26GLsfBs9IZnHjAoeVZ93JFDKmxfclDs0AGY+0WgSyY8Bt6VJyoWR60
 RPbW15i/0oMSWEPxbuqQmFIqlcj5n9D80SEmvhpn7oJwkrrUMprcxcWTQN+Ca73e
 2EIOW0SHaLrkM7wpYjwlO4dgCxSZWg6QfHyznbuJcVKOw8KnDMuTEOjP2vBvwHwu
 Wax4EIGZw9XqZVI7a9Z1nd+mpUYi5KDgpS8Uo08Qz5QapWEYla3JPt76q3TwSCIz
 ExMwoRBUSMrpSoDMbyjmk4sh5ABTTWOf9SPmCdnVfzZ36EY0dJckeXj0jFqtyVdq
 p1bxjWPARBm1LfVBcQek
 =yIOT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfix from Ted Ts'o:
 "Fix an ext4 metadata checksum regression introduced in v3.18-rc3"

* tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix regression where we fail to initialize checksum seed when loading
2014-12-01 20:11:49 -08:00
Darrick J. Wong
32f3869184 jbd2: fix regression where we fail to initialize checksum seed when loading
When we're enabling journal features, we cannot use the predicate
jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
feature flag fields!  Moreover, we just finished loading the shash
driver, so the test is unnecessary; calculate the seed always.

Without this patch, we fail to initialize the checksum seed the first
time we turn on journal_checksum, which means that all journal blocks
written during that first mount are corrupt.  Transactions written
after the second mount will be fine, since the feature flag will be
set in the journal superblock.  xfstests generic/{034,321,322} are the
regression tests.

(This is important for 3.18.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-01 21:57:06 -05:00
Changman Lee
9c01503f4d f2fs: cleanup redundant macro
We've already made fi and sbi for inode. Let's avoid duplicated work.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 14:16:50 -08:00
Chao Yu
cd34e2969b f2fs: fix to return correct error number in f2fs_write_begin
Fix the wrong error number in error path of f2fs_write_begin.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 13:56:02 -08:00
Dan Carpenter
818f2f57f2 nfsd: minor off by one checks in __write_versions()
My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.

The intent of the code is that we print as many versions as we can
without truncating a version.  Then we put a newline at the end.  If the
newline can't fit we return -EINVAL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-01 12:45:28 -07:00
Dave Chinner
c14fc01340 Merge branch 'xfs-coccinelle-cleanups' into for-next 2014-12-01 09:03:02 +11:00
kbuild test robot
d254aaec5d xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
fs/xfs/libxfs/xfs_bmap.c:5591:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

CC: Brian Foster <bfoster@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2014-12-01 08:42:52 +11:00
kbuild test robot
8300475ebf xfs: fix simple_return.cocci warning in xfs_file_readdir
fs/xfs/xfs_file.c:919:1-6: WARNING: end returns can be simpified and declaration on line 902 can be dropped

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:25:28 +11:00
kbuild test robot
b72091f2fb libxfs: fix simple_return.cocci warnings
fs/xfs/libxfs/xfs_ialloc.c:1141:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:58 +11:00
Markus Elfring
d2a5e3c6fc xfs: remove unnecessary null checks
The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether
their argument is NULL and then return immediately.  Thus the test
around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:20 +11:00
Chris Mason
2f19cad94c btrfs: zero out left over bytes after processing compression streams
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete.  This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-30 09:33:51 -08:00
Geert Uytterhoeven
4740f49652 jffs2: Drop bogus if in comment
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2014-11-28 18:23:44 -08:00
Changman Lee
31a3268839 f2fs: cleanup if-statement of phase in gc_data_segment
Little cleanup to distinguish each phase easily

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-27 20:30:17 -08:00
Dave Chinner
216875a594 Merge branch 'xfs-consolidate-format-defs' into for-next 2014-11-28 14:52:16 +11:00
Dave Chinner
4bd47c1bf4 Merge branch 'xfs-misc-fixes-for-3.19-1' into for-next 2014-11-28 14:52:02 +11:00
Christoph Hellwig
508b6b3b73 xfs: merge xfs_inum.h into xfs_format.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:10 +11:00
Christoph Hellwig
bb58e6188a xfs: move most of xfs_sb.h to xfs_format.h
More on-disk format consolidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:09 +11:00
Christoph Hellwig
4fb6e8ade2 xfs: merge xfs_ag.h into xfs_format.h
More on-disk format consolidation.  A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:25:04 +11:00
Christoph Hellwig
5beda58bf2 xfs: move acl structures to xfs_format.h
Move the on-disk ACL format to xfs_format.h, so that repair can
use the common defintion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:37 +11:00
Christoph Hellwig
6d3ebaae7c xfs: merge xfs_dinode.h into xfs_format.h
More consolidatation for the on-disk format defintions.  Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:06 +11:00
Eric Sandeen
db52d09ecb xfs: catch invalid negative blknos in _xfs_buf_find()
Here blkno is a daddr_t, which is a __s64; it's possible to hold
a value which is negative, and thus pass the (blkno >= eofs)
test.  Then we try to do a xfs_perag_get() for a ridiculous
agno via xfs_daddr_to_agno(), and bad things happen when that
fails, and returns a null pag which is dereferenced shortly
thereafter.

Found via a user-supplied fuzzed image...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:03:55 +11:00
Brian Foster
91ee575f2b xfs: allow lazy sb counter sync during filesystem freeze sequence
The expectation since the introduction the lazy superblock counters is
that the counters are synced and superblock logged appropriately as part
of the filesystem freeze sequence. This does not occur, however, due to
the logic in xfs_fs_writable() that prevents progress when the fs is in
any state other than SB_UNFROZEN.

While this is a bug, it has not been exposed to date because the last
thing XFS does during freeze is dirty the log. The log recovery process
recalculates the counters from AGI/AGF metadata to ensure everything is
correct. Therefore should a crash occur while an fs is frozen, the
subsequent log recovery puts everything back in order. See the following
commit for reference:

	92821e2b [XFS] Lazy Superblock Counters

We might not always want to rely on dirtying the log on a frozen fs.
Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
not once the freeze process has completed. Modify xfs_fs_writable() to
accept the minimum freeze level for which modifications should be
blocked to support various codepaths.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:02:59 +11:00
Brian Foster
5d45ee1b41 xfs: fix error handling in xfs_qm_log_quotaoff()
The error handling in xfs_qm_log_quotaoff() has a couple problems. If
xfs_trans_commit() fails, we fall through to the error block and call
xfs_trans_cancel(). This is incorrect on commit failure. If
xfs_trans_reserve() fails, we jump to the error block, cancel the tp and
restore the superblock qflags to oldsbqflag. However, oldsbqflag has
been initialized to zero and not yet updated from the original flags so
we set the flags to zero.

Fix up the error handling in xfs_qm_log_quotaoff() to not restore flags
if they haven't been modified and not cancel the tp on commit failure.
Remove the flag restore code altogether because commit error is the only
failure condition and we don't know whether the transaction made it to
disk.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:53 +11:00
Brian Foster
062647a8b4 xfs: replace on-stack xfs_trans_res with pointer in xfs_create()
There's no need to store a full struct xfs_trans_res on the stack in
xfs_create() and copy the fields. Use a pointer to the appropriate
structures embedded in the xfs_mount.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:16 +11:00
Brian Foster
78c931b8be xfs: replace global xfslogd wq with per-mount wq
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).

Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.

Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.

Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 13:59:58 +11:00
Phillip Lougher
62421645bb Squashfs: Add LZ4 compression configuration option
Add the glue code, and also update the documentation.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 18:48:44 +00:00
Phillip Lougher
9c06a46f15 Squashfs: add LZ4 compression support
Add support for reading file systems compressed with the
LZ4 compression algorithm.

This patch adds the LZ4 decompressor wrapper code.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 07:44:11 +00:00