Commit graph

39,678 commits

Author SHA1 Message Date
Zhao Lei
b968fed1c3 Btrfs: Combine per-page recover in dev-replace and scrub
The code are similar, combine them to make code clean and easy to maintenance.
Some lost condition are also completed with benefit of this combination.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
8d6738c1bd Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block()
In corrent code, code of finding-right-mirror and writing-to-target
are mixed in logic, if we find a right mirror but failed in writing
to target, it will treat as "hadn't found right block", and fill the
target with sblock_bad.

Actually, "failed in writing to target" does not mean "source
block is wrong", this patch separate above two condition in logic,
and do some cleanup to make code clean.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
dc5f7a3bd8 Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block()
Use break instead of useless loop should be more suitable in this
case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
7653947fe6 Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
09dd7a01c3 Btrfs: Cleanup btrfs_bio_counter_inc_blocked()
1: Remove no-need DEFINE_WAIT(wait)
2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition
3: Use while loop instead of goto

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
114ab50d82 Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replace
It is always 1 in this place, because !1 case was already jumped
out in previous code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
b25c94c580 Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON()
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
    ...
    goto nodatasum_case;
}
...
nodatasum_case:
    WARN_ON(sctx->is_dev_replace);

In above code, nodatasum_case marker should be moved after
WARN_ON().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
6e9606d2a2 Btrfs: add ref_count and free function for btrfs_bio
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
8e5cfb55d3 Btrfs: Make raid_map array be inlined in btrfs_bio structure
It can make code more simple and clear, we need not care about
free bbio and raid_map together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
cc7539edea Btrfs: sort raid_map before adding tgtdev stripes
It can avoid complex calculation of real stripes in sort,
moreover, we can clean up code of sorting tgtdev_map because it
will be in order initially.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
e34c330d63 Btrfs: fix a out-of-bound access of raid_map
We add the number of stripes on target devices into bbio->num_stripes
if we are under device replacement, and we just sort the raid_map of
those stripes that not on the target devices, so if when we need
real raid_map, we need skip the stripes on the target devices.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Filipe Manana
df8d116ffa Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:05 -08:00
Filipe Manana
2c2c452b0c Btrfs: fix fsync when extend references are added to an inode
If we added an extended reference to an inode and fsync'ed it, the log
replay code would make our inode have an incorrect link count, which
was lower then the expected/correct count.
This resulted in stale directory index entries after deleting some of
the hard links, and any access to the dangling directory entries resulted
in -ESTALE errors because the entries pointed to inode items that don't
exist anymore.

This is easy to reproduce with the test case I made for xfstests, and
the bulk of that test is:

    _scratch_mkfs "-O extref" >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create a test file with 3001 hard links. This number is large enough to
    # make btrfs start using extrefs at some point even if the fs has the maximum
    # possible leaf/node size (64Kb).
    echo "hello world" > $SCRATCH_MNT/foo
    for i in `seq 1 3000`; do
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
    done

    # Make sure all metadata and data are durably persisted.
    sync

    # Add one more link to the inode that ends up being a btrfs extref and fsync
    # the inode.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Now after the fsync log replay btrfs left our inode with a wrong link count N,
    # which was smaller than the correct link count M (N < M).
    # So after removing N hard links, the remaining M - N directory entries were
    # still visible to user space but it was impossible to do anything with them
    # because they pointed to an inode that didn't exist anymore. This resulted in
    # stale file handle errors (-ESTALE) when accessing those dentries for example.
    #
    # So remove all hard links except the first one and then attempt to read the
    # file, to verify we don't get an -ESTALE error when accessing the inodel
    #
    # The btrfs fsck tool also detected the incorrect inode link count and it
    # reported an error message like the following:
    #
    # root 5 inode 257 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
    #
    # The fstests framework automatically calls fsck after a test is run, so we
    # don't need to call fsck explicitly here.

    rm -f $SCRATCH_MNT/foo_link_*
    cat $SCRATCH_MNT/foo

    status=0
    exit

So make sure an fsync always flushes the delayed inode item, so that the
fsync log contains it (needed in order to trigger the link count fixup
code) and fix the extref counting function, which always return -ENOENT
to its caller (and made it assume there were always 0 extrefs).

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
d36808e0d4 Btrfs: fix directory inconsistency after fsync log replay
If we have an inode (file) with a link count greater than 1, remove
one of its hard links, fsync the inode, power fail/crash and then
replay the fsync log on the next mount, we end up getting the parent
directory's metadata inconsistent - its i_size still reflects the
deleted hard link and has dangling index entries (with no matching
inode reference entries). This prevents the directory from ever being
deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
even if all of its children inodes are deleted, and the dangling index
entries can never be removed (as they point to an inode that does not
exist anymore).

This is easy to reproduce with the following excerpt from the test case
for xfstests that I just made:

    _scratch_mkfs >> $seqres.full 2>&1

    _init_flakey
    _mount_flakey

    # Create a test file with 2 hard links in the same directory.
    mkdir -p $SCRATCH_MNT/a/b
    echo "hello world" > $SCRATCH_MNT/a/b/foo
    ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar

    # Make sure all metadata and data are durably persisted.
    sync

    # Now remove one of the hard links and fsync the inode.
    rm -f $SCRATCH_MNT/a/b/bar
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Remove the last hard link of the file and attempt to remove its parent
    # directory - this failed in btrfs because the fsync log and replay code
    # didn't decrement the parent directory's i_size and left dangling directory
    # index entries - this made the btrfs rmdir implementation always fail with
    # the error -ENOTEMPTY.
    #
    # The dangling directory index entries were visible to user space, but it was
    # impossible to do anything on them (unlink, open, read, write, stat, etc)
    # because the inode they pointed to did not exist anymore.
    #
    # The parent directory's metadata inconsistency (stale index entries) was
    # also detected by btrfs' fsck tool, which is run automatically by the fstests
    # framework when the test finishes. The error message reported by fsck was:
    #
    # root 5 inode 259 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
    #
    rm -f $SCRATCH_MNT/a/b/*
    rmdir $SCRATCH_MNT/a/b
    rmdir $SCRATCH_MNT/a

To fix this just make sure that after an unlink, if the inode is fsync'ed,
he parent inode is fully logged in the fsync log.

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
6219872dc6 Btrfs: lookup for block group only if needed when freeing a tree block
Very often our extent buffer's header generation doesn't match the current
transaction's id or it is also referenced by other trees (snapshots), so
we don't need the corresponding block group cache object. Therefore only
search for it if we are going to use it, so we avoid an unnecessary search
in the block groups rbtree (and acquiring and releasing its spinlock).

Freeing a tree block is performed when COWing or deleting a node/leaf,
which implies we are holding the node/leaf's parent node lock, therefore
reducing the amount of time spent when freeing a tree block helps reducing
the amount of time we are holding the parent node's lock.

For example, for a run of xfstests/generic/083, the block group cache
object was needed only 682 times for a total of 226691 calls to free
a tree block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
730a78c741 btrfs: remove a no-op unfreeze superbock callback
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
9ee49a047d btrfs: switch extent_state state to unsigned
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.

The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.

struct extent_state {
	u64                        start;                /*     0     8 */
	u64                        end;                  /*     8     8 */
	struct rb_node             rb_node;              /*    16    24 */
	wait_queue_head_t          wq;                   /*    40    24 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	atomic_t                   refs;                 /*    64     4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          state;                /*    72     8 */
	u64                        private;              /*    80     8 */

	/* size: 88, cachelines: 2, members: 7 */
	/* sum members: 84, holes: 1, sum holes: 4 */
	/* last cacheline: 24 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
5efa0490cc btrfs: set proper message level for skinny metadata
This has been confusing people for too long, the message is really just
informative.

CC: <stable@vger.kernel.org> # 3.10+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
f0954c6637 btrfs: update message levels after checksum errors
The errors are worth noting and might get missed with INFO level.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
aa8ee31209 btrfs: update message levels during failed mount
All error conditions from open_ctree shall be ERR. Warning would
suggest that something's wrong and we can continue.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
68b663d13c btrfs: update message levels for errors
Several messages that point to some internal problem, level INFO is
wrong here.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Filipe Manana
a8df6fe666 Btrfs: fix setup_leaf_for_split() to avoid leaf corruption
We were incorrectly detecting when the target key didn't exist anymore
after releasing the path and re-searching the tree. This could make
us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
are its only callers at the moment) an item when we should not.

For the case of duplicating an item, we currently only duplicate
checksum items (csum tree) and file extent items (fs/subvol trees).
For the checksum items we end up overriding the item completely,
but for file extent items we update only some of their fields in
the copy (done in __btrfs_drop_extents), which means we can end up
having a logical corruption for some values.

Also for the case where we duplicate a file extent item it will make
us produce a leaf with a wrong key order, as btrfs_duplicate_item()
advances us to the next slot and then its caller sets a smaller key
on the new item at that slot (like in __btrfs_drop_extents() e.g.).
Alternatively if the tree search in setup_leaf_for_split() leaves
with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
up accessing beyond the leaf's end (when we check if the item's size
has changed) and make our caller insert an item at the invalid slot
btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
access if the leaf is full or nearly full.

This issue has been present since the introduction of this function
in 2009:

    Btrfs: Add btrfs_duplicate_item
    commit ad48fd7546

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Chris Mason
57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
Chris Mason
d354183488 Merge branch 'fix/find-item-path-leak' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:45:25 -08:00
Jeff Layton
8116bf4cb6 locks: update comments that refer to inode->i_flock
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-21 20:44:01 -05:00
Josef Bacik
ce93ec548c Btrfs: track dirty block groups on their own list
Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set.  This function
can get called several times during a commit.  So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time.  Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
free space cache out so it is much cleaner.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:36:52 -08:00
Josef Bacik
e7070be198 Btrfs: change how we track dirty roots
I've been overloading root->dirty_list to keep track of dirty roots and which
roots need to have their commit roots switched at transaction commit time.  This
could cause us to lose an update to the root which could corrupt the file
system.  To fix this use a state bit to know if the root is dirty, and if it
isn't set we go ahead and move the root to the dirty list.  This way if we
re-dirty the root after adding it to the switch_commit list we make sure to
update it.  This also makes it so that the extent root is always the last root
on the dirty list to try and keep the amount of churn down at this point in the
commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:35:49 -08:00
Brian Foster
4d949021aa xfs: remove incorrect error negation in attr_multi ioctl
xfs_compat_attrmulti_by_handle() calls memdup_user() which returns a
negative error code. The error code is negated by the caller and thus
incorrectly converted to a positive error code.

Remove the error negation such that the negative error is passed
correctly back up to userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 10:04:24 +11:00
Dave Chinner
438c3c8d2b Merge branch 'xfs-buf-type-fixes' into for-next 2015-01-22 09:51:30 +11:00
Dave Chinner
3443a3bca5 xfs: set superblock buffer type correctly
When the superblock is modified in a transaction, the commonly
modified fields are not actually copied to the superblock buffer to
avoid the buffer lock becoming a serialisation point. However, there
are some other operations that modify the superblock fields within
the transaction that don't directly log to the superblock but rely
on the changes to be applied during the transaction commit (to
minimise the buffer lock hold time).

When we do this, we fail to mark the buffer log item as being a
superblock buffer and that can lead to the buffer not being marked
with the corect type in the log and hence causing recovery issues.
Fix it by setting the type correctly, similar to xfs_mod_sb()...

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:30:23 +11:00
Dave Chinner
fe22d552b8 xfs: set buf types when converting extent formats
Conversion from local to extent format does not set the buffer type
correctly on the new extent buffer when a symlink data is moved out
of line.

Fix the symlink code and leave a comment in the generic bmap code
reminding us that the format-specific data copy needs to set the
destination buffer type appropriately.

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:30:06 +11:00
Dave Chinner
f19b872b08 xfs: inode unlink does not set AGI buffer type
This leads to log recovery throwing errors like:

XFS (md0): Mounting V5 Filesystem
XFS (md0): Starting recovery (logdev: internal)
XFS (md0): Unknown buffer type 0!
XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1
ffff8800ffc53800: 58 41 47 49 .....

Which is the AGI buffer magic number.

Ensure that we set the type appropriately in both unlink list
addition and removal.

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:29:40 +11:00
Dave Chinner
0d612fb570 xfs: ensure buffer types are set correctly
Jan Kara reported that log recovery was finding buffers with invalid
types in them. This should not happen, and indicates a bug in the
logging of buffers. To catch this, add asserts to the buffer
formatting code to ensure that the buffer type is in range when the
transaction is committed.

We don't set a type on buffers being marked stale - they are not
going to get replayed, the format item exists only for recovery to
be able to prevent replay of the buffer, so the type does not
matter. Hence that needs special casing here.

cc: <stable@vger.kernel.org> # 3.10 to current
Reported-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:29:05 +11:00
Dave Chinner
465e2def7c Merge branch 'xfs-sb-logging-rework' into for-next
Conflicts:
	fs/xfs/xfs_mount.c
2015-01-22 09:20:53 +11:00
Anna Schumaker
2ef47eb1ae NFS: Fix use of nfs_attr_use_mounted_on_fileid()
This function call was being optimized out during nfs_fhget(), leading
to situations where we have a valid fileid but still want to use the
mounted_on_fileid.  For example, imagine we have our server configured
like this:

server % df
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       9.1G  6.5G  1.9G  78% /
/dev/vdb1       487M  2.3M  456M   1% /exports
/dev/vdc1       487M  2.3M  456M   1% /exports/vol1
/dev/vdd1       487M  2.3M  456M   1% /exports/vol2

If our client mounts /exports and tries to do a "chown -R" across the
entire mountpoint, we will get a nasty message warning us about a circular
directory structure.  Running chown with strace tells me that each directory
has the same device and inode number:

newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0

With this patch the mounted_on_fileid values are used for st_ino, so the
directory loop warning isn't reported.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:15:41 -05:00
Dave Chinner
074e427ba7 xfs: sanitise sb_bad_features2 handling
We currently have to ensure that every time we update sb_features2
that we update sb_bad_features2. Now that we log and format the
superblock in it's entirety we actually don't have to care because
we can simply update the sb_bad_features2 when we format it into the
buffer. This removes the need for anything but the mount and
superblock formatting code to care about sb_bad_features2, and
hence removes the possibility that we forget to update bad_features2
when necessary in the future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:33 +11:00
Dave Chinner
61e63ecb57 xfs: consolidate superblock logging functions
We now have several superblock loggin functions that are identical
except for the transaction reservation and whether it shoul dbe a
synchronous transaction or not. Consolidate these all into a single
function, a single reserveration and a sync flag and call it
xfs_sync_sb().

Also, xfs_mod_sb() is not really a modification function - it's the
operation of logging the superblock buffer. hence change the name of
it to reflect this.

Note that we have to change the mp->m_update_flags that are passed
around at mount time to a boolean simply to indicate a superblock
update is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:31 +11:00
Dave Chinner
4d11a40239 xfs: remove bitfield based superblock updates
When we log changes to the superblock, we first have to write them
to the on-disk buffer, and then log that. Right now we have a
complex bitfield based arrangement to only write the modified field
to the buffer before we log it.

This used to be necessary as a performance optimisation because we
logged the superblock buffer in every extent or inode allocation or
freeing, and so performance was extremely important. We haven't done
this for years, however, ever since the lazy superblock counters
pulled the superblock logging out of the transaction commit
fast path.

Hence we have a bunch of complexity that is not necessary that makes
writing the in-core superblock to disk much more complex than it
needs to be. We only need to log the superblock now during
management operations (e.g. during mount, unmount or quota control
operations) so it is not a performance critical path anymore.

As such, remove the complex field based logging mechanism and
replace it with a simple conversion function similar to what we use
for all other on-disk structures.

This means we always log the entirity of the superblock, but again
because we rarely modify the superblock this is not an issue for log
bandwidth or CPU time. Indeed, if we do log the superblock
frequently, delayed logging will minimise the impact of this
overhead.

[Fixed gquota/pquota inode sharing regression noticed by bfoster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:26 +11:00
Trond Myklebust
3175e1dcec NFSv4.1: Fix an Oops in nfs41_walk_client_list
If we start state recovery on a client that failed to initialise correctly,
then we are very likely to Oops.

Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:09 -05:00
Peng Tao
ee8a1a8b16 nfs: fix dio deadlock when O_DIRECT flag is flipped
We only support swap file calling nfs_direct_IO. However, application
might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
during IO and it can deadlock because we grab inode->i_mutex in
nfs_file_direct_write(). So return 0 for such case. Then the generic
layer will fall back to buffer IO.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:08 -05:00
Jan Kara
a39427007e xfs: Remove some pointless quota checks
xfs_fs_get_xstate() and xfs_fs_get_xstatev() check whether there's quota
running before calling xfs_qm_scall_getqstat() or
xfs_qm_scall_getqstatv(). Thus we are certain that superblock supports
quota and xfs_sb_version_hasquota() check is pointless. Similarly we
know that when quota is running, mp->m_quotainfo will be allocated.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:33 +01:00
Jan Kara
fbf64b3df3 xfs: Remove some useless flags tests
'flags' have XFS_ALL_QUOTA_ACCT cleared immediately on function entry.
There's no point in checking these bits later in the function. Also
because we check something is going to change, we know some enforcement
bits are being added and thus there's no point in testing that later.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:32 +01:00
Jan Kara
8a2fdd4a49 xfs: Remove useless test
Q_XQUOTARM is never passed to xfs_fs_set_xstate() so remove the test.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:32 +01:00
Jan Kara
ca6cb0918e quota: Verify flags passed to Q_SETINFO
Currently flags passed via Q_SETINFO were just stored. This makes it
hard to add new flags since in theory userspace could be just setting /
clearing random flags. Since currently there is only one userspace
settable flag and that is somewhat obscure flags only for ancient v1
quota format, I'm reasonably sure noone operates these flags and
hopefully we are fine just adding the check that passed flags are sane.
If we indeed find some userspace program that gets broken by the strict
check, we can always remove it again.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:31 +01:00
Jan Kara
9c45101e88 quota: Cleanup flags definitions
Currently all quota flags were defined just in kernel-private headers.
Export flags readable / writeable from userspace to userspace via
include/uapi/linux/quota.h.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:30 +01:00
Jan Kara
96827adcc2 ocfs2: Move OLQF_CLEAN flag out of generic quota flags
OLQF_CLEAN flag is used by OCFS2 on disk to recognize whether quota
recovery is needed or not. We also somewhat abuse mem_dqinfo->dqi_flags
to pass this flag around. Use private flags for this to avoid clashes
with other quota flags / not pollute generic quota flag namespace.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:30 +01:00
Jan Kara
c119c5b974 quota: Don't store flags for v2 quota format
Currently, v2 quota format blindly stored flags from in-memory dqinfo on
disk, although there are no flags supported. Since it is stupid to store
flags which have no effect, just store 0 unconditionally and don't
bother loading it from disk.

Note that userspace could have stored some flags there via Q_SETINFO
quotactl and then later read them (although flags have no effect) but
I'm pretty sure noone does that (most definitely quota-tools don't and
quota interface doesn't have too much other users).

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:07 +01:00
Ingo Molnar
f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Qu Wenruo
a53f4f8e9c btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
Commit 6b5fe46dfa (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.

However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[  143.255932] Call Trace:
[  143.255936]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.255939]  [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[  143.255971]  [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[  143.255992]  [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[  143.256003]  [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[  143.256007]  [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[  143.256011]  [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[  143.256014]  [<ffffffff811f7d75>] sys_sync+0x55/0x90
[  143.256017]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[  143.256111] Call Trace:
[  143.256114]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.256119]  [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[  143.256123]  [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[  143.256131]  [<ffffffff811caae8>] thaw_super+0x28/0xc0
[  143.256135]  [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[  143.256187]  [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[  143.256213]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17

The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
   |- btrfs_start_transaction()
      |- sb_start_intwrite()
      (Waiting thaw_fs to unfreeze)
					VFS thaw_fs staff:
					thaw_fs()
					(Waiting sync_fs to release
					 s_umount)

So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.

The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.

Cc: David Sterba <dsterba@suse.cz>
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE]
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:20:21 -08:00
Qu Wenruo
6c9fe14f9d btrfs: Fix the bug that fs_info->pending_changes is never cleared.
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.

This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:19:40 -08:00