Commit graph

272 commits

Author SHA1 Message Date
Chris Mason
27263e2832 Merge branch 'restriper' of git://github.com/idryomov/btrfs-unstable into integration 2012-01-16 15:26:02 -05:00
Ilya Dryomov
19a39dce3b Btrfs: add balance progress reporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
a7e99c691a Btrfs: allow for canceling restriper
Implement an ioctl for canceling restriper.  Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit.  Balance item is deleted and no memory
about the interrupted balance is kept.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
837d5b6e46 Btrfs: allow for pausing restriper
Implement an ioctl for pausing restriper.  This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc.  If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount.  (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
9555c6c180 Btrfs: add skip_balance mount option
Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it.  The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
596410151e Btrfs: recover balance on mount
On mount, if balance item is found, resume balance in a separate
kernel thread.

Try to be smart to continue roughly where previous balance (or convert)
was interrupted.  For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
0940ebf6b9 Btrfs: save balance parameters to disk
Introduce a new btree objectid for storing balance item.  The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.

The key for the new item is as follows:

	[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
cfa4c961cc Btrfs: soft profile changing mode (aka soft convert)
When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to.  This is useful if e.g. half of the FS was converted
earlier.

The soft mode switch is (like every other filter) per-type.  This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
e4d8ec0f65 Btrfs: implement online profile changing
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized.  Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce.  If target profile is not yet available it goes
back to a plain reduce.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
ea67176ae8 Btrfs: virtual address space subset filter
Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
94e60d5a5c Btrfs: devid subset filter
Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.

This filter only works when devid filter is turned on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov
409d404b46 Btrfs: devid filter
Relocate chunks which have at least one stripe located on a device with
devid X.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
5ce5b3c091 Btrfs: usage filter
Select chunks that are less than X percent full.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
ed25e9b26f Btrfs: profiles filter
Select chunks based on a given profile mask.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
f43ffb60fd Btrfs: add basic infrastructure for selective balancing
This allows to have a separate set of filters for each chunk type
(data,meta,sys).  The code however is generic and switch on chunk type
is only done once.

This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
c9e9f97bdf Btrfs: add basic restriper infrastructure
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc.  The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
52ba692972 Btrfs: introduce masks for chunk type and profile
Chunk's type and profile are encoded in u64 flags field.  Introduce
masks to easily access them.  Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
6fef8df1dc Btrfs: get rid of *_alloc_profile fields
{data,metadata,system}_alloc_profile fields have been unused for a long
time now.  Get rid of them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Li Zefan
b367e47fb3 Btrfs: fix possible deadlock when opening a seed device
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:

    open_ctree()
        lock(chunk_mutex);
        read_chunk_tree();
            read_one_dev();
                open_seed_devices();
                    lock(uuid_mutex);

and then we hit a lockdep splat.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:54 +08:00
Li Zefan
ec9ef7a13b Btrfs: simplfy calculation of stripe length for discard operation
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().

However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:

        dev1          dev2           dev3    (raid0)
        -----------------------------------
        s0 s3 s6      s1 s4 s7       s2 s5

Each device has (total_stripes / nr_dev) stripes, or plus one.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:46 +08:00
Li Zefan
de11cc12df Btrfs: don't pre-allocate btrfs bio
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.

Also we don't have to calcuate the stripe number twice.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:44 +08:00
Li Zefan
125ccb0ae6 Btrfs: don't pass a trans handle unnecessarily in volumes.c
Some functions never use the transaction handle passed to them.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:42 +08:00
Al Viro
10f6327b5d btrfs: fix a deadlock in btrfs_scan_one_device()
pathname resolution under a global mutex, taken on some paths in ->mount()
is a Bad Idea(tm) - think what happens if said pathname resolution triggers
automount of some btrfs instance and walks into attempt to grab the same
mutex.  Deadlock - we are waiting for daemon to finish walking the path,
daemon is waiting for us to release the mutex...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-08 19:33:24 -05:00
Chris Mason
1100373f8a Btrfs: use bigger metadata chunks on bigger filesystems
The 256MB chunk is a little small on a huge FS.  This scales up the
chunk size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-06 15:47:38 -05:00
Stefan Behrens
21adbd5cbb Btrfs: integrate integrity check module into btrfs
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-12-21 19:14:17 +01:00
Chris Mason
d85c8a6f1b Btrfs: unplug every once and a while
The btrfs io submission threads can build up massive plug lists.  This
keeps things more reasonable so we don't hand over huge dumps of IO at
once.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 15:38:41 -05:00
Chris Mason
5dbc8fca8e Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror
btrfs_end_bio checks the number of errors on a bio against the max
number of errors allowed before sending any EIOs up to the higher
levels.

If we got enough copies of the bio done for a given raid level, it is
supposed to clear the bio error flag and return success.

We have pointers to the original bio sent down by the higher layers and
pointers to any cloned bios we made for raid purposes.  If the original
bio happens to be the one that got an io error, but not the last one to
finish, it might not have the BIO_UPTODATE bit set.

Then, when the last bio does finish, we'll call bio_end_io on the
original bio.  It won't have the uptodate bit set and we'll end up
sending EIO to the higher layers.

We already had a check for this, it just was conditional on getting the
IO error on the very last bio.  Make the check unconditional so we eat
the EIOs properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-09 11:07:37 -05:00
Li Zefan
a5d1633361 Btrfs: check if the to-be-added device is writable
If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:

[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-08 08:55:46 -05:00
Miao Xie
924cd8fbe4 Btrfs: fix nocow when deleting the item
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Chris Mason
806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
Chris Mason
531f4b1ae5 Merge branch 'for-chris' of git://github.com/sensille/linux into integration
Conflicts:
	fs/btrfs/ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:05:08 -05:00
David Sterba
6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Ilya Dryomov
20bcd64934 Btrfs: close all bdevs on mount failure
Fix a bug introduced by 20b45077.  We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively.  This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:20:57 +02:00
Josef Bacik
2bf64758fd Btrfs: allow us to overcommit our enospc reservations
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases.  There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve.  However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit.  So this
patch adds chunk free space accounting so we always know how much unallocated
space we have.  Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit.  In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space.  This makes this fio test

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test

go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system.  This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:50 -04:00
Arne Jansen
90519d66ab btrfs: state information for readahead
Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
 - don't wait in radix_trees
 - add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:30 +02:00
Jan Schmidt
2774b2ca3d btrfs: Put mirror_num in bi_bdev
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt
a1d3c4786a btrfs: btrfs_multi_bio replaced with btrfs_bio
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Miao Xie
0e58885961 Btrfs: fix uninitialized sync_pending
sync_pending is uninitialized before it be used, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:31 -04:00
liubo
38c01b9605 Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.

Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:

device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:15 -04:00
Josef Bacik
d5e2003c2b Btrfs: detect wether a device supports discard
We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent.  This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay.  So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it.  And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:15 -04:00
Chris Mason
2ab1ba68ae Btrfs: force unplugs when switching from high to regular priority bios
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.

Synchronous writes go to the high priority thread while async writes
go to regular list.  This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-05 13:48:18 -04:00
Chris Mason
b43b31bdf2 Merge branch 'alloc_path' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus 2011-08-01 14:27:34 -04:00
Chris Mason
85d4e46111 Btrfs: make a lockdep class for each root
This patch was originally from Tejun Heo.  lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time.  The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.

This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:46 -04:00
Mark Fasheh
92b8e897f6 btrfs: Don't BUG_ON alloc_path errors in find_next_chunk
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.

do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-25 14:34:54 -07:00
Mark Fasheh
17e9f796bd btrfs: Don't BUG_ON alloc_path errors in btrfs_balance()
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Josef Bacik
508794eb5e Btrfs: don't panic if we get an error while balancing V2
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up.  This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass.  To fix we just exit.  Thanks,

Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-06 18:46:43 -04:00
Ilya Dryomov
22b63a2971 Btrfs - use %pU to print fsid
Get rid of FIXME comment.  Uuids from dmesg are now the same as uuids
given by btrfs-progs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-10 19:02:04 -04:00
Arne Jansen
5f3f302a6f btrfs: false BUG_ON when degraded
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:44 -04:00
Chris Mason
d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Xiao Guangrong
1f78160ce1 Btrfs: using rcu lock in the reader side of devices list
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00