Commit graph

332,963 commits

Author SHA1 Message Date
Alex Elder
1f7ba33115 rbd: handle locking inside __rbd_client_find()
There is only caller of __rbd_client_find(), and it somewhat
clumsily gets the appropriate lock and gets a reference to the
existing ceph_client structure if it's found.

Instead, have that function handle its own locking, and acquire the
reference if found while it holds the lock.  Drop the underscores
from the name because there's no need to signify anything special
about this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Wei Yongjun
cc4829e596 ceph: use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
523f32582f rbd: add new snapshots at the tail
This fixes a bug that went in with this commit:

    commit f6e0c99092cca7be00fca4080cfc7081739ca544
    Author: Alex Elder <elder@inktank.com>
    Date:   Thu Aug 2 11:29:46 2012 -0500
    rbd: simplify __rbd_init_snaps_header()

The problem is that a new rbd snapshot needs to go either after an
existing snapshot entry, or at the *end* of an rbd device's snapshot
list.  As originally coded, it is placed at the beginning.  This was
based on the assumption the list would be empty (so it wouldn't
matter), but in fact if multiple new snapshots are added to an empty
list in one shot the list will be non-empty after the first one is
added.

This addresses http://tracker.newdream.net/issues/3063

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
843a0d0879 rbd: rename block_name -> object_prefix
In the on-disk image header structure there is a field "block_name"
which represents what we now call the "object prefix" for an rbd
image.  Rename this field "object_prefix" to be consistent with
modern usage.

This appears to be the only remaining vestige of the use of "block"
in symbols that represent objects in the rbd code.

This addresses http://tracker.newdream.net/issues/1761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Iulius Curt
7698f2f5e0 libceph: Fix sparse warning
Make ceph_monc_do_poolop() static to remove the following sparse warning:
 * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not
   declared. Should it be static?
Also drops the 'ceph_monc_' prefix, now being a private function.

Signed-off-by: Iulius Curt <icurt@ixiacom.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Sage Weil
290e33593d libceph: remove unused monc->have_fsid
This is unused; use monc->client->have_fsid.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
c98f533c94 ceph: let path portion of mount "device" be optional
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped.  As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.

That is, an entry in /etc/fstab like:
    cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.

Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.

RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers.  (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)

According to those standards, no host specification will ever
contain a '/' character.  As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path.  If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space.  We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.

This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:

    "device" will look like:
        <server_spec>[,<server_spec>...]:[<path>]
    where
        <server_spec> is <ip>[:<port>]
        <path> is optional, but if present must begin with '/'

This addresses http://tracker.newdream.net/issues/2919

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
4156d99840 rbd: separate reading header from decoding it
Right now rbd_read_header() both reads the header object for an rbd
image and decodes its contents.  It does this repeatedly if needed,
in order to ensure a complete and intact header is obtained.

Separate this process into two steps--reading of the raw header
data (in new function, rbd_dev_v1_header_read()) and separately
decoding its contents (in rbd_header_from_disk()).  As a result,
the latter function no longer requires its allocated_snaps argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder
103a150f0c rbd: expand rbd_dev_ondisk_valid() checks
Add checks on the validity of the snap_count and snap_names_len
field values in rbd_dev_ondisk_valid().  This eliminates the
need to do them in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
28cb775de1 rbd: return earlier in rbd_header_from_disk()
The only caller of rbd_header_from_disk() is rbd_read_header().
It passes as allocated_snaps the number of snapshots it will
have received from the server for the snapshot context that
rbd_header_from_disk() is to interpret.  The first time through
it provides 0--mainly to extract the number of snapshots from
the snapshot context header--so that it can allocate an
appropriately-sized buffer to receive the entire snapshot
context from the server in a second request.

rbd_header_from_disk() will not fill in the array of snapshot ids
unless the number in the snapshot matches the number the caller
had allocated.

This patch adjusts that logic a little further to be more efficient.
rbd_read_header() doesn't even examine the snapshot context unless
the snapshot count (stored in header->total_snaps) matches the
number of snapshots allocated.  So rbd_header_from_disk() doesn't
need to allocate or fill in the snapshot context field at all in
that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
6a52325f61 rbd: rearrange rbd_header_from_disk()
This just moves code around for the most part.  It was pulled out as
a separate patch to avoid cluttering up some upcoming patches which
are more substantive.  The point is basically to group everything
related to initializing the snapshot context together.

The only functional change is that rbd_header_from_disk() now
ensures the (in-core) header it is passed is zero-filled.  This
allows a simpler error handling path in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
d2bb24e506 rbd: use sizeof (object) instead of sizeof (type)
Fix a few spots in rbd_header_from_disk() to use sizeof (object)
rather than sizeof (type).  Use a local variable to record sizes
to shorten some lines and improve readability.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
d78fd7ae03 rbd: ensure invalid pointers are made null
Fix a number of spots where a pointer value that is known to
have become invalid but was not reset to null.

Also, toss in a change so we use sizeof (object) rather than
sizeof (type).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
0f1d3f9385 rbd: make snap_names_len a u64
The snap_names_len field of an rbd_image_header structure is defined
with type size_t.  That field is used as both the source and target
of 64-bit byte-order swapping operations though, so it's best to
define it with type u64 instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder
3593815022 rbd: simplify __rbd_init_snaps_header()
The purpose of __rbd_init_snaps_header() is to compare a new
snapshot context with an rbd device's list of existing snapshots.
It updates the list by adding any new snapshots or removing any
that are not present in the new snapshot context.

The code as written is a little confusing, because it traverses both
the existing snapshot list and the set of snapshots in the snapshot
context in reverse.  This was done based on an assumption about
snapshots that is not true--namely that a duplicate snapshot name
could cause an error in intepreting things if they were not
processed in ascending order.

These precautions are not necessary, because:
    - all snapshots are uniquely identified by their snapshot id
    - a new snapshot cannot be created if the rbd device has another
      snapshot with the same name
(It is furthermore not currently possible to rename a snapshot.)

This patch re-implements __rbd_init_snaps_header() so it passes
through both the existing snapshot list and the entries in the
snapshot context in forward order.  It still does the same thing
as before, but I find the logic considerably easier to understand.

By going forward through the names in the snapshot context, there
is no longer a need for the rbd_prev_snap_name() helper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Linus Torvalds
3498d13b80 TTY merge for 3.7-rc1
As we skipped the merge window for 3.6-rc1 for the tty tree, everything
 is now settled down and working properly, so we are ready for 3.7-rc1.
 Here's the patchset, it's big, but the large changes are removing a
 firmware file and adding a staging tty driver (it depended on the tty
 core changes, so it's going through this tree instead of the staging
 tree.)
 
 All of these patches have been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX
 4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU
 =S5Ng
 -----END PGP SIGNATURE-----

Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull TTY changes from Greg Kroah-Hartman:
 "As we skipped the merge window for 3.6-rc1 for the tty tree,
  everything is now settled down and working properly, so we are ready
  for 3.7-rc1.  Here's the patchset, it's big, but the large changes are
  removing a firmware file and adding a staging tty driver (it depended
  on the tty core changes, so it's going through this tree instead of
  the staging tree.)

  All of these patches have been in the linux-next tree for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fix up more-or-less trivial conflicts in
 - drivers/char/pcmcia/synclink_cs.c:
    tty NULL dereference fix vs tty_port_cts_enabled() helper function
 - drivers/staging/{Kconfig,Makefile}:
    add-add conflict (dgrp driver added close to other staging drivers)
 - drivers/staging/ipack/devices/ipoctal.c:
    "split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device"

* tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits)
  tty/serial: Add kgdb_nmi driver
  tty/serial/amba-pl011: Quiesce interrupts in poll_get_char
  tty/serial/amba-pl011: Implement poll_init callback
  tty/serial/core: Introduce poll_init callback
  kdb: Turn KGDB_KDB=n stubs into static inlines
  kdb: Implement disable_nmi command
  kernel/debug: Mask KGDB NMI upon entry
  serial: pl011: handle corruption at high clock speeds
  serial: sccnxp: Make 'default' choice in switch last
  serial: sccnxp: Remove mask termios caps for SW flow control
  serial: sccnxp: Report actual baudrate back to core
  serial: samsung: Add poll_get_char & poll_put_char
  Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate
  Powerpc 8xx CPM_UART maxidl should not depend on fifo size
  Powerpc 8xx CPM_UART too many interrupts
  Powerpc 8xx CPM_UART desynchronisation
  serial: set correct baud_base for EXSYS EX-41092 Dual 16950
  serial: omap: fix the reciever line error case
  8250: blacklist Winbond CIR port
  8250_pnp: do pnp probe before legacy probe
  ...
2012-10-01 12:26:52 -07:00
Miao Xie
90abccf2c6 Revert "Btrfs: do not do filemap_write_and_wait_range in fsync"
This reverts commit 0885ef5b56

After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.

The following is the test result of sysbench:
	Before		After
	24MB/s		39MB/s

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:22 -04:00
Josef Bacik
698d0082c4 Btrfs: remove bytes argument from do_chunk_alloc
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:21 -04:00
Josef Bacik
ea658badc4 Btrfs: delay block group item insertion
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations.  This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data.  To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction.  This will allow us to create chunks
even if we are trying to make an allocation for the extent tree.  With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:21 -04:00
Kent Overstreet
be3940c0a9 btrfs: Kill some bi_idx references
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.

scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.

The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-01 15:19:21 -04:00
Miao Xie
962197babe Btrfs: fix unnecessary warning when the fragments make the space alloc fail
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
	BTRFS warning (device xxx): Aborting unused transaction.

The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.

In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:20 -04:00
Josef Bacik
69ffb54347 Btrfs: create a pinned em when writing to a prealloc range in DIO
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT.  This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues.  This patch fixes the problem and
makes his testcase much happier.  Thanks,

Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:20 -04:00
Josef Bacik
6df7881a84 Btrfs: move the sb_end_intwrite until after the throttle logic
Sage reported the following lockdep backtrace

=====================================
[ BUG: bad unlock balance detected! ]
3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
-------------------------------------
btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
but there are no more locks to release!

other info that might help us debug this:
1 lock held by btrfs-cleaner/7607:
 #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]

stack backtrace:
Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
Call Trace:
 [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
 [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff810b2a95>] lock_release+0xd5/0x220
 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
 [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
 [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
 [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
 [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
 [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
 [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
 [<ffffffff810791ee>] kthread+0xae/0xc0
 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
 [<ffffffff8163e740>] ? gs_change+0x13/0x13

This is because the throttle stuff can commit the transaction, which expects to
be the one stopping the intwrite stuff, but we've already done it in the
__btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
lockdep go away.  Thanks,

Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:19 -04:00
Liu Bo
425d17a290 Btrfs: use larger limit for translation of logical to inode
This is the change of the kernel side.

Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"

This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:19 -04:00
Liu Bo
df031f0752 Btrfs: use helper for logical resolve
We already have a helper, iterate_inodes_from_logical(), for logical resolve,
so just use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:18 -04:00
Liu Bo
69917e4312 Btrfs: fix a bug in parsing return value in logical resolve
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.

It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.

I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.

Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
2012-10-01 15:19:18 -04:00
Liu Bo
dea7d76ecb Btrfs: update delayed ref's tracepoints to show sequence
We've added a new field 'sequence' to delayed ref node, so update related
tracepoints.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:17 -04:00
liubo
0647d6bd16 Btrfs: cleanup for unused ref cache stuff
As ref cache has been removed from btrfs, there is no user on
its lock and its check.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:17 -04:00
Miao Xie
8407aa4643 Btrfs: fix corrupted metadata in the snapshot
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.

We fix this problem by updating the inode item before the current transaction ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:17 -04:00
David Sterba
837e197283 btrfs: polish names of kmem caches
Usecase:

  watch 'grep btrfs < /proc/slabinfo'

easy to watch all caches in one go.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-01 15:19:16 -04:00
Josef Bacik
a80c8dcf7e Btrfs: fix our overcommit math
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop.  While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often.  This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster.  We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:16 -04:00
Josef Bacik
dea31f5233 Btrfs: wait on async pages when shrinking delalloc
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib.  This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents.  To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance.  With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:15 -04:00
Liu Bo
9e8a4a8b0b Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:15 -04:00
Tsutomu Itoh
3d6b5c3b5c Btrfs: check return value of ulist_alloc() properly
ulist_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01 15:19:14 -04:00
Tsutomu Itoh
f54fb859da Btrfs: fix error handling in delete_block_group_cache()
btrfs_iget() never return NULL.
So, NULL check is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01 15:19:14 -04:00
Miao Xie
903889f462 Btrfs: fix wrong size for the reservation when doing, file pre-allocation.
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:14 -04:00
Miao Xie
69ce977a17 Btrfs: output more information when aborting a unused transaction handle
Though we dump the stack information when aborting a unused transaction
handle, we don't know the correct place where we decide to abort the
transaction handle if one function has several place where the transaction
abort function is invoked and jumps to the same place after this call.
And beside that we also don't know the reason why we jump to abort
the current handle. So I modify the transaction abort function and make
it output the function name, line and error information.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:13 -04:00
Miao Xie
2ecb79239b Btrfs: fix unprotected ->log_batch
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
48c03c4bcf Btrfs: fix wrong size for the reservation of the, snapshot creation
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
42874b3db7 Btrfs: fix the snapshot that should not exist
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:

 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # mkdir /mnt/1
 # cd /mnt/1
 # btrfs subvolume snapshot /mnt snap0
 # ls -a /mnt/1/snap0/1
 .	..	[no other file/dir]

 # ll /mnt/1/snap0/
 total 0
 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
			^^^
			There is no file/dir in it, but it's size is 10

 # cd /mnt/1/snap0/1/snap0
 [Enter a unexisted directory successfully...]

There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.

 # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
 # ll /mnt/1/snap0/1/
 total 0
 [None]
 # ll /mnt/snap1/1/
 total 0
 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0

And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.

So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
66d8f3dd1c Btrfs: add a new "type" field into the block reservation structure
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Miao Xie
6352b91da1 Btrfs: use a slab for ordered extents allocation
The ordered extent allocation is in the fast path of the IO, so use a slab
to improve the speed of the allocation.

 "Size of the struct is 280, so this will fall into the size-512 bucket,
  giving 8 objects per page, while own slab will pack 14 objects into a page.

  Another benefit I see is to check for leaked objects when the module is
  removed (and the cache destroy takes place)."
						-- David Sterba

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Miao Xie
b9a8cc5bef Btrfs: fix file extent discount problem in the, snapshot
If a snapshot is created while we are writing some data into the file,
the i_size of the corresponding file in the snapshot will be wrong, it will
be beyond the end of the last file extent. And btrfsck will report:
  root 256 inode 257 errors 100

Steps to reproduce:
 # mkfs.btrfs <partition>
 # mount <partition> <mnt>
 # cd <mnt>
 # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
 # for ((i=0; i<4; i++))
 > do
 > btrfs sub snap . $i
 > done

This because the algorithm of disk_i_size update is wrong. Though there are
some ordered extents behind the current one which we use to update disk_i_size,
it doesn't mean those extents will be dealt with in the same transaction. So
We shouldn't use the offset of those extents to update disk_i_size. Or we will
get the wrong i_size in the snapshot.

We fix this problem by recording the max real i_size. If we find there is a
ordered extent which is in front of the current one and doesn't complete, we
will record the end of the current one into that ordered extent. Surely, if
the current extent holds the end of other extent(it must be greater than
the current one because it is behind the current one), we will record the
number that the current extent holds. In this way, we can exclude the ordered
extents that may not be dealth with in the same transaction, and be easy to
know the real disk_i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:10 -04:00
Miao Xie
361048f586 Btrfs: fix full backref problem when inserting shared block reference
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.

	kernel BUG at fs/btrfs/extent-tree.c:6047!

Steps to reproduce:
 # mkfs.btrfs <partition>
 # mount <partition> <mnt>
 # cd <mnt>
 # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
 # for ((i=0; i<4; i++))
 > do
 > mkdir $i
 > for ((j=0; j<200; j++))
 > do
 > btrfs sub snap . $i/$j
 > done &
 > done

The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.

And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.

Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.

As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.

We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
   change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
   references be inserted before the implicit ones. It is also very easy, but
   I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
   at first. This way is not so good because it may wastes CPU time if we have
   lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
   the 1st way.

I chose the 1st way to fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:10 -04:00
Miao Xie
6fa9700e73 Btrfs: fix error path in create_pending_snapshot()
This patch fixes the following problem:
- If we failed to deal with the delayed dir items, we should abort transaction,
  just as its comment said. Fix it.
- If root reference or root back reference insertion failed, we should
  abort transaction. Fix it.
- Fix the double free problem of pending->inherit.
- Do not restore the trans->rsv if we doesn't change it.
- make the error path more clearly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:09 -04:00
Wei Yongjun
cf93dccea6 Btrfs: fix possible memory leak in scrub_setup_recheck_block()
bbio has been malloced in btrfs_map_block() and should be
freed before leaving from the error handling cases.

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2012-10-01 15:19:09 -04:00
Josef Bacik
7014cdb493 Btrfs: btrfs_drop_extent_cache should never fail
I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent.  This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given.  This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:09 -04:00
Sage Weil
ac14aed665 Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()
Josef has suggested that this is not necessary.  Removing it also avoids
this lockdep splat (after the new sb_internal locking stuff was added):

[  604.090449] ======================================================
[  604.114819] [ INFO: possible circular locking dependency detected ]
[  604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted
[  604.162193] -------------------------------------------------------
[  604.186139] btrfs-cleaner/6669 is trying to acquire lock:
[  604.209555]  (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  604.257100]
[  604.257100] but task is already holding lock:
[  604.300366]  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  604.352989]
[  604.352989] which lock already depends on the new lock.
[  604.352989]
[  604.427104]
[  604.427104] the existing dependency chain (in reverse order) is:
[  604.478493]
[  604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}:
[  604.529313]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  604.559621]        [<ffffffff81632b69>] down_read+0x39/0x4e
[  604.589382]        [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs]
[  604.596161] btrfs: unlinked 1 orphans
[  604.675002]        [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs]
[  604.708859]        [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs]
[  604.772466]        [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs]
[  604.842245]        [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
[  604.912852]        [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs]
[  604.951888]        [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560
[  604.989961]        [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0
[  605.026628]        [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b
[  605.064404]
[  605.064404] -> #0 (sb_internal#2){.+.+..}:
[  605.126832]        [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  605.163671]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  605.200228]        [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  605.236818]        [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  605.274029]        [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  605.340520]        [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  605.378720]        [<ffffffff811972c8>] evict+0xb8/0x1c0
[  605.416057]        [<ffffffff811974d5>] iput+0x105/0x210
[  605.452373]        [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  605.521627]        [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  605.560520]        [<ffffffff810791ee>] kthread+0xae/0xc0
[  605.598094]        [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  605.636499]
[  605.636499] other info that might help us debug this:
[  605.636499]
[  605.736504]  Possible unsafe locking scenario:
[  605.736504]
[  605.801931]        CPU0                    CPU1
[  605.835126]        ----                    ----
[  605.867093]   lock(&fs_info->cleanup_work_sem);
[  605.898594]                                lock(sb_internal#2);
[  605.931954]                                lock(&fs_info->cleanup_work_sem);
[  605.965359]   lock(sb_internal#2);
[  605.994758]
[  605.994758]  *** DEADLOCK ***
[  605.994758]
[  606.075281] 2 locks held by btrfs-cleaner/6669:
[  606.104528]  #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs]
[  606.165626]  #1:  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  606.231297]
[  606.231297] stack backtrace:
[  606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1
[  606.347823] Call Trace:
[  606.376184]  [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c
[  606.409243]  [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  606.441343]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.474583]  [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  606.505934]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.539429]  [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[  606.571719]  [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  606.603498]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.637405]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.670165]  [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160
[  606.702144]  [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  606.735562]  [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs]
[  606.769861]  [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  606.804575]  [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  606.838756]  [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[  606.872010]  [<ffffffff811972c8>] evict+0xb8/0x1c0
[  606.903800]  [<ffffffff811974d5>] iput+0x105/0x210
[  606.935416]  [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  606.970510]  [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs]
[  607.005648]  [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  607.040724]  [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[  607.104740]  [<ffffffff810791ee>] kthread+0xae/0xc0
[  607.137119]  [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[  607.169797]  [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  607.202472]  [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[  607.235884]  [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[  607.268731]  [<ffffffff8163e740>] ? gs_change+0x13/0x13

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:08 -04:00
Sage Weil
e209db7ace Btrfs: set journal_info in async trans commit worker
We expect current->journal_info to point to the trans handle we are
committing.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:08 -04:00
Sage Weil
6fc4e35485 Btrfs: pass lockdep rwsem metadata to async commit transaction
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction().  In the async case, that happens in a worker
thread.  Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.

XFS plays the same trick in fs/xfs/xfs_aops.c.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:07 -04:00