This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.
If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.
bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.
bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
block: fix booting from partitioned md array
block: revert part of 18ce3751cc
cciss: PCI power management reset for kexec
paride/pg.c: xs(): &&/|| confusion
fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
block: fix bad definition of BIO_RW_SYNC
bsg: Fix sense buffer bug in SG_IO
Otherwise, these don't work when called from 32-bit userspace on 64-bit
kernels.
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan said:
Thread 1:
for ((; ;))
{
mount -t cpuset xxx /mnt > /dev/null 2>&1
cat /mnt/cpus > /dev/null 2>&1
umount /mnt > /dev/null 2>&1
}
Thread 2:
for ((; ;))
{
mount -t cpuset xxx /mnt > /dev/null 2>&1
umount /mnt > /dev/null 2>&1
}
(Note: It is irrelevant which cgroup subsys is used.)
After a while a lockdep warning showed up:
=============================================
[ INFO: possible recursive locking detected ]
2.6.28 #479
---------------------------------------------
mount/13554 is trying to acquire lock:
(&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321
but task is already holding lock:
(&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321
other info that might help us debug this:
1 lock held by mount/13554:
#0: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321
stack backtrace:
Pid: 13554, comm: mount Not tainted 2.6.28-mc #479
Call Trace:
[<c044ad2e>] validate_chain+0x4c6/0xbbd
[<c044ba9b>] __lock_acquire+0x676/0x700
[<c044bb82>] lock_acquire+0x5d/0x7a
[<c049d888>] ? sget+0x5e/0x321
[<c061b9b8>] down_write+0x34/0x50
[<c049d888>] ? sget+0x5e/0x321
[<c049d888>] sget+0x5e/0x321
[<c045a2e7>] ? cgroup_set_super+0x0/0x3e
[<c045959f>] ? cgroup_test_super+0x0/0x2f
[<c045bcea>] cgroup_get_sb+0x98/0x2e7
[<c045cfb6>] cpuset_get_sb+0x4a/0x5f
[<c049dfa4>] vfs_kern_mount+0x40/0x7b
[<c049e02d>] do_kern_mount+0x37/0xbf
[<c04af4a0>] do_mount+0x5c3/0x61a
[<c04addd2>] ? copy_mount_options+0x2c/0x111
[<c04af560>] sys_mount+0x69/0xa0
[<c0403251>] sysenter_do_call+0x12/0x31
The cause is after alloc_super() and then retry, an old entry in list
fs_supers is found, so grab_super(old) is called, but both functions hold
s_umount lock:
struct super_block *sget(...)
{
...
retry:
spin_lock(&sb_lock);
if (test) {
list_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
if (!grab_super(old)) <--- 2nd: down_write(&old->s_umount);
goto retry;
if (s)
destroy_super(s);
return old;
}
}
if (!s) {
spin_unlock(&sb_lock);
s = alloc_super(type); <--- 1th: down_write(&s->s_umount)
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
}
...
}
It seems like a false positive, and seems like VFS but not cgroup needs to
be fixed.
Peter said:
We can simply put the new s_umount instance in a but lockdep doesn't
particularly cares about subclass order.
If there's any issue with the callers of sget() assuming the s_umount lock
being of sublcass 0, then there is another annotation we can use to fix
that, but lets not bother with that if this is sufficient.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).
Additionally, there is some inconsistency about when task_dirty_inc is
called. It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.
So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As requested by Michael, add a missing check for valid flags in
timerfd_settime(), and make it return EINVAL in case some extra bits are
set.
Michael said:
If this is to be any use to userland apps that want to check flag
support (perhaps it is too late already), then the sooner we get it
into the kernel the better: 2.6.29 would be good; earlier stables as
well would be even better.
[akpm@linux-foundation.org: remove unused TFD_FLAGS_SET]
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org> [2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently seq_read assumes that the offset passed to it is always the
offset it passed to user space. In the case pread this assumption is
broken and we do the wrong thing when presented with pread.
To solve this I introduce an offset cache inside of struct seq_file so we
know where our logical file position is. Then in seq_read if we try to
read from another offset we reset our data structures and attempt to go to
the offset user space wanted.
[akpm@linux-foundation.org: restore FMODE_PWRITE]
[pjt@google.com: seq_open needs its fmode opened up to take advantage of this]
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Turner <pjt@google.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).
---- test case ----
int main(int argc, char **argv)
{
int fdes, i;
FILE *fp;
struct timeval start;
struct timeval end;
struct timeval res;
gettimeofday(&start, NULL);
for (i=0; i<ROWS; i++) {
fp = fopen("test_file", "a");
fprintf(fp, "Some Text Data\n");
fdes = fileno(fp);
fsync(fdes);
fclose(fp);
}
gettimeofday(&end, NULL);
timersub(&end, &start, &res);
fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
(res.tv_sec*1000000 + res.tv_usec)/1000);
return 0;
}
-------------------
Thanks to Sean.White@APCC.com for tracking down this performance
regression and providing a test case.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When freeing from bio pool use right ptr to account for bs->front_pad,
instead of bio ptr,
Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
Btrfs: make a lockdep class for the extent buffer locks
Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
Btrfs: remove unused code in split_state()
Btrfs: remove btrfs_init_path
Btrfs: balance_level checks !child after access
Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
Btrfs: don't clean old snapshots on sync(1)
Btrfs: use larger metadata clusters in ssd mode
Btrfs: process mount options on mount -o remount,
Btrfs: make sure all pending extent operations are complete
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
ext4: Initialize preallocation list_head's properly
ext4: Fix lockdep warning
ext4: Fix to read empty directory blocks correctly in 64k
jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
Revert "ext4: wait on all pending commits in ext4_sync_fs()"
jbd2: Fix return value of jbd2_journal_start_commit()
Getting this wrong caused
WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
due to optimistically checking cpu_writer->mnt outside the spinlock.
Here's what we really want:
* we know that nobody will set cpu_writer->mnt to mnt from now on
* all changes to that sucker are done under cpu_writer->lock
* we want the laziest equivalent of
spin_lock(&cpu_writer->lock);
if (likely(cpu_writer->mnt != mnt)) {
spin_unlock(&cpu_writer->lock);
continue;
}
/* do stuff */
that would make sure we won't miss earlier setting of ->mnt done by
another CPU.
Anyway, for now we just move the spin_lock() earlier and move the test
into the properly locked region.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial cleanup, list_del(); list_add{,_tail}() is equivalent
to list_move{,_tail}(). Semantic patch for coccinelle can be
found at www.cccmz.de/~snakebyte/list_move_tail.spatch
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This was found through a code checker (http://repo.or.cz/w/smatch.git/).
It looks like you might be able to trigger the error by trying to migrate
a readonly file system.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent. In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited.
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext4_sb_info
enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of
32 KB saving 14 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The rec_len field in the directory entry is 16 bits, so to encode
blocksizes larger than 64k becomes problematic. This patch allows us
to supprot block sizes up to 256k, by using the low 2 bits to extend
the range of rec_len to 2**18-1 (since valid rec_len sizes must be a
multiple of 4). We use the convention that a rec_len of 0 or 65535
means the filesystem block size, for compatibility with older kernels.
It's unlikely we'll see VM pages of up to 256k, but at some point we
might find that the Linux VM has been enhanced to support filesystem
block sizes > than the VM page size, at which point it might be useful
for some applications to allow very large filesystem block sizes.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks. This is needed
so that we can get large contiguous blocks request. If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range. That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.
The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.
http://bugzilla.kernel.org/show_bug.cgi?id=12579
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists. Otherwise, with list debug enabled, we will get list
corruption warnings.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I've noticed some pretty poor behavior on OLPC machines after bootup, when
gdm/X are starting. The GCD monopolizes the scheduler (which in turns
means it gets to do more nand i/o), which results in processes taking much
much longer than they should to start.
As an example, on an OLPC machine going from OFW to a usable X (via
auto-login gdm) takes 2m 30s. The majority of this time is consumed by
the switch into graphical mode. With this patch, we cut a full 60s off of
bootup time. After bootup, things are much snappier as well.
Note that we have seen a CRC node error with this patch that causes the machine
to fail to boot, but we've also seen that problem without this patch.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction. This adds
it to all the places that were missing it.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.
The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.
btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.
The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Swapfiles are magic - I/O is directly initialized by the VM without
involving the filesystem. Swapping out extents underneath the VM thus
can cause severe problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
We can't just call xfs_log_unmount_dealloc on any failure because the
ail thread which is torn down by xfs_log_unmount_dealloc might not
be initialized yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Lachlan McIlroy <lachlan@sgi.com>
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@
(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
when != if (...) { <+...x...+> }
x->f = E
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_init_path was initially used when the path objects were on the
stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.
This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.
This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view. The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed. A new ioctl will be added
for this purpose instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks. The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.
On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.
This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.
So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return. I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch allows LSM modules to determine whether current process is in an
execve operation or not so that they can behave differently while an execve
operation is in progress.
This patch is needed by TOMOYO. Please see another patch titled "LSM adapter
functions." for backgrounds.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>