Commit graph

13,640 commits

Author SHA1 Message Date
yanhai zhu
0df49b911d Btrfs: Check kthread_should_stop() before schedule() in worker_loop
In worker_loop(), the func should check whether it has been requested to stop
before it decides to schedule out.

Otherwise if the stop request(also the last wake_up()) sent by
btrfs_stop_workers() happens when worker_loop() running after the "while"
judgement and before schedule(), woker_loop() will schedule away and never be
woken up, which will also cause btrfs_stop_workers() wait forever.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-12 14:36:58 -05:00
Yan Zheng
c36047d729 Btrfs: Fix race in btrfs_mark_extent_written
When extent needs to be split, btrfs_mark_extent_written truncates the extent
first, then inserts a new extent and increases the reference count.

The race happens if someone else deletes the old extent before the new extent
is inserted. The fix here is increase the reference count in advance. This race
is similar to the race in btrfs_drop_extents that was recently fixed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:19:50 -05:00
Yan Zheng
2b82032c34 Btrfs: Seed device support
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17 21:11:30 -05:00
Yan Zheng
c146afad2c Btrfs: mount ro and remount support
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:34:12 -05:00
Josef Bacik
f3465ca44e Btrfs: batch extent inserts/updates/deletions on the extent root
While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions.  This patch aims to try
and reduce that problem.  This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion.  This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf.  On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert.  In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update.  And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk.  There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-12 14:19:50 -05:00
Sage Weil
c5c9cd4d1b Btrfs: allow clone of an arbitrary file range
This patch adds an additional CLONE_RANGE ioctl to clone an arbitrary 
(block-aligned) file range to another file.  The original CLONE ioctl 
becomes a special case of cloning the entire file range.  The logic is a 
bit more complex now since ranges may be cloned to different offsets, and 
because we may only be cloning the beginning or end of a particular extent 
or checksum item.

An additional sanity check ensures the source and destination files aren't 
the same (which would previously deadlock), although eventually this could 
be extended to allow the duplication of file data at a different offset 
within the same file.

Any extents within the destination range in the target file are dropped.

We currently do not cope with the case where a compressed inline extent 
needs to be split.  This will probably require decompressing the extent 
into a temporary address_space, and inserting just the cloned portion as a 
new compressed inline extent.  For now, just return -EINVAL in this case.  
Note that this never comes up in the more common case of cloning an entire 
file.
    
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-12 14:32:25 -05:00
Chris Mason
2ed6d66408 Btrfs: Fix handling of space info full during allocations
When we fail to allocate a new block group, we should still do the
checks to make sure allocations try again with the minimum requested
allocation size.

This also fixes a deadlock that come from a missed down_read in
the chunk allocation failure handling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:33 -05:00
Chris Mason
6f3577bdc7 Btrfs: Improve metadata read latencies
This fixes latency problems on metadata reads by making sure they
don't go through the async submit queue, and by tuning down the amount
of readahead done during btree searches.

Also, the btrfs bdi congestion function is tuned to ignore the
number of pending async bios and checksums pending.  There is additional
code that throttles new async bios now and the congestion function
doesn't need to worry about it anymore.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:36 -05:00
Christoph Hellwig
5a6bb10393 fat: make sure to set d_ops in fat_get_parent
fat_get_parent needs to setup the dentry operations, otherwise we might
lose them when the NFS server needs to reconnect out of cache inodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2008-11-12 08:51:22 +09:00
OGAWA Hirofumi
985eafcc54 fat: fix duplicate addition of ->llseek handler
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2008-11-12 08:51:22 +09:00
OGAWA Hirofumi
ebeb0406f1 fat: drop negative dentry on rename() path
Drop the negative dentry on rename() path, in order to make sure to
use the case sensitive name which is specified by user if this is for
creation.

For it, this uses newly added LOOKUP_RENAME_TARGET like LOOKUP_CREATE.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2008-11-12 08:51:22 +09:00
David S. Miller
7e452baf6b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/message/fusion/mptlan.c
	drivers/net/sfc/ethtool.c
	net/mac80211/debugfs_sta.c
2008-11-11 15:43:02 -08:00
Linus Torvalds
04ca2c17e3 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  [XFS] XFS: Check for valid transaction headers in recovery
  [XFS] handle memory allocation failures during log initialisation
  [XFS] Account for allocated blocks when expanding directories
  [XFS] Wait for all I/O on truncate to zero file size
  [XFS] Fix use-after-free with log and quotas
2008-11-11 09:32:58 -08:00
Chris Mason
5b050f04c8 Btrfs: Fix compile warnings on 32 bit machines
Simple casting here and there to fix things up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-11 09:34:41 -05:00
Yan Zheng
8247b41ac9 Btrfs: Fix starting search offset inside btrfs_drop_extents
btrfs_drop_extents will drop paths and search again when it needs to
force COW of higher nodes.  It was using the key it found during the last
search as the offset for the next search.

But, this wasn't always correct.  The key could be from before our desired
range, and because we're dropping the path, it is possible for file's items
to change while we do the search again.

The fix here is to make sure we don't search for something smaller than
the offset btrfs_drop_extents was called with.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-11 09:33:29 -05:00
Chris Mason
8a1413a296 Btrfs: empty_size allocation fixes again
The allocator wasn't catching all of the cases where it needed to do
extra loops because the check to enforce them wasn't happening early
enough.

When the allocator decided to increase the size of the allocation
for metadata clustering, it wasn't always setting the empty_size to
include the extra (optional) bytes.  This also fixes the empty_size field
to be correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 16:13:54 -05:00
Chris Mason
240d5d482b Btrfs: tune btrfs unplug functions for a small number of devices
When btrfs unplugs, it tries to find the correct device to unplug
via search through the extent_map tree.  This avoids unplugging
a device that doesn't need it, but is a waste of time for filesystems
with a small number of devices.

This patch checks the total number of devices before doing the
search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 13:08:31 -05:00
Tiger Yang
6c1e183e12 ocfs2: Check search result in ocfs2_xattr_block_get()
ocfs2_xattr_block_get() calls ocfs2_xattr_search() to find an external
xattr, but doesn't check the search result that is passed back via struct
ocfs2_xattr_search. Add a check for search result, and pass back -ENODATA if
the xattr search failed. This avoids a later NULL pointer error.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Mark Fasheh
de29c08528 ocfs2: fix printk related build warnings in xattr.c
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Dmitri Monakhov
c435400140 ocfs2: truncate outstanding block after direct io failure
Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Tao Ma
80bcaf3469 ocfs2/xattr: Proper hash collision handle in bucket division
In ocfs2/xattr, we must make sure the xattrs which have the same hash value
exist in the same bucket so that the search schema can work. But in the old
implementation, when we want to extend a bucket, we just move half number of
xattrs to the new bucket. This works in most cases, but if we are lucky
enough we will move 2 xattrs into 2 different buckets. This means that an
xattr from the previous bucket cannot be found anymore. This patch fix this
problem by finding the right position during extending the bucket and extend
an empty bucket if needed.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Tao Ma
4c1bbf1ba6 ocfs2: return 0 in page_mkwrite to let VFS retry.
In ocfs2_page_mkwrite, we return -EINVAL when we found the page mapping
isn't updated, and it will cause the user space program get SIGBUS and
exit. The reason is that during race writeable mmap, we will do
unmap_mapping_range in ocfs2_data_downconvert_worker. The good thing is
that if we reuturn 0 in page_mkwrite, VFS will retry fault and then
call page_mkwrite again, so it is safe to return 0 here.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Sunil Mushran
ae0dff6830 ocfs2: Set journal descriptor to NULL after journal shutdown
Patch sets journal descriptor to NULL after the journal is shutdown.
This ensures that jbd2_journal_release_jbd_inode(), which removes the
jbd2 inode from txn lists, can be called safely from ocfs2_clear_inode()
even after the journal has been shutdown.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Tao Ma
d32647993c ocfs2: Fix check of return value of ocfs2_start_trans() in xattr.c.
On failure, ocfs2_start_trans() returns values like ERR_PTR(-ENOMEM),
so we should check whether handle is NULL. Fix them to use IS_ERR().
Jan has made the patch for other part in ocfs2(thank Jan for it), so
this is just the fix for fs/ocfs2/xattr.c.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:47 -08:00
Jan Kara
b99835c168 ocfs2: Let inode be really deleted when ocfs2_mknod_locked() fails
We forgot to set i_nlink to 0 when returning due to error from ocfs2_mknod_locked()
and thus inode was not properly released via ocfs2_delete_inode() (e.g. claimed
space was not released). Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Jan Kara
87cfa00432 ocfs2: Fix checking of return value of new_inode()
new_inode() does not return ERR_PTR() but NULL in case of failure. Correct
checking of the return value.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Jan Kara
fa38e92cb3 ocfs2: Fix check of return value of ocfs2_start_trans()
On failure, ocfs2_start_trans() returns values like ERR_PTR(-ENOMEM).
Thus checks for !handle are wrong. Fix them to use IS_ERR().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Tao Ma
8573f79d30 ocfs2: Fix some typos in xattr annotations.
Fix some typos in the xattr annotations.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Reported-by: Coly Li <coyli@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Tao Ma
63fd775737 ocfs2: Remove unused ocfs2_restore_xattr_block().
Since now ocfs2 supports empty xattr buckets, we will never remove
the xattr index tree even if all the xattrs are removed, so this
function will never be called. So remove it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Joel Becker
54f443f4e7 ocfs2: Don't repeat ocfs2_xattr_block_find()
ocfs2_xattr_block_get() looks up the xattr in a startlingly familiar
way; it's identical to the function ocfs2_xattr_block_find().  Let's just
use the later in the former.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Joel Becker
eb6ff2397d ocfs2: Specify appropriate journal access for new xattr buckets.
There are a couple places that get an xattr bucket that may be reading
an existing one or may be allocating a new one.  They should specify the
correct journal access mode depending.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:46 -08:00
Joel Becker
bd60bd37ad ocfs2: Check errors from ocfs2_xattr_update_xattr_search()
The ocfs2_xattr_update_xattr_search() function can return an error when
trying to read blocks off of disk.  The caller needs to check this error
before using those (possibly invalid) blocks.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:45 -08:00
Joel Becker
b37c4d84e9 ocfs2: Don't return -EFAULT from a corrupt xattr entry.
If the xattr disk structures are corrupt, return -EIO, not -EFAULT.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:45 -08:00
Joel Becker
f6087fb799 ocfs2: Check xattr block signatures properly.
The xattr.c code is currently memcmp()ing naking buffer pointers.
Create the OCFS2_IS_VALID_XATTR_BLOCK() macro to match its peers and use
that.

In addition, failed signature checks were returning -EFAULT, which is
completely wrong.  Return -EIO.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:44 -08:00
Tiger Yang
c988fd045f ocfs2: add handler_map array bounds checking
Make the handler_map array as large as the possible value range to avoid
a fencepost error.

[ Utilize alternate method -- Joel ]

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:44 -08:00
Tiger Yang
ceb1eba3dc ocfs2: remove duplicate definition in xattr
Include/linux/xattr.h already has the definition about xattr prefix,
so remove the duplicate definitions in xattr.c.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:44 -08:00
Tiger Yang
0030e00150 ocfs2: fix function declaration and definition in xattr
Because we merged the xattr sources into one file, some functions
no longer belong in the header file.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:44 -08:00
Tiger Yang
c3cb682735 ocfs2: fix license in xattr
This patch fixes the license in xattr.c and xattr.h.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:43 -08:00
Chris Mason
b47eda8690 Btrfs: Turn off extent state leak debugging
The extent_io.c code has a #define to find and cleanup extent state leaks
on module unmount.  This adds a very highly contended spinlock to a
hot path for most FS operations.

Turn it off by default.  A later changeset will add a .config option
for it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 12:34:40 -05:00
Chris Mason
445a694499 Btrfs: Fix usage of struct extent_map->orig_start
This makes sure the orig_start field in struct extent_map gets set
everywhere the extent_map structs are created or modified.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:53:33 -05:00
Chris Mason
39be25cd89 Btrfs: Use invalidatepage when writepage finds a page outside of i_size
With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size.  This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:50:50 -05:00
Chris Mason
f5a31e1667 Btrfs: Try harder while searching for free space
The loop searching for free space would exit out too soon when
metadata clustering was trying to allocate a large extent.  This makes
sure a full scan of the free space is done searching for only the
minimum extent size requested by the higher layers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:47:09 -05:00
Chris Mason
e04ca626ba Btrfs: Fix use after free during compressed reads
Yan's fix to use the correct file offset during compressed reads used the
extent_map struct pointer after it had been freed.  This saves the
fields we want for later use instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:44:58 -05:00
Yan Zheng
ff5b7ee33d Btrfs: Fix csum error for compressed data
The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.

The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-10 07:34:43 -05:00
Chris Mason
f2b1c41cf9 Btrfs: Make sure pages are dirty before doing delalloc for them
This adds a PageDirty check to the writeback path that locks pages
for delalloc.  If a page wasn't dirty at this point, it is in the
process of being truncated away.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 07:31:30 -05:00
Chris Mason
5b7c3fcc46 Btrfs: Don't substract too much from the allocation target (avoid wrapping)
When metadata allocation clustering has to fall back to unclustered
allocs because large free areas could not be found, it was sometimes
substracting too much from the total bytes to allocate.  This would
make it wrap below zero.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 07:26:33 -05:00
David Chinner
220ca310a5 [XFS] XFS: Check for valid transaction headers in recovery
When we are about to add a new item to a transaction in recovery, we need
to check that it is valid first. Currently we just assert that header
magic number matches, but in production systems that is not present and we
add a corrupted transaction to the list to be processed. This results in a
kernel oops later when processing the corrupted transaction.

Instead, if we detect a corrupted transaction, abort recovery and leave
the user to clean up the mess that has occurred.

SGI-PV: 988145

SGI-Modid: xfs-linux-melb:xfs-kern:32356a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-11-10 18:01:50 +11:00
Dave Chinner
8f330f5149 [XFS] handle memory allocation failures during log initialisation
When there is no memory left in the system, xfs_buf_get_noaddr()
can fail. If this happens at mount time during xlog_alloc_log()
we fail to catch the error and oops.

Catch the error from xfs_buf_get_noaddr(), and allow other memory
allocations to fail and catch those errors too. Report the error
to the console and fail the mount with ENOMEM.

Tested by manually injecting errors into xfs_buf_get_noaddr() and
xlog_alloc_log().

Version 2:
o remove unnecessary casts of the returned pointer from kmem_zalloc()

SGI-PV: 987246

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-11-10 17:57:06 +11:00
David Chinner
6f9f51adb6 [XFS] Account for allocated blocks when expanding directories
When we create a directory, we reserve a number of blocks for the maximum
possible expansion of of the directory due to various btree splits,
freespace allocation, etc. Unfortunately, each allocation is not reflected
in the total number of blocks still available to the transaction, so the
maximal reservation is used over and over again.

This leads to problems where an allocation group has only enough blocks
for *some* of the allocations required for the directory modification.
After the first N allocations, the remaining blocks in the allocation
group drops below the total reservation, and subsequent allocations fail
because the allocator will not allow the allocation to proceed if the AG
does not have the enough blocks available for the entire allocation total.

This results in an ENOSPC occurring after an allocation has already
occurred. This results in aborting the directory operation (leaving the
directory in an inconsistent state) and cancelling a dirty transaction,
which results in a filesystem shutdown.

Avoid the problem by reflecting the number of blocks allocated in any
directory expansion in the total number of blocks available to the
modification in progress. This prevents a directory modification from
being aborted part way through with an ENOSPC.

SGI-PV: 988144

SGI-Modid: xfs-linux-melb:xfs-kern:32340a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2008-11-10 17:51:14 +11:00
Lachlan McIlroy
2cf7f0da3a [XFS] Wait for all I/O on truncate to zero file size
It's possible to have outstanding xfs_ioend_t's queued when the file size
is zero. This can happen in the direct I/O path when a direct I/O write
fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
xfs_end_io_direct() does not know that the I/O failed so can't force the
xfs_ioend_t to be flushed synchronously).

When we truncate a file on unlink we don't know to wait for these
xfs_ioend_ts and we can have a use-after-free situation if the inode is
reclaimed before the xfs_ioend_t is finally processed.

As was suggested by Dave Chinner lets wait for all I/Os to complete when
truncating the file size to zero.

SGI-PV: 981668

SGI-Modid: xfs-linux-melb:xfs-kern:32216a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
2008-11-10 17:51:00 +11:00