gfs2_fallocate wasn't checking inode_newsize_ok nor get_write_access.
Split out the context setup and inode locking pieces into a separate
function to make it more clear and add these missing calls.
inode_newsize_ok is called conditional on FALLOC_FL_KEEP_SIZE as there
is no need to enforce a file size limit if it isn't going to change.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add IIO include files
kernel/panic.c: update comments for print_tainted
mem-hotplug: reset node present pages when hot-adding a new pgdat
mem-hotplug: reset node managed pages when hot-adding a new pgdat
mm/debug-pagealloc: correct freepage accounting and order resetting
fanotify: fix notification of groups with inode & mount marks
mm, compaction: prevent infinite loop in compact_zone
mm: alloc_contig_range: demote pages busy message from warn to info
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
mm/page_alloc: restrict max order of merging on isolated pageblock
mm/page_alloc: move freepage counting logic to __free_one_page()
mm/page_alloc: add freepage on isolate pageblock to correct buddy list
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
mm/compaction: skip the range until proper target pageblock is met
zram: avoid kunmap_atomic() of a NULL pointer
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).
Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.
Fix the problem by always using the same comparison function when
sorting / merging the mark lists.
Thanks to Heinrich Schuchardt for improvements of my patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bits one
to 16 bits.
Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Any attempt to call nfs_remove_bad_delegation() while a delegation is being
returned is currently a no-op. This means that we can end up looping
forever in nfs_end_delegation_return() if something causes the delegation
to be revoked.
This patch adds a mechanism whereby the state recovery code can communicate
to the delegation return code that the delegation is no longer valid and
that it should not be used when reclaiming state.
It also changes the return value for nfs4_handle_delegation_recall_error()
to ensure that nfs_end_delegation_return() does not reattempt the lock
reclaim before state recovery is done.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch removes the assumption made previously, that we only need to
check the delegation stateid when it matches the stateid on a cached
open.
If we believe that we hold a delegation for this file, then we must assume
that its stateid may have been revoked or expired too. If we don't test it
then our state recovery process may end up caching open/lock state in a
situation where it should not.
We therefore rename the function nfs41_clear_delegation_stateid as
nfs41_check_delegation_stateid, and change it to always run through the
delegation stateid test and recovery process as outlined in RFC5661.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling
SEEK over v4.1. This is wrong, and can make servers crash.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Variable 'err' needn't be initialized when nfs_getattr() uses it to
check whether it should call generic_fillattr() or not. That can result
in spurious error returns. Initialize 'err' properly.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 3a6fd1f004 (pnfs/blocklayout: remove read-modify-write handling
in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index
in variable initialization. AFAICS it's just a typo so remove it.
Spotted by Coverity (id 1248711).
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This WARN_ON_ONCE was supposed to catch reference counting bugs, but can
trigger in inappropriate situations.
This was reproducible using NFSv2 on an architecture with 64K pages -- we
verified that it was not a reference counting bug and the warning was
safe to ignore.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets.
So we need to take care to free them when freeing dreq.
Ideally this needs to be done inside layout driver where ds_cinfo.buckets
are allocated. But buckets are attached to dreq and reused across LD IO iterations.
So I feel it's OK to free them in the generic layer.
Cc: stable@vger.kernel.org [v3.4+]
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In some contexts, like in sysfs handlers, we don't want to trigger a
transaction commit. It's a heavy operation, we don't know what external
locks may be taken. Instead, make it possible to finish the operation
through sync syscall or SYNC_FS ioctl.
Signed-off-by: David Sterba <dsterba@suse.cz>
The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.
Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.
Signed-off-by: David Sterba <dsterba@suse.cz>
If a pending change is requested, it's not processed unless there is a
transaction commit about to happen, not even after sync or SYNC_FS
ioctl. For example a remount that toggles the inode_cache option will
not take effect after sync on a quiescent filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.
For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).
Signed-off-by: David Sterba <dsterba@suse.cz>
There is no need to keep the module loaded when it serves no function in
case the EFI runtime services are disabled. Return an error in this case
so loading the module will fail.
Also supply a module_exit function to allow unloading the module.
Last, but not least, set the owner of the file_system_type struct.
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode.
Otherwise, we can make some dirty pages during the truncation, and those pages
will be written through f2fs_write_data_page.
At that moment, the inode has still inline_data, so that it tries to write non-
zero pages into inline_data area.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is like this.
One trhead triggers:
f2fs_write_data_pages
lock_page
f2fs_write_data_page
f2fs_lock_op <- wait
The other thread triggers:
f2fs_truncate
truncate_blocks
f2fs_lock_op
truncate_partial_data_page
lock_page <- wait for locking the page
This patch resolves this bug by relocating truncate_partial_data_page.
This function is just to truncate user data page and not related to FS
consistency as well.
And, we don't need to call truncate_inline_data. Rather than that,
f2fs_write_data_page will finally update inline_data later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The # of inline_data inode is decreased only when it has inline_data.
After clearing the flag, we can't decreased the number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to write node pages if checkpoint is not doing in order to avoid
memory pressure.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All filesystems using VFS quotas are now converted to use their private
i_dquot fields. Remove the i_dquot field from generic inode structure.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
CC: jfs-discussion@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs,
reiserfs) so it is beneficial to move this array to fs-private part of
the inode. We cannot just pass quota pointers from filesystems to quota
functions because during quotaon and quotaoff we have to traverse list
of all inodes and manipulate i_dquot pointers for each inode. So we
provide a function which generic quota code can use to get pointer to
the i_dquot array from the filesystem.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user, group, and project quotas. Tell VFS about it.
CC: xfs@oss.sgi.com
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user and group quotas. Tell vfs about it.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all filesystems supporting VFS quota support user and group
quotas. With introduction of project quotas this is going to change so
make sure filesystem isn't called for quota type it doesn't support by
introduction of a bitmask determining which quota types each filesystem
supports.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull btrfs fix from Chris Mason:
"It's a one liner for an error cleanup path that leads to crashes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
This update fixes:
- incorrect warnings about i_mutex locking in
pagecache_isize_extended() and updates comments to match expected
locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUXTquAAoJEK3oKUf0dfodzPQP+wVWm3suRpvfpOeljlwcCB1w
r1MjJjEgKRmlRCwTe4IYSSS2ZBSz3f5qTunde1PEAcUyyf2gO5b/gV2WCaVDWIpV
0/1RDaXIRplTvY/i5UAOtqOSUpNwvWz7PCmCAR7RCHFfTyBBbFRlRdg1GPCrBAzv
rf/kRm9C6fRHHwLojwNCLEzA0MAdjKVG05A+Xv3MnWcd9fRxtNryQnhTIUJoDCVl
5keebmizquoc88WoQxDX29j2Ce+yjMPj27YhB9Z09mfmFvbLHT46UP2jI5Ty+DX5
rJikXA5Jv6wtig9wsbsenK7fsy1CyAAbmxS8vjyHsdCSAMpN98HkndsSadF8nH5U
sEh43OJjJS5LedVNqfV+LtI9ZD9+fGpAETQFVI8TpefFBX2aYw3fomWO0AUbRuMi
s4f6iz2sDvgDa+oE38XqKif6CqFsBX0QQSuCDiPaMi79vy2VLE/I5SU7QAj42+BW
sEyGVVcdJDsDpkUe6SicIpbNNwhXCR4GYc4jI4QYjDdcK6rrliCnsI8hHk5pPeKk
Qvt6ERiP5dLvp9f6KEzvPAkdxBmDKOtHKMSEMJzBBDtxFJXStJNuYZVtMYb8JwJq
nV0WKSoXR0xT/IeMw8336HF4GO04VCHjY3QnNh0qNdHYtfXJerZmpzVhae7dJmZZ
nLFimb3q2TlfgEvkPqmi
=8S0c
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"This update fixes a warning in the new pagecache_isize_extended() and
updates some related comments, another fix for zero-range
misbehaviour, and an unforntuately large set of fixes for regressions
in the bulkstat code.
The bulkstat fixes are large but necessary. I wouldn't normally push
such a rework for a -rcX update, but right now xfsdump can silently
create incomplete dumps on 3.17 and it's possible that even xfsrestore
won't notice that the dumps were incomplete. Hence we need to get
this update into 3.17-stable kernels ASAP.
In more detail, the refactoring work I committed in 3.17 has exposed a
major hole in our QA coverage. With both xfsdump (the major user of
bulkstat) and xfsrestore silently ignoring missing files in the
dump/restore process, incomplete dumps were going unnoticed if they
were being triggered. Many of the dump/restore filesets were so small
that they didn't evenhave a chance of triggering the loop iteration
bugs we introduced in 3.17, so we didn't exercise the code
sufficiently, either.
We have already taken steps to improve QA coverage in xfstests to
avoid this happening again, and I've done a lot of manual verification
of dump/restore on very large data sets (tens of millions of inodes)
of the past week to verify this patch set results in bulkstat behaving
the same way as it does on 3.16.
Unfortunately, the fixes are not exactly simple - in tracking down the
problem historic API warts were discovered (e.g xfsdump has been
working around a 20 year old bug in the bulkstat API for the past 10
years) and so that complicated the process of diagnosing and fixing
the problems. i.e. we had to fix bugs in the code as well as
discover and re-introduce the userspace visible API bugs that we
unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.
Summary:
- incorrect warnings about i_mutex locking in pagecache_isize_extended()
and updates comments to match expected locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17"
* tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track bulkstat progress by agino
xfs: bulkstat error handling is broken
xfs: bulkstat main loop logic is a mess
xfs: bulkstat chunk-formatter has issues
xfs: bulkstat chunk formatting cursor is broken
xfs: bulkstat btree walk doesn't terminate
mm: Fix comment before truncate_setsize()
xfs: rework zero range to prevent invalid i_size updates
mm: Remove false WARN_ON from pagecache_isize_extended()
xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs: bulkstat doesn't release AGI buffer on error
The global state_lock protects the file_hashtbl, and that has the
potential to be a scalability bottleneck.
Address this by making the file_hashtbl use RCU. Add a rcu_head to the
nfs4_file and use that when freeing ones that have been hashed. In order
to conserve space, we union the fi_rcu field with the fi_delegations
list_head which must be clear by the time the last reference to the file
is dropped.
Convert find_file_locked to use RCU lookup primitives and not to require
that the state_lock be held, and convert find_file to do a lockless
lookup. Convert find_or_add_file to attempt a lockless lookup first, and
then fall back to doing a locked search and insert if that fails to find
anything.
Also, minimize the number of times we need to calculate the hash value
by passing it in as an argument to the search and insert functions, and
optimize the order of arguments in nfsd4_init_file.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
DEALLOCATE only returns a status value, meaning we can use the noop()
xdr encoder to reply to the client.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The ALLOCATE operation is used to preallocate space in a file. I can do
this by using vfs_fallocate() to do the actual preallocation.
ALLOCATE only returns a status indicator, so we don't need to write a
special encode() function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This function needs to be exported so it can be used by the NFSD module
when responding to the new ALLOCATE and DEALLOCATE operations in NFS
v4.2. Christoph Hellwig suggested renaming the function to stay
consistent with how other vfs functions are named.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
To match the previous patch which used the pre-alloc buffer for
writes, this patch causes reads to use the same buffer.
This is not strictly necessary as the current seq_read() will allocate
on first read, so user-space can trigger the required pre-alloc. But
consistency is valuable.
The read function is somewhat simpler than seq_read() and, for example,
does not support reading from an offset into the file: reads must be
at the start of the file.
As seq_read() does not use the prealloc buffer, ->seq_show is
incompatible with ->prealloc and caused an EINVAL return from open().
sysfs code which calls into kernfs always chooses the correct function.
As the buffer is shared with writes and other reads, the mutex is
extended to cover the copy_to_user.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
md/raid allows metadata management to be performed in user-space.
A various times, particularly on device failure, the metadata needs
to be updated before further writes can be permitted.
This means that the user-space program which updates metadata much
not block on writeout, and so must not allocate memory.
mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
memory allocation issues for user-memory, but that does not help
kernel memory.
Several kernel objects can be pre-allocated. e.g. files opened before
any writes to the array are permitted.
However some kernel allocation happens in places that cannot be
pre-allocated.
In particular, writes to sysfs files (to tell md that it can now
allow writes to the array) allocate a buffer using GFP_KERNEL.
This patch allows attributes to be marked as "PREALLOC". In that case
the maximal buffer is allocated when the file is opened, and then used
on each write instead of allocating a new buffer.
As the same buffer is now shared for all writes on the same file
description, the mutex is extended to cover full use of the buffer
including the copy_from_user().
The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
inspected by sysfs_add_file_mode_ns() to determine if the file should be
marked as requiring prealloc.
Despite the comment, we *do* use ->seq_show together with ->prealloc
in this patch. The next patch fixes that.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
According to the user expectations common utilities like dd or sh
redirection operator > should work correctly over binary files from
sysfs. At the moment doing excessive write can not be completed:
write(1, "\0\0\0\0\0\0\0\0", 8) = 4
write(1, "\0\0\0\0", 4) = 0
write(1, "\0\0\0\0", 4) = 0
write(1, "\0\0\0\0", 4) = 0
...
Fix the problem by returning EFBIG described in man 2 write.
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The journal update function did not work for extended attributes properly,
because extended attribute inodes carry the xattr data, and the size of this
data was not taken into account.
Artem: improved commit message, amended the patch a bit.
Signed-off-by: Subodh Nijsure <snijsure@grid-net.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Ben Shelton <ben.shelton@ni.com>
Acked-by: Brad Mouring <brad.mouring@ni.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>