Replace all module uses with the new vfs_kern_mount() interface, and fix up
simple_pin_fs().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
do_kern_mount() does not allow the kernel to use private mount interfaces
without exposing the same interfaces to userland. The problem is that the
filesystem is referenced by name, thus meaning that it and its mount
interface must be registered in the global filesystem list.
vfs_kern_mount() passes the struct file_system_type as an explicit
parameter in order to overcome this limitation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a real nfs_invalidate_page() to ensure that
truncate_inode_pages() does the right thing when there are pending dirty
pages, we can get rid of nfs_delete_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the case of a call to truncate_inode_pages(), we should really try to
cancel any pending writes on the page.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We just set *acl_len to zero, and attrlen is unsigned, so this comparison
is clearly bogus. I have no idea what I was thinking.
Fixes a bug that caused getacl to fail over krb5p.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix two errors in the client-side acl cache: First, when nfs3_proc_getacl
requests only the default acl of a file and the access acl is not cached
already, a NULL access acl entry is cached instead of ERR_PTR(-EAGAIN)
("not cached").
Second, update the cached acls in nfs3_proc_setacls: nfs_refresh_inode does
not always invalidate the cached acls, and when it does not, the cached acls
get out of sync.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, we are accounting for all calls to nfs_revalidate_inode(), but not
to nfs_revalidate_mapping(), or nfs_lookup_verify_inode(), etc...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Separate out the function of revalidating the inode metadata, and
revalidating the mapping. The former may be called by lookup(),
and only really needs to check that permissions, ctime, etc haven't changed
whereas the latter needs only done when we want to read data from the page
cache, and may need to sync and then invalidate the mapping.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Whenever the directory changes, we want to make sure that we always
invalidate its page cache. Fix up update_changeattr() and
nfs_mark_for_revalidate() so that they do so.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up a bug in the handling of NFS_INO_REVAL_PAGECACHE: make sure that
nfs_update_inode() clears it when we're sure we're not racing with other
updates.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up use of page_array, and fix an off-by-one error noticed by Tom
Talpey which causes kmalloc calls in cases where using the page_array
is sufficient.
Test plan:
Normal client functional testing with r/wsize=32768.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The XID generator uses get_random_bytes to generate an initial XID.
NFS_ROOT starts up before the random driver, though, so get_random_bytes
doesn't set a random XID for NFS_ROOT. This causes NFS_ROOT mount points
to reuse XIDs every time the client is booted. If the client boots often
enough, the server will start serving old replies out of its DRC.
Use net_random() instead.
Test plan:
I/O intensive workloads should perform well and generate no errors. Traces
taken during client reboots should show that NFS_ROOT mounts use unique
XIDs after every reboot.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the RPC client select privileged ephemeral source ports at
random. This improves DRC behavior on the server by using the
same port when reconnecting for the same mount point, but using
a different port for fresh mounts.
The Linux TCP implementation already does this for nonprivileged
ports. Note that TCP sockets in TIME_WAIT will prevent quick reuse
of a random ephemeral port number by leaving the port INUSE until
the connection transitions out of TIME_WAIT.
Test plan:
Connectathon against every known server implementation using multiple
mount points. Locking especially.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The Linux NFSv4 server violates RFC3530 in that the change attribute is not
guaranteed to be updated for every change to the inode. Our optimisation
for checking whether or not the inode metadata has changed or not is broken
too. Grr....
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code that is supposed to zero the uninitialised partial pages when the
server returns a short read is currently broken: it looks at the nfs_page
wb_pgbase and wb_bytes fields instead of the equivalent nfs_read_data
values when deciding where to start truncating the page.
Also ensure that we are more careful about setting PG_uptodate
before retrying a short read: the retry will change the nfs_read_data
args.pgbase and args.count.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A few cleanups in hvc_rtas.c:
1. Remove unused RTASCONS_PUT_ATTEMPTS
2. Remove unused rtascons_put_delay.
3. Use i as a loop counter like everyone else on earth.
4. Remove pointless variables, eg. x = foo; if (x) return something_else;
5. Whitespace cleanups and formatting.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently the hvc_rtas driver is painfully slow to use. Our "benchmark" is
ls -R /etc, which spits out about 27866 characters. The theoretical maximum
speed would be about 2.2 seconds, the current code takes ~50 seconds.
The core of the problem is that sometimes when the tty layer asks us to push
characters the firmware isn't able to handle some or all of them, and so
returns an error. The current code sees this and just returns to the tty code
with the buffer half sent.
The khvcd thread will eventually wake up and try to push more characters, which
will usually work because by then the firmware's had time to make room. But
the khvcd thread only wakes up every 10 milliseconds, which isn't fast enough.
So change the khvcd thread logic so that if there's an incomplete write we
yield() and then immediately try writing again. Doing so makes POLL_QUICK and
POLL_WRITE synonymous, so remove POLL_QUICK.
With this patch our "benchmark" takes ~2.8 seconds.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This gives the ability to control whether alignment exceptions get
fixed up or reported to the process as a SIGBUS, using the existing
PR_SET_UNALIGN and PR_GET_UNALIGN prctls. We do not implement the
option of logging a message on alignment exceptions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the PowerPC part of the code to allow processes to change
their endian mode via prctl.
This also extends the alignment exception handler to be able to fix up
alignment exceptions that occur in little-endian mode, both for
"PowerPC" little-endian and true little-endian.
We always enter signal handlers in big-endian mode -- the support for
little-endian mode does not amount to the creation of a little-endian
user/kernel ABI. If the signal handler returns, the endian mode is
restored to what it was when the signal was delivered.
We have two new kernel CPU feature bits, one for PPC little-endian and
one for true little-endian. Most of the classic 32-bit processors
support PPC little-endian, and this is reflected in the CPU feature
table. There are two corresponding feature bits reported to userland
in the AT_HWCAP aux vector entry.
This is based on an earlier patch by Anton Blanchard.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This new prctl is intended for changing the execution mode of the
processor, on processors that support both a little-endian mode and a
big-endian mode. It is intended for use by programs such as
instruction set emulators (for example an x86 emulator on PowerPC),
which may find it convenient to use the processor in an alternate
endianness mode when executing translated instructions.
Note that this does not imply the existence of a fully-fledged ABI for
both endiannesses, or of compatibility code for converting system
calls done in the non-native endianness mode. The program is expected
to arrange for all of its system call arguments to be presented in the
native endianness.
Switching between big and little-endian mode will require some care in
constructing the instruction sequence for the switch. Generally the
instructions up to the instruction that invokes the prctl system call
will have to be in the old endianness, and subsequent instructions
will have to be in the new endianness.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When debugging early kernel crashes that happen after console_init() and
before a proper console driver takes over, we often have to go hack into
udbg.c to prevent it from unregistering so we can "see" what is
happening. This patch adds a kernel command line option "udbg-immortal"
instead to avoid having to modify the kernel.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
POWER6 moves some of the MMCRA bits and also requires some bits to be
cleared each PMU interrupt.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Make sure dma_alloc_coherent allocates memory from the local node. This
is important on Cell where we avoid going through the slow cpu
interconnect.
Note: I could only test this patch on Cell, it should be verified on
some pseries machine by those that have the hardware.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On 64bit powerpc we can find out what node a pci bus hangs off, so
implement the topology.h macros that export this information.
For 32bit this seems a little more difficult, but I don't know of 32bit
powerpc NUMA machines either, so let's leave it out for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch attempts to handle RTAS "busy" return codes in a more simple
and consistent manner. Typical callers of RTAS shouldn't have to
manage wait times and delay calls.
This patch also changes the kernel to use msleep() rather than udelay()
when a runtime delay is necessary. This will avoid CPU soft lockups
for extended delay conditions.
Signed-off-by: John Rose <johnrose@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
From: Andrew Morton <akpm@osdl.org>
arch/powerpc/Kconfig:339:warning: leading whitespace ignored
arch/powerpc/Kconfig:347:warning: leading whitespace ignored
arch/powerpc/Kconfig:357:warning: leading whitespace ignored
arch/powerpc/Kconfig:373:warning: leading whitespace ignored
arch/powerpc/Kconfig:382:warning: leading whitespace ignored
arch/powerpc/Kconfig:394:warning: leading whitespace ignored
arch/powerpc/Kconfig:842:warning: leading whitespace ignored
arch/powerpc/Kconfig:847:warning: leading whitespace ignored
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The 970MP cputable entry needs a num_pmcs entry for oprofile to work.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
My js20 appears to lack the ibm,#dma- properties, and boot fails with a
"Kernel panic - not syncing: iommu_init_table: Can't allocate 0 bytes"
message.
This adds a fallback to the "#address-cells" property in case the
"#ibm,dma-address-cells" property is missing. Tested on js20 and
power5 lpar.
Unless there is a more elegant solution... :-)
Signed-off-by: Will Schmidt <willschm@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Our MMU hash management code would not set the "C" bit (changed bit) in
the hardware PTE when updating a RO PTE into a RW PTE. That would cause
the hardware to possibly to a write back to the hash table to set it on
the first store access, which in addition to being a performance issue,
might also hit a bug when running with native hash management (non-HV)
as our code is specifically optimized for the case where no write back
happens.
Thus there is a very small therocial window were a hash PTE can become
corrupted if that HPTE has just been upgraded to read write, a store
access happens on it, and that races with another processor evicting
that same slot. Since eviction (caused by an almost full hash) is
extremely rare, the bug is very unlikely to happen fortunately.
This fixes by allowing the updating of the protection bits in the native
hash handling to also set (but not clear) the "C" bit, and, in order to
also improve performances in the general case, by always setting that
bit on newly inserted hash PTE so that writeback really never happens.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch cleans up some locking & error handling in the ppc vdso and
moves the vdso base pointer from the thread struct to the mm context
where it more logically belongs. It brings the powerpc implementation
closer to Ingo's new x86 one and also adds an arch_vma_name() function
allowing to print [vsdo] in /proc/<pid>/maps if Ingo's x86 vdso patch is
also applied.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
I have tested PPC_PTRACE_GETREGS and PPC_PTRACE_SETREGS on umview.
I do not understand why historically these tags has been defined as
PPC_PTRACE_GETREGS and PPC_PTRACE_SETREGS instead of simply
PTRACE_[GS]ETREGS. The other "originality" is that the address must be
put into the "addr" field instead of the "data" field as stated in the
manual.
Signed-off-by: renzo davoli <renzo@cs.unibo.it>
Signed-off-by: Paul Mackerras <paulus@samba.org>
getting decremented by 1. Since nused never reaches 0, the "if
(!free->hdr.nused)" check in xfs_dir2_leafn_remove() fails every time and
xfs_dir2_shrink_inode() doesn't get called when it should. This causes
extra blocks to be left on an empty directory and the directory in unable
to be converted back to inline extent mode.
SGI-PV: 951958
SGI-Modid: xfs-linux-melb:xfs-kern:211382a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
truncate down followed by delayed allocation (buffered writes) - worst
case scenario for the notorious NULL files problem. This reduces the
window where we are exposed to that problem significantly.
SGI-PV: 917976
SGI-Modid: xfs-linux-melb:xfs-kern:26100a
Signed-off-by: Nathan Scott <nathans@sgi.com>