Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.
There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.
So replace these two list_move_tail() calls with list_rotate_left().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().
Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.
Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.
If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.
In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.
So drop those two parameters and use their derived equivalents.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change the type of the "more" parameter from int to bool.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Some values printed are not (necessarily) in CPU order. We already
have a copy of the converted versions, so use them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().
The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.
The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.
Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.
This resolves:
http://tracker.ceph.com/issues/4324
Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(This is being reposted. The first one had a problem because it
erroneously added a similar change elsewhere; that change has been
dropped.)
The next patch in this series points out that the calculation for
the number of pages in an osd request is getting done twice. It
is not obvious, but the result of both calculations is identical.
This patch simplifies one of them--as a separate step--to make
it clear that the transformation in the next patch is valid.
In ceph_sync_write() there is some magic that computes page_align
for an osd request. But a little analysis shows it can be
simplified.
First, we have:
io_align = pos & ~PAGE_MASK;
which is used here:
page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
Note (pos - io_align) simply rounds "pos" down to the nearest multiple
of the page size.
We also have:
buf_align = (unsigned long)data & ~PAGE_MASK;
Adding buf_align to that rounded-down "pos" value will stay within
the same page; the result will just be offset by the page offset for
the "data" pointer. The final mask therefore leaves just the value
of "buf_align".
One more simplification. Note that the result of calc_pages_for()
is invariant of which page the offset starts in--the only thing that
matters is the offset within the starting page. We will have
put the proper page offset to use into "page_align", so just use
that in calculating num_pages.
This resolves:
http://tracker.ceph.com/issues/4166
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There's a spot that computes the number of pages to allocate for a
page-aligned length by just shifting it. Use calc_pages_for()
instead, to be consistent with usage everywhere else. The result
is the same.
The reason for this is to make it clearer in an upcoming patch that
this calculation is duplicated.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, incoming mds messages never use page data, which means
there is no need to set the page_alignment field in the message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.
This and the next patch resolve:
http://tracker.ceph.com/issues/4322
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().
Drop the mutex before making this call, and make sure we're still
connected when we get it back again.
This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The pagelist_count field is never actually used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Two of the fields defining osd operations are defined using (char *)
while the data they represent are really untyped, not character
strings. Change them to have type (void *).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The new cases added to osd_req_encode_op() caused a new sparse
error, which highlighted an existing problem that had been
overlooked since it was originally checked in. When an unsupported
opcode is found the destination rather than the source opcode was
being used in the error message. The two differ in their byte
order, and we want to be using the one in the source.
Fix the problem in both spots.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped. Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.
Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.
This resolves:
http://tracker.ceph.com/issues/3967
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Conflicts:
fs/ceph/addr.c
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().
This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The page alignment field for a request is currently set in
ceph_osdc_build_request(). It's not needed at that point
nor do either of its callers need that value assigned at
any point before they call ceph_osdc_start_request().
So move that assignment into ceph_osdc_start_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only remaining reason to pass the osd request to calc_layout()
is to fill in its r_num_pages and r_page_alignment fields. Once it
fills those in, it doesn't do anything more with them.
We can therefore move those assignments into the caller, and get rid
of the "req" parameter entirely.
Note, however, that the only caller is ceph_osdc_new_request(),
and that immediately overwrites those fields with values based on
its passed-in page offset. So the assignment inside calc_layout()
was redundant anyway.
This resolves:
http://tracker.ceph.com/issues/4262
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the formatting of the object name (oid) to use for an object
request into the caller of calc_layout(). This makes the "vino"
parameter no longer necessary, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Have calc_layout() pass the computed object number back to its
caller. (This is a small step to simplify review.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)
Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error. At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.
The only other value calc_layout() will return is 0, so make that
explicit in the successful case.
This resolves:
http://tracker.ceph.com/issues/4240
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Port over the mgag200 fix to cirrus as it suffers the same issue.
On F19 testing, it was noticed we get a lot of errors in dmesg
about being unable to reserve the buffer when plymouth starts,
this is due to the buffer being in the process of migrating,
so it makes sense we can't reserve it.
In order to deal with it, this adds delayed updates for the dirty
updates, when the bo is unreservable, in the normal console case
this shouldn't ever happen, its just when plymouth or X is
pushing the console bo to system memory.
Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
Port over the mgag200 fix to ast as it suffers the same issue.
On F19 testing, it was noticed we get a lot of errors in dmesg
about being unable to reserve the buffer when plymouth starts,
this is due to the buffer being in the process of migrating,
so it makes sense we can't reserve it.
In order to deal with it, this adds delayed updates for the dirty
updates, when the bo is unreservable, in the normal console case
this shouldn't ever happen, its just when plymouth or X is
pushing the console bo to system memory.
Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
On F19 testing, it was noticed we get a lot of errors in dmesg
about being unable to reserve the buffer when plymouth starts,
this is due to the buffer being in the process of migrating,
so it makes sense we can't reserve it.
In order to deal with it, this adds delayed updates for the dirty
updates, when the bo is unreservable, in the normal console case
this shouldn't ever happen, its just when plymouth or X is
pushing the console bo to system memory.
Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
We moved the definition of shift_to_mmu_psize and mmu_psize_to_shift
out of hugetlbpage.c in patch "powerpc: New hugepage directory format".
These functions are not related to hugetlbpage and we want to use them
outside hugetlbpage.c We missed a definition for book3e when we moved
these functions. Add similar functions to mmu-book3e.h
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This context switches the new Event Based Branching (EBB) SPRs. The three new
SPRs are:
- Event Based Branch Handler Register (EBBHR)
- Event Based Branch Return Register (EBBRR)
- Branch Event Status and Control Register (BESCR)
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This turns Event Based Branching (EBB) on in the Hypervisor Facility Status and
Control Register (HFSCR) and Facility Status and Control Register (FSCR).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are getting low on cpu feature bits. So rather than add a separate bit for
every new Power8 feature, add a bit for arch 2.07 server catagory and use that
instead.
Hijack the value we had for BCTAR, but swap the value with CFAR so that all the
ARCH defines are together.
Note we don't touch CPU_FTR_TM, because it is conditionally enabled if
the kernel is built with TM support.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make BHRB instructions available in problem and privileged states.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We do not want to take single step and branch-taken debug exception
in kernel exception code. But the address range check was not covering
all kernel exception handlers address range.
With this patch we defined the interrupt_end label which defines the
end on kernel exception code. So now we check interrupt_base to
interrupt_end range for not handling debug exception in kernel
exception entry.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull x86/efi changes from Peter Anvin:
"The bulk of these changes are cleaning up the efivars handling and
breaking it up into a tree of files. There are a number of fixes as
well.
The entire changeset is pretty big, but most of it is code movement.
Several of these commits are quite new; the history got very messed up
due to a mismerge with the urgent changes for rc8 which completely
broke IA64, and so Ingo requested that we rebase it to straighten it
out."
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: remove "kfree(NULL)"
efi: locking fix in efivar_entry_set_safe()
efi, pstore: Read data from variable store before memcpy()
efi, pstore: Remove entry from list when erasing
efi, pstore: Initialise 'entry' before iterating
efi: split efisubsystem from efivars
efivarfs: Move to fs/efivarfs
efivars: Move pstore code into the new EFI directory
efivars: efivar_entry API
efivars: Keep a private global pointer to efivars
efi: move utf16 string functions to efi.h
x86, efi: Make efi_memblock_x86_reserve_range more readable
efivarfs: convert to use simple_open()
Al's commit e1b5bb6d12 ("consolidate cond_syscall and SYSCALL_ALIAS
declarations") broke the build on blackfin and metag due to the
following code:
#ifndef SYMBOL_NAME
#ifdef CONFIG_SYMBOL_PREFIX
#define SYMBOL_NAME(x) CONFIG_SYMBOL_PREFIX ## x
#else
#define SYMBOL_NAME(x) x
#endif
#endif
#define __SYMBOL_NAME(x) __stringify(SYMBOL_NAME(x))
__stringify literally stringifies CONFIG_SYMBOL_PREFIX ##x, so you get
lines like this in kernel/sys_ni.s:
.weak CONFIG_SYMBOL_PREFIXsys_quotactl
.set CONFIG_SYMBOL_PREFIXsys_quotactl,CONFIG_SYMBOL_PREFIXsys_ni_syscall
The patches in Rusty's modules-next tree such as "CONFIG_SYMBOL_PREFIX:
cleanup." cleans up the whole mess around symbol prefixes, so this patch
just attempts to fix the build in the meantime.
The intermediate definition of SYMBOL_NAME above isn't used and is
incorrect when CONFIG_SYMBOL_PREFIX is defined as CONFIG_SYMBOL_PREFIX
is a quoted string literal, so define __SYMBOL_NAME directly depending
on CONFIG_SYMBOL_PREFIX.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Mea-culpa-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>