Cosmetic change: make alignment explicit in to_ipoib_neigh.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The size of struct ib_uverbs_create_qp_resp is not even multiple of 8
bytes. This causes problems for low-level drivers that add private
data after the structure: 32-bit userspace will look in the wrong
place for a response from a 64-bit kernel. Fix this by adding a
reserved field. Also, bump the ABI version because this changes the
size of a structure.
Pointed out by Hoang-Nam Nguyen <HNGUYEN@de.ibm.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace SRQs (shared
receive queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has the underlying
ib_query_srq() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle querying userspace QPs (queue pairs),
including adding an ABI for marshalling requests and responses. The
kernel midlayer already has the underlying ib_query_qp() function.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use ib_modify_qp_is_ok() in mthca, and delete the big table of
attributes for queue pair state transitions.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The in-kernel mthca driver contains a table of which attributes are
valid for each queue pair state transition. It turns out that both
other IB drivers -- ipath and ehca -- which are being prepared for
merging have copied this table, errors and all.
To forestall this code duplication, move this table and the code to
check parameters against it into a midlayer library function,
ib_modify_qp_is_ok().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add low-level driver support to ib_mthca so that consumers can request
a "send queue drained" event be generated when a transiton to the SQD
state completes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The call to ib_get_agent_port() shouldn't be possible to fail when
smi_check_local_dr_smp() is called from ib_mad_recv_done_handler().
When it is called from handle_outgoing_dr_smp(), the device and
port_num come from mad_agent_priv so I assume the call to
ib_get_agent_port() shouldn't fail either. In either case,
smi_check_local_smp() only uses the mad_agent pointer to check that
mad_agent->device->process_mad is not NULL. The device pointer would
have to be the same as the one passed to smi_check_local_dr_smp()
since that pointer is used later instead of the one checked in
smi_check_local_smp().
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
smi_check_local_dr_smp() is called only from two places in core/mad.c
It returns 0 or 1. In smi_check_local_dr_smp(), it checks for
a directed route SMP but this function is only called when the SMP
is a directed route so this is a NOP.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows the consumer to set the page size of "pages" mapped
by the pool FMRs, which is a feature already existing in the base
verbs API. On the cosmetic side it changes ib_fmr_attr.page_size field
to be named page_shift.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a modify_device method to mthca, which implements setting the node
description. This makes the writable "node_desc" sysfs attribute work
for Mellanox HCAs.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Expose a writable "node_desc" sysfs attribute for InfiniBand devices.
This allows userspace to update the node description with information
such as the node's hostname, so that IB network management software
can tie its view to the real world.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to uverbs to handle resizing userspace CQs (completion
queues), including adding an ABI for marshalling requests and
responses. The kernel midlayer already has ib_resize_cq().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The might_sleep() annotations in mthca are silly -- they all occur
shortly before calls that will end up in core functions like kmalloc()
that will print the same warning in an unsafe context anyway. In
fact, beyond cluttering the source, we're actually bloating text with
CONFIG_DEBUG_SPINLOCK_SLEEP and/or CONFIG_PREEMPT_VOLUNTARY set.
With both options set, getting rid of the might_sleep()s saves a lot:
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-171 (-171)
function old new delta
mthca_pd_alloc 132 109 -23
mthca_init_cq 969 946 -23
mthca_mr_alloc 592 568 -24
mthca_pd_free 67 42 -25
mthca_free_mr 219 194 -25
mthca_free_cq 570 545 -25
mthca_fmr_alloc 742 716 -26
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The function mthca_free_err_wqe() can never fail, so get rid of its
return value. That means handle_error_cqe() doesn't have to check
what mthca_free_err_wqe() returns, which means it can't fail either
and doesn't have to return anything either. All this results in
simpler source code and a slight object code improvement:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-10 (-10)
function old new delta
mthca_free_err_wqe 83 81 -2
mthca_poll_cq 1758 1750 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
1) huge_pte_offset() did not check the page table hierarchy
elements as being empty correctly, resulting in an OOPS
2) Need platform specific hugetlb_get_unmapped_area() to handle
the top-down vs. bottom-up address space allocation strategies.
Signed-off-by: David S. Miller <davem@davemloft.net>
init/do_mounts_rd.c depends upon CONFIG_BLK_DEV_RAM, not CONFIG_BLK_DEV_INITRD.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need to write an invalid tag every 16 bytes,
so taking advantage of this can save many instructions
compared to the simple memset() call we make now.
A prefetching implementation is implemented for sun4u
and a block-init store version if implemented for Niagara.
The next trick is to be able to perform an init and
a copy_tsb() in parallel when growing a TSB table.
Signed-off-by: David S. Miller <davem@davemloft.net>
online_page() is straightforward, and then add a dummy
remove_memory() that returns -EINVAL just like i386.
There is no point in implementing remove_memory() since
__remove_pages() has no implementation either.
Signed-off-by: David S. Miller <davem@davemloft.net>
Try only lightly on > 1 order allocations.
If a grow fails, we are under memory pressure, so do not try
to grow the TSB for this address space any more.
If a > 0 order TSB allocation fails on a new fork, retry using
a 0 order allocation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Put it one page below the top of the 32-bit address space.
This gives us ~16MB more address space to work with.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently allocations are very constrained for 32-bit processes.
It grows down-up from 0x70000000 to 0xf0000000 which gives about
2GB of stack + dynamic mmap() space.
So support the top-down method, and we need to override the
generic helper function in order to deal with D-cache coloring.
With these changes I was able to squeeze out a mmap() just over
3.6GB in size in a 32-bit process.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is good for up to %50 performance improvement of some test cases.
The problem has been the race conditions, and hopefully I've plugged
them all up here.
1) There was a serious race in switch_mm() wrt. lazy TLB
switching to and from kernel threads.
We could erroneously skip a tsb_context_switch() and thus
use a stale TSB across a TSB grow event.
There is a big comment now in that function describing
exactly how it can happen.
2) All code paths that do something with the TSB need to be
guarded with the mm->context.lock spinlock. This makes
page table flushing paths properly synchronize with both
TSB growing and TLB context changes.
3) TSB growing events are moved to the end of successful fault
processing. Previously it was in update_mmu_cache() but
that is deadlock prone. At the end of do_sparc64_fault()
we hold no spinlocks that could deadlock the TSB grow
sequence. We also have dropped the address space semaphore.
While we're here, add prefetching to the copy_tsb() routine
and put it in assembler into the tsb.S file. This piece of
code is quite time critical.
There are some small negative side effects to this code which
can be improved upon. In particular we grab the mm->context.lock
even for the tsb insert done by update_mmu_cache() now and that's
a bit excessive. We can get rid of that locking, and the same
lock taking in flush_tsb_user(), by disabling PSTATE_IE around
the whole operation including the capturing of the tsb pointer
and tsb_nentries value. That would work because anyone growing
the TSB won't free up the old TSB until all cpus respond to the
TSB change cross call.
I'm not quite so confident in that optimization to put it in
right now, but eventually we might be able to and the description
is here for reference.
This code seems very solid now. It passes several parallel GCC
bootstrap builds, and our favorite "nut cruncher" stress test which is
a full "make -j8192" build of a "make allmodconfig" kernel. That puts
about 256 processes on each cpu's run queue, makes lots of process cpu
migrations occur, causes lots of page table and TLB flushing activity,
incurs many context version number changes, and it swaps the machine
real far out to disk even though there is 16GB of ram on this test
system. :-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Sun does't put an SEEPROM behind the tigon3 chip, among other things,
so accesses to these areas just give bus timeouts.
Signed-off-by: David S. Miller <davem@davemloft.net>
Report 'sun4v' when appropriate in /proc/cpuinfo
Remove all the verifications of the OBP version string. Just
make sure it's there, and report it raw in the bootup logs and
via /proc/cpuinfo.
Signed-off-by: David S. Miller <davem@davemloft.net>
The mapping is a simple "(cpuid >> 2) == core" for now.
Later we'll add more sophisticated code that will walk
the sun4v machine description and figure this out from
there.
We should also add core mappings for jaguar and panther
processors.
Signed-off-by: David S. Miller <davem@davemloft.net>
The page->flags manipulations done by the D-cache dirty
state tracking was broken because the constants were not
marked with "UL" to make them 64-bit, which means we were
clobbering the upper 32-bits of page->flags all the time.
This doesn't jive well with sparsemem which stores the
section and indexing information in the top 32-bits of
page->flags.
This is yet another sparc64 bug which has been with us
forever.
While we're here, tidy up some things in bootmem_init()
and paginig_init():
1) Pass min_low_pfn to init_bootmem_node(), it's identical
to (phys_base >> PAGE_SHIFT) but we should use consistent
with the variable names we print in CONFIG_BOOTMEM_DEBUG
2) max_mapnr, although no longer used, was being set
inaccurately, we shouldn't subtract pfn_base any more.
3) All the games with phys_base in the zones_*[] arrays
we pass to free_area_init_node() are no longer necessary.
Thanks to Josh Grebe and Fabbione for the bug reports
and testing. Fix also verified locally on an SB2500
which had a memory layout that triggered the same problem.
Signed-off-by: David S. Miller <davem@davemloft.net>
This has been pending for a long time, and the fact
that we waste a ton of ram on some configurations
kind of pushed things over the edge.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't piggy back the SMP receive signal code to do the
context version change handling.
Instead allocate another fixed PIL number for this
asynchronous cross-call. We can't use smp_call_function()
because this thing is invoked with interrupts disabled
and a few spinlocks held.
Also, fix smp_call_function_mask() to count "cpus" correctly.
There is no guarentee that the local cpu is in the mask
yet that is exactly what this code was assuming.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Always spin_lock_init() in init_context(). The caller essentially
clears it out, or copies the mm info from the parent. In both
cases we need to explicitly initialize the spinlock.
2) Always do explicit IRQ disabling while taking mm->context.lock
and ctx_alloc_lock.
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch converts arch/sparc64 to kzalloc usage.
Crosscompile tested with allyesconfig.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we were aligned, but didn't have at least 256MB left
to process, we would loop forever.
Thanks to fabbione for the report and testing the fix.
Signed-off-by: David S. Miller <davem@davemloft.net>