Commit graph

2710 commits

Author SHA1 Message Date
Anshuman Khandual
53b56ca019 powerpc: Setup BHRB instructions facility in HFSCR for POWER8
Make BHRB instructions available in problem and privileged states.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02 10:35:15 +10:00
Linus Torvalds
73287a43cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights (1721 non-merge commits, this has to be a record of some
  sort):

   1) Add 'random' mode to team driver, from Jiri Pirko and Eric
      Dumazet.

   2) Make it so that any driver that supports configuration of multiple
      MAC addresses can provide the forwarding database add and del
      calls by providing a default implementation and hooking that up if
      the driver doesn't have an explicit set of handlers.  From Vlad
      Yasevich.

   3) Support GSO segmentation over tunnels and other encapsulating
      devices such as VXLAN, from Pravin B Shelar.

   4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

   5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
      Dukkipati.

   6) In the PHY layer, allow supporting wake-on-lan in situations where
      the PHY registers have to be written for it to be configured.

      Use it to support wake-on-lan in mv643xx_eth.

      From Michael Stapelberg.

   7) Significantly improve firewire IPV6 support, from YOSHIFUJI
      Hideaki.

   8) Allow multiple packets to be sent in a single transmission using
      network coding in batman-adv, from Martin Hundebøll.

   9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

  10) Generalize the VXLAN forwarding tables so that there is more
      flexibility in configurating various aspects of the endpoints.
      From David Stevens.

  11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
      from Dmitry Kravkov.

  12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
      Neira Ayuso.

  13) Start adding networking selftests.

  14) In situations of overload on the same AF_PACKET fanout socket, or
      per-cpu packet receive queue, minimize drop by distributing the
      load to other cpus/fanouts.  From Willem de Bruijn and Eric
      Dumazet.

  15) Add support for new payload offset BPF instruction, from Daniel
      Borkmann.

  16) Convert several drivers over to mdoule_platform_driver(), from
      Sachin Kamat.

  17) Provide a minimal BPF JIT image disassembler userspace tool, from
      Daniel Borkmann.

  18) Rewrite F-RTO implementation in TCP to match the final
      specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.

  19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
      you like netlink, so I implemented netlink dumping of netlink
      sockets.") From Andrey Vagin.

  20) Remove ugly passing of rtnetlink attributes into rtnl_doit
      functions, from Thomas Graf.

  21) Allow userspace to be able to see if a configuration change occurs
      in the middle of an address or device list dump, from Nicolas
      Dichtel.

  22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
      Frederic Sowa.

  23) Increase accuracy of packet length used by packet scheduler, from
      Jason Wang.

  24) Beginning set of changes to make ipv4/ipv6 fragment handling more
      scalable and less susceptible to overload and locking contention,
      from Jesper Dangaard Brouer.

  25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
      instead.  From Hong Zhiguo.

  26) Optimize route usage in IPVS by avoiding reference counting where
      possible, from Julian Anastasov.

  27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

  28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
      Eitzenberger.

  29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
      nfnetlink_log, and nfnetlink_queue.  From Gao feng.

  30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

  31) Support several new r8169 chips, from Hayes Wang.

  32) Support tokenized interface identifiers in ipv6, from Daniel
      Borkmann.

  33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

  34) Add 802.1ad vlan offload support, from Patrick McHardy.

  35) Support mmap() based netlink communication, also from Patrick
      McHardy.

  36) Support HW timestamping in mlx4 driver, from Amir Vadai.

  37) Rationalize AF_PACKET packet timestamping when transmitting, from
      Willem de Bruijn and Daniel Borkmann.

  38) Bring parity to what's provided by /proc/net/packet socket dumping
      and the info provided by netlink socket dumping of AF_PACKET
      sockets.  From Nicolas Dichtel.

  39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
      Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  filter: fix va_list build error
  af_unix: fix a fatal race with bit fields
  bnx2x: Prevent memory leak when cnic is absent
  bnx2x: correct reading of speed capabilities
  net: sctp: attribute printl with __printf for gcc fmt checks
  netlink: kconfig: move mmap i/o into netlink kconfig
  netpoll: convert mutex into a semaphore
  netlink: Fix skb ref counting.
  net_sched: act_ipt forward compat with xtables
  mlx4_en: fix a build error on 32bit arches
  Revert "bnx2x: allow nvram test to run when device is down"
  bridge: avoid OOPS if root port not found
  drivers: net: cpsw: fix kernel warn on cpsw irq enable
  sh_eth: use random MAC address if no valid one supplied
  3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
  tg3: fix to append hardware time stamping flags
  unix/stream: fix peeking with an offset larger than data in queue
  unix/dgram: fix peeking with an offset larger than data in queue
  unix/dgram: peek beyond 0-sized skbs
  openvswitch: Remove unneeded ovs_netdev_get_ifindex()
  ...
2013-05-01 14:08:52 -07:00
Linus Torvalds
08d7676083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
 "Mostly about syscall wrappers this time; there will be another pile
  with patches in the same general area from various people, but I'd
  rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  syscalls.h: slightly reduce the jungles of macros
  get rid of union semop in sys_semctl(2) arguments
  make do_mremap() static
  sparc: no need to sign-extend in sync_file_range() wrapper
  ppc compat wrappers for add_key(2) and request_key(2) are pointless
  x86: trim sys_ia32.h
  x86: sys32_kill and sys32_mprotect are pointless
  get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
  merge compat sys_ipc instances
  consolidate compat lookup_dcookie()
  convert vmsplice to COMPAT_SYSCALL_DEFINE
  switch getrusage() to COMPAT_SYSCALL_DEFINE
  switch epoll_pwait to COMPAT_SYSCALL_DEFINE
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
  switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
  make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
  make HAVE_SYSCALL_WRAPPERS unconditional
  consolidate cond_syscall and SYSCALL_ALIAS declarations
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
  get rid of duplicate logics in __SC_....[1-6] definitions
2013-05-01 07:21:43 -07:00
Linus Torvalds
8700c95adb Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug changes from Ingo Molnar:
 "This is a pretty large, multi-arch series unifying and generalizing
  the various disjunct pieces of idle routines that architectures have
  historically copied from each other and have grown in random, wildly
  inconsistent and sometimes buggy directions:

   101 files changed, 455 insertions(+), 1328 deletions(-)

  this went through a number of review and test iterations before it was
  committed, it was tested on various architectures, was exposed to
  linux-next for quite some time - nevertheless it might cause problems
  on architectures that don't read the mailing lists and don't regularly
  test linux-next.

  This cat herding excercise was motivated by the -rt kernel, and was
  brought to you by Thomas "the Whip" Gleixner."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  idle: Remove GENERIC_IDLE_LOOP config switch
  um: Use generic idle loop
  ia64: Make sure interrupts enabled when we "safe_halt()"
  sparc: Use generic idle loop
  idle: Remove unused ARCH_HAS_DEFAULT_IDLE
  bfin: Fix typo in arch_cpu_idle()
  xtensa: Use generic idle loop
  x86: Use generic idle loop
  unicore: Use generic idle loop
  tile: Use generic idle loop
  tile: Enter idle with preemption disabled
  sh: Use generic idle loop
  score: Use generic idle loop
  s390: Use generic idle loop
  powerpc: Use generic idle loop
  parisc: Use generic idle loop
  openrisc: Use generic idle loop
  mn10300: Use generic idle loop
  mips: Use generic idle loop
  microblaze: Use generic idle loop
  ...
2013-04-30 07:50:17 -07:00
Linus Torvalds
e0972916e8 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Features:

   - Add "uretprobes" - an optimization to uprobes, like kretprobes are
     an optimization to kprobes.  "perf probe -x file sym%return" now
     works like kretprobes.  By Oleg Nesterov.

   - Introduce per core aggregation in 'perf stat', from Stephane
     Eranian.

   - Add memory profiling via PEBS, from Stephane Eranian.

   - Event group view for 'annotate' in --stdio, --tui and --gtk, from
     Namhyung Kim.

   - Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.

   - Add Ivy Bridge-EP uncore support, by Zheng Yan

   - IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.

   - Add perf test entries for checking breakpoint overflow signal
     handler issues, from Jiri Olsa.

   - Add perf test entry for for checking number of EXIT events, from
     Namhyung Kim.

   - Add perf test entries for checking --cpu in record and stat, from
     Jiri Olsa.

   - Introduce perf stat --repeat forever, from Frederik Deweerdt.

   - Add --no-demangle to report/top, from Namhyung Kim.

   - PowerPC fixes plus a couple of cleanups/optimizations in uprobes
     and trace_uprobes, by Oleg Nesterov.

  Various fixes and refactorings:

   - Fix dependency of the python binding wrt libtraceevent, from
     Naohiro Aota.

   - Simplify some perf_evlist methods and to allow 'stat' to share code
     with 'record' and 'trace', by Arnaldo Carvalho de Melo.

   - Remove dead code in related to libtraceevent integration, from
     Namhyung Kim.

   - Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
     sched lat' back working, by Arnaldo Carvalho de Melo

   - We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
     de Melo.

   - Kill a bunch of die() calls, from Namhyung Kim.

   - Fix build on non-glibc systems due to libio.h absence, from Cody P
     Schafer.

   - Remove some perf_session and tracing dead code, from David Ahern.

   - Honor parallel jobs, fix from Borislav Petkov

   - Introduce tools/lib/lk library, initially just removing duplication
     among tools/perf and tools/vm.  from Borislav Petkov

  ... and many more I missed to list, see the shortlog and git log for
  more details."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
  perf/x86/intel/P4: Robistify P4 PMU types
  perf/x86/amd: Fix AMD NB and L2I "uncore" support
  perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
  perf/x86: Check all MSRs before passing hw check
  perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
  perf/x86/intel: Add Ivy Bridge-EP uncore support
  perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
  perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
  uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
  uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
  uprobes/tracing: Change create_trace_uprobe() to support uretprobes
  uprobes/tracing: Make seq_printf() code uretprobe-friendly
  uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
  uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
  uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
  uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
  uprobes/tracing: Generalize struct uprobe_trace_entry_head
  uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
  uprobes/tracing: Kill the pointless seq_print_ip_sym() call
  uprobes/tracing: Kill the pointless task_pt_regs() calls
  ...
2013-04-30 07:41:01 -07:00
Aneesh Kumar K.V
d8139ebf85 powerpc: print both base and actual page size on hash failure
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:22 +10:00
Aneesh Kumar K.V
b1022fbd29 powerpc: Decode the pte-lp-encoding bits correctly.
We look at both the segment base page size and actual page size and store
the pte-lp-encodings in an array per base page size.

We also update all relevant functions to take actual page size argument
so that we can use the correct PTE LP encoding in HPTE. This should also
get the basic Multiple Page Size per Segment (MPSS) support. This is needed
to enable THP on ppc64.

[Fixed PR KVM build --BenH]

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:14 +10:00
Aneesh Kumar K.V
5c1f6ee9a3 powerpc: Reduce PTE table memory wastage
We allocate one page for the last level of linux page table. With THP and
large page size of 16MB, that would mean we are wasting large part
of that page. To map 16MB area, we only need a PTE space of 2K with 64K
page size. This patch reduce the space wastage by sharing the page
allocated for the last level of linux page table with multiple pmd
entries. We call these smaller chunks PTE page fragments and allocated
page, PTE page.

In order to support systems which doesn't have 64K HPTE support, we also
add another 2K to PTE page fragment. The second half of the PTE fragments
is used for storing slot and secondary bit information of an HPTE. With this
we now have a 4K PTE fragment.

We use a simple approach to share the PTE page. On allocation, we bump the
PTE page refcount to 16 and share the PTE page with the next 16 pte alloc
request. This should help in the node locality of the PTE page fragment,
assuming that the immediate pte alloc request will mostly come from the
same NUMA node. We don't try to reuse the freed PTE page fragment. Hence
we could be waisting some space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:07 +10:00
Aneesh Kumar K.V
d614bb0412 powerpc: Move the pte free routines from common header
Acked-by: Paul Mackerras <paulus@samba.org>

This patch moves the common code to 32/64 bit headers and also duplicate
4K_PAGES and 64K_PAGES section. We will later change the 64 bit 64K_PAGES
version to support smaller PTE fragments. The patch doesn't introduce
any functional changes.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:04 +10:00
Aneesh Kumar K.V
419df06eea powerpc: Reduce the PTE_INDEX_SIZE
This make one PMD cover 16MB range. That helps in easier implementation of THP
on power. THP core code make use of one pmd entry to track the hugepage and
the range mapped by a single pmd entry should be equal to the hugepage size
supported by the hardware.

This also switch PGD to cover 16GB. That is needed so that we can simplify the
hugetlb page walking code so that we have same pte format for explicit hugepage
and THP hugepage.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:00 +10:00
Aneesh Kumar K.V
e2b3d202d1 powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation.
With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level.
That means with 32 bit process we cannot allocate normal pages at
all, because we cover the entire address space with one pgd entry. Fix this
by switching to a new page table format for hugepages. With the new page table
format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead
we encode the PTE information directly at the directory level. This forces 16MB
hugepage at PMD level. This will also make the page take walk much simpler later
when we add the THP support.

With the new table format we have 4 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(3) leaf pte for huge page, bottom two bits != 00
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 15:59:56 +10:00
Aneesh Kumar K.V
cf9427b85e powerpc: New hugepage directory format
Change the hugepage directory format so that we can have leaf ptes directly
at page directory avoiding the allocation of hugepage directory.

With the new table format we have 3 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table

Instead of storing shift value in hugepd pointer we use mmu_psize_def index
so that we can fit all the supported hugepage size in 4 bits

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 15:59:53 +10:00
Aneesh Kumar K.V
0e5f35d0e4 powerpc: Don't truncate pgd_index wrongly
With PGD_INDEX_SIZE set to 12 the existing macro doesn't work. Fix it to
use PTRS_PER_PGD

The idea originally was to have one more bit in the result of
pgd_index() than PGD_INDEX_SIZE, so that if one had an address
corresponding to the last PGD entry, and then incremented that address
by PGD_SIZE, and took pgd_index() of that, you wouldn't end up with
zero.  The commit that introduced that dates back to 2002, and the
code that was sensitive to that edge case has long since been
refactored (several times), so there is no need for it these days.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 15:59:49 +10:00
Aneesh Kumar K.V
cc3665a60a powerpc: Don't hard code the size of pte page
USE PTRS_PER_PTE to indicate the size of pte page. To support THP,
later patches will be changing PTRS_PER_PTE value.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 15:59:46 +10:00
Benjamin Herrenschmidt
bc23100a0d Merge remote-tracking branch 'kumar/next' into next
From Kumar Gala:
<<
Add support for T4 and B4 SoC families from Freescale, e6500 altivec
support, some various board fixes and other minor cleanups.
>>
2013-04-30 11:10:09 +10:00
Michel Lespinasse
34d07177b8 mm: remove free_area_cache use in powerpc architecture
As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.

This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 11:05:10 +10:00
Gerald Schaefer
106c992a5e mm/hugetlb: add more arch-defined huge_pte functions
Commit abf09bed3c ("s390/mm: implement software dirty bits")
introduced another difference in the pte layout vs.  the pmd layout on
s390, thoroughly breaking the s390 support for hugetlbfs.  This requires
replacing some more pte_xxx functions in mm/hugetlbfs.c with a
huge_pte_xxx version.

This patch introduces those huge_pte_xxx functions and their generic
implementation in asm-generic/hugetlb.h, which will now be included on
all architectures supporting hugetlbfs apart from s390.  This change
will be a no-op for those architectures.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>	[for !s390 parts]
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Paul Mackerras
8b78645c93 KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state
This adds the ability for userspace to save and restore the state
of the XICS interrupt presentation controllers (ICPs) via the
KVM_GET/SET_ONE_REG interface.  Since there is one ICP per vcpu, we
simply define a new 64-bit register in the ONE_REG space for the ICP
state.  The state includes the CPU priority setting, the pending IPI
priority, and the priority and source number of any pending external
interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:34 +02:00
Paul Mackerras
d19bd86204 KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls
This adds support for the ibm,int-on and ibm,int-off RTAS calls to the
in-kernel XICS emulation and corrects the handling of the saved
priority by the ibm,set-xive RTAS call.  With this, ibm,int-off sets
the specified interrupt's priority in its saved_priority field and
sets the priority to 0xff (the least favoured value).  ibm,int-on
restores the saved_priority to the priority field, and ibm,set-xive
sets both the priority and the saved_priority to the specified
priority value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:33 +02:00
Paul Mackerras
4619ac88b7 KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts
This streamlines our handling of external interrupts that come in
while we're in the guest.  First, when waking up a hardware thread
that was napping, we split off the "napping due to H_CEDE" case
earlier, and use the code that handles an external interrupt (0x500)
in the guest to handle that too.  Secondly, the code that handles
those external interrupts now checks if any other thread is exiting
to the host before bouncing an external interrupt to the guest, and
also checks that there is actually an external interrupt pending for
the guest before setting the LPCR MER bit (mediated external request).

This also makes sure that we clear the "ceded" flag when we handle a
wakeup from cede in real mode, and fixes a potential infinite loop
in kvmppc_run_vcpu() which can occur if we ever end up with the ceded
flag set but MSR[EE] off.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:32 +02:00
Benjamin Herrenschmidt
54695c3088 KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM
Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.

This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.

We then use this new facility in the in-kernel XICS code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:31 +02:00
Benjamin Herrenschmidt
bc5ad3f370 KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Each vcpu gets one ICP (interrupt controller presentation) entity,
used to store the per-vcpu state such as vcpu priority, pending
interrupt state, IPI request, etc.

This does not include any API or any way to connect vcpus to their
ICP state; that will be added in later patches.

This is based on an initial implementation by Michael Ellerman
<michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix typo, add dependency on !KVM_MPIC]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:30 +02:00
Michael Ellerman
8e591cb720 KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
For pseries machine emulation, in order to move the interrupt
controller code to the kernel, we need to intercept some RTAS
calls in the kernel itself.  This adds an infrastructure to allow
in-kernel handlers to be registered for RTAS services by name.
A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
associate token values with those service names.  Then, when the
guest requests an RTAS service with one of those token values, it
will be handled by the relevant in-kernel handler rather than being
passed up to userspace as at present.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix warning]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:29 +02:00
Alexander Graf
5efdb4be59 KVM: PPC: MPIC: Add support for KVM_IRQ_LINE
Now that all pieces are in place for reusing generic irq infrastructure,
we can copy x86's implementation of KVM_IRQ_LINE irq injection and simply
reuse it for PPC, as it will work there just as well.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:25 +02:00
Alexander Graf
de9ba2f363 KVM: PPC: Support irq routing and irqfd for in-kernel MPIC
Now that all the irq routing and irqfd pieces are generic, we can expose
real irqchip support to all of KVM's internal helpers.

This allows us to use irqfd with the in-kernel MPIC.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:25 +02:00
Scott Wood
eb1e4f43e0 kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub functions for kvmppc_mpic_{dis,}connect_vcpu]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:24 +02:00
Scott Wood
5df554ad5b kvm/ppc/mpic: in-kernel MPIC emulation
Hook the MPIC code up to the KVM interfaces, add locking, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:23 +02:00
Paul Mackerras
c35635efdc KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest.  This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.

However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration.  In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host.  Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary.  Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.

So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain.  This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page.  Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:13 +02:00
Paul Mackerras
a1b4a0f606 KVM: PPC: Book3S HV: Make HPT reading code notice R/C bit changes
At present, the code that determines whether a HPT entry has changed,
and thus needs to be sent to userspace when it is copying the HPT,
doesn't consider a hardware update to the reference and change bits
(R and C) in the HPT entries to constitute a change that needs to
be sent to userspace.  This adds code to check for changes in R and C
when we are scanning the HPT to find changed entries, and adds code
to set the changed flag for the HPTE when we update the R and C bits
in the guest view of the HPTE.

Since we now need to set the HPTE changed flag in book3s_64_mmu_hv.c
as well as book3s_hv_rm_mmu.c, we move the note_hpte_modification()
function into kvm_book3s_64.h.

Current Linux guest kernels don't use the hardware updates of R and C
in the HPT, so this change won't affect them.  Linux (or other) kernels
might in future want to use the R and C bits and have them correctly
transferred across when a guest is migrated, so it is better to correct
this deficiency.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:12 +02:00
Mihai Caraman
9a6061d7fd KVM: PPC: e500: Add support for EPTCFG register
EPTCFG register defined by E.PT is accessed unconditionally by Linux guests
in the presence of MAV 2.0. Emulate it now.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:08 +02:00
Mihai Caraman
307d9008ed KVM: PPC: e500: Add support for TLBnPS registers
Add support for TLBnPS registers available in MMU Architecture Version
(MAV) 2.0.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:07 +02:00
Mihai Caraman
a85d2aa23e KVM: PPC: e500: Expose MMU registers via ONE_REG
MMU registers were exposed to user-space using sregs interface. Add them
to ONE_REG interface using kvmppc_get_one_reg/kvmppc_set_one_reg delegation
mechanism.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:06 +02:00
Bharat Bhushan
c402a3f457 Rename EMULATE_DO_PAPR to EMULATE_EXIT_USER
Instruction emulation return EMULATE_DO_PAPR when it requires
exit to userspace on book3s. Similar return is required
for booke. EMULATE_DO_PAPR reads out to be confusing so it is
renamed to EMULATE_EXIT_USER.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:03 +02:00
Bharat Bhushan
092d62ee93 KVM: PPC: debug stub interface parameter defined
This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
ioctl support. Follow up patches will use this for setting up
hardware breakpoints, watchpoints and software breakpoints.

Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below.
This is because I am not sure what is required for book3s. So this ioctl
behaviour will not change for book3s.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:02 +02:00
Bharat Bhushan
adccf65ca4 KVM: PPC: cache flush for kernel managed pages
Kernel can only access pages which maps as memory.
So flush only the valid kernel pages.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:01 +02:00
Anshuman Khandual
3925f46bb5 powerpc/perf: Enable branch stack sampling framework
Provides basic enablement for perf branch stack sampling framework on
POWER8 processor based platforms. Adds new BHRB related elements into
cpu_hw_event structure to represent current BHRB config, BHRB filter
configuration, manage context and to hold output BHRB buffer during
PMU interrupt before passing to the user space. This also enables
processing of BHRB data and converts them into generic perf branch
stack data format.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:13:02 +10:00
Anshuman Khandual
5afc9b52a7 powerpc/perf: Add new BHRB related generic functions, data and flags
This patch adds couple of generic functions to power_pmu structure
which would configure the BHRB and it's filters. It also adds
representation of the number of BHRB entries present on the PMU.
A new PMU flag PPMU_BHRB would indicate presence of BHRB feature.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:12 +10:00
Anshuman Khandual
95213959ae powerpc/perf: Add new BHRB related instructions for POWER8
This patch adds new POWER8 instruction encoding for reading
and clearing Branch History Rolling Buffer entries. The new
instruction 'mfbhrbe' (move from branch history rolling buffer
entry) is used to read BHRB buffer entries and instruction
'clrbhrb' (clear branch history rolling buffer) is used to
clear the entire buffer. The instruction 'clrbhrb' has straight
forward encoding. But the instruction encoding format for
reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it
takes two arguments, i.e the index for the BHRB buffer entry to
read and a general purpose register to put the value which was
read from the buffer entry.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:11 +10:00
Michael Ellerman
8f61aa325f powerpc/perf: Add support for SIER
On power8 we have a new SIER (Sampled Instruction Event Register), which
captures information about instructions when we have random sampling
enabled.

Add support for loading the SIER into pt_regs, overloading regs->dar.
Also set the new NO_SIPR flag in regs->result if we don't have SIPR.

Update regs_sihv/sipr() to look for SIPR/SIHV in SIER.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:10 +10:00
Michael Ellerman
7a7868326d powerpc/perf: Add an explict flag indicating presence of SLOT field
In perf_ip_adjust() we potentially use the MMCRA[SLOT] field to adjust
the reported IP of a sampled instruction.

Currently the logic is written so that if the backend does NOT have
the PPMU_ALT_SIPR flag set then we assume MMCRA[SLOT] exists.

However on power8 we do not want to set ALT_SIPR (it's in a third
location), and we also do not have MMCRA[SLOT].

So add a new flag which only indicates whether MMCRA[SLOT] exists.

Naively we'd set it on everything except power6/7, because they set
ALT_SIPR, and we've reversed the polarity of the flag. But it's more
complicated than that.

mpc7450 is 32-bit, and uses its own version of perf_ip_adjust()
which doesn't use MMCRA[SLOT], so it doesn't need the new flag set and
the behaviour is unchanged.

PPC970 (and I assume power4) don't have MMCRA[SLOT], so shouldn't have
the new flag set. This is a behaviour change on those cpus, though we
were probably getting lucky and the bits in question were 0.

power5 and power5+ set the new flag, behaviour unchanged.

power6 & power7 do not set the new flag, behaviour unchanged.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:07 +10:00
Michael Ellerman
240686c136 powerpc: Initialise PMU related regs on Power8
For both HV and guest kernels, intialise PMU regs to something sane.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:06 +10:00
Gavin Shan
137436c9a6 powerpc/powernv: Patch MSI EOI handler on P8
The EOI handler of MSI/MSI-X interrupts for P8 (PHB3) need additional
steps to handle the P/Q bits in IVE before EOIing the corresponding
interrupt. The patch changes the EOI handler to cover that. we have
individual IRQ chip in each PHB instance. During the MSI IRQ setup
time, the IRQ chip is copied over from the original one for that IRQ,
and the EOI handler is patched with the one that will handle the P/Q
bits (As Ben suggested).

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:09:59 +10:00
Paul Mackerras
a485c70989 powerpc: Fix "attempt to move .org backwards" error
Building a 64-bit powerpc kernel with PR KVM enabled currently gives
this error:

  AS      arch/powerpc/kernel/head_64.o
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards
make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1

This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into
33 instructions, but we only have space for 32 at the decrementer
interrupt vector (from 0x900 to 0x980).

In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we
currently have two instances of the HMT_MEDIUM macro, which has the
effect of setting the SMT thread priority to medium.  One is the
first instruction, and is overwritten by a no-op on processors where
we save the PPR (processor priority register), that is, POWER7 or
later.  The other is after we have saved the PPR.

In order to reduce the code at 0x900 by one instruction, we omit the
first HMT_MEDIUM.  On processors without SMT this will have no effect
since HMT_MEDIUM is a no-op there.  On POWER5 and RS64 machines this
will mean that the first few instructions take a little longer in the
case where a decrementer interrupt occurs when the hardware thread is
running at low SMT priority.  On POWER6 and later machines, the
hardware automatically boosts the thread priority when a decrementer
interrupt is taken if the thread priority was below medium, so this
change won't make any difference.

The alternative would be to branch out of line after saving the CFAR.
However, that would incur an extra overhead on all processors, whereas
the approach adopted here only adds overhead on older threaded processors.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:27 +10:00
Nathan Fontenot
e04fa61214 powerpc/pseries: Add /proc interface to control topology updates
There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.

This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:26 +10:00
Jesse Larrew
5d88aa85c0 powerpc/pseries: Update CPU maps when device tree is updated
Platform events such as partition migration or the new PRRN firmware
feature can cause the NUMA characteristics of a CPU to change, and these
changes will be reflected in the device tree nodes for the affected
CPUs.

This patch registers a handler for Open Firmware device tree updates
and reconfigures the CPU and node maps whenever the associativity
changes. Currently, this is accomplished by marking the affected CPUs in
the cpu_associativity_changes_mask and allowing
arch_update_cpu_topology() to retrieve the new associativity information
using hcall_vphn().

Protecting the NUMA cpu maps from concurrent access during an update
operation will be addressed in a subsequent patch in this series.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:23 +10:00
Nathan Fontenot
f0ff7eb483 powerpc/pseries: Update firmware_has_feature() to check architecture vector 5 bits
The firmware_has_feature() function makes it easy to check for supported
features of the hypervisor. This patch extends the capability of
firmware_has_feature() to include checking for specified bits
in vector 5 of the architecture vector as reported in the device tree.

As part of this the #defines used for the architecture vector are re-defined
such that each option has the index into vector 5 and the feature bit encoded
into it. This makes checking for architecture bits when initiating data
for firmware_has_feature much easier.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:22 +10:00
Nathan Fontenot
43c0ea6053 powerpc/pseries: Use ARRAY_SIZE to iterate over firmware_features_table array
When iterating over the entries in firmware_features_table we only need
to go over the actual number of entries in the array instead of declaring
it to be bigger and checking to make sure there is a valid entry in every
slot.

This patch removes the FIRMWARE_MAX_FEATURES #define and replaces the
array looping with the use of ARRAY_SIZE().

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:21 +10:00
Nathan Fontenot
530b5e1475 powerpc/pseries: Move architecture vector definitions to prom.h
As part of handling of PRRN events we need to check vector 5 of the
architecture vector bits reported in the device tree to ensure PRRN event
handling is enabled. To do this firmware_has_feature() is updated (in a
subsequent patch) to make this check vector 5 bits. To avoid having to
re-define bits in the architecture vector the bit definitions are moved
to prom.h.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:21 +10:00
Jesse Larrew
49c68a8518 powerpc/pseries: Add PRRN RTAS event handler
A PRRN event is signaled via the RTAS event-scan mechanism, which
returns a Hot Plug Event message "fixed part" indicating "Platform
Resource Reassignment". In response to the Hot Plug Event message,
we must call ibm,update-nodes to determine which resources were
reassigned and then ibm,update-properties to obtain the new affinity
information about those resources.

The PRRN event-scan RTAS message contains only the "fixed part" with
the "Type" field set to the value 160 and no Extended Event Log. The
four-byte Extended Event Log Length field is re-purposed (since no
Extended Event Log message is included) to pass the "scope" parameter
that causes the ibm,update-nodes to return the nodes affected by the
specific resource reassignment.

This patch adds a handler for RTAS events. The function
pseries_devicetree_update() (from mobility.c) is used to make the
ibm,update-nodes/ibm,update-properties RTAS calls. Updating the NUMA maps
(handled by a subsequent patch) will require significant processing,
so pseries_devicetree_update() is called from an asynchronous workqueue
to allow event processing to continue.

PRRN RTAS events on pseries systems are rare events that have to be
initiated from the HMC console for the system by an IBM tech. This allows
us to assume that these events are widely spaced. Additionally, all work
on the queue is flushed before handling any new work to ensure we only have
one event in flight being handled at a time.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:20 +10:00
Nathan Fontenot
762ec15707 powerpc/pseries: Expose pseries devicetree_update()
Newer firmware on Power systems can transparently reassign platform resources
(CPU and Memory) in use. For instance, if a processor or memory unit is
predicted to fail, the platform may transparently move the processing to an
equivalent unused processor or the memory state to an equivalent unused
memory unit. However, reassigning resources across NUMA boundaries may alter
the performance of the partition. When such reassignment is necessary, the
Platform Resource Reassignment Notification (PRRN) option provides a
mechanism to inform the Linux kernel of changes to the NUMA affinity of
its platform resources.

When rtasd receives a PRRN event, it needs to make a series of RTAS
calls (ibm,update-nodes and ibm,update-properties) to retrieve the
updated device tree information. These calls are already handled in the
pseries_devicetree_update() routine used in partition migration.

This patch exposes pseries_devicetree_update() to make it accessible
to other pseries routines, this patch also updates pseries_devicetree_update()
to take a 32-bit scope parameter. The scope value, which was previously hard
coded to 1 for partition migration, is used for the RTAS calls
ibm,update-nodes/properties to update the device tree.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:08:19 +10:00