64bit mode bootstrap code does set nmi_watchdog to NMI_NONE
by default and doing the same on 32bit mode is safe too.
Such an action saves us from several #ifdef.
Btw, my previous commit
commit 19ec673ced
Author: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Wed May 28 23:00:47 2008 +0400
x86: nmi - fix incorrect NMI watchdog used by default
did not fix the problem completely, moreover it
introduced additional bug - nmi_watchdog would be
set to either NMI_LOCAL_APIC or NMI_IO_APIC
_regardless_ to boot option if being enabled thru
/proc/sys/kernel/nmi_watchdog. Sorry for that.
Fix it too.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: macro@linux-mips.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds linked list of struct setup_data supported for i386.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch extracts the common part of head32.c and head64.c into head.c.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch reserves the EFI memory map with reserve_early(). Because EFI
memory map is allocated by bootloader, if it is not reserved by
reserved_early(), it may be overwritten through address returned by
find_e820_area().
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch makes early reserved highmem pages become reserved
pages. This can be used for highmem pages allocated by bootloader such
as EFI memory map, linked list of setup_data, etc.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch clean up reserve_early() family functions by extracting the
common part of reserve_early(), free_early() and bad_addr() into
find_overlapped_early().
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: andi@firstfloor.org
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If
fix == 0, aper_enabled == 1, gart_fix_e820 == 0
if (!fix && !aper_enabled)
return;
if (gart_fix_e820 && !fix && aper_enabled) {
if (e820_any_mapped(aper_base, aper_base + aper_size,
E820_RAM)) {
/* reserve it, so we can reuse it in second kernel */
printk(KERN_INFO "update e820 for GART\n");
add_memory_region(aper_base, aper_size, E820_RESERVED);
update_e820();
}
return;
}
/* different nodes have different setting, disable them all atfirst*/
we'll fall back here and disable all the settings, even when they were
all consistent.
What about this? (I hope it compiles...)
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hi!
void __init early_gart_iommu_check(void)
contains
for (num = 24; num < 32; num++) {
if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00)))
continue;
loop, with very similar loop duplicated in
void __init gart_iommu_hole_init(void)
. First copy of a loop seems to be buggy, too. It uses 0 as a "nothing
set" value, which may actually bite us in last_aper_enabled case
(because it may be often zero).
(Beware, it is hard to test this patch, because this code has about
2^8 different code paths, depending on hardware and cmdline settings).
Plus, the second loop does not check for consistency of
aper_enabled. Should it?
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
should use right shift
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Jürgen Mell reported an FPU state corruption bug under CONFIG_PREEMPT,
and bisected it to commit v2.6.19-1363-gacc2076, "i386: add sleazy FPU
optimization".
Add tsk_used_math() checks to prevent calling math_state_restore()
which can sleep in the case of !tsk_used_math(). This prevents
making a blocking call in __switch_to().
Apparently "fpu_counter > 5" check is not enough, as in some signal handling
and fork/exec scenarios, fpu_counter > 5 and !tsk_used_math() is possible.
It's a side effect though. This is the failing scenario:
process 'A' in save_i387_ia32() just after clear_used_math()
Got an interrupt and pre-empted out.
At the next context switch to process 'A' again, kernel tries to restore
the math state proactively and sees a fpu_counter > 0 and !tsk_used_math()
This results in init_fpu() during the __switch_to()'s math_state_restore()
And resulting in fpu corruption which will be saved/restored
(save_i387_fxsave and restore_i387_fxsave) during the remaining
part of the signal handling after the context switch.
Bisected-by: Jürgen Mell <j.mell@t-online.de>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Jürgen Mell <j.mell@t-online.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
iommu/gart support misses suspend/resume code, which can do bad stuff,
including memory corruption on resume. Prevent system suspend in case we
would be unable to resume.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Tested-by: Patrick <ragamuffin@datacomm.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this:
WARNING: vmlinux.o(.text+0x114bb): Section mismatch in reference from
the function nopat() to the function .cpuinit.text:pat_disable()
The function nopat() references
the function __cpuinit pat_disable().
This is often because nopat lacks a __cpuinit
annotation or the annotation of pat_disable is wrong.
Reported-by: "Fabio Comolli" <fabio.comolli@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clarify the usage of mtrr_lookup() in PAT code, and to make PAT code
resilient to mtrr lookup problems.
Specifically, pat_x_mtrr_type() is restructured to highlight, under what
conditions we look for mtrr hint. pat_x_mtrr_type() uses a default type
when there are any errors in mtrr lookup (still maintaining the pat
consistency). And, reserve_memtype() highlights its usage ot mtrr_lookup
for request type of '-1' and also defaults in a sane way on any mtrr
lookup failure.
pat.c looks at mtrr type of a range to get a hint on what mapping type
to request when user/API: (1) hasn't specified any type (/dev/mem
mapping) and we do not want to take performance hit by always mapping
UC_MINUS. This will be the case for /dev/mem mappings used to map BIOS
area or ACPI region which are WB'able. In this case, as long as MTRR is
not WB, PAT will request UC_MINUS for such mappings.
(2) user/API requests WB mapping while in reality MTRR may have UC or
WC. In this case, PAT can map as WB (without checking MTRR) and still
effective type will be UC or WC. But, a subsequent request to map same
region as UC or WC may fail, as the region will get trackked as WB in
PAT list. Looking at MTRR hint helps us to track based on effective type
rather than what user requested. Again, here mtrr_lookup is only used as
hint and we fallback to WB mapping (as requested by user) as default.
In both cases, after using the mtrr hint, we still go through the
memtype list to make sure there are no inconsistencies among multiple
users.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Rufus & Azrael <rufus-azrael@numericable.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
OGAWA Hirofumi and Fede have reported rare pmd_ERROR messages:
mm/memory.c:127: bad pmd ffff810000207xxx(9090909090909090).
Initialization's cleanup_highmap was leaving alignment filler
behind in the pmd for MODULES_VADDR: when vmalloc's guard page
would occupy a new page table, it's not allocated, and then
module unload's vfree hits the bad 9090 pmd entry left over.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mika Kukkonen noticed that the nesting check in early_iounmap() is not
actually done.
Reported-by: Mika Kukkonen <mikukkon@srv1-m700-lanp.koti>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: torvalds@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: mikukkon@iki.fi
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix the math emulation that got broken with the recent lazy allocation of FPU
area. init_fpu() need to be added for the math-emulation path aswell
for the FPU area allocation.
math emulation enabled kernel booted fine with this, in the presence
of "no387 nofxsr" boot param.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: hpa@zytor.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The RT team has been searching for a nasty latency. This latency shows
up out of the blue and has been seen to be as big as 5ms!
Using ftrace I found the cause of the latency.
pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt)
pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit)
pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit)
pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt
)
pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt)
Here's an example of a 400 us latency. pcscd took a timer interrupt and
returned with "need resched" enabled, but did not reschedule until after
the next interrupt came in at 52360771us 400us later!
At first I thought we somehow missed a preemption check in entry.S. But
I also noticed that this always seemed to happen during a __delay call.
pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit)
pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay)
Looking at the x86 delay, I found my problem.
In git commit 35d5d08a08, Andrew Morton
placed preempt_disable around the entire delay due to TSC's not working
nicely on SMP. Unfortunately for those that care about latencies this
is devastating! Especially when we have callers to mdelay(8).
Here I enable preemption during the loop and account for anytime the task
migrates to a new CPU. The delay asked for may be extended a bit by
the migration, but delay only guarantees that it will delay for that minimum
time. Delaying longer should not be an issue.
[
Thanks to Thomas Gleixner for spotting that cpu wasn't updated,
and to place the rep_nop between preempt_enabled/disable.
]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: akpm@osdl.org
Cc: Clark Williams <clark.williams@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi-suse@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
for http://bugzilla.kernel.org/show_bug.cgi?id=10613
BIOS bug, APIC version is 0 for CPU#0! fixing up to 0x10. (tell your hw vendor)
v2: fix 64 bit compilation
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this way 32-bit is more similar to 64-bit, and smarter e820 and numa.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when 1/3 user/kernel split is used, and less memory is installed, or if
we have a big hole below 4g, max_low_pfn is still using 3g-128m
try to go down from max_low_pfn until we get it. otherwise will panic.
need to make 32-bit code to use register_e820_active_regions ... later.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It looks good to move bugs_64.c to cpu/bugs_64.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
on 64-bit we only get valid max_pfn_mapped after init_memory_mapping().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on 32-bit in head_32.S after initial page table is done, we get initial
max_pfn_mapped, and then kernel_physical_mapping_init will give us
a final one.
We need to use that to make sure find_e820_area will get valid addresses
for boot_map and for NODE_DATA(0) on numa32.
XEN PV and lguest may need to assign max_pfn_mapped too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make mptable to be consistent with acpi routing, so we could:
1. kexec kernel with acpi=off
2. work around BIOSes where acpi routing is working, but mptable is
not right, so can use kernel/kexec to start other OSes that don't have
good acpi support.
command line: update_mptable
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we don't need to call memory_present that early.
numa and sparse will call memory_present later and might
even fail, it will call memory_present for the full range.
also for sparse it will call alloc_bootmem ... before we set up bootmem.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... otherwise alloc_remap will not get node_mem_map from kva area, and
alloc_node_mem_map has to alloc_bootmem_node to get mem_map.
It will use two low address copies ...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so every element will represent 64M instead of 256M.
AMD opteron could have HW memory hole remapping, so could have
[0, 8g + 64M) on node0. Reduce element size to 64M to keep that on node 0
Later we need to use find_e820_area() to allocate memory_node_map like
on 64-bit. But need to move memory_present out of populate_mem_map...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on Summit it's possible to have:
CONFIG_ACPI_SRAT=y
CONFIG_HAVE_ARCH_PARSE_SRAT=y
in which case acpi.h defines the acpi_numa_slit_init() and
acpi_numa_processor_affinity_init() methods as a macro.
Define recently added XEN_ELFNOTEs, and use them appropriately.
Most significantly, this enables domain checkpointing (xm save -c).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On resume, the vcpu timer modes will not be restored. The timer
infrastructure doesn't do this for us, since it assumes the cpus
are offline. We can just poke the other vcpus into the right mode
directly though.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we're using vcpu_info mapping, then make sure its restored on all
processors before relasing them from stop_machine.
The only complication is that if this fails, we can't continue because
we've already made assumptions that the mapping is available (baked in
calls to the _direct versions of the functions, for example).
Fortunately this can only happen with a 32-bit hypervisor, which may
possibly run out of mapping space. On a 64-bit hypervisor, this is a
non-issue.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When operating on an unpinned pagetable (ie, one under construction or
destruction), it isn't necessary to use a hypercall to update a
pud/pmd entry. Jan Beulich observed that a similar optimisation
avoided many thousands of hypercalls while doing a kernel build.
One tricky part is that early in the kernel boot there's no page
structure, so we can't check to see if the page is pinned. In that
case, we just always use the hypercall.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
iommu/gart support misses suspend/resume code, which can do bad stuff,
including memory corruption on resume. Prevent system suspend in case we
would be unable to resume.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Tested-by: Patrick <ragamuffin@datacomm.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update the UV address macros to better describe the
fields of UV physical addresses. Improve comments
in the header files. Add additional MMR definitions.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>