linux-pinenote

Author	SHA1	Message	Date
Yinghai Lu	1411e0ec31	x86-64, numa: Put pgtable to local node memory Introduce init_memory_mapping_high(), and use it with 64bit. It will go with every memory segment above 4g to create page table to the memory range itself. before this patch all page tables was on one node. with this patch, one RED-PEN is killed debug out for 8 sockets system after patch [ 0.000000] initial memory mapped : 0 - 20000000 [ 0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff] [ 0.000000] 0000000000 - 007f600000 page 2M [ 0.000000] 007f600000 - 007f750000 page 4k [ 0.000000] kernel direct mapping tables up to 7f750000 @ [0x7f74c000-0x7f74ffff] [ 0.000000] RAMDISK: 7bc84000 - 7f745000 .... [ 0.000000] Adding active range (0, 0x10, 0x95) 0 entries of 3200 used [ 0.000000] Adding active range (0, 0x100, 0x7f750) 1 entries of 3200 used [ 0.000000] Adding active range (0, 0x100000, 0x1080000) 2 entries of 3200 used [ 0.000000] Adding active range (1, 0x1080000, 0x2080000) 3 entries of 3200 used [ 0.000000] Adding active range (2, 0x2080000, 0x3080000) 4 entries of 3200 used [ 0.000000] Adding active range (3, 0x3080000, 0x4080000) 5 entries of 3200 used [ 0.000000] Adding active range (4, 0x4080000, 0x5080000) 6 entries of 3200 used [ 0.000000] Adding active range (5, 0x5080000, 0x6080000) 7 entries of 3200 used [ 0.000000] Adding active range (6, 0x6080000, 0x7080000) 8 entries of 3200 used [ 0.000000] Adding active range (7, 0x7080000, 0x8080000) 9 entries of 3200 used [ 0.000000] init_memory_mapping: [0x00000100000000-0x0000107fffffff] [ 0.000000] 0100000000 - 1080000000 page 2M [ 0.000000] kernel direct mapping tables up to 1080000000 @ [0x107ffbd000-0x107fffffff] [ 0.000000] memblock_x86_reserve_range: [0x107ffc2000-0x107fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00001080000000-0x0000207fffffff] [ 0.000000] 1080000000 - 2080000000 page 2M [ 0.000000] kernel direct mapping tables up to 2080000000 @ [0x207ff7d000-0x207fffffff] [ 0.000000] memblock_x86_reserve_range: [0x207ffc0000-0x207fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00002080000000-0x0000307fffffff] [ 0.000000] 2080000000 - 3080000000 page 2M [ 0.000000] kernel direct mapping tables up to 3080000000 @ [0x307ff3d000-0x307fffffff] [ 0.000000] memblock_x86_reserve_range: [0x307ffc0000-0x307fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00003080000000-0x0000407fffffff] [ 0.000000] 3080000000 - 4080000000 page 2M [ 0.000000] kernel direct mapping tables up to 4080000000 @ [0x407fefd000-0x407fffffff] [ 0.000000] memblock_x86_reserve_range: [0x407ffc0000-0x407fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00004080000000-0x0000507fffffff] [ 0.000000] 4080000000 - 5080000000 page 2M [ 0.000000] kernel direct mapping tables up to 5080000000 @ [0x507febd000-0x507fffffff] [ 0.000000] memblock_x86_reserve_range: [0x507ffc0000-0x507fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00005080000000-0x0000607fffffff] [ 0.000000] 5080000000 - 6080000000 page 2M [ 0.000000] kernel direct mapping tables up to 6080000000 @ [0x607fe7d000-0x607fffffff] [ 0.000000] memblock_x86_reserve_range: [0x607ffc0000-0x607fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00006080000000-0x0000707fffffff] [ 0.000000] 6080000000 - 7080000000 page 2M [ 0.000000] kernel direct mapping tables up to 7080000000 @ [0x707fe3d000-0x707fffffff] [ 0.000000] memblock_x86_reserve_range: [0x707ffc0000-0x707fffffff] PGTABLE [ 0.000000] init_memory_mapping: [0x00007080000000-0x0000807fffffff] [ 0.000000] 7080000000 - 8080000000 page 2M [ 0.000000] kernel direct mapping tables up to 8080000000 @ [0x807fdfc000-0x807fffffff] [ 0.000000] memblock_x86_reserve_range: [0x807ffbf000-0x807fffffff] PGTABLE [ 0.000000] Initmem setup node 0 [0000000000000000-000000107fffffff] [ 0.000000] NODE_DATA [0x0000107ffbd000-0x0000107ffc1fff] [ 0.000000] Initmem setup node 1 [0000001080000000-000000207fffffff] [ 0.000000] NODE_DATA [0x0000207ffbb000-0x0000207ffbffff] [ 0.000000] Initmem setup node 2 [0000002080000000-000000307fffffff] [ 0.000000] NODE_DATA [0x0000307ffbb000-0x0000307ffbffff] [ 0.000000] Initmem setup node 3 [0000003080000000-000000407fffffff] [ 0.000000] NODE_DATA [0x0000407ffbb000-0x0000407ffbffff] [ 0.000000] Initmem setup node 4 [0000004080000000-000000507fffffff] [ 0.000000] NODE_DATA [0x0000507ffbb000-0x0000507ffbffff] [ 0.000000] Initmem setup node 5 [0000005080000000-000000607fffffff] [ 0.000000] NODE_DATA [0x0000607ffbb000-0x0000607ffbffff] [ 0.000000] Initmem setup node 6 [0000006080000000-000000707fffffff] [ 0.000000] NODE_DATA [0x0000707ffbb000-0x0000707ffbffff] [ 0.000000] Initmem setup node 7 [0000007080000000-000000807fffffff] [ 0.000000] NODE_DATA [0x0000807ffba000-0x0000807ffbefff] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D1933D1.9020609@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-29 15:48:08 -08:00
Yinghai Lu	dbef7b56d2	x86-64, numa: Allocate memnodemap under max_pfn_mapped We need to access it right way, so make sure that it is mapped already. Prepare to put page table on local node, and nodemap is used before that. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D1933C8.7060105@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-29 15:48:08 -08:00
Yinghai Lu	45635ab5e4	x86: Change get_max_mapped() to inline Move it into head file. to prepare use it in other files. [ hpa: added missing <linux/types.h> and changed type to phys_addr_t. ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D1933BA.8000508@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-29 15:47:55 -08:00
Yinghai Lu	32e3f2b00c	x86-64, gart: Fix allocation with memblock When trying to change alloc_bootmem with memblock to go with real top-down Found one old system: [ 0.000000] Node 0: aperture @ ac000000 size 64 MB [ 0.000000] Aperture pointing to e820 RAM. Ignoring. [ 0.000000] Your BIOS doesn't leave a aperture memory hole [ 0.000000] Please enable the IOMMU option in the BIOS setup [ 0.000000] This costs you 64 MB of RAM [ 0.000000] memblock_x86_reserve_range: [0x2020000000-0x2023ffffff] aperture64 [ 0.000000] Cannot allocate aperture memory hole (ffff882020000000,65536K) [ 0.000000] memblock_x86_free_range: [0x2020000000-0x2023ffffff] [ 0.000000] Kernel panic - not syncing: Not enough memory for aperture [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.37-rc5-tip-yh-06229-gb792dc2-dirty #331 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff81cf50fe>] ? panic+0x91/0x1a3 [ 0.000000] [<ffffffff827c66b2>] ? gart_iommu_hole_init+0x3d7/0x4a3 [ 0.000000] [<ffffffff81d026a9>] ? _etext+0x0/0x3 [ 0.000000] [<ffffffff827ba940>] ? pci_iommu_alloc+0x47/0x71 [ 0.000000] [<ffffffff827c820b>] ? mem_init+0x19/0xec [ 0.000000] [<ffffffff827b3c40>] ? start_kernel+0x20a/0x3e8 [ 0.000000] [<ffffffff827b32cc>] ? x86_64_start_reservations+0x9c/0xa0 [ 0.000000] [<ffffffff827b33e4>] ? x86_64_start_kernel+0x114/0x11b it means __alloc_bootmem_nopanic() get too high for that aperture. Use memblock_find_in_range() with limit directly. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D0C0740.90104@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-29 14:46:54 -08:00
Yinghai Lu	4b239f458c	x86-64, mm: Put early page table high While dubug kdump, found current kernel will have problem with crashkernel=512M. It turns out that initial mapping is to 512M, and later initial mapping to 4G (acutally is 2040M in my platform), will put page table near 512M. then initial mapping to 128g will be near 2g. before this patch: [ 0.000000] initial memory mapped : 0 - 20000000 [ 0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff] [ 0.000000] 0000000000 - 007f600000 page 2M [ 0.000000] 007f600000 - 007f750000 page 4k [ 0.000000] kernel direct mapping tables up to 7f750000 @ [0x1fffc000-0x1fffffff] [ 0.000000] memblock_x86_reserve_range: [0x1fffc000-0x1fffdfff] PGTABLE [ 0.000000] init_memory_mapping: [0x00000100000000-0x0000207fffffff] [ 0.000000] 0100000000 - 2080000000 page 2M [ 0.000000] kernel direct mapping tables up to 2080000000 @ [0x7bc01000-0x7bc83fff] [ 0.000000] memblock_x86_reserve_range: [0x7bc01000-0x7bc7efff] PGTABLE [ 0.000000] RAMDISK: 7bc84000 - 7f745000 [ 0.000000] crashkernel reservation failed - No suitable area found. after patch: [ 0.000000] initial memory mapped : 0 - 20000000 [ 0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff] [ 0.000000] 0000000000 - 007f600000 page 2M [ 0.000000] 007f600000 - 007f750000 page 4k [ 0.000000] kernel direct mapping tables up to 7f750000 @ [0x7f74c000-0x7f74ffff] [ 0.000000] memblock_x86_reserve_range: [0x7f74c000-0x7f74dfff] PGTABLE [ 0.000000] init_memory_mapping: [0x00000100000000-0x0000207fffffff] [ 0.000000] 0100000000 - 2080000000 page 2M [ 0.000000] kernel direct mapping tables up to 2080000000 @ [0x207ff7d000-0x207fffffff] [ 0.000000] memblock_x86_reserve_range: [0x207ff7d000-0x207fffafff] PGTABLE [ 0.000000] RAMDISK: 7bc84000 - 7f745000 [ 0.000000] memblock_x86_reserve_range: [0x17000000-0x36ffffff] CRASH KERNEL [ 0.000000] Reserving 512MB of memory at 368MB for crashkernel (System RAM: 133120MB) It means with the patch, page table for [0, 2g) will need 2g, instead of under 512M, page table for [4g, 128g) will be near 128g, instead of under 2g. That would good, if we have lots of memory above 4g, like 1024g, or 2048g or 16T, will not put related page table under 2g. that would be have chance to fill the under 2g if 1G or 2M page is not used. the code change will use add map_low_page() and update unmap_low_page() for 64bit, and use them to get access the corresponding high memory for page table setting. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D0C0734.7060900@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-29 14:46:54 -08:00
H. Peter Anvin	d50e8fc7e3	Merge branch 'x86/apic-cleanups' into x86/numa	2010-12-29 11:36:26 -08:00
Avi Kivity	649497d1a3	KVM: MMU: Fix incorrect direct gfn for unpaged mode shadow We use the physical address instead of the base gfn for the four PAE page directories we use in unpaged mode. When the guest accesses an address above 1GB that is backed by a large host page, a BUG_ON() in kvm_mmu_set_gfn() triggers. Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=21962 Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com> KVM-Stable-Tag. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-12-29 12:35:29 +02:00
Cliff Wickman	c8217b8305	x86, paravirt: Use native_halt on a halt, not native_safe_halt halt() should use native_halt() safe_halt() uses native_safe_halt() If CONFIG_PARAVIRT=y, halt() is defined in arch/x86/include/asm/paravirt.h as static inline void halt(void) { PVOP_VCALL0(pv_irq_ops.safe_halt); } Otherwise (no CONFIG_PARAVIRT) halt() in arch/x86/include/asm/irqflags.h is static inline void halt(void) { native_halt(); } So it looks to me like the CONFIG_PARAVIRT case of using native_safe_halt() for a halt() is an oversight. Am I missing something? It probably hasn't shown up as a problem because the local apic is disabled on a shutdown or restart. But if we disable interrupts and call halt() we shouldn't expect that the halt() will re-enable interrupts. Signed-off-by: Cliff Wickman <cpw@sgi.com> LKML-Reference: <E1PSBcz-0001g1-FM@eag09.americas.sgi.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-27 14:02:11 -08:00
Jesper Juhl	5cdd2de0a7	x86/microcode: Fix double vfree() and remove redundant pointer checks before vfree() In arch/x86/kernel/microcode_intel.c::generic_load_microcode() we have this: while (leftover) { ... if (get_ucode_data(mc, ucode_ptr, mc_size) \|\| microcode_sanity_check(mc) < 0) { vfree(mc); break; } ... } if (mc) vfree(mc); This will cause a double free of 'mc'. This patch fixes that by just removing the vfree() call in the loop since 'mc' will be freed nicely just after we break out of the loop. There's also a second change in the patch. I noticed a lot of checks for pointers being NULL before passing them to vfree(). That's completely redundant since vfree() deals gracefully with being passed a NULL pointer. Removing the redundant checks yields a nice size decrease for the object file. Size before the patch: text data bss dec hex filename 4578 240 1032 5850 16da arch/x86/kernel/microcode_intel.o Size after the patch: text data bss dec hex filename 4489 240 984 5713 1651 arch/x86/kernel/microcode_intel.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Shaohua Li <shaohua.li@intel.com> LKML-Reference: <alpine.LNX.2.00.1012251946100.10759@swampdragon.chaosbits.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-27 14:33:30 +01:00
Linus Torvalds	79534f237f	Merge branches 'perf-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf probe: Fix to support libdwfl older than 0.148 perf tools: Fix lazy wildcard matching perf buildid-list: Fix error return for success perf buildid-cache: Fix symbolic link handling perf symbols: Stop using vmlinux files with no symbols perf probe: Fix use of kernel image path given by 'k' option * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, kexec: Limit the crashkernel address appropriately	2010-12-23 15:39:40 -08:00
David Rientjes	a387e95a49	x86, numa: Fix cpu to node mapping for sparse node ids NUMA boot code assumes that physical node ids start at 0, but the DIMMs that the apic id represents may not be reachable. If this is the case, node 0 is never online and cpus never end up getting appropriately assigned to a node. This causes the cpumask of all online nodes to be empty and machines crash with kernel code assuming online nodes have valid cpus. The fix is to appropriately map all the address ranges for physical nodes and ensure the cpu to node mapping function checks all possible nodes (up to MAX_NUMNODES) instead of simply checking nodes 0-N, where N is the number of physical nodes, for valid address ranges. This requires no longer "compressing" the address ranges of nodes in the physical node map from 0-N, but rather leave indices in physnodes[] to represent the actual node id of the physical node. Accordingly, the topology exported by both amd_get_nodes() and acpi_get_nodes() no longer must return the number of nodes to iterate through; all such iterations will now be to MAX_NUMNODES. This change also passes the end address of system RAM (which may be different from normal operation if mem= is specified on the command line) before the physnodes[] array is populated. ACPI parsed nodes are truncated to fit within the address range that respect the mem= boundaries and even some physical nodes may become unreachable in such cases. When NUMA emulation does succeed, any apicid to node mapping that exists for unreachable nodes are given default values so that proximity domains can still be assigned. This is important for node_distance() to function as desired. Signed-off-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1012221702090.3701@chino.kir.corp.google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 15:27:16 -08:00
David Rientjes	c1c3443c9c	x86, numa: Fake node-to-cpumask for NUMA emulation It's necessary to fake the node-to-cpumask mapping so that an emulated node ID returns a cpumask that includes all cpus that have affinity to the memory it represents. This is a little intrusive because it requires knowledge of the physical topology of the system. setup_physnodes() gives us that information, but since NUMA emulation ends up altering the physnodes array, it's necessary to reset it before cpus are brought online. Accordingly, the physnodes array is moved out of init.data and into cpuinit.data since it will be needed on cpuup callbacks. This works regardless of whether numa=fake is used on the command line, or the setup of the fake node succeeds or fails. The physnodes array always contains the physical topology of the machine if CONFIG_NUMA_EMU is enabled and can be used to setup the correct node-to-cpumask mappings in all cases since setup_physnodes() is called whenever the array needs to be repopulated with the correct data. To fake the actual mappings, numa_add_cpu() and numa_remove_cpu() are rewritten for CONFIG_NUMA_EMU so that we first find the physical node to which each cpu has local affinity, then iterate through all online nodes to find the emulated nodes that have local affinity to that physical node, and then finally map the cpu to each of those emulated nodes. Signed-off-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1012221701520.3701@chino.kir.corp.google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 15:27:15 -08:00
David Rientjes	f51bf3073a	x86, numa: Fake apicid and pxm mappings for NUMA emulation This patch adds the equivalent of acpi_fake_nodes() for AMD Northbridge platforms. The goal is to fake the apicid-to-node mappings for NUMA emulation so the physical topology of the machine is correctly maintained within the kernel. This change also fakes proximity domains for both ACPI and k8 code so the physical distance between emulated nodes is maintained via node_distance(). This exports the correct distances via /sys/devices/system/node/.../distance based on the underlying topology. A new helper function, fake_physnodes(), is introduced to correctly invoke the correct NUMA code to fake these two mappings based on the system type. If there is no underlying NUMA configuration, all cpus are mapped to node 0 for local distance. Since acpi_fake_nodes() is no longer called with CONFIG_ACPI_NUMA, it's prototype can be removed from the header file for such a configuration. Signed-off-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1012221701360.3701@chino.kir.corp.google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 15:27:14 -08:00
David Rientjes	4e76f4e67a	x86, numa: Avoid compiling NUMA emulation functions without CONFIG_NUMA_EMU Both acpi_get_nodes() and amd_get_nodes() are only necessary when CONFIG_NUMA_EMU is enabled, so avoid compiling them when the option is disabled. Signed-off-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1012221701210.3701@chino.kir.corp.google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 15:27:12 -08:00
David Rientjes	34dc9e7496	x86, numa: Reduce minimum fake node size to 32M This patch changes the minimum fake node size from 64MB to 32MB so it is possible to test NUMA code at a greater scale on smaller machines (64 nodes on a 2G machine, 1024 nodes on 32G machine with CONFIG_NODES_SHIFT=10). Signed-off-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1012221700590.3701@chino.kir.corp.google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 15:27:10 -08:00
Yinghai Lu	d3bd058826	x86, acpi: Parse all SRAT cpu entries even above the cpu number limitation Recent Intel new system have different order in MADT, aka will list all thread0 at first, then all thread1. But SRAT table still old order, it will list cpus in one socket all together. If the user have compiled limited NR_CPUS or boot with nr_cpus=, could have missed to put some cpus apic id to node mapping into apicid_to_node[]. for example for 4 sockets system with 64 cpus with nr_cpus=32 will get crash... [ 9.106288] Total of 32 processors activated (136190.88 BogoMIPS). [ 9.235021] divide error: 0000 [#1] SMP [ 9.235315] last sysfs file: [ 9.235481] CPU 1 [ 9.235592] Modules linked in: [ 9.245398] [ 9.245478] Pid: 2, comm: kthreadd Not tainted 2.6.37-rc1-tip-yh-01782-ge92ef79-dirty #274 /Sun Fire x4800 [ 9.265415] RIP: 0010:[<ffffffff81075a8f>] [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623 ... [ 9.645938] RIP [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623 [ 9.665356] RSP <ffff88103f8d1c40> [ 9.665568] ---[ end trace 2296156d35fdfc87 ]--- So let just parse all cpu entries in SRAT. Also add apicid checking with MAX_LOCAL_APIC, in case We could out of boundaries of apicid_to_node[]. it fixes following bug too. https://bugzilla.kernel.org/show_bug.cgi?id=22662 -v2: expand to 32bit according to hpa need to add MAX_LOCAL_APIC for 32bit Reported-and-Tested-by: Wu Fengguang <fengguang.wu@intel.com> Reported-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Tested-by: Myron Stowe <myron.stowe@hp.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D0AD486.9020704@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 13:16:18 -08:00
Yinghai Lu	56d91f132c	x86, acpi: Add MAX_LOCAL_APIC for 32bit We should use MAX_LOCAL_APIC for max apic ids and MAX_APICS as number of local apics. Also apic_version[] array should use MAX_LOCAL_APICs. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D0AD464.2020408@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-12-23 13:15:53 -08:00
Seth Heasley	9b444b36fe	x86/PCI: irq and pci_ids patch for Intel Patsburg This patch adds an additional LPC Controller DeviceID for the Intel Patsburg PCH. Signed-off-by: Seth Heasley <seth.heasley@intel.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-23 12:53:10 -08:00
Ingo Molnar	26e20a108c	Merge commit 'v2.6.37-rc7' into x86/security	2010-12-23 09:48:41 +01:00
Don Zickus	4a7863cc2e	x86, nmi_watchdog: Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR The x86 arch has shifted its use of the nmi_watchdog from a local implementation to the global one provide by kernel/watchdog.c. This shift has caused a whole bunch of compile problems under different config options. I attempt to simplify things with the patch below. In order to simplify things, I had to come to terms with the meaning of two terms ARCH_HAS_NMI_WATCHDOG and CONFIG_HARDLOCKUP_DETECTOR. Basically they mean the same thing, the former on a local level and the latter on a global level. With the old x86 nmi watchdog gone, there is no need to rely on defining the ARCH_HAS_NMI_WATCHDOG variable because it doesn't make sense any more. x86 will now use the global implementation. The changes below do a few things. First it changes the few places that relied on ARCH_HAS_NMI_WATCHDOG to use CONFIG_X86_LOCAL_APIC (the former was an alias for the latter anyway, so nothing unusual here). Those pieces of code were relying more on local apic functionality the nmi watchdog functionality, so the change should make sense. Second, I removed the x86 implementation of touch_nmi_watchdog(). It isn't need now, instead x86 will rely on kernel/watchdog.c's implementation. Third, I removed the #define ARCH_HAS_NMI_WATCHDOG itself from x86. And tweaked the include/linux/nmi.h file to tell users to look for an externally defined touch_nmi_watchdog in the case of ARCH_HAS_NMI_WATCHDOG _or_ CONFIG_HARDLOCKUP_DETECTOR. This changes removes some of the ugliness in that file. Finally, I added a Kconfig dependency for CONFIG_HARDLOCKUP_DETECTOR that said you can't have ARCH_HAS_NMI_WATCHDOG _and_ CONFIG_HARDLOCKUP_DETECTOR. You can only have one nmi_watchdog. Tested with ARCH=i386: allnoconfig, defconfig, allyesconfig, (various broken configs) ARCH=x86_64: allnoconfig, defconfig, allyesconfig, (various broken configs) Hopefully, after this patch I won't get any more compile broken emails. :-) v3: changed a couple of 'linux/nmi.h' -> 'asm/nmi.h' to pick-up correct function prototypes when CONFIG_HARDLOCKUP_DETECTOR is not set. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: fweisbec@gmail.com LKML-Reference: <1293044403-14117-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-22 22:15:32 +01:00
Jiri Kosina	4b7bd36470	Merge branch 'master' into for-next Conflicts: MAINTAINERS arch/arm/mach-omap2/pm24xx.c drivers/scsi/bfa/bfa_fcpim.c Needed to update to apply fixes for which the old branch was too outdated.	2010-12-22 18:57:02 +01:00
Jack Steiner	d8850ba425	x86, UV: Fix the effect of extra bits in the hub nodeid register UV systems can be partitioned into multiple independent SSIs. Large partitioned systems may have extra bits in the node_id register. These bits are used when the total memory on all SSIs exceeds 16TB. These extra bits need to be ignored when calculating x2apic_extra_bits. Signed-off-by: Jack Steiner <steiner@sgi.com> LKML-Reference: <20101130195926.972776133@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-22 12:31:15 +01:00
Jack Steiner	e681041388	x86, UV: Add common uv_early_read_mmr() function for reading MMRs Early in boot, reading MMRs from the UV hub controller require calls to early_ioremap()/early_iounmap(). Rather than duplicating code, add a common function to do the map/read/unmap. Signed-off-by: Jack Steiner <steiner@sgi.com> LKML-Reference: <20101130195926.834804371@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-22 12:31:15 +01:00
Ingo Molnar	6c529a266b	Merge commit 'v2.6.37-rc7' into perf/core Merge reason: Pick up the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-22 11:53:23 +01:00
Linus Torvalds	55ec86f848	Merge branches 'x86-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-32: Make sure we can map all of lowmem if we need to x86, vt-d: Handle previous faults after enabling fault handling x86: Enable the intr-remap fault handling after local APIC setup x86, vt-d: Fix the vt-d fault handling irq migration in the x2apic mode x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic x86, xsave: Use alloc_bootmem_align() instead of alloc_bootmem() bootmem: Add alloc_bootmem_align() x86, gcc-4.6: Use gcc -m options when building vdso x86: HPET: Chose a paranoid safe value for the ETIME check x86: io_apic: Avoid unused variable warning when CONFIG_GENERIC_PENDING_IRQ=n * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix off by one in perf_swevent_init() perf: Fix duplicate events with multiple-pmu vs software events ftrace: Have recordmcount honor endianness in fn_ELF_R_INFO scripts/tags.sh: Add magic for trace-events tracing: Fix panic when lseek() called on "trace" opened for writing	2010-12-19 10:44:54 -08:00
Robert Richter	da169f5df2	oprofile, x86: Add support for 6 counters (AMD family 15h) This patch adds support for up to 6 hardware counters for AMD family 15h cpus. There is a new MSR range for hardware counters beginning at MSRC001_0200 Performance Event Select (PERF_CTL0). Signed-off-by: Robert Richter <robert.richter@amd.com>	2010-12-19 11:43:08 +01:00
Robert Richter	30570bced1	oprofile, x86: Add support for AMD family 15h This patch adds support for AMD family 15h (Interlagos/Valencia/ Zambezi) cpus. Signed-off-by: Robert Richter <robert.richter@amd.com>	2010-12-19 11:43:04 +01:00
Linus Torvalds	46bdfe6a50	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: x86: avoid high BIOS area when allocating address space x86: avoid E820 regions when allocating address space x86: avoid low BIOS area when allocating address space resources: add arch hook for preventing allocation in reserved areas Revert "resources: support allocating space within a region from the top down" Revert "PCI: allocate bus resources from the top down" Revert "x86/PCI: allocate space from the end of a region, not the beginning" Revert "x86: allocate space within a region top-down" Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode" PCI: Update MCP55 quirk to not affect non HyperTransport variants	2010-12-18 10:13:24 -08:00
Tejun Heo	05c2d088d0	Merge branch 'this_cpu_ops' into for-2.6.38	2010-12-18 15:54:36 +01:00
Christoph Lameter	8270137a0d	cpuops: Use cmpxchg for xchg to avoid lock semantics Use cmpxchg instead of xchg to realize this_cpu_xchg. xchg will cause LOCK overhead since LOCK is always implied but cmpxchg will not. Baselines: xchg() = 18 cycles (no segment prefix, LOCK semantics) __this_cpu_xchg = 1 cycle (simulated using this_cpu_read/write, two prefixes. Looks like the cpu can use loop optimization to get rid of most of the overhead) Cycles before: this_cpu_xchg = 37 cycles (segment prefix and LOCK (implied by xchg)) After: this_cpu_xchg = 11 cycle (using cmpxchg without lock semantics) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-12-18 15:54:04 +01:00
Christoph Lameter	7296e08aba	x86: this_cpu_cmpxchg and this_cpu_xchg operations Provide support as far as the hardware capabilities of the x86 cpus allow. Define CONFIG_CMPXCHG_LOCAL in Kconfig.cpu to allow core code to test for fast cpuops implementations. V1->V2: - Take out the definition for this_cpu_cmpxchg_8 and move it into a separate patch. tj: - Reordered ops to better follow this_cpu_* organization. - Renamed macro temp variables similar to their existing neighbours. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-12-18 15:54:04 +01:00
H. Peter Anvin	7f8595bfac	x86, kexec: Limit the crashkernel address appropriately Keep the crash kernel address below 512 MiB for 32 bits and 896 MiB for 64 bits. For 32 bits, this retains compatibility with earlier kernel releases, and makes it work even if the vmalloc= setting is adjusted. For 64 bits, we should be able to increase this substantially once a hard-coded limit in kexec-tools is fixed. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20101217195035.GE14502@redhat.com>	2010-12-17 15:04:00 -08:00
Bjorn Helgaas	a2c606d53a	x86: avoid high BIOS area when allocating address space This prevents allocation of the last 2MB before 4GB. The experiment described here shows Windows 7 ignoring the last 1MB: https://bugzilla.kernel.org/show_bug.cgi?id=23542#c27 This patch ignores the top 2MB instead of just 1MB because H. Peter Anvin says "There will be ROM at the top of the 32-bit address space; it's a fact of the architecture, and on at least older systems it was common to have a shadow 1 MiB below." Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-17 10:01:30 -08:00
Bjorn Helgaas	4dc2287c18	x86: avoid E820 regions when allocating address space When we allocate address space, e.g., to assign it to a PCI device, don't allocate anything mentioned in the BIOS E820 memory map. On recent machines (2008 and newer), we assign PCI resources from the windows described by the ACPI PCI host bridge _CRS. On many Dell machines, these windows overlap some E820 reserved areas, e.g., BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff] If we put devices at 0xbff00000, they don't work, probably because that's really RAM, not I/O memory. This patch prevents that by removing the 0xbfe4dc00-0xbfffffff area from the "available" resource. I'm not very happy with this solution because Windows solves the problem differently (it seems to ignore E820 reserved areas and it allocates top-down instead of bottom-up; details at comment 45 of the bugzilla below). That means we're vulnerable to BIOS defects that Windows would not trip over. For example, if BIOS described a device in ACPI but didn't mention it in E820, Windows would work fine but Linux would fail. Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228 Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-17 10:01:24 -08:00
Bjorn Helgaas	30919b0bf3	x86: avoid low BIOS area when allocating address space This implements arch_remove_reservations() so allocate_resource() can avoid any arch-specific reserved areas. This currently just avoids the BIOS area (the first 1MB), but could be used for E820 reserved areas if that turns out to be necessary. We previously avoided this area in pcibios_align_resource(). This patch moves the test from that PCI-specific path to a generic path, so all resource allocations will avoid this area. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-17 10:01:17 -08:00
Bjorn Helgaas	d14125ecfe	Revert "x86/PCI: allocate space from the end of a region, not the beginning" This reverts commit `dc9887dc02`. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-17 10:00:49 -08:00
Bjorn Helgaas	5e52f1c5e8	Revert "x86: allocate space within a region top-down" This reverts commit `1af3c2e45e`. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-12-17 10:00:43 -08:00
Linus Torvalds	a6ac1f0af4	Merge branch 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Fix preemption counter leak in kvm_timer_init() KVM: enlarge number of possible CPUID leaves KVM: SVM: Do not report xsave in supported cpuid KVM: Fix OSXSAVE after migration	2010-12-17 09:32:39 -08:00
Tejun Heo	403047754c	percpu,x86: relocate this_cpu_add_return() and friends - include/linux/percpu.h: this_cpu_add_return() and friends were located next to __this_cpu_add_return(). However, the overall organization is to first group by preemption safeness. Relocate this_cpu_add_return() and friends to preemption-safe area. - arch/x86/include/asm/percpu.h: Relocate percpu_add_return_op() after other more basic operations. Relocate [__]this_cpu_add_return_8() so that they're first grouped by preemption safeness. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com>	2010-12-17 16:13:22 +01:00
Tejun Heo	275c8b9328	Merge branch 'this_cpu_ops' into for-2.6.38	2010-12-17 15:16:46 +01:00
Christoph Lameter	8f1d97c79e	x86: Support for this_cpu_add, sub, dec, inc_return Supply an implementation for x86 in order to generate more efficient code. V2->V3: - Cleanup - Remove strange type checking from percpu_add_return_op. tj: - Dropped unused typedef from percpu_add_return_op(). - Renamed ret__ to paro_ret__ in percpu_add_return_op(). - Minor indentation adjustments. Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-12-17 15:15:28 +01:00
Christoph Lameter	780f36d8b3	xen: Use this_cpu_ops Use this_cpu_ops to reduce code size and simplify things in various places. V3->V4: Move instance of this_cpu_inc_return to a later patchset so that this patch can be applied without infrastructure changes. Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-12-17 15:07:19 +01:00
Christoph Lameter	b76834bc1b	kprobes: Use this_cpu_ops Use this_cpu ops in various places to optimize per cpu data access. Cc: Jason Baron <jbaron@redhat.com> Cc: Namhyung Kim <namhyung@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-12-17 15:07:19 +01:00
H. Peter Anvin	147dd5610c	x86-32: Make sure we can map all of lowmem if we need to A relocatable kernel can be anywhere in lowmem -- and in the case of a kdump kernel, is likely to be fairly high. Since the early page tables map everything from address zero up we need to make sure we allocate enough brk that we can map all of lowmem if we need to. Reported-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Tested-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D0AD3ED.8070607@kernel.org>	2010-12-16 19:11:09 -08:00
Avi Kivity	3e26f23091	KVM: Fix preemption counter leak in kvm_timer_init() Based on a patch from Thomas Meyer. Signed-off-by: Avi Kivity <avi@redhat.com>	2010-12-16 12:39:31 +02:00
Peter Zijlstra	7639dae0ca	perf, x86: Provide a PEBS capable cycle event Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-16 11:36:44 +01:00
Peter Zijlstra	2e80a82a49	perf: Dynamic pmu types Extend the perf_pmu_register() interface to allow for named and dynamic pmu types. Because we need to support the existing static types we cannot use dynamic types for everything, hence provide a type argument. If we want to enumerate the PMUs they need a name, provide one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101117222056.259707703@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-16 11:36:43 +01:00
Peter Zijlstra	4407204c5c	perf, x86: Detect broken BIOSes that corrupt the PMU Some BIOSes use PMU resources, which can cause various bugs: - Non-working or erratic PMU based statistics - the PMU can end up counting the wrong thing, resulting in misleading statistics - Profiling can stop working or it can profile the wrong thing - A non-working or erratic NMI watchdog that cannot be relied on - The kernel may disturb whatever thing the BIOS tries to use the PMU for - possibly causing hardware malfunction in extreme cases. - ... and other forms of potential misbehavior Various forms of such misbehavior has been observed in practice - there are BIOSes that just corrupt the PMU state, consequences be damned. The PMU is a CPU resource that is handled by the kernel and the BIOS stealing+corrupting it is not acceptable nor robust, so we detect it, warn about it and further refuse to touch the PMU ourselves. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-16 11:36:42 +01:00
Ingo Molnar	006b20fe4c	Merge branch 'perf/urgent' into perf/core Merge reason: We want to apply a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-12-16 11:22:27 +01:00
Rusty Russell	da32dac101	lguest: populate initial_page_table Two x86 patches broke lguest: 1) v2.6.35-492-g72d7c3b, which changed x86 to use the memblock allocator. In lguest, the host places linear page tables at the top of mem, which used to be enough to get us up to the swapper_pg_dir page tables. With the first patch, the direct mapping tables used that memory: Before: kernel direct mapping tables up to 4000000 @ 7000-1a000 After: kernel direct mapping tables up to 4000000 @ 3fed000-4000000 I initially fixed this by lying about the amount of memory we had, so the kernel wouldn't blatt the lguest boot pagetables (yuk!), but then... 2) v2.6.36-rc8-54-gb40827f, which made x86 boot use initial_page_table. This was initialized in a part of head_32.S which isn't executed by lguest; it is then copied into swapper_pg_dir. So we have to initialize it; and anyway we switch to it before we blatt the old tables, so that fixes the previous damage as well. For the moment, I cut & pasted the code into lguest's boot code, but next merge window I will merge them. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> To: x86@kernel.org	2010-12-16 17:03:15 +10:30

... 15 16 17 18 19 ...

12,946 commits