linux-pinenote

Author	SHA1	Message	Date
Lee, Chun-Yi	8477957555	x86/mm, hibernate: Do not assume the first e820 area to be RAM In arch/x86/kernel/setup.c::trim_bios_range(), the codes introduced by `1b5576e6` (base on `d8a9e6a5`), it updates the first 4Kb of memory to be E820_RESERVED region. That's because it's a BIOS owned area but generally not listed in the E820 table: e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x0000000000096fff] usable BIOS-e820: [mem 0x0000000000097000-0x0000000000097fff] reserved ... e820: update [mem 0x00000000-0x00000fff] usable ==> reserved e820: remove [mem 0x000a0000-0x000fffff] usable But the region of first 4Kb didn't register to nosave memory: PM: Registered nosave memory: [mem 0x00097000-0x00097fff] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] The code in e820_mark_nosave_regions() assumes the first e820 area to be RAM, so it causes the first 4Kb E820_RESERVED region ignored when register to nosave. This patch removed assumption of the first e820 area. Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <len.brown@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Takashi Iwai <tiwai@suse.de> Link: http://lkml.kernel.org/r/1410491038-17576-1-git-send-email-jlee@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-16 09:54:31 +02:00
Luiz Capitulino	8b375f64dc	x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data() The setup_node_data() function allocates a pg_data_t object, inserts it into the node_data[] array and initializes the following fields: node_id, node_start_pfn and node_spanned_pages. However, a few function calls later during the kernel boot, free_area_init_node() re-initializes those fields, possibly with setup_node_data() is not used. This causes a small glitch when running Linux as a hyperv numa guest: SRAT: PXM 0 -> APIC 0x00 -> Node 0 SRAT: PXM 0 -> APIC 0x01 -> Node 0 SRAT: PXM 1 -> APIC 0x02 -> Node 1 SRAT: PXM 1 -> APIC 0x03 -> Node 1 SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff] Initmem setup node 0 [mem 0x00000000-0x7fffffff] NODE_DATA [mem 0x7ffdc000-0x7ffeffff] Initmem setup node 1 [mem 0x80800000-0x1081fffff] NODE_DATA [mem 0x1081ea000-0x1081fdfff] crashkernel: memory value expected [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0 [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1 Zone ranges: DMA [mem 0x00001000-0x00ffffff] DMA32 [mem 0x01000000-0xffffffff] Normal [mem 0x100000000-0x1081fffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x00001000-0x0009efff] node 0: [mem 0x00100000-0x7ffeffff] node 1: [mem 0x80200000-0xf7ffffff] node 1: [mem 0x100000000-0x1081fffff] On node 0 totalpages: 524174 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520176 pages, LIFO batch:31 On node 1 totalpages: 524288 DMA32 zone: 7672 pages used for memmap DMA32 zone: 491008 pages, LIFO batch:31 Normal zone: 520 pages used for memmap Normal zone: 33280 pages, LIFO batch:7 In this dmesg, the SRAT table reports that the memory range for node 1 starts at 0x80200000. However, the line starting with "Initmem" reports that node 1 memory range starts at 0x80800000. The "Initmem" line is reported by setup_node_data() and is wrong, because the kernel ends up using the range as reported in the SRAT table. This commit drops all that dead code from setup_node_data(), renames it to alloc_node_data() and adds a printk() to free_area_init_node() so that we report a node's memory range accurately. Here's the same dmesg section with this patch applied: SRAT: PXM 0 -> APIC 0x00 -> Node 0 SRAT: PXM 0 -> APIC 0x01 -> Node 0 SRAT: PXM 1 -> APIC 0x02 -> Node 1 SRAT: PXM 1 -> APIC 0x03 -> Node 1 SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff] NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff] NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff] crashkernel: memory value expected [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0 [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1 Zone ranges: DMA [mem 0x00001000-0x00ffffff] DMA32 [mem 0x01000000-0xffffffff] Normal [mem 0x100000000-0x1081fffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x00001000-0x0009efff] node 0: [mem 0x00100000-0x7ffeffff] node 1: [mem 0x80200000-0xf7ffffff] node 1: [mem 0x100000000-0x1081fffff] Initmem setup node 0 [mem 0x00001000-0x7ffeffff] On node 0 totalpages: 524174 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520176 pages, LIFO batch:31 Initmem setup node 1 [mem 0x80200000-0x1081fffff] On node 1 totalpages: 524288 DMA32 zone: 7672 pages used for memmap DMA32 zone: 491008 pages, LIFO batch:31 Normal zone: 520 pages used for memmap Normal zone: 33280 pages, LIFO batch:7 This commit was tested on a two node bare-metal NUMA machine and Linux as a numa guest on hyperv and qemu/kvm. PS: The wrong memory range reported by setup_node_data() seems to be harmless in the current kernel because it's just not used. However, that bad range is used in kernel 2.6.32 to initialize the old boot memory allocator, which causes a crash during boot. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-16 08:55:10 +02:00
Yasuaki Ishimatsu	9661d5bcd0	x86/mm/hotplug: Modify PGD entry when removing memory When hot-adding/removing memory, sync_global_pgds() is called for synchronizing PGD to PGD entries of all processes MM. But when hot-removing memory, sync_global_pgds() does not work correctly. At first, sync_global_pgds() checks whether target PGD is none or not. And if PGD is none, the PGD is skipped. But when hot-removing memory, PGD may be none since PGD may be cleared by free_pud_table(). So when sync_global_pgds() is called after hot-removing memory, sync_global_pgds() should not skip PGD even if the PGD is none. And sync_global_pgds() must clear PGD entries of all processes MM. Currently sync_global_pgds() does not clear PGD entries of all processes MM when hot-removing memory. So when hot adding memory which is same memory range as removed memory after hot-removing memory, following call traces are shown: kernel BUG at arch/x86/mm/init_64.c:206! ... [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32 [<ffffffff8107e02b>] process_one_work+0x17b/0x460 [<ffffffff8107edfb>] worker_thread+0x11b/0x400 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400 [<ffffffff81085aef>] kthread+0xcf/0xe0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 This patch clears PGD entries of all processes MM when sync_global_pgds() is called after hot-removing memory Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-16 08:55:09 +02:00
Yasuaki Ishimatsu	5255e0a79f	x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable() When hot-adding memory after hot-removing memory, following call traces are shown: kernel BUG at arch/x86/mm/init_64.c:206! ... [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2 [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380 [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0 [<ffffffff815d03d9>] add_memory+0xb9/0x1b0 [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0 [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7 [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5 [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2 [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15 [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32 [<ffffffff8107e02b>] process_one_work+0x17b/0x460 [<ffffffff8107edfb>] worker_thread+0x11b/0x400 [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400 [<ffffffff81085aef>] kthread+0xcf/0xe0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140 The patch-set fixes the issue. This patch (of 2): remove_pagetable() gets start argument and passes the argument to sync_global_pgds(). In this case, the argument must not be modified. If the argument is modified and passed to sync_global_pgds(), sync_global_pgds() does not correctly synchronize PGD to PGD entries of all processes MM since synchronized range of memory [start, end] is wrong. Unfortunately the start argument is modified in remove_pagetable(). So this patch fixes the issue. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-16 08:55:08 +02:00
Peter Neubauer	2e151c70df	x86: HPET force enable for e6xx based systems As the Soekris net6501 and other e6xx based systems do not have any ACPI implementation, HPET won't get enabled. This patch enables HPET on such platforms. [ 0.430149] pci 0000:00:01.0: Force enabled HPET at 0xfed00000 [ 0.644838] HPET: 3 timers in total, 0 timers will be used for per-cpu timer Original patch by Peter Neubauer (http://www.mail-archive.com/soekris-tech@lists.soekris.com/msg06462.html) slightly modified by Conrad Kostecki <ck@conrad-kostecki.de> and massaged accoring to Thomas Gleixners <tglx@linutronix.de> by me. Suggested-by: Conrad Kostecki <ck@conrad-kostecki.de> Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de> Cc: Peter Neubauer <pneubauer@bluerwhite.org> Link: http://lkml.kernel.org/r/5412D3A5.2030909@lsexperts.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2014-09-15 17:53:35 -07:00
Dave Young	3eddc69ffe	x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8 3.16 kernel boot fail with earlyprintk=efi, it keeps scrolling at the bottom line of screen. Bisected, the first bad commit is below: commit `86dfc6f339` Author: Lv Zheng <lv.zheng@intel.com> Date: Fri Apr 4 12:38:57 2014 +0800 ACPICA: Tables: Fix table checksums verification before installation. I did some debugging by enabling both serial and efi earlyprintk, below is some debug dmesg, seems early_ioremap fails in scroll up function due to no free slot, see below dmesg output: WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:116 __early_ioremap+0x90/0x1c4() __early_ioremap(ed00c800, 00000c80) not found slot Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc1+ #204 Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.15 05/09/2013 Call Trace: dump_stack+0x4e/0x7a warn_slowpath_common+0x75/0x8e ? __early_ioremap+0x90/0x1c4 warn_slowpath_fmt+0x47/0x49 __early_ioremap+0x90/0x1c4 ? sprintf+0x46/0x48 early_ioremap+0x13/0x15 early_efi_map+0x24/0x26 early_efi_scroll_up+0x6d/0xc0 early_efi_write+0x1b0/0x214 call_console_drivers.constprop.21+0x73/0x7e console_unlock+0x151/0x3b2 ? vprintk_emit+0x49f/0x532 vprintk_emit+0x521/0x532 ? console_unlock+0x383/0x3b2 printk+0x4f/0x51 acpi_os_vprintf+0x2b/0x2d acpi_os_printf+0x43/0x45 acpi_info+0x5c/0x63 ? __acpi_map_table+0x13/0x18 ? acpi_os_map_iomem+0x21/0x147 acpi_tb_print_table_header+0x177/0x186 acpi_tb_install_table_with_override+0x4b/0x62 acpi_tb_install_standard_table+0xd9/0x215 ? early_ioremap+0x13/0x15 ? __acpi_map_table+0x13/0x18 acpi_tb_parse_root_table+0x16e/0x1b4 acpi_initialize_tables+0x57/0x59 acpi_table_init+0x50/0xce acpi_boot_table_init+0x1e/0x85 setup_arch+0x9b7/0xcc4 start_kernel+0x94/0x42d ? early_idt_handlers+0x120/0x120 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xf3/0x100 Quote reply from Lv.zheng about the early ioremap slot usage in this case: """ In early_efi_scroll_up(), 2 mapping entries will be used for the src/dst screen buffer. In drivers/acpi/acpica/tbutils.c, we've improved the early table loading code in acpi_tb_parse_root_table(). We now need 2 mapping entries: 1. One mapping entry is used for RSDT table mapping. Each RSDT entry contains an address for another ACPI table. 2. For each entry in RSDP, we need another mapping entry to map the table to perform necessary check/override before installing it. When acpi_tb_parse_root_table() prints something through EFI earlyprintk console, we'll have 4 mapping entries used. The current 4 slots setting of early_ioremap() seems to be too small for such a use case. """ Thus increase the slot to 8 in this patch to fix this issue. boot-time mappings become 512 page with this patch. Signed-off-by: Dave Young <dyoung@redhat.com> Cc: <stable@vger.kernel.org> # v3.16 Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2014-09-14 15:24:31 +01:00
Linus Torvalds	72d9310460	Make ARCH_HAS_FAST_MULTIPLIER a real config variable It used to be an ad-hoc hack defined by the x86 version of <asm/bitops.h> that enabled a couple of library routines to know whether an integer multiply is faster than repeated shifts and additions. This just makes it use the real Kconfig system instead, and makes x86 (which was the only architecture that did this) select the option. NOTE! Even for x86, this really is kind of wrong. If we cared, we would probably not enable this for builds optimized for netburst (P4), where shifts-and-adds are generally faster than multiplies. This patch does not change that kind of logic, though, it is purely a syntactic change with no code changes. This was triggered by the fact that we have other places that really want to know "do I want to expand multiples by constants by hand or not", particularly the hash generation code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-09-13 11:14:53 -07:00
Frederic Weisbecker	3010279f0f	x86: Tell irq work about self IPI support x86 supports irq work self-IPIs when local apic is available. This is partly known on runtime so lets implement arch_irq_work_has_interrupt() accordingly. This should be safely called after setup_arch(). Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2014-09-13 18:38:29 +02:00
Peter Zijlstra	c5c38ef3d7	irq_work: Introduce arch_irq_work_has_interrupt() The nohz full code needs irq work to trigger its own interrupt so that the subsystem can work even when the tick is stopped. Lets introduce arch_irq_work_has_interrupt() that archs can override to tell about their support for this ability. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2014-09-13 18:38:07 +02:00
Linus Torvalds	7ee2d2d671	xen: bug fixes for 3.17-rc4 - Fix for PVHVM suspend/resume and migration - Don't pointlessly retry certain ballooning ops - Fix gntalloc when grefs have run out. - Fix PV boot if KSALR is enable or very large modules are used. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJUEZw4AAoJEFxbo/MsZsTRoHkIAJ5h287Wer2Yf4ALlFI47Esl AhcgVKi7WMJ6mBQUk/xpbSS3MZEuTuPJVbuiabJwRaZk9vX7w8q08yrAD8SOrCwY 6iGooaWEZ96P5DPI5fmWBuOgt5f2wfqzAp//wl/dK/kzr6Kw63huTXa0H0fmd8BY Gy13pN4JEPDyvjjZWFvDcrcEUA9eTcTaXQZisBUGuGictXySr5ovz9G3EKOtJP0D vkmDzApijFEEjJwZMGazgipng4mwh94fW4+7bs57s6iREpMzOJDTRKrl9suaXdza 5HA2ed7Au/qL8IwiQAFGxQpMBqNQxEGerlFfPBhRPuGQ+Ek3/pZF3BTU56w/nt4= =y/Ol -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.17-b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen bug fixes from David Vrabel: - fix for PVHVM suspend/resume and migration - don't pointlessly retry certain ballooning ops - fix gntalloc when grefs have run out. - fix PV boot if KSALR is enable or very large modules are used. * tag 'stable/for-linus-3.17-b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: don't copy bogus duplicate entries into kernel page tables xen/gntalloc: safely delete grefs in add_grefs() undo path xen/gntalloc: fix oops after runnning out of grant refs xen/balloon: cancel ballooning if adding new memory failed xen/manage: Always freeze/thaw processes when suspend/resuming	2014-09-11 16:52:29 -07:00
Dave Hansen	9298b815ef	x86: Add more disabled features The original motivation for these patches was for an Intel CPU feature called MPX. The patch to add a disabled feature for it will go in with the other parts of the support. But, in the meantime, there are a few other features than MPX that we can make assumptions about at compile-time based on compile options. Add them to disabled-features.h and check them with cpu_feature_enabled(). Note that this gets rid of the last things that needed an #ifdef CONFIG_X86_64 in cpufeature.h. Yay! Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140911211524.C0EC332A@viggo.jf.intel.com Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-11 14:30:17 -07:00
Dave Hansen	381aa07a9b	x86: Introduce disabled-features I believe the REQUIRED_MASK aproach was taken so that it was easier to consult in assembly (arch/x86/kernel/verify_cpu.S). DISABLED_MASK does not have the same restriction, but I implemented it the same way for consistency. We have a REQUIRED_MASK... which does two things: 1. Keeps a list of cpuid bits to check in very early boot and refuse to boot if those are not present. 2. Consulted during cpu_has() checks, which allows us to optimize out things at compile-time. In other words, if we KNOW we will not boot with the feature off, then we can safely assume that it will be present forever. But, we don't have a similar mechanism for CPU features which may be present but that we know we will not use. We simply use our existing mechanisms to repeatedly check the status of the bit at runtime (well, the alternatives patching helps here but it does not provide compile-time optimization). Adding a feature to disabled-features.h allows the bit to be checked via a new macro: cpu_feature_enabled(). Note that for features in DISABLED_MASK, checks with this macro have all of the benefits of an #ifdef. Before, we would have done this in a header: #ifdef CONFIG_X86_INTEL_MPX #define cpu_has_mpx cpu_has(X86_FEATURE_MPX) #else #define cpu_has_mpx 0 #endif and this in the code: if (cpu_has_mpx) do_some_mpx_thing(); Now, just add your feature to DISABLED_MASK and you can do this everywhere, and get the same benefits you would have from #ifdefs: if (cpu_feature_enabled(X86_FEATURE_MPX)) do_some_mpx_thing(); We need a new function and not a modification to cpu_has() because there are cases where we actually need to check the CPU itself, despite what features the kernel supports. The best example of this is a hypervisor which has no control over what features its guests are using and where the guest does not depend on the host for support. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140911211513.9E35E931@viggo.jf.intel.com Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-11 14:30:02 -07:00
Dave Hansen	c8128cceb4	x86: Axe the lightly-used cpu_has_pae cpu_has_pae is only referenced in one place: the X86_32 kexec code (in a file not even built on 64-bit). It hardly warrants its own macro, or the trouble we go to ensuring that it can't be called in X86_64 code. Axe the macro and replace it with a direct cpu feature check. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140911211511.AD76E774@viggo.jf.intel.com Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-11 14:30:01 -07:00
Paolo Bonzini	a183b638b6	KVM: x86: make apic_accept_irq tracepoint more generic Initially the tracepoint was added only to the APIC_DM_FIXED case, also because it reported coalesced interrupts that only made sense for that case. However, the coalesced argument is not used anymore and tracing other delivery modes is useful, so hoist the call out of the switch statement. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-11 11:51:02 +02:00
Tang Chen	73a6d94162	kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address. We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of apic access page. So use this macro. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-11 11:10:22 +02:00
Stefan Bader	0b5a50635f	x86/xen: don't copy bogus duplicate entries into kernel page tables When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded modules exceeds 512 MiB, then loading modules fails with a warning (and hence a vmalloc allocation failure) because the PTEs for the newly-allocated vmalloc address space are not zero. WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 vmap_page_range_noflush+0x2a1/0x360() This is caused by xen_setup_kernel_pagetables() copying level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present entries. Without KASLR, the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[ 0..255]->kernel [256..511]->module [511]->level2_fixmap_pgt[ 0..505]->module This allows 512 MiB of of module vmalloc space to be used before having to use the corrupted level2_fixmap_pgt entries. With KASLR enabled, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel [511]->level2_fixmap_pgt[0..505]->module And now no module vmalloc space can be used without using the corrupt level2_fixmap_pgt entries. Fix this by properly converting the level2_fixmap_pgt entries to MFNs, and setting level1_fixmap_pgt as read-only. A number of comments were also using the the wrong L3 offset for level2_kernel_pgt. These have been corrected. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: stable@vger.kernel.org	2014-09-10 15:23:42 +01:00
Waiman Long	6157c7e1bb	locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S This patch removes the unused asm/rwlock.h and rwlock.S files. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1408037251-45918-3-git-send-email-Waiman.Long@hp.com Cc: Scott J Norton <scott.norton@hp.com> Cc: Borislav Petkov <bp@suse.de> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Francesco Fusco <ffusco@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Graf <tgraf@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-10 11:46:39 +02:00
Waiman Long	2ff810a7ef	locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code As the x86 architecture now uses qrwlock for its read/write lock implementation, it is no longer necessary to keep the old rwlock code around. This patch removes the old rwlock code in the asm/spinlock.h and asm/spinlock_types.h files. Now the ARCH_USE_QUEUE_RWLOCK config parameter cannot be removed from x86/Kconfig or there will be a compilation error. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Waiman Long <Waiman.Long@hp.com> Link: http://lkml.kernel.org/r/1408037251-45918-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-10 11:46:38 +02:00
Daniel Borkmann	286aad3c40	net: bpf: be friendly to kmemcheck Reported by Mikulas Patocka, kmemcheck currently barks out a false positive since we don't have special kmemcheck annotation for bitfields used in bpf_prog structure. We currently have jited:1, len:31 and thus when accessing len while CONFIG_KMEMCHECK enabled, kmemcheck throws a warning that we're reading uninitialized memory. As we don't need the whole bit universe for pages member, we can just split it to u16 and use a bool flag for jited instead of a bitfield. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-09 16:58:56 -07:00
Daniel Borkmann	738cbe72ad	net: bpf: consolidate JIT binary allocator Introduced in commit `314beb9bca` ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") and later on replicated in `aa2d2c73c2` ("s390/bpf,jit: address randomize and write protect jit code") for s390 architecture, write protection for BPF JIT images got added and a random start address of the JIT code, so that it's not on a page boundary anymore. Since both use a very similar allocator for the BPF binary header, we can consolidate this code into the BPF core as it's mostly JIT independant anyway. This will also allow for future archs that support DEBUG_SET_MODULE_RONX to just reuse instead of reimplementing it. JIT tested on x86_64 and s390x with BPF test suite. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-09 16:58:56 -07:00
Alexei Starovoitov	02ab695bb3	net: filter: add "load 64-bit immediate" eBPF instruction add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. All previous instructions were 8-byte. This is first 16-byte instruction. Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction: insn[0].code = BPF_LD \| BPF_DW \| BPF_IMM insn[0].dst_reg = destination register insn[0].imm = lower 32-bit insn[1].code = 0 insn[1].imm = upper 32-bit All unused fields must be zero. Classic BPF has similar instruction: BPF_LD \| BPF_W \| BPF_IMM which loads 32-bit immediate value into a register. x64 JITs it as single 'movabsq %rax, imm64' arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn Note that old eBPF programs are binary compatible with new interpreter. It helps eBPF programs load 64-bit constant into a register with one instruction instead of using two registers and 4 instructions: BPF_MOV32_IMM(R1, imm32) BPF_ALU64_IMM(BPF_LSH, R1, 32) BPF_MOV32_IMM(R2, imm32) BPF_ALU64_REG(BPF_OR, R1, R2) User space generated programs will use this instruction to load constants only. To tell kernel that user space needs a pointer the _pseudo_ variant of this instruction may be added later, which will use extra bits of encoding to indicate what type of pointer user space is asking kernel to provide. For example 'off' or 'src_reg' fields can be used for such purpose. src_reg = 1 could mean that user space is asking kernel to validate and load in-kernel map pointer. src_reg = 2 could mean that user space needs readonly data section pointer src_reg = 3 could mean that user space needs a pointer to per-cpu local data All such future pseudo instructions will not be carrying the actual pointer as part of the instruction, but rather will be treated as a request to kernel to provide one. The kernel will verify the request_for_a_pointer, then will drop _pseudo_ marking and will store actual internal pointer inside the instruction, so the end result is the interpreter and JITs never see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that loads 64-bit immediate into a register. User space never operates on direct pointers and verifier can easily recognize request_for_pointer vs other instructions. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-09 10:26:47 -07:00
Ingo Molnar	5ac385d835	* Fix early boot regression affecting x86 EFI boot stub when loading initrds above 4GB - Yinghai Lu * Relocate GOT entries in the x86 EFI boot stub now that we have symbols with global visibility - Matt Fleming * fdt memory reservation fix for arm64 - Mark Salter -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUDqq/AAoJEC84WcCNIz1VgzEP/1Ax+XnXQjIRMGcR7gHolcan lzBzDL3afEp28LcWevmDZ9Bp4VRFjCRecg1gdI64HJhn+b7Ay1iPX/hUaIqfgPfb ptY8uAomNAwDyxC7z0S13GNiZZxPKB6eHnoV8t2Hi3uM8oUnkca/WTXHOyXs+gJG 4fQZtXWn/T8j7vAXuHGSbdH1pF4HYf2vX9i0c7iWVIcKyl+Oe5xGMcql4BqPJnAz 6hN9etyRMWF37CHZjD1pH0YHhRMJ6uuqUFvUQZt2q+OPUzgYVPv1Es6984r5q2CI HHQK2RSfHifYhNuLHuQo+8hOzz41pTriUrrDLDYk9SXDaJM4nHF6n2AXvra320P3 Xa0TR87+DxOdCM+1s1LeLl/9wMrwz1tgx8m9St16yISnRcGkkJrWYeV9z4PXYsi5 Qe1uGFS4eVWMAGVuaQgOP/olLAOxr1Vxwrnci+mg4Zh5LgohDZ4FBqbDdMeP3GIF vuI+yNnH9jxqmKZXD7wKtxVmS5s3vB3bH0+H8fFCMdBfUWqcM2CA0QJjSCsYGgkB mv5jaccRBk8WlI4KrDDuJ2BzM5prg59XTsO0m8oaloCk8b2OEvOte3XsF1DAkYh+ DnMbESfyDJxc6OFzq6pzAFeY5JUbSgWe0AwnyDJ3Woo9qCCpSkQHImllohRuXVgO BruJYYr5r55mjhyDWzJk =DjZA -----END PGP SIGNATURE----- Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent Pull EFI fixes from Matt Fleming: * Fix early boot regression affecting x86 EFI boot stub when loading initrds above 4GB - Yinghai Lu * Relocate GOT entries in the x86 EFI boot stub now that we have symbols with global visibility - Matt Fleming * fdt memory reservation fix for arm64 - Mark Salter Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-09 16:56:12 +02:00
Jan-Simon Möller	cc99535eb4	x86/mm: Apply the section attribute to the variable, not its type This fixes a compilation error in clang in that a linker section attribute can't be added to a type: arch/x86/mm/mmap.c:34:8: error: '__section__' attribute only applies to functions and global variables struct __read_mostly ... By moving the section attribute to the variable declaration, the desired effect is achieved. Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de> Signed-off-by: Behan Webster <behanw@converseincode.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1409959005-11479-1-git-send-email-behanw@converseincode.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-09 07:13:39 +02:00
Andi Kleen	a08b6769d4	perf/x86: Fix section mismatch in split uncore driver The new split Intel uncore driver code that recently went into tip added a section mismatch, which the build process complains about. uncore_pmu_register() can be called from uncore_pci_probe,() which is not __init and can be called from pci driver ->probe. I'm not fully sure if it's actually possible to call the probe function later, but it seems safer to mark uncore_pmu_register not __init. This also fixes the warning. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1409332858-29039-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-09 06:53:08 +02:00
Mathias Krause	066ce64c7e	perf/x86/intel: Mark initialization code as such A few of the initialization functions are missing the __init annotation. Fix this and thereby allow ~680 additional bytes of code to be released after initialization. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/1409071785-26015-1-git-send-email-minipli@googlemail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-09 06:53:06 +02:00
Ingo Molnar	bdea534db8	Linux 3.17-rc4 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJUDOW+AAoJEHm+PkMAQRiGOXYH/00TPKm8PdM5cXXG2YYYv9eT W99K7KD2i0/qiVtlGgjjvB7fO3K0HcZusTd2jmVd8IWntXvauq7Zpw5YZkjwu4KX Y1HCwwCd2aw0FoqgrJhNP3+j5Cr1BD/HLtbffjCe+A3tppOIis4Bwt2wJOoYlXpS hU9Jxxc4lcRo8YKbffouDo7PIneWeJy8N+WGpUR5BfJIEK0ZZtCUqn3/3WLX4FYu fE6uiF/bACTpKXU/mo4dDbhZp439H/QdwQc9B0F8+8CBDMXKaNHrPV7kN36T2SWa fD4boikTsi/yh9Ks1fvHbvNq2N0ihoMnja+vLRyvjAcAQv2fKG3OZtYgFWSdghU= =Xknd -----END PGP SIGNATURE----- Merge tag 'v3.17-rc4' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-09 06:48:07 +02:00
Mel Gorman	59f6e2073c	percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix A commit in linux-next was causing boot to fail and bisection identified the patch `4ba2968420` ("percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_"). One of the changes in that patch looks very suspicious. Reverting the full patch fixes boot as does this fixlet. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com>	2014-09-09 07:17:43 +09:00
Andy Lutomirski	1dcf74f6ed	x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls On KVM on my box, this reduces the overhead from an always-accept seccomp filter from ~130ns to ~17ns. Most of that comes from avoiding IRET on every syscall when seccomp is enabled. In extremely approximate hacked-up benchmarking, just bypassing IRET saves about 80ns, so there's another 43ns of savings here from simplifying the seccomp path. The diffstat is also rather nice :) Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/a3dbd267ee990110478d349f78cccfdac5497a84.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-08 14:14:12 -07:00
Andy Lutomirski	54eea9957f	x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls For slowpath syscalls, we initialize regs->ax to -ENOSYS and stick the syscall number into regs->orig_ax prior to any possible tracing and syscall execution. This is user-visible ABI used by ptrace syscall emulation and seccomp. For fastpath syscalls, there's no good reason not to do the same thing. It's even slightly simpler than what we're currently doing. It probably has no measureable performance impact. It should have no user-visible effect. The purpose of this patch is to prepare for two-phase syscall tracing, in which the first phase might modify the saved RAX without leaving the fast path. This change is just subtle enough that I'm keeping it separate. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/01218b493f12ae2f98034b78c9ae085e38e94350.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-08 14:14:08 -07:00
Andy Lutomirski	e0ffbaabc4	x86: Split syscall_trace_enter into two phases This splits syscall_trace_enter into syscall_trace_enter_phase1 and syscall_trace_enter_phase2. Only phase 2 has full pt_regs, and only phase 2 is permitted to modify any of pt_regs except for orig_ax. The intent is that phase 1 can be called from the syscall fast path. In this implementation, phase1 can handle any combination of TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT, unless seccomp requests a ptrace event, in which case phase2 is forced. In principle, this could yield a big speedup for TIF_NOHZ as well as for TIF_SECCOMP if syscall exit work were similarly split up. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/2df320a600020fda055fccf2b668145729dd0c04.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-08 14:14:03 -07:00
Andy Lutomirski	fd143b210e	x86, entry: Only call user_exit if TIF_NOHZ The RCU context tracking code requires that arch code call user_exit() on any entry into kernel code if TIF_NOHZ is set. This patch adds a check for TIF_NOHZ and a comment to the syscall entry tracing code. The main purpose of this patch is to make the code easier to follow: one can read the body of user_exit and of every function it calls without finding any explanation of why it's called for traced syscalls but not for untraced syscalls. This makes it clear when user_exit() is necessary. Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/0b13e0e24ec0307d67ab7a23b58764f6b1270116.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-08 14:13:59 -07:00
Andy Lutomirski	81f49a8fd7	x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit is_compat_task() is the wrong check for audit arch; the check should be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not AUDIT_ARCH_I386. CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has no visible effect. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/a0138ed8c709882aec06e4acc30bfa9b623b8717.1409954077.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-08 14:13:55 -07:00
Matt Fleming	9cb0e39423	x86/efi: Fixup GOT in all boot code paths Maarten reported that his Macbook pro 8.2 stopped booting after commit `f23cf8bd5c` ("efi/x86: efistub: Move shared dependencies to <asm/efi.h>"), the main feature of which is changing the visibility of symbol 'efi_early' from local to global. By making 'efi_early' global we end up requiring an entry in the Global Offset Table. Unfortunately, while we do include code to fixup GOT entries in the early boot code, it's only called after we've executed the EFI boot stub. What this amounts to is that references to 'efi_early' in the EFI boot stub don't point to the correct place. Since we've got multiple boot entry points we need to be prepared to fixup the GOT in multiple places, while ensuring that we never do it more than once, otherwise the GOT entries will still point to the wrong place. Reported-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Tested-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2014-09-08 20:52:02 +01:00
Yinghai Lu	47226ad4f4	x86/efi: Only load initrd above 4g on second try Mantas found that after commit `4bf7111f50` ("x86/efi: Support initrd loaded above 4G"), the kernel freezes at the earliest possible moment when trying to boot via UEFI on Asus laptop. Revert to old way to load initrd under 4G on first try, second try will use above 4G buffer when initrd is too big and does not fit under 4G. [ The cause of the freeze appears to be a firmware bug when reading file data into buffers above 4GB, though the exact reason is unknown. Mantas reports that the hang can be avoid if the file size is a multiple of 512 bytes, but I've seen some ASUS firmware simply corrupting the file data rather than freezing. Laszlo fixed an issue in the upstream EDK2 DiskIO code in Aug 2013 which may possibly be related, commit 4e39b75e ("MdeModulePkg/DiskIoDxe: fix source/destination pointer of overrun transfer"). Whatever the cause, it's unlikely that a fix will be forthcoming from the vendor, hence the workaround - Matt ] Cc: Laszlo Ersek <lersek@redhat.com> Reported-by: Mantas Mikulėnas <grawity@gmail.com> Reported-by: Harald Hoyer <harald@redhat.com> Tested-by: Anders Darander <anders@chargestorm.se> Tested-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2014-09-08 20:52:02 +01:00
Mathias Krause	8a5a5d1530	x86-64, ptdump: Mark espfix area only if existent We should classify the espfix area as such only if we actually have enabled the corresponding option. Otherwise the page table dump might look confusing. Signed-off-by: Mathias Krause <minipli@googlemail.com> Link: http://lkml.kernel.org/r/1410114629-24523-1-git-send-email-minipli@googlemail.com Cc: Arjan van de Ven <arjan.van.de.ven@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2014-09-08 11:57:34 -07:00
David S. Miller	eb84d6b604	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-09-07 21:41:53 -07:00
Tejun Heo	908c7f1949	percpu_counter: add @gfp to percpu_counter_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>	2014-09-08 09:51:29 +09:00
Ingo Molnar	196cf35842	x86/tty/serial/8250: Clean up the asm/serial.h include file a bit - correct spelling - align fields vertically to make things more readable - make the layout of magic defines more obvious Cc: Mark Rustad <mark.d.rustad@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Link: http://lkml.kernel.org/r/1409972149-26272-1-git-send-email-jeffrey.t.kirsher@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-06 10:20:55 +02:00
Mark Rustad	9ea029f12a	x86/tty/serial/8250: Resolve missing-field-initializers warnings Resolve some missing-field-initializers warnings by using designated initialization in the expansion of the SERIAL_PORT_DFNS macro. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Link: http://lkml.kernel.org/r/1409972149-26272-1-git-send-email-jeffrey.t.kirsher@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-09-06 10:20:53 +02:00
Daniel Borkmann	60a3b2253c	net: bpf: make eBPF interpreter images read-only With eBPF getting more extended and exposure to user space is on it's way, hardening the memory range the interpreter uses to steer its command flow seems appropriate. This patch moves the to be interpreted bytecode to read-only pages. In case we execute a corrupted BPF interpreter image for some reason e.g. caused by an attacker which got past a verifier stage, it would not only provide arbitrary read/write memory access but arbitrary function calls as well. After setting up the BPF interpreter image, its contents do not change until destruction time, thus we can setup the image on immutable made pages in order to mitigate modifications to that code. The idea is derived from commit `314beb9bca` ("x86: bpf_jit_comp: secure bpf jit against spraying attacks"). This is possible because bpf_prog is not part of sk_filter anymore. After setup bpf_prog cannot be altered during its life-time. This prevents any modifications to the entire bpf_prog structure (incl. function/JIT image pointer). Every eBPF program (including classic BPF that are migrated) have to call bpf_prog_select_runtime() to select either interpreter or a JIT image as a last setup step, and they all are being freed via bpf_prog_free(), including non-JIT. Therefore, we can easily integrate this into the eBPF life-time, plus since we directly allocate a bpf_prog, we have no performance penalty. Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual inspection of kernel_page_tables. Brad Spengler proposed the same idea via Twitter during development of this patch. Joint work with Hannes Frederic Sowa. Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-05 12:02:48 -07:00
Paolo Bonzini	54987b7afa	KVM: x86: propagate exception from permission checks on the nested page fault Currently, if a permission error happens during the translation of the final GPA to HPA, walk_addr_generic returns 0 but does not fill in walker->fault. To avoid this, add an x86_exception* argument to the translate_gpa function, and let it fill in walker->fault. The nested_page_fault field will be true, since the walk_mmu is the nested_mmu and translate_gpu instead operates on the "outer" (NPT) instance. Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-05 12:01:13 +02:00
Paolo Bonzini	ef54bcfeea	KVM: x86: skip writeback on injection of nested exception If a nested page fault happens during emulation, we will inject a vmexit, not a page fault. However because writeback happens after the injection, we will write ctxt->eip from L2 into the L1 EIP. We do not write back if an instruction caused an interception vmexit---do the same for page faults. Suggested-by: Gleb Natapov <gleb@kernel.org> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-05 12:01:06 +02:00
Andy Lutomirski	a4412fc948	seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing The secure_computing function took a syscall number parameter, but it only paid any attention to that parameter if seccomp mode 1 was enabled. Rather than coming up with a kludge to get the parameter to work in mode 2, just remove the parameter. To avoid churn in arches that don't have seccomp filters (and may not even support syscall_get_nr right now), this leaves the parameter in secure_computing_strict, which is now a real function. For ARM, this is a bit ugly due to the fact that ARM conditionally supports seccomp filters. Fixing that would probably only be a couple of lines of code, but it should be coordinated with the audit maintainers. This will be a slight slowdown on some arches. The right fix is to pass in all of seccomp_data instead of trying to make just the syscall nr part be fast. This is a prerequisite for making two-phase seccomp work cleanly. Cc: Russell King <linux@arm.linux.org.uk> Cc: linux-arm-kernel@lists.infradead.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>	2014-09-03 14:58:17 -07:00
Paolo Bonzini	5e35251951	KVM: nSVM: propagate the NPF EXITINFO to the guest This is similar to what the EPT code does with the exit qualification. This allows the guest to see a valid value for bits 33:32. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:18:54 +02:00
Paolo Bonzini	a0c0feb579	KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD Bit 8 would be the "global" bit, which does not quite make sense for non-leaf page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:04:11 +02:00
Tiejun Chen	d143148383	KVM: mmio: cleanup kvm_set_mmio_spte_mask Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask() for slightly better code. Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:04:10 +02:00
David Matlack	56f17dd3fb	kvm: x86: fix stale mmio cache bug The following events can lead to an incorrect KVM_EXIT_MMIO bubbling up to userspace: (1) Guest accesses gpa X without a memory slot. The gfn is cached in struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets the SPTE write-execute-noread so that future accesses cause EPT_MISCONFIGs. (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION covering the page just accessed. (3) Guest attempts to read or write to gpa X again. On Intel, this generates an EPT_MISCONFIG. The memory slot generation number that was incremented in (2) would normally take care of this but we fast path mmio faults through quickly_check_mmio_pf(), which only checks the per-vcpu mmio cache. Since we hit the cache, KVM passes a KVM_EXIT_MMIO up to userspace. This patch fixes the issue by using the memslot generation number to validate the mmio cache. Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:03:42 +02:00
David Matlack	ee3d1570b5	kvm: fix potentially corrupt mmio cache vcpu exits and memslot mutations can run concurrently as long as the vcpu does not aquire the slots mutex. Thus it is theoretically possible for memslots to change underneath a vcpu that is handling an exit. If we increment the memslot generation number again after synchronize_srcu_expedited(), vcpus can safely cache memslot generation without maintaining a single rcu_dereference through an entire vm exit. And much of the x86/kvm code does not maintain a single rcu_dereference of the current memslots during each exit. We can prevent the following case: vcpu (CPU 0) \| thread (CPU 1) --------------------------------------------+-------------------------- 1 vm exit \| 2 srcu_read_unlock(&kvm->srcu) \| 3 decide to cache something based on \| old memslots \| 4 \| change memslots \| (increments generation) 5 \| synchronize_srcu(&kvm->srcu); 6 retrieve generation # from new memslots \| 7 tag cache with new memslot generation \| 8 srcu_read_unlock(&kvm->srcu) \| ... \| <action based on cache occurs even \| though the caching decision was based \| on the old memslots> \| ... \| <action continues to occur until next \| memslot generation change, which may \| be never> \| \| By incrementing the generation after synchronizing with kvm->srcu readers, we ensure that the generation retrieved in (6) will become invalid soon after (8). Keeping the existing increment is not strictly necessary, but we do keep it and just move it for consistency from update_memslots to install_new_memslots. It invalidates old cached MMIOs immediately, instead of having to wait for the end of synchronize_srcu_expedited, which makes the code more clearly correct in case CPU 1 is preempted right after synchronize_srcu() returns. To avoid halving the generation space in SPTEs, always presume that the low bit of the generation is zero when reconstructing a generation number out of an SPTE. This effectively disables MMIO caching in SPTEs during the call to synchronize_srcu_expedited. Using the low bit this way is somewhat like a seqcount---where the protected thing is a cache, and instead of retrying we can simply punt if we observe the low bit to be 1. Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:03:41 +02:00
Paolo Bonzini	00f034a12f	KVM: do not bias the generation number in kvm_current_mmio_generation The next patch will give a meaning (a la seqcount) to the low bit of the generation number. Ensure that it matches between kvm->memslots->generation and kvm_current_mmio_generation(). Cc: stable@vger.kernel.org Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-03 10:03:35 +02:00
Oleg Nesterov	6f46b3aef0	x86: copy_thread: Don't nullify ->ptrace_bps twice Both 32bit and 64bit versions of copy_thread() do memset(ptrace_bps) twice for no reason, kill the 2nd memset(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140902175733.GA21676@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-09-02 14:51:17 -07:00

... 2 3 4 5 6 ...

20056 commits