252 lines
		
	
	
	
		
			8 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
		
		
			
		
	
	
			252 lines
		
	
	
	
		
			8 KiB
			
		
	
	
	
		
			Text
		
	
	
	
	
	
|   | Nested VMX | ||
|  | ========== | ||
|  | 
 | ||
|  | Overview | ||
|  | --------- | ||
|  | 
 | ||
|  | On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) | ||
|  | to easily and efficiently run guest operating systems. Normally, these guests | ||
|  | *cannot* themselves be hypervisors running their own guests, because in VMX, | ||
|  | guests cannot use VMX instructions. | ||
|  | 
 | ||
|  | The "Nested VMX" feature adds this missing capability - of running guest | ||
|  | hypervisors (which use VMX) with their own nested guests. It does so by | ||
|  | allowing a guest to use VMX instructions, and correctly and efficiently | ||
|  | emulating them using the single level of VMX available in the hardware. | ||
|  | 
 | ||
|  | We describe in much greater detail the theory behind the nested VMX feature, | ||
|  | its implementation and its performance characteristics, in the OSDI 2010 paper | ||
|  | "The Turtles Project: Design and Implementation of Nested Virtualization", | ||
|  | available at: | ||
|  | 
 | ||
|  | 	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf | ||
|  | 
 | ||
|  | 
 | ||
|  | Terminology | ||
|  | ----------- | ||
|  | 
 | ||
|  | Single-level virtualization has two levels - the host (KVM) and the guests. | ||
|  | In nested virtualization, we have three levels: The host (KVM), which we call | ||
|  | L0, the guest hypervisor, which we call L1, and its nested guest, which we | ||
|  | call L2. | ||
|  | 
 | ||
|  | 
 | ||
|  | Known limitations | ||
|  | ----------------- | ||
|  | 
 | ||
|  | The current code supports running Linux guests under KVM guests. | ||
|  | Only 64-bit guest hypervisors are supported. | ||
|  | 
 | ||
|  | Additional patches for running Windows under guest KVM, and Linux under | ||
|  | guest VMware server, and support for nested EPT, are currently running in | ||
|  | the lab, and will be sent as follow-on patchsets. | ||
|  | 
 | ||
|  | 
 | ||
|  | Running nested VMX | ||
|  | ------------------ | ||
|  | 
 | ||
|  | The nested VMX feature is disabled by default. It can be enabled by giving | ||
|  | the "nested=1" option to the kvm-intel module. | ||
|  | 
 | ||
|  | No modifications are required to user space (qemu). However, qemu's default | ||
|  | emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be | ||
|  | explicitly enabled, by giving qemu one of the following options: | ||
|  | 
 | ||
|  |      -cpu host              (emulated CPU has all features of the real CPU) | ||
|  | 
 | ||
|  |      -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type) | ||
|  | 
 | ||
|  | 
 | ||
|  | ABIs | ||
|  | ---- | ||
|  | 
 | ||
|  | Nested VMX aims to present a standard and (eventually) fully-functional VMX | ||
|  | implementation for the a guest hypervisor to use. As such, the official | ||
|  | specification of the ABI that it provides is Intel's VMX specification, | ||
|  | namely volume 3B of their "Intel 64 and IA-32 Architectures Software | ||
|  | Developer's Manual". Not all of VMX's features are currently fully supported, | ||
|  | but the goal is to eventually support them all, starting with the VMX features | ||
|  | which are used in practice by popular hypervisors (KVM and others). | ||
|  | 
 | ||
|  | As a VMX implementation, nested VMX presents a VMCS structure to L1. | ||
|  | As mandated by the spec, other than the two fields revision_id and abort, | ||
|  | this structure is *opaque* to its user, who is not supposed to know or care | ||
|  | about its internal structure. Rather, the structure is accessed through the | ||
|  | VMREAD and VMWRITE instructions. | ||
|  | Still, for debugging purposes, KVM developers might be interested to know the | ||
|  | internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. | ||
|  | 
 | ||
|  | The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we | ||
|  | also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS | ||
|  | which L0 builds to actually run L2 - how this is done is explained in the | ||
|  | aforementioned paper. | ||
|  | 
 | ||
|  | For convenience, we repeat the content of struct vmcs12 here. If the internals | ||
|  | of this structure changes, this can break live migration across KVM versions. | ||
|  | VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner | ||
|  | struct shadow_vmcs is ever changed. | ||
|  | 
 | ||
|  | 	typedef u64 natural_width; | ||
|  | 	struct __packed vmcs12 { | ||
|  | 		/* According to the Intel spec, a VMCS region must start with | ||
|  | 		 * these two user-visible fields */ | ||
|  | 		u32 revision_id; | ||
|  | 		u32 abort; | ||
|  | 
 | ||
|  | 		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ | ||
|  | 		u32 padding[7]; /* room for future expansion */ | ||
|  | 
 | ||
|  | 		u64 io_bitmap_a; | ||
|  | 		u64 io_bitmap_b; | ||
|  | 		u64 msr_bitmap; | ||
|  | 		u64 vm_exit_msr_store_addr; | ||
|  | 		u64 vm_exit_msr_load_addr; | ||
|  | 		u64 vm_entry_msr_load_addr; | ||
|  | 		u64 tsc_offset; | ||
|  | 		u64 virtual_apic_page_addr; | ||
|  | 		u64 apic_access_addr; | ||
|  | 		u64 ept_pointer; | ||
|  | 		u64 guest_physical_address; | ||
|  | 		u64 vmcs_link_pointer; | ||
|  | 		u64 guest_ia32_debugctl; | ||
|  | 		u64 guest_ia32_pat; | ||
|  | 		u64 guest_ia32_efer; | ||
|  | 		u64 guest_pdptr0; | ||
|  | 		u64 guest_pdptr1; | ||
|  | 		u64 guest_pdptr2; | ||
|  | 		u64 guest_pdptr3; | ||
|  | 		u64 host_ia32_pat; | ||
|  | 		u64 host_ia32_efer; | ||
|  | 		u64 padding64[8]; /* room for future expansion */ | ||
|  | 		natural_width cr0_guest_host_mask; | ||
|  | 		natural_width cr4_guest_host_mask; | ||
|  | 		natural_width cr0_read_shadow; | ||
|  | 		natural_width cr4_read_shadow; | ||
|  | 		natural_width cr3_target_value0; | ||
|  | 		natural_width cr3_target_value1; | ||
|  | 		natural_width cr3_target_value2; | ||
|  | 		natural_width cr3_target_value3; | ||
|  | 		natural_width exit_qualification; | ||
|  | 		natural_width guest_linear_address; | ||
|  | 		natural_width guest_cr0; | ||
|  | 		natural_width guest_cr3; | ||
|  | 		natural_width guest_cr4; | ||
|  | 		natural_width guest_es_base; | ||
|  | 		natural_width guest_cs_base; | ||
|  | 		natural_width guest_ss_base; | ||
|  | 		natural_width guest_ds_base; | ||
|  | 		natural_width guest_fs_base; | ||
|  | 		natural_width guest_gs_base; | ||
|  | 		natural_width guest_ldtr_base; | ||
|  | 		natural_width guest_tr_base; | ||
|  | 		natural_width guest_gdtr_base; | ||
|  | 		natural_width guest_idtr_base; | ||
|  | 		natural_width guest_dr7; | ||
|  | 		natural_width guest_rsp; | ||
|  | 		natural_width guest_rip; | ||
|  | 		natural_width guest_rflags; | ||
|  | 		natural_width guest_pending_dbg_exceptions; | ||
|  | 		natural_width guest_sysenter_esp; | ||
|  | 		natural_width guest_sysenter_eip; | ||
|  | 		natural_width host_cr0; | ||
|  | 		natural_width host_cr3; | ||
|  | 		natural_width host_cr4; | ||
|  | 		natural_width host_fs_base; | ||
|  | 		natural_width host_gs_base; | ||
|  | 		natural_width host_tr_base; | ||
|  | 		natural_width host_gdtr_base; | ||
|  | 		natural_width host_idtr_base; | ||
|  | 		natural_width host_ia32_sysenter_esp; | ||
|  | 		natural_width host_ia32_sysenter_eip; | ||
|  | 		natural_width host_rsp; | ||
|  | 		natural_width host_rip; | ||
|  | 		natural_width paddingl[8]; /* room for future expansion */ | ||
|  | 		u32 pin_based_vm_exec_control; | ||
|  | 		u32 cpu_based_vm_exec_control; | ||
|  | 		u32 exception_bitmap; | ||
|  | 		u32 page_fault_error_code_mask; | ||
|  | 		u32 page_fault_error_code_match; | ||
|  | 		u32 cr3_target_count; | ||
|  | 		u32 vm_exit_controls; | ||
|  | 		u32 vm_exit_msr_store_count; | ||
|  | 		u32 vm_exit_msr_load_count; | ||
|  | 		u32 vm_entry_controls; | ||
|  | 		u32 vm_entry_msr_load_count; | ||
|  | 		u32 vm_entry_intr_info_field; | ||
|  | 		u32 vm_entry_exception_error_code; | ||
|  | 		u32 vm_entry_instruction_len; | ||
|  | 		u32 tpr_threshold; | ||
|  | 		u32 secondary_vm_exec_control; | ||
|  | 		u32 vm_instruction_error; | ||
|  | 		u32 vm_exit_reason; | ||
|  | 		u32 vm_exit_intr_info; | ||
|  | 		u32 vm_exit_intr_error_code; | ||
|  | 		u32 idt_vectoring_info_field; | ||
|  | 		u32 idt_vectoring_error_code; | ||
|  | 		u32 vm_exit_instruction_len; | ||
|  | 		u32 vmx_instruction_info; | ||
|  | 		u32 guest_es_limit; | ||
|  | 		u32 guest_cs_limit; | ||
|  | 		u32 guest_ss_limit; | ||
|  | 		u32 guest_ds_limit; | ||
|  | 		u32 guest_fs_limit; | ||
|  | 		u32 guest_gs_limit; | ||
|  | 		u32 guest_ldtr_limit; | ||
|  | 		u32 guest_tr_limit; | ||
|  | 		u32 guest_gdtr_limit; | ||
|  | 		u32 guest_idtr_limit; | ||
|  | 		u32 guest_es_ar_bytes; | ||
|  | 		u32 guest_cs_ar_bytes; | ||
|  | 		u32 guest_ss_ar_bytes; | ||
|  | 		u32 guest_ds_ar_bytes; | ||
|  | 		u32 guest_fs_ar_bytes; | ||
|  | 		u32 guest_gs_ar_bytes; | ||
|  | 		u32 guest_ldtr_ar_bytes; | ||
|  | 		u32 guest_tr_ar_bytes; | ||
|  | 		u32 guest_interruptibility_info; | ||
|  | 		u32 guest_activity_state; | ||
|  | 		u32 guest_sysenter_cs; | ||
|  | 		u32 host_ia32_sysenter_cs; | ||
|  | 		u32 padding32[8]; /* room for future expansion */ | ||
|  | 		u16 virtual_processor_id; | ||
|  | 		u16 guest_es_selector; | ||
|  | 		u16 guest_cs_selector; | ||
|  | 		u16 guest_ss_selector; | ||
|  | 		u16 guest_ds_selector; | ||
|  | 		u16 guest_fs_selector; | ||
|  | 		u16 guest_gs_selector; | ||
|  | 		u16 guest_ldtr_selector; | ||
|  | 		u16 guest_tr_selector; | ||
|  | 		u16 host_es_selector; | ||
|  | 		u16 host_cs_selector; | ||
|  | 		u16 host_ss_selector; | ||
|  | 		u16 host_ds_selector; | ||
|  | 		u16 host_fs_selector; | ||
|  | 		u16 host_gs_selector; | ||
|  | 		u16 host_tr_selector; | ||
|  | 	}; | ||
|  | 
 | ||
|  | 
 | ||
|  | Authors | ||
|  | ------- | ||
|  | 
 | ||
|  | These patches were written by: | ||
|  |      Abel Gordon, abelg <at> il.ibm.com | ||
|  |      Nadav Har'El, nyh <at> il.ibm.com | ||
|  |      Orit Wasserman, oritw <at> il.ibm.com | ||
|  |      Ben-Ami Yassor, benami <at> il.ibm.com | ||
|  |      Muli Ben-Yehuda, muli <at> il.ibm.com | ||
|  | 
 | ||
|  | With contributions by: | ||
|  |      Anthony Liguori, aliguori <at> us.ibm.com | ||
|  |      Mike Day, mdday <at> us.ibm.com | ||
|  |      Michael Factor, factor <at> il.ibm.com | ||
|  |      Zvi Dubitzky, dubi <at> il.ibm.com | ||
|  | 
 | ||
|  | And valuable reviews by: | ||
|  |      Avi Kivity, avi <at> redhat.com | ||
|  |      Gleb Natapov, gleb <at> redhat.com | ||
|  |      Marcelo Tosatti, mtosatti <at> redhat.com | ||
|  |      Kevin Tian, kevin.tian <at> intel.com | ||
|  |      and others. |