summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-16Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM Xen: Fix a bug where KVM fails to check the validity of an incoming userspace virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-07-12Merge branch 'kvm-prefault' into HEADPaolo Bonzini
Pre-population has been requested several times to mitigate KVM page faults during guest boot or after live migration. It is also required by TDX before filling in the initial guest memory with measured contents. Introduce it as a generic API.
2024-07-12KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()Paolo Bonzini
Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest memory. It can be called right after KVM_CREATE_VCPU creates a vCPU, since at that point kvm_mmu_create() and kvm_init_mmu() are called and the vCPU is ready to invoke the KVM page fault handler. The helper function kvm_tdp_map_page() takes care of the logic to process RET_PF_* return values and convert them to success or errno. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped levelPaolo Bonzini
The guest memory population logic will need to know what page size or level (4K, 2M, ...) is mapped. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"Sean Christopherson
Move the accounting of the result of kvm_mmu_do_page_fault() to its callers, as only pf_fixed is common to guest page faults and async #PFs, and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handlerSean Christopherson
Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault handler, instead of conditionally bumping it in kvm_mmu_do_page_fault(). The "real" page fault handler is the only path that should ever increment the number of taken page faults, as all other paths that "do page fault" are by definition not handling faults that occurred in the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-30x86-32: fix cmpxchg8b_emu build error with clangLinus Torvalds
The kernel test robot reported that clang no longer compiles the 32-bit x86 kernel in some configurations due to commit 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions"). The build fails with arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available and the reason seems to be that not only does the cmpxchg8b instruction need four fixed registers (EDX:EAX and ECX:EBX), with the emulation fallback the inline asm also wants a fifth fixed register for the address (it uses %esi for that, but that's just a software convention with cmpxchg8b_emu). Avoiding using another pointer input to the asm (and just forcing it to use the "0(%esi)" addressing that we end up requiring for the sw fallback) seems to fix the issue. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/ Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions") Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/ Suggested-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-and-Tested-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-28Merge tag 'hardening-v6.10-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Remove invalid tty __counted_by annotation (Nathan Chancellor) - Add missing MODULE_DESCRIPTION()s for KUnit string tests (Jeff Johnson) - Remove non-functional per-arch kstack entropy filtering * tag 'hardening-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: tty: mxser: Remove __counted_by from mxser_board.ports[] randomize_kstack: Remove non-functional per-arch entropy filtering string: kunit: add missing MODULE_DESCRIPTION() macros
2024-06-28x86: stop playing stack games in profile_pc()Linus Torvalds
The 'profile_pc()' function is used for timer-based profiling, which isn't really all that relevant any more to begin with, but it also ends up making assumptions based on the stack layout that aren't necessarily valid. Basically, the code tries to account the time spent in spinlocks to the caller rather than the spinlock, and while I support that as a concept, it's not worth the code complexity or the KASAN warnings when no serious profiling is done using timers anyway these days. And the code really does depend on stack layout that is only true in the simplest of cases. We've lost the comment at some point (I think when the 32-bit and 64-bit code was unified), but it used to say: Assume the lock function has either no stack frame or a copy of eflags from PUSHF. which explains why it just blindly loads a word or two straight off the stack pointer and then takes a minimal look at the values to just check if they might be eflags or the return pc: Eflags always has bits 22 and up cleared unlike kernel addresses but that basic stack layout assumption assumes that there isn't any lock debugging etc going on that would complicate the code and cause a stack frame. It causes KASAN unhappiness reported for years by syzkaller [1] and others [2]. With no real practical reason for this any more, just remove the code. Just for historical interest, here's some background commits relating to this code from 2006: 0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels") 31679f38d886 ("Simplify profile_pc on x86-64") and a code unification from 2009: ef4512882dbe ("x86: time_32/64.c unify profile_pc") but the basics of this thing actually goes back to before the git tree. Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1] Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-06-28randomize_kstack: Remove non-functional per-arch entropy filteringKees Cook
An unintended consequence of commit 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") was that the per-architecture entropy size filtering reduced how many bits were being added to the mix, rather than how many bits were being used during the offsetting. All architectures fell back to the existing default of 0x3FF (10 bits), which will consume at most 1KiB of stack space. It seems that this is working just fine, so let's avoid the confusion and update everything to use the default. The prior intent of the per-architecture limits were: arm64: capped at 0x1FF (9 bits), 5 bits effective powerpc: uncapped (10 bits), 6 or 7 bits effective riscv: uncapped (10 bits), 6 bits effective x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective s390: capped at 0xFF (8 bits), undocumented effective entropy Current discussion has led to just dropping the original per-architecture filters. The additional entropy appears to be safe for arm64, x86, and s390. Quoting Arnd, "There is no point pretending that 15.75KB is somehow safe to use while 15.00KB is not." Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com> Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-28KVM: Validate hva in kvm_gpc_activate_hva() to fix __kvm_gpc_refresh() WARNPei Li
Check that the virtual address is "ok" when activating a gfn_to_pfn_cache with a host VA to ensure that KVM never attempts to use a bad address. This fixes a bug where KVM fails to check the incoming address when handling KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA in kvm_xen_vcpu_set_attr(). Reported-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fd555292a1da3180fc82 Tested-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Signed-off-by: Pei Li <peili.dev@gmail.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240627-bug5-v2-1-2c63f7ee6739@gmail.com [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-25syscalls: fix compat_sys_io_pgetevents_time64 usageArnd Bergmann
Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-23Merge tag 'x86_urgent_for_v6.10_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - An ARM-relevant fix to not free default RMIDs of a resource control group - A randconfig build fix for the VMware virtual GPU driver * tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Don't try to free nonexistent RMIDs drm/vmwgfx: Fix missing HYPERVISOR_GUEST dependency
2024-06-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix dangling references to a redistributor region if the vgic was prematurely destroyed. - Properly mark FFA buffers as released, ensuring that both parties can make forward progress. x86: - Allow getting/setting MSRs for SEV-ES guests, if they're using the pre-6.9 KVM_SEV_ES_INIT API. - Always sync pending posted interrupts to the IRR prior to IOAPIC route updates, so that EOIs are intercepted properly if the old routing table requested that. Generic: - Avoid __fls(0) - Fix reference leak on hwpoisoned page - Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic. - Fix bug in __kvm_handle_hva_range() where KVM calls a function pointer that was intended to be a marker only (nothing bad happens but kind of a mine and also technically undefined behavior) - Do not bother accounting allocations that are small and freed before getting back to userspace. Selftests: - Fix compilation for RISC-V. - Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest. - Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr from KVM's supported CPUID (if it's available)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests KVM: Discard zero mask with function kvm_dirty_ring_reset virt: guest_memfd: fix reference leak on hwpoisoned page kvm: do not account temporary allocations to kmem MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes KVM: Stop processing *all* memslots when "null" mmu_notifier handler is found KVM: arm64: FFA: Release hyp rx buffer KVM: selftests: Fix RISC-V compilation KVM: arm64: Disassociate vcpus from redistributor region on teardown KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBits KVM: selftests: Fix shift of 32 bit unsigned int more than 32 bits
2024-06-21KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guestsMichael Roth
With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption"), older VMMs like QEMU 9.0 and older will fail when booting SEV-ES guests with something like the following error: qemu-system-x86_64: error: failed to get MSR 0x174 qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. This is because older VMMs that might still call svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if those interfaces were essentially just noops because of the vCPU state being encrypted and stored separately in the VMSA. Now those VMMs will get an -EINVAL and generally crash. Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of the stricter limitations of what vCPU state can be sync'd during guest run-time, so newer QEMU for instance will work both for legacy KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2. So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with -EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing with either an older VMM and so it needs to assume that returning -EINVAL might break the VMM. Address this by only returning -EINVAL if the guest was started with KVM_SEV_INIT2. Otherwise, just silently return. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Nikunj A Dadhania <nikunj@amd.com> Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/ Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption") Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240604233510.764949-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep()Rick Edgecombe
Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of passing fault->addr and then converting it to a GFN. Future changes will make fault->addr and fault->gfn differ when running TDX guests. The GFN will be conceptually the same as it is for normal VMs, but fault->addr may contain a TDX specific bit that differentiates between "shared" and "private" memory. This bit will be used to direct faults to be handled on different roots, either the normal "direct" root or a new type of root that handles private memory. The TDP iterators will process the traditional GFN concept and apply the required TDX specifics depending on the root type. For this reason, it needs to operate on regular GFN and not the addr, which may contain these special TDX specific bits. Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then immediately converts it to a GFN with a bit shift. However, this would unfortunately retain the TDX specific bits in what is supposed to be a traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass fault->addr and convert it to a GFN when the GFN is already on hand. So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and use it directly. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTERick Edgecombe
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other multi-part operations. REMOVED_SPTE is used as a non-present intermediate value for multi-part operations that can happen when a thread doesn't have an MMU write lock. Today these operations are when removing PTEs. However, future changes will want to use the same concept for setting a PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it to FROZEN_SPTE so it can be used for both types of operations. Also rename the relevant helpers and comments that refer to "removed" within the context of the SPTE value. Take care to not update naming referring the "remove" operations, which are still distinct. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20Merge branch 'kvm-6.10-fixes' into HEADPaolo Bonzini
2024-06-20KVM: x86/tdp_mmu: Sprinkle __must_checkIsaku Yamahata
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace the SPTE value and returns -EBUSY on failure. The caller must check the return value and retry. Add __must_check to it, as well as to two more functions that forward the return value of __tdp_mmu_set_spte_atomic to their caller. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routesSean Christopherson
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-19x86/resctrl: Don't try to free nonexistent RMIDsDave Martin
Commit 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") adds logic to map individual monitoring groups into a global index space used for tracking allocated RMIDs. Attempts to free the default RMID are ignored in free_rmid(), and this works fine on x86. With arm64 MPAM, there is a latent bug here however: on platforms with no monitors exposed through resctrl, each control group still gets a different monitoring group ID as seen by the hardware, since the CLOSID always forms part of the monitoring group ID. This means that when removing a control group, the code may try to free this group's default monitoring group RMID for real. If there are no monitors however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is never allocated, leading to a splat when free_rmid() tries to dereference the table. One option would be to treat RMID 0 as special for every CLOSID, but this would be ugly since bookkeeping still needs to be done for these monitoring group IDs when there are monitors present in the hardware. Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and just do nothing if the hardware doesn't have monitors. This fix mirrors the gating checks already present in mkdir_rdt_prepare_rmid_alloc() and elsewhere. No functional change on x86. [ bp: Massage commit message. ] Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
2024-06-18KVM: Introduce vcpu->wants_to_runDavid Matlack
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: x86: Prevent excluding the BSP on setting max_vcpu_idsSean Christopherson
If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_IDMathias Krause
Do not accept IDs which are definitely invalid by limit checking the passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was already set. This ensures invalid values, especially on 64-bit systems, don't go unnoticed and lead to a valid id by chance when truncated by the final assignment. Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18Merge tag 'efi-fixes-for-v6.10-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Another small set of EFI fixes. Only the x86 one is likely to affect any actual users (and has a cc:stable), but the issue it fixes was only observed in an unusual context (kexec in a confidential VM). - Ensure that EFI runtime services are not unmapped by PAN on ARM - Avoid freeing the memory holding the EFI memory map inadvertently on x86 - Avoid a false positive kmemleak warning on arm64" * tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/arm64: Fix kmemleak false positive in arm64_efi_rt_init() efi/x86: Free EFI memory map only when installing a new one. efi/arm: Disable LPAE PAN when calling EFI runtime services
2024-06-15Merge tag 'x86-urgent-2024-06-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix the 8 bytes get_user() logic on x86-32 - Fix build bug that creates weird & mistaken target directory under arch/x86/ * tag 'x86-urgent-2024-06-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Don't add the EFI stub to targets, again x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checking
2024-06-15efi/x86: Free EFI memory map only when installing a new one.Ard Biesheuvel
The logic in __efi_memmap_init() is shared between two different execution flows: - mapping the EFI memory map early or late into the kernel VA space, so that its entries can be accessed; - the x86 specific cloning of the EFI memory map in order to insert new entries that are created as a result of making a memory reservation via a call to efi_mem_reserve(). In the former case, the underlying memory containing the kernel's view of the EFI memory map (which may be heavily modified by the kernel itself on x86) is not modified at all, and the only thing that changes is the virtual mapping of this memory, which is different between early and late boot. In the latter case, an entirely new allocation is created that carries a new, updated version of the kernel's view of the EFI memory map. When installing this new version, the old version will no longer be referenced, and if the memory was allocated by the kernel, it will leak unless it gets freed. The logic that implements this freeing currently lives on the code path that is shared between these two use cases, but it should only apply to the latter. So move it to the correct spot. While at it, drop the dummy definition for non-x86 architectures, as that is no longer needed. Cc: <stable@vger.kernel.org> Fixes: f0ef6523475f ("efi: Fix efi_memmap_alloc() leaks") Tested-by: Ashish Kalra <Ashish.Kalra@amd.com> Link: https://lore.kernel.org/all/36ad5079-4326-45ed-85f6-928ff76483d3@amd.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-06-13Merge tag 'fixes-2024-06-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: "Fix validation of NUMA coverage. memblock_validate_numa_coverage() was checking for a unset node ID using NUMA_NO_NODE, but x86 used MAX_NUMNODES when no node ID was specified by buggy firmware. Update memblock to substitute MAX_NUMNODES with NUMA_NO_NODE in memblock_set_node() and use NUMA_NO_NODE in x86::numa_init()" * tag 'fixes-2024-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/mm/numa: Use NUMA_NO_NODE when calling memblock_set_node() memblock: make memblock_set_node() also warn about use of MAX_NUMNODES
2024-06-13x86/boot: Don't add the EFI stub to targets, againBenjamin Segall
This is a re-commit of da05b143a308 ("x86/boot: Don't add the EFI stub to targets") after the tagged patch incorrectly reverted it. vmlinux-objs-y is added to targets, with an assumption that they are all relative to $(obj); adding a $(objtree)/drivers/... path causes the build to incorrectly create a useless arch/x86/boot/compressed/drivers/... directory tree. Fix this just by using a different make variable for the EFI stub. Fixes: cb8bda8ad443 ("x86/boot/compressed: Rename efi_thunk_64.S to efi-mixed.S") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Cc: stable@vger.kernel.org # v6.1+ Link: https://lore.kernel.org/r/xm267ceukksz.fsf@bsegall.svl.corp.google.com
2024-06-11x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checkingKees Cook
When reworking the range checking for get_user(), the get_user_8() case on 32-bit wasn't zeroing the high register. (The jump to bad_get_user_8 was accidentally dropped.) Restore the correct error handling destination (and rename the jump to using the expected ".L" prefix). While here, switch to using a named argument ("size") for the call template ("%c4" to "%c[size]") as already used in the other call templates in this file. Found after moving the usercopy selftests to KUnit: # usercopy_test_invalid: EXPECTATION FAILED at lib/usercopy_kunit.c:278 Expected val_u64 == 0, but val_u64 == -60129542144 (0xfffffff200000000) Closes: https://lore.kernel.org/all/CABVgOSn=tb=Lj9SxHuT4_9MTjjKVxsq-ikdXC4kGHO4CfKVmGQ@mail.gmail.com Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Reported-by: David Gow <davidgow@google.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Tested-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20240610210213.work.143-kees%40kernel.org
2024-06-11KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()Sean Christopherson
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU loadSean Christopherson
Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()Sean Christopherson
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()Sean Christopherson
Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation of moving the sched_in logic, which handles shrinking the PLE window, into vmx_vcpu_load(). No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIPYi Wang
Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <foxywang@tencent.com> Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-08Merge tag 'x86-urgent-2024-06-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Miscellaneous fixes: - Fix kexec() crash if call depth tracking is enabled - Fix SMN reads on inaccessible registers on certain AMD systems" * tag 'x86-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_nb: Check for invalid SMN reads x86/kexec: Fix bug with call depth tracking
2024-06-06x86/mm/numa: Use NUMA_NO_NODE when calling memblock_set_node()Jan Beulich
memblock_set_node() warns about using MAX_NUMNODES, see e0eec24e2e19 ("memblock: make memblock_set_node() also warn about use of MAX_NUMNODES") for details. Reported-by: Narasimhan V <Narasimhan.V@amd.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org [bp: commit message] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240603141005.23261-1-bp@kernel.org Link: https://lore.kernel.org/r/abadb736-a239-49e4-ab42-ace7acdd4278@suse.com Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-06-05x86/amd_nb: Check for invalid SMN readsYazen Ghannam
AMD Zen-based systems use a System Management Network (SMN) that provides access to implementation-specific registers. SMN accesses are done indirectly through an index/data pair in PCI config space. The PCI config access may fail and return an error code. This would prevent the "read" value from being updated. However, the PCI config access may succeed, but the return value may be invalid. This is in similar fashion to PCI bad reads, i.e. return all bits set. Most systems will return 0 for SMN addresses that are not accessible. This is in line with AMD convention that unavailable registers are Read-as-Zero/Writes-Ignored. However, some systems will return a "PCI Error Response" instead. This value, along with an error code of 0 from the PCI config access, will confuse callers of the amd_smn_read() function. Check for this condition, clear the return value, and set a proper error code. Fixes: ddfe43cdc0da ("x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230403164244.471141-1-yazen.ghannam@amd.com
2024-06-05KVM: SNP: Fix LBR Virtualization for SNP guestRavi Bangoria
SEV-ES and thus SNP guest mandates LBR Virtualization to be _always_ ON. Although commit b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization to the processor") did the correct change for SEV-ES guests, it missed the SNP. Fix it. Reported-by: Srikanth Aithal <sraithal@amd.com> Fixes: b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization to the processor") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240605114810.1304-1-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attrTao Su
Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn(). Before checking the mismatch of private vs. shared, mmu_invalidate_seq is saved to fault->mmu_seq, which can be used to detect an invalidation related to the gfn occurred, i.e. KVM will not install a mapping in page table if fault->mmu_seq != mmu_invalidate_seq. Currently there is a second snapshot of mmu_invalidate_seq, which may not be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute may be changed between the two snapshots, but the gfn may be mapped in page table without hindrance. Therefore, drop the second snapshot as it has no obvious benefits. Fixes: f6adeae81f35 ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()") Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03Merge branch 'kvm-6.11-sev-snp' into HEADPaolo Bonzini
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.
2024-06-03Merge branch 'kvm-fixes-6.10-1' into HEADPaolo Bonzini
* Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.
2024-06-03Merge branch 'kvm-fixes-6.10-1' into HEADPaolo Bonzini
* Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.
2024-06-03KVM: x86: Drop support for hand tuning APIC timer advancement from userspaceSean Christopherson
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Delegate LBR virtualization to the processorRavi Bangoria
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. Although KVM currently enforces LBRV for SEV-ES guests, there are multiple issues with it: o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR interception is used to dynamically toggle LBRV for performance reasons, this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3: [guest ~]# wrmsr 0x1d9 0x4 KVM: entry failed, hardware error 0xffffffff EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000 Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests. No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR is of swap type A. o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA encryption. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Co-developed-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absentRavi Bangoria
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. So, prevent SEV-ES guests when LBRV support is missing. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Prevent MSR access post VMSA encryptionNikunj A Dadhania
KVM currently allows userspace to read/write MSRs even after the VMSA is encrypted. This can cause unintentional issues if MSR access has side- effects. For ex, while migrating a guest, userspace could attempt to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the target. Fix this by preventing access to those MSRs which are context switched via the VMSA, once the VMSA is encrypted. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-2-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>