diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-15 14:46:43 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-15 14:46:43 -0700 |
commit | f4b0c4b508364fde023e4f7b9f23f7e38c663dfe (patch) | |
tree | d10d9c6602dcd1d2d50effe18ce63edc4d4bb706 /drivers | |
parent | 2e9250022e9f2c9cde3b98fd26dcad1c2a9aedf3 (diff) | |
parent | cba23f333fedf8e39743b0c9787b45a5bd7d03af (diff) |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Move a lot of state that was previously stored on a per vcpu basis
into a per-CPU area, because it is only pertinent to the host while
the vcpu is loaded. This results in better state tracking, and a
smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in nested
virtualisation. The last two instructions also require emulating
part of the pointer authentication extension. As a result, the trap
handling of pointer authentication has been greatly simplified.
- Turn the global (and not very scalable) LPI translation cache into
a per-ITS, scalable cache, making non directly injected LPIs much
cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing for
smaller EL2 mapping and some flexibility in implementing more or
less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR map has
been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
LoongArch:
- Add ParaVirt IPI support
- Add software breakpoint support
- Add mmio trace events support
RISC-V:
- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak
- Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities of
various kinds of fauls (private page, no memory, write to read-only
slot, etc.) are easier to follow.
x86:
- Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state.
This is a state where the mmu_lock is held for reading but
concurrent accesses to the PTE have to spin; shortening its use
allows other vCPUs to repopulate the zapped region while the zapper
finishes tearing down the old, defunct page tables.
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
field, which is defined by hardware but left for software use.
This lets KVM communicate its inability to map GPAs that set bits
51:48 on hosts without 5-level nested page tables. Guest firmware
is expected to use the information when mapping BARs; this avoids
that they end up at a legal, but unmappable, GPA.
- Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- As usual, a bunch of code cleanups.
x86 (AMD):
- Implement a new and improved API to initialize SEV and SEV-ES VMs,
which will also be extendable to SEV-SNP.
The new API specifies the desired encryption in KVM_CREATE_VM and
then separately initializes the VM. The new API also allows
customizing the desired set of VMSA features; the features affect
the measurement of the VM's initial state, and therefore enabling
them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
once the VMSA has been encrypted. Also, the FPU and AVX state will
be synchronized and encrypted too.
- Support for GHCB version 2 as applicable to SEV-ES guests.
This, once more, is only accessible when using the new
KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.
x86 (Intel):
- An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat
user visible change is a new debugging mode that checks that KVM's
MMU never triggers a #VE virtualization exception in the guest.
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
VM-Exit to L1, as per the SDM.
Generic:
- Use vfree() instead of kvfree() for allocations that always use
vcalloc() or __vcalloc().
- Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the
KVM tree.
The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever
since calls to set_pte_at_notify were wrapped with
invalidate_range_start and invalidate_range_end... in 2012.
Selftests:
- Enhance the demand paging test to allow for better reporting and
stressing of UFFD performance.
- Convert the steal time test to generate TAP-friendly output.
- Fix a flaky false positive in the xen_shinfo_test due to comparing
elapsed time across two different clock domains.
- Skip the MONITOR/MWAIT test if the host doesn't actually support
MWAIT.
- Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
shell script, to play nice with running in a minimal userspace
environment.
- Allow skipping the RSEQ test's sanity check that the vCPU was able
to complete a reasonable number of KVM_RUNs, as the assert can fail
on a completely valid setup.
If the test is run on a large-ish system that is otherwise idle,
and the test isn't affined to a low-ish number of CPUs, the vCPU
task can be repeatedly migrated to CPUs that are in deep sleep
states, which results in the vCPU having very little net runtime
before the next migration due to high wakeup latencies.
- Define _GNU_SOURCE for all selftests to fix a warning that was
introduced by a change to kselftest_harness.h late in the 6.9
cycle, and because forcing every test to #define _GNU_SOURCE is
painful.
- Provide a global pseudo-RNG instance for all tests, so that library
code can generate random, but determinstic numbers.
- Use the global pRNG to randomly force emulation of select writes
from guest code on x86, e.g. to help validate KVM's emulation of
locked accesses.
- Allocate and initialize x86's GDT, IDT, TSS, segments, and default
exception handlers at VM creation, instead of forcing tests to
manually trigger the related setup.
Documentation:
- Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
selftests/kvm: remove dead file
KVM: selftests: arm64: Test vCPU-scoped feature ID registers
KVM: selftests: arm64: Test that feature ID regs survive a reset
KVM: selftests: arm64: Store expected register value in set_id_regs
KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
KVM: arm64: Only reset vCPU-scoped feature ID regs once
KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
KVM: arm64: Rename is_id_reg() to imply VM scope
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: SEV: Allow per-guest configuration of GHCB protocol version
KVM: SEV: Add GHCB handling for termination requests
KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
KVM: SEV: Add support to handle AP reset MSR protocol
KVM: x86: Explicitly zero kvm_caps during vendor module load
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
KVM: x86: Fully re-initialize supported_vm_types on vendor module load
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
...
Diffstat (limited to 'drivers')
-rw-r--r-- | drivers/perf/riscv_pmu.c | 3 | ||||
-rw-r--r-- | drivers/perf/riscv_pmu_sbi.c | 316 |
2 files changed, 286 insertions, 33 deletions
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c index b4efdddb2ad9..78c490e0505a 100644 --- a/drivers/perf/riscv_pmu.c +++ b/drivers/perf/riscv_pmu.c @@ -191,8 +191,6 @@ void riscv_pmu_stop(struct perf_event *event, int flags) struct hw_perf_event *hwc = &event->hw; struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu); - WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED); - if (!(hwc->state & PERF_HES_STOPPED)) { if (rvpmu->ctr_stop) { rvpmu->ctr_stop(event, 0); @@ -408,6 +406,7 @@ struct riscv_pmu *riscv_pmu_alloc(void) cpuc->n_events = 0; for (i = 0; i < RISCV_MAX_COUNTERS; i++) cpuc->events[i] = NULL; + cpuc->snapshot_addr = NULL; } pmu->pmu = (struct pmu) { .event_init = riscv_pmu_event_init, diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c index 82636273d726..a2e4005e1fd0 100644 --- a/drivers/perf/riscv_pmu_sbi.c +++ b/drivers/perf/riscv_pmu_sbi.c @@ -27,7 +27,7 @@ #define ALT_SBI_PMU_OVERFLOW(__ovl) \ asm volatile(ALTERNATIVE_2( \ - "csrr %0, " __stringify(CSR_SSCOUNTOVF), \ + "csrr %0, " __stringify(CSR_SCOUNTOVF), \ "csrr %0, " __stringify(THEAD_C9XX_CSR_SCOUNTEROF), \ THEAD_VENDOR_ID, ERRATA_THEAD_PMU, \ CONFIG_ERRATA_THEAD_PMU, \ @@ -57,6 +57,11 @@ asm volatile(ALTERNATIVE( \ PMU_FORMAT_ATTR(event, "config:0-47"); PMU_FORMAT_ATTR(firmware, "config:63"); +static bool sbi_v2_available; +static DEFINE_STATIC_KEY_FALSE(sbi_pmu_snapshot_available); +#define sbi_pmu_snapshot_available() \ + static_branch_unlikely(&sbi_pmu_snapshot_available) + static struct attribute *riscv_arch_formats_attr[] = { &format_attr_event.attr, &format_attr_firmware.attr, @@ -384,7 +389,7 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event) cmask = 1; } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) { cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH; - cmask = 1UL << (CSR_INSTRET - CSR_CYCLE); + cmask = BIT(CSR_INSTRET - CSR_CYCLE); } } @@ -506,24 +511,126 @@ static int pmu_sbi_event_map(struct perf_event *event, u64 *econfig) return ret; } +static void pmu_sbi_snapshot_free(struct riscv_pmu *pmu) +{ + int cpu; + + for_each_possible_cpu(cpu) { + struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu); + + if (!cpu_hw_evt->snapshot_addr) + continue; + + free_page((unsigned long)cpu_hw_evt->snapshot_addr); + cpu_hw_evt->snapshot_addr = NULL; + cpu_hw_evt->snapshot_addr_phys = 0; + } +} + +static int pmu_sbi_snapshot_alloc(struct riscv_pmu *pmu) +{ + int cpu; + struct page *snapshot_page; + + for_each_possible_cpu(cpu) { + struct cpu_hw_events *cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu); + + snapshot_page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!snapshot_page) { + pmu_sbi_snapshot_free(pmu); + return -ENOMEM; + } + cpu_hw_evt->snapshot_addr = page_to_virt(snapshot_page); + cpu_hw_evt->snapshot_addr_phys = page_to_phys(snapshot_page); + } + + return 0; +} + +static int pmu_sbi_snapshot_disable(void) +{ + struct sbiret ret; + + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, SBI_SHMEM_DISABLE, + SBI_SHMEM_DISABLE, 0, 0, 0, 0); + if (ret.error) { + pr_warn("failed to disable snapshot shared memory\n"); + return sbi_err_map_linux_errno(ret.error); + } + + return 0; +} + +static int pmu_sbi_snapshot_setup(struct riscv_pmu *pmu, int cpu) +{ + struct cpu_hw_events *cpu_hw_evt; + struct sbiret ret = {0}; + + cpu_hw_evt = per_cpu_ptr(pmu->hw_events, cpu); + if (!cpu_hw_evt->snapshot_addr_phys) + return -EINVAL; + + if (cpu_hw_evt->snapshot_set_done) + return 0; + + if (IS_ENABLED(CONFIG_32BIT)) + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, + cpu_hw_evt->snapshot_addr_phys, + (u64)(cpu_hw_evt->snapshot_addr_phys) >> 32, 0, 0, 0, 0); + else + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_SNAPSHOT_SET_SHMEM, + cpu_hw_evt->snapshot_addr_phys, 0, 0, 0, 0, 0); + + /* Free up the snapshot area memory and fall back to SBI PMU calls without snapshot */ + if (ret.error) { + if (ret.error != SBI_ERR_NOT_SUPPORTED) + pr_warn("pmu snapshot setup failed with error %ld\n", ret.error); + return sbi_err_map_linux_errno(ret.error); + } + + memset(cpu_hw_evt->snapshot_cval_shcopy, 0, sizeof(u64) * RISCV_MAX_COUNTERS); + cpu_hw_evt->snapshot_set_done = true; + + return 0; +} + static u64 pmu_sbi_ctr_read(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; int idx = hwc->idx; struct sbiret ret; - union sbi_pmu_ctr_info info; u64 val = 0; + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu); + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events); + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr; + union sbi_pmu_ctr_info info = pmu_ctr_list[idx]; + + /* Read the value from the shared memory directly only if counter is stopped */ + if (sbi_pmu_snapshot_available() && (hwc->state & PERF_HES_STOPPED)) { + val = sdata->ctr_values[idx]; + return val; + } if (pmu_sbi_is_fw_event(event)) { ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ, hwc->idx, 0, 0, 0, 0, 0); - if (!ret.error) - val = ret.value; + if (ret.error) + return 0; + + val = ret.value; + if (IS_ENABLED(CONFIG_32BIT) && sbi_v2_available && info.width >= 32) { + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_FW_READ_HI, + hwc->idx, 0, 0, 0, 0, 0); + if (!ret.error) + val |= ((u64)ret.value << 32); + else + WARN_ONCE(1, "Unable to read upper 32 bits of firmware counter error: %ld\n", + ret.error); + } } else { - info = pmu_ctr_list[idx]; val = riscv_pmu_ctr_read_csr(info.csr); if (IS_ENABLED(CONFIG_32BIT)) - val = ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 31 | val; + val |= ((u64)riscv_pmu_ctr_read_csr(info.csr + 0x80)) << 32; } return val; @@ -553,6 +660,7 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival) struct hw_perf_event *hwc = &event->hw; unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE; + /* There is no benefit setting SNAPSHOT FLAG for a single counter */ #if defined(CONFIG_32BIT) ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, hwc->idx, 1, flag, ival, ival >> 32, 0); @@ -573,16 +681,36 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag) { struct sbiret ret; struct hw_perf_event *hwc = &event->hw; + struct riscv_pmu *pmu = to_riscv_pmu(event->pmu); + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events); + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr; if ((hwc->flags & PERF_EVENT_FLAG_USER_ACCESS) && (hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT)) pmu_sbi_reset_scounteren((void *)event); + if (sbi_pmu_snapshot_available()) + flag |= SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT; + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0); - if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) && - flag != SBI_PMU_STOP_FLAG_RESET) + if (!ret.error && sbi_pmu_snapshot_available()) { + /* + * The counter snapshot is based on the index base specified by hwc->idx. + * The actual counter value is updated in shared memory at index 0 when counter + * mask is 0x01. To ensure accurate counter values, it's necessary to transfer + * the counter value to shared memory. However, if hwc->idx is zero, the counter + * value is already correctly updated in shared memory, requiring no further + * adjustment. + */ + if (hwc->idx > 0) { + sdata->ctr_values[hwc->idx] = sdata->ctr_values[0]; + sdata->ctr_values[0] = 0; + } + } else if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) && + flag != SBI_PMU_STOP_FLAG_RESET) { pr_err("Stopping counter idx %d failed with error %d\n", hwc->idx, sbi_err_map_linux_errno(ret.error)); + } } static int pmu_sbi_find_num_ctrs(void) @@ -640,10 +768,39 @@ static inline void pmu_sbi_stop_all(struct riscv_pmu *pmu) static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu) { struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events); + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr; + unsigned long flag = 0; + int i, idx; + struct sbiret ret; + u64 temp_ctr_overflow_mask = 0; + + if (sbi_pmu_snapshot_available()) + flag = SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT; + + /* Reset the shadow copy to avoid save/restore any value from previous overflow */ + memset(cpu_hw_evt->snapshot_cval_shcopy, 0, sizeof(u64) * RISCV_MAX_COUNTERS); + + for (i = 0; i < BITS_TO_LONGS(RISCV_MAX_COUNTERS); i++) { + /* No need to check the error here as we can't do anything about the error */ + ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, i * BITS_PER_LONG, + cpu_hw_evt->used_hw_ctrs[i], flag, 0, 0, 0); + if (!ret.error && sbi_pmu_snapshot_available()) { + /* Save the counter values to avoid clobbering */ + for_each_set_bit(idx, &cpu_hw_evt->used_hw_ctrs[i], BITS_PER_LONG) + cpu_hw_evt->snapshot_cval_shcopy[i * BITS_PER_LONG + idx] = + sdata->ctr_values[idx]; + /* Save the overflow mask to avoid clobbering */ + temp_ctr_overflow_mask |= sdata->ctr_overflow_mask << (i * BITS_PER_LONG); + } + } - /* No need to check the error here as we can't do anything about the error */ - sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, 0, - cpu_hw_evt->used_hw_ctrs[0], 0, 0, 0, 0); + /* Restore the counter values to the shared memory for used hw counters */ + if (sbi_pmu_snapshot_available()) { + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) + sdata->ctr_values[idx] = cpu_hw_evt->snapshot_cval_shcopy[idx]; + if (temp_ctr_overflow_mask) + sdata->ctr_overflow_mask = temp_ctr_overflow_mask; + } } /* @@ -652,11 +809,10 @@ static inline void pmu_sbi_stop_hw_ctrs(struct riscv_pmu *pmu) * while the overflowed counters need to be started with updated initialization * value. */ -static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu, - unsigned long ctr_ovf_mask) +static inline void pmu_sbi_start_ovf_ctrs_sbi(struct cpu_hw_events *cpu_hw_evt, + u64 ctr_ovf_mask) { - int idx = 0; - struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events); + int idx = 0, i; struct perf_event *event; unsigned long flag = SBI_PMU_START_FLAG_SET_INIT_VALUE; unsigned long ctr_start_mask = 0; @@ -664,11 +820,12 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu, struct hw_perf_event *hwc; u64 init_val = 0; - ctr_start_mask = cpu_hw_evt->used_hw_ctrs[0] & ~ctr_ovf_mask; - - /* Start all the counters that did not overflow in a single shot */ - sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 0, ctr_start_mask, - 0, 0, 0, 0); + for (i = 0; i < BITS_TO_LONGS(RISCV_MAX_COUNTERS); i++) { + ctr_start_mask = cpu_hw_evt->used_hw_ctrs[i] & ~ctr_ovf_mask; + /* Start all the counters that did not overflow in a single shot */ + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, i * BITS_PER_LONG, ctr_start_mask, + 0, 0, 0, 0); + } /* Reinitialize and start all the counter that overflowed */ while (ctr_ovf_mask) { @@ -691,6 +848,52 @@ static inline void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu, } } +static inline void pmu_sbi_start_ovf_ctrs_snapshot(struct cpu_hw_events *cpu_hw_evt, + u64 ctr_ovf_mask) +{ + int i, idx = 0; + struct perf_event *event; + unsigned long flag = SBI_PMU_START_FLAG_INIT_SNAPSHOT; + u64 max_period, init_val = 0; + struct hw_perf_event *hwc; + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr; + + for_each_set_bit(idx, cpu_hw_evt->used_hw_ctrs, RISCV_MAX_COUNTERS) { + if (ctr_ovf_mask & BIT(idx)) { + event = cpu_hw_evt->events[idx]; + hwc = &event->hw; + max_period = riscv_pmu_ctr_get_width_mask(event); + init_val = local64_read(&hwc->prev_count) & max_period; + cpu_hw_evt->snapshot_cval_shcopy[idx] = init_val; + } + /* + * We do not need to update the non-overflow counters the previous + * value should have been there already. + */ + } + + for (i = 0; i < BITS_TO_LONGS(RISCV_MAX_COUNTERS); i++) { + /* Restore the counter values to relative indices for used hw counters */ + for_each_set_bit(idx, &cpu_hw_evt->used_hw_ctrs[i], BITS_PER_LONG) + sdata->ctr_values[idx] = + cpu_hw_evt->snapshot_cval_shcopy[idx + i * BITS_PER_LONG]; + /* Start all the counters in a single shot */ + sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, idx * BITS_PER_LONG, + cpu_hw_evt->used_hw_ctrs[i], flag, 0, 0, 0); + } +} + +static void pmu_sbi_start_overflow_mask(struct riscv_pmu *pmu, + u64 ctr_ovf_mask) +{ + struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events); + + if (sbi_pmu_snapshot_available()) + pmu_sbi_start_ovf_ctrs_snapshot(cpu_hw_evt, ctr_ovf_mask); + else + pmu_sbi_start_ovf_ctrs_sbi(cpu_hw_evt, ctr_ovf_mask); +} + static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) { struct perf_sample_data data; @@ -700,10 +903,11 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) int lidx, hidx, fidx; struct riscv_pmu *pmu; struct perf_event *event; - unsigned long overflow; - unsigned long overflowed_ctrs = 0; + u64 overflow; + u64 overflowed_ctrs = 0; struct cpu_hw_events *cpu_hw_evt = dev; u64 start_clock = sched_clock(); + struct riscv_pmu_snapshot_data *sdata = cpu_hw_evt->snapshot_addr; if (WARN_ON_ONCE(!cpu_hw_evt)) return IRQ_NONE; @@ -725,7 +929,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) pmu_sbi_stop_hw_ctrs(pmu); /* Overflow status register should only be read after counter are stopped */ - ALT_SBI_PMU_OVERFLOW(overflow); + if (sbi_pmu_snapshot_available()) + overflow = sdata->ctr_overflow_mask; + else + ALT_SBI_PMU_OVERFLOW(overflow); /* * Overflow interrupt pending bit should only be cleared after stopping @@ -751,9 +958,14 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) if (!info || info->type != SBI_PMU_CTR_TYPE_HW) continue; - /* compute hardware counter index */ - hidx = info->csr - CSR_CYCLE; - /* check if the corresponding bit is set in sscountovf */ + if (sbi_pmu_snapshot_available()) + /* SBI implementation already updated the logical indicies */ + hidx = lidx; + else + /* compute hardware counter index */ + hidx = info->csr - CSR_CYCLE; + + /* check if the corresponding bit is set in sscountovf or overflow mask in shmem */ if (!(overflow & BIT(hidx))) continue; @@ -763,7 +975,10 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) */ overflowed_ctrs |= BIT(lidx); hw_evt = &event->hw; + /* Update the event states here so that we know the state while reading */ + hw_evt->state |= PERF_HES_STOPPED; riscv_pmu_event_update(event); + hw_evt->state |= PERF_HES_UPTODATE; perf_sample_data_init(&data, 0, hw_evt->last_period); if (riscv_pmu_event_set_period(event)) { /* @@ -776,6 +991,8 @@ static irqreturn_t pmu_sbi_ovf_handler(int irq, void *dev) */ perf_event_overflow(event, &data, regs); } + /* Reset the state as we are going to start the counter after the loop */ + hw_evt->state = 0; } pmu_sbi_start_overflow_mask(pmu, overflowed_ctrs); @@ -807,6 +1024,9 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node) enable_percpu_irq(riscv_pmu_irq, IRQ_TYPE_NONE); } + if (sbi_pmu_snapshot_available()) + return pmu_sbi_snapshot_setup(pmu, cpu); + return 0; } @@ -819,6 +1039,9 @@ static int pmu_sbi_dying_cpu(unsigned int cpu, struct hlist_node *node) /* Disable all counters access for user mode now */ csr_write(CSR_SCOUNTEREN, 0x0); + if (sbi_pmu_snapshot_available()) + return pmu_sbi_snapshot_disable(); + return 0; } @@ -927,6 +1150,12 @@ static inline void riscv_pm_pmu_unregister(struct riscv_pmu *pmu) { } static void riscv_pmu_destroy(struct riscv_pmu *pmu) { + if (sbi_v2_available) { + if (sbi_pmu_snapshot_available()) { + pmu_sbi_snapshot_disable(); + pmu_sbi_snapshot_free(pmu); + } + } riscv_pm_pmu_unregister(pmu); cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node); } @@ -1094,10 +1323,6 @@ static int pmu_sbi_device_probe(struct platform_device *pdev) pmu->event_unmapped = pmu_sbi_event_unmapped; pmu->csr_index = pmu_sbi_csr_index; - ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node); - if (ret) - return ret; - ret = riscv_pm_pmu_register(pmu); if (ret) goto out_unregister; @@ -1106,8 +1331,34 @@ static int pmu_sbi_device_probe(struct platform_device *pdev) if (ret) goto out_unregister; + /* SBI PMU Snapsphot is only available in SBI v2.0 */ + if (sbi_v2_available) { + ret = pmu_sbi_snapshot_alloc(pmu); + if (ret) + goto out_unregister; + + ret = pmu_sbi_snapshot_setup(pmu, smp_processor_id()); + if (ret) { + /* Snapshot is an optional feature. Continue if not available */ + pmu_sbi_snapshot_free(pmu); + } else { + pr_info("SBI PMU snapshot detected\n"); + /* + * We enable it once here for the boot cpu. If snapshot shmem setup + * fails during cpu hotplug process, it will fail to start the cpu + * as we can not handle hetergenous PMUs with different snapshot + * capability. + */ + static_branch_enable(&sbi_pmu_snapshot_available); + } + } + register_sysctl("kernel", sbi_pmu_sysctl_table); + ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node); + if (ret) + goto out_unregister; + return 0; out_unregister: @@ -1135,6 +1386,9 @@ static int __init pmu_sbi_devinit(void) return 0; } + if (sbi_spec_version >= sbi_mk_version(2, 0)) + sbi_v2_available = true; + ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_RISCV_STARTING, "perf/riscv/pmu:starting", pmu_sbi_starting_cpu, pmu_sbi_dying_cpu); |