diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-11-03 08:17:38 -1000 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-11-03 08:17:38 -1000 |
commit | 7ab89417ed235f56d84c7893d38d4905e38d2692 (patch) | |
tree | 0980734f4e492a09e68d820fedce20465c69e3df /tools/perf/arch/x86 | |
parent | 31e5f934ff962820995c82a6953176a1c7d18ff5 (diff) | |
parent | fed3a1be6433e15833068c701bfde7b422d8b988 (diff) |
Merge tag 'perf-tools-for-v6.7-1-2023-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Namhyung Kim:
"Build:
- Compile BPF programs by default if clang (>= 12.0.1) is available
to enable more features like kernel lock contention, off-cpu
profiling, kwork, sample filtering and so on.
This can be disabled by passing BUILD_BPF_SKEL=0 to make.
- Produce better error messages for bison on debug build (make
DEBUG=1) by defining YYDEBUG symbol internally.
perf record:
- Track sideband events (like FORK/MMAP) from all CPUs even if perf
record targets a subset of CPUs only (using -C option). Otherwise
it may lose some information happened on a CPU out of the target
list.
- Fix checking raw sched_switch tracepoint argument using system BTF.
This affects off-cpu profiling which attaches a BPF program to the
raw tracepoint.
perf lock contention:
- Add --lock-cgroup option to see contention by cgroups. This should
be used with BPF only (using -b option).
$ sudo perf lock con -ab --lock-cgroup -- sleep 1
contended total wait max wait avg wait cgroup
835 14.06 ms 41.19 us 16.83 us /system.slice/led.service
25 122.38 us 13.77 us 4.89 us /
44 23.73 us 3.87 us 539 ns /user.slice/user-657345.slice/session-c4.scope
1 491 ns 491 ns 491 ns /system.slice/connectd.service
- Add -G/--cgroup-filter option to see contention only for given
cgroups.
This can be useful when you identified a cgroup in the above
command and want to investigate more on it. It also works with
other output options like -t/--threads and -l/--lock-addr.
$ sudo perf lock con -ab -G /user.slice/user-657345.slice/session-c4.scope -- sleep 1
contended total wait max wait avg wait type caller
8 77.11 us 17.98 us 9.64 us spinlock futex_wake+0xc8
2 24.56 us 14.66 us 12.28 us spinlock tick_do_update_jiffies64+0x25
1 4.97 us 4.97 us 4.97 us spinlock futex_q_lock+0x2a
- Use per-cpu array for better spinlock tracking. This is to improve
performance of the BPF program and to avoid nested contention on a
lock in the BPF hash map.
- Update callstack check for PowerPC. To find a representative caller
of a lock, it needs to look up the call stacks. It ends the lookup
when it sees 0 in the call stack buffer. However, PowerPC call
stacks can have 0 values in the beginning so skip them when it
expects valid call stacks after.
perf kwork:
- Support 'sched' class (for -k option) so that it can see task
scheduling event (using sched_switch tracepoint) as well as irq and
workqueue items.
- Add perf kwork top subcommand to show more accurate cpu utilization
with sched class above. It works both with a recorded data (using
perf kwork record command) and BPF (using -b option). Unlike perf
top command, it does not support interactive mode (yet).
$ sudo perf kwork top -b -k sched
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 160702.425 ms, 8 cpus
%Cpu(s): 36.00% id, 0.00% hi, 0.00% si
%Cpu0 [|||||||||||||||||| 61.66%]
%Cpu1 [|||||||||||||||||| 61.27%]
%Cpu2 [||||||||||||||||||| 66.40%]
%Cpu3 [|||||||||||||||||| 61.28%]
%Cpu4 [|||||||||||||||||| 61.82%]
%Cpu5 [||||||||||||||||||||||| 77.41%]
%Cpu6 [|||||||||||||||||| 61.73%]
%Cpu7 [|||||||||||||||||| 63.25%]
PID SPID %CPU RUNTIME COMMMAND
-------------------------------------------------------------
0 0 38.72 8089.463 ms [swapper/1]
0 0 38.71 8084.547 ms [swapper/3]
0 0 38.33 8007.532 ms [swapper/0]
0 0 38.26 7992.985 ms [swapper/6]
0 0 38.17 7971.865 ms [swapper/4]
0 0 36.74 7447.765 ms [swapper/7]
0 0 33.59 6486.942 ms [swapper/2]
0 0 22.58 3771.268 ms [swapper/5]
9545 9351 2.48 447.136 ms sched-messaging
9574 9351 2.09 418.583 ms sched-messaging
9724 9351 2.05 372.407 ms sched-messaging
9531 9351 2.01 368.804 ms sched-messaging
9512 9351 2.00 362.250 ms sched-messaging
9514 9351 1.95 357.767 ms sched-messaging
9538 9351 1.86 384.476 ms sched-messaging
9712 9351 1.84 386.490 ms sched-messaging
9723 9351 1.83 380.021 ms sched-messaging
9722 9351 1.82 382.738 ms sched-messaging
9517 9351 1.81 354.794 ms sched-messaging
9559 9351 1.79 344.305 ms sched-messaging
9725 9351 1.77 365.315 ms sched-messaging
<SNIP>
- Add hard/soft-irq statistics to perf kwork top. This will show the
total CPU utilization with IRQ stats like below:
$ sudo perf kwork top -b -k sched,irq,softirq
Starting trace, Hit <Ctrl+C> to stop and report
^C
Total : 12554.889 ms, 8 cpus
%Cpu(s): 96.23% id, 0.10% hi, 0.19% si <---- here
%Cpu0 [| 4.60%]
%Cpu1 [| 4.59%]
%Cpu2 [ 2.73%]
%Cpu3 [| 3.81%]
<SNIP>
perf bench:
- Add -G/--cgroups option to perf bench sched pipe. The pipe bench is
good to measure context switch overhead. With this option, it puts
the reader and writer tasks in separate cgroups to enforce context
switch between two different cgroups.
Also it needs to set CPU affinity of the tasks in a CPU to
accurately measure the impact of cgroup context switches.
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.307 [sec]
3.078180 usecs/op
324867 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000':
200,026 context-switches
63 cgroup-switches
0.321637922 seconds time elapsed
You can see small number of cgroup-switches because both write and
read tasks are in the same cgroup.
$ sudo mkdir /sys/fs/cgroup/{AAA,BBB}
$ sudo perf stat -e context-switches,cgroup-switches -- \
> taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 100000 pipe operations between two processes
Total time: 0.351 [sec]
3.512990 usecs/op
284657 ops/sec
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 100000 -G AAA,BBB':
200,020 context-switches
200,019 cgroup-switches
0.365034567 seconds time elapsed
Now context-switches and cgroup-switches are almost same. And you
can see the pipe operation took little more.
- Kill child processes when perf bench sched messaging exited
abnormally. Otherwise it'd leave the child doing unnecessary work.
perf test:
- Fix various shellcheck issues on the tests written in shell script.
- Skip tests when condition is not satisfied:
- object code reading test for non-text section addresses.
- CoreSight test if cs_etm// event is not available.
- lock contention test if not enough CPUs.
Event parsing:
- Make PMU alias name loading lazy to reduce the startup time in the
event parsing code for perf record, stat and others in the general
case.
- Lazily compute PMU default config. In the same sense, delay PMU
initialization until it's really needed to reduce the startup cost.
- Fix event term values that are raw events. The event specification
can have several terms including event name. But sometimes it
clashes with raw event encoding which starts with 'r' and has
hex-digits.
For example, an event named 'read' should be processed as a normal
event but it was mis-treated as a raw encoding and caused a
failure.
$ perf stat -e 'uncore_imc_free_running/event=read/' -a sleep 1
event syntax error: '..nning/event=read/'
\___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
Event metrics:
- Add "Compat" regex to match event with multiple identifiers.
- Usual updates for Intel, Power10, Arm telemetry/CMN and AmpereOne.
Misc:
- Assorted memory leak fixes and footprint reduction.
- Add "bpf_skeletons" to perf version --build-options so that users
can check whether their perf tools have BPF support easily.
- Fix unaligned access in Intel-PT packet decoder found by
undefined-behavior sanitizer.
- Avoid frequency mode for the dummy event. Surprisingly it'd impact
kernel timer tick handler performance by force iterating all PMU
events.
- Update bash shell completion for events and metrics"
* tag 'perf-tools-for-v6.7-1-2023-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (187 commits)
perf vendor events intel: Update tsx_cycles_per_elision metrics
perf vendor events intel: Update bonnell version number to v5
perf vendor events intel: Update westmereex events to v4
perf vendor events intel: Update meteorlake events to v1.06
perf vendor events intel: Update knightslanding events to v16
perf vendor events intel: Add typo fix for ivybridge FP
perf vendor events intel: Update a spelling in haswell/haswellx
perf vendor events intel: Update emeraldrapids to v1.01
perf vendor events intel: Update alderlake/alderlake events to v1.23
perf build: Disable BPF skeletons if clang version is < 12.0.1
perf callchain: Fix spelling mistake "statisitcs" -> "statistics"
perf report: Fix spelling mistake "heirachy" -> "hierarchy"
perf python: Fix binding linkage due to rename and move of evsel__increase_rlimit()
perf tests: test_arm_coresight: Simplify source iteration
perf vendor events intel: Add tigerlake two metrics
perf vendor events intel: Add broadwellde two metrics
perf vendor events intel: Fix broadwellde tma_info_system_dram_bw_use metric
perf mem_info: Add and use map_symbol__exit and addr_map_symbol__exit
perf callchain: Minor layout changes to callchain_list
perf callchain: Make brtype_stat in callchain_list optional
...
Diffstat (limited to 'tools/perf/arch/x86')
-rw-r--r-- | tools/perf/arch/x86/annotate/instructions.c | 9 | ||||
-rw-r--r-- | tools/perf/arch/x86/util/intel-pt.c | 42 | ||||
-rw-r--r-- | tools/perf/arch/x86/util/pmu.c | 145 |
3 files changed, 26 insertions, 170 deletions
diff --git a/tools/perf/arch/x86/annotate/instructions.c b/tools/perf/arch/x86/annotate/instructions.c index 5f4ac4fc7fcf..5cdf457f5cbe 100644 --- a/tools/perf/arch/x86/annotate/instructions.c +++ b/tools/perf/arch/x86/annotate/instructions.c @@ -74,12 +74,15 @@ static struct ins x86__instructions[] = { { .name = "movdqa", .ops = &mov_ops, }, { .name = "movdqu", .ops = &mov_ops, }, { .name = "movsd", .ops = &mov_ops, }, - { .name = "movslq", .ops = &mov_ops, }, { .name = "movss", .ops = &mov_ops, }, + { .name = "movsb", .ops = &mov_ops, }, + { .name = "movsw", .ops = &mov_ops, }, + { .name = "movsl", .ops = &mov_ops, }, { .name = "movupd", .ops = &mov_ops, }, { .name = "movups", .ops = &mov_ops, }, - { .name = "movzbl", .ops = &mov_ops, }, - { .name = "movzwl", .ops = &mov_ops, }, + { .name = "movzb", .ops = &mov_ops, }, + { .name = "movzw", .ops = &mov_ops, }, + { .name = "movzl", .ops = &mov_ops, }, { .name = "mulsd", .ops = &mov_ops, }, { .name = "mulss", .ops = &mov_ops, }, { .name = "nop", .ops = &nop_ops, }, diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c index 31807791589e..fa0c718b9e72 100644 --- a/tools/perf/arch/x86/util/intel-pt.c +++ b/tools/perf/arch/x86/util/intel-pt.c @@ -60,36 +60,31 @@ struct intel_pt_recording { size_t priv_size; }; -static int intel_pt_parse_terms_with_default(struct perf_pmu *pmu, +static int intel_pt_parse_terms_with_default(const struct perf_pmu *pmu, const char *str, u64 *config) { - struct list_head *terms; + struct parse_events_terms terms; struct perf_event_attr attr = { .size = 0, }; int err; - terms = malloc(sizeof(struct list_head)); - if (!terms) - return -ENOMEM; - - INIT_LIST_HEAD(terms); - - err = parse_events_terms(terms, str, /*input=*/ NULL); + parse_events_terms__init(&terms); + err = parse_events_terms(&terms, str, /*input=*/ NULL); if (err) goto out_free; attr.config = *config; - err = perf_pmu__config_terms(pmu, &attr, terms, /*zero=*/true, /*err=*/NULL); + err = perf_pmu__config_terms(pmu, &attr, &terms, /*zero=*/true, /*err=*/NULL); if (err) goto out_free; *config = attr.config; out_free: - parse_events_terms__delete(terms); + parse_events_terms__exit(&terms); return err; } -static int intel_pt_parse_terms(struct perf_pmu *pmu, const char *str, u64 *config) +static int intel_pt_parse_terms(const struct perf_pmu *pmu, const char *str, u64 *config) { *config = 0; return intel_pt_parse_terms_with_default(pmu, str, config); @@ -182,7 +177,7 @@ static int intel_pt_pick_bit(int bits, int target) return pick; } -static u64 intel_pt_default_config(struct perf_pmu *intel_pt_pmu) +static u64 intel_pt_default_config(const struct perf_pmu *intel_pt_pmu) { char buf[256]; int mtc, mtc_periods = 0, mtc_period; @@ -261,20 +256,17 @@ static int intel_pt_parse_snapshot_options(struct auxtrace_record *itr, return 0; } -struct perf_event_attr * -intel_pt_pmu_default_config(struct perf_pmu *intel_pt_pmu) +void intel_pt_pmu_default_config(const struct perf_pmu *intel_pt_pmu, + struct perf_event_attr *attr) { - struct perf_event_attr *attr; - - attr = zalloc(sizeof(struct perf_event_attr)); - if (!attr) - return NULL; - - attr->config = intel_pt_default_config(intel_pt_pmu); + static u64 config; + static bool initialized; - intel_pt_pmu->selectable = true; - - return attr; + if (!initialized) { + config = intel_pt_default_config(intel_pt_pmu); + initialized = true; + } + attr->config = config; } static const char *intel_pt_find_filter(struct evlist *evlist, diff --git a/tools/perf/arch/x86/util/pmu.c b/tools/perf/arch/x86/util/pmu.c index f428cffb0378..469555ae9b3c 100644 --- a/tools/perf/arch/x86/util/pmu.c +++ b/tools/perf/arch/x86/util/pmu.c @@ -17,158 +17,19 @@ #include "../../../util/pmus.h" #include "env.h" -struct pmu_alias { - char *name; - char *alias; - struct list_head list; -}; - -static LIST_HEAD(pmu_alias_name_list); -static bool cached_list; - -struct perf_event_attr *perf_pmu__get_default_config(struct perf_pmu *pmu __maybe_unused) +void perf_pmu__arch_init(struct perf_pmu *pmu __maybe_unused) { #ifdef HAVE_AUXTRACE_SUPPORT if (!strcmp(pmu->name, INTEL_PT_PMU_NAME)) { pmu->auxtrace = true; - return intel_pt_pmu_default_config(pmu); + pmu->selectable = true; + pmu->perf_event_attr_init_default = intel_pt_pmu_default_config; } if (!strcmp(pmu->name, INTEL_BTS_PMU_NAME)) { pmu->auxtrace = true; pmu->selectable = true; } #endif - return NULL; -} - -static void pmu_alias__delete(struct pmu_alias *pmu_alias) -{ - if (!pmu_alias) - return; - - zfree(&pmu_alias->name); - zfree(&pmu_alias->alias); - free(pmu_alias); -} - -static struct pmu_alias *pmu_alias__new(char *name, char *alias) -{ - struct pmu_alias *pmu_alias = zalloc(sizeof(*pmu_alias)); - - if (pmu_alias) { - pmu_alias->name = strdup(name); - if (!pmu_alias->name) - goto out_delete; - - pmu_alias->alias = strdup(alias); - if (!pmu_alias->alias) - goto out_delete; - } - return pmu_alias; - -out_delete: - pmu_alias__delete(pmu_alias); - return NULL; -} - -static int setup_pmu_alias_list(void) -{ - int fd, dirfd; - DIR *dir; - struct dirent *dent; - struct pmu_alias *pmu_alias; - char buf[MAX_PMU_NAME_LEN]; - FILE *file; - int ret = -ENOMEM; - - dirfd = perf_pmu__event_source_devices_fd(); - if (dirfd < 0) - return -1; - - dir = fdopendir(dirfd); - if (!dir) - return -errno; - - while ((dent = readdir(dir))) { - if (!strcmp(dent->d_name, ".") || - !strcmp(dent->d_name, "..")) - continue; - - fd = perf_pmu__pathname_fd(dirfd, dent->d_name, "alias", O_RDONLY); - if (fd < 0) - continue; - - file = fdopen(fd, "r"); - if (!file) - continue; - - if (!fgets(buf, sizeof(buf), file)) { - fclose(file); - continue; - } - - fclose(file); - - /* Remove the last '\n' */ - buf[strlen(buf) - 1] = 0; - - pmu_alias = pmu_alias__new(dent->d_name, buf); - if (!pmu_alias) - goto close_dir; - - list_add_tail(&pmu_alias->list, &pmu_alias_name_list); - } - - ret = 0; - -close_dir: - closedir(dir); - return ret; -} - -static const char *__pmu_find_real_name(const char *name) -{ - struct pmu_alias *pmu_alias; - - list_for_each_entry(pmu_alias, &pmu_alias_name_list, list) { - if (!strcmp(name, pmu_alias->alias)) - return pmu_alias->name; - } - - return name; -} - -const char *pmu_find_real_name(const char *name) -{ - if (cached_list) - return __pmu_find_real_name(name); - - setup_pmu_alias_list(); - cached_list = true; - - return __pmu_find_real_name(name); -} - -static const char *__pmu_find_alias_name(const char *name) -{ - struct pmu_alias *pmu_alias; - - list_for_each_entry(pmu_alias, &pmu_alias_name_list, list) { - if (!strcmp(name, pmu_alias->name)) - return pmu_alias->alias; - } - return NULL; -} - -const char *pmu_find_alias_name(const char *name) -{ - if (cached_list) - return __pmu_find_alias_name(name); - - setup_pmu_alias_list(); - cached_list = true; - - return __pmu_find_alias_name(name); } int perf_pmus__num_mem_pmus(void) |