swiotlb: reduce area lock contention for non-primary IO TLB pools - nova.git

diff options

author	Petr Tesarik <petr.tesarik1@huawei-partners.com>	2023-12-01 13:13:52 +0100
committer	Christoph Hellwig <hch@lst.de>	2023-12-15 12:32:45 +0100
commit	55c543865b76830f524ce5ce1243266a93d06f84 (patch)
tree	96327a25d40b8bf33a55dd454e3b22056aa542e3 /kernel/sched/pelt.c
parent	4ad4c1f394b84f9941a10aa8aaf11102478a390b (diff)

swiotlb: reduce area lock contention for non-primary IO TLB pools

If multiple areas and multiple IO TLB pools exist, first iterate the current CPU specific area in all pools. Then move to the next area index. This is best illustrated by a diagram: area 0 | area 1 | ... | area M | pool 0 A B C pool 1 D E ... pool N F G H Currently, each pool is searched before moving on to the next pool, i.e. the search order is A, B ... C, D, E ... F, G ... H. With this patch, each area is searched in all pools before moving on to the next area, i.e. the search order is A, D ... F, B, E ... G ... C ... H. Note that preemption is not disabled, and raw_smp_processor_id() may not return a stable result, but it is called only once to determine the initial area index. The search will iterate over all areas eventually, even if the current task is preempted. Next, some pools may have less (but not more) areas than default_nareas. Skip such pools if the distance from the initial area index is greater than pool->nareas. This logic ensures that for every pool the search starts in the initial CPU's own area and never tries any area twice. To verify performance impact, I booted the kernel with a minimum pool size ("swiotlb=512,4,force"), so multiple pools get allocated, and I ran these benchmarks: - small: single-threaded I/O of 4 KiB blocks, - big: single-threaded I/O of 64 KiB blocks, - 4way: 4-way parallel I/O of 4 KiB blocks. The "var" column in the tables below is the coefficient of variance over 5 runs of the test, the "diff" column is the relative difference against base in read-write I/O bandwidth (MiB/s). Tested on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based block device on the host: base patched var var diff small 0.69% 0.62% +25.4% big 2.14% 2.27% +25.7% 4way 2.65% 1.70% +23.6% Tested on a Raspberry Pi against a class-10 A1 microSD card: base patched var var diff small 0.53% 1.96% -0.3% big 0.02% 0.57% +0.8% 4way 6.17% 0.40% +0.3% These results confirm that there is significant performance boost in the software IO TLB slot allocation itself. Where performance is dominated by actual hardware, there is no measurable change. Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Reviewed-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: Christoph Hellwig <hch@lst.de>

Diffstat (limited to 'kernel/sched/pelt.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: