arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary

Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3. Now that the rmap overhaul[1] is upstream that provides a clean interface for rmap batching, let's implement PTE batching during fork when processing PTE-mapped THPs. This series is partially based on Ryan's previous work[2] to implement cont-pte support on arm64, but its a complete rewrite based on [1] to optimize all architectures independent of any such PTE bits, and to use the new rmap batching functions that simplify the code and prepare for further rmap accounting changes. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch and (c) perform batch PTE setting/updates. While this series should be beneficial for adding cont-pte support on ARM64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during fork with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for fork() (shorter is better): Folio Size | v6.8-rc1 | New | Change ------------------------------------------ 4KiB | 0.014328 | 0.014035 | - 2% 16KiB | 0.014263 | 0.01196 | -16% 32KiB | 0.014334 | 0.01094 | -24% 64KiB | 0.014046 | 0.010444 | -26% 128KiB | 0.014011 | 0.010063 | -28% 256KiB | 0.013993 | 0.009938 | -29% 512KiB | 0.013983 | 0.00985 | -30% 1024KiB | 0.013986 | 0.00982 | -30% 2048KiB | 0.014305 | 0.010076 | -30% Note that these numbers are even better than the ones from v1 (verified over multiple reboots), even though there were only minimal code changes. Well, I removed a pte_mkclean() call for anon folios, maybe that also plays a role. But my experience is that fork() is extremely sensitive to code size, inlining, ... so I suspect we'll see on other architectures rather a change of -20% instead of -30%, and it will be easy to "lose" some of that speedup in the future by subtle code changes. Next up is PTE batching when unmapping. Only tested on x86-64. Compile-tested on most other architectures. [1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com This patch (of 15): Since the high bits [51:48] of an OA are not stored contiguously in the PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE to the pte to get the pte with the next pfn. This works until the pfn crosses the 48-bit boundary, at which point we overflow into the upper attributes. Of course one could argue (and Matthew Wilcox has :) that we will never see a folio cross this boundary because we only allow naturally aligned power-of-2 allocation, so this would require a half-petabyte folio. So its only a theoretical bug. But its better that the code is robust regardless. I've implemented pte_next_pfn() as part of the fix, which is an opt-in core-mm interface. So that is now available to the core-mm, which will be needed shortly to support forthcoming fork()-batching optimizations. Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com Fixes: 4a169d61c2ed ("arm64: implement the new page table range API") Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
author: Ryan Roberts <ryan.roberts@arm.com> 2024-01-29 13:46:35 +0100
committer: Andrew Morton <akpm@linux-foundation.org> 2024-02-22 10:24:50 -0800
commit: 6e8f588708971e0626f5be808e8c4b6cdb86eb0b (patch)
tree: fcfeb47298755f65dadc57b9b455d23001648241 /arch/arm64/include/asm/pgtable.h
parent: e321d7c934774b7c7e778fe4433d1beb38a1744a (diff)
1 files changed, 17 insertions, 11 deletions
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..52d0b0a763f1 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,6 +341,22 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 		mte_sync_tags(pte, nr_pages);
 }
 
+/*
+ * Select all bits except the pfn
+ */
+static inline pgprot_t pte_pgprot(pte_t pte)
+{
+	unsigned long pfn = pte_pfn(pte);
+
+	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
+}
+
+#define pte_next_pfn pte_next_pfn
+static inline pte_t pte_next_pfn(pte_t pte)
+{
+	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+}
+
 static inline void set_ptes(struct mm_struct *mm,
 			    unsigned long __always_unused addr,
 			    pte_t *ptep, pte_t pte, unsigned int nr)
@@ -354,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte_val(pte) += PAGE_SIZE;
+		pte = pte_next_pfn(pte);
 	}
 }
 #define set_ptes set_ptes
@@ -433,16 +449,6 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 	return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
 }
 
-/*
- * Select all bits except the pfn
- */
-static inline pgprot_t pte_pgprot(pte_t pte)
-{
-	unsigned long pfn = pte_pfn(pte);
-
-	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
-}
-
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
author	Ryan Roberts <ryan.roberts@arm.com>	2024-01-29 13:46:35 +0100
committer	Andrew Morton <akpm@linux-foundation.org>	2024-02-22 10:24:50 -0800
commit	6e8f588708971e0626f5be808e8c4b6cdb86eb0b (patch)
tree	fcfeb47298755f65dadc57b9b455d23001648241 /arch/arm64/include/asm/pgtable.h
parent	e321d7c934774b7c7e778fe4433d1beb38a1744a (diff)