mm/sparse-vmemmap: improve memory savings for compound devmaps

A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
author: Joao Martins <joao.m.martins@oracle.com> 2022-04-28 23:16:16 -0700
committer: akpm <akpm@linux-foundation.org> 2022-04-28 23:16:16 -0700
commit: 4917f55b4ef963e2d2288fe4eb651728be8db406 (patch)
tree: 20fb204ebd3b92b32e39851bab258e0077c16f2b /Documentation
parent: 60a427db0f80f16b9bb9efe6cc79c93f336e8466 (diff)
1 files changed, 53 insertions, 3 deletions
diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index 485ccf4f7b10..c9c495f62d12 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -1,8 +1,11 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-==================================
-Free some vmemmap pages of HugeTLB
-==================================
+=========================================
+A vmemmap diet for HugeTLB and Device DAX
+=========================================
+
+HugeTLB
+=======
 
 The struct page structures (page structs) are used to describe a physical
 page frame. By default, there is a one-to-one mapping from a page frame to
@@ -171,3 +174,50 @@ tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
 more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
 associated with each HugeTLB page. The compound_head() can handle this
 correctly (more details refer to the comment above compound_head()).
+
+Device DAX
+==========
+
+The device-dax interface uses the same tail deduplication technique explained
+in the previous chapter, except when used with the vmemmap in
+the device (altmap).
+
+The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
+PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+
+The differences with HugeTLB are relatively minor.
+
+It only use 3 page structs for storing all information as opposed
+to 4 on HugeTLB pages.
+
+There's no remapping of vmemmap given that device-dax memory is not part of
+System RAM ranges initialized at boot. Thus the tail page deduplication
+happens at a later stage when we populate the sections. HugeTLB reuses the
+the head vmemmap page representing, whereas device-dax reuses the tail
+vmemmap page. This results in only half of the savings compared to HugeTLB.
+
+Deduplicated tail pages are not mapped read-only.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
author	Joao Martins <joao.m.martins@oracle.com>	2022-04-28 23:16:16 -0700
committer	akpm <akpm@linux-foundation.org>	2022-04-28 23:16:16 -0700
commit	4917f55b4ef963e2d2288fe4eb651728be8db406 (patch)
tree	20fb204ebd3b92b32e39851bab258e0077c16f2b /Documentation
parent	60a427db0f80f16b9bb9efe6cc79c93f336e8466 (diff)