diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
commit | 7fa8a8ee9400fe8ec188426e40e481717bc5e924 (patch) | |
tree | cc8fd6b4f936ec01e73238643757451e20478c07 /mm/hugetlb.c | |
parent | 91ec4b0d11fe115581ce2835300558802ce55e6c (diff) | |
parent | 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1 (diff) |
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Diffstat (limited to 'mm/hugetlb.c')
-rw-r--r-- | mm/hugetlb.c | 136 |
1 files changed, 80 insertions, 56 deletions
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a93e070ab175..f154019e6b84 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2050,19 +2050,23 @@ int PageHuge(struct page *page) } EXPORT_SYMBOL_GPL(PageHuge); -/* - * PageHeadHuge() only returns true for hugetlbfs head page, but not for - * normal or transparent huge pages. +/** + * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs + * @folio: The folio to test. + * + * Context: Any context. Caller should have a reference on the folio to + * prevent it from being turned into a tail page. + * Return: True for hugetlbfs folios, false for anon folios or folios + * belonging to other filesystems. */ -int PageHeadHuge(struct page *page_head) +bool folio_test_hugetlb(struct folio *folio) { - struct folio *folio = (struct folio *)page_head; if (!folio_test_large(folio)) - return 0; + return false; return folio->_folio_dtor == HUGETLB_PAGE_DTOR; } -EXPORT_SYMBOL_GPL(PageHeadHuge); +EXPORT_SYMBOL_GPL(folio_test_hugetlb); /* * Find and lock address space (mapping) in write mode. @@ -2090,7 +2094,7 @@ pgoff_t hugetlb_basepage_index(struct page *page) pgoff_t index = page_index(page_head); unsigned long compound_idx; - if (compound_order(page_head) >= MAX_ORDER) + if (compound_order(page_head) > MAX_ORDER) compound_idx = page_to_pfn(page) - page_to_pfn(page_head); else compound_idx = page - page_head; @@ -4504,7 +4508,7 @@ static int __init default_hugepagesz_setup(char *s) * The number of default huge pages (for this size) could have been * specified as the first hugetlb parameter: hugepages=X. If so, * then default_hstate_max_huge_pages is set. If the default huge - * page size is gigantic (>= MAX_ORDER), then the pages must be + * page size is gigantic (> MAX_ORDER), then the pages must be * allocated here from bootmem allocator. */ if (default_hstate_max_huge_pages) { @@ -4994,11 +4998,15 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte) static void hugetlb_install_folio(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr, - struct folio *new_folio) + struct folio *new_folio, pte_t old) { + pte_t newpte = make_huge_pte(vma, &new_folio->page, 1); + __folio_mark_uptodate(new_folio); hugepage_add_new_anon_rmap(new_folio, vma, addr); - set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, &new_folio->page, 1)); + if (userfaultfd_wp(vma) && huge_pte_uffd_wp(old)) + newpte = huge_pte_mkuffd_wp(newpte); + set_huge_pte_at(vma->vm_mm, addr, ptep, newpte); hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm); folio_set_hugetlb_migratable(new_folio); } @@ -5073,14 +5081,12 @@ again: */ ; } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) { - bool uffd_wp = huge_pte_uffd_wp(entry); - - if (!userfaultfd_wp(dst_vma) && uffd_wp) + if (!userfaultfd_wp(dst_vma)) entry = huge_pte_clear_uffd_wp(entry); set_huge_pte_at(dst, addr, dst_pte, entry); } else if (unlikely(is_hugetlb_entry_migration(entry))) { swp_entry_t swp_entry = pte_to_swp_entry(entry); - bool uffd_wp = huge_pte_uffd_wp(entry); + bool uffd_wp = pte_swp_uffd_wp(entry); if (!is_readable_migration_entry(swp_entry) && cow) { /* @@ -5091,10 +5097,10 @@ again: swp_offset(swp_entry)); entry = swp_entry_to_pte(swp_entry); if (userfaultfd_wp(src_vma) && uffd_wp) - entry = huge_pte_mkuffd_wp(entry); + entry = pte_swp_mkuffd_wp(entry); set_huge_pte_at(src, addr, src_pte, entry); } - if (!userfaultfd_wp(dst_vma) && uffd_wp) + if (!userfaultfd_wp(dst_vma)) entry = huge_pte_clear_uffd_wp(entry); set_huge_pte_at(dst, addr, dst_pte, entry); } else if (unlikely(is_pte_marker(entry))) { @@ -5138,9 +5144,14 @@ again: ret = PTR_ERR(new_folio); break; } - copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma, - npages); + ret = copy_user_large_folio(new_folio, + page_folio(ptepage), + addr, dst_vma); put_page(ptepage); + if (ret) { + folio_put(new_folio); + break; + } /* Install the new hugetlb folio if src pte stable */ dst_ptl = huge_pte_lock(h, dst, dst_pte); @@ -5154,7 +5165,8 @@ again: /* huge_ptep of dst_pte won't change as in child */ goto again; } - hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio); + hugetlb_install_folio(dst_vma, dst_pte, addr, + new_folio, src_pte_old); spin_unlock(src_ptl); spin_unlock(dst_ptl); continue; @@ -5172,6 +5184,9 @@ again: entry = huge_pte_wrprotect(entry); } + if (!userfaultfd_wp(dst_vma)) + entry = huge_pte_clear_uffd_wp(entry); + set_huge_pte_at(dst, addr, dst_pte, entry); hugetlb_count_add(npages, dst); } @@ -5657,8 +5672,10 @@ retry_avoidcopy: goto out_release_all; } - copy_user_huge_page(&new_folio->page, old_page, address, vma, - pages_per_huge_page(h)); + if (copy_user_large_folio(new_folio, page_folio(old_page), address, vma)) { + ret = VM_FAULT_HWPOISON_LARGE; + goto out_release_all; + } __folio_mark_uptodate(new_folio); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddr, @@ -5672,13 +5689,16 @@ retry_avoidcopy: spin_lock(ptl); ptep = hugetlb_walk(vma, haddr, huge_page_size(h)); if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { + pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare); + /* Break COW or unshare */ huge_ptep_clear_flush(vma, haddr, ptep); mmu_notifier_invalidate_range(mm, range.start, range.end); page_remove_rmap(old_page, vma, true); hugepage_add_new_anon_rmap(new_folio, vma, haddr); - set_huge_pte_at(mm, haddr, ptep, - make_huge_pte(vma, &new_folio->page, !unshare)); + if (huge_pte_uffd_wp(pte)) + newpte = huge_pte_mkuffd_wp(newpte); + set_huge_pte_at(mm, haddr, ptep, newpte); folio_set_hugetlb_migratable(new_folio); /* Make the old page be freed below */ new_folio = page_folio(old_page); @@ -5835,7 +5855,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, */ new_folio = false; folio = filemap_lock_folio(mapping, idx); - if (!folio) { + if (IS_ERR(folio)) { size = i_size_read(mapping->host) >> huge_page_shift(h); if (idx >= size) goto out; @@ -6126,6 +6146,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, vma_end_reservation(h, vma, haddr); pagecache_folio = filemap_lock_folio(mapping, idx); + if (IS_ERR(pagecache_folio)) + pagecache_folio = NULL; } ptl = huge_pte_lock(h, mm, ptep); @@ -6209,19 +6231,19 @@ out_mutex: #ifdef CONFIG_USERFAULTFD /* - * Used by userfaultfd UFFDIO_COPY. Based on mcopy_atomic_pte with - * modifications for huge pages. + * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte + * with modifications for hugetlb pages. */ -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy) -{ - bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); +int hugetlb_mfill_atomic_pte(pte_t *dst_pte, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + uffd_flags_t flags, + struct folio **foliop) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + bool is_continue = uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE); + bool wp_enabled = (flags & MFILL_ATOMIC_WP); struct hstate *h = hstate_vma(dst_vma); struct address_space *mapping = dst_vma->vm_file->f_mapping; pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); @@ -6237,11 +6259,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (is_continue) { ret = -EFAULT; folio = filemap_lock_folio(mapping, idx); - if (!folio) + if (IS_ERR(folio)) goto out; folio_in_pagecache = true; - } else if (!*pagep) { - /* If a page already exists, then it's UFFDIO_COPY for + } else if (!*foliop) { + /* If a folio already exists, then it's UFFDIO_COPY for * a non-missing case. Return -EEXIST. */ if (vm_shared && @@ -6256,9 +6278,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, goto out; } - ret = copy_huge_page_from_user(&folio->page, - (const void __user *) src_addr, - pages_per_huge_page(h), false); + ret = copy_folio_from_user(folio, (const void __user *) src_addr, + false); /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { @@ -6277,33 +6298,36 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, ret = -ENOMEM; goto out; } - *pagep = &folio->page; - /* Set the outparam pagep and return to the caller to + *foliop = folio; + /* Set the outparam foliop and return to the caller to * copy the contents outside the lock. Don't free the - * page. + * folio. */ goto out; } } else { if (vm_shared && hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { - put_page(*pagep); + folio_put(*foliop); ret = -EEXIST; - *pagep = NULL; + *foliop = NULL; goto out; } folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0); if (IS_ERR(folio)) { - put_page(*pagep); + folio_put(*foliop); ret = -ENOMEM; - *pagep = NULL; + *foliop = NULL; + goto out; + } + ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma); + folio_put(*foliop); + *foliop = NULL; + if (ret) { + folio_put(folio); goto out; } - copy_user_huge_page(&folio->page, *pagep, dst_addr, dst_vma, - pages_per_huge_page(h)); - put_page(*pagep); - *pagep = NULL; } /* @@ -6356,7 +6380,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY * with wp flag set, don't set pte write bit. */ - if (wp_copy || (is_continue && !vm_shared)) + if (wp_enabled || (is_continue && !vm_shared)) writable = 0; else writable = dst_vma->vm_flags & VM_WRITE; @@ -6371,7 +6395,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, _dst_pte = huge_pte_mkdirty(_dst_pte); _dst_pte = pte_mkyoung(_dst_pte); - if (wp_copy) + if (wp_enabled) _dst_pte = huge_pte_mkuffd_wp(_dst_pte); set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); |