summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2022-11-17Merge tag 'net-6.1-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf. Current release - regressions: - tls: fix memory leak in tls_enc_skb() and tls_sw_fallback_init() Previous releases - regressions: - bridge: fix memory leaks when changing VLAN protocol - dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims - dsa: don't leak tagger-owned storage on switch driver unbind - eth: mlxsw: avoid warnings when not offloaded FDB entry with IPv6 is removed - eth: stmmac: ensure tx function is not running in stmmac_xdp_release() - eth: hns3: fix return value check bug of rx copybreak Previous releases - always broken: - kcm: close race conditions on sk_receive_queue - bpf: fix alignment problem in bpf_prog_test_run_skb() - bpf: fix writing offset in case of fault in strncpy_from_kernel_nofault - eth: macvlan: use built-in RCU list checking - eth: marvell: add sleep time after enabling the loopback bit - eth: octeon_ep: fix potential memory leak in octep_device_setup() Misc: - tcp: configurable source port perturb table size - bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)" * tag 'net-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits) net: use struct_group to copy ip/ipv6 header addresses net: usb: smsc95xx: fix external PHY reset net: usb: qmi_wwan: add Telit 0x103a composition netdevsim: Fix memory leak of nsim_dev->fa_cookie tcp: configurable source port perturb table size l2tp: Serialize access to sk_user_data with sk_callback_lock net: thunderbolt: Fix error handling in tbnet_init() net: microchip: sparx5: Fix potential null-ptr-deref in sparx_stats_init() and sparx5_start() net: lan966x: Fix potential null-ptr-deref in lan966x_stats_init() net: dsa: don't leak tagger-owned storage on switch driver unbind net/x25: Fix skb leak in x25_lapb_receive_frame() net: ag71xx: call phylink_disconnect_phy if ag71xx_hw_enable() fail in ag71xx_open() bridge: switchdev: Fix memory leaks when changing VLAN protocol net: hns3: fix setting incorrect phy link ksettings for firmware in resetting process net: hns3: fix return value check bug of rx copybreak net: hns3: fix incorrect hw rss hash type of rx packet net: phy: marvell: add sleep time after enabling the loopback bit net: ena: Fix error handling in ena_init() kcm: close race conditions on sk_receive_queue net: ionic: Fix error handling in ionic_init_module() ...
2022-11-11Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Andrii Nakryiko says: ==================== bpf 2022-11-11 We've added 11 non-merge commits during the last 8 day(s) which contain a total of 11 files changed, 83 insertions(+), 74 deletions(-). The main changes are: 1) Fix strncpy_from_kernel_nofault() to prevent out-of-bounds writes, from Alban Crequy. 2) Fix for bpf_prog_test_run_skb() to prevent wrong alignment, from Baisong Zhong. 3) Switch BPF_DISPATCHER to static_call() instead of ftrace infra, with a small build fix on top, from Peter Zijlstra and Nathan Chancellor. 4) Fix memory leak in BPF verifier in some error cases, from Wang Yufen. 5) 32-bit compilation error fixes for BPF selftests, from Pu Lehui and Yang Jihong. 6) Ensure even distribution of per-CPU free list elements, from Xu Kuohai. 7) Fix copy_map_value() to track special zeroed out areas properly, from Xu Kuohai. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix offset calculation error in __copy_map_value and zero_map_value bpf: Initialize same number of free nodes for each pcpu_freelist selftests: bpf: Add a test when bpf_probe_read_kernel_str() returns EFAULT maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() selftests/bpf: Fix test_progs compilation failure in 32-bit arch selftests/bpf: Fix casting error when cross-compiling test_verifier for 32-bit platforms bpf: Fix memory leaks in __check_func_call bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE() bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace) bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop") bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb() ==================== Link: https://lore.kernel.org/r/20221111231624.938829-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11Merge tag 'mm-hotfixes-stable-2022-11-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "22 hotfixes. Eight are cc:stable and the remainder address issues which were introduced post-6.0 or which aren't considered serious enough to justify a -stable backport" * tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) docs: kmsan: fix formatting of "Example report" mm/damon/dbgfs: check if rm_contexts input is for a real context maple_tree: don't set a new maximum on the node when not reusing nodes maple_tree: fix depth tracking in maple_state arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging fs: fix leaked psi pressure state nilfs2: fix use-after-free bug of ns_writer on remount x86/traps: avoid KMSAN bugs originating from handle_bug() kmsan: make sure PREEMPT_RT is off Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN x86/uaccess: instrument copy_from_user_nmi() kmsan: core: kmsan_in_runtime() should return true in NMI context mm: hugetlb_vmemmap: include missing linux/moduleparam.h mm/shmem: use page_mapping() to detect page cache for uffd continue mm/memremap.c: map FS_DAX device memory as decrypted Partly revert "mm/thp: carry over dirty bit when thp splits on pmd" nilfs2: fix deadlock in nilfs_count_free_blocks() mm/mmap: fix memory leak in mmap_region() hugetlbfs: don't delete error page from pagecache maple_tree: reorganize testing to restore module testing ...
2022-11-11maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault()Alban Crequy
If a page fault occurs while copying the first byte, this function resets one byte before dst. As a consequence, an address could be modified and leaded to kernel crashes if case the modified address was accessed later. Fixes: b58294ead14c ("maccess: allow architectures to provide kernel probing directly") Signed-off-by: Alban Crequy <albancrequy@linux.microsoft.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Francis Laniel <flaniel@linux.microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8] Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.com
2022-11-09Merge tag 'slab-for-6.1-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: "Most are small fixups as described below. The !CONFIG_TRACING fix is a bit bigger and would normally be done in the next merge window as part of upcoming hardening changes. But we realized it can make the kmalloc waste tracking introduced in this window inaccurate, so decided to go with it now. Summary: - Remove !CONFIG_TRACING kmalloc() wrappers intended to save a function call, due to incompatilibity with recently introduced wasted space tracking and planned hardening changes. - A tracing parameter regression fix, by Kees Cook. - Two kernel-doc warning fixups, by Lukas Bulwahn and myself * tag 'slab-for-6.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm, slab: remove duplicate kernel-doc comment for ksize() mm/slab_common: Restore passing "caller" for tracing mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() mm/slab_common: repair kernel-doc for __ksize()
2022-11-08mm/damon/dbgfs: check if rm_contexts input is for a real contextSeongJae Park
A user could write a name of a file under 'damon/' debugfs directory, which is not a user-created context, to 'rm_contexts' file. In the case, 'dbgfs_rm_context()' just assumes it's the valid DAMON context directory only if a file of the name exist. As a result, invalid memory access could happen as below. Fix the bug by checking if the given input is for a directory. This check can filter out non-context inputs because directories under 'damon/' debugfs directory can be created via only 'mk_contexts' file. This bug has found by syzbot[1]. [1] https://lore.kernel.org/damon/000000000000ede3ac05ec4abf8e@google.com/ Link: https://lkml.kernel.org/r/20221107165001.5717-2-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: syzbot+6087eafb76a94c4ac9eb@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> [5.15.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08kmsan: core: kmsan_in_runtime() should return true in NMI contextAlexander Potapenko
Without that, every call to __msan_poison_alloca() in NMI may end up allocating memory, which is NMI-unsafe. Link: https://lkml.kernel.org/r/20221102110611.1085175-1-glider@google.com Link: https://lore.kernel.org/lkml/20221025221755.3810809-1-glider@google.com/ Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm: hugetlb_vmemmap: include missing linux/moduleparam.hVasily Gorbik
The kernel test robot reported build failures with a 'randconfig' on s390: >> mm/hugetlb_vmemmap.c:421:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0); ^ Link: https://lore.kernel.org/linux-mm/202210300751.rG3UDsuc-lkp@intel.com/ Link: https://lkml.kernel.org/r/patch.git-296b83ca939b.your-ad-here.call-01667411912-ext-5073@work.hours Fixes: 30152245c63b ("mm: hugetlb_vmemmap: replace early_param() with core_param()") Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/shmem: use page_mapping() to detect page cache for uffd continuePeter Xu
mfill_atomic_install_pte() checks page->mapping to detect whether one page is used in the page cache. However as pointed out by Matthew, the page can logically be a tail page rather than always the head in the case of uffd minor mode with UFFDIO_CONTINUE. It means we could wrongly install one pte with shmem thp tail page assuming it's an anonymous page. It's not that clear even for anonymous page, since normally anonymous pages also have page->mapping being setup with the anon vma. It's safe here only because the only such caller to mfill_atomic_install_pte() is always passing in a newly allocated page (mcopy_atomic_pte()), whose page->mapping is not yet setup. However that's not extremely obvious either. For either of above, use page_mapping() instead. Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/memremap.c: map FS_DAX device memory as decryptedPankaj Gupta
virtio_pmem use devm_memremap_pages() to map the device memory. By default this memory is mapped as encrypted with SEV. Guest reboot changes the current encryption key and guest no longer properly decrypts the FSDAX device meta data. Mark the corresponding device memory region for FSDAX devices (mapped with memremap_pages) as decrypted to retain the persistent memory property. Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com Fixes: b7b3c01b19159 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"Peter Xu
Anatoly Pugachev reported sparc64 breakage on the patch: https://lore.kernel.org/r/20221021160603.GA23307@u164.east.ru The sparc64 impl of pte_mkdirty() is definitely slightly special in that it leverages a code patching mechanism for sun4u/sun4v on relevant pgtable entry operations. Before having a clue of why the sparc64 is special and caused the patch to SIGSEGV the processes, revert the patch for now. The swap path of dirty bit inheritage is kept because that's using the swap shared code so we assume it'll not be affected. Link: https://lkml.kernel.org/r/Y1Wbi4yyVvDtg4zN@x1n Fixes: 0ccf7f168e17 ("mm/thp: carry over dirty bit when thp splits on pmd") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Anatoly Pugachev <matorola@gmail.com> Tested-by: Anatoly Pugachev <matorola@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/mmap: fix memory leak in mmap_region()Li Zetao
There is a memory leak reported by kmemleak: unreferenced object 0xffff88817231ce40 (size 224): comm "mount.cifs", pid 19308, jiffies 4295917571 (age 405.880s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 60 c0 b2 00 81 88 ff ff 98 83 01 42 81 88 ff ff `..........B.... backtrace: [<ffffffff81936171>] __alloc_file+0x21/0x250 [<ffffffff81937051>] alloc_empty_file+0x41/0xf0 [<ffffffff81937159>] alloc_file+0x59/0x710 [<ffffffff81937964>] alloc_file_pseudo+0x154/0x210 [<ffffffff81741dbf>] __shmem_file_setup+0xff/0x2a0 [<ffffffff817502cd>] shmem_zero_setup+0x8d/0x160 [<ffffffff817cc1d5>] mmap_region+0x1075/0x19d0 [<ffffffff817cd257>] do_mmap+0x727/0x1110 [<ffffffff817518b2>] vm_mmap_pgoff+0x112/0x1e0 [<ffffffff83adf955>] do_syscall_64+0x35/0x80 [<ffffffff83c0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 The root cause was traced to an error handing path in mmap_region() when arch_validate_flags() or mas_preallocate() fails. In the shared anonymous mapping sence, vma will be setuped and mapped with a new shared anonymous file via shmem_zero_setup(). So in this case, the file resource needs to be released. Fix it by calling fput(vma->vm_file) and unmap_region() when arch_validate_flags() or mas_preallocate() returns an error in the shared anonymous mapping sence. Link: https://lkml.kernel.org/r/20221028073717.1179380-1-lizetao1@huawei.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()") Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08hugetlbfs: don't delete error page from pagecacheJames Houghton
This change is very similar to the change that was made for shmem [1], and it solves the same problem but for HugeTLBFS instead. Currently, when poison is found in a HugeTLB page, the page is removed from the page cache. That means that attempting to map or read that hugepage in the future will result in a new hugepage being allocated instead of notifying the user that the page was poisoned. As [1] states, this is effectively memory corruption. The fix is to leave the page in the page cache. If the user attempts to use a poisoned HugeTLB page with a syscall, the syscall will fail with EIO, the same error code that shmem uses. For attempts to map the page, the thread will get a BUS_MCEERR_AR SIGBUS. [1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens") Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-07mm, slab: remove duplicate kernel-doc comment for ksize()Vlastimil Babka
Akira reports: > "make htmldocs" reports duplicate C declaration of ksize() as follows: > /linux/Documentation/core-api/mm-api:43: ./mm/slab_common.c:1428: WARNING: Duplicate C declaration, also defined at core-api/mm-api:212. > Declaration is '.. c:function:: size_t ksize (const void *objp)'. > This is due to the kernel-doc comment for ksize() declaration added in > include/linux/slab.h by commit 05a940656e1e ("slab: Introduce > kmalloc_size_roundup()"). There is an older kernel-doc comment for ksize() definition in mm/slab_common.c, which is not only duplicated, but also contradicts the new one - the additional storage discovered by ksize() should not be used by callers anymore. Delete the old kernel-doc. Reported-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/all/d33440f6-40cf-9747-3340-e54ffaf7afb8@gmail.com/ Fixes: 05a940656e1e ("slab: Introduce kmalloc_size_roundup()") Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-06mm/slab_common: Restore passing "caller" for tracingKees Cook
The "caller" argument was accidentally being ignored in a few places that were recently refactored. Restore these "caller" arguments, instead of _RET_IP_. Fixes: 11e9734bcb6a ("mm/slab_common: unify NUMA and UMA version of tracepoints") Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-04mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace()Vlastimil Babka
For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where the allocation size is build-time constant) to save a function call, by inlining kmalloc_trace() to a kmem_cache_alloc() call. However since commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") this path now fails to pass the original request size to be eventually recorded (for kmalloc caches with debugging enabled). We could adjust the code to call __kmem_cache_alloc_node() as the CONFIG_TRACING variant, but that would as a result inline a call with 5 parameters, bloating the kmalloc() call sites. The cost of extra function call (to kmalloc_trace()) seems like a lesser evil. It also appears that the !CONFIG_TRACING variant is incompatible with upcoming hardening efforts [1] so it's easier if we just remove it now. Kernels with no tracing are rare these days and the benefit is dubious anyway. [1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/ Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-03mm/slab_common: repair kernel-doc for __ksize()Lukas Bulwahn
Commit 445d41d7a7c1 ("Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-next") resolved a conflict of two concurrent changes to __ksize(). However, it did not adjust the kernel-doc comment of __ksize(), while the name of the argument to __ksize() was renamed. Hence, ./scripts/ kernel-doc -none mm/slab_common.c warns about it. Adjust the kernel-doc comment for __ksize() for make W=1 happiness. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-10-28mmap: fix remap_file_pages() regressionLiam Howlett
When using the VMA iterator, the final execution will set the variable 'next' to NULL which causes the function to fail out. Restore the break in the loop to exit the VMA iterator early without clearing NULL fixes the issue. Link: https://lore.kernel.org/lkml/29344.1666681759@jrobl/ Link: https://lkml.kernel.org/r/20221025161222.2634030-1-Liam.Howlett@oracle.com Fixes: 763ecb035029 (mm: remove the vma linked list) Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: "J. R. Okajima" <hooanon05g@gmail.com> Tested-by: "J. R. Okajima" <hooanon05g@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm/shmem: ensure proper fallback if page faultsIra Weiny
The kernel test robot flagged a recursive lock as a result of a conversion from kmap_atomic() to kmap_local_folio()[Link] The cause was due to the code depending on the kmap_atomic() side effect of disabling page faults. In that case the code expects the fault to fail and take the fallback case. git archaeology implied that the recursion may not be an actual bug.[1] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[2] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Add an explicit pagefault_disable() and a big comment to explain this for future souls looking at this code. [1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com Fixes: 7a7256d5f512 ("shmem: convert shmem_mfill_atomic_pte() to use a folio") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <yujie.liu@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm/userfaultfd: replace kmap/kmap_atomic() with kmap_local_page()Ira Weiny
kmap() and kmap_atomic() are being deprecated in favor of kmap_local_page() which is appropriate for any thread local context.[1] A recent locking bug report with userfaultfd showed that the conversion of the kmap_atomic()'s in those code flows requires care with regard to the prevention of deadlock.[2] git archaeology implied that the recursion may not be an actual bug.[3] However, depending on the implementation of the mmap_lock and the condition of the call there may still be a deadlock.[4] So this is not purely a lockdep issue. Considering a single threaded call stack there are 3 options. 1) Different mm's are in play (no issue) 2) Readlock implementation is recursive and same mm is in play (no issue) 3) Readlock implementation is _not_ recursive (issue) The mmap_lock is recursive so with a single thread there is no issue. However, Matthew pointed out a deadlock scenario when you consider additional process' and threads thusly. "The readlock implementation is only recursive if nobody else has taken a write lock. If you have a multithreaded process, one of the other threads can call mmap() and that will prevent recursion (due to fairness). Even if it's a different process that you're trying to acquire the mmap read lock on, you can still get into a deadly embrace. eg: process A thread 1 takes read lock on own mmap_lock process A thread 2 calls mmap, blocks taking write lock process B thread 1 takes page fault, read lock on own mmap lock process B thread 2 calls mmap, blocks taking write lock process A thread 1 blocks taking read lock on process B process B thread 1 blocks taking read lock on process A Now all four threads are blocked waiting for each other." Regardless using pagefault_disable() ensures that no matter what locking implementation is used a deadlock will not occur. Complete kmap conversion in userfaultfd by replacing the kmap() and kmap_atomic() calls with kmap_local_page(). When replacing the kmap_atomic() call ensure page faults continue to be disabled to support the correct fall back behavior and add a comment to inform future souls of the requirement. [1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/ [2] https://lore.kernel.org/all/Y1Mh2S7fUGQ%2FiKFR@iweiny-desk3/ [3] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/ [4] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/ [ira.weiny@intel.com: v2] Link: https://lkml.kernel.org/r/20221025220136.2366143-1-ira.weiny@intel.com Link: https://lkml.kernel.org/r/20221024043452.1491677-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28x86: fortify: kmsan: fix KMSAN fortify buildsAlexander Potapenko
Ensure that KMSAN builds replace memset/memcpy/memmove calls with the respective __msan_XXX functions, and that none of the macros are redefined twice. This should allow building kernel with both CONFIG_KMSAN and CONFIG_FORTIFY_SOURCE. Link: https://lkml.kernel.org/r/20221024212144.2852069-5-glider@google.com Link: https://github.com/google/kmsan/issues/89 Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm: kmsan: export kmsan_copy_page_meta()Alexander Potapenko
Certain modules call copy_user_highpage(), which calls kmsan_copy_page_meta() under KMSAN, so we need to export the latter. Link: https://lkml.kernel.org/r/20221024212144.2852069-1-glider@google.com Link: https://github.com/google/kmsan/issues/89 Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm: migrate: fix return value if all subpages of THPs are migrated successfullyBaolin Wang
During THP migration, if THPs are not migrated but they are split and all subpages are migrated successfully, migrate_pages() will still return the number of THP pages that were not migrated. This will confuse the callers of migrate_pages(). For example, the longterm pinning will failed though all pages are migrated successfully. Thus we should return 0 to indicate that all pages are migrated in this case Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com Fixes: b5bade978e9b ("mm: migrate: fix the return value of migrate_pages()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm: prep_compound_tail() clear page->privateHugh Dickins
Although page allocation always clears page->private in the first page or head page of an allocation, it has never made a point of clearing page->private in the tails (though 0 is often what is already there). But now commit 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split") issues a warning when page_tail->private is found to be non-0 (unless it's swapcache). Change that warning to dump page_tail (which also dumps head), instead of just the head: so far we have seen dead000000000122, dead000000000003, dead000000000001 or 0000000000000002 in the raw output for tail private. We could just delete the warning, but today's consensus appears to want page->private to be 0, unless there's a good reason for it to be set: so now clear it in prep_compound_tail() (more general than just for THP; but not for high order allocation, which makes no pass down the tails). Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com Fixes: 71e2d666ef85 ("mm/huge_memory: do not clobber swp_entry_t during THP split") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfsRik van Riel
A common use case for hugetlbfs is for the application to create memory pools backed by huge pages, which then get handed over to some malloc library (eg. jemalloc) for further management. That malloc library may be doing MADV_DONTNEED calls on memory that is no longer needed, expecting those calls to happen on PAGE_SIZE boundaries. However, currently the MADV_DONTNEED code rounds up any such requests to HPAGE_PMD_SIZE boundaries. This leads to undesired outcomes when jemalloc expects a 4kB MADV_DONTNEED, but 2MB of memory get zeroed out, instead. Use of pre-built shared libraries means that user code does not always know the page size of every memory arena in use. Avoid unexpected data loss with MADV_DONTNEED by rounding up only to PAGE_SIZE (in do_madvise), and rounding down to huge page granularity. That way programs will only get as much memory zeroed out as they requested. Link: https://lkml.kernel.org/r/20221021192805.366ad573@imladris.surriel.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm/page_isolation: fix clang deadcode warningMaria Yu
When !CONFIG_VM_BUG_ON, there is warning of clang-analyzer-deadcode.DeadStores: Value stored to 'mt' during its initialization is never read. Link: https://lkml.kernel.org/r/20221021101555.7992-2-quic_aiquny@quicinc.com Signed-off-by: Maria Yu <quic_aiquny@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28memory tier, sysfs: rename attribute "nodes" to "nodelist"Huang Ying
In sysfs, we use attribute name "cpumap" or "cpus" for cpu mask and "cpulist" or "cpus_list" for cpu list. For example, in my system, $ cat /sys/devices/system/node/node0/cpumap f,ffffffff $ cat /sys/devices/system/cpu/cpu2/topology/core_cpus 0,00100004 $ cat cat /sys/devices/system/node/node0/cpulist 0-35 $ cat /sys/devices/system/cpu/cpu2/topology/core_cpus_list 2,20 It looks reasonable to use "nodemap" for node mask and "nodelist" for node list. So, rename the attribute to follow the naming convention. Link: https://lkml.kernel.org/r/20221020015122.290097-1-ying.huang@intel.com Fixes: 9832fb87834e2b ("mm/demotion: expose memory tier details via sysfs") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28mm/kmemleak: prevent soft lockup in kmemleak_scan()'s object iteration loopsWaiman Long
Commit 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()") adds cond_resched() in the first object iteration loop of kmemleak_scan(). However, it turns that the 2nd objection iteration loop can still cause soft lockup to happen in some cases. So add a cond_resched() call in the 2nd and 3rd loops as well to prevent that and for completeness. Link: https://lkml.kernel.org/r/20221020175619.366317-1-longman@redhat.com Fixes: 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/huge_memory: do not clobber swp_entry_t during THP splitMel Gorman
The following has been observed when running stressng mmap since commit b653db77350c ("mm: Clear page->private when splitting or migrating a page") watchdog: BUG: soft lockup - CPU#75 stuck for 26s! [stress-ng:9546] CPU: 75 PID: 9546 Comm: stress-ng Tainted: G E 6.0.0-revert-b653db77-fix+ #29 0357d79b60fb09775f678e4f3f64ef0579ad1374 Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016 RIP: 0010:xas_descend+0x28/0x80 Code: cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 <48> 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02 89 c2 RSP: 0018:ffffbbf02a2236a8 EFLAGS: 00000246 RAX: ffff9cab7d6a0002 RBX: ffffe04b0af88040 RCX: 0000000000000002 RDX: 0000000000000030 RSI: ffff9cab60509b60 RDI: ffffbbf02a2236c0 RBP: 0000000000000000 R08: ffff9cab60509b60 R09: ffffbbf02a2236c0 R10: 0000000000000001 R11: ffffbbf02a223698 R12: 0000000000000000 R13: ffff9cab4e28da80 R14: 0000000000039c01 R15: ffff9cab4e28da88 FS: 00007fab89b85e40(0000) GS:ffff9cea3fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fab84e00000 CR3: 00000040b73a4003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> xas_load+0x3a/0x50 __filemap_get_folio+0x80/0x370 ? put_swap_page+0x163/0x360 pagecache_get_page+0x13/0x90 __try_to_reclaim_swap+0x50/0x190 scan_swap_map_slots+0x31e/0x670 get_swap_pages+0x226/0x3c0 folio_alloc_swap+0x1cc/0x240 add_to_swap+0x14/0x70 shrink_page_list+0x968/0xbc0 reclaim_page_list+0x70/0xf0 reclaim_pages+0xdd/0x120 madvise_cold_or_pageout_pte_range+0x814/0xf30 walk_pgd_range+0x637/0xa30 __walk_page_range+0x142/0x170 walk_page_range+0x146/0x170 madvise_pageout+0xb7/0x280 ? asm_common_interrupt+0x22/0x40 madvise_vma_behavior+0x3b7/0xac0 ? find_vma+0x4a/0x70 ? find_vma+0x64/0x70 ? madvise_vma_anon_name+0x40/0x40 madvise_walk_vmas+0xa6/0x130 do_madvise+0x2f4/0x360 __x64_sys_madvise+0x26/0x30 do_syscall_64+0x5b/0x80 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x17/0x40 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x17/0x40 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? common_interrupt+0x8b/0xa0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The problem can be reproduced with the mmtests config config-workload-stressng-mmap. It does not always happen and when it triggers is variable but it has happened on multiple machines. The intent of commit b653db77350c patch was to avoid the case where PG_private is clear but folio->private is not-NULL. However, THP tail pages uses page->private for "swp_entry_t if folio_test_swapcache()" as stated in the documentation for struct folio. This patch only clobbers page->private for tail pages if the head page was not in swapcache and warns once if page->private had an unexpected value. Link: https://lkml.kernel.org/r/20221019134156.zjyyn5aownakvztf@techsingularity.net Fixes: b653db77350c ("mm: Clear page->private when splitting or migrating a page") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <shy828301@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20hugetlb: fix memory leak associated with vma_lock structureMike Kravetz
The hugetlb vma_lock structure hangs off the vm_private_data pointer of sharable hugetlb vmas. The structure is vma specific and can not be shared between vmas. At fork and various other times, vmas are duplicated via vm_area_dup(). When this happens, the pointer in the newly created vma must be cleared and the structure reallocated. Two hugetlb specific routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open. Both routines are called for newly created vmas. hugetlb_dup_vma_private would always clear the pointer and hugetlb_vm_op_open would allocate the new vms_lock structure. This did not work in the case of this calling sequence pointed out in [1]. move_vma copy_vma new_vma = vm_area_dup(vma); new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock. is_vm_hugetlb_page(vma) clear_vma_resv_huge_pages hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL When clearing hugetlb_dup_vma_private we actually leak the associated vma_lock structure. The vma_lock structure contains a pointer to the associated vma. This information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open to ensure we only clear the vm_private_data of newly created (copied) vmas. In such cases, the vma->vma_lock->vma field will not point to the vma. Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear vm_private_data if vma->vma_lock->vma == vma. Also, log a warning if hugetlb_vm_op_open ever encounters the case where vma_lock has already been correctly allocated for the vma. [1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/ Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com Fixes: 131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/page_alloc: reduce potential fragmentation in make_alloc_exact()Liam R. Howlett
Try to avoid using the left over split page on the next request for a page by calling __free_pages_ok() with FPI_TO_TAIL. This increases the potential of defragmenting memory when it's used for a short period of time. Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pagesRik van Riel
The h->*_huge_pages counters are protected by the hugetlb_lock, but alloc_huge_page has a corner case where it can decrement the counter outside of the lock. This could lead to a corrupted value of h->resv_huge_pages, which we have observed on our systems. Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a potential race. Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com Fixes: a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Glen McCready <gkmccready@meta.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/mmap: fix MAP_FIXED address return on VMA mergeLiam Howlett
mmap should return the start address of newly mapped area when successful. On a successful merge of a VMA, the return address was changed and thus was violating that expectation from userspace. This is a restoration of functionality provided by 309d08d9b3a3 (mm/mmap.c: fix mmap return value when vma is merged after call_mmap()). For completeness of fixing MAP_FIXED, implement the comments from the previous discussion to never update the address and fail if the address changes. Leaving the error as a WARN_ON() to avoid crashing the kernel. Link: https://lkml.kernel.org/r/20221018191613.4133459-1-Liam.Howlett@oracle.com Link: https://lore.kernel.org/all/Y06yk66SKxlrwwfb@lakrids/ Link: https://lore.kernel.org/all/20201203085350.22624-1-liuzixian4@huawei.com/ Fixes: 4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Cc: Liu Zixian <liuzixian4@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/mmap.c: __vma_adjust(): suppress uninitialized var warningAndrew Morton
The code is OK, but it fools gcc. mm/mmap.c:802 __vma_adjust() error: uninitialized symbol 'next_next'. Fixes: 524e00b36e8c5 ("mm: remove rb tree.") Reported-by: kernel test robot <lkp@intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/mmap: undo ->mmap() when mas_preallocate() failsMike Kravetz
A memory leak in hugetlb_reserve_pages was reported in [1]. The root cause was traced to an error path in mmap_region when mas_preallocate() fails. In this case, the vma is freed after a successful call to filesystem specific mmap. The hugetlbfs mmap routine may allocate data structures pointed to by m_private_data. These need to be cleaned up by the hugetlb vm_ops->close() routine. The same issue was addressed by commit deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") for the arch_validate_flags() test. Go to the same close_and_free_vma label if mas_preallocate() fails. [1] https://lore.kernel.org/linux-mm/CAKXUXMxf7OiCwbxib7MwfR4M1b5+b3cNTU7n5NV9Zm4967=FPQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221018024945.415036-1-mike.kravetz@oracle.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Carlos Llamas <cmllamas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20zsmalloc: zs_destroy_pool: add size_class NULL checkAlexey Romanov
Inside the zs_destroy_pool() function, there can still be NULL size_class pointers: if when the next size_class is allocated, inside zs_create_pool() function, kzalloc will return NULL and handling the error condition, zs_create_pool() will call zs_destroy_pool(). Link: https://lkml.kernel.org/r/20221013112825.61869-1-avromanov@sberdevices.ru Fixes: f24263a5a076 ("zsmalloc: remove unnecessary size_class NULL check") Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20mm/mempolicy: fix mbind_range() arguments to vma_merge()Liam Howlett
Fuzzing produced an invalid argument to vma_merge() which was caught by the newly added verification of the number of VMAs being removed on process exit. Analyzing the failure eventually resulted in finding an issue with the search of a VMA that started at address 0, which caused an underflow and thus the loss of many VMAs being tracked in the tree. Fix the underflow by changing the search of the maple tree to use the start address directly. Link: https://lkml.kernel.org/r/20221015021135.2816178-1-Liam.Howlett@oracle.com Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/r/202210052318.5ad10912-oliver.sang@intel.com Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-16Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-15Merge tag 'slab-for-6.1-rc1-hotfix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab hotfix from Vlastimil Babka: "A single fix for the common-kmalloc series, for warnings on mips and sparc64 reported by Guenter Roeck" * tag 'slab-for-6.1-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: use kmalloc_node() for off slab freelist_idx_t array allocation
2022-10-15mm/slab: use kmalloc_node() for off slab freelist_idx_t array allocationHyeonggon Yoo
After commit d6a71648dbc0 ("mm/slab: kmalloc: pass requests larger than order-1 page to page allocator"), SLAB passes large ( > PAGE_SIZE * 2) requests to buddy like SLUB does. SLAB has been using kmalloc caches to allocate freelist_idx_t array for off slab caches. But after the commit, freelist_size can be bigger than KMALLOC_MAX_CACHE_SIZE. Instead of using pointer to kmalloc cache, use kmalloc_node() and only check if the kmalloc cache is off slab during calculate_slab_order(). If freelist_size > KMALLOC_MAX_CACHE_SIZE, no looping condition happens as it allocates freelist_idx_t array directly from buddy. Link: https://lore.kernel.org/all/20221014205818.GA1428667@roeck-us.net/ Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: d6a71648dbc0 ("mm/slab: kmalloc: pass requests larger than order-1 page to page allocator") Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-10-14Merge tag 'mm-stable-2022-10-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - fix a race which causes page refcounting errors in ZONE_DEVICE pages (Alistair Popple) - fix userfaultfd test harness instability (Peter Xu) - various other patches in MM, mainly fixes * tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits) highmem: fix kmap_to_page() for kmap_local_page() addresses mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page mm/selftest: uffd: explain the write missing fault check mm/hugetlb: use hugetlb_pte_stable in migration race check mm/hugetlb: fix race condition of uffd missing/minor handling zram: always expose rw_page LoongArch: update local TLB if PTE entry exists mm: use update_mmu_tlb() on the second thread kasan: fix array-bounds warnings in tests hmm-tests: add test for migrate_device_range() nouveau/dmem: evict device private memory during release nouveau/dmem: refactor nouveau_dmem_fault_copy_one() mm/migrate_device.c: add migrate_device_range() mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() mm/memremap.c: take a pgmap reference on page allocation mm: free device private pages have zero refcount mm/memory.c: fix race when faulting a device private page mm/damon: use damon_sz_region() in appropriate place mm/damon: move sz_damon_region to damon_sz_region lib/test_meminit: add checks for the allocation functions ...
2022-10-12highmem: fix kmap_to_page() for kmap_local_page() addressesIra Weiny
kmap_to_page() is used to get the page for a virtual address which may be kmap'ed. Unfortunately, kmap_local_page() stores mappings in a thread local array separate from kmap(). These mappings were not checked by the call. Check the kmap_local_page() mappings and return the page if found. Because it is intended to remove kmap_to_page() add a warn on once to the kmap checks to flag potential issues early. NOTE Due to 32bit x86 use of kmap local in iomap atmoic, KMAP_LOCAL does not require HIGHMEM to be set. Therefore the support calls required a new KMAP_LOCAL section to fix 0day build errors. [akpm@linux-foundation.org: fix warning] Link: https://lkml.kernel.org/r/20221006040555.1502679-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: kernel test robot <lkp@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order pageYafang Shao
PGFREE and PGALLOC represent the number of freed and allocated pages. So the page order must be considered. Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/hugetlb: use hugetlb_pte_stable in migration race checkPeter Xu
After hugetlb_pte_stable() introduced, we can also rewrite the migration race condition against page allocation to use the new helper too. Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/hugetlb: fix race condition of uffd missing/minor handlingPeter Xu
Patch series "mm/hugetlb: Fix selftest failures with write check", v3. Currently akpm mm-unstable fails with uffd hugetlb private mapping test randomly on a write check. The initial bisection of that points to the recent pmd unshare series, but it turns out there's no direction relationship with the series but only some timing change caused the race to start trigger. The race should be fixed in patch 1. Patch 2 is a trivial cleanup on the similar race with hugetlb migrations, patch 3 comment on the write check so when anyone read it again it'll be clear why it's there. This patch (of 3): After the recent rework patchset of hugetlb locking on pmd sharing, kselftest for userfaultfd sometimes fails on hugetlb private tests with unexpected write fault checks. It turns out there's nothing wrong within the locking series regarding this matter, but it could have changed the timing of threads so it can trigger an old bug. The real bug is when we call hugetlb_no_page() we're not with the pgtable lock. It means we're reading the pte values lockless. It's perfectly fine in most cases because before we do normal page allocations we'll take the lock and check pte_same() again. However before that, there are actually two paths on userfaultfd missing/minor handling that may directly move on with the fault process without checking the pte values. It means for these two paths we may be generating an uffd message based on an unstable pte, while an unstable pte can legally be anything as long as the modifier holds the pgtable lock. One example, which is also what happened in the failing kselftest and caused the test failure, is that for private mappings wr-protection changes can happen on one page. While hugetlb_change_protection() generally requires pte being cleared before being changed, then there can be a race condition like: thread 1 thread 2 -------- -------- UFFDIO_WRITEPROTECT hugetlb_fault hugetlb_change_protection pgtable_lock() huge_ptep_modify_prot_start pte==NULL hugetlb_no_page generate uffd missing event even if page existed!! huge_ptep_modify_prot_commit pgtable_unlock() Fix this by rechecking the pte after pgtable lock for both userfaultfd missing & minor fault paths. This bug should have been around starting from uffd hugetlb introduced, so attaching a Fixes to the commit. Also attach another Fixes to the minor support commit for easier tracking. Note that userfaultfd is actually fine with false positives (e.g. caused by pte changed), but not wrong logical events (e.g. caused by reading a pte during changing). The latter can confuse the userspace, so the strictness is very much preferred. E.g., MISSING event should never happen on the page after UFFDIO_COPY has correctly installed the page and returned. Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode") Signed-off-by: Peter Xu <peterx@redhat.com> Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm: use update_mmu_tlb() on the second threadQi Zheng
As message in commit 7df676974359 ("mm/memory.c: Update local TLB if PTE entry exists") said, we should update local TLB only on the second thread. So in the do_anonymous_page() here, we should use update_mmu_tlb() instead of update_mmu_cache() on the second thread. As David pointed out, this is a performance improvement, not a correctness fix. Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12kasan: fix array-bounds warnings in testsAndrey Konovalov
GCC's -Warray-bounds option detects out-of-bounds accesses to statically-sized allocations in krealloc out-of-bounds tests. Use OPTIMIZER_HIDE_VAR to suppress the warning. Also change kmalloc_memmove_invalid_size to use OPTIMIZER_HIDE_VAR instead of a volatile variable. Link: https://lkml.kernel.org/r/e94399242d32e00bba6fd0d9ec4c897f188128e8.1664215688.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/migrate_device.c: add migrate_device_range()Alistair Popple
Device drivers can use the migrate_vma family of functions to migrate existing private anonymous mappings to device private pages. These pages are backed by memory on the device with drivers being responsible for copying data to and from device memory. Device private pages are freed via the pgmap->page_free() callback when they are unmapped and their refcount drops to zero. Alternatively they may be freed indirectly via migration back to CPU memory in response to a pgmap->migrate_to_ram() callback called whenever the CPU accesses an address mapped to a device private page. In other words drivers cannot control the lifetime of data allocated on the devices and must wait until these pages are freed from userspace. This causes issues when memory needs to reclaimed on the device, either because the device is going away due to a ->release() callback or because another user needs to use the memory. Drivers could use the existing migrate_vma functions to migrate data off the device. However this would require them to track the mappings of each page which is both complicated and not always possible. Instead drivers need to be able to migrate device pages directly so they can free up device memory. To allow that this patch introduces the migrate_device family of functions which are functionally similar to migrate_vma but which skips the initial lookup based on mapping. Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()Alistair Popple
migrate_device_coherent_page() reuses the existing migrate_vma family of functions to migrate a specific page without providing a valid mapping or vma. This looks a bit odd because it means we are calling migrate_vma_*() without setting a valid vma, however it was considered acceptable at the time because the details were internal to migrate_device.c and there was only a single user. One of the reasons the details could be kept internal was that this was strictly for migrating device coherent memory. Such memory can be copied directly by the CPU without intervention from a driver. However this isn't true for device private memory, and a future change requires similar functionality for device private memory. So refactor the code into something more sensible for migrating device memory without a vma. Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12mm/memremap.c: take a pgmap reference on page allocationAlistair Popple
ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a driver. When the struct page is first allocated by the kernel in memremap_pages() a reference is taken on the associated pagemap to ensure it is not freed prior to the pages being freed. Prior to 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount") pages were considered free and returned to the driver when the reference count dropped to one. However the pagemap reference was not dropped until the page reference count hit zero. This would occur as part of the final put_page() in memunmap_pages() which would wait for all pages to be freed prior to returning. When the extra refcount was removed the pagemap reference was no longer being dropped in put_page(). Instead memunmap_pages() was changed to explicitly drop the pagemap references. This means that memunmap_pages() can complete even though pages are still mapped by the kernel which can lead to kernel crashes, particularly if a driver frees the pagemap. To fix this drivers should take a pagemap reference when allocating the page. This reference can then be returned when the page is freed. Link: https://lkml.kernel.org/r/12d155ec727935ebfbb4d639a03ab374917ea51b.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount") Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>