diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-12-15 12:53:37 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-12-15 12:53:37 -0800 |
commit | ac73e3dc8acd0a3be292755db30388c3580f5674 (patch) | |
tree | 5abef6cb82b205b5dbbb69dca950b8a5aae716de | |
parent | 148842c98a24e508aecb929718818fbf4c2a6ff3 (diff) | |
parent | dfefd226b0bf7c435a58d75a0ce2f9273b9825f6 (diff) |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
216 files changed, 4331 insertions, 2882 deletions
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 9093228cf18b..700329d25f57 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -266,6 +266,7 @@ line of text and contains the following stats separated by whitespace: No memory is allocated for such pages. pages_compacted the number of pages freed during compaction huge_pages the number of incompressible pages + huge_pages_since the number of incompressible pages since zram set up ================ ============================================================= File /sys/block/zram<id>/bd_stat @@ -334,6 +335,11 @@ Admin can request writeback of those idle pages at right timing via:: With the command, zram writeback idle pages from memory to the storage. +If admin want to write a specific page in zram device to backing device, +they could write a page index into the interface. + + echo "page_index=1251" > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 3f7115e07b5d..4f83de2dab6e 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -219,13 +219,11 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. This is an easy way to test page migration, too. -9.5 mkdir/rmdir ---------------- +9.5 nested cgroups +------------------ - When using hierarchy, mkdir/rmdir test should be done. - Use tests like the following:: + Use tests like the following for testing nested cgroups:: - echo 1 >/opt/cgroup/01/memory/use_hierarchy mkdir /opt/cgroup/01/child_a mkdir /opt/cgroup/01/child_b diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 12757e63b26c..a44cd467d218 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -77,6 +77,8 @@ Brief summary of control files. memory.soft_limit_in_bytes set/show soft limit of memory usage memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled + This knob is deprecated and shouldn't be + used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications memory.swappiness set/show swappiness parameter of vmscan @@ -495,16 +497,13 @@ cgroup might have some charge associated with it, even though all tasks have migrated away from it. (because we charge against pages, not against tasks.) -We move the stats to root (if use_hierarchy==0) or parent (if -use_hierarchy==1), and no change on the charge except uncharging +We move the stats to parent, and no change on the charge except uncharging from the child. Charges recorded in swap information is not updated at removal of cgroup. Recorded information is discarded and a cgroup which uses swap (swapcache) will be charged as a new owner of it. -About use_hierarchy, see Section 6. - 5. Misc. interfaces =================== @@ -527,8 +526,6 @@ About use_hierarchy, see Section 6. write will still return success. In this case, it is expected that memory.kmem.usage_in_bytes == memory.usage_in_bytes. - About use_hierarchy, see Section 6. - 5.2 stat file ------------- @@ -675,31 +672,20 @@ hierarchy:: d e In the diagram above, with hierarchical accounting enabled, all memory -usage of e, is accounted to its ancestors up until the root (i.e, c and root), -that has memory.use_hierarchy enabled. If one of the ancestors goes over its -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the -children of the ancestor. - -6.1 Enabling hierarchical accounting and reclaim ------------------------------------------------- - -A memory cgroup by default disables the hierarchy feature. Support -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: +usage of e, is accounted to its ancestors up until the root (i.e, c and root). +If one of the ancestors goes over its limit, the reclaim algorithm reclaims +from the tasks in the ancestor and the children of the ancestor. - # echo 1 > memory.use_hierarchy - -The feature can be disabled by:: +6.1 Hierarchical accounting and reclaim +--------------------------------------- - # echo 0 > memory.use_hierarchy +Hierarchical accounting is enabled by default. Disabling the hierarchical +accounting is deprecated. An attempt to do it will result in a failure +and a warning printed to dmesg. -NOTE1: - Enabling/disabling will fail if either the cgroup already has other - cgroups created below it, or if the parent cgroup has use_hierarchy - enabled. +For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: -NOTE2: - When panic_on_oom is set to "2", the whole system will panic in - case of an OOM event in any cgroup. + # echo 1 > memory.use_hierarchy 7. Soft limits ============== diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 608d7c279396..63521cd36ce5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1274,6 +1274,9 @@ PAGE_SIZE multiple when read back. kernel_stack Amount of memory allocated to kernel stacks. + pagetables + Amount of memory allocated for page tables. + percpu(npn) Amount of memory used for storing per-cpu kernel data structures. @@ -1300,6 +1303,14 @@ PAGE_SIZE multiple when read back. Amount of memory used in anonymous mappings backed by transparent hugepages + file_thp + Amount of cached filesystem data backed by transparent + hugepages + + shmem_thp + Amount of shm, tmpfs, shared anonymous mmap()s backed by + transparent hugepages + inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b2acd0d395ca..3b8a336511a4 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -401,21 +401,6 @@ compact_fail is incremented if the system tries to compact memory but failed. -compact_pages_moved - is incremented each time a page is moved. If - this value is increasing rapidly, it implies that the system - is copying a lot of data to satisfy the huge page allocation. - It is possible that the cost of copying exceeds any savings - from reduced TLB misses. - -compact_pagemigrate_failed - is incremented when the underlying mechanism - for moving a page failed. - -compact_blocks_moved - is incremented each time memory compaction examines - a huge page aligned range of pages. - It is possible to establish how long the stalls were using the function tracer to record how long was spent in __alloc_pages_nodemask and using the mm_page_alloc tracepoint to identify which allocations were diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e0cf17ad2ae5..e972caa43bd4 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -873,12 +873,17 @@ file-backed pages is less than the high watermark in a zone. unprivileged_userfaultfd ======================== -This flag controls whether unprivileged users can use the userfaultfd -system calls. Set this to 1 to allow unprivileged users to use the -userfaultfd system calls, or set this to 0 to restrict userfaultfd to only -privileged users (with SYS_CAP_PTRACE capability). +This flag controls the mode in which unprivileged users can use the +userfaultfd system calls. Set this to 0 to restrict unprivileged users +to handle page faults in user mode only. In this case, users without +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to +succeed. Prohibiting use of userfaultfd for handling faults from kernel +mode may make certain vulnerabilities more difficult to exploit. -The default value is 1. +Set this to 1 to allow unprivileged users to use the userfaultfd system +calls without any restrictions. + +The default value is 0. user_reserve_kbytes diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 4446a1ac36cc..5954ddf6ee13 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -147,6 +147,10 @@ The address of a chunk allocated with `kmalloc` is aligned to at least ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the alignment is also guaranteed to be at least the respective size. +Chunks allocated with kmalloc() can be resized with krealloc(). Similarly +to kmalloc_array(): a helper for resizing arrays is provided in the form of +krealloc_array(). + For large allocations you can use vmalloc() and vzalloc(), or directly request pages from the page allocator. The memory allocated by `vmalloc` and related functions is not physically contiguous. diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 7ca8c7bac650..fcf605be43d0 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -221,12 +221,12 @@ Unit testing ============ This file:: - tools/testing/selftests/vm/gup_benchmark.c + tools/testing/selftests/vm/gup_test.c has the following new calls to exercise the new pin*() wrapper functions: -* PIN_FAST_BENCHMARK (./gup_benchmark -a) -* PIN_BENCHMARK (./gup_benchmark -b) +* PIN_FAST_BENCHMARK (./gup_test -a) +* PIN_BASIC_TEST (./gup_test -b) You can monitor how many total dma-pinned pages have been acquired and released since the system was booted, via two new /proc/vmstat entries: :: diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 0bd6d0e1ca6b..6b752a45a936 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -190,8 +190,9 @@ function calls GCC directly inserts the code to check the shadow memory. This option significantly enlarges kernel but it gives x1.1-x2 performance boost over outline instrumented kernel. -Generic KASAN prints up to 2 call_rcu() call stacks in reports, the last one -and the second to last. +Generic KASAN also reports the last 2 call stacks to creation of work that +potentially has access to an object. Call stacks for the following are shown: +call_rcu() and workqueue queuing. Software tag-based KASAN ~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index c44f8b1d3cab..0408c245785e 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -4,7 +4,7 @@ Tmpfs ===== -Tmpfs is a file system which keeps all files in virtual memory. +Tmpfs is a file system which keeps all of its files in virtual memory. Everything in tmpfs is temporary in the sense that no files will be @@ -35,7 +35,7 @@ tmpfs has the following uses: memory. This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not - set, the user visible part of tmpfs is not build. But the internal + set, the user visible part of tmpfs is not built. But the internal mechanisms are always present. 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for @@ -50,7 +50,7 @@ tmpfs has the following uses: This mount is _not_ needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was necessary to mount the predecessor of tmpfs (shm fs) to use SYSV - shared memory) + shared memory.) 3) Some people (including me) find it very convenient to mount it e.g. on /tmp and /var/tmp and have a big swap partition. And now @@ -83,7 +83,7 @@ If nr_blocks=0 (or size=0), blocks will not be limited in that instance; if nr_inodes=0, inodes will not be limited. It is generally unwise to mount with such options, since it allows any user with write access to use up all the memory on the machine; but enhances the scalability of -that instance in a system with many cpus making intensive use of it. +that instance in a system with many CPUs making intensive use of it. tmpfs has a mount option to set the NUMA memory allocation policy for diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst index 9daadf9faba1..ce398a7dc6cd 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/vm/memory-model.rst @@ -51,8 +51,7 @@ call :c:func:`free_area_init` function. Yet, the mappings array is not usable until the call to :c:func:`memblock_free_all` that hands all the memory to the page allocator. -If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option, -it may free parts of the `mem_map` array that do not cover the +An architecture may free parts of the `mem_map` array that do not cover the actual physical pages. In such case, the architecture specific :c:func:`pfn_valid` implementation should take the holes in the `mem_map` into account. diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst index 02deac76673f..4e67c2e9bbed 100644 --- a/Documentation/vm/page_owner.rst +++ b/Documentation/vm/page_owner.rst @@ -41,17 +41,17 @@ size change due to this facility. - Without page owner:: text data bss dec hex filename - 40662 1493 644 42799 a72f mm/page_alloc.o + 48392 2333 644 51369 c8a9 mm/page_alloc.o - With page owner:: text data bss dec hex filename - 40892 1493 644 43029 a815 mm/page_alloc.o - 1427 24 8 1459 5b3 mm/page_ext.o - 2722 50 0 2772 ad4 mm/page_owner.o + 48800 2445 644 51889 cab1 mm/page_alloc.o + 6574 108 29 6711 1a37 mm/page_owner.o + 1025 8 8 1041 411 mm/page_ext.o -Although, roughly, 4 KB code is added in total, page_alloc.o increase by -230 bytes and only half of it is in hotpath. Building the kernel with +Although, roughly, 8 KB code is added in total, page_alloc.o increase by +520 bytes and less than half of it is in hotpath. Building the kernel with page owner and turning it on if needed would be great option to debug kernel memory problem. diff --git a/arch/Kconfig b/arch/Kconfig index 7a3371d28508..becdd2f22d2e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -261,7 +261,7 @@ config ARCH_HAS_SET_DIRECT_MAP # # Select if the architecture provides the arch_dma_set_uncached symbol to -# either provide an uncached segement alias for a DMA allocation, or +# either provide an uncached segment alias for a DMA allocation, or # to remap the page tables in place. # config ARCH_HAS_DMA_SET_UNCACHED @@ -314,14 +314,14 @@ config ARCH_32BIT_OFF_T config HAVE_ASM_MODVERSIONS bool help - This symbol should be selected by an architecure if it provides + This symbol should be selected by an architecture if it provides <asm/asm-prototypes.h> to support the module versioning for symbols exported from assembly code. config HAVE_REGS_AND_STACK_ACCESS_API bool help - This symbol should be selected by an architecure if it supports + This symbol should be selected by an architecture if it supports the API needed to access registers and stack entries from pt_regs, declared in asm/ptrace.h For example the kprobes-based event tracer needs this API. @@ -336,7 +336,7 @@ config HAVE_RSEQ config HAVE_FUNCTION_ARG_ACCESS_API bool help - This symbol should be selected by an architecure if it supports + This symbol should be selected by an architecture if it supports the API needed to access function arguments from pt_regs, declared in asm/ptrace.h @@ -665,6 +665,13 @@ config HAVE_IRQ_TIME_ACCOUNTING Archs need to ensure they use a high enough resolution clock to support irq time accounting and then call enable_sched_clock_irqtime(). +config HAVE_MOVE_PUD + bool + help + Architectures that select this are able to move page tables at the + PUD level. If there are only 3 page table levels, the move effectively + happens at the PGD level. + config HAVE_MOVE_PMD bool help @@ -1054,6 +1061,12 @@ config ARCH_WANT_LD_ORPHAN_WARN by the linker, since the locations of such sections can change between linker versions. +config HAVE_ARCH_PFN_VALID + bool + +config ARCH_SUPPORTS_DEBUG_PAGEALLOC + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index d6e9fc7a7b19..aedf5c296f13 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -40,6 +40,7 @@ config ALPHA select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67 select MMU_GATHER_NO_RANGE select SET_FS + select SPARSEMEM_EXTREME if SPARSEMEM help The Alpha is a 64-bit general-purpose processor designed and marketed by the Digital Equipment Corporation of blessed memory, @@ -551,12 +552,19 @@ config NR_CPUS config ARCH_DISCONTIGMEM_ENABLE bool "Discontiguous Memory Support" + depends on BROKEN help Say Y to support efficient handling of discontiguous physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) or have huge holes in the physical address space for other reasons. See <file:Documentation/vm/numa.rst> for more. +config ARCH_SPARSEMEM_ENABLE + bool "Sparse Memory Support" + help + Say Y to support efficient handling of discontiguous physical memory, + for systems that have huge holes in the physical address space. + config NUMA bool "NUMA Support (EXPERIMENTAL)" depends on DISCONTIGMEM && BROKEN diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h index 9b521c857436..86644604d977 100644 --- a/arch/alpha/include/asm/mmzone.h +++ b/arch/alpha/include/asm/mmzone.h @@ -6,6 +6,8 @@ #ifndef _ASM_MMZONE_H_ #define _ASM_MMZONE_H_ +#ifdef CONFIG_DISCONTIGMEM + #include <asm/smp.h> /* @@ -45,8 +47,6 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, int n) } #endif -#ifdef CONFIG_DISCONTIGMEM - /* * Following are macros that each numa implementation must define. */ @@ -68,11 +68,6 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, int n) /* XXX: FIXME -- nyc */ #define kern_addr_valid(kaddr) (0) -#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) - -#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> 32)) -#define pte_pfn(pte) (pte_val(pte) >> 32) - #define mk_pte(page, pgprot) \ ({ \ pte_t pte; \ @@ -95,16 +90,11 @@ PLAT_NODE_DATA_LOCALNR(unsigned long p, int n) __xx; \ }) -#define page_to_pa(page) \ - (page_to_pfn(page) << PAGE_SHIFT) - #define pfn_to_nid(pfn) pa_to_nid(((u64)(pfn) << PAGE_SHIFT)) #define pfn_valid(pfn) \ (((pfn) - node_start_pfn(pfn_to_nid(pfn))) < \ node_spanned_pages(pfn_to_nid(pfn))) \ -#define virt_addr_valid(kaddr) pfn_valid((__pa(kaddr) >> PAGE_SHIFT)) - #endif /* CONFIG_DISCONTIGMEM */ #endif /* _ASM_MMZONE_H_ */ diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h index e241bd88880f..268f99b4602b 100644 --- a/arch/alpha/include/asm/page.h +++ b/arch/alpha/include/asm/page.h @@ -83,12 +83,13 @@ typedef struct page *pgtable_t; #define __pa(x) ((unsigned long) (x) - PAGE_OFFSET) #define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET)) -#ifndef CONFIG_DISCONTIGMEM + #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) +#define virt_addr_valid(kaddr) pfn_valid((__pa(kaddr) >> PAGE_SHIFT)) +#ifdef CONFIG_FLATMEM #define pfn_valid(pfn) ((pfn) < max_mapnr) -#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) -#endif /* CONFIG_DISCONTIGMEM */ +#endif /* CONFIG_FLATMEM */ #include <asm-generic/memory_model.h> #include <asm-generic/getorder.h> diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index 660b14ce1317..8d856c62e22a 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -203,10 +203,10 @@ extern unsigned long __zero_page(void); * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. */ -#ifndef CONFIG_DISCONTIGMEM -#define page_to_pa(page) (((page) - mem_map) << PAGE_SHIFT) - +#define page_to_pa(page) (page_to_pfn(page) << PAGE_SHIFT) #define pte_pfn(pte) (pte_val(pte) >> 32) + +#ifndef CONFIG_DISCONTIGMEM #define pte_page(pte) pfn_to_page(pte_pfn(pte)) #define mk_pte(page, pgprot) \ ({ \ @@ -236,10 +236,8 @@ pmd_page_vaddr(pmd_t pmd) return ((pmd_val(pmd) & _PFN_MASK) >> (32-PAGE_SHIFT)) + PAGE_OFFSET; } -#ifndef CONFIG_DISCONTIGMEM -#define pmd_page(pmd) (mem_map + ((pmd_val(pmd) & _PFN_MASK) >> 32)) -#define pud_page(pud) (mem_map + ((pud_val(pud) & _PFN_MASK) >> 32)) -#endif +#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> 32)) +#define pud_page(pud) (pfn_to_page(pud_val(pud) >> 32)) extern inline unsigned long pud_page_vaddr(pud_t pgd) { return PAGE_OFFSET + ((pud_val(pgd) & _PFN_MASK) >> (32-PAGE_SHIFT)); } diff --git a/arch/alpha/include/asm/sparsemem.h b/arch/alpha/include/asm/sparsemem.h new file mode 100644 index 000000000000..a0820fd2d4b1 --- /dev/null +++ b/arch/alpha/include/asm/sparsemem.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_ALPHA_SPARSEMEM_H +#define _ASM_ALPHA_SPARSEMEM_H + +#ifdef CONFIG_SPARSEMEM + +#define SECTION_SIZE_BITS 27 + +/* + * According to "Alpha Architecture Reference Manual" physical + * addresses are at most 48 bits. + * https://download.majix.org/dec/alpha_arch_ref.pdf + */ +#define MAX_PHYSMEM_BITS 48 + +#endif /* CONFIG_SPARSEMEM */ + +#endif /* _ASM_ALPHA_SPARSEMEM_H */ diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c index 916e42d74a86..03dda3beb3bd 100644 --- a/arch/alpha/kernel/setup.c +++ b/arch/alpha/kernel/setup.c @@ -648,6 +648,7 @@ setup_arch(char **cmdline_p) /* Find our memory. */ setup_memory(kernel_end); memblock_set_bottom_up(true); + sparse_init(); /* First guess at cpu cache sizes. Do this before init_arch. */ determine_cpu_caches(cpu->type); diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index d8804001d550..6a821e13a98f 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -67,6 +67,7 @@ config GENERIC_CSUM config ARCH_DISCONTIGMEM_ENABLE def_bool n + depends on BROKEN config ARCH_FLATMEM_ENABLE def_bool y @@ -506,7 +507,7 @@ config LINUX_RAM_BASE config HIGHMEM bool "High Memory Support" - select ARCH_DISCONTIGMEM_ENABLE + select HAVE_ARCH_PFN_VALID select KMAP_LOCAL help With ARC 2G:2G address split, only upper 2G is directly addressable by diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h index b0dfed0f12be..23e41e890eda 100644 --- a/arch/arc/include/asm/page.h +++ b/arch/arc/include/asm/page.h @@ -82,11 +82,25 @@ typedef pte_t * pgtable_t; */ #define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT) -#define ARCH_PFN_OFFSET virt_to_pfn(CONFIG_LINUX_RAM_BASE) +/* + * When HIGHMEM is enabled we have holes in the memory map so we need + * pfn_valid() that takes into account the actual extents of the physical + * memory + */ +#ifdef CONFIG_HIGHMEM + +extern unsigned long arch_pfn_offset; +#define ARCH_PFN_OFFSET arch_pfn_offset + +extern int pfn_valid(unsigned long pfn); +#define pfn_valid pfn_valid -#ifdef CONFIG_FLATMEM +#else /* CONFIG_HIGHMEM */ + +#define ARCH_PFN_OFFSET virt_to_pfn(CONFIG_LINUX_RAM_BASE) #define pfn_valid(pfn) (((pfn) - ARCH_PFN_OFFSET) < max_mapnr) -#endif + +#endif /* CONFIG_HIGHMEM */ /* * __pa, __va, virt_to_page (ALERT: deprecated, don't use them) diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index 3a35b82a718e..ce07e697916c 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -28,6 +28,8 @@ static unsigned long low_mem_sz; static unsigned long min_high_pfn, max_high_pfn; static phys_addr_t high_mem_start; static phys_addr_t high_mem_sz; +unsigned long arch_pfn_offset; +EXPORT_SYMBOL(arch_pfn_offset); #endif #ifdef CONFIG_DISCONTIGMEM @@ -98,16 +100,11 @@ void __init setup_arch_memory(void) init_mm.brk = (unsigned long)_end; /* first page of system - kernel .vector starts here */ - min_low_pfn = ARCH_PFN_OFFSET; + min_low_pfn = virt_to_pfn(CONFIG_LINUX_RAM_BASE); /* Last usable page of low mem */ max_low_pfn = max_pfn = PFN_DOWN(low_mem_start + low_mem_sz); -#ifdef CONFIG_FLATMEM - /* pfn_valid() uses this */ - max_mapnr = max_low_pfn - min_low_pfn; -#endif - /*------------- bootmem allocator setup -----------------------*/ /* @@ -153,7 +150,9 @@ void __init setup_arch_memory(void) * DISCONTIGMEM in turns requires multiple nodes. node 0 above is * populated with normal memory zone while node 1 only has highmem */ +#ifdef CONFIG_DISCONTIGMEM node_set_online(1); +#endif min_high_pfn = PFN_DOWN(high_mem_start); max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz); @@ -161,8 +160,15 @@ void __init setup_arch_memory(void) max_zone_pfn[ZONE_HIGHMEM] = min_low_pfn; high_memory = (void *)(min_high_pfn << PAGE_SHIFT); + + arch_pfn_offset = min(min_low_pfn, min_high_pfn); kmap_init(); -#endif + +#else /* CONFIG_HIGHMEM */ + /* pfn_valid() uses this when FLATMEM=y and HIGHMEM=n */ + max_mapnr = max_low_pfn - min_low_pfn; + +#endif /* CONFIG_HIGHMEM */ free_area_init(max_zone_pfn); } @@ -190,3 +196,12 @@ void __init mem_init(void) highmem_init(); mem_init_print_info(NULL); } + +#ifdef CONFIG_HIGHMEM +int pfn_valid(unsigned long pfn) +{ + return (pfn >= min_high_pfn && pfn <= max_high_pfn) || + (pfn >= min_low_pfn && pfn <= max_low_pfn); +} +EXPORT_SYMBOL(pfn_valid); +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 4708ede3b826..b2d7585fd2be 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -25,7 +25,7 @@ config ARM select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAVE_CUSTOM_GPIO_H select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_KEEP_MEMBLOCK if HAVE_ARCH_PFN_VALID || KEXEC + select ARCH_KEEP_MEMBLOCK select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_NO_SG_CHAIN if !ARM_HAS_SG_CHAIN select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX @@ -69,6 +69,7 @@ config ARM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU + select HAVE_ARCH_PFN_VALID select HAVE_ARCH_SECCOMP select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT select HAVE_ARCH_THREAD_STRUCT_WHITELIST @@ -520,7 +521,6 @@ config ARCH_S3C24XX config ARCH_OMAP1 bool "TI OMAP1" depends on MMU - select ARCH_HAS_HOLES_MEMORYMODEL select ARCH_OMAP select CLKDEV_LOOKUP select CLKSRC_MMIO @@ -1480,9 +1480,6 @@ config OABI_COMPAT UNPREDICTABLE (in fact it can be predicted that it won't work at all). If in doubt say N. -config ARCH_HAS_HOLES_MEMORYMODEL - bool - config ARCH_SELECT_MEMORY_MODEL bool @@ -1493,9 +1490,6 @@ config ARCH_SPARSEMEM_ENABLE bool select SPARSEMEM_STATIC if SPARSEMEM -config HAVE_ARCH_PFN_VALID - def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM - config HIGHMEM bool "High Memory Support" depends on MMU diff --git a/arch/arm/kernel/vdso.c b/arch/arm/kernel/vdso.c index fddd08a6e063..3408269d19c7 100644 --- a/arch/arm/kernel/vdso.c +++ b/arch/arm/kernel/vdso.c @@ -50,15 +50,6 @@ static const struct vm_special_mapping vdso_data_mapping = { static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) { - unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - unsigned long vdso_size; - - /* without VVAR page */ - vdso_size = (vdso_total_pages - 1) << PAGE_SHIFT; - - if (vdso_size != new_size) - return -EINVAL; - current->mm->context.vdso = new_vma->vm_start; return 0; diff --git a/arch/arm/mach-bcm/Kconfig b/arch/arm/mach-bcm/Kconfig index ae790908fc74..9b594ae98153 100644 --- a/arch/arm/mach-bcm/Kconfig +++ b/arch/arm/mach-bcm/Kconfig @@ -211,7 +211,6 @@ config ARCH_BRCMSTB select BCM7038_L1_IRQ select BRCMSTB_L2_IRQ select BCM7120_L2_IRQ - select ARCH_HAS_HOLES_MEMORYMODEL select ZONE_DMA if ARM_LPAE select SOC_BRCMSTB select SOC_BUS diff --git a/arch/arm/mach-davinci/Kconfig b/arch/arm/mach-davinci/Kconfig index f56ff8c24043..de11030748d0 100644 --- a/arch/arm/mach-davinci/Kconfig +++ b/arch/arm/mach-davinci/Kconfig @@ -5,7 +5,6 @@ menuconfig ARCH_DAVINCI depends on ARCH_MULTI_V5 select DAVINCI_TIMER select ZONE_DMA - select ARCH_HAS_HOLES_MEMORYMODEL select PM_GENERIC_DOMAINS if PM select PM_GENERIC_DOMAINS_OF if PM && OF select REGMAP_MMIO diff --git a/arch/arm/mach-exynos/Kconfig b/arch/arm/mach-exynos/Kconfig index d2d249706ebb..56d272967fc0 100644 --- a/arch/arm/mach-exynos/Kconfig +++ b/arch/arm/mach-exynos/Kconfig @@ -8,7 +8,6 @@ menuconfig ARCH_EXYNOS bool "Samsung Exynos" depends on ARCH_MULTI_V7 - select ARCH_HAS_HOLES_MEMORYMODEL select ARCH_SUPPORTS_BIG_ENDIAN select ARM_AMBA select ARM_GIC diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig index 1bc68913d62c..9de38ce8124f 100644 --- a/arch/arm/mach-highbank/Kconfig +++ b/arch/arm/mach-highbank/Kconfig @@ -2,7 +2,6 @@ config ARCH_HIGHBANK bool "Calxeda ECX-1000/2000 (Highbank/Midway)" depends on ARCH_MULTI_V7 - select ARCH_HAS_HOLES_MEMORYMODEL select ARCH_SUPPORTS_BIG_ENDIAN select ARM_AMBA select ARM_ERRATA_764369 if SMP diff --git a/arch/arm/mach-omap2/Kconfig b/arch/arm/mach-omap2/Kconfig index 3f62a0c9450d..164985505f9e 100644 --- a/arch/arm/mach-omap2/Kconfig +++ b/arch/arm/mach-omap2/Kconfig @@ -93,7 +93,6 @@ config SOC_DRA7XX config ARCH_OMAP2PLUS bool select ARCH_HAS_BANDGAP - select ARCH_HAS_HOLES_MEMORYMODEL select ARCH_HAS_RESET_CONTROLLER select ARCH_OMAP select CLKSRC_MMIO diff --git a/arch/arm/mach-s5pv210/Kconfig b/arch/arm/mach-s5pv210/Kconfig index 95d4e8284866..d644b45bc29d 100644 --- a/arch/arm/mach-s5pv210/Kconfig +++ b/arch/arm/mach-s5pv210/Kconfig @@ -8,7 +8,6 @@ config ARCH_S5PV210 bool "Samsung S5PV210/S5PC110" depends on ARCH_MULTI_V7 - select ARCH_HAS_HOLES_MEMORYMODEL select ARM_VIC select CLKSRC_SAMSUNG_PWM select COMMON_CLK_SAMSUNG diff --git a/arch/arm/mach-tango/Kconfig b/arch/arm/mach-tango/Kconfig index 25b2fd434861..a9eeda36aeb1 100644 --- a/arch/arm/mach-tango/Kconfig +++ b/arch/arm/mach-tango/Kconfig @@ -3,7 +3,6 @@ config ARCH_TANGO bool "Sigma Designs Tango4 (SMP87xx)" depends on ARCH_MULTI_V7 # Cortex-A9 MPCore r3p0, PL310 r3p2 - select ARCH_HAS_HOLES_MEMORYMODEL select ARM_ERRATA_754322 select ARM_ERRATA_764369 if SMP select ARM_ERRATA_775420 diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index c23dbf8bebee..db623d7c30de 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -267,83 +267,6 @@ static inline void poison_init_mem(void *s, size_t count) *p++ = 0xe7fddef0; } -static inline void __init -free_memmap(unsigned long start_pfn, unsigned long end_pfn) -{ - struct page *start_pg, *end_pg; - phys_addr_t pg, pgend; - - /* - * Convert start_pfn/end_pfn to a struct page pointer. - */ - start_pg = pfn_to_page(start_pfn - 1) + 1; - end_pg = pfn_to_page(end_pfn - 1) + 1; - - /* - * Convert to physical addresses, and - * round start upwards and end downwards. - */ - pg = PAGE_ALIGN(__pa(start_pg)); - pgend = __pa(end_pg) & PAGE_MASK; - - /* - * If there are free pages between these, - * free the section of the memmap array. - */ - if (pg < pgend) - memblock_free_early(pg, pgend - pg); -} - -/* - * The mem_map array can get very big. Free the unused area of the memory map. - */ -static void __init free_unused_memmap(void) -{ - unsigned long start, end, prev_end = 0; - int i; - - /* - * This relies on each bank being in address order. - * The banks are sorted previously in bootmem_init(). - */ - for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { -#ifdef CONFIG_SPARSEMEM - /* - * Take care not to free memmap entries that don't exist - * due to SPARSEMEM sections which aren't present. - */ - start = min(start, - ALIGN(prev_end, PAGES_PER_SECTION)); -#else - /* - * Align down here since the VM subsystem insists that the - * memmap entries are valid from the bank start aligned to - * MAX_ORDER_NR_PAGES. - */ - start = round_down(start, MAX_ORDER_NR_PAGES); -#endif - /* - * If we had a previous bank, and there is a space - * between the current bank and the previous, free it. - */ - if (prev_end && prev_end < start) - free_memmap(prev_end, start); - - /* - * Align up here since the VM subsystem insists that the - * memmap entries are valid from the bank end aligned to - * MAX_ORDER_NR_PAGES. - */ - prev_end = ALIGN(end, MAX_ORDER_NR_PAGES); - } - -#ifdef CONFIG_SPARSEMEM - if (!IS_ALIGNED(prev_end, PAGES_PER_SECTION)) - free_memmap(prev_end, - ALIGN(prev_end, PAGES_PER_SECTION)); -#endif -} - static void __init free_highpages(void) { #ifdef CONFIG_HIGHMEM @@ -385,7 +308,6 @@ void __init mem_init(void) set_max_mapnr(pfn_to_page(max_pfn) - mem_map); /* this will put all unused low memory onto the freelists */ - free_unused_memmap(); memblock_free_all(); #ifdef CONFIG_SA1111 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e5bcb4995268..d58133f49188 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -71,6 +71,7 @@ config ARM64 select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_USE_SYM_ANNOTATIONS + select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_MEMORY_FAILURE select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK select ARCH_SUPPORTS_ATOMIC_RMW @@ -125,6 +126,7 @@ config ARM64 select HANDLE_DOMAIN_IRQ select HARDIRQS_SW_RESEND select HAVE_MOVE_PMD + select HAVE_MOVE_PUD select HAVE_PCI select HAVE_ACPI_APEI if (ACPI && EFI) select HAVE_ALIGNED_STRUCT_PAGE if SLUB @@ -139,6 +141,7 @@ config ARM64 select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT + select HAVE_ARCH_PFN_VALID select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK @@ -1027,9 +1030,6 @@ config HOLES_IN_ZONE source "kernel/Kconfig.hz" -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - def_bool y - config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE @@ -1043,9 +1043,6 @@ config ARCH_SELECT_MEMORY_MODEL config ARCH_FLATMEM_ENABLE def_bool !NUMA -config HAVE_ARCH_PFN_VALID - def_bool y - config HW_PERF_EVENTS def_bool y depends on ARM_PMU diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h index 9384fd8fc13c..45217f21f1fe 100644 --- a/arch/arm64/include/asm/cacheflush.h +++ b/arch/arm64/include/asm/cacheflush.h @@ -140,6 +140,7 @@ int set_memory_valid(unsigned long addr, int numpages, int enable); int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +bool kernel_page_present(struct page *page); #include <asm-generic/cacheflush.h> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 88a18d7f4e59..501562793ce2 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -463,6 +463,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)) +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud)) #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d)) #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys) diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index debb8995d57f..cee5d04ea9ad 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -78,17 +78,9 @@ static union { } vdso_data_store __page_aligned_data; struct vdso_data *vdso_data = vdso_data_store.data; -static int __vdso_remap(enum vdso_abi abi, - const struct vm_special_mapping *sm, - struct vm_area_struct *new_vma) +static int vdso_mremap(const struct vm_special_mapping *sm, + struct vm_area_struct *new_vma) { - unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - unsigned long vdso_size = vdso_info[abi].vdso_code_end - - vdso_info[abi].vdso_code_start; - - if (vdso_size != new_size) - return -EINVAL; - current->mm->context.vdso = (void *)new_vma->vm_start; return 0; @@ -219,17 +211,6 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, return vmf_insert_pfn(vma, vmf->address, pfn); } -static int vvar_mremap(const struct vm_special_mapping *sm, - struct vm_area_struct *new_vma) -{ - unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - - if (new_size != VVAR_NR_PAGES * PAGE_SIZE) - return -EINVAL; - - return 0; -} - static int __setup_additional_pages(enum vdso_abi abi, struct mm_struct *mm, struct linux_binprm *bprm, @@ -280,12 +261,6 @@ up_fail: /* * Create and map the vectors page for AArch32 tasks. */ -static int aarch32_vdso_mremap(const struct vm_special_mapping *sm, - struct vm_area_struct *new_vma) -{ - return __vdso_remap(VDSO_ABI_AA32, sm, new_vma); -} - enum aarch32_map { AA32_MAP_VECTORS, /* kuser helpers */ AA32_MAP_SIGPAGE, @@ -308,11 +283,10 @@ static struct vm_special_mapping aarch32_vdso_maps[] = { [AA32_MAP_VVAR] = { .name = "[vvar]", .fault = vvar_fault, - .mremap = vvar_mremap, }, [AA32_MAP_VDSO] = { .name = "[vdso]", - .mremap = aarch32_vdso_mremap, + .mremap = vdso_mremap, }, }; @@ -453,12 +427,6 @@ out: } #endif /* CONFIG_COMPAT */ -static int vdso_mremap(const struct vm_special_mapping *sm, - struct vm_area_struct *new_vma) -{ - return __vdso_remap(VDSO_ABI_AA64, sm, new_vma); -} - enum aarch64_map { AA64_MAP_VVAR, AA64_MAP_VDSO, @@ -468,7 +436,6 @@ static struct vm_special_mapping aarch64_vdso_maps[] __ro_after_init = { [AA64_MAP_VVAR] = { .name = "[vvar]", .fault = vvar_fault, - .mremap = vvar_mremap, }, [AA64_MAP_VDSO] = { .name = "[vdso]", diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index fbd452e12397..69d4251ee079 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -444,71 +444,6 @@ void __init bootmem_init(void) memblock_dump_all(); } -#ifndef CONFIG_SPARSEMEM_VMEMMAP -static inline void free_memmap(unsigned long start_pfn, unsigned long end_pfn) -{ - struct page *start_pg, *end_pg; - unsigned long pg, pgend; - - /* - * Convert start_pfn/end_pfn to a struct page pointer. - */ - start_pg = pfn_to_page(start_pfn - 1) + 1; - end_pg = pfn_to_page(end_pfn - 1) + 1; - - /* - * Convert to physical addresses, and round start upwards and end - * downwards. - */ - pg = (unsigned long)PAGE_ALIGN(__pa(start_pg)); - pgend = (unsigned long)__pa(end_pg) & PAGE_MASK; - - /* - * If there are free pages between these, free the section of the - * memmap array. - */ - if (pg < pgend) - memblock_free(pg, pgend - pg); -} - -/* - * The mem_map array can get very big. Free the unused area of the memory map. - */ -static void __init free_unused_memmap(void) -{ - unsigned long start, end, prev_end = 0; - int i; - - for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { -#ifdef CONFIG_SPARSEMEM - /* - * Take care not to free memmap entries that don't exist due - * to SPARSEMEM sections which aren't present. - */ - start = min(start, ALIGN(prev_end, PAGES_PER_SECTION)); -#endif - /* - * If we had a previous bank, and there is a space between the - * current bank and the previous, free it. - */ - if (prev_end && prev_end < start) - free_memmap(prev_end, start); - - /* - * Align up here since the VM subsystem insists that the - * memmap entries are valid from the bank end aligned to - * MAX_ORDER_NR_PAGES. - */ - prev_end = ALIGN(end, MAX_ORDER_NR_PAGES); - } - -#ifdef CONFIG_SPARSEMEM - if (!IS_ALIGNED(prev_end, PAGES_PER_SECTION)) - free_memmap(prev_end, ALIGN(prev_end, PAGES_PER_SECTION)); -#endif -} -#endif /* !CONFIG_SPARSEMEM_VMEMMAP */ - /* * mem_init() marks the free areas in the mem_map and tells us how much memory * is free. This is done after various parts of the system have claimed their @@ -524,9 +459,6 @@ void __init mem_init(void) set_max_mapnr(max_pfn - PHYS_PFN_OFFSET); -#ifndef CONFIG_SPARSEMEM_VMEMMAP - free_unused_memmap(); -#endif /* this will put all unused low memory onto the freelists */ memblock_free_all(); diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 1b94f5b82654..92eccaf595c8 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -155,7 +155,7 @@ int set_direct_map_invalid_noflush(struct page *page) .clear_mask = __pgprot(PTE_VALID), }; - if (!rodata_full) + if (!debug_pagealloc_enabled() && !rodata_full) return 0; return apply_to_page_range(&init_mm, @@ -170,7 +170,7 @@ int set_direct_map_default_noflush(struct page *page) .clear_mask = __pgprot(PTE_RDONLY), }; - if (!rodata_full) + if (!debug_pagealloc_enabled() && !rodata_full) return 0; return apply_to_page_range(&init_mm, @@ -178,6 +178,7 @@ int set_direct_map_default_noflush(struct page *page) PAGE_SIZE, change_page_range, &data); } +#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { if (!debug_pagealloc_enabled() && !rodata_full) @@ -185,6 +186,7 @@ void __kernel_map_pages(struct page *page, int numpages, int enable) set_memory_valid((unsigned long)page_address(page), numpages, enable); } +#endif /* CONFIG_DEBUG_PAGEALLOC */ /* * This function is used to determine if a linear map page has been marked as diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 39b25a5a591b..6e67d6110249 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -288,6 +288,7 @@ config ARCH_SELECT_MEMORY_MODEL config ARCH_DISCONTIGMEM_ENABLE def_bool y + depends on BROKEN help Say Y to support efficient handling of discontiguous physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) @@ -299,12 +300,11 @@ config ARCH_FLATMEM_ENABLE config ARCH_SPARSEMEM_ENABLE def_bool y - depends on ARCH_DISCONTIGMEM_ENABLE select SPARSEMEM_VMEMMAP_ENABLE -config ARCH_DISCONTIGMEM_DEFAULT +config ARCH_SPARSEMEM_DEFAULT def_bool y - depends on ARCH_DISCONTIGMEM_ENABLE + depends on ARCH_SPARSEMEM_ENABLE config NUMA bool "NUMA support" @@ -329,7 +329,7 @@ config NODES_SHIFT # VIRTUAL_MEM_MAP has been retained for historical reasons. config VIRTUAL_MEM_MAP bool "Virtual mem map" - depends on !SPARSEMEM + depends on !SPARSEMEM && !FLATMEM default y help Say Y to compile the kernel with support for a virtual mem map. @@ -342,9 +342,6 @@ config HOLES_IN_ZONE bool default y if VIRTUAL_MEM_MAP -config HAVE_ARCH_EARLY_PFN_TO_NID - def_bool NUMA && SPARSEMEM - config HAVE_ARCH_NODEDATA_EXTENSION def_bool y depends on NUMA diff --git a/arch/ia64/include/asm/meminit.h b/arch/ia64/include/asm/meminit.h index 092f1c91b36c..e789c0818edb 100644 --- a/arch/ia64/include/asm/meminit.h +++ b/arch/ia64/include/asm/meminit.h @@ -59,10 +59,8 @@ extern int reserve_elfcorehdr(u64 *start, u64 *end); extern int register_active_ranges(u64 start, u64 len, int nid); #ifdef CONFIG_VIRTUAL_MEM_MAP -# define LARGE_GAP 0x40000000 /* Use virtual mem map if hole is > than this */ extern unsigned long VMALLOC_END; extern struct page *vmem_map; - extern int find_largest_hole(u64 start, u64 end, void *arg); extern int create_mem_map_page_table(u64 start, u64 end, void *arg); extern int vmemmap_find_next_valid_pfn(int, int); #else diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c index e30e360beef8..bfc4ecd0a2ab 100644 --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -19,15 +19,12 @@ #include <linux/mm.h> #include <linux/nmi.h> #include <linux/swap.h> +#include <linux/sizes.h> #include <asm/meminit.h> #include <asm/sections.h> #include <asm/mca.h> -#ifdef CONFIG_VIRTUAL_MEM_MAP -static unsigned long max_gap; -#endif - /* physical address where the bootmem map is located */ unsigned long bootmap_start; @@ -166,6 +163,32 @@ find_memory (void) alloc_per_cpu_data(); } +static int __init find_largest_hole(u64 start, u64 end, void *arg) +{ + u64 *max_gap = arg; + + static u64 last_end = PAGE_OFFSET; + + /* NOTE: this algorithm assumes efi memmap table is ordered */ + + if (*max_gap < (start - last_end)) + *max_gap = start - last_end; + last_end = end; + return 0; +} + +static void __init verify_gap_absence(void) +{ + unsigned long max_gap; + + /* Forbid FLATMEM if hole is > than 1G */ + efi_memmap_walk(find_largest_hole, (u64 *)&max_gap); + if (max_gap >= SZ_1G) + panic("Cannot use FLATMEM with %ldMB hole\n" + "Please switch over to SPARSEMEM\n", + (max_gap >> 20)); +} + /* * Set up the page tables. */ @@ -177,37 +200,12 @@ paging_init (void) unsigned long max_zone_pfns[MAX_NR_ZONES]; memset(max_zone_pfns, 0, sizeof(max_zone_pfns)); -#ifdef CONFIG_ZONE_DMA32 max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT; max_zone_pfns[ZONE_DMA32] = max_dma; -#endif max_zone_pfns[ZONE_NORMAL] = max_low_pfn; -#ifdef CONFIG_VIRTUAL_MEM_MAP - efi_memmap_walk(find_largest_hole, (u64 *)&max_gap); - if (max_gap < LARGE_GAP) { - vmem_map = (struct page *) 0; - } else { - unsigned long map_size; - - /* allocate virtual_mem_map */ + verify_gap_absence(); - map_size = PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) * - sizeof(struct page)); - VMALLOC_END -= map_size; - vmem_map = (struct page *) VMALLOC_END; - efi_memmap_walk(create_mem_map_page_table, NULL); - - /* - * alloc_node_mem_map makes an adjustment for mem_map - * which isn't compatible with vmem_map. - */ - NODE_DATA(0)->node_mem_map = vmem_map + - find_min_pfn_with_active_regions(); - - printk("Virtual mem_map starts at 0x%p\n", mem_map); - } -#endif /* !CONFIG_VIRTUAL_MEM_MAP */ free_area_init(max_zone_pfns); zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); } diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index dbe829fc5298..c7311131156e 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -584,6 +584,25 @@ void call_pernode_memory(unsigned long start, unsigned long len, void *arg) } } +static void __init virtual_map_init(void) +{ +#ifdef CONFIG_VIRTUAL_MEM_MAP + int node; + + VMALLOC_END -= PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) * + sizeof(struct page)); + vmem_map = (struct page *) VMALLOC_END; + efi_memmap_walk(create_mem_map_page_table, NULL); + printk("Virtual mem_map starts at 0x%p\n", vmem_map); + + for_each_online_node(node) { + unsigned long pfn_offset = mem_data[node].min_pfn; + + NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset; + } +#endif +} + /** * paging_init - setup page tables * @@ -593,38 +612,17 @@ void call_pernode_memory(unsigned long start, unsigned long len, void *arg) void __init paging_init(void) { unsigned long max_dma; - unsigned long pfn_offset = 0; - unsigned long max_pfn = 0; - int node; unsigned long max_zone_pfns[MAX_NR_ZONES]; max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT; sparse_init(); -#ifdef CONFIG_VIRTUAL_MEM_MAP - VMALLOC_END -= PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) * - sizeof(struct page)); - vmem_map = (struct page *) VMALLOC_END; - efi_memmap_walk(create_mem_map_page_table, NULL); - printk("Virtual mem_map starts at 0x%p\n", vmem_map); -#endif - - for_each_online_node(node) { - pfn_offset = mem_data[node].min_pfn; - -#ifdef CONFIG_VIRTUAL_MEM_MAP - NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset; -#endif - if (mem_data[node].max_pfn > max_pfn) - max_pfn = mem_data[node].max_pfn; - } + virtual_map_init(); memset(max_zone_pfns, 0, sizeof(max_zone_pfns)); -#ifdef CONFIG_ZONE_DMA32 max_zone_pfns[ZONE_DMA32] = max_dma; -#endif - max_zone_pfns[ZONE_NORMAL] = max_pfn; + max_zone_pfns[ZONE_NORMAL] = max_low_pfn; free_area_init(max_zone_pfns); zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index ef12e097f318..9b5acf8fb092 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -574,20 +574,6 @@ ia64_pfn_valid (unsigned long pfn) } EXPORT_SYMBOL(ia64_pfn_valid); -int __init find_largest_hole(u64 start, u64 end, void *arg) -{ - u64 *max_gap = arg; - - static u64 last_end = PAGE_OFFSET; - - /* NOTE: this algorithm assumes efi memmap table is ordered */ - - if (*max_gap < (start - last_end)) - *max_gap = start - last_end; - last_end = end; - return 0; -} - #endif /* CONFIG_VIRTUAL_MEM_MAP */ int __init register_active_ranges(u64 start, u64 len, int nid) diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index f34964271101..46b6e5f3a40f 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -58,36 +58,6 @@ paddr_to_nid(unsigned long paddr) EXPORT_SYMBOL(paddr_to_nid); #if defined(CONFIG_SPARSEMEM) && defined(CONFIG_NUMA) -/* - * Because of holes evaluate on section limits. - * If the section of memory exists, then return the node where the section - * resides. Otherwise return node 0 as the default. This is used by - * SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where - * the section resides. - */ -int __meminit __early_pfn_to_nid(unsigned long pfn, - struct mminit_pfnnid_cache *state) -{ - int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec; - - if (section >= state->last_start && section < state->last_end) - return state->last_nid; - - for (i = 0; i < num_node_memblks; i++) { - ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT; - esec = (node_memblk[i].start_paddr + node_memblk[i].size + - ((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT; - if (section >= ssec && section < esec) { - state->last_start = ssec; - state->last_end = esec; - state->last_nid = node_memblk[i].nid; - return node_memblk[i].nid; - } - } - - return -1; -} - void numa_clear_node(int cpu) { unmap_cpu_from_node(cpu, NUMA_NO_NODE); diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu index 694c4fca9f5d..3e70fb7a8d83 100644 --- a/arch/m68k/Kconfig.cpu +++ b/arch/m68k/Kconfig.cpu @@ -20,6 +20,7 @@ choice config M68KCLASSIC bool "Classic M68K CPU family support" + select HAVE_ARCH_PFN_VALID config COLDFIRE bool "Coldfire CPU family support" @@ -373,16 +374,38 @@ config RMW_INSNS config SINGLE_MEMORY_CHUNK bool "Use one physical chunk of memory only" if ADVANCED && !SUN3 depends on MMU - default y if SUN3 - select NEED_MULTIPLE_NODES + default y if SUN3 || MMU_COLDFIRE help Ignore all but the first contiguous chunk of physical memory for VM purposes. This will save a few bytes kernel size and may speed up - some operations. Say N if not sure. + some operations. + When this option os set to N, you may want to lower "Maximum zone + order" to save memory that could be wasted for unused memory map. + Say N if not sure. config ARCH_DISCONTIGMEM_ENABLE + depends on BROKEN def_bool MMU && !SINGLE_MEMORY_CHUNK +config FORCE_MAX_ZONEORDER + int "Maximum zone order" if ADVANCED + depends on !SINGLE_MEMORY_CHUNK + default "11" + help + The kernel memory allocator divides physically contiguous memory + blocks into "zones", where each zone is a power of two number of + pages. This option selects the largest power of two that the kernel + keeps in the memory allocator. If you need to allocate very large + blocks of physically contiguous memory, then you may need to + increase this value. + + For systems that have holes in their physical address space this + value also defines the minimal size of the hole that allows + freeing unused memory map. + + This config option is actually maximum order plus one. For example, + a value of 11 means that the largest free memory block is 2^10 pages. + config 060_WRITETHROUGH bool "Use write-through caching for 68060 supervisor accesses" depends on ADVANCED && M68060 @@ -406,7 +429,7 @@ config M68K_L2_CACHE config NODES_SHIFT int default "3" - depends on !SINGLE_MEMORY_CHUNK + depends on DISCONTIGMEM config CPU_HAS_NO_BITFIELDS bool diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h index 2614a1206f2f..6116d7094292 100644 --- a/arch/m68k/include/asm/page.h +++ b/arch/m68k/include/asm/page.h @@ -62,8 +62,10 @@ extern unsigned long _ramend; #include <asm/page_no.h> #endif +#ifdef CONFIG_DISCONTIGMEM #define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT)) #define __pfn_to_phys(pfn) PFN_PHYS(pfn) +#endif #include <asm-generic/getorder.h> diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h index e6b75992192b..7f5912af2a52 100644 --- a/arch/m68k/include/asm/page_mm.h +++ b/arch/m68k/include/asm/page_mm.h @@ -126,7 +126,7 @@ static inline void *__va(unsigned long x) extern int m68k_virt_to_node_shift; -#ifdef CONFIG_SINGLE_MEMORY_CHUNK +#ifndef CONFIG_DISCONTIGMEM #define __virt_to_node(addr) (&pg_data_map[0]) #else extern struct pglist_data *pg_data_table[]; @@ -153,6 +153,7 @@ static inline __attribute_const__ int __virt_to_node_shift(void) pfn_to_virt(page_to_pfn(page)); \ }) +#ifdef CONFIG_DISCONTIGMEM #define pfn_to_page(pfn) ({ \ unsigned long __pfn = (pfn); \ struct pglist_data *pgdat; \ @@ -165,6 +166,10 @@ static inline __attribute_const__ int __virt_to_node_shift(void) pgdat = &pg_data_map[page_to_nid(__p)]; \ ((__p) - pgdat->node_mem_map) + pgdat->node_start_pfn; \ }) +#else +#define ARCH_PFN_OFFSET (m68k_memory[0].addr) +#include <asm-generic/memory_model.h> +#endif #define virt_addr_valid(kaddr) ((void *)(kaddr) >= (void *)PAGE_OFFSET && (void *)(kaddr) < high_memory) #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn)) diff --git a/arch/m68k/include/asm/virtconvert.h b/arch/m68k/include/asm/virtconvert.h index dfe43083b579..ca91b32dc6ef 100644 --- a/arch/m68k/include/asm/virtconvert.h +++ b/arch/m68k/include/asm/virtconvert.h @@ -29,12 +29,7 @@ static inline void *phys_to_virt(unsigned long address) } /* Permanent address of a page. */ -#if defined(CONFIG_MMU) && defined(CONFIG_SINGLE_MEMORY_CHUNK) -#define page_to_phys(page) \ - __pa(PAGE_OFFSET + (((page) - pg_data_map[0].node_mem_map) << PAGE_SHIFT)) -#else #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT) -#endif /* * IO bus memory addresses are 1:1 with the physical address, diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c index 53040857a9ed..14c1e541451c 100644 --- a/arch/m68k/mm/init.c +++ b/arch/m68k/mm/init.c @@ -42,19 +42,19 @@ EXPORT_SYMBOL(empty_zero_page); #ifdef CONFIG_MMU +int m68k_virt_to_node_shift; + +#ifdef CONFIG_DISCONTIGMEM pg_data_t pg_data_map[MAX_NUMNODES]; EXPORT_SYMBOL(pg_data_map); -int m68k_virt_to_node_shift; - -#ifndef CONFIG_SINGLE_MEMORY_CHUNK pg_data_t *pg_data_table[65]; EXPORT_SYMBOL(pg_data_table); #endif void __init m68k_setup_node(int node) { -#ifndef CONFIG_SINGLE_MEMORY_CHUNK +#ifdef CONFIG_DISCONTIGMEM struct m68k_mem_info *info = m68k_memory + node; int i, end; diff --git a/arch/mips/vdso/genvdso.c b/arch/mips/vdso/genvdso.c index abb06ae04b40..09e30eb4be86 100644 --- a/arch/mips/vdso/genvdso.c +++ b/arch/mips/vdso/genvdso.c @@ -263,10 +263,6 @@ int main(int argc, char **argv) fprintf(out_file, " const struct vm_special_mapping *sm,\n"); fprintf(out_file, " struct vm_area_struct *new_vma)\n"); fprintf(out_file, "{\n"); - fprintf(out_file, " unsigned long new_size =\n"); - fprintf(out_file, " new_vma->vm_end - new_vma->vm_start;\n"); - fprintf(out_file, " if (vdso_image.size != new_size)\n"); - fprintf(out_file, " return -EINVAL;\n"); fprintf(out_file, " current->mm->context.vdso =\n"); fprintf(out_file, " (void *)(new_vma->vm_start);\n"); fprintf(out_file, " return 0;\n"); diff --git a/arch/nds32/mm/mm-nds32.c b/arch/nds32/mm/mm-nds32.c index 55bec50ccc03..f2778f2b39f6 100644 --- a/arch/nds32/mm/mm-nds32.c +++ b/arch/nds32/mm/mm-nds32.c @@ -34,8 +34,8 @@ pgd_t *pgd_alloc(struct mm_struct *mm) cpu_dcache_wb_range((unsigned long)new_pgd, (unsigned long)new_pgd + PTRS_PER_PGD * sizeof(pgd_t)); - inc_zone_page_state(virt_to_page((unsigned long *)new_pgd), - NR_PAGETABLE); + inc_lruvec_page_state(virt_to_page((unsigned long *)new_pgd), + NR_PAGETABLE); return new_pgd; } @@ -59,7 +59,7 @@ void pgd_free(struct mm_struct *mm, pgd_t * pgd) pte = pmd_page(*pmd); pmd_clear(pmd); - dec_zone_page_state(virt_to_page((unsigned long *)pgd), NR_PAGETABLE); + dec_lruvec_page_state(virt_to_page((unsigned long *)pgd), NR_PAGETABLE); pte_free(mm, pte); mm_dec_nr_ptes(mm); pmd_free(mm, pmd); diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d4b2dfb2746f..2590a66d4d51 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -146,6 +146,7 @@ config PPC select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC32 || PPC_BOOK3S_64 select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS @@ -356,10 +357,6 @@ config PPC_OF_PLATFORM_PCI depends on PCI depends on PPC64 # not supported on 32 bits yet -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - depends on PPC32 || PPC_BOOK3S_64 - def_bool y - config ARCH_SUPPORTS_UPROBES def_bool y diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 44377fd7860e..9283c6f9ae2a 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -14,6 +14,7 @@ config RISCV def_bool y select ARCH_CLOCKSOURCE_INIT select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_DEBUG_PAGEALLOC if MMU select ARCH_HAS_BINFMT_FLAT select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DEBUG_VIRTUAL if MMU @@ -153,9 +154,6 @@ config ARCH_SELECT_MEMORY_MODEL config ARCH_WANT_GENERAL_HUGETLB def_bool y -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - def_bool y - config SYS_SUPPORTS_HUGETLBFS depends on MMU def_bool y diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 183f1f4b2ae6..41a72861987c 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -461,8 +461,6 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, #define VMALLOC_START 0 #define VMALLOC_END TASK_SIZE -static inline void __kernel_map_pages(struct page *page, int numpages, int enable) {} - #endif /* !CONFIG_MMU */ #define kern_addr_valid(addr) (1) /* FIXME */ diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h index 4c5bae7ca01c..d690b08dff2a 100644 --- a/arch/riscv/include/asm/set_memory.h +++ b/arch/riscv/include/asm/set_memory.h @@ -24,6 +24,7 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; } int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +bool kernel_page_present(struct page *page); #endif /* __ASSEMBLY__ */ diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index 19fecb362d81..87ba5a68bbb8 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -184,6 +184,7 @@ int set_direct_map_default_noflush(struct page *page) return ret; } +#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { if (!debug_pagealloc_enabled()) @@ -196,3 +197,33 @@ void __kernel_map_pages(struct page *page, int numpages, int enable) __set_memory((unsigned long)page_address(page), numpages, __pgprot(0), __pgprot(_PAGE_PRESENT)); } +#endif + +bool kernel_page_present(struct page *page) +{ + unsigned long addr = (unsigned long)page_address(page); + pgd_t *pgd; + pud_t *pud; + p4d_t *p4d; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset_k(addr); + if (!pgd_present(*pgd)) + return false; + + p4d = p4d_offset(pgd, addr); + if (!p4d_present(*p4d)) + return false; + + pud = pud_offset(p4d, addr); + if (!pud_present(*pud)) + return false; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + return false; + + pte = pte_offset_kernel(pmd, addr); + return pte_present(*pte); +} diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index a60cc523d810..a9b1328e375d 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -35,9 +35,6 @@ config GENERIC_LOCKBREAK config PGSTE def_bool y if KVM -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - def_bool y - config AUDIT_ARCH def_bool y @@ -105,6 +102,7 @@ config S390 select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE select ARCH_STACKWALK select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF diff --git a/arch/s390/configs/debug_defconfig b/arch/s390/configs/debug_defconfig index c52113a238b1..1be32fcf6f2e 100644 --- a/arch/s390/configs/debug_defconfig +++ b/arch/s390/configs/debug_defconfig @@ -102,7 +102,7 @@ CONFIG_ZSMALLOC_STAT=y CONFIG_DEFERRED_STRUCT_PAGE_INIT=y CONFIG_IDLE_PAGE_TRACKING=y CONFIG_PERCPU_STATS=y -CONFIG_GUP_BENCHMARK=y +CONFIG_GUP_TEST=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_PACKET_DIAG=m diff --git a/arch/s390/configs/defconfig b/arch/s390/configs/defconfig index 17d5df2c1eff..e2171a008809 100644 --- a/arch/s390/configs/defconfig +++ b/arch/s390/configs/defconfig @@ -95,7 +95,7 @@ CONFIG_ZSMALLOC_STAT=y CONFIG_DEFERRED_STRUCT_PAGE_INIT=y CONFIG_IDLE_PAGE_TRACKING=y CONFIG_PERCPU_STATS=y -CONFIG_GUP_BENCHMARK=y +CONFIG_GUP_TEST=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_PACKET_DIAG=m diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index aef2edff9959..8bc269c55fd3 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -62,17 +62,8 @@ static vm_fault_t vdso_fault(const struct vm_special_mapping *sm, static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *vma) { - unsigned long vdso_pages; - - vdso_pages = vdso64_pages; - - if ((vdso_pages << PAGE_SHIFT) != vma->vm_end - vma->vm_start) - return -EINVAL; - - if (WARN_ON_ONCE(current->mm != vma->vm_mm)) - return -EFAULT; - current->mm->context.vdso_base = vma->vm_start; + return 0; } diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index e841708cb830..85ac956dd5f8 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -88,6 +88,7 @@ config SPARC64 select HAVE_C_RECORDMCOUNT select HAVE_ARCH_AUDITSYSCALL select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_DEBUG_PAGEALLOC select HAVE_NMI select HAVE_REGS_AND_STACK_ACCESS_API select ARCH_USE_QUEUED_RWLOCKS @@ -149,9 +150,6 @@ config GENERIC_ISA_DMA bool default y if SPARC32 -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - def_bool y if SPARC64 - config PGTABLE_LEVELS default 4 if 64BIT default 3 diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 96edf64d4fb3..182bb7bdaa0a 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2894,7 +2894,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm) if (!page) return NULL; if (!pgtable_pte_page_ctor(page)) { - free_unref_page(page); + __free_page(page); return NULL; } return (pte_t *) page_address(page); diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eeb87fce9c6f..00746dd364db 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -92,6 +92,7 @@ config X86 select ARCH_STACKWALK select ARCH_SUPPORTS_ACPI select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096 select ARCH_USE_BUILTIN_BSWAP @@ -202,6 +203,7 @@ config X86 select HAVE_MIXED_BREAKPOINTS_REGS select HAVE_MOD_ARCH_SPECIFIC select HAVE_MOVE_PMD + select HAVE_MOVE_PUD select HAVE_NMI select HAVE_OPROFILE select HAVE_OPTPROBES @@ -333,9 +335,6 @@ config ZONE_DMA32 config AUDIT_ARCH def_bool y if X86_64 -config ARCH_SUPPORTS_DEBUG_PAGEALLOC - def_bool y - config KASAN_SHADOW_OFFSET hex depends on KASAN diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index de60cd37070b..825e829ffff1 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -89,30 +89,14 @@ static void vdso_fix_landing(const struct vdso_image *image, static int vdso_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) { - unsigned long new_size = new_vma->vm_end - new_vma->vm_start; const struct vdso_image *image = current->mm->context.vdso_image; - if (image->size != new_size) - return -EINVAL; - vdso_fix_landing(image, new_vma); current->mm->context.vdso = (void __user *)new_vma->vm_start; return 0; } -static int vvar_mremap(const struct vm_special_mapping *sm, - struct vm_area_struct *new_vma) -{ - const struct vdso_image *image = new_vma->vm_mm->context.vdso_image; - unsigned long new_size = new_vma->vm_end - new_vma->vm_start; - - if (new_size != -image->sym_vvar_start) - return -EINVAL; - - return 0; -} - #ifdef CONFIG_TIME_NS static struct page *find_timens_vvar_page(struct vm_area_struct *vma) { @@ -252,7 +236,6 @@ static const struct vm_special_mapping vdso_mapping = { static const struct vm_special_mapping vvar_mapping = { .name = "[vvar]", .fault = vvar_fault, - .mremap = vvar_mremap, }; /* diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 5948218f35c5..4352f08bfbb5 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -82,6 +82,7 @@ int set_pages_rw(struct page *page, int numpages); int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +bool kernel_page_present(struct page *page); extern int kernel_set_to_readonly; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 0daf2f1cf7a8..e916646adc69 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) return 0; } -static int pseudo_lock_dev_mremap(struct vm_area_struct *area) +static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) { /* Not supported */ return -EINVAL; diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index ae64f98ec2ab..4c09ba110204 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = { .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(init_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 40baa90e74f4..16f878c26667 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2194,6 +2194,7 @@ int set_direct_map_default_noflush(struct page *page) return __set_pages_p(page, 1); } +#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { if (PageHighMem(page)) @@ -2225,8 +2226,8 @@ void __kernel_map_pages(struct page *page, int numpages, int enable) arch_flush_lazy_mmu_mode(); } +#endif /* CONFIG_DEBUG_PAGEALLOC */ -#ifdef CONFIG_HIBERNATION bool kernel_page_present(struct page *page) { unsigned int level; @@ -2238,7 +2239,6 @@ bool kernel_page_present(struct page *page) pte = lookup_address((unsigned long)page_address(page), &level); return (pte_val(*pte) & _PAGE_PRESENT); } -#endif /* CONFIG_HIBERNATION */ int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, unsigned numpages, unsigned long page_flags) diff --git a/drivers/base/node.c b/drivers/base/node.c index 6ffa470e2984..04f71c7bc3f8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -450,7 +450,7 @@ static ssize_t node_read_meminfo(struct device *dev, #ifdef CONFIG_SHADOW_CALL_STACK nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif - nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), + nid, K(node_page_state(pgdat, NR_PAGETABLE)), nid, 0UL, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index fe7a4b7d30cf..668c6bf2554d 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -2,7 +2,7 @@ config ZRAM tristate "Compressed RAM block device support" depends on BLOCK && SYSFS && ZSMALLOC && CRYPTO - select CRYPTO_LZO + depends on CRYPTO_LZO || CRYPTO_ZSTD || CRYPTO_LZ4 || CRYPTO_LZ4HC || CRYPTO_842 help Creates virtual block devices called /dev/zramX (X = 0, 1, ...). Pages written to these disks are compressed and stored in memory @@ -14,6 +14,46 @@ config ZRAM See Documentation/admin-guide/blockdev/zram.rst for more information. +choice + prompt "Default zram compressor" + default ZRAM_DEF_COMP_LZORLE + depends on ZRAM + +config ZRAM_DEF_COMP_LZORLE + bool "lzo-rle" + depends on CRYPTO_LZO + +config ZRAM_DEF_COMP_ZSTD + bool "zstd" + depends on CRYPTO_ZSTD + +config ZRAM_DEF_COMP_LZ4 + bool "lz4" + depends on CRYPTO_LZ4 + +config ZRAM_DEF_COMP_LZO + bool "lzo" + depends on CRYPTO_LZO + +config ZRAM_DEF_COMP_LZ4HC + bool "lz4hc" + depends on CRYPTO_LZ4HC + +config ZRAM_DEF_COMP_842 + bool "842" + depends on CRYPTO_842 + +endchoice + +config ZRAM_DEF_COMP + string + default "lzo-rle" if ZRAM_DEF_COMP_LZORLE + default "zstd" if ZRAM_DEF_COMP_ZSTD + default "lz4" if ZRAM_DEF_COMP_LZ4 + default "lzo" if ZRAM_DEF_COMP_LZO + default "lz4hc" if ZRAM_DEF_COMP_LZ4HC + default "842" if ZRAM_DEF_COMP_842 + config ZRAM_WRITEBACK bool "Write back incompressible or idle page to backing device" depends on ZRAM diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 33e3b76c4fa9..052aa3f65514 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -15,8 +15,10 @@ #include "zcomp.h" static const char * const backends[] = { +#if IS_ENABLED(CONFIG_CRYPTO_LZO) "lzo", "lzo-rle", +#endif #if IS_ENABLED(CONFIG_CRYPTO_LZ4) "lz4", #endif diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 1b697208d661..66a33e418940 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -42,7 +42,7 @@ static DEFINE_IDR(zram_index_idr); static DEFINE_MUTEX(zram_index_mutex); static int zram_major; -static const char *default_compressor = "lzo-rle"; +static const char *default_compressor = CONFIG_ZRAM_DEF_COMP; /* Module params (documentation at end) */ static unsigned int num_devices = 1; @@ -620,15 +620,19 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec, return 1; } +#define PAGE_WB_SIG "page_index=" + +#define PAGE_WRITEBACK 0 #define HUGE_WRITEBACK 1 #define IDLE_WRITEBACK 2 + static ssize_t writeback_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { struct zram *zram = dev_to_zram(dev); unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; - unsigned long index; + unsigned long index = 0; struct bio bio; struct bio_vec bio_vec; struct page *page; @@ -640,8 +644,17 @@ static ssize_t writeback_store(struct device *dev, mode = IDLE_WRITEBACK; else if (sysfs_streq(buf, "huge")) mode = HUGE_WRITEBACK; - else - return -EINVAL; + else { + if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) + return -EINVAL; + + ret = kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index); + if (ret || index >= nr_pages) + return -EINVAL; + + nr_pages = 1; + mode = PAGE_WRITEBACK; + } down_read(&zram->init_lock); if (!init_done(zram)) { @@ -660,7 +673,7 @@ static ssize_t writeback_store(struct device *dev, goto release_init_lock; } - for (index = 0; index < nr_pages; index++) { + while (nr_pages--) { struct bio_vec bvec; bvec.bv_page = page; @@ -1071,7 +1084,7 @@ static ssize_t mm_stat_show(struct device *dev, max_used = atomic_long_read(&zram->stats.max_used_pages); ret = scnprintf(buf, PAGE_SIZE, - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu\n", + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n", orig_size << PAGE_SHIFT, (u64)atomic64_read(&zram->stats.compr_data_size), mem_used << PAGE_SHIFT, @@ -1079,7 +1092,8 @@ static ssize_t mm_stat_show(struct device *dev, max_used << PAGE_SHIFT, (u64)atomic64_read(&zram->stats.same_pages), pool_stats.pages_compacted, - (u64)atomic64_read(&zram->stats.huge_pages)); + (u64)atomic64_read(&zram->stats.huge_pages), + (u64)atomic64_read(&zram->stats.huge_pages_since)); up_read(&zram->init_lock); return ret; @@ -1411,6 +1425,7 @@ out: if (comp_len == PAGE_SIZE) { zram_set_flag(zram, index, ZRAM_HUGE); atomic64_inc(&zram->stats.huge_pages); + atomic64_inc(&zram->stats.huge_pages_since); } if (flags) { diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index f2fd46daa760..9cabcbb13fd9 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -78,6 +78,7 @@ struct zram_stats { atomic64_t notify_free; /* no. of swap slot free notifications */ atomic64_t same_pages; /* no. of same element filled pages */ atomic64_t huge_pages; /* no. of huge pages */ + atomic64_t huge_pages_since; /* no. of huge pages since zram set up */ atomic64_t pages_stored; /* no. of pages currently stored */ atomic_long_t max_used_pages; /* no. of maximum pages stored */ atomic64_t writestall; /* no. of write slow paths */ diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 25e0b84a4296..5da2980bb16b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -256,7 +256,7 @@ static vm_fault_t dev_dax_fault(struct vm_fault *vmf) return dev_dax_huge_fault(vmf, PE_SIZE_PTE); } -static int dev_dax_split(struct vm_area_struct *vma, unsigned long addr) +static int dev_dax_may_split(struct vm_area_struct *vma, unsigned long addr) { struct file *filp = vma->vm_file; struct dev_dax *dev_dax = filp->private_data; @@ -277,7 +277,7 @@ static unsigned long dev_dax_pagesize(struct vm_area_struct *vma) static const struct vm_operations_struct dax_vm_ops = { .fault = dev_dax_fault, .huge_fault = dev_dax_huge_fault, - .split = dev_dax_split, + .may_split = dev_dax_may_split, .pagesize = dev_dax_pagesize, }; diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index b4368c5b6a0c..403ec42472d1 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -61,7 +61,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) return -EINVAL; } - data = kzalloc(sizeof(*data) + sizeof(struct resource *) * dev_dax->nr_range, GFP_KERNEL); + data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL); if (!data) return -ENOMEM; diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c index 5a5a1da01a00..20d9bddbb985 100644 --- a/drivers/dma-buf/sync_file.c +++ b/drivers/dma-buf/sync_file.c @@ -270,8 +270,7 @@ static struct sync_file *sync_file_merge(const char *name, struct sync_file *a, fences[i++] = dma_fence_get(a_fences[0]); if (num_fences > i) { - nfences = krealloc(fences, i * sizeof(*fences), - GFP_KERNEL); + nfences = krealloc_array(fences, i, sizeof(*fences), GFP_KERNEL); if (!nfences) goto err; diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c index a918ca93e4f7..6d1ddecbf0da 100644 --- a/drivers/edac/ghes_edac.c +++ b/drivers/edac/ghes_edac.c @@ -207,8 +207,8 @@ static void enumerate_dimms(const struct dmi_header *dh, void *arg) if (!hw->num_dimms || !(hw->num_dimms % 16)) { struct dimm_info *new; - new = krealloc(hw->dimms, (hw->num_dimms + 16) * sizeof(struct dimm_info), - GFP_KERNEL); + new = krealloc_array(hw->dimms, hw->num_dimms + 16, + sizeof(struct dimm_info), GFP_KERNEL); if (!new) { WARN_ON_ONCE(1); return; diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 6c6eec044a97..df3f9bcab581 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -57,6 +57,7 @@ struct mm_struct efi_mm = { .mm_rb = RB_ROOT, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(efi_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c index b2d20eb6c807..dda60051854b 100644 --- a/drivers/gpu/drm/drm_atomic.c +++ b/drivers/gpu/drm/drm_atomic.c @@ -964,7 +964,8 @@ drm_atomic_get_connector_state(struct drm_atomic_state *state, struct __drm_connnectors_state *c; int alloc = max(index + 1, config->num_connector); - c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL); + c = krealloc_array(state->connectors, alloc, + sizeof(*state->connectors), GFP_KERNEL); if (!c) return ERR_PTR(-ENOMEM); diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index 3a77551fb4fc..7d95242db900 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -2002,7 +2002,7 @@ nr_pages_store(struct device *dev, struct device_attribute *attr, } nr_wins++; - rewin = krealloc(win, sizeof(*win) * nr_wins, GFP_KERNEL); + rewin = krealloc_array(win, nr_wins, sizeof(*win), GFP_KERNEL); if (!rewin) { kfree(win); return -ENOMEM; diff --git a/drivers/ide/falconide.c b/drivers/ide/falconide.c index dbeb2605e5f6..77af4c1a3f38 100644 --- a/drivers/ide/falconide.c +++ b/drivers/ide/falconide.c @@ -51,8 +51,6 @@ static void falconide_release_lock(void) static void falconide_get_lock(irq_handler_t handler, void *data) { if (falconide_intr_lock == 0) { - if (in_interrupt() > 0) - panic("Falcon IDE hasn't ST-DMA lock in interrupt"); stdma_lock(handler, data); falconide_intr_lock = 1; } diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c index 1ddc45a04418..430b29e0abdb 100644 --- a/drivers/ide/ide-probe.c +++ b/drivers/ide/ide-probe.c @@ -1592,9 +1592,6 @@ EXPORT_SYMBOL_GPL(ide_port_unregister_devices); static void ide_unregister(ide_hwif_t *hwif) { - BUG_ON(in_interrupt()); - BUG_ON(irqs_disabled()); - mutex_lock(&ide_cfg_mtx); if (hwif->present) { diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile index c70b3822013f..1c4c7aca0026 100644 --- a/drivers/misc/lkdtm/Makefile +++ b/drivers/misc/lkdtm/Makefile @@ -11,6 +11,7 @@ lkdtm-$(CONFIG_LKDTM) += usercopy.o lkdtm-$(CONFIG_LKDTM) += stackleak.o lkdtm-$(CONFIG_LKDTM) += cfi.o +KASAN_SANITIZE_rodata.o := n KASAN_SANITIZE_stackleak.o := n KCOV_INSTRUMENT_rodata.o := n diff --git a/drivers/pinctrl/pinctrl-utils.c b/drivers/pinctrl/pinctrl-utils.c index f2bcbf62c03d..93df0d4c0a24 100644 --- a/drivers/pinctrl/pinctrl-utils.c +++ b/drivers/pinctrl/pinctrl-utils.c @@ -39,7 +39,7 @@ int pinctrl_utils_reserve_map(struct pinctrl_dev *pctldev, if (old_num >= new_num) return 0; - new_map = krealloc(*map, sizeof(*new_map) * new_num, GFP_KERNEL); + new_map = krealloc_array(*map, new_num, sizeof(*new_map), GFP_KERNEL); if (!new_map) { dev_err(pctldev->dev, "krealloc(map) failed\n"); return -ENOMEM; diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c index b7403ba8e7f7..85d85faba058 100644 --- a/drivers/vhost/vringh.c +++ b/drivers/vhost/vringh.c @@ -198,7 +198,8 @@ static int resize_iovec(struct vringh_kiov *iov, gfp_t gfp) flag = (iov->max_num & VRINGH_IOV_ALLOCATED); if (flag) - new = krealloc(iov->iov, new_num * sizeof(struct iovec), gfp); + new = krealloc_array(iov->iov, new_num, + sizeof(struct iovec), gfp); else { new = kmalloc_array(new_num, sizeof(struct iovec), gfp); if (new) { diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 481611c09dae..8985fc2cea86 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -1114,9 +1114,7 @@ static int virtballoon_validate(struct virtio_device *vdev) * page reporting as it could potentially change the contents * of our free pages. */ - if (!want_init_on_free() && - (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) || - !page_poisoning_enabled())) + if (!want_init_on_free() && !page_poisoning_enabled_static()) __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON); else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING); diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c index 7762c1bb23cb..e64e6befc63b 100644 --- a/drivers/xen/unpopulated-alloc.c +++ b/drivers/xen/unpopulated-alloc.c @@ -27,11 +27,6 @@ static int fill_list(unsigned int nr_pages) if (!res) return -ENOMEM; - pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL); - if (!pgmap) - goto err_pgmap; - - pgmap->type = MEMORY_DEVICE_GENERIC; res->name = "Xen scratch"; res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; @@ -43,6 +38,11 @@ static int fill_list(unsigned int nr_pages) goto err_resource; } + pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL); + if (!pgmap) + goto err_pgmap; + + pgmap->type = MEMORY_DEVICE_GENERIC; pgmap->range = (struct range) { .start = res->start, .end = res->end, @@ -92,10 +92,10 @@ static int fill_list(unsigned int nr_pages) return 0; err_memremap: - release_resource(res); -err_resource: kfree(pgmap); err_pgmap: + release_resource(res); +err_resource: kfree(res); return ret; } @@ -323,13 +323,16 @@ static void aio_free_ring(struct kioctx *ctx) } } -static int aio_ring_mremap(struct vm_area_struct *vma) +static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) { struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct kioctx_table *table; int i, res = -EINVAL; + if (flags & MREMAP_DONTUNMAP) + return -EINVAL; + spin_lock(&mm->ioctx_lock); rcu_read_lock(); table = rcu_dereference(mm->ioctx_table); diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c index f42967b738eb..e5aab265dff1 100644 --- a/fs/ntfs/file.c +++ b/fs/ntfs/file.c @@ -323,7 +323,7 @@ static ssize_t ntfs_prepare_file_for_write(struct kiocb *iocb, unsigned long flags; struct file *file = iocb->ki_filp; struct inode *vi = file_inode(file); - ntfs_inode *base_ni, *ni = NTFS_I(vi); + ntfs_inode *ni = NTFS_I(vi); ntfs_volume *vol = ni->vol; ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, pos " @@ -365,9 +365,6 @@ static ssize_t ntfs_prepare_file_for_write(struct kiocb *iocb, err = -EOPNOTSUPP; goto out; } - base_ni = ni; - if (NInoAttr(ni)) - base_ni = ni->ext.base_ntfs_ino; err = file_remove_privs(file); if (unlikely(err)) goto out; diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c index caf563981532..f7e4cbc26eaf 100644 --- a/fs/ntfs/inode.c +++ b/fs/ntfs/inode.c @@ -2347,7 +2347,6 @@ int ntfs_truncate(struct inode *vi) ATTR_RECORD *a; const char *te = " Leaving file length out of sync with i_size."; int err, mp_size, size_change, alloc_change; - u32 attr_len; ntfs_debug("Entering for inode 0x%lx.", vi->i_ino); BUG_ON(NInoAttr(ni)); @@ -2721,7 +2720,6 @@ do_non_resident_truncate: * this cannot fail since we are making the attribute smaller thus by * definition there is enough space to do so. */ - attr_len = le32_to_cpu(a->length); err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu(a->data.non_resident.mapping_pairs_offset)); BUG_ON(err); diff --git a/fs/ntfs/logfile.c b/fs/ntfs/logfile.c index a0c40f1be7ac..bc1bf217b38e 100644 --- a/fs/ntfs/logfile.c +++ b/fs/ntfs/logfile.c @@ -478,7 +478,7 @@ bool ntfs_check_logfile(struct inode *log_vi, RESTART_PAGE_HEADER **rp) u8 *kaddr = NULL; RESTART_PAGE_HEADER *rstr1_ph = NULL; RESTART_PAGE_HEADER *rstr2_ph = NULL; - int log_page_size, log_page_mask, err; + int log_page_size, err; bool logfile_is_empty = true; u8 log_page_bits; @@ -501,7 +501,6 @@ bool ntfs_check_logfile(struct inode *log_vi, RESTART_PAGE_HEADER **rp) log_page_size = DefaultLogPageSize; else log_page_size = PAGE_SIZE; - log_page_mask = log_page_size - 1; /* * Use ntfs_ffs() instead of ffs() to enable the compiler to * optimize log_page_size and log_page_bits into constants. diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 79a231719460..3bd8119bed5e 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1198,7 +1198,6 @@ static int o2net_process_message(struct o2net_sock_container *sc, msglog(hdr, "bad magic\n"); ret = -EINVAL; goto out; - break; } /* find a handler for it */ diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index c46bf7f581a1..2a237ab00453 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -1088,8 +1088,8 @@ static int ocfs2_check_if_ancestor(struct ocfs2_super *osb, child_inode_no = parent_inode_no; if (++i >= MAX_LOOKUP_TIMES) { - mlog(ML_NOTICE, "max lookup times reached, filesystem " - "may have nested directories, " + mlog_ratelimited(ML_NOTICE, "max lookup times reached, " + "filesystem may have nested directories, " "src inode: %llu, dest inode: %llu.\n", (unsigned long long)src_inode_no, (unsigned long long)dest_inode_no); diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index e502414b3556..4d2e64e9016c 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -193,8 +193,6 @@ kclist_add_private(unsigned long pfn, unsigned long nr_pages, void *arg) return 1; p = pfn_to_page(pfn); - if (!memmap_valid_within(pfn, p, page_zone(p))) - return 1; ent = kmalloc(sizeof(*ent), GFP_KERNEL); if (!ent) diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 887a5532e449..d6fc74619625 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -107,7 +107,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) global_node_page_state(NR_KERNEL_SCS_KB)); #endif show_val_kb(m, "PageTables: ", - global_zone_page_state(NR_PAGETABLE)); + global_node_page_state(NR_PAGETABLE)); show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 000b457ad087..894cc28142e7 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -28,7 +28,7 @@ #include <linux/security.h> #include <linux/hugetlb.h> -int sysctl_unprivileged_userfaultfd __read_mostly = 1; +int sysctl_unprivileged_userfaultfd __read_mostly; static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; @@ -405,6 +405,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) if (ctx->features & UFFD_FEATURE_SIGBUS) goto out; + if ((vmf->flags & FAULT_FLAG_USER) == 0 && + ctx->flags & UFFD_USER_MODE_ONLY) { + printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd " + "sysctl knob to 1 if kernel faults must be handled " + "without obtaining CAP_SYS_PTRACE capability\n"); + goto out; + } /* * If it's already released don't get it. This avoids to loop @@ -1959,16 +1966,23 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) struct userfaultfd_ctx *ctx; int fd; - if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE)) + if (!sysctl_unprivileged_userfaultfd && + (flags & UFFD_USER_MODE_ONLY) == 0 && + !capable(CAP_SYS_PTRACE)) { + printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd " + "sysctl knob to 1 if kernel faults must be handled " + "without obtaining CAP_SYS_PTRACE capability\n"); return -EPERM; + } BUG_ON(!current->mm); /* Check the UFFD_* constants for consistency. */ + BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS); BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC); BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK); - if (flags & ~UFFD_SHARED_FCNTL_FLAGS) + if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY)) return -EINVAL; ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL); diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index fee0b5547cd0..559ee05f86b2 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -668,21 +668,6 @@ struct cgroup_subsys { */ bool threaded:1; - /* - * If %false, this subsystem is properly hierarchical - - * configuration, resource accounting and restriction on a parent - * cgroup cover those of its children. If %true, hierarchy support - * is broken in some ways - some subsystems ignore hierarchy - * completely while others are only implemented half-way. - * - * It's now disallowed to create nested cgroups if the subsystem is - * broken and cgroup core will emit a warning message on such - * cases. Eventually, all subsystems will be made properly - * hierarchical and this will go away. - */ - bool broken_hierarchy:1; - bool warned_broken_hierarchy:1; - /* the following two fields are initialized automtically during boot */ int id; const char *name; diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 1de5a1151ee7..ed4070ed41ef 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -98,11 +98,8 @@ extern void reset_isolation_suitable(pg_data_t *pgdat); extern enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int highest_zoneidx); -extern void defer_compaction(struct zone *zone, int order); -extern bool compaction_deferred(struct zone *zone, int order); extern void compaction_defer_reset(struct zone *zone, int order, bool alloc_success); -extern bool compaction_restarting(struct zone *zone, int order); /* Compaction has made some progress and retrying makes sense */ static inline bool compaction_made_progress(enum compact_result result) @@ -194,15 +191,6 @@ static inline enum compact_result compaction_suitable(struct zone *zone, int ord return COMPACT_SKIPPED; } -static inline void defer_compaction(struct zone *zone, int order) -{ -} - -static inline bool compaction_deferred(struct zone *zone, int order) -{ - return true; -} - static inline bool compaction_made_progress(enum compact_result result) { return false; diff --git a/include/linux/fs.h b/include/linux/fs.h index 8667d0cdc71e..1fcc2b00582b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3230,7 +3230,7 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma) { struct inode *inode; - if (!vma->vm_file) + if (!IS_ENABLED(CONFIG_FS_DAX) || !vma->vm_file) return false; if (!vma_is_dax(vma)) return false; diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c603237e006c..6e479e9c48ce 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -580,8 +580,6 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); -extern void free_unref_page(struct page *page); -extern void free_unref_page_list(struct list_head *list); struct page_frag_cache; extern void __page_frag_cache_drain(struct page *page, unsigned int count); diff --git a/include/linux/highmem.h b/include/linux/highmem.h index f597830f26b4..d2c70d3772a3 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -204,13 +204,22 @@ static inline void clear_highpage(struct page *page) kunmap_atomic(kaddr); } +/* + * If we pass in a base or tail page, we can zero up to PAGE_SIZE. + * If we pass in a head page, we can zero up to the size of the compound page. + */ +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE) +void zero_user_segments(struct page *page, unsigned start1, unsigned end1, + unsigned start2, unsigned end2); +#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */ static inline void zero_user_segments(struct page *page, - unsigned start1, unsigned end1, - unsigned start2, unsigned end2) + unsigned start1, unsigned end1, + unsigned start2, unsigned end2) { void *kaddr = kmap_atomic(page); + unsigned int i; - BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE); + BUG_ON(end1 > page_size(page) || end2 > page_size(page)); if (end1 > start1) memset(kaddr + start1, 0, end1 - start1); @@ -219,8 +228,10 @@ static inline void zero_user_segments(struct page *page, memset(kaddr + start2, 0, end2 - start2); kunmap_atomic(kaddr); - flush_dcache_page(page); + for (i = 0; i < compound_nr(page); i++) + flush_dcache_page(page + i); } +#endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */ static inline void zero_user_segment(struct page *page, unsigned start, unsigned end) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0365aa97f8e7..6a19f35f836b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -7,43 +7,37 @@ #include <linux/fs.h> /* only for vma_is_dax() */ -extern vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); -extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, - struct vm_area_struct *vma); -extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); -extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, unsigned long addr, - struct vm_area_struct *vma); +vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + struct vm_area_struct *vma); +void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); +int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, + pud_t *dst_pud, pud_t *src_pud, unsigned long addr, + struct vm_area_struct *vma); #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD -extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); +void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); #else static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) { } #endif -extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); -extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, - unsigned long addr, - pmd_t *pmd, - unsigned int flags); -extern bool madvise_free_huge_pmd(struct mmu_gather *tlb, - struct vm_area_struct *vma, - pmd_t *pmd, unsigned long addr, unsigned long next); -extern int zap_huge_pmd(struct mmu_gather *tlb, - struct vm_area_struct *vma, - pmd_t *pmd, unsigned long addr); -extern int zap_huge_pud(struct mmu_gather *tlb, - struct vm_area_struct *vma, - pud_t *pud, unsigned long addr); -extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, - unsigned long new_addr, - pmd_t *old_pmd, pmd_t *new_pmd); -extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, - unsigned long cp_flags); +vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); +struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd, + unsigned int flags); +bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, + pmd_t *pmd, unsigned long addr, unsigned long next); +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr); +int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, + unsigned long addr); +bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd); +int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, + pgprot_t newprot, unsigned long cp_flags); vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn, pgprot_t pgprot, bool write); @@ -100,13 +94,13 @@ enum transparent_hugepage_flag { struct kobject; struct kobj_attribute; -extern ssize_t single_hugepage_flag_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count, - enum transparent_hugepage_flag flag); -extern ssize_t single_hugepage_flag_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf, - enum transparent_hugepage_flag flag); +ssize_t single_hugepage_flag_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count, + enum transparent_hugepage_flag flag); +ssize_t single_hugepage_flag_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf, + enum transparent_hugepage_flag flag); extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) @@ -179,12 +173,11 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, (transparent_hugepage_flags & \ (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) -extern unsigned long thp_get_unmapped_area(struct file *filp, - unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags); +unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long flags); -extern void prep_transhuge_page(struct page *page); -extern void free_transhuge_page(struct page *page); +void prep_transhuge_page(struct page *page); +void free_transhuge_page(struct page *page); bool is_transparent_hugepage(struct page *page); bool can_split_huge_page(struct page *page, int *pextra_pins); @@ -222,16 +215,12 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, __split_huge_pud(__vma, __pud, __address); \ } while (0) -extern int hugepage_madvise(struct vm_area_struct *vma, - unsigned long *vm_flags, int advice); -extern void vma_adjust_trans_huge(struct vm_area_struct *vma, - unsigned long start, - unsigned long end, - long adjust_next); -extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, - struct vm_area_struct *vma); -extern spinlock_t *__pud_trans_huge_lock(pud_t *pud, - struct vm_area_struct *vma); +int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, + int advice); +void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, + unsigned long end, long adjust_next); +spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); +spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma); static inline int is_swap_pmd(pmd_t pmd) { @@ -294,7 +283,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap); -extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); +vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); extern struct page *huge_zero_page; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 922a7f600465..f530d634f055 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -235,11 +235,6 @@ struct mem_cgroup { struct vmpressure vmpressure; /* - * Should the accounting and control be hierarchical, per subtree? - */ - bool use_hierarchy; - - /* * Should the OOM killer kill all belonging tasks, had it kill one? */ bool oom_group; @@ -296,7 +291,6 @@ struct mem_cgroup { int tcpmem_pressure; #ifdef CONFIG_MEMCG_KMEM - /* Index in the kmem_cache->memcg_params.memcg_caches array */ int kmemcg_id; enum memcg_kmem_state kmem_state; struct obj_cgroup __rcu *objcg; @@ -589,8 +583,6 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, { if (root == memcg) return true; - if (!root->use_hierarchy) - return false; return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup); } @@ -794,19 +786,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); -void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val); -void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val); - -void mod_memcg_obj_state(void *p, int idx, int val); +void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val); -static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx, +static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { unsigned long flags; local_irq_save(flags); - __mod_lruvec_slab_state(p, idx, val); + __mod_lruvec_kmem_state(p, idx, val); local_irq_restore(flags); } @@ -820,43 +808,6 @@ static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, local_irq_restore(flags); } -static inline void mod_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ - unsigned long flags; - - local_irq_save(flags); - __mod_lruvec_state(lruvec, idx, val); - local_irq_restore(flags); -} - -static inline void __mod_lruvec_page_state(struct page *page, - enum node_stat_item idx, int val) -{ - struct page *head = compound_head(page); /* rmap on tail pages */ - pg_data_t *pgdat = page_pgdat(page); - struct lruvec *lruvec; - - /* Untracked pages have no memcg, no lruvec. Update only the node */ - if (!head->mem_cgroup) { - __mod_node_page_state(pgdat, idx, val); - return; - } - - lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat); - __mod_lruvec_state(lruvec, idx, val); -} - -static inline void mod_lruvec_page_state(struct page *page, - enum node_stat_item idx, int val) -{ - unsigned long flags; - - local_irq_save(flags); - __mod_lruvec_page_state(page, idx, val); - local_irq_restore(flags); -} - unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); @@ -1215,31 +1166,7 @@ static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, { } -static inline void __mod_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ - __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); -} - -static inline void mod_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ - mod_node_page_state(lruvec_pgdat(lruvec), idx, val); -} - -static inline void __mod_lruvec_page_state(struct page *page, - enum node_stat_item idx, int val) -{ - __mod_node_page_state(page_pgdat(page), idx, val); -} - -static inline void mod_lruvec_page_state(struct page *page, - enum node_stat_item idx, int val) -{ - mod_node_page_state(page_pgdat(page), idx, val); -} - -static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, +static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { struct page *page = virt_to_head_page(p); @@ -1247,7 +1174,7 @@ static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, __mod_node_page_state(page_pgdat(page), idx, val); } -static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx, +static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { struct page *page = virt_to_head_page(p); @@ -1255,10 +1182,6 @@ static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx, mod_node_page_state(page_pgdat(page), idx, val); } -static inline void mod_memcg_obj_state(void *p, int idx, int val) -{ -} - static inline unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1322,38 +1245,14 @@ static inline void __dec_memcg_page_state(struct page *page, __mod_memcg_page_state(page, idx, -1); } -static inline void __inc_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx) -{ - __mod_lruvec_state(lruvec, idx, 1); -} - -static inline void __dec_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx) -{ - __mod_lruvec_state(lruvec, idx, -1); -} - -static inline void __inc_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - __mod_lruvec_page_state(page, idx, 1); -} - -static inline void __dec_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - __mod_lruvec_page_state(page, idx, -1); -} - -static inline void __inc_lruvec_slab_state(void *p, enum node_stat_item idx) +static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx) { - __mod_lruvec_slab_state(p, idx, 1); + __mod_lruvec_kmem_state(p, idx, 1); } -static inline void __dec_lruvec_slab_state(void *p, enum node_stat_item idx) +static inline void __dec_lruvec_kmem_state(void *p, enum node_stat_item idx) { - __mod_lruvec_slab_state(p, idx, -1); + __mod_lruvec_kmem_state(p, idx, -1); } /* idx can be of type enum memcg_stat_item or node_stat_item */ @@ -1384,30 +1283,6 @@ static inline void dec_memcg_page_state(struct page *page, mod_memcg_page_state(page, idx, -1); } -static inline void inc_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx) -{ - mod_lruvec_state(lruvec, idx, 1); -} - -static inline void dec_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx) -{ - mod_lruvec_state(lruvec, idx, -1); -} - -static inline void inc_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - mod_lruvec_page_state(page, idx, 1); -} - -static inline void dec_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - mod_lruvec_page_state(page, idx, -1); -} - static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) { struct mem_cgroup *memcg; @@ -1568,9 +1443,8 @@ static inline void memcg_kmem_uncharge(struct mem_cgroup *memcg, } /* - * helper for accessing a memcg's index. It will be used as an index in the - * child cache array in kmem_cache, and also to derive its name. This function - * will return -1 when this is not a kmem-limited memcg. + * A helper for accessing memcg's kmem_id, used for getting + * corresponding LRU lists. */ static inline int memcg_cache_id(struct mem_cgroup *memcg) { diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 0f8d1583fa8e..4594838a0f7c 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -45,8 +45,8 @@ extern struct page *alloc_migration_target(struct page *page, unsigned long priv extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); -extern int migrate_prep(void); -extern int migrate_prep_local(void); +extern void migrate_prep(void); +extern void migrate_prep_local(void); extern void migrate_page_states(struct page *newpage, struct page *page); extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, diff --git a/include/linux/mm.h b/include/linux/mm.h index 1813fa86b981..e189509323f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -557,8 +557,9 @@ enum page_entry_size { struct vm_operations_struct { void (*open)(struct vm_area_struct * area); void (*close)(struct vm_area_struct * area); - int (*split)(struct vm_area_struct * area, unsigned long addr); - int (*mremap)(struct vm_area_struct * area); + /* Called any time before splitting to check if it's allowed */ + int (*may_split)(struct vm_area_struct *area, unsigned long addr); + int (*mremap)(struct vm_area_struct *area, unsigned long flags); /* * Called by mprotect() to make driver-specific permission * checks before mprotect() is finalised. The VMA must not @@ -1723,8 +1724,8 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags); extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); -extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, - unsigned long addr, void *buf, int len, unsigned int gup_flags); +extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr, + void *buf, int len, unsigned int gup_flags); long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, @@ -2210,7 +2211,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page) if (!ptlock_init(page)) return false; __SetPageTable(page); - inc_zone_page_state(page, NR_PAGETABLE); + inc_lruvec_page_state(page, NR_PAGETABLE); return true; } @@ -2218,7 +2219,7 @@ static inline void pgtable_pte_page_dtor(struct page *page) { ptlock_free(page); __ClearPageTable(page); - dec_zone_page_state(page, NR_PAGETABLE); + dec_lruvec_page_state(page, NR_PAGETABLE); } #define pte_offset_map_lock(mm, pmd, address, ptlp) \ @@ -2305,7 +2306,7 @@ static inline bool pgtable_pmd_page_ctor(struct page *page) if (!pmd_ptlock_init(page)) return false; __SetPageTable(page); - inc_zone_page_state(page, NR_PAGETABLE); + inc_lruvec_page_state(page, NR_PAGETABLE); return true; } @@ -2313,7 +2314,7 @@ static inline void pgtable_pmd_page_dtor(struct page *page) { pmd_ptlock_free(page); __ClearPageTable(page); - dec_zone_page_state(page, NR_PAGETABLE); + dec_lruvec_page_state(page, NR_PAGETABLE); } /* @@ -2440,9 +2441,6 @@ static inline int early_pfn_to_nid(unsigned long pfn) #else /* please see mm/page_alloc.c */ extern int __meminit early_pfn_to_nid(unsigned long pfn); -/* there is a per-arch backend function. */ -extern int __meminit __early_pfn_to_nid(unsigned long pfn, - struct mminit_pfnnid_cache *state); #endif extern void set_dma_reserve(unsigned long new_dma_reserve); @@ -2881,44 +2879,56 @@ extern int apply_to_existing_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); +extern void init_mem_debugging_and_hardening(void); #ifdef CONFIG_PAGE_POISONING -extern bool page_poisoning_enabled(void); -extern void kernel_poison_pages(struct page *page, int numpages, int enable); +extern void __kernel_poison_pages(struct page *page, int numpages); +extern void __kernel_unpoison_pages(struct page *page, int numpages); +extern bool _page_poisoning_enabled_early; +DECLARE_STATIC_KEY_FALSE(_page_poisoning_enabled); +static inline bool page_poisoning_enabled(void) +{ + return _page_poisoning_enabled_early; +} +/* + * For use in fast paths after init_mem_debugging() has run, or when a + * false negative result is not harmful when called too early. + */ +static inline bool page_poisoning_enabled_static(void) +{ + return static_branch_unlikely(&_page_poisoning_enabled); +} +static inline void kernel_poison_pages(struct page *page, int numpages) +{ + if (page_poisoning_enabled_static()) + __kernel_poison_pages(page, numpages); +} +static inline void kernel_unpoison_pages(struct page *page, int numpages) +{ + if (page_poisoning_enabled_static()) + __kernel_unpoison_pages(page, numpages); +} #else static inline bool page_poisoning_enabled(void) { return false; } -static inline void kernel_poison_pages(struct page *page, int numpages, - int enable) { } +static inline bool page_poisoning_enabled_static(void) { return false; } +static inline void __kernel_poison_pages(struct page *page, int nunmpages) { } +static inline void kernel_poison_pages(struct page *page, int numpages) { } +static inline void kernel_unpoison_pages(struct page *page, int numpages) { } #endif -#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON -DECLARE_STATIC_KEY_TRUE(init_on_alloc); -#else DECLARE_STATIC_KEY_FALSE(init_on_alloc); -#endif static inline bool want_init_on_alloc(gfp_t flags) { - if (static_branch_unlikely(&init_on_alloc) && - !page_poisoning_enabled()) + if (static_branch_unlikely(&init_on_alloc)) return true; return flags & __GFP_ZERO; } -#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON -DECLARE_STATIC_KEY_TRUE(init_on_free); -#else DECLARE_STATIC_KEY_FALSE(init_on_free); -#endif static inline bool want_init_on_free(void) { - return static_branch_unlikely(&init_on_free) && - !page_poisoning_enabled(); + return static_branch_unlikely(&init_on_free); } -#ifdef CONFIG_DEBUG_PAGEALLOC -extern void init_debug_pagealloc(void); -#else -static inline void init_debug_pagealloc(void) {} -#endif extern bool _debug_pagealloc_enabled_early; DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled); @@ -2940,28 +2950,28 @@ static inline bool debug_pagealloc_enabled_static(void) return static_branch_unlikely(&_debug_pagealloc_enabled); } -#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP) -extern void __kernel_map_pages(struct page *page, int numpages, int enable); - +#ifdef CONFIG_DEBUG_PAGEALLOC /* - * When called in DEBUG_PAGEALLOC context, the call should most likely be - * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static() + * To support DEBUG_PAGEALLOC architecture must ensure that + * __kernel_map_pages() never fails */ -static inline void -kernel_map_pages(struct page *page, int numpages, int enable) -{ - __kernel_map_pages(page, numpages, enable); -} -#ifdef CONFIG_HIBERNATION -extern bool kernel_page_present(struct page *page); -#endif /* CONFIG_HIBERNATION */ -#else /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */ -static inline void -kernel_map_pages(struct page *page, int numpages, int enable) {} -#ifdef CONFIG_HIBERNATION -static inline bool kernel_page_present(struct page *page) { return true; } -#endif /* CONFIG_HIBERNATION */ -#endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */ +extern void __kernel_map_pages(struct page *page, int numpages, int enable); + +static inline void debug_pagealloc_map_pages(struct page *page, int numpages) +{ + if (debug_pagealloc_enabled_static()) + __kernel_map_pages(page, numpages, 1); +} + +static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages) +{ + if (debug_pagealloc_enabled_static()) + __kernel_map_pages(page, numpages, 0); +} +#else /* CONFIG_DEBUG_PAGEALLOC */ +static inline void debug_pagealloc_map_pages(struct page *page, int numpages) {} +static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages) {} +#endif /* CONFIG_DEBUG_PAGEALLOC */ #ifdef __HAVE_ARCH_GATE_AREA extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5a9238f6caad..915f4f100383 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -14,6 +14,7 @@ #include <linux/uprobes.h> #include <linux/page-flags-layout.h> #include <linux/workqueue.h> +#include <linux/seqlock.h> #include <asm/mmu.h> @@ -446,6 +447,13 @@ struct mm_struct { */ atomic_t has_pinned; + /** + * @write_protect_seq: Locked when any thread is write + * protecting pages mapped by this mm to enforce a later COW, + * for instance during page table copying for fork(). + */ + seqcount_t write_protect_seq; + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 18e7eae9b5ba..0540f0156f58 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -1,11 +1,65 @@ #ifndef _LINUX_MMAP_LOCK_H #define _LINUX_MMAP_LOCK_H +#include <linux/lockdep.h> +#include <linux/mm_types.h> #include <linux/mmdebug.h> +#include <linux/rwsem.h> +#include <linux/tracepoint-defs.h> +#include <linux/types.h> #define MMAP_LOCK_INITIALIZER(name) \ .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock), +DECLARE_TRACEPOINT(mmap_lock_start_locking); +DECLARE_TRACEPOINT(mmap_lock_acquire_returned); +DECLARE_TRACEPOINT(mmap_lock_released); + +#ifdef CONFIG_TRACING + +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write); +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write, + bool success); +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write); + +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm, + bool write) +{ + if (tracepoint_enabled(mmap_lock_start_locking)) + __mmap_lock_do_trace_start_locking(mm, write); +} + +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm, + bool write, bool success) +{ + if (tracepoint_enabled(mmap_lock_acquire_returned)) + __mmap_lock_do_trace_acquire_returned(mm, write, success); +} + +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write) +{ + if (tracepoint_enabled(mmap_lock_released)) + __mmap_lock_do_trace_released(mm, write); +} + +#else /* !CONFIG_TRACING */ + +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm, + bool write) +{ +} + +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm, + bool write, bool success) +{ +} + +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write) +{ +} + +#endif /* CONFIG_TRACING */ + static inline void mmap_init_lock(struct mm_struct *mm) { init_rwsem(&mm->mmap_lock); @@ -13,57 +67,86 @@ static inline void mmap_init_lock(struct mm_struct *mm) static inline void mmap_write_lock(struct mm_struct *mm) { + __mmap_lock_trace_start_locking(mm, true); down_write(&mm->mmap_lock); + __mmap_lock_trace_acquire_returned(mm, true, true); } static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass) { + __mmap_lock_trace_start_locking(mm, true); down_write_nested(&mm->mmap_lock, subclass); + __mmap_lock_trace_acquire_returned(mm, true, true); } static inline int mmap_write_lock_killable(struct mm_struct *mm) { - return down_write_killable(&mm->mmap_lock); + int ret; + + __mmap_lock_trace_start_locking(mm, true); + ret = down_write_killable(&mm->mmap_lock); + __mmap_lock_trace_acquire_returned(mm, true, ret == 0); + return ret; } static inline bool mmap_write_trylock(struct mm_struct *mm) { - return down_write_trylock(&mm->mmap_lock) != 0; + bool ret; + + __mmap_lock_trace_start_locking(mm, true); + ret = down_write_trylock(&mm->mmap_lock) != 0; + __mmap_lock_trace_acquire_returned(mm, true, ret); + return ret; } static inline void mmap_write_unlock(struct mm_struct *mm) { up_write(&mm->mmap_lock); + __mmap_lock_trace_released(mm, true); } static inline void mmap_write_downgrade(struct mm_struct *mm) { downgrade_write(&mm->mmap_lock); + __mmap_lock_trace_acquire_returned(mm, false, true); } static inline void mmap_read_lock(struct mm_struct *mm) { + __mmap_lock_trace_start_locking(mm, false); down_read(&mm->mmap_lock); + __mmap_lock_trace_acquire_returned(mm, false, true); } static inline int mmap_read_lock_killable(struct mm_struct *mm) { - return down_read_killable(&mm->mmap_lock); + int ret; + + __mmap_lock_trace_start_locking(mm, false); + ret = down_read_killable(&mm->mmap_lock); + __mmap_lock_trace_acquire_returned(mm, false, ret == 0); + return ret; } static inline bool mmap_read_trylock(struct mm_struct *mm) { - return down_read_trylock(&mm->mmap_lock) != 0; + bool ret; + + __mmap_lock_trace_start_locking(mm, false); + ret = down_read_trylock(&mm->mmap_lock) != 0; + __mmap_lock_trace_acquire_returned(mm, false, ret); + return ret; } static inline void mmap_read_unlock(struct mm_struct *mm) { up_read(&mm->mmap_lock); + __mmap_lock_trace_released(mm, false); } static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm) { - if (down_read_trylock(&mm->mmap_lock)) { + if (mmap_read_trylock(mm)) { rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_); return true; } @@ -73,6 +156,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm) static inline void mmap_read_unlock_non_owner(struct mm_struct *mm) { up_read_non_owner(&mm->mmap_lock); + __mmap_lock_trace_released(mm, false); } static inline void mmap_assert_locked(struct mm_struct *mm) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9d0c454d23cd..98a80c01d150 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -152,7 +152,6 @@ enum zone_stat_item { NR_ZONE_UNEVICTABLE, NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ - NR_PAGETABLE, /* used for pagetables */ /* Second 128 byte cacheline */ NR_BOUNCE, #if IS_ENABLED(CONFIG_ZSMALLOC) @@ -207,6 +206,7 @@ enum node_stat_item { #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) NR_KERNEL_SCS_KB, /* measured in KiB */ #endif + NR_PAGETABLE, /* used for pagetables */ NR_VM_NODE_STAT_ITEMS }; @@ -450,6 +450,12 @@ struct zone { #endif struct pglist_data *zone_pgdat; struct per_cpu_pageset __percpu *pageset; + /* + * the high and batch values are copied to individual pagesets for + * faster access + */ + int pageset_high; + int pageset_batch; #ifndef CONFIG_SPARSEMEM /* @@ -1409,17 +1415,6 @@ void sparse_init(void); #endif /* CONFIG_SPARSEMEM */ /* - * During memory init memblocks map pfns to nids. The search is expensive and - * this caches recent lookups. The implementation of __early_pfn_to_nid - * may treat start/end as pfns or sections. - */ -struct mminit_pfnnid_cache { - unsigned long last_start; - unsigned long last_end; - int last_nid; -}; - -/* * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we * need to check pfn validity within that MAX_ORDER_NR_PAGES block. * pfn_valid_within() should be used in this case; we optimise this away @@ -1431,37 +1426,6 @@ struct mminit_pfnnid_cache { #define pfn_valid_within(pfn) (1) #endif -#ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL -/* - * pfn_valid() is meant to be able to tell if a given PFN has valid memmap - * associated with it or not. This means that a struct page exists for this - * pfn. The caller cannot assume the page is fully initialized in general. - * Hotplugable pages might not have been onlined yet. pfn_to_online_page() - * will ensure the struct page is fully online and initialized. Special pages - * (e.g. ZONE_DEVICE) are never onlined and should be treated accordingly. - * - * In FLATMEM, it is expected that holes always have valid memmap as long as - * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed - * that a valid section has a memmap for the entire section. - * - * However, an ARM, and maybe other embedded architectures in the future - * free memmap backing holes to save memory on the assumption the memmap is - * never used. The page_zone linkages are then broken even though pfn_valid() - * returns true. A walker of the full memmap must then do this additional - * check to ensure the memmap they are looking at is sane by making sure - * the zone and PFN linkages are still valid. This is expensive, but walkers - * of the full memmap are extremely rare. - */ -bool memmap_valid_within(unsigned long pfn, - struct page *page, struct zone *zone); -#else -static inline bool memmap_valid_within(unsigned long pfn, - struct page *page, struct zone *zone) -{ - return true; -} -#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ - #endif /* !__GENERATING_BOUNDS.H */ #endif /* !__ASSEMBLY__ */ #endif /* _LINUX_MMZONE_H */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4f6ba9379112..c1368af622c7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -86,8 +86,7 @@ */ /* - * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break - * locked- and dirty-page accounting. + * Don't use the pageflags directly. Use the PageFoo macros. * * The page flags field is split into two parts, the main flags area * which extends from the low bits upwards, and the fields area which @@ -363,8 +362,7 @@ PAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) * for its own purposes. * - PG_private and PG_private_2 cause releasepage() and co to be invoked */ -PAGEFLAG(Private, private, PF_ANY) __SETPAGEFLAG(Private, private, PF_ANY) - __CLEARPAGEFLAG(Private, private, PF_ANY) +PAGEFLAG(Private, private, PF_ANY) PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY) PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY) TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY) diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index cfce186f0c4e..aff81ba31bd8 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -44,8 +44,12 @@ static inline void page_ext_init_flatmem(void) { } extern void page_ext_init(void); +static inline void page_ext_init_flatmem_late(void) +{ +} #else extern void page_ext_init_flatmem(void); +extern void page_ext_init_flatmem_late(void); static inline void page_ext_init(void) { } @@ -76,6 +80,10 @@ static inline void page_ext_init(void) { } +static inline void page_ext_init_flatmem_late(void) +{ +} + static inline void page_ext_init_flatmem(void) { } diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 081d934eda64..ad4ddc17d403 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -43,9 +43,6 @@ static inline unsigned pagevec_lookup(struct pagevec *pvec, unsigned pagevec_lookup_range_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, pgoff_t end, xa_mark_t tag); -unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec, - struct address_space *mapping, pgoff_t *index, pgoff_t end, - xa_mark_t tag, unsigned max_pages); static inline unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, xa_mark_t tag) { diff --git a/include/linux/poison.h b/include/linux/poison.h index dc8ae5d8db03..aff1c9250c82 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -27,11 +27,7 @@ #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) /********** mm/page_poison.c **********/ -#ifdef CONFIG_PAGE_POISONING_ZERO -#define PAGE_POISON 0x00 -#else #define PAGE_POISON 0xaa -#endif /********** mm/page_alloc.c ************/ diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 3a6adfa70fb0..70085ca1a3fc 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -91,7 +91,6 @@ enum ttu_flags { TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ - TTU_IGNORE_ACCESS = 0x10, /* don't age */ TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */ TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible * and caller guarantees they will diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index a91fb3ad9ec7..1ae08b8462a4 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -181,6 +181,22 @@ static inline void fs_reclaim_release(gfp_t gfp_mask) { } #endif /** + * might_alloc - Mark possible allocation sites + * @gfp_mask: gfp_t flags that would be used to allocate + * + * Similar to might_sleep() and other annotations, this can be used in functions + * that might allocate, but often don't. Compiles to nothing without + * CONFIG_LOCKDEP. Includes a conditional might_sleep() if @gfp allows blocking. + */ +static inline void might_alloc(gfp_t gfp_mask) +{ + fs_reclaim_acquire(gfp_mask); + fs_reclaim_release(gfp_mask); + + might_sleep_if(gfpflags_allow_blocking(gfp_mask)); +} + +/** * memalloc_noio_save - Marks implicit GFP_NOIO allocation scope. * * This functions marks the beginning of the GFP_NOIO allocation scope. diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index 860e0f843c12..fe1aa4e54680 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -23,6 +23,11 @@ static inline int set_direct_map_default_noflush(struct page *page) { return 0; } + +static inline bool kernel_page_present(struct page *page) +{ + return true; +} #endif #ifndef set_mce_nospec diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a5a5d1d4d7b1..d82b6f396588 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -67,7 +67,11 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); extern int shmem_lock(struct file *file, int lock, struct user_struct *user); #ifdef CONFIG_SHMEM -extern bool shmem_mapping(struct address_space *mapping); +extern const struct address_space_operations shmem_aops; +static inline bool shmem_mapping(struct address_space *mapping) +{ + return mapping->a_ops == &shmem_aops; +} #else static inline bool shmem_mapping(struct address_space *mapping) { diff --git a/include/linux/slab.h b/include/linux/slab.h index dd6897f62010..be4ba5867ac5 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -593,6 +593,24 @@ static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) } /** + * krealloc_array - reallocate memory for an array. + * @p: pointer to the memory chunk to reallocate + * @new_n: new number of elements to alloc + * @new_size: new size of a single member of the array + * @flags: the type of memory to allocate (see kmalloc) + */ +static __must_check inline void * +krealloc_array(void *p, size_t new_n, size_t new_size, gfp_t flags) +{ + size_t bytes; + + if (unlikely(check_mul_overflow(new_n, new_size, &bytes))) + return NULL; + + return krealloc(p, bytes, flags); +} + +/** * kcalloc - allocate memory for an array. The memory is set to zero. * @n: number of elements. * @size: element size. diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9517e2..80c0181c411d 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -72,16 +72,14 @@ struct vmap_area { struct list_head list; /* address sorted list */ /* - * The following three variables can be packed, because - * a vmap_area object is always one of the three states: + * The following two variables can be packed, because + * a vmap_area object can be either: * 1) in "free" tree (root is vmap_area_root) - * 2) in "busy" tree (root is free_vmap_area_root) - * 3) in purge list (head is vmap_purge_list) + * 2) or "busy" tree (root is free_vmap_area_root) */ union { unsigned long subtree_max_size; /* in "free" tree */ struct vm_struct *vm; /* in "busy" tree */ - struct llist_node purge_list; /* in purge list */ }; }; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 322dcbfcc933..773135fc6e19 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -450,4 +450,108 @@ static inline const char *vm_event_name(enum vm_event_item item) } #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ +#ifdef CONFIG_MEMCG + +void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, + int val); + +static inline void mod_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx, int val) +{ + unsigned long flags; + + local_irq_save(flags); + __mod_lruvec_state(lruvec, idx, val); + local_irq_restore(flags); +} + +void __mod_lruvec_page_state(struct page *page, + enum node_stat_item idx, int val); + +static inline void mod_lruvec_page_state(struct page *page, + enum node_stat_item idx, int val) +{ + unsigned long flags; + + local_irq_save(flags); + __mod_lruvec_page_state(page, idx, val); + local_irq_restore(flags); +} + +#else + +static inline void __mod_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx, int val) +{ + __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); +} + +static inline void mod_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx, int val) +{ + mod_node_page_state(lruvec_pgdat(lruvec), idx, val); +} + +static inline void __mod_lruvec_page_state(struct page *page, + enum node_stat_item idx, int val) +{ + __mod_node_page_state(page_pgdat(page), idx, val); +} + +static inline void mod_lruvec_page_state(struct page *page, + enum node_stat_item idx, int val) +{ + mod_node_page_state(page_pgdat(page), idx, val); +} + +#endif /* CONFIG_MEMCG */ + +static inline void __inc_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + __mod_lruvec_state(lruvec, idx, 1); +} + +static inline void __dec_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + __mod_lruvec_state(lruvec, idx, -1); +} + +static inline void __inc_lruvec_page_state(struct page *page, + enum node_stat_item idx) +{ + __mod_lruvec_page_state(page, idx, 1); +} + +static inline void __dec_lruvec_page_state(struct page *page, + enum node_stat_item idx) +{ + __mod_lruvec_page_state(page, idx, -1); +} + +static inline void inc_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + mod_lruvec_state(lruvec, idx, 1); +} + +static inline void dec_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx) +{ + mod_lruvec_state(lruvec, idx, -1); +} + +static inline void inc_lruvec_page_state(struct page *page, + enum node_stat_item idx) +{ + mod_lruvec_page_state(page, idx, 1); +} + +static inline void dec_lruvec_page_state(struct page *page, + enum node_stat_item idx) +{ + mod_lruvec_page_state(page, idx, -1); +} + #endif /* _LINUX_VMSTAT_H */ diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h new file mode 100644 index 000000000000..0abff67b96f0 --- /dev/null +++ b/include/trace/events/mmap_lock.h @@ -0,0 +1,107 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM mmap_lock + +#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_MMAP_LOCK_H + +#include <linux/tracepoint.h> +#include <linux/types.h> + +struct mm_struct; + +extern int trace_mmap_lock_reg(void); +extern void trace_mmap_lock_unreg(void); + +TRACE_EVENT_FN(mmap_lock_start_locking, + + TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write), + + TP_ARGS(mm, memcg_path, write), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __string(memcg_path, memcg_path) + __field(bool, write) + ), + + TP_fast_assign( + __entry->mm = mm; + __assign_str(memcg_path, memcg_path); + __entry->write = write; + ), + + TP_printk( + "mm=%p memcg_path=%s write=%s\n", + __entry->mm, + __get_str(memcg_path), + __entry->write ? "true" : "false" + ), + + trace_mmap_lock_reg, trace_mmap_lock_unreg +); + +TRACE_EVENT_FN(mmap_lock_acquire_returned, + + TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write, + bool success), + + TP_ARGS(mm, memcg_path, write, success), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __string(memcg_path, memcg_path) + __field(bool, write) + __field(bool, success) + ), + + TP_fast_assign( + __entry->mm = mm; + __assign_str(memcg_path, memcg_path); + __entry->write = write; + __entry->success = success; + ), + + TP_printk( + "mm=%p memcg_path=%s write=%s success=%s\n", + __entry->mm, + __get_str(memcg_path), + __entry->write ? "true" : "false", + __entry->success ? "true" : "false" + ), + + trace_mmap_lock_reg, trace_mmap_lock_unreg +); + +TRACE_EVENT_FN(mmap_lock_released, + + TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write), + + TP_ARGS(mm, memcg_path, write), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __string(memcg_path, memcg_path) + __field(bool, write) + ), + + TP_fast_assign( + __entry->mm = mm; + __assign_str(memcg_path, memcg_path); + __entry->write = write; + ), + + TP_printk( + "mm=%p memcg_path=%s write=%s\n", + __entry->mm, + __get_str(memcg_path), + __entry->write ? "true" : "false" + ), + + trace_mmap_lock_reg, trace_mmap_lock_unreg +); + +#endif /* _TRACE_MMAP_LOCK_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index c96a4337afe6..5039af667645 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -5,6 +5,7 @@ #if !defined(_TRACE_SCHED_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_SCHED_H +#include <linux/kthread.h> #include <linux/sched/numa_balancing.h> #include <linux/tracepoint.h> #include <linux/binfmts.h> @@ -51,6 +52,89 @@ TRACE_EVENT(sched_kthread_stop_ret, TP_printk("ret=%d", __entry->ret) ); +/** + * sched_kthread_work_queue_work - called when a work gets queued + * @worker: pointer to the kthread_worker + * @work: pointer to struct kthread_work + * + * This event occurs when a work is queued immediately or once a + * delayed work is actually queued (ie: once the delay has been + * reached). + */ +TRACE_EVENT(sched_kthread_work_queue_work, + + TP_PROTO(struct kthread_worker *worker, + struct kthread_work *work), + + TP_ARGS(worker, work), + + TP_STRUCT__entry( + __field( void *, work ) + __field( void *, function) + __field( void *, worker) + ), + + TP_fast_assign( + __entry->work = work; + __entry->function = work->func; + __entry->worker = worker; + ), + + TP_printk("work struct=%p function=%ps worker=%p", + __entry->work, __entry->function, __entry->worker) +); + +/** + * sched_kthread_work_execute_start - called immediately before the work callback + * @work: pointer to struct kthread_work + * + * Allows to track kthread work execution. + */ +TRACE_EVENT(sched_kthread_work_execute_start, + + TP_PROTO(struct kthread_work *work), + + TP_ARGS(work), + + TP_STRUCT__entry( + __field( void *, work ) + __field( void *, function) + ), + + TP_fast_assign( + __entry->work = work; + __entry->function = work->func; + ), + + TP_printk("work struct %p: function %ps", __entry->work, __entry->function) +); + +/** + * sched_kthread_work_execute_end - called immediately after the work callback + * @work: pointer to struct work_struct + * @function: pointer to worker function + * + * Allows to track workqueue execution. + */ +TRACE_EVENT(sched_kthread_work_execute_end, + + TP_PROTO(struct kthread_work *work, kthread_work_func_t function), + + TP_ARGS(work, function), + + TP_STRUCT__entry( + __field( void *, work ) + __field( void *, function) + ), + + TP_fast_assign( + __entry->work = work; + __entry->function = function; + ), + + TP_printk("work struct %p: function %ps", __entry->work, __entry->function) +); + /* * Tracepoint for waking up a task: */ diff --git a/include/uapi/linux/const.h b/include/uapi/linux/const.h index 5ed721ad5b19..af2a44c08683 100644 --- a/include/uapi/linux/const.h +++ b/include/uapi/linux/const.h @@ -28,4 +28,9 @@ #define _BITUL(x) (_UL(1) << (x)) #define _BITULL(x) (_ULL(1) << (x)) +#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) +#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) + +#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) + #endif /* _UAPI_LINUX_CONST_H */ diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 9ca87bc73c44..cde753bb2093 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -14,7 +14,7 @@ #ifndef _UAPI_LINUX_ETHTOOL_H #define _UAPI_LINUX_ETHTOOL_H -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/types.h> #include <linux/if_ether.h> diff --git a/include/uapi/linux/kernel.h b/include/uapi/linux/kernel.h index 0ff8f7477847..fadf2db71fe8 100644 --- a/include/uapi/linux/kernel.h +++ b/include/uapi/linux/kernel.h @@ -3,13 +3,6 @@ #define _UAPI_LINUX_KERNEL_H #include <linux/sysinfo.h> - -/* - * 'kernel.h' contains some often-used function prototypes etc - */ -#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) -#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) - -#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#include <linux/const.h> #endif /* _UAPI_LINUX_KERNEL_H */ diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h index f9a1be7fc696..ead2e72e5c88 100644 --- a/include/uapi/linux/lightnvm.h +++ b/include/uapi/linux/lightnvm.h @@ -21,7 +21,7 @@ #define _UAPI_LINUX_LIGHTNVM_H #ifdef __KERNEL__ -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/ioctl.h> #else /* __KERNEL__ */ #include <stdio.h> diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h index c36177a86516..a1fd6173e2db 100644 --- a/include/uapi/linux/mroute6.h +++ b/include/uapi/linux/mroute6.h @@ -2,7 +2,7 @@ #ifndef _UAPI__LINUX_MROUTE6_H #define _UAPI__LINUX_MROUTE6_H -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/types.h> #include <linux/sockios.h> #include <linux/in6.h> /* For struct sockaddr_in6. */ diff --git a/include/uapi/linux/netfilter/x_tables.h b/include/uapi/linux/netfilter/x_tables.h index a8283f7dbc51..b8c6bb233ac1 100644 --- a/include/uapi/linux/netfilter/x_tables.h +++ b/include/uapi/linux/netfilter/x_tables.h @@ -1,7 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ #ifndef _UAPI_X_TABLES_H #define _UAPI_X_TABLES_H -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/types.h> #define XT_FUNCTION_MAXNAMELEN 30 diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h index c3816ff7bfc3..3d94269bbfa8 100644 --- a/include/uapi/linux/netlink.h +++ b/include/uapi/linux/netlink.h @@ -2,7 +2,7 @@ #ifndef _UAPI__LINUX_NETLINK_H #define _UAPI__LINUX_NETLINK_H -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/socket.h> /* for __kernel_sa_family_t */ #include <linux/types.h> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 27c1ed2822e6..458179df9b27 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -23,7 +23,7 @@ #ifndef _UAPI_LINUX_SYSCTL_H #define _UAPI_LINUX_SYSCTL_H -#include <linux/kernel.h> +#include <linux/const.h> #include <linux/types.h> #include <linux/compiler.h> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index e7e98bde221f..5f2d88212f7c 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -257,4 +257,13 @@ struct uffdio_writeprotect { __u64 mode; }; +/* + * Flags for the userfaultfd(2) system call itself. + */ + +/* + * Create a userfaultfd that can handle page faults only in user mode. + */ +#define UFFD_USER_MODE_ONLY 1 + #endif /* _LINUX_USERFAULTFD_H */ diff --git a/init/main.c b/init/main.c index 32b2a8affafd..3024c4db17a9 100644 --- a/init/main.c +++ b/init/main.c @@ -58,7 +58,6 @@ #include <linux/rmap.h> #include <linux/mempolicy.h> #include <linux/key.h> -#include <linux/buffer_head.h> #include <linux/page_ext.h> #include <linux/debug_locks.h> #include <linux/debugobjects.h> @@ -825,9 +824,11 @@ static void __init mm_init(void) * bigger than MAX_ORDER unless SPARSEMEM. */ page_ext_init_flatmem(); - init_debug_pagealloc(); + init_mem_debugging_and_hardening(); report_meminit(); mem_init(); + /* page_owner must be initialized after buddy is ready */ + page_ext_init_flatmem_late(); kmem_cache_init(); kmemleak_init(); pgtable_init(); @@ -1034,7 +1035,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) fork_init(); proc_caches_init(); uts_ns_init(); - buffer_init(); key_init(); security_init(); dbg_late_init(); diff --git a/ipc/shm.c b/ipc/shm.c index e25c7c6106bc..febd88daba8c 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -434,13 +434,13 @@ static vm_fault_t shm_fault(struct vm_fault *vmf) return sfd->vm_ops->fault(vmf); } -static int shm_split(struct vm_area_struct *vma, unsigned long addr) +static int shm_may_split(struct vm_area_struct *vma, unsigned long addr) { struct file *file = vma->vm_file; struct shm_file_data *sfd = shm_file_data(file); - if (sfd->vm_ops->split) - return sfd->vm_ops->split(vma, addr); + if (sfd->vm_ops->may_split) + return sfd->vm_ops->may_split(vma, addr); return 0; } @@ -582,7 +582,7 @@ static const struct vm_operations_struct shm_vm_ops = { .open = shm_open, /* callback for a new vm-area open */ .close = shm_close, /* callback for when the vm-area is released */ .fault = shm_fault, - .split = shm_split, + .may_split = shm_may_split, .pagesize = shm_pagesize, #if defined(CONFIG_NUMA) .set_policy = shm_set_policy, diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index f2eeff74d713..fefa21981027 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -281,9 +281,6 @@ bool cgroup_ssid_enabled(int ssid) * - cpuset: a task can be moved into an empty cpuset, and again it takes * masks of ancestors. * - * - memcg: use_hierarchy is on by default and the cgroup file for the flag - * is not created. - * * - blkcg: blk-throttle becomes properly hierarchical. * * - debug: disallowed on the default hierarchy. @@ -5152,15 +5149,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, if (err) goto err_list_del; - if (ss->broken_hierarchy && !ss->warned_broken_hierarchy && - cgroup_parent(parent)) { - pr_warn("%s (%d) created nested cgroup for controller \"%s\" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.\n", - current->comm, current->pid, ss->name); - if (!strcmp(ss->name, "memory")) - pr_warn("\"memory\" requires setting use_hierarchy to 1 on the root\n"); - ss->warned_broken_hierarchy = true; - } - return css; err_list_del: diff --git a/kernel/fork.c b/kernel/fork.c index 99c76dab31c1..d61fa86eaaad 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -385,7 +385,7 @@ static void account_kernel_stack(struct task_struct *tsk, int account) mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB, account * (THREAD_SIZE / 1024)); else - mod_lruvec_slab_state(stack, NR_KERNEL_STACK_KB, + mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB, account * (THREAD_SIZE / 1024)); } @@ -1009,6 +1009,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->vmacache_seqnum = 0; atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); + seqcount_init(&mm->write_protect_seq); mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; diff --git a/kernel/kthread.c b/kernel/kthread.c index e6aa66551241..a5eceecd4513 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -704,8 +704,15 @@ repeat: raw_spin_unlock_irq(&worker->lock); if (work) { + kthread_work_func_t func = work->func; __set_current_state(TASK_RUNNING); + trace_sched_kthread_work_execute_start(work); work->func(work); + /* + * Avoid dereferencing work after this point. The trace + * event only cares about the address. + */ + trace_sched_kthread_work_execute_end(work, func); } else if (!freezing(current)) schedule(); @@ -786,7 +793,25 @@ EXPORT_SYMBOL(kthread_create_worker); * A good practice is to add the cpu number also into the worker name. * For example, use kthread_create_worker_on_cpu(cpu, "helper/%d", cpu). * - * Returns a pointer to the allocated worker on success, ERR_PTR(-ENOMEM) + * CPU hotplug: + * The kthread worker API is simple and generic. It just provides a way + * to create, use, and destroy workers. + * + * It is up to the API user how to handle CPU hotplug. They have to decide + * how to handle pending work items, prevent queuing new ones, and + * restore the functionality when the CPU goes off and on. There are a + * few catches: + * + * - CPU affinity gets lost when it is scheduled on an offline CPU. + * + * - The worker might not exist when the CPU was off when the user + * created the workers. + * + * Good practice is to implement two CPU hotplug callbacks and to + * destroy/create the worker when the CPU goes down/up. + * + * Return: + * The pointer to the allocated worker on success, ERR_PTR(-ENOMEM) * when the needed structures could not get allocated, and ERR_PTR(-EINTR) * when the worker was SIGKILLed. */ @@ -834,6 +859,8 @@ static void kthread_insert_work(struct kthread_worker *worker, { kthread_insert_work_sanity_check(worker, work); + trace_sched_kthread_work_queue_work(worker, work); + list_add_tail(&work->node, pos); work->worker = worker; if (!worker->current_work && likely(worker->task)) diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 2fc7d509a34f..da0b41914177 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -326,7 +326,7 @@ static int create_image(int platform_mode) if (!in_suspend) { events_check_enabled = false; - clear_free_pages(); + clear_or_poison_free_pages(); } platform_leave(platform_mode); diff --git a/kernel/power/power.h b/kernel/power/power.h index 24f12d534515..778bf431ec02 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -106,7 +106,7 @@ extern int create_basic_memory_bitmaps(void); extern void free_basic_memory_bitmaps(void); extern int hibernate_preallocate_memory(void); -extern void clear_free_pages(void); +extern void clear_or_poison_free_pages(void); /** * Auxiliary structure used for reading the snapshot image data and diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 46b1804c1ddf..d63560e1cf87 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -76,6 +76,40 @@ static inline void hibernate_restore_protect_page(void *page_address) {} static inline void hibernate_restore_unprotect_page(void *page_address) {} #endif /* CONFIG_STRICT_KERNEL_RWX && CONFIG_ARCH_HAS_SET_MEMORY */ + +/* + * The calls to set_direct_map_*() should not fail because remapping a page + * here means that we only update protection bits in an existing PTE. + * It is still worth to have a warning here if something changes and this + * will no longer be the case. + */ +static inline void hibernate_map_page(struct page *page) +{ + if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) { + int ret = set_direct_map_default_noflush(page); + + if (ret) + pr_warn_once("Failed to remap page\n"); + } else { + debug_pagealloc_map_pages(page, 1); + } +} + +static inline void hibernate_unmap_page(struct page *page) +{ + if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) { + unsigned long addr = (unsigned long)page_address(page); + int ret = set_direct_map_invalid_noflush(page); + + if (ret) + pr_warn_once("Failed to remap page\n"); + + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + } else { + debug_pagealloc_unmap_pages(page, 1); + } +} + static int swsusp_page_is_free(struct page *); static void swsusp_set_page_forbidden(struct page *); static void swsusp_unset_page_forbidden(struct page *); @@ -1144,7 +1178,15 @@ void free_basic_memory_bitmaps(void) pr_debug("Basic memory bitmaps freed\n"); } -void clear_free_pages(void) +static void clear_or_poison_free_page(struct page *page) +{ + if (page_poisoning_enabled_static()) + __kernel_poison_pages(page, 1); + else if (want_init_on_free()) + clear_highpage(page); +} + +void clear_or_poison_free_pages(void) { struct memory_bitmap *bm = free_pages_map; unsigned long pfn; @@ -1152,12 +1194,12 @@ void clear_free_pages(void) if (WARN_ON(!(free_pages_map))) return; - if (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) || want_init_on_free()) { + if (page_poisoning_enabled() || want_init_on_free()) { memory_bm_position_reset(bm); pfn = memory_bm_next_pfn(bm); while (pfn != BM_END_OF_MAP) { if (pfn_valid(pfn)) - clear_highpage(pfn_to_page(pfn)); + clear_or_poison_free_page(pfn_to_page(pfn)); pfn = memory_bm_next_pfn(bm); } @@ -1355,9 +1397,9 @@ static void safe_copy_page(void *dst, struct page *s_page) if (kernel_page_present(s_page)) { do_copy_page(dst, page_address(s_page)); } else { - kernel_map_pages(s_page, 1, 1); + hibernate_map_page(s_page); do_copy_page(dst, page_address(s_page)); - kernel_map_pages(s_page, 1, 0); + hibernate_unmap_page(s_page); } } diff --git a/kernel/ptrace.c b/kernel/ptrace.c index add677d79fcf..61db50f7ca86 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -57,7 +57,7 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, return 0; } - ret = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags); + ret = __access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); return ret; diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c71da2a59e12..b5295a0b0536 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1327,6 +1327,9 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work, { struct worker_pool *pool = pwq->pool; + /* record the work call stack in order to print it in KASAN reports */ + kasan_record_aux_stack(work); + /* we own @work, set data and link */ set_work_pwq(work, pwq, extra_flags); list_add_tail(&work->entry, head); diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index 4c24ac8a456c..9959ea23529e 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -15,6 +15,7 @@ #include <linux/mutex.h> #include <linux/ww_mutex.h> #include <linux/sched.h> +#include <linux/sched/mm.h> #include <linux/delay.h> #include <linux/lockdep.h> #include <linux/spinlock.h> @@ -2374,6 +2375,50 @@ static void queued_read_lock_tests(void) pr_cont("\n"); } +static void fs_reclaim_correct_nesting(void) +{ + fs_reclaim_acquire(GFP_KERNEL); + might_alloc(GFP_NOFS); + fs_reclaim_release(GFP_KERNEL); +} + +static void fs_reclaim_wrong_nesting(void) +{ + fs_reclaim_acquire(GFP_KERNEL); + might_alloc(GFP_KERNEL); + fs_reclaim_release(GFP_KERNEL); +} + +static void fs_reclaim_protected_nesting(void) +{ + unsigned int flags; + + fs_reclaim_acquire(GFP_KERNEL); + flags = memalloc_nofs_save(); + might_alloc(GFP_KERNEL); + memalloc_nofs_restore(flags); + fs_reclaim_release(GFP_KERNEL); +} + +static void fs_reclaim_tests(void) +{ + printk(" --------------------\n"); + printk(" | fs_reclaim tests |\n"); + printk(" --------------------\n"); + + print_testname("correct nesting"); + dotest(fs_reclaim_correct_nesting, SUCCESS, 0); + pr_cont("\n"); + + print_testname("wrong nesting"); + dotest(fs_reclaim_wrong_nesting, FAILURE, 0); + pr_cont("\n"); + + print_testname("protected nesting"); + dotest(fs_reclaim_protected_nesting, SUCCESS, 0); + pr_cont("\n"); +} + void locking_selftest(void) { /* @@ -2495,6 +2540,8 @@ void locking_selftest(void) if (IS_ENABLED(CONFIG_QUEUED_RWLOCKS)) queued_read_lock_tests(); + fs_reclaim_tests(); + if (unexpected_testcase_failures) { printk("-----------------------------------------------------------------\n"); debug_locks = 0; diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c index 2d68db6ae67b..62a87854b120 100644 --- a/lib/test_kasan_module.c +++ b/lib/test_kasan_module.c @@ -91,6 +91,34 @@ static noinline void __init kasan_rcu_uaf(void) call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim); } +static noinline void __init kasan_workqueue_work(struct work_struct *work) +{ + kfree(work); +} + +static noinline void __init kasan_workqueue_uaf(void) +{ + struct workqueue_struct *workqueue; + struct work_struct *work; + + workqueue = create_workqueue("kasan_wq_test"); + if (!workqueue) { + pr_err("Allocation failed\n"); + return; + } + work = kmalloc(sizeof(struct work_struct), GFP_KERNEL); + if (!work) { + pr_err("Allocation failed\n"); + return; + } + + INIT_WORK(work, kasan_workqueue_work); + queue_work(workqueue, work); + destroy_workqueue(workqueue); + + pr_info("use-after-free on workqueue\n"); + ((volatile struct work_struct *)work)->data; +} static int __init test_kasan_module_init(void) { @@ -102,6 +130,7 @@ static int __init test_kasan_module_init(void) copy_user_test(); kasan_rcu_uaf(); + kasan_workqueue_uaf(); kasan_restore_multi_shot(multishot); return -EAGAIN; diff --git a/mm/Kconfig b/mm/Kconfig index 8c49d09da214..cf04bc3c866c 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -821,13 +821,28 @@ config PERCPU_STATS information includes global and per chunk statistics, which can be used to help understand percpu memory usage. -config GUP_BENCHMARK - bool "Enable infrastructure for get_user_pages() and related calls benchmarking" +config GUP_TEST + bool "Enable infrastructure for get_user_pages()-related unit tests" + depends on DEBUG_FS help - Provides /sys/kernel/debug/gup_benchmark that helps with testing - performance of get_user_pages() and related calls. + Provides /sys/kernel/debug/gup_test, which in turn provides a way + to make ioctl calls that can launch kernel-based unit tests for + the get_user_pages*() and pin_user_pages*() family of API calls. - See tools/testing/selftests/vm/gup_benchmark.c + These tests include benchmark testing of the _fast variants of + get_user_pages*() and pin_user_pages*(), as well as smoke tests of + the non-_fast variants. + + There is also a sub-test that allows running dump_page() on any + of up to eight pages (selected by command line args) within the + range of user-space addresses. These pages are either pinned via + pin_user_pages*(), or pinned via get_user_pages*(), as specified + by other command line arguments. + + See tools/testing/selftests/vm/gup_test.c + +comment "GUP_TEST needs to have DEBUG_FS enabled" + depends on !GUP_TEST && !DEBUG_FS config GUP_GET_PTE_LOW_HIGH bool diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index 864f129f1937..1e73717802f8 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -64,7 +64,6 @@ config PAGE_OWNER config PAGE_POISONING bool "Poison pages after freeing" - select PAGE_POISONING_NO_SANITY if HIBERNATION help Fill the pages with poison patterns after free_pages() and verify the patterns before alloc_pages. The filling of the memory helps @@ -75,30 +74,11 @@ config PAGE_POISONING Note that "poison" here is not the same thing as the "HWPoison" for CONFIG_MEMORY_FAILURE. This is software poisoning only. - If unsure, say N - -config PAGE_POISONING_NO_SANITY - depends on PAGE_POISONING - bool "Only poison, don't sanity check" - help - Skip the sanity checking on alloc, only fill the pages with - poison on free. This reduces some of the overhead of the - poisoning feature. - - If you are only interested in sanitization, say Y. Otherwise - say N. + If you are only interested in sanitization of freed pages without + checking the poison pattern on alloc, you can boot the kernel with + "init_on_free=1" instead of enabling this. -config PAGE_POISONING_ZERO - bool "Use zero for poisoning instead of debugging value" - depends on PAGE_POISONING - help - Instead of using the existing poison value, fill the pages with - zeros. This makes it harder to detect when errors are occurring - due to sanitization but the zeroing at free means that it is - no longer necessary to write zeros when GFP_ZERO is used on - allocation. - - If unsure, say N + If unsure, say N config DEBUG_PAGE_REF bool "Enable tracepoint to track down page reference manipulation" diff --git a/mm/Makefile b/mm/Makefile index d73aed0fc99c..b6cd2fffa492 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -52,7 +52,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ mm_init.o percpu.o slab_common.o \ compaction.o vmacache.o \ interval_tree.o list_lru.o workingset.o \ - debug.o gup.o $(mmu-y) + debug.o gup.o mmap_lock.o $(mmu-y) # Give 'page_alloc' its own module-parameter namespace page-alloc-y := page_alloc.o @@ -90,7 +90,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o -obj-$(CONFIG_GUP_BENCHMARK) += gup_benchmark.o +obj-$(CONFIG_GUP_TEST) += gup_test.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 408d5051d05b..e33797579338 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -150,11 +150,11 @@ static ssize_t read_ahead_kb_store(struct device *dev, #define BDI_SHOW(name, expr) \ static ssize_t name##_show(struct device *dev, \ - struct device_attribute *attr, char *page) \ + struct device_attribute *attr, char *buf) \ { \ struct backing_dev_info *bdi = dev_get_drvdata(dev); \ \ - return snprintf(page, PAGE_SIZE-1, "%lld\n", (long long)expr); \ + return sysfs_emit(buf, "%lld\n", (long long)expr); \ } \ static DEVICE_ATTR_RW(name); @@ -200,11 +200,11 @@ BDI_SHOW(max_ratio, bdi->max_ratio) static ssize_t stable_pages_required_show(struct device *dev, struct device_attribute *attr, - char *page) + char *buf) { dev_warn_once(dev, "the stable_pages_required attribute has been removed. Use the stable_writes queue attribute instead.\n"); - return snprintf(page, PAGE_SIZE-1, "%d\n", 0); + return sysfs_emit(buf, "%d\n", 0); } static DEVICE_ATTR_RO(stable_pages_required); @@ -38,7 +38,6 @@ struct cma cma_areas[MAX_CMA_AREAS]; unsigned cma_area_count; -static DEFINE_MUTEX(cma_mutex); phys_addr_t cma_get_base(const struct cma *cma) { @@ -454,10 +453,9 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, mutex_unlock(&cma->lock); pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit); - mutex_lock(&cma_mutex); ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0)); - mutex_unlock(&cma_mutex); + if (ret == 0) { page = pfn_to_page(pfn); break; @@ -512,7 +510,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count) if (!cma || !pages) return false; - pr_debug("%s(page %p)\n", __func__, (void *)pages); + pr_debug("%s(page %p, count %u)\n", __func__, (void *)pages, count); pfn = page_to_pfn(pages); diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..dbcfdfce1b82 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -157,7 +157,7 @@ EXPORT_SYMBOL(__ClearPageMovable); * allocation success. 1 << compact_defer_shift, compactions are skipped up * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT */ -void defer_compaction(struct zone *zone, int order) +static void defer_compaction(struct zone *zone, int order) { zone->compact_considered = 0; zone->compact_defer_shift++; @@ -172,7 +172,7 @@ void defer_compaction(struct zone *zone, int order) } /* Returns true if compaction should be skipped this time */ -bool compaction_deferred(struct zone *zone, int order) +static bool compaction_deferred(struct zone *zone, int order) { unsigned long defer_limit = 1UL << zone->compact_defer_shift; @@ -209,7 +209,7 @@ void compaction_defer_reset(struct zone *zone, int order, } /* Returns true if restarting compaction after many failures */ -bool compaction_restarting(struct zone *zone, int order) +static bool compaction_restarting(struct zone *zone, int order) { if (order < zone->compact_order_failed) return false; @@ -237,7 +237,7 @@ static void reset_cached_positions(struct zone *zone) } /* - * Compound pages of >= pageblock_order should consistenly be skipped until + * Compound pages of >= pageblock_order should consistently be skipped until * released. It is always pointless to compact pages of such order (if they are * migratable), and the pageblocks they occupy cannot contain any free pages. */ @@ -2070,13 +2070,6 @@ static enum compact_result compact_finished(struct compact_control *cc) return ret; } -/* - * compaction_suitable: Is this suitable to run compaction on this zone now? - * Returns - * COMPACT_SKIPPED - If there are too few free pages for compaction - * COMPACT_SUCCESS - If the allocation would succeed without compaction - * COMPACT_CONTINUE - If compaction should run now - */ static enum compact_result __compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int highest_zoneidx, @@ -2120,6 +2113,13 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order, return COMPACT_CONTINUE; } +/* + * compaction_suitable: Is this suitable to run compaction on this zone now? + * Returns + * COMPACT_SKIPPED - If there are too few free pages for compaction + * COMPACT_SUCCESS - If the allocation would succeed without compaction + * COMPACT_CONTINUE - If compaction should run now + */ enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int highest_zoneidx) @@ -2275,7 +2275,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) { int err; - unsigned long start_pfn = cc->migrate_pfn; + unsigned long iteration_start_pfn = cc->migrate_pfn; /* * Avoid multiple rescans which can happen if a page cannot be @@ -2287,7 +2287,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) */ cc->rescan = false; if (pageblock_start_pfn(last_migrated_pfn) == - pageblock_start_pfn(start_pfn)) { + pageblock_start_pfn(iteration_start_pfn)) { cc->rescan = true; } @@ -2311,8 +2311,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) goto check_drain; case ISOLATE_SUCCESS: update_cached = false; - last_migrated_pfn = start_pfn; - ; + last_migrated_pfn = iteration_start_pfn; } err = migrate_pages(&cc->migratepages, compaction_alloc, diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..39bb88140680 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -204,9 +204,9 @@ static void unaccount_page_cache_page(struct address_space *mapping, if (PageSwapBacked(page)) { __mod_lruvec_page_state(page, NR_SHMEM, -nr); if (PageTransHuge(page)) - __dec_node_page_state(page, NR_SHMEM_THPS); + __dec_lruvec_page_state(page, NR_SHMEM_THPS); } else if (PageTransHuge(page)) { - __dec_node_page_state(page, NR_FILE_THPS); + __dec_lruvec_page_state(page, NR_FILE_THPS); filemap_nr_thps_dec(mapping); } @@ -1583,19 +1583,20 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm, else wait_on_page_locked(page); return 0; - } else { - if (flags & FAULT_FLAG_KILLABLE) { - int ret; + } + if (flags & FAULT_FLAG_KILLABLE) { + int ret; - ret = __lock_page_killable(page); - if (ret) { - mmap_read_unlock(mm); - return 0; - } - } else - __lock_page(page); - return 1; + ret = __lock_page_killable(page); + if (ret) { + mmap_read_unlock(mm); + return 0; + } + } else { + __lock_page(page); } + return 1; + } /** @@ -2166,6 +2167,259 @@ static void shrink_readahead_size_eio(struct file_ra_state *ra) ra->ra_pages /= 4; } +static int lock_page_for_iocb(struct kiocb *iocb, struct page *page) +{ + if (iocb->ki_flags & IOCB_WAITQ) + return lock_page_async(page, iocb->ki_waitq); + else if (iocb->ki_flags & IOCB_NOWAIT) + return trylock_page(page) ? 0 : -EAGAIN; + else + return lock_page_killable(page); +} + +static struct page * +generic_file_buffered_read_readpage(struct kiocb *iocb, + struct file *filp, + struct address_space *mapping, + struct page *page) +{ + struct file_ra_state *ra = &filp->f_ra; + int error; + + if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) { + unlock_page(page); + put_page(page); + return ERR_PTR(-EAGAIN); + } + + /* + * A previous I/O error may have been due to temporary + * failures, eg. multipath errors. + * PG_error will be set again if readpage fails. + */ + ClearPageError(page); + /* Start the actual read. The read will unlock the page. */ + error = mapping->a_ops->readpage(filp, page); + + if (unlikely(error)) { + put_page(page); + return error != AOP_TRUNCATED_PAGE ? ERR_PTR(error) : NULL; + } + + if (!PageUptodate(page)) { + error = lock_page_for_iocb(iocb, page); + if (unlikely(error)) { + put_page(page); + return ERR_PTR(error); + } + if (!PageUptodate(page)) { + if (page->mapping == NULL) { + /* + * invalidate_mapping_pages got it + */ + unlock_page(page); + put_page(page); + return NULL; + } + unlock_page(page); + shrink_readahead_size_eio(ra); + put_page(page); + return ERR_PTR(-EIO); + } + unlock_page(page); + } + + return page; +} + +static struct page * +generic_file_buffered_read_pagenotuptodate(struct kiocb *iocb, + struct file *filp, + struct iov_iter *iter, + struct page *page, + loff_t pos, loff_t count) +{ + struct address_space *mapping = filp->f_mapping; + struct inode *inode = mapping->host; + int error; + + /* + * See comment in do_read_cache_page on why + * wait_on_page_locked is used to avoid unnecessarily + * serialisations and why it's safe. + */ + if (iocb->ki_flags & IOCB_WAITQ) { + error = wait_on_page_locked_async(page, + iocb->ki_waitq); + } else { + error = wait_on_page_locked_killable(page); + } + if (unlikely(error)) { + put_page(page); + return ERR_PTR(error); + } + if (PageUptodate(page)) + return page; + + if (inode->i_blkbits == PAGE_SHIFT || + !mapping->a_ops->is_partially_uptodate) + goto page_not_up_to_date; + /* pipes can't handle partially uptodate pages */ + if (unlikely(iov_iter_is_pipe(iter))) + goto page_not_up_to_date; + if (!trylock_page(page)) + goto page_not_up_to_date; + /* Did it get truncated before we got the lock? */ + if (!page->mapping) + goto page_not_up_to_date_locked; + if (!mapping->a_ops->is_partially_uptodate(page, + pos & ~PAGE_MASK, count)) + goto page_not_up_to_date_locked; + unlock_page(page); + return page; + +page_not_up_to_date: + /* Get exclusive access to the page ... */ + error = lock_page_for_iocb(iocb, page); + if (unlikely(error)) { + put_page(page); + return ERR_PTR(error); + } + +page_not_up_to_date_locked: + /* Did it get truncated before we got the lock? */ + if (!page->mapping) { + unlock_page(page); + put_page(page); + return NULL; + } + + /* Did somebody else fill it already? */ + if (PageUptodate(page)) { + unlock_page(page); + return page; + } + + return generic_file_buffered_read_readpage(iocb, filp, mapping, page); +} + +static struct page * +generic_file_buffered_read_no_cached_page(struct kiocb *iocb, + struct iov_iter *iter) +{ + struct file *filp = iocb->ki_filp; + struct address_space *mapping = filp->f_mapping; + pgoff_t index = iocb->ki_pos >> PAGE_SHIFT; + struct page *page; + int error; + + if (iocb->ki_flags & IOCB_NOIO) + return ERR_PTR(-EAGAIN); + + /* + * Ok, it wasn't cached, so we need to create a new + * page.. + */ + page = page_cache_alloc(mapping); + if (!page) + return ERR_PTR(-ENOMEM); + + error = add_to_page_cache_lru(page, mapping, index, + mapping_gfp_constraint(mapping, GFP_KERNEL)); + if (error) { + put_page(page); + return error != -EEXIST ? ERR_PTR(error) : NULL; + } + + return generic_file_buffered_read_readpage(iocb, filp, mapping, page); +} + +static int generic_file_buffered_read_get_pages(struct kiocb *iocb, + struct iov_iter *iter, + struct page **pages, + unsigned int nr) +{ + struct file *filp = iocb->ki_filp; + struct address_space *mapping = filp->f_mapping; + struct file_ra_state *ra = &filp->f_ra; + pgoff_t index = iocb->ki_pos >> PAGE_SHIFT; + pgoff_t last_index = (iocb->ki_pos + iter->count + PAGE_SIZE-1) >> PAGE_SHIFT; + int i, j, nr_got, err = 0; + + nr = min_t(unsigned long, last_index - index, nr); +find_page: + if (fatal_signal_pending(current)) + return -EINTR; + + nr_got = find_get_pages_contig(mapping, index, nr, pages); + if (nr_got) + goto got_pages; + + if (iocb->ki_flags & IOCB_NOIO) + return -EAGAIN; + + page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); + + nr_got = find_get_pages_contig(mapping, index, nr, pages); + if (nr_got) + goto got_pages; + + pages[0] = generic_file_buffered_read_no_cached_page(iocb, iter); + err = PTR_ERR_OR_ZERO(pages[0]); + if (!IS_ERR_OR_NULL(pages[0])) + nr_got = 1; +got_pages: + for (i = 0; i < nr_got; i++) { + struct page *page = pages[i]; + pgoff_t pg_index = index + i; + loff_t pg_pos = max(iocb->ki_pos, + (loff_t) pg_index << PAGE_SHIFT); + loff_t pg_count = iocb->ki_pos + iter->count - pg_pos; + + if (PageReadahead(page)) { + if (iocb->ki_flags & IOCB_NOIO) { + for (j = i; j < nr_got; j++) + put_page(pages[j]); + nr_got = i; + err = -EAGAIN; + break; + } + page_cache_async_readahead(mapping, ra, filp, page, + pg_index, last_index - pg_index); + } + + if (!PageUptodate(page)) { + if ((iocb->ki_flags & IOCB_NOWAIT) || + ((iocb->ki_flags & IOCB_WAITQ) && i)) { + for (j = i; j < nr_got; j++) + put_page(pages[j]); + nr_got = i; + err = -EAGAIN; + break; + } + + page = generic_file_buffered_read_pagenotuptodate(iocb, + filp, iter, page, pg_pos, pg_count); + if (IS_ERR_OR_NULL(page)) { + for (j = i + 1; j < nr_got; j++) + put_page(pages[j]); + nr_got = i; + err = PTR_ERR_OR_ZERO(page); + break; + } + } + } + + if (likely(nr_got)) + return nr_got; + if (err) + return err; + /* + * No pages and no error means we raced and should retry: + */ + goto find_page; +} + /** * generic_file_buffered_read - generic file read routine * @iocb: the iocb to read @@ -2186,294 +2440,117 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb, struct iov_iter *iter, ssize_t written) { struct file *filp = iocb->ki_filp; + struct file_ra_state *ra = &filp->f_ra; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; - struct file_ra_state *ra = &filp->f_ra; - loff_t *ppos = &iocb->ki_pos; - pgoff_t index; - pgoff_t last_index; - pgoff_t prev_index; - unsigned long offset; /* offset into pagecache page */ - unsigned int prev_offset; - int error = 0; - - if (unlikely(*ppos >= inode->i_sb->s_maxbytes)) + struct page *pages_onstack[PAGEVEC_SIZE], **pages = NULL; + unsigned int nr_pages = min_t(unsigned int, 512, + ((iocb->ki_pos + iter->count + PAGE_SIZE - 1) >> PAGE_SHIFT) - + (iocb->ki_pos >> PAGE_SHIFT)); + int i, pg_nr, error = 0; + bool writably_mapped; + loff_t isize, end_offset; + + if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes)) return 0; iov_iter_truncate(iter, inode->i_sb->s_maxbytes); - index = *ppos >> PAGE_SHIFT; - prev_index = ra->prev_pos >> PAGE_SHIFT; - prev_offset = ra->prev_pos & (PAGE_SIZE-1); - last_index = (*ppos + iter->count + PAGE_SIZE-1) >> PAGE_SHIFT; - offset = *ppos & ~PAGE_MASK; - - /* - * If we've already successfully copied some data, then we - * can no longer safely return -EIOCBQUEUED. Hence mark - * an async read NOWAIT at that point. - */ - if (written && (iocb->ki_flags & IOCB_WAITQ)) - iocb->ki_flags |= IOCB_NOWAIT; + if (nr_pages > ARRAY_SIZE(pages_onstack)) + pages = kmalloc_array(nr_pages, sizeof(void *), GFP_KERNEL); - for (;;) { - struct page *page; - pgoff_t end_index; - loff_t isize; - unsigned long nr, ret; + if (!pages) { + pages = pages_onstack; + nr_pages = min_t(unsigned int, nr_pages, ARRAY_SIZE(pages_onstack)); + } + do { cond_resched(); -find_page: - if (fatal_signal_pending(current)) { - error = -EINTR; - goto out; - } - page = find_get_page(mapping, index); - if (!page) { - if (iocb->ki_flags & IOCB_NOIO) - goto would_block; - page_cache_sync_readahead(mapping, - ra, filp, - index, last_index - index); - page = find_get_page(mapping, index); - if (unlikely(page == NULL)) - goto no_cached_page; - } - if (PageReadahead(page)) { - if (iocb->ki_flags & IOCB_NOIO) { - put_page(page); - goto out; - } - page_cache_async_readahead(mapping, - ra, filp, page, - index, last_index - index); - } - if (!PageUptodate(page)) { - /* - * See comment in do_read_cache_page on why - * wait_on_page_locked is used to avoid unnecessarily - * serialisations and why it's safe. - */ - if (iocb->ki_flags & IOCB_WAITQ) { - if (written) { - put_page(page); - goto out; - } - error = wait_on_page_locked_async(page, - iocb->ki_waitq); - } else { - if (iocb->ki_flags & IOCB_NOWAIT) { - put_page(page); - goto would_block; - } - error = wait_on_page_locked_killable(page); - } - if (unlikely(error)) - goto readpage_error; - if (PageUptodate(page)) - goto page_ok; - - if (inode->i_blkbits == PAGE_SHIFT || - !mapping->a_ops->is_partially_uptodate) - goto page_not_up_to_date; - /* pipes can't handle partially uptodate pages */ - if (unlikely(iov_iter_is_pipe(iter))) - goto page_not_up_to_date; - if (!trylock_page(page)) - goto page_not_up_to_date; - /* Did it get truncated before we got the lock? */ - if (!page->mapping) - goto page_not_up_to_date_locked; - if (!mapping->a_ops->is_partially_uptodate(page, - offset, iter->count)) - goto page_not_up_to_date_locked; - unlock_page(page); + /* + * If we've already successfully copied some data, then we + * can no longer safely return -EIOCBQUEUED. Hence mark + * an async read NOWAIT at that point. + */ + if ((iocb->ki_flags & IOCB_WAITQ) && written) + iocb->ki_flags |= IOCB_NOWAIT; + + i = 0; + pg_nr = generic_file_buffered_read_get_pages(iocb, iter, + pages, nr_pages); + if (pg_nr < 0) { + error = pg_nr; + break; } -page_ok: + /* - * i_size must be checked after we know the page is Uptodate. + * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). */ - isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_SHIFT; - if (unlikely(!isize || index > end_index)) { - put_page(page); - goto out; - } + if (unlikely(iocb->ki_pos >= isize)) + goto put_pages; - /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_SIZE; - if (index == end_index) { - nr = ((isize - 1) & ~PAGE_MASK) + 1; - if (nr <= offset) { - put_page(page); - goto out; - } - } - nr = nr - offset; + end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count); - /* If users can be writing to this page using arbitrary - * virtual addresses, take care about potential aliasing - * before reading the page on the kernel side. - */ - if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + while ((iocb->ki_pos >> PAGE_SHIFT) + pg_nr > + (end_offset + PAGE_SIZE - 1) >> PAGE_SHIFT) + put_page(pages[--pg_nr]); /* - * When a sequential read accesses a page several times, - * only mark it as accessed the first time. + * Once we start copying data, we don't want to be touching any + * cachelines that might be contended: */ - if (prev_index != index || offset != prev_offset) - mark_page_accessed(page); - prev_index = index; + writably_mapped = mapping_writably_mapped(mapping); /* - * Ok, we have the page, and it's up-to-date, so - * now we can copy it to user space... + * When a sequential read accesses a page several times, only + * mark it as accessed the first time. */ + if (iocb->ki_pos >> PAGE_SHIFT != + ra->prev_pos >> PAGE_SHIFT) + mark_page_accessed(pages[0]); + for (i = 1; i < pg_nr; i++) + mark_page_accessed(pages[i]); + + for (i = 0; i < pg_nr; i++) { + unsigned int offset = iocb->ki_pos & ~PAGE_MASK; + unsigned int bytes = min_t(loff_t, end_offset - iocb->ki_pos, + PAGE_SIZE - offset); + unsigned int copied; - ret = copy_page_to_iter(page, offset, nr, iter); - offset += ret; - index += offset >> PAGE_SHIFT; - offset &= ~PAGE_MASK; - prev_offset = offset; - - put_page(page); - written += ret; - if (!iov_iter_count(iter)) - goto out; - if (ret < nr) { - error = -EFAULT; - goto out; - } - continue; - -page_not_up_to_date: - /* Get exclusive access to the page ... */ - if (iocb->ki_flags & IOCB_WAITQ) { - if (written) { - put_page(page); - goto out; - } - error = lock_page_async(page, iocb->ki_waitq); - } else { - error = lock_page_killable(page); - } - if (unlikely(error)) - goto readpage_error; - -page_not_up_to_date_locked: - /* Did it get truncated before we got the lock? */ - if (!page->mapping) { - unlock_page(page); - put_page(page); - continue; - } - - /* Did somebody else fill it already? */ - if (PageUptodate(page)) { - unlock_page(page); - goto page_ok; - } - -readpage: - if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) { - unlock_page(page); - put_page(page); - goto would_block; - } - /* - * A previous I/O error may have been due to temporary - * failures, eg. multipath errors. - * PG_error will be set again if readpage fails. - */ - ClearPageError(page); - /* Start the actual read. The read will unlock the page. */ - error = mapping->a_ops->readpage(filp, page); + /* + * If users can be writing to this page using arbitrary + * virtual addresses, take care about potential aliasing + * before reading the page on the kernel side. + */ + if (writably_mapped) + flush_dcache_page(pages[i]); - if (unlikely(error)) { - if (error == AOP_TRUNCATED_PAGE) { - put_page(page); - error = 0; - goto find_page; - } - goto readpage_error; - } + copied = copy_page_to_iter(pages[i], offset, bytes, iter); - if (!PageUptodate(page)) { - if (iocb->ki_flags & IOCB_WAITQ) { - if (written) { - put_page(page); - goto out; - } - error = lock_page_async(page, iocb->ki_waitq); - } else { - error = lock_page_killable(page); - } + written += copied; + iocb->ki_pos += copied; + ra->prev_pos = iocb->ki_pos; - if (unlikely(error)) - goto readpage_error; - if (!PageUptodate(page)) { - if (page->mapping == NULL) { - /* - * invalidate_mapping_pages got it - */ - unlock_page(page); - put_page(page); - goto find_page; - } - unlock_page(page); - shrink_readahead_size_eio(ra); - error = -EIO; - goto readpage_error; + if (copied < bytes) { + error = -EFAULT; + break; } - unlock_page(page); } +put_pages: + for (i = 0; i < pg_nr; i++) + put_page(pages[i]); + } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); - goto page_ok; - -readpage_error: - /* UHHUH! A synchronous read error occurred. Report it */ - put_page(page); - goto out; - -no_cached_page: - /* - * Ok, it wasn't cached, so we need to create a new - * page.. - */ - page = page_cache_alloc(mapping); - if (!page) { - error = -ENOMEM; - goto out; - } - error = add_to_page_cache_lru(page, mapping, index, - mapping_gfp_constraint(mapping, GFP_KERNEL)); - if (error) { - put_page(page); - if (error == -EEXIST) { - error = 0; - goto find_page; - } - goto out; - } - goto readpage; - } + file_accessed(filp); -would_block: - error = -EAGAIN; -out: - ra->prev_pos = prev_index; - ra->prev_pos <<= PAGE_SHIFT; - ra->prev_pos |= prev_offset; + if (pages != pages_onstack) + kfree(pages); - *ppos = ((loff_t)index << PAGE_SHIFT) + offset; - file_accessed(filp); return written ? written : error; } EXPORT_SYMBOL_GPL(generic_file_buffered_read); @@ -123,6 +123,28 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, return NULL; } +static void put_compound_head(struct page *page, int refs, unsigned int flags) +{ + if (flags & FOLL_PIN) { + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, + refs); + + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *= GUP_PIN_COUNTING_BIAS; + } + + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); + /* + * Calling put_page() for each ref is unnecessarily slow. Only the last + * ref needs a put_page(). + */ + if (refs > 1) + page_ref_sub(page, refs - 1); + put_page(page); +} + /** * try_grab_page() - elevate a page's refcount by a flag-dependent amount * @@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) return true; } -#ifdef CONFIG_DEV_PAGEMAP_OPS -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - int count, refs = 1; - - if (!page_is_devmap_managed(page)) - return false; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - count = page_ref_sub_return(page, refs); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); - - return true; -} -#else -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - return false; -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - /** * unpin_user_page() - release a dma-pinned page * @page: pointer to page to be released @@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) { - int refs = 1; - - page = compound_head(page); - - /* - * For devmap managed pages we need to catch refcount transition from - * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the - * page is free and we need to inform the device driver through - * callback. See include/linux/memremap.h and HMM for details. - */ - if (__unpin_devmap_managed_user_page(page)) - return; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - if (page_ref_sub_and_test(page, refs)) - __put_page(page); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); + put_compound_head(compound_head(page), 1, FOLL_PIN); } EXPORT_SYMBOL(unpin_user_page); @@ -923,6 +889,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) return -EFAULT; + if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) + return -EOPNOTSUPP; + if (write) { if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) @@ -1060,10 +1029,14 @@ static long __get_user_pages(struct mm_struct *mm, goto next_page; } - if (!vma || check_vma_flags(vma, gup_flags)) { + if (!vma) { ret = -EFAULT; goto out; } + ret = check_vma_flags(vma, gup_flags); + if (ret) + goto out; + if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, @@ -1567,26 +1540,6 @@ struct page *get_dump_page(unsigned long addr) } #endif /* CONFIG_ELF_CORE */ -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA) -static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) -{ - long i; - struct vm_area_struct *vma_prev = NULL; - - for (i = 0; i < nr_pages; i++) { - struct vm_area_struct *vma = vmas[i]; - - if (vma == vma_prev) - continue; - - vma_prev = vma; - - if (vma_is_fsdax(vma)) - return true; - } - return false; -} - #ifdef CONFIG_CMA static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned long start, @@ -1705,63 +1658,23 @@ static long __gup_longterm_locked(struct mm_struct *mm, struct vm_area_struct **vmas, unsigned int gup_flags) { - struct vm_area_struct **vmas_tmp = vmas; unsigned long flags = 0; - long rc, i; - - if (gup_flags & FOLL_LONGTERM) { - if (!pages) - return -EINVAL; + long rc; - if (!vmas_tmp) { - vmas_tmp = kcalloc(nr_pages, - sizeof(struct vm_area_struct *), - GFP_KERNEL); - if (!vmas_tmp) - return -ENOMEM; - } + if (gup_flags & FOLL_LONGTERM) flags = memalloc_nocma_save(); - } - rc = __get_user_pages_locked(mm, start, nr_pages, pages, - vmas_tmp, NULL, gup_flags); + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, + gup_flags); if (gup_flags & FOLL_LONGTERM) { - if (rc < 0) - goto out; - - if (check_dax_vmas(vmas_tmp, rc)) { - if (gup_flags & FOLL_PIN) - unpin_user_pages(pages, rc); - else - for (i = 0; i < rc; i++) - put_page(pages[i]); - rc = -EOPNOTSUPP; - goto out; - } - - rc = check_and_migrate_cma_pages(mm, start, rc, pages, - vmas_tmp, gup_flags); -out: + if (rc > 0) + rc = check_and_migrate_cma_pages(mm, start, rc, pages, + vmas, gup_flags); memalloc_nocma_restore(flags); } - - if (vmas_tmp != vmas) - kfree(vmas_tmp); return rc; } -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */ -static __always_inline long __gup_longterm_locked(struct mm_struct *mm, - unsigned long start, - unsigned long nr_pages, - struct page **pages, - struct vm_area_struct **vmas, - unsigned int flags) -{ - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, - NULL, flags); -} -#endif /* CONFIG_FS_DAX || CONFIG_CMA */ static bool is_valid_gup_flags(unsigned int gup_flags) { @@ -1932,7 +1845,19 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, EXPORT_SYMBOL(get_user_pages); /** - * get_user_pages_locked() is suitable to replace the form: + * get_user_pages_locked() - variant of get_user_pages() + * + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying lookup behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @locked: pointer to lock flag indicating whether lock is held and + * subsequently whether VM_FAULT_RETRY functionality can be + * utilised. Lock must initially be held. + * + * It is suitable to replace the form: * * mmap_read_lock(mm); * do_something() @@ -1948,16 +1873,6 @@ EXPORT_SYMBOL(get_user_pages); * if (locked) * mmap_read_unlock(mm); * - * @start: starting user address - * @nr_pages: number of pages from start to pin - * @gup_flags: flags modifying lookup behaviour - * @pages: array that receives pointers to the pages pinned. - * Should be at least nr_pages long. Or NULL, if caller - * only intends to ensure the pages are faulted in. - * @locked: pointer to lock flag indicating whether lock is held and - * subsequently whether VM_FAULT_RETRY functionality can be - * utilised. Lock must initially be held. - * * We can leverage the VM_FAULT_RETRY functionality in the page fault * paths better by using either get_user_pages_locked() or * get_user_pages_unlocked(). @@ -2063,28 +1978,6 @@ EXPORT_SYMBOL(get_user_pages_unlocked); */ #ifdef CONFIG_HAVE_FAST_GUP -static void put_compound_head(struct page *page, int refs, unsigned int flags) -{ - if (flags & FOLL_PIN) { - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, - refs); - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, refs); - else - refs *= GUP_PIN_COUNTING_BIAS; - } - - VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); - /* - * Calling put_page() for each ref is unnecessarily slow. Only the last - * ref needs a put_page(). - */ - if (refs > 1) - page_ref_sub(page, refs - 1); - put_page(page); -} - static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, unsigned int flags, struct page **pages) @@ -2621,13 +2514,61 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, return ret; } -static int internal_get_user_pages_fast(unsigned long start, int nr_pages, +static unsigned long lockless_pages_from_mm(unsigned long start, + unsigned long end, + unsigned int gup_flags, + struct page **pages) +{ + unsigned long flags; + int nr_pinned = 0; + unsigned seq; + + if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) || + !gup_fast_permitted(start, end)) + return 0; + + if (gup_flags & FOLL_PIN) { + seq = raw_read_seqcount(¤t->mm->write_protect_seq); + if (seq & 1) + return 0; + } + + /* + * Disable interrupts. The nested form is used, in order to allow full, + * general purpose use of this routine. + * + * With interrupts disabled, we block page table pages from being freed + * from under us. See struct mmu_table_batch comments in + * include/asm-generic/tlb.h for more details. + * + * We do not adopt an rcu_read_lock() here as we also want to block IPIs + * that come from THPs splitting. + */ + local_irq_save(flags); + gup_pgd_range(start, end, gup_flags, pages, &nr_pinned); + local_irq_restore(flags); + + /* + * When pinning pages for DMA there could be a concurrent write protect + * from fork() via copy_page_range(), in this case always fail fast GUP. + */ + if (gup_flags & FOLL_PIN) { + if (read_seqcount_retry(¤t->mm->write_protect_seq, seq)) { + unpin_user_pages(pages, nr_pinned); + return 0; + } + } + return nr_pinned; +} + +static int internal_get_user_pages_fast(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, struct page **pages) { - unsigned long addr, len, end; - unsigned long flags; - int nr_pinned = 0, ret = 0; + unsigned long len, end; + unsigned long nr_pinned; + int ret; if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | @@ -2641,54 +2582,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, might_lock_read(¤t->mm->mmap_lock); start = untagged_addr(start) & PAGE_MASK; - addr = start; - len = (unsigned long) nr_pages << PAGE_SHIFT; - end = start + len; - - if (end <= start) + len = nr_pages << PAGE_SHIFT; + if (check_add_overflow(start, len, &end)) return 0; if (unlikely(!access_ok((void __user *)start, len))) return -EFAULT; - /* - * Disable interrupts. The nested form is used, in order to allow - * full, general purpose use of this routine. - * - * With interrupts disabled, we block page table pages from being - * freed from under us. See struct mmu_table_batch comments in - * include/asm-generic/tlb.h for more details. - * - * We do not adopt an rcu_read_lock(.) here as we also want to - * block IPIs that come from THPs splitting. - */ - if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { - unsigned long fast_flags = gup_flags; - - local_irq_save(flags); - gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned); - local_irq_restore(flags); - ret = nr_pinned; - } - - if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) { - /* Try to get the remaining pages with get_user_pages */ - start += nr_pinned << PAGE_SHIFT; - pages += nr_pinned; + nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages); + if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY) + return nr_pinned; - ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, - gup_flags, pages); - - /* Have to be a bit careful with return values */ - if (nr_pinned > 0) { - if (ret < 0) - ret = nr_pinned; - else - ret += nr_pinned; - } + /* Slow path: try to get the remaining pages with get_user_pages */ + start += nr_pinned << PAGE_SHIFT; + pages += nr_pinned; + ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags, + pages); + if (ret < 0) { + /* + * The caller has to unpin the pages we already pinned so + * returning -errno is not an option + */ + if (nr_pinned) + return nr_pinned; + return ret; } - - return ret; + return ret + nr_pinned; } + /** * get_user_pages_fast_only() - pin user pages in memory * @start: starting user address diff --git a/mm/gup_benchmark.c b/mm/gup_test.c index 8b3e5b5cd8fa..e3cf78e5873e 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_test.c @@ -4,40 +4,34 @@ #include <linux/uaccess.h> #include <linux/ktime.h> #include <linux/debugfs.h> - -#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 2, struct gup_benchmark) -#define PIN_FAST_BENCHMARK _IOWR('g', 3, struct gup_benchmark) -#define PIN_BENCHMARK _IOWR('g', 4, struct gup_benchmark) -#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) - -struct gup_benchmark { - __u64 get_delta_usec; - __u64 put_delta_usec; - __u64 addr; - __u64 size; - __u32 nr_pages_per_call; - __u32 flags; - __u64 expansion[10]; /* For future use */ -}; +#include "gup_test.h" static void put_back_pages(unsigned int cmd, struct page **pages, - unsigned long nr_pages) + unsigned long nr_pages, unsigned int gup_test_flags) { unsigned long i; switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_BENCHMARK: + case GUP_BASIC_TEST: for (i = 0; i < nr_pages; i++) put_page(pages[i]); break; case PIN_FAST_BENCHMARK: - case PIN_BENCHMARK: + case PIN_BASIC_TEST: case PIN_LONGTERM_BENCHMARK: unpin_user_pages(pages, nr_pages); break; + case DUMP_USER_PAGES_TEST: + if (gup_test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) { + unpin_user_pages(pages, nr_pages); + } else { + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + + } + break; } } @@ -49,14 +43,14 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages, switch (cmd) { case PIN_FAST_BENCHMARK: - case PIN_BENCHMARK: + case PIN_BASIC_TEST: case PIN_LONGTERM_BENCHMARK: for (i = 0; i < nr_pages; i++) { page = pages[i]; if (WARN(!page_maybe_dma_pinned(page), "pages[%lu] is NOT dma-pinned\n", i)) { - dump_page(page, "gup_benchmark failure"); + dump_page(page, "gup_test failure"); break; } } @@ -64,8 +58,39 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages, } } -static int __gup_benchmark_ioctl(unsigned int cmd, - struct gup_benchmark *gup) +static void dump_pages_test(struct gup_test *gup, struct page **pages, + unsigned long nr_pages) +{ + unsigned int index_to_dump; + unsigned int i; + + /* + * Zero out any user-supplied page index that is out of range. Remember: + * .which_pages[] contains a 1-based set of page indices. + */ + for (i = 0; i < GUP_TEST_MAX_PAGES_TO_DUMP; i++) { + if (gup->which_pages[i] > nr_pages) { + pr_warn("ZEROING due to out of range: .which_pages[%u]: %u\n", + i, gup->which_pages[i]); + gup->which_pages[i] = 0; + } + } + + for (i = 0; i < GUP_TEST_MAX_PAGES_TO_DUMP; i++) { + index_to_dump = gup->which_pages[i]; + + if (index_to_dump) { + index_to_dump--; // Decode from 1-based, to 0-based + pr_info("---- page #%u, starting from user virt addr: 0x%llx\n", + index_to_dump, gup->addr); + dump_page(pages[index_to_dump], + "gup_test: dump_pages() test"); + } + } +} + +static int __gup_test_ioctl(unsigned int cmd, + struct gup_test *gup) { ktime_t start_time, end_time; unsigned long i, nr_pages, addr, next; @@ -109,7 +134,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages_fast(addr, nr, gup->flags, pages + i); break; - case GUP_BENCHMARK: + case GUP_BASIC_TEST: nr = get_user_pages(addr, nr, gup->flags, pages + i, NULL); break; @@ -117,7 +142,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = pin_user_pages_fast(addr, nr, gup->flags, pages + i); break; - case PIN_BENCHMARK: + case PIN_BASIC_TEST: nr = pin_user_pages(addr, nr, gup->flags, pages + i, NULL); break; @@ -126,6 +151,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd, gup->flags | FOLL_LONGTERM, pages + i, NULL); break; + case DUMP_USER_PAGES_TEST: + if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN) + nr = pin_user_pages(addr, nr, gup->flags, + pages + i, NULL); + else + nr = get_user_pages(addr, nr, gup->flags, + pages + i, NULL); + break; default: ret = -EINVAL; goto unlock; @@ -149,9 +182,12 @@ static int __gup_benchmark_ioctl(unsigned int cmd, */ verify_dma_pinned(cmd, pages, nr_pages); + if (cmd == DUMP_USER_PAGES_TEST) + dump_pages_test(gup, pages, nr_pages); + start_time = ktime_get(); - put_back_pages(cmd, pages, nr_pages); + put_back_pages(cmd, pages, nr_pages, gup->flags); end_time = ktime_get(); gup->put_delta_usec = ktime_us_delta(end_time, start_time); @@ -164,18 +200,19 @@ free_pages: return ret; } -static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, +static long gup_test_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) { - struct gup_benchmark gup; + struct gup_test gup; int ret; switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_BENCHMARK: case PIN_FAST_BENCHMARK: - case PIN_BENCHMARK: case PIN_LONGTERM_BENCHMARK: + case GUP_BASIC_TEST: + case PIN_BASIC_TEST: + case DUMP_USER_PAGES_TEST: break; default: return -EINVAL; @@ -184,7 +221,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, if (copy_from_user(&gup, (void __user *)arg, sizeof(gup))) return -EFAULT; - ret = __gup_benchmark_ioctl(cmd, &gup); + ret = __gup_test_ioctl(cmd, &gup); if (ret) return ret; @@ -194,17 +231,17 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, return 0; } -static const struct file_operations gup_benchmark_fops = { +static const struct file_operations gup_test_fops = { .open = nonseekable_open, - .unlocked_ioctl = gup_benchmark_ioctl, + .unlocked_ioctl = gup_test_ioctl, }; -static int gup_benchmark_init(void) +static int __init gup_test_init(void) { - debugfs_create_file_unsafe("gup_benchmark", 0600, NULL, NULL, - &gup_benchmark_fops); + debugfs_create_file_unsafe("gup_test", 0600, NULL, NULL, + &gup_test_fops); return 0; } -late_initcall(gup_benchmark_init); +late_initcall(gup_test_init); diff --git a/mm/gup_test.h b/mm/gup_test.h new file mode 100644 index 000000000000..90a6713d50eb --- /dev/null +++ b/mm/gup_test.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __GUP_TEST_H +#define __GUP_TEST_H + +#include <linux/types.h> + +#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_test) +#define PIN_FAST_BENCHMARK _IOWR('g', 2, struct gup_test) +#define PIN_LONGTERM_BENCHMARK _IOWR('g', 3, struct gup_test) +#define GUP_BASIC_TEST _IOWR('g', 4, struct gup_test) +#define PIN_BASIC_TEST _IOWR('g', 5, struct gup_test) +#define DUMP_USER_PAGES_TEST _IOWR('g', 6, struct gup_test) + +#define GUP_TEST_MAX_PAGES_TO_DUMP 8 + +#define GUP_TEST_FLAG_DUMP_PAGES_USE_PIN 0x1 + +struct gup_test { + __u64 get_delta_usec; + __u64 put_delta_usec; + __u64 addr; + __u64 size; + __u32 nr_pages_per_call; + __u32 flags; + /* + * Each non-zero entry is the number of the page (1-based: first page is + * page 1, so that zero entries mean "do nothing") from the .addr base. + */ + __u32 which_pages[GUP_TEST_MAX_PAGES_TO_DUMP]; +}; + +#endif /* __GUP_TEST_H */ diff --git a/mm/highmem.c b/mm/highmem.c index 83f9660f168f..c3a9ea7875ef 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -359,6 +359,58 @@ void kunmap_high(struct page *page) wake_up(pkmap_map_wait); } EXPORT_SYMBOL(kunmap_high); + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void zero_user_segments(struct page *page, unsigned start1, unsigned end1, + unsigned start2, unsigned end2) +{ + unsigned int i; + + BUG_ON(end1 > page_size(page) || end2 > page_size(page)); + + for (i = 0; i < compound_nr(page); i++) { + void *kaddr = NULL; + + if (start1 < PAGE_SIZE || start2 < PAGE_SIZE) + kaddr = kmap_atomic(page + i); + + if (start1 >= PAGE_SIZE) { + start1 -= PAGE_SIZE; + end1 -= PAGE_SIZE; + } else { + unsigned this_end = min_t(unsigned, end1, PAGE_SIZE); + + if (end1 > start1) + memset(kaddr + start1, 0, this_end - start1); + end1 -= this_end; + start1 = 0; + } + + if (start2 >= PAGE_SIZE) { + start2 -= PAGE_SIZE; + end2 -= PAGE_SIZE; + } else { + unsigned this_end = min_t(unsigned, end2, PAGE_SIZE); + + if (end2 > start2) + memset(kaddr + start2, 0, this_end - start2); + end2 -= this_end; + start2 = 0; + } + + if (kaddr) { + kunmap_atomic(kaddr); + flush_dcache_page(page + i); + } + + if (!end1 && !end2) + break; + } + + BUG_ON((start1 | start2 | end1 | end2) != 0); +} +EXPORT_SYMBOL(zero_user_segments); +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* CONFIG_HIGHMEM */ #ifdef CONFIG_KMAP_LOCAL diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ec2bb93f7431..57d08156acb1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -163,12 +163,17 @@ static struct shrinker huge_zero_page_shrinker = { static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { + const char *output; + if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "[always] madvise never\n"); - else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "always [madvise] never\n"); + output = "[always] madvise never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + &transparent_hugepage_flags)) + output = "always [madvise] never"; else - return sprintf(buf, "always madvise [never]\n"); + output = "always madvise [never]"; + + return sysfs_emit(buf, "%s\n", output); } static ssize_t enabled_store(struct kobject *kobj, @@ -200,11 +205,11 @@ static struct kobj_attribute enabled_attr = __ATTR(enabled, 0644, enabled_show, enabled_store); ssize_t single_hugepage_flag_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf, - enum transparent_hugepage_flag flag) + struct kobj_attribute *attr, char *buf, + enum transparent_hugepage_flag flag) { - return sprintf(buf, "%d\n", - !!test_bit(flag, &transparent_hugepage_flags)); + return sysfs_emit(buf, "%d\n", + !!test_bit(flag, &transparent_hugepage_flags)); } ssize_t single_hugepage_flag_store(struct kobject *kobj, @@ -232,15 +237,24 @@ ssize_t single_hugepage_flag_store(struct kobject *kobj, static ssize_t defrag_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "[always] defer defer+madvise madvise never\n"); - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "always [defer] defer+madvise madvise never\n"); - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "always defer [defer+madvise] madvise never\n"); - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return sprintf(buf, "always defer defer+madvise [madvise] never\n"); - return sprintf(buf, "always defer defer+madvise madvise [never]\n"); + const char *output; + + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, + &transparent_hugepage_flags)) + output = "[always] defer defer+madvise madvise never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, + &transparent_hugepage_flags)) + output = "always [defer] defer+madvise madvise never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, + &transparent_hugepage_flags)) + output = "always defer [defer+madvise] madvise never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, + &transparent_hugepage_flags)) + output = "always defer defer+madvise [madvise] never"; + else + output = "always defer defer+madvise madvise [never]"; + + return sysfs_emit(buf, "%s\n", output); } static ssize_t defrag_store(struct kobject *kobj, @@ -281,10 +295,10 @@ static struct kobj_attribute defrag_attr = __ATTR(defrag, 0644, defrag_show, defrag_store); static ssize_t use_zero_page_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, char *buf) { return single_hugepage_flag_show(kobj, attr, buf, - TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); + TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); } static ssize_t use_zero_page_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) @@ -296,9 +310,9 @@ static struct kobj_attribute use_zero_page_attr = __ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store); static ssize_t hpage_pmd_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", HPAGE_PMD_SIZE); + return sysfs_emit(buf, "%lu\n", HPAGE_PMD_SIZE); } static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); @@ -2321,7 +2335,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { - enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | + enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; bool unmap_success; @@ -2710,9 +2724,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) spin_unlock(&ds_queue->split_queue_lock); if (mapping) { if (PageSwapBacked(head)) - __dec_node_page_state(head, NR_SHMEM_THPS); + __dec_lruvec_page_state(head, NR_SHMEM_THPS); else - __dec_node_page_state(head, NR_FILE_THPS); + __dec_lruvec_page_state(head, NR_FILE_THPS); } __split_huge_page(page, list, end, flags); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d029d938d26d..cbf32d2824fd 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1944,13 +1944,14 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. */ -static int gather_surplus_pages(struct hstate *h, int delta) +static int gather_surplus_pages(struct hstate *h, long delta) __must_hold(&hugetlb_lock) { struct list_head surplus_list; struct page *page, *tmp; - int ret, i; - int needed, allocated; + int ret; + long i; + long needed, allocated; bool alloc_ok = true; needed = (h->resv_huge_pages + delta) - h->free_huge_pages; @@ -2014,8 +2015,7 @@ retry: * This page is now managed by the hugetlb allocator and has * no users -- drop the buddy allocator's reference. */ - put_page_testzero(page); - VM_BUG_ON_PAGE(page_count(page), page); + VM_BUG_ON_PAGE(!put_page_testzero(page), page); enqueue_huge_page(h, page); } free: @@ -2760,7 +2760,7 @@ static ssize_t nr_hugepages_show_common(struct kobject *kobj, else nr_huge_pages = h->nr_huge_pages_node[nid]; - return sprintf(buf, "%lu\n", nr_huge_pages); + return sysfs_emit(buf, "%lu\n", nr_huge_pages); } static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, @@ -2833,7 +2833,8 @@ HSTATE_ATTR(nr_hugepages); * huge page alloc/free. */ static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, + char *buf) { return nr_hugepages_show_common(kobj, attr, buf); } @@ -2851,7 +2852,7 @@ static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { struct hstate *h = kobj_to_hstate(kobj, NULL); - return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages); + return sysfs_emit(buf, "%lu\n", h->nr_overcommit_huge_pages); } static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj, @@ -2889,7 +2890,7 @@ static ssize_t free_hugepages_show(struct kobject *kobj, else free_huge_pages = h->free_huge_pages_node[nid]; - return sprintf(buf, "%lu\n", free_huge_pages); + return sysfs_emit(buf, "%lu\n", free_huge_pages); } HSTATE_ATTR_RO(free_hugepages); @@ -2897,7 +2898,7 @@ static ssize_t resv_hugepages_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { struct hstate *h = kobj_to_hstate(kobj, NULL); - return sprintf(buf, "%lu\n", h->resv_huge_pages); + return sysfs_emit(buf, "%lu\n", h->resv_huge_pages); } HSTATE_ATTR_RO(resv_hugepages); @@ -2914,7 +2915,7 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj, else surplus_huge_pages = h->surplus_huge_pages_node[nid]; - return sprintf(buf, "%lu\n", surplus_huge_pages); + return sysfs_emit(buf, "%lu\n", surplus_huge_pages); } HSTATE_ATTR_RO(surplus_hugepages); @@ -3198,8 +3199,6 @@ void __init hugetlb_add_hstate(unsigned int order) h = &hstates[hugetlb_max_hstate++]; h->order = order; h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1); - h->nr_huge_pages = 0; - h->free_huge_pages = 0; for (i = 0; i < MAX_NUMNODES; ++i) INIT_LIST_HEAD(&h->hugepage_freelists[i]); INIT_LIST_HEAD(&h->hugepage_activelist); @@ -3673,7 +3672,7 @@ const struct vm_operations_struct hugetlb_vm_ops = { .fault = hugetlb_vm_op_fault, .open = hugetlb_vm_op_open, .close = hugetlb_vm_op_close, - .split = hugetlb_vm_op_split, + .may_split = hugetlb_vm_op_split, .pagesize = hugetlb_vm_op_pagesize, }; @@ -5115,6 +5114,7 @@ int hugetlb_reserve_pages(struct inode *inode, if (unlikely(add < 0)) { hugetlb_acct_memory(h, -gbl_reserve); + ret = add; goto out_put_pages; } else if (unlikely(chg > add)) { /* diff --git a/mm/init-mm.c b/mm/init-mm.c index 3a613c85f9ed..153162669f80 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -31,6 +31,7 @@ struct mm_struct init_mm = { .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), + .write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq), MMAP_LOCK_INITIALIZER(init_mm) .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), diff --git a/mm/internal.h b/mm/internal.h index c43ccdddb0f6..25d2b2439f19 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -199,8 +199,13 @@ extern void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags); extern int user_min_free_kbytes; +extern void free_unref_page(struct page *page); +extern void free_unref_page_list(struct list_head *list); + extern void zone_pcp_update(struct zone *zone); extern void zone_pcp_reset(struct zone *zone); +extern void zone_pcp_disable(struct zone *zone); +extern void zone_pcp_enable(struct zone *zone); #if defined CONFIG_COMPACTION || defined CONFIG_CMA diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c index 248264b9cb76..30c0a5038b5c 100644 --- a/mm/kasan/generic.c +++ b/mm/kasan/generic.c @@ -339,9 +339,6 @@ void kasan_record_aux_stack(void *addr) object = nearest_obj(cache, page, addr); alloc_info = get_alloc_info(cache, object); - /* - * record the last two call_rcu() call stacks. - */ alloc_info->aux_stack[1] = alloc_info->aux_stack[0]; alloc_info->aux_stack[0] = kasan_save_stack(GFP_NOWAIT); } diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 00a53f1355ae..5a0102f37171 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -185,12 +185,12 @@ static void describe_object(struct kmem_cache *cache, void *object, #ifdef CONFIG_KASAN_GENERIC if (alloc_info->aux_stack[0]) { - pr_err("Last call_rcu():\n"); + pr_err("Last potentially related work creation:\n"); print_stack(alloc_info->aux_stack[0]); pr_err("\n"); } if (alloc_info->aux_stack[1]) { - pr_err("Second to last call_rcu():\n"); + pr_err("Second to last potentially related work creation:\n"); print_stack(alloc_info->aux_stack[1]); pr_err("\n"); } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4e3dff13eb70..ad316d2e1fee 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -90,6 +90,8 @@ static struct kmem_cache *mm_slot_cache __read_mostly; * @hash: hash collision list * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head * @mm: the mm that this information is valid for + * @nr_pte_mapped_thp: number of pte mapped THP + * @pte_mapped_thp: address array corresponding pte mapped THP */ struct mm_slot { struct hlist_node hash; @@ -124,18 +126,18 @@ static ssize_t scan_sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs); + return sysfs_emit(buf, "%u\n", khugepaged_scan_sleep_millisecs); } static ssize_t scan_sleep_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - unsigned long msecs; + unsigned int msecs; int err; - err = kstrtoul(buf, 10, &msecs); - if (err || msecs > UINT_MAX) + err = kstrtouint(buf, 10, &msecs); + if (err) return -EINVAL; khugepaged_scan_sleep_millisecs = msecs; @@ -152,18 +154,18 @@ static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs); + return sysfs_emit(buf, "%u\n", khugepaged_alloc_sleep_millisecs); } static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - unsigned long msecs; + unsigned int msecs; int err; - err = kstrtoul(buf, 10, &msecs); - if (err || msecs > UINT_MAX) + err = kstrtouint(buf, 10, &msecs); + if (err) return -EINVAL; khugepaged_alloc_sleep_millisecs = msecs; @@ -180,17 +182,17 @@ static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_pages_to_scan); + return sysfs_emit(buf, "%u\n", khugepaged_pages_to_scan); } static ssize_t pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { + unsigned int pages; int err; - unsigned long pages; - err = kstrtoul(buf, 10, &pages); - if (err || !pages || pages > UINT_MAX) + err = kstrtouint(buf, 10, &pages); + if (err || !pages) return -EINVAL; khugepaged_pages_to_scan = pages; @@ -205,7 +207,7 @@ static ssize_t pages_collapsed_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_pages_collapsed); + return sysfs_emit(buf, "%u\n", khugepaged_pages_collapsed); } static struct kobj_attribute pages_collapsed_attr = __ATTR_RO(pages_collapsed); @@ -214,7 +216,7 @@ static ssize_t full_scans_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_full_scans); + return sysfs_emit(buf, "%u\n", khugepaged_full_scans); } static struct kobj_attribute full_scans_attr = __ATTR_RO(full_scans); @@ -223,7 +225,7 @@ static ssize_t khugepaged_defrag_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return single_hugepage_flag_show(kobj, attr, buf, - TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); + TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); } static ssize_t khugepaged_defrag_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -248,7 +250,7 @@ static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_max_ptes_none); + return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_none); } static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -273,7 +275,7 @@ static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", khugepaged_max_ptes_swap); + return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_swap); } static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj, @@ -297,10 +299,10 @@ static struct kobj_attribute khugepaged_max_ptes_swap_attr = khugepaged_max_ptes_swap_store); static ssize_t khugepaged_max_ptes_shared_show(struct kobject *kobj, - struct kobj_attribute *attr, - char *buf) + struct kobj_attribute *attr, + char *buf) { - return sprintf(buf, "%u\n", khugepaged_max_ptes_shared); + return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_shared); } static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj, @@ -1414,7 +1416,11 @@ static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm, } /** - * Try to collapse a pte-mapped THP for mm at address haddr. + * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at + * address haddr. + * + * @mm: process address space where collapse happens + * @addr: THP collapse address * * This function checks whether all the PTEs in the PMD are pointing to the * right THP. If so, retract the page table so the THP can refault in with @@ -1605,6 +1611,12 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) /** * collapse_file - collapse filemap/tmpfs/shmem pages into huge one. * + * @mm: process address space where collapse happens + * @file: file that collapse on + * @start: collapse start address + * @hpage: new allocated huge page for collapse + * @node: appointed node the new huge page allocate from + * * Basic scheme is simple, details are more complex: * - allocate and lock a new huge page; * - scan page cache replacing old pages with the new one @@ -1845,9 +1857,9 @@ out_unlock: } if (is_shmem) - __inc_node_page_state(new_page, NR_SHMEM_THPS); + __inc_lruvec_page_state(new_page, NR_SHMEM_THPS); else { - __inc_node_page_state(new_page, NR_FILE_THPS); + __inc_lruvec_page_state(new_page, NR_FILE_THPS); filemap_nr_thps_inc(mapping); } @@ -2833,18 +2833,18 @@ static void wait_while_offlining(void) static ssize_t sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_thread_sleep_millisecs); + return sysfs_emit(buf, "%u\n", ksm_thread_sleep_millisecs); } static ssize_t sleep_millisecs_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - unsigned long msecs; + unsigned int msecs; int err; - err = kstrtoul(buf, 10, &msecs); - if (err || msecs > UINT_MAX) + err = kstrtouint(buf, 10, &msecs); + if (err) return -EINVAL; ksm_thread_sleep_millisecs = msecs; @@ -2857,18 +2857,18 @@ KSM_ATTR(sleep_millisecs); static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_thread_pages_to_scan); + return sysfs_emit(buf, "%u\n", ksm_thread_pages_to_scan); } static ssize_t pages_to_scan_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { + unsigned int nr_pages; int err; - unsigned long nr_pages; - err = kstrtoul(buf, 10, &nr_pages); - if (err || nr_pages > UINT_MAX) + err = kstrtouint(buf, 10, &nr_pages); + if (err) return -EINVAL; ksm_thread_pages_to_scan = nr_pages; @@ -2880,17 +2880,17 @@ KSM_ATTR(pages_to_scan); static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_run); + return sysfs_emit(buf, "%lu\n", ksm_run); } static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { + unsigned int flags; int err; - unsigned long flags; - err = kstrtoul(buf, 10, &flags); - if (err || flags > UINT_MAX) + err = kstrtouint(buf, 10, &flags); + if (err) return -EINVAL; if (flags > KSM_RUN_UNMERGE) return -EINVAL; @@ -2927,9 +2927,9 @@ KSM_ATTR(run); #ifdef CONFIG_NUMA static ssize_t merge_across_nodes_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_merge_across_nodes); + return sysfs_emit(buf, "%u\n", ksm_merge_across_nodes); } static ssize_t merge_across_nodes_store(struct kobject *kobj, @@ -2984,9 +2984,9 @@ KSM_ATTR(merge_across_nodes); #endif static ssize_t use_zero_pages_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_use_zero_pages); + return sysfs_emit(buf, "%u\n", ksm_use_zero_pages); } static ssize_t use_zero_pages_store(struct kobject *kobj, struct kobj_attribute *attr, @@ -3008,7 +3008,7 @@ KSM_ATTR(use_zero_pages); static ssize_t max_page_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_max_page_sharing); + return sysfs_emit(buf, "%u\n", ksm_max_page_sharing); } static ssize_t max_page_sharing_store(struct kobject *kobj, @@ -3049,21 +3049,21 @@ KSM_ATTR(max_page_sharing); static ssize_t pages_shared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_pages_shared); + return sysfs_emit(buf, "%lu\n", ksm_pages_shared); } KSM_ATTR_RO(pages_shared); static ssize_t pages_sharing_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_pages_sharing); + return sysfs_emit(buf, "%lu\n", ksm_pages_sharing); } KSM_ATTR_RO(pages_sharing); static ssize_t pages_unshared_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_pages_unshared); + return sysfs_emit(buf, "%lu\n", ksm_pages_unshared); } KSM_ATTR_RO(pages_unshared); @@ -3080,21 +3080,21 @@ static ssize_t pages_volatile_show(struct kobject *kobj, */ if (ksm_pages_volatile < 0) ksm_pages_volatile = 0; - return sprintf(buf, "%ld\n", ksm_pages_volatile); + return sysfs_emit(buf, "%ld\n", ksm_pages_volatile); } KSM_ATTR_RO(pages_volatile); static ssize_t stable_node_dups_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_stable_node_dups); + return sysfs_emit(buf, "%lu\n", ksm_stable_node_dups); } KSM_ATTR_RO(stable_node_dups); static ssize_t stable_node_chains_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_stable_node_chains); + return sysfs_emit(buf, "%lu\n", ksm_stable_node_chains); } KSM_ATTR_RO(stable_node_chains); @@ -3103,7 +3103,7 @@ stable_node_chains_prune_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); + return sysfs_emit(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); } static ssize_t @@ -3127,7 +3127,7 @@ KSM_ATTR(stable_node_chains_prune_millisecs); static ssize_t full_scans_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%lu\n", ksm_scan.seqnr); + return sysfs_emit(buf, "%lu\n", ksm_scan.seqnr); } KSM_ATTR_RO(full_scans); diff --git a/mm/madvise.c b/mm/madvise.c index 13f5677b9322..6a660858784b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -877,7 +877,6 @@ static long madvise_remove(struct vm_area_struct *vma, static int madvise_inject_error(int behavior, unsigned long start, unsigned long end) { - struct zone *zone; unsigned long size; if (!capable(CAP_SYS_ADMIN)) @@ -908,24 +907,13 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); - /* - * Drop the page reference taken by get_user_pages_fast(). In - * the absence of MF_COUNT_INCREASED the memory_failure() - * routine is responsible for pinning the page to prevent it - * from being released back to the page allocator. - */ - put_page(page); - ret = memory_failure(pfn, 0); + ret = memory_failure(pfn, MF_COUNT_INCREASED); } if (ret) return ret; } - /* Ensure that all poisoned pages are removed from per-cpu lists */ - for_each_populated_zone(zone) - drain_all_pages(zone); - return 0; } #endif diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index 2c7d03675903..b59054ef2e10 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -23,7 +23,8 @@ struct wp_walk { /** * wp_pte - Write-protect a pte * @pte: Pointer to the pte - * @addr: The virtual page address + * @addr: The start of protecting virtual address + * @end: The end of protecting virtual address * @walk: pagetable walk callback argument * * The function write-protects a pte and records the range in @@ -74,7 +75,8 @@ struct clean_walk { * clean_record_pte - Clean a pte and record its address space offset in a * bitmap * @pte: Pointer to the pte - * @addr: The virtual page address + * @addr: The start of virtual address to be clean + * @end: The end of virtual address to be clean * @walk: pagetable walk callback argument * * The function cleans a pte and records the range in diff --git a/mm/memblock.c b/mm/memblock.c index b68ee86788af..049df4163a97 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1926,6 +1926,85 @@ static int __init early_memblock(char *p) } early_param("memblock", early_memblock); +static void __init free_memmap(unsigned long start_pfn, unsigned long end_pfn) +{ + struct page *start_pg, *end_pg; + phys_addr_t pg, pgend; + + /* + * Convert start_pfn/end_pfn to a struct page pointer. + */ + start_pg = pfn_to_page(start_pfn - 1) + 1; + end_pg = pfn_to_page(end_pfn - 1) + 1; + + /* + * Convert to physical addresses, and round start upwards and end + * downwards. + */ + pg = PAGE_ALIGN(__pa(start_pg)); + pgend = __pa(end_pg) & PAGE_MASK; + + /* + * If there are free pages between these, free the section of the + * memmap array. + */ + if (pg < pgend) + memblock_free(pg, pgend - pg); +} + +/* + * The mem_map array can get very big. Free the unused area of the memory map. + */ +static void __init free_unused_memmap(void) +{ + unsigned long start, end, prev_end = 0; + int i; + + if (!IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) || + IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) + return; + + /* + * This relies on each bank being in address order. + * The banks are sorted previously in bootmem_init(). + */ + for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { +#ifdef CONFIG_SPARSEMEM + /* + * Take care not to free memmap entries that don't exist + * due to SPARSEMEM sections which aren't present. + */ + start = min(start, ALIGN(prev_end, PAGES_PER_SECTION)); +#else + /* + * Align down here since the VM subsystem insists that the + * memmap entries are valid from the bank start aligned to + * MAX_ORDER_NR_PAGES. + */ + start = round_down(start, MAX_ORDER_NR_PAGES); +#endif + + /* + * If we had a previous bank, and there is a space + * between the current bank and the previous, free it. + */ + if (prev_end && prev_end < start) + free_memmap(prev_end, start); + + /* + * Align up here since the VM subsystem insists that the + * memmap entries are valid from the bank end aligned to + * MAX_ORDER_NR_PAGES. + */ + prev_end = ALIGN(end, MAX_ORDER_NR_PAGES); + } + +#ifdef CONFIG_SPARSEMEM + if (!IS_ALIGNED(prev_end, PAGES_PER_SECTION)) + free_memmap(prev_end, ALIGN(prev_end, PAGES_PER_SECTION)); +#endif +} + static void __init __free_pages_memory(unsigned long start, unsigned long end) { int order; @@ -2012,6 +2091,7 @@ unsigned long __init memblock_free_all(void) { unsigned long pages; + free_unused_memmap(); reset_all_zones_managed_pages(); pages = free_low_memory_core_early(); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 29459a6ce1c7..b9419a3605eb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -623,14 +623,9 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz, if (mz->usage_in_excess < mz_node->usage_in_excess) { p = &(*p)->rb_left; rightmost = false; - } - - /* - * We can't avoid mem cgroups that are over their soft - * limit by the same amount - */ - else if (mz->usage_in_excess >= mz_node->usage_in_excess) + } else { p = &(*p)->rb_right; + } } if (rightmost) @@ -858,7 +853,25 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __mod_memcg_lruvec_state(lruvec, idx, val); } -void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val) +void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, + int val) +{ + struct page *head = compound_head(page); /* rmap on tail pages */ + pg_data_t *pgdat = page_pgdat(page); + struct lruvec *lruvec; + + /* Untracked pages have no memcg, no lruvec. Update only the node */ + if (!head->mem_cgroup) { + __mod_node_page_state(pgdat, idx, val); + return; + } + + lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat); + __mod_lruvec_state(lruvec, idx, val); +} +EXPORT_SYMBOL(__mod_lruvec_page_state); + +void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { pg_data_t *pgdat = page_pgdat(virt_to_page(p)); struct mem_cgroup *memcg; @@ -882,17 +895,6 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } -void mod_memcg_obj_state(void *p, int idx, int val) -{ - struct mem_cgroup *memcg; - - rcu_read_lock(); - memcg = mem_cgroup_from_obj(p); - if (memcg) - mod_memcg_state(memcg, idx, val); - rcu_read_unlock(); -} - /** * __count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup @@ -1157,12 +1159,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, if (prev && !reclaim) pos = prev; - if (!root->use_hierarchy && root != root_mem_cgroup) { - if (prev) - goto out; - return root; - } - rcu_read_lock(); if (reclaim) { @@ -1242,7 +1238,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, out_unlock: rcu_read_unlock(); -out: if (prev && prev != root) css_put(&prev->css); @@ -1340,7 +1335,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, * @page: the page * @pgdat: pgdat of the page * - * This function relies on page->mem_cgroup being stable - see the + * This function relies on page's memcg being stable - see the * access rules in commit_charge(). */ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat) @@ -1499,6 +1494,7 @@ static struct memory_stat memory_stats[] = { { "anon", PAGE_SIZE, NR_ANON_MAPPED }, { "file", PAGE_SIZE, NR_FILE_PAGES }, { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, + { "pagetables", PAGE_SIZE, NR_PAGETABLE }, { "percpu", 1, MEMCG_PERCPU_B }, { "sock", PAGE_SIZE, MEMCG_SOCK }, { "shmem", PAGE_SIZE, NR_SHMEM }, @@ -1512,6 +1508,8 @@ static struct memory_stat memory_stats[] = { * constant(e.g. powerpc). */ { "anon_thp", 0, NR_ANON_THPS }, + { "file_thp", 0, NR_FILE_THPS }, + { "shmem_thp", 0, NR_SHMEM_THPS }, #endif { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON }, { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON }, @@ -1542,7 +1540,9 @@ static int __init memory_stats_init(void) for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (memory_stats[i].idx == NR_ANON_THPS) + if (memory_stats[i].idx == NR_ANON_THPS || + memory_stats[i].idx == NR_FILE_THPS || + memory_stats[i].idx == NR_SHMEM_THPS) memory_stats[i].ratio = HPAGE_PMD_SIZE; #endif VM_BUG_ON(!memory_stats[i].ratio); @@ -2891,7 +2891,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg) { VM_BUG_ON_PAGE(page->mem_cgroup, page); /* - * Any of the following ensures page->mem_cgroup stability: + * Any of the following ensures page's memcg stability: * * - the page lock * - LRU isolation @@ -2987,6 +2987,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) objcg = rcu_dereference(memcg->objcg); if (objcg && obj_cgroup_tryget(objcg)) break; + objcg = NULL; } rcu_read_unlock(); @@ -3246,8 +3247,10 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) * independently later. */ rcu_read_lock(); +retry: memcg = obj_cgroup_memcg(objcg); - css_get(&memcg->css); + if (unlikely(!css_tryget(&memcg->css))) + goto retry; rcu_read_unlock(); nr_pages = size >> PAGE_SHIFT; @@ -3470,22 +3473,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, } /* - * Test whether @memcg has children, dead or alive. Note that this - * function doesn't care whether @memcg has use_hierarchy enabled and - * returns %true if there are child csses according to the cgroup - * hierarchy. Testing use_hierarchy is the caller's responsibility. - */ -static inline bool memcg_has_children(struct mem_cgroup *memcg) -{ - bool ret; - - rcu_read_lock(); - ret = css_next_child(NULL, &memcg->css); - rcu_read_unlock(); - return ret; -} - -/* * Reclaims as many pages from the given memcg as possible. * * Caller is responsible for holding css reference for memcg. @@ -3533,37 +3520,20 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of, static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css, struct cftype *cft) { - return mem_cgroup_from_css(css)->use_hierarchy; + return 1; } static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) { - int retval = 0; - struct mem_cgroup *memcg = mem_cgroup_from_css(css); - struct mem_cgroup *parent_memcg = mem_cgroup_from_css(memcg->css.parent); - - if (memcg->use_hierarchy == val) + if (val == 1) return 0; - /* - * If parent's use_hierarchy is set, we can't make any modifications - * in the child subtrees. If it is unset, then the change can - * occur, provided the current cgroup has no children. - * - * For the root cgroup, parent_mem is NULL, we allow value to be - * set if there are no children. - */ - if ((!parent_memcg || !parent_memcg->use_hierarchy) && - (val == 1 || val == 0)) { - if (!memcg_has_children(memcg)) - memcg->use_hierarchy = val; - else - retval = -EBUSY; - } else - retval = -EINVAL; + pr_warn_once("Non-hierarchical mode is deprecated. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); - return retval; + return -EINVAL; } static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) @@ -3712,12 +3682,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg) static_branch_enable(&memcg_kmem_enabled_key); - /* - * A memory cgroup is considered kmem-online as soon as it gets - * kmemcg_id. Setting the id after enabling static branching will - * guarantee no one starts accounting before all call sites are - * patched. - */ memcg->kmemcg_id = memcg_id; memcg->kmem_state = KMEM_ONLINE; @@ -3757,8 +3721,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg) child = mem_cgroup_from_css(css); BUG_ON(child->kmemcg_id != kmemcg_id); child->kmemcg_id = parent->kmemcg_id; - if (!memcg->use_hierarchy) - break; } rcu_read_unlock(); @@ -5349,38 +5311,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent); memcg->oom_kill_disable = parent->oom_kill_disable; - } - if (!parent) { - page_counter_init(&memcg->memory, NULL); - page_counter_init(&memcg->swap, NULL); - page_counter_init(&memcg->kmem, NULL); - page_counter_init(&memcg->tcpmem, NULL); - } else if (parent->use_hierarchy) { - memcg->use_hierarchy = true; + page_counter_init(&memcg->memory, &parent->memory); page_counter_init(&memcg->swap, &parent->swap); page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); } else { - page_counter_init(&memcg->memory, &root_mem_cgroup->memory); - page_counter_init(&memcg->swap, &root_mem_cgroup->swap); - page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem); - page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem); - /* - * Deeper hierachy with use_hierarchy == false doesn't make - * much sense so let cgroup subsystem know about this - * unfortunate state in our controller. - */ - if (parent != root_mem_cgroup) - memory_cgrp_subsys.broken_hierarchy = true; - } + page_counter_init(&memcg->memory, NULL); + page_counter_init(&memcg->swap, NULL); + page_counter_init(&memcg->kmem, NULL); + page_counter_init(&memcg->tcpmem, NULL); - /* The following stuff does not apply to the root */ - if (!parent) { root_mem_cgroup = memcg; return &memcg->css; } + /* The following stuff does not apply to the root */ error = memcg_online_kmem(memcg); if (error) goto fail; @@ -6217,24 +6163,6 @@ static void mem_cgroup_move_task(void) } #endif -/* - * Cgroup retains root cgroups across [un]mount cycles making it necessary - * to verify whether we're attached to the default hierarchy on each mount - * attempt. - */ -static void mem_cgroup_bind(struct cgroup_subsys_state *root_css) -{ - /* - * use_hierarchy is forced on the default hierarchy. cgroup core - * guarantees that @root doesn't have any children, so turning it - * on for the root memcg is enough. - */ - if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) - root_mem_cgroup->use_hierarchy = true; - else - root_mem_cgroup->use_hierarchy = false; -} - static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value) { if (value == PAGE_COUNTER_MAX) @@ -6572,7 +6500,6 @@ struct cgroup_subsys memory_cgrp_subsys = { .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task, - .bind = mem_cgroup_bind, .dfl_cftypes = memory_files, .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0, @@ -6995,7 +6922,6 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) if (newpage->mem_cgroup) return; - /* Swapcache readahead pages can get replaced before being charged */ memcg = oldpage->mem_cgroup; if (!memcg) return; @@ -7354,9 +7280,9 @@ bool mem_cgroup_swap_full(struct page *page) static int __init setup_swap_account(char *s) { if (!strcmp(s, "1")) - cgroup_memory_noswap = 0; + cgroup_memory_noswap = false; else if (!strcmp(s, "0")) - cgroup_memory_noswap = 1; + cgroup_memory_noswap = true; return 1; } __setup("swapaccount=", setup_swap_account); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 5d880d4eb9a2..5a38e9eade94 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -263,8 +263,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags) } /* - * When a unknown page type is encountered drain as many buffers as possible - * in the hope to turn the page into a LRU or free page, which we can handle. + * Unknown page type encountered. Try to check whether it can turn PageLRU by + * lru_add_drain_all, or a free page by reclaiming slabs when possible. */ void shake_page(struct page *p, int access) { @@ -273,9 +273,6 @@ void shake_page(struct page *p, int access) if (!PageSlab(p)) { lru_add_drain_all(); - if (PageLRU(p)) - return; - drain_all_pages(page_zone(p)); if (PageLRU(p) || is_free_buddy_page(p)) return; } @@ -809,7 +806,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn) */ static int me_huge_page(struct page *p, unsigned long pfn) { - int res = 0; + int res; struct page *hpage = compound_head(p); struct address_space *mapping; @@ -820,6 +817,7 @@ static int me_huge_page(struct page *p, unsigned long pfn) if (mapping) { res = truncate_error_page(hpage, pfn, mapping); } else { + res = MF_FAILED; unlock_page(hpage); /* * migration entry prevents later access on error anonymous @@ -828,8 +826,10 @@ static int me_huge_page(struct page *p, unsigned long pfn) */ if (PageAnon(hpage)) put_page(hpage); - dissolve_free_huge_page(p); - res = MF_RECOVERED; + if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } lock_page(hpage); } @@ -946,13 +946,13 @@ static int page_action(struct page_state *ps, struct page *p, } /** - * get_hwpoison_page() - Get refcount for memory error handling: + * __get_hwpoison_page() - Get refcount for memory error handling: * @page: raw error page (hit by memory error) * * Return: return 0 if failed to grab the refcount, otherwise true (some * non-zero value.) */ -static int get_hwpoison_page(struct page *page) +static int __get_hwpoison_page(struct page *page) { struct page *head = compound_head(page); @@ -983,13 +983,80 @@ static int get_hwpoison_page(struct page *page) } /* + * Safely get reference count of an arbitrary page. + * + * Returns 0 for a free page, 1 for an in-use page, + * -EIO for a page-type we cannot handle and -EBUSY if we raced with an + * allocation. + * We only incremented refcount in case the page was already in-use and it + * is a known type we can handle. + */ +static int get_any_page(struct page *p, unsigned long flags) +{ + int ret = 0, pass = 0; + bool count_increased = false; + + if (flags & MF_COUNT_INCREASED) + count_increased = true; + +try_again: + if (!count_increased && !__get_hwpoison_page(p)) { + if (page_count(p)) { + /* We raced with an allocation, retry. */ + if (pass++ < 3) + goto try_again; + ret = -EBUSY; + } else if (!PageHuge(p) && !is_free_buddy_page(p)) { + /* We raced with put_page, retry. */ + if (pass++ < 3) + goto try_again; + ret = -EIO; + } + } else { + if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) { + ret = 1; + } else { + /* + * A page we cannot handle. Check whether we can turn + * it into something we can handle. + */ + if (pass++ < 3) { + put_page(p); + shake_page(p, 1); + count_increased = false; + goto try_again; + } + put_page(p); + ret = -EIO; + } + } + + return ret; +} + +static int get_hwpoison_page(struct page *p, unsigned long flags, + enum mf_flags ctxt) +{ + int ret; + + zone_pcp_disable(page_zone(p)); + if (ctxt == MF_SOFT_OFFLINE) + ret = get_any_page(p, flags); + else + ret = __get_hwpoison_page(p); + zone_pcp_enable(page_zone(p)); + + return ret; +} + +/* * Do all that is necessary to remove user space mappings. Unmap * the pages and send SIGBUS to the processes if the data was dirty. */ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, int flags, struct page **hpagep) { - enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; + enum ttu_flags ttu = TTU_IGNORE_MLOCK; struct address_space *mapping; LIST_HEAD(tokill); bool unmap_success = true; @@ -1162,7 +1229,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) num_poisoned_pages_inc(); - if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) { + if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) { /* * Check "filter hit" and "race with other subpage." */ @@ -1176,9 +1243,13 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) } } unlock_page(head); - dissolve_free_huge_page(p); - action_result(pfn, MF_MSG_FREE_HUGE, MF_DELAYED); - return 0; + res = MF_FAILED; + if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } + action_result(pfn, MF_MSG_FREE_HUGE, res); + return res == MF_RECOVERED ? 0 : -EBUSY; } lock_page(head); @@ -1231,6 +1302,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, loff_t start; dax_entry_t cookie; + if (flags & MF_COUNT_INCREASED) + /* + * Drop the extra refcount in case we come from madvise(). + */ + put_page(page); + /* * Prevent the inode from being freed while we are interrogating * the address_space, typically this would be handled by @@ -1319,6 +1396,7 @@ int memory_failure(unsigned long pfn, int flags) struct dev_pagemap *pgmap; int res; unsigned long page_flags; + bool retry = true; if (!sysctl_memory_failure_recovery) panic("Memory failure on page %lx", pfn); @@ -1336,6 +1414,7 @@ int memory_failure(unsigned long pfn, int flags) return -ENXIO; } +try_again: if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { @@ -1358,10 +1437,23 @@ int memory_failure(unsigned long pfn, int flags) * In fact it's dangerous to directly bump up page count from 0, * that may make page_ref_freeze()/page_ref_unfreeze() mismatch. */ - if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) { + if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) { if (is_free_buddy_page(p)) { - action_result(pfn, MF_MSG_BUDDY, MF_DELAYED); - return 0; + if (take_page_off_buddy(p)) { + page_ref_inc(p); + res = MF_RECOVERED; + } else { + /* We lost the race, try again */ + if (retry) { + ClearPageHWPoison(p); + num_poisoned_pages_dec(); + retry = false; + goto try_again; + } + res = MF_FAILED; + } + action_result(pfn, MF_MSG_BUDDY, res); + return res == MF_RECOVERED ? 0 : -EBUSY; } else { action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); return -EBUSY; @@ -1385,14 +1477,6 @@ int memory_failure(unsigned long pfn, int flags) * walked by the page reclaim code, however that's not a big loss. */ shake_page(p, 0); - /* shake_page could have turned it free. */ - if (!PageLRU(p) && is_free_buddy_page(p)) { - if (flags & MF_COUNT_INCREASED) - action_result(pfn, MF_MSG_BUDDY, MF_DELAYED); - else - action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED); - return 0; - } lock_page(p); @@ -1596,6 +1680,7 @@ int unpoison_memory(unsigned long pfn) struct page *page; struct page *p; int freeit = 0; + unsigned long flags = 0; static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -1640,7 +1725,7 @@ int unpoison_memory(unsigned long pfn) return 0; } - if (!get_hwpoison_page(p)) { + if (!get_hwpoison_page(p, flags, 0)) { if (TestClearPageHWPoison(p)) num_poisoned_pages_dec(); unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n", @@ -1671,75 +1756,6 @@ int unpoison_memory(unsigned long pfn) } EXPORT_SYMBOL(unpoison_memory); -/* - * Safely get reference count of an arbitrary page. - * Returns 0 for a free page, -EIO for a zero refcount page - * that is not free, and 1 for any other page type. - * For 1 the page is returned with increased page count, otherwise not. - */ -static int __get_any_page(struct page *p, unsigned long pfn, int flags) -{ - int ret; - - if (flags & MF_COUNT_INCREASED) - return 1; - - /* - * When the target page is a free hugepage, just remove it - * from free hugepage list. - */ - if (!get_hwpoison_page(p)) { - if (PageHuge(p)) { - pr_info("%s: %#lx free huge page\n", __func__, pfn); - ret = 0; - } else if (is_free_buddy_page(p)) { - pr_info("%s: %#lx free buddy page\n", __func__, pfn); - ret = 0; - } else if (page_count(p)) { - /* raced with allocation */ - ret = -EBUSY; - } else { - pr_info("%s: %#lx: unknown zero refcount page type %lx\n", - __func__, pfn, p->flags); - ret = -EIO; - } - } else { - /* Not a free page */ - ret = 1; - } - return ret; -} - -static int get_any_page(struct page *page, unsigned long pfn, int flags) -{ - int ret = __get_any_page(page, pfn, flags); - - if (ret == -EBUSY) - ret = __get_any_page(page, pfn, flags); - - if (ret == 1 && !PageHuge(page) && - !PageLRU(page) && !__PageMovable(page)) { - /* - * Try to free it. - */ - put_page(page); - shake_page(page, 1); - - /* - * Did it turn free? - */ - ret = __get_any_page(page, pfn, 0); - if (ret == 1 && !PageLRU(page)) { - /* Drop page reference which is from __get_any_page() */ - put_page(page); - pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n", - pfn, page->flags, &page->flags); - return -EIO; - } - } - return ret; -} - static bool isolate_page(struct page *page, struct list_head *pagelist) { bool isolated = false; @@ -1839,11 +1855,11 @@ static int __soft_offline_page(struct page *page) pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n", pfn, msg_page[huge], ret, page->flags, &page->flags); if (ret > 0) - ret = -EIO; + ret = -EBUSY; } } else { - pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n", - pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags); + pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %lx (%pGp)\n", + pfn, msg_page[huge], page_count(page), page->flags, &page->flags); ret = -EBUSY; } return ret; @@ -1905,7 +1921,7 @@ int soft_offline_page(unsigned long pfn, int flags) return -EIO; if (PageHWPoison(page)) { - pr_info("soft offline: %#lx page already poisoned\n", pfn); + pr_info("%s: %#lx page already poisoned\n", __func__, pfn); if (flags & MF_COUNT_INCREASED) put_page(page); return 0; @@ -1913,16 +1929,20 @@ int soft_offline_page(unsigned long pfn, int flags) retry: get_online_mems(); - ret = get_any_page(page, pfn, flags); + ret = get_hwpoison_page(page, flags, MF_SOFT_OFFLINE); put_online_mems(); - if (ret > 0) + if (ret > 0) { ret = soft_offline_in_use_page(page); - else if (ret == 0) + } else if (ret == 0) { if (soft_offline_free_page(page) && try_again) { try_again = false; goto retry; } + } else if (ret == -EIO) { + pr_info("%s: %#lx: unknown page type: %lx (%pGP)\n", + __func__, pfn, page->flags, &page->flags); + } return ret; } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..4a42a74a2240 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1171,6 +1171,15 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0, src_vma, src_mm, addr, end); mmu_notifier_invalidate_range_start(&range); + /* + * Disabling preemption is not needed for the write side, as + * the read side doesn't spin, but goes to the mmap_lock. + * + * Use the raw variant of the seqcount_t write API to avoid + * lockdep complaining about preemptibility. + */ + mmap_assert_write_locked(src_mm); + raw_write_seqcount_begin(&src_mm->write_protect_seq); } ret = 0; @@ -1187,8 +1196,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow) + if (is_cow) { + raw_write_seqcount_end(&src_mm->write_protect_seq); mmu_notifier_invalidate_range_end(&range); + } return ret; } @@ -4874,11 +4885,10 @@ EXPORT_SYMBOL_GPL(generic_access_phys); #endif /* - * Access another process' address space as given in mm. If non-NULL, use the - * given task for page fault accounting. + * Access another process' address space as given in mm. */ -int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, - unsigned long addr, void *buf, int len, unsigned int gup_flags) +int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, + int len, unsigned int gup_flags) { struct vm_area_struct *vma; void *old_buf = buf; @@ -4955,7 +4965,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags) { - return __access_remote_vm(NULL, mm, addr, buf, len, gup_flags); + return __access_remote_vm(mm, addr, buf, len, gup_flags); } /* @@ -4973,7 +4983,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, if (!mm) return 0; - ret = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags); + ret = __access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 63b2e46b6555..e0a561c550b3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -596,8 +596,7 @@ void generic_online_page(struct page *page, unsigned int order) * so we should map it first. This is better than introducing a special * case in page freeing fast path. */ - if (debug_pagealloc_enabled_static()) - kernel_map_pages(page, 1 << order, 1); + debug_pagealloc_map_pages(page, 1 << order); __free_pages_core(page, order); totalram_pages_add(1UL << order); #ifdef CONFIG_HIGHMEM @@ -1304,7 +1303,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) if (WARN_ON(PageLRU(page))) isolate_lru_page(page); if (page_mapped(page)) - try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS); + try_to_unmap(page, TTU_IGNORE_MLOCK); continue; } @@ -1492,13 +1491,19 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) } node = zone_to_nid(zone); + /* + * Disable pcplists so that page isolation cannot race with freeing + * in a way that pages from isolated pageblock are left on pcplists. + */ + zone_pcp_disable(zone); + /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE, MEMORY_OFFLINE | REPORT_FAILURE); if (ret) { reason = "failure to isolate range"; - goto failed_removal; + goto failed_removal_pcplists_disabled; } arg.start_pfn = start_pfn; @@ -1550,21 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) goto failed_removal_isolated; } - /* - * per-cpu pages are drained in start_isolate_page_range, but if - * there are still pages that are not free, make sure that we - * drain again, because when we isolated range we might - * have raced with another thread that was adding pages to pcp - * list. - * - * Forward progress should be still guaranteed because - * pages on the pcp list can only belong to MOVABLE_ZONE - * because has_unmovable_pages explicitly checks for - * PageBuddy on freed pages on other zones. - */ ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); - if (ret) - drain_all_pages(zone); + } while (ret); /* Mark all sections offline and remove free pages from the buddy. */ @@ -1580,6 +1572,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages; spin_unlock_irqrestore(&zone->lock, flags); + zone_pcp_enable(zone); + /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); zone->present_pages -= nr_pages; @@ -1612,6 +1606,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) failed_removal_isolated: undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); memory_notify(MEM_CANCEL_OFFLINE, &arg); +failed_removal_pcplists_disabled: + zone_pcp_enable(zone); failed_removal: pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n", (unsigned long long) start_pfn << PAGE_SHIFT, diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3ca4898f3f24..8cf96bd21341 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1114,9 +1114,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, int err; nodemask_t tmp; - err = migrate_prep(); - if (err) - return err; + migrate_prep(); mmap_read_lock(mm); @@ -1315,9 +1313,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { - err = migrate_prep(); - if (err) - goto mpol_out; + migrate_prep(); } { NODEMASK_SCRATCH(scratch); diff --git a/mm/migrate.c b/mm/migrate.c index 5795cb82e27c..ee802cb509a3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -62,7 +62,7 @@ * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is * undesirable, use migrate_prep_local() */ -int migrate_prep(void) +void migrate_prep(void) { /* * Clear the LRU lists so pages can be isolated. @@ -71,16 +71,12 @@ int migrate_prep(void) * pages that may be busy. */ lru_add_drain_all(); - - return 0; } /* Do the necessary work of migrate_prep but not if it involves other CPUs */ -int migrate_prep_local(void) +void migrate_prep_local(void) { lru_add_drain(); - - return 0; } int isolate_movable_page(struct page *page, isolate_mode_t mode) @@ -1106,7 +1102,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, * and treated as swapcache but it has no rmap yet. * Calling try_to_unmap() against a page->mapping==NULL page will * trigger a BUG. So handle it here. - * 2. An orphaned page (see truncate_complete_page) might have + * 2. An orphaned page (see truncate_cleanup_page) might have * fs-private metadata. The page can be picked up due to memory * offlining. Everywhere else except page reclaim, the page is * invisible to the vm, so the page can not be migrated. So try to @@ -1122,8 +1118,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, /* Establish migration ptes */ VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma, page); - try_to_unmap(page, - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK); page_was_mapped = 1; } @@ -1169,13 +1164,14 @@ static int unmap_and_move(new_page_t get_new_page, free_page_t put_new_page, unsigned long private, struct page *page, int force, enum migrate_mode mode, - enum migrate_reason reason) + enum migrate_reason reason, + struct list_head *ret) { int rc = MIGRATEPAGE_SUCCESS; struct page *newpage = NULL; if (!thp_migration_supported() && PageTransHuge(page)) - return -ENOMEM; + return -ENOSYS; if (page_count(page) == 1) { /* page was freed from under us. So we are done. */ @@ -1206,7 +1202,14 @@ out: * migrated will have kept its references and be restored. */ list_del(&page->lru); + } + /* + * If migration is successful, releases reference grabbed during + * isolation. Otherwise, restore the page to right list unless + * we want to retry. + */ + if (rc == MIGRATEPAGE_SUCCESS) { /* * Compaction can migrate also non-LRU pages which are * not accounted to NR_ISOLATED_*. They can be recognized @@ -1215,35 +1218,16 @@ out: if (likely(!__PageMovable(page))) mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_is_file_lru(page), -thp_nr_pages(page)); - } - /* - * If migration is successful, releases reference grabbed during - * isolation. Otherwise, restore the page to right list unless - * we want to retry. - */ - if (rc == MIGRATEPAGE_SUCCESS) { if (reason != MR_MEMORY_FAILURE) /* * We release the page in page_handle_poison. */ put_page(page); } else { - if (rc != -EAGAIN) { - if (likely(!__PageMovable(page))) { - putback_lru_page(page); - goto put_new; - } + if (rc != -EAGAIN) + list_add_tail(&page->lru, ret); - lock_page(page); - if (PageMovable(page)) - putback_movable_page(page); - else - __ClearPageIsolated(page); - unlock_page(page); - put_page(page); - } -put_new: if (put_new_page) put_new_page(newpage, private); else @@ -1274,7 +1258,8 @@ put_new: static int unmap_and_move_huge_page(new_page_t get_new_page, free_page_t put_new_page, unsigned long private, struct page *hpage, int force, - enum migrate_mode mode, int reason) + enum migrate_mode mode, int reason, + struct list_head *ret) { int rc = -EAGAIN; int page_was_mapped = 0; @@ -1290,7 +1275,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, * kicking migration. */ if (!hugepage_migration_supported(page_hstate(hpage))) { - putback_active_hugepage(hpage); + list_move_tail(&hpage->lru, ret); return -ENOSYS; } @@ -1329,8 +1314,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, if (page_mapped(hpage)) { bool mapping_locked = false; - enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK| - TTU_IGNORE_ACCESS; + enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK; if (!PageAnon(hpage)) { /* @@ -1376,8 +1360,10 @@ put_anon: out_unlock: unlock_page(hpage); out: - if (rc != -EAGAIN) + if (rc == MIGRATEPAGE_SUCCESS) putback_active_hugepage(hpage); + else if (rc != -EAGAIN && rc != MIGRATEPAGE_SUCCESS) + list_move_tail(&hpage->lru, ret); /* * If migration was not successful and there's a freeing callback, use @@ -1392,6 +1378,20 @@ out: return rc; } +static inline int try_split_thp(struct page *page, struct page **page2, + struct list_head *from) +{ + int rc = 0; + + lock_page(page); + rc = split_huge_page_to_list(page, from); + unlock_page(page); + if (!rc) + list_safe_reset_next(page, *page2, lru); + + return rc; +} + /* * migrate_pages - migrate the pages specified in a list, to the free pages * supplied as the target for the page migration @@ -1408,8 +1408,8 @@ out: * * The function returns after 10 attempts or if no pages are movable any more * because the list has become empty or no retryable pages exist any more. - * The caller should call putback_movable_pages() to return pages to the LRU - * or free list only if ret != 0. + * It is caller's responsibility to call putback_movable_pages() to return pages + * to the LRU or free list only if ret != 0. * * Returns the number of pages that were not migrated, or an error code. */ @@ -1430,6 +1430,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, struct page *page2; int swapwrite = current->flags & PF_SWAPWRITE; int rc, nr_subpages; + LIST_HEAD(ret_pages); if (!swapwrite) current->flags |= PF_SWAPWRITE; @@ -1452,31 +1453,56 @@ retry: if (PageHuge(page)) rc = unmap_and_move_huge_page(get_new_page, put_new_page, private, page, - pass > 2, mode, reason); + pass > 2, mode, reason, + &ret_pages); else rc = unmap_and_move(get_new_page, put_new_page, private, page, pass > 2, mode, - reason); - + reason, &ret_pages); + /* + * The rules are: + * Success: non hugetlb page will be freed, hugetlb + * page will be put back + * -EAGAIN: stay on the from list + * -ENOMEM: stay on the from list + * Other errno: put on ret_pages list then splice to + * from list + */ switch(rc) { + /* + * THP migration might be unsupported or the + * allocation could've failed so we should + * retry on the same page with the THP split + * to base pages. + * + * Head page is retried immediately and tail + * pages are added to the tail of the list so + * we encounter them after the rest of the list + * is processed. + */ + case -ENOSYS: + /* THP migration is unsupported */ + if (is_thp) { + if (!try_split_thp(page, &page2, from)) { + nr_thp_split++; + goto retry; + } + + nr_thp_failed++; + nr_failed += nr_subpages; + break; + } + + /* Hugetlb migration is unsupported */ + nr_failed++; + break; case -ENOMEM: /* - * THP migration might be unsupported or the - * allocation could've failed so we should - * retry on the same page with the THP split - * to base pages. - * - * Head page is retried immediately and tail - * pages are added to the tail of the list so - * we encounter them after the rest of the list - * is processed. + * When memory is low, don't bother to try to migrate + * other pages, just exit. */ if (is_thp) { - lock_page(page); - rc = split_huge_page_to_list(page, from); - unlock_page(page); - if (!rc) { - list_safe_reset_next(page, page2, lru); + if (!try_split_thp(page, &page2, from)) { nr_thp_split++; goto retry; } @@ -1504,7 +1530,7 @@ retry: break; default: /* - * Permanent failure (-EBUSY, -ENOSYS, etc.): + * Permanent failure (-EBUSY, etc.): * unlike -EAGAIN case, the failed page is * removed from migration page list and not * retried in the next outer loop. @@ -1523,6 +1549,12 @@ retry: nr_thp_failed += thp_retry; rc = nr_failed; out: + /* + * Put the permanent failure page back to migration list, they + * will be put back to the right list by the caller. + */ + list_splice(&ret_pages, from); + count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); count_vm_events(PGMIGRATE_FAIL, nr_failed); count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded); @@ -1698,7 +1730,7 @@ static int move_pages_and_store_status(struct mm_struct *mm, int node, * Positive err means the number of failed * pages to migrate. Since we are going to * abort and return the number of non-migrated - * pages, so need to incude the rest of the + * pages, so need to include the rest of the * nr_pages that have not been attempted as * well. */ @@ -2065,6 +2097,17 @@ bool pmd_trans_migrating(pmd_t pmd) return PageLocked(page); } +static inline bool is_shared_exec_page(struct vm_area_struct *vma, + struct page *page) +{ + if (page_mapcount(page) != 1 && + (page_is_file_lru(page) || vma_is_shmem(vma)) && + (vma->vm_flags & VM_EXEC)) + return true; + + return false; +} + /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on @@ -2082,8 +2125,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, * Don't migrate file pages that are mapped in multiple processes * with execute permissions as they are probably shared libraries. */ - if (page_mapcount(page) != 1 && page_is_file_lru(page) && - (vma->vm_flags & VM_EXEC)) + if (is_shared_exec_page(vma, page)) goto out; /* @@ -2138,6 +2180,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, int page_lru = page_is_file_lru(page); unsigned long start = address & HPAGE_PMD_MASK; + if (is_shared_exec_page(vma, page)) + goto out; + new_page = alloc_pages_node(node, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), HPAGE_PMD_ORDER); @@ -2249,6 +2294,7 @@ out_fail: out_unlock: unlock_page(page); +out: put_page(page); return 0; } @@ -2688,7 +2734,7 @@ static void migrate_vma_prepare(struct migrate_vma *migrate) */ static void migrate_vma_unmap(struct migrate_vma *migrate) { - int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; + int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK; const unsigned long npages = migrate->npages; const unsigned long start = migrate->start; unsigned long addr, i, restore = 0; @@ -2848,8 +2894,7 @@ EXPORT_SYMBOL(migrate_vma_setup); static void migrate_vma_insert_page(struct migrate_vma *migrate, unsigned long addr, struct page *page, - unsigned long *src, - unsigned long *dst) + unsigned long *src) { struct vm_area_struct *vma = migrate->vma; struct mm_struct *mm = vma->vm_mm; @@ -3003,16 +3048,14 @@ void migrate_vma_pages(struct migrate_vma *migrate) if (!notified) { notified = true; - mmu_notifier_range_init(&range, - MMU_NOTIFY_CLEAR, 0, - NULL, - migrate->vma->vm_mm, - addr, migrate->end); + mmu_notifier_range_init_migrate(&range, 0, + migrate->vma, migrate->vma->vm_mm, + addr, migrate->end, + migrate->pgmap_owner); mmu_notifier_invalidate_range_start(&range); } migrate_vma_insert_page(migrate, addr, newpage, - &migrate->src[i], - &migrate->dst[i]); + &migrate->src[i]); continue; } diff --git a/mm/mm_init.c b/mm/mm_init.c index b06a30fbedff..8e02e865cc65 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -173,6 +173,7 @@ static int __meminit mm_compute_batch_notifier(struct notifier_block *self, case MEM_ONLINE: case MEM_OFFLINE: mm_compute_batch(sysctl_overcommit_memory); + break; default: break; } diff --git a/mm/mmap.c b/mm/mmap.c index 5c8b4485860d..10598e5d4757 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2731,8 +2731,8 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *new; int err; - if (vma->vm_ops && vma->vm_ops->split) { - err = vma->vm_ops->split(vma, addr); + if (vma->vm_ops && vma->vm_ops->may_split) { + err = vma->vm_ops->may_split(vma, addr); if (err) return err; } @@ -3405,10 +3405,14 @@ static const char *special_mapping_name(struct vm_area_struct *vma) return ((struct vm_special_mapping *)vma->vm_private_data)->name; } -static int special_mapping_mremap(struct vm_area_struct *new_vma) +static int special_mapping_mremap(struct vm_area_struct *new_vma, + unsigned long flags) { struct vm_special_mapping *sm = new_vma->vm_private_data; + if (flags & MREMAP_DONTUNMAP) + return -EINVAL; + if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) return -EFAULT; @@ -3418,6 +3422,17 @@ static int special_mapping_mremap(struct vm_area_struct *new_vma) return 0; } +static int special_mapping_split(struct vm_area_struct *vma, unsigned long addr) +{ + /* + * Forbid splitting special mappings - kernel has expectations over + * the number of pages in mapping. Together with VM_DONTEXPAND + * the size of vma should stay the same over the special mapping's + * lifetime. + */ + return -EINVAL; +} + static const struct vm_operations_struct special_mapping_vmops = { .close = special_mapping_close, .fault = special_mapping_fault, @@ -3425,6 +3440,7 @@ static const struct vm_operations_struct special_mapping_vmops = { .name = special_mapping_name, /* vDSO code relies that VVAR can't be accessed remotely */ .access = NULL, + .may_split = special_mapping_split, }; static const struct vm_operations_struct legacy_special_mapping_vmops = { diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c new file mode 100644 index 000000000000..dcdde4f722a4 --- /dev/null +++ b/mm/mmap_lock.c @@ -0,0 +1,230 @@ +// SPDX-License-Identifier: GPL-2.0 +#define CREATE_TRACE_POINTS +#include <trace/events/mmap_lock.h> + +#include <linux/mm.h> +#include <linux/cgroup.h> +#include <linux/memcontrol.h> +#include <linux/mmap_lock.h> +#include <linux/mutex.h> +#include <linux/percpu.h> +#include <linux/rcupdate.h> +#include <linux/smp.h> +#include <linux/trace_events.h> + +EXPORT_TRACEPOINT_SYMBOL(mmap_lock_start_locking); +EXPORT_TRACEPOINT_SYMBOL(mmap_lock_acquire_returned); +EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released); + +#ifdef CONFIG_MEMCG + +/* + * Our various events all share the same buffer (because we don't want or need + * to allocate a set of buffers *per event type*), so we need to protect against + * concurrent _reg() and _unreg() calls, and count how many _reg() calls have + * been made. + */ +static DEFINE_MUTEX(reg_lock); +static int reg_refcount; /* Protected by reg_lock. */ + +/* + * Size of the buffer for memcg path names. Ignoring stack trace support, + * trace_events_hist.c uses MAX_FILTER_STR_VAL for this, so we also use it. + */ +#define MEMCG_PATH_BUF_SIZE MAX_FILTER_STR_VAL + +/* + * How many contexts our trace events might be called in: normal, softirq, irq, + * and NMI. + */ +#define CONTEXT_COUNT 4 + +static DEFINE_PER_CPU(char __rcu *, memcg_path_buf); +static char **tmp_bufs; +static DEFINE_PER_CPU(int, memcg_path_buf_idx); + +/* Called with reg_lock held. */ +static void free_memcg_path_bufs(void) +{ + int cpu; + char **old = tmp_bufs; + + for_each_possible_cpu(cpu) { + *(old++) = rcu_dereference_protected( + per_cpu(memcg_path_buf, cpu), + lockdep_is_held(®_lock)); + rcu_assign_pointer(per_cpu(memcg_path_buf, cpu), NULL); + } + + /* Wait for inflight memcg_path_buf users to finish. */ + synchronize_rcu(); + + old = tmp_bufs; + for_each_possible_cpu(cpu) { + kfree(*(old++)); + } + + kfree(tmp_bufs); + tmp_bufs = NULL; +} + +int trace_mmap_lock_reg(void) +{ + int cpu; + char *new; + + mutex_lock(®_lock); + + /* If the refcount is going 0->1, proceed with allocating buffers. */ + if (reg_refcount++) + goto out; + + tmp_bufs = kmalloc_array(num_possible_cpus(), sizeof(*tmp_bufs), + GFP_KERNEL); + if (tmp_bufs == NULL) + goto out_fail; + + for_each_possible_cpu(cpu) { + new = kmalloc(MEMCG_PATH_BUF_SIZE * CONTEXT_COUNT, GFP_KERNEL); + if (new == NULL) + goto out_fail_free; + rcu_assign_pointer(per_cpu(memcg_path_buf, cpu), new); + /* Don't need to wait for inflights, they'd have gotten NULL. */ + } + +out: + mutex_unlock(®_lock); + return 0; + +out_fail_free: + free_memcg_path_bufs(); +out_fail: + /* Since we failed, undo the earlier ref increment. */ + --reg_refcount; + + mutex_unlock(®_lock); + return -ENOMEM; +} + +void trace_mmap_lock_unreg(void) +{ + mutex_lock(®_lock); + + /* If the refcount is going 1->0, proceed with freeing buffers. */ + if (--reg_refcount) + goto out; + + free_memcg_path_bufs(); + +out: + mutex_unlock(®_lock); +} + +static inline char *get_memcg_path_buf(void) +{ + char *buf; + int idx; + + rcu_read_lock(); + buf = rcu_dereference(*this_cpu_ptr(&memcg_path_buf)); + if (buf == NULL) { + rcu_read_unlock(); + return NULL; + } + idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) - + MEMCG_PATH_BUF_SIZE; + return &buf[idx]; +} + +static inline void put_memcg_path_buf(void) +{ + this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE); + rcu_read_unlock(); +} + +/* + * Write the given mm_struct's memcg path to a percpu buffer, and return a + * pointer to it. If the path cannot be determined, or no buffer was available + * (because the trace event is being unregistered), NULL is returned. + * + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be + * disabled by the caller before calling us, and re-enabled only after the + * caller is done with the pointer. + * + * The caller must call put_memcg_path_buf() once the buffer is no longer + * needed. This must be done while preemption is still disabled. + */ +static const char *get_mm_memcg_path(struct mm_struct *mm) +{ + char *buf = NULL; + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + + if (memcg == NULL) + goto out; + if (unlikely(memcg->css.cgroup == NULL)) + goto out_put; + + buf = get_memcg_path_buf(); + if (buf == NULL) + goto out_put; + + cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE); + +out_put: + css_put(&memcg->css); +out: + return buf; +} + +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ + do { \ + const char *memcg_path; \ + preempt_disable(); \ + memcg_path = get_mm_memcg_path(mm); \ + trace_mmap_lock_##type(mm, \ + memcg_path != NULL ? memcg_path : "", \ + ##__VA_ARGS__); \ + if (likely(memcg_path != NULL)) \ + put_memcg_path_buf(); \ + preempt_enable(); \ + } while (0) + +#else /* !CONFIG_MEMCG */ + +int trace_mmap_lock_reg(void) +{ + return 0; +} + +void trace_mmap_lock_unreg(void) +{ +} + +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) + +#endif /* CONFIG_MEMCG */ + +/* + * Trace calls must be in a separate file, as otherwise there's a circular + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h. + */ + +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write) +{ + TRACE_MMAP_LOCK_EVENT(start_locking, mm, write); +} +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking); + +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write, + bool success) +{ + TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success); +} +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned); + +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write) +{ + TRACE_MMAP_LOCK_EVENT(released, mm, write); +} +EXPORT_SYMBOL(__mmap_lock_do_trace_released); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 5654dd19addc..61ee40ed804e 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, mmap_assert_write_locked(mm); BUG_ON(atomic_read(&mm->mm_users) <= 0); - if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we diff --git a/mm/mmzone.c b/mm/mmzone.c index 4686fdc23bb9..f337831affc2 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -72,20 +72,6 @@ struct zoneref *__next_zones_zonelist(struct zoneref *z, return z; } -#ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL -bool memmap_valid_within(unsigned long pfn, - struct page *page, struct zone *zone) -{ - if (page_to_pfn(page) != pfn) - return false; - - if (page_zone(page) != zone) - return false; - - return true; -} -#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ - void lruvec_init(struct lruvec *lruvec) { enum lru_list lru; diff --git a/mm/mremap.c b/mm/mremap.c index 138abbae4f75..c5590afe7165 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -30,12 +30,11 @@ #include "internal.h" -static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr) +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; - pmd_t *pmd; pgd = pgd_offset(mm, addr); if (pgd_none_or_clear_bad(pgd)) @@ -49,6 +48,18 @@ static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr) if (pud_none_or_clear_bad(pud)) return NULL; + return pud; +} + +static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pud; + pmd_t *pmd; + + pud = get_old_pud(mm, addr); + if (!pud) + return NULL; + pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) return NULL; @@ -56,19 +67,27 @@ static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr) return pmd; } -static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma, +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; pgd = pgd_offset(mm, addr); p4d = p4d_alloc(mm, pgd, addr); if (!p4d) return NULL; - pud = pud_alloc(mm, p4d, addr); + + return pud_alloc(mm, p4d, addr); +} + +static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr) +{ + pud_t *pud; + pmd_t *pmd; + + pud = alloc_new_pud(mm, vma, addr); if (!pud) return NULL; @@ -249,14 +268,148 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, return true; } +#else +static inline bool move_normal_pmd(struct vm_area_struct *vma, + unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, + pmd_t *new_pmd) +{ + return false; +} #endif +#ifdef CONFIG_HAVE_MOVE_PUD +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + pud_t pud; + + /* + * The destination pud shouldn't be established, free_pgtables() + * should have released it. + */ + if (WARN_ON_ONCE(!pud_none(*new_pud))) + return false; + + /* + * We don't have to worry about the ordering of src and dst + * ptlocks because exclusive mmap_lock prevents deadlock. + */ + old_ptl = pud_lock(vma->vm_mm, old_pud); + new_ptl = pud_lockptr(mm, new_pud); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pud */ + pud = *old_pud; + pud_clear(old_pud); + + VM_BUG_ON(!pud_none(*new_pud)); + + /* Set the new pud */ + set_pud_at(mm, new_addr, new_pud, pud); + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + return true; +} +#else +static inline bool move_normal_pud(struct vm_area_struct *vma, + unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, + pud_t *new_pud) +{ + return false; +} +#endif + +enum pgt_entry { + NORMAL_PMD, + HPAGE_PMD, + NORMAL_PUD, +}; + +/* + * Returns an extent of the corresponding size for the pgt_entry specified if + * valid. Else returns a smaller extent bounded by the end of the source and + * destination pgt_entry. + */ +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr, + unsigned long old_end, unsigned long new_addr) +{ + unsigned long next, extent, mask, size; + + switch (entry) { + case HPAGE_PMD: + case NORMAL_PMD: + mask = PMD_MASK; + size = PMD_SIZE; + break; + case NORMAL_PUD: + mask = PUD_MASK; + size = PUD_SIZE; + break; + default: + BUILD_BUG(); + break; + } + + next = (old_addr + size) & mask; + /* even if next overflowed, extent below will be ok */ + extent = (next > old_end) ? old_end - old_addr : next - old_addr; + next = (new_addr + size) & mask; + if (extent > next - new_addr) + extent = next - new_addr; + return extent; +} + +/* + * Attempts to speedup the move by moving entry at the level corresponding to + * pgt_entry. Returns true if the move was successful, else false. + */ +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma, + unsigned long old_addr, unsigned long new_addr, + void *old_entry, void *new_entry, bool need_rmap_locks) +{ + bool moved = false; + + /* See comment in move_ptes() */ + if (need_rmap_locks) + take_rmap_locks(vma); + + switch (entry) { + case NORMAL_PMD: + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, + new_entry); + break; + case NORMAL_PUD: + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, + new_entry); + break; + case HPAGE_PMD: + moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && + move_huge_pmd(vma, old_addr, new_addr, old_entry, + new_entry); + break; + default: + WARN_ON_ONCE(1); + break; + } + + if (need_rmap_locks) + drop_rmap_locks(vma); + + return moved; +} + unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, bool need_rmap_locks) { - unsigned long extent, next, old_end; + unsigned long extent, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; @@ -269,53 +422,50 @@ unsigned long move_page_tables(struct vm_area_struct *vma, for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); - next = (old_addr + PMD_SIZE) & PMD_MASK; - /* even if next overflowed, extent below will be ok */ - extent = next - old_addr; - if (extent > old_end - old_addr) - extent = old_end - old_addr; - next = (new_addr + PMD_SIZE) & PMD_MASK; - if (extent > next - new_addr) - extent = next - new_addr; + /* + * If extent is PUD-sized try to speed up the move by moving at the + * PUD level if possible. + */ + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr); + if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { + pud_t *old_pud, *new_pud; + + old_pud = get_old_pud(vma->vm_mm, old_addr); + if (!old_pud) + continue; + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr); + if (!new_pud) + break; + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, + old_pud, new_pud, need_rmap_locks)) + continue; + } + + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr); old_pmd = get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; - if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) { - if (extent == HPAGE_PMD_SIZE) { - bool moved; - /* See comment in move_ptes() */ - if (need_rmap_locks) - take_rmap_locks(vma); - moved = move_huge_pmd(vma, old_addr, new_addr, - old_pmd, new_pmd); - if (need_rmap_locks) - drop_rmap_locks(vma); - if (moved) - continue; - } + if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || + pmd_devmap(*old_pmd)) { + if (extent == HPAGE_PMD_SIZE && + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, + old_pmd, new_pmd, need_rmap_locks)) + continue; split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; - } else if (extent == PMD_SIZE) { -#ifdef CONFIG_HAVE_MOVE_PMD + } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) && + extent == PMD_SIZE) { /* * If the extent is PMD-sized, try to speed the move by * moving at the PMD level if possible. */ - bool moved; - - if (need_rmap_locks) - take_rmap_locks(vma); - moved = move_normal_pmd(vma, old_addr, new_addr, - old_pmd, new_pmd); - if (need_rmap_locks) - drop_rmap_locks(vma); - if (moved) + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, + old_pmd, new_pmd, need_rmap_locks)) continue; -#endif } if (pte_alloc(new_vma->vm_mm, new_pmd)) @@ -343,7 +493,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, unsigned long excess = 0; unsigned long hiwater_vm; int split = 0; - int err; + int err = 0; bool need_rmap_locks; /* @@ -353,6 +503,15 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (mm->map_count >= sysctl_max_map_count - 3) return -ENOMEM; + if (vma->vm_ops && vma->vm_ops->may_split) { + if (vma->vm_start != old_addr) + err = vma->vm_ops->may_split(vma, old_addr); + if (!err && vma->vm_end != old_addr + old_len) + err = vma->vm_ops->may_split(vma, old_addr + old_len); + if (err) + return err; + } + /* * Advise KSM to break any KSM pages in the area to be moved: * it would be confusing if they were to turn up at the new @@ -365,18 +524,26 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (err) return err; + if (unlikely(flags & MREMAP_DONTUNMAP && vm_flags & VM_ACCOUNT)) { + if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT)) + return -ENOMEM; + } + new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff, &need_rmap_locks); - if (!new_vma) + if (!new_vma) { + if (unlikely(flags & MREMAP_DONTUNMAP && vm_flags & VM_ACCOUNT)) + vm_unacct_memory(new_len >> PAGE_SHIFT); return -ENOMEM; + } moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, need_rmap_locks); if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { - err = vma->vm_ops->mremap(new_vma); + err = vma->vm_ops->mremap(new_vma, flags); } if (unlikely(err)) { @@ -398,7 +565,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, } /* Conceal VM_ACCOUNT so old reservation is not undone */ - if (vm_flags & VM_ACCOUNT) { + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) { vma->vm_flags &= ~VM_ACCOUNT; excess = vma->vm_end - vma->vm_start - old_len; if (old_addr > vma->vm_start && @@ -423,34 +590,17 @@ static unsigned long move_vma(struct vm_area_struct *vma, untrack_pfn_moved(vma); if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) { - if (vm_flags & VM_ACCOUNT) { - /* Always put back VM_ACCOUNT since we won't unmap */ - vma->vm_flags |= VM_ACCOUNT; - - vm_acct_memory(new_len >> PAGE_SHIFT); - } - - /* - * VMAs can actually be merged back together in copy_vma - * calling merge_vma. This can happen with anonymous vmas - * which have not yet been faulted, so if we were to consider - * this VMA split we'll end up adding VM_ACCOUNT on the - * next VMA, which is completely unrelated if this VMA - * was re-merged. - */ - if (split && new_vma == vma) - split = 0; - /* We always clear VM_LOCKED[ONFAULT] on the old vma */ vma->vm_flags &= VM_LOCKED_CLEAR_MASK; /* Because we won't unmap we don't need to touch locked_vm */ - goto out; + return new_addr; } if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) { /* OOM: unable to split vma, just get accounts right */ - vm_unacct_memory(excess >> PAGE_SHIFT); + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) + vm_acct_memory(new_len >> PAGE_SHIFT); excess = 0; } @@ -458,7 +608,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, mm->locked_vm += new_len >> PAGE_SHIFT; *locked = true; } -out: + mm->hiwater_vm = hiwater_vm; /* Restore VM_ACCOUNT if one or two pieces of vma left */ diff --git a/mm/nommu.c b/mm/nommu.c index 0faf39b32cdb..870fea12823e 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1675,8 +1675,8 @@ void filemap_map_pages(struct vm_fault *vmf, } EXPORT_SYMBOL(filemap_map_pages); -int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, - unsigned long addr, void *buf, int len, unsigned int gup_flags) +int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, + int len, unsigned int gup_flags) { struct vm_area_struct *vma; int write = gup_flags & FOLL_WRITE; @@ -1722,7 +1722,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags) { - return __access_remote_vm(NULL, mm, addr, buf, len, gup_flags); + return __access_remote_vm(mm, addr, buf, len, gup_flags); } /* @@ -1741,7 +1741,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in if (!mm) return 0; - len = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags); + len = __access_remote_vm(mm, addr, buf, len, gup_flags); mmput(mm); return len; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8b84661a6410..04b19b7b5435 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -170,11 +170,13 @@ static bool oom_unkillable_task(struct task_struct *p) return false; } -/* - * Print out unreclaimble slabs info when unreclaimable slabs amount is greater - * than all user memory (LRU pages) - */ -static bool is_dump_unreclaim_slabs(void) +/** + * Check whether unreclaimable slab amount is greater than + * all user memory(LRU pages). + * dump_unreclaimable_slab() could help in the case that + * oom due to too much unreclaimable slab used by kernel. +*/ +static bool should_dump_unreclaim_slab(void) { unsigned long nr_lru; @@ -463,7 +465,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) mem_cgroup_print_oom_meminfo(oc->memcg); else { show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask); - if (is_dump_unreclaim_slabs()) + if (should_dump_unreclaim_slab()) dump_unreclaimable_slab(); } if (sysctl_oom_dump_tasks) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..b63294517e04 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -70,6 +71,7 @@ #include <linux/psi.h> #include <linux/padata.h> #include <linux/khugepaged.h> +#include <linux/buffer_head.h> #include <asm/sections.h> #include <asm/tlbflush.h> @@ -165,53 +167,26 @@ unsigned long totalcma_pages __read_mostly; int percpu_pagelist_fraction; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; -#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON -DEFINE_STATIC_KEY_TRUE(init_on_alloc); -#else DEFINE_STATIC_KEY_FALSE(init_on_alloc); -#endif EXPORT_SYMBOL(init_on_alloc); -#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON -DEFINE_STATIC_KEY_TRUE(init_on_free); -#else DEFINE_STATIC_KEY_FALSE(init_on_free); -#endif EXPORT_SYMBOL(init_on_free); +static bool _init_on_alloc_enabled_early __read_mostly + = IS_ENABLED(CONFIG_INIT_ON_ALLOC_DEFAULT_ON); static int __init early_init_on_alloc(char *buf) { - int ret; - bool bool_result; - ret = kstrtobool(buf, &bool_result); - if (ret) - return ret; - if (bool_result && page_poisoning_enabled()) - pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_alloc\n"); - if (bool_result) - static_branch_enable(&init_on_alloc); - else - static_branch_disable(&init_on_alloc); - return 0; + return kstrtobool(buf, &_init_on_alloc_enabled_early); } early_param("init_on_alloc", early_init_on_alloc); +static bool _init_on_free_enabled_early __read_mostly + = IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON); static int __init early_init_on_free(char *buf) { - int ret; - bool bool_result; - - ret = kstrtobool(buf, &bool_result); - if (ret) - return ret; - if (bool_result && page_poisoning_enabled()) - pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_free\n"); - if (bool_result) - static_branch_enable(&init_on_free); - else - static_branch_disable(&init_on_free); - return 0; + return kstrtobool(buf, &_init_on_free_enabled_early); } early_param("init_on_free", early_init_on_free); @@ -495,14 +470,6 @@ static inline int pfn_to_bitidx(struct page *page, unsigned long pfn) return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; } -/** - * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages - * @page: The page within the block of interest - * @pfn: The target page frame number - * @mask: mask of bits that the caller is interested in - * - * Return: pageblock_bits flags - */ static __always_inline unsigned long __get_pfnblock_flags_mask(struct page *page, unsigned long pfn, @@ -521,6 +488,14 @@ unsigned long __get_pfnblock_flags_mask(struct page *page, return (word >> bitidx) & mask; } +/** + * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages + * @page: The page within the block of interest + * @pfn: The target page frame number + * @mask: mask of bits that the caller is interested in + * + * Return: pageblock_bits flags + */ unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long mask) { @@ -728,19 +703,6 @@ static int __init early_debug_pagealloc(char *buf) } early_param("debug_pagealloc", early_debug_pagealloc); -void init_debug_pagealloc(void) -{ - if (!debug_pagealloc_enabled()) - return; - - static_branch_enable(&_debug_pagealloc_enabled); - - if (!debug_guardpage_minorder()) - return; - - static_branch_enable(&_debug_guardpage_enabled); -} - static int __init debug_guardpage_minorder_setup(char *buf) { unsigned long res; @@ -792,6 +754,53 @@ static inline void clear_page_guard(struct zone *zone, struct page *page, unsigned int order, int migratetype) {} #endif +/* + * Enable static keys related to various memory debugging and hardening options. + * Some override others, and depend on early params that are evaluated in the + * order of appearance. So we need to first gather the full picture of what was + * enabled, and then make decisions. + */ +void init_mem_debugging_and_hardening(void) +{ + if (_init_on_alloc_enabled_early) { + if (page_poisoning_enabled()) + pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " + "will take precedence over init_on_alloc\n"); + else + static_branch_enable(&init_on_alloc); + } + if (_init_on_free_enabled_early) { + if (page_poisoning_enabled()) + pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " + "will take precedence over init_on_free\n"); + else + static_branch_enable(&init_on_free); + } + +#ifdef CONFIG_PAGE_POISONING + /* + * Page poisoning is debug page alloc for some arches. If + * either of those options are enabled, enable poisoning. + */ + if (page_poisoning_enabled() || + (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && + debug_pagealloc_enabled())) + static_branch_enable(&_page_poisoning_enabled); +#endif + +#ifdef CONFIG_DEBUG_PAGEALLOC + if (!debug_pagealloc_enabled()) + return; + + static_branch_enable(&_debug_pagealloc_enabled); + + if (!debug_guardpage_minorder()) + return; + + static_branch_enable(&_debug_guardpage_enabled); +#endif +} + static inline void set_buddy_order(struct page *page, unsigned int order) { set_page_private(page, order); @@ -994,7 +1003,7 @@ static inline void __free_one_page(struct page *page, struct page *buddy; bool to_tail; - max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); + max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -1007,7 +1016,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(bad_range(zone, page), page); continue_merging: - while (order < max_order - 1) { + while (order < max_order) { if (compaction_capture(capc, page, order, migratetype)) { __mod_zone_freepage_state(zone, -(1 << order), migratetype); @@ -1033,7 +1042,7 @@ continue_merging: pfn = combined_pfn; order++; } - if (max_order < MAX_ORDER) { + if (order < MAX_ORDER - 1) { /* If we are here, it means order is >= pageblock_order. * We want to prevent merge between freepages on isolate * pageblock and normal pageblock. Without this, pageblock @@ -1054,7 +1063,7 @@ continue_merging: is_migrate_isolate(buddy_mt))) goto done_merging; } - max_order++; + max_order = order + 1; goto continue_merging; } @@ -1264,7 +1273,8 @@ static __always_inline bool free_pages_prepare(struct page *page, if (want_init_on_free()) kernel_init_free_pages(page, 1 << order); - kernel_poison_pages(page, 1 << order, 0); + kernel_poison_pages(page, 1 << order); + /* * arch_free_page() can make the page's contents inaccessible. s390 * does this. So nothing which can access the page's contents should @@ -1272,8 +1282,7 @@ static __always_inline bool free_pages_prepare(struct page *page, */ arch_free_page(page, order); - if (debug_pagealloc_enabled_static()) - kernel_map_pages(page, 1 << order, 0); + debug_pagealloc_unmap_pages(page, 1 << order); kasan_free_nondeferred_pages(page, order); @@ -1344,7 +1353,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, { int migratetype = 0; int batch_free = 0; - int prefetch_nr = 0; + int prefetch_nr = READ_ONCE(pcp->batch); bool isolated_pageblocks; struct page *page, *tmp; LIST_HEAD(head); @@ -1395,8 +1404,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, * avoid excessive prefetching due to large count, only * prefetch buddy for the first pcp->batch nr of pages. */ - if (prefetch_nr++ < pcp->batch) + if (prefetch_nr) { prefetch_buddy(page); + prefetch_nr--; + } } while (--count && --batch_free && !list_empty(list)); } @@ -1558,14 +1569,23 @@ void __free_pages_core(struct page *page, unsigned int order) #ifdef CONFIG_NEED_MULTIPLE_NODES -static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata; +/* + * During memory init memblocks map pfns to nids. The search is expensive and + * this caches recent lookups. The implementation of __early_pfn_to_nid + * treats start/end as pfns. + */ +struct mminit_pfnnid_cache { + unsigned long last_start; + unsigned long last_end; + int last_nid; +}; -#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID +static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata; /* * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. */ -int __meminit __early_pfn_to_nid(unsigned long pfn, +static int __meminit __early_pfn_to_nid(unsigned long pfn, struct mminit_pfnnid_cache *state) { unsigned long start_pfn, end_pfn; @@ -1583,7 +1603,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn, return nid; } -#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ int __meminit early_pfn_to_nid(unsigned long pfn) { @@ -2103,6 +2122,8 @@ void __init page_alloc_init_late(void) files_maxfiles_init(); #endif + buffer_init(); + /* Discard memblock private memory */ memblock_discard(); @@ -2207,12 +2228,6 @@ static inline int check_new_page(struct page *page) return 1; } -static inline bool free_pages_prezeroed(void) -{ - return (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) && - page_poisoning_enabled()) || want_init_on_free(); -} - #ifdef CONFIG_DEBUG_VM /* * With DEBUG_VM enabled, order-0 pages are checked for expected state when @@ -2270,11 +2285,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_refcounted(page); arch_alloc_page(page, order); - if (debug_pagealloc_enabled_static()) - kernel_map_pages(page, 1 << order, 1); + debug_pagealloc_map_pages(page, 1 << order); kasan_alloc_pages(page, order); - kernel_poison_pages(page, 1 << order, 1); + kernel_unpoison_pages(page, 1 << order); set_page_owner(page, order, gfp_flags); + + if (!want_init_on_free() && want_init_on_alloc(gfp_flags)) + kernel_init_free_pages(page, 1 << order); } static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, @@ -2282,9 +2299,6 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags { post_alloc_hook(page, order, gfp_flags); - if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags)) - kernel_init_free_pages(page, 1 << order); - if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); @@ -2470,12 +2484,12 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } -static inline void boost_watermark(struct zone *zone) +static inline bool boost_watermark(struct zone *zone) { unsigned long max_boost; if (!watermark_boost_factor) - return; + return false; /* * Don't bother in zones that are unlikely to produce results. * On small machines, including kdump capture kernels running @@ -2483,7 +2497,7 @@ static inline void boost_watermark(struct zone *zone) * memory situation immediately. */ if ((pageblock_nr_pages * 4) > zone_managed_pages(zone)) - return; + return false; max_boost = mult_frac(zone->_watermark[WMARK_HIGH], watermark_boost_factor, 10000); @@ -2497,12 +2511,14 @@ static inline void boost_watermark(struct zone *zone) * boosted watermark resulting in a hang. */ if (!max_boost) - return; + return false; max_boost = max(pageblock_nr_pages, max_boost); zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, max_boost); + + return true; } /* @@ -2540,8 +2556,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * likelihood of future fallbacks. Wake kswapd now as the node * may be balanced overall and kswapd will not wake naturally. */ - boost_watermark(zone); - if (alloc_flags & ALLOC_KSWAPD) + if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD)) set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); /* We are not allowed to try stealing from the whole block */ @@ -3017,13 +3032,16 @@ static void drain_local_pages_wq(struct work_struct *work) } /* - * Spill all the per-cpu pages from all CPUs back into the buddy allocator. - * - * When zone parameter is non-NULL, spill just the single zone's pages. + * The implementation of drain_all_pages(), exposing an extra parameter to + * drain on all cpus. * - * Note that this can be extremely slow as the draining happens in a workqueue. + * drain_all_pages() is optimized to only execute on cpus where pcplists are + * not empty. The check for non-emptiness can however race with a free to + * pcplist that has not yet increased the pcp->count from 0 to 1. Callers + * that need the guarantee that every CPU has drained can disable the + * optimizing racy check. */ -void drain_all_pages(struct zone *zone) +static void __drain_all_pages(struct zone *zone, bool force_all_cpus) { int cpu; @@ -3062,7 +3080,13 @@ void drain_all_pages(struct zone *zone) struct zone *z; bool has_pcps = false; - if (zone) { + if (force_all_cpus) { + /* + * The pcp.count check is racy, some callers need a + * guarantee that no cpu is missed. + */ + has_pcps = true; + } else if (zone) { pcp = per_cpu_ptr(zone->pageset, cpu); if (pcp->pcp.count) has_pcps = true; @@ -3095,6 +3119,18 @@ void drain_all_pages(struct zone *zone) mutex_unlock(&pcpu_drain_mutex); } +/* + * Spill all the per-cpu pages from all CPUs back into the buddy allocator. + * + * When zone parameter is non-NULL, spill just the single zone's pages. + * + * Note that this can be extremely slow as the draining happens in a workqueue. + */ +void drain_all_pages(struct zone *zone) +{ + __drain_all_pages(zone, false); +} + #ifdef CONFIG_HIBERNATION /* @@ -3190,10 +3226,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) pcp = &this_cpu_ptr(zone->pageset)->pcp; list_add(&page->lru, &pcp->lists[migratetype]); pcp->count++; - if (pcp->count >= pcp->high) { - unsigned long batch = READ_ONCE(pcp->batch); - free_pcppages_bulk(zone, batch, pcp); - } + if (pcp->count >= READ_ONCE(pcp->high)) + free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp); } /* @@ -3378,7 +3412,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, do { if (list_empty(list)) { pcp->count += rmqueue_bulk(zone, 0, - pcp->batch, list, + READ_ONCE(pcp->batch), list, migratetype, alloc_flags); if (unlikely(list_empty(list))) return NULL; @@ -4264,10 +4298,8 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); -static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { - gfp_mask = current_gfp_context(gfp_mask); - /* no reclaim without waiting on it */ if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) return false; @@ -4276,10 +4308,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false; - /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - if (gfp_mask & __GFP_NOLOCKDEP) return false; @@ -4298,15 +4326,29 @@ void __fs_reclaim_release(void) void fs_reclaim_acquire(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_acquire(); + gfp_mask = current_gfp_context(gfp_mask); + + if (__need_reclaim(gfp_mask)) { + if (gfp_mask & __GFP_FS) + __fs_reclaim_acquire(); + +#ifdef CONFIG_MMU_NOTIFIER + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); +#endif + + } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire); void fs_reclaim_release(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_release(); + gfp_mask = current_gfp_context(gfp_mask); + + if (__need_reclaim(gfp_mask)) { + if (gfp_mask & __GFP_FS) + __fs_reclaim_release(); + } } EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif @@ -5007,6 +5049,26 @@ static inline void free_the_page(struct page *page, unsigned int order) __free_pages_ok(page, order, FPI_NONE); } +/** + * __free_pages - Free pages allocated with alloc_pages(). + * @page: The page pointer returned from alloc_pages(). + * @order: The order of the allocation. + * + * This function can free multi-page allocations that are not compound + * pages. It does not check that the @order passed in matches that of + * the allocation, so it is easy to leak memory. Freeing more memory + * than was allocated will probably emit a warning. + * + * If the last reference to this page is speculative, it will be released + * by put_page() which only frees the first page of a non-compound + * allocation. To prevent the remaining pages from being leaked, we free + * the subsequent pages here. If you want to use the page's reference + * count to decide when to free the allocation, you should allocate a + * compound page, and use put_page() instead of __free_pages(). + * + * Context: May be called in interrupt context or while holding a normal + * spinlock, but not in NMI context or while holding a raw spinlock. + */ void __free_pages(struct page *page, unsigned int order) { if (put_page_testzero(page)) @@ -5465,7 +5527,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B), global_node_page_state(NR_FILE_MAPPED), global_node_page_state(NR_SHMEM), - global_zone_page_state(NR_PAGETABLE), + global_node_page_state(NR_PAGETABLE), global_zone_page_state(NR_BOUNCE), global_zone_page_state(NR_FREE_PAGES), free_pcp, @@ -5497,6 +5559,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) #ifdef CONFIG_SHADOW_CALL_STACK " shadow_call_stack:%lukB" #endif + " pagetables:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5522,6 +5585,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) #ifdef CONFIG_SHADOW_CALL_STACK node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif + K(node_page_state(pgdat, NR_PAGETABLE)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } @@ -5553,7 +5617,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " present:%lukB" " managed:%lukB" " mlocked:%lukB" - " pagetables:%lukB" " bounce:%lukB" " free_pcp:%lukB" " local_pcp:%ukB" @@ -5574,7 +5637,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(zone->present_pages), K(zone_managed_pages(zone)), K(zone_page_state(zone, NR_MLOCK)), - K(zone_page_state(zone, NR_PAGETABLE)), K(zone_page_state(zone, NR_BOUNCE)), K(free_pcp), K(this_cpu_read(zone->pageset->pcp.count)), @@ -5904,7 +5966,10 @@ static void build_zonelists(pg_data_t *pgdat) * not check if the processor is online before following the pageset pointer. * Other parts of the kernel may not check if the zone is available. */ -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch); +static void pageset_init(struct per_cpu_pageset *p); +/* These effectively disable the pcplists in the boot pageset completely */ +#define BOOT_PAGESET_HIGH 0 +#define BOOT_PAGESET_BATCH 1 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); @@ -5972,7 +6037,7 @@ build_all_zonelists_init(void) * (a chicken-egg dilemma). */ for_each_possible_cpu(cpu) - setup_pageset(&per_cpu(boot_pageset, cpu), 0); + pageset_init(&per_cpu(boot_pageset, cpu)); mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); @@ -6255,13 +6320,16 @@ static int zone_batchsize(struct zone *zone) } /* - * pcp->high and pcp->batch values are related and dependent on one another: - * ->batch must never be higher then ->high. - * The following function updates them in a safe manner without read side - * locking. + * pcp->high and pcp->batch values are related and generally batch is lower + * than high. They are also related to pcp->count such that count is lower + * than high, and as soon as it reaches high, the pcplist is flushed. * - * Any new users of pcp->batch and pcp->high should ensure they can cope with - * those fields changing asynchronously (acording to the above rule). + * However, guaranteeing these relations at all times would require e.g. write + * barriers here but also careful usage of read barriers at the read side, and + * thus be prone to error and bad for performance. Thus the update only prevents + * store tearing. Any new users of pcp->batch and pcp->high should ensure they + * can cope with those fields changing asynchronously, and fully trust only the + * pcp->count field on the local CPU with interrupts disabled. * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters @@ -6270,21 +6338,8 @@ static int zone_batchsize(struct zone *zone) static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, unsigned long batch) { - /* start with a fail safe value for batch */ - pcp->batch = 1; - smp_wmb(); - - /* Update high, then batch, in order */ - pcp->high = high; - smp_wmb(); - - pcp->batch = batch; -} - -/* a companion to pageset_set_high() */ -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch) -{ - pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch)); + WRITE_ONCE(pcp->batch, batch); + WRITE_ONCE(pcp->high, high); } static void pageset_init(struct per_cpu_pageset *p) @@ -6297,53 +6352,70 @@ static void pageset_init(struct per_cpu_pageset *p) pcp = &p->pcp; for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) INIT_LIST_HEAD(&pcp->lists[migratetype]); + + /* + * Set batch and high values safe for a boot pageset. A true percpu + * pageset's initialization will update them subsequently. Here we don't + * need to be as careful as pageset_update() as nobody can access the + * pageset yet. + */ + pcp->high = BOOT_PAGESET_HIGH; + pcp->batch = BOOT_PAGESET_BATCH; } -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) +static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, + unsigned long batch) { - pageset_init(p); - pageset_set_batch(p, batch); + struct per_cpu_pageset *p; + int cpu; + + for_each_possible_cpu(cpu) { + p = per_cpu_ptr(zone->pageset, cpu); + pageset_update(&p->pcp, high, batch); + } } /* - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist - * to the value high for the pageset p. + * Calculate and set new high and batch values for all per-cpu pagesets of a + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl. */ -static void pageset_set_high(struct per_cpu_pageset *p, - unsigned long high) +static void zone_set_pageset_high_and_batch(struct zone *zone) { - unsigned long batch = max(1UL, high / 4); - if ((high / 4) > (PAGE_SHIFT * 8)) - batch = PAGE_SHIFT * 8; + unsigned long new_high, new_batch; - pageset_update(&p->pcp, high, batch); -} + if (percpu_pagelist_fraction) { + new_high = zone_managed_pages(zone) / percpu_pagelist_fraction; + new_batch = max(1UL, new_high / 4); + if ((new_high / 4) > (PAGE_SHIFT * 8)) + new_batch = PAGE_SHIFT * 8; + } else { + new_batch = zone_batchsize(zone); + new_high = 6 * new_batch; + new_batch = max(1UL, 1 * new_batch); + } -static void pageset_set_high_and_batch(struct zone *zone, - struct per_cpu_pageset *pcp) -{ - if (percpu_pagelist_fraction) - pageset_set_high(pcp, - (zone_managed_pages(zone) / - percpu_pagelist_fraction)); - else - pageset_set_batch(pcp, zone_batchsize(zone)); -} + if (zone->pageset_high == new_high && + zone->pageset_batch == new_batch) + return; -static void __meminit zone_pageset_init(struct zone *zone, int cpu) -{ - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu); + zone->pageset_high = new_high; + zone->pageset_batch = new_batch; - pageset_init(pcp); - pageset_set_high_and_batch(zone, pcp); + __zone_set_pageset_high_and_batch(zone, new_high, new_batch); } void __meminit setup_zone_pageset(struct zone *zone) { + struct per_cpu_pageset *p; int cpu; + zone->pageset = alloc_percpu(struct per_cpu_pageset); - for_each_possible_cpu(cpu) - zone_pageset_init(zone, cpu); + for_each_possible_cpu(cpu) { + p = per_cpu_ptr(zone->pageset, cpu); + pageset_init(p); + } + + zone_set_pageset_high_and_batch(zone); } /* @@ -6386,6 +6458,8 @@ static __meminit void zone_pcp_init(struct zone *zone) * offset of a (static) per cpu variable into the per cpu area. */ zone->pageset = &boot_pageset; + zone->pageset_high = BOOT_PAGESET_HIGH; + zone->pageset_batch = BOOT_PAGESET_BATCH; if (populated_zone(zone)) printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n", @@ -7791,31 +7865,24 @@ static void calculate_totalreserve_pages(void) static void setup_per_zone_lowmem_reserve(void) { struct pglist_data *pgdat; - enum zone_type j, idx; + enum zone_type i, j; for_each_online_pgdat(pgdat) { - for (j = 0; j < MAX_NR_ZONES; j++) { - struct zone *zone = pgdat->node_zones + j; - unsigned long managed_pages = zone_managed_pages(zone); - - zone->lowmem_reserve[j] = 0; - - idx = j; - while (idx) { - struct zone *lower_zone; - - idx--; - lower_zone = pgdat->node_zones + idx; - - if (!sysctl_lowmem_reserve_ratio[idx] || - !zone_managed_pages(lower_zone)) { - lower_zone->lowmem_reserve[j] = 0; - continue; + for (i = 0; i < MAX_NR_ZONES - 1; i++) { + struct zone *zone = &pgdat->node_zones[i]; + int ratio = sysctl_lowmem_reserve_ratio[i]; + bool clear = !ratio || !zone_managed_pages(zone); + unsigned long managed_pages = 0; + + for (j = i + 1; j < MAX_NR_ZONES; j++) { + if (clear) { + zone->lowmem_reserve[j] = 0; } else { - lower_zone->lowmem_reserve[j] = - managed_pages / sysctl_lowmem_reserve_ratio[idx]; + struct zone *upper_zone = &pgdat->node_zones[j]; + + managed_pages += zone_managed_pages(upper_zone); + zone->lowmem_reserve[j] = managed_pages / ratio; } - managed_pages += zone_managed_pages(lower_zone); } } } @@ -8077,15 +8144,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write, return 0; } -static void __zone_pcp_update(struct zone *zone) -{ - unsigned int cpu; - - for_each_possible_cpu(cpu) - pageset_set_high_and_batch(zone, - per_cpu_ptr(zone->pageset, cpu)); -} - /* * percpu_pagelist_fraction - changes the pcp->high for each zone on each * cpu. It is the fraction of total pages in each zone that a hot per cpu @@ -8118,7 +8176,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write, goto out; for_each_populated_zone(zone) - __zone_pcp_update(zone); + zone_set_pageset_high_and_batch(zone); out: mutex_unlock(&pcp_batch_high_lock); return ret; @@ -8517,6 +8575,8 @@ int alloc_contig_range(unsigned long start, unsigned long end, if (ret) return ret; + drain_all_pages(cc.zone); + /* * In case of -EBUSY, we'd like to know which page causes problem. * So, just fall through. test_pages_isolated() has a tracepoint @@ -8725,7 +8785,28 @@ EXPORT_SYMBOL(free_contig_range); void __meminit zone_pcp_update(struct zone *zone) { mutex_lock(&pcp_batch_high_lock); - __zone_pcp_update(zone); + zone_set_pageset_high_and_batch(zone); + mutex_unlock(&pcp_batch_high_lock); +} + +/* + * Effectively disable pcplists for the zone by setting the high limit to 0 + * and draining all cpus. A concurrent page freeing on another CPU that's about + * to put the page on pcplist will either finish before the drain and the page + * will be drained, or observe the new high limit and skip the pcplist. + * + * Must be paired with a call to zone_pcp_enable(). + */ +void zone_pcp_disable(struct zone *zone) +{ + mutex_lock(&pcp_batch_high_lock); + __zone_set_pageset_high_and_batch(zone, 0, 1); + __drain_all_pages(zone, true); +} + +void zone_pcp_enable(struct zone *zone) +{ + __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); mutex_unlock(&pcp_batch_high_lock); } diff --git a/mm/page_counter.c b/mm/page_counter.c index b24a60b28bb0..c6860f51b6c6 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -183,14 +183,14 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages) * the limit, so if it sees the old limit, we see the * modified counter and retry. */ - usage = atomic_long_read(&counter->usage); + usage = page_counter_read(counter); if (usage > nr_pages) return -EBUSY; old = xchg(&counter->max, nr_pages); - if (atomic_long_read(&counter->usage) <= usage) + if (page_counter_read(counter) <= usage) return 0; counter->max = old; diff --git a/mm/page_ext.c b/mm/page_ext.c index a3616f7a0e9e..16b161f28a31 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -99,12 +99,19 @@ static void __init invoke_init_callbacks(void) } } +#ifndef CONFIG_SPARSEMEM +void __init page_ext_init_flatmem_late(void) +{ + invoke_init_callbacks(); +} +#endif + static inline struct page_ext *get_entry(void *base, unsigned long index) { return base + page_ext_size * index; } -#if !defined(CONFIG_SPARSEMEM) +#ifndef CONFIG_SPARSEMEM void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) @@ -177,7 +184,6 @@ void __init page_ext_init_flatmem(void) goto fail; } pr_info("allocated %ld bytes of page_ext\n", total_usage); - invoke_init_callbacks(); return; fail: diff --git a/mm/page_isolation.c b/mm/page_isolation.c index abbf42214485..bddf788f45bf 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ __mod_zone_freepage_state(zone, -nr_pages, mt); spin_unlock_irqrestore(&zone->lock, flags); - drain_all_pages(zone); return 0; } @@ -89,7 +88,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype) */ if (PageBuddy(page)) { order = buddy_order(page); - if (order >= pageblock_order) { + if (order >= pageblock_order && order < MAX_ORDER - 1) { pfn = page_to_pfn(page); buddy_pfn = __find_buddy_pfn(pfn, order); buddy = page + (buddy_pfn - pfn); @@ -172,11 +171,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * * Please note that there is no strong synchronization with the page allocator * either. Pages might be freed while their page blocks are marked ISOLATED. - * In some cases pages might still end up on pcp lists and that would allow + * A call to drain_all_pages() after isolation can flush most of them. However + * in some cases pages might still end up on pcp lists and that would allow * for their allocation even when they are in fact isolated already. Depending - * on how strong of a guarantee the caller needs drain_all_pages might be needed - * (e.g. __offline_pages will need to call it after check for isolated range for - * a next retry). + * on how strong of a guarantee the caller needs, zone_pcp_disable/enable() + * might be used to flush and disable pcplist before isolation and enable after + * unisolation. * * Return: 0 on success and -EBUSY if any part of range cannot be isolated. */ diff --git a/mm/page_owner.c b/mm/page_owner.c index b735a8eafcdb..af464bb7fbe7 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -10,6 +10,7 @@ #include <linux/migrate.h> #include <linux/stackdepot.h> #include <linux/seq_file.h> +#include <linux/sched/clock.h> #include "internal.h" @@ -25,6 +26,8 @@ struct page_owner { gfp_t gfp_mask; depot_stack_handle_t handle; depot_stack_handle_t free_handle; + u64 ts_nsec; + pid_t pid; }; static bool page_owner_enabled = false; @@ -172,6 +175,8 @@ static inline void __set_page_owner_handle(struct page *page, page_owner->order = order; page_owner->gfp_mask = gfp_mask; page_owner->last_migrate_reason = -1; + page_owner->pid = current->pid; + page_owner->ts_nsec = local_clock(); __set_bit(PAGE_EXT_OWNER, &page_ext->flags); __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); @@ -236,6 +241,8 @@ void __copy_page_owner(struct page *oldpage, struct page *newpage) new_page_owner->last_migrate_reason = old_page_owner->last_migrate_reason; new_page_owner->handle = old_page_owner->handle; + new_page_owner->pid = old_page_owner->pid; + new_page_owner->ts_nsec = old_page_owner->ts_nsec; /* * We don't clear the bit on the oldpage as it's going to be freed @@ -349,9 +356,10 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, return -ENOMEM; ret = snprintf(kbuf, count, - "Page allocated via order %u, mask %#x(%pGg)\n", + "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns\n", page_owner->order, page_owner->gfp_mask, - &page_owner->gfp_mask); + &page_owner->gfp_mask, page_owner->pid, + page_owner->ts_nsec); if (ret >= count) goto err; @@ -427,8 +435,9 @@ void __dump_page_owner(struct page *page) else pr_alert("page_owner tracks the page as freed\n"); - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg)\n", - page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask); + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu\n", + page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, + page_owner->pid, page_owner->ts_nsec); handle = READ_ONCE(page_owner->handle); if (!handle) { diff --git a/mm/page_poison.c b/mm/page_poison.c index ae0482cded87..06ec518b2089 100644 --- a/mm/page_poison.c +++ b/mm/page_poison.c @@ -8,45 +8,17 @@ #include <linux/ratelimit.h> #include <linux/kasan.h> -static DEFINE_STATIC_KEY_FALSE_RO(want_page_poisoning); +bool _page_poisoning_enabled_early; +EXPORT_SYMBOL(_page_poisoning_enabled_early); +DEFINE_STATIC_KEY_FALSE(_page_poisoning_enabled); +EXPORT_SYMBOL(_page_poisoning_enabled); static int __init early_page_poison_param(char *buf) { - int ret; - bool tmp; - - ret = strtobool(buf, &tmp); - if (ret) - return ret; - - if (tmp) - static_branch_enable(&want_page_poisoning); - else - static_branch_disable(&want_page_poisoning); - - return 0; + return kstrtobool(buf, &_page_poisoning_enabled_early); } early_param("page_poison", early_page_poison_param); -/** - * page_poisoning_enabled - check if page poisoning is enabled - * - * Return true if page poisoning is enabled, or false if not. - */ -bool page_poisoning_enabled(void) -{ - /* - * Assumes that debug_pagealloc_enabled is set before - * memblock_free_all. - * Page poisoning is debug page alloc for some arches. If - * either of those options are enabled, enable poisoning. - */ - return (static_branch_unlikely(&want_page_poisoning) || - (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && - debug_pagealloc_enabled())); -} -EXPORT_SYMBOL_GPL(page_poisoning_enabled); - static void poison_page(struct page *page) { void *addr = kmap_atomic(page); @@ -58,7 +30,7 @@ static void poison_page(struct page *page) kunmap_atomic(addr); } -static void poison_pages(struct page *page, int n) +void __kernel_poison_pages(struct page *page, int n) { int i; @@ -79,9 +51,6 @@ static void check_poison_mem(unsigned char *mem, size_t bytes) unsigned char *start; unsigned char *end; - if (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)) - return; - start = memchr_inv(mem, PAGE_POISON, bytes); if (!start) return; @@ -117,7 +86,7 @@ static void unpoison_page(struct page *page) kunmap_atomic(addr); } -static void unpoison_pages(struct page *page, int n) +void __kernel_unpoison_pages(struct page *page, int n) { int i; @@ -125,17 +94,6 @@ static void unpoison_pages(struct page *page, int n) unpoison_page(page + i); } -void kernel_poison_pages(struct page *page, int numpages, int enable) -{ - if (!page_poisoning_enabled()) - return; - - if (enable) - unpoison_pages(page, numpages); - else - poison_pages(page, numpages); -} - #ifndef CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 5e77b269c330..86e3a3688d59 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -66,18 +66,19 @@ static inline bool pfn_is_match(struct page *page, unsigned long pfn) /** * check_pte - check if @pvmw->page is mapped at the @pvmw->pte + * @pvmw: page_vma_mapped_walk struct, includes a pair pte and page for checking * * page_vma_mapped_walk() found a place where @pvmw->page is *potentially* * mapped. check_pte() has to validate this. * - * @pvmw->pte may point to empty PTE, swap PTE or PTE pointing to arbitrary - * page. + * pvmw->pte may point to empty PTE, swap PTE or PTE pointing to + * arbitrary page. * * If PVMW_MIGRATION flag is set, returns true if @pvmw->pte contains migration * entry that points to @pvmw->page or any subpage in case of THP. * - * If PVMW_MIGRATION flag is not set, returns true if @pvmw->pte points to - * @pvmw->page or any subpage in case of THP. + * If PVMW_MIGRATION flag is not set, returns true if pvmw->pte points to + * pvmw->page or any subpage in case of THP. * * Otherwise, return false. * diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 702250f148e7..4bcc11958089 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -260,7 +260,7 @@ static ssize_t process_vm_rw(pid_t pid, struct iovec iovstack_l[UIO_FASTIOV]; struct iovec iovstack_r[UIO_FASTIOV]; struct iovec *iov_l = iovstack_l; - struct iovec *iov_r = iovstack_r; + struct iovec *iov_r; struct iov_iter iter; ssize_t rc; int dir = vm_write ? WRITE : READ; diff --git a/mm/rmap.c b/mm/rmap.c index 31b29321adfe..6657000b18d4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1533,15 +1533,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto discard; } - if (!(flags & TTU_IGNORE_ACCESS)) { - if (ptep_clear_flush_young_notify(vma, address, - pvmw.pte)) { - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; - } - } - /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); if (should_defer_flush(mm, flags)) { diff --git a/mm/shmem.c b/mm/shmem.c index 537c137698f8..7c6b6d8f6c39 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -246,7 +246,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) } static const struct super_operations shmem_ops; -static const struct address_space_operations shmem_aops; +const struct address_space_operations shmem_aops; static const struct file_operations shmem_file_operations; static const struct inode_operations shmem_inode_operations; static const struct inode_operations shmem_dir_inode_operations; @@ -713,7 +713,7 @@ next: } if (PageTransHuge(page)) { count_vm_event(THP_FILE_ALLOC); - __inc_node_page_state(page, NR_SHMEM_THPS); + __inc_lruvec_page_state(page, NR_SHMEM_THPS); } mapping->nrpages += nr; __mod_lruvec_page_state(page, NR_FILE_PAGES, nr); @@ -1152,7 +1152,7 @@ static void shmem_evict_inode(struct inode *inode) struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - if (inode->i_mapping->a_ops == &shmem_aops) { + if (shmem_mapping(inode->i_mapping)) { shmem_unacct_size(info->flags, inode->i_size); inode->i_size = 0; shmem_truncate_range(inode, 0, (loff_t)-1); @@ -1858,7 +1858,7 @@ repeat: } /* shmem_symlink() */ - if (mapping->a_ops != &shmem_aops) + if (!shmem_mapping(mapping)) goto alloc_nohuge; if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE) goto alloc_nohuge; @@ -2352,11 +2352,6 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode return inode; } -bool shmem_mapping(struct address_space *mapping) -{ - return mapping->a_ops == &shmem_aops; -} - static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, @@ -3865,7 +3860,7 @@ static void shmem_destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); } -static const struct address_space_operations shmem_aops = { +const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, #ifdef CONFIG_TMPFS @@ -3877,6 +3872,7 @@ static const struct address_space_operations shmem_aops = { #endif .error_remove_page = generic_error_remove_page, }; +EXPORT_SYMBOL(shmem_aops); static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, @@ -4024,7 +4020,7 @@ out2: #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS) static ssize_t shmem_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) + struct kobj_attribute *attr, char *buf) { static const int values[] = { SHMEM_HUGE_ALWAYS, @@ -4034,16 +4030,19 @@ static ssize_t shmem_enabled_show(struct kobject *kobj, SHMEM_HUGE_DENY, SHMEM_HUGE_FORCE, }; - int i, count; - - for (i = 0, count = 0; i < ARRAY_SIZE(values); i++) { - const char *fmt = shmem_huge == values[i] ? "[%s] " : "%s "; + int len = 0; + int i; - count += sprintf(buf + count, fmt, - shmem_format_huge(values[i])); + for (i = 0; i < ARRAY_SIZE(values); i++) { + len += sysfs_emit_at(buf, len, + shmem_huge == values[i] ? "%s[%s]" : "%s%s", + i ? " " : "", + shmem_format_huge(values[i])); } - buf[count - 1] = '\n'; - return count; + + len += sysfs_emit_at(buf, len, "\n"); + + return len; } static ssize_t shmem_enabled_store(struct kobject *kobj, @@ -4312,7 +4311,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, struct page *page; int error; - BUG_ON(mapping->a_ops != &shmem_aops); + BUG_ON(!shmem_mapping(mapping)); error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL, NULL, NULL); if (error) diff --git a/mm/slab.c b/mm/slab.c index b1113561b98b..d7c8da9319c7 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1399,7 +1399,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page) __ClearPageSlabPfmemalloc(page); __ClearPageSlab(page); page_mapcount_reset(page); - page->mapping = NULL; + /* In union with page->mapping where page allocator expects NULL */ + page->slab_cache = NULL; if (current->reclaim_state) current->reclaim_state->reclaimed_slab += 1 << order; @@ -1434,7 +1435,7 @@ static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map) if (!is_debug_pagealloc_cache(cachep)) return; - kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map); + __kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map); } #else @@ -3416,6 +3417,9 @@ free_done: static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp, unsigned long caller) { + if (unlikely(slab_want_init_on_free(cachep))) + memset(objp, 0, cachep->object_size); + /* Put the object into the quarantine, don't touch it for now. */ if (kasan_slab_free(cachep, objp, _RET_IP_)) return; @@ -3434,8 +3438,6 @@ void ___cache_free(struct kmem_cache *cachep, void *objp, struct array_cache *ac = cpu_cache_get(cachep); check_irq_off(); - if (unlikely(slab_want_init_on_free(cachep))) - memset(objp, 0, cachep->object_size); kmemleak_free_recursive(objp, cachep->flags); objp = cache_free_debugcheck(cachep, objp, caller); memcg_slab_free_hook(cachep, &objp, 1); diff --git a/mm/slab.h b/mm/slab.h index f9977d6613d6..54faf9a623f9 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -204,7 +204,7 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer, void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **); int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **); -static inline int cache_vmstat_idx(struct kmem_cache *s) +static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) { return (s->flags & SLAB_RECLAIM_ACCOUNT) ? NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; @@ -304,7 +304,7 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, static inline void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, - int idx, int nr) + enum node_stat_item idx, int nr) { struct mem_cgroup *memcg; struct lruvec *lruvec; @@ -510,10 +510,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, { flags &= gfp_allowed_mask; - fs_reclaim_acquire(flags); - fs_reclaim_release(flags); - - might_sleep_if(gfpflags_allow_blocking(flags)); + might_alloc(flags); if (should_failslab(s, flags)) return NULL; diff --git a/mm/slab_common.c b/mm/slab_common.c index f9ccd5dc13f3..2f2b55c2798e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -978,7 +978,7 @@ static int slab_show(struct seq_file *m, void *p) void dump_unreclaimable_slab(void) { - struct kmem_cache *s, *s2; + struct kmem_cache *s; struct slabinfo sinfo; /* @@ -996,7 +996,7 @@ void dump_unreclaimable_slab(void) pr_info("Unreclaimable slab info:\n"); pr_info("Name Used Total\n"); - list_for_each_entry_safe(s, s2, &slab_caches, list) { + list_for_each_entry(s, &slab_caches, list) { if (s->flags & SLAB_RECLAIM_ACCOUNT) continue; @@ -1091,9 +1091,9 @@ static __always_inline void *__do_krealloc(const void *p, size_t new_size, * @flags: the type of memory to allocate. * * The contents of the object pointed to are preserved up to the - * lesser of the new and old sizes. If @p is %NULL, krealloc() - * behaves exactly like kmalloc(). If @new_size is 0 and @p is not a - * %NULL pointer, the object pointed to is freed. + * lesser of the new and old sizes (__GFP_ZERO flag is effectively ignored). + * If @p is %NULL, krealloc() behaves exactly like kmalloc(). If @new_size + * is 0 and @p is not a %NULL pointer, the object pointed to is freed. * * Return: pointer to the allocated memory or %NULL in case of error */ diff --git a/mm/slob.c b/mm/slob.c index 7cc9805c8091..8d4bfa46247f 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -474,8 +474,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) gfp &= gfp_allowed_mask; - fs_reclaim_acquire(gfp); - fs_reclaim_release(gfp); + might_alloc(gfp); if (size < PAGE_SIZE - minalign) { int align = minalign; @@ -597,8 +596,7 @@ static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node) flags &= gfp_allowed_mask; - fs_reclaim_acquire(flags); - fs_reclaim_release(flags); + might_alloc(flags); if (c->size < PAGE_SIZE) { b = slob_alloc(c->size, flags, c->align, node, 0); diff --git a/mm/slub.c b/mm/slub.c index 34dcc09e2ec9..4552319148f6 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1836,8 +1836,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page) __ClearPageSlabPfmemalloc(page); __ClearPageSlab(page); - - page->mapping = NULL; + /* In union with page->mapping where page allocator expects NULL */ + page->slab_cache = NULL; if (current->reclaim_state) current->reclaim_state->reclaimed_slab += pages; unaccount_slab_page(page, order, s); @@ -2245,8 +2245,7 @@ redo: } } else { m = M_FULL; -#ifdef CONFIG_SLUB_DEBUG - if ((s->flags & SLAB_STORE_USER) && !lock) { + if (kmem_cache_debug_flags(s, SLAB_STORE_USER) && !lock) { lock = 1; /* * This also ensures that the scanning of full @@ -2255,7 +2254,6 @@ redo: */ spin_lock(&n->list_lock); } -#endif } if (l != m) { @@ -3433,7 +3431,7 @@ static inline int calculate_order(unsigned int size) */ min_objects = slub_min_objects; if (!min_objects) - min_objects = 4 * (fls(nr_cpu_ids) + 1); + min_objects = 4 * (fls(num_online_cpus()) + 1); max_objects = order_objects(slub_max_order, size); min_objects = min(min_objects, max_objects); @@ -4726,7 +4724,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s, } static int list_locations(struct kmem_cache *s, char *buf, - enum track_item alloc) + enum track_item alloc) { int len = 0; unsigned long i; @@ -4736,7 +4734,7 @@ static int list_locations(struct kmem_cache *s, char *buf, if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) { - return sprintf(buf, "Out of memory\n"); + return sysfs_emit(buf, "Out of memory\n"); } /* Push back cpu slabs */ flush_all(s); @@ -4759,50 +4757,45 @@ static int list_locations(struct kmem_cache *s, char *buf, for (i = 0; i < t.count; i++) { struct location *l = &t.loc[i]; - if (len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100) - break; - len += sprintf(buf + len, "%7ld ", l->count); + len += sysfs_emit_at(buf, len, "%7ld ", l->count); if (l->addr) - len += sprintf(buf + len, "%pS", (void *)l->addr); + len += sysfs_emit_at(buf, len, "%pS", (void *)l->addr); else - len += sprintf(buf + len, "<not-available>"); - - if (l->sum_time != l->min_time) { - len += sprintf(buf + len, " age=%ld/%ld/%ld", - l->min_time, - (long)div_u64(l->sum_time, l->count), - l->max_time); - } else - len += sprintf(buf + len, " age=%ld", - l->min_time); + len += sysfs_emit_at(buf, len, "<not-available>"); + + if (l->sum_time != l->min_time) + len += sysfs_emit_at(buf, len, " age=%ld/%ld/%ld", + l->min_time, + (long)div_u64(l->sum_time, + l->count), + l->max_time); + else + len += sysfs_emit_at(buf, len, " age=%ld", l->min_time); if (l->min_pid != l->max_pid) - len += sprintf(buf + len, " pid=%ld-%ld", - l->min_pid, l->max_pid); + len += sysfs_emit_at(buf, len, " pid=%ld-%ld", + l->min_pid, l->max_pid); else - len += sprintf(buf + len, " pid=%ld", - l->min_pid); + len += sysfs_emit_at(buf, len, " pid=%ld", + l->min_pid); if (num_online_cpus() > 1 && - !cpumask_empty(to_cpumask(l->cpus)) && - len < PAGE_SIZE - 60) - len += scnprintf(buf + len, PAGE_SIZE - len - 50, - " cpus=%*pbl", - cpumask_pr_args(to_cpumask(l->cpus))); - - if (nr_online_nodes > 1 && !nodes_empty(l->nodes) && - len < PAGE_SIZE - 60) - len += scnprintf(buf + len, PAGE_SIZE - len - 50, - " nodes=%*pbl", - nodemask_pr_args(&l->nodes)); - - len += sprintf(buf + len, "\n"); + !cpumask_empty(to_cpumask(l->cpus))) + len += sysfs_emit_at(buf, len, " cpus=%*pbl", + cpumask_pr_args(to_cpumask(l->cpus))); + + if (nr_online_nodes > 1 && !nodes_empty(l->nodes)) + len += sysfs_emit_at(buf, len, " nodes=%*pbl", + nodemask_pr_args(&l->nodes)); + + len += sysfs_emit_at(buf, len, "\n"); } free_loc_track(&t); if (!t.count) - len += sprintf(buf, "No data\n"); + len += sysfs_emit_at(buf, len, "No data\n"); + return len; } #endif /* CONFIG_SLUB_DEBUG */ @@ -4899,12 +4892,13 @@ __setup("slub_memcg_sysfs=", setup_slub_memcg_sysfs); #endif static ssize_t show_slab_objects(struct kmem_cache *s, - char *buf, unsigned long flags) + char *buf, unsigned long flags) { unsigned long total = 0; int node; int x; unsigned long *nodes; + int len = 0; nodes = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL); if (!nodes) @@ -4993,15 +4987,19 @@ static ssize_t show_slab_objects(struct kmem_cache *s, nodes[node] += x; } } - x = sprintf(buf, "%lu", total); + + len += sysfs_emit_at(buf, len, "%lu", total); #ifdef CONFIG_NUMA - for (node = 0; node < nr_node_ids; node++) + for (node = 0; node < nr_node_ids; node++) { if (nodes[node]) - x += sprintf(buf + x, " N%d=%lu", - node, nodes[node]); + len += sysfs_emit_at(buf, len, " N%d=%lu", + node, nodes[node]); + } #endif + len += sysfs_emit_at(buf, len, "\n"); kfree(nodes); - return x + sprintf(buf + x, "\n"); + + return len; } #define to_slab_attr(n) container_of(n, struct slab_attribute, attr) @@ -5023,37 +5021,37 @@ struct slab_attribute { static ssize_t slab_size_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", s->size); + return sysfs_emit(buf, "%u\n", s->size); } SLAB_ATTR_RO(slab_size); static ssize_t align_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", s->align); + return sysfs_emit(buf, "%u\n", s->align); } SLAB_ATTR_RO(align); static ssize_t object_size_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", s->object_size); + return sysfs_emit(buf, "%u\n", s->object_size); } SLAB_ATTR_RO(object_size); static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", oo_objects(s->oo)); + return sysfs_emit(buf, "%u\n", oo_objects(s->oo)); } SLAB_ATTR_RO(objs_per_slab); static ssize_t order_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", oo_order(s->oo)); + return sysfs_emit(buf, "%u\n", oo_order(s->oo)); } SLAB_ATTR_RO(order); static ssize_t min_partial_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%lu\n", s->min_partial); + return sysfs_emit(buf, "%lu\n", s->min_partial); } static ssize_t min_partial_store(struct kmem_cache *s, const char *buf, @@ -5073,7 +5071,7 @@ SLAB_ATTR(min_partial); static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", slub_cpu_partial(s)); + return sysfs_emit(buf, "%u\n", slub_cpu_partial(s)); } static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf, @@ -5098,13 +5096,13 @@ static ssize_t ctor_show(struct kmem_cache *s, char *buf) { if (!s->ctor) return 0; - return sprintf(buf, "%pS\n", s->ctor); + return sysfs_emit(buf, "%pS\n", s->ctor); } SLAB_ATTR_RO(ctor); static ssize_t aliases_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", s->refcount < 0 ? 0 : s->refcount - 1); + return sysfs_emit(buf, "%d\n", s->refcount < 0 ? 0 : s->refcount - 1); } SLAB_ATTR_RO(aliases); @@ -5137,7 +5135,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf) int objects = 0; int pages = 0; int cpu; - int len; + int len = 0; for_each_online_cpu(cpu) { struct page *page; @@ -5150,52 +5148,53 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf) } } - len = sprintf(buf, "%d(%d)", objects, pages); + len += sysfs_emit_at(buf, len, "%d(%d)", objects, pages); #ifdef CONFIG_SMP for_each_online_cpu(cpu) { struct page *page; page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); - - if (page && len < PAGE_SIZE - 20) - len += sprintf(buf + len, " C%d=%d(%d)", cpu, - page->pobjects, page->pages); + if (page) + len += sysfs_emit_at(buf, len, " C%d=%d(%d)", + cpu, page->pobjects, page->pages); } #endif - return len + sprintf(buf + len, "\n"); + len += sysfs_emit_at(buf, len, "\n"); + + return len; } SLAB_ATTR_RO(slabs_cpu_partial); static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT)); } SLAB_ATTR_RO(reclaim_account); static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN)); } SLAB_ATTR_RO(hwcache_align); #ifdef CONFIG_ZONE_DMA static ssize_t cache_dma_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA)); } SLAB_ATTR_RO(cache_dma); #endif static ssize_t usersize_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", s->usersize); + return sysfs_emit(buf, "%u\n", s->usersize); } SLAB_ATTR_RO(usersize); static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU)); } SLAB_ATTR_RO(destroy_by_rcu); @@ -5214,33 +5213,33 @@ SLAB_ATTR_RO(total_objects); static ssize_t sanity_checks_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_CONSISTENCY_CHECKS)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_CONSISTENCY_CHECKS)); } SLAB_ATTR_RO(sanity_checks); static ssize_t trace_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_TRACE)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_TRACE)); } SLAB_ATTR_RO(trace); static ssize_t red_zone_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE)); } SLAB_ATTR_RO(red_zone); static ssize_t poison_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_POISON)); } SLAB_ATTR_RO(poison); static ssize_t store_user_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_STORE_USER)); } SLAB_ATTR_RO(store_user); @@ -5284,7 +5283,7 @@ SLAB_ATTR_RO(free_calls); #ifdef CONFIG_FAILSLAB static ssize_t failslab_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB)); + return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB)); } SLAB_ATTR_RO(failslab); #endif @@ -5308,7 +5307,7 @@ SLAB_ATTR(shrink); #ifdef CONFIG_NUMA static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%u\n", s->remote_node_defrag_ratio / 10); + return sysfs_emit(buf, "%u\n", s->remote_node_defrag_ratio / 10); } static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s, @@ -5335,7 +5334,7 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) { unsigned long sum = 0; int cpu; - int len; + int len = 0; int *data = kmalloc_array(nr_cpu_ids, sizeof(int), GFP_KERNEL); if (!data) @@ -5348,16 +5347,19 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) sum += x; } - len = sprintf(buf, "%lu", sum); + len += sysfs_emit_at(buf, len, "%lu", sum); #ifdef CONFIG_SMP for_each_online_cpu(cpu) { - if (data[cpu] && len < PAGE_SIZE - 20) - len += sprintf(buf + len, " C%d=%u", cpu, data[cpu]); + if (data[cpu]) + len += sysfs_emit_at(buf, len, " C%d=%u", + cpu, data[cpu]); } #endif kfree(data); - return len + sprintf(buf + len, "\n"); + len += sysfs_emit_at(buf, len, "\n"); + + return len; } static void clear_stat(struct kmem_cache *s, enum stat_item si) diff --git a/mm/swap.c b/mm/swap.c index 47a47681c86b..16a525296960 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -909,6 +909,9 @@ void release_pages(struct page **pages, int nr) put_devmap_managed_page(page); continue; } + if (put_page_testzero(page)) + put_dev_pagemap(page->pgmap); + continue; } if (!put_page_testzero(page)) @@ -1164,15 +1167,6 @@ unsigned pagevec_lookup_range_tag(struct pagevec *pvec, } EXPORT_SYMBOL(pagevec_lookup_range_tag); -unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec, - struct address_space *mapping, pgoff_t *index, pgoff_t end, - xa_mark_t tag, unsigned max_pages) -{ - pvec->nr = find_get_pages_range_tag(mapping, index, end, tag, - min_t(unsigned int, max_pages, PAGEVEC_SIZE), pvec->pages); - return pagevec_count(pvec); -} -EXPORT_SYMBOL(pagevec_lookup_range_nr_tag); /* * Perform any setup for the swap system */ diff --git a/mm/swap_state.c b/mm/swap_state.c index ee465827420e..751c1ef2fe0e 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -839,7 +839,9 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, swp_entry_t entry; unsigned int i; bool page_allocated; - struct vma_swap_readahead ra_info = {0,}; + struct vma_swap_readahead ra_info = { + .win = 1, + }; swap_ra_info(vmf, &ra_info); if (ra_info.win == 1) @@ -900,7 +902,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, static ssize_t vma_ra_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return sprintf(buf, "%s\n", enable_vma_readahead ? "true" : "false"); + return sysfs_emit(buf, "%s\n", + enable_vma_readahead ? "true" : "false"); } static ssize_t vma_ra_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, diff --git a/mm/swapfile.c b/mm/swapfile.c index d58361109066..1c0a829f7311 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -975,8 +975,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) { unsigned long idx; struct swap_cluster_info *ci; - unsigned long offset, i; - unsigned char *map; + unsigned long offset; /* * Should not even be attempting cluster allocations when huge @@ -996,9 +995,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) alloc_cluster(si, idx); cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE); - map = si->swap_map + offset; - for (i = 0; i < SWAPFILE_CLUSTER; i++) - map[i] = SWAP_HAS_CACHE; + memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); unlock_cluster(ci); swap_range_alloc(si, offset, SWAPFILE_CLUSTER); *slot = swp_entry(si->type, offset); @@ -3445,11 +3442,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) unsigned long offset; unsigned char count; unsigned char has_cache; - int err = -EINVAL; + int err; p = get_swap_device(entry); if (!p) - goto out; + return -EINVAL; offset = swp_offset(entry); ci = lock_cluster_or_swap_info(p, offset); @@ -3496,7 +3493,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) unlock_out: unlock_cluster_or_swap_info(p, ci); -out: if (p) put_swap_device(p); return err; @@ -3613,7 +3609,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) ci = lock_cluster(si, offset); - count = si->swap_map[offset] & ~SWAP_HAS_CACHE; + count = swap_count(si->swap_map[offset]); if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) { /* diff --git a/mm/truncate.c b/mm/truncate.c index 960edf5803ca..8aa4907e06e0 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -637,9 +637,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, EXPORT_SYMBOL(invalidate_mapping_pages); /** - * This helper is similar with the above one, except that it accounts for pages - * that are likely on a pagevec and count them in @nr_pagevec, which will used by - * the caller. + * invalidate_mapping_pagevec - Invalidate all the unlocked pages of one inode + * @mapping: the address_space which holds the pages to invalidate + * @start: the offset 'from' which to invalidate + * @end: the offset 'to' which to invalidate (inclusive) + * @nr_pagevec: invalidate failed page number for caller + * + * This helper is similar to invalidate_mapping_pages(), except that it accounts + * for pages that are likely on a pagevec and counts them in @nr_pagevec, which + * will be used by the caller. */ void invalidate_mapping_pagevec(struct address_space *mapping, pgoff_t start, pgoff_t end, unsigned long *nr_pagevec) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6ae491a8b210..4d88fe5a277a 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -413,10 +413,13 @@ static DEFINE_SPINLOCK(vmap_area_lock); static DEFINE_SPINLOCK(free_vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); -static LLIST_HEAD(vmap_purge_list); static struct rb_root vmap_area_root = RB_ROOT; static bool vmap_initialized __read_mostly; +static struct rb_root purge_vmap_area_root = RB_ROOT; +static LIST_HEAD(purge_vmap_area_list); +static DEFINE_SPINLOCK(purge_vmap_area_lock); + /* * This kmem_cache is used for vmap_area objects. Instead of * allocating from slab we reuse an object from this cache to @@ -820,10 +823,17 @@ insert: if (!merged) link_va(va, root, parent, link, head); - /* - * Last step is to check and update the tree. - */ - augment_tree_propagate_from(va); + return va; +} + +static __always_inline struct vmap_area * +merge_or_add_vmap_area_augment(struct vmap_area *va, + struct rb_root *root, struct list_head *head) +{ + va = merge_or_add_vmap_area(va, root, head); + if (va) + augment_tree_propagate_from(va); + return va; } @@ -1138,7 +1148,7 @@ static void free_vmap_area(struct vmap_area *va) * Insert/Merge it back to the free tree/list. */ spin_lock(&free_vmap_area_lock); - merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list); + merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list); spin_unlock(&free_vmap_area_lock); } @@ -1326,32 +1336,32 @@ void set_iounmap_nonlazy(void) static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { unsigned long resched_threshold; - struct llist_node *valist; - struct vmap_area *va; - struct vmap_area *n_va; + struct list_head local_pure_list; + struct vmap_area *va, *n_va; lockdep_assert_held(&vmap_purge_lock); - valist = llist_del_all(&vmap_purge_list); - if (unlikely(valist == NULL)) + spin_lock(&purge_vmap_area_lock); + purge_vmap_area_root = RB_ROOT; + list_replace_init(&purge_vmap_area_list, &local_pure_list); + spin_unlock(&purge_vmap_area_lock); + + if (unlikely(list_empty(&local_pure_list))) return false; - /* - * TODO: to calculate a flush range without looping. - * The list can be up to lazy_max_pages() elements. - */ - llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < start) - start = va->va_start; - if (va->va_end > end) - end = va->va_end; - } + start = min(start, + list_first_entry(&local_pure_list, + struct vmap_area, list)->va_start); + + end = max(end, + list_last_entry(&local_pure_list, + struct vmap_area, list)->va_end); flush_tlb_kernel_range(start, end); resched_threshold = lazy_max_pages() << 1; spin_lock(&free_vmap_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) { + list_for_each_entry_safe(va, n_va, &local_pure_list, list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; @@ -1361,8 +1371,8 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) * detached and there is no need to "unlink" it from * anything. */ - va = merge_or_add_vmap_area(va, &free_vmap_area_root, - &free_vmap_area_list); + va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root, + &free_vmap_area_list); if (!va) continue; @@ -1419,9 +1429,15 @@ static void free_vmap_area_noflush(struct vmap_area *va) nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); - /* After this point, we may free va at any time */ - llist_add(&va->purge_list, &vmap_purge_list); + /* + * Merge or place it to the purge tree/list. + */ + spin_lock(&purge_vmap_area_lock); + merge_or_add_vmap_area(va, + &purge_vmap_area_root, &purge_vmap_area_list); + spin_unlock(&purge_vmap_area_lock); + /* After this point, we may free va at any time */ if (unlikely(nr_lazy > lazy_max_pages())) try_purge_vmap_area_lazy(); } @@ -2256,7 +2272,7 @@ static void __vunmap(const void *addr, int deallocate_pages) debug_check_no_locks_freed(area->addr, get_vm_area_size(area)); debug_check_no_obj_freed(area->addr, get_vm_area_size(area)); - kasan_poison_vmalloc(area->addr, area->size); + kasan_poison_vmalloc(area->addr, get_vm_area_size(area)); vm_remove_mappings(area, deallocate_pages); @@ -2275,7 +2291,6 @@ static void __vunmap(const void *addr, int deallocate_pages) } kfree(area); - return; } static inline void __vfree_deferred(const void *addr) @@ -2461,9 +2476,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, { const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; - unsigned int array_size = nr_pages * sizeof(struct page *), i; + unsigned long array_size; + unsigned int i; struct page **pages; + array_size = (unsigned long)nr_pages * sizeof(struct page *); gfp_mask |= __GFP_NOWARN; if (!(gfp_mask & (GFP_DMA | GFP_DMA32))) gfp_mask |= __GFP_HIGHMEM; @@ -2477,8 +2494,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, } if (!pages) { - remove_vm_area(area->addr); - kfree(area); + free_vm_area(area); return NULL; } @@ -3134,6 +3150,7 @@ pvm_find_va_enclose_addr(unsigned long addr) * @va: * in - the VA we start the search(reverse order); * out - the VA with the highest aligned end address. + * @align: alignment for required highest address * * Returns: determined end address within vmap_area */ @@ -3350,8 +3367,8 @@ recovery: while (area--) { orig_start = vas[area]->va_start; orig_end = vas[area]->va_end; - va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root, - &free_vmap_area_list); + va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root, + &free_vmap_area_list); if (va) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); @@ -3400,8 +3417,8 @@ err_free_shadow: for (area = 0; area < nr_vms; area++) { orig_start = vas[area]->va_start; orig_end = vas[area]->va_end; - va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root, - &free_vmap_area_list); + va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root, + &free_vmap_area_list); if (va) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); @@ -3448,11 +3465,11 @@ static void *s_next(struct seq_file *m, void *p, loff_t *pos) } static void s_stop(struct seq_file *m, void *p) - __releases(&vmap_purge_lock) __releases(&vmap_area_lock) + __releases(&vmap_purge_lock) { - mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock); + mutex_unlock(&vmap_purge_lock); } static void show_numa_info(struct seq_file *m, struct vm_struct *v) @@ -3481,18 +3498,15 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static void show_purge_info(struct seq_file *m) { - struct llist_node *head; struct vmap_area *va; - head = READ_ONCE(vmap_purge_list.first); - if (head == NULL) - return; - - llist_for_each_entry(va, head, purge_list) { + spin_lock(&purge_vmap_area_lock); + list_for_each_entry(va, &purge_vmap_area_list, list) { seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); } + spin_unlock(&purge_vmap_area_lock); } static int s_show(struct seq_file *m, void *p) @@ -3550,10 +3564,7 @@ static int s_show(struct seq_file *m, void *p) seq_putc(m, '\n'); /* - * As a final step, dump "unpurged" areas. Note, - * that entire "/proc/vmallocinfo" output will not - * be address sorted, because the purge list is not - * sorted. + * As a final step, dump "unpurged" areas. */ if (list_is_last(&va->list, &vmap_area_list)) show_purge_info(m); diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b4e31eac2cf..242368592ea7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1,7 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 /* - * linux/mm/vmscan.c - * * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds * * Swap reorganised 29.12.95, Stephen Tweedie. @@ -1072,7 +1070,6 @@ static void page_check_dirty_writeback(struct page *page, static unsigned int shrink_page_list(struct list_head *page_list, struct pglist_data *pgdat, struct scan_control *sc, - enum ttu_flags ttu_flags, struct reclaim_stat *stat, bool ignore_references) { @@ -1297,7 +1294,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page)) { - enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH; + enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = PageSwapBacked(page); if (unlikely(PageTransHuge(page))) @@ -1372,6 +1369,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (PageDirty(page) || PageWriteback(page)) goto keep_locked; mapping = page_mapping(page); + fallthrough; case PAGE_CLEAN: ; /* try to free the page below */ } @@ -1393,7 +1391,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, * * Rarely, pages can have buffers and no ->mapping. These are * the pages which were not successfully invalidated in - * truncate_complete_page(). We try to drop those buffers here + * truncate_cleanup_page(). We try to drop those buffers here * and if that worked, and the page is no longer mapped into * process address space (page_count == 1) it can be freed. * Otherwise, leave the page on the LRU so it is swappable. @@ -1514,7 +1512,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, } nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, - TTU_IGNORE_ACCESS, &stat, true); + &stat, true); list_splice(&clean_pages, page_list); mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -(long)nr_reclaimed); @@ -1958,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0, - &stat, false); + nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false); spin_lock_irq(&pgdat->lru_lock); @@ -2131,8 +2128,7 @@ unsigned long reclaim_pages(struct list_head *page_list) nr_reclaimed += shrink_page_list(&node_page_list, NODE_DATA(nid), - &sc, 0, - &dummy_stat, false); + &sc, &dummy_stat, false); while (!list_empty(&node_page_list)) { page = lru_to_page(&node_page_list); list_del(&page->lru); @@ -2145,8 +2141,7 @@ unsigned long reclaim_pages(struct list_head *page_list) if (!list_empty(&node_page_list)) { nr_reclaimed += shrink_page_list(&node_page_list, NODE_DATA(nid), - &sc, 0, - &dummy_stat, false); + &sc, &dummy_stat, false); while (!list_empty(&node_page_list)) { page = lru_to_page(&node_page_list); list_del(&page->lru); @@ -3899,7 +3894,7 @@ kswapd_try_sleep: highest_zoneidx); /* Read the new order and highest_zoneidx */ - alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order); + alloc_order = READ_ONCE(pgdat->kswapd_order); highest_zoneidx = kswapd_highest_zoneidx(pgdat, highest_zoneidx); WRITE_ONCE(pgdat->kswapd_order, 0); diff --git a/mm/vmstat.c b/mm/vmstat.c index 698bc0bc18d1..f8942160fc95 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1157,7 +1157,6 @@ const char * const vmstat_text[] = { "nr_zone_unevictable", "nr_zone_write_pending", "nr_mlock", - "nr_page_table_pages", "nr_bounce", #if IS_ENABLED(CONFIG_ZSMALLOC) "nr_zspages", @@ -1215,6 +1214,7 @@ const char * const vmstat_text[] = { #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) "nr_shadow_call_stack", #endif + "nr_page_table_pages", /* enum writeback_stat_item counters */ "nr_dirty_threshold", @@ -1503,10 +1503,6 @@ static void pagetypeinfo_showblockcount_print(struct seq_file *m, if (!page) continue; - /* Watch for unexpected holes punched in the memmap */ - if (!memmap_valid_within(pfn, page, zone)) - continue; - if (page_zone(page) != zone) continue; diff --git a/mm/workingset.c b/mm/workingset.c index 975a4d2dd02e..25f75bbe80e0 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -445,12 +445,12 @@ void workingset_update_node(struct xa_node *node) if (node->count && node->count == node->nr_values) { if (list_empty(&node->private_list)) { list_lru_add(&shadow_nodes, &node->private_list); - __inc_lruvec_slab_state(node, WORKINGSET_NODES); + __inc_lruvec_kmem_state(node, WORKINGSET_NODES); } } else { if (!list_empty(&node->private_list)) { list_lru_del(&shadow_nodes, &node->private_list); - __dec_lruvec_slab_state(node, WORKINGSET_NODES); + __dec_lruvec_kmem_state(node, WORKINGSET_NODES); } } } @@ -544,7 +544,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, } list_lru_isolate(lru, item); - __dec_lruvec_slab_state(node, WORKINGSET_NODES); + __dec_lruvec_kmem_state(node, WORKINGSET_NODES); spin_unlock(lru_lock); @@ -559,7 +559,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out_invalid; mapping->nrexceptional -= node->nr_values; xa_delete_node(node, workingset_update_node); - __inc_lruvec_slab_state(node, WORKINGSET_NODERECLAIM); + __inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM); out_invalid: xa_unlock_irq(&mapping->i_pages); diff --git a/mm/z3fold.c b/mm/z3fold.c index 18feaa0bc537..dacb0d70fa61 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -90,7 +90,7 @@ struct z3fold_buddy_slots { * be enough slots to hold all possible variants */ unsigned long slot[BUDDY_MASK + 1]; - unsigned long pool; /* back link + flags */ + unsigned long pool; /* back link */ rwlock_t lock; }; #define HANDLE_FLAG_MASK (0x03) @@ -185,7 +185,7 @@ enum z3fold_page_flags { * handle flags, go under HANDLE_FLAG_MASK */ enum z3fold_handle_flags { - HANDLES_ORPHANED = 0, + HANDLES_NOFREE = 0, }; /* @@ -303,10 +303,9 @@ static inline void put_z3fold_header(struct z3fold_header *zhdr) z3fold_page_unlock(zhdr); } -static inline void free_handle(unsigned long handle) +static inline void free_handle(unsigned long handle, struct z3fold_header *zhdr) { struct z3fold_buddy_slots *slots; - struct z3fold_header *zhdr; int i; bool is_free; @@ -316,22 +315,19 @@ static inline void free_handle(unsigned long handle) if (WARN_ON(*(unsigned long *)handle == 0)) return; - zhdr = handle_to_z3fold_header(handle); slots = handle_to_slots(handle); write_lock(&slots->lock); *(unsigned long *)handle = 0; - if (zhdr->slots == slots) { + + if (test_bit(HANDLES_NOFREE, &slots->pool)) { write_unlock(&slots->lock); return; /* simple case, nothing else to do */ } - /* we are freeing a foreign handle if we are here */ - zhdr->foreign_handles--; + if (zhdr->slots != slots) + zhdr->foreign_handles--; + is_free = true; - if (!test_bit(HANDLES_ORPHANED, &slots->pool)) { - write_unlock(&slots->lock); - return; - } for (i = 0; i <= BUDDY_MASK; i++) { if (slots->slot[i]) { is_free = false; @@ -343,6 +339,8 @@ static inline void free_handle(unsigned long handle) if (is_free) { struct z3fold_pool *pool = slots_to_pool(slots); + if (zhdr->slots == slots) + zhdr->slots = NULL; kmem_cache_free(pool->c_handle, slots); } } @@ -525,8 +523,6 @@ static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked) { struct page *page = virt_to_page(zhdr); struct z3fold_pool *pool = zhdr_to_pool(zhdr); - bool is_free = true; - int i; WARN_ON(!list_empty(&zhdr->buddy)); set_bit(PAGE_STALE, &page->private); @@ -536,21 +532,6 @@ static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked) list_del_init(&page->lru); spin_unlock(&pool->lock); - /* If there are no foreign handles, free the handles array */ - read_lock(&zhdr->slots->lock); - for (i = 0; i <= BUDDY_MASK; i++) { - if (zhdr->slots->slot[i]) { - is_free = false; - break; - } - } - if (!is_free) - set_bit(HANDLES_ORPHANED, &zhdr->slots->pool); - read_unlock(&zhdr->slots->lock); - - if (is_free) - kmem_cache_free(pool->c_handle, zhdr->slots); - if (locked) z3fold_page_unlock(zhdr); @@ -642,15 +623,39 @@ static inline void add_to_unbuddied(struct z3fold_pool *pool, { if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || zhdr->middle_chunks == 0) { - struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); - + struct list_head *unbuddied; int freechunks = num_free_chunks(zhdr); + + migrate_disable(); + unbuddied = this_cpu_ptr(pool->unbuddied); spin_lock(&pool->lock); list_add(&zhdr->buddy, &unbuddied[freechunks]); spin_unlock(&pool->lock); zhdr->cpu = smp_processor_id(); - put_cpu_ptr(pool->unbuddied); + migrate_enable(); + } +} + +static inline enum buddy get_free_buddy(struct z3fold_header *zhdr, int chunks) +{ + enum buddy bud = HEADLESS; + + if (zhdr->middle_chunks) { + if (!zhdr->first_chunks && + chunks <= zhdr->start_middle - ZHDR_CHUNKS) + bud = FIRST; + else if (!zhdr->last_chunks) + bud = LAST; + } else { + if (!zhdr->first_chunks) + bud = FIRST; + else if (!zhdr->last_chunks) + bud = LAST; + else + bud = MIDDLE; } + + return bud; } static inline void *mchunk_memmove(struct z3fold_header *zhdr, @@ -714,18 +719,7 @@ static struct z3fold_header *compact_single_buddy(struct z3fold_header *zhdr) if (WARN_ON(new_zhdr == zhdr)) goto out_fail; - if (new_zhdr->first_chunks == 0) { - if (new_zhdr->middle_chunks != 0 && - chunks >= new_zhdr->start_middle) { - new_bud = LAST; - } else { - new_bud = FIRST; - } - } else if (new_zhdr->last_chunks == 0) { - new_bud = LAST; - } else if (new_zhdr->middle_chunks == 0) { - new_bud = MIDDLE; - } + new_bud = get_free_buddy(new_zhdr, chunks); q = new_zhdr; switch (new_bud) { case FIRST: @@ -847,9 +841,8 @@ static void do_compact_page(struct z3fold_header *zhdr, bool locked) return; } - if (unlikely(PageIsolated(page) || - test_bit(PAGE_CLAIMED, &page->private) || - test_bit(PAGE_STALE, &page->private))) { + if (test_bit(PAGE_STALE, &page->private) || + test_and_set_bit(PAGE_CLAIMED, &page->private)) { z3fold_page_unlock(zhdr); return; } @@ -858,13 +851,16 @@ static void do_compact_page(struct z3fold_header *zhdr, bool locked) zhdr->mapped_count == 0 && compact_single_buddy(zhdr)) { if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) atomic64_dec(&pool->pages_nr); - else + else { + clear_bit(PAGE_CLAIMED, &page->private); z3fold_page_unlock(zhdr); + } return; } z3fold_compact_page(zhdr); add_to_unbuddied(pool, zhdr); + clear_bit(PAGE_CLAIMED, &page->private); z3fold_page_unlock(zhdr); } @@ -886,8 +882,9 @@ static inline struct z3fold_header *__z3fold_alloc(struct z3fold_pool *pool, int chunks = size_to_chunks(size), i; lookup: + migrate_disable(); /* First, try to find an unbuddied z3fold page. */ - unbuddied = get_cpu_ptr(pool->unbuddied); + unbuddied = this_cpu_ptr(pool->unbuddied); for_each_unbuddied_list(i, chunks) { struct list_head *l = &unbuddied[i]; @@ -905,7 +902,7 @@ lookup: !z3fold_page_trylock(zhdr)) { spin_unlock(&pool->lock); zhdr = NULL; - put_cpu_ptr(pool->unbuddied); + migrate_enable(); if (can_sleep) cond_resched(); goto lookup; @@ -919,7 +916,7 @@ lookup: test_bit(PAGE_CLAIMED, &page->private)) { z3fold_page_unlock(zhdr); zhdr = NULL; - put_cpu_ptr(pool->unbuddied); + migrate_enable(); if (can_sleep) cond_resched(); goto lookup; @@ -934,7 +931,7 @@ lookup: kref_get(&zhdr->refcount); break; } - put_cpu_ptr(pool->unbuddied); + migrate_enable(); if (!zhdr) { int cpu; @@ -973,6 +970,9 @@ lookup: } } + if (zhdr && !zhdr->slots) + zhdr->slots = alloc_slots(pool, + can_sleep ? GFP_NOIO : GFP_ATOMIC); return zhdr; } @@ -1109,17 +1109,8 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp, retry: zhdr = __z3fold_alloc(pool, size, can_sleep); if (zhdr) { - if (zhdr->first_chunks == 0) { - if (zhdr->middle_chunks != 0 && - chunks >= zhdr->start_middle) - bud = LAST; - else - bud = FIRST; - } else if (zhdr->last_chunks == 0) - bud = LAST; - else if (zhdr->middle_chunks == 0) - bud = MIDDLE; - else { + bud = get_free_buddy(zhdr, chunks); + if (bud == HEADLESS) { if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) atomic64_dec(&pool->pages_nr); @@ -1265,12 +1256,11 @@ static void z3fold_free(struct z3fold_pool *pool, unsigned long handle) pr_err("%s: unknown bud %d\n", __func__, bud); WARN_ON(1); put_z3fold_header(zhdr); - clear_bit(PAGE_CLAIMED, &page->private); return; } if (!page_claimed) - free_handle(handle); + free_handle(handle, zhdr); if (kref_put(&zhdr->refcount, release_z3fold_page_locked_list)) { atomic64_dec(&pool->pages_nr); return; @@ -1280,8 +1270,7 @@ static void z3fold_free(struct z3fold_pool *pool, unsigned long handle) z3fold_page_unlock(zhdr); return; } - if (unlikely(PageIsolated(page)) || - test_and_set_bit(NEEDS_COMPACTING, &page->private)) { + if (test_and_set_bit(NEEDS_COMPACTING, &page->private)) { put_z3fold_header(zhdr); clear_bit(PAGE_CLAIMED, &page->private); return; @@ -1345,6 +1334,10 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) struct page *page = NULL; struct list_head *pos; unsigned long first_handle = 0, middle_handle = 0, last_handle = 0; + struct z3fold_buddy_slots slots __attribute__((aligned(SLOTS_ALIGN))); + + rwlock_init(&slots.lock); + slots.pool = (unsigned long)pool | (1 << HANDLES_NOFREE); spin_lock(&pool->lock); if (!pool->ops || !pool->ops->evict || retries == 0) { @@ -1359,35 +1352,36 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) list_for_each_prev(pos, &pool->lru) { page = list_entry(pos, struct page, lru); - /* this bit could have been set by free, in which case - * we pass over to the next page in the pool. - */ - if (test_and_set_bit(PAGE_CLAIMED, &page->private)) { - page = NULL; - continue; - } - - if (unlikely(PageIsolated(page))) { - clear_bit(PAGE_CLAIMED, &page->private); - page = NULL; - continue; - } zhdr = page_address(page); if (test_bit(PAGE_HEADLESS, &page->private)) break; + if (kref_get_unless_zero(&zhdr->refcount) == 0) { + zhdr = NULL; + break; + } if (!z3fold_page_trylock(zhdr)) { - clear_bit(PAGE_CLAIMED, &page->private); + if (kref_put(&zhdr->refcount, + release_z3fold_page)) + atomic64_dec(&pool->pages_nr); zhdr = NULL; continue; /* can't evict at this point */ } - if (zhdr->foreign_handles) { - clear_bit(PAGE_CLAIMED, &page->private); - z3fold_page_unlock(zhdr); + + /* test_and_set_bit is of course atomic, but we still + * need to do it under page lock, otherwise checking + * that bit in __z3fold_alloc wouldn't make sense + */ + if (zhdr->foreign_handles || + test_and_set_bit(PAGE_CLAIMED, &page->private)) { + if (kref_put(&zhdr->refcount, + release_z3fold_page)) + atomic64_dec(&pool->pages_nr); + else + z3fold_page_unlock(zhdr); zhdr = NULL; continue; /* can't evict such page */ } - kref_get(&zhdr->refcount); list_del_init(&zhdr->buddy); zhdr->cpu = -1; break; @@ -1409,12 +1403,16 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) first_handle = 0; last_handle = 0; middle_handle = 0; + memset(slots.slot, 0, sizeof(slots.slot)); if (zhdr->first_chunks) - first_handle = encode_handle(zhdr, FIRST); + first_handle = __encode_handle(zhdr, &slots, + FIRST); if (zhdr->middle_chunks) - middle_handle = encode_handle(zhdr, MIDDLE); + middle_handle = __encode_handle(zhdr, &slots, + MIDDLE); if (zhdr->last_chunks) - last_handle = encode_handle(zhdr, LAST); + last_handle = __encode_handle(zhdr, &slots, + LAST); /* * it's safe to unlock here because we hold a * reference to this page @@ -1429,19 +1427,16 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) ret = pool->ops->evict(pool, middle_handle); if (ret) goto next; - free_handle(middle_handle); } if (first_handle) { ret = pool->ops->evict(pool, first_handle); if (ret) goto next; - free_handle(first_handle); } if (last_handle) { ret = pool->ops->evict(pool, last_handle); if (ret) goto next; - free_handle(last_handle); } next: if (test_bit(PAGE_HEADLESS, &page->private)) { @@ -1455,9 +1450,11 @@ next: spin_unlock(&pool->lock); clear_bit(PAGE_CLAIMED, &page->private); } else { + struct z3fold_buddy_slots *slots = zhdr->slots; z3fold_page_lock(zhdr); if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) { + kmem_cache_free(pool->c_handle, slots); atomic64_dec(&pool->pages_nr); return 0; } @@ -1573,8 +1570,7 @@ static bool z3fold_page_isolate(struct page *page, isolate_mode_t mode) VM_BUG_ON_PAGE(!PageMovable(page), page); VM_BUG_ON_PAGE(PageIsolated(page), page); - if (test_bit(PAGE_HEADLESS, &page->private) || - test_bit(PAGE_CLAIMED, &page->private)) + if (test_bit(PAGE_HEADLESS, &page->private)) return false; zhdr = page_address(page); @@ -1586,6 +1582,8 @@ static bool z3fold_page_isolate(struct page *page, isolate_mode_t mode) if (zhdr->mapped_count != 0 || zhdr->foreign_handles != 0) goto out; + if (test_and_set_bit(PAGE_CLAIMED, &page->private)) + goto out; pool = zhdr_to_pool(zhdr); spin_lock(&pool->lock); if (!list_empty(&zhdr->buddy)) @@ -1612,16 +1610,17 @@ static int z3fold_page_migrate(struct address_space *mapping, struct page *newpa VM_BUG_ON_PAGE(!PageMovable(page), page); VM_BUG_ON_PAGE(!PageIsolated(page), page); + VM_BUG_ON_PAGE(!test_bit(PAGE_CLAIMED, &page->private), page); VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); zhdr = page_address(page); pool = zhdr_to_pool(zhdr); - if (!z3fold_page_trylock(zhdr)) { + if (!z3fold_page_trylock(zhdr)) return -EAGAIN; - } if (zhdr->mapped_count != 0 || zhdr->foreign_handles != 0) { z3fold_page_unlock(zhdr); + clear_bit(PAGE_CLAIMED, &page->private); return -EBUSY; } if (work_pending(&zhdr->work)) { @@ -1663,6 +1662,7 @@ static int z3fold_page_migrate(struct address_space *mapping, struct page *newpa queue_work_on(new_zhdr->cpu, pool->compact_wq, &new_zhdr->work); page_mapcount_reset(page); + clear_bit(PAGE_CLAIMED, &page->private); put_page(page); return 0; } @@ -1686,6 +1686,7 @@ static void z3fold_page_putback(struct page *page) spin_lock(&pool->lock); list_add(&page->lru, &pool->lru); spin_unlock(&pool->lock); + clear_bit(PAGE_CLAIMED, &page->private); z3fold_page_unlock(zhdr); } diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index cdfaaadea8ff..7289f502ffac 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -726,13 +726,10 @@ static void insert_zspage(struct size_class *class, * We want to see more ZS_FULL pages and less almost empty/full. * Put pages with higher ->inuse first. */ - if (head) { - if (get_zspage_inuse(zspage) < get_zspage_inuse(head)) { - list_add(&zspage->list, &head->list); - return; - } - } - list_add(&zspage->list, &class->fullness_list[fullness]); + if (head && get_zspage_inuse(zspage) < get_zspage_inuse(head)) + list_add(&zspage->list, &head->list); + else + list_add(&zspage->list, &class->fullness_list[fullness]); } /* diff --git a/mm/zswap.c b/mm/zswap.c index fbb782924ccc..182f6ad5aa69 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -24,8 +24,10 @@ #include <linux/rbtree.h> #include <linux/swap.h> #include <linux/crypto.h> +#include <linux/scatterlist.h> #include <linux/mempool.h> #include <linux/zpool.h> +#include <crypto/acompress.h> #include <linux/mm_types.h> #include <linux/page-flags.h> @@ -81,7 +83,7 @@ static bool zswap_pool_reached_full; static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON); static int zswap_enabled_param_set(const char *, const struct kernel_param *); -static struct kernel_param_ops zswap_enabled_param_ops = { +static const struct kernel_param_ops zswap_enabled_param_ops = { .set = zswap_enabled_param_set, .get = param_get_bool, }; @@ -91,7 +93,7 @@ module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644); static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; static int zswap_compressor_param_set(const char *, const struct kernel_param *); -static struct kernel_param_ops zswap_compressor_param_ops = { +static const struct kernel_param_ops zswap_compressor_param_ops = { .set = zswap_compressor_param_set, .get = param_get_charp, .free = param_free_charp, @@ -102,7 +104,7 @@ module_param_cb(compressor, &zswap_compressor_param_ops, /* Compressed storage zpool to use */ static char *zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT; static int zswap_zpool_param_set(const char *, const struct kernel_param *); -static struct kernel_param_ops zswap_zpool_param_ops = { +static const struct kernel_param_ops zswap_zpool_param_ops = { .set = zswap_zpool_param_set, .get = param_get_charp, .free = param_free_charp, @@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled, * data structures **********************************/ +struct crypto_acomp_ctx { + struct crypto_acomp *acomp; + struct acomp_req *req; + struct crypto_wait wait; + u8 *dstmem; + struct mutex *mutex; +}; + struct zswap_pool { struct zpool *zpool; - struct crypto_comp * __percpu *tfm; + struct crypto_acomp_ctx __percpu *acomp_ctx; struct kref kref; struct list_head list; struct work_struct release_work; @@ -388,23 +398,43 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, * per-cpu code **********************************/ static DEFINE_PER_CPU(u8 *, zswap_dstmem); +/* + * If users dynamically change the zpool type and compressor at runtime, i.e. + * zswap is running, zswap can have more than one zpool on one cpu, but they + * are sharing dtsmem. So we need this mutex to be per-cpu. + */ +static DEFINE_PER_CPU(struct mutex *, zswap_mutex); static int zswap_dstmem_prepare(unsigned int cpu) { + struct mutex *mutex; u8 *dst; dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); if (!dst) return -ENOMEM; + mutex = kmalloc_node(sizeof(*mutex), GFP_KERNEL, cpu_to_node(cpu)); + if (!mutex) { + kfree(dst); + return -ENOMEM; + } + + mutex_init(mutex); per_cpu(zswap_dstmem, cpu) = dst; + per_cpu(zswap_mutex, cpu) = mutex; return 0; } static int zswap_dstmem_dead(unsigned int cpu) { + struct mutex *mutex; u8 *dst; + mutex = per_cpu(zswap_mutex, cpu); + kfree(mutex); + per_cpu(zswap_mutex, cpu) = NULL; + dst = per_cpu(zswap_dstmem, cpu); kfree(dst); per_cpu(zswap_dstmem, cpu) = NULL; @@ -415,30 +445,54 @@ static int zswap_dstmem_dead(unsigned int cpu) static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_comp *tfm; - - if (WARN_ON(*per_cpu_ptr(pool->tfm, cpu))) - return 0; + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + struct crypto_acomp *acomp; + struct acomp_req *req; + + acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); + if (IS_ERR(acomp)) { + pr_err("could not alloc crypto acomp %s : %ld\n", + pool->tfm_name, PTR_ERR(acomp)); + return PTR_ERR(acomp); + } + acomp_ctx->acomp = acomp; - tfm = crypto_alloc_comp(pool->tfm_name, 0, 0); - if (IS_ERR_OR_NULL(tfm)) { - pr_err("could not alloc crypto comp %s : %ld\n", - pool->tfm_name, PTR_ERR(tfm)); + req = acomp_request_alloc(acomp_ctx->acomp); + if (!req) { + pr_err("could not alloc crypto acomp_request %s\n", + pool->tfm_name); + crypto_free_acomp(acomp_ctx->acomp); return -ENOMEM; } - *per_cpu_ptr(pool->tfm, cpu) = tfm; + acomp_ctx->req = req; + + crypto_init_wait(&acomp_ctx->wait); + /* + * if the backend of acomp is async zip, crypto_req_done() will wakeup + * crypto_wait_req(); if the backend of acomp is scomp, the callback + * won't be called, crypto_wait_req() will return without blocking. + */ + acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + crypto_req_done, &acomp_ctx->wait); + + acomp_ctx->mutex = per_cpu(zswap_mutex, cpu); + acomp_ctx->dstmem = per_cpu(zswap_dstmem, cpu); + return 0; } static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); - struct crypto_comp *tfm; + struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + + if (!IS_ERR_OR_NULL(acomp_ctx)) { + if (!IS_ERR_OR_NULL(acomp_ctx->req)) + acomp_request_free(acomp_ctx->req); + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) + crypto_free_acomp(acomp_ctx->acomp); + } - tfm = *per_cpu_ptr(pool->tfm, cpu); - if (!IS_ERR_OR_NULL(tfm)) - crypto_free_comp(tfm); - *per_cpu_ptr(pool->tfm, cpu) = NULL; return 0; } @@ -561,8 +615,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) pr_debug("using %s zpool\n", zpool_get_type(pool->zpool)); strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); - pool->tfm = alloc_percpu(struct crypto_comp *); - if (!pool->tfm) { + + pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx); + if (!pool->acomp_ctx) { pr_err("percpu alloc failed\n"); goto error; } @@ -585,7 +640,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) return pool; error: - free_percpu(pool->tfm); + if (pool->acomp_ctx) + free_percpu(pool->acomp_ctx); if (pool->zpool) zpool_destroy_pool(pool->zpool); kfree(pool); @@ -596,14 +652,14 @@ static __init struct zswap_pool *__zswap_pool_create_fallback(void) { bool has_comp, has_zpool; - has_comp = crypto_has_comp(zswap_compressor, 0, 0); + has_comp = crypto_has_acomp(zswap_compressor, 0, 0); if (!has_comp && strcmp(zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) { pr_err("compressor %s not available, using default %s\n", zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT); param_free_charp(&zswap_compressor); zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; - has_comp = crypto_has_comp(zswap_compressor, 0, 0); + has_comp = crypto_has_acomp(zswap_compressor, 0, 0); } if (!has_comp) { pr_err("default compressor %s not available\n", @@ -639,7 +695,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool) zswap_pool_debug("destroying", pool); cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); - free_percpu(pool->tfm); + free_percpu(pool->acomp_ctx); zpool_destroy_pool(pool->zpool); kfree(pool); } @@ -723,7 +779,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp, } type = s; } else if (!compressor) { - if (!crypto_has_comp(s, 0, 0)) { + if (!crypto_has_acomp(s, 0, 0)) { pr_err("compressor %s not available\n", s); return -ENOENT; } @@ -774,7 +830,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp, * failed, maybe both compressor and zpool params were bad. * Allow changing this param, so pool creation will succeed * when the other param is changed. We already verified this - * param is ok in the zpool_has_pool() or crypto_has_comp() + * param is ok in the zpool_has_pool() or crypto_has_acomp() * checks above. */ ret = param_set_charp(s, kp); @@ -876,8 +932,10 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle) pgoff_t offset; struct zswap_entry *entry; struct page *page; - struct crypto_comp *tfm; - u8 *src, *dst; + struct scatterlist input, output; + struct crypto_acomp_ctx *acomp_ctx; + + u8 *src; unsigned int dlen; int ret; struct writeback_control wbc = { @@ -916,14 +974,20 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle) case ZSWAP_SWAPCACHE_NEW: /* page is locked */ /* decompress */ + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + dlen = PAGE_SIZE; src = (u8 *)zhdr + sizeof(struct zswap_header); - dst = kmap_atomic(page); - tfm = *get_cpu_ptr(entry->pool->tfm); - ret = crypto_comp_decompress(tfm, src, entry->length, - dst, &dlen); - put_cpu_ptr(entry->pool->tfm); - kunmap_atomic(dst); + + mutex_lock(acomp_ctx->mutex); + sg_init_one(&input, src, entry->length); + sg_init_table(&output, 1); + sg_set_page(&output, page, PAGE_SIZE, 0); + acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); + ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); + dlen = acomp_ctx->req->dlen; + mutex_unlock(acomp_ctx->mutex); + BUG_ON(ret); BUG_ON(dlen != PAGE_SIZE); @@ -1004,7 +1068,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, { struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry, *dupentry; - struct crypto_comp *tfm; + struct scatterlist input, output; + struct crypto_acomp_ctx *acomp_ctx; int ret; unsigned int hlen, dlen = PAGE_SIZE; unsigned long handle, value; @@ -1074,12 +1139,32 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, } /* compress */ - dst = get_cpu_var(zswap_dstmem); - tfm = *get_cpu_ptr(entry->pool->tfm); - src = kmap_atomic(page); - ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen); - kunmap_atomic(src); - put_cpu_ptr(entry->pool->tfm); + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + + mutex_lock(acomp_ctx->mutex); + + dst = acomp_ctx->dstmem; + sg_init_table(&input, 1); + sg_set_page(&input, page, PAGE_SIZE, 0); + + /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */ + sg_init_one(&output, dst, PAGE_SIZE * 2); + acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); + /* + * it maybe looks a little bit silly that we send an asynchronous request, + * then wait for its completion synchronously. This makes the process look + * synchronous in fact. + * Theoretically, acomp supports users send multiple acomp requests in one + * acomp instance, then get those requests done simultaneously. but in this + * case, frontswap actually does store and load page by page, there is no + * existing method to send the second page before the first page is done + * in one thread doing frontswap. + * but in different threads running on different cpu, we have different + * acomp instance, so multiple threads can do (de)compression in parallel. + */ + ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); + dlen = acomp_ctx->req->dlen; + if (ret) { ret = -EINVAL; goto put_dstmem; @@ -1103,7 +1188,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, memcpy(buf, &zhdr, hlen); memcpy(buf + hlen, dst, dlen); zpool_unmap_handle(entry->pool->zpool, handle); - put_cpu_var(zswap_dstmem); + mutex_unlock(acomp_ctx->mutex); /* populate entry */ entry->offset = offset; @@ -1131,7 +1216,7 @@ insert_entry: return 0; put_dstmem: - put_cpu_var(zswap_dstmem); + mutex_unlock(acomp_ctx->mutex); zswap_pool_put(entry->pool); freepage: zswap_entry_cache_free(entry); @@ -1148,7 +1233,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, { struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry; - struct crypto_comp *tfm; + struct scatterlist input, output; + struct crypto_acomp_ctx *acomp_ctx; u8 *src, *dst; unsigned int dlen; int ret; @@ -1175,11 +1261,16 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); - dst = kmap_atomic(page); - tfm = *get_cpu_ptr(entry->pool->tfm); - ret = crypto_comp_decompress(tfm, src, entry->length, dst, &dlen); - put_cpu_ptr(entry->pool->tfm); - kunmap_atomic(dst); + + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); + mutex_lock(acomp_ctx->mutex); + sg_init_one(&input, src, entry->length); + sg_init_table(&output, 1); + sg_set_page(&output, page, PAGE_SIZE, 0); + acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); + ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); + mutex_unlock(acomp_ctx->mutex); + zpool_unmap_handle(entry->pool->zpool, entry->handle); BUG_ON(ret); diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c index bda3514c7b2d..b7e3d8f44511 100644 --- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1129,8 +1129,8 @@ int snd_pcm_hw_rule_add(struct snd_pcm_runtime *runtime, unsigned int cond, if (constrs->rules_num >= constrs->rules_all) { struct snd_pcm_hw_rule *new; unsigned int new_rules = constrs->rules_all + 16; - new = krealloc(constrs->rules, new_rules * sizeof(*c), - GFP_KERNEL); + new = krealloc_array(constrs->rules, new_rules, + sizeof(*c), GFP_KERNEL); if (!new) { va_end(args); return -ENOMEM; diff --git a/tools/include/linux/poison.h b/tools/include/linux/poison.h index d29725769107..2e6338ac5eed 100644 --- a/tools/include/linux/poison.h +++ b/tools/include/linux/poison.h @@ -35,12 +35,8 @@ */ #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) -/********** mm/debug-pagealloc.c **********/ -#ifdef CONFIG_PAGE_POISONING_ZERO -#define PAGE_POISON 0x00 -#else +/********** mm/page_poison.c **********/ #define PAGE_POISON 0xaa -#endif /********** mm/page_alloc.c ************/ diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 849e8226395a..9a35c3f6a557 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -8,6 +8,7 @@ thuge-gen compaction_test mlock2-tests mremap_dontunmap +mremap_test on-fault-limit transhuge-stress protection_keys @@ -15,8 +16,9 @@ userfaultfd mlock-intersect-test mlock-random-test virtual_address_range -gup_benchmark +gup_test va_128TBswitch map_fixed_noreplace write_to_hugetlbfs hmm-tests +local_config.* diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 691893afc15d..9a25307f6115 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -1,5 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 # Makefile for vm selftests + +include local_config.mk + uname_M := $(shell uname -m 2>/dev/null || echo not) MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/') @@ -21,23 +24,24 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/') MAKEFLAGS += --no-builtin-rules CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS) -LDLIBS = -lrt +LDLIBS = -lrt -lpthread TEST_GEN_FILES = compaction_test -TEST_GEN_FILES += gup_benchmark +TEST_GEN_FILES += gup_test TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-shm -TEST_GEN_FILES += map_hugetlb +TEST_GEN_FILES += khugepaged TEST_GEN_FILES += map_fixed_noreplace +TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate TEST_GEN_FILES += mlock-random-test TEST_GEN_FILES += mlock2-tests TEST_GEN_FILES += mremap_dontunmap +TEST_GEN_FILES += mremap_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd -TEST_GEN_FILES += khugepaged ifeq ($(ARCH),x86_64) CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32) @@ -73,15 +77,13 @@ TEST_GEN_FILES += virtual_address_range TEST_GEN_FILES += write_to_hugetlbfs endif -TEST_PROGS := run_vmtests +TEST_PROGS := run_vmtests.sh TEST_FILES := test_vmalloc.sh KSFT_KHDR_INSTALL := 1 include ../lib.mk -$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread - ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) @@ -131,6 +133,25 @@ warn_32bit_failure: endif endif -$(OUTPUT)/userfaultfd: LDLIBS += -lpthread - $(OUTPUT)/mlock-random-test: LDLIBS += -lcap + +$(OUTPUT)/gup_test: ../../../../mm/gup_test.h + +$(OUTPUT)/hmm-tests: local_config.h + +# HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. +$(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS) + +local_config.mk local_config.h: check_config.sh + /bin/sh ./check_config.sh $(CC) + +EXTRA_CLEAN += local_config.mk local_config.h + +ifeq ($(HMM_EXTRA_LIBS),) +all: warn_missing_hugelibs + +warn_missing_hugelibs: + @echo ; \ + echo "Warning: missing libhugetlbfs support. Some HMM tests will be skipped." ; \ + echo +endif diff --git a/tools/testing/selftests/vm/check_config.sh b/tools/testing/selftests/vm/check_config.sh new file mode 100644 index 000000000000..079c8a40b85d --- /dev/null +++ b/tools/testing/selftests/vm/check_config.sh @@ -0,0 +1,31 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# +# Probe for libraries and create header files to record the results. Both C +# header files and Makefile include fragments are created. + +OUTPUT_H_FILE=local_config.h +OUTPUT_MKFILE=local_config.mk + +# libhugetlbfs +tmpname=$(mktemp) +tmpfile_c=${tmpname}.c +tmpfile_o=${tmpname}.o + +echo "#include <sys/types.h>" > $tmpfile_c +echo "#include <hugetlbfs.h>" >> $tmpfile_c +echo "int func(void) { return 0; }" >> $tmpfile_c + +CC=${1:?"Usage: $0 <compiler> # example compiler: gcc"} +$CC -c $tmpfile_c -o $tmpfile_o >/dev/null 2>&1 + +if [ -f $tmpfile_o ]; then + echo "#define LOCAL_CONFIG_HAVE_LIBHUGETLBFS 1" > $OUTPUT_H_FILE + echo "HMM_EXTRA_LIBS = -lhugetlbfs" > $OUTPUT_MKFILE +else + echo "// No libhugetlbfs support found" > $OUTPUT_H_FILE + echo "# No libhugetlbfs support found, so:" > $OUTPUT_MKFILE + echo "HMM_EXTRA_LIBS = " >> $OUTPUT_MKFILE +fi + +rm ${tmpname}.* diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config index 69dd0d1aa30b..60e82da0de85 100644 --- a/tools/testing/selftests/vm/config +++ b/tools/testing/selftests/vm/config @@ -3,4 +3,4 @@ CONFIG_USERFAULTFD=y CONFIG_TEST_VMALLOC=m CONFIG_DEVICE_PRIVATE=y CONFIG_TEST_HMM=m -CONFIG_GUP_BENCHMARK=y +CONFIG_GUP_TEST=y diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c deleted file mode 100644 index 1d4359341e44..000000000000 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ /dev/null @@ -1,143 +0,0 @@ -#include <fcntl.h> -#include <stdio.h> -#include <stdlib.h> -#include <unistd.h> - -#include <sys/ioctl.h> -#include <sys/mman.h> -#include <sys/prctl.h> -#include <sys/stat.h> -#include <sys/types.h> - -#include <linux/types.h> - -#define MB (1UL << 20) -#define PAGE_SIZE sysconf(_SC_PAGESIZE) - -#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 2, struct gup_benchmark) - -/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */ -#define PIN_FAST_BENCHMARK _IOWR('g', 3, struct gup_benchmark) -#define PIN_BENCHMARK _IOWR('g', 4, struct gup_benchmark) -#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) - -/* Just the flags we need, copied from mm.h: */ -#define FOLL_WRITE 0x01 /* check pte is writable */ - -struct gup_benchmark { - __u64 get_delta_usec; - __u64 put_delta_usec; - __u64 addr; - __u64 size; - __u32 nr_pages_per_call; - __u32 flags; - __u64 expansion[10]; /* For future use */ -}; - -int main(int argc, char **argv) -{ - struct gup_benchmark gup; - unsigned long size = 128 * MB; - int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0; - int cmd = GUP_FAST_BENCHMARK, flags = MAP_PRIVATE; - char *file = "/dev/zero"; - char *p; - - while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) { - switch (opt) { - case 'a': - cmd = PIN_FAST_BENCHMARK; - break; - case 'b': - cmd = PIN_BENCHMARK; - break; - case 'L': - cmd = PIN_LONGTERM_BENCHMARK; - break; - case 'm': - size = atoi(optarg) * MB; - break; - case 'r': - repeats = atoi(optarg); - break; - case 'n': - nr_pages = atoi(optarg); - break; - case 't': - thp = 1; - break; - case 'T': - thp = 0; - break; - case 'U': - cmd = GUP_BENCHMARK; - break; - case 'u': - cmd = GUP_FAST_BENCHMARK; - break; - case 'w': - write = 1; - break; - case 'f': - file = optarg; - break; - case 'S': - flags &= ~MAP_PRIVATE; - flags |= MAP_SHARED; - break; - case 'H': - flags |= (MAP_HUGETLB | MAP_ANONYMOUS); - break; - default: - return -1; - } - } - - filed = open(file, O_RDWR|O_CREAT); - if (filed < 0) { - perror("open"); - exit(filed); - } - - gup.nr_pages_per_call = nr_pages; - if (write) - gup.flags |= FOLL_WRITE; - - fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR); - if (fd == -1) { - perror("open"); - exit(1); - } - - p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, filed, 0); - if (p == MAP_FAILED) { - perror("mmap"); - exit(1); - } - gup.addr = (unsigned long)p; - - if (thp == 1) - madvise(p, size, MADV_HUGEPAGE); - else if (thp == 0) - madvise(p, size, MADV_NOHUGEPAGE); - - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) - p[0] = 0; - - for (i = 0; i < repeats; i++) { - gup.size = size; - if (ioctl(fd, cmd, &gup)) { - perror("ioctl"); - exit(1); - } - - printf("Time: get:%lld put:%lld us", gup.get_delta_usec, - gup.put_delta_usec); - if (gup.size != size) - printf(", truncated (size: %lld)", gup.size); - printf("\n"); - } - - return 0; -} diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c new file mode 100644 index 000000000000..6c6336dd3b7f --- /dev/null +++ b/tools/testing/selftests/vm/gup_test.c @@ -0,0 +1,194 @@ +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/stat.h> +#include <sys/types.h> +#include "../../../../mm/gup_test.h" + +#define MB (1UL << 20) +#define PAGE_SIZE sysconf(_SC_PAGESIZE) + +/* Just the flags we need, copied from mm.h: */ +#define FOLL_WRITE 0x01 /* check pte is writable */ + +static char *cmd_to_str(unsigned long cmd) +{ + switch (cmd) { + case GUP_FAST_BENCHMARK: + return "GUP_FAST_BENCHMARK"; + case PIN_FAST_BENCHMARK: + return "PIN_FAST_BENCHMARK"; + case PIN_LONGTERM_BENCHMARK: + return "PIN_LONGTERM_BENCHMARK"; + case GUP_BASIC_TEST: + return "GUP_BASIC_TEST"; + case PIN_BASIC_TEST: + return "PIN_BASIC_TEST"; + case DUMP_USER_PAGES_TEST: + return "DUMP_USER_PAGES_TEST"; + } + return "Unknown command"; +} + +int main(int argc, char **argv) +{ + struct gup_test gup = { 0 }; + unsigned long size = 128 * MB; + int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0; + unsigned long cmd = GUP_FAST_BENCHMARK; + int flags = MAP_PRIVATE; + char *file = "/dev/zero"; + char *p; + + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) { + switch (opt) { + case 'a': + cmd = PIN_FAST_BENCHMARK; + break; + case 'b': + cmd = PIN_BASIC_TEST; + break; + case 'L': + cmd = PIN_LONGTERM_BENCHMARK; + break; + case 'c': + cmd = DUMP_USER_PAGES_TEST; + /* + * Dump page 0 (index 1). May be overridden later, by + * user's non-option arguments. + * + * .which_pages is zero-based, so that zero can mean "do + * nothing". + */ + gup.which_pages[0] = 1; + break; + case 'F': + /* strtol, so you can pass flags in hex form */ + gup.flags = strtol(optarg, 0, 0); + break; + case 'm': + size = atoi(optarg) * MB; + break; + case 'r': + repeats = atoi(optarg); + break; + case 'n': + nr_pages = atoi(optarg); + break; + case 't': + thp = 1; + break; + case 'T': + thp = 0; + break; + case 'U': + cmd = GUP_BASIC_TEST; + break; + case 'u': + cmd = GUP_FAST_BENCHMARK; + break; + case 'w': + write = 1; + break; + case 'f': + file = optarg; + break; + case 'S': + flags &= ~MAP_PRIVATE; + flags |= MAP_SHARED; + break; + case 'H': + flags |= (MAP_HUGETLB | MAP_ANONYMOUS); + break; + default: + return -1; + } + } + + if (optind < argc) { + int extra_arg_count = 0; + /* + * For example: + * + * ./gup_test -c 0 1 0x1001 + * + * ...to dump pages 0, 1, and 4097 + */ + + while ((optind < argc) && + (extra_arg_count < GUP_TEST_MAX_PAGES_TO_DUMP)) { + /* + * Do the 1-based indexing here, so that the user can + * use normal 0-based indexing on the command line. + */ + long page_index = strtol(argv[optind], 0, 0) + 1; + + gup.which_pages[extra_arg_count] = page_index; + extra_arg_count++; + optind++; + } + } + + filed = open(file, O_RDWR|O_CREAT); + if (filed < 0) { + perror("open"); + exit(filed); + } + + gup.nr_pages_per_call = nr_pages; + if (write) + gup.flags |= FOLL_WRITE; + + fd = open("/sys/kernel/debug/gup_test", O_RDWR); + if (fd == -1) { + perror("open"); + exit(1); + } + + p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, filed, 0); + if (p == MAP_FAILED) { + perror("mmap"); + exit(1); + } + gup.addr = (unsigned long)p; + + if (thp == 1) + madvise(p, size, MADV_HUGEPAGE); + else if (thp == 0) + madvise(p, size, MADV_NOHUGEPAGE); + + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE) + p[0] = 0; + + /* Only report timing information on the *_BENCHMARK commands: */ + if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) || + (cmd == PIN_LONGTERM_BENCHMARK)) { + for (i = 0; i < repeats; i++) { + gup.size = size; + if (ioctl(fd, cmd, &gup)) + perror("ioctl"), exit(1); + + printf("%s: Time: get:%lld put:%lld us", + cmd_to_str(cmd), gup.get_delta_usec, + gup.put_delta_usec); + if (gup.size != size) + printf(", truncated (size: %lld)", gup.size); + printf("\n"); + } + } else { + gup.size = size; + if (ioctl(fd, cmd, &gup)) { + perror("ioctl"); + exit(1); + } + + printf("%s: done\n", cmd_to_str(cmd)); + if (gup.size != size) + printf("Truncated (size: %lld)\n", gup.size); + } + + return 0; +} diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index c9404ef9698e..5d1ac691b9f4 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -21,12 +21,16 @@ #include <strings.h> #include <time.h> #include <pthread.h> -#include <hugetlbfs.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/ioctl.h> +#include "./local_config.h" +#ifdef LOCAL_CONFIG_HAVE_LIBHUGETLBFS +#include <hugetlbfs.h> +#endif + /* * This is a private UAPI to the kernel test module so it isn't exported * in the usual include/uapi/... directory. @@ -662,6 +666,7 @@ TEST_F(hmm, anon_write_huge) hmm_buffer_free(buffer); } +#ifdef LOCAL_CONFIG_HAVE_LIBHUGETLBFS /* * Write huge TLBFS page. */ @@ -720,6 +725,7 @@ TEST_F(hmm, anon_write_hugetlbfs) buffer->ptr = NULL; hmm_buffer_free(buffer); } +#endif /* LOCAL_CONFIG_HAVE_LIBHUGETLBFS */ /* * Read mmap'ed file memory. @@ -1336,6 +1342,7 @@ TEST_F(hmm2, snapshot) hmm_buffer_free(buffer); } +#ifdef LOCAL_CONFIG_HAVE_LIBHUGETLBFS /* * Test the hmm_range_fault() HMM_PFN_PMD flag for large pages that * should be mapped by a large page table entry. @@ -1411,6 +1418,7 @@ TEST_F(hmm, compound) buffer->ptr = NULL; hmm_buffer_free(buffer); } +#endif /* LOCAL_CONFIG_HAVE_LIBHUGETLBFS */ /* * Test two devices reading the same memory (double mapped). diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c new file mode 100644 index 000000000000..9c391d016922 --- /dev/null +++ b/tools/testing/selftests/vm/mremap_test.c @@ -0,0 +1,344 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2020 Google LLC + */ +#define _GNU_SOURCE + +#include <errno.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <time.h> + +#include "../kselftest.h" + +#define EXPECT_SUCCESS 0 +#define EXPECT_FAILURE 1 +#define NON_OVERLAPPING 0 +#define OVERLAPPING 1 +#define NS_PER_SEC 1000000000ULL +#define VALIDATION_DEFAULT_THRESHOLD 4 /* 4MB */ +#define VALIDATION_NO_THRESHOLD 0 /* Verify the entire region */ + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#define MIN(X, Y) ((X) < (Y) ? (X) : (Y)) + +struct config { + unsigned long long src_alignment; + unsigned long long dest_alignment; + unsigned long long region_size; + int overlapping; +}; + +struct test { + const char *name; + struct config config; + int expect_failure; +}; + +enum { + _1KB = 1ULL << 10, /* 1KB -> not page aligned */ + _4KB = 4ULL << 10, + _8KB = 8ULL << 10, + _1MB = 1ULL << 20, + _2MB = 2ULL << 20, + _4MB = 4ULL << 20, + _1GB = 1ULL << 30, + _2GB = 2ULL << 30, + PTE = _4KB, + PMD = _2MB, + PUD = _1GB, +}; + +#define MAKE_TEST(source_align, destination_align, size, \ + overlaps, should_fail, test_name) \ +{ \ + .name = test_name, \ + .config = { \ + .src_alignment = source_align, \ + .dest_alignment = destination_align, \ + .region_size = size, \ + .overlapping = overlaps, \ + }, \ + .expect_failure = should_fail \ +} + +/* + * Returns the start address of the mapping on success, else returns + * NULL on failure. + */ +static void *get_source_mapping(struct config c) +{ + unsigned long long addr = 0ULL; + void *src_addr = NULL; +retry: + addr += c.src_alignment; + src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE, + MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0); + if (src_addr == MAP_FAILED) { + if (errno == EPERM) + goto retry; + goto error; + } + /* + * Check that the address is aligned to the specified alignment. + * Addresses which have alignments that are multiples of that + * specified are not considered valid. For instance, 1GB address is + * 2MB-aligned, however it will not be considered valid for a + * requested alignment of 2MB. This is done to reduce coincidental + * alignment in the tests. + */ + if (((unsigned long long) src_addr & (c.src_alignment - 1)) || + !((unsigned long long) src_addr & c.src_alignment)) + goto retry; + + if (!src_addr) + goto error; + + return src_addr; +error: + ksft_print_msg("Failed to map source region: %s\n", + strerror(errno)); + return NULL; +} + +/* Returns the time taken for the remap on success else returns -1. */ +static long long remap_region(struct config c, unsigned int threshold_mb, + char pattern_seed) +{ + void *addr, *src_addr, *dest_addr; + unsigned long long i; + struct timespec t_start = {0, 0}, t_end = {0, 0}; + long long start_ns, end_ns, align_mask, ret, offset; + unsigned long long threshold; + + if (threshold_mb == VALIDATION_NO_THRESHOLD) + threshold = c.region_size; + else + threshold = MIN(threshold_mb * _1MB, c.region_size); + + src_addr = get_source_mapping(c); + if (!src_addr) { + ret = -1; + goto out; + } + + /* Set byte pattern */ + srand(pattern_seed); + for (i = 0; i < threshold; i++) + memset((char *) src_addr + i, (char) rand(), 1); + + /* Mask to zero out lower bits of address for alignment */ + align_mask = ~(c.dest_alignment - 1); + /* Offset of destination address from the end of the source region */ + offset = (c.overlapping) ? -c.dest_alignment : c.dest_alignment; + addr = (void *) (((unsigned long long) src_addr + c.region_size + + offset) & align_mask); + + /* See comment in get_source_mapping() */ + if (!((unsigned long long) addr & c.dest_alignment)) + addr = (void *) ((unsigned long long) addr | c.dest_alignment); + + clock_gettime(CLOCK_MONOTONIC, &t_start); + dest_addr = mremap(src_addr, c.region_size, c.region_size, + MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr); + clock_gettime(CLOCK_MONOTONIC, &t_end); + + if (dest_addr == MAP_FAILED) { + ksft_print_msg("mremap failed: %s\n", strerror(errno)); + ret = -1; + goto clean_up_src; + } + + /* Verify byte pattern after remapping */ + srand(pattern_seed); + for (i = 0; i < threshold; i++) { + char c = (char) rand(); + + if (((char *) dest_addr)[i] != c) { + ksft_print_msg("Data after remap doesn't match at offset %d\n", + i); + ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff, + ((char *) dest_addr)[i] & 0xff); + ret = -1; + goto clean_up_dest; + } + } + + start_ns = t_start.tv_sec * NS_PER_SEC + t_start.tv_nsec; + end_ns = t_end.tv_sec * NS_PER_SEC + t_end.tv_nsec; + ret = end_ns - start_ns; + +/* + * Since the destination address is specified using MREMAP_FIXED, subsequent + * mremap will unmap any previous mapping at the address range specified by + * dest_addr and region_size. This significantly affects the remap time of + * subsequent tests. So we clean up mappings after each test. + */ +clean_up_dest: + munmap(dest_addr, c.region_size); +clean_up_src: + munmap(src_addr, c.region_size); +out: + return ret; +} + +static void run_mremap_test_case(struct test test_case, int *failures, + unsigned int threshold_mb, + unsigned int pattern_seed) +{ + long long remap_time = remap_region(test_case.config, threshold_mb, + pattern_seed); + + if (remap_time < 0) { + if (test_case.expect_failure) + ksft_test_result_pass("%s\n\tExpected mremap failure\n", + test_case.name); + else { + ksft_test_result_fail("%s\n", test_case.name); + *failures += 1; + } + } else { + /* + * Comparing mremap time is only applicable if entire region + * was faulted in. + */ + if (threshold_mb == VALIDATION_NO_THRESHOLD || + test_case.config.region_size <= threshold_mb * _1MB) + ksft_test_result_pass("%s\n\tmremap time: %12lldns\n", + test_case.name, remap_time); + else + ksft_test_result_pass("%s\n", test_case.name); + } +} + +static void usage(const char *cmd) +{ + fprintf(stderr, + "Usage: %s [[-t <threshold_mb>] [-p <pattern_seed>]]\n" + "-t\t only validate threshold_mb of the remapped region\n" + " \t if 0 is supplied no threshold is used; all tests\n" + " \t are run and remapped regions validated fully.\n" + " \t The default threshold used is 4MB.\n" + "-p\t provide a seed to generate the random pattern for\n" + " \t validating the remapped region.\n", cmd); +} + +static int parse_args(int argc, char **argv, unsigned int *threshold_mb, + unsigned int *pattern_seed) +{ + const char *optstr = "t:p:"; + int opt; + + while ((opt = getopt(argc, argv, optstr)) != -1) { + switch (opt) { + case 't': + *threshold_mb = atoi(optarg); + break; + case 'p': + *pattern_seed = atoi(optarg); + break; + default: + usage(argv[0]); + return -1; + } + } + + if (optind < argc) { + usage(argv[0]); + return -1; + } + + return 0; +} + +int main(int argc, char **argv) +{ + int failures = 0; + int i, run_perf_tests; + unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD; + unsigned int pattern_seed; + time_t t; + + pattern_seed = (unsigned int) time(&t); + + if (parse_args(argc, argv, &threshold_mb, &pattern_seed) < 0) + exit(EXIT_FAILURE); + + ksft_print_msg("Test configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n", + threshold_mb, pattern_seed); + + struct test test_cases[] = { + /* Expected mremap failures */ + MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE, + "mremap - Source and Destination Regions Overlapping"), + MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE, + "mremap - Destination Address Misaligned (1KB-aligned)"), + MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE, + "mremap - Source Address Misaligned (1KB-aligned)"), + + /* Src addr PTE aligned */ + MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS, + "8KB mremap - Source PTE-aligned, Destination PTE-aligned"), + + /* Src addr 1MB aligned */ + MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"), + MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"), + + /* Src addr PMD aligned */ + MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination PTE-aligned"), + MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"), + MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination PMD-aligned"), + + /* Src addr PUD aligned */ + MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PTE-aligned"), + MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"), + MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PMD-aligned"), + MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PUD-aligned"), + }; + + struct test perf_test_cases[] = { + /* + * mremap 1GB region - Page table level aligned time + * comparison. + */ + MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PTE-aligned, Destination PTE-aligned"), + MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PMD-aligned, Destination PMD-aligned"), + MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PUD-aligned, Destination PUD-aligned"), + }; + + run_perf_tests = (threshold_mb == VALIDATION_NO_THRESHOLD) || + (threshold_mb * _1MB >= _1GB); + + ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ? + ARRAY_SIZE(perf_test_cases) : 0)); + + for (i = 0; i < ARRAY_SIZE(test_cases); i++) + run_mremap_test_case(test_cases[i], &failures, threshold_mb, + pattern_seed); + + if (run_perf_tests) { + ksft_print_msg("\n%s\n", + "mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:"); + for (i = 0; i < ARRAY_SIZE(perf_test_cases); i++) + run_mremap_test_case(perf_test_cases[i], &failures, + threshold_mb, pattern_seed); + } + + if (failures > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests index a3f4f30f0a2e..e953f3cd9664 100755 --- a/tools/testing/selftests/vm/run_vmtests +++ b/tools/testing/selftests/vm/run_vmtests @@ -123,10 +123,10 @@ else echo "[PASS]" fi -echo "--------------------------------------------" -echo "running 'gup_benchmark -U' (normal/slow gup)" -echo "--------------------------------------------" -./gup_benchmark -U +echo "------------------------------------------------------" +echo "running: gup_test -u # get_user_pages_fast() benchmark" +echo "------------------------------------------------------" +./gup_test -u if [ $? -ne 0 ]; then echo "[FAIL]" exitcode=1 @@ -134,10 +134,22 @@ else echo "[PASS]" fi -echo "------------------------------------------" -echo "running gup_benchmark -b (pin_user_pages)" -echo "------------------------------------------" -./gup_benchmark -b +echo "------------------------------------------------------" +echo "running: gup_test -a # pin_user_pages_fast() benchmark" +echo "------------------------------------------------------" +./gup_test -a +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + +echo "------------------------------------------------------------" +echo "# Dump pages 0, 19, and 4096, using pin_user_pages:" +echo "running: gup_test -ct -F 0x1 0 19 0x1000 # dump_page() test" +echo "------------------------------------------------------------" +./gup_test -ct -F 0x1 0 19 0x1000 if [ $? -ne 0 ]; then echo "[FAIL]" exitcode=1 @@ -148,7 +160,7 @@ fi echo "-------------------" echo "running userfaultfd" echo "-------------------" -./userfaultfd anon 128 32 +./userfaultfd anon 20 16 if [ $? -ne 0 ]; then echo "[FAIL]" exitcode=1 @@ -173,7 +185,7 @@ rm -f $mnt/ufd_test_file echo "-------------------------" echo "running userfaultfd_shmem" echo "-------------------------" -./userfaultfd shmem 128 32 +./userfaultfd shmem 20 16 if [ $? -ne 0 ]; then echo "[FAIL]" exitcode=1 @@ -241,6 +253,17 @@ else echo "[PASS]" fi +echo "-------------------" +echo "running mremap_test" +echo "-------------------" +./mremap_test +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + echo "-----------------" echo "running thuge-gen" echo "-----------------" diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index c4425597769a..9d8650d4ba5a 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -55,6 +55,8 @@ #include <setjmp.h> #include <stdbool.h> #include <assert.h> +#include <inttypes.h> +#include <stdint.h> #include "../kselftest.h" @@ -135,6 +137,13 @@ static void usage(void) exit(1); } +#define uffd_error(code, fmt, ...) \ + do { \ + fprintf(stderr, fmt, ##__VA_ARGS__); \ + fprintf(stderr, ": %" PRId64 "\n", (int64_t)(code)); \ + exit(1); \ + } while (0) + static void uffd_stats_reset(struct uffd_stats *uffd_stats, unsigned long n_cpus) { @@ -338,7 +347,7 @@ static int my_bcmp(char *str1, char *str2, size_t n) static void wp_range(int ufd, __u64 start, __u64 len, bool wp) { - struct uffdio_writeprotect prms = { 0 }; + struct uffdio_writeprotect prms; /* Write protection page faults */ prms.range.start = start; @@ -347,7 +356,8 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp) prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0; if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) { - fprintf(stderr, "clear WP failed for address 0x%Lx\n", start); + fprintf(stderr, "clear WP failed for address 0x%" PRIx64 "\n", + (uint64_t)start); exit(1); } } @@ -481,14 +491,11 @@ static void retry_copy_page(int ufd, struct uffdio_copy *uffdio_copy, if (ioctl(ufd, UFFDIO_COPY, uffdio_copy)) { /* real retval in ufdio_copy.copy */ if (uffdio_copy->copy != -EEXIST) { - fprintf(stderr, "UFFDIO_COPY retry error %Ld\n", - uffdio_copy->copy); - exit(1); + uffd_error(uffdio_copy->copy, + "UFFDIO_COPY retry error"); } - } else { - fprintf(stderr, "UFFDIO_COPY retry unexpected %Ld\n", - uffdio_copy->copy); exit(1); - } + } else + uffd_error(uffdio_copy->copy, "UFFDIO_COPY retry unexpected"); } static int __copy_page(int ufd, unsigned long offset, bool retry) @@ -509,14 +516,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry) uffdio_copy.copy = 0; if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) { /* real retval in ufdio_copy.copy */ - if (uffdio_copy.copy != -EEXIST) { - fprintf(stderr, "UFFDIO_COPY error %Ld\n", - uffdio_copy.copy); - exit(1); - } + if (uffdio_copy.copy != -EEXIST) + uffd_error(uffdio_copy.copy, "UFFDIO_COPY error"); } else if (uffdio_copy.copy != page_size) { - fprintf(stderr, "UFFDIO_COPY unexpected copy %Ld\n", - uffdio_copy.copy); exit(1); + uffd_error(uffdio_copy.copy, "UFFDIO_COPY unexpected copy"); } else { if (test_uffdio_copy_eexist && retry) { test_uffdio_copy_eexist = false; @@ -791,11 +794,13 @@ static int userfaultfd_open(int features) uffdio_api.api = UFFD_API; uffdio_api.features = features; if (ioctl(uffd, UFFDIO_API, &uffdio_api)) { - fprintf(stderr, "UFFDIO_API\n"); + fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to " + "run with either root or ptrace capability.\n"); return 1; } if (uffdio_api.api != UFFD_API) { - fprintf(stderr, "UFFDIO_API error %Lu\n", uffdio_api.api); + fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n", + (uint64_t)uffdio_api.api); return 1; } @@ -957,13 +962,12 @@ static void retry_uffdio_zeropage(int ufd, offset); if (ioctl(ufd, UFFDIO_ZEROPAGE, uffdio_zeropage)) { if (uffdio_zeropage->zeropage != -EEXIST) { - fprintf(stderr, "UFFDIO_ZEROPAGE retry error %Ld\n", - uffdio_zeropage->zeropage); - exit(1); + uffd_error(uffdio_zeropage->zeropage, + "UFFDIO_ZEROPAGE retry error"); } } else { - fprintf(stderr, "UFFDIO_ZEROPAGE retry unexpected %Ld\n", - uffdio_zeropage->zeropage); exit(1); + uffd_error(uffdio_zeropage->zeropage, + "UFFDIO_ZEROPAGE retry unexpected"); } } @@ -972,6 +976,7 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) struct uffdio_zeropage uffdio_zeropage; int ret; unsigned long has_zeropage; + __s64 res; has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE); @@ -983,29 +988,17 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) uffdio_zeropage.range.len = page_size; uffdio_zeropage.mode = 0; ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage); + res = uffdio_zeropage.zeropage; if (ret) { /* real retval in ufdio_zeropage.zeropage */ if (has_zeropage) { - if (uffdio_zeropage.zeropage == -EEXIST) { - fprintf(stderr, "UFFDIO_ZEROPAGE -EEXIST\n"); - exit(1); - } else { - fprintf(stderr, "UFFDIO_ZEROPAGE error %Ld\n", - uffdio_zeropage.zeropage); - exit(1); - } - } else { - if (uffdio_zeropage.zeropage != -EINVAL) { - fprintf(stderr, - "UFFDIO_ZEROPAGE not -EINVAL %Ld\n", - uffdio_zeropage.zeropage); - exit(1); - } - } + uffd_error(res, "UFFDIO_ZEROPAGE %s", + res == -EEXIST ? "-EEXIST" : "error"); + } else if (res != -EINVAL) + uffd_error(res, "UFFDIO_ZEROPAGE not -EINVAL"); } else if (has_zeropage) { - if (uffdio_zeropage.zeropage != page_size) { - fprintf(stderr, "UFFDIO_ZEROPAGE unexpected %Ld\n", - uffdio_zeropage.zeropage); exit(1); + if (res != page_size) { + uffd_error(res, "UFFDIO_ZEROPAGE unexpected"); } else { if (test_uffdio_zeropage_eexist && retry) { test_uffdio_zeropage_eexist = false; @@ -1014,11 +1007,8 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) } return 1; } - } else { - fprintf(stderr, - "UFFDIO_ZEROPAGE succeeded %Ld\n", - uffdio_zeropage.zeropage); exit(1); - } + } else + uffd_error(res, "UFFDIO_ZEROPAGE succeeded"); return 0; } @@ -1040,7 +1030,7 @@ static int userfaultfd_zeropage_test(void) if (uffd_test_ops->release_pages(area_dst)) return 1; - if (userfaultfd_open(0) < 0) + if (userfaultfd_open(0)) return 1; uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; @@ -1090,7 +1080,7 @@ static int userfaultfd_events_test(void) features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | UFFD_FEATURE_EVENT_REMOVE; - if (userfaultfd_open(features) < 0) + if (userfaultfd_open(features)) return 1; fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); @@ -1162,7 +1152,7 @@ static int userfaultfd_sig_test(void) return 1; features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; - if (userfaultfd_open(features) < 0) + if (userfaultfd_open(features)) return 1; fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); @@ -1242,7 +1232,7 @@ static int userfaultfd_stress(void) if (!area_dst) return 1; - if (userfaultfd_open(0) < 0) + if (userfaultfd_open(0)) return 1; count_verify = malloc(nr_pages * sizeof(unsigned long long)); @@ -1302,6 +1292,8 @@ static int userfaultfd_stress(void) printf(" ver"); if (bounces & BOUNCE_POLL) printf(" poll"); + else + printf(" read"); printf(", "); fflush(stdout); |