diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
commit | 7fa8a8ee9400fe8ec188426e40e481717bc5e924 (patch) | |
tree | cc8fd6b4f936ec01e73238643757451e20478c07 /kernel | |
parent | 91ec4b0d11fe115581ce2835300558802ce55e6c (diff) | |
parent | 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1 (diff) |
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Diffstat (limited to 'kernel')
-rw-r--r-- | kernel/cgroup/rstat.c | 4 | ||||
-rw-r--r-- | kernel/cpu.c | 2 | ||||
-rw-r--r-- | kernel/crash_core.c | 2 | ||||
-rw-r--r-- | kernel/dma/pool.c | 6 | ||||
-rw-r--r-- | kernel/events/ring_buffer.c | 2 | ||||
-rw-r--r-- | kernel/exit.c | 2 | ||||
-rw-r--r-- | kernel/fork.c | 163 | ||||
-rw-r--r-- | kernel/kcsan/kcsan_test.c | 20 | ||||
-rw-r--r-- | kernel/kthread.c | 22 | ||||
-rw-r--r-- | kernel/printk/printk.c | 2 | ||||
-rw-r--r-- | kernel/sched/core.c | 15 | ||||
-rw-r--r-- | kernel/sched/fair.c | 57 | ||||
-rw-r--r-- | kernel/sys.c | 42 |
13 files changed, 287 insertions, 52 deletions
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 0a2b4967e333..9c4c55228567 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -241,12 +241,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) } /** - * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush() + * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush() * @cgrp: target cgroup * * This function can be called from any context. */ -void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp) +void cgroup_rstat_flush_atomic(struct cgroup *cgrp) { unsigned long flags; diff --git a/kernel/cpu.c b/kernel/cpu.c index c59b73d13a3a..f4a2c5845bcb 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -623,7 +623,7 @@ static int finish_cpu(unsigned int cpu) */ if (mm != &init_mm) idle->active_mm = &init_mm; - mmdrop(mm); + mmdrop_lazy_tlb(mm); return 0; } diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 755f5f08ab38..90ce1dfd591c 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -474,7 +474,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(list_head, prev); VMCOREINFO_OFFSET(vmap_area, va_start); VMCOREINFO_OFFSET(vmap_area, list); - VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); + VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1); log_buf_vmcoreinfo_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); VMCOREINFO_NUMBER(NR_FREE_PAGES); diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c index 4d40dcce7604..1acec2e22827 100644 --- a/kernel/dma/pool.c +++ b/kernel/dma/pool.c @@ -84,8 +84,8 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size, void *addr; int ret = -ENOMEM; - /* Cannot allocate larger than MAX_ORDER-1 */ - order = min(get_order(pool_size), MAX_ORDER-1); + /* Cannot allocate larger than MAX_ORDER */ + order = min(get_order(pool_size), MAX_ORDER); do { pool_size = 1 << (PAGE_SHIFT + order); @@ -190,7 +190,7 @@ static int __init dma_atomic_pool_init(void) /* * If coherent_pool was not used on the command line, default the pool - * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER-1. + * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER. */ if (!atomic_pool_size) { unsigned long pages = totalram_pages() / (SZ_1G / SZ_128K); diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 273a0fe7910a..a0433f37b024 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -814,7 +814,7 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags) size = sizeof(struct perf_buffer); size += nr_pages * sizeof(void *); - if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER) + if (order_base_2(size) > PAGE_SHIFT+MAX_ORDER) goto fail; node = (cpu == -1) ? cpu : cpu_to_node(cpu); diff --git a/kernel/exit.c b/kernel/exit.c index f2afdb0add7c..86902cb5ab78 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -537,7 +537,7 @@ static void exit_mm(void) return; sync_mm_rss(mm); mmap_read_lock(mm); - mmgrab(mm); + mmgrab_lazy_tlb(mm); BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); diff --git a/kernel/fork.c b/kernel/fork.c index bfe73db1c26c..4342200d5e2b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -451,13 +451,49 @@ static struct kmem_cache *vm_area_cachep; /* SLAB cache for mm_struct structures (tsk->mm) */ static struct kmem_cache *mm_cachep; +#ifdef CONFIG_PER_VMA_LOCK + +/* SLAB cache for vm_area_struct.lock */ +static struct kmem_cache *vma_lock_cachep; + +static bool vma_lock_alloc(struct vm_area_struct *vma) +{ + vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL); + if (!vma->vm_lock) + return false; + + init_rwsem(&vma->vm_lock->lock); + vma->vm_lock_seq = -1; + + return true; +} + +static inline void vma_lock_free(struct vm_area_struct *vma) +{ + kmem_cache_free(vma_lock_cachep, vma->vm_lock); +} + +#else /* CONFIG_PER_VMA_LOCK */ + +static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; } +static inline void vma_lock_free(struct vm_area_struct *vma) {} + +#endif /* CONFIG_PER_VMA_LOCK */ + struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) { struct vm_area_struct *vma; vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); - if (vma) - vma_init(vma, mm); + if (!vma) + return NULL; + + vma_init(vma, mm); + if (!vma_lock_alloc(vma)) { + kmem_cache_free(vm_area_cachep, vma); + return NULL; + } + return vma; } @@ -465,26 +501,56 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) { struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); - if (new) { - ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); - ASSERT_EXCLUSIVE_WRITER(orig->vm_file); - /* - * orig->shared.rb may be modified concurrently, but the clone - * will be reinitialized. - */ - data_race(memcpy(new, orig, sizeof(*new))); - INIT_LIST_HEAD(&new->anon_vma_chain); - dup_anon_vma_name(orig, new); + if (!new) + return NULL; + + ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); + ASSERT_EXCLUSIVE_WRITER(orig->vm_file); + /* + * orig->shared.rb may be modified concurrently, but the clone + * will be reinitialized. + */ + data_race(memcpy(new, orig, sizeof(*new))); + if (!vma_lock_alloc(new)) { + kmem_cache_free(vm_area_cachep, new); + return NULL; } + INIT_LIST_HEAD(&new->anon_vma_chain); + vma_numab_state_init(new); + dup_anon_vma_name(orig, new); + return new; } -void vm_area_free(struct vm_area_struct *vma) +void __vm_area_free(struct vm_area_struct *vma) { + vma_numab_state_free(vma); free_anon_vma_name(vma); + vma_lock_free(vma); kmem_cache_free(vm_area_cachep, vma); } +#ifdef CONFIG_PER_VMA_LOCK +static void vm_area_free_rcu_cb(struct rcu_head *head) +{ + struct vm_area_struct *vma = container_of(head, struct vm_area_struct, + vm_rcu); + + /* The vma should not be locked while being destroyed. */ + VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma); + __vm_area_free(vma); +} +#endif + +void vm_area_free(struct vm_area_struct *vma) +{ +#ifdef CONFIG_PER_VMA_LOCK + call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb); +#else + __vm_area_free(vma); +#endif +} + static void account_kernel_stack(struct task_struct *tsk, int account) { if (IS_ENABLED(CONFIG_VMAP_STACK)) { @@ -775,6 +841,67 @@ static void check_mm(struct mm_struct *mm) #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +static void do_check_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + WARN_ON_ONCE(current->active_mm == mm); +} + +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + WARN_ON_ONCE(current->mm); + current->active_mm = &init_mm; + switch_mm(mm, &init_mm, current); + } +} + +static void cleanup_lazy_tlbs(struct mm_struct *mm) +{ + if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + /* + * In this case, lazy tlb mms are refounted and would not reach + * __mmdrop until all CPUs have switched away and mmdrop()ed. + */ + return; + } + + /* + * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it + * requires lazy mm users to switch to another mm when the refcount + * drops to zero, before the mm is freed. This requires IPIs here to + * switch kernel threads to init_mm. + * + * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm + * switch with the final userspace teardown TLB flush which leaves the + * mm lazy on this CPU but no others, reducing the need for additional + * IPIs here. There are cases where a final IPI is still required here, + * such as the final mmdrop being performed on a different CPU than the + * one exiting, or kernel threads using the mm when userspace exits. + * + * IPI overheads have not found to be expensive, but they could be + * reduced in a number of possible ways, for example (roughly + * increasing order of complexity): + * - The last lazy reference created by exit_mm() could instead switch + * to init_mm, however it's probable this will run on the same CPU + * immediately afterwards, so this may not reduce IPIs much. + * - A batch of mms requiring IPIs could be gathered and freed at once. + * - CPUs store active_mm where it can be remotely checked without a + * lock, to filter out false-positives in the cpumask. + * - After mm_users or mm_count reaches zero, switching away from the + * mm could clear mm_cpumask to reduce some IPIs, perhaps together + * with some batching or delaying of the final IPIs. + * - A delayed freeing and RCU-like quiescing sequence based on mm + * switching to avoid IPIs completely. + */ + on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); + if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES)) + on_each_cpu(do_check_lazy_tlb, (void *)mm, 1); +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -786,6 +913,10 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); + + /* Ensure no CPUs are using this as their lazy tlb mm */ + cleanup_lazy_tlbs(mm); + WARN_ON_ONCE(mm == current->active_mm); mm_free_pgd(mm); destroy_context(mm); @@ -1128,6 +1259,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, seqcount_init(&mm->write_protect_seq); mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); +#ifdef CONFIG_PER_VMA_LOCK + mm->mm_lock_seq = 0; +#endif mm_pgtables_bytes_init(mm); mm->map_count = 0; mm->locked_vm = 0; @@ -3159,6 +3293,9 @@ void __init proc_caches_init(void) NULL); vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); +#ifdef CONFIG_PER_VMA_LOCK + vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT); +#endif mmap_init(); nsproxy_cache_init(); } diff --git a/kernel/kcsan/kcsan_test.c b/kernel/kcsan/kcsan_test.c index a60c561724be..0ddbdab5903d 100644 --- a/kernel/kcsan/kcsan_test.c +++ b/kernel/kcsan/kcsan_test.c @@ -1572,34 +1572,26 @@ static void test_exit(struct kunit *test) } __no_kcsan -static void register_tracepoints(struct tracepoint *tp, void *ignore) +static void register_tracepoints(void) { - check_trace_callback_type_console(probe_console); - if (!strcmp(tp->name, "console")) - WARN_ON(tracepoint_probe_register(tp, probe_console, NULL)); + register_trace_console(probe_console, NULL); } __no_kcsan -static void unregister_tracepoints(struct tracepoint *tp, void *ignore) +static void unregister_tracepoints(void) { - if (!strcmp(tp->name, "console")) - tracepoint_probe_unregister(tp, probe_console, NULL); + unregister_trace_console(probe_console, NULL); } static int kcsan_suite_init(struct kunit_suite *suite) { - /* - * Because we want to be able to build the test as a module, we need to - * iterate through all known tracepoints, since the static registration - * won't work here. - */ - for_each_kernel_tracepoint(register_tracepoints, NULL); + register_tracepoints(); return 0; } static void kcsan_suite_exit(struct kunit_suite *suite) { - for_each_kernel_tracepoint(unregister_tracepoints, NULL); + unregister_tracepoints(); tracepoint_synchronize_unregister(); } diff --git a/kernel/kthread.c b/kernel/kthread.c index 4bc6e0971ec9..490792b1066e 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1406,14 +1406,18 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); + /* + * It is possible for mm to be the same as tsk->active_mm, but + * we must still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), + * because these references are not equivalent. + */ + mmgrab(mm); + task_lock(tsk); /* Hold off tlb flush IPIs while switching mm's */ local_irq_disable(); active_mm = tsk->active_mm; - if (active_mm != mm) { - mmgrab(mm); - tsk->active_mm = mm; - } + tsk->active_mm = mm; tsk->mm = mm; membarrier_update_current_mm(mm); switch_mm_irqs_off(active_mm, mm, tsk); @@ -1430,12 +1434,9 @@ void kthread_use_mm(struct mm_struct *mm) * memory barrier after storing to tsk->mm, before accessing * user-space memory. A full memory barrier for membarrier * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by - * mmdrop(), or explicitly with smp_mb(). + * mmdrop_lazy_tlb(). */ - if (active_mm != mm) - mmdrop(active_mm); - else - smp_mb(); + mmdrop_lazy_tlb(active_mm); } EXPORT_SYMBOL_GPL(kthread_use_mm); @@ -1463,10 +1464,13 @@ void kthread_unuse_mm(struct mm_struct *mm) local_irq_disable(); tsk->mm = NULL; membarrier_update_current_mm(NULL); + mmgrab_lazy_tlb(mm); /* active_mm is still 'mm' */ enter_lazy_tlb(mm, tsk); local_irq_enable(); task_unlock(tsk); + + mmdrop(mm); } EXPORT_SYMBOL_GPL(kthread_unuse_mm); diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 5bcb11ad50bf..6a333adce3b3 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -71,6 +71,8 @@ EXPORT_SYMBOL_GPL(console_printk); atomic_t ignore_console_lock_warning __read_mostly = ATOMIC_INIT(0); EXPORT_SYMBOL(ignore_console_lock_warning); +EXPORT_TRACEPOINT_SYMBOL_GPL(console); + /* * Low level drivers may need that to know if they can schedule in * their unblank() callback or not. So let's export it. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0d18c3969f90..143e46bd2a68 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5203,13 +5203,14 @@ static struct rq *finish_task_switch(struct task_struct *prev) * rq->curr, before returning to userspace, so provide them here: * * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly - * provided by mmdrop(), + * provided by mmdrop_lazy_tlb(), * - a sync_core for SYNC_CORE. */ if (mm) { membarrier_mm_sync_core_before_usermode(mm); - mmdrop_sched(mm); + mmdrop_lazy_tlb_sched(mm); } + if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); @@ -5266,9 +5267,9 @@ context_switch(struct rq *rq, struct task_struct *prev, /* * kernel -> kernel lazy + transfer active - * user -> kernel lazy + mmgrab() active + * user -> kernel lazy + mmgrab_lazy_tlb() active * - * kernel -> user switch + mmdrop() active + * kernel -> user switch + mmdrop_lazy_tlb() active * user -> user switch */ if (!next->mm) { // to kernel @@ -5276,7 +5277,7 @@ context_switch(struct rq *rq, struct task_struct *prev, next->active_mm = prev->active_mm; if (prev->mm) // from user - mmgrab(prev->active_mm); + mmgrab_lazy_tlb(prev->active_mm); else prev->active_mm = NULL; } else { // to user @@ -5293,7 +5294,7 @@ context_switch(struct rq *rq, struct task_struct *prev, lru_gen_use_mm(next->mm); if (!prev->mm) { // from kernel - /* will mmdrop() in finish_task_switch(). */ + /* will mmdrop_lazy_tlb() in finish_task_switch(). */ rq->prev_mm = prev->active_mm; prev->active_mm = NULL; } @@ -9935,7 +9936,7 @@ void __init sched_init(void) /* * The boot idle thread does lazy MMU switching as well: */ - mmgrab(&init_mm); + mmgrab_lazy_tlb(&init_mm); enter_lazy_tlb(&init_mm, current); /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5f6587d94c1d..da0d8b0b8a2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2928,6 +2928,24 @@ static void reset_ptenuma_scan(struct task_struct *p) p->mm->numa_scan_offset = 0; } +static bool vma_is_accessed(struct vm_area_struct *vma) +{ + unsigned long pids; + /* + * Allow unconditional access first two times, so that all the (pages) + * of VMAs get prot_none fault introduced irrespective of accesses. + * This is also done to avoid any side effect of task scanning + * amplifying the unfairness of disjoint set of VMAs' access. + */ + if (READ_ONCE(current->mm->numa_scan_seq) < 2) + return true; + + pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; + return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids); +} + +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) + /* * The expensive part of numa migration is done from task_work context. * Triggered from task_tick_numa(). @@ -3027,6 +3045,45 @@ static void task_numa_work(struct callback_head *work) if (!vma_is_accessible(vma)) continue; + /* Initialise new per-VMA NUMAB state. */ + if (!vma->numab_state) { + vma->numab_state = kzalloc(sizeof(struct vma_numab_state), + GFP_KERNEL); + if (!vma->numab_state) + continue; + + vma->numab_state->next_scan = now + + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); + + /* Reset happens after 4 times scan delay of scan start */ + vma->numab_state->next_pid_reset = vma->numab_state->next_scan + + msecs_to_jiffies(VMA_PID_RESET_PERIOD); + } + + /* + * Scanning the VMA's of short lived tasks add more overhead. So + * delay the scan for new VMAs. + */ + if (mm->numa_scan_seq && time_before(jiffies, + vma->numab_state->next_scan)) + continue; + + /* Do not scan the VMA if task has not accessed */ + if (!vma_is_accessed(vma)) + continue; + + /* + * RESET access PIDs regularly for old VMAs. Resetting after checking + * vma for recent access to avoid clearing PID info before access.. + */ + if (mm->numa_scan_seq && + time_after(jiffies, vma->numab_state->next_pid_reset)) { + vma->numab_state->next_pid_reset = vma->numab_state->next_pid_reset + + msecs_to_jiffies(VMA_PID_RESET_PERIOD); + vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]); + vma->numab_state->access_pids[1] = 0; + } + do { start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); diff --git a/kernel/sys.c b/kernel/sys.c index 351de7916302..72cdb16e2636 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -15,6 +15,7 @@ #include <linux/highuid.h> #include <linux/fs.h> #include <linux/kmod.h> +#include <linux/ksm.h> #include <linux/perf_event.h> #include <linux/resource.h> #include <linux/kernel.h> @@ -2388,6 +2389,16 @@ static inline int prctl_get_mdwe(unsigned long arg2, unsigned long arg3, PR_MDWE_REFUSE_EXEC_GAIN : 0; } +static int prctl_get_auxv(void __user *addr, unsigned long len) +{ + struct mm_struct *mm = current->mm; + unsigned long size = min_t(unsigned long, sizeof(mm->saved_auxv), len); + + if (size && copy_to_user(addr, mm->saved_auxv, size)) + return -EFAULT; + return sizeof(mm->saved_auxv); +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2518,6 +2529,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, else return -EINVAL; break; + case PR_GET_AUXV: + if (arg4 || arg5) + return -EINVAL; + error = prctl_get_auxv((void __user *)arg2, arg3); + break; default: return -EINVAL; } @@ -2672,6 +2688,32 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_VMA: error = prctl_set_vma(arg2, arg3, arg4, arg5); break; +#ifdef CONFIG_KSM + case PR_SET_MEMORY_MERGE: + if (arg3 || arg4 || arg5) + return -EINVAL; + if (mmap_write_lock_killable(me->mm)) + return -EINTR; + + if (arg2) { + error = ksm_enable_merge_any(me->mm); + } else { + /* + * TODO: we might want disable KSM on all VMAs and + * trigger unsharing to completely disable KSM. + */ + clear_bit(MMF_VM_MERGE_ANY, &me->mm->flags); + error = 0; + } + mmap_write_unlock(me->mm); + break; + case PR_GET_MEMORY_MERGE: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + + error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags); + break; +#endif default: error = -EINVAL; break; |