Age | Commit message (Collapse) | Author |
|
Remove the redundant updating of stats_flush_threshold. If the global var
stats_flush_threshold has exceeded the trigger value for
__mem_cgroup_flush_stats, further increment is unnecessary.
Apply the patch and test the pts/hackbench-1.0.0 Count:4 (160 threads).
Score gain: 1.95x
Reduce CPU cycles in __mod_memcg_lruvec_state (44.88% -> 0.12%)
CPU: ICX 8380 x 2 sockets
Core number: 40 x 2 physical cores
Benchmark: pts/hackbench-1.0.0 Count:4 (160 threads)
Link: https://lkml.kernel.org/r/20220722164949.47760-1-jiebin.sun@intel.com
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Amadeusz Sawiski <amadeuszx.slawinski@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We forget to set cft->private for numa stat file. As a result, numa stat
of hstates[0] is always showed for all hstates. Encode the hstates index
into cft->private to fix this issue.
Link: https://lkml.kernel.org/r/20220723073804.53035-1-linmiaohe@huawei.com
Fixes: f47761999052 ("hugetlb: add hugetlb.*.numa_stat file")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Avoids truncating the debugfs output to 16 chars. Potentially alters
the userspace output, but this is a debugfs interface and there are no
stability guarantees.
Link: https://lkml.kernel.org/r/20220719091554.27864-1-quic_yingangl@quicinc.com
Signed-off-by: Kassey Li <quic_yingangl@quicinc.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We can use unlock label to unlock ptl and return ret directly to remove
the unneeded out label and reduce the size of mempolicy.o. No functional
change intended.
[Before]
text data bss dec hex filename
26702 3972 6168 36842 8fea mm/mempolicy.o
[After]
text data bss dec hex filename
26662 3972 6168 36802 8fc2 mm/mempolicy.o
Link: https://lkml.kernel.org/r/20220719115233.6706-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
cpuset.c was moved to kernel/cgroup/ in below commit
201af4c0fab0 ("cgroup: move cgroup files under kernel/cgroup/")
Correct the wrong path in comment.
Link: https://lkml.kernel.org/r/20220718120336.5145-1-mark-pk.tsai@mediatek.com
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When code reaches here, the page must be !PageAnon. There's no need to
check PageAnon again. Remove it.
Link: https://lkml.kernel.org/r/20220716081816.10752-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This allows userspace to set flags like FS_APPEND_FL, FS_IMMUTABLE_FL,
FS_NODUMP_FL, etc., like all other standard Linux file systems.
[akpm@linux-foundation.org: fix CONFIG_TMPFS_XATTR=n warnings]
Link: https://lkml.kernel.org/r/20220715015912.2560575-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_reclaim_init() allocates a memory chunk for ctx with
damon_new_ctx(). When damon_select_ops() fails, ctx is not released,
which will lead to a memory leak.
We should release the ctx with damon_destroy_ctx() when damon_select_ops()
fails to fix the memory leak.
Link: https://lkml.kernel.org/r/20220714063746.2343549-1-niejianglei2021@163.com
Fixes: 4d69c3457821 ("mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()")
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
memory.reclaim is a cgroup v2 interface that allows users to proactively
reclaim memory from a memcg, without real memory pressure. Reclaim
operations invoke vmpressure, which is used: (a) To notify userspace of
reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
under memory pressure for networking (see
mem_cgroup_under_socket_pressure()).
For (a), vmpressure notifications in v1 are not affected by this change
since memory.reclaim is a v2 feature.
For (b), the effects of the vmpressure signal (according to Shakeel [1])
are as follows:
1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.
Since proactive reclaim is invoked directly by userspace, not by memory
pressure, it makes sense not to throttle networking. Hence, this change
makes sure that proactive reclaim caused by memory.reclaim does not
trigger vmpressure.
[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/
[yosryahmed@google.com: update documentation]
Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
zs_malloc returns 0 if it fails. zs_zpool_malloc will return -1 when
zs_malloc return 0. But -1 makes the return value unclear.
For example, when zswap_frontswap_store calls zs_malloc through
zs_zpool_malloc, it will return -1 to its caller. The other return value
is -EINVAL, -ENODEV or something else.
This commit changes zs_malloc to return ERR_PTR on failure. It didn't
just let zs_zpool_malloc return -ENOMEM becaue zs_malloc has two types of
failure:
- size is not OK return -EINVAL
- memory alloc fail return -ENOMEM.
Link: https://lkml.kernel.org/r/20220714080757.12161-1-teawater@gmail.com
Signed-off-by: Hui Zhu <teawater@antgroup.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
as unusable. When the system restarts next time, these areas are not
reported or reported as EFI_UNUSABLE_MEMORY. Both cases lead to an
increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
a larger number of memblocks.
For example, if the EFI_UNUSABLE_MEMORY type is reported:
...
memory[0x92] [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x93] [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x94] [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
memory[0x95] [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x96] [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x97] [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x98] [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
memory[0x99] [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
memory[0x9a] [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9b] [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
memory[0x9c] [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9d] [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
memory[0x9e] [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9f] [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
...
The EFI memory map is parsed to construct the memblock arrays before the
memblock arrays can be resized. As the result, memory regions beyond
INIT_MEMBLOCK_REGIONS are lost.
Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
array.
Allow overriding memblock.memory array size with architecture defined
INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.
Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Darren Hart <darren@os.amperecomputing.com>
Acked-by: Will Deacon <will@kernel.org> [arm64]
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Xu Qiang <xuqiang36@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit 7267ec008b5c ("mm: postpone page table allocation until we
have page to map"), do_fault_around is not called with page table lock
held. Cleanup the corresponding comments.
Link: https://lkml.kernel.org/r/20220716080359.38791-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The number of scanned pages can be lower than the number of isolated pages
when isolating mirgratable or free pageblock. The metric is being
reported in trace event and also used in vmstat.
some example output from trace where it shows nr_taken can be greater
than nr_scanned:
Produced by kernel v5.19-rc6
kcompactd0-42 [001] ..... 1210.268022: mm_compaction_isolate_migratepages: range=(0x107ae4 ~ 0x107c00) nr_scanned=265 nr_taken=255
[...]
kcompactd0-42 [001] ..... 1210.268382: mm_compaction_isolate_freepages: range=(0x215800 ~ 0x215a00) nr_scanned=13 nr_taken=128
kcompactd0-42 [001] ..... 1210.268383: mm_compaction_isolate_freepages: range=(0x215600 ~ 0x215680) nr_scanned=1 nr_taken=128
mm_compaction_isolate_migratepages does not seem to have this
behaviour, but for the reason of consistency, nr_scanned should also be
taken care of in that side.
This behaviour is confusing since currently the count for isolated pages
takes account of compound page but not for the case of scanned pages. And
given that the number of isolated pages(nr_taken) reported in
mm_compaction_isolate_template trace event is on a single-page basis, the
ambiguity when reporting the number of scanned pages can be removed by
also including compound page count.
Link: https://lkml.kernel.org/r/20220711202806.22296-1-william.lam@bytedance.com
Signed-off-by: William Lam <william.lam@bytedance.com>
Reviewed-by: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Yafang Shao reported an issue related to the accounting of bpf memory:
if a bpf map is charged indirectly for memory consumed from an
interrupt context and allocations are enforced, MEMCG_MAX events are
not raised.
It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the direct reclaim and
MEMCG_MAX events will be raised. However a bpf map can belong to a
dying/abandoned memory cgroup, so there will be no allocations from a
process context and no MEMCG_MAX events will be triggered.
Link: https://lkml.kernel.org/r/20220702033521.64630-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Restructure the logic in filemap_write_and_wait_range to simplify the code
and make it more consistent with file_write_and_wait_range. No functional
change intended.
Link: https://lkml.kernel.org/r/20220627132351.55680-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
twice because vm_unacct_memory will be called by above unmap_region. But
since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from
the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
anymore. So charged shouldn't be set to 0 now otherwise the calling to
paired vm_unacct_memory will be missed and leads to imbalanced account.
Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com
Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot is reporting double kfree() at free_prealloced_shrinker() [1], for
destroy_unused_super() calls free_prealloced_shrinker() even if
prealloc_shrinker() returned an error. Explicitly clear shrinker name
when prealloc_shrinker() called kfree().
[roman.gushchin@linux.dev: zero shrinker->name in all cases where shrinker->name is freed]
Link: https://lkml.kernel.org/r/YtgteTnQTgyuKUSY@castle
Link: https://syzkaller.appspot.com/bug?extid=8b481578352d4637f510 [1]
Link: https://lkml.kernel.org/r/ffa62ece-6a42-2644-16cf-0d33ef32c676@I-love.SAKURA.ne.jp
Fixes: e33c267ab70de424 ("mm: shrinkers: provide shrinkers with names")
Reported-by: syzbot <syzbot+8b481578352d4637f510@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a
device private PTE is found, the hmm_range::dev_private_owner page is used
to determine if the device private page should not be faulted in.
However, if the device private page is not owned by the caller,
hmm_range_fault() returns an error instead of calling migrate_to_ram() to
fault in the page.
For example, if a page is migrated to GPU private memory and a RDMA fault
capable NIC tries to read the migrated page, without this patch it will
get an error. With this patch, the page will be migrated back to system
memory and the NIC will be able to read the data.
Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com
Fixes: 08ddddda667b ("mm/hmm: check the device private page owner in hmm_range_fault()")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reported-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There was a report that a task is waiting at the
throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
increasing.
This is a bug where zone_watermark_fast returns true even when the free
is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
reserve in watermark fast") changed the watermark fast to consider
highatomic reserve. But it did not handle a negative value case which
can be happened when reserved_highatomic pageblock is bigger than the
actual free.
If watermark is considered as ok for the negative value, allocating
contexts for order-0 will consume all free pages without direct reclaim,
and finally free page may become depleted except highatomic free.
Then allocating contexts may fall into throttle_direct_reclaim. This
symptom may easily happen in a system where wmark min is low and other
reclaimers like kswapd does not make free pages quickly.
Handle the negative case by using MIN.
Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Reported-by: GyeongHwan Hong <gh21.hong@samsung.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kerenl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Thirteen hotfixes.
Eight are cc:stable and the remainder are for post-5.18 issues or are
too minor to warrant backporting"
* tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: update Gao Xiang's email addresses
userfaultfd: provide properly masked address for huge-pages
Revert "ocfs2: mount shared volume without ha stack"
hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
fs: sendfile handles O_NONBLOCK of out_fd
ntfs: fix use-after-free in ntfs_ucsncmp()
secretmem: fix unhandled fault in truncate
mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
mm: fix missing wake-up event for FSDAX pages
mm: fix page leak with multiple threads mapping the same page
mailmap: update Seth Forshee's email address
tmpfs: fix the issue that the mount and remount results are inconsistent.
mm: kfence: apply kmemleak_ignore_phys on early allocated pool
|
|
The vmf->page can be NULL when the wp_page_reuse() is invoked by
wp_pfn_shared(), it will cause the following panic:
BUG: kernel NULL pointer dereference, address: 000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 18 PID: 923 Comm: Xorg Not tainted 5.19.0-rc8.bm.1-amd64 #263
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g14
RIP: 0010:_compound_head+0x0/0x40
[...]
Call Trace:
wp_page_reuse+0x1c/0xa0
do_wp_page+0x1a5/0x3f0
__handle_mm_fault+0x8cf/0xd20
handle_mm_fault+0xd5/0x2a0
do_user_addr_fault+0x1d0/0x680
exc_page_fault+0x78/0x170
asm_exc_page_fault+0x22/0x30
To fix it, this patch performs a NULL pointer check before dereferencing
the vmf->page.
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__kunmap_ {local,atomic}() currently take pointers to void. However, this
is semantically incorrect, since these functions do not change the memory
their arguments point to.
Therefore, make this semantics explicit by modifying the
__kunmap_{local,atomic}() prototypes to take pointers to const void.
As a side effect, compilers may produce more efficient code.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Suggested-by: David Sterba <dsterba@suse.cz>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
* for-next/mte:
arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"
mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
mm: kasan: Skip unpoisoning of user pages
mm: kasan: Ensure the tags are visible before the tag in page->flags
|
|
* for-next/mm:
arm64: enable THP_SWAP for arm64
|
|
If we're creating a page cache page with FGP_CREAT but FGP_NOWAIT is
set, we should dial back the gfp flags to avoid frivolous blocking
which is trivial to hit in low memory conditions:
[ 10.117661] __schedule+0x8c/0x550
[ 10.118305] schedule+0x58/0xa0
[ 10.118897] schedule_timeout+0x30/0xdc
[ 10.119610] __wait_for_common+0x88/0x114
[ 10.120348] wait_for_completion+0x1c/0x24
[ 10.121103] __flush_work.isra.0+0x16c/0x19c
[ 10.121896] flush_work+0xc/0x14
[ 10.122496] __drain_all_pages+0x144/0x218
[ 10.123267] drain_all_pages+0x10/0x18
[ 10.123941] __alloc_pages+0x464/0x9e4
[ 10.124633] __folio_alloc+0x18/0x3c
[ 10.125294] __filemap_get_folio+0x17c/0x204
[ 10.126084] iomap_write_begin+0xf8/0x428
[ 10.126829] iomap_file_buffered_write+0x144/0x24c
[ 10.127710] xfs_file_buffered_write+0xe8/0x248
[ 10.128553] xfs_file_write_iter+0xa8/0x120
[ 10.129324] io_write+0x16c/0x38c
[ 10.129940] io_issue_sqe+0x70/0x1cc
[ 10.130617] io_queue_sqe+0x18/0xfc
[ 10.131277] io_submit_sqes+0x5d4/0x600
[ 10.131946] __arm64_sys_io_uring_enter+0x224/0x600
[ 10.132752] invoke_syscall.constprop.0+0x70/0xc0
[ 10.133616] do_el0_svc+0xd0/0x118
[ 10.134238] el0_svc+0x78/0xa0
Clear IO, FS, and reclaim flags and mark the allocation as GFP_NOWAIT and
add __GFP_NOWARN to avoid polluting dmesg with pointless allocations
failures. A caller with FGP_NOWAIT must be expected to handle the
resulting -EAGAIN return and retry from a suitable context without NOWAIT
set.
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This adds the helper function balance_dirty_pages_ratelimited_flags().
It adds the parameter flags to balance_dirty_pages_ratelimited().
The flags parameter is passed to balance_dirty_pages(). For async
buffered writes the flag value will be BDP_ASYNC.
If balance_dirty_pages() gets called for async buffered write, we don't
want to wait. Instead we need to indicate to the caller that throttling
is needed so that it can stop writing and offload the rest of the write
to a context that can block.
The new helper function is also used by balance_dirty_pages_ratelimited().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623175157.1715274-4-shr@fb.com
[axboe: fix kerneltest bot 'ret' issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Transition of wb->dirty_exceeded from 0 to 1 happens before we go to
sleep in balance_dirty_pages() while transition from 1 to 0 happens when
exiting from balance_dirty_pages(), possibly based on old values. This
does not make a lot of sense since wb->dirty_exceeded should simply
reflect whether wb is over dirty limit and so we should ratelimit
entering to balance_dirty_pages() less. Move the two updates together.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220623175157.1715274-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We start background writeback if we are over background threshold after
exiting the main loop in balance_dirty_pages(). This may result in
basing the decision on already stale values (we may have slept for
significant amount of time) and it is also inconvenient for refactoring
needed for async dirty throttling. Move the check into the main waiting
loop.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220623175157.1715274-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The core of devm_request_free_mem_region() is a helper that searches for
free space in iomem_resource and performs __request_region_locked() on
the result of that search. The policy choices of the implementation
conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is
immediately marked busy, and a preference to search for the first-fit
free range in descending order from the top of the physical address
space.
CXL has a need for a similar allocator, but with the following tweaks:
1/ Search for free space in ascending order
2/ Search for free space relative to a given CXL window
3/ 'insert' rather than 'request' the new resource given downstream
drivers from the CXL Region driver (like the pmem or dax drivers) are
responsible for request_mem_region() when they activate the memory
range.
Rework __request_free_mem_region() into get_free_mem_region() which
takes a set of GFR_* (Get Free Region) flags to control the allocation
policy (ascending vs descending), and "busy" policy (insert_resource()
vs request_region()).
As part of the consolidation of the legacy GFR_REQUEST_REGION case with
the new default of just inserting a new resource into the free space
some minor cleanups like not checking for NULL before calling
devres_free() (which does its own check) is included.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Now that only SLOB use __kmem_cache_{alloc,free}_bulk(), move them to
SLOB. No functional change intended.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
There is no benefit to call generic bulk free function when
kmem_cache_alloc_bulk() failed. Use own kmem_cache_free_bulk()
instead of generic function.
Note that if kmem_cache_alloc_bulk() fails to allocate first object in
SLUB, size is zero. So allow passing size == 0 to kmem_cache_free_bulk()
like SLAB's.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
THP_SWAP has been proven to improve the swap throughput significantly
on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
splitting THP after swapped out").
As long as arm64 uses 4K page size, it is quite similar with x86_64
by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
enabling it on arm64 will benefit arm64 as well.
A corner case is that MTE has an assumption that only base pages
can be swapped. We won't enable THP_SWAP for ARM64 hardware with
MTE support until MTE is reworked to coexist with THP_SWAP.
A micro-benchmark is written to measure thp swapout throughput as
below,
unsigned long long tv_to_ms(struct timeval tv)
{
return tv.tv_sec * 1000 + tv.tv_usec / 1000;
}
main()
{
struct timeval tv_b, tv_e;;
#define SIZE 400*1024*1024
volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (!p) {
perror("fail to get memory");
exit(-1);
}
madvise(p, SIZE, MADV_HUGEPAGE);
memset(p, 0x11, SIZE); /* write to get mem */
gettimeofday(&tv_b, NULL);
madvise(p, SIZE, MADV_PAGEOUT);
gettimeofday(&tv_e, NULL);
printf("swp out bandwidth: %ld bytes/ms\n",
SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
}
Testing is done on rk3568 64bit Quad Core Cortex-A55 platform -
ROCK 3A.
thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Link: https://lore.kernel.org/r/20220720093737.133375-1-21cnbao@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
When alloc_huge_page fails, *pagep is set to NULL without put_page first.
So the hugepage indicated by *pagep is leaked.
Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzkaller reports the following issue:
BUG: unable to handle page fault for address: ffff888021f7e005
PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
Oops: 0002 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
zero_user_segments include/linux/highmem.h:272 [inline]
folio_zero_range include/linux/highmem.h:428 [inline]
truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
truncate_inode_pages mm/truncate.c:452 [inline]
truncate_pagecache+0x63/0x90 mm/truncate.c:753
simple_setattr+0xed/0x110 fs/libfs.c:535
secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
notify_change+0xb8c/0x12b0 fs/attr.c:424
do_truncate+0x13c/0x200 fs/open.c:65
do_sys_ftruncate+0x536/0x730 fs/open.c:193
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb29d900899
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
</TASK>
Modules linked in:
CR2: ffff888021f7e005
---[ end trace 0000000000000000 ]---
Eric Biggers suggested that this happens when
secretmem_setattr()->simple_setattr() races with secretmem_fault() so that
a page that is faulted in by secretmem_fault() (and thus removed from the
direct map) is zeroed by inode truncation right afterwards.
Use mapping->invalidate_lock to make secretmem_fault() and
secretmem_setattr() mutually exclusive.
[rppt@linux.ibm.com: v3]
Link: https://lkml.kernel.org/r/20220714091337.412297-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20220707165650.248088-1-rppt@kernel.org
Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Originally copy_hugetlb_page_range() handles migration entries and
hwpoisoned entries in similar manner. But recently the related code path
has more code for migration entries, and when
is_writable_migration_entry() was converted to
!is_readable_migration_entry(), hwpoison entries on source processes got
to be unexpectedly updated (which is legitimate for migration entries, but
not for hwpoison entries). This results in unexpected serious issues like
kernel panic when forking processes with hwpoison entries in pmd.
Separate the if branch into one for hwpoison entries and one for migration
entries.
Link: https://lkml.kernel.org/r/20220704013312.2415700-3-naoya.horiguchi@linux.dev
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org> [5.18]
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
1, then the page is freed. The FSDAX pages can be pinned through GUP,
then they will be unpinned via unpin_user_page() using a folio variant
to put the page, however, folio variants did not consider this special
case, the result will be to miss a wakeup event (like the user of
__fuse_dax_break_layouts()). This results in a task being permanently
stuck in TASK_INTERRUPTIBLE state.
Since FSDAX pages are only possibly obtained by GUP users, so fix GUP
instead of folio_put() to lower overhead.
Link: https://lkml.kernel.org/r/20220705123532.283-1-songmuchun@bytedance.com
Fixes: d8ddc099c6b3 ("mm/gup: Add gup_put_folio()")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We have an application with a lot of threads that use a shared mmap backed
by tmpfs mounted with -o huge=within_size. This application started
leaking loads of huge pages when we upgraded to a recent kernel.
Using the page ref tracepoints and a BPF program written by Tejun Heo we
were able to determine that these pages would have multiple refcounts from
the page fault path, but when it came to unmap time we wouldn't drop the
number of refs we had added from the faults.
I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
huge=always, and then spawned 20 threads all looping faulting random
offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
page aligned ranges. This very quickly reproduced the problem.
The problem here is that we check for the case that we have multiple
threads faulting in a range that was previously unmapped. One thread maps
the PMD, the other thread loses the race and then returns 0. However at
this point we already have the page, and we are no longer putting this
page into the processes address space, and so we leak the page. We
actually did the correct thing prior to f9ce0be71d1f, however it looks
like Kirill copied what we do in the anonymous page case. In the
anonymous page case we don't yet have a page, so we don't have to drop a
reference on anything. Previously we did the correct thing for file based
faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
the page we faulted in.
Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
case, this makes us drop the ref on the page properly, and now my
reproducer no longer leaks the huge pages.
[josef@toxicpanda.com: v2]
Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
An undefined-behavior issue has not been completely fixed since commit
d14f5efadd84 ("tmpfs: fix undefined-behaviour in shmem_reconfigure()").
In the commit, check in the shmem_reconfigure() is added in remount
process to avoid the Ubsan problem. However, the check is not added to
the mount process. It causes inconsistent results between mount and
remount. The operations to reproduce the problem in user mode as follows:
If nr_blocks is set to 0x8000000000000000, the mounting is successful.
# mount tmpfs /dev/shm/ -t tmpfs -o nr_blocks=0x8000000000000000
However, when -o remount is used, the mount fails because of the
check in the shmem_reconfigure()
# mount tmpfs /dev/shm/ -t tmpfs -o remount,nr_blocks=0x8000000000000000
mount: /dev/shm: mount point not mounted or bad option.
Therefore, add checks in the shmem_parse_one() function and remove the
check in shmem_reconfigure() to avoid this problem.
Link: https://lkml.kernel.org/r/20220629124324.1640807-1-wangzhaolong1@huawei.com
Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com>
Cc: Luo Meng <luomeng12@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch solves two issues.
(1) The pool allocated by memblock needs to unregister from
kmemleak scanning. Apply kmemleak_ignore_phys to replace the
original kmemleak_free as its address now is stored in the phys tree.
(2) The pool late allocated by page-alloc doesn't need to unregister.
Move out the freeing operation from its call path.
Link: https://lkml.kernel.org/r/20220628113714.7792-2-yee.lee@mediatek.com
Fixes: 0c24e061196c21d5 ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA")
Signed-off-by: Yee Lee <yee.lee@mediatek.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mmget_still_valid() has already been removed via commit 4d45e75a9955 ("mm:
remove the now-unnecessary mmget_still_valid() hack"). Update the
corresponding comment.
Link: https://lkml.kernel.org/r/20220709092527.47778-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use helper function huge_pte_lock() to lock the huge pte to simplify the
code a bit. No functional change intended.
Link: https://lkml.kernel.org/r/20220709092440.43018-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use try_cmpxchg instead of cmpxchg in set_pfnblock_flags_mask. x86
CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after cmpxchg (and related move instruction in front of cmpxchg).
The main loop improves from:
1c5d: 48 89 c2 mov %rax,%rdx
1c60: 48 89 c1 mov %rax,%rcx
1c63: 48 21 fa and %rdi,%rdx
1c66: 4c 09 c2 or %r8,%rdx
1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi)
1c6e: 48 39 c1 cmp %rax,%rcx
1c71: 75 ea jne 1c5d <...>
to:
1c60: 48 89 ca mov %rcx,%rdx
1c63: 48 21 c2 and %rax,%rdx
1c66: 4c 09 c2 or %r8,%rdx
1c69: f0 48 0f b1 16 lock cmpxchg %rdx,(%rsi)
1c6e: 75 f0 jne 1c60 <...>
Link: https://lkml.kernel.org/r/20220708140736.8737-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
show_free_areas() allows to filter out node specific data which is
irrelevant to the allocation request. But hugetlb_show_meminfo() still
shows hugetlb on all nodes, which is redundant and unnecessary.
Use show_mem_node_skip() to skip irrelevant nodes. And replace
hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid).
before-and-after sample output of OOM:
before:
```
[ 214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB
[ 214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
[ 214.388334] lowmem_reserve[]: 0 0 0 0 0
[ 214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20
[ 214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
```
after:
```
[ 145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u
[ 145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
[ 145.152315] lowmem_reserve[]: 0 0 0 0 0
[ 145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME)
[ 145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
```
Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Kmemleak recently added a rbtree to store the objects allocted with
physical address. Those objects can't be freed with kmemleak_free().
According to the comments, percpu allocations are tracked by kmemleak
separately. Kmemleak_free() was used to avoid the unnecessary
tracking. If kmemleak_free() fails, those objects would be scanned by
kmemleak, which is unnecessary but shouldn't lead to other effects.
Use kmemleak_ignore_phys() instead of kmemleak_free() for those
objects.
Link: https://lkml.kernel.org/r/20220705113158.127600-1-patrick.wang.shcn@gmail.com
Fixes: 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA")
Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The variable error will be assigned correctly before it is used, the
initialization is redundant, so remove it.
Link: https://lkml.kernel.org/r/20220704114112.163112-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use helper macro IS_ERR_OR_NULL to check the validity of page to simplify
the code. Minor readability improvement.
Link: https://lkml.kernel.org/r/20220704132201.14611-17-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It's dangerous and wrong to call page_folio(pmd_page(*pmd)) when pmd isn't
present. But the caller guarantees pmd is present when folio is set. So we
should be safe here. Add comment to make it clear.
Link: https://lkml.kernel.org/r/20220704132201.14611-16-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We use page->mapping and page->index, instead of page->indexlru in second
tail page as list_head. Correct it.
Link: https://lkml.kernel.org/r/20220704132201.14611-15-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is nothing to do if a zone doesn't have any pages managed by the
buddy allocator. So we should check managed_zone instead. Also if a thp
is found, there's no need to traverse the subpages again.
Link: https://lkml.kernel.org/r/20220704132201.14611-13-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Subpages in swapcache won't be freed even if it is the last user of the
page until next time reclaim. It shouldn't hurt indeed, but we could try
to free these pages to save more memory for system.
Link: https://lkml.kernel.org/r/20220704132201.14611-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|