summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-09-08hugetlbfs: truncate_hugepages() takes a range of pagesMike Kravetz
Modify truncate_hugepages() to take a range of pages (start, end) instead of simply start. If an end value of LLONG_MAX is passed, the current "truncate" functionality is maintained. Existing callers are modified to pass LLONG_MAX as end of range. By keying off end == LLONG_MAX, the routine behaves differently for truncate and hole punch. Page removal is now synchronized with page allocation via faults by using the fault mutex table. The hole punch case can experience the rare region_del error and must handle accordingly. Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in the case where region_del returns an error. Since the routine handles more than just the truncate case, it is renamed to remove_inode_hugepages(). To be consistent, the routine truncate_huge_page() is renamed remove_huge_page(). Downstream of remove_inode_hugepages(), the routine hugetlb_unreserve_pages() is also modified to take a range of pages. hugetlb_unreserve_pages is modified to detect an error from region_del and pass it back to the caller. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to deleteMike Kravetz
fallocate hole punch will want to unmap a specific range of pages. Modify the existing hugetlb_vmtruncate_list() routine to take a start/end range. If end is 0, this indicates all pages after start should be unmapped. This is the same as the existing truncate functionality. Modify existing callers to add 0 as end of range. Since the routine will be used in hole punch as well as truncate operations, it is more appropriately renamed to hugetlb_vmdelete_list(). Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/hugetlb: expose hugetlb fault mutex for use by fallocateMike Kravetz
hugetlb page faults are currently synchronized by the table of mutexes (htlb_fault_mutex_table). fallocate code will need to synchronize with the page fault code when it allocates or deletes pages. Expose interfaces so that fallocate operations can be synchronized with page faults. Minor name changes to be more consistent with other global hugetlb symbols. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/hugetlb: add region_del() to delete a specific range of entriesMike Kravetz
fallocate hole punch will want to remove a specific range of pages. The existing region_truncate() routine deletes all region/reserve map entries after a specified offset. region_del() will provide this same functionality if the end of region is specified as LONG_MAX. Hence, region_del() can replace region_truncate(). Unlike region_truncate(), region_del() can return an error in the rare case where it can not allocate memory for a region descriptor. This ONLY happens in the case where an existing region must be split. Current callers passing LONG_MAX as end of range will never experience this error and do not need to deal with error handling. Future callers of region_del() (such as fallocate hole punch) will need to handle this error. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/hugetlb: add cache of descriptors to resv_map for region_addMike Kravetz
hugetlbfs is used today by applications that want a high degree of control over huge page usage. Often, large hugetlbfs files are used to map a large number huge pages into the application processes. The applications know when page ranges within these large files will no longer be used, and ideally would like to release them back to the subpool or global pools for other uses. The fallocate() system call provides an interface for preallocation and hole punching within files. This patch set adds fallocate functionality to hugetlbfs. fallocate hole punch will want to remove a specific range of pages. When pages are removed, their associated entries in the region/reserve map will also be removed. This will break an assumption in the region_chg/region_add calling sequence. If a new region descriptor must be allocated, it is done as part of the region_chg processing. In this way, region_add can not fail because it does not need to attempt an allocation. To prepare for fallocate hole punch, create a "cache" of descriptors that can be used by region_add if necessary. region_chg will ensure there are sufficient entries in the cache. It will be necessary to track the number of in progress add operations to know a sufficient number of descriptors reside in the cache. A new routine region_abort is added to adjust this in progress count when add operations are aborted. vma_abort_reservation is also added for callers creating reservations with vma_needs_reservation/vma_commit_reservation. [akpm@linux-foundation.org: fix typo in comment, use more cols] Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: rename and move get/set_freepage_migratetypeVlastimil Babka
The pair of get/set_freepage_migratetype() functions are used to cache pageblock migratetype for a page put on a pcplist, so that it does not have to be retrieved again when the page is put on a free list (e.g. when pcplists become full). Historically it was also assumed that the value is accurate for pages on freelists (as the functions' names unfortunately suggest), but that cannot be guaranteed without affecting various allocator fast paths. It is in fact not needed and all such uses have been removed. The last remaining (but pointless) usage related to pages of freelists is in move_freepages(), which this patch removes. To prevent further confusion, rename the functions to get/set_pcppage_migratetype() and expand their description. Since all the users are now in mm/page_alloc.c, move the functions there from the shared header. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Seungho Park <seungho1.park@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, page_isolation: remove bogus tests for isolated pagesVlastimil Babka
The __test_page_isolated_in_pageblock() is used to verify whether all pages in pageblock were either successfully isolated, or are hwpoisoned. Two of the possible state of pages, that are tested, are however bogus and misleading. Both tests rely on get_freepage_migratetype(page), which however has no guarantees about pages on freelists. Specifically, it doesn't guarantee that the migratetype returned by the function actually matches the migratetype of the freelist that the page is on. Such guarantee is not its purpose and would have negative impact on allocator performance. The first test checks whether the freepage_migratetype equals MIGRATE_ISOLATE, supposedly to catch races between page isolation and allocator activity. These races should be fixed nowadays with 51bb1a4093 ("mm/page_alloc: add freepage on isolate pageblock to correct buddy list") and related patches. As explained above, the check wouldn't be able to catch them reliably anyway. For the same reason false positives can happen, although they are harmless, as the move_freepages() call would just move the page to the same freelist it's already on. So removing the test is not a bug fix, just cleanup. After this patch, we assume that all PageBuddy pages are on the correct freelist and that the races were really fixed. A truly reliable verification in the form of e.g. VM_BUG_ON() would be complicated and is arguably not needed. The second test (page_count(page) == 0 && get_freepage_migratetype(page) == MIGRATE_ISOLATE) is probably supposed (the code comes from a big memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages nowadays, those are freed directly to free lists, so the check is obsolete. Remove it as well. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Seungho Park <seungho1.park@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08lib/show_mem.c: correct reserved memory calculationVishnu Pratap Singh
CMA reserved memory is not part of total reserved memory. Currently when we print the total reserve memory it considers cma as part of reserve memory and do minus of totalcma_pages from reserved, which is wrong. In cases where total reserved is less than cma reserved we will get negative values & while printing we print as unsigned and we will get a very large value. Below is the show mem output on X86 ubuntu based system where CMA reserved is 100MB (25600 pages) & total reserved is ~40MB(10316 pages). And reserve memory shows a large value because of this bug. Before: [ 127.066430] 898908 pages RAM [ 127.066432] 671682 pages HighMem/MovableOnly [ 127.066434] 4294952012 pages reserved [ 127.066436] 25600 pages cma reserved After: [ 44.663129] 898908 pages RAM [ 44.663130] 671682 pages HighMem/MovableOnly [ 44.663130] 10316 pages reserved [ 44.663131] 25600 pages cma reserved Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Danesh Petigara <dpetigara@broadcom.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg: move memcg_proto_active from sock.hMichal Hocko
The only user is sock_update_memcg which is living in memcontrol.c so it doesn't make much sense to pollute sock.h by this inline helper. Move it to memcontrol.c and open code it into its only caller. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg, tcp_kmem: check for cg_proto in sock_update_memcgMichal Hocko
sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg doesn't check for NULL. The function relies on the mem_cgroup_is_root check because we shouldn't get NULL otherwise because mem_cgroup_from_task will always return !NULL. All other callers are checking for NULL and we can safely replace mem_cgroup_is_root() check by cg_proto != NULL which will be more straightforward (proto_cgroup returns NULL for the root memcg already). Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg: restructure mem_cgroup_can_attach()Tejun Heo
Restructure it to lower nesting level and help the planned threadgroup leader iteration changes. This is pure reorganization. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg: get rid of extern for functions in memcontrol.hMichal Hocko
Most of the exported functions in this header are not marked extern so change the rest to follow the same style. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCGMichal Hocko
The only user is cgwb_bdi_init and that one depends on CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it doesn't make much sense to definte an empty stub for !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if used in unguarded code paths. Better fail during compilation. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memcg: export struct mem_cgroupMichal Hocko
mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08sparc32: do not include swap.h from pgtable_32.hMichal Hocko
"memcg: export struct mem_cgroup" will add includes into linux/memcontrol.h which lead to further header dependency issues as reported by Guenter Roeck: In file included from include/linux/highmem.h:7:0, from include/linux/bio.h:23, from include/linux/writeback.h:192, from include/linux/memcontrol.h:30, from include/linux/swap.h:8, from ./arch/sparc/include/asm/pgtable_32.h:17, from ./arch/sparc/include/asm/pgtable.h:6, from arch/sparc/kernel/traps_32.c:23: include/linux/mm.h: In function 'is_vmalloc_addr': include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function) include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function) include/linux/mm.h: In function 'maybe_mkwrite': include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite' The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but that goes all the way down to linux/mm.h which wants to have VMALLOC_* which is defined later in pgtable_32.h, though. swap_entry_t is defined in include/mm_types.h so it should be sufficient to include this header without more dependencies. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/dmapool: allow NULL `pool' pointer in dma_pool_destroy()Sergey Senozhatsky
dma_pool_destroy() does not tolerate a NULL dma_pool pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all dma_pool_destroy() callers to do a NULL check if (pool) dma_pool_destroy(pool); Or, otherwise, be invalid dma_pool_destroy() users. Tweak dma_pool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/mempool: allow NULL `pool' pointer in mempool_destroy()Sergey Senozhatsky
mempool_destroy() does not tolerate a NULL mempool_t pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all mempool_destroy() callers to do a NULL check if (pool) mempool_destroy(pool); Or, otherwise, be invalid mempool_destroy() users. Tweak mempool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()Sergey Senozhatsky
kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check if (cache) kmem_cache_destroy(cache); Or, otherwise, be invalid kmem_cache_destroy() users. Tweak kmem_cache_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, oom: remove unnecessary variableDavid Rientjes
The "killed" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, oom: add description of struct oom_controlDavid Rientjes
Describe the purpose of struct oom_control and what each member does. Also make gfp_mask and order const since they are never manipulated or passed to functions that discard the qualifier. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, oom: do not panic for oom kills triggered from sysrqDavid Rientjes
Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, oom: pass an oom order of -1 when triggered by sysrqDavid Rientjes
The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, oom: organize oom context into structDavid Rientjes
There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: make set_recommended_min_free_kbytes() return voidNicholas Krause
This makes set_recommended_min_free_kbytes() have a return type of void as it cannot fail. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: improve __GFP_NORETRY comment based on implementationDavid Rientjes
Explicitly state that __GFP_NORETRY will attempt direct reclaim and memory compaction before returning NULL and that the oom killer is not called in the current implementation of the page allocator. [akpm@linux-foundation.org: s/has/have/] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08fs: do not prefault sys_write() user buffer pagesDave Hansen
=== Short summary ==== iov_iter_fault_in_readable() works around a really rare case and we can avoid the deadlock it addresses in another way: disable page faults and work around copy failures by faulting after the copy in a slow path instead of before in a hot one. I have a little microbenchmark that does repeated, small writes to tmpfs. This patch speeds that micro up by 6.2%. === Long version === When doing a sys_write() we have a source buffer in userspace and then a target file page. If both of those are the same physical page, there is a potential deadlock that we avoid. It would happen something like this: 1. We start the write to the file 2. Allocate page cache page and set it !Uptodate 3. Touch the userspace buffer to copy in the user data 4. Page fault (since source of the write not yet mapped) 5. Page fault code tries to lock the page and deadlocks (more details on this below) To avoid this, we prefault the page to guarantee that this fault does not occur. But, this prefault comes at a cost. It is one of the most expensive things that we do in a hot write() path (especially if we compare it to the read path). It is working around a pretty rare case. To fix this, it's pretty simple. We move the "prefault" code to run after we attempt the copy. We explicitly disable page faults _during_ the copy, detect the copy failure, then execute the "prefault" ouside of where the page lock needs to be held. iov_iter_copy_from_user_atomic() actually already has an implicit pagefault_disable() inside of it (at least on x86), but we add an explicit one. I don't think we can depend on every kmap_atomic() implementation to pagefault_disable() for eternity. =================================================== The stack trace when this happens looks like this: wait_on_page_bit_killable+0xc0/0xd0 __lock_page_or_retry+0x84/0xa0 filemap_fault+0x1ed/0x3d0 __do_fault+0x41/0xc0 handle_mm_fault+0x9bb/0x1210 __do_page_fault+0x17f/0x3d0 do_page_fault+0xc/0x10 page_fault+0x22/0x30 generic_perform_write+0xca/0x1a0 __generic_file_write_iter+0x190/0x1f0 ext4_file_write_iter+0xe9/0x460 __vfs_write+0xaa/0xe0 vfs_write+0xa6/0x1a0 SyS_write+0x46/0xa0 entry_SYSCALL_64_fastpath+0x12/0x6a 0xffffffffffffffff (Note, this does *NOT* happen in practice today because the kmap_atomic() does a pagefault_disable(). The trace above was obtained by taking out the pagefault_disable().) You can trigger the deadlock with this little code snippet: fd = open("foo", O_RDWR); fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0); write(fd, &fdmap[0], 1); Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Paul Cassella <cassella@cray.com> Cc: Greg Thelen <gthelen@google.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: /proc/pid/smaps:: show proportional swap share of the mappingMinchan Kim
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744 [akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memtest: remove unused header filesVladimir Murzin
memtest does not require these headers to be included. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memtest: cleanup log messagesVladimir Murzin
- prefer pr_info(... to printk(KERN_INFO ... - use %pa for phys_addr_t - use cpu_to_be64 while printing pattern in reserve_bad_mem() Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08memtest: use kstrtouint instead of simple_strtoulVladimir Murzin
Since simple_strtoul is obsolete and memtest_pattern is type of int, use kstrtouint instead. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Leon Romanovsky <leon@leon.nu> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: update documentationKonstantin Khlebnikov
Notes about recent changes. [akpm@linux-foundation.org: various tweaks] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mark Williamson <mwilliamson@undo-software.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: add mmap-exclusive bit for marking pages mapped only hereKonstantin Khlebnikov
This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PFN: present file exclusive state 0 0 0 non-present 1 1 0 file page mapped somewhere else 1 1 1 file page mapped only here 1 0 0 anon non-CoWed page (shared with parent/child) 1 0 1 anon CoWed page (or never forked) CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context. MMap-exclusive bit doesn't reflect potential page-sharing via swapcache: page could be mapped once but has several swap-ptes which point to it. Application could detect that by swap bit in pagemap entry and touch that pte via /proc/pid/mem to get real information. See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com Requested by Mark Williamson. [akpm@linux-foundation.org: fix spello] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: hide physical addresses from non-privileged usersKonstantin Khlebnikov
This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all. See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: rework hugetlb and thp reportKonstantin Khlebnikov
This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs. Add missing "FILE" bit in hugetlb vmas. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: switch to the new format and do some cleanupKonstantin Khlebnikov
This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08pagemap: check permissions and capabilities at open timeKonstantin Khlebnikov
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: remove put_page_unless_one()Vineet Gupta
It has no callers. Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/memblock.c: WARN_ON when flags differs from overlap regionWei Yang
Each memblock_region has flags to indicates the type of this range. For the overlap case, memblock_add_range() inserts the lower part and leave the upper part as indicated in the overlapped region. If the flags of the new range differs from the overlapped region, the information recorded is not correct. This patch adds a WARN_ON when the flags of the new range differs from the overlapped region. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/page_alloc.c: remove unused variable in free_area_init_core()Wei Yang
Commit febd5949e134 ("mm/memory hotplug: init the zone's size when calculating node totalpages") refines the function free_area_init_core(). After doing so, these two parameters are not used anymore. This patch removes these two parameters. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/page_alloc.c: refine the calculation of highest possible node idWei Yang
nr_node_ids records the highest possible node id, which is calculated by scanning the bitmap node_states[N_POSSIBLE]. Current implementation scan the bitmap from the beginning, which will scan the whole bitmap. This patch reverses the order by scanning from the end with find_last_bit(). Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm, dax: use i_mmap_unlock_write() in do_cow_fault()Kirill A. Shutemov
__dax_fault() takes i_mmap_lock for write. Let's pair it with write unlock on do_cow_fault() side. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: take i_mmap_lock in unmap_mapping_range() for DAXKirill A. Shutemov
DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap. __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from all mappings. We need to drop i_mmap_lock there to avoid lock deadlock. Re-aquiring the lock should be fine since we check i_size after the point. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: use linear_page_index()Matthew Wilcox
I was basically open-coding it (thanks to copying code from do_fault() which probably also needs to be fixed). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: ensure that zero pages are removed from other processesMatthew Wilcox
If the first access to a huge page was a store, there would be no existing zero pmd in this process's page tables. There could be a zero pmd in another process's page tables, if it had done a load. We can detect this case by noticing that the buffer_head returned from the filesystem is New, and ensure that other processes mapping this huge page have their page tables flushed. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: don't use set_huge_zero_page()Kirill A. Shutemov
This is another place where DAX assumed that pgtable_t was a pointer. Open code the important parts of set_huge_zero_page() in DAX and make set_huge_zero_page() static again. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: fix zap_huge_pmd() for DAXKirill A. Shutemov
The original DAX code assumed that pgtable_t was a pointer, which isn't true on all architectures. Restructure the code to not rely on that assumption. [willy@linux.intel.com: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08thp: decrement refcount on huge zero page if it is splitKirill A. Shutemov
The DAX code neglected to put the refcount on the huge zero page. Also we must notify on splits. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08dax: fix race between simultaneous faultsMatthew Wilcox
If two threads write-fault on the same hole at the same time, the winner of the race will return to userspace and complete their store, only to have the loser overwrite their store with zeroes. Fix this for now by taking the i_mmap_sem for write instead of read, and do so outside the call to get_block(). Now the loser of the race will see the block has already been zeroed, and will not zero it again. This severely limits our scalability. I have ideas for improving it, but those can wait for a later patch. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08ext4: start transaction before calling into DAXMatthew Wilcox
Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08ext4: add ext4_get_block_dax()Matthew Wilcox
DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>