Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
|
|
|
|
|
|
Pull block fixes from Jens Axboe:
"A few fixes for this merge window, where some of them should go in
sooner rather than later, hence a new pull this week. This pull
request contains:
- Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
changes, but also teardown/removal and misc changes (Christop/Dan/
Johannes/Sagi/Steve).
- Two lightnvm fixes for issues that showed up in this window
(Colin/Wei).
- Failfast/driver flags inheritance for flush requests (Hannes).
- The md device put sanitization and fix (Kent).
- dm bio_set inheritance fix (me).
- nbd discard granularity fix (Josef).
- nbd consistency in command printing (Kevin).
- Loop recursion validation fix (Ted).
- Partition overlap check (Wang)"
[ .. and now my build is warning-free again thanks to the md fix - Linus ]
* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
nvme: cleanup double shift issue
nvme-pci: make CMB SQ mod-param read-only
nvme-pci: unquiesce dead controller queues
nvme-pci: remove HMB teardown on reset
nvme-pci: queue creation fixes
nvme-pci: remove unnecessary completion doorbell check
nvme-pci: remove unnecessary nested locking
nvmet: filter newlines from user input
nvme-rdma: correctly check for target keyed sgl support
nvme: don't hold nvmf_transports_rwsem for more than transport lookups
nvmet: return all zeroed buffer when we can't find an active namespace
md: Unify mddev destruction paths
dm: use bioset_init_from_src() to copy bio_set
block: add bioset_init_from_src() helper
block: always set partition number to '0' in blk_partition_remap()
block: pass failfast and driver-specific flags to flush requests
nbd: set discard_alignment to the granularity
nbd: Consistently use request pointer in debug messages.
block: add verifier for cmdline partition
lightnvm: pblk: fix resource leak of invalid_bitmap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"Quite a lot of core work this time around, though not 100% successful.
We gained support for runtime mode changes thanks to David Collins and
improved support for write only regulators (ones where we can't read
back the configuration) from Douglas Anderson.
There's been quite a bit of work from Linus Walleij on converting from
specfying GPIOs by numbers to descriptors. Sadly the testing turned
out to be less good than we had hoped and so a lot of this had to be
reverted.
We also have the start of updates to use coupled regulators from
Maciej Purski, unfortunately there are further problems there so the
last couple of patches have been reverted.
We also have new drivers for BD71837 and SY8106A devices, SAW
regulators on Qualcomm SPMI and dropped support for some preproduction
chips that never made it to market from the AB8500 driver"
* tag 'regulator-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (57 commits)
regulator: gpio: Revert
ARM: pxa, regulator: fix building ezx e680
regulator: Revert coupled regulator support again
regulator: wm8994: Fix shared GPIOs
regulator: max77686: Fix shared GPIOs
regulator: bd71837: BD71837 PMIC regulator driver
regulator: bd71837: Devicetree bindings for BD71837 regulators
regulator: gpio: Get enable GPIO using GPIO descriptor
regulator: fixed: Convert to use GPIO descriptor only
regulator: s2mps11: Fix boot on Odroid XU3
dt-bindings: qcom_spmi: Document SAW support
regulator: qcom_spmi: Add support for SAW
regulator: tps65090: Pass descriptor instead of GPIO number
regulator: s5m8767: Pass descriptor instead of GPIO number
regulator: pfuze100: Delete reference to ena_gpio
regulator: max8952: Pass descriptor instead of GPIO number
regulator: lp8788-ldo: Pass descriptor instead of GPIO number
regulator: lm363x: Pass descriptor instead of GPIO number
regulator: max8973: Pass descriptor instead of GPIO number
regulator: mc13xxx-core: Switch to SPDX identifier
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Apart from the core arm64 and perf changes, the Spectre v4 mitigation
touches the arm KVM code and the ACPI PPTT support touches drivers/
(acpi and cacheinfo). I should have the maintainers' acks in place.
Summary:
- Spectre v4 mitigation (Speculative Store Bypass Disable) support
for arm64 using SMC firmware call to set a hardware chicken bit
- ACPI PPTT (Processor Properties Topology Table) parsing support and
enable the feature for arm64
- Report signal frame size to user via auxv (AT_MINSIGSTKSZ). The
primary motivation is Scalable Vector Extensions which requires
more space on the signal frame than the currently defined
MINSIGSTKSZ
- ARM perf patches: allow building arm-cci as module, demote
dev_warn() to dev_dbg() in arm-ccn event_init(), miscellaneous
cleanups
- cmpwait() WFE optimisation to avoid some spurious wakeups
- L1_CACHE_BYTES reverted back to 64 (for performance reasons that
have to do with some network allocations) while keeping
ARCH_DMA_MINALIGN to 128. cache_line_size() returns the actual
hardware Cache Writeback Granule
- Turn LSE atomics on by default in Kconfig
- Kernel fault reporting tidying
- Some #include and miscellaneous cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (53 commits)
arm64: Fix syscall restarting around signal suppressed by tracer
arm64: topology: Avoid checking numa mask for scheduler MC selection
ACPI / PPTT: fix build when CONFIG_ACPI_PPTT is not enabled
arm64: cpu_errata: include required headers
arm64: KVM: Move VCPU_WORKAROUND_2_FLAG macros to the top of the file
arm64: signal: Report signal frame size to userspace via auxv
arm64/sve: Thin out initialisation sanity-checks for sve_max_vl
arm64: KVM: Add ARCH_WORKAROUND_2 discovery through ARCH_FEATURES_FUNC_ID
arm64: KVM: Handle guest's ARCH_WORKAROUND_2 requests
arm64: KVM: Add ARCH_WORKAROUND_2 support for guests
arm64: KVM: Add HYP per-cpu accessors
arm64: ssbd: Add prctl interface for per-thread mitigation
arm64: ssbd: Introduce thread flag to control userspace mitigation
arm64: ssbd: Restore mitigation status on CPU resume
arm64: ssbd: Skip apply_ssbd if not using dynamic mitigation
arm64: ssbd: Add global mitigation state accessor
arm64: Add 'ssbd' command-line option
arm64: Add ARCH_WORKAROUND_2 probing
arm64: Add per-cpu infrastructure to call ARCH_WORKAROUND_2
arm64: Call ARCH_WORKAROUND_2 on transitions between EL0 and EL1
...
|
|
Pull dmaengine updates from Vinod Koul:
- updates to sprd, bam_dma, stm drivers
- remove VLAs in dmatest
- move TI drivers to their own subdir
- switch to SPDX tags for ima/mxs dma drivers
- simplify getting .drvdata on bunch of drivers by Wolfram Sang
* tag 'dmaengine-4.18-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (32 commits)
dmaengine: sprd: Add Spreadtrum DMA configuration
dmaengine: sprd: Optimize the sprd_dma_prep_dma_memcpy()
dmaengine: imx-dma: Switch to SPDX identifier
dmaengine: mxs-dma: Switch to SPDX identifier
dmaengine: imx-sdma: Switch to SPDX identifier
dmaengine: usb-dmac: Document R8A7799{0,5} bindings
dmaengine: qcom: bam_dma: fix some doc warnings.
dmaengine: qcom: bam_dma: fix invalid assignment warning
dmaengine: sprd: fix an NULL vs IS_ERR() bug
dmaengine: sprd: Use devm_ioremap_resource() to map memory
dmaengine: sprd: Fix potential NULL dereference in sprd_dma_probe()
dmaengine: pl330: flush before wait, and add dev burst support.
dmaengine: axi-dmac: Request IRQ with IRQF_SHARED
dmaengine: stm32-mdma: fix spelling mistake: "avalaible" -> "available"
dmaengine: rcar-dmac: Document R-Car D3 bindings
dmaengine: sprd: Move DMA request mode and interrupt type into head file
dmaengine: sprd: Define the DMA data width type
dmaengine: sprd: Define the DMA transfer step type
dmaengine: ti: New directory for Texas Instruments DMA drivers
dmaengine: shdmac: Change platform check to CONFIG_ARCH_RENESAS
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"Nothing big this time. In particular:
- Debugging code for Tegra-GART
- Improvement in Intel VT-d fault printing to prevent soft-lockups
when on fault storms
- Improvements in AMD IOMMU event reporting
- NUMA aware allocation in io-pgtable code for ARM
- Various other small fixes and cleanups all over the place"
* tag 'iommu-updates-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/io-pgtable-arm: Make allocations NUMA-aware
iommu/amd: Prevent possible null pointer dereference and infinite loop
iommu/amd: Fix grammar of comments
iommu: Clean up the comments for iommu_group_alloc
iommu/vt-d: Remove unnecessary parentheses
iommu/vt-d: Clean up pasid quirk for pre-production devices
iommu/vt-d: Clean up unused variable in find_or_alloc_domain
iommu/vt-d: Fix iotlb psi missing for mappings
iommu/vt-d: Introduce __mapping_notify_one()
iommu: Remove extra NULL check when call strtobool()
iommu/amd: Update logging information for new event type
iommu/amd: Update the PASID information printed to the system log
iommu/tegra: gart: Fix gart_iommu_unmap()
iommu/tegra: gart: Add debugging facility
iommu/io-pgtable-arm: Use for_each_set_bit to simplify code
iommu/qcom: Simplify getting .drvdata
iommu: Remove depends on HAS_DMA in case of platform dependency
iommu/vt-d: Ratelimit each dmar fault printing
|
|
Pull MTD updates from Boris Brezillon:
"Core changes:
- Add a sysfs attribute to expose available OOB size
Driver changes:
- Remove HAS_DMA dependency on various drivers
- Use dev_get_drvdata() instead of platform_get_drvdata() in docg3
- Replace msleep by usleep_range() in the dataflash driver
- Avoid VLA usage in nftl layers
- Remove useless .owner assignment in pismo
- Fix various issues in the CFI driver
- Improve TRX partition handling expose a DT compat for this part
parser
- Clarify OFFSET_CONTINUOUS meaning
NAND core changes:
- Add Miquel as a NAND maintainer
- Add access mode to the nand_page_io_req struct
- Fix kernel-doc in rawnand.h
- Support bit-wise majority to recover from corrupted ONFI parameter
pages
- Stop checking FAIL bit after a SET_FEATURES, as documented in the
ONFI spec
Raw NAND Driver changes:
- Fix and cleanup the error path of many NAND controller drivers
- GPMI:
+ Cleanup/simplification of a few aspects in the driver
+ Take ECC setup specified in the DT into account
- sunxi: remove support for GPIO-based R/B polling
- MTK:
+ Use of_device_get_match_data() instead of of_match_device()
+ Add an entry in MAINTAINERS for this driver
+ Fix nand-ecc-step-size and nand-ecc-strength description in the
DT bindings doc
- fsl_ifc: fix ->cmdfunc() to read more than one ONFI parameter page
OneNAND driver changes:
- samsung: use dev_get_drvdata() instead of platform_get_drvdata()
SPI NOR core changes:
- Add support for a bunch of SPI NOR chips
- Clear EAR reg when switching to 3-byte addressing mode on Winbond
chips
SPI NOR controller driver changes:
- cadence: Add DMA support for direct mode reads
- hisi: Prefix a few functions with hisi_
- intel:
+ Mark the driver as "dangerous" in Kconfig
+ Fix atomic sequence handling
+ Pass a 40us delay (instead of 0us) to readl_poll_timeout()
- fsl:
+ fix a typo in a function name
+ add support for IP variants embedded in the ls2080a and ls1080a
SoCs
- stm32: request exclusive control of the reset line"
* tag 'mtd/for-4.18' of git://git.infradead.org/linux-mtd: (66 commits)
mtd: nand: Pass mode information to nand_page_io_req
mtd: cfi_cmdset_0002: Change erase one block to enable XIP once
mtd: cfi_cmdset_0002: Change erase functions to check chip good only
mtd: cfi_cmdset_0002: Change erase functions to retry for error
mtd: cfi_cmdset_0002: Change definition naming to retry write operation
mtd: cfi_cmdset_0002: Change write buffer to check correct value
mtd: cmdlinepart: Update comment for introduction of OFFSET_CONTINUOUS
mtd: bcm47xxpart: add of_match_table with a new DT binding
dt-bindings: mtd: document Broadcom's BCM47xx partitions
mtd: spi-nor: Add support for EN25QH32
mtd: spi-nor: Add support for is25wp series chips
mtd: spi-nor: Add Winbond w25q32jv support
mtd: spi-nor: fsl-quadspi: add support for ls2080a/ls1080a
mtd: spi-nor: stm32-quadspi: explicitly request exclusive reset control
mtd: spi-nor: intel: provide a range for poll_timout
mtd: spi-nor: fsl-quadspi: fix api naming typo _init_ahb_read
mtd: spi-nor: intel-spi: Explicitly mark the driver as dangerous in Kconfig
mtd: spi-nor: intel-spi: Fix atomic sequence handling
mtd: rawnand: Do not check FAIL bit when executing a SET_FEATURES op
mtd: rawnand: use bit-wise majority to recover the ONFI param page
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.18 development cycle.
Core changes:
- We have killed off VLA from the core library and all drivers.
The background should be clear for everyone at this point:
https://lwn.net/Articles/749064/
Also I just don't like VLA's, kernel developers hate it when
compilers do things behind their back. It's as simple as that.
I'm sorry that they even slipped in to begin with. Kudos to Laura
Abbott for exorcising them.
- Support GPIO hogs in machines/board files.
New drivers and chip support:
- R-Car r8a77470 (RZ/G1C)
- R-Car r8a77965 (M3-N)
- R-Car r8a77990 (E3)
- PCA953x driver improvements to accomodate more variants.
Improvements and new features:
- Support one interrupt per line on port A in the DesignWare dwapb
driver.
Misc:
- Random cleanups, right header files in the drivers, some size
optimizations etc"
* tag 'gpio-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (73 commits)
gpio: davinci: fix build warning when !CONFIG_OF
gpio: dwapb: Fix rework support for 1 interrupt per port A GPIO
gpio: pxa: Include the right header
gpio: pl061: Include the right header
gpio: pch: Include the right header
gpio: pcf857x: Include the right header
gpio: pca953x: Include the right header
gpio: palmas: Include the right header
gpio: omap: Include the right header
gpio: octeon: Include the right header
gpio: mxs: Switch to SPDX identifier
gpio: Remove VLA from stmpe driver
gpio: mxc: Switch to SPDX identifier
gpio: mxc: add clock operation
gpio: Remove VLA from gpiolib
gpio: aspeed: Use a cache of output data registers
gpio: aspeed: Set output latch before changing direction
gpio: pca953x: fix address calculation for pcal6524
gpio: pca953x: define masks for addressing common and extended registers
gpio: pca953x: set the PCA_PCAL flag also when matching by DT
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
- Valve Steam Controller support from Rodrigo Rivas Costa
- Redragon Asura support from Robert Munteanu
- improvement of duplicate usage handling in generic hid-input from
Benjamin Tissoires
- Win 8.1 precisioun touchpad spec implementation from Benjamin
Tissoires
- Support for "In Range" flag for Wacom Intuos/Bamboo devices from
Jason Gerecke
- other various assorted smaller fixes and improvements
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (27 commits)
HID: rmi: use HID_QUIRK_NO_INPUT_SYNC
HID: multitouch: fix calculation of last slot field in multi-touch reports
HID: quirks: remove Delcom Visual Signal Indicator from hid_have_special_driver[]
HID: steam: select CONFIG_POWER_SUPPLY
HID: i2c-hid: remove i2c_hid_open_mut
HID: wacom: Support "in range" for Intuos/Bamboo tablets where possible
HID: core: fix hid_hw_open() comment
HID: hid-plantronics: Re-resend Update to map button for PTT products
HID: multitouch: fix types returned from mt_need_to_apply_feature()
HID: i2c-hid: check if device is there before really probing
HID: steam: add missing fields in client initialization
HID: steam: add battery device.
HID: add driver for Valve Steam Controller
HID: alps: Fix some style in 't4_read_write_register()'
HID: alps: Check errors returned by 't4_read_write_register()'
HID: alps: Save a memory allocation in 't4_read_write_register()' when writing data
HID: alps: Report an error if we receive invalid data in 't4_read_write_register()'
HID: multitouch: implement precision touchpad latency and switches
HID: multitouch: simplify the settings of the various features
HID: multitouch: make use of HID_QUIRK_INPUT_PER_APP
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio iopriority support from Al Viro:
"The rest of aio stuff for this cycle - Adam's aio ioprio series"
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: aio ioprio use ioprio_check_cap ret val
fs: aio ioprio add explicit block layer dependence
fs: iomap dio set bio prio from kiocb prio
fs: blkdev set bio prio from kiocb prio
fs: Add aio iopriority support
fs: Convert kiocb rw_hint from enum to u16
block: add ioprio_check_cap function
|
|
Add a helper that allows a caller to initialize a new bio_set,
using the settings from an existing bio_set.
Reported-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
- improvement of duplicate usage handling in hid-input from Benjamin Tissoires
- Win 8.1 precisioun touchpad spec implementation from Benjamin Tissoires
|
|
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- v9fs updates
- MM
- procfs updates
- lib/ updates
- autofs updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
autofs: small cleanup in autofs_getpath()
autofs: clean up includes
autofs: comment on selinux changes needed for module autoload
autofs: update MAINTAINERS entry for autofs
autofs: use autofs instead of autofs4 in documentation
autofs: rename autofs documentation files
autofs: create autofs Kconfig and Makefile
autofs: delete fs/autofs4 source files
autofs: update fs/autofs4/Makefile
autofs: update fs/autofs4/Kconfig
autofs: copy autofs4 to autofs
autofs4: use autofs instead of autofs4 everywhere
autofs4: merge auto_fs.h and auto_fs4.h
fs/binfmt_misc.c: do not allow offset overflow
checkpatch: improve patch recognition
lib/ucs2_string.c: add MODULE_LICENSE()
lib/mpi: headers cleanup
lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
lib/idr.c: remove simple_ida_lock
lib/bitmap.c: micro-optimization for __bitmap_complement()
...
|
|
MPI headers contain definitions for huge number of non-existing
functions.
Most part of these functions was removed in 2012 by Dmitry Kasatkin
- 7cf4206a99d1 ("Remove unused code from MPI library")
- 9e235dcaf4f6 ("Revert "crypto: GnuPG based MPI lib - additional ...")
- bc95eeadf5c6 ("lib/mpi: removed unused functions")
however headers wwere not updated properly.
Also I deleted some unused macros.
Link: http://lkml.kernel.org/r/fb2fc1ef-1185-f0a3-d8d0-173d2f97bbaf@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Kasatkin <dmitry.kasatkin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This header file is not exported. It is safe to reference types without
double-underscore prefix.
Link: http://lkml.kernel.org/r/1526350925-14922-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Lihao Liang <lianglihao@huawei.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
<uapi/linux/types.h> has the same typedefs except that it prefixes them
with double-underscore for user space. Use them for the kernel space
typedefs.
Link: http://lkml.kernel.org/r/1526350925-14922-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Lihao Liang <lianglihao@huawei.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When commit bd33ef368135 ("mm: enable page poisoning early at boot") got
rid of the PAGE_EXT_DEBUG_POISON, page_is_poisoned in the header left
behind. This patch cleans up the leftovers under the table.
Link: http://lkml.kernel.org/r/1528101069-21637-1-git-send-email-kpark3469@gmail.com
Signed-off-by: Sahara <keun-o.park@darkmatter.ae>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
same cacheline
The LKP robot found a 27% will-it-scale/page_fault3 performance
regression regarding commit e27be240df53("mm: memcg: make sure
memory.events is uptodate when waking pollers").
What the test does is:
1 mkstemp() a 128M file on a tmpfs;
2 start $nr_cpu processes, each to loop the following:
2.1 mmap() this file in shared write mode;
2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
2.3 unmap() this file and repeat this process.
3 After 5 minutes, check how many loops they managed to complete, the
higher the better.
The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap(). So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.
I verified this by moving memory_events[] back to where it was:
: --- a/include/linux/memcontrol.h
: +++ b/include/linux/memcontrol.h
: @@ -205,7 +205,6 @@ struct mem_cgroup {
: int oom_kill_disable;
:
: /* memory.events */
: - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
: struct cgroup_file events_file;
:
: /* protect arrays of thresholds */
: @@ -238,6 +237,7 @@ struct mem_cgroup {
: struct mem_cgroup_stat_cpu __percpu *stat_cpu;
: atomic_long_t stat[MEMCG_NR_STAT];
: atomic_long_t events[NR_VM_EVENT_ITEMS];
: + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:
: unsigned long socket_pressure;
And performance restored.
Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good. To avoid future performance surprise by other
commits changing the layout of 'struct mem_cgroup', this patch makes
sure the 3 fields stay in the same cacheline.
One concern of this approach is, moving_account and move_lock_task could
be modified when a process changes memory cgroup while stat_cpu is a
always read field, it might hurt to place them in the same cacheline. I
assume it is rare for a process to change memory cgroup so this should
be OK.
Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When bit is equal to 0x4, it means OPT_ZONE_DMA32 should be got from
GFP_ZONE_TABLE. OPT_ZONE_DMA32 shall be equal to ZONE_DMA32 or
ZONE_NORMAL according to the status of CONFIG_ZONE_DMA32.
Similarly, when bit is equal to 0xc, that means OPT_ZONE_DMA32 should be
got with an allocation policy GFP_MOVABLE. So ZONE_DMA32 or ZONE_NORMAL
is the possible result value.
Link: http://lkml.kernel.org/r/20180601163403.1032-1-yehs2007@zoho.com
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The reserved field was only used for embedding an rcu_head in the data
structure. With the previous commit, we no longer need it. That lets us
remove the 'reserved' argument to a lot of functions.
Link: http://lkml.kernel.org/r/20180518194519.3820-16-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make hmm_data an explicit member of the struct page union.
Link: http://lkml.kernel.org/r/20180518194519.3820-14-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For pgd page table pages, x86 overloads the page->index field to store a
pointer to the mm_struct. Rename this to pt_mm so it's visible to other
users.
Link: http://lkml.kernel.org/r/20180518194519.3820-13-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Rewrite the documentation to describe what you can use in struct page
rather than what you can't.
Link: http://lkml.kernel.org/r/20180518194519.3820-12-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This gives us five words of space in a single union in struct page. The
compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
systems, but that does not seem likely to cause any trouble.
Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since the LRU is two words, this does not affect the double-word alignment
of SLUB's freelist.
Link: http://lkml.kernel.org/r/20180518194519.3820-10-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
By combining these three one-word unions into one three-word union, we
make it easier for users to add their own multi-word fields to struct
page, as well as making it obvious that SLUB needs to keep its double-word
alignment for its freelist & counters.
No field moves position; verified with pahole.
Link: http://lkml.kernel.org/r/20180518194519.3820-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Keeping the refcount in the union only encourages people to put something
else in the union which will overlap with _refcount and eventually explode
messily. pahole reports no fields change location.
Link: http://lkml.kernel.org/r/20180518194519.3820-7-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
By moving page->private to the fourth word of struct page, we can put the
SLUB counters in the same word as SLAB's s_mem and still do the
cmpxchg_double trick. Now the SLUB counters no longer overlap with the
mapcount or refcount so we can drop the call to page_mapcount_reset() and
simplify set_page_slub_counters() to a single line.
Link: http://lkml.kernel.org/r/20180518194519.3820-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This will allow us to store slub's counters in the same bits as slab's
s_mem. slub now needs to set page->mapping to NULL as it frees the page,
just like slab does.
Link: http://lkml.kernel.org/r/20180518194519.3820-5-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Define a new PageTable bit in the page_type and use it to mark pages in
use as page tables. This can be helpful when debugging crashdumps or
analysing memory fragmentation. Add a KPF flag to report these pages to
userspace and update page-types.c to interpret that flag.
Note that only pages currently accounted as NR_PAGETABLES are tracked as
PageTable; this does not include pgd/p4d/pud/pmd pages. Those will be the
subject of a later patch.
Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We're already using a union of many fields here, so stop abusing the
_mapcount and make page_type its own field. That implies renaming some of
the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
back the PG_buddy, PG_balloon and PG_kmemcg names.
As suggested by Kirill, make page_type a bitmask. Because it starts out
life as -1 (thanks to sharing the storage with _mapcount), setting a page
flag means clearing the appropriate bit. This gives us space for probably
twenty or so extra bits (depending how paranoid we want to be about
_mapcount underflow).
Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were
stranded. Fill the gaps by moving the existing gfp masks around.
Link: http://lkml.kernel.org/r/20180516211439.177440-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
See commit 1c8f422059ae ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.
But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.
It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.
If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.
[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
is_pageblock_removable_nolock() is not used outside of
mm/memory_hotplug.c. Move it next to unique caller
is_mem_section_removable() and make it static.
Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1):
mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
directly.
Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
Signed-off-by: Wang Long <wanglong19@meituan.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
and there is no point to keep them in the include/linux/ksm.h
[akpm@linux-foundation.org: fix SYSFS=n build]
Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 9f32624be943 ("mm/rmap: use rmap_walk() in page_referenced()")
removed the declaration of page_referenced_ksm for the case
CONFIG_KSM=y, but left one for CONFIG_KSM=n.
Remove the unused leftover.
Link: http://lkml.kernel.org/r/1524552106-7356-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.
1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.
For example, there are memcgs A, A/B, A/C, A/D and A/E:
A A/memory.low = 2G, A/memory.current = 6G
//\\
BC DE B/memory.low = 3G B/memory.current = 2G
C/memory.low = 1G C/memory.current = 2G
D/memory.low = 0 D/memory.current = 2G
E/memory.low = 10G E/memory.current = 0
If we apply memory pressure, B, C and D are reclaimed at the same pace
while A's usage exceeds 2G. This is obviously wrong, as B's usage is
fully below B's memory.low, and C has 1G of protection as well. Also, A
is pushed to the size, which is less than A's 2G memory.low, which is
also wrong.
A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
A: 1430097920
A/B: 711929856
A/C: 717426688
A/D: 741376
A/E: 0
To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as a
cgroup's memory.low usage divided by sum of sibling's memory.low usages
(under memory.low usage I mean the size of actually protected memory:
memory.current if memory.current < memory.low, 0 otherwise). It's
necessary to track the actual usage, because otherwise an empty cgroup
with memory.low set (A/E in my example) will affect actual memory
distribution, which makes no sense. To avoid traversing the cgroup tree
twice, page_counters code is reused.
Calculating effective memory.low can be done in the reclaim path, as we
conveniently traversing the cgroup tree from top to bottom and check
memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.
This also eliminates a need to traverse the cgroup tree from bottom to
top each time to check if parent's guarantee is not exceeded.
Setting/resetting effective memory.low is intentionally racy, but it's
fine and shouldn't lead to any significant differences in actual memory
distribution.
With this patch applied results are matching the expectations:
A: 2147930112
A/B: 1428721664
A/C: 718393344
A/D: 815104
A/E: 0
Test script:
#!/bin/bash
CGPATH="/sys/fs/cgroup"
truncate /file1 --size 2G
truncate /file2 --size 2G
truncate /file3 --size 2G
truncate /file4 --size 50G
mkdir "${CGPATH}/A"
echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
echo 2G > "${CGPATH}/A/memory.low"
echo 3G > "${CGPATH}/A/B/memory.low"
echo 1G > "${CGPATH}/A/C/memory.low"
echo 0 > "${CGPATH}/A/D/memory.low"
echo 10G > "${CGPATH}/A/E/memory.low"
echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
echo "A: " `cat "${CGPATH}/A/memory.current"`
echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
rmdir "${CGPATH}/A"
rm /file1 /file2 /file3 /file4
Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch renames struct page_counter fields:
count -> usage
limit -> max
and the corresponding functions:
page_counter_limit() -> page_counter_set_max()
mem_cgroup_get_limit() -> mem_cgroup_get_max()
mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
memcg_update_kmem_limit() -> memcg_update_kmem_max()
memcg_update_tcp_limit() -> memcg_update_tcp_max()
The idea behind this renaming is to have the direct matching
between memory cgroup knobs (low, high, max) and page_counters API.
This is pure renaming, this patch doesn't bring any functional change.
Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
So far code was using ULLONG_MAX and type casting to obtain a
phys_addr_t with all bits set. The typecast is necessary to silence
compiler warnings on 32-bit platforms.
Use the simpler but still type safe approach "~(phys_addr_t)0" to create a
preprocessor define for all bits set.
Link: http://lkml.kernel.org/r/20180406213809.566-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently the PTE special supports is turned on in per architecture
header files. Most of the time, it is defined in
arch/*/include/asm/pgtable.h depending or not on some other per
architecture static definition.
This patch introduce a new configuration variable to manage this
directly in the Kconfig files. It would later replace
__HAVE_ARCH_PTE_SPECIAL.
Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:
arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.
sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64
There is no functional change introduced by this patch.
Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <albert@sifive.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Christophe LEROY <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With the addition of memfd hugetlbfs support, we now have the situation
where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only
supported on tmpfs, so it made sense that the code resided in shmem.c.
In the current code, memfd is only functional if TMPFS is defined. If
HUGETLFS is defined and TMPFS is not defined, then memfd functionality
will not be available for hugetlbfs. This does not cause BUGs, just a
lack of potentially desired functionality.
Code is restructured in the following way:
- include/linux/memfd.h is a new file containing memfd specific
definitions previously contained in shmem_fs.h.
- mm/memfd.c is a new file containing memfd specific code previously
contained in shmem.c.
- memfd specific code is removed from shmem_fs.h and shmem.c.
- A new config option MEMFD_CREATE is added that is defined if TMPFS
or HUGETLBFS is defined.
No functional changes are made to the code: restructuring only.
Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add swap max and fail events so that userland can monitor and respond to
running out of swap.
I'm not too sure about the fail event. Right now, it's a bit confusing
which stats / events are recursive and which aren't and also which ones
reflect events which originate from a given cgroup and which targets the
cgroup. No idea what the right long term solution is and it could just
be that growing them organically is actually the only right thing to do.
Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|