summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-12-28Merge tag 'gpio-v4.21-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.21 kernel series. Core changes: - Some core changes are already in outside of this pull request as they came through the regulator tree, most notably devm_gpiod_unhinge() that removes devres refcount management from a GPIO descriptor. This is needed in subsystems such as regulators where the regulator core need to take over the reference counting and lifecycle management for a GPIO descriptor. - We dropped devm_gpiochip_remove() and devm_gpio_chip_match() as nothing needs it. We can bring it back if need be. - Add a global TODO so people see where we are going. This helps setting the direction now that we are two GPIO maintainers. - Handle the MMC CD/WP properties in the device tree core. (The bulk of patches activating this code is already merged through the MMC/SD tree.) - Augment gpiochip_request_own_desc() to pass a flag so we as gpiochips can request lines as active low or open drain etc even from ourselves. New drivers: - New driver for Cadence GPIO blocks. - New driver for Atmel SAMA5D2 PIOBU GPIO lines. Driver improvements: - A major refactoring of the PCA953x driver - this driver has been around for ages, and is now modernized to reduce code duplication that has stacked up and is using regmap to read write and cache registers. - Intel drivers are now maintained in a separate tree and start with a round of cleanups and unifications" * tag 'gpio-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (99 commits) gpio: sama5d2-piobu: Depend on OF_GPIO gpio: Add Cadence GPIO driver dt-bindings: gpio: Add bindings for Cadence GPIO gpiolib-acpi: remove unused variable 'err', cleans up build warning gpio: mxs: read pin level directly instead of using .get gpio: aspeed: remove duplicated statement gpio: add driver for SAMA5D2 PIOBU pins dt-bindings: arm: atmel: describe SECUMOD usage as a GPIO controller gpio/mmc/of: Respect polarity in the device tree dt-bindings: gpio: rcar: Add r8a774c0 (RZ/G2E) support memory: omap-gpmc: Get the header of the enum ARM: omap1: Fix new user of gpiochip_request_own_desc() gpio: pca953x: Add regmap dependency for PCA953x driver gpio: raspberrypi-exp: decrease refcount on firmware dt node gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB gpio: pca953x: Restore registers after suspend/resume cycle gpio: pca953x: Zap single use of pca953x_read_single() gpio: pca953x: Zap ad-hoc reg_output cache gpio: pca953x: Zap ad-hoc reg_direction cache gpio: pca953x: Perform basic regmap conversion ...
2018-12-28Merge tag 'drm-next-2018-12-27' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull more drm updates from Dave Airlie: "Daniel collected a couple of pulls after I want on holidays, back for a couple of days, so may as well send them out. This has exynos and etnaviv work for 4.21. exynos: - plane alpha and blending configurability etnaviv: - mostly cleanups in prep for new features" * tag 'drm-next-2018-12-27' of git://anongit.freedesktop.org/drm/drm: drm/etnaviv: remove lastctx member from gpu struct drm/etnaviv: replace header include with forward declaration drm/etnaviv: remove unnecessary local irq disable drm/exynos: fimd: Make pixel blend mode configurable drm/exynos: fimd: Make plane alpha configurable drm/etnaviv: Replace drm_dev_unref with drm_dev_put drm/etnaviv: consolidate hardware fence handling in etnaviv_gpu drm/etnaviv: kill active fence tracking
2018-12-28Merge tag 'hwmon-for-v4.21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon updates from Guenter Roeck: "The big change in this series is for the most part automatic: Introducing SENSOR[_DEVICE]_ATTR_{RO,RW,WO} variants and conversion of various drivers to use it. This is similar to DEVICE_ATTR variants. Other than that, we have - Some conversions of S_<PERMS> with octal values, also automated - Added support for Hygon Dhyana CPUs to k10temp driver - Added support for STLM75 to lm75 driver - B57891S0103 to ntc_thermistor - Added pm-runtime support to ina3221 driver - Support for PowerPC On-Chip Controller (OCC) - Various minor bug fices and improvements" * tag 'hwmon-for-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (80 commits) hwmon: (lm80) fix a missing check of bus read in lm80 probe hwmon: (lm80) fix a missing check of the status of SMBus read hwmon: (asus_atk0110) Fix debugfs_simple_attr.cocci warnings hwmon: (ftsteutates) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (fschmd) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc6w201) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc2103) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (emc1403) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (ds620) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (ds1621) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (dell-smm-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (da9055-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (da9052-hwmon) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (coretemp) Replace S_<PERMS> with octal values hwmon: (asus_atk0110) Replace S_<PERMS> with octal values hwmon: (aspeed-pwm-tacho) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (applesmc) Replace S_<PERMS> with octal values hwmon: (amc6821) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (adt7x10) Use permission specific SENSOR[_DEVICE]_ATTR variants hwmon: (adt7475) Use permission specific SENSOR[_DEVICE]_ATTR variants ...
2018-12-28Merge tag 'vfio-v4.21-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Replace global vfio-pci lock with per bus lock to allow concurrent open and release (Alex Williamson) - Declare mdev function as static (Paolo Cretaro) - Convert char to u8 in mdev/mtty sample driver (Nathan Chancellor) * tag 'vfio-v4.21-rc1' of git://github.com/awilliam/linux-vfio: vfio-mdev/samples: Use u8 instead of char for handle functions vfio/mdev: add static modifier to add_mdev_supported_type vfio/pci: Parallelize device open and release
2018-12-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - large KASAN update to use arm's "software tag-based mode" - a few misc things - sh updates - ocfs2 updates - just about all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits) kernel/fork.c: mark 'stack_vm_area' with __maybe_unused memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages include/linux/gfp.h: fix typo mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization memory_hotplug: add missing newlines to debugging output mm: remove __hugepage_set_anon_rmap() include/linux/vmstat.h: remove unused page state adjustment macro mm/page_alloc.c: allow error injection mm: migrate: drop unused argument of migrate_page_move_mapping() blkdev: avoid migration stalls for blkdev pages mm: migrate: provide buffer_migrate_page_norefs() mm: migrate: move migrate_page_lock_buffers() mm: migrate: lock buffers before migrate_page_move_mapping() mm: migration: factor out code to compute expected number of page references mm, page_alloc: enable pcpu_drain with zone capability kmemleak: add config to select auto scan mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ...
2018-12-28Merge tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds
Pull MMC updates from Ulf Hansson: "This time, this pull request contains changes crossing subsystems and archs/platforms, which is mainly because of a bigger modernization of moving from legacy GPIO to GPIO descriptors for MMC (by Linus Walleij). Additionally, once again, I am funneling changes to drivers/misc/cardreader/* and drivers/memstick/* through my MMC tree, mostly due to that we lack a maintainer for these. Summary: MMC core: - Cleanup BKOPS support - Introduce MMC_CAP_SYNC_RUNTIME_PM - slot-gpio: Delete legacy slot GPIO handling MMC host: - alcor: Add new mmc host driver for Alcor Micro PCI based cardreader - bcm2835: Several improvements to better recover from errors - jz4740: Rework and fixup pre|post_req support - mediatek: Add support for SDIO IRQs - meson-gx: Improve clock phase management - meson-gx: Stop descriptor on errors - mmci: Complete the sbc error path by sending a stop command - renesas_sdhi/tmio: Fixup reset/resume operations - renesas_sdhi: Add support for r8a774c0 and R7S9210 - renesas_sdhi: Whitelist R8A77990 SDHI - renesas_sdhi: Fixup eMMC HS400 compatibility issues for H3 and M3-W - rtsx_usb_sdmmc: Re-work card detection/removal support - rtsx_usb_sdmmc: Re-work runtime PM support - sdhci: Fix timeout loops for some variant drivers - sdhci: Improve support for error handling due to failing commands - sdhci-acpi/pci: Disable LED control for Intel BYT-based controllers - sdhci_am654: Add new SDHCI variant driver to support TI's AM654 SOCs - sdhci-of-esdhc: Add support for eMMC HS400 mode - sdhci-omap: Fixup reset support - sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures - sdhci-msm: Fixup sporadic write transfers issues for SDR104/HS200 - sdhci-msm: Fixup dynamical clock gating issues - various: Complete converting all hosts into using slot GPIO descriptors Other: - Move GPIO mmc platform data for mips/sh/arm to GPIO descriptors - Add new Alcor Micro cardreader PCI driver - Support runtime power management for memstick rtsx_usb_ms driver - Use USB remote wakeups for card detection for rtsx_usb misc driver" * tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (99 commits) mmc: mediatek: Add MMC_CAP_SDIO_IRQ support mmc: renesas_sdhi_internal_dmac: Whitelist r8a774c0 dt-bindings: mmc: renesas_sdhi: Add r8a774c0 support mmc: core: Cleanup BKOPS support mmc: core: Drop redundant check in mmc_send_hpi_cmd() mmc: sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures (i929) dt-bindings: sdhci-omap: Add note for cpu_thermal mmc: sdhci-acpi: Disable LED control for Intel BYT-based controllers mmc: sdhci-pci: Disable LED control for Intel BYT-based controllers mmc: sdhci: Add quirk to disable LED control mmc: mmci: add variant property to set command stop bit misc: alcor_pci: fix spelling mistake "invailid" -> "invalid" mmc: meson-gx: add signal resampling mmc: meson-gx: align default phase on soc vendor tree mmc: meson-gx: remove useless lock mmc: meson-gx: make sure the descriptor is stopped on errors mmc: sdhci_am654: Add Initial Support for AM654 SDHCI driver dt-bindings: mmc: sdhci-of-arasan: Add deprecated message for AM65 dt-bindings: mmc: sdhci-am654: Document bindings for the host controllers on TI's AM654 SOCs mmc: sdhci-msm: avoid unused function warning ...
2018-12-28Merge tag 'libnvdimm-for-4.21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The vast bulk of this update is the new support for the security capabilities of some nvdimms. The userspace tooling for this capability is still a work in progress, but the changes survive the existing libnvdimm unit tests. The changes also pass manual checkout on hardware and the new nfit_test emulation of the security capability. The touches of the security/keys/ files have received the necessary acks from Mimi and David. Those changes were necessary to allow for a new generic encrypted-key type, and allow the nvdimm sub-system to lookup key material referenced by the libnvdimm-sysfs interface. Summary: - Add support for the security features of nvdimm devices that implement a security model similar to ATA hard drive security. The security model supports locking access to the media at device-power-loss, to be unlocked with a passphrase, and secure-erase (crypto-scramble). Unlike the ATA security case where the kernel expects device security to be managed in a pre-OS environment, the libnvdimm security implementation allows key provisioning and key-operations at OS runtime. Keys are managed with the kernel's encrypted-keys facility to provide data-at-rest security for the libnvdimm key material. The usage model mirrors fscrypt key management, but is driven via libnvdimm sysfs. - Miscellaneous updates for api usage and comment fixes" * tag 'libnvdimm-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) libnvdimm/security: Quiet security operations libnvdimm/security: Add documentation for nvdimm security support tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test tools/testing/nvdimm: Add overwrite support for nfit_test tools/testing/nvdimm: Add test support for Intel nvdimm security DSMs acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support acpi/nfit, libnvdimm/security: Add security DSM overwrite support acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm acpi/nfit, libnvdimm: Add enable/update passphrase support for Intel nvdimms acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm acpi/nfit, libnvdimm: Introduce nvdimm_security_ops keys-encrypted: add nvdimm key format type to encrypted keys keys: Export lookup_user_key to external users acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm libnvdimm, namespace: Replace kmemdup() with kstrndup() libnvdimm, label: Switch to bitmap_zalloc() ACPI/nfit: Adjust annotation for why return 0 if fail to find NFIT at start libnvdimm, bus: Check id immediately following ida_simple_get ...
2018-12-28Merge tag 'for-4.21/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Eliminate a couple indirect calls from bio-based DM core. - Fix DM to allow reads that exceed readahead limits by setting io_pages in the backing_dev_info. - A couple code cleanups in request-based DM. - Fix various DM targets to check for device sector overflow if CONFIG_LBDAF is not set. - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t isn't large enough on 32bit when CONFIG_LBDAF is not set. - Performance fixes to DM's kcopyd and the snapshot target focused on limiting memory use and workqueue stalls. - Fix typos in the integrity and writecache targets. - Log which algorithm is used for dm-crypt's encryption and dm-integrity's hashing. - Fix false -EBUSY errors in DM raid target's handling of check/repair messages. - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt the Nth byte in a bio's payload. * tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: do not allow readahead to limit IO size dm raid: fix false -EBUSY when handling check/repair message dm rq: cleanup leftover code from recently removed q->mq_ops branching dm verity: log the hash algorithm implementation dm crypt: log the encryption algorithm implementation dm integrity: fix spelling mistake in workqueue name dm flakey: Properly corrupt multi-page bios. dm: Check for device sector overflow if CONFIG_LBDAF is not set dm crypt: use u64 instead of sector_t to store iv_offset dm kcopyd: Fix bug causing workqueue stalls dm snapshot: Fix excessive memory usage and workqueue stalls dm bufio: update comment in dm-bufio.c dm writecache: fix typo in error msg for creating writecache_flush_thread dm: remove indirect calls from __send_changing_extent_only() dm mpath: only flush workqueue when needed dm rq: remove unused arguments from rq_completed() dm: avoid indirect call in __dm_make_request
2018-12-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This has been a fairly typical cycle, with the usual sorts of driver updates. Several series continue to come through which improve and modernize various parts of the core code, and we finally are starting to get the uAPI command interface cleaned up. - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5, qib, rxe, usnic - Rework the entire syscall flow for uverbs to be able to run over ioctl(). Finally getting past the historic bad choice to use write() for command execution - More functional coverage with the mlx5 'devx' user API - Start of the HFI1 series for 'TID RDMA' - SRQ support in the hns driver - Support for new IBTA defined 2x lane widths - A big series to consolidate all the driver function pointers into a big struct and have drivers provide a 'static const' version of the struct instead of open coding initialization - New 'advise_mr' uAPI to control device caching/loading of page tables - Support for inline data in SRPT - Modernize how umad uses the driver core and creates cdev's and sysfs files - First steps toward removing 'uobject' from the view of the drivers" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits) RDMA/srpt: Use kmem_cache_free() instead of kfree() RDMA/mlx5: Signedness bug in UVERBS_HANDLER() IB/uverbs: Signedness bug in UVERBS_HANDLER() IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported IB/umad: Start using dev_groups of class IB/umad: Use class_groups and let core create class file IB/umad: Refactor code to use cdev_device_add() IB/umad: Avoid destroying device while it is accessed IB/umad: Simplify and avoid dynamic allocation of class IB/mlx5: Fix wrong error unwind IB/mlx4: Remove set but not used variable 'pd' RDMA/iwcm: Don't copy past the end of dev_name() string IB/mlx5: Fix long EEH recover time with NVMe offloads IB/mlx5: Simplify netdev unbinding IB/core: Move query port to ioctl RDMA/nldev: Expose port_cap_flags2 IB/core: uverbs copy to struct or zero helper IB/rxe: Reuse code which sets port state IB/rxe: Make counters thread safe IB/mlx5: Use the correct commands for UMEM and UCTX allocation ...
2018-12-28Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: smarpqi, lpfc, qedi, megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas. Additionally, we have a pile of annotation, unused variable and minor updates. The big API change is the updates for Christoph's DMA rework which include removing the DISABLE_CLUSTERING flag. And finally there are a couple of target tree updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits) scsi: isci: request: mark expected switch fall-through scsi: isci: remote_node_context: mark expected switch fall-throughs scsi: isci: remote_device: Mark expected switch fall-throughs scsi: isci: phy: Mark expected switch fall-through scsi: iscsi: Capture iscsi debug messages using tracepoints scsi: myrb: Mark expected switch fall-throughs scsi: megaraid: fix out-of-bound array accesses scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through scsi: fcoe: remove set but not used variable 'port' scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown() scsi: smartpqi: fix build warnings scsi: smartpqi: update driver version scsi: smartpqi: add ofa support scsi: smartpqi: increase fw status register read timeout scsi: smartpqi: bump driver version scsi: smartpqi: add smp_utils support scsi: smartpqi: correct lun reset issues scsi: smartpqi: correct volume status scsi: smartpqi: do not offline disks for transient did no connect conditions scsi: smartpqi: allow for larger raid maps ...
2018-12-28Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull DMA mapping updates from Christoph Hellwig: "A huge update this time, but a lot of that is just consolidating or removing code: - provide a common DMA_MAPPING_ERROR definition and avoid indirect calls for dma_map_* error checking - use direct calls for the DMA direct mapping case, avoiding huge retpoline overhead for high performance workloads - merge the swiotlb dma_map_ops into dma-direct - provide a generic remapping DMA consistent allocator for architectures that have devices that perform DMA that is not cache coherent. Based on the existing arm64 implementation and also used for csky now. - improve the dma-debug infrastructure, including dynamic allocation of entries (Robin Murphy) - default to providing chaining scatterlist everywhere, with opt-outs for the few architectures (alpha, parisc, most arm32 variants) that can't cope with it - misc sparc32 dma-related cleanups - remove the dma_mark_clean arch hook used by swiotlb on ia64 and replace it with the generic noncoherent infrastructure - fix the return type of dma_set_max_seg_size (Niklas Söderlund) - move the dummy dma ops for not DMA capable devices from arm64 to common code (Robin Murphy) - ensure dma_alloc_coherent returns zeroed memory to avoid kernel data leaks through userspace. We already did this for most common architectures, but this ensures we do it everywhere. dma_zalloc_coherent has been deprecated and can hopefully be removed after -rc1 with a coccinelle script" * tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping: (73 commits) dma-mapping: fix inverted logic in dma_supported dma-mapping: deprecate dma_zalloc_coherent dma-mapping: zero memory returned from dma_alloc_* sparc/iommu: fix ->map_sg return value sparc/io-unit: fix ->map_sg return value arm64: default to the direct mapping in get_arch_dma_ops PCI: Remove unused attr variable in pci_dma_configure ia64: only select ARCH_HAS_DMA_COHERENT_TO_PFN if swiotlb is enabled dma-mapping: bypass indirect calls for dma-direct vmd: use the proper dma_* APIs instead of direct methods calls dma-direct: merge swiotlb_dma_ops into the dma_direct code dma-direct: use dma_direct_map_page to implement dma_direct_map_sg dma-direct: improve addressability error reporting swiotlb: remove dma_mark_clean swiotlb: remove SWIOTLB_MAP_ERROR ACPI / scan: Refactor _CCA enforcement dma-mapping: factor out dummy DMA ops dma-mapping: always build the direct mapping code dma-mapping: move dma_cache_sync out of line dma-mapping: move various slow path functions out of line ...
2018-12-28Merge tag 'for-4.21/libata-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull libata updates from Jens Axboe: "Here are the libata changes for this merge window. Nothing major in here. This contains: - GPIO descriptor conversions (Linus Walleij) - rcar deferred probing fix (Sergei Shtylyov)" * tag 'for-4.21/libata-20181221' of git://git.kernel.dk/linux-block: sata_rcar: fix deferred probing ata: palmld: Introduce state container ata: palmld: Convert to GPIO descriptors ata: rb532_cf: Convert to use GPIO descriptors ata: sata_highbank: Convert to use GPIO descriptors ata: pxa: Drop <linux/gpio.h> include
2018-12-28Merge tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull aio updates from Jens Axboe: "Flushing out pre-patches for the buffered/polled aio series. Some fixes in here, but also optimizations" * tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block: aio: abstract out io_event filler helper aio: split out iocb copy from io_submit_one() aio: use iocb_put() instead of open coding it aio: only use blk plugs for > 2 depth submissions aio: don't zero entire aio_kiocb aio_get_req() aio: separate out ring reservation from req allocation aio: use assigned completion handler
2018-12-28Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
2018-12-28Merge tag 'y2038-for-4.21' of ↵Linus Torvalds
ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull y2038 updates from Arnd Bergmann: "More syscalls and cleanups This concludes the main part of the system call rework for 64-bit time_t, which has spread over most of year 2018, the last six system calls being - ppoll - pselect6 - io_pgetevents - recvmmsg - futex - rt_sigtimedwait As before, nothing changes for 64-bit architectures, while 32-bit architectures gain another entry point that differs only in the layout of the timespec structure. Hopefully in the next release we can wire up all 22 of those system calls on all 32-bit architectures, which gives us a baseline version for glibc to start using them. This does not include the clock_adjtime, getrusage/waitid, and getitimer/setitimer system calls. I still plan to have new versions of those as well, but they are not required for correct operation of the C library since they can be emulated using the old 32-bit time_t based system calls. Aside from the system calls, there are also a few cleanups here, removing old kernel internal interfaces that have become unused after all references got removed. The arch/sh cleanups are part of this, there were posted several times over the past year without a reaction from the maintainers, while the corresponding changes made it into all other architectures" * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: timekeeping: remove obsolete time accessors vfs: replace current_kernel_time64 with ktime equivalent timekeeping: remove timespec_add/timespec_del timekeeping: remove unused {read,update}_persistent_clock sh: remove board_time_init() callback sh: remove unused rtc_sh_get/set_time infrastructure sh: sh03: rtc: push down rtc class ops into driver sh: dreamcast: rtc: push down rtc class ops into driver y2038: signal: Add compat_sys_rt_sigtimedwait_time64 y2038: signal: Add sys_rt_sigtimedwait_time32 y2038: socket: Add compat_sys_recvmmsg_time64 y2038: futex: Add support for __kernel_timespec y2038: futex: Move compat implementation into futex.c io_pgetevents: use __kernel_timespec pselect6: use __kernel_timespec ppoll: use __kernel_timespec signal: Add restore_user_sigmask() signal: Add set_user_sigmask()
2018-12-28Fix failure path in alloc_pid()Matthew Wilcox
The failure path removes the allocated PIDs from the wrong namespace. This could lead to us inadvertently reusing PIDs in the leaf namespace and leaking PIDs in parent namespaces. Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28kernel/fork.c: mark 'stack_vm_area' with __maybe_unusedYueHaibing
Fixes gcc '-Wunused-but-set-variable' warning when CONFIG_VMAP_STACK is not set: kernel/fork.c: In function 'dup_task_struct': kernel/fork.c:843:20: warning: variable 'stack_vm_area' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/1545965190-2381-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28memcg, oom: notify on oom killer invocation from the charge pathMichal Hocko
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") has moved the oom handling back to the charge path. While doing so the notification was left behind in mem_cgroup_oom_synchronize. Fix the issue by replicating the oom hierarchy locking and the notification. Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Burt Holzman <burt@fnal.gov> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, swap: fix swapoff with KSM pagesHuang Ying
KSM pages may be mapped to the multiple VMAs that cannot be reached from one anon_vma. So during swapin, a new copy of the page need to be generated if a different anon_vma is needed, please refer to comments of ksm_might_need_to_copy() for details. During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and virtual address mapped to the page, so not all mappings to a swapped out KSM page could be found. So in try_to_unuse(), even if the swap count of a swap entry isn't zero, the page needs to be deleted from swap cache, so that, in the next round a new page could be allocated and swapin for the other mappings of the swapped out KSM page. But this contradicts with the THP swap support. Where the THP could be deleted from swap cache only after the swap count of every swap entry in the huge swap cluster backing the THP has reach 0. So try_to_unuse() is changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") to check that before delete a page from swap cache, but this has broken KSM swapoff too. Fortunately, KSM is for the normal pages only, so the original behavior for KSM pages could be restored easily via checking PageTransCompound(). That is how this patch works. The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out"), which is merged by v4.14-rc1. So I think we should backport the fix to from 4.14 on. But Hugh thinks it may be rare for the KSM pages being in the swap device when swapoff, so nobody reports the bug so far. Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28include/linux/gfp.h: fix typoKyle Spiers
Fix misspelled "satisfied" Link: http://lkml.kernel.org/r/20181227232354.64562-1-ksspiers@google.com Signed-off-by: Kyle Spiers <ksspiers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmmDan Williams
The kbuild robot reported the following on a development branch that used memremap.h in a new path: In file included from arch/m68k/include/asm/pgtable_mm.h:148:0, from arch/m68k/include/asm/pgtable.h:5, from include/linux/memremap.h:7, from drivers//dax/bus.c:3: arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset': >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct' return mm->pgd + pgd_index(address); ^~ The ->page_fault() callback is specific to HMM. Move it to 'struct hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in include/linux/hmm.h. Longer term refactoring this dependency out of HMM is recommended, but in the meantime memremap.h remains generic. Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate raceMike Kravetz
hugetlbfs page faults can race with truncate and hole punch operations. Current code in the page fault path attempts to handle this by 'backing out' operations if we encounter the race. One obvious omission in the current code is removing a page newly added to the page cache. This is pretty straight forward to address, but there is a more subtle and difficult issue of backing out hugetlb reservations. To handle this correctly, the 'reservation state' before page allocation needs to be noted so that it can be properly backed out. There are four distinct possibilities for reservation state: shared/reserved, shared/no-resv, private/reserved and private/no-resv. Backing out a reservation may require memory allocation which could fail so that needs to be taken into account as well. Instead of writing the required complicated code for this rare occurrence, just eliminate the race. i_mmap_rwsem is now held in read mode for the duration of page fault processing. Hold i_mmap_rwsem longer in truncation and hold punch code to cover the call to remove_inode_hugepages. With this modification, code in remove_inode_hugepages checking for races becomes 'dead' as it can not longer happen. Remove the dead code and expand comments to explain reasoning. Similarly, checks for races with truncation in the page fault path can be simplified and removed. [mike.kravetz@oracle.com: incorporat suggestions from Kirill] Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronizationMike Kravetz
While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. [mike.kravetz@oracle.com: add explicit check for mapping != null] Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28memory_hotplug: add missing newlines to debugging outputMichal Hocko
pages_correctly_probed is missing new lines which means that the line is not printed rightaway but it rather waits for additional printks. Add \n to all three messages in pages_correctly_probed. Link: http://lkml.kernel.org/r/20181218162307.10518-1-mhocko@kernel.org Fixes: b77eab7079d9 ("mm/memory_hotplug: optimize probe routine") Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: remove __hugepage_set_anon_rmap()Kirill Tkhai
This function is identical to __page_set_anon_rmap() since the time, when it was introduced (8 years ago). The patch removes the function, and makes its users to use __page_set_anon_rmap() instead. Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28include/linux/vmstat.h: remove unused page state adjustment macroWei Yang
These four macro are not used anymore. Just remove them. Link: http://lkml.kernel.org/r/20181214063211.2290-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/page_alloc.c: allow error injectionBenjamin Poirier
Model call chain after should_failslab(). Likewise, we can now use a kprobe to override the return value of should_fail_alloc_page() and inject allocation failures into alloc_page*(). This will allow injecting allocation failures using the BCC tools even without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a fail_page_alloc= parameter, which incurs some overhead even when failures are not being injected. On the other hand, this patch adds an unconditional call to should_fail_alloc_page() from page allocation hotpath. That overhead should be rather negligible with CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though. [vbabka@suse.cz: changelog addition] Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: drop unused argument of migrate_page_move_mapping()Jan Kara
All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28blkdev: avoid migration stalls for blkdev pagesJan Kara
Currently, block device pages don't provide a ->migratepage callback and thus fallback_migrate_page() is used for them. This handler cannot deal with dirty pages in async mode and also with the case a buffer head is in the LRU buffer head cache (as it has elevated b_count). Thus such page can block memory offlining. Fix the problem by using buffer_migrate_page_norefs() for migrating block device pages. That function takes care of dropping bh LRU in case migration would fail due to elevated buffer refcount to avoid stalls and can also migrate dirty pages without writing them. Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: provide buffer_migrate_page_norefs()Jan Kara
Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: move migrate_page_lock_buffers()Jan Kara
buffer_migrate_page() is the only caller of migrate_page_lock_buffers() move it close to it and also drop the now unused stub for !CONFIG_BLOCK. Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: lock buffers before migrate_page_move_mapping()Jan Kara
Lock buffers before calling into migrate_page_move_mapping() so that that function doesn't have to know about buffers (which is somewhat unexpected anyway) and all the buffer head logic is in buffer_migrate_page(). Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migration: factor out code to compute expected number of page referencesJan Kara
Patch series "mm: migrate: Fix page migration stalls for blkdev pages". This patchset deals with page migration stalls that were reported by our customer due to a block device page that had a bufferhead that was in the bh LRU cache. The patchset modifies the page migration code so that bufferheads are completely handled inside buffer_migrate_page() and then provides a new migration helper for pages with buffer heads that is safe to use even for block device pages and that also deals with bh lrus. This patch (of 6): Factor out function to compute number of expected page references in migrate_page_move_mapping(). Note that we move hpage_nr_pages() and page_has_private() checks from under xas_lock_irq() however this is safe since we hold page lock. [jack@suse.cz: fix expected_page_refs()] Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, page_alloc: enable pcpu_drain with zone capabilityWei Yang
drain_all_pages is documented to drain per-cpu pages for a given zone (if non-NULL). The current implementation doesn't match the description though. It will drain all pcp pages for all zones that happen to have cached pages on the same cpu as the given zone. This will lead to premature pcp cache draining for zones that are not of any interest to the caller - e.g. compaction, hwpoison or memory offline. This forces the page allocator to take locks and potential lock contention as a result. There is no real reason for this sub-optimal implementation. Replace per-cpu work item with a dedicated structure which contains a pointer to the zone and pass it over to the worker. This will get the zone information all the way down to the worker function and do the right job. [akpm@linux-foundation.org: avoid 80-col tricks] [mhocko@suse.com: refactor the whole changelog] Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28kmemleak: add config to select auto scanSri Krishna chowdary
Kmemleak scan can be cpu intensive and can stall user tasks at times. To prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto scan on boot up. Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as this is meant for only first automatic scan. Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com> Signed-off-by: Sachin Nikam <snikam@nvidia.com> Signed-off-by: Prateek <prpatel@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/page_alloc.c: don't call kasan_free_pages() at deferred mem initWaiman Long
When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred pages initialization can take a long time. Below were the reported init times on a 8-socket 96-core 4TB IvyBridge system. 1) Non-debug kernel without CONFIG_KASAN [ 8.764222] node 1 initialised, 132086516 pages in 7027ms 2) Debug kernel with CONFIG_KASAN [ 146.288115] node 1 initialised, 132075466 pages in 143052ms So the page init time in a debug kernel was 20X of the non-debug kernel. The long init time can be problematic as the page initialization is done with interrupt disabled. In this particular case, it caused the appearance of following warning messages as well as NMI backtraces of all the cores that were doing the initialization. [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252 [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253 [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253 [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253 [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253 [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253 [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253 [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2) The long init time was mainly caused by the call to kasan_free_pages() to poison the newly initialized pages. On a 4TB system, we are talking about almost 500GB of memory probably on the same node. In reality, we may not need to poison the newly initialized pages before they are ever allocated. So KASAN poisoning of freed pages before the completion of deferred memory initialization is now disabled. Those pages will be properly poisoned when they are allocated or freed after deferred pages initialization is done. With this change, the new page initialization time became: [ 21.948010] node 1 initialised, 132075466 pages in 18702ms This was still about double the non-debug kernel time, but was much better than before. Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28userfaultfd: clear flag if remap event not enabledPeter Xu
When the process being tracked does mremap() without UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle, we should not generate the remap event, and at the same time we should clear all the uffd flags on the new VMA. Without this patch, we can still have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault handling process does not even know the existance of the VMA. Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Pravin Shedge <pravin.shedge4linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPESPingfan Liu
Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code. If someone adds extra migrate type, then he may forget to enlarge the NR_PAGEBLOCK_BITS. Hence it requires some way to fix. NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on two different .h file with reverse dependency, it is a little hard to refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind such relation in compiling-time. Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ksm: react on changing "sleep_millisecs" parameter fasterKirill Tkhai
ksm thread unconditionally sleeps in ksm_scan_thread() after each iteration: schedule_timeout_interruptible( msecs_to_jiffies(ksm_thread_sleep_millisecs)) The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs. In case of user writes a big value by a mistake, and the thread enters into schedule_timeout_interruptible(), it's not possible to cancel the sleep by writing a new smaler value; the thread is just sleeping till timeout expires. The patch fixes the problem by waking the thread each time after the value is updated. This also may be useful for debug purposes; and also for userspace daemons, which change sleep_millisecs value in dependence of system load. Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, fault_around: do not take a reference to a locked pageMichal Hocko
filemap_map_pages takes a speculative reference to each page in the range before it tries to lock that page. While this is correct it also can influence page migration which will bail out when seeing an elevated reference count. The faultaround code would bail on seeing a locked page so we can pro-actively check the PageLocked bit before page_cache_get_speculative and prevent from pointless reference count churn. Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, memory_hotplug: deobfuscate migration part of offliningMichal Hocko
Memory migration might fail during offlining and we keep retrying in that case. This is currently obfuscated by goto retry loop. The code is hard to follow and as a result it is even suboptimal becase each retry round scans the full range from start_pfn even though we have successfully scanned/migrated [start_pfn, pfn] range already. This is all only because check_pages_isolated failure has to rescan the full range again. De-obfuscate the migration retry loop by promoting it to a real for loop. In fact remove the goto altogether by making it a proper double loop (yeah, gotos are nasty in this specific case). In the end we will get a slightly more optimal code which is better readable. [akpm@linux-foundation.org: reflow comments to 80 cols] Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, memory_hotplug: try to migrate full pfn rangeMichal Hocko
Patch series "few memory offlining enhancements". I have been chasing memory offlining not making progress recently. On the way I have noticed few weird decisions in the code. The migration itself is restricted without a reasonable justification and the retry loop around the migration is quite messy. This is addressed by patch 1 and patch 2. Patch 3 is targeting on the faultaround code which has been a hot candidate for the initial issue reported upstream [2] and that I am debugging internally. It turned out to be not the main contributor in the end but I believe we should address it regardless. See the patch description for more details. [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv This patch (of 3): do_migrate_range has been limiting the number of pages to migrate to 256 for some reason which is not documented. Even if the limit made some sense back then when it was introduced it doesn't really serve a good purpose these days. If the range contains huge pages then we break out of the loop too early and go through LRU and pcp caches draining and scan_movable_pages is quite suboptimal. The only reason to limit the number of pages I can think of is to reduce the potential time to react on the fatal signal. But even then the number of pages is a questionable metric because even a single page migration might block in a non-killable state (e.g. __unmap_and_move). Remove the limit and offline the full requested range (this is one memblock worth of pages with the current code). Should we ever get a report that offlining takes too long to react on fatal signal then we should rather fix the core migration to use killable waits and bailout on a signal. Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, proc: report PR_SET_THP_DISABLE in procMichal Hocko
David Rientjes has reported that commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") has changed the way how we report THPable VMAs to the userspace. Their monitoring tool is triggering false alarms on PR_SET_THP_DISABLE tasks because it considers an insufficient THP usage as a memory fragmentation resp. memory pressure issue. Before the said commit each newly created VMA inherited VM_NOHUGEPAGE flag and that got exposed to the userspace via /proc/<pid>/smaps file. This implementation had its downsides as explained in the commit message but it is true that the userspace doesn't have any means to query for the process wide THP enabled/disabled status. PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to export in the process wide context rather than per-vma. Introduce a new field to /proc/<pid>/status which export this status. If PR_SET_THP_DISABLE is used then it reports false same as when the THP is not compiled in. It doesn't consider the global THP status because we already export that information via sysfs Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, thp, proc: report THP eligibility for each vmaMichal Hocko
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smapsMichal Hocko
Patch series "THP eligibility reporting via proc". This series of three patches aims at making THP eligibility reporting much more robust and long term sustainable. The trigger for the change is a regression report [2] and the long follow up discussion. In short the specific application didn't have good API to query whether a particular mapping can be backed by THP so it has used VMA flags to workaround that. These flags represent a deep internal state of VMAs and as such they should be used by userspace with a great deal of caution. A similar has happened for [3] when users complained that VM_MIXEDMAP is no longer set on DAX mappings. Again a lack of a proper API led to an abuse. The first patch in the series tries to emphasise that that the semantic of flags might change and any application consuming those should be really careful. The remaining two patches provide a more suitable interface to address [2] and provide a consistent API to query the THP status both for each VMA and process wide as well. [1] http://lkml.kernel.org/r/20181120103515.25280-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com [3] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz This patch (of 3): Even though vma flags exported via /proc/<pid>/smaps are explicitly documented to be not guaranteed for future compatibility the warning doesn't go far enough because it doesn't mention semantic changes to those flags. And they are important as well because these flags are a deep implementation internal to the MM code and the semantic might change at any time. Let's consider two recent examples: http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the : mean time certain customer of ours started poking into /proc/<pid>/smaps : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA : flags, the application just fails to start complaining that DAX support is : missing in the kernel. http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") : introduced a regression in that userspace cannot always determine the set : of vmas where thp is ineligible. : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps : to determine if a vma is eligible to be backed by hugepages. : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to : be disabled and emit "nh" as a flag for the corresponding vmas as part of : /proc/pid/smaps. After the commit, thp is disabled by means of an mm : flag and "nh" is not emitted. : This causes smaps parsing libraries to assume a vma is eligible for thp : and ends up puzzling the user on why its memory is not backed by thp. In both cases userspace was relying on a semantic of a specific VMA flag. The primary reason why that happened is a lack of a proper interface. While this has been worked on and it will be fixed properly, it seems that our wording could see some refinement and be more vocal about semantic aspect of these flags as well. Link: http://lkml.kernel.org/r/20181211143641.3503-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages()Wei Yang
offline_pages() is already declared in this file. Just remove the duplicated one. Link: http://lkml.kernel.org/r/20181205031357.24769-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end callbackJérôme Glisse
Patch series "mmu notifier contextual informations", v2. This patchset adds contextual information, why an invalidation is happening, to mmu notifier callback. This is necessary for user of mmu notifier that wish to maintains their own data structure without having to add new fields to struct vm_area_struct (vma). For instance device can have they own page table that mirror the process address space. When a vma is unmap (munmap() syscall) the device driver can free the device page table for the range. Today we do not have any information on why a mmu notifier call back is happening and thus device driver have to assume that it is always an munmap(). This is inefficient at it means that it needs to re-allocate device page table on next page fault and rebuild the whole device driver data structure for the range. Other use case beside munmap() also exist, for instance it is pointless for device driver to invalidate the device page table when the invalidation is for the soft dirtyness tracking. Or device driver can optimize away mprotect() that change the page table permission access for the range. This patchset enables all this optimizations for device drivers. I do not include any of those in this series but another patchset I am posting will leverage this. The patchset is pretty simple from a code point of view. The first two patches consolidate all mmu notifier arguments into a struct so that it is easier to add/change arguments. The last patch adds the contextual information (munmap, protection, soft dirty, clear, ...). This patch (of 3): To avoid having to change many callback definition everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end callback. No functional changes with this patch. [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc] Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jason Gunthorpe <jgg@mellanox.com> [infiniband] Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28hwpoison, memory_hotplug: allow hwpoisoned pages to be offlinedMichal Hocko
We have received a bug report that an injected MCE about faulty memory prevents memory offline to succeed on 4.4 base kernel. The underlying reason was that the HWPoison page has an elevated reference count and the migration keeps failing. There are two problems with that. First of all it is dubious to migrate the poisoned page because we know that accessing that memory is possible to fail. Secondly it doesn't make any sense to migrate a potentially broken content and preserve the memory corruption over to a new location. Oscar has found out that 4.4 and the current upstream kernels behave slightly differently with his simply testcase === int main(void) { int ret; int i; int fd; char *array = malloc(4096); char *array_locked = malloc(4096); fd = open("/tmp/data", O_RDONLY); read(fd, array, 4095); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked)); if (ret) perror("mlock"); sleep (20); ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON); if (ret) perror("madvise"); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; return 0; } === + offline this memory. In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU list kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340 kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40 kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160 kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100 kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0 kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70 kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs] kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200 kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0 kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660 kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140 kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120 kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50 kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0 kernel: [<ffffffff81215f08>] do_execve+0x28/0x30 kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130 kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80 And that latter confuses the hotremove path because an LRU page is attempted to be migrated and that fails due to an elevated reference count. It is quite possible that the reuse of the HWPoisoned page is some kind of fixed race condition but I am not really sure about that. With the upstream kernel the failure is slightly different. The page doesn't seem to have LRU bit set but isolate_movable_page simply fails and do_migrate_range simply puts all the isolated pages back to LRU and therefore no progress is made and scan_movable_pages finds same set of pages over and over again. Fix both cases by explicitly checking HWPoisoned pages before we even try to get reference on the page, try to unmap it if it is still mapped. As explained by Naoya: : Hwpoison code never unmapped those for no big reason because : Ksm pages never dominate memory, so we simply didn't have strong : motivation to save the pages. Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU HWPoison pages which shouldn't happen but I couldn't convince myself about that. Naoya has noted the following: : Theoretically no such gurantee, because try_to_unmap() doesn't have a : guarantee of success and then memory_failure() returns immediately : when hwpoison_user_mappings fails. : Or the following code (comes after hwpoison_user_mappings block) also impli= : es : that the target page can still have PageLRU flag. : : /* : * Torn down by someone else? : */ : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) { : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED); : res =3D -EBUSY; : goto out; : } : : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in : current version of your patch. Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.com> Debugged-by: Oscar Salvador <osalvador@suse.com> Tested-by: Oscar Salvador <osalvador@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, kmemleak: little optimization while scanningOscar Salvador
kmemleak_scan() goes through all online nodes and tries to scan all used pages. We can do better and use pfn_to_online_page(), so in case we have CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically. For boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page() will fallback to pfn_valid(). Another little optimization is to check if the page belongs to the node we are currently checking, so in case we have nodes interleaved we will not check the same pfn multiple times. I ran some tests: Add some memory to node1 and node2 making it interleaved: (qemu) object_add memory-backend-ram,id=ram0,size=1G (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 (qemu) object_add memory-backend-ram,id=ram1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2 (qemu) object_add memory-backend-ram,id=ram2,size=1G (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1 Then, we offline that memory: # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done And we run kmemleak_scan: # echo "scan" > /sys/kernel/debug/kmemleak before the patch: kmemleak: time spend: 41596 us after the patch: kmemleak: time spend: 34899 us [akpm@linux-foundation.org: remove stray newline, per Oscar] Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>