summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-01-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pendingLinus Torvalds
Pull scsi target fixes from Nicholas Bellinger: "Mostly minor fixes this time, including: - Add missing virtio-scsi -> TCM attribute conversion in vhost-scsi. - Fix persistent reservations write exclusive handling to allow readers for all registered I_T nexuses. - Drop arbitrary maximum I/O size limit in order to process I/Os larger than 4 MB, required for initiators that don't honor block limits EVPD. - Drop the now left-over fabric_max_sectors attribute" * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: iscsi-target: Fix typos in enum cmd_flags_table MAINTAINERS: Add entry for iSER target driver target: Allow Write Exclusive non-reservation holders to READ target: Drop left-over fabric_max_sectors attribute target: Drop arbitrary maximum I/O size limit Documentation/target: Update fabric_ops to latest code vhost-scsi: Add missing virtio-scsi -> TCM attribute conversion
2015-01-13mm: mmu_gather: use tlb->end != 0 only for TLB invalidationWill Deacon
When batching up address ranges for TLB invalidation, we check tlb->end != 0 to indicate that some pages have actually been unmapped. As of commit f045bbb9fa1b ("mmu_gather: fix over-eager tlb_flush_mmu_free() calling"), we use the same check for freeing these pages in order to avoid a performance regression where we call free_pages_and_swap_cache even when no pages are actually queued up. Unfortunately, the range could have been reset (tlb->end = 0) by tlb_end_vma, which has been shown to cause memory leaks on arm64. Furthermore, investigation into these leaks revealed that the fullmm case on task exit no longer invalidates the TLB, by virtue of tlb->end == 0 (in 3.18, need_flush would have been set). This patch resolves the problem by reverting commit f045bbb9fa1b, using instead tlb->local.nr as the predicate for page freeing in tlb_flush_mmu_free and ensuring that tlb->end is initialised to a non-zero value in the fullmm case. Tested-by: Mark Langsdorf <mlangsdo@redhat.com> Tested-by: Dave Hansen <dave@sr71.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-11linux 3.19-rc4v3.19-rc4Linus Torvalds
2015-01-11Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM fixes from Russell King: "Three small fixes from over the Christmas period, and wiring up the new execveat syscall for ARM" * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: ARM: 8275/1: mm: fix PMD_SECT_RDONLY undeclared compile error ARM: 8253/1: mm: use phys_addr_t type in map_lowmem() for kernel mem region ARM: 8249/1: mm: dump: don't skip regions ARM: wire up execveat syscall
2015-01-11Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: two vdso fixes, two kbuild fixes and a boot failure fix with certain odd memory mappings" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, vdso: Use asm volatile in __getcpu x86/build: Clean auto-generated processor feature files x86: Fix mkcapflags.sh bash-ism x86: Fix step size adjustment during initial memory mapping x86_64, vdso: Fix the vdso address randomization algorithm
2015-01-11Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: group scheduling corner case fix, two deadline scheduler fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs system fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group() sched/deadline: Avoid double-accounting in case of missed deadlines sched/deadline: Fix migration of SCHED_DEADLINE tasks sched: Fix odd values in effective_load() calculations sched, fanotify: Deal with nested sleeps sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
2015-01-11Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Mostly tooling fixes, but also some kernel side fixes: uncore PMU driver fix, user regs sampling fix and an instruction decoder fix that unbreaks PEBS precise sampling" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes perf/x86_64: Improve user regs sampling perf: Move task_pt_regs sampling into arch code x86: Fix off-by-one in instruction decoder perf hists browser: Fix segfault when showing callchain perf callchain: Free callchains when hist entries are deleted perf hists: Fix children sort key behavior perf diff: Fix to sort by baseline field by default perf list: Fix --raw-dump option perf probe: Fix crash in dwarf_getcfi_elf perf probe: Fix to fall back to find probe point in symbols perf callchain: Append callchains only when requested perf ui/tui: Print backtrace symbols when segfault occurs perf report: Show progress bar for output resorting
2015-01-11Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "A liblockdep fix and a mutex_unlock() mutex-debugging fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mutex: Always clear owner field upon mutex_unlock() tools/liblockdep: Fix debug_check thinko in mutex destroy
2015-01-11mm: fix corner case in anon_vma endless growing preventionKonstantin Khlebnikov
Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Anon_vma_clone() is usually called for a copy of source vma in destination argument. If source vma has anon_vma it should be already in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's called from anon_vma_fork(). In this case anon_vma_clone() finds anon_vma for reusing. Vma_adjust() calls it differently and this breaks anon_vma reusing logic: anon_vma_clone() links vma to old anon_vma and updates degree counters but vma_adjust() overrides vma->anon_vma right after that. As a result final unlink_anon_vmas() decrements degree for wrong anon_vma. This patch assigns ->anon_vma before calling anon_vma_clone(). Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com> Reported-and-tested-by: Oded Gabbay <oded.gabbay@amd.com> Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu> Cc: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-11mm: Don't count the stack guard page towards RLIMIT_STACKLinus Torvalds
Commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") made sure that we return the error properly for stack growth conditions. It also theorized that counting the guard page towards the stack limit might break something, but also said "Let's see if anybody notices". Somebody did notice. Apparently android-x86 sets the stack limit very close to the limit indeed, and including the guard page in the rlimit check causes the android 'zygote' process problems. So this adds the (fairly trivial) code to make the stack rlimit check be against the actual real stack size, rather than the size of the vma that includes the guard page. Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Cc: Jay Foad <jay.foad@gmail.com> Cc: stable@kernel.org # to match back-porting of fee7e49d4514 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-11Merge branch 'core/urgent' into locking/urgent, to collect all pending ↵Ingo Molnar
locking fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-10Merge tag 'vfio-v3.19-rc4' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO fix from Alex Williamson: "Fix PCI header check in vfio_pci_probe() (Wei Yang)" * tag 'vfio-v3.19-rc4' of git://github.com/awilliam/linux-vfio: vfio-pci: Fix the check on pci device type in vfio_pci_probe()
2015-01-10Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "Just one fix: a qlogic busy wait regression" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: qla2xxx: fix busy wait regression
2015-01-09Merge tag 'sound-3.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "All a few small regression or stable fixes: a Nvidia HDMI ID addition, a regression fix for CAIAQ stream count, a typo fix for GPIO setup with STAC/IDT HD-audio codecs, and a Fireworks big-endian fix" * tag 'sound-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: fireworks: fix an endianness bug for transaction length ALSA: hda - Add new GPU codec ID 0x10de0072 to snd-hda ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs ALSA: snd-usb-caiaq: fix stream count check
2015-01-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: - bounds checking fixes in logitech and roccat drivers, from Peter Wu and Dan Carpenter - double-kfree fix in i2c-hid driver on bus shutdown, from Mika Westerberg - a couple of various small driver fixes - a few device id additions * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: HID: roccat: potential out of bounds in pyra_sysfs_write_settings() HID: Add a new id 0x501a for Genius MousePen i608X HID: logitech-hidpp: prefix the name with "Logitech" HID: logitech-hidpp: avoid unintended fall-through HID: Allow HID_BATTERY_STRENGTH to be enabled HID: i2c-hid: Do not free buffers in i2c_hid_stop() HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard HID: logitech-hidpp: check WTP report length HID: logitech-dj: check report length
2015-01-09Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "I'm briefly working between holidays and LCA, so this is close to a couple of weeks of fixes, Two sets of amdkfd fixes, this is a new feature this kernel, and this pull fixes a few issues since it got merged, ordering when built-in to kernel and also the iommu vs gpu ordering patch, it also reworks the ioctl before the initial release. Otherwise: - radeon: some misc fixes all over, hdmi, 4k, dpm - nouveau: mcp77 init fixes, oops fix, bug on fix, msi fix - i915: power fixes, revert VGACNTR patch Probably be quiteer next week since I'll be at LCA anyways" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (33 commits) drm/amdkfd: rewrite kfd_ioctl() according to drm_ioctl() drm/amdkfd: reformat IOCTL definitions to drm-style drm/amdkfd: Do copy_to/from_user in general kfd_ioctl() drm/radeon: integer underflow in radeon_cp_dispatch_texture() drm/radeon: adjust default bapm settings for KV drm/radeon: properly filter DP1.2 4k modes on non-DP1.2 hw drm/radeon: fix sad_count check for dce3 drm/radeon: KV has three PPLLs (v2) drm/amdkfd: unmap VMID<-->PASID when relesing VMID (non-HWS) drm/radeon: Init amdkfd only if it was compiled amdkfd: actually allocate longs for the pasid bitmask drm/nouveau/nouveau: Do not BUG_ON(!spin_is_locked()) on UP drm/nv4c/mc: disable msi drm/nouveau/fb/ram/mcp77: enable NISO poller drm/nouveau/fb/ram/mcp77: use carveout reg to determine size drm/nouveau/fb/ram/mcp77: subclass nouveau_ram drm/nouveau: wake up the card if necessary during gem callbacks drm/nouveau/device: Add support for GK208B, resolves bug 86935 drm/nouveau: fix missing return statement in nouveau_ttm_tt_unpopulate drm/nouveau/bios: fix oops on pre-nv50 chipsets ...
2015-01-09Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "Here is a handful of minor arm64 fixes discovered and fixed over the Christmas break. The main part is adding some missing #includes that we seem to be getting transitively but have started causing problems in -next. - Fix early mapping fixmap corruption by EFI runtime services - Fix __NR_compat_syscalls off-by-one - Add missing sanity checks for some 32-bit registers - Add some missing #includes which we get transitively - Remove unused prepare_to_copy() macro" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/efi: add missing call to early_ioremap_reset() arm64: fix missing asm/io.h include in kernel/smp_spin_table.c arm64: fix missing asm/alternative.h include in kernel/module.c arm64: fix missing linux/bug.h include in asm/arch_timer.h arm64: fix missing asm/pgtable-hwdef.h include in asm/processor.h arm64: sanity checks: add missing AArch32 registers arm64: Remove unused prepare_to_copy() arm64: Correct __NR_compat_syscalls for bpf
2015-01-09Merge tag 'for_linus-3.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb Pull kgdb/kdb fixes from Jason Wessel: "These have been around since 3.17 and in kgdb-next for the last 9 weeks and some will go back to -stable. Summary of changes: Cleanups - kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE Fixes - kgdb/kdb: Allow access on a single core, if a CPU round up is deemed impossible, which will allow inspection of the now "trashed" kernel - kdb: Add enable mask for the command groups - kdb: access controls to restrict sensitive commands" * tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: kernel/debug/debug_core.c: Logging clean-up kgdb: timeout if secondary CPUs ignore the roundup kdb: Allow access to sensitive commands to be restricted by default kdb: Add enable mask for groups of commands kdb: Categorize kdb commands (similar to SysRq categorization) kdb: Remove KDB_REPEAT_NONE flag kdb: Use KDB_REPEAT_* values as flags kdb: Rename kdb_register_repeat() to kdb_register_flags() kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags kdb: Remove currently unused kdbtab_t->cmd_flags
2015-01-09Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull two nfsd bugfixes from Bruce Fields. * 'for-3.19' of git://linux-nfs.org/~bfields/linux: rpc: fix xdr_truncate_encode to handle buffer ending on page boundary nfsd: fix fi_delegees leak when fi_had_conflict returns true
2015-01-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "These are both pretty trivial: a sparse warning fix and size_t printk thing" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: fix sparse endianness warnings ceph: use %zu for len in ceph_fill_inline_data()
2015-01-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "None of these are huge, but my commit does fix a regression from 3.18 that could cause lost files during log replay. This also adds Dave Sterba to the list of Btrfs maintainers. It doesn't mean we're doing things differently, but Dave has really been helping with the maintainer workload for years" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: don't delay inode ref updates during log replay Btrfs: correctly get tree level in tree_backref_for_extent Btrfs: call inode_dec_link_count() on mkdir error path Btrfs: abort transaction if we don't find the block group Btrfs, scrub: uninitialized variable in scrub_extent_for_parity() Btrfs: add more maintainers
2015-01-09iscsi-target: Fix typos in enum cmd_flags_tableAndy Grover
Everything else starts with ICF so the last two should as well. Fix places they are used to match. Signed-off-by: Andy Grover <agrover@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-01-09MAINTAINERS: Add entry for iSER target driverSagi Grimberg
iSCSI extensions for RDMA - Target mode. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-01-09target: Allow Write Exclusive non-reservation holders to READLee Duncan
For PGR reservation of type Write Exclusive Access, allow all non reservation holding I_T nexuses with active registrations to READ from the device. This addresses a bug where active registrations that attempted to READ would result in an reservation conflict. Signed-off-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-01-09target: Drop left-over fabric_max_sectors attributeNicholas Bellinger
Now that fabric_max_sectors is no longer used to enforce the maximum I/O size, go ahead and drop it's left-over usage in target-core and associated backend drivers. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-01-09target: Drop arbitrary maximum I/O size limitNicholas Bellinger
This patch drops the arbitrary maximum I/O size limit in sbc_parse_cdb(), which currently for fabric_max_sectors is hardcoded to 8192 (4 MB for 512 byte sector devices), and for hw_max_sectors is a backend driver dependent value. This limit is problematic because Linux initiators have only recently started to honor block limits MAXIMUM TRANSFER LENGTH, and other non-Linux based initiators (eg: MSFT Fibre Channel) can also generate I/Os larger than 4 MB in size. Currently when this happens, the following message will appear on the target resulting in I/Os being returned with non recoverable status: SCSI OP 28h with too big sectors 16384 exceeds fabric_max_sectors: 8192 Instead, drop both [fabric,hw]_max_sector checks in sbc_parse_cdb(), and convert the existing hw_max_sectors into a purely informational attribute used to represent the granuality that backend driver and/or subsystem code is splitting I/Os upon. Also, update FILEIO with an explicit FD_MAX_BYTES check in fd_execute_rw() to deal with the one special iovec limitiation case. v2 changes: - Drop hw_max_sectors check in sbc_parse_cdb() Reported-by: Lance Gropper <lance.gropper@qosserver.com> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Roland Dreier <roland@purestorage.com> Cc: stable@vger.kernel.org # 3.4 Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-01-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "12 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed memcg: fix destination cgroup leak on task charges migration mm: memcontrol: switch soft limit default back to infinity mm/debug_pagealloc: remove obsolete Kconfig options vfs: renumber FMODE_NONOTIFY and add to uniqueness check arch/blackfin/mach-bf533/boards/stamp.c: add linux/delay.h ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file MAINTAINERS: update rydberg's addresses mm: protect set_page_dirty() from ongoing truncation mm: prevent endless growth of anon_vma hierarchy exit: fix race between wait_consider_task() and wait_task_zombie() ocfs2: remove bogus check in dlm_process_recovery_data
2015-01-09ARM: 8275/1: mm: fix PMD_SECT_RDONLY undeclared compile errorVictor Kamensky
In v3.19-rc3 tree when CONFIG_ARM_LPAE and CONFIG_DEBUG_RODATA are enabled image failed to compile with the following error: arch/arm/mm/init.c:661:14: error: ‘PMD_SECT_RDONLY’ undeclared here (not in a function) It seems that '80d6b0c ARM: mm: allow text and rodata sections to be read-only' and 'ded9477 ARM: 8109/1: mm: Modify pte_write and pmd_write logic for LPAE' commits crossed. 80d6b0c uses PMD_SECT_RDONLY macro but ded9477 renames it and uses software bits L_PMD_SECT_RDONLY instead. Fix is to use L_PMD_SECT_RDONLY instead PMD_SECT_RDONLY as ded9477 does in another places. Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2015-01-09HID: roccat: potential out of bounds in pyra_sysfs_write_settings()Dan Carpenter
This is a static checker fix. We write some binary settings to the sysfs file. One of the settings is the "->startup_profile". There isn't any checking to make sure it fits into the pyra->profile_settings[] array in the profile_activated() function. I added a check to pyra_sysfs_write_settings() in both places because I wasn't positive that the other callers were correct. Cc: <stable@vger.kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-09mutex: Always clear owner field upon mutex_unlock()Chris Wilson
Currently if DEBUG_MUTEXES is enabled, the mutex->owner field is only cleared iff debug_locks is active. This exposes a race to other users of the field where the mutex->owner may be still set to a stale value, potentially upsetting mutex_spin_on_owner() among others. References: https://bugs.freedesktop.org/show_bug.cgi?id=87955 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1420540175-30204-1-git-send-email-chris@chris-wilson.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()Tetsuo Handa
When alloc_fair_sched_group() in sched_create_group() fails, free_sched_group() is called, and free_fair_sched_group() is called by free_sched_group(). Since destroy_cfs_bandwidth() is called by free_fair_sched_group() without calling init_cfs_bandwidth(), RCU stall occurs at hrtimer_cancel(): INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0) Task dump for CPU 1: (fprintd) R running task 0 6249 1 0x00000088 ... Call Trace: <IRQ> [<ffffffff81094988>] sched_show_task+0xa8/0x110 [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50 [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0 [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700 [<ffffffff810cbf2b>] update_process_times+0x4b/0x80 [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50 [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70 [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0 [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50 [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230 [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70 [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30 [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0 [<ffffffff8108df46>] free_sched_group+0x16/0x40 [<ffffffff810971bb>] sched_create_group+0x4b/0x80 [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0 [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110 [<ffffffff81647729>] system_call_fastpath+0x12/0x17 Check whether init_cfs_bandwidth() was called before calling destroy_cfs_bandwidth(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [ Move the check into destroy_cfs_bandwidth() to aid compilability. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09sched/deadline: Avoid double-accounting in case of missed deadlinesLuca Abeni
The dl_runtime_exceeded() function is supposed to ckeck if a SCHED_DEADLINE task must be throttled, by checking if its current runtime is <= 0. However, it also checks if the scheduling deadline has been missed (the current time is larger than the current scheduling deadline), further decreasing the runtime if this happens. This "double accounting" is wrong: - In case of partitioned scheduling (or single CPU), this happens if task_tick_dl() has been called later than expected (due to small HZ values). In this case, the current runtime is also negative, and replenish_dl_entity() can take care of the deadline miss by recharging the current runtime to a value smaller than dl_runtime - In case of global scheduling on multiple CPUs, scheduling deadlines can be missed even if the task did not consume more runtime than expected, hence penalizing the task is wrong This patch fix this problem by throttling a SCHED_DEADLINE task only when its runtime becomes negative, and not modifying the runtime Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09sched/deadline: Fix migration of SCHED_DEADLINE tasksLuca Abeni
According to global EDF, tasks should be migrated between runqueues without checking if their scheduling deadlines and runtimes are valid. However, SCHED_DEADLINE currently performs such a check: a migration happens doing: deactivate_task(rq, next_task, 0); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); which ends up calling dequeue_task_dl(), setting the new CPU, and then calling enqueue_task_dl(). enqueue_task_dl() then calls enqueue_dl_entity(), which calls update_dl_entity(), which can modify scheduling deadline and runtime, breaking global EDF scheduling. As a result, some of the properties of global EDF are not respected: for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on two cores can have unbounded response times for the third task even if 30/80+40/80+120/170 = 1.5809 < 2 This can be fixed by invoking update_dl_entity() only in case of wakeup, or if this is a new SCHED_DEADLINE task. Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09sched: Fix odd values in effective_load() calculationsYuyang Du
In effective_load, we have (long w * unsigned long tg->shares) / long W, when w is negative, it is cast to unsigned long and hence the product is insanely large. Fix this by casting tg->shares to long. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Jones <davej@redhat.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09sched, fanotify: Deal with nested sleepsPeter Zijlstra
As per e23738a7300a ("sched, inotify: Deal with nested sleeps"). fanotify_read is a wait loop with sleeps in. Wait loops rely on task_struct::state and sleeps do too, since that's the only means of actually sleeping. Therefore the nested sleeps destroy the wait loop state and the wait loop breaks the sleep functions that assume TASK_RUNNING (mutex_lock). Fix this by using the new woken_wake_function and wait_woken() stuff, which registers wakeups in wait and thereby allows shrinking the task_state::state changes to the actual sleep part. Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Paris <eparis@redhat.com> Link: http://lkml.kernel.org/r/20141216152838.GZ3337@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09perf/x86/uncore/hsw-ep: Handle systems with only two SBOXesAndi Kleen
There was another report of a boot failure with a #GP fault in the uncore SBOX initialization. The earlier work around was not enough for this system. The boot was failing while trying to initialize the third SBOX. This patch detects parts with only two SBOXes and limits the number of SBOX units to two there. Stable material, as it affects boot problems on 3.18. Tested-by: Andreas Oehler <andreas@oehler-net.de> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/1420583675-9163-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09perf/x86_64: Improve user regs samplingAndy Lutomirski
Perf reports user regs for kernel-mode samples so that samples can be backtraced through user code. The old code was very broken in syscall context, resulting in useless backtraces. The new code, in contrast, is still dangerously racy, but it should at least work most of the time. Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/243560c26ff0f739978e2459e203f6515367634d.1420396372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09perf: Move task_pt_regs sampling into arch codeAndy Lutomirski
On x86_64, at least, task_pt_regs may be only partially initialized in many contexts, so x86_64 should not use it without extra care from interrupt context, let alone NMI context. This will allow x86_64 to override the logic and will supply some scratch space to use to make a cleaner copy of user regs. Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09x86: Fix off-by-one in instruction decoderPeter Zijlstra
Stephane reported that the PEBS fixup was broken by the recent commit to the instruction decoder. The thing had an off-by-one which resulted in not being able to decode the last instruction and always bail. Reported-by: Stephane Eranian <eranian@google.com> Fixes: 6ba48ff46f76 ("x86: Remove arbitrary instruction size limit in instruction decoder") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # 3.18 Cc: <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Liang Kan <kan.liang@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20141216104614.GV3337@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09Merge tag 'perf-urgent-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: - Free callchains when hist entries are deleted, plugging a massive leak in 'top -g', where hist_entries (and its callchains) are decayed over time. (Namhyung Kim) - Fix segfault when showing callchain in the hists browser (report & top) (Namhyung Kim) - Fix children sort key behavior, and also the 'perf test 32' test that was failing due to reliance on undefined behaviour (Namhyung Kim) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-08mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process ↵Vlastimil Babka
being killed Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep(): if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; // kswapd will not go to sleep } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to make sure that no process is left sleeping on pfmemalloc_wait when kswapd itself goes to sleep. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue actually empties. To prevent processes from being left sleeping, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Note that if any process puts itself in the queue after this waitqueue_active() check, or after the wake up itself, it means that the process will also wake up kswapd - and since we are under prepare_to_wait(), the wake up won't be missed. Also we update the comment prepare_kswapd_sleep() to hopefully more clearly describe the races it is preventing. Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08memcg: fix destination cgroup leak on task charges migrationVladimir Davydov
We are supposed to take one css reference per each memory page and per each swap entry accounted to a memory cgroup. However, during task charges migration we take a reference to the destination cgroup twice per each swap entry: first in mem_cgroup_do_precharge()->try_charge() and then in mem_cgroup_move_swap_account(), permanently leaking the destination cgroup. The hunk taking the second reference seems to be a leftover from the pre-00501b531c472 ("mm: memcontrol: rewrite charge API") era. Remove it to fix the leak. Fixes: e8ea14cc6ead (mm: memcontrol: take a css reference for each charged page) Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: memcontrol: switch soft limit default back to infinityJohannes Weiner
Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") accidentally switched the soft limit default from infinity to zero, which turns all memcgs with even a single page into soft limit excessors and engages soft limit reclaim on all of them during global memory pressure. This makes global reclaim generally more aggressive, but also inverts the meaning of existing soft limit configurations where unset soft limits are usually more generous than set ones. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm/debug_pagealloc: remove obsolete Kconfig optionsJoonsoo Kim
These are obsolete since commit e30825f1869a ("mm/debug-pagealloc: prepare boottime configurable") was merged. So remove them. [pebolle@tiscali.nl: find obsolete Kconfig options] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08vfs: renumber FMODE_NONOTIFY and add to uniqueness checkDavid Drysdale
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The clashing O_PATH value was added in commit 5229645bdc35 ("vfs: add nonconflicting values for O_PATH") but this can't be changed as it is user-visible. FMODE_NONOTIFY is only used internally in the kernel, but it is in the same numbering space as the other O_* flags, as indicated by the comment at the top of include/uapi/asm-generic/fcntl.h (and its use in fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash. All of this has happened before (commit 12ed2e36c98a: "fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will happen again -- so update the uniqueness check in fcntl_init() to include __FMODE_NONOTIFY. Signed-off-by: David Drysdale <drysdale@google.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jan Kara <jack@suse.cz> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08arch/blackfin/mach-bf533/boards/stamp.c: add linux/delay.hOleg Nesterov
build error arch/blackfin/mach-bf533/boards/stamp.c:834:2: error: implicit declaration of function 'mdelay' Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when ↵Xue jiufei
link file In ocfs2_link(), the parent directory inode passed to function ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of new_dentry not old_dentry. We should get old_dir from old_dentry and lookup old_dentry in old_dir in case another node remove the old dentry. With this change, hard linking works again, when paths are relative with at least one subdirectory. This is how the problem was reproducable: # mkdir a # mkdir b # touch a/test # ln a/test b/test ln: failed to create hard link `b/test' => `a/test': No such file or directory However when creating links in the same dir, it worked well. Now the link gets created. Fixes: 0e048316ff57 ("ocfs2: check existence of old dentry in ocfs2_link()") Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reported-by: Szabo Aron - UBIT <aron@ubit.hu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Tested-by: Aron Szabo <aron@ubit.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08MAINTAINERS: update rydberg's addressesHenrik Rydberg
My ISP finally gave up on the old mail address, so I am moving things over to bitmath.org instead. Also change the status fields to better reflect reality. Signed-off-by: Henrik Rydberg <rydberg@bitmath.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: protect set_page_dirty() from ongoing truncationJohannes Weiner
Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: prevent endless growth of anon_vma hierarchyKonstantin Khlebnikov
Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: Michal Hocko <mhocko@suse.cz> Tested-by: Jerome Marchand <jmarchan@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>