summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...
2014-06-12Merge tag 'pm+acpi-3.16-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI and power management updates from Rafael Wysocki: "These are fixups on top of the previous PM+ACPI pull request, regression fixes (ACPI hotplug, cpufreq ppc-corenet), other bug fixes (ACPI reset, cpufreq), new PM trace points for system suspend profiling and a copyright notice update. Specifics: - I didn't remember correctly that the Hans de Goede's ACPI video patches actually didn't flip the video.use_native_backlight default, although we had discussed that and decided to do that. Since I said we would do that in the previous PM+ACPI pull request, make that change for real now. - ACPI bus check notifications for PCI host bridges don't cause the bus below the host bridge to be checked for changes as they should because of a mistake in the ACPI-based PCI hotplug (ACPIPHP) subsystem that forgets to add hotplug contexts to PCI host bridge ACPI device objects. Create hotplug contexts for PCI host bridges too as appropriate. - Revert recent cpufreq commit related to the big.LITTLE cpufreq driver that breaks arm64 builds. - Fix for a regression in the ppc-corenet cpufreq driver introduced during the 3.15 cycle and causing the driver to use the remainder from do_div instead of the quotient. From Ed Swarthout. - Resets triggered by panic activate a BUG_ON() in vmalloc.c on systems where the ACPI reset register is located in memory address space. Fix from Randy Wright. - Fix for a problem with cpufreq governors that decisions made by them may be suboptimal due to the fact that deferrable timers are used by them for CPU load sampling. From Srivatsa S Bhat. - Fix for a problem with the Tegra cpufreq driver where the CPU frequency is temporarily switched to a "stable" level that is different from both the initial and target frequencies during transitions which causes udelay() to expire earlier than it should sometimes. From Viresh Kumar. - New trace points and rework of some existing trace points for system suspend/resume profiling from Todd Brandt. - Assorted cpufreq fixes and cleanups from Stratos Karafotis and Viresh Kumar. - Copyright notice update for suspend-and-cpuhotplug.txt from Srivatsa S Bhat" * tag 'pm+acpi-3.16-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / hotplug / PCI: Add hotplug contexts to PCI host bridges PM / sleep: trace events for device PM callbacks cpufreq: cpufreq-cpu0: remove dependency on THERMAL and REGULATOR cpufreq: tegra: update comment for clarity cpufreq: intel_pstate: Remove duplicate CPU ID check cpufreq: Mark CPU0 driver with CPUFREQ_NEED_INITIAL_FREQ_CHECK flag PM / Documentation: Update copyright in suspend-and-cpuhotplug.txt cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info' cpufreq: governor: Be friendly towards latency-sensitive bursty workloads PM / sleep: trace events for suspend/resume cpufreq: ppc-corenet-cpu-freq: do_div use quotient Revert "cpufreq: Enable big.LITTLE cpufreq driver on arm64" cpufreq: Tegra: implement intermediate frequency callbacks cpufreq: add support for intermediate (stable) frequencies ACPI / video: Change the default for video.use_native_backlight to 1 ACPI: Fix bug when ACPI reset register is implemented in system memory
2014-06-12Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / sleep: trace events for device PM callbacks PM / sleep: trace events for suspend/resume
2014-06-11Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Most of this is cleaning up various driver sysfs permissions so we can re-add the perm check (we unified the module param and sysfs checks, but the module ones were stronger so we weakened them temporarily). Param parsing gets documented, and also "--" now forces args to be handed to init (and ignored by the kernel). Module NX/RO protections get tightened: we now set them before calling parse_args()" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: set nx before marking module MODULE_STATE_COMING. samples/kobject/: avoid world-writable sysfs files. drivers/hid/hid-picolcd_fb: avoid world-writable sysfs files. drivers/staging/speakup/: avoid world-writable sysfs files. drivers/regulator/virtual: avoid world-writable sysfs files. drivers/scsi/pm8001/pm8001_ctl.c: avoid world-writable sysfs files. drivers/hid/hid-lg4ff.c: avoid world-writable sysfs files. drivers/video/fbdev/sm501fb.c: avoid world-writable sysfs files. drivers/mtd/devices/docg3.c: avoid world-writable sysfs files. speakup: fix incorrect perms on speakup_acntsa.c cpumask.h: silence warning with -Wsign-compare Documentation: Update kernel-parameters.tx param: hand arguments after -- straight to init modpost: Fix resource leak in read_dump()
2014-06-10Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge leftovers from Andrew Morton: "A few leftovers: ocfs2, gcov, RTC" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: rtc: s5m: consolidate two device type switch statements rtc: s5m: add support for S2MPS14 RTC rtc: s5m: support different register layout rtc: s5m: use shorter time of register update rtc: s5m: remove undocumented time init on first boot mfd/rtc: sec/s5m: rename SEC* symbols to S5M gcov: add support for GCC 4.9 ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
2014-06-10gcov: add support for GCC 4.9Yuan Pengfei
This patch handles the gcov-related changes in GCC 4.9: A new counter (time profile) is added. The total number is 9 now. A new profile merge function __gcov_merge_time_profile is added. See gcc/gcov-io.h and libgcc/libgcov-merge.c For the first change, the layout of struct gcov_info is affected. For the second one, a dummy function is added to kernel/gcov/base.c similarly. Signed-off-by: Yuan Pengfei <coolypf@qq.com> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10fs,userns: Change inode_capable to capable_wrt_inode_uidgidAndy Lutomirski
The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Chinner <david@fromorbit.com> Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10auditsc: audit_krule mask accesses need bounds checkingAndy Lutomirski
Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. eparis: If the admin has enabled auditd and has specifically loaded audit rules. This bug has been around since before git. Wow... Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-09Merge tag 'trace-3.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Lots of tweaks, small fixes, optimizations, and some helper functions to help out the rest of the kernel to ease their use of trace events. The big change for this release is the allowing of other tracers, such as the latency tracers, to be used in the trace instances and allow for function or function graph tracing to be in the top level simultaneously" * tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) tracing: Fix memory leak on instance deletion tracing: Fix leak of ring buffer data when new instances creation fails tracing/kprobes: Avoid self tests if tracing is disabled on boot up tracing: Return error if ftrace_trace_arrays list is empty tracing: Only calculate stats of tracepoint benchmarks for 2^32 times tracing: Convert stddev into u64 in tracepoint benchmark tracing: Introduce saved_cmdlines_size file tracing: Add __get_dynamic_array_len() macro for trace events tracing: Remove unused variable in trace_benchmark tracing: Eliminate double free on failure of allocation on boot up ftrace/x86: Call text_ip_addr() instead of the duplicated code tracing: Print max callstack on stacktrace bug tracing: Move locking of trace_cmdline_lock into start/stop seq calls tracing: Try again for saved cmdline if failed due to locking tracing: Have saved_cmdlines use the seq_read infrastructure tracing: Add tracepoint benchmark tracepoint tracing: Print nasty banner when trace_printk() is in use tracing: Add funcgraph_tail option to print function name after closing braces tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks ...
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...
2014-06-09Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Lai simplified worker destruction path and internal workqueue locking and there are some other minor changes. Except for the removal of some long-deprecated interfaces which haven't had any in-kernel user for quite a while, there shouldn't be any difference to workqueue users" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: remove the confusing POOL_FREEZING workqueue: rename first_worker() to first_idle_worker() workqueue: remove unused work_clear_pending() workqueue: remove unused WORK_CPU_END workqueue: declare system_highpri_wq workqueue: use generic attach/detach routine for rescuers workqueue: separate pool-attaching code out from create_worker() workqueue: rename manager_mutex to attach_mutex workqueue: narrow the protection range of manager_mutex workqueue: convert worker_idr to worker_ida workqueue: separate iteration role from worker_idr workqueue: destroy worker directly in the idle timeout handler workqueue: async worker destruction workqueue: destroy_worker() should destroy idle workers only workqueue: use manager lock only to protect worker_idr workqueue: Remove deprecated system_nrt[_freezable]_wq workqueue: Remove deprecated flush[_delayed]_work_sync() kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-06-08numa,sched: fix load_to_imbalanced logic inversionRik van Riel
This function is supposed to return true if the new load imbalance is worse than the old one. It didn't. I can only hope brown paper bags are in style. Now things converge much better on both the 4 node and 8 node systems. I am not sure why this did not seem to impact specjbb performance on the 4 node system, which is the system I have full-time access to. This bug was introduced recently, with commit e63da03639cc ("sched/numa: Allow task switch if load imbalance improves") Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-06tracing: Fix memory leak on instance deletionSteven Rostedt (Red Hat)
When an instance is created, it also gets a snapshot ring buffer allocated (with minimum of pages). But when it is deleted the snapshot buffer is not. There was a helper function added to match the allocation of these ring buffers to a way to free them, but it wasn't used by the deletion of an instance. Using that helper function solves this memory leak. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-06sysctl: convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/seccomp.c: kernel-doc warning fixFabian Frederick
+ fix small typo Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: clear whitespacePaul McQuade
trailing whitespace Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: use Linux headersPaul McQuade
Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/profile.c: use static const char instead of static charFabian Frederick
schedstr, sleepstr and kvmstr are only used in strcmp & strlen Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/profile.c: convert printk to pr_foo()Fabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/user_namespace.c: kernel-doc/checkpatch fixesFabian Frederick
-uid->gid -split some function declarations -if/then/else warning Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06sysctl: allow for strict write position handlingKees Cook
When writing to a sysctl string, each write, regardless of VFS position, begins writing the string from the start. This means the contents of the last write to the sysctl controls the string contents instead of the first: open("/proc/sys/kernel/modprobe", O_WRONLY) = 1 write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096 write(1, "/bin/true", 9) = 9 close(1) = 0 $ cat /proc/sys/kernel/modprobe /bin/true Expected behaviour would be to have the sysctl be "AAAA..." capped at maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the contents of the second write. Similarly, multiple short writes would not append to the sysctl. The old behavior is unlike regular POSIX files enough that doing audits of software that interact with sysctls can end up in unexpected or dangerous situations. For example, "as long as the input starts with a trusted path" turns out to be an insufficient filter, as what must also happen is for the input to be entirely contained in a single write syscall -- not a common consideration, especially for high level tools. This provides kernel.sysctl_writes_strict as a way to make this behavior act in a less surprising manner for strings, and disallows non-zero file position when writing numeric sysctls (similar to what is already done when reading from non-zero file positions). For now, the default (0) is to warn about non-zero file position use, but retain the legacy behavior. Setting this to -1 disables the warning, and setting this to 1 enables the file position respecting behavior. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: move misplaced hunk, per Randy] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06sysctl: refactor sysctl string writing logicKees Cook
Consolidate buffer length checking with new-line/end-of-line checking. Additionally, instead of reading user memory twice, just do the assignment during the loop. This change doesn't affect the potential races here. It was already possible to read a sysctl that was in the middle of a write. In both cases, the string will always be NULL terminated. The pre-existing race remains a problem to be solved. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06sysctl: clean up char buffer argumentsKees Cook
When writing to a sysctl string, each write, regardless of VFS position, began writing the string from the start. This meant the contents of the last write to the sysctl controlled the string contents instead of the first. This misbehavior was featured in an exploit against Chrome OS. While it's not in itself a vulnerability, it's a weirdness that isn't on the mind of most auditors: "This filter looks correct, the first line written would not be meaningful to sysctl" doesn't apply here, since the size of the write and the contents of the final write are what matter when writing to sysctls. This adds the sysctl kernel.sysctl_writes_strict to control the write behavior. The default (0) reports when VFS position is non-0 on a write, but retains legacy behavior, -1 disables the warning, and 1 enables the position-respecting behavior. The long-term plan here is to wait for userspace to be fixed in response to the new warning and to then switch the default kernel behavior to the new position-respecting behavior. This patch (of 4): The char buffer arguments are needlessly cast in weird places. Clean it up so things are easier to read. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/kexec.c: convert printk to pr_foo()Fabian Frederick
+ some pr_warning -> pr_warn and checkpatch warning fixes Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after ↵Masami Hiramatsu
panic_notifers Add a "crash_kexec_post_notifiers" boot option to run kdump after running panic_notifiers and dump kmsg. This can help rare situations where kdump fails because of unstable crashed kernel or hardware failure (memory corruption on critical data/code), or the 2nd kernel is already broken by the 1st kernel (it's a broken behavior, but who can guarantee that the "crashed" kernel works correctly?). Usage: add "crash_kexec_post_notifiers" to kernel boot option. Note that this actually increases risks of the failure of kdump. This option should be set only if you worry about the rare case of kdump failure rather than increasing the chance of success. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com> Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06smp: print more useful debug info upon receiving IPI on an offline CPUSrivatsa S. Bhat
There is a longstanding problem related to CPU hotplug which causes IPIs to be delivered to offline CPUs, and the smp-call-function IPI handler code prints out a warning whenever this is detected. Every once in a while this (usually harmless) warning gets reported on LKML, but so far it has not been completely fixed. Usually the solution involves finding out the IPI sender and fixing it by adding appropriate synchronization with CPU hotplug. However, while going through one such internal bug reports, I found that there is a significant bug in the receiver side itself (more specifically, in stop-machine) that can lead to this problem even when the sender code is perfectly fine. This patchset fixes that synchronization problem in the CPU hotplug stop-machine code. Patch 1 adds some additional debug code to the smp-call-function framework, to help debug such issues easily. Patch 2 modifies the stop-machine code to ensure that any IPIs that were sent while the target CPU was online, would be noticed and handled by that CPU without fail before it goes offline. Thus, this avoids scenarios where IPIs are received on offline CPUs (as long as the sender uses proper hotplug synchronization). In fact, I debugged the problem by using Patch 1, and found that the payload of the IPI was always the block layer's trigger_softirq() function. But I was not able to find anything wrong with the block layer code. That's when I started looking at the stop-machine code and realized that there is a race-window which makes the IPI _receiver_ the culprit, not the sender. Patch 2 fixes that race and hence this should put an end to most of the hard-to-debug IPI-to-offline-CPU issues. This patch (of 2): Today the smp-call-function code just prints a warning if we get an IPI on an offline CPU. This info is sufficient to let us know that something went wrong, but often it is very hard to debug exactly who sent the IPI and why, from this info alone. In most cases, we get the warning about the IPI to an offline CPU, immediately after the CPU going offline comes out of the stop-machine phase and reenables interrupts. Since all online CPUs participate in stop-machine, the information regarding the sender of the IPI is already lost by the time we exit the stop-machine loop. So even if we dump the stack on each CPU at this point, we won't find anything useful since all of them will show the stack-trace of the stopper thread. So we need a better way to figure out who sent the IPI and why. To achieve this, when we detect an IPI targeted to an offline CPU, loop through the call-single-data linked list and print out the payload (i.e., the name of the function which was supposed to be executed by the target CPU). This would give us an insight as to who might have sent the IPI and help us debug this further. [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: change wait_for_helper() to use kernel_sigaction()Oleg Nesterov
Now that we have kernel_sigaction() we can change wait_for_helper() to use it and cleans up the code a bit. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: introduce kernel_sigaction()Oleg Nesterov
Now that allow_signal() is really trivial we can unify it with disallow_signal(). Add the new helper, kernel_sigaction(), and reimplement allow_signal/disallow_signal as a trivial wrappers. This saves one EXPORT_SYMBOL() and the new helper can have more users. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: disallow_signal() should flush the potentially pending signalOleg Nesterov
disallow_signal() simply sets SIG_IGN, this is not enough and recalc_sigpending() is simply pointless because in can never change the state of TIF_SIGPENDING. If we ignore a signal, we also need to do flush_sigqueue_mask() for the case when this signal is pending, this way recalc_sigpending() can actually clear TIF_SIGPENDING and we do not "leak" the allocated siginfo's. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: kill the obsolete sigdelset() and recalc_sigpending() in allow_signal()Oleg Nesterov
allow_signal() does sigdelset(current->blocked) due to historic reason, previously it could be called by a daemonize()'ed kthread, and daemonize() played with current->blocked. Now that daemonize() has gone away we can remove sigdelset() and recalc_sigpending(). If a user really wants to unblock a signal, it must use sigprocmask() or set_current_block() explicitely. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch]Oleg Nesterov
Move the declaration/definition of allow_signal/disallow_signal to signal.h/signal.c. The new place is more logical and allows to use the static helpers in signal.c (see the next changes). While at it, make them return void and remove the valid_signal() check. Nobody checks the returned value, and in-kernel users must not pass the wrong signal number. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: cleanup the usage of t/current in do_sigaction()Oleg Nesterov
The usage of "task_struct *t" and "current" in do_sigaction() looks really annoying and chaotic. Initially "t" is used as a cached value of current but not consistently, then it is reused as a loop variable and we have to use "current" again. Clean up this mess and also convert the code to use for_each_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: rename rm_from_queue_full() to flush_sigqueue_mask()Oleg Nesterov
"rm_from_queue_full" looks ugly and misleading, especially now that rm_from_queue() has gone away. Rename it to flush_sigqueue_mask(), this matches flush_sigqueue() we already have. Also remove the obsolete comment which explains the difference with rm_from_queue() we already killed. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: kill rm_from_queue(), change prepare_signal() to use for_each_thread()Oleg Nesterov
rm_from_queue() doesn't make sense. The only caller, prepare_signal(), can use rm_from_queue_full() with the same effect. While at it, change prepare_signal() to use for_each_thread() instead of do/while_each_thread. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06signals: s/siginitset/sigemptyset/ in do_sigtimedwait()Oleg Nesterov
Cosmetic, but siginitset(0) looks a bit strange, sigemptyset() is what do_sigtimedwait() needs. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()Oleg Nesterov
__wake_up_bit() checks waitqueue_active() and thus the caller needs mb() as wake_up_bit() documents, fix task_clear_jobctl_trapping(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ptrace: fix fork event messages across pid namespacesMatthew Dempsky
When tracing a process in another pid namespace, it's important for fork event messages to contain the child's pid as seen from the tracer's pid namespace, not the parent's. Otherwise, the tracer won't be able to correlate the fork event with later SIGTRAP signals it receives from the child. We still risk a race condition if a ptracer from a different pid namespace attaches after we compute the pid_t value. However, sending a bogus fork event message in this unlikely scenario is still a vast improvement over the status quo where we always send bogus fork event messages to debuggers in a different pid namespace than the forking process. Signed-off-by: Matthew Dempsky <mdempsky@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Julien Tinnes <jln@chromium.org> Cc: Roland McGrath <mcgrathr@chromium.org> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07PM / sleep: trace events for suspend/resumeTodd E Brandt
Adds trace events that give finer resolution into suspend/resume. These events are graphed in the timelines generated by the analyze_suspend.py script. They represent large areas of time consumed that are typical to suspend and resume. The event is triggered by calling the function "trace_suspend_resume" with three arguments: a string (the name of the event to be displayed in the timeline), an integer (case specific number, such as the power state or cpu number), and a boolean (where true is used to denote the start of the timeline event, and false to denote the end). The suspend_resume trace event reproduces the data that the machine_suspend trace event did, so the latter has been removed. Signed-off-by: Todd Brandt <todd.e.brandt@intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-06-07Merge branch 'acpi-pm' into pm-sleepRafael J. Wysocki
2014-06-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Four misc fixes: each was deemed serious enough to warrant v3.15 inclusion" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock sched/dl: Fix race in dl_task_timer() sched: Fix sched_policy < 0 comparison sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
2014-06-06tracing: Fix leak of ring buffer data when new instances creation failsSteven Rostedt (Red Hat)
Yoshihiro Yunomae reported that the ring buffer data for a trace instance does not get properly cleaned up when it fails. He proposed a patch that manually cleaned the data up and addad a bunch of labels. The labels are not needed because all trace array is allocated with a kzalloc which initializes it to 0 and all kfree()s can take a NULL pointer and will ignore it. Adding a new helper function free_trace_buffers() that can also take null buffers to free the buffers that were allocated by allocate_trace_buffers(). Link: http://lkml.kernel.org/r/20140605223522.32311.31664.stgit@yunodevel Reported-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-06tracing/kprobes: Avoid self tests if tracing is disabled on boot upYoshihiro YUNOMAE
If tracing is disabled on boot up, the kernel should not execute tracing self tests. The kernel should check whether tracing is disabled or not before executing any of the tracing self tests. Link: http://lkml.kernel.org/p/20140605223520.32311.56097.stgit@yunodevel Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-06tracing: Return error if ftrace_trace_arrays list is emptyYoshihiro YUNOMAE
ftrace_trace_arrays links global_trace.list. However, global_trace is not added to ftrace_trace_arrays if trace_alloc_buffers() failed. As the result, ftrace_trace_arrays becomes an empty list. If ftrace_trace_arrays is an empty list, current top_trace_array() returns an invalid pointer. As the result, the kernel can induce memory corruption or panic. Current implementation does not check whether ftrace_trace_arrays is empty list or not. So, in this patch, if ftrace_trace_arrays is empty list, top_trace_array() returns NULL. Moreover, this patch makes all functions calling top_trace_array() handle it appropriately. Link: http://lkml.kernel.org/p/20140605223517.32311.99233.stgit@yunodevel Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-06tracing: Only calculate stats of tracepoint benchmarks for 2^32 timesSteven Rostedt (Red Hat)
When calculating the average and standard deviation, it is required that the count be less than UINT_MAX, otherwise the do_div() will get undefined results. After 2^32 counts of data, the average and standard deviation should pretty much be set anyway. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-05tracing: Convert stddev into u64 in tracepoint benchmarkSteven Rostedt (Red Hat)
I've been told that do_div() expects an unsigned 64 bit number, and is undefined if a signed is used. This gave a warning on the MIPS build. I'm not sure if a signed 64 bit dividend is really an issue or not, but the calculation this is used for is standard deviation, and that isn't going to be negative. We can just convert it to unsigned and be safe. Reported-by: David Daney <ddaney.cavm@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-05Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm into nextLinus Torvalds
Pull ARM updates from Russell King: - Major clean-up of the L2 cache support code. The existing mess was becoming rather unmaintainable through all the additions that others have done over time. This turns it into a much nicer structure, and implements a few performance improvements as well. - Clean up some of the CP15 control register tweaks for alignment support, moving some code and data into alignment.c - DMA properties for ARM, from Santosh and reviewed by DT people. This adds DT properties to specify bus translations we can't discover automatically, and to indicate whether devices are coherent. - Hibernation support for ARM - Make ftrace work with read-only text in modules - add suspend support for PJ4B CPUs - rework interrupt masking for undefined instruction handling, which allows us to enable interrupts earlier in the handling of these exceptions. - support for big endian page tables - fix stacktrace support to exclude stacktrace functions from the trace, and add save_stack_trace_regs() implementation so that kprobes can record stack traces. - Add support for the Cortex-A17 CPU. - Remove last vestiges of ARM710 support. - Removal of ARM "meminfo" structure, finally converting us solely to memblock to handle the early memory initialisation. * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (142 commits) ARM: ensure C page table setup code follows assembly code (part II) ARM: ensure C page table setup code follows assembly code ARM: consolidate last remaining open-coded alignment trap enable ARM: remove global cr_no_alignment ARM: remove CPU_CP15 conditional from alignment.c ARM: remove unused adjust_cr() function ARM: move "noalign" command line option to alignment.c ARM: provide common method to clear bits in CPU control register ARM: 8025/1: Get rid of meminfo ARM: 8060/1: mm: allow sub-architectures to override PCI I/O memory type ARM: 8066/1: correction for ARM patch 8031/2 ARM: 8049/1: ftrace/add save_stack_trace_regs() implementation ARM: 8065/1: remove last use of CONFIG_CPU_ARM710 ARM: 8062/1: Modify ldrt fixup handler to re-execute the userspace instruction ARM: 8047/1: rwsem: use asm-generic rwsem implementation ARM: l2c: trial at enabling some Cortex-A9 optimisations ARM: l2c: add warnings for stuff modifying aux_ctrl register values ARM: l2c: print a warning with L2C-310 caches if the cache size is modified ARM: l2c: remove old .set_debug method ARM: l2c: kill L2X0_AUX_CTRL_MASK before anyone else makes use of this ...
2014-06-05Merge branch 'futex-fixes' (futex fixes from Thomas Gleixner)Linus Torvalds
Merge futex fixes from Thomas Gleixner: "So with more awake and less futex wreckaged brain, I went through my list of points again and came up with the following 4 patches. 1) Prevent pi requeueing on the same futex I kept Kees check for uaddr1 == uaddr2 as a early check for private futexes and added a key comparison to both futex_requeue and futex_wait_requeue_pi. Sebastian, sorry for the confusion yesterday night. I really misunderstood your question. You are right the check is pointless for shared futexes where the same physical address is mapped to two different virtual addresses. 2) Sanity check atomic acquisiton in futex_lock_pi_atomic That's basically what Darren suggested. I just simplified it to use futex_top_waiter() to find kernel internal state. If state is found return -EINVAL and do not bother to fix up the user space variable. It's corrupted already. 3) Ensure state consistency in futex_unlock_pi The code is silly versus the owner died bit. There is no point to preserve it on unlock when the user space thread owns the futex. What's worse is that it does not update the user space value when the owner died bit is set. So the kernel itself creates observable inconsistency. Another "optimization" is to retry an atomic unlock. That's pointless as in a sane environment user space would not call into that code if it could have unlocked it atomically. So we always check whether there is kernel state around and only if there is none, we do the unlock by setting the user space value to 0. 4) Sanitize lookup_pi_state lookup_pi_state is ambigous about TID == 0 in the user space value. This can be a valid state even if there is kernel state on this uaddr, but we miss a few corner case checks. I tried to come up with a smaller solution hacking the checks into the current cruft, but it turned out to be ugly as hell and I got more confused than I was before. So I rewrote the sanity checks along the state documentation with awful lots of commentry" * emailed patches from Thomas Gleixner <tglx@linutronix.de>: futex: Make lookup_pi_state more robust futex: Always cleanup owner tid in unlock_pi futex: Validate atomic acquisition in futex_lock_pi_atomic() futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
2014-06-05futex: Make lookup_pi_state more robustThomas Gleixner
The current implementation of lookup_pi_state has ambigous handling of the TID value 0 in the user space futex. We can get into the kernel even if the TID value is 0, because either there is a stale waiters bit or the owner died bit is set or we are called from the requeue_pi path or from user space just for fun. The current code avoids an explicit sanity check for pid = 0 in case that kernel internal state (waiters) are found for the user space address. This can lead to state leakage and worse under some circumstances. Handle the cases explicit: Waiter | pi_state | pi->owner | uTID | uODIED | ? [1] NULL | --- | --- | 0 | 0/1 | Valid [2] NULL | --- | --- | >0 | 0/1 | Valid [3] Found | NULL | -- | Any | 0/1 | Invalid [4] Found | Found | NULL | 0 | 1 | Valid [5] Found | Found | NULL | >0 | 1 | Invalid [6] Found | Found | task | 0 | 1 | Valid [7] Found | Found | NULL | Any | 0 | Invalid [8] Found | Found | task | ==taskTID | 0/1 | Valid [9] Found | Found | task | 0 | 0 | Invalid [10] Found | Found | task | !=taskTID | 0/1 | Invalid [1] Indicates that the kernel can acquire the futex atomically. We came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit. [2] Valid, if TID does not belong to a kernel thread. If no matching thread is found then it indicates that the owner TID has died. [3] Invalid. The waiter is queued on a non PI futex [4] Valid state after exit_robust_list(), which sets the user space value to FUTEX_WAITERS | FUTEX_OWNER_DIED. [5] The user space value got manipulated between exit_robust_list() and exit_pi_state_list() [6] Valid state after exit_pi_state_list() which sets the new owner in the pi_state but cannot access the user space value. [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set. [8] Owner and user space value match [9] There is no transient state which sets the user space TID to 0 except exit_robust_list(), but this is indicated by the FUTEX_OWNER_DIED bit. See [4] [10] There is no transient state which leaves owner and user space TID out of sync. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Will Drewry <wad@chromium.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05futex: Always cleanup owner tid in unlock_piThomas Gleixner
If the owner died bit is set at futex_unlock_pi, we currently do not cleanup the user space futex. So the owner TID of the current owner (the unlocker) persists. That's observable inconsistant state, especially when the ownership of the pi state got transferred. Clean it up unconditionally. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Will Drewry <wad@chromium.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>