summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-04-29Taint kernel after WARN_ON(condition)Nur Hussein
The kernel is sent to tainted within the warn_on_slowpath() function, and whenever a warning occurs the new taint flag 'W' is set. This is useful to know if a warning occurred before a BUG by preserving the warning as a flag in the taint state. This does not work on architectures where WARN_ON has its own definition. These archs are: 1. s390 2. superh 3. avr32 4. parisc The maintainers of these architectures have been added in the Cc: list in this email to alert them to the situation. The documentation in oops-tracing.txt has been updated to include the new flag. Signed-off-by: Nur Hussein <nurhussein@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kernel: fix integer as NULL pointer warningsHarvey Harrison
kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Reviewed-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28ptrace: conditionalize compat_ptrace_requestRoland McGrath
My recent additions to compat_ptrace_request made it mandatory for CONFIG_COMPAT arch's to define copy_siginfo_from_user32. This broke some builds, though they all really should get cleaned up in that way. Since all the arch's that actually call compat_ptrace_request have now been cleaned up to use the generic compat_sys_ptrace, we can avoid the build problems on the crufty arch's by changing the conditionals on the definition. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28hrtimer: raise softirq unlocked to avoid circular lock dependencyThomas Gleixner
The scheduler hrtimer bits in 2.6.25 introduced a circular lock dependency in a rare code path: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.25-sched-devel.git-x86-latest.git #19 ------------------------------------------------------- X/2980 is trying to acquire lock: (&rq->rq_lock_key#2){++..}, at: [<ffffffff80230146>] task_rq_lock+0x56/0xa0 but task is already holding lock: (&cpu_base->lock){++..}, at: [<ffffffff80257ae1>] lock_hrtimer_base+0x31/0x60 which lock already depends on the new lock. The scenario which leads to this is: posix-timer signal is delivered -> posix-timer is rearmed timer is already expired in hrtimer_enqueue() -> softirq is raised To prevent this we need to move the raise of the softirq out of the base->lock protected code path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrtLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: timeout too long when using HRTIMER_CB_SOFTIRQ
2008-04-28PM/gxfb: add hook to PM console layer that allows disabling of suspend VT switchAndres Salomon
Prior to suspend, we allocate and switch to a new VT; after suspend, we switch back to the original VT. This can be slow, and is completely unnecessary if the framebuffer we're using can restore video properly. This adds a hook that allows drivers to select whether or not to do this vt switch, and changes the gxfb driver to call this hook. It also adds a module param to gxfb to allow controlling of the vt switch (defaulting to no switch). (Note: I'm not convinced that console_sem is the best way to protect this, but we should probably have some form of locking..) [akpm@linux-foundation.org: build fix] Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan.crouse@amd.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kprobes: add (un)register_jprobes for batch registrationMasami Hiramatsu
Introduce unregister_/register_jprobes() for jprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kprobes: add (un)register_kretprobes for batch registrationMasami Hiramatsu
Introduce unregister_/register_kretprobes() for kretprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kprobes: add (un)register_kprobes for batch registrationMasami Hiramatsu
Introduce unregister_/register_kprobes() for kprobe batch registration. This can reduce waiting time for synchronized_sched() when a lot of probes have to be unregistered at once. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kprobes: prevent probing of preempt_schedule()Srinivasa Ds
Prohibit users from probing preempt_schedule(). One way of prohibiting the user from probing functions is by marking such functions with __kprobes. But this method doesn't work for those functions, which are already marked to different section like preempt_schedule() (belongs to __sched section). So we use blacklist approach to refuse user from probing these functions. In blacklist approach we populate the blacklisted function's starting address and its size in kprobe_blacklist structure. Then we verify the user specified address against start and end of the blacklisted function. So any attempt to register probe on blacklisted functions will be rejected. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28capabilities: implement per-process securebitsAndrew G. Morgan
Filesystem capability support makes it possible to do away with (set)uid-0 based privilege and use capabilities instead. That is, with filesystem support for capabilities but without this present patch, it is (conceptually) possible to manage a system with capabilities alone and never need to obtain privilege via (set)uid-0. Of course, conceptually isn't quite the same as currently possible since few user applications, certainly not enough to run a viable system, are currently prepared to leverage capabilities to exercise privilege. Further, many applications exist that may never get upgraded in this way, and the kernel will continue to want to support their setuid-0 base privilege needs. Where pure-capability applications evolve and replace setuid-0 binaries, it is desirable that there be a mechanisms by which they can contain their privilege. In addition to leveraging the per-process bounding and inheritable sets, this should include suppressing the privilege of the uid-0 superuser from the process' tree of children. The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel. With this patch applied a process, that is capable(CAP_SETPCAP), can now drop all legacy privilege (through uid=0) for itself and all subsequently fork()'d/exec()'d children with: prctl(PR_SET_SECUREBITS, 0x2f); This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is enabled at configure time. [akpm@linux-foundation.org: fix uninitialised var warning] [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Paul Moore <paul.moore@hp.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rename mpol_copy to mpol_dupLee Schermerhorn
This patch renames mpol_copy() to mpol_dup() because, well, that's what it does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an existing mempolicy, allocates a new one and copies the contents. In a later patch, I want to use the name mpol_copy() to copy the contents from one mempolicy to another like, e.g., strcpy() does for strings. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rename mpol_free to mpol_putLee Schermerhorn
This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28vmcoreinfo: add page flags valuesKen'ichi Ohmichi
Add some values of page flags to the vmcoreinfo data. The vmcoreinfo data has the minimum debugging information only for dump filtering. makedumpfile (dump filtering command) gets it to distinguish unnecessary pages, and makedumpfile creates a small dumpfile. An old makedumpfile (v1.2.4 or before) had assumed some values of page flags internally, and this implementation could not follow the change of these values. For example, Christoph Lameter is changing these values by the follwing patch: http://lkml.org/lkml/2008/2/29/463 So a new makedumpfile (v1.2.5) came to need these values and I created this patch to let the kernel output them. Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm: Get rid of __ZONE_COUNTChristoph Lameter
It was used to compensate because MAX_NR_ZONES was not available to the #ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of __ZONE_COUNT. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28pageflags: get rid of FLAGS_RESERVEDChristoph Lameter
NR_PAGEFLAGS specifies the number of page flags we are using. From that we can calculate the number of bits leftover that can be used for zone, node (and maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use NR_PAGEFLAGS. Use the new methods to make NR_PAGEFLAGS available via the preprocessor. NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields. These field widths have to be available to the preprocessor. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Miller <davem@davemloft.net> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28kbuild: create a way to create preprocessor constants from C expressionsChristoph Lameter
The use of enums create constants that are not available to the preprocessor when building the kernel (f.e. MAX_NR_ZONES). Arch code already has a way to export constants calculated to the preprocessor through the asm-offsets.c file. Generate something similar for the core kernel through kbuild. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm: filter based on a nodemask as well as a gfp_maskMel Gorman
The MPOL_BIND policy creates a zonelist that is used for allocations controlled by that mempolicy. As the per-node zonelist is already being filtered based on a zone id, this patch adds a version of __alloc_pages() that takes a nodemask for further filtering. This eliminates the need for MPOL_BIND to create a custom zonelist. A positive benefit of this is that allocations using MPOL_BIND now use the local node's distance-ordered zonelist instead of a custom node-id-ordered zonelist. I.e., pages will be allocated from the closest allowed node with available memory. [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm: have zonelist contains structs with both a zone pointer and zone_idxMel Gorman
Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-27hrtimer: timeout too long when using HRTIMER_CB_SOFTIRQBodo Stroesser
When using hrtimer with timer->cb_mode == HRTIMER_CB_SOFTIRQ in some cases the clockevent is not programmed. This happens, if: - a timer is rearmed while it's state is HRTIMER_STATE_CALLBACK - hrtimer_reprogram() returns -ETIME, when it is called after CALLBACK is finished. This occurs if the new timer->expires is in the past when CALLBACK is done. In this case, the timer needs to be removed from the tree and put onto the pending list again. The patch is against 2.6.22.5, but AFAICS, it is relevant for 2.6.25 also (in run_hrtimer_pending()). Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-27s390: KVM preparation: provide hook to enable pgstes in user pagetableCarsten Otte
The SIE instruction on s390 uses the 2nd half of the page table page to virtualize the storage keys of a guest. This patch offers the s390_enable_sie function, which reorganizes the page tables of a single-threaded process to reserve space in the page table: s390_enable_sie makes sure that the process is single threaded and then uses dup_mm to create a new mm with reorganized page tables. The old mm is freed and the process has now a page status extended field after every page table. Code that wants to exploit pgstes should SELECT CONFIG_PGSTE. This patch has a small common code hit, namely making dup_mm non-static. Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's review feedback. Now we do have the prototype for dup_mm in include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now call task_lock() to prevent race against ptrace modification of mm_users. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-26Fix uninitialized 'copy' in unshare_filesAl Viro
Arrgghhh... Sorry about that, I'd been sure I'd folded that one, but it actually got lost. Please apply - that breaks execve(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] sanitize locate_fd() [PATCH] sanitize unshare_files/reset_files_struct [PATCH] sanitize handling of shared descriptor tables in failing execve() [PATCH] close race in unshare_files() [PATCH] restore sane ->umount_begin() API cifs: timeout dfs automounts +little fix.
2008-04-25Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=n [IA64] fix bootmem regression on Altix
2008-04-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes: sched: fix share (re)distribution softlockup: fix NOHZ wakeup seqlock: livelock fix
2008-04-25[PATCH] sanitize unshare_files/reset_files_structAl Viro
* let unshare_files() give caller the displaced files_struct * don't bother with grabbing reference only to drop it in the caller if it hadn't been shared in the first place * in that form unshare_files() is trivially implemented via unshare_fd(), so we eliminate the duplicate logics in fork.c * reset_files_struct() is not just only called for current; it will break the system if somebody ever calls it for anything else (we can't modify ->files of somebody else). Lose the task_struct * argument. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-25[PATCH] sanitize handling of shared descriptor tables in failing execve()Al Viro
* unshare_files() can fail; doing it after irreversible actions is wrong and de_thread() is certainly irreversible. * since we do it unconditionally anyway, we might as well do it in do_execve() and save ourselves the PITA in binfmt handlers, etc. * while we are at it, binfmt_som actually leaked files_struct on failure. As a side benefit, unshare_files(), put_files_struct() and reset_files_struct() become unexported. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-25[PATCH] close race in unshare_files()Al Viro
updating current->files requires task_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-25sched: use alloc_bootmem() instead of alloc_bootmem_low()David Miller
There is no guarantee that there is physical ram below 4GB, and in fact many boxes don't have exactly that. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-25sched: fix share (re)distributionPeter Zijlstra
fix __aggregate_redistribute_shares() related lockup reported by David S. Miller. The problem this code tries to solve is 'accurately' calculating the 'fair' share of the group weight for each cpu. The current code falls back to a global group rebalance in case the sched_domain's span it looks at has no shares, but does have tasks. The reason it gets stuck here, is because its inherently racy - if someone steals the last task after we compute the agg->rq_weight, but before we rebalance, we'll never get out of the loop. We could of course go fix that, but while looking at this issue I found that this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite common - and given it walks the whole machine, thats very bad. The new approach is simple (why didn't I think of it before?), we set the aggregate shares to the full task group weight, and each larger sched domain that encounters an aggregate shares larger than the weight, clips it (it already re-distributes anyway). This nicely converges to the desired global picture where the sum of all shares equals the task group weight. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-25softlockup: fix NOHZ wakeupIngo Molnar
David Miller reported: |---------------> the following commit: | commit 27ec4407790d075c325e1f4da0a19c56953cce23 | Author: Ingo Molnar <mingo@elte.hu> | Date: Thu Feb 28 21:00:21 2008 +0100 | | sched: make cpu_clock() globally synchronous | | Alexey Zaytsev reported (and bisected) that the introduction of | cpu_clock() in printk made the timestamps jump back and forth. | | Make cpu_clock() more reliable while still keeping it fast when it's | called frequently. | | Signed-off-by: Ingo Molnar <mingo@elte.hu> causes watchdog triggers when a cpu exits NOHZ state when it has been there for >= the soft lockup threshold, for example here are some messages from a 128 cpu Niagara2 box: [ 168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239] [ 168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0] [ 168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511] [ 168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0] [ 169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0] [ 169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515] [ 169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0] [ 169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0] [ 169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#109 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#117 stuck for 92s! [swapper:0] [ 169.171483] BUG: soft lockup - CPU#40 stuck for 80s! [dd:3239] [ 169.331483] BUG: soft lockup - CPU#13 stuck for 86s! [swapper:0] [ 169.351500] BUG: soft lockup - CPU#43 stuck for 101s! [dd:3239] [ 169.531482] BUG: soft lockup - CPU#9 stuck for 129s! [mkdir:4565] [ 169.595754] BUG: soft lockup - CPU#20 stuck for 93s! [swapper:0] [ 169.626787] BUG: soft lockup - CPU#52 stuck for 93s! [swapper:0] [ 169.626787] BUG: soft lockup - CPU#84 stuck for 92s! [swapper:0] [ 169.636812] BUG: soft lockup - CPU#116 stuck for 94s! [swapper:0] It's simple enough to trigger this by doing a 10 minute sleep after a fresh bootup then starting a parallel kernel build. I suspect this might be reintroducing a problem we've had and fixed before, see the thread: http://marc.info/?l=linux-kernel&m=119546414004065&w=2 <---------------| touch the softlockup watchdog when exiting NOHZ state - we are obviously not locked up. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-24[PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=nMike Travis
Regression caused by 434d53b00d6bb7be0a1d3dcc0d0d5df6c042e164 Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-04-24[IA64] fix bootmem regression on AltixRuss Anderson
A recent change prevents SGI Altix from booting. This patch fixes the problem. The regresson was introduced in commit 434d53b00d6bb7be0a1d3dcc0d0d5df6c042e164 Signed-off-by: Russ Anderson <rja@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2008-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: iwlwifi: Fix built-in compilation of iwlcore net: Unexport move_addr_to_{kernel,user} rt2x00: Select LEDS_CLASS. iwlwifi: Select LEDS_CLASS. leds: Do not guard NEW_LEDS with HAS_IOMEM [IPSEC]: Fix catch-22 with algorithm IDs above 31 time: Export set_normalized_timespec. tcp: Make use of before macro in tcp_input.c hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c [NETNS]: Remove empty ->init callback. [DCCP]: Convert do_gettimeofday() to getnstimeofday(). [NETNS]: Don't initialize err variable twice. [NETNS]: The ip6_fib_timer can work with garbage on net namespace stop. [IPV4]: Convert do_gettimeofday() to getnstimeofday(). [IPV4]: Make icmp_sk_init() static. [IPV6]: Make struct ip6_prohibit_entry_template static. tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c [NET]: Expose netdevice dev_id through sysfs skbuff: fix missing kernel-doc notation [ROSE]: Fix soft lockup wrt. rose_node_list_lock
2008-04-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct() [PATCH] proc_readfd_common() race fix [PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe() [PATCH] teach seq_file to discard entries [PATCH] umount_tree() will unhash everything itself [PATCH] get rid of more nameidata passing in namespace.c [PATCH] switch a bunch of LSM hooks from nameidata to path [PATCH] lock exclusively in collect_mounts() and drop_collected_mounts() [PATCH] move a bunch of declarations to fs/internal.h
2008-04-22[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()Al Viro
The only reason to have separated __...() for those was to keep them inlined for local users in exit.c. Since Alexey removed the inline on those, there's no reason whatsoever to keep them around; just collapse with normal variants. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-22kernel-doc: fix sched.c missing parameterRandy Dunlap
Add missing kernel-doc in kernel/sched.c: Warning(linux-2.6.25-git3//kernel/sched.c:7044): No description found for parameter 'span' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-21time: Export set_normalized_timespec.YOSHIFUJI Hideaki
Sorry I have just realized set_normalized_timespec() (used in timespec_sub()) is not exported, and link will fail because of it... Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial: (24 commits) DOC: A couple corrections and clarifications in USB doc. Generate a slightly more informative error msg for bad HZ fix typo "is" -> "if" in Makefile ext*: spelling fix prefered -> preferred DOCUMENTATION: Use newer DEFINE_SPINLOCK macro in docs. KEYS: Fix the comment to match the file name in rxrpc-type.h. RAID: remove trailing space from printk line DMA engine: typo fixes Remove unused MAX_NODES_SHIFT MAINTAINERS: Clarify access to OCFS2 development mailing list. V4L: Storage class should be before const qualifier (sn9c102) V4L: Storage class should be before const qualifier sonypi: Storage class should be before const qualifier intel_menlow: Storage class should be before const qualifier DVB: Storage class should be before const qualifier arm: Storage class should be before const qualifier ALSA: Storage class should be before const qualifier acpi: Storage class should be before const qualifier firmware_sample_driver.c: fix coding style MAINTAINERS: Add ati_remote2 driver ... Fixed up trivial conflicts in firmware_sample_driver.c
2008-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6: (42 commits) PCI: Change PCI subsystem MAINTAINER PCI: pci-iommu-iotlb-flushing-speedup PCI: pci_setup_bridge() mustn't be __devinit PCI: pci_bus_size_cardbus() mustn't be __devinit PCI: pci_scan_device() mustn't be __devinit PCI: pci_alloc_child_bus() mustn't be __devinit PCI: replace remaining __FUNCTION__ occurrences PCI: Hotplug: fakephp: Return success, not ENODEV, when bus rescan is triggered PCI: Hotplug: Fix leaks in IBM Hot Plug Controller Driver - ibmphp_init_devno() PCI: clean up resource alignment management PCI: aerdrv_acpi.c: remove unneeded NULL check PCI: Update VIA CX700 quirk PCI: Expose PCI VPD through sysfs PCI: iommu: iotlb flushing PCI: simplify quirk debug output PCI: iova RB tree setup tweak PCI: parisc: use generic pci_enable_resources() PCI: ppc: use generic pci_enable_resources() PCI: powerpc: use generic pci_enable_resources() PCI: ia64: use generic pci_enable_resources() ...
2008-04-21ptrace: compat_ptrace_request siginfoRoland McGrath
This adds support for PTRACE_GETSIGINFO and PTRACE_SETSIGINFO in compat_ptrace_request. It relies on existing arch definitions for copy_siginfo_to_user32 and copy_siginfo_from_user32. On powerpc, this fixes a longstanding regression of 32-bit ptrace calls on 64-bit kernels vs native calls (64-bit calls or 32-bit kernels). This can be seen in a 32-bit call using PTRACE_GETSIGINFO to examine e.g. siginfo_t.si_addr from a signal that sets it. (This was broken as of 2.6.24 and, I presume, many or all prior versions.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-21Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: optimize the softirq time optimization hrtimer: reduce calls to hrtimer_get_softirq_time() clockevents: fix typo in tick-broadcast.c jiffies: add time_is_after_jiffies and others which compare with jiffies
2008-04-21Merge branch 'semaphore' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: Deprecate the asm/semaphore.h files in feature-removal-schedule. Convert asm/semaphore.h users to linux/semaphore.h security: Remove unnecessary inclusions of asm/semaphore.h lib: Remove unnecessary inclusions of asm/semaphore.h kernel: Remove unnecessary inclusions of asm/semaphore.h include: Remove unnecessary inclusions of asm/semaphore.h fs: Remove unnecessary inclusions of asm/semaphore.h drivers: Remove unnecessary inclusions of asm/semaphore.h net: Remove unnecessary inclusions of asm/semaphore.h arch: Remove unnecessary inclusions of asm/semaphore.h
2008-04-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel: (62 commits) sched: build fix sched: better rt-group documentation sched: features fix sched: /debug/sched_features sched: add SCHED_FEAT_DEADLINE sched: debug: show a weight tree sched: fair: weight calculations sched: fair-group: de-couple load-balancing from the rb-trees sched: fair-group scheduling vs latency sched: rt-group: optimize dequeue_rt_stack sched: debug: add some debug code to handle the full hierarchy sched: fair-group: SMP-nice for group scheduling sched, cpuset: customize sched domains, core sched, cpuset: customize sched domains, docs sched: prepatory code movement sched: rt: multi level group constraints sched: task_group hierarchy sched: fix the task_group hierarchy for UID grouping sched: allow the group scheduler to have multiple levels sched: mix tasks and groups ...
2008-04-21trivial: small cleanupsPavel Machek
These are small cleanups all over the tree. Trivial style and comment changes to fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
2008-04-21hrtimer: optimize the softirq time optimizationThomas Gleixner
The previous optimization did not take the case into account where a clock provides its own softirq_get_time() function. Check for the availablitiy of the clock get time function first and then check if we need to retrieve the time for both clocks via hrtimer_softirq_gettime() to avoid a double evaluation of time in that case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-21hrtimer: reduce calls to hrtimer_get_softirq_time()Dimitri Sivanich
It seems that hrtimer_run_queues() is calling hrtimer_get_softirq_time() more often than it needs to. This can cause frequent contention on systems with large numbers of processors/cores. With this patch, hrtimer_run_queues only calls hrtimer_get_softirq_time() if there is a pending timer in one of the hrtimer bases, and only once. This also combines hrtimer_run_queues() and the inline run_hrtimer_queue() into one function. [ tglx@linutronix.de: coding style ] Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-21clockevents: fix typo in tick-broadcast.cGlauber Costa
braodcast -> broadcast Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-20PCI: clean up resource alignment managementIvan Kokshaysky
Done per Linus' request and suggestions. Linus has explained that better than I'll be able to explain: On Thu, Mar 27, 2008 at 10:12:10AM -0700, Linus Torvalds wrote: > Actually, before we go any further, there might be a less intrusive > alternative: add just a couple of flags to the resource flags field (we > still have something like 8 unused bits on 32-bit), and use those to > implement a generic "resource_alignment()" routine. > > Two flags would do it: > > - IORESOURCE_SIZEALIGN: size indicates alignment (regular PCI device > resources) > > - IORESOURCE_STARTALIGN: start field is alignment (PCI bus resources > during probing) > > and then the case of both flags zero (or both bits set) would actually be > "invalid", and we would also clear the IORESOURCE_STARTALIGN flag when we > actually allocate the resource (so that we don't use the "start" field as > alignment incorrectly when it no longer indicates alignment). > > That wouldn't be totally generic, but it would have the nice property of > automatically at least add sanity checking for that whole "res->start has > the odd meaning of 'alignment' during probing" and remove the need for a > new field, and it would allow us to have a generic "resource_alignment()" > routine that just gets a resource pointer. Besides, I removed IORESOURCE_BUS_HAS_VGA flag which was unused for ages. Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19sched: build fixIngo Molnar
Signed-off-by: Ingo Molnar <mingo@elte.hu>