summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2010-10-27cgroup: add clone_children control fileDaniel Lezcano
The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. This patch: Sent as an answer to a previous thread around the ns_cgroup. https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html It adds a control file 'clone_children' for a cgroup. This control file is a boolean specifying if the child cgroup should be a clone of the parent cgroup or not. The default value is 'false'. This flag makes the child cgroup to call the post_clone callback of all the subsystem, if it is available. At present, the cpuset is the only one which had implemented the post_clone callback. The option can be set at mount time by specifying the 'clone_children' mount option. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Cc: Matt Helsley <matthltc@us.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27fbmem: fix fb_read, fb_write unaligned accessesJames Hogan
fb_{read,write} access the framebuffer using lots of fb_{read,write}l's but don't check that the file position is aligned which can cause problems on some architectures which do not support unaligned accesses. Since the operations are essentially memcpy_{from,to}io, new fb_memcpy_{from,to}fb macros have been defined and these are used instead. For Sparc, fb_{read,write} macros use sbus_{read,write}, so this defines new sbus_memcpy_{from,to}io functions the same as memcpy_{from,to}io but using sbus_{read,write}b instead of {read,write}b. Signed-off-by: James Hogan <james@albanarts.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27gpio: adp5588-gpio: add i2c forward declarationMichael Hennerich
Some ADP5588 functions take a pointer to an i2c_client, but if the i2c header doesn't happen to be included first, we hit the standard "struct declared inside parameter list" warnings from gcc. So add a simple forward decl of the i2c_client struct. Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27gpio: adp5588-gpio: gpio_start must be signedMichael Hennerich
Common code interprets this as a signed value (a negative value is used to request dynamic ID allocation), so make sure the platform data has proper types to support that. Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27gpio: adp5588-gpio: support interrupt controllerMichael Hennerich
Implement irq_chip functionality on ADP5588/5587 GPIO expanders. Only level sensitive interrupts are supported. Interrupts provided by this irq_chip must be requested using request_threaded_irq(). Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27gpio: add support for 74x164 serial-in/parallel-out 8-bit shift registerMiguel Gaio
Add support for generic 74x164 serial-in/parallel-out 8-bits shift register. This driver can be used as a GPIO output expander. [akpm@linux-foundation.org: remove unused local `refresh'] Signed-off-by: Miguel Gaio <miguel.gaio@efixo.com> Signed-off-by: Juhos Gabor <juhosg@openwrt.org> Signed-off-by: Florian Fainelli <florian@openwrt.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27gpio: add driver for basic memory-mapped GPIO controllersAnton Vorontsov
The basic GPIO controllers may be found in various on-board FPGA and ASIC solutions that are used to control board's switches, LEDs, chip-selects, Ethernet/USB PHY power, etc. These controllers may not provide any means of pin setup (in/out/open drain). The driver supports: - 8/16/32/64 bits registers; - GPIO controllers with clear/set registers; - GPIO controllers with a single "data" register; - Big endian bits/GPIOs ordering (mostly used on PowerPC). Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com> Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: David Brownell <david-b@pacbell.net> Cc: Samuel Ortiz <sameo@linux.intel.com>, Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27mm: fix race in kunmap_atomic()Peter Zijlstra
Christoph reported a nice splat which illustrated a race in the new stack based kmap_atomic implementation. The problem is that we pop our stack slot before we're completely done resetting its state -- in particular clearing the PTE (sometimes that's CONFIG_DEBUG_HIGHMEM). If an interrupt happens before we actually clear the PTE used for the last slot, that interrupt can reuse the slot in a dirty state, which triggers a BUG in kmap_atomic(). Fix this by introducing kmap_atomic_idx() which reports the current slot index without actually releasing it and use that to find the PTE and delay the _pop() until after we're completely done. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Christoph Hellwig <hch@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27mm,x86: fix kmap_atomic_push vs ioremap_32.cPeter Zijlstra
It appears i386 uses kmap_atomic infrastructure regardless of CONFIG_HIGHMEM which results in a compile error when highmem is disabled. Cure this by providing the needed few bits for both CONFIG_HIGHMEM and CONFIG_X86_32. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...
2010-10-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (63 commits) IB/qib: clean up properly if pci_set_consistent_dma_mask() fails IB/qib: Allow driver to load if PCIe AER fails IB/qib: Fix uninitialized pointer if CONFIG_PCI_MSI not set IB/qib: Fix extra log level in qib_early_err() RDMA/cxgb4: Remove unnecessary KERN_<level> use RDMA/cxgb3: Remove unnecessary KERN_<level> use IB/core: Add link layer type information to sysfs IB/mlx4: Add VLAN support for IBoE IB/core: Add VLAN support for IBoE IB/mlx4: Add support for IBoE mlx4_en: Change multicast promiscuous mode to support IBoE mlx4_core: Update data structures and constants for IBoE mlx4_core: Allow protocol drivers to find corresponding interfaces IB/uverbs: Return link layer type to userspace for query port operation IB/srp: Sync buffer before posting send IB/srp: Use list_first_entry() IB/srp: Reduce number of BUSY conditions IB/srp: Eliminate two forward declarations IB/mlx4: Signal node desc changes to SM by using FW to generate trap 144 IB: Replace EXTRA_CFLAGS with ccflags-y ...
2010-10-26docbook: add idr/ida to kernel-api docbookRandy Dunlap
Add idr/ida to kernel-api docbook. Fix typos and kernel-doc notation. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Naohiro Aota <naota@elisp.net> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26docbook: add more wait/wake/completion to device-drivers docbookRandy Dunlap
Add more wait, wake, and completion interfaces to the device-drivers docbook. Fix kernel-doc notation in the added files. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (53 commits) ACPI: install ACPI table handler before any dynamic tables being loaded ACPI / PM: Blacklist another machine that needs acpi_sleep=nonvs ACPI: Page based coalescing of I/O remappings optimization ACPI: Convert simple locking to RCU based locking ACPI: Pre-map 'system event' related register blocks ACPI: Add interfaces for ioremapping/iounmapping ACPI registers ACPI: Maintain a list of ACPI memory mapped I/O remappings ACPI: Fix ioremap size for MMIO reads and writes ACPI / Battery: Return -ENODEV for unknown values in get_property() ACPI / PM: Fix reference counting of power resources Subject: [PATCH] ACPICA: Fix Scope() op in module level code ACPI battery: support percentage battery remaining capacity ACPI: Make Embedded Controller command timeout delay configurable ACPI dock: move some functions to .init.text ACPI: thermal: remove unused limit code ACPI: static sleep_states[] and acpi_gts_bfs_check ACPI: remove dead code ACPI: delete dedicated MAINTAINERS entries for ACPI EC and BATTERY drivers ACPI: Only processor needs CPU_IDLE ACPICA: Update version to 20101013 ...
2010-10-26Merge branch 'sfi-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6 * 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6: SFI: remove the v0.7 related definitions from sfi.h
2010-10-26Merge branch 'akpm-incoming-1'Linus Torvalds
* akpm-incoming-1: (176 commits) scripts/checkpatch.pl: add check for declaration of pci_device_id scripts/checkpatch.pl: add warnings for static char that could be static const char checkpatch: version 0.31 checkpatch: statement/block context analyser should look at sanitised lines checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar checkpatch: clean up structure definition macro handline checkpatch: update copyright dates checkpatch: Add additional attribute #defines checkpatch: check for incorrect permissions checkpatch: ensure kconfig help checks only apply when we are adding help checkpatch: simplify and consolidate "missing space after" checks checkpatch: add check for space after struct, union, and enum checkpatch: returning errno typically should be negative checkpatch: handle casts better fixing false categorisation of : as binary checkpatch: ensure we do not collapse bracketed sections into constants checkpatch: suggest cleanpatch and cleanfile when appropriate checkpatch: types may sit on a line on their own checkpatch: fix regressions in "fix handling of leading spaces" div64_u64(): improve precision on 32bit platforms lib/parser: cleanup match_number() ...
2010-10-26div64_u64(): improve precision on 32bit platformsBrian Behlendorf
The current implementation of div64_u64 for 32bit systems returns an approximately correct result when the divisor exceeds 32bits. Since doing 64bit division using 32bit hardware is a long since solved problem we just use one of the existing proven methods. Additionally, add a div64_s64 function to correctly handle doing signed 64bit division. Addresses https://bugzilla.redhat.com/show_bug.cgi?id=616105 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Ben Woodard <bwoodard@llnl.gov> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Mark Grondona <mgrondona@llnl.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26ratelimit: add comment warning people off printk_ratelimit()Andrew Morton
printk_ratelimit() was a bad idea - we don't want subsytem A causing ratelimiting of subsystem B's messages. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26printk: declare printk_ratelimit_state in ratelimit.hNamhyung Kim
Adding declaration of printk_ratelimit_state in ratelimit.h removes potential build breakage and following sparse warning: kernel/printk.c:1426:1: warning: symbol 'printk_ratelimit_state' was not declared. Should it be static? [akpm@linux-foundation.org: remove unneeded ifdef] Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26kernel: remove PF_FLUSHERPeter Zijlstra
PF_FLUSHER is only ever set, not tested, remove it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26fs: allow for more than 2^31 filesEric Dumazet
Robin Holt tried to boot a 16TB system and found af_unix was overflowing a 32bit value : <quote> We were seeing a failure which prevented boot. The kernel was incapable of creating either a named pipe or unix domain socket. This comes down to a common kernel function called unix_create1() which does: atomic_inc(&unix_nr_socks); if (atomic_read(&unix_nr_socks) > 2 * get_max_files()) goto out; The function get_max_files() is a simple return of files_stat.max_files. files_stat.max_files is a signed integer and is computed in fs/file_table.c's files_init(). n = (mempages * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = n; In our case, mempages (total_ram_pages) is approx 3,758,096,384 (0xe0000000). That leaves max_files at approximately 1,503,238,553. This causes 2 * get_max_files() to integer overflow. </quote> Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long integers, and change af_unix to use an atomic_long_t instead of atomic_t. get_max_files() is changed to return an unsigned long. get_nr_files() is changed to return a long. unix_nr_socks is changed from atomic_t to atomic_long_t, while not strictly needed to address Robin problem. Before patch (on a 64bit kernel) : # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max -18446744071562067968 After patch: # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max 2147483648 # cat /proc/sys/fs/file-nr 704 0 2147483648 Reported-by: Robin Holt <holt@sgi.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Reviewed-by: Robin Holt <holt@sgi.com> Tested-by: Robin Holt <holt@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26drivers/misc: driver for APDS990X ALS and proximity sensorsSamu Onkalo
This is a driver for Avago APDS990X combined ALS and proximity sensor. Interface is sysfs based. The driver uses interrupts to provide new data. The driver supports pm_runtime and regulator frameworks. See Documentation/misc-devices/apds990x.txt for details Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com> Acked-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26drivers/misc: driver for bh1770glc / sfh7770 ALS and proximity sensorSamu Onkalo
This is a driver for ROHM BH1770GLC and OSRAM SFH7770 combined ALS and proximity sensor. Interface is sysfs based. The driver uses interrupts to provide new data. The driver supports pm_runtime and regulator frameworks. See Documentation/misc-devices/bh1770glc.txt for details Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com> Acked-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26workqueues: s/ON_STACK/ONSTACK/Andrew Morton
Silly though it is, completions and wait_queue_heads use foo_ONSTACK (COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK, __WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I guess workqueues should do the same thing. s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/ s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/ Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26modules: no need to align .modinfo stringsJan Beulich
gcc aligns strings as a performance consideration for those cases where strings are being used a lot. Their use is not performance critical, and hence it seems better to save some space. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26include/linux/kernel.h: add __must_check to strict_strto*()Andrew Morton
The whole point to using the strict functions is to check the return value. If you don't, strict_strto*() will return you uninitialised garbage. Offenders have been observed in the wild. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26kernel.h: add {min,max}3 macrosHagen Paul Pfeifer
Introduce two additional min/max macros to compare three operands. This will save some cycles as well as some bytes on the stack and last but not least more pleasing as macro nesting. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Joe Perches <joe@perches.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: add vzalloc() and vzalloc_node() helpersDave Young
Add vzalloc() and vzalloc_node() to encapsulate the vmalloc-then-memset-zero operation. Use __GFP_ZERO to zero fill the allocated memory. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Greg Ungerer <gerg@snapgear.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: fix sparse warnings on GFP_ZONE_TABLE/BADNamhyung Kim
Introduce ___GFP_* masks in order for gfp_t to not be mixed with plain integers which causes a lot of warnings like the following: warning: restricted gfp_t degrades to integer Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: declare some external symbolsNamhyung Kim
Declare 'bdi_pending_list' and 'tag_pages_for_writeback()' to remove following sparse warnings: mm/backing-dev.c:46:1: warning: symbol 'bdi_pending_list' was not declared. Should it be static? mm/page-writeback.c:825:6: warning: symbol 'tag_pages_for_writeback' was not declared. Should it be static? Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26rmap: wrap page_check_address() using __cond_lock()Namhyung Kim
The page_check_address() conditionally grabs *@ptlp in case of returning non-NULL. Rename and wrap it using __cond_lock() removes following warnings from sparse: mm/rmap.c:472:9: warning: context imbalance in 'page_mapped_in_vma' - unexpected unlock mm/rmap.c:524:9: warning: context imbalance in 'page_referenced_one' - unexpected unlock mm/rmap.c:706:9: warning: context imbalance in 'page_mkclean_one' - unexpected unlock mm/rmap.c:1066:9: warning: context imbalance in 'try_to_unmap_one' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26rmap: annotate lock context change on page_[un]lock_anon_vma()Namhyung Kim
The page_lock_anon_vma() conditionally grabs RCU and anon_vma lock but page_unlock_anon_vma() releases them unconditionally. This leads sparse to complain about context imbalance. Annotate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: wrap get_locked_pte() using __cond_lock()Namhyung Kim
The get_locked_pte() conditionally grabs 'ptl' in case of returning non-NULL. This leads sparse to complain about context imbalance. Rename and wrap it using __cond_lock() to make sparse happy. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: retry page fault when blocking on disk transferMichel Lespinasse
This change reduces mmap_sem hold times that are caused by waiting for disk transfers when accessing file mapped VMAs. It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants mmap_sem to be released if blocking on a pending disk transfer. In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and do_page_fault() will then re-acquire mmap_sem and retry the page fault. It is expected that the retry will hit the same page which will now be cached, and thus it will complete with a low mmap_sem hold time. Tests: - microbenchmark: thread A mmaps a large file and does random read accesses to the mmaped area - achieves about 55 iterations/s. Thread B does mmap/munmap in a loop at a separate location - achieves 55 iterations/s before, 15000 iterations/s after. - We are seeing related effects in some applications in house, which show significant performance regressions when running without this change. [akpm@linux-foundation.org: fix warning & crash] Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Ying Han <yinghan@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: remove alignment padding from anon_vma on (some) 64 bit buildsRichard Kennedy
Reorder structure anon_vma to remove alignment padding on 64 builds when (CONFIG_KSM || CONFIG_MIGRATION). This will shrink the size of the anon_vma structure from 40 to 32 bytes & allow more objects per slab in its kmem_cache. Under slub the objects in the anon_vma kmem_cache will then be 40 bytes with 102 objects per slab. (On v2.6.36 without this patch,the size is 48 bytes and 85 objects/slab.) Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: stack based kmap_atomic()Peter Zijlstra
Keep the current interface but ignore the KM_type and use a stack based approach. The advantage is that we get rid of crappy code like: #define __KM_PTE \ (in_nmi() ? KM_NMI_PTE : \ in_irq() ? KM_IRQ_PTE : \ KM_PTE0) and in general can stop worrying about what context we're in and what kmap slots might be appropriate for that. The downside is that FRV kmap_atomic() gets more expensive. For now we use a CPP trick suggested by Andrew: #define kmap_atomic(page, args...) __kmap_atomic(page) to avoid having to touch all kmap_atomic() users in a single patch. [ not compiled on: - mn10300: the arch doesn't actually build with highmem to begin with ] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix up drivers/gpu/drm/i915/intel_overlay.c] Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Airlie <airlied@linux.ie> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: strictly nested kmap_atomic()Peter Zijlstra
Ensure kmap_atomic() usage is strictly nested Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: do not sleep on the congestion queue if there are no congested ↵Mel Gorman
BDIs or if significant congestion is not being encountered in the current zone If congestion_wait() is called with no BDI congested, the caller will sleep for the full timeout and this may be an unnecessary sleep. This patch adds a wait_iff_congested() that checks congestion and only sleeps if a BDI is congested else, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep. This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: fix typo in mm.h when NODE_NOT_IN_PAGE_FLAGSWill Deacon
NODE_NOT_IN_PAGE_FLAGS is defined in mm.h when the node information is not stored in the page flags bitmap. Unfortunately, there's a typo in one of the checks for it. This patch fixes it (s/NODE_NOT_IN_PAGEFLAGS/NODE_NOT_IN_PAGE_FLAGS/). Since this has been around for ages, I doubt it's been causing any serious problems. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: add nr_dirtied and nr_written to /proc/vmstatMichael Rubin
To help developers and applications gain visibility into writeback behaviour adding two entries to vm_stat_items and /proc/vmstat. This will allow us to track the "written" and "dirtied" counts. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: add account_page_writeback()Michael Rubin
To help developers and applications gain visibility into writeback behaviour this patch adds two counters to /proc/vmstat. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 These entries allow user apps to understand writeback behaviour over time and learn how it is impacting their performance. Currently there is no way to inspect dirty and writeback speed over time. It's not possible for nr_dirty/nr_writeback. These entries are necessary to give visibility into writeback behaviour. We have /proc/diskstats which lets us understand the io in the block layer. We have blktrace for more in depth understanding. We have e2fsprogs and debugsfs to give insight into the file systems behaviour, but we don't offer our users the ability understand what writeback is doing. There is no way to know how active it is over the whole system, if it's falling behind or to quantify it's efforts. With these values exported users can easily see how much data applications are sending through writeback and also at what rates writeback is processing this data. Comparing the rates of change between the two allow developers to see when writeback is not able to keep up with incoming traffic and the rate of dirty memory being sent to the IO back end. This allows folks to understand their io workloads and track kernel issues. Non kernel engineers at Google often use these counters to solve puzzling performance problems. Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written Patch #5 add writeback thresholds to /proc/vmstat Currently these values are in debugfs. But they should be promoted to /proc since they are useful for developers who are writing databases and file servers and are not debugging the kernel. The output is as below: # grep threshold /proc/vmstat nr_pages_dirty_threshold 409111 nr_pages_dirty_background_threshold 818223 This patch: This allows code outside of the mm core to safely manipulate page writeback state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. Modify nilfs2 to use interface. Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Jiro SEKIBA <jir@unicus.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26memory hotplug: unify is_removable and offline detection codeKAMEZAWA Hiroyuki
Now, sysfs interface of memory hotplug shows whether the section is removable or not. But it checks only migrateype of pages and doesn't check details of cluster of pages. Next, memory hotplug's set_migratetype_isolate() has the same kind of check, too. This patch adds the function __count_unmovable_pages() and makes above 2 checks to use the same logic. Then, is_removable and hotremove code uses the same logic. No changes in the hotremove logic itself. TODO: need to find a way to check RECLAMABLE. But, considering bit, calling shrink_slab() against a range before starting memory hotremove sounds better. If so, this patch's logic doesn't need to be changed. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: Michal Hocko <mhocko@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: only build per-node scan_unevictable functions when NUMA is enabledThadeu Lima de Souza Cascardo
Non-NUMA systems do never create these files anyway, since they are only created by driver subsystem when NUMA is configured. [akpm@linux-foundation.org: cleanup] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26include/linux/pageblock-flags.h: fix set_pageblock_flags() macro definitonzeal
The presently-unused macro was missing one parameter. Signed-off-by: zeal <zealcook@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26oom: add per-mm oom disable countYing Han
It's pointless to kill a task if another thread sharing its mm cannot be killed to allow future memory freeing. A subsequent patch will prevent kills in such cases, but first it's necessary to have a way to flag a task that shares memory with an OOM_DISABLE task that doesn't incur an additional tasklist scan, which would make select_bad_process() an O(n^2) function. This patch adds an atomic counter to struct mm_struct that follows how many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN. They cannot be killed by the kernel, so their memory cannot be freed in oom conditions. This only requires task_lock() on the task that we're operating on, it does not require mm->mmap_sem since task_lock() pins the mm and the operation is atomic. [rientjes@google.com: changelog and sys_unshare() code] [rientjes@google.com: protect oom_disable_count with task_lock in fork] [rientjes@google.com: use old_mm for oom_disable_count in exec] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26kfifo: disable __kfifo_must_check_helper()Andrew Morton
This helper is wrong: it coerces signed values into unsigned ones, so code such as if (kfifo_alloc(...) < 0) { error } will fail to detect the error. So let's disable __kfifo_must_check_helper() for 2.6.36. Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26types.h: move misplaced commentAndrew Morton
This comment landed in the wrong place. Cc: Andi Kleen <andi@firstfloor.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Miller <davem@davemloft.net> Cc: Eric Paris <eparis@redhat.com> Cc: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26Merge branches 'amso1100', 'cma', 'cxgb3', 'cxgb4', 'ehca', 'iboe', 'ipoib', ↵Roland Dreier
'misc', 'mlx4', 'nes', 'qib' and 'srp' into for-next
2010-10-26Merge branch 'misc' into releaseLen Brown
2010-10-26Merge branch 'ima-memory-use-fixes'Linus Torvalds
* ima-memory-use-fixes: IMA: fix the ToMToU logic IMA: explicit IMA i_flag to remove global lock on inode_delete IMA: drop refcnt from ima_iint_cache since it isn't needed IMA: only allocate iint when needed IMA: move read counter into struct inode IMA: use i_writecount rather than a private counter IMA: use inode->i_lock to protect read and write counters IMA: convert internal flags from long to char IMA: use unsigned int instead of long for counters IMA: drop the inode opencount since it isn't needed for operation IMA: use rbtree instead of radix tree for inode information cache