summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-11-13fanotify: fix notification of groups with inode & mount marksJan Kara
fsnotify() needs to merge inode and mount marks lists when notifying groups about events so that ignore masks from inode marks are reflected in mount mark notifications and groups are notified in proper order (according to priorities). Currently the sorting of the lists done by fsnotify_add_inode_mark() / fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted ignore masks not being used in some cases. Fix the problem by always using the same comparison function when sorting / merging the mark lists. Thanks to Heinrich Schuchardt for improvements of my patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721 Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "It's a one liner for an error cleanup path that leads to crashes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
2014-11-07Merge tag 'xfs-for-linus-3.18-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update fixes a warning in the new pagecache_isize_extended() and updates some related comments, another fix for zero-range misbehaviour, and an unforntuately large set of fixes for regressions in the bulkstat code. The bulkstat fixes are large but necessary. I wouldn't normally push such a rework for a -rcX update, but right now xfsdump can silently create incomplete dumps on 3.17 and it's possible that even xfsrestore won't notice that the dumps were incomplete. Hence we need to get this update into 3.17-stable kernels ASAP. In more detail, the refactoring work I committed in 3.17 has exposed a major hole in our QA coverage. With both xfsdump (the major user of bulkstat) and xfsrestore silently ignoring missing files in the dump/restore process, incomplete dumps were going unnoticed if they were being triggered. Many of the dump/restore filesets were so small that they didn't evenhave a chance of triggering the loop iteration bugs we introduced in 3.17, so we didn't exercise the code sufficiently, either. We have already taken steps to improve QA coverage in xfstests to avoid this happening again, and I've done a lot of manual verification of dump/restore on very large data sets (tens of millions of inodes) of the past week to verify this patch set results in bulkstat behaving the same way as it does on 3.16. Unfortunately, the fixes are not exactly simple - in tracking down the problem historic API warts were discovered (e.g xfsdump has been working around a 20 year old bug in the bulkstat API for the past 10 years) and so that complicated the process of diagnosing and fixing the problems. i.e. we had to fix bugs in the code as well as discover and re-introduce the userspace visible API bugs that we unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly. Summary: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17" * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: track bulkstat progress by agino xfs: bulkstat error handling is broken xfs: bulkstat main loop logic is a mess xfs: bulkstat chunk-formatter has issues xfs: bulkstat chunk formatting cursor is broken xfs: bulkstat btree walk doesn't terminate mm: Fix comment before truncate_setsize() xfs: rework zero range to prevent invalid i_size updates mm: Remove false WARN_ON from pagecache_isize_extended() xfs: Check error during inode btree iteration in xfs_bulkstat() xfs: bulkstat doesn't release AGI buffer on error
2014-11-07xfs: track bulkstat progress by aginoDave Chinner
The bulkstat main loop progress is tracked by the "lastino" variable, which is a full 64 bit inode. However, the loop actually works on agno/agino pairs, and so there's a significant disconnect between the rest of the loop and the main cursor. Convert this to use the agino, and pass the agino into the chunk formatting function and convert it too. This gets rid of the inconsistency in the loop processing, and finally makes it simple for us to skip inodes at any point in the loop simply by incrementing the agino cursor. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat error handling is brokenDave Chinner
The error propagation is a horror - xfs_bulkstat() returns a rval variable which is only set if there are formatter errors. Any sort of btree walk error or corruption will cause the bulkstat walk to terminate but will not pass an error back to userspace. Worse is the fact that formatter errors will also be ignored if any inodes were correctly formatted into the user buffer. Hence bulkstat can fail badly yet still report success to userspace. This causes significant issues with xfsdump not dumping everything in the filesystem yet reporting success. It's not until a restore fails that there is any indication that the dump was bad and tha bulkstat failed. This patch now triggers xfsdump to fail with bulkstat errors rather than silently missing files in the dump. This now causes bulkstat to fail when the lastino cookie does not fall inside an existing inode chunk. The pre-3.17 code tolerated that error by allowing the code to move to the next inode chunk as the agino target is guaranteed to fall into the next btree record. With the fixes up to this point in the series, xfsdump now passes on the troublesome filesystem image that exposes all these bugs. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2014-11-07xfs: bulkstat main loop logic is a messDave Chinner
There are a bunch of variables tha tare more wildy scoped than they need to be, obfuscated user buffer checks and tortured "next inode" tracking. This all needs cleaning up to expose the real issues that need fixing. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat chunk-formatter has issuesDave Chinner
The loop construct has issues: - clustidx is completely unused, so remove it. - the loop tries to be smart by terminating when the "freecount" tells it that all inodes are free. Just drop it as in most cases we have to scan all inodes in the chunk anyway. - move the "user buffer left" condition check to the only point where we consume space int eh user buffer. - move the initialisation of agino out of the loop, leaving just a simple loop control logic using the clusteridx. Also, double handling of the user buffer variables leads to problems tracking the current state - use the cursor variables directly rather than keeping local copies and then having to update the cursor before returning. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat chunk formatting cursor is brokenDave Chinner
The xfs_bulkstat_agichunk formatting cursor takes buffer values from the main loop and passes them via the structure to the chunk formatter, and the writes the changed values back into the main loop local variables. Unfortunately, this complex dance is full of corner cases that aren't handled correctly. The biggest problem is that it is double handling the information in both the main loop and the chunk formatting function, leading to inconsistent updates and endless loops where progress is not made. To fix this, push the struct xfs_bulkstat_agichunk outwards to be the primary holder of user buffer information. this removes the double handling in the main loop. Also, pass the last inode processed by the chunk formatter as a separate parameter as it purely an output variable and is not related to the user buffer consumption cursor. Finally, the chunk formatting code is not shared by anyone, so make it local to xfs_itable.c. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat btree walk doesn't terminateDave Chinner
The bulkstat code has several different ways of detecting the end of an AG when doing a walk. They are not consistently detected, and the code that checks for the end of AG conditions is not consistently coded. Hence the are conditions where the walk code can get stuck in an endless loop making no progress and not triggering any termination conditions. Convert all the "tmp/i" status return codes from btree operations to a common name (stat) and apply end-of-ag detection to these operations consistently. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-05fix breakage in o2net_send_tcp_msg()Al Viro
uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()" by me ;-/ Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-05ovl: don't poison cursorMiklos Szeredi
ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be rebuilt. We did list_del() on the cursor, which results in an Oops on the poisoned pointer in ovl_seek_cursor(). Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-04Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanupChris Mason
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Erik Berg <btrfs@slipsprogrammoer.no> cc: stable@vger.kernel.org # 3.3 or newer
2014-11-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "A bunch of assorted fixes, most of them followups to overlayfs merge" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: initialize ->is_cursor Return short read or 0 at end of a raw device, not EIO isofs: don't bother with ->d_op for normal case isofs_cmp(): we'll never see a dentry for . or .. overlayfs: fix lockdep misannotation ovl: fix check for cursor overlayfs: barriers for opening upper-layer directory rcu: Provide counterpart to rcu_dereference() for non-RCU situations staging: android: logger: Fix log corruption regression
2014-11-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is nailing down some problems with our skinny extent variation, and Dave's patch fixes endian problems in the new super block checks" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items Btrfs: properly clean up btrfs_end_io_wq_cache Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() btrfs: use macro accessors in superblock validation checks
2014-10-31Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "A set of miscellaneous ext4 bug fixes for 3.18" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: make ext4_ext_convert_to_initialized() return proper number of blocks ext4: bail early when clearing inode journal flag fails ext4: bail out from make_indexed_dir() on first error jbd2: use a better hash function for the revoke table ext4: prevent bugon on race between write/fcntl ext4: remove extent status procfs files if journal load fails ext4: disallow changing journal_csum option during remount ext4: enable journal checksum when metadata checksum feature enabled ext4: fix oops when loading block bitmap failed ext4: fix overflow when updating superblock backups after resize
2014-10-31Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and ext3 fixes from Jan Kara. * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fs, jbd: use a more generic hash function quota: Properly return errors from dquot_writeback_dquots() ext3: Don't check quota format when there are no quota files
2014-10-31ovl: initialize ->is_cursorMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31Return short read or 0 at end of a raw device, not EIODavid Jeffery
Author: David Jeffery <djeffery@redhat.com> Changes to the basic direct I/O code have broken the raw driver when reading to the end of a raw device. Instead of returning a short read for a read that extends partially beyond the device's end or 0 when at the end of the device, these reads now return EIO. The raw driver needs the same end of device handling as was added for normal block devices. Using blkdev_read_iter, which has the needed size checks, prevents the EIO conditions at the end of the device. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31isofs: don't bother with ->d_op for normal caseAl Viro
we only need it for joliet and case-insensitive mounts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-30fs: allow open(dir, O_TMPFILE|..., 0) with mode 0Eric Rannaud
The man page for open(2) indicates that when O_CREAT is specified, the 'mode' argument applies only to future accesses to the file: Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The man page for open(2) implies that 'mode' is treated identically by O_CREAT and O_TMPFILE. O_TMPFILE, however, behaves differently: int fd = open("/tmp", O_TMPFILE | O_RDWR, 0); assert(fd == -1); assert(errno == EACCES); int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600); assert(fd > 0); For O_CREAT, do_last() sets acc_mode to MAY_OPEN only: if (*opened & FILE_CREATED) { /* Don't check for write permission, don't truncate */ open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = MAY_OPEN; path_to_nameidata(path, nd); goto finish_open_created; } But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to may_open(). This patch lines up the behavior of O_TMPFILE with O_CREAT. After the inode is created, may_open() is called with acc_mode = MAY_OPEN, in do_tmpfile(). A different, but related glibc bug revealed the discrepancy: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 The glibc lazily loads the 'mode' argument of open() and openat() using va_arg() only if O_CREAT is present in 'flags' (to support both the 2 argument and the 3 argument forms of open; same idea for openat()). However, the glibc ignores the 'mode' argument if O_TMPFILE is in 'flags'. On x86_64, for open(), it magically works anyway, as 'mode' is in RDX when entering open(), and is still in RDX on SYSCALL, which is where the kernel looks for the 3rd argument of a syscall. But openat() is not quite so lucky: 'mode' is in RCX when entering the glibc wrapper for openat(), while the kernel looks for the 4th argument of a syscall in R10. Indeed, the syscall calling convention differs from the regular calling convention in this respect on x86_64. So the kernel sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and fails with EACCES. Signed-off-by: Eric Rannaud <e@nanocritical.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-30ext4: make ext4_ext_convert_to_initialized() return proper number of blocksJan Kara
ext4_ext_convert_to_initialized() can return more blocks than are actually allocated from map->m_lblk in case where initial part of the on-disk extent is zeroed out. Luckily this doesn't have serious consequences because the caller currently uses the return value only to unmap metadata buffers. Anyway this is a data corruption/exposure problem waiting to happen so fix it. Coverity-id: 1226848 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: bail early when clearing inode journal flag failsJan Kara
When clearing inode journal flag, we call jbd2_journal_flush() to force all the journalled data to their final locations. Currently we ignore when this fails and continue clearing inode journal flag. This isn't a big problem because when jbd2_journal_flush() fails, journal is likely aborted anyway. But it can still lead to somewhat confusing results so rather bail out early. Coverity-id: 989044 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: bail out from make_indexed_dir() on first errorJan Kara
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node() fail, there's really something wrong with the fs and there's no point in continuing further. Just return error from make_indexed_dir() in that case. Also initialize frames array so that if we return early due to error, dx_release() doesn't try to dereference uninitialized memory (which could happen also due to error in do_split()). Coverity-id: 741300 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30jbd2: use a better hash function for the revoke tableTheodore Ts'o
The old hash function didn't work well for 64-bit block numbers, and used undefined (negative) shift right behavior. Use the generic 64-bit hash function instead. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
2014-10-30ext4: prevent bugon on race between write/fcntlDmitry Monakhov
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2014-10-30ext4: remove extent status procfs files if journal load failsDarrick J. Wong
If we can't load the journal, remove the procfs files for the extent status information file to avoid leaking resources. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: disallow changing journal_csum option during remountDarrick J. Wong
ext4 does not permit changing the metadata or journal checksum feature flag while mounted. Until we decide to support that, don't allow a remount to change the journal_csum flag (right now we silently fail to change anything). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: fix oops when loading block bitmap failedJan Kara
When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: fix overflow when updating superblock backups after resizeJan Kara
When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-10-29Merge branch 'akpm' (incoming from Andrew Morton)Linus Torvalds
Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) mm/balloon_compaction: fix deflation when compaction is disabled sh: fix sh770x SCIF memory regions zram: avoid NULL pointer access in concurrent situation mm/slab_common: don't check for duplicate cache names ocfs2: fix d_splice_alias() return code checking mm: rmap: split out page_remove_file_rmap() mm: memcontrol: fix missed end-writeback page accounting mm: page-writeback: inline account_page_dirtied() into single caller lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}() drivers/rtc/rtc-bq32k.c: fix register value memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock kernel/kmod: fix use-after-free of the sub_info structure drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc mm, thp: fix collapsing of hugepages on madvise drivers: of: add return value to of_reserved_mem_device_init() mm: free compound page with correct order gcov: add ARM64 to GCOV_PROFILE_ALL fsnotify: next_i is freed during fsnotify_unmount_inodes. mm/compaction.c: avoid premature range skip in isolate_migratepages_range ...
2014-10-30xfs: rework zero range to prevent invalid i_size updatesBrian Foster
The zero range operation is analogous to fallocate with the exception of converting the range to zeroes. E.g., it attempts to allocate zeroed blocks over the range specified by the caller. The XFS implementation kills all delalloc blocks currently over the aligned range, converts the range to allocated zero blocks (unwritten extents) and handles the partial pages at the ends of the range by sending writes through the pagecache. The current implementation suffers from several problems associated with inode size. If the aligned range covers an extending I/O, said I/O is discarded and an inode size update from a previous write never makes it to disk. Further, if an unaligned zero range extends beyond eof, the page write induced for the partial end page can itself increase the inode size, even if the zero range request is not supposed to update i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). The latter behavior not only incorrectly increases the inode size, but can lead to stray delalloc blocks on the inode. Typically, post-eof preallocation blocks are either truncated on release or inode eviction or explicitly written to by xfs_zero_eof() on natural file size extension. If the inode size increases due to zero range, however, associated blocks leak into the address space having never been converted or mapped to pagecache pages. A direct I/O to such an uncovered range cannot convert the extent via writeback and will BUG(). For example: $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> ... $ xfs_io -d -c "pread 128k 128k" <file> <BUG> If the entire delalloc extent happens to not have page coverage whatsoever (e.g., delalloc conversion couldn't find a large enough free space extent), even a full file writeback won't convert what's left of the extent and we'll assert on inode eviction. Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. Use the existing hole punch and prealloc mechanisms as primitives for zero range. This implementation is not efficient nor ideal as we writeback dirty data over the range and remove existing extents rather than convert to unwrittern. The former writeback, however, is currently the only mechanism available to ensure consistency between pagecache and extent state. Even a pagecache truncate/delalloc punch prior to hole punch has lead to inconsistencies due to racing with writeback. This provides a consistent, correct implementation of zero range that survives fsstress/fsx testing without assert failures. The implementation can be optimized from this point forward once the fundamental issue of pagecache and delalloc extent state consistency is addressed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-30xfs: Check error during inode btree iteration in xfs_bulkstat()Jan Kara
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In case of specific fs corruption that could result in xfs_bulkstat() entering an infinite loop because we would be looping over the same chunk over and over again. Fix the problem by checking the return value and terminating the loop properly. Coverity-id: 1231338 cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jie Liu <jeff.u.liu@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-29ocfs2: fix d_splice_alias() return code checkingRichard Weinberger
d_splice_alias() can return a valid dentry, NULL or an ERR_PTR. Currently the code checks not for ERR_PTR and will cuase an oops in ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL(). Signed-off-by: Richard Weinberger <richard@nod.at> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29fsnotify: next_i is freed during fsnotify_unmount_inodes.Jerry Hoemann
During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING | I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current kernel. This contains: - Two error handling fixes from Jan Kara. One for null_blk on failure to add a device, and the other for the block/scsi_ioctl SCSI_IOCTL_SEND_COMMAND fixing up the error jump point. - A commit added in the merge window for the bio integrity bits unfortunately disabled merging for all requests if CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that integrity checking wont disallow merges when not enabled. - A fix from Ming Lei for merging and generating too many segments. This caused a BUG in virtio_blk. - Two error handling printk() fixups from Robert Elliott, improving the information given when we rate limit. - Error handling fixup on elevator_init() failure from Sudip Mukherjee. - A fix from Tony Battersby, fixing up a memory leak in the scatterlist handling with scsi-mq" * 'for-linus' of git://git.kernel.dk/linux-block: block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined lib/scatterlist: fix memory leak with scsi-mq block: fix wrong error return in elevator_init() scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND null_blk: Cleanup error recovery in null_add_dev() blk-merge: recaculate segment if it isn't less than max segments fs: clarify rate limit suppressed buffer I/O errors fs: merge I/O error prints into one line
2014-10-28isofs_cmp(): we'll never see a dentry for . or ..Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28overlayfs: fix lockdep misannotationMiklos Szeredi
In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28ovl: fix check for cursorMiklos Szeredi
ovl_cache_entry.name is now an array not a pointer, so it makes no sense test for it being NULL. Detected by coverity. From: Miklos Szeredi <mszeredi@suse.cz> Fixes: 68bf8611076a ("overlayfs: make ovl_cache_entry->name an array instead of +pointer") Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28overlayfs: barriers for opening upper-layer directoryAl Viro
make sure that a) all stores done by opening struct file don't leak past storing the reference in od->upperfile b) the lockless side has read dependency barrier Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-29xfs: bulkstat doesn't release AGI buffer on errorDave Chinner
The recent refactoring of the bulkstat code left a small landmine in the code. If a inobt read fails, then the tree walk is aborted and returns without releasing the AGI buffer or freeing the cursor. This can lead to a subsequent bulkstat call hanging trying to grab the AGI buffer again. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-28Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent itemsFilipe Manana
We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns > 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret > 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path->slots[0] is zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-28Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull two nfsd fixes from Bruce Fields: "One regression from the 3.16 xdr rewrite, one an older bug exposed by a separate bug in the client's new SEEK code" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: nfsd4: fix crash on unknown operation number nfsd4: fix response size estimation for OP_SEQUENCE
2014-10-27Btrfs: properly clean up btrfs_end_io_wq_cacheJosef Bacik
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()Filipe Manana
If we couldn't find our extent item, we accessed the current slot (path->slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-27btrfs: use macro accessors in superblock validation checksDavid Sterba
The initial patch c926093ec516f5d316 (btrfs: add more superblock checks) did not properly use the macro accessors that wrap endianness and the code would not work correctly on big endian machines. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "overlayfs merge + leak fix for d_splice_alias() failure exits" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: overlayfs: embed middle into overlay_readdir_data overlayfs: embed root into overlay_readdir_data overlayfs: make ovl_cache_entry->name an array instead of pointer overlayfs: don't hold ->i_mutex over opening the real directory fix inode leaks on d_splice_alias() failure exits fs: limit filesystem stacking depth overlay: overlay filesystem documentation overlayfs: implement show_options overlayfs: add statfs support overlay filesystem shmem: support RENAME_WHITEOUT ext4: support RENAME_WHITEOUT vfs: add RENAME_WHITEOUT vfs: add whiteout support vfs: export check_sticky() vfs: introduce clone_private_mount() vfs: export __inode_permission() to modules vfs: export do_splice_direct() to modules vfs: add i_op->dentry_open()
2014-10-24overlayfs: embed middle into overlay_readdir_dataAl Viro
same story... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24overlayfs: embed root into overlay_readdir_dataAl Viro
no sense having it a pointer - all instances have it pointing to local variable in the same stack frame Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24overlayfs: make ovl_cache_entry->name an array instead of pointerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>